<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:clearspace="http://www.jivesoftware.com/xmlns/clearspace/rss" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>VMware Communities: Message List - problems after SAN failure.</title>
    <link>http://communities.vmware.com/community/vmtn/vi/esx3.5?view=discussions</link>
    <description>Most recent forum messages</description>
    <language>en</language>
    <pubDate>Thu, 21 May 2009 22:51:28 GMT</pubDate>
    <generator>Clearspace 1.10.12 (http://jivesoftware.com/products/clearspace/)</generator>
    <dc:date>2009-05-21T22:51:28Z</dc:date>
    <dc:language>en</dc:language>
    <item>
      <title>Re: problems after SAN failure.</title>
      <link>http://communities.vmware.com/message/1259103?tstart=0#1259103</link>
      <description>&lt;br /&gt;
I'm having same issue. 6 VMware ESXi servers hooked up iSCSI to Clariion AX4-5i. From time to time all Windows guests at same time get this error-&lt;br /&gt;
&lt;p /&gt;
&lt;p /&gt;
&lt;p /&gt;
Event ID 11 - Disk - The driver detected a controller error on \Device\Harddisk1.&lt;br /&gt;
&lt;p /&gt;
Event ID 15 - symmpi - The device, \Device\Scsi\symmpi1, is not ready for access yet.&lt;br /&gt;
&lt;p /&gt;
Linux guests fare even worse - they either remount filesystem as read only or lock up completely.&lt;br /&gt;
&lt;p /&gt;
&lt;p /&gt;
&lt;p /&gt;
These errors come up at exact same time from all ESX hosts in /var/log/messages-&lt;br /&gt;
&lt;p /&gt;
This comes up a lot-&lt;br /&gt;
&lt;p /&gt;
May 21 22:20:46 vmkernel: 1:16:47:08.634 cpu4:1305)iSCSI: bus 0 target 3 trying to establish session 0x35e402c0 to portal 0, address 10.0.0.2 port 3260 group 4&lt;br /&gt;
&lt;p /&gt;
Also this-&lt;br /&gt;
 May 21 22:19:58 vmkernel: 1:16:46:19.933 cpu7:5865)SCSI: 638: Queue for device vml.020001000060060160eaa0210086031f2471c4dd11524149442035 is being blocked to check for hung SP.&lt;br /&gt;
&lt;p /&gt;
&lt;p /&gt;
&lt;p /&gt;
EMC replaced an SP yesterday but didn't help at all.  I found this article that seemed to help a little, but issue has occured at least once since implementing &lt;a class="jive-link-external" href="http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&amp;#38;cmd=displayKC&amp;#38;externalId=1008113"&gt;http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&amp;#38;cmd=displayKC&amp;#38;externalId=1008113&lt;/a&gt; I used value of 32 and 16.&lt;br /&gt;
&lt;p /&gt;
&lt;p /&gt;
&lt;p /&gt;
Did you get your issue resolved? I don't know what to do at this point. &lt;br /&gt;
&lt;p /&gt;
&lt;br /&gt;</description>
      <pubDate>Thu, 21 May 2009 22:51:28 GMT</pubDate>
      <author>oroadwarrioro</author>
      <guid>http://communities.vmware.com/message/1259103?tstart=0#1259103</guid>
      <dc:date>2009-05-21T22:51:28Z</dc:date>
      <clearspace:dateToText>6 months, 1 week ago</clearspace:dateToText>
    </item>
    <item>
      <title>Re: problems after SAN failure.</title>
      <link>http://communities.vmware.com/message/1211310?tstart=0#1211310</link>
      <description>There are a couple of Primus articles that reference the ASC/ASCQ combination with a Sense Key of 6, they are said to be due to data changes within the LUN and do not indicate data corruption, i would expect to see these during an internal LUN migration on the Clariion, did the last migration complete?</description>
      <pubDate>Sat, 28 Mar 2009 20:50:37 GMT</pubDate>
      <author>whynotq</author>
      <guid>http://communities.vmware.com/message/1211310?tstart=0#1211310</guid>
      <dc:date>2009-03-28T20:50:37Z</dc:date>
      <clearspace:dateToText>8 months, 1 day ago</clearspace:dateToText>
      <clearspace:replyCount>1</clearspace:replyCount>
    </item>
    <item>
      <title>Re: problems after SAN failure.</title>
      <link>http://communities.vmware.com/message/1211298?tstart=0#1211298</link>
      <description>I will get back to you when I have more info. They are looking into it and so is VMware.&lt;br /&gt;
VMware has never seen anything like this. Initially EMC said it was caused by a bug in the&lt;br /&gt;
LUN migration software so we didnt use that again.&lt;br /&gt;
&lt;br /&gt;
Now after the latest round of problems, they are thinking it is all caused by a hardware fault in the backend.&lt;br /&gt;
I don't have access to the collects right now so cannot give you more info at this time. The SAN was sending out&lt;br /&gt;
a ASC/ASCQ 3f/0xe to ESX and apparently that points to a hardware issue.&lt;br /&gt;
&lt;p /&gt;
No primus reference as this looks like a first. They have engineering working on it &lt;br /&gt;
&lt;p /&gt;
&lt;p /&gt;
&lt;p /&gt;
&lt;p /&gt;
&lt;p /&gt;
cheers</description>
      <pubDate>Sat, 28 Mar 2009 20:36:34 GMT</pubDate>
      <author>AllBlack</author>
      <guid>http://communities.vmware.com/message/1211298?tstart=0#1211298</guid>
      <dc:date>2009-03-28T20:36:34Z</dc:date>
      <clearspace:dateToText>8 months, 1 day ago</clearspace:dateToText>
      <clearspace:replyCount>2</clearspace:replyCount>
    </item>
    <item>
      <title>Re: problems after SAN failure.</title>
      <link>http://communities.vmware.com/message/1211294?tstart=0#1211294</link>
      <description>&lt;br /&gt;
any chance of some more detail relating to the fault and environment? i work with EMC clariion constantly and have not seen a bug yet to cause this so i would be interested to here, what is the Flare code that you have currently and what is the Bug detail that EMC highlighted? did they reference any primus articles and do you have the Bug check or Panic ID?&lt;br /&gt;
&lt;p /&gt;
 lots of questions i know but it may help more people avoid your pain in the future &lt;img class="jive-emoticon" border="0" src="http://communities.vmware.com/images/emoticons/happy.gif" alt=":)" /&gt; &lt;br /&gt;
&lt;p /&gt;
 i'll take a look at the sp collects if you care to post them...</description>
      <pubDate>Sat, 28 Mar 2009 20:09:07 GMT</pubDate>
      <author>whynotq</author>
      <guid>http://communities.vmware.com/message/1211294?tstart=0#1211294</guid>
      <dc:date>2009-03-28T20:09:07Z</dc:date>
      <clearspace:dateToText>8 months, 1 day ago</clearspace:dateToText>
      <clearspace:replyCount>3</clearspace:replyCount>
    </item>
    <item>
      <title>Re: problems after SAN failure.</title>
      <link>http://communities.vmware.com/message/1211292?tstart=0#1211292</link>
      <description>&lt;br /&gt;
A lot has happened since last post. Two days after that the entire SAN started to fallover and we had major outage.&lt;br /&gt;
It was identified to a bug in Flare software that was unknown until then. Things stabilized after that and we did not use&lt;br /&gt;
the functionality that was buggy. A few days ago things went balls up again!!! We are getting lots of trespassing and finger&lt;br /&gt;
was pointed at VMware. They pretty much proofed it was SAN. It looks now that there is an issue in the hardware backend&lt;br /&gt;
that can cause trespassing. It has been a full-on week to say the least</description>
      <pubDate>Sat, 28 Mar 2009 19:39:06 GMT</pubDate>
      <author>AllBlack</author>
      <guid>http://communities.vmware.com/message/1211292?tstart=0#1211292</guid>
      <dc:date>2009-03-28T19:39:06Z</dc:date>
      <clearspace:dateToText>8 months, 1 day ago</clearspace:dateToText>
      <clearspace:replyCount>4</clearspace:replyCount>
    </item>
    <item>
      <title>Re: problems after SAN failure.</title>
      <link>http://communities.vmware.com/message/1179876?tstart=0#1179876</link>
      <description>I suppose better safe than sorry, reinstall the host so it does not become an issue down the line.&lt;br /&gt;
&lt;br /&gt;
Really hammer EMC to get an answer as what happened to the SAN things could have been much worse and you do not want that to occur again.</description>
      <pubDate>Tue, 24 Feb 2009 00:11:26 GMT</pubDate>
      <author>Lightbulb</author>
      <guid>http://communities.vmware.com/message/1179876?tstart=0#1179876</guid>
      <dc:date>2009-02-24T00:11:26Z</dc:date>
      <clearspace:dateToText>9 months, 4 days ago</clearspace:dateToText>
      <clearspace:replyCount>5</clearspace:replyCount>
    </item>
    <item>
      <title>Re: problems after SAN failure.</title>
      <link>http://communities.vmware.com/message/1179620?tstart=0#1179620</link>
      <description>&lt;br /&gt;
The SAN had what looks same failure last night. Haven't heard back from EMC.&lt;br /&gt;
&lt;br /&gt;
This host is standalone but I was thinking of re-installing it as there were plans to add it to cluster.&lt;br /&gt;
The hosts in the other cluster have no dead paths as far as I can see. Although their VMs have been affected&lt;br /&gt;
by the SAN failure. We pretty much have to reboot every VM</description>
      <pubDate>Mon, 23 Feb 2009 19:42:10 GMT</pubDate>
      <author>AllBlack</author>
      <guid>http://communities.vmware.com/message/1179620?tstart=0#1179620</guid>
      <dc:date>2009-02-23T19:42:10Z</dc:date>
      <clearspace:dateToText>9 months, 4 days ago</clearspace:dateToText>
      <clearspace:replyCount>6</clearspace:replyCount>
    </item>
    <item>
      <title>Re: problems after SAN failure.</title>
      <link>http://communities.vmware.com/message/1179176?tstart=0#1179176</link>
      <description>&lt;br /&gt;
I take that your other hosts (Is this a cluster?) are fine and the VMs are running on those hosts, is this correct?&lt;br /&gt;
&lt;p /&gt;
&lt;p /&gt;
&lt;p /&gt;
&lt;p /&gt;
&lt;br /&gt;
You could try step &lt;b&gt;3&lt;/b&gt; from the following document that deals with cleaning up ISCSI config on a ESX system&lt;br /&gt;
&lt;p /&gt;
&lt;br /&gt;
&lt;a class="jive-link-external" href="http://apps.sourceforge.net/mediawiki/iscsitarget/index.php?title=The_case_of_stale_iSCSI_LUNs"&gt;http://apps.sourceforge.net/mediawiki/iscsitarget/index.php?title=The_case_of_stale_iSCSI_LUNs&lt;/a&gt;&lt;br /&gt;
&lt;p /&gt;
&lt;br /&gt;
If your VMs are safely on other hosts you may want to evict this host and reinstall ESX and add host back to cluster. Kind of of a cop out but may be the best use of your time. Of course these suggestions are predicated on your VMs running on another host that is not having an issue.&lt;br /&gt;
&lt;br /&gt;
Note: On the Clariion check to see if the Failure happened at the same time as the weekly battery test, this is a scheduled activity that could effect both SPs&lt;br /&gt;
&lt;br /&gt;
Just a thought.&lt;br /&gt;
&lt;p /&gt;
&lt;p /&gt;
&lt;p /&gt;
&lt;p /&gt;
&lt;p /&gt;
&lt;p /&gt;</description>
      <pubDate>Mon, 23 Feb 2009 11:26:25 GMT</pubDate>
      <author>Lightbulb</author>
      <guid>http://communities.vmware.com/message/1179176?tstart=0#1179176</guid>
      <dc:date>2009-02-23T11:26:25Z</dc:date>
      <clearspace:dateToText>9 months, 5 days ago</clearspace:dateToText>
      <clearspace:replyCount>7</clearspace:replyCount>
    </item>
    <item>
      <title>problems after SAN failure.</title>
      <link>http://communities.vmware.com/message/1178985?tstart=0#1178985</link>
      <description>Hi everyone. &lt;br /&gt;
&lt;br /&gt;
On friday we had a SAN failure. Logged a job with EMC and they had no clue as nothing was obvious.&lt;br /&gt;
They did notice that both storage processors had a panic at the same time. The EMC development engineers are looking into it.Needless to say that pretty much everything turned to custard. A lot of VMs are unhappy and pretty much needed a cold reboot.&lt;br /&gt;
&lt;p /&gt;
&lt;br /&gt;
On one of our hosts I have lost one of the LUNs. On the SAN there are two LUNs available. ESX seems to detect the same LUN twice and obviously I have issues with my paths.&lt;br /&gt;
&lt;p /&gt;
 Some output that is related &lt;br /&gt;
&lt;p /&gt;
&lt;ol&gt;
&lt;li&gt;esxcfg-mpath -l&lt;/li&gt;
&lt;/ol&gt;
Disk vmhba32:2:0  (0MB) has 2 paths and policy of Most Recently Used&lt;br /&gt;
 iScsi sw iqn.1998-01.com.vmware:localhost-3055e03e&amp;lt;-&amp;gt;iqn.1992-04.com.emc:cx.ck200064601253.a2 vmhba32:2:0 On active preferred&lt;br /&gt;
 iScsi sw iqn.1998-01.com.vmware:localhost-3055e03e&amp;lt;-&amp;gt;iqn.1992-04.com.emc:cx.ck200064601253.b2 vmhba32:6:0 On&lt;br /&gt;
&lt;br /&gt;
Disk vmhba0:0:0 /dev/cciss/c0d0 (69973MB) has 1 paths and policy of Fixed&lt;br /&gt;
 Local 6:0.0 vmhba0:0:0 On active preferred&lt;br /&gt;
&lt;br /&gt;
Disk vmhba32:3:1 /dev/sdb (512000MB) has 2 paths and policy of Most Recently Used&lt;br /&gt;
 iScsi sw iqn.1998-01.com.vmware:localhost-3055e03e&amp;lt;-&amp;gt;iqn.1992-04.com.emc:cx.ck200064601253.a3 vmhba32:3:1 Standby  preferred&lt;br /&gt;
 iScsi sw iqn.1998-01.com.vmware:localhost-3055e03e&amp;lt;-&amp;gt;iqn.1992-04.com.emc:cx.ck200064601253.b3 vmhba32:7:1 On active&lt;br /&gt;
&lt;br /&gt;
Disk vmhba32:3:3  (512000MB) has 2 paths and policy of Most Recently Used&lt;br /&gt;
 iScsi sw iqn.1998-01.com.vmware:localhost-3055e03e&amp;lt;-&amp;gt;iqn.1992-04.com.emc:cx.ck200064601253.a3 vmhba32:3:3 Dead  preferred&lt;br /&gt;
 iScsi sw iqn.1998-01.com.vmware:localhost-3055e03e&amp;lt;-&amp;gt;iqn.1992-04.com.emc:cx.ck200064601253.b3 vmhba32:7:3 Dead&lt;br /&gt;
&lt;br /&gt;
Disk vmhba32:3:0 /dev/sda (512000MB) has 2 paths and policy of Most Recently Used&lt;br /&gt;
 iScsi sw iqn.1998-01.com.vmware:localhost-3055e03e&amp;lt;-&amp;gt;iqn.1992-04.com.emc:cx.ck200064601253.a3 vmhba32:3:0 Standby  preferred&lt;br /&gt;
 iScsi sw iqn.1998-01.com.vmware:localhost-3055e03e&amp;lt;-&amp;gt;iqn.1992-04.com.emc:cx.ck200064601253.b3 vmhba32:7:0 On active&lt;br /&gt;
&lt;p /&gt;
&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;esxcfg-vmhbadevs&lt;/li&gt;
&lt;/ol&gt;
vmhba0:0:0     /dev/cciss/c0d0&lt;br /&gt;
vmhba32:3:0    /dev/sda&lt;br /&gt;
vmhba32:3:1    /dev/sdb&lt;br /&gt;
&lt;p /&gt;
These two seem to be the same physical LUN though.  &lt;br /&gt;
&lt;p /&gt;
&lt;p /&gt;
&lt;p /&gt;
&lt;br /&gt;
Some log data&lt;br /&gt;
&lt;p /&gt;
&lt;p /&gt;
&lt;p /&gt;
&lt;br /&gt;
Feb 21 14:01:03 tur-esx-dev1 vmkernel: 184:00:00:15.003 cpu5:1043)WARNING: SCSI: 4541: Delaying failover to path vmhba32:7:3&lt;br /&gt;
&lt;br /&gt;
Feb 21 14:01:03 tur-esx-dev1 vmkernel: 184:00:00:15.004 cpu1:1025)SCSI: 5270: vml.020003000060060160a2a01a007eea3c745c6edd11&lt;br /&gt;
524149442035: Cmd failed. Blocking device during path failover.&lt;br /&gt;
Feb 21 14:01:03 tur-esx-dev1 vmkernel: 184:00:00:15.006 cpu2:1058)SCSI: 2741: Could not locate path to peer SP for CX SP B p&lt;br /&gt;
ath vmhba32:7:3.&lt;br /&gt;
Feb 21 14:01:03 tur-esx-dev1 vmkernel: 184:00:00:15.006 cpu2:1058)SCSI: 2741: Could not locate path to peer SP for CX SP B p&lt;br /&gt;
ath vmhba32:7:3.&lt;br /&gt;
Feb 21 14:01:03 tur-esx-dev1 vmkernel: 184:00:00:15.006 cpu2:1058)SCSI: 2308: Unmapped LUN state for DGC path vmhba32:7:3&lt;br /&gt;
Feb 21 14:01:03 tur-esx-dev1 vmkernel: 184:00:00:15.007 cpu2:1058)SCSI: 2308: Unmapped LUN state for DGC path vmhba32:3:3&lt;br /&gt;
Feb 21 14:01:03 tur-esx-dev1 vmkernel: 184:00:00:15.007 cpu2:1058)WARNING: SCSI: 4559: Manual switchover to path vmhba32:7:3&lt;br /&gt;
begins.&lt;br /&gt;
Feb 21 14:01:03 tur-esx-dev1 vmkernel: 184:00:00:15.007 cpu2:1058)SCSI: 2308: Unmapped LUN state for DGC path vmhba32:7:3&lt;br /&gt;
Feb 21 14:01:03 tur-esx-dev1 vmkernel: 184:00:00:15.007 cpu2:1058)WARNING: SCSI: 3743: Could not switchover to vmhba32:7:3.&lt;br /&gt;
Check Unit Ready Command returned an error instead of NOT READY for standby controller .&lt;br /&gt;
Feb 21 14:01:03 tur-esx-dev1 vmkernel: 184:00:00:15.007 cpu2:1058)WARNING: SCSI: 4619: Manual switchover to vmhba32:7:3 comp&lt;br /&gt;
leted unsuccessfully.&lt;br /&gt;
Feb 21 14:01:03 tur-esx-dev1 vmkernel: 184:00:00:15.010 cpu5:1057)SCSI: 2741: Could not locate path to peer SP for CX SP B p&lt;br /&gt;
ath vmhba32:7:3.&lt;br /&gt;
Feb 21 14:01:03 tur-esx-dev1 vmkernel: 184:00:00:15.010 cpu5:1057)SCSI: 2741: Could not locate path to peer SP for CX SP B p&lt;br /&gt;
ath vmhba32:7:3.&lt;br /&gt;
Feb 21 14:01:03 tur-esx-dev1 vmkernel: 184:00:00:15.010 cpu5:1057)SCSI: 2308: Unmapped LUN state for DGC path vmhba32:7:3&lt;br /&gt;
Feb 21 14:01:03 tur-esx-dev1 vmkernel: 184:00:00:15.010 cpu5:1057)SCSI: 2308: Unmapped LUN state for DGC path vmhba32:3:3&lt;br /&gt;
Feb 21 14:01:03 tur-esx-dev1 vmkernel: 184:00:00:15.010 cpu5:1057)WARNING: SCSI: 4559: Manual switchover to path vmhba32:3:3&lt;br /&gt;
begins.&lt;br /&gt;
Feb 21 14:01:03 tur-esx-dev1 vmkernel: 184:00:00:15.010 cpu5:1057)iSCSI: session 0xba402c0 eh_device_reset at 1589761539 for&lt;br /&gt;
command 0x6636888 to (0 0 3 3), cdb 0x0&lt;br /&gt;
Feb 21 14:01:03 tur-esx-dev1 vmkernel: 184:00:00:15.010 cpu2:1081)iSCSI: session 0xba402c0 requested target reset for (0 0 3&lt;br /&gt;
*), warm reset itt 25080319 at 1589761539&lt;br /&gt;
Feb 21 14:01:03 tur-esx-dev1 vmkernel: 184:00:00:15.016 cpu6:1082)iSCSI: session 0xba402c0 warm target reset success for mgm&lt;br /&gt;
t 25080319 at 1589761539&lt;br /&gt;
Feb 21 14:01:03 tur-esx-dev1 vmkernel: 184:00:00:15.017 cpu2:1081)iSCSI: session 0xba402c0 (0 0 3 *) finished reset at 15897&lt;br /&gt;
&lt;p /&gt;
&lt;p /&gt;
Some dmesg output&lt;br /&gt;
&lt;p /&gt;
&lt;p /&gt;
VMWARE: Device that would have been attached as scsi disk sda at scsi1, channel 0, id 2, lun 0&lt;br /&gt;
Has not been attached because this path is not active.&lt;br /&gt;
key = 0x2, asc = 0x4, ascq = 0x1&lt;br /&gt;
VMWARE: Device that would have been attached as scsi disk sda at scsi1, channel 0, id 2, lun 0&lt;br /&gt;
Has not been attached because it is a duplicate path or on a passive path &lt;br /&gt;
&lt;p /&gt;
I have never dealt with such an issue so any pointers would be appreciated. &lt;br /&gt;
&lt;p /&gt;
&lt;p /&gt;
&lt;p /&gt;
&lt;br /&gt;
Cheers</description>
      <pubDate>Mon, 23 Feb 2009 01:10:18 GMT</pubDate>
      <author>AllBlack</author>
      <guid>http://communities.vmware.com/message/1178985?tstart=0#1178985</guid>
      <dc:date>2009-02-23T01:10:18Z</dc:date>
      <clearspace:dateToText>9 months, 5 days ago</clearspace:dateToText>
      <clearspace:replyCount>8</clearspace:replyCount>
    </item>
  </channel>
</rss>

