Re: ESX 3.5 Update 2 - vmkernel messages FileIO fa... - Page 2

bigdee · ‎07-30-2008

Hi gents

I updated four of our ESX hosts to 3.5 Update 2. The hosts all had ESX 3.5 Update 1 installed with the latest patches. After a reboot of the host the following entries are added to the vmkernel logfile for all physical CPUs:

Jul 30 13:29:57 svstr18003 vmkernel: 0:00:02:23.455 cpu2:1035)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

Jul 30 13:29:57 svstr18003 vmkernel: 0:00:02:24.414 cpu2:1036)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

Jul 30 13:29:57 svstr18003 vmkernel: 0:00:02:24.504 cpu2:1035)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

Jul 30 13:29:57 svstr18003 vmkernel: 0:00:02:24.626 cpu3:1037)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

...

Beside of that I have noticed no other issues. The hosts seem to run fine!

Has anyone experienced something similar or knows what this messages mean?

Thanks

CU

pboguszewski · ‎08-08-2008

I opened a service request on this issue.

I am wondering if anyone else is seeing guest OS (Win2003 R2 x64 servers) errors in conjuction with the FileIO errors. The errors are "The device, \Device\Scsi\symmpi2, is not ready for access yet" and "The driver detected a controller error on \Device\Harddisk2\DR2". These guest OS errors are only showing up on the virtual machines that have massive storage on the SAN trays that run SATA disks (each of these have 2 to 20 tb of disks mapped to local drives in the guest OS). All our other guests run on trays that have FC disks and are not seeing these issues. For full disclosure, I will mention that the the guest on the FC disks do not have massive numbers of disks attached to them and do not see as much traffic. I am not sure this is related but want to see if anyone else is seeing what we are seeing.

Thanks,

Pete

aworkman · ‎08-13-2008

I too have noticed that all of the VM's that share one of my Netapp Aggregates will periodically all have that similar, "The device, \Device\Scsi\symmpi2, is not ready for access yet" and "The

driver detected a controller error on \Device\Harddisk2\DR2". This occurs at the exact same time across the board for all VM's sharing this storage. It is not isolated to a single host in my cluster as it happens on all vm's across all hosts in the cluster using this aggregate of disks. I do not have VM's on other aggregates so I can't say for sure that it is not a problem across other disk aggregates. The VM disks are not very large whatsoever 15-30 gig. The volumes across the aggregate are in the 1-3 TB range. The aggregate itself is 11TB. I'm not sure what to think and am curious as to whether you could run into SCSI reservation issues with only 3 esx hosts and 3-4 primary luns(there are others not used as frequently) will reservations conflicts go up as VM numbers increase? I'm also using A-SIS deduplication on the storage end to turn 3TB worth of vmdk's into 300GB worth.

Edit: I am using FCP across a dual fabric as well as iSCSI as a backup path in the event that somehow both fabrics fail and the ethernet network is still up. I would assume the only two causes in this case could be the actual disk or vmfs. It seems to also be peculiar that these issues only go as far back as when we had our outage for bringing down the entire disk subsystem and updating the firmware code on the shelfs to clear several shelf fault errenous bug messages.

foam · ‎08-13-2008

I have the issue also. I only noticed it after being bit by the Aug12th license bug. It has happened on all of my 3.5 U2 servers.

Aug 13 17:25:56 caesars vmkernel: 0:00:02:42.843 cpu5:1040)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

Aug 13 17:25:57 caesars vmkernel: 0:00:02:43.015 cpu5:1040)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

Aug 13 17:25:57 caesars vmkernel: 0:00:02:43.102 cpu1:1041)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

Aug 13 17:25:57 caesars vmkernel: 0:00:02:43.188 cpu5:1040)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

Aug 13 17:25:57 caesars vmkernel: 0:00:02:43.273 cpu1:1041)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

Aug 13 17:26:09 caesars vmkernel: 0:00:02:55.373 cpu5:1040)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

Aug 13 17:26:10 caesars vmkernel: 0:00:02:56.283 cpu1:1041)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

Aug 13 17:26:10 caesars vmkernel: 0:00:02:56.532 cpu5:1040)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

Aug 13 17:26:10 caesars vmkernel: 0:00:02:56.690 cpu5:1040)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

Aug 13 17:26:12 caesars vmkernel: 0:00:02:58.596 cpu1:1041)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

Aug 13 17:26:12 caesars vmkernel: 0:00:02:58.821 cpu5:1040)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

Aug 13 17:26:13 caesars vmkernel: 0:00:02:58.993 cpu3:1039)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

Aug 13 17:26:13 caesars vmkernel: 0:00:02:59.082 cpu3:1039)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

Aug 13 17:26:13 caesars vmkernel: 0:00:02:59.163 cpu5:1040)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

Aug 13 17:26:13 caesars vmkernel: 0:00:02:59.287 cpu3:1039)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

Aug 13 17:26:13 caesars vmkernel: 0:00:02:59.382 cpu1:1041)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

Aug 13 17:26:13 caesars vmkernel: 0:00:02:59.842 cpu1:1041)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

Aug 13 17:26:15 caesars vmkernel: 0:00:03:01.394 cpu1:1041)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

I also got hit with PCPU lockup "Failed to ack TLB invalidate." Pink Screen.

RParker · ‎08-18-2008

> Does anyone with Update 2 NOT have this issue?

We are seeing the same thing. We have 8 machines in a cluster (for hosts on the SAN) and there is only 1 with the issue. Update 2 Build 110181

So it's odd there is only 1 machine with the problem, since all 8 access the same VMFS volumes.

RParker · ‎08-18-2008

Any updates on this from VM Ware?

foam · ‎08-18-2008

When working through the August time bomb. VM told me that it would be fixed in an up coming patch. They said it was cosmetic. It has not caused us any issues since we put the "new" Update 2 on. Not sure i am 100% comfortable with that, but i have not had any other problems.

Nashwood · ‎08-18-2008

I, too, am experiencing this issue. The server hangs on reboot. This is the only thing that stands out as a problem that I can see. I am working on a new ESX build for our unattended installs, so this is a single server in a lab environment. The specifics:

Brand new install of ESX 3.5 Update 2 Build 110268 - 08/13/2008. Dell R900 with Emulex LPE 1150, 4GB, OptionalHost Bus Adapter, PCI-E Card (341-4608). Fiber was disconnected during install and remains disconnected.

I will be opening a SR in a moment, but I wanted to know if earlier posts in this thread got an answer.

Thanks in advance.

hegars · ‎08-19-2008

We have seen this problem also on VM's hosted on ESX v3.5 U2. Our systems are in a test/dev lab so we had the luxury of being able to re-install the ESX fresh with ESX v3.5 U1 and import the VM's back from the SAN storage.

The problem persists even now so it has us thinking that it may be the VM tools that are the problem as the VM's have the v3.5 U2 VM tools still installed.

So we've removed the VMware Tools from Add/remove Programs, rebooted the VM's, Installed VMware Tools again to the v3.5 U1 version and we'll wait and see if the problem rears it's head again .... so far so good!

hegars · ‎08-26-2008

OK, all attempts specified in my previous post made no difference whatsoever eventually

But, we have found what could be the answer ...

We have >4TB of MFS storage attached and were indeed running out of Heap Memory which was eventually causing the symmpi errors. Once we increased heap Memory from the default 16MB to 64MB the problems went away. it may even be an idea to just configure Heap memory to the Max of 128MB and be done with .... unless of course somebody here can give a reason why NOT to do that

So far the errors have not re-appeared ... if they do at a later stage then I will update this thread accordingly

awinkel · ‎08-28-2008

After update 2 I have the same errors in the vmkernel logfile.

The "solution" described by Hegars is poorly not the solution for my ESX servers.

I'm running ESX 3.5 U2 at a HPDL380 G5 with HP SIM 8.1.

kaufmac · ‎09-02-2008

3.5u2 (110268) here. But no SAN, we use NFS to a Netapp Filer. Same Messages:

vmkernel: 18:22:28:40.046 cpu2:1037)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

awinkel · ‎09-08-2008

Does anyone have an update for this problem? I still have the same problem.

madliv · ‎09-08-2008

I have the same error:

HP DL360 G5, connected by fiber to a MSA1000. I do not have any VM's on it yet so I am going to unplug the SAN and see what my logs say.

Darksun777 · ‎09-09-2008

Same issue at our ESX-Cluster. Error will appear for all CPUs (1-4)

vmkernel: 18:22:28:40.046 cpu2:1037)BC: 814: FileIO failed with 0x0xbad0006(Limit exceeded)

I haven't tried this solution, but this sounds very intresting. Maybe you guys wanna have a look at it:

http://zealkabi.blogspot.com/2008/09/vmware-esx-server-30-and-35-esx-server.html

Greetings,

Christian

VirtualKenneth · ‎09-11-2008

Same issue here, got it on BL460's with no SAN attached all, just a clear scripted install with HP Management Agents 8.10.

Will try to disable the specific HP services as discussed in the article. I don't like the VMware answer saying it's "cosmetic" since I don't like error messages at all.

nabsltd · ‎09-11-2008

I'm seeing the same error as everyone else, but most people seem to be running the HP management agent.

I'm not, as I don't even have HP hardware.

I'm seeing the same error on a SuperMicro server with no IPMI hardware or management software. Identical hardware and software without U2 installed does not have the error.

dmartushev · ‎09-14-2008

I was seeing the same issue as the previous posts however was in the same boat as you as I too wasn't running HP hardware.

Scratch installations of ESX 3.5 UP2 Installable resulted in the same 0x0xbad0006 error. Filed a case with VMware and they said it was a known issue. When I pressed them for a KB or when a fix was going to be issued I was told the KB was pending release and that the fix was currently scheduled for release with UP3. I pushed the matter further in order to get a workaround so that my vmkernel logs weren't beign blitzed out after a few hours and was provided the following:

-

1) Add following line to /etc/vmware/logfilters

3 xbad0006

Restart the vmklogger

2) killall -HUP vmklogger

Above will result vmklogger ignoring these messages after 3 instances.

-

Just make sure to start up the vmklogger service again (or just reboot the ESX Host) or else after a while you'll notice your logs aren't populating with any information

Oh and as of this morning, I was given the following KB link regarding this issue:

spex · ‎09-14-2008

Guten Tag,

ich bin am 17.09.2008 wieder im Haus.

Bei wichtigen Problemen wenden Sie sich bitte an Herrn Dr. Matthias Weger.

Ihre E-Mail wurde nicht weitergeleitet

Danke

Stefan Holzwarth

GreyhoundHH · ‎09-25-2008

Hi!

I'm experiencing the same issue on two FSC RX200S4s with ESX 3.5u2 (+ latest patches). The only difference is that the event doesn't occur only every 2 minutes but I'm getting several events per second. The systems seem to work perfectly normal, they are two nodes of a four node cluster, they can access their san, VMotion works. So is there a reason to worry about this massive amount of events or is it still cosmetic?

I've added the logfilter as described to keep my logs clean. But I had to reboot the first server to get the vmklogger started again. Can anyone tell me how to start the vmklogger without rebooting? There is no such service under /etc/init.d/ and /usr/sbin/vmklogger is resulting in a blank line...

thx

nabsltd · ‎10-09-2008

Try "/usr/sbin/vmklogger &". This puts it in the background, and is exactly what the vmware-late initscript does.

All

ESX 3.5 Update 2 - vmkernel messages FileIO failed with 0x0xbad0006(Limit exceeded)