Solved: FileIO failed with 0x0xbad0006(Limit exceeded) in ...

dtnvm · ‎06-25-2008

Hi,

we see repeatet messages in the vmkernel logs auf our esx host:

Jun 25 05:42:33 cs500102 vmkernel: 11:13:30:24.439 cpu0:1036)BC: 810: FileIO failed with 0x0xbad0006(Limit exceeded)

I did search the forum and the KB but could not find anything useful. "vmkerrcode -l" does not give a more detailed explanation of the message. To which resource or device does the message refer to? All sytem and vmfs partitions still have space left, no partition is filled up. Is 1036 the world id that generates the message? In this case 1036 is the id of one of the helper processes, not a VM.

Any help?

Thanks,

Jan

Texiwill · ‎06-25-2008

Hello,

This could be a number of things relating to your file system. I can not find much more documentation than you already have. In this case I would open a support case with your VMware Support Representative. What could help is more details of the error, ie. the surrounding lines of the logfile. Often there is other information on other lines and we can decode the limit issue using that. I would say the I/O failed as some buffer limit may have been exceeded.

Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

CIO Virtualization Blog: http://www.cio.com/blog/index/topic/168354

As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

View solution in original post

Texiwill · ‎06-25-2008

Hello,

This could be a number of things relating to your file system. I can not find much more documentation than you already have. In this case I would open a support case with your VMware Support Representative. What could help is more details of the error, ie. the surrounding lines of the logfile. Often there is other information on other lines and we can decode the limit issue using that. I would say the I/O failed as some buffer limit may have been exceeded.

Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

CIO Virtualization Blog: http://www.cio.com/blog/index/topic/168354

As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

dtnvm · ‎06-25-2008

Hello Edward,

thanks for your fast reply! Here is some more vmkernel log:

Jun 20 16:25:54 cs500102 vmkernel: 7:00:13:40.129 cpu0:1039)FS3: 4828: Reclaimed timed out heartbeat [HB state abcdef02 offset 3175936 gen 66 stamp

605594682518 uuid 4852805b-7501bef5-409f-0019bbc92338 jrnl <FB 328789> drv 4.31]

Jun 20 19:01:05 cs500102 vmkernel: 7:02:48:51.534 cpu0:1112)SCSI: 4782: path vmhba1:5:1: Passing device status RESERVATION_CONFLICT (18) through

Jun 21 05:41:43 cs500102 vmkernel: 7:13:29:30.505 cpu2:1036)BC: 810: FileIO failed with 0x0xbad0006(Limit exceeded)

Jun 21 05:53:53 cs500102 vmkernel: 7:13:41:39.993 cpu0:1036)BC: 810: FileIO failed with 0x0xbad0006(Limit exceeded)

Jun 22 05:41:42 cs500102 vmkernel: 8:13:29:30.581 cpu2:1037)BC: 810: FileIO failed with 0x0xbad0006(Limit exceeded)

Jun 22 05:53:50 cs500102 vmkernel: 8:13:41:38.692 cpu2:1035)BC: 810: FileIO failed with 0x0xbad0006(Limit exceeded)

Jun 23 04:13:32 cs500102 vmkernel: 9:12:01:21.584 cpu0:1032)FS3: 1974: Checking if lock holders are live for lock [type 10c00002 offset 6967296 v 6

2688, hb offset 4011520

Jun 23 04:13:32 cs500102 vmkernel: gen 5296, mode 1, owner 48592b6a-7cfbe73a-236b-0017a4f614c3 mtime 383858]

Jun 23 04:18:39 cs500102 vmkernel: 9:12:06:28.171 cpu0:1032)FS3: 1974: Checking if lock holders are live for lock [type 10c00002 offset 6967296 v 6

2696, hb offset 4011520

Jun 23 04:18:39 cs500102 vmkernel: gen 5300, mode 1, owner 48592b6a-7cfbe73a-236b-0017a4f614c3 mtime 384165]

Jun 23 05:41:23 cs500102 vmkernel: 9:13:29:12.575 cpu0:1035)BC: 810: FileIO failed with 0x0xbad0006(Limit exceeded)

Jun 23 05:53:32 cs500102 vmkernel: 9:13:41:21.228 cpu0:1035)BC: 810: FileIO failed with 0x0xbad0006(Limit exceeded)

Jun 24 13:54:51 cs500102 vmkernel: 10:21:42:42.338 cpu0:1037)BC: 810: FileIO failed with 0x0xbad0006(Limit exceeded)

Jun 24 14:16:37 cs500102 vmkernel: 10:22:04:27.625 cpu1:1032)FS3: 1974: Checking if lock holders are live for lock [type 10c00002 offset 6967296 v

65876, hb offset 4011520

Jun 24 14:16:37 cs500102 vmkernel: gen 6962, mode 1, owner 48592b6a-7cfbe73a-236b-0017a4f614c3 mtime 506438]

Jun 24 14:26:52 cs500102 vmkernel: 10:22:14:42.720 cpu3:1032)FS3: 1974: Checking if lock holders are live for lock [type 10c00002 offset 6967296 v

65892, hb offset 4011520

Jun 24 14:26:52 cs500102 vmkernel: gen 6970, mode 1, owner 48592b6a-7cfbe73a-236b-0017a4f614c3 mtime 507052]

Jun 25 05:42:33 cs500102 vmkernel: 11:13:30:24.439 cpu0:1036)BC: 810: FileIO failed with 0x0xbad0006(Limit exceeded)

Jun 25 13:31:44 cs500102 vmkernel: 11:21:19:39.677 cpu0:1032)FS3: 1974: Checking if lock holders are live for lock [type 10c00002 offset 8924160 v

5180, hb offset 3458048

Jun 25 13:31:44 cs500102 vmkernel: gen 23545, mode 1, owner 48523e50-e10c6b55-d817-0019bbce5ffe mtime 1044088]

Jun 25 13:34:20 cs500102 vmkernel: 11:21:22:15.479 cpu0:1036)VSCSI: 2803: Reset request on handle 8202 (1 outstanding commands)

Jun 25 13:34:20 cs500102 vmkernel: 11:21:22:15.479 cpu1:1056)VSCSI: 3019: Resetting handle 8202

Jun 25 13:34:20 cs500102 vmkernel: 11:21:22:15.479 cpu1:1056)<4>lpfc0:0712:FPe:SCSI layer issued abort device Data: x5 x1

Jun 25 13:34:20 cs500102 vmkernel: 11:21:22:15.621 cpu3:1107)WARNING: VSCSI: vm 1108: 5847: closing handle 0x200a with 2 pending cmds, 2 ref count

Jun 25 13:34:21 cs500102 vmkernel: 11:21:22:16.743 cpu1:1056)<4>lpfc0:0749:FPe:Completed Abort Task Set Data: x5 x1 x128

Jun 25 13:34:21 cs500102 vmkernel: 11:21:22:17.144 cpu2:1056)VSCSI: 2871: Completing reset on handle 8202 (0 outstanding commands)

Jun 25 13:34:22 cs500102 vmkernel: 11:21:22:17.480 cpu2:1037)WARNING: VSCSI: 2837: Ignoring double reset on handle 8202

Jun 25 17:14:36 cs500102 vmkernel: 12:01:02:31.962 cpu3:1032)FS3: 1974: Checking if lock holders are live for lock [type 10c00002 offset 6967296 v

68520, hb offset 3294720

Jun 25 17:14:36 cs500102 vmkernel: gen 420, mode 1, owner 486206c3-a9ba2167-407c-0017a4f614c3 mtime 23085]

Usually there is not much going on before the "Limit exceeded" error gets logged. Sometimes is is preceeded with the "RESERVATION_CONFLICT" messages a few minutes before but sometimes the vmkernel log has no entry for hours and then suddenly the "limit exceeded" message.

The vmkwarning log does not show anything, not a single entry for days.

Jan

Texiwill · ‎06-25-2008

Hello,

The limit refers to the REservation Conflict Limit. You are performing some actions that cause a Reservation Conflict and hence why you see them in the logs. Here is an Excerpt from My Book which discusses the causes of Reservation Conflicts.

Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

CIO Virtualization Blog: http://www.cio.com/blog/index/topic/168354

As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

dtnvm · ‎06-26-2008

Hi Edward,

how can you tell the limit exceeded refers to reservation conflicts? Is there any more detail on this? As I already mentioned, sometimes I see the limit exceed error without a reservation conflict message beforehand.

While searching for information on this issue I already found your article on VMFS reservations. There is no scripting or third party software installed on the esx hosts except from HP Insight Agents. I can not think of any massive parallel actions that perform VMFS metatdata updates. We see this on two different clusters, one with 8 ESX Host and about 60 running VMs, the other with 2 Hosts and only 8 VMs.

We will open a support call on this as you suggested.

Jan

Texiwill · ‎06-26-2008

Hello,

Massive is a misnomer, it just has to be two simultaneous actions depending on the array in use and the volume of traffic over the LUNs in question. I would review the logfile and try to determine WHAT is happening at this time when the limit is exceeded as well as the reservation conflicts.

Within the VIC you have a timestamp for just about everything, map that time to the logfile and then look at your SAN/NAS logfile for the similar time. Perhaps it is not anything within ESX but the SAN/NAS device. Be sure you understand the difference in times between the systems. I.e. your VC could be out of time sync with the SAN/NAS which could be out of time sync with the ESX server. Else the timestamps in the logfiles will lead to misinformation.

At the moment you are looking for the cause of the problem by reviewing the logfiles to determine any correlation of actions. I.e. you did a vMotion at X Time, the VMkernel Log shows X errors or warnings, and the SAN/NAS log shows these other issues or none at all. Once you have that information you can get an idea of what is happening.

SCSI subsystem issues like this are extremely hard to pinpoint as it takes time investigating all the logs.

Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

CIO Virtualization Blog: http://www.cio.com/blog/index/topic/168354

As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

Cruicer · ‎07-31-2008

Have you had any success with your issue...I am seeing it now on a brand new ESX host, only about 14 VMs. Only attempting to push out VMtools to the VMs. Assuming you are using SAN storage? If so what type...EVA / XP?

mpverr · ‎08-02-2008

I'm just starting to see this as well - right when the host boots up

bobross · ‎08-02-2008

Both clustering and multiple datastores on the same LUN can and do cause res conflicts. Looking at your log, whatever process happens regularly(?) on that VM on that server is getting reported multiple days at the same time (05:41). Have a look at what processes are going on at or before 05:41. I suspect some heavy I/O activity, such as backups or other file-sweeping activity is causing the res conflict for that particular LUN. Best practice is to segregate datastores on individual LUNs. With some SAN systems, creating individual LUNs and automatically formatting VMFS for a datastore is bone simple; others, it's a giant mess. Good luck.

dtnvm · ‎08-05-2008

Cruiser,

indeed our problem has been identified, HP SIM Agents were causing the conflicts. After disabling the cmahost agent the reservation conflict messages stopped. VMware created a KB article for this issue (1004771), here is the link:

By the way, the SAN is an EVA.

Regards,

Jan

All

FileIO failed with 0x0xbad0006(Limit exceeded) in /var/log/vmkernel