VMware Cloud Community
LiamA2013
Contributor
Contributor

ESXi 5.1 Hangs after a few hours to a few days of running

Hi, i am hoping someone can help with the crurrent issue I am faced with

The build machine we are using for ESXi 5.1 is as following

Gigabyte GA-970A-D3 motherboard

AMD FX-8350 (Eight Core) CPU

32GB Vengance RAM

4TB HDD (2 x 2TB)

Bios settings are default apart from the virtualisation and boot options.

The install went perfectly, no issues at all however after a couple of hours to a few days of running, the system will just hand. Going to the console and pressing the various key options doesn't do anything. There isn't anything in the logs files so it looks like the system just froze. The only way out of it is to turn the machine off then back on.

Can anyone help???

Tags (3)
Reply
0 Kudos
17 Replies
Cooldude09
Commander
Commander

check the logs if you see any errors.o vsphere

Also check on the cpu graph by connecting to vsphere client and see if it helps.

If U find my answer useful, feel free to give points by clicking Helpful or Correct.

Subscribe yourself at walkonblock.com

Reply
0 Kudos
LiamA2013
Contributor
Contributor

Hi and thanks for the input cooldude09. All aspects of CPU/RAM etc all look ok in the performace stats.

I can see the following errors in the logs but not sure what they mean

Completing Region/Field/Buffer/Package initialization:....................................................................................................ACPI Error (dsobject-0207): [LNKC]0:00:00:04.334 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKD]0:00:00:04.334 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKA]0:00:00:04.334 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKB]0:00:00:04.334 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
..ACPI Error (dsobject-0207): [LNKD]0:00:00:04.334 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKA]0:00:00:04.334 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKB]0:00:00:04.334 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKC]0:00:00:04.334 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
..ACPI Error (dsobject-0207): [LNKA]0:00:00:04.335 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKB]0:00:00:04.335 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKC]0:00:00:04.335 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKD]0:00:00:04.335 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
..ACPI Error (dsobject-0207): [LNKB]0:00:00:04.335 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKC]0:00:00:04.335 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKD]0:00:00:04.335 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKA]0:00:00:04.335 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
..ACPI Error (dsobject-0207): [LNKC]0:00:00:04.335 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKD]0:00:00:04.335 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKA]0:00:00:04.335 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKB]0:00:00:04.335 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
..ACPI Error (dsobject-0207): [LNKD]0:00:00:04.335 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKA]0:00:00:04.335 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKB]0:00:00:04.335 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKC]0:00:00:04.335 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
..ACPI Error (dsobject-0207): [LNKB]0:00:00:04.336 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKC]0:00:00:04.336 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKD]0:00:00:04.336 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKA]0:00:00:04.336 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
..ACPI Error (dsobject-0207): [LNKC]0:00:00:04.336 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKD]0:00:00:04.336 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKA]0:00:00:04.336 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKB]0:00:00:04.336 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
..ACPI Error (dsobject-0207): [LNKD]0:00:00:04.336 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKA]0:00:00:04.336 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKB]0:00:00:04.336 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKC]0:00:00:04.336 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
..ACPI Error (dsobject-0207): [LNKA]0:00:00:04.336 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKB]0:00:00:04.336 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKC]0:00:00:04.336 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKD]0:00:00:04.336 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
..ACPI Error (dsobject-0207): [LNKB]0:00:00:04.336 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKC]0:00:00:04.336 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKD]0:00:00:04.336 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
ACPI Error (dsobject-0207): [LNKA]0:00:00:04.336 cpu0:8192) Namespace lookup failure, AE_NOT_FOUND
......................................................0:00:00:04.338 cpu0:8192)
Initialized 33/33 Regions 2/2 Fields 30/30 Buffers 109/109 Packages (866 nodes)
Initializing Device/Processor/Thermal objects by executing _INI methods:..0:00:00:04.342 cpu0:8192)
Executed 2 _INI methods requiring 1 _STA executions (examined 79 objects)
evgpeblk-1210 [03] EvInitializeGpeBlock  : 0:00:00:04.343 cpu0:8192)Found 9 Wake, Enabled 0 Runtime GPEs in this block
0:00:00:04.343 cpu0:8192)Device: 527: Registered device: p=0x417fc1cbabe0 0x41000e669020 \_SB_.PCI0 PNP0A03 bd=0x4100018b18a0
0:00:00:04.343 cpu0:8192)VMKAcpi: 1320: Registered root bridge (0 0 0).
0:00:00:04.344 cpu0:8192)VMKAcpi: 1240: Ignoring disabled device.
0:00:00:04.345 cpu0:8192)VMKAcpi: 1240: Ignoring disabled device.
0:00:00:04.346 cpu0:8192)VMKAcpi: 1240: Ignoring disabled device.
0:00:00:04.348 cpu0:8192)VMKAcpi: 1240: Ignoring disabled device.
0:00:00:04.348 cpu0:8192)VMKAcpi: 1240: Ignoring disabled device.
0:00:00:04.348 cpu0:8192)VMKAcpi: 1240: Ignoring disabled device.
0:00:00:04.349 cpu0:8192)VMKAcpi: 1238: Ignoring disabled device 3.
0:00:00:04.349 cpu0:8192)VMKAcpi: 1399: 1 root bridges found, 16 pci-pci bridges found, 3 max depth
0:00:00:04.349 cpu0:8192)VMKAcpi: 418: ISA irq 2 uses ioapicID 8, intIn 2 which is already present with trigger=1, polarity=2
0:00:00:04.349 cpu0:8192)VMKAcpi: 264: Printing ACPI irq routing information
The below is the last section prior to it hanging
0:00:00:04.360 cpu0:8192)PCI: 3542:   irq 3 vector 0x98
0:00:00:04.360 cpu0:8192)Device: 527: Registered device: p=0x41000e669110 0x41000e669440 00:00:16.0 1002:4397 1458:5004 bd=0x410001910ac0
0:00:00:04.360 cpu0:8192)VMK_PCI: 317: device 00:00:16.0 event: Device inserted: new owner module
0:00:00:04.360 cpu0:8192)PCI: 3527: 00:00:16.2 1002:4396 1458:5004 added
0:00:00:04.360 cpu0:8192)PCI: 3529:   classCode 0c03 progIFRevID 2000
0:00:00:04.360 cpu0:8192)PCI: 3533:   intPIN B intLine 10
0:00:00:04.360 cpu0:8192)Chipset: 404: 00:16 B busIRQ= 89 on 00-17
0:00:00:04.360 cpu0:8192)PCI: 3542:   irq 10 vector 0x78
0:00:00:04.360 cpu0:8192)Device: 527: Registered device: p=0x41000e669110 0x41000e669440 00:00:16.2 1002:4396 1458:5004 bd=0x410001910e00
0:00:00:04.360 cpu0:8192)VMK_PCI: 317: device 00:00:16.2 event: Device inserted: new owner module
0:00:00:04.360 cpu0:8192)PCI: 3407: Device 00:00:18.0 is disabled by the BIOS
0:00:00:04.360 cpu0:8192)PCI: 3407: Device 00:00:18.1 is disabled by the BIOS
0:00:00:04.360 cpu0:8192)PCI: 3407: Device 00:00:18.2 is disabled by the BIOS
0:00:00:04.360 cpu0:8192)PCI: 3407: Device 00:00:18.3 is disabled by the BIOS
0:00:00:04.360 cpu0:8192)PCI: 3407: Device 00:00:18.4 is disabled by the BIOS
0:00:00:04.360 cpu0:8192)PCI: 3407: Device 00:00:18.5 is disabled by the BIOS
0:00:00:04.361 cpu0:8192)PCI: 3527: 00:00:18.0 1022:1600 0000:0000 added
0:00:00:04.361 cpu0:8192)PCI: 3529:   classCode 0600 progIFRevID 0000
0:00:00:04.361 cpu0:8192)Device: 527: Registered device: p=0x41000e669110 0x41000e669440 00:00:18.0 1022:1600 0000:0000 bd=0x4100019112a0
0:00:00:04.361 cpu0:8192)VMK_PCI: 317: device 00:00:18.0 event: Device inserted: new owner vmkernel
0:00:00:04.361 cpu0:8192)PCI: 3527: 00:00:18.1 1022:1601 0000:0000 added
0:00:00:04.361 cpu0:8192)PCI: 3529:   classCode 0600 progIFRevID 0000
0:00:00:04.361 cpu0:8192)Device: 527: Registered device: p=0x41000e669110 0x41000e669440 00:00:18.1 1022:1601 0000:0000 bd=0x410001911600
0:00:00:04.361 cpu0:8192)VMK_PCI: 317: device 00:00:18.1 event: Device inserted: new owner vmkernel
0:00:00:04.361 cpu0:8192)PCI: 3527: 00:00:18.2 1022:1602 0000:0000 added
0:00:00:04.361 cpu0:8192)PCI: 3529:   classCode 0600 progIFRevID 0000
0:00:00:04.361 cpu0:8192)Device: 527: Registered device: p=0x41000e669110 0x41000e669440 00:00:18.2 1022:1602 0000:0000 bd=0x410001911940
0:00:00:04.361 cpu0:8192)VMK_PCI: 317: device 00:00:18.2 event: Device inserted: new owner vmkernel
0:00:00:04.361 cpu0:8192)PCI: 3527: 00:00:18.3 1022:1603 0000:0000 added
0:00:00:04.361 cpu0:8192)PCI: 3529:   classCode 0600 progIFRevID 0000
0:00:00:04.361 cpu0:8192)Device: 527: Registered device: p=0x41000e669110 0x41000e669440 00:00:18.3 1022:1603 0000:0000 bd=0x410001911c80
0:00:00:04.361 cpu0:8192)VMK_PCI: 317: device 00:00:18.3 event: Device inserted: new owner vmkernel
0:00:00:04.361 cpu0:8192)PCI: 3527: 00:00:18.4 1022:1604 0000:0000 added
0:00:00:04.361 cpu0:8192)PCI: 3529:   classCode 0600 progIFRevID 0000
0:00:00:04.361 cpu0:8192)Device: 527: Registered device: p=0x41000e669110 0x41000e669440 00:00:18.4 1022:1604 0000:0000 bd=0x410001911fe0
0:00:00:04.361 cpu0:8192)VMK_PCI: 317: device 00:00:18.4 event: Device inserted: new owner vmkernel
0:00:00:04.361 cpu0:8192)PCI: 3527: 00:00:18.5 1022:1605 0000:0000 added
0:00:00:04.361 cpu0:8192)PCI: 3529:   classCode 0600 progIFRevID 0000
0:00:00:04.361 cpu0:8192)Device: 527: Registered device: p=0x41000e669110 0x41000e669440 00:00:18.5 1022:1605 0000:0000 bd=0x410001912320
0:00:00:04.361 cpu0:8192)VMK_PCI: 317: device 00:00:18.5 event: Device inserted: new owner vmkernel
0:00:00:04.361 cpu0:8192)Device: 196: Found driver pci for device 0x41000e669110
0:00:00:04.361 cpu0:8192)PCI: 3819: 00:03:07.0 to 4
0:00:00:04.361 cpu0:8192)VMK_PCI: 317: device 00:03:07.0 event: Device changed ownership: new owner vmkernel
0:00:00:04.361 cpu0:8192)HPET: 456: HPET timer 0 capabilities 0xc0000000000010
0:00:00:04.361 cpu0:8192)IRQ: 233: 0x28 <hpet> exclusive, flags 0x3
0:00:00:04.361 cpu0:8192)IOAPIC: 1277: 0x28 retriggerred
0:00:00:04.361 cpu0:8192)HPET: 243: Got an HPET interrupt, counter = 62893887
0:00:00:04.364 cpu0:8192)HPET: 307: 1000 calls to HPET_SetTimeout(hpetHz) took 9421661 TSC cycles
0:00:00:04.374 cpu0:8192)HPET: 243: Got an HPET interrupt, counter = 63084987
0:00:00:04.394 cpu0:8192)HPET: 456: HPET timer 0 capabilities 0xc0000000000014
0:00:00:04.394 cpu0:8192)IRQ: 233: 0x28 <hpet> exclusive, flags 0x3
0:00:00:04.394 cpu0:8192)GlobalTimer: 78: GlobalTimer service available
0:00:00:04.394 cpu0:8192)Initializing Power Management ...
0:00:00:04.396 cpu0:8192)Power: 2648: AMD Enhanced PowerNow(R) detected on this system
0:00:00:04.398 cpu0:8192)Power: 2366: Current power management policy was set to "dynamic"
0:00:00:04.398 cpu0:8192)MCE: 635: Fixed 1 MCE bank/CPU-package ownership settings
0:00:00:04.398 cpu0:8192)CpuSched: 12114: Reset scheduler statistics
0:00:00:04.398 cpu0:8192)Init: 862: Vmkernel initialization done.  Returning to console.
0:00:00:04.398 cpu0:8192)VMKernel loaded successfully.
Thanks for your help
Reply
0 Kudos
Cooldude09
Commander
Commander

looks like some kind of interrupt issue. disbale acpi in the bios and u'l be fine

If U find my answer useful, feel free to give points by clicking Helpful or Correct.

Subscribe yourself at walkonblock.com

LiamA2013
Contributor
Contributor

Hey cooldud09, again thanks for the info and your comments do make sense. I have one issue in that my bios doesn't allow me to disable ACPI but rather it has the following two options

ACPI Suspend Type
Specifies the ACPI sleep state when the system enters suspend.
S1(POS) Enables the system to enterthe ACPI S1 (Power on Suspend) sleep state.
In S1 sleep state, the system appears suspended and stays in a low power mode. The
system can be resumed at any time.
S3(STR) Enables the system to enter the ACPI S3 (Suspend to RAM) sleep state (default).
In S3 sleep state, the system appears to be off and consumes less power than in the S1
state. When signaled by a wake-up device or event, the system resumes to its working
state exactly where it was left off.

Its currently set to default with is S3(STR) - Do you thing the other option will help at all?

Thanks

Reply
0 Kudos
Cooldude09
Commander
Commander

thanks mate..check with S1 state and see if it helps

If U find my answer useful, feel free to give points by clicking Helpful or Correct.

Subscribe yourself at walkonblock.com

LiamA2013
Contributor
Contributor

Cooldude09, the ACPI S1 state has fixed my hanging issue however I have just noticed one other issue that wasn't apparent before I changed the ACPI state.

The cpu installed has 8 coures which were detailed in the summary tab of ESX prior to the bios change. Now its only showing 4 cores, any reason that you can think of that would cause this?

Thanks

Reply
0 Kudos
LiamA2013
Contributor
Contributor

Hi guys, I think I spoke to soon, the machine is still handing with the same errors but ironically I think its down to the sata controller as when no VM's are running, it doesn't seem to hang up.

I have logged a call with VMWare so will post the results once I get them.

Reply
0 Kudos
kashifkarar01
Enthusiast
Enthusiast

If you migrate the VMs to any other datastore do you notice the same issue?

Reply
0 Kudos
LiamA2013
Contributor
Contributor

Yeah same issue Smiley Sad

Reply
0 Kudos
kashifkarar01
Enthusiast
Enthusiast

I had observed similar behaviour which interim logged an illegal vector error in the VMkernel and messages log files shortly before an HBA stops responding to the driver.Try disabling Interrupt request on the host and check if that helps:

Command is:

esxcli system settings kernel set --setting=iovDisableIR -v TRUE

Reply
0 Kudos
LiamA2013
Contributor
Contributor

Thanks for the suggestion kashifkarar01

I think the issue is related to the sata controller as looking on the VMWare compatibility charts, the AMD700 chipset is supported however mine is AMD950 which isn't on the list yet.

I will give it a shot and let you know the outcome.

Thanks again

Reply
0 Kudos
LiamA2013
Contributor
Contributor

I've just noticed on the VMware site that the Sata controller chipset on MB Gigabyte GA-970A-D3 is not supported. The chipset being AMD SB950. ESXi apparently only supports AMD SB700. Do you think this could also be causing the system hangs etc?

Thanks

Reply
0 Kudos
JonaR01
Contributor
Contributor

Looks like exactly same issue I have seen on Cisco servers, and is an interupt problem, requires changing a setting in ESXi, not bios, check: https://supportforums.cisco.com/docs/DOC-23667 This worked perfectly on Cisco servers with the problem. Hope this helps.

Reply
0 Kudos
robinsedman
Contributor
Contributor

Hi,

I hope it is okey to borrow your thread.

It is possible that I have the same problem, but mine freezes after maybe five minutes to one hour.

Is there any document that can specify which sata-controllers are supported and which are not?

Would be very happy if anyone knew.

Thanks!

Reply
0 Kudos
LiamA2013
Contributor
Contributor

Hi guys, well I ran the command

esxcli system settings kernel set --setting=iovDisableIR -v TRUE

Which worked for about a week but now I have a system hanging again that I can't connect to. I will be looking at the logs later today but does anyone know if running the above command can lead to other issues?

Many thanks

Reply
0 Kudos
CB4009
Contributor
Contributor

I had a very similiar issue to what that you encountered but I'm using a GA-990FXA-UD3 Board w/ FX8130 Processor. I had 4 Nic cards (2-e1000e,1-e1000, and the on board Realtek) I'd experience lockups just like you are. I was able to induce the lockups by having multiple VMs transfer large files 2+ GB at the same time. I was able to trace the problem to the PCI slot on my board, as soon as I removed the Intel e1000 card the system has been rock solid for over a month. I reintroduced the e1000 last night and I had 2 lock ups and once the system just powered off completely on me. I removed the card again and its running like a champ. I hope you're able to isolate you issue, I know how fustrating it is to have these types of lockups so I wanted to lend my perspective.

In my case its either a bad PCI slot, IRQ sharing issue, Esxi configuration issues running both e1000 and e1000e in the same machine, or bad card (Don't think its a card I tried 2 different intel cards and the system crashed with either in the PCI slot.

Reply
0 Kudos
work4coffee
Contributor
Contributor

Were you receiving any specific errors in the log files that pointed you to the NIC cards?

Reply
0 Kudos