Re: ESXi 4.1 PSOD when performing certain Linux di... - Page 4

securitux · ‎11-12-2010

Hey all,

I am running ESXi 4.1 on an Intel server and I get PSOD's whenever I perform certain disk operations. It seems to happen when I partition a disk or once in a while when writing to disk.

It happens on thin or thick provisioned disks

It ONLY happens with LSI Parallel config, not LSI SAS.

I have attached the purple screen I get. It sucks that a Linux guests disk operation can crash an entire ESXi host with 15 guests on it. The guest disk operations SHOULD be irrelevant to the host.

Any help would be appreciated.

Thanks

-J

jlong · ‎04-26-2012

Just got 2 of the ARC-1882ix-16 with 2gb cache memory and the battery option this morning. In the process of migrating VM's off one of my hosts and will start the process of installing the new card and esxi 5, for the 100 billionth time lol.

jlong · ‎04-27-2012

So, 1 - 2008 R2 Std sp1 vm loaded with iometer running 100% sequential writes, transfer request size 32K, 64 outstanding I/Os per target.

12 - 300GB cheetah 15K.6 in a raid 5. 1 - 10 gb vmdk.

More tests to follow.

I plan on loading a P4000 VSA 9.5 on this same box.

jlong · ‎05-01-2012

Here are some of my findings, and a recap. When I had the adaptec 51245 firmware 18937 in place it was very difficult to manage and the logs were pretty hard for me to read due to the date not being appended to events. The array would randomly rebuild and it would cause the server to crash due to disk latency when the rebuild would occur. I was able to install windows and export a support bundle. I sent it to adaptec but because I am out of warranty they couldn't tell me anything. Maybe if I spent more time with it I could have figured out the problem, but I didn't.

So, I replaced it with an ARC-1882ix-16 with 2 GB cache and a battery module. I built the array and installed ESXi 5. I had to use ESXi-Customizer, an extremely easy to use and very handy utility, to insert the driver into the ESXi install cd. It ran idle over the first night and when I came in the next day I heard an alarm. To my surprise I saw the array was degraded and it showed me what drive had failed!

Why couldn't the adaptec controller do its job and show me what drive had failed instead of simply rebuilding the array and failing again and again?

I replaced the drive and everything has been pretty stable since.

I'm still testing the stability of the areca and its ability to not corrupt data. I started to storage vMotion a powered off VM between local and remote iSCSI storage using md5sum in between moves. The first move was successful.

7702b3e875d04326e7fb5a96555138a1 Test1-flat.vmdk

I imported an OVF which was an HP P4000 VSA 9.5. When I booted it the linux OS detected storage corruption and would not fully boot. Is this a coincidence? I deleted it and reimported it and everything appears to be ok.

I'm going to start another VM and load IOmeter on it and perform a storage vmotion of the other test VM and compare the md5 checksum value.

Thoughts anyone?

Thanks!

John

jlong · ‎05-02-2012

Well, I'm not sure what to do at this point.

The last copy and md5sum were different.

180bbf306867e5b7886e17159559d855 Test1-flat.vmdk (resides on test host) --->
352a0c3bd422345c0f39311ba7a65860 Test1-flat.vmdk (resides on areca host)

And I had a corrupted imported OVF VM.

The VM booted and I ran a check disk, the results are no problems were found. I am so very confused by these results.

|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Checking file system on C:
The type of the file system is NTFS.

A disk check has been scheduled.
Windows will now check the disk.

CHKDSK is verifying files (stage 1 of 3)...
63488 file records processed.
File verification completed.
52 large file records processed.
0 bad file records processed.
0 EA records processed.
60 reparse records processed.
CHKDSK is verifying indexes (stage 2 of 3)...
92030 index entries processed.
Index verification completed.
0 unindexed files scanned.
0 unindexed files recovered.
CHKDSK is verifying security descriptors (stage 3 of 3)...
63488 file SDs/SIDs processed.
Cleaning up 103 unused index entries from index $SII of file 0x9.
Cleaning up 103 unused index entries from index $SDH of file 0x9.
Cleaning up 103 unused security descriptors.
Security descriptor verification completed.
14272 data files processed.
CHKDSK is verifying Usn Journal...
1235848 USN bytes processed.
Usn Journal verification completed.
Windows has checked the file system and found no problems.

31352831 KB total disk space.
15843052 KB in 46954 files.
     37800 KB in 14273 indexes.
         0 KB in bad sectors.
    131819 KB in use by the system.
     65536 KB occupied by the log file.
15340160 KB available on disk.

      4096 bytes in each allocation unit.
   7838207 total allocation units on disk.
   3835040 allocation units available on disk.

Internal Info:
00 f8 00 00 37 ef 00 00 70 dc 01 00 00 00 00 00 ....7...p.......
64 00 00 00 3c 00 00 00 00 00 00 00 00 00 00 00 d...<...........
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................

Windows has finished checking your disk.
Please wait while your computer restarts.

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

ErikCarlseen · ‎05-03-2012

As far as I know, IOmeter doesn't test data integrity - just read/write speed. That's great for seeing if you're going to get a PSOD caused by high I/O load (like we saw with the older driver on VMWare 4.1), but it won't find what you're looking for here (unless you just want to simulate high I/O loads while doing other testing).

At this point, I'd send the results to Areca support and see what they send you.

ErikCarlseen · ‎05-28-2012

Just curious to see if you got any response from VMWare on this issue?

jlong · ‎06-27-2012

Sorry about the slow response. No I never did get any answers to any of my probems. I still have strange problems from time to time, like very intermitten high disk latency on the iSCSI software adapter, but nothing that has effected the actual production of the servers or caused downtime. I have had a couple purple screens on one host, but not the other. According to VMware support the logs don't point to a problem with a particular piece of hardware or software. No smoking gun, if you will. I guess that's why I invested in mirrored VSA's, Veeam and redundant hardware for HA.

I've been in production with this hardware for some time now. I may have been a little paranoid during my testing. Sometimes problems just happen when you are setting up a server for the first time. The key is to test, test and test again till it's stable. And after it's in production backup, backup and backup some more and test your backup! So I feel confident that everything is ok with this card and current driver. So far no corrupt data at all.

NTShad0w · ‎08-08-2012

hi ErikCarlseen,

listen,

I have still same problems (and PSODs on Vmware 5) with my Aeca 1882ix and I'm tired of that,

can you give more detailed info about drivers You receive from Areca Support?? share it somewhere or what?

I don't have good relations with their support, they talk me only that they have to replicate my enviroment to check this... sic!!!

thanks dude

NTShad0w

NTShad0w · ‎08-08-2012

Hi jlong,

if you don't buy yet Areca arc-1882ix-16 what You plan... I dont recommend it, it's still crashes (PSODs on ESXi5 when copying data between RAID Sets) with a lot of Supermicro boards, and they (Areca support) can't help, just tell a story that they dont have similar configuration to check this...:( yep, wonderfull company, just wonderfull, I don't recomment Areca RAID Controllers for Vmware

best regards

NTShad0w

NTShad0w · ‎08-10-2012

jlong, mates,

heeh it makes me crazy when I'm reading that some of You have problems with data corruption on Areca 1882ix controllers...!!!!! ((

I have 1882ix-24/1GB+BBU, 6x 2TB Seagate (lowend) sata 7k2 drives in RAID6 and 5x SSD drives in RAID5, some months earlier I had 10x Seagate 11 1TB drives in RAID 5 and 6 and I have Supermicro mb (8XDAH+-F + 2x L5638 and 144GB ram), I never have data corruption (yep I'm lucky as I can see...) but if I have, I will jump from the bridge probably (hope no), it's my lab but I really like it and data on it... not only labs data in true.

hmm, strange bad things you tells here about this data corruptions on Areca 1882ix (and maybe others) controllers, I have PSODs from the begining of using Areca Controller (yep I hate it, but never lost a VMFS DS, only 2-3 vms max), I used it only with ESX5 from the beggining, earlier I have HP P800 on ESX 3,3.5,4,4.1 and everything was ok, besides of disk drops from RAID grup sometimes) but never any PSOD...

on Areca PSODs are so "natural" for me, I had in one year about 7-8 PSODs, support generally cant help because..heeh.. because they have to replicate my enviroment to test it, but they don't replicate it and don't give me any working solutions (from about 1 year now, I talk and write with them many times but no sollution at all ) - why I didn't give back the controller? i don't know, maybe I have a lot of data on it and don't have anought time to migrate in on the other raid's when I create it from the same disks, ehh I don't have anought disks to easy and fast migrate it to other RAID/Controller, that's a problem.

I like web management of Areca, I like theoretical possibility of using 4GB cache with BBU... nice option but these PSODs and problem with sharing data streams/multitasking (on vmware with areca, it's probably driver problem, eventually firmware and driver)... I hate, It's not a cheep controller, its controller for (in my country, PL in EU) 1600usd... and making things like something cheep for a 100 bucks...:((

I probably last time try to solve my problems with them, I send them a long email with a lot of datas and examples of failures... if they don't solve PSODs and miltitasking/multistreaming problem I probably (if it's possible) give it back and go into LSI - stable and fast controllers (I should go into them from a start but I'm not happy of 512MB cache).

here is my email to Areca Support (20120807), some of attachements attached to post too:

I'm experiencing such problems with my Areca ARC-1882ix-24 1GB cache from more than 1 year:

1. it crashes of VMware ESXi v5 (PSOD) during high IOPS on the 2 different RAID Sets (1 is on SSD disks in RAID5 and second is on SATA 7k2 disks in RAID6), when high IOPS is on the one RAID Set the problem never happend (about 1 year of tests), when on 2 separated RAID Sets on the same or differend kind of disks it crashes almost in 95% of high IOPS operations on the 2 RAID Sets, during crash the Areca HTTP interface working ok and in event log there are nothing happens, this problem is probably connected to problem 2.

last PSOD ware 2012-08-07 17:44:31 (05:44:31 PM) - attachements Areca_VMware_PSOD_20120807_17_44_31_IMG_0666.JPG - Areca_VMware_PSOD_20120807_17_44_31_IMG_0669.JPG

PSOD 2012-07-24 17:29:34 (little differ than others, it was some DB IOPS operations on this virtual machine: SEPM-tst669 and additionally it was copied/moved simultanously on another RAID Set - so copying between RAID Sets was done during crash, as we can see on screen Areca_VMware_PSOD_20120724_17_29_34_IMG_0660.JPG small IOPS it sugest it was moved/copied from RAID6 to RAID5 on SSD drives) - attachements Areca_VMware_PSOD_20120724_17_29_34_IMG_0599.JPG - Areca_VMware_PSOD_20120724_17_29_34_IMG_0661.JPG

PSOD 20111219 22:48:07 - attachement: Areca_VMware_PSOD_20111219_22_48_07_IMG_0369.JPG

PSOD 20120131 12:43:31 - attachement: Areca_VMware_PSOD_20120131_12_43_31_IMG_0404.JPG

PSOD 20120522 08:10:30 - attachement: Areca_VMware_PSOD_20120522_08_10_30_IMG_0602.JPG

2. during even middle size IOPS (aka 200-300 IOPS) on one RAID Group second RAID group (even it's not loaded) it's having very high responces and runing extremaly slow, for example if I make load about 300 IOPS on my RAID6 Set (6x 7k2 SATA) and try to use 4x SSD (RAID5) RAID Set it's almost unaccessible, running extremaly slow, like 50-70 IOPS... when the same RAID Set have 6.000-10.000 IOPS when second RAID group is not loaded more than some IOPS...:P - this problem is probably connected to crashes from problem 1.

3. I observe VERY low write performance especially on SSD Raid Set, when it easy make 8.000-12.000 IOPS on read it make only about 500 IOPS on write (aka 70MB/s)... but when the same RAID Set is degragated... it makes easy 3x more IOPS like 1500-2000 IOPS (aka 190MB/s), this problem may be connected somehow to problems 1. and 2.

4. I cant add 7'th sata drive when server/and controller is powered off because after powering on Areca BIOS cant be initialized, its showing seconds up to 300 and halt, when there are only 6 of 2TB Seagate SATA drives Array looks like working ok (besides of problems 1,2,3 but the same problems occured on 10x 1TB 7k2 SATA Seagate drives, the 7'th 2TB Seagate 7k2 SATA disk can be easly added during server/and controller is powered on bt after restart it halt in the same way , of course Itest another disks, and also switch between disks that working on RAID Set (RAID6 after rebuild) and the result is the same with any other disk added...

attachement Areca_Halt_after_300sec_when_7th_Seagate_2TB_7k2_SATA_drive_pluged_in_to_controller(when_poweredoff)_IMG_0654.JPG - after enumeration to 300sec it halt with something like Areca BIOS can't be initializated.

5. problem similar to problem 3. but in the same RAID Set, when RAID Set is loaded with one IOPS kind, it's really hard to agree other types of IOPS (one tasks looks like works quite ok but when I add second or thrid task it runs much much slower than same (or even smaller) config on HP P400 controller (P800 and P410 too), so it's some kind of problem with VMware driver or/and Areca Firmware (for compatybility with Vmware)

attachement Areca_Copying_Speed_Differences_(3xRAID6_Vols_on_1x RAID_Set_of_9x_SATA_hdd)_20111224.jpg - shows differences in copying speed on one stream and on 4 streams - it's same kind of IOPS but we see very big IOPS/MB/s degragation (1-2 are on 4 streams and 3-4 are on 1 stream)

Exactly same problems occured some months ago when I have 2 RAID groups on 1TB SATA 7k2 Seagate drives (RAID6 + RAID5 Raid Sets), there was no crashes problem when there only 1 RAID Set (rest problems ware the same).

Host HW/SW Enviroment:

VMware ESXi v5.0.0 b623860

Supermicro X8DAH - fully supported by VMware ESXi v5 Supermicro Mainboard

2x Xeon L5638 (2x 6 cores +HT)

144GB of RAM

Areca ARC-1882ix-24

Areca RAID Sets - 1x RAID6 (6x 2TB 7k2 Seagate SATA II), 1x RAID5 (4x 256GB SSD)

Areca SN: xxxxxxxxxxxxxxxx

VMware supported intel 82571EB Quad 1000 MT NIC card

ESXi v5 boot from local SSD on mainboard SATA controller

system was fully stable 1 year before on HP P400, HP P410 and HP P800 controllers (no PSODS on almost 1 year)

Areca HW/SW/Config specs (in attachements - Areca_arc-1882ix-24_HW&SW_info_20120807_01.jpg - Areca_arc-1882ix-24_HW&SW_info_20120807_08.jpg)

I need support for this problems, my suspection of problems 1,2,3 are bad Areca VMware driver and/or Areca firmware, for problem 4 it's probably problem with Areca firmware.

regards

NTShad0w

NTShad0w · ‎08-10-2012

hi Mates,

how much cache you have in Areca controllers that makes data integrity errors (CRC/MD5)? Dou have BBU or not?

thanks

NTShad0w

mauricev · ‎10-14-2012

NTShad0w,

Have you ever emailed the author of the Areca driver, NickCheng, nick.cheng@areca.com.tw, directly?

iwayag · ‎04-08-2013

About the PSOD, this can most likely be fixed by disabling SR-IOV

Just login with SSH to the host and issue a:

esxcfg-advcfg -k TRUE iovDisableIR

After that reboot the server and you should be fine. This is btw. no directly Areca related problem, we had the same problems with servers that were running Adaptec controllers.
Here you might find more information about this problem: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=103026...

As for the data-corruption (and that's why I'm here) we have now the same problem with two servers were we upgrade to ESXi 5.1 recently.
Ever since then, we see corruption (like Windows VM can no longer boot after migrating/cloning) on the Areca backed datastores.
Interestingly it only happens when the datastore is formatted with VMFS5 (5.58).
On both these servers (one equipped with ARC-1680 and one with ARC1880) we have two datastores, one with VMFS3 (3.34) and one with VMFS5, and it only happens on the later.

I'm also in contact with Areca, but so far no response.
Interestingly we never had any troubles with ESXi 4.1 or 5.0, but I see that there we're using different/older Areca drivers...

NTShad0w · ‎05-21-2013

mauricev

thanks for a contact to someone technical in Areca support, or Areca itself, but as I expect he write a driver but he don't fully realize the problem and dont have any hardware and resources/knowledge to understand the problem that's something not right on the driver and/or firmware of areca when working on vmware, same areca working quite ok when we use it in a NAS (on Solaris based NAS server) without any kernel panic's and problem with occupation one raid set and don't have normal access to other raid sets on the controller.

the psods have gone when I change driver to default or to little older one, i'm not sure where are a differences because Areca have strange naming and numerring of a drivers scheme but it working quite ok under vmware (only with one raid set) besides long latency during high load and then very low performance, but much better solution is to use it in a NAS/SAN server and then serve it to vmware, Areca just in my opinion dosnt have good drivers and firmware and support for vmware environment.

iwayag

thanks for info, it may be helpfull but I decide to move areca and hdd/sdd to a NAS server and now its working quite ok (and really fast).

regards

NTShad0w

mauricev · ‎05-21-2013

I migrated to LSI Megaraid. It's not ideal as their webbios doesn't actually work on the web (or at all for that matter), but they do have a MegaRAID Storage Manager which runs in Windows and can connect to the ESXi host. For some odd reason, it's very slow, but it does let you everything and it's easier to use than their crummy webbios. The important thing is also to update the firmware as LSI is notorious for buggy firmware. Since then, it just works and it NEVER crashes.

All

ESXi 4.1 PSOD when performing certain Linux disk operations