Hey all,
I am running ESXi 4.1 on an Intel server and I get PSOD's whenever I perform certain disk operations. It seems to happen when I partition a disk or once in a while when writing to disk.
It happens on thin or thick provisioned disks
It ONLY happens with LSI Parallel config, not LSI SAS.
I have attached the purple screen I get. It sucks that a Linux guests disk operation can crash an entire ESXi host with 15 guests on it. The guest disk operations SHOULD be irrelevant to the host.
Any help would be appreciated.
Thanks
-J
Just got 2 of the ARC-1882ix-16 with 2gb cache memory and the battery option this morning. In the process of migrating VM's off one of my hosts and will start the process of installing the new card and esxi 5, for the 100 billionth time lol.
Here are some of my findings, and a recap. When I had the adaptec 51245 firmware 18937 in place it was very difficult to manage and the logs were pretty hard for me to read due to the date not being appended to events. The array would randomly rebuild and it would cause the server to crash due to disk latency when the rebuild would occur. I was able to install windows and export a support bundle. I sent it to adaptec but because I am out of warranty they couldn't tell me anything. Maybe if I spent more time with it I could have figured out the problem, but I didn't.
So, I replaced it with an ARC-1882ix-16 with 2 GB cache and a battery module. I built the array and installed ESXi 5. I had to use ESXi-Customizer, an extremely easy to use and very handy utility, to insert the driver into the ESXi install cd. It ran idle over the first night and when I came in the next day I heard an alarm. To my surprise I saw the array was degraded and it showed me what drive had failed!
Why couldn't the adaptec controller do its job and show me what drive had failed instead of simply rebuilding the array and failing again and again?
I replaced the drive and everything has been pretty stable since.
I'm still testing the stability of the areca and its ability to not corrupt data. I started to storage vMotion a powered off VM between local and remote iSCSI storage using md5sum in between moves. The first move was successful.
7702b3e875d04326e7fb5a96555138a1 Test1-flat.vmdk
7702b3e875d04326e7fb5a96555138a1 Test1-flat.vmdk
I imported an OVF which was an HP P4000 VSA 9.5. When I booted it the linux OS detected storage corruption and would not fully boot. Is this a coincidence? I deleted it and reimported it and everything appears to be ok.
I'm going to start another VM and load IOmeter on it and perform a storage vmotion of the other test VM and compare the md5 checksum value.
Thoughts anyone?
Thanks!
John
Well, I'm not sure what to do at this point.
The last copy and md5sum were different.
180bbf306867e5b7886e17159559d855 Test1-flat.vmdk (resides on test host) --->
352a0c3bd422345c0f39311ba7a65860 Test1-flat.vmdk (resides on areca host)
And I had a corrupted imported OVF VM.
The VM booted and I ran a check disk, the results are no problems were found. I am so very confused by these results.
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Checking file system on C:
The type of the file system is NTFS.
A disk check has been scheduled.
Windows will now check the disk.
CHKDSK is verifying files (stage 1 of 3)...
63488 file records processed.
File verification completed.
52 large file records processed.
0 bad file records processed.
0 EA records processed.
60 reparse records processed.
CHKDSK is verifying indexes (stage 2 of 3)...
92030 index entries processed.
Index verification completed.
0 unindexed files scanned.
0 unindexed files recovered.
CHKDSK is verifying security descriptors (stage 3 of 3)...
63488 file SDs/SIDs processed.
Cleaning up 103 unused index entries from index $SII of file 0x9.
Cleaning up 103 unused index entries from index $SDH of file 0x9.
Cleaning up 103 unused security descriptors.
Security descriptor verification completed.
14272 data files processed.
CHKDSK is verifying Usn Journal...
1235848 USN bytes processed.
Usn Journal verification completed.
Windows has checked the file system and found no problems.
31352831 KB total disk space.
15843052 KB in 46954 files.
37800 KB in 14273 indexes.
0 KB in bad sectors.
131819 KB in use by the system.
65536 KB occupied by the log file.
15340160 KB available on disk.
4096 bytes in each allocation unit.
7838207 total allocation units on disk.
3835040 allocation units available on disk.
Internal Info:
00 f8 00 00 37 ef 00 00 70 dc 01 00 00 00 00 00 ....7...p.......
64 00 00 00 3c 00 00 00 00 00 00 00 00 00 00 00 d...<...........
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Windows has finished checking your disk.
Please wait while your computer restarts.
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
As far as I know, IOmeter doesn't test data integrity - just read/write speed. That's great for seeing if you're going to get a PSOD caused by high I/O load (like we saw with the older driver on VMWare 4.1), but it won't find what you're looking for here (unless you just want to simulate high I/O loads while doing other testing).
At this point, I'd send the results to Areca support and see what they send you.
Just curious to see if you got any response from VMWare on this issue?
Sorry about the slow response. No I never did get any answers to any of my probems. I still have strange problems from time to time, like very intermitten high disk latency on the iSCSI software adapter, but nothing that has effected the actual production of the servers or caused downtime. I have had a couple purple screens on one host, but not the other. According to VMware support the logs don't point to a problem with a particular piece of hardware or software. No smoking gun, if you will. I guess that's why I invested in mirrored VSA's, Veeam and redundant hardware for HA.
I've been in production with this hardware for some time now. I may have been a little paranoid during my testing. Sometimes problems just happen when you are setting up a server for the first time. The key is to test, test and test again till it's stable. And after it's in production backup, backup and backup some more and test your backup! So I feel confident that everything is ok with this card and current driver. So far no corrupt data at all.
hi ErikCarlseen,
listen,
I have still same problems (and PSODs on Vmware 5) with my Aeca 1882ix and I'm tired of that,
can you give more detailed info about drivers You receive from Areca Support?? share it somewhere or what?
I don't have good relations with their support, they talk me only that they have to replicate my enviroment to check this... sic!!!
thanks dude
NTShad0w
Hi jlong,
if you don't buy yet Areca arc-1882ix-16 what You plan... I dont recommend it, it's still crashes (PSODs on ESXi5 when copying data between RAID Sets) with a lot of Supermicro boards, and they (Areca support) can't help, just tell a story that they dont have similar configuration to check this...:( yep, wonderfull company, just wonderfull, I don't recomment Areca RAID Controllers for Vmware
best regards
NTShad0w
jlong, mates,
heeh it makes me crazy when I'm reading that some of You have problems with data corruption on Areca 1882ix controllers...!!!!! ((
I have 1882ix-24/1GB+BBU, 6x 2TB Seagate (lowend) sata 7k2 drives in RAID6 and 5x SSD drives in RAID5, some months earlier I had 10x Seagate 11 1TB drives in RAID 5 and 6 and I have Supermicro mb (8XDAH+-F + 2x L5638 and 144GB ram), I never have data corruption (yep I'm lucky as I can see...) but if I have, I will jump from the bridge probably (hope no), it's my lab but I really like it and data on it... not only labs data in true.
hmm, strange bad things you tells here about this data corruptions on Areca 1882ix (and maybe others) controllers, I have PSODs from the begining of using Areca Controller (yep I hate it, but never lost a VMFS DS, only 2-3 vms max), I used it only with ESX5 from the beggining, earlier I have HP P800 on ESX 3,3.5,4,4.1 and everything was ok, besides of disk drops from RAID grup sometimes) but never any PSOD...
on Areca PSODs are so "natural" for me, I had in one year about 7-8 PSODs, support generally cant help because..heeh.. because they have to replicate my enviroment to test it, but they don't replicate it and don't give me any working solutions (from about 1 year now, I talk and write with them many times but no sollution at all ) - why I didn't give back the controller? i don't know, maybe I have a lot of data on it and don't have anought time to migrate in on the other raid's when I create it from the same disks, ehh I don't have anought disks to easy and fast migrate it to other RAID/Controller, that's a problem.
I like web management of Areca, I like theoretical possibility of using 4GB cache with BBU... nice option but these PSODs and problem with sharing data streams/multitasking (on vmware with areca, it's probably driver problem, eventually firmware and driver)... I hate, It's not a cheep controller, its controller for (in my country, PL in EU) 1600usd... and making things like something cheep for a 100 bucks...:((
I probably last time try to solve my problems with them, I send them a long email with a lot of datas and examples of failures... if they don't solve PSODs and miltitasking/multistreaming problem I probably (if it's possible) give it back and go into LSI - stable and fast controllers (I should go into them from a start but I'm not happy of 512MB cache).
here is my email to Areca Support (20120807), some of attachements attached to post too:
regards
NTShad0w
hi Mates,
how much cache you have in Areca controllers that makes data integrity errors (CRC/MD5)? Dou have BBU or not?
thanks
NTShad0w
NTShad0w,
Have you ever emailed the author of the Areca driver, NickCheng, nick.cheng@areca.com.tw, directly?
About the PSOD, this can most likely be fixed by disabling SR-IOV
Just login with SSH to the host and issue a:
esxcfg-advcfg -k TRUE iovDisableIR
After that reboot the server and you should be fine. This is btw. no directly Areca related problem, we had the same problems with servers that were running Adaptec controllers.
Here you might find more information about this problem: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=103026...
As for the data-corruption (and that's why I'm here) we have now the same problem with two servers were we upgrade to ESXi 5.1 recently.
Ever since then, we see corruption (like Windows VM can no longer boot after migrating/cloning) on the Areca backed datastores.
Interestingly it only happens when the datastore is formatted with VMFS5 (5.58).
On both these servers (one equipped with ARC-1680 and one with ARC1880) we have two datastores, one with VMFS3 (3.34) and one with VMFS5, and it only happens on the later.
I'm also in contact with Areca, but so far no response.
Interestingly we never had any troubles with ESXi 4.1 or 5.0, but I see that there we're using different/older Areca drivers...
mauricev
thanks for a contact to someone technical in Areca support, or Areca itself, but as I expect he write a driver but he don't fully realize the problem and dont have any hardware and resources/knowledge to understand the problem that's something not right on the driver and/or firmware of areca when working on vmware, same areca working quite ok when we use it in a NAS (on Solaris based NAS server) without any kernel panic's and problem with occupation one raid set and don't have normal access to other raid sets on the controller.
the psods have gone when I change driver to default or to little older one, i'm not sure where are a differences because Areca have strange naming and numerring of a drivers scheme but it working quite ok under vmware (only with one raid set) besides long latency during high load and then very low performance, but much better solution is to use it in a NAS/SAN server and then serve it to vmware, Areca just in my opinion dosnt have good drivers and firmware and support for vmware environment.
iwayag
thanks for info, it may be helpfull but I decide to move areca and hdd/sdd to a NAS server and now its working quite ok (and really fast).
regards
NTShad0w
I migrated to LSI Megaraid. It's not ideal as their webbios doesn't actually work on the web (or at all for that matter), but they do have a MegaRAID Storage Manager which runs in Windows and can connect to the ESXi host. For some odd reason, it's very slow, but it does let you everything and it's easier to use than their crummy webbios. The important thing is also to update the firmware as LSI is notorious for buggy firmware. Since then, it just works and it NEVER crashes.