Re: Harddisk events in the guest OS logs

dcaperton · ‎12-08-2008

I've been getting the following 2 errors in my guest OS and I can't seem to find a solution. Any advice would be great.

Observations:

I do not get them on all my guests

It happens to guests across all 3 esx host servers in my farm

It's always at night when the errors occur but not at the same time

The duration of guest events are usually about 20 seconds long

Farm Configuration:

I have 3 esx hosts (Dell r805, dual quad 2.4 procs, 64GB of memory) attached to a ns20 emc iscsi back end. I use 2 dedicated interfaces for host to iscsi communication and 2 for incoming guest communication. I have 40ish virtual guests spread evenly across all 3 systems.

System: computer1

Log: System

Source: Disk

Dec 6, 2008 19:1:28

Entry 9722: The driver detected a controller error on \Device\Harddisk0\DR0.

System: computer1

Log: System

Source: symmpi

Dec 6, 2008 19:1:28

Entry 9721: The device, \Device\Scsi\symmpi1, is not ready for access yet.

Thanks in advance for the help,

Danny

SuryaVMware · ‎12-08-2008

It appears to me that you are not able to get enough I/O for disk access from all the VMs. Particularlly, if you are noticing this issues during the night time can you check if there is a backup job scheduled for these VMs at that time?

My first guess is a backup job while the VMs are still alive and you are running in to I/O bottle neck issues with your iSCSI network.

Let me know if this helps.

-Surya

dcaperton · ‎12-08-2008

It appears to me that you are not able to get enough I/O for disk access from all the VMs. Particularlly, if you are noticing this issues during the night time can you check if there is a backup job scheduled for these VMs at that time?
My first guess is a backup job while the VMs are still alive and you are running in to I/O bottle neck issues with your iSCSI network.
Let me know if this helps.

-Surya

This was my first thought as well, but I only have 1 backup job that runs on 1 guest and it's scheduled at 2am and is very small (1GB). I did go back and look and it appears I only get these errors on Sunday night. I wonder if this is because of something my emc is doing? Does anyone have a NS-20 out there?

SuryaVMware · ‎12-08-2008

If this is a one of incident and you are sure that there is some one doing something with your array then I guess you should be fine. If this is a repeated issue then you have a reason to get worried.

Let us know if this happens again. we can look in to few things to tweek.

-Surya

dcaperton · ‎12-08-2008

If this is a one of incident and you are sure that there is some one doing something with your array then I guess you should be fine. If this is a repeated issue then you have a reason to get worried.
Let us know if this happens again. we can look in to few things to tweek.
-Surya

This is not a one time event. It happens every sunday night. I lied to you earlier in the thread when I said it happens at different times (sorry about that). I looked back as far as I could in my logs and I noticed they happen on every sunday around 7pm.

The emc device I use as a back end for the .vmdk files is used only by esx host systems, so there are no issues with other systems or processes accessing the emc. Another thing to note is that I recently (in the past 2 months) migrated all my guests from 2 old IBM host systems to the new dell systems I refered to earlier in the thread. While I was at it I rebuilt virtual center server., and I exported my guests to a brand new environment. (Well almost brand new. The emc is still the datastore but the host systems and vsc are new) The problem was happening before I created the new environment and followed the migration. My hope is it would leave once I got the better servers. The only common piece is the emc.

I'm not an expert when it comes to the ns-20 emc device so I'm not sure if it may be cause my problems with an internal process it may run. Just a thought.

Danny

SuryaVMware · ‎12-08-2008

humm .. Are you sure you dont have any ESX level backups running at that time. If it happens at a particular time every week there is a scheduled task running somewhere ...

Must be a backup job running from with in a bunch of VMs or ESX level backup. If there is backup job triggered from with in the VMs, I would schedule them with a little bit of time differenc, so that there is 1 or 2 backup jobs running at any given point of time.

Hope this helps.

-Surya

bry125vm · ‎01-21-2009

I have witnessed the same errors in our guests. My infrastructure is similar to yours, we have 2 Dell PE2950 hosts and using the NS20 with iSCSI to store the VM's. After contacting EMC support about this, they informed a scheduled battery check was taking place once a week on the NS20's backend (Clarion CX3-10). During the battery check the cache is disabled on your datamovers, meaning your disk I/O's jump dramaticialy. What lead me to this was checking the kernel logs of our ESX hosts. I was seeing iSCSI timeout errors at the exact same time of the scheduled battery check and event viewer errors in Windows. The times were all consistent. We changed the scheduled time of the battery check on the NS20 from a weekday to every Sunday evening. I have assumed usage on the system would be low at that time and it wouldnt have an impact, but the errors still occur. Luckily a VM has not crashed, nor a datastore gone offline, but I hope we are working towards a permanent solution to this.

If your still experiencing this, I can lookup the command to check (and change) what time your NS20 is doing the battery check. EMC informed me it cannot be disabled, it has to run every 7 days, whether it be a weekday or weekend.

Lightbulb · ‎01-21-2009

Log into the Clariion, that your Cellera is using for back end storage, via Navisphere . Right clik on one of your SPs and choose view events. You are looking for event ID 740a or 2580 (Depends on Clariion model and off the top of my head I do not know which event matches with which model). The weekly battery testing schedule can be determined by drilling down to your batteries in the hardware listing and right clicking (There should be a context menu like "testing schedule"). If the schedule matches the time from of your issue this will confirm that it is your issue.

Call EMC I know they have a few Powerlink articles regards issues with the weekly battery test. They will probably want to do a Flare update (Their stock answer) be careful bad things can happen during Flare code upgrades.

dcaperton · ‎01-22-2009

I have witnessed the same errors in our guests. My infrastructure is similar to yours, we have 2 Dell PE2950 hosts and using the NS20 with iSCSI to store the VM's. After contacting EMC support about this, they informed a scheduled battery check was taking place once a week on the NS20's backend (Clarion CX3-10). During the battery check the cache is disabled on your datamovers, meaning your disk I/O's jump dramaticialy. What lead me to this was checking the kernel logs of our ESX hosts. I was seeing iSCSI timeout errors at the exact same time of the scheduled battery check and event viewer errors in Windows. The times were all consistent. We changed the scheduled time of the battery check on the NS20 from a weekday to every Sunday evening. I have assumed usage on the system would be low at that time and it wouldnt have an impact, but the errors still occur. Luckily a VM has not crashed, nor a datastore gone offline, but I hope we are working towards a permanent solution to this.
If your still experiencing this, I can lookup the command to check (and change) what time your NS20 is doing the battery check. EMC informed me it cannot be disabled, it has to run every 7 days, whether it be a weekday or weekend.

Thank you so much for this. I have an open case with emc and they can't seem to identify the problem. I'm going to bring this to the technicians attention and see what he has to say. More to follow.

Danny

dcaperton · ‎01-22-2009

Log into the Clariion, that your Cellera is using for back end storage, via Navisphere . Right clik on one of your SPs and choose view events. You are looking for event ID 740a or 2580 (Depends on Clariion model and off the top of my head I do not know which event matches with which model). The weekly battery testing schedule can be determined by drilling down to your batteries in the hardware listing and right clicking (There should be a context menu like "testing schedule"). If the schedule matches the time from of your issue this will confirm that it is your issue.
Call EMC I know they have a few Powerlink articles regards issues with the weekly battery test. They will probably want to do a Flare update (Their stock answer) be careful bad things can happen during Flare code upgrades.

I don't have navisphere. We use the Celerra Manager web UI. Any ideas where to find it in there? I'm about to contact emc.

-Danny

Lightbulb · ‎01-22-2009

In the tree listing of your Cellera UI you should have a filder called Storage (I am doing this all from memory I don't have functioning system in front of me). Expand the storage folder and you should see your Clariions SP with associated IP address. Depending on your setup you should be able to click on the SP to open a Navisphere (Clariion Web interface session). You will need username and password a good guess would be clariion clariion

bry125vm · ‎01-22-2009

From a putty or terminal session on your Control Station, here is the command for checking and changing the battery check. You'll want to just check the time of the battery check and see if it matches the Event viewer errors in your Guest OS.

SET SPS TIME

1. to check what the current time is.

[]$ /nas/sbin/navicli -h SPA setspstime

2. to change the time.

"usage: setspstime <-d dayOfWeek -h hour -m minute> <-nolocal>"

I.E. Please adjust variables to fit your needs.

[]$ /nas/sbin/navicli -h SPA setspstime -d 6 -h 22 -m 00

NOTE: dayOfWeek setting

0 Sunday

1 monday

2 tuesday

3 wednesday

4 thursday

5 friday

6 saturday

bry125vm · ‎01-22-2009

Also search the /var/log/vmkernel file on your ESX hosts for strings like "finished error recovery" and "abort success". These are iSCSI timeout errors between your ESX hosts and the NS20.

dcaperton · ‎01-22-2009

Also search the /var/log/vmkernel file on your ESX hosts for strings like "finished error recovery" and "abort success". These are iSCSI timeout errors between your ESX hosts and the NS20.

Yes these are the errors I have been seeing.

-danny

dcaperton · ‎01-22-2009

From a putty or terminal session on your Control Station, here is the command for checking and changing the battery check. You'll want to just check the time of the battery check and see if it matches the Event viewer errors in your Guest OS.

SET SPS TIME
1. to check what the current time is.
[]$ /nas/sbin/navicli -h SPA setspstime
2. to change the time.
"usage: setspstime <-d dayOfWeek -h hour -m minute> <-nolocal>"
I.E. Please adjust variables to fit your needs.
[]$ /nas/sbin/navicli -h SPA setspstime -d 6 -h 22 -m 00
NOTE: dayOfWeek setting
0 Sunday
1 monday
2 tuesday
3 wednesday
4 thursday
5 friday
6 saturday

Yep this is the problem. I just ran the command and everything looks just like you said it would. Is there a solution to the problem? Is there any reason for concern?

-Danny

dcaperton · ‎01-22-2009

Leave it to the vmware communities to find an emc problem before emc

Anyways you may be able to shine some light on another issue I'm having along the same lines. I run 4 citrix terminal servers. 3 in esx and 1 on a stand-alone server. Every night between 11-11:10pm all the guest TS servers slow to a crawl and become 100% unusable during this time. The sessions into the servers do not disconnect, it's as if the screen froze on you desktop for 10 minutes. Then all comes magically back to life and people can continue to work. It only happens on my virtual systems. I look at all the performance logs and all the systems look as if nothing happened. I don't see anything in the esx or emc logs either. I have network graphs showing nothing out the the ordinary as far as traffic is concerned and I'm not running any backups at this time. The only thing I can think is there is some process running at 11pm every night on the ns-20. The truth is I wouldn't believe it was even happening if I hadn't seen it with my own eyes. Any Ideas?

-danny

bry125vm · ‎01-22-2009

The schedule cannot be disabled, so I tried moving it to the least impactful time. The errors for us still occur every Sunday. It's pure luck we haven’t had a VM crash or datastore go offline during this time. Look at the error in Windows -"The driver detected a controller error on \Device\Harddisk0\DR0" If this was a physical box you'd be scrambling to find out why before something really bad happened. If a firmware or driver update didn’t fix, you'd most like be replacing the controller.

The fix (for us)is to increase the performance of our NS20 and it's RAID groups used by the ESX hosts. Getting higher performance from the RAID groups means when the cache is disabled during the battery check, the disk I/O's do not spike, the iSCSI session timeouts do not occur, and the event viewer in Windows is left free of errors. Will be doing a migration to a new NS20, with a different RAID layout, the expectations are the battery check will no longer have a negative effect. I'd press EMC support to help you find a resolution. Utilizing Navianalyzer was a big help with this.

All

Harddisk events in the guest OS logs