VMware Cloud Community
cagriffith2000
Contributor
Contributor

Disk IO problem

We have a perplexing issue.

We have a number of virtal environments set up to use 4 SQL servers in each. One of these servers, call it DB1, at midnight, looses communication and IO with the iSCSI storage and stops writing to the logs so we cannot see what is causing it. We have resources looking at SQL, at VMware and at the hardware and network connectivity, but cannot identify the root cause.

Can you help?

CG

Reply
0 Kudos
15 Replies
Kevin_Gao
Hot Shot
Hot Shot

Can you please provide some details? i.e.: What's your storage backend? What's the storage protocol? Are you using multi-path to the storage nodes?

Reply
0 Kudos
RParker
Immortal
Immortal

We have a number of virtal environments set up to use 4 SQL servers in each. One of these servers, call it DB1, at midnight, looses communication and IO with the iSCSI storage

I am guessing, but I am guessing that the iSCSI target is a Windows machine. I am guessing that authentication is done via AD. I am also speculating that there is a group policy that is kicking off users after midnight which affects that authentication, which in turn is disconnecting you from that iSCSI 'network', that is causing your VM to disconnect the logs.

It sure seems like something is done regularly that is causing the outage, so I would say this isn't random.

Reply
0 Kudos
cagriffith2000
Contributor
Contributor

Storage backend is NetAPP, protocol is TCP/IP, and yes there is

multipathing to the storage nodes

Reply
0 Kudos
cagriffith2000
Contributor
Contributor

Yes, the iSCSI target in this case is a Windows machine. Yes, authentication is done via AD. There is no group policy employed to kick users off (it's not production it's a dev environment), so the iSCSI 'network' is always available. It is apparently not random, as it hits this one type of DB1 server in several different virtual environments at the same time. There are other servers that are possibly sending rapid fire requests at the databases on DB1 and bringing it to its knees, but as yet we have been unable to prove it. Appreciate your help.

Reply
0 Kudos
Kevin_Gao
Hot Shot
Hot Shot

If this is the only VM losing it's storage (I'm assuming all the other VM's are fine)...can you check it's volume? i.e.: do you have anything scheduled at around the time of the crashes? i.e.: multiple snaps / dedups / etc?

also what version of data ontap are you on? our fas shipped with a buggy version that didn't let me turn on dedup. Smiley Sad

Reply
0 Kudos
cagriffith2000
Contributor
Contributor

There are no logged errors or issues found on the NetAPP. SQL just

indicates timeout errors. As a part of our troubleshooting, we have

isolated the iSCSI traffic to the NetAPP and run stress tests against

DB1 servers and caused the timeouts with test tools.. but we still

don't know the cause of the midnight madness..

Reply
0 Kudos
cagriffith2000
Contributor
Contributor

I'll check on that info and get back to you.

Reply
0 Kudos
Kevin_Gao
Hot Shot
Hot Shot

OK also I'll go out on a limb here and look at it from the Windows side; have you looked into this hotfix? http://support.microsoft.com/kb/959384

Reply
0 Kudos
cagriffith2000
Contributor
Contributor

This could be a possible solution except all the SQL servers here are SQL 2005 running on Win2003 Server Enterprise.

Hotfix is for Vista / Windows Server 2008

Reply
0 Kudos
cagriffith2000
Contributor
Contributor

ONtap version 7.2

Volume size is 2.0 TB(We are planning to lower it)

Nothing is scheduled at that time frame. We don't use dedupe as yet.

Reply
0 Kudos
kjb007
Immortal
Immortal

Virus scan? Backups? If you can recreate the timeouts with stress test, then run a perfmon job to watch what the cpu/memory/io is doing from within the OS, as well as look at the performance charts from esx / vc. What kind of drives are you using on your NetApp? Do you see scsi timeouts on the esxi logs?

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
cagriffith2000
Contributor
Contributor

First thing we looked at. No virus scanning, no backup occurring at that timeframe. We have perfmon captures but no source for the problem.. we are after the source.

Reply
0 Kudos
kjb007
Immortal
Immortal

On your 2 TB volume, how many disks are backing that volume? What type of disks are they? How many IOPS are you sending to that LUN?

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
cagriffith2000
Contributor
Contributor

We identified the issue.

We have an aging version of Symantec AntiVirus Enterprise and apparently it is configured to acquire new virus definition files at midnight. Once it has these, it distributes them to more than 400 virtual servers (a relatively small spike in network traffic). The issue that we found was that once servers have the new definitions they trigger automatic 'mini' scans of their respective systems, so all at once the hosts have to contend with hundreds of i/o intensive scans. This activity kills the virtual hosts.

We've separated into groups the servers and individually configured the definition update timing on each group and this has apparently fixed the problem.

Thanks everyone for your input. We checked everything you gave us.

CG

Reply
0 Kudos
kjb007
Immortal
Immortal

Don't forget to leave points for helpful / correct posts.

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos