VMware Cloud Community
timcwhite
Contributor
Contributor

VMkernel Marks Paths as dead

Good morning,

This past weekend I had a major outage in my ESX environment. At first glance my entire infrastucture went down due to a faulty fiber channel switch in my environment. All hardware components in my infrastructure are supposed to be redundant. With one switch taking a dump, it was assumed that I/O would fail over to our secondary switch.

Each server has four paths to our backend array (Clarion CX4). Two paths go through one switch, (Brocade DCX) and the other two paths goes to a second switch (Brocade DCX).

Two weeks ago, the first DCX switch rebooted and all paths failed over as expected to the second switch. What we didn't notice was that the first two paths didn't come back up after the switch recovered. The Emulex drivers for the kernel to mark the paths as dead!

So, when the secondary switch rebooted this weekend, unbenowkn to me, I/O didn't fail back to the first two paths because the paths had been marked dead for over a week!!

So here is my question,

Can someone assist me with writing a script that can poll the VMkernel logs for such an event? I'd like to poll for the following:

$ grep EXPIRED *

vmkernel.4:May 25 02:41:03 sknxbldesx01 vmkernel: 2:14:45:15.264 cpu0:1024)<4>lpfc1:0250:DIe:EXPIRED nodev timer Data: x661713 x3 x7

vmkernel.4:May 25 02:41:03 sknxbldesx01 vmkernel: 2:14:45:15.264 cpu0:1024)<4>lpfc1:0250:DIe:EXPIRED nodev timer Data: x661715 x4 x8

Can a cronjob be writen to search for such an expression then send me an email?

Thanks In advance.

Tim

Reply
0 Kudos
6 Replies
RParker
Immortal
Immortal

Well first of all this is visible from the storage configuration for your SAN, and it should show active / on. If it shows as dead you still have a cable / port problem. Because ESX is only detecting what is visible.

So a script won't help if you don't fix the dead issue. Once both paths (or all 4) are working ESX will show this, so writing a script only means doing the command line way of what VI Client already has built in.

And if it failed over once, that means its working. It went dead because after the initial fail there was no confirmation that the other paths were still working.

Reply
0 Kudos
COdlk
Hot Shot
Hot Shot

What i have done is setup a centralized syslog host. I then configured my ESX server to send certain syslog messages to the loghost. I then used a program called Simple Event Correlator (SEC http://www.estpak.ee/~risto/sec/). Takes a little bit to configure but it basically watches logfiles and phrases that you specify. When it matches a certain phrase it will perform what ever you tell it to do (i.e. email, run a script etc).

david

Reply
0 Kudos
timcwhite
Contributor
Contributor

Thanks, I do see this in Virtual Center. I have 32 ESX hosts in my environment, so I was hoping to find a automated way of being notified when ESX doesn't see the path. I don't see any preconfigured alarms for this.

Thanks,

Tim

Reply
0 Kudos
RParker
Immortal
Immortal

so I was hoping to find a automated way of being notified when ESX doesn't see the path. I don't see any preconfigured alarms for this.

Another thing is with 4 paths, there is something wrong with your Fibre Switch setup. If you have 4 paths, you should have 4 distinct Fibre switches then, if you don't then you only have 2 paths. A path means a dedicated physical access to the SAN, you can't count the failover path between the SAN devices as a path, because if you break the connection, as is what happened, that's more than just 1 path affected.

A path is a single route ALL the way from the ESX host to the SAN. If you break that, you should have 3 left over. Each switch has 4 fibre cables, each cable is a dedicated physical route, and each SAN should therefore have 2 physical paths to 2 physical switches, and EACH SAN should have 2 distinct path (2 with virtual WWN so they can work in a failover).

That way if you break a SAN path, you have 3 to still route your VM's. so your path is actually only 2, or you need to fix it so it is 4 distinct paths.

also I see what you are saying about SAN and connectivity and notification, however, your SAN switches should be able to see this as well. either they will show their path is broken to the SAN or to the ESX host, and THEY can monitor just as easily. Since they are the central part of your route, I would use them to monitor the paths. If they go down, that's what nagios or big brother or some other 'ping' is for to tell you they are down... which is why I think many people don' take this into account when VM's solve every issue, if the VM's you are hosting (Virtual Center) are hosted on SAN (via ESX) and they are critical to your environment, but SAN/VI they monitor shouldn't be what THEY are running on. Things like this should be physical. If you yank a Fibre cable it shouldn't bring down your entire organization, the critical systems and machines you monitor should be up on a physical host. Prime example.

not saying you keep VC on the ESX but if you did, then you couldn't monitor or notify if things go wrong because those systems are ALSO down.

Reply
0 Kudos
admin
Immortal
Immortal

Tim,

Rather than scanning the vmkernel log file, it would be easier to write a cronjob to periodically run the esxcfg-mpath -l command check for any paths in the dead state.

Reply
0 Kudos
penaut
Contributor
Contributor

"so I was hoping to find a automated way of being notified when ESX

doesn't see the path. I don't see any preconfigured alarms for this."

Did you try KIWI?

Reply
0 Kudos