Re: ESX on HP DL380 - Bad raid controller?

ryan_p · ‎03-20-2007

Hi everyone,

I'm having a problem that I really hope you all can help with. Please bear with me, I'm trying to provide as much information up front as I can.

We're running ESX on an HP DL380 with 2 arrays configured: a sys volume with two mirrored 36GB drives (VMLinux drive), and a data volume with 4 146GB drives added together for one big 400+ GB partition, where our VMWare hosts & data (/vmfs) are stored.

Today we noticed that one of our Windows 2003 instances had locked up, and restarting it via the web console wasn't working, so I rebooted the whole server. I watched the console as it was coming back up, and it got to the step where it loads VMKernel and it just sat there. Eventually it displays the following messages:

/dev/cciss/c0d0: No such file or directory

sfdisk: cannot open /dev/cciss/c0d0 for reading

/dev/cciss/c0d1: No such file or directory

sfdisk: cannot open /dev/cciss/c0d1 for reading

At this point it finishes booting and displays the welcome screen. I am unable to login via the web console, because I get the following message:

Unexpected response from vmware-authd: 511 Error connecting to /usr/sbin/vmware-serverd process.

I can ssh to the server and login that way. The following is the result from a df -h:

Filesystem Size Used Avail Use% Mounted on

/dev/cciss/c0d0p2 2.4G 1.2G 1.0G 53% /

/dev/cciss/c0d0p1 50M 12M 36M 24% /boot

none 392M 0 392M 0% /dev/shm

So, apparently it CAN read c0d0, but notice that c0d1 is missing. This is presumably what's causing it to hang when it tries to load VMKernel, as it can't get any of the virtual machine files off of that array.

One last thing that's weird: it DOES appear to see both arrays during boot up. I noticed this part in the output from the dmesg command:

HP CISS Driver (v 2.4.54-14VMS)

cciss:Smart Array 6i is at irq 25

cciss: using DAC cycles

blocks= 71122560 block_size= 512

heads= 255, sectors= 32, cylinders= 8716 RAID 1(1+0)

blocks= 860220400 block_size= 512

heads= 255, sectors= 32, cylinders= 105419 RAID 5

blk: queue c0306f20, no I/O memory limit

Partition check:

cciss/c0d0: p1 p2 p3 p4 < p5 p6 >

cciss/c0d1: p1

So you can see that it sees both partitions, sees what type of RAID they are, and even sees partition information for each one during the check. But once it gets past that point, it's like c0d1 falls off the face of the earth.

Edit - I should also mention that trying to do an ls on /vmfs hangs the console, I typically have to kill bash from another login session. Just another symptom of not being able to read a file system it expects to be there, I guess.

Any help you guys can provide would be greatly appreciated. Thanks for your time!

Message was edited by:

ryan_p

ZMkenzie · ‎03-20-2007

Recently i've seen A LOT of smart array controllers broken. I don't know why but in the last year we lost about 7-8 DL380 scsi controllers.

Maybe this is your situations (symptoms are quite similar even if i've never experienced this problem under ESX but only with windows or linux servers).

Ill try this steps before calling HP:

\- Upgrade firmware using HP Firmware maintenance 7.70

\- Boot with a rescue CD and check if files are there (ext3 ones)

\- Backup all your files (using netcat)

If still nothing changes:

\- Try to change the controller or try to put the disks on another DL380 if you have one

\- Call hp

wila · ‎03-20-2007

Maybe not, there have been similar complaints, please check the following thread, maybe it helps.

Patching leaves system with kernel panic & fix[/url]

| Author of Vimalin. The virtual machine Backup app for VMware Fusion, VMware Workstation and Player |
| More info at vimalin.com | Twitter @wilva

ryan_p · ‎03-20-2007

Thanks for the replies so far, I'm downloading the new firmware CD at the moment.

I do have another DL380, I might try putting the drives in that one just to see what happens. It's a production server also though, so I'm nervous about doing that.

I'll update you all once I've tried the CD.

All

ESX on HP DL380 - Bad raid controller?