I'm having a problem that I really hope you all can help with. Please bear with me, I'm trying to provide as much information up front as I can.
We're running ESX on an HP DL380 with 2 arrays configured: a sys volume with two mirrored 36GB drives (VMLinux drive), and a data volume with 4 146GB drives added together for one big 400+ GB partition, where our VMWare hosts & data (/vmfs) are stored.
Today we noticed that one of our Windows 2003 instances had locked up, and restarting it via the web console wasn't working, so I rebooted the whole server. I watched the console as it was coming back up, and it got to the step where it loads VMKernel and it just sat there. Eventually it displays the following messages:
/dev/cciss/c0d0: No such file or directory
sfdisk: cannot open /dev/cciss/c0d0 for reading
/dev/cciss/c0d1: No such file or directory
sfdisk: cannot open /dev/cciss/c0d1 for reading
At this point it finishes booting and displays the welcome screen. I am unable to login via the web console, because I get the following message:
Unexpected response from vmware-authd: 511 Error connecting to /usr/sbin/vmware-serverd process.
I can ssh to the server and login that way. The following is the result from a df -h:
Filesystem Size Used Avail Use% Mounted on
/dev/cciss/c0d0p2 2.4G 1.2G 1.0G 53% /
/dev/cciss/c0d0p1 50M 12M 36M 24% /boot
none 392M 0 392M 0% /dev/shm
So, apparently it CAN read c0d0, but notice that c0d1 is missing. This is presumably what's causing it to hang when it tries to load VMKernel, as it can't get any of the virtual machine files off of that array.
One last thing that's weird: it DOES appear to see both arrays during boot up. I noticed this part in the output from the dmesg command:
HP CISS Driver (v 2.4.54-14VMS)
cciss:Smart Array 6i is at irq 25
cciss: using DAC cycles
blocks= 71122560 block_size= 512
heads= 255, sectors= 32, cylinders= 8716 RAID 1(1+0)
blocks= 860220400 block_size= 512
heads= 255, sectors= 32, cylinders= 105419 RAID 5
blk: queue c0306f20, no I/O memory limit
cciss/c0d0: p1 p2 p3 p4 < p5 p6 >
So you can see that it sees both partitions, sees what type of RAID they are, and even sees partition information for each one during the check. But once it gets past that point, it's like c0d1 falls off the face of the earth.
Edit - I should also mention that trying to do an ls on /vmfs hangs the console, I typically have to kill bash from another login session. Just another symptom of not being able to read a file system it expects to be there, I guess.
Any help you guys can provide would be greatly appreciated. Thanks for your time!
Message was edited by:
Recently i've seen A LOT of smart array controllers broken. I don't know why but in the last year we lost about 7-8 DL380 scsi controllers.
Maybe this is your situations (symptoms are quite similar even if i've never experienced this problem under ESX but only with windows or linux servers).
Ill try this steps before calling HP:
\- Upgrade firmware using HP Firmware maintenance 7.70
\- Boot with a rescue CD and check if files are there (ext3 ones)
\- Backup all your files (using netcat)
If still nothing changes:
\- Try to change the controller or try to put the disks on another DL380 if you have one
\- Call hp
Thanks for the replies so far, I'm downloading the new firmware CD at the moment.
I do have another DL380, I might try putting the drives in that one just to see what happens. It's a production server also though, so I'm nervous about doing that.
I'll update you all once I've tried the CD.