Hi all,
Today a client called me with a strange problem.
the Service Console responded on a ping (reply), but i could not connect using ssh or using local logon (I did not receive a prompt). for me, this was an indication that the Service Console hung or crashed. The only option left was to reset the server (luckily there weren't any VM's running).
After a reboot I started looking at the logs straight away. In the VMkernel log I noticed spamming of the following line:
Apr 21 16:12:20 <servername> vmkernel: 3:01:08:24.385 cpu1:1037)BC: 810: FileIO failed with 0x0xbad0006(Limit exceeded)
Apr 22 16:12:12 <servername> vmkernel: 4:01:08:15.060 cpu2:1037)BC: 810: FileIO failed with 0x0xbad0006(Limit exceeded)
Apr 23 16:12:10 <servername> vmkernel: 5:01:08:12.703 cpu1:1035)BC: 810: FileIO failed with 0x0xbad0006(Limit exceeded)
there were no other messages indicating an issue. VMkwarning didn't indicate a problem. Neither did /var/log/messages.
The environment runs Virtual Center 2.5 Update 1 and ESX 3.5 Update 1.
After a reboot everything turned back to normal. Only I don't know for how long... Before I put the server back into production, I would like to know what caused this.
Any ideas?
Thomas
Hi Thomas
I suggest you that verify free space of your partition distribution of ESX instalation. Maybe you have something on your / (vmkernel resides in) that occupies a lot of undesired space. If you can, ssh login on ESX as root and monitor space partitions with du or df command, for example. I hope this help you.
Hi Thomas
I suggest you that verify free space of your partition distribution of ESX instalation. Maybe you have something on your / (vmkernel resides in) that occupies a lot of undesired space. If you can, ssh login on ESX as root and monitor space partitions with du or df command, for example. I hope this help you.
Thank you for the tip. I'll keep monitoring the server in particular. After the reboot, everything turned back to normal. So I'll give it a day or 2.
I'll keep you posted.
Thomas
Are you sure that logfile messages of over a month ago are relevant to a problem your customer had yesterday?
The FileIO bad message certainly doesn't look good and you wouldn't want to run production on it with lots of those messages.
However the message does ring a bell with me somewhere in regards with released patches.
I would check that your host is up-to-date with all available patches and in particular the driver(s) for the storage controller on your host.
Forgot to say, check the logs (nvram log etcetera) on your storage controller as well, or any hardware log you can get your hands on.
--
Wil
Message was edited by: wila
Are you booting from SAN? Which SAN technology? Did you have a SAN path failover?
Ben
Wila,
It had not yet occurred to me that indeed the logs are one month old. And indeed it seems weird that those things are somehow connected. Nevertheless, I have the remaining (8) updates scheduled this evening. I will take a look again tomorrow and hte day after. Who knows, things might clear out.
I think it would be far-fetched to monitor the logs on the EVA controllers, since there are about 24 servers connected to SAN. And those servers have clear logs...
Anyhow, I'll keep you posted on the progress of the patches.
Thomas
BenConrad,
Thanks for your response. I am not booting from SAN. VMware has been installed on local storage. We are definately using path failover in our HP EVA configuration (2 SAN fabrics and 4 paths). Path failover has been set correctly and works perfectly in the environment.
Thomas
I have been monitoring the server for the past few days. The server is behaving normally. No files are growing beyond control. So i guess everything is in order.
I am quite confident it was a partition with no more space due to growing log files.
Passahobe, I'll mark your answer as the correct one. Thanks for your help everyone.
Regards,
Thomas
Thanks Thomas Louis, I am glad of to have helped. Until another one...;\