Solved: SCO Unix and unrecoverable read/write errors on di...

Oczkov · ‎11-16-2007

Hi,

We have sucessfully installed the SCO OpenServer 5.0.6 in a ESX 3.0.2 VM. We started with the mmunix driver available from one of the private sites (I mean during installation process) - it worked OK. However we couldn't get the SMP configuration working. The machine was generally running OK (as uniprocessor), but when we heavy-loaded the disks (we actully noticed this when doing loads of data to the Informix Dynamic Server 7.31) we have these errors in the /var/adm/messages log:

NOTICE: Sdsk: Unrecoverable error reading SCSI disk 1 dev 1/74 (ha=0 bus=0 id=1 lun=0) block=438804

NOTICE: Sdsk: Unrecoverable error reading SCSI disk 1 dev 1/74 (ha=0 bus=0 id=1 lun=0) block=489556

NOTICE: Sdsk: Unrecoverable error writing SCSI disk 1 dev 1/72 (ha=0 bus=0 id=1 lun=0) block=527056

NOTICE: Sdsk: Unrecoverable error writing SCSI disk 1 dev 1/72 (ha=0 bus=0 id=1 lun=0) block=858348

NOTICE: Sdsk: Unrecoverable error writing SCSI disk 1 dev 1/72 (ha=0 bus=0 id=1 lun=0) block=858436

NOTICE: Sdsk: Unrecoverable error writing SCSI disk 1 dev 1/72 (ha=0 bus=0 id=1 lun=0) block=859632

NOTICE: Sdsk: Unrecoverable error writing SCSI disk 1 dev 1/72 (ha=0 bus=0 id=1 lun=0) block=898192

NOTICE: Sdsk: Unrecoverable error writing SCSI disk 1 dev 1/72 (ha=0 bus=0 id=1 lun=0) block=651100

NOTICE: Sdsk: Unrecoverable error writing SCSI disk 1 dev 1/72 (ha=0 bus=0 id=1 lun=0) block=651304

NOTICE: Sdsk: Unrecoverable error writing SCSI disk 1 dev 1/72 (ha=0 bus=0 id=1 lun=0) block=651880

NOTICE: Sdsk: Unrecoverable error writing SCSI disk 1 dev 1/72 (ha=0 bus=0 id=1 lun=0) block=674000

NOTICE: Sdsk: Unrecoverable error writing SCSI disk 1 dev 1/72 (ha=0 bus=0 id=1 lun=0) block=719736

NOTICE: Sdsk: Unrecoverable error writing SCSI disk 1 dev 1/72 (ha=0 bus=0 id=1 lun=0) block=720264

NOTICE: Sdsk: Unrecoverable error writing SCSI disk 1 dev 1/72 (ha=0 bus=0 id=1 lun=0) block=825640

NOTICE: Sdsk: Unrecoverable error writing SCSI disk 1 dev 1/72 (ha=0 bus=0 id=1 lun=0) block=825956

NOTICE: Sdsk: Unrecoverable error reading SCSI disk 1 dev 1/82 (ha=0 bus=0 id=1 lun=0) block=1371800

NOTICE: Sdsk: Unrecoverable error reading SCSI disk 1 dev 1/64 (ha=0 bus=0 id=1 lun=0) block=1002254

NOTICE: Sdsk: Unrecoverable error reading SCSI disk 0 dev 1/0 (ha=0 bus=0 id=0 lun=0) block=1330485

NOTICE: Sdsk: Unrecoverable error reading SCSI disk 1 dev 1/64 (ha=0 bus=0 id=1 lun=0) block=1504702

NOTICE: Sdsk: Unrecoverable error reading SCSI disk 2 dev 1/139 (ha=1 bus=0 id=0 lun=0) block=2288464

NOTICE: Sdsk: Unrecoverable error reading SCSI disk 1 dev 1/64 (ha=0 bus=0 id=1 lun=0) block=112352

NOTICE: Sdsk: Unrecoverable error reading SCSI disk 1 dev 1/64 (ha=0 bus=0 id=1 lun=0) block=112384

NOTICE: Sdsk: Unrecoverable error reading SCSI disk 1 dev 1/64 (ha=0 bus=0 id=1 lun=0) block=2471066

NOTICE: Sdsk: Unrecoverable error reading SCSI disk 1 dev 1/64 (ha=0 bus=0 id=1 lun=0) block=125312

NOTICE: Sdsk: Unrecoverable error reading SCSI disk 1 dev 1/64 (ha=0 bus=0 id=1 lun=0) block=847992

NOTICE: Sdsk: Unrecoverable error reading SCSI disk 1 dev 1/64 (ha=0 bus=0 id=1 lun=0) block=848024

NOTICE: Sdsk: Unrecoverable error reading SCSI disk 1 dev 1/64 (ha=0 bus=0 id=1 lun=0) block=477916

NOTICE: Sdsk: Unrecoverable error reading SCSI disk 1 dev 1/64 (ha=0 bus=0 id=1 lun=0) block=4910304

NOTICE: Sdsk: Unrecoverable error reading SCSI disk 1 dev 1/64 (ha=0 bus=0 id=1 lun=0) block=4910336

The simple way to replicate the problem is to run multiple rounds of:

dd if=/dev/dsk/1s0 of=/dev/null bs=4096

... on different disks you have. On each of the virtual disks we are easily geting these errors both for reading and writing (the latter ones only while loading the database with the data which is logic). The storage is on EMC CLARiiON CX3, but I think this is not related to the underlying HW.

We changed the driver to what SCO claims to be working and is awailable here:

They say:

HARDWARE: BusLogic BT-440C SCSI host adapter BusLogic BT-445S SCSI host adapter BusLogic BT-540CF SCSI host adapter BusLogic BT-542B SCSI host adapter BusLogic BT-542D SCSI host adapter BusLogic BT-545S SCSI host adapter BusLogic BT-646D SCSI host adapter BusLogic BT-646S SCSI host adapter BusLogic BT-742A SCSI host adapter BusLogic BT-747D SCSI host adapter BusLogic BT-747S SCSI host adapter BusLogic BT-757C SCSI host adapter BusLogic BT-757CD SCSI host adapter BusLogic BT-930 FlashPoint LT SCSI host adapter BusLogic BT-948 SCSI host adapter BusLogic BT-958 SCSI host adapter VMware virtual machines

PROBLEM: SCO OpenServer Release 5 includes a version of the "blc" host bus adapter driver which incorrectly enumerates PCI devices when running in a multiprocessor (SMP) enviroment or when running under VMware (which emulates a BusLogic BT-958 SCSI host adapter). This results in a a "no root disk controller found" message when the system is first booted after installation.

However after installing the SMP the system is behaving exactly the same as with the mmunix (reboots in a loop) and the Uncrecoverable errors related to disks (using the new SCOs driver which installed without a single warning) are still easy to reproduce.

The only strange thing after installing the new blc dirver was that the system after the install-required reboot failed to boot saying that the controller is not ready. After the power-OFF and ON it worked OK and the system came up normally.

Can anybody of you guys check this scenario on you machines?

Any suggestions or hints?

Let me (and the community) know.

Best regards,Oczkov

vvarnell · ‎12-11-2007

Howdy Oczkov,

I'll see what I can answer for you:

1) SCO 5.0.7. I started an attempt with SCO 6 but abandoned it before I got too far into it. From where I stopped I think 6.0 may have worked but the application needed 5.0.7.

2) I've got the VM's as uniprocessor. The applications don't demand SMP so I eliminated that from the equation. The VM's are configured as "other" guest OS type, so I suspect SMP would not play well. I also, as a standard practice, do not normally have SMP VM's for Windows or Linux unless the demonstrate a need and a benefit for SMP. This keeps my %READY down across the environment and sometimes agrivates my users (an added bonus, occasionally).

3) I started off with the blc driver from the aplawrence document mentioned elsewhere in the thread. Later I installed Maint Pack 5, which has an updated driver. (It's been a couple of years so it's fuzzy). Before ANY attempt to update the driver, shut the VM down and clone it, that way you have a way back to good. If this step is skipped I guarantee unhappiness. I learned the hard way.

4) Only once did I see a data impact issue with the errors (it showed up as a bad disk block). Most of the time the error appeared to be benign but at least once there was a restore involved. Most problems/incidents showed during backups when high read I/O was present. The OS would occasionally crash so a reboot was needed. FSCK ran afterwards and usually found no issues. I'm not surprised that you've seen good results with data loading/writes. That tracks with the behavior I experienced.

5) Yes, block I/O on a local (internal) SCSI controller. Where I think the issue lies is in responsiveness to I/O requests in the driver. The driver was written for a SCSI environment with that pattern of data flow and timing. When the driver sits in a VM it is abstracted (virtualized) so that it thinks it is still in a native SCSI environment but in reality its I/O is being broken up into packets for the SAN which has different timing and flow patterns. I attempted to increase the I/O timeout to 60 seconds, but that only marginally resolved the issue. My theory is that when the VM resides on true SCSI disk there are no added timing issues. My SAN is Brocade and the storage is Hitachi, but I believe the issue is not vendor/manufacturer specific.

I've designed my ESX environment with one stand-alone ESX host. It has more local disk than my other hosts and no SAN connection. The idea is to have only VM's on it that we still want around should the SAN and/or storage take a dive. It's a Dell 2850 with a PERC 4/Di controller and a 6 x 72GB RAID array. The disk not utilized for the ESX OS is given over to VMFS3 (about 120GB). On this host I have two domain controllers, the SCO VM's and a VM that monitors the SAN and storage systems. The remainder of my ESX environment is all Dell 6850 class hardware that is fully SAN resident (no VM's on local disk). Like I mentioned before I lose vMotion capability, but I've put redundancy where possible in hardware and I keep it clean and well fed. Periodically I shutdown and clone the SCO VM's (moving them to the SAN side in a powered off state) for D/R purposes.

Hope this helps. Good luck.

VwV

View solution in original post

Oczkov · ‎11-22-2007

Anybody willing to run 'dd' command (see example above) on your SCO machine and help me confirm the problem is repeatable?? This is a very simple task, no harm to your disk or data! You just read from the device and write it to /dev/null.

I would really appreciate if anybody who said (on the forum) is running SCO in ESX VM could test this on his env and share the info about his /var/log/messages content (if there is something like unrecoverable error ).

Best regards,

Oczkov

vvarnell · ‎12-10-2007

Finally, someone else crazy enough to attempt running SCO in a VM. Welcome to the club.

I've been running with a couple of SCO VM's for two years now. Both would exhibit the same symptoms you describe. The problem turns out to have something to do with either the timing of traffic on the SAN environment or the way the data is split up in that same environment. Either way I was able to localize the issue to disk I/O traffic. The way I got around it was to move the VM to local SCSI/Serial SCSI disk on the host. Since then (summer 2007) I have been running 24x7 without a single reported I/O issu. I lose vmotion but gain stability.

I believe the problem was resolved by going to the block-type I/O.

Hardware in this particular SCO environment is built around the Dell 2850 platform.

Good luck.

VwV

Oczkov · ‎12-10-2007

Dear Warnell

Thank you for your response and workaround suggestion. I really appreciate your help!!

I will test this (using local SCSI) soon and confirm what are my findings. We are running our VMware farm on HP ProLiant DL360 G5 and SAN-connected (Brocade 200E based) EMC's CLARiiON CX3-20. I am sure this is nothing related to the host hardware itself, but rather to the disk and timing issues (as you say) related to emulation of the LSI Logic (on SAN hardware).

Can you please let me know several more things:

1.) what SCO OpenServer version you run inside your VMs? (5.0.4/5.0.5/5.0.6(a)/5.0.7 or maybe the 6)?

2.) do you run your VMs in multiprocessor mode (SMP)? If yes, have you configured it during system installation phase or after installing it? No problems with this?

3.) do you use the SCO-provided blc driver I mentioned above?

4.) did your try to investigate whether these I/O erros have any impact on the data? Our observation is that we have loaded several GB of data into the Informix Dynamic Server chunks (raw device) and actually we see nothing in the IDS log, really nothing, no error, no problems. The only notion of unrecoverable SCSI errors is the /var/adm/messages.

5.) do you mean going to the block-type I/O is just moving VMs to DASD like SCSI or SAS?

I am giving you one "Helpful Anserwer" point and really willing to grant the rest 10 points (after my tests of course). Your hint is really valuable!

Best regards,

Oczkov

vvarnell · ‎12-11-2007

Howdy Oczkov,

I'll see what I can answer for you:

1) SCO 5.0.7. I started an attempt with SCO 6 but abandoned it before I got too far into it. From where I stopped I think 6.0 may have worked but the application needed 5.0.7.

2) I've got the VM's as uniprocessor. The applications don't demand SMP so I eliminated that from the equation. The VM's are configured as "other" guest OS type, so I suspect SMP would not play well. I also, as a standard practice, do not normally have SMP VM's for Windows or Linux unless the demonstrate a need and a benefit for SMP. This keeps my %READY down across the environment and sometimes agrivates my users (an added bonus, occasionally).

3) I started off with the blc driver from the aplawrence document mentioned elsewhere in the thread. Later I installed Maint Pack 5, which has an updated driver. (It's been a couple of years so it's fuzzy). Before ANY attempt to update the driver, shut the VM down and clone it, that way you have a way back to good. If this step is skipped I guarantee unhappiness. I learned the hard way.

4) Only once did I see a data impact issue with the errors (it showed up as a bad disk block). Most of the time the error appeared to be benign but at least once there was a restore involved. Most problems/incidents showed during backups when high read I/O was present. The OS would occasionally crash so a reboot was needed. FSCK ran afterwards and usually found no issues. I'm not surprised that you've seen good results with data loading/writes. That tracks with the behavior I experienced.

5) Yes, block I/O on a local (internal) SCSI controller. Where I think the issue lies is in responsiveness to I/O requests in the driver. The driver was written for a SCSI environment with that pattern of data flow and timing. When the driver sits in a VM it is abstracted (virtualized) so that it thinks it is still in a native SCSI environment but in reality its I/O is being broken up into packets for the SAN which has different timing and flow patterns. I attempted to increase the I/O timeout to 60 seconds, but that only marginally resolved the issue. My theory is that when the VM resides on true SCSI disk there are no added timing issues. My SAN is Brocade and the storage is Hitachi, but I believe the issue is not vendor/manufacturer specific.

I've designed my ESX environment with one stand-alone ESX host. It has more local disk than my other hosts and no SAN connection. The idea is to have only VM's on it that we still want around should the SAN and/or storage take a dive. It's a Dell 2850 with a PERC 4/Di controller and a 6 x 72GB RAID array. The disk not utilized for the ESX OS is given over to VMFS3 (about 120GB). On this host I have two domain controllers, the SCO VM's and a VM that monitors the SAN and storage systems. The remainder of my ESX environment is all Dell 6850 class hardware that is fully SAN resident (no VM's on local disk). Like I mentioned before I lose vMotion capability, but I've put redundancy where possible in hardware and I keep it clean and well fed. Periodically I shutdown and clone the SCO VM's (moving them to the SAN side in a powered off state) for D/R purposes.

Hope this helps. Good luck.

VwV

Oczkov · ‎08-01-2008

Hi,

I don't know if you care but your suggestion was very good and using SAS DASD on the server helped (HP DL380 G5, SmartArray P400 controller, 2.5-in 72GB 10K drives).

It also seems that there is a second working solution: using RDMs (Raw Device Mapping) in physical mode and FC storage array.

http://communities.vmware.com/message/797175

Best regards,

Oczkov

All

SCO Unix and unrecoverable read/write errors on disks