Reply to Message

View discussion in a popup

Replying to:
kastlr
Expert
Expert

Hi,

to answer your question, let me provide some additional infos about the IO flow in general.

As already posted, for any write IO the host expect an ACK(nowledge) from the array.

Without this ACK, the IO isn't completed.

The host usually does wait a defined time and will try to resend that IO again.

Most of the time, a read IO will be answered immediatly, does mean, the array will hold the IO flow to the LUN until the read request is answered.

This usually happen when the requested information is already available in the storage array cache area. (Read Hit)

Sometime it could happen, that the array request the host to release the SCSI bus for additional request as it isn't able to answer the read request immediatly.

This happens when the requested information isn't available in the arrays cache and therefor some backend opertions are required. (Read Miss)

The IO will be queued in the HBA driver, and in the meantime the host could send additional IO requests to that LUN.

Let's talk about your SAN now.[/b]

If something is changed in your SAN, all affected zone members will be informed by a RSCN (request state change notification).

When the link is brocken, the switch will inform all remaining zone member about this event.

With this notification your HBA driver will inform the OS to take proper action.

This would usually cause the LVM to transmit IO's over another available path.

All IO's already send but not answered over the broken link will be automatically retransmitted over the new path.

Such an event would be found in the ESX logs like vmkernel or vmwarning.log

This is completly transparent for a VM, the VM isn't even aware of multiple pathes to the used disk.

Lets talk about "freezed".[/b]

When a host does have unanswered IO requests, it will simply wait for an answer.

There's no kind of transaction log or snapshot area for such a condition.

Depending on the application either some RAM memory is used to queue the IO's or the application couldn't run further without the requested informations.

A database does rely on consitent data and the vendors did implement some mechanisms to guarentee this (transaction logs, redo logs etc.)

When such an event does occur on a database server, it could be that the DB crashes when it takes to long to answer an IO request.

I.e., transaction logs are really critical and nearly produce 100% write IO's.

When the DB is unable to write these IO's for a time frame defined by the vendor, the DB crash.

As long as your outage doesn't hit this limit, the DB will simply freeze (sleep) and didn't answer any SQL statement.

A file server is less restrictive, as already seen during your tests, the IO flow will hold during the outage.

After the outage the IO flow will resume normally.

From a user point of view, the server didn't respond for some seconds (maybe the damned slow network again Smiley Wink

Hope this helps a bit.


Hope this helps a bit.
Greetings from Germany. (CEST)
Reply
0 Kudos