1 2 3 4 5 Previous Next 62 Replies Latest reply on Oct 19, 2006 6:40 AM by elebel Go to original post
      • 15. Re: Problem migrating a VM
        kitcolbert Expert

        I think it is a general problem, it's just that you're getting lucky sometimes.  For iscsi, how busy is the network?  I wonder if sometimes storage updates are delayed due to ip traffic, which could result in the source's close or lock free not being propogated to the iscsi server in a timely manner, which causes the destination to think the lock is still held.  That's just a theory, of course.  But from all the error messages I've seen, it seems that there are problems at times when the vmotion isn't occuring.  In the non-vmotion case, instead of failures it could be that you're just seeing reduced performance.

        • 16. Re: Problem migrating a VM
          steverding Enthusiast

          It is a dedicated VLAN for ISCSI, but it is on a Shared-Backbone. I did some performance tests today, because i was on the same track you are now. Here are the measurements :

           

          I copied a 600 MByte file to the /vmfs Volume from the cmd-line (on esx server : cp /var/tmp/file /vmfs/volumes/ISCSI/) 30 times in a row. It lasted 25 minutes. So i had about 750 MByte / minute what is around 12.5 Mbyte / second what is almost exactly 100 Mbit.

           

          Is this okay or is it way too slow ?

          • 17. Re: Problem migrating a VM
            steverding Enthusiast

            Some more observations :

             

            I installed a linux software scsi target in our lan and tested migration here. Absolutely no problems. and the software target is really slow. I did the same tests as above and got about 12 Mbit performance (so factor 8 slower than the non working hardware target)

            • 18. Re: Problem migrating a VM
              kitcolbert Expert

              Yeah, I'm not too concerned with throughput, but more with latency.

               

              Hmm, so the linux iscsi target works fine?  What are you using are your default target?  This makes it seem like there's some problem with the default target (or am I missing something?).

              • 19. Re: Problem migrating a VM
                steverding Enthusiast

                Yeah, the hardware target is the default one. I installed the linux thing for the tests only.

                • 20. Re: Problem migrating a VM
                  kitcolbert Expert

                  So the linux one works but the hardware one doesn't.  Are they on the same vlan?  What other differences in set up are there?  Also, (stupid question, but I have to ask) is the hardware target on our supported list of hardware?

                  • 21. Re: Problem migrating a VM
                    steverding Enthusiast

                    The devices are on the same vlan, networkwise its all the same.

                     

                    And no for the hard part The ISCSI Hardware device isn't on the list. But shouldn't ISCSI be ISCSI ? Isn't it a standard ?

                    • 22. Re: Problem migrating a VM
                      slkiran Enthusiast

                      iSCSI is a standard but we have seen different vendors interpreting the standards differently. Also the implementations also seem to differ between vendors.

                       

                      If you don't mind, what is the hardware target that you are using here?

                      • 23. Re: Problem migrating a VM
                        steverding Enthusiast

                        Its a low cost ISCSI device for the lab phase.

                         

                        It's a promise VTrak M200i. If we are successfull there maybe a a NetApp filer 'round the corner, but we need to make perfect tests and that isn't the case at the moment.

                        • 24. Re: Problem migrating a VM
                          donbaek Enthusiast

                          Unfortunately iSCSI is not just iSCSI - both the iSCSI initiator and all target systems have bugs, a few only showing with ESX and not with Linux because we use SCSI commands that Linux would never issue.

                           

                          For instance, ESX relies heavily on RESERVE/RELEASE commands (the 6 byte versions) for atomic file operations to a shared VMFS and all open-source iSCSI targets I have looked at (including Linux IET, OpenFiler, Intel iSCSI target, netBSD iSCSI target) and a bunch of commercial ones (Open-E iSCSI and Wasabi Storage Builder) don't correctly implement RESERVE/RELEASE so you are really playing russian roulette with your data if running a shared VMFS off such a target on a production system. If you are not using a shared VMFS most of them should work fine.

                           

                          If you have a NetApp filer I would recommend you use that for production use until more storage arrays get on the HCL. For testing etc. I welcome any testing with unsupported arrays since it sometimes help us find issues we didn't know about.

                           

                          With respect to your problem it does seem that the iSCSI part is fine with this most of the way, but perhaps not all the way. First off, are you running this off 100 Mbit NIC's or through a 100 Mbit network? If so there are known issues and we require the use of Gbit NIC's for this. The problem is that things take a lot longer on 100 Mbit when both network traffic and iSCSI traffic need to share the same NIC and this can sometimes push us over some timeout limits that you would normally never see. If you are running off 100 Mbit, please try with Gbit and see if the problem goes away.

                           

                          Secondly, could you please check the vmkernel log for warnings about reservation's failing. We have seen some arrays fail with ESX due to a bug in the iSCSI initiator that will prevent RESERVE/RELEASE from working and that could be the cause of your troubles.

                           

                          If not any of the above you should probably file a support incident with us, but let's get the answer's to the above questions first and take it from there.

                           

                          Regards,

                          Thor

                           

                          NB: I work for VMware, but opinions expressed above are my own and not necessarily those of my employer.

                          • 25. Re: Problem migrating a VM
                            steverding Enthusiast

                            Thanks for the explanation . I think you missunderstood one thing about the linux-iscsi part :

                             

                            We don't use a linux initiator to test the hardware target, we used a linux target (iSCSI Enterprise Target 0.4.13) and could succesfully migrate the vms. But that doesn't really help for the real problem

                             

                            For the ethernet speed : I just finished a test on a dedicated GB switch. I only put the ISCSI hardware-target and both ESX servers in a separate 1000Mbit Switch and i didn't change anything. Still the same error.

                             

                            The reservation errors : I did a "grep -i resevation /var/log/vmkernel*" on both machines and there wasn't a result, so if the errors are found that way, there wasn't any.

                             

                            The incident : Does vmware support me when i use unsupported hardware ?

                            • 26. Re: Problem migrating a VM
                              donbaek Enthusiast

                              No, I saw that you used a linux-iscsi target only for a quick test and otherwise use an unsupported hardware target. I was just bringing the point across that iSCSI is not simply iSCSI just because there is a standard - I was merely using the reservation issue on IET as an example of why things are not always so. Also, since the iSCSI initiator used by ESX comes from the Linux world (linux-iscsi) I was bringing the point across that you can still see errors with ESX that do not occur on Linux - even with the same initiator source code. The reason being that we sometimes issue different commands through the initiator (linux never uses RESERVE for instance, but we do it a lot).

                               

                              Back to your problem, I would be interested in knowing whether session are dropped and reestablished a lot. If you grep through the vmkernel log for "iSCSI" do you see many messages indicating that the session was dropped and then later reestablished. This might occur if either part (initiator or target) does something the other part does not like - and it takes a little time to reestablish the session and if it happens often this time is increased so relogins may actually take many seconds to occur.

                               

                              If you do not find any such warnings, I have one last thing for you to check before filing an incident - please make sure that both ESX servers use a different initiator name. If they use the same name bad things will happen.

                               

                              Lastly, to answer your question:

                              The incident : Does vmware support me when i use unsupported

                              hardware ?

                               

                              We give you no promises other than the fact that we will look at the logs.

                               

                              You can tell the support folks that Thor from iscsi dev asked you to file it - and please let them reference this as a possible case of bug 121440. The reason I want to get the logs in your case is that I am not 100% sure the problem is related to the target - e.g. it might be a generic problem that you could hit on other (supported) targets and that's primarily why I want to have a look.

                               

                              If you want to make my job going through the incident easier (and the chance of figuring the problem higher), then please reproduce the problem and run vm-support on both systems immediately after. Also note the time on both systems before and after so I can narrow the look to the interesting parts of the log.

                               

                              NB: I work for VMware, but opinions expressed above are my own and not necessarily those of my employer.

                              • 27. Re: Problem migrating a VM
                                steverding Enthusiast

                                I started a little stress test on the systems to see if i get ISCSI errors. I'll tell you tomorrow if something happened in the logs and i'll file an incident after that (when i get through Fujitsu Siemens support, because i only have OEM support)

                                • 28. Re: Problem migrating a VM
                                  steverding Enthusiast

                                  After i did :

                                   

                                   

                                  while 

                                  do

                                    dd if=/dev/zero of=/vmfs/volumes/VDISKS1/file bs=1k count=100000

                                  done

                                   

                                   

                                  I get lots of those now :

                                   

                                   

                                   

                                  Sep  7 16:23:00 esx2 vmkernel: 0:00:16:00.726 cpu2:1031)SCSI: vm 1031: 5436: Sync CR at 64

                                  Sep  7 16:23:01 esx2 vmkernel: 0:00:16:01.895 cpu2:1031)SCSI: vm 1031: 5436: Sync CR at 48

                                  Sep  7 16:23:02 esx2 vmkernel: 0:00:16:02.956 cpu2:1031)SCSI: vm 1031: 5436: Sync CR at 32

                                  Sep  7 16:23:03 esx2 vmkernel: 0:00:16:04.083 cpu2:1031)SCSI: vm 1031: 5436: Sync CR at 16

                                  Sep  7 16:23:05 esx2 vmkernel: 0:00:16:06.087 cpu2:1031)SCSI: vm 1031: 5436: Sync CR at 0

                                  Sep  7 16:23:05 esx2 vmkernel: 0:00:16:06.087 cpu2:1031)WARNING: SCSI: 5446: Failing I/O due to too many reservation conflicts

                                  Sep  7 16:23:05 esx2 vmkernel: 0:00:16:06.087 cpu2:1031)WARNING: SCSI: 5541: status 0xbad0022, rstatus 0xc0de01 for vmhba40:0:0. residual R 919, CR 0, ER 3

                                  Sep  7 16:23:05 esx2 vmkernel: 0:00:16:06.087 cpu2:1031)BC: 1516: Failed to flush buffer for object f530 28 3 44f820a5 e1a1a490 15008acb be6e0e17 14404 1 0 0 0 0 0: SCSI reservation conflict

                                  Sep  7 16:23:07 esx2 vmkernel: 0:00:16:07.456 cpu2:1031)SCSI: vm 1031: 5436: Sync CR at 64

                                  Sep  7 16:23:08 esx2 vmkernel: 0:00:16:08.556 cpu2:1031)SCSI: vm 1031: 5436: Sync CR at 48

                                  Sep  7 16:23:10 esx2 vmkernel: 0:00:16:11.103 cpu2:1031)SCSI: vm 1031: 5436: Sync CR at 32

                                  Sep  7 16:23:11 esx2 vmkernel: 0:00:16:12.298 cpu2:1031)SCSI: vm 1031: 5436: Sync CR at 16

                                  Sep  7 16:23:24 esx2 vmkernel: 0:00:16:24.593 cpu2:1031)SCSI: vm 1031: 5436: Sync CR at 64

                                  Sep  7 16:23:25 esx2 vmkernel: 0:00:16:25.745 cpu2:1031)SCSI: vm 1031: 5436: Sync CR at 48

                                  Sep  7 16:23:30 esx2 vmkernel: 0:00:16:30.731 cpu2:1031)SCSI: vm 1031: 5436: Sync CR at 64

                                  Sep  7 16:23:31 esx2 vmkernel: 0:00:16:31.860 cpu2:1031)SCSI: vm 1031: 5436: Sync CR at 48

                                  Sep  7 16:23:36 esx2 vmkernel: 0:00:16:36.824 cpu2:1031)SCSI: vm 1031: 5436: Sync CR at 64

                                   

                                  followed later (another test) by :

                                   

                                  On ESX1 :

                                   

                                  Sep  7 16:36:34 esx1 vmkernel: 0:00:39:15.675 cpu2:1034)WARNING: FS3: 1570: Lock corruption detected at offset 0x3a37800

                                  Sep  7 16:36:34 esx1 vmkernel: 0:00:39:15.678 cpu2:1034)WARNING: FS3: 1570: Lock corruption detected at offset 0x3a37800

                                  Sep  7 16:36:34 esx1 vmkernel: 0:00:39:15.680 cpu2:1034)WARNING: FS3: 1570: Lock corruption detected at offset 0x3a37800

                                  Sep  7 16:36:34 esx1 vmkernel: 0:00:39:15.682 cpu2:1034)WARNING: FS3: 1570: Lock corruption detected at offset 0x3a37800

                                  Sep  7 16:36:34 esx1 vmkernel: 0:00:39:15.684 cpu1:1033)WARNING: FS3: 1570: Lock corruption detected at offset 0x3a37800

                                  Sep  7 16:36:34 esx1 vmkernel: 0:00:39:15.687 cpu2:1034)WARNING: FS3: 1570: Lock corruption detected at offset 0x3a37800

                                  Sep  7 16:36:34 esx1 vmkernel: 0:00:39:15.689 cpu2:1034)WARNING: FS3: 1570: Lock corruption detected at offset 0x3a37800

                                  Sep  7 16:36:34 esx1 vmkernel: 0:00:39:15.691 cpu2:1034)WARNING: FS3: 1570: Lock corruption detected at offset 0x3a37800

                                  Sep  7 16:36:34 esx1 vmkernel: 0:00:39:16.002 cpu2:1034)WARNING: FS3: 1570: Lock corruption detected at offset 0x3a37800

                                  Sep  7 16:36:34 esx1 vmkernel: 0:00:39:16.134 cpu2:1034)WARNING: FS3: 1570: Lock corruption detected at offset 0x3a37800

                                   

                                   

                                  On ESX2 :

                                  Sep  7 16:36:03 esx2 vmkernel: 0:00:28:20.814 cpu3:1034)WARNING: Fil3: 3206: Unknown object type 0

                                  Sep  7 16:36:03 esx2 vmkernel: 0:00:28:20.814 cpu3:1034)WARNING: Fil3: 7937: Found invalid object on 44f820a5-e1a1a490-8acb-0015170e6ebe

                                  Sep  7 16:36:03 esx2 vmkernel: 0:00:28:20.817 cpu3:1034)WARNING: Fil3: 3206: Unknown object type 0

                                   

                                  Message was edited by:

                                          steverding

                                   

                                  Message was edited by:

                                          steverding

                                   

                                  Message was edited by:

                                          steverding

                                  • 29. Re: Problem migrating a VM
                                    steverding Enthusiast

                                    One more thing : I got those errors only on one servers, none on the other server.

                                     

                                    i had done an

                                    while 

                                    do

                                      dd if=/dev/zero of=/vmfs/volumes/VDISKS1/file bs=1k count=100000

                                    done

                                     

                                    Now it leaves a file /vmfs/volumes/VDISKS1/file which can't be accessed anymore :

                                     

                                    \[root@esx1 VDISKS1]# rm file rm: remove regular file `file'? y

                                    rm: cannot remove `file': Invalid argument

                                    \[root@esx1 VDISKS1]#