VMware Cloud Community
altmiket
Contributor
Contributor

cold migrate takes hours one way, minutes the other

Hey All,

We have 3-node ESXi cluster running 6.0.0, 3380124 (storage essentials plus) in a cluster that has been happily chugging along.

The primary datastore is old NetApp 3070 with a ton of FC disks over NFS (running 7-mode 737, trunked quad-gigabit ethernet) and we have hooked up a new datastore, which is a somewhat newer 6210 cluster with faster (15krpm) SAS disks, trunked 10gig ethernet, running cluster mode 8.3.2.

The hosts are HP C7000 blades, with Flex-10, and we get great performance out of all components individually.

I'm having an issue with incredibly slow cold migrations of offline VMs from the old datastore to the new one.  A thin-provisioned 20gb VM takes a couple hours to migrate to the new datastore, but moving it back only takes about 5 minutes.

It seems that when the move is kicked off to the new datastore, most of the files immediately appear on the destination, and there is a burst of write traffic, that then winds down, and then we get almost no write traffic as the hours tick by, and the status indicator in the vSphere client gradually ticks up in percentage completed.

What's weird is that if I let that complete, and then move the same VM back to the original datastore, it takes just a few minutes to complete.

The VMs themselves have very fast access to this filer, they can write in excess of 500MB/s to it, so I don't think there is an issue filer side.  We have several other applications which are running on it just fine.

I'm not really sure where to look for insight into what is happening during the move.

in /var/log/vmkernel.log on the node that owns the VM, there is the following:

2016-04-25T22:55:55.994Z cpu4:33234)NMP: nmp_ThrottleLogForDevice:3286: Cmd 0x12 (0x439dc07f51c0, 0) to dev "naa.600508b1001030393146394334301000" on path "vmhba0:C0:T0:L1" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE

2016-04-25T22:57:56.666Z cpu8:33964)Config: 680: "SIOControlFlag1" = 33964, Old Value: 0, (Status: 0x0)

2016-04-25T22:59:02.715Z cpu9:33964)Config: 680: "SIOControlFlag1" = 0, Old Value: 33964, (Status: 0x0)

2016-04-25T23:00:55.988Z cpu1:32841)NMP: nmp_ThrottleLogForDevice:3286: Cmd 0x12 (0x439dcc956040, 0) to dev "naa.600508b1001030393146394334301000" on path "vmhba0:C0:T0:L1" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE

2016-04-25T23:01:08.976Z cpu9:33964)Config: 680: "SIOControlFlag1" = 33964, Old Value: 0, (Status: 0x0)

2016-04-25T23:02:15.041Z cpu9:33964)Config: 680: "SIOControlFlag1" = 0, Old Value: 33964, (Status: 0x0)

2016-04-25T23:04:21.239Z cpu9:33964)Config: 680: "SIOControlFlag1" = 33964, Old Value: 0, (Status: 0x0)

2016-04-25T23:05:27.299Z cpu13:33964)Config: 680: "SIOControlFlag1" = 0, Old Value: 33964, (Status: 0x0)

2016-04-25T23:05:55.979Z cpu4:32841)NMP: nmp_ThrottleLogForDevice:3286: Cmd 0x12 (0x439dc07f5340, 0) to dev "naa.600508b1001030393146394334301000" on path "vmhba0:C0:T0:L1" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE

2016-04-25T23:07:33.490Z cpu11:33964)Config: 680: "SIOControlFlag1" = 33964, Old Value: 0, (Status: 0x0)

I'm thinking that nmp_ThrottleLogForDevice message may be relevant, or not.  I don't see any complaints or warnings in the vSphere clients at all.

Any ideas?

Reply
0 Kudos
4 Replies
hussainbte
Expert
Expert

Could you share the current vmkernel networking configurations including management network.

When you say Offline migrations are we doing them because you are migrating from one ESXi cluster to another.

Also share the details for both NAS boxes IPs. if you can Smiley Happy

If you found my answers useful please consider marking them as Correct OR Helpful Regards, Hussain https://virtualcubes.wordpress.com/
Reply
0 Kudos
altmiket
Contributor
Contributor

Thanks for the reply!

Not sure if this is all the info you were looking for - please let me know if not!

'cold migration' in this case means I have powered off the VM, and am changing the datastore (we do not have storage vmotion to be able to do that live while the VM is running)

[root@dd-esxi-02:~] esxcfg-nics -l

Name    PCI          Driver      Link Speed     Duplex MAC Address       MTU    Description

vmnic0  0000:02:00.0 bnx2x       Up   10000Mbps Full   00:17:a4:77:08:0c 1500   QLogic Corporation NetXtreme II BCM57711E/NC532i 10 Gigabit Ethernet

vmnic1  0000:02:00.1 bnx2x       Up   10000Mbps Full   00:17:a4:77:08:0e 1500   QLogic Corporation NetXtreme II BCM57711E/NC532i 10 Gigabit Ethernet

vmnic2  0000:02:00.2 bnx2x       Down 0Mbps     Half   1c:c1:de:05:80:d1 1500   QLogic Corporation NetXtreme II BCM57711E/NC532i 10 Gigabit Ethernet

vmnic3  0000:02:00.3 bnx2x       Down 0Mbps     Half   1c:c1:de:05:80:d5 1500   QLogic Corporation NetXtreme II BCM57711E/NC532i 10 Gigabit Ethernet

vmnic4  0000:02:00.4 bnx2x       Down 0Mbps     Half   1c:c1:de:05:80:d2 1500   QLogic Corporation NetXtreme II BCM57711E/NC532i 10 Gigabit Ethernet

vmnic5  0000:02:00.5 bnx2x       Down 0Mbps     Half   1c:c1:de:05:80:d6 1500   QLogic Corporation NetXtreme II BCM57711E/NC532i 10 Gigabit Ethernet

vmnic6  0000:02:00.6 bnx2x       Down 0Mbps     Half   1c:c1:de:05:80:d3 1500   QLogic Corporation NetXtreme II BCM57711E/NC532i 10 Gigabit Ethernet

vmnic7  0000:02:00.7 bnx2x       Down 0Mbps     Half   1c:c1:de:05:80:d7 1500   QLogic Corporation NetXtreme II BCM57711E/NC532i 10 Gigabit Ethernet


[root@dd-esxi-02:~] esxcfg-vmknic -l

Interface  Port Group/DVPort/Opaque Network        IP Family IP Address                              Netmask         Broadcast       MAC Address       MTU     TSO MSS   Enabled Type                NetStack

vmk0       Management Network                      IPv4      10.78.3.137                             255.255.0.0     10.78.255.255   00:17:a4:77:08:0c 1500    65535     true    STATIC              defaultTcpipStack

vmk0       Management Network                      IPv6      fe80::217:a4ff:fe77:80c                 64                              00:17:a4:77:08:0c 1500    65535     true    STATIC, PREFERRED   defaultTcpipStack


[root@dd-esxi-02:~] esxcfg-vswitch -l

Switch Name      Num Ports   Used Ports  Configured Ports  MTU     Uplinks

vSwitch0         3072        23          128               1500    vmnic0,vmnic1

  PortGroup Name        VLAN ID  Used Ports  Uplinks

  VM Network            0        17          vmnic0,vmnic1

  Management Network    0        1           vmnic0,vmnic1

[root@dd-esxi-02:~] esxcli network ip interface list

vmk0

   Name: vmk0

   MAC Address: 00:17:a4:77:08:0c

   Enabled: true

   Portset: vSwitch0

   Portgroup: Management Network

   Netstack Instance: defaultTcpipStack

   VDS Name: N/A

   VDS UUID: N/A

   VDS Port: N/A

   VDS Connection: -1

   Opaque Network ID: N/A

   Opaque Network Type: N/A

   External ID: N/A

   MTU: 1500

   TSO MSS: 65535

   Port ID: 33554438

on the filer side of things:

this is the 'old' filer, that we are looking to migrate away from, but we can do migrations to it at normal/expected speeds:

(ignore the name iscsi1b, that's the name of the filer, even though it is sharing the datastores to ESXi over NFS)

iscsi1b> ifconfig -a

e0a: flags=0xad4c867<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500

        ether 02:a0:98:07:46:db (auto-1000t-fd-up) flowcontrol full

        trunked vif1

e0b: flags=0xad4c867<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500

        ether 02:a0:98:07:46:db (auto-1000t-fd-up) flowcontrol full

        trunked vif1

e0c: flags=0xad4c867<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500

        ether 02:a0:98:07:46:db (auto-1000t-fd-up) flowcontrol full

        trunked vif1

e0d: flags=0xad4c867<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500

        ether 02:a0:98:07:46:db (auto-1000t-fd-up) flowcontrol full

        trunked vif1

lo: flags=0x1948049<UP,LOOPBACK,RUNNING,MULTICAST,TCPCKSUM> mtu 8160

        inet 127.0.0.1 netmask-or-prefix 0xff000000 broadcast 127.0.0.1

        ether 00:00:00:00:00:00 (VIA Provider)

vif1: flags=0x22d4c863<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500

        inet 10.78.0.108 netmask-or-prefix 0xffff0000 broadcast 10.78.255.255

        partner vif0 (not in use)

        ether 02:a0:98:07:46:db (Enabled virtual interface)

and this is the 'new' filer, that migrations run very, very slow to, but other traffic is very fast to:

dd-san::> net int show -vserver d3

  (network interface show)

            Logical    Status     Network            Current       Current Is

Vserver     Interface  Admin/Oper Address/Mask       Node          Port    Home

----------- ---------- ---------- ------------------ ------------- ------- ----

d3

            d3_cifs_nfs_lif1

                         up/up    10.54.9.12/24      dd-san-01     a0a-549 true

dd-san::> net port show -node dd-san-01

  (network port show)

                                                             Speed (Mbps)

Node   Port      IPspace      Broadcast Domain Link   MTU    Admin/Oper

------ --------- ------------ ---------------- ----- ------- ------------

dd-san-01

       a0a       Default      Default          up       1500  auto/10000

       a0a-549   Default      Default          up       1500  auto/10000

       e0M       Default      Default          up       1500  auto/100

       e0a       Default      Default          up       1500  auto/1000

       e0b       Default      Default          down     1500  auto/10

       e0c       Default      -                up       1500  auto/10000

       e0d       Cluster      Cluster          up       9000  auto/10000

       e0e       Default      -                up       1500  auto/10000

       e0f       Cluster      Cluster          up       9000  auto/10000

dd-san::> net port ifgrp show

  (network port ifgrp show)

         Port       Distribution                   Active

Node     IfGrp      Function     MAC Address       Ports   Ports

-------- ---------- ------------ ----------------- ------- -------------------

dd-san-01

         a0a        ip           02:a0:98:3b:03:cb full    e0c, e0e

dd-san-02

         a0a        ip           02:a0:98:3b:03:23 full    e0c, e0e

2 entries were displayed.

Reply
0 Kudos
brendenc00k
Contributor
Contributor

altmiket,

Did you ever figure out the issue? I'm seeing the same issue.

Reply
0 Kudos
altmiket
Contributor
Contributor

unfortunately I don't recall - I also just looked through my email for around that time, but no mention of it.

I suspect we just went ahead and migrated everything incredibly slowly, as we are indeed now cut over to the new filer.

I do vaguely recall the speed had something to do with the thin-provisioning - that is, if the VM had say 1TB allocated to it, even though it was only using a few GB, it would take an insane amount of time to do the migration (and pretty positive we confirmed that nothing was traveling over the wire this hole time, ESXi just sat there) whereas, tiny VMs with only a small amount of space allocated/used, would move at normal speed.  Almost as if thin-provisioning wasn't 'working' for the migration - though not sure how that would be possible.

Sorry I don't have more info for you

Reply
0 Kudos