Hey All,
We have 3-node ESXi cluster running 6.0.0, 3380124 (storage essentials plus) in a cluster that has been happily chugging along.
The primary datastore is old NetApp 3070 with a ton of FC disks over NFS (running 7-mode 737, trunked quad-gigabit ethernet) and we have hooked up a new datastore, which is a somewhat newer 6210 cluster with faster (15krpm) SAS disks, trunked 10gig ethernet, running cluster mode 8.3.2.
The hosts are HP C7000 blades, with Flex-10, and we get great performance out of all components individually.
I'm having an issue with incredibly slow cold migrations of offline VMs from the old datastore to the new one. A thin-provisioned 20gb VM takes a couple hours to migrate to the new datastore, but moving it back only takes about 5 minutes.
It seems that when the move is kicked off to the new datastore, most of the files immediately appear on the destination, and there is a burst of write traffic, that then winds down, and then we get almost no write traffic as the hours tick by, and the status indicator in the vSphere client gradually ticks up in percentage completed.
What's weird is that if I let that complete, and then move the same VM back to the original datastore, it takes just a few minutes to complete.
The VMs themselves have very fast access to this filer, they can write in excess of 500MB/s to it, so I don't think there is an issue filer side. We have several other applications which are running on it just fine.
I'm not really sure where to look for insight into what is happening during the move.
in /var/log/vmkernel.log on the node that owns the VM, there is the following:
2016-04-25T22:55:55.994Z cpu4:33234)NMP: nmp_ThrottleLogForDevice:3286: Cmd 0x12 (0x439dc07f51c0, 0) to dev "naa.600508b1001030393146394334301000" on path "vmhba0:C0:T0:L1" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE
2016-04-25T22:57:56.666Z cpu8:33964)Config: 680: "SIOControlFlag1" = 33964, Old Value: 0, (Status: 0x0)
2016-04-25T22:59:02.715Z cpu9:33964)Config: 680: "SIOControlFlag1" = 0, Old Value: 33964, (Status: 0x0)
2016-04-25T23:00:55.988Z cpu1:32841)NMP: nmp_ThrottleLogForDevice:3286: Cmd 0x12 (0x439dcc956040, 0) to dev "naa.600508b1001030393146394334301000" on path "vmhba0:C0:T0:L1" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE
2016-04-25T23:01:08.976Z cpu9:33964)Config: 680: "SIOControlFlag1" = 33964, Old Value: 0, (Status: 0x0)
2016-04-25T23:02:15.041Z cpu9:33964)Config: 680: "SIOControlFlag1" = 0, Old Value: 33964, (Status: 0x0)
2016-04-25T23:04:21.239Z cpu9:33964)Config: 680: "SIOControlFlag1" = 33964, Old Value: 0, (Status: 0x0)
2016-04-25T23:05:27.299Z cpu13:33964)Config: 680: "SIOControlFlag1" = 0, Old Value: 33964, (Status: 0x0)
2016-04-25T23:05:55.979Z cpu4:32841)NMP: nmp_ThrottleLogForDevice:3286: Cmd 0x12 (0x439dc07f5340, 0) to dev "naa.600508b1001030393146394334301000" on path "vmhba0:C0:T0:L1" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE
2016-04-25T23:07:33.490Z cpu11:33964)Config: 680: "SIOControlFlag1" = 33964, Old Value: 0, (Status: 0x0)
I'm thinking that nmp_ThrottleLogForDevice message may be relevant, or not. I don't see any complaints or warnings in the vSphere clients at all.
Any ideas?
Could you share the current vmkernel networking configurations including management network.
When you say Offline migrations are we doing them because you are migrating from one ESXi cluster to another.
Also share the details for both NAS boxes IPs. if you can
Thanks for the reply!
Not sure if this is all the info you were looking for - please let me know if not!
'cold migration' in this case means I have powered off the VM, and am changing the datastore (we do not have storage vmotion to be able to do that live while the VM is running)
[root@dd-esxi-02:~] esxcfg-nics -l
Name PCI Driver Link Speed Duplex MAC Address MTU Description
vmnic0 0000:02:00.0 bnx2x Up 10000Mbps Full 00:17:a4:77:08:0c 1500 QLogic Corporation NetXtreme II BCM57711E/NC532i 10 Gigabit Ethernet
vmnic1 0000:02:00.1 bnx2x Up 10000Mbps Full 00:17:a4:77:08:0e 1500 QLogic Corporation NetXtreme II BCM57711E/NC532i 10 Gigabit Ethernet
vmnic2 0000:02:00.2 bnx2x Down 0Mbps Half 1c:c1:de:05:80:d1 1500 QLogic Corporation NetXtreme II BCM57711E/NC532i 10 Gigabit Ethernet
vmnic3 0000:02:00.3 bnx2x Down 0Mbps Half 1c:c1:de:05:80:d5 1500 QLogic Corporation NetXtreme II BCM57711E/NC532i 10 Gigabit Ethernet
vmnic4 0000:02:00.4 bnx2x Down 0Mbps Half 1c:c1:de:05:80:d2 1500 QLogic Corporation NetXtreme II BCM57711E/NC532i 10 Gigabit Ethernet
vmnic5 0000:02:00.5 bnx2x Down 0Mbps Half 1c:c1:de:05:80:d6 1500 QLogic Corporation NetXtreme II BCM57711E/NC532i 10 Gigabit Ethernet
vmnic6 0000:02:00.6 bnx2x Down 0Mbps Half 1c:c1:de:05:80:d3 1500 QLogic Corporation NetXtreme II BCM57711E/NC532i 10 Gigabit Ethernet
vmnic7 0000:02:00.7 bnx2x Down 0Mbps Half 1c:c1:de:05:80:d7 1500 QLogic Corporation NetXtreme II BCM57711E/NC532i 10 Gigabit Ethernet
[root@dd-esxi-02:~] esxcfg-vmknic -l
Interface Port Group/DVPort/Opaque Network IP Family IP Address Netmask Broadcast MAC Address MTU TSO MSS Enabled Type NetStack
vmk0 Management Network IPv4 10.78.3.137 255.255.0.0 10.78.255.255 00:17:a4:77:08:0c 1500 65535 true STATIC defaultTcpipStack
vmk0 Management Network IPv6 fe80::217:a4ff:fe77:80c 64 00:17:a4:77:08:0c 1500 65535 true STATIC, PREFERRED defaultTcpipStack
[root@dd-esxi-02:~] esxcfg-vswitch -l
Switch Name Num Ports Used Ports Configured Ports MTU Uplinks
vSwitch0 3072 23 128 1500 vmnic0,vmnic1
PortGroup Name VLAN ID Used Ports Uplinks
VM Network 0 17 vmnic0,vmnic1
Management Network 0 1 vmnic0,vmnic1
[root@dd-esxi-02:~] esxcli network ip interface list
vmk0
Name: vmk0
MAC Address: 00:17:a4:77:08:0c
Enabled: true
Portset: vSwitch0
Portgroup: Management Network
Netstack Instance: defaultTcpipStack
VDS Name: N/A
VDS UUID: N/A
VDS Port: N/A
VDS Connection: -1
Opaque Network ID: N/A
Opaque Network Type: N/A
External ID: N/A
MTU: 1500
TSO MSS: 65535
Port ID: 33554438
on the filer side of things:
this is the 'old' filer, that we are looking to migrate away from, but we can do migrations to it at normal/expected speeds:
(ignore the name iscsi1b, that's the name of the filer, even though it is sharing the datastores to ESXi over NFS)
iscsi1b> ifconfig -a
e0a: flags=0xad4c867<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 02:a0:98:07:46:db (auto-1000t-fd-up) flowcontrol full
trunked vif1
e0b: flags=0xad4c867<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 02:a0:98:07:46:db (auto-1000t-fd-up) flowcontrol full
trunked vif1
e0c: flags=0xad4c867<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 02:a0:98:07:46:db (auto-1000t-fd-up) flowcontrol full
trunked vif1
e0d: flags=0xad4c867<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 02:a0:98:07:46:db (auto-1000t-fd-up) flowcontrol full
trunked vif1
lo: flags=0x1948049<UP,LOOPBACK,RUNNING,MULTICAST,TCPCKSUM> mtu 8160
inet 127.0.0.1 netmask-or-prefix 0xff000000 broadcast 127.0.0.1
ether 00:00:00:00:00:00 (VIA Provider)
vif1: flags=0x22d4c863<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
inet 10.78.0.108 netmask-or-prefix 0xffff0000 broadcast 10.78.255.255
partner vif0 (not in use)
ether 02:a0:98:07:46:db (Enabled virtual interface)
and this is the 'new' filer, that migrations run very, very slow to, but other traffic is very fast to:
dd-san::> net int show -vserver d3
(network interface show)
Logical Status Network Current Current Is
Vserver Interface Admin/Oper Address/Mask Node Port Home
----------- ---------- ---------- ------------------ ------------- ------- ----
d3
d3_cifs_nfs_lif1
up/up 10.54.9.12/24 dd-san-01 a0a-549 true
dd-san::> net port show -node dd-san-01
(network port show)
Speed (Mbps)
Node Port IPspace Broadcast Domain Link MTU Admin/Oper
------ --------- ------------ ---------------- ----- ------- ------------
dd-san-01
a0a Default Default up 1500 auto/10000
a0a-549 Default Default up 1500 auto/10000
e0M Default Default up 1500 auto/100
e0a Default Default up 1500 auto/1000
e0b Default Default down 1500 auto/10
e0c Default - up 1500 auto/10000
e0d Cluster Cluster up 9000 auto/10000
e0e Default - up 1500 auto/10000
e0f Cluster Cluster up 9000 auto/10000
dd-san::> net port ifgrp show
(network port ifgrp show)
Port Distribution Active
Node IfGrp Function MAC Address Ports Ports
-------- ---------- ------------ ----------------- ------- -------------------
dd-san-01
a0a ip 02:a0:98:3b:03:cb full e0c, e0e
dd-san-02
a0a ip 02:a0:98:3b:03:23 full e0c, e0e
2 entries were displayed.
altmiket,
Did you ever figure out the issue? I'm seeing the same issue.
unfortunately I don't recall - I also just looked through my email for around that time, but no mention of it.
I suspect we just went ahead and migrated everything incredibly slowly, as we are indeed now cut over to the new filer.
I do vaguely recall the speed had something to do with the thin-provisioning - that is, if the VM had say 1TB allocated to it, even though it was only using a few GB, it would take an insane amount of time to do the migration (and pretty positive we confirmed that nothing was traveling over the wire this hole time, ESXi just sat there) whereas, tiny VMs with only a small amount of space allocated/used, would move at normal speed. Almost as if thin-provisioning wasn't 'working' for the migration - though not sure how that would be possible.
Sorry I don't have more info for you