VMware Cloud Community
rob_ellison
Contributor
Contributor

Nexus 1000v Disaster!!

I have at the moment a single ESXi host - 2 nics in a standard vSwitch plus 2 extra as uplinks from the Nexus module.

i'm not entirely sure what happened but after shutting the ESX server down and bringing it back up in a new location connectivity was lost completely. the Service Console was a port-group inside the Nexus switch so i think that must have failed. as the VSM was hosted on the same ESX it will not have come up at the time but i thought the VEM could run without the VSM?

anyway, I revmoved the VEM from the ESX server and reconfigured a few of the networks on the standard vSwitch to get it up and running. no problems there but a bit of a nightmare if it was a production system. i readded the ESX server to vCenter and everything looked ok EXCEPT vCenter thought the VEM was still installed. it wasn't. I tried to remove the distrbuted switch but this failed with a meaningless error message. i ended up removing the ESX server from vCenter and re-adding. Now the Distributed switch doesn't think it's installed but I now cant add the host to the nexus switch!!

I get this error message:

Cannot complete a Distributed Virtual Switch operation for one or more host members. - i've not got the detailed text to hand

is there any way to do a cleanup so it is completely uninstalled? this is becoming a bit of a nightmare!

Reply
0 Kudos
8 Replies
lwatta
Hot Shot
Hot Shot

Rob,

If you migrated the service console to the VEM and used the "system vlan" directive on both the port-profile for the Service Console and the uplink port-profile then I beleive it should have worked. If you still have the configuration of the VSM if you could post it so we could take a look it would help us debug what caused the issue. In production environments we recommend customers use HA mode which provides two VSMs one being primary and the second a standby. We recommend that these instances be kept on different ESX hosts to prevent an outage like you experienced.

When you say you removed the VEM how did you remove it? Did you use vem-remove -d? If you ran that command then the VEM module should be unloaded and all parts of the VEM removed.

If the vem parts are still on the ESX host (as in vemcmd still works) then do the following

root@cae-cali-172--#cd /usr/lib/ext/cisco/nexus/vem-v100/sbin

root@cae-cali-172--#hotswap.sh -u

root@cae-cali-172--vem-remove -d -v

Then try to add your host back. If the above fails or If /usr/lib/ext/cisco... does not exist then try the following to clean the DVS info out of ESX

1. Note the first line of "/usr/lib/vmware/bin/net-dvs -l". It will look like "switch af 34 3c ..." The hex sequence is the switch uuid in ASCII.

2. Try doing "net-dvs -d -n "switch-uuid-in-ascii"" so like --->

-


> root@cmhlab-vm4 ~]# /usr/lib/vmware/bin/net-dvs -d -n "d0 97 06 50 59 a2 f2 52-78 26 2f 5f ff 15 d4 2b"

Hopefully that will get you back up and running.

louis

Reply
0 Kudos
rob_ellison
Contributor
Contributor

hi Louis, thanks for the prompt reply!

i've had a run through those commands and it seems that the VEM isn't installed on the ESXi host anyway. to prove this i have built an extra ESXi server and i get the same issue. it looks like it is a problem with the vCenter server.

i have removed the extensions from vCenter and re-added as well as removing the DVS.

i have also redeployed the VSM and called it a different name to try and bypass this issue still no luck however!

this is the error message i get when i try to add a host (even a freshly installed one!)

here is the VSM config as it is at the moment..

NGNX-Nexus-VSM1(config-port-prof)# sh run

version 4.0(4)SV1(1)

telnet server enable

banner motd # NOTICE TO USERS=============================================================================This

=============================#

ssh key rsa 2048

ip domain-lookup

ip host NGNX-Nexus-VSM1 10.100.5.101

kernel core target 0.0.0.0

kernel core limit 1

system default switchport

vrf context management

ip route 0.0.0.0/0 10.100.5.254

switchname NGNX-Nexus-VSM1

vlan 1

vlan 100

name M100-Private-Management

vlan 101

name M101-Database-Servers

vlan 102

name M102-Customer-Facing

vlan 103

name M103-Internet-Facing

vlan 104

name M104-ESX-Service-Console

vlan 111

name M111-Nexus-Control

vlan 112

name M112-Nexus-Packet

vdc NGNX-Nexus-VSM1 id 1

limit-resource vlan minimum 16 maximum 513

limit-resource monitor-session minimum 0 maximum 64

limit-resource vrf minimum 16 maximum 8192

limit-resource port-channel minimum 0 maximum 256

limit-resource u4route-mem minimum 32 maximum 80

limit-resource u6route-mem minimum 16 maximum 48

port-profile system-uplink

capability uplink

vmware port-group

switchport mode trunk

switchport trunk allowed vlan 2-998

no shutdown

system vlan 111-112

state enabled

interface mgmt0

ip address 10.100.5.101/24

interface control0

boot kickstart bootflash:/nexus-1000v-kickstart-mz.4.0.4.SV1.1.bin sup-1

boot system bootflash:/nexus-1000v-mz.4.0.4.SV1.1.bin sup-1

boot kickstart bootflash:/nexus-1000v-kickstart-mz.4.0.4.SV1.1.bin sup-2

boot system bootflash:/nexus-1000v-mz.4.0.4.SV1.1.bin sup-2

svs-domain

domain id 2

control vlan 111

packet vlan 112

svs mode L2

NGNX-Nexus-VSM1(config-port-prof)#

NGNX-Nexus-VSM1(config-port-prof)#

NGNX-Nexus-VSM1(config-port-prof)#

NGNX-Nexus-VSM1(config-port-prof)#

NGNX-Nexus-VSM1(config-port-prof)#

NGNX-Nexus-VSM1(config-port-prof)# exit

NGNX-Nexus-VSM1(config)# svs connection VC

NGNX-Nexus-VSM1(config-svs-conn)# vmware dvs datacenter-name Nottingham

NGNX-Nexus-VSM1(config-svs-conn)# protocol vmware-vim

NGNX-Nexus-VSM1(config-svs-conn)# remote ip address 10.100.1.10

NGNX-Nexus-VSM1(config-svs-conn)# connect

Note: Command execution in progress..please wait

NGNX-Nexus-VSM1(config-svs-conn)# exit

Reply
0 Kudos
lwatta
Hot Shot
Hot Shot

Rob,

We are thinking that your VEM module might be installed correctly. Can you tell us how you installed the VEM module on the ESXi hosts? Did you use VUM or are you installing it manually with RCLI?

louis

rob_ellison
Contributor
Contributor

Hi Louis,

i was using VUM. looking at the logs it looked like that failed due to a dodgy firewall. i've fixed that issue now and applied the VEM patch. I now get a different error..

i'm now just trying to remove all the extensions from the ESXi host and will try again.

p.s. is there any way of resetting the admin password on the VSM? for some reason i now cant log into it!

thanks again,

Rob

Reply
0 Kudos
lwatta
Hot Shot
Hot Shot

Rob,

To reset the password on the VSM you can use this guide

http://www.cisco.com/en/US/docs/switches/datacenter/nexus1000/sw/4_0/pw_recovery/guide/n1000v_pwd_re...

The error message you are getting now also can mean that the VEM module is not loaded. Make sure that VUM is going through all the steps to install the VEM. We have seen the VUM process die and cause errors like you are seeing. So make sure that the VUM processes are still running.

louis

rob_ellison
Contributor
Contributor

hi Louis,

it looks to me like the VEM is being installed properly, i cant see anything strange in the logs. VUM shows the VEM patch as being installed ok.

have you got any other ideas?

Reply
0 Kudos
lwatta
Hot Shot
Hot Shot

Rob,

I'm out of ideas at this point. I can escalate to engineering if you have not found a workaround.

louis

Reply
0 Kudos
rob_ellison
Contributor
Contributor

Hi louis, I did get this working in the end - it was a combination of firewall ports between the update server and the ESX hosts and a seemingly corrupted ESXi install.

after starting afresh with the correct ports over and a clean install of ESXi it worked correctly. still not perfect every time but at least when it fails this time i can rebuild the host and re-add.

thanks for your help,

Rob

Reply
0 Kudos