VMware Cloud Community
Phatsta
Enthusiast
Enthusiast

Network problems

I'm humbly asking for your help regarding this stubborn problem that I have in a customers network. I've done what I feel I can but it seems like my knowledge just isn't adequate to solve this. I really need your help.

Situation (see attached PDF for more info):

This network is located in a 4 story building with 6 AP's on each level. These AP's are PoE and connected to a 10-port PoE switch, one on each level. From there the switches are connected to a unified management switch in the server room that centrally manages all the AP's and their configs. There are 2 SSID's available, each with it's own VLAN, meaning there are a total of 3 VLANs, counting the untagged network (VLAN1).

To the management switch there's an ESXi host connected, running 4 virtual servers. 1 for each VLAN, 1 router and 1 server for testing purposes. SRV01 is DHCP, DNS and router for VLAN2, and SRV02 is DHCP, DNS and router for VLAN3. They both route to the shared router on VLAN1 to reach the internet.

Problem:

At first the problem was that some wireless clients couldn't get an IP, and hence no network connection. After some testing I concluded that I had mistakenly placed a non-VLAN capable switch in the middle of a VLAN network, so I corrected it by removing it and configured the network to what it is today, according to the network map you see attached. This didn't help at all though. Next I went over the cables and swapped out some old harness for new ones. Still nothing. I then deleted the vSwitch0 in the ESXi host, and added a new one, configuring it the same way, and that worked for half a day or so.

After this I contacted HP support (that now handles all 3Com stuff), and got help from a network tech there. We analysed the logs and saw that the management switch was leaving memory leak errors. He then sent me an upgraded firmware that was suppose to eliminate this bug, but unfortunally that didn't help either. Rebooting the switch helps for half a day or so, same as rebooting the ESXi host.

Due to our switches not being under warranty, the tech couldn't help me more than that, but he left me a few ideas. He said he thinks it's some unit in the network that keeps rebooting, making the spanning tree protocol restart and reindex the network every time, making the switches busy and not available for traffic meanwhile. So next I started checking every physical unit I could find (except computers of course) but no switch seems to be rebooting, unless it's doing so without actually restarting (i.e. bad software).

This is leaving me puzzled and I can't see where the problem lies or what to look for or try next. I'm not even sure that it's not my own fault, that due to my noobness I might have configured something completely wrong. The latter would be the most probable explaination.

I'm attaching both the network map and a print screen of the vSwitch setup. I replaced some of the customer specific names but I changed to other info. Any more info needed just ask.

Anyone willing and able to help?

Reply
0 Kudos
6 Replies
chengh
Contributor
Contributor

If the wireless users can not get IP address from the DHCP server, it means something wrong about the broadcast between the AP and the DHCP server.

You said "Rebooting the switch helps for half a day or so, same as rebooting the ESXi host." Did you mean the management switch?


Are there two SSID for each AP? Since you have 2 VLAN and 2 DHCP servers, can you try to configure the AP only use one SSID and one VLAN? For example, level one only uses vlan 2, level 2 uses vlan 3.This may help you separate the issue and find out which device cause the problem.

I also recommend you try to replace the management switch with another one for test if possible. I have experience that a malfunction switch caused wired problem of communication among VMware host and VMs.

Reply
0 Kudos
sakibpavel
Enthusiast
Enthusiast

Have you configure the switch port as an access port where connect the access point. Please also make sure that access point configure correctly.
Sakibpavel
Reply
0 Kudos
Phatsta
Enthusiast
Enthusiast

@chengh The wireless clients can actually get IP's, that was just the initial symptom. Looking at it more closely what happens is that all clients - wired or wireless, loses network connectivity, or rather the switches goes offline and doesn't handle their requests so they time out. Looking through the logs of the management switch, there are about 170 lines of "memory leak" errors. According to the HP tech this is also a symptom rather than the actual error. He said there must be something else on the network that keeps rebooting, making the STP protocol reindex, and with this being a process that don't allow concurrent network traffic everything comes to a halt until the STP protocol is done reindexing.

I meant the management switch, yes. Rebooting the ESXi host will also solve the problem, temporarily.

Yes, these are 3Com 8760's that can have 4 roaming SSID's at the same time, but they're really only dummies since the actual config is pushed out from the management switch. I could change their config, but not without losing functions in the production environment. Granted, they don't work for everyone all the time, but everything is still working for most people, most of the time. When the HP tech said "something keeps rebooting", I believe he meant a switch or router that's part of the STP protocol. Also, we've actually monitored all the switches (and with that the AP's) and it's none of them that keeps rebooting. Worth mentioning is also that we've had these AP's and the switches for years working perfectly, but since we expanded the network, added another server and moved the vlan seperation from the Windows 2003 VM to the vSwitch, this started to show. Just a little bit in the beginning, but increasingly getting worse.

I'd love to replace the management switch just to rule that out. But I need to be really sure that would help and I'm not. Firstly because they aren't cheap (I'd need to buy a new one as I don't have any spare and I can't just use any switch since that would leave the wireless network dead), and on top of that the HP tech said he didn't at all think it would solve the problem even if I replaced it with a newer one. Since I'm a consultant I need to have all my facts straight to get the money to buy another switch for the customer. Oh and another thing... I tried to capture data with Wireshark from the interface of SRV01 towards the vSwitch0, but every time the STP reindex occurs Wireshark crashes... makes it a bit hard to see any details on the traffic. To me it feels like it's the vSwitch that keeps "rebooting" somehow, and if that is the STP root then I can see why the switches go bonkers. Although I can't really understand how or why this would happen.

Reply
0 Kudos
Phatsta
Enthusiast
Enthusiast

What do you mean with access port?

As I just posted in my previous answer, we've had these AP's and the switches working perfectly for years, and that config is unchanged since. The things changed is the vSwitch and vlan handling. I'm not sure that's what's at fault though, as we've had increased strain on the network. It might be a dodgy switch that goes bananas due to overuse.

Reply
0 Kudos
Phatsta
Enthusiast
Enthusiast

Okay so the mob (pissed off users) has forced me to take desperate action. I don't like temporary solutions, in fact I know I'm going to get teared apart for it, but I just have no choice. I had to remove all vlans all together and put every client on the untagged network, with the monowall router as dhcp and internet router. Now it works, at least for the moment. No sign of the errors even with heavy load.

The more I keep trying to solve the original problem, the worse it gets. And in my head, everything is pointing to the vlan handling in the vSwitches. There's a small change it might be faulty hardware on the host computer, but it's not looking that way. Let me just say I'm re-planning the network and won't ever rely on the virtual networks in ESXi ever again.

Reply
0 Kudos
Phatsta
Enthusiast
Enthusiast

Just thought I'd add the solution for future reference if anyone should want to know. It had nothing to do with vmware in the end. It was a faulty switch that didn't immediatly indicate any error but on further troubleshooting we found that it was acting very strange. It was a 3CRUS2475 wireless controller.

Reply
0 Kudos