Networking troubleshooting with VMware side and Network layer

#VMware #vSphereNetworking #DVS #NICTeaming #CLI

Today I was asked by friend about the pinging issue in his vSphere environment. After the discussion i had with him I decided to visit the site and check what was the reason behind this case. I will briefly summarize the issue was.

They have two ESXi servers in their production cluster. Some VMs have connectivity and some do not but sometimes when vMotion occurs connectivity will be lost. I was shocked and checked whether there is an OS level firewall or any physical firewall blocking ICMP traffic between the client PC and VMs. Later I realized why there is a firewall that never should have the connectivity. Then I asked what were the recent changes that you have done? They have changed their upstream layer 2 switches recently. Previously they had cisco and later replaced it with arister. Okay cool then i was checking some virtual switch level configuration such as uplinks, LLDP, VLAN etc. Finally, I chose one test VM for testing. Disabled OS level firewall (If windows) stop the iptables or ufw (If you are using Linux). Clear-out the basic things that are related to ping. Then select one ESXi host for troubleshooting because they have production workloads in this environment, so I had to carefully do the troubleshooting without impacting rest VMs. Migrate VMs into standby ESXi server and start troubleshooting with the selected VM.

Currently the test VM has connectivity. when I powered off and spin up it doesn’t ping in the same ESXi server. Tried to migrate into another ESXi server but still the same. Checked the DvsWitch setting and uplinks are being assigned correctly. Everything is fine and uplinks have been assigned as active as well. VM is powered on, Port group has been properly configured, dvSwitch has two vmnics which have connected to outside. Everything is fine but no connectivity. I opened a ssh session to a particular ESXi server. List down the running VMs in host using below command.

esxcli vm process list

Figure 1.0 output of esxcli vm process list command

Take the example VM by its world id (as above mentioned) Then you have command to see what is the active uplink which is serving traffic.

esxcli network vm port list -w <VM world ID>

Eg : esxcli network vm port list -w 5375861

Then I was thinking about changing the active uplinks from the dvSwith. I set vmnic0 as unused and put vmnic1 as active. Please be aware this might impact your current production workload so be very careful doing all of these changes and make sure to keep everything in mind what you have done. better to keep it record i did change this for this portgroup like this.

Once after setting it to VMNIC1 VM again responded to ping. One clue was found this is why intermittent connectivity is happening. When we migrate VMs into ESXi servers it will automatically assign an uplink to the VM when we have both active uplinks in our dv or standard switches. select the port group through the networking wizard in vCenter. Right click and click on the edit setting. you will be redirected to the page below if you are using vCenter HTML5 client.

In my case I put one dvuplink2 (where the host being assigned vmnic0) as unused. Then the environment is stable whatever the VMs are placing in host were pinging continuously.

Starting the troubleshooting part with the network team.

I was curious about the configuration switch configuration from the start point. I asked show me the configuration and how they configured port channel in upstream switches. YES they have created a port channel. In that case we must have a LAG in dvswitch and add our uplinks to that LAG. Without having a DAG this shouldn’t work multiple uplinks. ie: The DVS offers several improvements over a VSS, like (for example) LACP support. The VSS does not support LACP.

Go to the dvSwitch and click on the configure tab in vCenter.

Name : Give a name as you prefer

Number of ports : Automatically pick by switch

Mode : Check with network guys how they configured the physical side. (Active or passive)

Load balancing mode : Select the load balancing policy as same as switch

Once you created the LAG successfully please add your active uplinks to lag. Then after everything will be working fine as expected. As I told you earlier please revert back the changes that you have done for troubleshooting. stop the SSH service in ESXi server as well.

Thank you for reading. Stay safe!!!

I am Pubudu Wijerathna who is the author of SystemsMedic blog.
Posts created 16

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top