VMware Guests Lose Network Connectivity
Closed     Case # 10002     Affiliated Job:  New Trier Township District 2031
Opened:  Tuesday, December 15, 2009     Closed:  Tuesday, February 9, 2010
Total Hit Count:  29564     Last Hit:  Saturday, December 21, 2024 11:10:42 AM
Unique Hit Count:  7271     Last Unique Hit:  Saturday, December 21, 2024 11:10:42 AM
Case Type(s):  Server, Vendor Support, Network
Case Notes(s):  All cases are posted for review purposes only. Any implementations should be performed at your own risk.

Problem:
4 Dell PowerEdge R710 vSphere 4.0 U1 servers arranged as a clustered environment using embedded 1gb 4 port Broadcom 5709 NICs as a single vswitch NIC Team connected directly to the Cisco Core with switch ports trunked across VLANs. Shared EMC CX4-120 SAN storage is connected through dual Fiber Channel via a Dell QLogic controller QLE2462 joined by a Cisco Fiber Switch 9124 v4.1.3a.

At random times the guests and sometimes even the service console lose network connectivity on the affiliated vmnic and cannot be accessed remotely across the network nor can the console of the guest submit outbound network traffic. If the service console is accessible, the affected guests may be vMotioned or the host may be placed into maintenance mode - once the guest is moved to another host in the VM cluster, immediately network accessibility is returned to the affected guests. If the service console is on the affected vmnic, disabling the port on the Cisco core will cause the migration to an alternate vmnic on the vswitch returning network connectivity to the service console and allowing the affected host to be placed into maintenance mode.

Action(s) Performed:
Total Action(s): 1
Action # Recorded Date Type Hit(s) User Expand Details
10033 2/9/2010 12:20:36 PM Vendor Support 4041 contact@danieljchu.com Upgrade to latest 1/5 VMware releases using VM Update Manager. Also update  More ...

Resolution:
After discussions with VMsupport, multiple VMware log transfers, EMC SAN review with SPCollects & emcgrab.sh log transfers and Dell over all system reviews. VMware came to the conclusion that the Broadcom NICs were at fault in all four of our servers and suggested a motherboard replacement by Dell. As we began to follow up with Dell on this, VMware got back to us again with a change of their diagnosis. Apparently the U1 release for vSphere version 4.0 created a problem acknowledged worldwide affecting many customers with Dell PowerEdge R710 & R900 servers equipped with the embedded Broadcom 5709 quad port NICs.

"Thank you for your Support Request.

I involved VMware escalation engineer into this SR and we found that same issue is already faced at different customers worldwide with DELL PowerEdge servers which has 4 port Broadcom LOM. We found that the issue happens after putting continuous stress on the bnx2 NIC for sometime which is also true in your environment. VMware engineering has already worked the solution with the product vendors and it is expected to be released with ESX 4.0 Update 2.

Please let me know as how do you want us to proceed on this as of now as U2 is due for release sometime in the end of second quarter 2010 (Dates are tentative and may change if needed).

Looking forward to hearing from you.
"

10/15/2010 Update: Not long after we filed these complaints with VMware, EMC & Dell; Dell provided us with 4 Intel Quad Gig NICs; these have operated perfectly further isolating the issue to those Broadcom embedded NICs. U2 was released to us in Beta; however, we were explained that by installing the Beta of U2, we'd have to rebuild the server in order to depoly the final release and therefore we decided to continue using the Intel NICs until the official release. As of last week we are now operating 4.1; however, we have not reverted back to the Broadcom NICs simply because we have these Intel NICs and don't feel any urgency to migrate back. But we are explained, that these issues have been resolved as of U2.



Profile IMG: Footer Left Profile IMG: Footer Right