×
all 6 comments

[–]noukthx 1 point2 points  (6 children)

This kinda feels like bad L2 load balancing down one of the port-channel links.

Had similar problems with only specific machines being unavailable, eventually traced to a bad SFP in a 8x10G downlink group to a FEX (using Nexus switching).

You may need to drop links one by one to see which one causes the problem to go away.

https://www.cisco.com/c/en/us/support/docs/lan-switching/etherchannel/12023-4.html#topic1

This may help you find which link in a port channel your traffic is traversing. But likely easiest to briefly down each link one by one to hunt out the problem interface.

Edit: Given that Switch D is routing, I'd start with the port channels between C-D and D-E. As the MAC addresses would change when the packets traverse L3 - so it'll be one or other side of the switch doing the intervlan routing.

[–][deleted]  (5 children)

[deleted]

    [–]noukthx 2 points3 points  (4 children)

    Possibly... the load balancing decisions depend on how its configured and/or the platform its on.

    But its generally a combination of SRC/DST MAC, SRC/DST IP, and some times SRC/DST protocol/port. It may be the mathematical operation only hits the dodgy link for certain host and MAC combos - and you might not have enough gear in the subnets to hit it on more than one host just out of pure luck.

    If you've got graphing, see if one of the port-channel member interfaces is noticeably less utilised than the others (long shot).

    [–][deleted]  (3 children)

    [deleted]

      [–]noukthx 1 point2 points  (2 children)

      Could try temporarily changing the MAC address of host A or host E or host C and seeing what happens.

      But yeah, toggling port channel members or packet captures at various test points probably your best bet.

      [–][deleted]  (1 child)

      [deleted]

        [–]Apachez 0 points1 point  (0 children)

        Perhaps a mac-address collission then?

        Some years ago Asus (or was it Abit?) accidently put the same mac-address on a series of motherboards they created which of course screws things up if these motherboards shows up on the same L2-network.

        So verify that too while you are at it.

        Also when you have host <-> switch <-> router its not uncommon that you need to wait for the arp timeout before a switched host will start to work (or login to the router and manually clear the arp entry). Back in the days the arp (in routers/l3-switches) was cached for ehm 30 minutes or so while today many vendors defaults to below 5 minutes.

        There can also be "sticky" mac-address filters at the access layer specially if you use port security and/or involves a dhcp-relay/dhcp-snooping on the access-switches. This will create an acl which isnt clearly visible through "show config".

        [–]vnetmanFreelance Network Coder 0 points1 point  (0 children)

        Isolate the direction where the packet loss is happening, i.e. is it that ICMP echo requests are reaching C but the Replies aren't getting back to A, or is the packet loss in the other direction?

        One way to do it is to run packet captures on C. Another way is to enable "debug ip icmp" on each of the network switches and routers (one at a time), and have C ping that device, and see whether the requests are printed. If you're going to do this, though, make sure the network device is not getting any ICMP traffic (directed to it) from anywhere else, otherwise the debug logs will kill the device.

        [–]Apachez 0 points1 point  (0 children)

        Why are you doing HSRP with a single box?

        This looks like a classic troubleshooting case where you need to verify each hop, use port-mirroring/tcpdump is you are uncertain.

        So if A/E cannot reach C (and C cannot reach A/E) then start with the obvious, do you even see the packets at each end (just to rule out any local firewalls)?

        If not start to dig where the packets goes missing.

        You will normally start with the wrong side (murphys law or something) if you have two to choose from so lets start with A/E :-)

        Does the local ip config look correct on A/E?

        ifconfig or ipconfig /all depending on OS along with netstat -rn to verify routingtable. You should see the SVI as default gateway.

        Do the above also on the C host.

        Next try to ping the default gateway, does an arp-entry for the default gateway ip show up at each box (arp -a or arp -an or arp -n).

        If the above worked we can login at switch D, verify its arp tables that each ip matches the mac address of each host.

        Now if A/E can reach switch D and host C can reach switch D then we have something in switch D thats bogus.

        If its a L3-switch its most likely that you forgot to enable routing usually something like "ip routing" would enable it in the cli (read the manual), some vendors wants a reboot before that kicks in so try that aswell.

        If it still doesnt work and you dont see the packets arriving (through tcpdump or better yet a port-mirroring into a box running tcpdump) then you should check how acl's are configured at all switches and routers along the road and then use port-mirroring at each physical hop to locate where the packets vanishes...