Press J to jump to the feed. Press question mark to learn the rest of the keyboard shortcuts
6
Posted byCCNP/CCDP3 months ago

ASA 5585-X Poor throughput and sub-intf packet loss

I have a weird (and very frustrating) issue I'm troubleshooting and I'm hoping someone can point me in the right direction...

   

BRIEF OVERVIEW  

I have a cluster of two ASA 5585-S SSP-60 firewalls in transparent mode. Upstream are two Nexus 7Ks. As you may know, in cluster mode the upstream N7Ks will see the ASAs as a single unit. The N7Ks have a vPC that spans the ASA cluster. Each N7K has two 10GE links to each ASA in the vPC, providing 40Gbps to each ASA. It’s a bit overkill IMO, but it is what it is.
  Each ASA should be able to handle between 20Gbps to 40Gbps, depending on traffic type/protocols/etc.
  The firewalls have several firewalled networks (e.g. 10.63.100.0/24, 10.63.253.0/24, etc). The firewalled networks are separated via subinterfaces (e.g. Network 10.63.100.0/24 is on Po10.14, Network 10.63.253.0/24 is on Po10.15, etc).
  And as you may know, in transparent mode each firewalled network gets two subinterfaces (one as the inside interface and one as the outside interface), two VLANs (one for the inside interface and one for the outside), one bridge-group to bridge the two VLANs, and one BVI interface for the bridge-group

   

ISSUE  

I’m only seeing no more than 2Gbps throughput when going from either a non-firewalled network to a firewalled network, or a firewalled network to another firewalled network. So if a host on a firewalled network (e.g. 10.63.100.x) tries to do a file transfer with a host on another firewalled network (e.g. 10.63.253.x), they see no more than 2Gbps. The hosts are VMs living Cisco UCS chassis. The chassis connect to Fabric Interconnects (FIs), then FIs to Nexus 5Ks, N5Ks to N7Ks, then N7Ks to ASAs. I did the throughput tests using iperf (with 50M window scaling) and using FTP file transfers. Running the same tests from non-firewalled hosts to non-firewalled hosts over the same switching infrastructure and I get 9Gbps (the maximum limit of the virtual servers I was testing). This tells me the issue isn't on the UCS, FI, N5K, N7K, but instead

   

ANALYSIS  

I see two things that could be causing this issue.  

  • I see incrementing packet drops on the pertinent port-channel subinterfaces that I’m transferring data through on the ASA. However, I do NOT see errors/drops on the physical interfaces (the port-channel doesn’t show me counters).  

  • Using the command “show asp drop,” I see "Dispatch queue tail drops (dispatch-queue-limit)" increase while I transfer the traffic, but the google results are vague.   My findings suggest the packet loss (or poor performance) is software related within the ASA and NOT a layer 1 issue…but I don’t know how to troubleshoot this further.

   

QUESTIONS:  

  1. Have any of you see something similar?

  2. How can subinterfaces incur errors, but not the physical interfaces?

   

COUNTERS:  

I've posted the output of "show interface" for one of the subinterfaces and the physical links here

20 comments
66% Upvoted
What are your thoughts? Log in or Sign uplog insign up
level 1
CCNA5 points · 3 months ago

Are there any overruns in "show interface detail" for the internal data interfaces?

I remember from a few Cisco live sessions that flows can get pinned to a cpu and limit throughput.

https://www.fir3net.com/Firewalls/Cisco/cisco-asa-5585x.html

level 2
CCIE Security, CCIE Voice Candidate5 points · 3 months ago

This is the right answer. Only one cpu core can work on a flow at a time. Other cores that receive the packets have to wait for a lock.

Single flow performance is somewhere around 2gbps. If you need to break that barrier for single flows then the flow offload feature on 4100/9300 is the business.

level 3
CCNP/CCDPOriginal Poster2 points · 3 months ago

I was afraid of that... A Cisco TAC engineer casually mentioned that flow limit, but we didn't believe him. Do you remember where you read it?

Thanks!

level 4
CCIE Security, CCIE Voice Candidate4 points · 3 months ago

Well, mask off here, I’m also a cisco support engineer. But ASA architecture and performance is my specialty 😬. There’s a little note buried in here:

https://www.cisco.com/c/en/us/products/collateral/security/asa-5585-x-adaptive-security-appliance/whitepaper_C11-731802.html

“To preserve the packet order and help ensure accurate state checking, each stateful flow can be processed by only one CPU core at any given time.”

To be fair, I think the actual single flow performance will vary across models and generations of ASA based on a few things:

  • cpu architecture (more modern and more efficient architectures gets the packet out quicker, so less waiting for lock)

  • cpu clock speed (given same cpu architecture)

  • number of overall cores/threads (fewer means less contending for the same resource)

level 5
CCNA1 point · 3 months ago

Out of curiosity, does enabling the asp per-packet load balancing help with the performance of individual flows or more just help mitigate the overruns?

From this Cisco Live session: https://clnv.s3.amazonaws.com/2015/usa/pdf/BRKSEC-3021.pdf (slide 41).

Never did get a clear answer on that when we were running into drop issues between N7k's and our 5585x's.

level 6
CCIE Security, CCIE Voice Candidate1 point · 3 months ago

It can help a little bit but i think it’s barely noticeable in practice. Normally, cpu cores have an affinity for a smaller set of the RX rings. This makes sense because packets for the same flow get hashed to the same RX ring. So if you have a smaller set of cores that consistently pull packets from them, you should have less contention for the lock.

Load balance per packet makes it so this doesn’t happen. The cores are then optimized to empty the RX rings wherever there are packets. So in the case of a single flow, you can have more of the cores handling packets for the same flow. They still have to wait for a lock, but it can help if your drops are because of overruns on the RX ring where the flow gets hashed to. It doesn’t overcome the lock/contention problem though.

level 6
CCNP/CCDPOriginal Poster1 point · 3 months ago

I enabled the "asp load-balance per-packet auto" command last night and did not see an improvement.

level 7
CCIE Security, CCIE Voice Candidate1 point · 3 months ago

Auto means we switch to per-packet load balancing under certain conditions. I pulled the spec and it’s.....complicated. Basically the intent of auto is to default to the style where cores have an affinity or binding to the same RX rings. But, if there’s a transient condition that pops where you have one RX ring that’s smashed and being overrun, why not have other cores come in and help pull packets? The auto feature tries to detect this and minimize the overruns. It doesn’t solve any of the problems with having to lock the structure in DRAM for the conn/flow (in fact asp load balance per packet can make the locking/waiting problem worse, so you can see a rise in CPU usage with no associated performance increase)

level 5
CCNP/CCDPOriginal Poster1 point · 3 months ago

Awww, I was afraid of that :-(. I received a response from our TAC engineer and he too said the same thing. Thanks for letting me know.

level 2
CCNP/CCDPOriginal Poster1 point · 3 months ago

Yep! Most have overruns in the hundreds, whereas Internal-Data0/3 has them in the thousands. See the output here.

Since posting this, I have incurred some overruns on two of the physical data interfaces. The overrun count matches that for input errors.

I read something like that too, and that's what prompted me to enable flowcontrol on the physical interfaces and on the upstream N7K. That didn't solve my problem, but it decreased my overruns. I'm seeing pause frame counters now. I'm still not sure if I should just disable flowcontrol considering it didn't help.

level 1
2 points · 3 months ago · edited 3 months ago

I’ve not heard of this issue, but the cpu pinning sounds plausible.

I just wanted to address your assumptions about throughput, however. Per ASA, you actually probably have 2x10G links for the data plane (spanned ether channel) and 2x10G links for the cluster control link. There are 3 vPCs on the 7ks - 1 20G for cluster control link #1, 1 20G for CCL #2, and a 40G for the shared data-plane across the 2 ASAs.

Each ASA can do about 20Gbps multiprotocol throughput, but in a cluster, you add them all together and multiply by 70% to get 28G available for the cluster.

Now, if you actually have 4x10G data links per ASA and the cluster control link (CCL) is using the remaining 1G interfaces on the SSP-60, this could be your problem. The CCL NEEDS to match the data links - this ain’t a HA failover/state interface... here’s why: One of the ASAs ‘owns’ a particular flow. Because the ASA cannot command how the N7K hashes the traffic across the VPCs, a flow managed by ASA#1 may actually arrive on ASA#2. At this point, ASA#2 redirects the flow to ASA#1 via the CCL. Now you see why the CCL should match the data speed in size.

So dive into the 7K and ASA cluster config and see if it is set up with over-provisioned Spanned etherchannel links and under-provisioned CCLs.

Edit: looking at the paste bin, I see ints Te2/x. Does this mean you have a firepower services SSP and are using those interfaces in addition to the 4 x10G on the ASA SSP-60?

level 1

I am sorry I cannot provide much help but FWIW I have a similar setup with Nexus 9k’s VPC’d to ASA5555s. Difference is we are set up L3 routed not transparent but similar otherwise, and that all my links are 1G. We have UCS inside running VMware.

I see the exact same things, tail drops on the ASA ingress at much lower throughput (few hundred meg) than I would expect. So suggests the ASA cannot dequeue / process the incoming traffic fast enough. No link errors or anything like that, and no suggestion of problems on the switching.

Until now I haven’t raised a TAC, a colleague was supposed to be looking at it but he recently left the job s it’s just been landed on me. He was suggesting the ASA was basically at capacity and we should use Ethernet pause frames to try to affect the send rate at the far end. That sounds like an awful solution to me though.

Interested to hear if you find a solution. My instinct says that the FW does a lot in software, not silicon. So throughput is probably highly dependent on what features are enabled, the size of rule sets etc.? Like I say literally been handed over management of the FW layer so I’m not well briefed on this.

level 2
CCNP/CCDPOriginal Poster2 points · 3 months ago

Thanks! At least I know I'm not alone :-).

 

I enabled pause frames earlier on in my troubleshooting, but that neither helped with performance nor with stopping overruns.

 

It definitely looks like the ASA is all software-driven....although mine has 10GE links....it can actually process it much, much slower than that.

level 3
1 point · 3 months ago · edited 3 months ago

Yeah that makes sense.

At one end every packet can be part of the same flow, for example a single http download, on the other hand every packet can be for a different destination from a different source. There must be an different impact on the ASA CPU in these two cases, there simply has to be.

As I say this landed on me last week. If I get any other meaningful info, from TAC or otherwise, I'll try to pass it on.

level 1

What about the command: "asp load-balance"

As previously mentioned the cores lock to a ring until the ring is empty. To change this behavior the command asp load-balance is used.

It is recommended to change the default ASP Load balancing behavior when any of the following is observed,

  • uneven usage across the cores

  • overruns seen on the RX rings even though the CPU usage appears low. Typically seen with traffic microbursts There are 2 options when configuring ASP load balancing. They are,

Per-Packet - Each core i.e DATA_PATH thread locks to the ring on a per packet basis.

asp load-balance per-packet

Per-Packet-Auto - Introduced within 9.3(1) allows the ASA to adaptively switch asp loadbalance per packet on and off. Further details around this command can be found here.

asp load-balance per-packet-auto
level 2
CCNP/CCDPOriginal Poster1 point · 3 months ago

Yep, tried it....but it didn't make a difference :-(. My TAC engineer responded back and also said that what I'm seeing is "by design." Only one CPU can work on a flow, and its limited up to 2Gbps.

level 3

This is a very interesting thread. Thanks for sharing.

Just curious. Any idea how you will overcome the limitation?

level 4
CCNP/CCDPOriginal Poster1 point · 3 months ago

I think the only thing we can do is move the servers that need >2Gbps throughput OFF the firewall. Thankfully, they're not externally-facing systems. We're talking with our Cisco rep to see if we have any other options.

Apparently some of the new highend FirePOWER firewalls overcome this limitation...but we're quite a ways from replacing these firewalls at this time.

level 1
Packet Ninja1 point · 3 months ago

There are ethernet pause frames which would point at micro bursting.

level 1

There is a very similar bug to this with nearly the exact symptoms you describe. I would open a TAC case for this immediately.

Community Details

127k

Subscribers

648

Online

###Enterprise Networking Routers, switches and firewalls. Network blogs, news and network management articles. Cisco, Juniper, Brocade and more all welcome.

Create Post
r/networking Rules
1.
Rule #1: No Home Networking.
2.
Rule #2: No Certification Brain Dumps / Cheating.
3.
Rule #3: No BlogSpam / Traffic re-direction.
4.
Rule #4: No Low Quality Posts.
5.
Rule #5: No Early Career Advice.
6.
Rule #6: Educational Questions must show effort.
Cookies help us deliver our Services. By using our Services or clicking I agree, you agree to our use of cookies. Learn More.