all 78 comments

[–]kapache33 30 points31 points  (27 children)

Linux and Unix have been part of networking since the very beginning. Juniper the best hardware/software vendor in my book runs off BSD, and Cisco runs their own Linux based kernels on many types of equipment. The integration has been around for years.

[–]qupada42 19 points20 points  (21 children)

The interesting thing is how much or how little some of them hide it.

Arista and Juniper have taken the "sure, here's the OS shell, knock yourself out" point of view, which gives the network admin some seriously powerful debugging tools.

In Arista's case, this goes to the extent of "yep, it's Fedora under here, install yourself some RPMs if you like", which gives you options for monitoring, programmability and all sorts of general mischief.

For others, you really have to dig into some obscure diagnostic commands to even find reference to the fact that there's the Linux kernel buried somewhere under there.

[–]scritty 3 points4 points  (0 children)

I've been heavily enjoying Arista's approach, especially their ZTP process. I've got custom bootstrap scripts fetching data from /etc/prefdl, making decisions on tweaking interfaces, hitting callback URLs on ansible tower to kick off jobs against the switch in question. It's super neat.

[–][deleted]  (14 children)


    [–]HonkeyTalk 6 points7 points  (13 children)

    You're looking to software being the difference. The difference is in the hardware. Networking gear has ASICs, so the performance will be great no matter what, as long as the drivers are done right. Commodity server hardware can't really compete.

    As for the software, whether the networking code is in the kernel or not makes very little difference, IMO. The performance and functionality is in the hardware.

    [–]anothersackofmeatI break things, professionally. 3 points4 points  (0 children)

    I’ll agree that the performance is in the hardware, sure. But the functionality is just as much a part of the software stack as it is the hardware’s capabilities. For example, you can have hardware that can do single pass VXLAN encap/decap, but it really isn’t worth anything without a control plane that understands how to make a table of VTEPs and destinations and then actually program that into the hardware.

    I think OP has a good question regarding what’s more valueable. Native support inside the Linux kernel, versus prorprietary services. At the end of the day though I think we are going to see proprietary services as king for quite a while longer. Even Cumulus, for all of their contributions to the kernel and FRroutinv/Quagga, still hasn’t open sourced switchd, which is the piece that makes that ASIC do what it needs to do.

    [–]amaged73[S] -2 points-1 points  (11 children)

    A long time ago, you needed expensive hardware to run small things and now we dont, isnt that right? We need GPUs for ML today but is that going to continue?

    [–]HonkeyTalk 7 points8 points  (7 children)

    Yes. High performance will always require specialized hardware.

    [–]kevlarcupid 2 points3 points  (2 children)

    The floor and ceiling of “high performance” increases with time. What was tops-of-the-line performance in the 60’s has been available in calculators since the 90s, to make no mention of the incredible computing power we all carry in our pockets these days.

    [–]HonkeyTalk 0 points1 point  (1 child)

    Yet there's still a demand for the latest iteration of infrastructure, whether it's networking equipment, servers, GPUs, etc.

    Technology is iterative. All of it. That means newer iterations of one type of technology will increase demand for other types of technology. The "power we all carry in our pockets", for example, massively increases demand for bandwidth and higher performance networking equipment.

    [–]kevlarcupid 1 point2 points  (0 children)

    Yep, exactly. It's not like there's some maximum we'll reach, either. As we develop new technology to meet current demands, we unlock pent-up demand for the more capable technology, which drives more demand. It's a self-perpetuating machine.

    [–]MertsA 0 points1 point  (3 children)

    There is a counterpoint to this. Network hardware isn't ever going to look like general purpose compute hardware but that doesn't mean that it won't ever be something more programmable. This is the same problem as GPUs all over again. At first they did absolutely nothing other than graphics and now GPUs are completely programmable via CUDA and OpenCL. I'm sure networking hardware will eventually go the same route with products like Tofino.

    [–]HonkeyTalk -1 points0 points  (2 children)

    In some ways yes, however ASICs are not programmable.

    [–]MertsA 0 points1 point  (1 child)

    Tofino is fully programmable and supposedly already shipping. It runs at 6.5 Tbps in a single ASIC.

    [–]kijikiCumulus co-founder 1 point2 points  (0 children)

    Tofino is very impressive and way more programmable than a Broadcom or Mellanox ASIC, but it is still far less flexible than a CPU or NPU. That said, it greatly increases what is possible on a line-rate ASIC.

    A good way to think about Tofino is that it is like an FPGA, but with high-level parse, lookup, and modification blocks instead of low-level LUTs.

    [–]jackoman03CCNA 4 points5 points  (2 children)

    You don't need expensive hardware for it to work initially, but you do need expensive hardware if you want it to work with 20,000 hosts and multi-gigabit throughput. Software will never be able to keep up with Hardware. I'm sure that's written in the Bible somewhere.

    [–]amaged73[S] 2 points3 points  (1 child)

    Distributed Architectures are not useful in these kind of problems?

    [–]My-RFC1918-Dont-LieSystems Engineer 2 points3 points  (0 children)

    It depends on whether the problem is better solved with distributed computing using commodity hardware, or centralized expensive purpose-built hardware. There are a lot of advantages to distributed computing using commodity hardware, but there are also additional complexities with distributed computing.

    In short: it depends

    [–]packet_whispererProbably drunk | Moderator -1 points0 points  (3 children)

    And now Cisco's like, here, have an SDK, run python apps natively on the box. Oh and if you want a Bash shell, would you like the native shell or a disposable guest shell?

    [–]scritty 1 point2 points  (2 children)

    Can you install whatever libraries you want?

    [–]packet_whispererProbably drunk | Moderator 1 point2 points  (1 child)

    I believe so, or you can bundle the library with your app.

    [–]scritty 0 points1 point  (0 children)

    That's a good change - I've been investing a lot of time into creating 'network device native' python scripts to enhance certain operational processes so it's good to hear cisco's coming along for that ride.

    [–]drdabbles 2 points3 points  (1 child)

    Juniper's base OS is now Linux, and their REs are BSD virtual machines running in QEMU.

    [–]Torgen_ChickenvaldIt places the packet on the wire or else it gets the hose again. 0 points1 point  (0 children)

    I think the EX4600 was the first on their gear to run in that configuration.

    [–]amaged73[S] 4 points5 points  (0 children)

    I could have been more specific, "Networking into Linux" for me is when you have networking applications in Linux using kernel data structures and constructs, with tool set and utilities of Linux.

    [–]wombleh 0 points1 point  (0 children)

    Indeed quite a lot of networking products run Linux under the hood.

    Makes sense, if you need an OS then it's a ready made and easily modifiable one. Would need a very strong reason to use or develop something else.

    [–]amaged73[S] -1 points0 points  (0 children)

    I'm not talking about the common "Networking in Linux" approach of hiding the linux kernel with a wrapper but real stuff that is done in the kernel like these folks : https://cumulusnetworks.com/blog/vrf-for-linux/

    [–]kWV0XhdO 9 points10 points  (13 children)

    Ivan Pepelnjak covered the question of using the Linux kernel as the control plane vs. using Linux as the OS running a user space control plane a few months ago:


    It was a good episode, featuring some of the developers who extended Linux to make the use case truly workable.

    I'm convinced. Linux control plane all the things!

    [–]amaged73[S] 0 points1 point  (12 children)

    How about data plane ?

    [–]kWV0XhdO 7 points8 points  (5 children)

    What about it? Putting the Linux kernel in the forwarding path implies not using hardware acceleration, so I'll pass on that, I guess.

    [–]amaged73[S] 0 points1 point  (4 children)

    That's exactly what i am interested in. Can we do Forwarding efficiently in the Linux kernel, with hardware acceleration ? Is it a matter of time before these closed APIs from Broadcom and Cavium will open up ?

    [–]kWV0XhdO 11 points12 points  (0 children)

    Can we do Forwarding efficiently in the Linux kernel, with hardware acceleration ?

    "forwarding in the Linux kernel" and "with hardware acceleration" are mutually exclusive in my book.

    We may have a misunderstanding about terms here.

    I'd describe Cumulus as having an external, hardware accelerated data plane managed by the Linux kernel.

    A frame/packet transiting the front panel of a Cumulus switch never touches the kernel and, other than pulling stats from the underlying hardware, the kernel will never know that packet went by.

    [–]djdawsonCCIE #1937 2 points3 points  (0 children)

    You may find this post informative. There's apparently a lot of effort being put into high performance packet forwarding on generic hardware, and they've got some very impressive results.

    [–]MertsA 1 point2 points  (1 child)

    You can do something that feels pretty close to the same thing. Cavium has for years maintained a Linux kernel module that transparently offloads most forwarding traffic via their ASIC and configuration is just bog standard Linux networking. You set up your firewall rules in IP tables and behind the scenes that kernel module translates it into an equivalent configuration in hardware. Any fancy features that the hardware doesn't support like QoS just get forwarded to the Kernel and processed as usual. Eventually I'm sure the community will standardize on an API that can be supported by the kernel natively.

    [–]OverOnTheRock 0 points1 point  (0 children)

    The Mellanox guys have upstreamed SwitchDev into the kernel. This allows a vendor neutral mechanism to sync the fib between the kernel and the hardware.

    [–]jiannone 1 point2 points  (5 children)

    To what extent? Hardware computes in the dataplane, not software. Obviously the hooks between planes require software, but if you care about scale and performance, forwarding is a hardware task. Current ASICs do line rate processing of ACLs and L4 DPI. Sure, there are some insane instances of high scale throughput, but they're feature poor or built for lab environments.

    [–]amaged73[S] 0 points1 point  (4 children)

    What are the architecture and design pros and cons of using a userland process like VPP vs the way things are done in Pica8 and Cumulus with Mellanoix and broadcom, which one performs better and has decent feature velocity, has anyone compared these two ?

    [–]JRHelgeson 1 point2 points  (2 children)

    If you use the general processor for packet handling, what happens when something causes the processor to stop and ponder its life? An ASIC would drop the packet, but a processor can get consumed by other tasks unrelated to forwarding packets. This is why packet forwarding is handled on the ASIC, and other tasks get escalated to the processor on an as-needed basis.

    [–]amaged73[S] -1 points0 points  (1 child)

    The same thing we do when any type of server is interrupted, we rely on service high availability to workaround bugs, we cant expect a perfect network that doesnt skip a beat, because it can and it will, just like other piece of software out there, are we restricting ourselves to hardware solutions by these ideas ?

    [–]JRHelgeson 1 point2 points  (0 children)

    I'm just stating that it is most efficient to handle edge connections at the edge, and only escalate to the core proc when necessary. There are massive trade-offs to routing packets at the core. 15 years ago, all routing was done at the core - and therefore routing packets took milliseconds, whereas switching was nanoseconds. Now that routing is done on silicon we have routing happening at wire-speed.

    If you bring all traffic to the kernel, there is no two ways about it, it will be slower - unless you find a way to slow time or speed up light.

    [–]kijikiCumulus co-founder 0 points1 point  (0 children)

    All the below are rough estimates based on when I last looked into this back in 2016, but nothing has changed radically since then.

    An ASIC like Mellanox or Broadcom will always, or almost always be able to do all features at full line rate. For the same power budget, that rate will be approximately 10-20x faster than an NPU.

    An NPU will be outperform a general purpose CPU by 4-10x, depending on how aggressive you get with optimizing the code for the CPU. The 10x is if you just use the normal Linux/BSD forwarding code. The ~4x is if you do custom code using something like DPDK.

    So why do people use customized CPU forwarding? If you just need a 3 10G interface router, x86 servers are cheap and easily available and can almost always keep up. And they have a much richer feature set than fixed function ASICs.

    Why do people use NPUs instead of ASICs? Flexibility. The feature set can be much greater than the fixed function ASICs.

    Some of these lines are blurring a bit with semi-programmable ASICs like Cavium, and especially with Barefoot's Tofino ASIC.

    [–]HoorayInternetDramaMeow 🐈🐈Meow 🐱🐱 Meow Meow🍺🐈🐱Meow A+! 1 point2 points  (3 children)

    Putting it into the control plane or data plane?

    [–]amaged73[S] 0 points1 point  (1 child)


    [–]HoorayInternetDramaMeow 🐈🐈Meow 🐱🐱 Meow Meow🍺🐈🐱Meow A+! 2 points3 points  (0 children)

    Then yes.

    [–]UIfHvsv12 1 point2 points  (0 children)

    Two things,


    Support (Especially for enterprise customers)

    [–]cylemmulo 1 point2 points  (0 children)

    A long with everything being base on linux, I use linux VYOS VMs for routing as well.

    [–]drdabbles 1 point2 points  (0 children)

    It's a little hard for me to follow what exactly you're asking, but I'll give you an example from both directions to illustrate what I think is useful.

    First, using a Linux device as both the route/switch/whatever and the application server. This is extremely useful because of the massive levels of scale that can be accomplished with what is effectively commodity network and server equipment. Instead of purchasing routers, firewalls, servers, and all of that, you can simply connect fabrics of switches together, terminate your transit at whatever layer you like within that fabric, and those Linux devices can pretty much handle the rest. They can do all of the BGP announcements, they can do VRRP between each other if that's your thing, you can use ECMP to each host, you can do clever network tricks to handle dropping a request into a mesh of servers and only having a single one respond, and so on. Basically you limit your equipment spend, eliminate major bottlenecks and chokepoints, and so on. Additionally, you get to use all of your normal config management tools like Ansible, Chef, Puppet, or whatever you like to manage the only devices you actually maintain- Linux servers. This is, at a high level, how Facebook and Google have scaled their networks so well, so quickly, and at a significant cost savings over what would have otherwise been designed in a more traditional setting.

    On the other hand, using Linux inside white and gray box devices like Arista or Cumulus switches is also massively powerful. First, you get the Linux networking stack, which has its positives and negatives. Things like VLAN membership isn't global on the device, so you can have multiple VLANs with the same ID without conflict. For virtual networking, or datacenter networking where tenants might make some changes (Softlayer, rackspace, Digital Ocean for example), this means your customers won't have to come up with their own IDs and avoid conflicts, which is a pretty big win. Administrators also understand how networking gear will work, and the tools to interact with it, which is nice. But, none of that compares with the ability to run custom software on purpose-built network hardware running Linux. Imagine you have a Cumulus or Arista switch, and you want to do some random thing on the device. Traditionally, you'd be out of luck. but with these Linux network appliances, you just write your software, load it into the device, and run it! Services, applications, whatever. And you can get access to the accelerated networking components, which means your service can handle traffic at blazing fast speeds.

    As a final example of where converging things are going, XDP, DPDK (and its competitors), af_packet, etc. all serve to massively accelerate packet handling in several situations, and can be extended to various levels of complexity. By spreading data intelligently across buffers, CPUs, and host NUMA architectures, and accelerating it with the various fast pathing tools that are quickly maturing, devices are seeing orders of magnitude better performance. Hosts that struggled to manage traffic levels over 10Gbit without specialized drivers and hardware can now handle multiple 100G ports without much trouble. With some real focus and attention, companies using these technologies are being rewarded with huge performance increases, which means better customer experiences and fewer hardware dollars spent. And all of that ties back into the previous two examples. With these tools, easily affordable and commodity hardware is performing on the same level as purpose build chipsets and fabrics. And all in an operating system that tens of thousands of people have administrative experience with. That's industry-changing stuff right there.

    [–]yahooguyz 0 points1 point  (0 children)

    I think this gives you much more ability in operational and monitoring sense. Running native packet sniffers/wiresharks and complexity as well. I see the trend is a bit shifting away from cisco and the likes.

    [–]scritty 0 points1 point  (0 children)

    I'd see merit in this approach for better 'route to the host' support.

    I've had to handle shoving things into multiple namespaces before and it's a PITA to manage.

    Pretty good example: https://stackoverflow.com/questions/28846059/can-i-open-sockets-in-multiple-network-namespaces-from-my-python-code#28865626

    [–]burbankmarc 0 points1 point  (28 children)

    I was really hoping we'd see more native kernel development with the Broadcom/Intel ASICS. If it was natively supported I could then just buy a switch and install Red Hat on it.

    [–]PE1NUTThe best TCP tuning gets you UDP behaviour. So we just use UDP. 5 points6 points  (18 children)

    You can already, if you buy a switch with a supported ASIC and put e.g. Cumulus on it. But because the API to the ASIC is proprietary, this will not make it into more general Linux distributions. I think the API for some Mellanox switches is fully open, but you would still need to put Cumulus, or a more open but switch-centric OS on it to have it work as a switch.

    [–]sryan2k1 2 points3 points  (0 children)

    I mean this is Arista. It's Fedora linux, and literally the only custom kernel module is the Broadcom ASIC driver.

    [–]burbankmarc 1 point2 points  (16 children)

    Yeah, I don't want to use Cumulus.

    [–]amaged73[S] -1 points0 points  (15 children)

    is there a better way today?

    [–]burbankmarc 0 points1 point  (14 children)

    No, and that's kind of my point. You can use either Cumulus or the hardware vendor's OS, that's it. Where a server I can install anything that runs on x86.

    [–]scritty 1 point2 points  (0 children)

    There's more than Cumulus supporting ONIE boot and related functionality these days - http://www.opencompute.org/wiki/Networking/ONIE/NOS_Status



    Not a huge list, I'll admit.

    Programming good features that aren't just poking instructions at broadcom's SDK is a) hard and b) really hard.

    Broadcom doesn't just hand the thing out, either :/

    [–]amaged73[S] 0 points1 point  (11 children)

    I guess (correct me if i am wrong) you can install Cumulus OS for free on any x86 but you will only get a limited performance (~70 - 80Gb on one box) and they give you the option to purchase special hardware to go higher. This can improve if someone would be able to do things faster in the kernel with more features and that was what companies like Cumulus is trying to do.

    [–]burbankmarc 0 points1 point  (7 children)

    To answer your question directly, yes I like what Cumulus is doing. However, I do not like the fact that it's only Cumulus doing it.

    [–]pyvpxobsessed with NetKAT 1 point2 points  (5 children)

    what? cumulus is a control plane that speaks to a switching ASIC -- like literally everyone else.

    cumulus does their own linux to ASIC driver, instead of what many other whitebox switch control plane vendors do, which is just %s/ASIC_SUPPLIER/VENDOR/g on the reference driver.

    [–]burbankmarc 2 points3 points  (4 children)

    Like I said previously, I would rather the drivers be native to the Linux kernel. However, as someone else pointed out the API is proprietary so it most likely isn't compliant with the GPL.

    [–]pyvpxobsessed with NetKAT 0 points1 point  (3 children)

    from a performance perspective, why does that matter at all?

    don't expect any switching ASIC vendor to mainline drivers. it's not gonna happen. NIC drivers just interface with a proprietary firmware API. everything happens in a "black box" there. it's the same thing with switching ASICs from Broadcom, Mellanox, Cavium, etc.


    [–]amaged73[S] 0 points1 point  (0 children)

    Why is it only Cumulus ? Is it because they have the talent or because no other company wants to take that approach ? Does Pica8 and Big switch and Arista follow a similar solution ?

    [–]pyvpxobsessed with NetKAT 0 points1 point  (2 children)


    much of the highest throughput network stacks are userland it's very tough to break the 40Gbps per core, line rate, for a number of reasons -- namely you only get so many instruction cycles per packet, and PCIe bandwidth.

    [–]amaged73[S] 0 points1 point  (0 children)

    So its in userland because of architectural limitations, does that mean we should only use these open networking solutions for ToR switches and not clos type of fabrics (bigger purposes) ?

    [–]EtherealMind2packetpushers.net 0 points1 point  (0 children)

    You can improve this by using a fancy NIC that has onboard FGPA/CPUs or ASIC to get more than 100G (some up 200G) today.

    [–]EtherealMind2packetpushers.net 0 points1 point  (0 children)

    There are many people who roll their own Linux.

    Or you can go with Link: OpenSwitch - https://github.com/open-switch

    Link: Open Network Linux - https://opennetlinux.org/

    [–]kWV0XhdO 2 points3 points  (5 children)

    hoping we'd see more native kernel development with the Broadcom/Intel ASICS

    Not possible. The kernel is open source, while the SDKs from the ASIC vendors come only under NDA. That's why switchd is only available as a binary.

    [–]amaged73[S] 0 points1 point  (4 children)

    What if those SDKs are open sourced ? What will it take for a genius to reverse engineer that shit and set networking free :)

    [–]kWV0XhdO 4 points5 points  (3 children)

    That's an expensive proposition that won't result in a stable/sellable product, but will result in endless lawsuits. I don't see that happening.

    Having said that, I don't understand why the ASIC vendors hide their cards like they do. It's not like making a switching ASIC is that hard... Not that I'd even know where to begin, but Barefoot only needed a few years to crank out a truly revolutionary chip, that was (briefly) faster than anything else on the market.

    My guess is that Broadcom hides their SDK for two main reasons:

    • Habit
    • Maintaining this barrier keeps their customers (Cisco, etc...) happy precisely because it doesn't "set networking free"

    [–]banditoitaliano 0 points1 point  (1 child)

    It's not like making a switching ASIC is that hard...

    I don't have any in-depth knowledge of the topic, but I'm going to take a wild guess and say it IS hard but it's probably due to patents more than anything.

    [–]kWV0XhdO 0 points1 point  (0 children)

    Heh. I don't mean to trivialize it, but it's been recently demonstrated that a young company, starting from scratch, can totally do this.

    [–]anothersackofmeatI break things, professionally. 0 points1 point  (0 children)

    I imagine it’s to lock in OS vendors. If you’re Broadcom, and Cavium came along with a cheaper chipset that did the same things and had the same API there would be pretty much nothing to stop your customers from jumping ship.

    It would be great for the consumer, though.

    [–]EtherealMind2packetpushers.net 1 point2 points  (0 children)

    You can do this today. There are several companies that roll their own Linux for switches and routers. The hardware ASICs do the forwarding, Linux programs the ASICs while handling all the management and operations needed to gather route data, monitor the system etc.

    [–]OverOnTheRock 0 points1 point  (1 child)

    I am no way associated with them, but I will suggest that you can buy a Mellanox Spectrum based switch, and by using their ONIE install mechanism, you can install a plain-jane recent linux kernel on them, and make use of the hardware based forwarding. You'll need to load the SwitchDev kernel module. SwitchDev is integral and is delivered with the kernel. It is not a binary blob such as what is needed to run Broadcom asics. In any case, switchdev takes care of syncing the kernel's forwarding information base with the hardware. The road map of implemented features is reasonably substantial. You can then use software like Free Range Routing and/or Open vSwitch and/or iproute2 to handle control plane operations, and leave the forwarding plane for the hardware.