iproute2 Traffic Control
Linux has very cool network filtering capabilities, I’ve written about them in a couple of articles. One of the comments made about the first article was this:
“…add advanced routing (iproute2) in order to handle bandwith resources”
I didn’t know anything about iproute2 then. I did a little research before writing my follow-up iptables article and I still didn’t know enough to write anything about iproute2. Then I started seeing “abuse” of my web server and traffic control was the obvious solution. At the beginning of 2009 an organization starting polling my webcam repeatedly without pauses. The frame was being read 10 times per-minute and my web server traffic went from about 5GB per-month to 50GB per month. There are expensive routers that can control bandwidth usage, but Linux has all that I need.
When I realized that the cause of the huge traffic levels was the non-stop webcam polling. I sent an e-mail requesting the polling rate be reduced but it was Friday evening. Then I remembered the response to my iptables article – “handle bandwidth resources” – and set about learning more about iproute2 traffic control.
I setup a test network to teach myself enough to squash the webcam “abuse”. iproute2 traffic control is tricky and it was Sunday evening before I was able choose the bit rate available for any type, source or destination of traffic crossing my router. I configured the data rate to the problem organization such that it would take a whole minute to read one frame without impacting performance for other users..
I said that iproute2 traffic control was tricky for me. My goal in this article is to explain how to use iproute2 to solve basic problems like the one above. It still doesn’t satisfy the request in the comment quoted above because it doesn’t get into the combined use of NetFilter and iproute2…but it’s a start…and I plan to write that article anyway.
Click here to skip the explanation and go straight to the example >
Environment used for investigation
I used three hosts
- A Linux host running a ssh server on one IP network on one set of cabling;
- A ssh client on a Linux host on a different IP network on different cabling; and
- A Linux host connected to both IP networks via different NICs and configured as an IP router
The ssh server and client hosts were configured with static routes to each other’s networks via the router host.
Test case
I chose to observe the effects of traffic shaping on a ssh session running the following command in a bash shell:
while [ 1 ]; do ls -Rla ~; done
The ls runs non-stop on the ssh server host and the text output is transferred via ssh to be displayed on the ssh client. So, the rate of update on the client display is related to the bandwidth available to the ssh session. It’s a crude test but sufficient to see if bandwidth was actually being controlled and determine which commands to use and in what order.
Iproute2 components
You are probably familiar with the ip program already, it’s one of the iproute2 tools. For traffic control there’s the tc program. It has a many options and they are divided into functional groups. A vast set of configurations are possible:
tc help Usage: tc [ OPTIONS ] OBJECT { COMMAND | help } tc [-force] -batch file where OBJECT := { qdisc | class | filter | action | monitor } OPTIONS := { -s[tatistics] | -d[etails] | -r[aw] | -b[atch] [file] }
That’s a fraction of the available options. Try putting each of those OBJECT values individually between the tc and help:
tc qdisc help Usage: tc qdisc [ add | del | replace | change | show ] dev STRING [ handle QHANDLE ] [ root | ingress | parent CLASSID ] [ estimator INTERVAL TIME_CONSTANT ] [ [ QDISC_KIND ] [ help | OPTIONS ] ] tc qdisc show [ dev STRING ] [ingress] Where: QDISC_KIND := { [p|b]fifo | tbf | prio | cbq | red | etc. } OPTIONS := ... try tc qdisc add <desired QDISC_KIND> help
I’ll leave it to you to look at the rest, and the man page and the Linux Advanced Routing & Traffic Control HOWTO” on the website referred to in the man pages:
http://lartc.org/howto/
For this article I’ll describe configuring a simple set of bandwidth classes that you can select from for basic session types. It’s important to note that I’ll only be implementing control over the send rate on each side of the router.
Packet generation and transmission
Programs submit their messages for transmission and the network stack organizes them as packets and they are queued to be sent on the wire. The queuing allows for a number performance optimizations, asynchronous and overlapped I/O for example.
The best known queue type is probably first-in, first-out (fifo). Messages from all programs are handled in the strict order that they were “sent”. There’s no prioritization and a small number of programs can hog the network by sending lots of large messages without pausing.
That could be avoided if multiple queues are used, each with a different priority. Each queue is fifo but the highest priority queue with messages in it will be drained before any other queue is drained. Assign the nasty programs to the lower priority queues and some fairness is restored.
In iproute2 traffic control, a queueing discipline (qdisc) provides for selection of queueing models and features.
Queuing disciplines (qdisc)
For session based bandwidth control I used iproute2’s class based queuing (cbq) discipline. It let me create priority and bandwidth classes then filter session types into a class of my choosing. It isn’t the only solution but it is suitable for the task.
I started by adding a class based queue to the public interface on my router (the one that sends packets in the direction of the user I want to control):
tc qdisc add dev eth2 root handle 1: cbq avpkt 1500 bandwidth 10mbit
This cbq qdisc is added as the root qdisc on the eth2 device and the cbq given a node ID of 1:. The qdisc is configured to assume the average packet size will be 1500 bytes and that the link bit rate is 10 Megabits per second (10mbit) – see the note on units at the end of this article.
The bandwidth control happens by inserting a delay between packets but the length of the delay depends on the size of packets. For example, every packet on a 10Mbps link goes at 10Mbps…but a 1000 byte packet will only take about 1 millisecond to put on the wire. So, insertion of a 9ms delay between 1000 byte packets in a given session will limit that session to an average bandwidth of 1Mbps. If the packets were 1500 bytes long then they would occupy 1.5 milliseconds on the 10Mbps wire and the delay between them would need to be 13.5 milliseconds to maintain an average of 1Mbps.
I suggested an average packet size and link bandwidth so that the cbq can estimate the amount of delay time to insert between packets to get a specific average bit rate. It’s probably safe to assume the link data rate won’t change so the selected bit rate restrictions are not necessarily very accurate if the average packet size varies or the suggested value is badly selected. The configuration above is not the only way to control bandwidth – see the man pages, tc help or the how-to.
Queue classes
Now that I have a cbq I add “classes” that specify my choice of bit rate restrictions. Each of these classes is a queue, by default a fifo one. Here’s an example:
tc class add dev eth2 parent 1: classid 1:1 cbq rate 1mbit allot 1500 prio 5 bounded isolated
That adds a 1 Megabit per second (1mbit) class to the root qdisc (parent 1:) on eth2. I’ give the class a handle (classid) of 1:1 (child 1 of the object numbered 1: – the cbq). The cbq is permitted to remove 1500 bytes (allot 1500) at a time from the class. The class has a priority of 5.
Lower priority numbers are higher operational priorities. The class with the lowest numbered priority that has something in its queue will be drained before those that have a higher numbered priority and something in their queues. That permits queues for interactive (higher priority) versus background (lower priority) traffic that have the same bit-rate restriction.
The bounded and isolated argument refer to sharing between classes. For example, on a 10Mbps link with ten 1Mbps classes the classes could share the available bandwidth equally. But, if one of those classes is idle, it’s a shame to leave the bandwidth unused. If an idle class can “lend” it’s idle capacity to another class then overall utilization is maximized. In this case I specified that the class cannot borrow bandwidth from other classes (bounded) and cannot lend its idle bandwidth to other classes (isolated). I’m ultimately trying to stop abuse of my webcam, the abusive user doesn’t deserve access to spare capacity.
For fun I’ll create a bunch of classes with the behavior of historically popular dial-up modem standards, I’ll keep the 1:1 class above and add these:
tc class add dev eth2 parent 1: classid 1:2 cbq rate 300bit allot 576 prio 5 bounded isolated tc class add dev eth2 parent 1: classid 1:3 cbq rate 1200bit allot 576 prio 5 bounded isolated tc class add dev eth2 parent 1: classid 1:4 cbq rate 2400bit allot 576 prio 5 bounded isolated tc class add dev eth2 parent 1: classid 1:5 cbq rate 9600bit allot 576 prio 5 bounded isolated tc class add dev eth2 parent 1: classid 1:6 cbq rate 19.2kbit allot 576 prio 5 bounded isolated tc class add dev eth2 parent 1: classid 1:7 cbq rate 28.8kbit allot 576 prio 5 bounded isolated tc class add dev eth2 parent 1: classid 1:8 cbq rate 33.6kbit allot 576 prio 5 bounded isolated tc class add dev eth2 parent 1: classid 1:9 cbq rate 57.6kbit allot 576 prio 5 bounded isolated
They all have the same priority for our purposes, so none will be favored over another.
Now I have everything in place to specify what sessions will have which restrictions applied to them.
Filters
Filters are used to specify what class a particular type of traffic is. For example, imagine the abusive user of my webcam was at IP address 10.0.5.57. Just to be mean I’ll put all traffic to that host (the direction of the downloads) on a virtual 300 baud modem:
tc filter add dev eth2 parent 1: protocol ip prio 16 u32 match ip dst 10.0.5.57 flowid 1:2
There I add an IP (protocol ip) filter to the qdisc with ID 1: (parent 1:) on device eth2. The u32 is the filter module name and allows filtering on named fields, numeric values of bytes and even down to the values of individual bits in a packet. The filter applies to packets going to the IP address 10.0.5.57 (match ip dst 10.0.5.57). The priority for the filter is 16 (this is important, see below). In the filter shown, packets with a destination IP address of 10.0.5.57 (match ip dst) will be queued in the 300 bps class (flowid 1:2).
That prio value is important for several reasons. If there are packets that might match more than one filter then it is important that the filter you want those packets to match has the lowest prio value. Each packet is tested for a match with each filter in ascending filter prio value order.
There’s another quirk to the prio value. It provides the only way I’m aware of to reliably delete a filter but I am only able to make it work if all filters on a given device have unique prio values. So, take care to ensure there are no filters with the same prio value and that the prio values reflect the order you want the filter match tests to be applied.
The addition of another filter could make the whole web server appear to be connected via a 9600 baud modem as follows:
tc filter add dev eth2 parent 1: protocol ip prio 17 u32 match ip protocol 0x6 0xff match ip sport 80 0xffff flowid 1:5
It’s a bit complicated and although tc supports simpler syntax for the same thing, I can’t get it to work as expected. The main difference from the previous filter is the packet “match” settings. The first part (match ip protocol 0x6 0xff) isolates the protocol field in the IP header and tests if it has the value 6 (tcp). The 0xff masks one byte and the 0x6 is the value to compare the masked value with. The next part (match ip sport 80 0xffff) isolates the source port field in the protocol header and compares the value with 80 (http). If a packet matches both parts then it is queued in class 1:5 (9600 bits per-second).
Remember that in this configuration I can only control the flow rate of packets being sent, so here I restricted everything being sent with a tcp source port of 80. If the web server was mostly used for uploads then that filter wouldn’t work. Instead, I’d have to use a filter on a router interface facing the web server (i.e. one sending packets to it). A more careful configuration would filter both directions.
Look at the prio value, it’s 17. The previous filter I added, the one to limit the abusive host to 300 baud, had a prio value of 16. So that test is made first and everything will look like it is on a 300 baud modem for the user at 10.0.5.57 and the web server will look like it is on a 9600 baud modem for all other users. All other traffic types will get as much capacity as the end-to-end link has available.
I can delete both of those filters by their prio number:
tc filter del dev eth2 protocol ip prio 16 u32 tc filter del dev eth2 protocol ip prio 17 u32
Back to my test case
Now that I have the classes in place and have learned the basic filter semantics, I can look at how they affect the speed the directory listing appears at in the ssh session. Assume I have the qdisc and classes as shown above but am back at the point where there are no filters. Assume the session is connected and the “while” command already running:
tc filter add dev eth2 parent 1: prio 101 protocol ip u32 match ip protocol 0x6 0xff match ip sport 22 0xffff flowid 1:1
That places tcp traffic with a source port of 22 (ssh) into the 1 megabit per-second queue.
I expected the connection to drop but it didn’t! The update rate of the display just slowed down after a few seconds (to about 1Mbps). Filters can be applied in realtime without disturbing existing sessions. That’s the way it should be, but I didn’t expect it and it is very impressive.
It can take a lot of experimentation to figure out tc commands to achieve a given goal. At this point it would be good to just change the filter or replace it with one that would move the ssh traffic into another class and change its speed again. As far as I can tell, the tc filter change and tc filter replace commands don’t work. So I’ll have to suffer a moment of full speed ssh while I switch it to a simulated 300 baud modem:
tc filter del dev eth2 protocol ip prio 101 u32 tc filter add dev eth2 parent 1: prio 101 protocol ip u32 match ip protocol 0x6 0xff match ip sport 22 0xffff flowid 1:2
Remember the good old 300 baud days? Or not as old as you think…I still use a 300 baud modem for digital data over HF radio! I haven’t used a 56kbps modem for a long time though?
tc filter del dev eth2 protocol ip prio 101 u32 tc filter add dev eth2 parent 1: prio 101 protocol ip u32 match ip protocol 0x6 0xff match ip sport 22 0xffff flowid 1:9
The speed changes in the output are very clear and with careful research could be made precise enough for many tasks.
Putting it all together
Having experimented with the features of traffic control in iproute2 let’s build something more practical:
- Our organization has a mail server at 10.0.0.25/24
- a web server at 10.0.0.80/24
- a https server at 10.0.0.80/24
- a second web server with different content at 10.0.0.81/24
- a second https server with different content at 10.0.0.81/24
- a ssh server at 10.0.0.22/24
- a router at 10.0.0.1/24 (eth0) on the private side and 10.0.1.1/24 (eth1) on the public side
- we have a dedicated 25Mbps line on the public side
- the private side is Gigabit ethernet
- We expect most of the activity to be around 100 simultaneous http(s) sessions but user activity will be mostly reading content before moving on (e.g. a static wiki)
Pretend these aren’t private addresses and public routes exist from everywhere to them and back. Insert appropriate values for your network. We want to divide up the bandwidth in a fairly crude way:
- We don’t get a lot of e-mail so it can wait, limit it to 512kbps
- The ssh server needs to be interactive and fit for file transfer but it has a small userbase and none has more than a 6Mbps line, allow 6Mbps for ssh
- We want people to prefer the https servers and we have messages on every page saying https is faster so limit the http servers to 1Mbps total.
- We expect the https server at .80 to be used much more than the one at .81 so allow up to 10Mbps for .81 and up to 25Mbps for .80
You might wonder why the sum is more than the available 25Mbps. Well, we don’t expect all of these uses to be happening constantly so there will be lots of idle bandwidth. On the other hand, we want to ensure that appropriate priority is given to each service when there is heavy usage. The following configuration should achieve the goals above:
# Qdiscipline on private interface tc qdisc add dev eth0 root handle 1: cbq avpkt 1500 bandwidth 1gbit # Performance classes for sending from router onto private network # The only substantial traffic then is to the mail server and to the ssh server # everything else of significance is being sent from the private network to the outside tc class add dev eth0 parent 1: classid 1:1 cbq rate 6mbit allot 1500 prio 1 bounded isolated tc class add dev eth0 parent 1: classid 1:5 cbq rate 512kbit allot 1500 prio 8 bounded isolated # Filters on private network tc filter add dev eth0 parent 1: prio 101 protocol ip u32 match ip protocol 0x6 0xff match ip dport 22 0xffff flowid 1:1 tc filter add dev eth0 parent 1: prio 102 protocol ip u32 match ip protocol 0x6 0xff match ip dport 25 0xffff flowid 1:5 # Queue discipline for the public interface tc qdisc add dev eth1 root handle 1: cbq avpkt 1500 bandwidth 25mbit # Performance classes for sending from router onto public network # All private hosts can generate significant traffic in this direction tc class add dev eth0 parent 1: classid 1:1 cbq rate 6mbit allot 1500 prio 1 bounded isolated tc class add dev eth0 parent 1: classid 1:2 cbq rate 25mbit allot 1500 prio 2 bounded isolated tc class add dev eth0 parent 1: classid 1:3 cbq rate 10mbit allot 1500 prio 3 bounded isolated tc class add dev eth0 parent 1: classid 1:4 cbq rate 1mbit allot 1500 prio 7 bounded isolated tc class add dev eth0 parent 1: classid 1:5 cbq rate 512kbit allot 1500 prio 8 bounded isolated # Filters on public network tc filter add dev eth0 parent 1: prio 101 protocol ip u32 match ip protocol 0x6 0xff match ip sport 22 0xffff flowid 1:1 tc filter add dev eth0 parent 1: prio 102 protocol ip u32 match ip protocol 0x6 0xff match ip src 10.0.0.80 match ip sport 443 0xffff flowid 1:2 tc filter add dev eth0 parent 1: prio 103 protocol ip u32 match ip protocol 0x6 0xff match ip src 10.0.0.81 match ip sport 443 0xffff flowid 1:3 tc filter add dev eth0 parent 1: prio 104 protocol ip u32 match ip protocol 0x6 0xff match ip sport 80 0xffff flowid 1:4 tc filter add dev eth0 parent 1: prio 105 protocol ip u32 match ip protocol 0x6 0xff match ip dport 25 0xffff flowid 1:5
Remember that we can only control the rate of data being sent. The only streaming traffic being sent from the router into the private network is incoming e-mail and ssh sessions. So only the classes and filters for those are needed on the private interface. All the hosts on the private network have exposed services that can stream onto the public network so we need classes and filters for them all.
Remember that a packet is tested for a filter match in increasing filter prio value order. I’ve listed the filters in that order but they could have been created in any order.
So, ssh traffic (tcp source port 22) gets first crack at the available bandwidth but can’t consume more than 6Mbps. Next, the higher volume https server (packets from tcp port 443 with source address 10.0.0.80) gets up to 25Mbps. If there is already a constant 6Mbps of ssh traffic then only 19Mbps of the link’s 25Mbps total will be available to that https server and only idle levels to the other services. Next, if there is any spare capacity, the second https server (packets with tcp source port 443 and source address 10.0.0.81) gets whatever is available up to its 10Mbps restriction. Then, the two http servers get whatever is available up to 1Mbps between them (all outgoing from port 80). Finally, whatever is left up to 512kbps is made available for the mail server to deliver messages (tcp to destination port 25).
Anything else is unmanaged and has no restriction. The configuration is quite simple, it does nothing with respect to interactive traffic (like web browsing) from the private network or access to unspecified services on any private hosts. Dealing with that is left as an exercise for the reader but I will tell you that you can specify a match of the form 10.0.0.0/24 and classify everything on the specified subnet. Make that the last rule and you could restrict everything else on the private network to whatever class you created for that traffic.
The original request to include iproute2 information in my iptables article was asking for a lot more than the information above and you can see how much effort it took to describe it. I plan to write a further article showing how you can use the more versatile iptables rules to mark packets so that they can be classified with simpler traffic control filters.
A note on units
Bandwidth and size units for the tc program have a syntax that needs to be used carefully. Byte and bit sized units are supported but their syntax is not consistent. Bytes are specified with “b” and bits with “bit”:
5b - Five bytes 5bit - Five bits
Decimal fractions are supported:
5.5b - Five point five bytes 5.5bit - Five point five bits
Rates can be specified in per second units. For the “byte” sized units the bytes per second rate is specified as “bps”. Your use of the term bps may be different from tc’s. Bits per-second are specified only with “bit”, i.e. the same syntax as bit counts so context is required to disambiguate:
5bps - Five bytes per-second 5bit - Five bits per-second (if used in a bandwidth context, otherwise five bits long)
Decimal fractions are supported as shown above:
5.5bps - Five point five bytes per-second 5.5bit - Five point five bits per-second (if used in a bandwidth context)
If a number is used in a bandwidth context and no units specified then bytes per-second is assumed:
5 - Five bytes per-second
Times are seconds based and specified with s:
5s - Five seconds
Decimal fractions are supported in the same was as shown above:
5.5s - Five point five seconds
If a number is used in a time context and no units specified then microseconds is assumed:
5 - Five microseconds
Scale prefixes may be used. For bytes and bits you can use k (kilo), m (mega) and note it’s lower case for both. In each case they are binary powers, so k is a multiplier of 1024 and m is a multiplier of 1048576. For example:
2kbps - 2048 bytes per-second 2kbit - 2048 bits per-second (in a bandwidth context) 2kbit - 2048 bits (in a data length context) 1mbps - 1048576 bytes per-second 1mbit - 1048576 bits per-second (in a bandwidth context) 1mbit - 1048576 bits (in a data length context)
Decimal fractions are supported the same way as shown above:
0.5kbps - 524288 bytes per-second 0.5kbit - 524288 bits per-second (if used in a bandwidth context) 0.5kbit - 524288 bits (if used in a data length context)
Scale prefixes for times are decimal and one of them uses the same character as a bandwidth or size multiplier but has a different scale. They are m (milli) and u (micro):
5ms - Five milliseconds (0.005 seconds) 5us - Five microseconds (0.000005 seconds)
Related Articles
Jun 29th, 2023
It’s THE time: SUSE doc survey 2023 ‘call to action’
Jun 22nd, 2023
Add more power to Prometheus
Mar 17th, 2023
No comments yet