Linux QoS in a Small Network
Introduction
QoS is a very broad and long topic. I am not going to try to cover it all and I will limit the scope of this document to a specific area that is of interest to me - the small and medium enterprise. The reason for this is that large corporate, ISP and telco aspects of
QoS have been covered extensively from as far back as the original Sally Floyd and Van Jacobson work at LBL in the early 1990-es. Medium and small corporate aspects are not covered and some of the cut and paste "advice" on the Internet is way out of context. As a result its usability is quite limited.
If you are fairly familiar with
QoS you can skip the rest of the introduction and go directly to the interesting bits in the next sections. If not, you may find the rest of the section interesting and applicable to your problems. I have tried to keep this document as not only a "HowTo", but as a "WhyTo" and "WhatTo" as well.
Please note that while
FreeBSD? has an hourable mentioning in quite a few places, I no longer use it. After 7 years of running
QoS installations based on
FreeBSD?/ALTQ, I took down the last one in 2005 and replaced it with Linux due to reasons beyond my control. System level differences and performance notes for
FreeBSD? will stay in the document as they are least likely to change (unless the BSD crowd completely rewrites a large portion of the kernel once again). I can no longer provide cut-n-paste snippets for it as I do not have an environment to maintain them.
Similarly, most of the linux material has been tested and is being used on Debian or Debian-derived diskless/low footprint in-house systems. Most of the stuff will be applicable to
RedHat? and other usual suspects with minimal changes.
Differences between QoS in an SME and a LargeCo
Large Co is either in control of its backbone or is capable of negotiating terms from the telecommunication provider. If worst comes to worst, it can often solve the problem by throwing money at it. That is not the case in an SME which does not have any of these options.
One of the primary aims of
QoS is to achieve specific parameters for the various types of traffic on the link (delay, jitter, loss). In theory (and as described in literature) this should be done on the egress interface of a
QoS capable router just before the traffic enters a bottleneck. There is no point in doing it on ingress after the bottleneck because by that time the congestion has already happened. So, in theory, to have
QoS on a link you have to be in control of both sides of it and police the traffic as it enters the link. If the theory is correct, an SME will have to "grow up" so it can control the links for which
QoS is of interest to it. The important part here is the "in theory". Let's see if it is possible to achieve
QoS in practice for a large portion of cases which are interesting to many people.
To QoS or not to QoS
This largely depends on the application(s) which need special treatement. These may be transaction systems, games and most importantly Voice Over IP. It is important to know the traffic requirements for every one of these before designing a
QoS scheme. The following usually need to be taken into consideration:
Maximum tolerable delay
Maximum tolerable jitter for a reasonable percentage of the traffic (usually 95%+).
Upper limit on allowed random packet drop
Some minimal values for these are natural for any IP network. There will always be some baseline delay due to router processing and transmission speeds. Routers will always jitter the packets. There is always some percentage of packet loss due to link errors in the network. These are usually quite small. The big delays, jitter and drops come from network congestion.
Going back to the original works by the pioneers of
QoS - Sally Floyd and Van Jacobson, they hardly use the words
QoS. They usually use the words "Congestion Avoidance". They are correct as this is what Internet
QoS boils down to - avoiding congestion for the traffic classes where it matters and if congestion cannot be avoided to controlling it in a predictable and well defined manner.
If the link is loaded to the full and the router queue is full to the point where it starts dropping packets the actual delay will be (for the commonly used queue lengths):
| 64K ISDN BRI |
queue length of 30+ packets / 5+ seconds delay - unuseable for VOIP when congested |
| 256K+ DSL |
queue lengths of 30-100 packets / 1.5 - 5+ seconds - unuseable for VOIP when congested |
| 2M E1 |
queue lengths of 100-250 packets / 0.5 - 2+ seconds - unuseable for VOIP when congested |
| 100M Ethernet |
queue lengths of 8-30 packets in Layer 2 switches / 0 - 7 ms; 100-250 packets in routers / 7-30ms - can be used for VOIP provided that the codec can tolerate the packet loss |
| Backbone |
Depends on link bandwidth / 125 - 250 ms depending on provider and router type, very rarely more, sometimes less. Usually can be used for VOIP. |
Congestion is problematic even when the queue is not fully loaded. If a packet arrives at a router and cannot be transmitted immediately it will be queued. From there on, without
QoS it will be delayed in the queue until all packets before it have been transmitted. This will result in a variation in packet delay - jitter. For example, most VOIP applications will attempt to buffer up to the upper bound of jitter (or a preconfigured maximum) before playing sound. As a result congestion that is considerably lower then the previous "full queues case" can bring similar problems in voice quality.
- The jitter upper bound is on the same order as the delays in the previous list so no point repeating it
Tail queue drop effects due to congestion have different effects for different classes of traffic.
- High Troughput. Usually TCP. For most TCP applications tail drop is normal means of traffic management. It is essential to normal functionality. When TCP encounters tail drop it will adjust the window accordingly and keep the transmission rate so that the drops are minimized in the future. As a result throughput is maximized and drops are kept to a minimum.
- Low latency. Usually UDP. Most UDP applications retransmit regardless of drop rates or do not retransmit at all (most VOIP). As a result tail drop for them should be avoided as it will not make the application transmit at a lesser rate. For most applications it will also cause problems.
In order to get some use of
QoS it is necessary to ensure that uncontrolled congestion is avoided as much as possible for low latency traffic and controlled congestion is applied to high throughput traffic in a manner prescribed by the
QoS policy. All of this has to be done so that the bottleneck is utilised as good as possible, but never congested as a whole.
Where to QoS
In theory it is necessary to do this on the egress to the bottleneck. This is also the place which has been analysed in nearly all papers on the subject.
* Fig 1:
If we look at the bandwidth/delay numbers in the previous subsection it is quite obvious that the worst bottleneck is the access loop between the ISP and the customer. In order to do get
QoS it, congestion avoidance has to be applied on the ISP edge towards the customer for any downstream traffic and on the ISP customer presentation router for for any upstream traffic (the arrows in the Figure 1).
Asking an ISP to get something done is likely to be futile, so we will try to achieve the same results by applying
QoS one step later where we are in control (the arrows on Figure 2).
- Fig 2:
This is the only place that is available to an SME, a home office or a hobbyist in modern network. This is also one of the few places where it is possible to put a Linux or
FreeBSD? to do
QoS without running into a religious argument with someone that has been brainwashed into a particular router brand loyalty scheme.
Specifics of controlling QoS after the bottleneck
In order to achieve
QoS by policying the traffic after the bottleneck it will be necessary to ensure that "natural" congestion at the bottleneck is never reached. This by the way is the same method as used by commercial "single box" solutions like Sandvine, Ellacoya, etc.
This is the main difference between doing
QoS at the ISP edge and
QoS one hop away from it. The ISP edge always knows the current queue depth and the actual recent link utilisation. The
QoS conditioner has a reasonably good idea of the recent link utilisation (if there are no drops on the edge). It does not know the queue depth at the ISP. All it can do is to act proactively to ensure that the queue length is as low as possible.
If a reasonanle proportion of the traffic is TCP, this can be achieved by ensuring that TCP's own congestion control kicks in before the link itself is congested. This can be done successfully, but it will be necessary to sacrifice some bandwidth.
The
QoS conditioner has to limit the bandwidth through itself to a value which is slightly smaller than what is available at the ISP edge. From there on if the
QoS parameters are set correctly, most TCP streams on the link will perform their own congestion avoidance and adjust the TCP windows to avoid congestion.
Essentially, this is cheating (like most of
QoS anyway). The senders in the TCP conversations do not know that the congestion has been deliberately introduced and they are made to keep their heads down so that they do not congest the bottleneck link.
Bandwidth Sacrifice Specifics
The actuall amount of bandwidth sacrifice depends on the following factors: bandwidth estimation and
QoS policy precision, bandwidth of the physical link on the bottleneck and traffic shaping on the ISP edge.
Bandwidth estimation and QoS policy precision
These depend on the following factors:
System processing latency. While
QoS is not CPU intensive it is a good idea to have a reasonably fast single CPU (see further why SMP is a bad idea at least on x86). This is least likely to be a problem on modern PC derived systems. Even something like a 600 Mhz Via C3 can handle a corporate E1 with 10+ VOIP streams, web browsing, etc and police it correctly. 600MHz Via C3 is actually a lot of processing power compared to something like one of the low power MIPS based devices that have been recently hacked to run Linux. I have not tried to run
QoS on a "Slug" or Linksys Wireless. I suspect that at least some of the devices in this category may have have insufficient CPU processing power.
System load average. It is important that the system is not loaded. More then 10% CPU load average is likely to start affecting
QoS bandwidth estimation. Further to this, heavy IO load will definitely affect correct scheduling of packet transmissions. Man-db, locate and similar jobs that scan large portions of the filesystem must be disabled. It is even better to run the system swapless and/or diskless off NFS, off flash or from a ramdisk. While NFS has lower throughput compared to a disk, the IO latency is also considerably lower. Further to this, if the system is booted diskless from its internal interface, NFS will provide the system with extra scheduling opportunities to run the
QoS bandwidth estimator and policy decisions in that direction (see network cards for explanation of this).
Timers used for QoS. The Unix clones that have
QoS provisions usually offer the following choice of timers on a PC platform - gettimeofday(), kernel HZ and CPU performance timers. Some also offer the choice of using the ACPI clock (BSD, Linux after 2.6.21). gettimeofday() is the worst. It is an expensive system call and it is fairly imprecise as well. Kernel HZ is slightly better, but still not good enough. It offers at best 1KHz on Linux 2.6 which is not good enough for anything under 10 Mbit.
FreeBSD? can be pushed further to 2.5-5KHz Hz values which can yield reasonable precision down under 2Mbit or so (depending on hardware). Usually this is not good enough as well. The only timer source of use at SME bandwidths are the high performance timers available in post-PentiumPro CPUs. These allow bandwidth estimation and policying at speeds lower than 64K on
FreeBSD? (with the HZ raised beyond 2KHz) and lower than 128K on Linux (at 1KHz HZ). I am not an expert in most non-PC architectures so I am not aware if timer sources of similar precision are available on MIPS, PPC, Sparc and other "usual suspect" platforms. If you intend to run
QoS on these, you need to check what the hardware has to offer. Also, this type of timers is not synchronized across CPUs on most SMP platforms. They do not compensate for clock frequency changes on platforms wich have a non-constant CPU frequency. As a result they cannot be used on SMP or cpufreq systems.
Kernel HZ and preemption. These are still important even if gettimeofday or CPU timers are in use for
QoS estimation. The reason for this that low HZ and non-preemptive systems are much more likely to spend longer periods of time doing other IO. This in turn will decrease the estimator precision.
The last time I checked all Linux distributions ship with timer, preemption and/or HZ values that are not well suited for
QoS.
FreeBSD? needs some tuning as well. In all cases the kernel will have to be rebuilt.
Linux prior to 2.6.21 requires the following options:
# (Note - these are only necessary system options, not the actual options for QoS)
# Turn preemption on
CONFIG_PREEMPT=y
CONFIG_PREEMPT_BKL=y
# provide sane interrupt handling (not strictly necessary, but helps)
CONFIG_X86_UP_APIC=y
CONFIG_X86_UP_IOAPIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
# compile in CPU timers
CONFIG_X86_TSC=y
# Set HZ correctly
CONFIG_HZ_1000=y
CONFIG_HZ=1000
Kernels after 2.6.21 are by default tickless and use internal high res timers for
QoS. For them the actual timer source can be configured at runtime.
On
FreeBSD? 6.0 and higher it is necessary to adjust HZ:
# (Note - these are only necessary platform options, not the actual options for QoS)
options HZ=2500
# Get it as high as a PC can bare
# FreeBSD 6.0 has Preemption and apic on by default so we do not mess with them
Network Card(s). In addition to global system variables like timer precision and scheduler behaviour,
QoS bandwidth estimation depends on the actual network hardware. The effects of this dependency are slightly different on
FreeBSD? and on Linux.
FreeBSD? does ALTQ processing in the actual device driver dequeue procedure (same as the original Van Jacobson proof of concept Solaris work). As a result, server adapters and high end network cards which generate less interrupts will call the estimator and transmit scheduling less often compared to "horrid" cards. This results in lower precision when using high performance adapters. At the same time, cards that are considered "horrid" like Realtek 8139 (rl) provide much more interrupts and much more scheduling opportunities. As a result they provide considerably better estimator precision and policy performance. The difference is especially obvious on speeds under 2MBit. It is nearly impossible to achieve a working ALTQ hierarchy where some classes are in the sub-64Kbit range for a 2MB (E1) using Intel
EtherExpress? Pro (fxp). It is hard, but possible on a Tulip (dc). It is trivial to do this using Realtek (rl).
Linux estimator entry points differ from BSD. As a result, the effects of hardware are less pronounced and system is more dependant on the precision of the timer source. Still, the same rules are valid for Linux as well.
QoS on low bandwidth links (sub-2MB) cannot be performed on server class network hardware.
Overall, the better the precision of the estimator, the less the bandwidth sacrifice. For example using HZ on a 600MHz mini-ITX Via C3 at the default kernel version 2.6 value (1000) on Linux will require to sacrifice more then 20% of the bandwdith on an E1 link to get a reasonably complex
QoS scheme to work. It will also be impossible to enforce precisely any bandwidths in the sub-256K range. At the same time, the same hardware using CPU timers will be able to force allocations as low as 32Kbit while sacrificing less then 5%. There will be no difference in system CPU load between both cases. It will be under 2%.
Bandwidth of the physical link on the bottleneck and traffic shaping if any
- The lower the bandwidth, the more the sacrifice. The reason for this is that on low bandwidth even 2-3 packets can significantly jitter the link. On higher bandwidth links it will take 5-10 packets to do the same damage. It is important to note here that the physical bandwidth is what matters. It may not necessarily be the same as the advertised bandwidth. For example:
- A lot of the DSL hardware synchronises at a rate which is higher than the rate advertised by the provider and written in the contract. From there on, the actual bandwidths are enforced by traffic shaping.
- All Cable hardware runs at much higher rate then advertised by the provider. It depends on the actual HFC, but speeds of up to 32Mbit downstream and 6MBit upstream are possible on some networks. From there on the actual contracted speed is forced by traffic shaping using a token bucket algorithm.
- All traffic limits in LAN extension, conferences, colocation and fiber NGA varieties work via traffic shaping
The presence of traffic shaping on the ISP edge makes the task of a
QoS scheduler at an SME considerably easier. The actual physical transmit is faster then on a "real" synchronous link. As a result even if the queue gets filled up with a few packets once in a while they will jitter the traffic less then expected. The caveat here is not to trigger the bandwidth shaping algortithm on the broadband access system at the provider.
The best approach for this is to use RED on the TCP/High Throughput classes (see later) to ensure that the bandwidth is kept under the limit proactively.
Overall, for a good estimator precision like Linux or
FreeBSD? with CPU performance counters and a well configured
QoS scheme a sacrifice of around 32KBit at 64KBit bandwidth going down to around 16KBit at 2MBit has so far worked for me in most cases. Your mileage may vary.
How to QoS
If you are a lover of "cut and paste" this is the section where it will become interesting for you. Please beware that if you are cutting and pasting without reading the "bull" in the previous sections and at least some of the references you are likely to break your network.
Building a QoS scheme
Chosing a qdisc
The choice of Queuing Discipline (qdisc) depends on what is the actual "legal" relationship between the traffic that will have to be provided shares of the link.
We have the following choice of well known classfull schemes offered in the linux (and BSD scheduler).
- If the relationship is best described as "cohabitation" the best ideology is likely to be link sharing and the best scheme building approach is CBQ. It provides very good link utilisation, but it is based around the concepts of mutual benefit which are very hard to express in legalese (even as an internal policy document).
- If the relationship is best described as "subletting" the best ideology is likely to be limiting to specific bandwidth (with/without burst) and the best scheme building approach is likely to be HTBF. It does not provide good link utilisation (classes utilise only up to whatever they are allowed to burst to), but it is based on clear and well defined concepts which can be expressed in legalese (or as a policy document).
- There is no point to try priority queueing for policying after the bottleneck. It gives nothing unless you are in control of the link.
- There is no point to try any of the "fair queing" including the vendor specific ones either at this point. They need a queue to operate on. In the absence of a classfull qdisc they are usefull only when they are on the exit to the bottleneck. Using them after the bottleneck does not provide any benefit.
- There are several other schemes which nobody besides their inventors understand. This is only slightly worse then CBQ which is understood by its inventors and 5 or so people per country. OK, just kidding...
Building a scheme
Once the relationship between traffic has been clarified the traffic needs to be classified and some minimal amount of bandwidth allocated to each class. This is best done visually. If the only traffic coming up a 2048K DSL line is web browsing and voice the diagram may look like this.
In reality the diagram is likely to contain 16-20 classes for an average company or 5-10 classes for an average home office network.
For CBQ all numbers for any children must add up to a value which is equal or smaller then the parent. When all classes transmit at full blast this is what they will be limited to. If a class is allowed to borrow, and is over limit it will surrender borrowed bandwidth to a class that has started to transmit as long as that class is still under its limit as well. Any packets in the class which is over limit will be delayed and packets in the classes which are under limit will be transmitted immediately. As a result voice traffic classified into the VOIP class will coexist peacefully with the web traffic from the WWW class.
If the chosen qdisc is HTB there is a need for a second graph which specifies the maximum bandwidth for each class. These numbers can add up to a value that exceeds the value of the limit class as long as none of them is larger then the limit class itself
Applying a Qdisc
The two qdiscs suitable for the
QoS in an SME are CBQ and HTB.
Applying a CBQ qdisc
If the qdisc is being applied to an ephemeral interface - ppp, tun, etc which is deleted and recreated on casual basis the qdisc must be applied once the interface is up. It will not be preserved across a ppp or tunneling daemon restart. As a result the qdisc application script will have to be run from the /etc/ppp/ip-up.d or its equivalent.
For anything even remotely complex (5-6 classes or more) it is better if all magic numbers are replaced by variables set elsewhere. /etc/default/qos is a good location. Otherwise, reading anything one week later will require a lot of neurophen.
For example:
INTERNAL_IFACE='eth0' ; EXTERNAL_IFACE='ppp0'
INTERNAL_ROOT='1' ; EXTERNAL_ROOT='2'
LIMIT='2' # we use X:2 for all LIMIT classes
MANAGEMENT='8'
WWW='16' # we use X:16 for WWW traffic
VOIP='20' # we use X:20 for VOIP traffic
It is a good idea to start by killing any old qdiscs.
tc qdisc del dev ${INTERNAL_IFACE} root
tc qdisc del dev ${EXTERNAL_IFACE} root
Creating new qdiscs (this example is for a DSL link as described in the previous section).
# Note that this is real bandwidths here. It is used in scheduler calculations
/sbin/tc qdisc add dev ${INTERNAL_IFACE} root handle ${INTERNAL_ROOT}:0 cbq avpkt 1000 rate 100Mbit bandwidth 100Mbit
/sbin/tc qdisc add dev ${EXTERNAL_IFACE} root handle ${EXTERNAL_ROOT}:0 cbq avpkt 1000 rate 10Mbit bandwidth 10 Mbit # this is a guesstimate (we do not know the maximum modem rate via PPTP relay)
Next, it is necessary to allocate a management class on the internal interaface (if the system is managed inband). It must have enough bandwidth to operate and must be limited so that any management does not affect QoS functionality
/sbin/tc class add dev ${INTERNAL_INTERFACE} parent ${INTERNAL_ROOT}:0 \
classid ${INTERNAL_ROOT}:${MANAGEMENT} cbq allot 1500 rate 10Mbit prio 1 avpkt 1500 bounded
Usually there is no point to allocate a management class on the external interface. If necessary it will have to be allocated from under the limit class, not from under the root.
Next is the limiting class for the QoS conditioning on both interfaces. It must be equal to the bandwidth on the bottleneck link (see one of the previous sections) minus an amount of bandwidth we have decided to sacrifice to make the system work (again - see one of the previous sections).
# example CBQ for ADSL 256/2048
# real incoming bandwidth is 2048. We throw away 64Kbit to be sure it works
# we make the limit strict
/sbin/tc class add dev ${INTERNAL_INTERFACE} parent ${INTERNAL_ROOT}:0 \
classid ${INTERNAL_ROOT}:${LIMIT} \
cbq allot 1500 rate 1984Kbit prio 1 avpkt 1500 bounded
# real outgoing bandwidth is 256. We trow away 32Kbit to be sure it works.
# We make the limit strict
/sbin/tc class add dev ${EXTERNAL_INTERFACE} parent ${EXTERNAL_ROOT}:0 \
classid ${EXTERNAL_ROOT}:${LIMIT} \
cbq allot 1500 rate 224Kbit prio 1 avpkt 1500 bounded
Next step is to allocate classes from under the limit classes (or further down). First - www traffic:
# We limit web browsing to 1024 but leave it burstable so it can use spare bandwidth
/sbin/tc class add dev ${INTERNAL_INTERFACE} parent ${INTERNAL_ROOT}:${LIMIT}\
classid ${INTERNAL_ROOT}:${WWW} cbq allot 1500 rate 1024Kbit prio 1 avpkt 1500
# same for the other direction
/sbin/tc class add dev ${EXTERNAL_INTERFACE} parent ${EXTERNAL_ROOT}:${LIMIT}\
classid ${EXTERNAL_ROOT}:${WWW} cbq allot 1500 rate 64Kbit prio 1 avpkt 1500
Next, VOIP class for one channel G729 and signalling. 56Kbit is more then enough.
/sbin/tc class add dev ${INTERNAL_INTERFACE} parent ${INTERNAL_ROOT}:${LIMIT} \
classid ${INTERNAL_ROOT}:${VOIP} cbq allot 1500 rate 56Kbit prio 1 avpkt 1500
# same for the other direction
/sbin/tc class add dev ${EXTERNAL_INTERFACE} parent ${EXTERNAL_ROOT}:${LIMIT} \
classid ${EXTERNAL_ROOT}:${VOIP} cbq allot 1500 rate 56Kbit prio 1 avpkt 1500
The process is continued until the entire tree from the previous section has been written out as tc class statements.
It is a good idea to leave priorities for a later stage and get a CBQ scheme to work without them first. They take effect only when both classes have queued traffic and are of littel use for very delay sensitive traffic anyway.
Applying a HTB qdisc
HTB is not a good choice for an SME or hobby network. Bandwidth will not be utilised fully and the link efficiency is considerably worse. Its only advantage is that its "ease of understanding" and "predictability" are easier to express in a subletting agreement. The only reason for it being here is that people keep looking for it.
This is the same QoS scheme as in the previous section. Once again, if the qdisc is being applied to an ephemeral interface - ppp, tun, etc which is deleted and recreated on casual basis the qdisc must be applied once the interface is up. It will not be preserved across a ppp or tunneling daemon restart. As a result the qdisc application script will have to be run from the /etc/ppp/ip-up.d or its equivalent.
For anything even remotely complex (5-6 classes or more) it is better if all magic numbers are replaced by variables set elsewhere. /etc/default/qos is a good location. Otherwise, reading anything one week later will require a lot of neurophen.
For example:
INTERNAL_IFACE='eth0' ; EXTERNAL_IFACE='ppp0'
INTERNAL_ROOT='1' ; EXTERNAL_ROOT='2'
LIMIT='2' # we use X:2 for all LIMIT classes
MANAGEMENT='8'
WWW='16' # we use X:16 for WWW traffic
VOIP='20' # we use X:20 for VOIP traffic
It is a good idea to start by killing any old qdiscs.
tc qdisc del dev ${INTERNAL_IFACE} root
tc qdisc del dev ${EXTERNAL_IFACE} root
Creating new qdiscs (this example is for a DSL link as described in the previous section).
# Note that this is real bandwidths here. It is used in scheduler calculations
/sbin/tc qdisc add dev ${INTERNAL_IFACE} root handle ${INTERNAL_ROOT}:0 htb
/sbin/tc qdisc add dev ${EXTERNAL_IFACE} root handle ${EXTERNAL_ROOT}:0 htb
Next, it is necessary to allocate a management class on the internal interaface (if the system is managed inband). It must have enough bandwidth to operate and must be limited so that any management does not affect QoS functionality
/sbin/tc class add dev ${INTERNAL_INTERFACE} parent ${INTERNAL_ROOT}:0 \
classid ${INTERNAL_ROOT}:${MANAGEMENT} htb rate 10Mbit prio 1
Usually there is no point to allocate a management class on the external interface. If necessary it will have to be allocated from under the limit class, not from under the root.
Next is the limiting class for the QoS conditioning on both interfaces. It must be equal to the bandwidth on the bottleneck link (see one of the previous sections) minus an amount of bandwidth we have decided to sacrifice to make the system work (again - see one of the previous sections).
# example HTB for ADSL 256/2048
# real incoming bandwidth is 2048. We throw away 64Kbit to be sure it works
# we make the limit strict (note - no "ceil" here)
/sbin/tc class add dev ${INTERNAL_INTERFACE} parent ${INTERNAL_ROOT}:0 \
classid ${INTERNAL_ROOT}:${LIMIT} \
htb rate 1984Kbit prio 1
# real outgoing bandwidth is 256. We trow away 32Kbit to be sure it works.
# We make the limit strict
/sbin/tc class add dev ${EXTERNAL_INTERFACE} parent ${EXTERNAL_ROOT}:0 \
classid ${EXTERNAL_ROOT}:${LIMIT} \
htb rate 224Kbit prio 1
Next step is to allocate classes from under the limit classes (or further down). First - www traffic:
# We limit web browsing to 1024 but leave it burstable so it can use spare bandwidth
/sbin/tc class add dev ${INTERNAL_INTERFACE} parent ${INTERNAL_ROOT}:${LIMIT}\
classid ${INTERNAL_ROOT}:${WWW} htb rate 1024Kbit prio 2 ceil 1984Kbit
# same for the other direction
/sbin/tc class add dev ${EXTERNAL_INTERFACE} parent ${EXTERNAL_ROOT}:${LIMIT}\
classid ${EXTERNAL_ROOT}:${WWW} htb rate 64Kbit prio 2 ceil 224Kbit
Next, VOIP class for one channel G729 and SIP signalling. 56Kbit is more then enough.
/sbin/tc class add dev ${INTERNAL_INTERFACE} parent ${INTERNAL_ROOT}:${LIMIT} \
classid ${INTERNAL_ROOT}:${VOIP} htb rate 56Kbit prio 1 ceil 128Kbit # we should never need burst here - paranoya...
# same for the other direction
/sbin/tc class add dev ${EXTERNAL_INTERFACE} parent ${EXTERNAL_ROOT}:${LIMIT} \
classid ${EXTERNAL_ROOT}:${VOIP} htb rate 56Kbit prio 1 ceil 128Kbit
The process is continued until the entire tree from the previous section has been written out as tc class statements.
Classifying traffic
Unfortunately the best described Linux classifiers like u32 or ip addresses are the least suitable for an SME or a home user. They simply lack the precision necessary to deal with traffic that is NAT-ed to a single (or just a few) IP address. The only classifier useable in an SME is the firewall marking. This is the only classifier which will be looked at from here onwards. If you are interested in the others you need to look elsewhere. Alexey Kuznetcov original papers or the LARTC are a good starting point.
The firewall marking QoS classifier in Linux (and the similar provisions in the pf packet filtering framework in Open/FreeBSD) do not do classification as such. They trust a mark which has been placed on the packet elsewhere in the kernel by the firewall rules. For example on linux you can place a mark as follows (slightly altered example from the LARTC)
iptables -A OUTPUT -t mangle -o ${EXTERNAL_INTERFACE} -p tcp --source-port 80 -j MARK --set-mark ${WWW}
This statement will mark any outgoing web traffic on a webserver with mark ${WWW}. Once we have the traffic marked, we can filter it into a class and apply traffic limits to it.
/sbin/tc filter add dev ${EXTERNAL_INTERFACE} protocol ip \
parent ${EXTERNAL_ROOT}:0 prio 1 handle ${WWW} \
fw flowid ${EXTERNAL_ROOT}:${WWW}
We can perform filtering hierarchically, but I will not provide examples for this. It is much easier to comprehend what is happening if the filtering models in the firewall filtering rules and the traffic control filters are identical.
Combining QoS and firewalling
One of the problems for a QoS conditioner in an SME is the lack of sufficient information to classify the traffic. This is especially valid for Linux/iptables (BSD/pf is better to that respect) because of the actual order of rules in the firewall tables and the lack of statefull MARKing in Linux prior to version 2.6.12. While kernels after 2.6.12 can use CONNMARK for the statefull marking I do not have a tested configuration with this feature so the examples in this article are limited to the pre-2.6.12 stateless approach.
The best way around the statefull MARK limitation is to introduce state artificially by writing all NAT rules in a manner which allows NAT information to be used for MARKing. Essentially, every class of traffic is NAT-ed to specific IP address(es) and specific port ranges. This approach allows packets coming back from the Internet to be matched based on port numbers in the PREROUTING chain before they are NAT-ed back to their internal addresses. It is not very pretty, but it works quite well in practice.
Major Caveat Alert: cut-and-pasting existing firewall NAT rules and altering them into MARK rules even on a well designed firewall will not work. The reason is that the MARK target is a fall-through target. Matching continues instead of terminating on a successful match. As a result a set of rules designed for NAT or forwarding has to be rewritten in the opposite order for MARKing.
It is tedious, time consuming, but not very difficult. If common variables and macro definitions are used for both the likelihood of errors is fairly low.
Let's continue the above example for an asterisk which uses SIP (signalling and data are UDP) and has a source IP of ${VOIP_PHONE_ADDR}
EXTERNAL_NAT_ADDRESS='1.2.3.4'
VOIP_PORTS_RANGE='32800-40000' ; VOIP_PORTS='32800:4000' # need two notations, bloody iptables
# We have to MARK outgoing traffic on the internal interface before it traverses the QoS system.
iptables -A PREROUTING -t mangle -i ${INTERNAL_INTERFACE} -s ${VOIP_PHONE_ADDR} \
-j MARK --set-mark ${VOIP}
# We now NAT it to a predefined set of ports
iptables -A POSTROUTING -t nat -o ${EXTERNAL_INTERFACE} -p udp \
-s ${VOIP_PHONE_ADDR} -j SNAT --to ${EXTERNAL_NAT_ADDRESS}:${VOIP_PORTS_RANGE}
# Any replies will come back on the same ports as sent out
iptables -A PREROUTING -t mangle -i ${EXTERNAL_INTERFACE} -d ${EXTERNAL_NAT_ADDRESS} \
-p udp --destination-ports ${VOIP_PORTS} -j MARK --set-mark ${VOIP}
All traffic that is being NAT-ed from the internal network needs to be decribed and assigned in a similar way to non-conflicting port allocations. This is quite easy for all traffic that can be NAT-ed without a helper module. Helper module cases like FTP are considerably more difficult. Nearly all netfilter helper modules do not have arguments to specify a list of allowed port numbers. Some of these are helpless. Some can be solved using a proxy. For example the frox ftp proxy allows to specify port ranges for all types of data and command connections. It can perform some protocol conversions like active to passive as well.
Proxy Servers and Services
Proxy servers and services usually cannot be run on the QoS system. The are two reasons for this:
CBQ or HTB can be applied only to traffic that is exiting the QoS system (forwarded or locally generated). As a result any traffic terminating on the QoS system itself will not be accounted for.
Most proxies will pull or push traffic at a rate that is different from the actual rate at which it transmits the results to the client. As a result any bandwidth estimation and any policies applied when the material is being transmitted will be invalid.
As a result any services or proxies will have to be run at least one hop away from the QoS system on the corporate network side. We have to amend Figure 2 accordingly:
Figure 3:
Any services have to forwarded to the relevant service machine using destination address NAT if applicable.
Any transparent proxying has to be done either before the QoS or the QoS will have to destination NAT back to a Proxy. The only additional complication is that the traffic has to be accounted for QoS only once.
The easiest way to do both transparent services and proxies is to assign multiple loopback aliases on the Proxy and/or Services systems which are used on a per service basis. For example:
# /etc/network/interfaces
auto lo:0 lo:1 lo:2 lo:3 lo:4
# Address used by squid for all outgoing traffic
# /etc/squid.conf: tcp_outgoing_address 192.168.0.1
iface lo:0 inet static
address 192.168.0.1
netmask 255.255.255.255
# Address used by exim for all outgoing traffic
# /etc/exim4/exim4.conf,section remote_smtp:
# interface 192.168.0.2
iface lo:1 inet static
address 192.168.0.2
netmask 255.255.255.255
# Address used by frox for all outgoing traffic
# /etc/frox.conf: TcpOutgoingAddr 192.168.0.3
iface lo:2 inet static
address 192.168.0.3
netmask 255.255.255.255
# Address used by frox, exim and squid in transparent proxying
# must be listened on by all of them
iface lo:3 inet static
address lo:3 inet static
netmask 255.255.255.255
As a result if all NAT-ed traffic for proxies on the first pass goes to the ethernet interface or to lo:3 it will be very easy to isolate it and ignore for the purposes of QoS policy. The second pass traffic, the one that actually gets onto the internet will be coming from distinct, well defined IP addresses which will make the definition of firewall rules much simpler.
Caveat: It may be necessary to double NAT - both source and destination to redirect traffic to these addresses. The reason for this is that if we change only the destination address so it hits the proxy the proxy will reply to the unchanged source address. As a result the connection will be dropped. Alternatively, all transparent proxying can be done one hop before the QoS on a "firewall proper" and QoS to be left to do only policying (and possibly NAT).
Monitoring and Accounting
Monitoring QoS is actually not one, but two problems each of which has different specifics and different variables to follow. Long term monitoring follows parameters over periods of time which are longer then any particular congestion or usage event. As a result drops and queue length are of little relevance and monitoring needs to look at class packet and byte counters. When monitoring current state and troubleshooting, the situation is the exact opposite. Packet counters are of little value and the best indicators are queue depths and packet drop values.
Unfortunately neither of these is available in a form that can be plugged into an NMS package like OpenView? or a statiscs package like mrtg or cricket. Class variables are not present in any of the default host MIBs available in Net-SNMP. To the extent of my knoewledge there is no project to provide proxy or extension access to them either. The only project that may have some use is the SNMP-iptables project. I have not tried it and I would not recommend to try it, because it has not been updated beyond an early alpha version for a few years now. As a result the only option to monitor qos is to poll the results for all classes using the command line tools and parse the output.
Here is an example on how to do this using tc on Linux.
/sbin/tc -s class show dev eth0 > /var/local/stats/stats.eth0
If the QoS conditioner is running diskless which I would recommend for performance reasons because it provides more interrupts useable for scheduling (see the bandwidth estimation), the stats can be picked up from the NFS server which holds the /var/local/stats/ directory. If QoS is a standalone system, it will be necessary to provide means of logging in and running tc for the monitoring system. This traffic also has to be accounted in the management class.
Once the values have been collected they can be fed into to mrtg using trivial scripting. For example this ugly piece of code will take two files with output from tc as arguments and print the bytes for two classes in the format mrtg expects for in/out. Feeding the values into something more advanced like cricket is not much different.
This results in graphs like these:
| VOIP Class: |
|
| VPN Class: |
|
These are off my home office network which is optimized to death and rather dull.
It is also important to follow the memory and the CPU load on the QoS system. Once again the reason is the bandwidth estimation precision. While this is possible via SNMP there is no point to run SNMP just for that if the QoS statistics are being collected differently. It can be added to the scripts which run tc and provide traffic stats.
There is no difference between Linux and FreeBSD? as far as QoS stats are concerned. FreeBSD? is equaly "bare" and any stats will have to be scripted out of the output of the altqstat utility which provides similar functionality to tc show.
Improving the QoS Schemes
CBQ on its own does not handle very well a number of cases which will require further improvements to the QoS scheme. More specifically when working on the SME gateway and trying to predict traffic on the ISP edge it may have problems with:
- Very "aggressive" TCP flows from high speed download sites which have large TCP windows.
- VPNs which require QoS for some, but not all of the VPN traffic (IP phone inside a VPN scenario). The QoS conditioner cannot see inside the VPN and as a result it cannot provide guarantees to traffic in it.
Nested Qdiscs - RED and Fair Queueing
One of the least advertised, but greatest advantages of the Linux QoS implementation is that a qdisc can be attached to a class in any classful qdisc. This allows different qdiscs to be mixed and matched to polish corner cases which are much harder to handle in other implementations.
Nested Qdiscs - RED
Quite often CBQ by itself does not provide good enough traffic management for traffic which contains fairly "aggressive" TCP flows like downloads from fast download sites. If these are allowed to burst to the full link bandwidth minus the sacrifice they may from time to time build up enough packets on the ISP Edge to cause congestion. The best way to deal with the problem is to simulate some congestion a bit earlier then the flow expects so that it is kept under the estimated bandwidth.
The best known algorithm for this is RED. It is not necessarily the best one, but it is the one which most people know and understand. The idea of RED is to start dropping packets the moment the queue fills up above a certain level which is considerably lower then the full queue and increase the drop rate as the queue length increases. If the coefficients are correct, this results in much better management of TCP flows than tail drop and much lower congestion in at networks which are not heavily congested.
In Linux RED is a classless qdisc. This means that it will subject all traffic passed to it to the selective drop algorithm. As a result, on its own it is mostly useless for the needs of an SME. To be of use it has to be attached to a class in a classful qdisc like CBQ or HTB and applied only to traffic that needs it. The following example attaches RED to the WWW class from one of the previous sections.
/sbin/tc qdisc add dev eth0 parent ${INTERNAL_ROOT}:${WWW} \
handle 10${INTERNAL_ROOT}${WWW}: \
red limit 24000 min 6000 max 12000 avpkt 1500 burst 5 probability 0.02
10${INTERNAL_ROOT}${WWW} may look slightly weird, but it provides a unique handle which does not overlap with any of the ones have used for CBQ in the previous sections.
What the other values mean:
| min 6000 |
start to mark. 6000 bytes is 4 packets in an average download stream. |
| max 12000 |
queue size at which we drop with maximum probability - 8 packets |
| limit 24000 |
hard queue limit - 16 packets |
| 0.02 |
probability to drop at queue depth = max. |
These values are likely to raise some eybrows in someone who have run RED on its own as too small. They are correct, because RED in this case is attached to a CBQ class which would have bursted to reach to the point where RED is activated. This means that the traffic is subjected to at least some conditioning from CBQ delays. Also, this is for fairly low bandwidth so the usual RED parameters from a backbone setup have to be adjusted down to the point where they are effective when dealing with 1-2 aggressive flows, not the many flows they deal on a typical backbone. Also, the goal is not to manage traffic with RED only. The idea is to use RED to decrease the "aggresiveness" of TCP flows so that CBQ can take place and keep the overall link uncongested. In theory, similar results can be achieved by limiting window sizes on all clients or by passing all client traffic though a proxy which has a limit on the window size. In practice the link utilisation with CBQ/RED is likely to be better because window sizes will not be hammered down before the class reaches its alloted capacity.
Nested Qdiscs - SFQ
Fair queuing (subset of which is Stochastic Fair Queueing present in Linux) has been long promoted as the ultimate solution for QoS by one of the most popular Internet router vendors. In fact, the vendor in question used to require people sitting their certification exams to proclaim this queueing discipline to be the holy grail of QoS and the only QoS necessary. As quite often happens with the vendor in question, it has made a complete U-turn. It still sticks the words fair and weighted fair all over the place in their QoS literature, but it has finally got some QoS clue and has admitted that fair queueing has to be subordinate to a classful qdisc.
Fair Queueing (and SFQ on Linux) provides the following advantages:
- Fair queuing will eliminate typical zig-zag/up-down TCP throughput behaviour when multiple TCP streams are competing for bandwidth. It can do this reasonably well for a low number of streams and they will reach a nice smooth throughput.
Fair queueing will provide new streams with some "beginner luck" which will allow new TCP connections to be initiated faster
Fair queueing will provide interactive traffic with lower average latency if the interactive traffic is a relatively low proportion of the overall traffic flow both in terms of packet rates and bandwidth.
Fair queueing (and SFQ on Linux) fails for the following:
- Fair queueing cannot provide sufficient advantage to VOIP traffic on a congested link. In many cases it will increase the jitter compared to a bog standard tail drop. Many of the perceived advantages when using fair queueing come from other factors causing queue perturbation. For example, Voice Activity Detection is capable of introducing similar effects as discussed in the Traffic Dimensioning for GSM over IP paper I did for an IEE QoS conference in 2004 (pdf available on request, email me if interested).
When used one hop away from the bottleneck, fair queueing cannot provide any advantages without working in conjunction with CBQ or HTB. These limit the traffic to a specific bandwidth and queue it. From there on SFQ can take place.
Overall fair queueing can be quite beneficial for some common scenarios. It will improve user experience when it is attached to a CBQ or HTB class which fits the following criteria:
- The class is often over limit
- More than one TCP stream is active simultaneously within the class, but the overall number of active streams is considerably smaller than the queue depth (in packets).
- The class does not contain VOIP or any other media streams where software behaviour depends on jitter
- The class is not the primary contributor to the link congestion
Good examples for such traffic are instant messaging, https or any other tcp traffic different from http. The HTTP class is usually the main congestion contributor so RED described in the previous section provides considerably better results. Otherwise attaching a sfq qdisc to a CBQ or HTB class is trivial.
/sbin/tc qdisc add dev eth0 parent ${INTERNAL_ROOT}:${HTTPS} \
handle 10${INTERNAL_ROOT}${HTTPS}: sfq perturb 10
Fair queueing is a major difference between FreeBSD? ALTQ and Linux. In FreeBSD? fair queueing and RED are options to the normal CBQ qdiscs. There is no need to configure an extra qdisc and attach it to the master class.
Classes Based on Content
This can be done for http and ftp by using different squid outgoing addresses based on an ACL using the tcp_outgoing_address 192.168.0.X ACL syntax. This type of ACLs is processed only prior connection establishment so it is not possible to make these trully content dependant because content is negotiated after the connection has been established. The closest possible approximation is to do this based on a url regexp and create an ACLs with "slow" and/or "fast" extensions.
Another possible approach may be the sourceforge L7 filter project. I have not used it so I cannot provide any information on it. An example of using it for controlling traffic can be found on the Gentoo Wiki pages.
QoS - VPN and QoS - Security interactions
This section will appear shortly
Conclusion
It is possible to achieve reasonable QoS without being in control of the bottleneck link. It is not easy and not trivial, but if a person is not greedy it is possible. Some bandwidth will have to be sacrificed in nearly all cases
References and Links