You are here: Foswiki>Net Web>RFC3258 (23 Nov 2008, AntonIvanov?)EditAttach

How to build an RFC3258 DNS System

Introduction

This is the way some Tier 1 providers like Level 3, Verio, MCI run their DNS. This is also the way some root servers are being run (as far as I know).

Foreword

First of all, this is not the design which was used at a Tier 1 provider when I used to work there. This is a design which was written later by me for a presentation at a UK ISP and it fixes many of the problems in implementing 3258 which I have seen first hand.

I came across the old slides by chance in December 2005. After looking at it, I have decided to put it into the public domain as is does me no good to leave it to bitrot. It is mostly useless in the current UK ISP environment. The ones that have the capability to implement RFC3258 have already done it. The ones who do not, well... do not. Hope it is usefull to someone somewhere out there.

All configuration examples assume Debian, Quagga, Bind and Cisco. It should not be a problem to alter them for an alternative Unix, routing protocol stack, DNS daemon and and router vendor.

What is RFC3258

The idea of RFC3258 is fairly simple. Multiple servers are located in different locations and answer for queries directed to the same IP address. The idea itself is quite. It was originally presented in one of the early RFCs - RFC1546 - Anycasting. The actual decision who answers is done entirely based on routing and nothing else. There are no cluster controllers, no redirectors, no complex logic, nothing.

3258-1.png

Figure 1

Both Client 1 and Client 2 try to query a nameserver at the same address (1.2.3.4). They both get an answer, but from different (usually closest to the client) nameservers. If DNS2 is down or unreachable as Figure 2 the queries will hit server 1 and both clients will receive a response without having to timeout and query secondaries.

3258-2.png

Figure 2

The approach results in:

  • Redundancy on geographical scale. Taking out a any single point of presence does not take out DNS from the customer perspective
  • Low DNS query latency across the entire network
  • If executed correctly it can be done with minimal capital and operational expenditure.

The RFC refers mostly to DNS, but it is applicable to any service that is based on UDP and is stateless. For example some of the implementations can be applied to NTP as well.

How NOT to do RFC3258

There are multiple ways to get RFC3258 wrong. Nearly all of them have the following common underlying cause:

Oh my god, a Unix system running routing protocol. The world has ended. It may even forward a packet. Run for your Lives (or Certifications).

Well, the Internet (and ArpaNet? before that) survived by Unix systems forwarding packets and running routing protocols for a very long time before any of the current network vendors came about. Even now, a large portion of the packets out there have traversed at least one Unix system. Quite a few of these systems run routing protocols as well.

A low end PC based Unix system can easily exceed 20000 packets per second. While it is better from an operational perspective for servers not to forward packets in their normal mode of operation, there is no technical or performance reason to forbid failure modes where they forward some or all of the DNS traffic. Similarly, there is no technical reason for a server not to run a small routing protocol subset. The currently available implementations are mature, reliable and have no problem to run a small mostly-IGP routing table.

Here are some (but not all) of the well known ways to get RFC3258 wrong.

Layer 2 Link Failure detection with no routing protocol

3258-3.png

Figure 3

This is usually the first idea for RFC3258 suggested by network engineers:

Why bother to run a routing protocol? Let's assign the common unicast address (1.2.3.5 on the diagram) to the upstream end of a /30 link from the router. If the system fails, the router will detect the link failure at Layer2, the interface will be marked as down and the route will be withdrawn from the routing table.

This does not work for most failure modes.

The reason is that a failed modern Unix box (especially based on PC hardware) will not take down Layer 2 on Ethernet. The Ethernet silicon is usually designed to support wake-up on LAN. In order to do this, Layer 2 will remain active even if the system itself is down and for some cards even if the OS does not support wake-up on LAN. As a result, for the majority of failure modes, the router has no means of determining if the box is dead or alive from Layer2 only.

Running the main backbone IGP on the servers (or a leaf of)

This may have been possible in the pre-MPLS days. It is not realistic now for fully featured IGPs. Stubs or NSSAs are also not applicable because the DNS servers have to announce specifics back into the backbone.

Running an IGP on the servers and redistributing into the core IGP

3258-4.png

Figure 4

This is not a bad idea if it is executed correctly. It is inferior to the design suggested later on, but it is something that can be made to work. In fact, this is the way many people with working RFC3258 do it. It has its fair share of caveats:

If the DNS servers and the core use the same IGP there is a possibility of some very nasty race conditions, standing wave route flaps, and other mayhem. The reason for this is that instabilities in the IGP at the servers have the same time to converge as the backbone. As a result if a route starts flapping in the server zone it forces recomputation of the backbone IGP at regular intervals along a large part of the backbone. This can be solved by altering timers from default values, but the effects will still be unpleasant.

ISPs IGPS are usually multipath. While it is the essence of RFC3258 to provide multiple name servers answering for the same address, it is considerably better if the same customer from the same location consistently hits the same nameserver. This makes tracing customer complaints and debugging considerably easier. This is least likely to be the case for multipath as there may be multiple routes with the same metric visible at the point where the customer query enters the network.

The previous point about multipath is even more valid for NTP. In the case of NTP there should be no points in the network where there is a multipath to several alternative locations with the common anycast address.

Running a cluster controller in front of DNS servers and redistributing routes from it

First of all, if there is only one instance of the cluster this is not RFC3258. This is someone approaching the unfamiliar problem of DNS redundancy from a familiar perspective (usually corporate IT). It does not offer any of the major benefits of RFC3258

  • There is no redundancy on a geographical scale. Taking out the point of presence where the cluster is located will take out all DNS.
  • The query latency depends on latency to reach the cluster. There is no latency benefit.
  • There is additional capital expenditure for the cluster controller and possibly additional operational expenditure related to keeping the system running.

The Solution - an RFC2270 DNS Server Group

RFC2270 is a common approach for connecting a customer which needs to talk a routing protocol and announce a few networks to an ISP backbone.

3258-6.png

Figure 5

Multiple customers talks BGP using the same martian AS to the ISP (Figure 6). All of them receive only default route from the ISP and announce some of their own routes. The ISP strips the customer martian AS prior to announcing any routes to peers (or announces aggregates). The benefits or RFC2270 are:

  • A customer can be multiply connected to the ISP and talk a routing protocol to it without having portable address space or an AS proper
  • The resources on the customer router to implement RFC2270 are minimal. It does not receive a full routing table and announces only a limited set of routes. This can be done on low end kit.

RFC2270 is a very good setup for connecting RFC3258 DNS (and not only DNS) servers to the backbone. It is even better when they are grouped into small networks connected to the backbone via RFC2270 (Figure 6).

3258-7.png

Every network consists of several servers. For example: 2+ recustive DNS resolvers to be queried by customers, 2+ authoritative DNS servers, etc. Each of these exports into the IGP (best of all OSPF, though RIP is also possible) at least two addresses:

  • A globally unique address that can be used for as a source address for DNS queries by the server. It will also be used for diagnostics and monitoring later on.
  • A single instance for each of the shared unicast addresses used in the network (1.2.3.4,1.2.3.5 on the diagram). While it is possible to design each group to contain multiple servers with the same shared unicast address, it is better if more groups are deployed instead. The reason is that groups with multiple instances of the same shared unicast address in them will most likely need to have multipath working. This is possible with Linux and is even possible to configure sensible route caching. It is also possible (at least with some extra patches) on some BSD derived Unixes. In all cases the server configuration is likely to become too complex to be used in a production environment.

From now on this setup will be referred to as 3258/2270 (after the RFCs on which it is based).

The primary advantages of using 3258/2270.

  • Any routing instabilities in the server group will be dampened on announcing the addresses to the backbone via BGP. If the instabilities are excessive the server group will flap itself out and other server groups will take over. As a result any network problems in a server group will have minimal or no impact on the backbone. It will be easy to trace instabilities as well.
  • For every point in the network the common unicast addresses announced to the backbone will be reachable by a single path which is selected using standard BGP selection rules. Any multipath effects will be negligible. This will allow the implentation of other protocols like NTP. The server group is also a natural place to put any services that do not announce common unicast, but have to be dispersed along the backbone none the less. Mail relays, Internet News servers, network test and performance metrics equipment, etc. These all will benefit from having a set of redundant name servers nearby as well as any network redundancy designed as a part of a 3258/2270 DNS Implementation.
  • The capital expenditure is considerably less then the expenditure for a cluster. The setup can be implemented on something as low as a couple of lower end 26xx or 36xx series Cisco routers along with two 802.1q capable layer 2 switches per server group. Alternatively a higher end switch capable of talking BGP or a virual router on high end colocation equipment is capable to do this as well. All of these are cheaper then the head of most clusters.

Setting up a RFC3258/RFC2270 DNS Server Group

The following sectons assume bind. They also assume a DNS installation which has been designed according to current best practices. More specifically it is assumed that:

  • Any authoritative nameservers do not have recursion turned on
  • Any recursive nameservers do not carry any zones except the ones used for internal purposes

It may be possible to apply this approach to an installation that is not set up according to the current best practice. It will be more difficult to monitor and troubleshoot.

IP Addressing and Network Configuration

Each DNS Server must have at least one ethernet interface and at least 2 loopback alias interfaces. While it is better if all servers in a server group are located at the far end of their own /30s this is not strictly necessary. The ethernet may be shared as well.

#/etc/network/options
ip_forward=yes 
spoofprotect=no 

If OSPF is in use forwarding must be turned on. It is part of the routing protocol semantics. Every system talking OSPF is presumed to be able to forward packets. While it may not do so in normal mode of operation it may do it in a some failure modes. There is no point trying to avoid it.

It may be possible to design the setup in a manner where all routing is fully symmetric, but it is best not to bother. Hence the antispoofing is off.

#/etc/network/interfaces
auto eth0 eth1 lo:0 lo:1
iface eth0 inet static
   address 2.3.4.5
   netmask 255.255.255.252
   broadcast 2.3.4.7
iface eth0 inet static
   address 2.3.5.5
   netmask 255.255.255.252
   broadcast 2.3.5.7
# server ID address
iface lo:0 inet static
   address 2.3.6.1
   netmask 255.255.255.255
# common unicast address
iface lo:1 inet static
   address 1.2.3.5
   netmask 255.255.255.255

If multiple ethernets are in use, the server must have more then one loopback alias to function in all failure modes.

  • lo:1 is used as the common unicast address across the network. It is the address which the customers will query.
  • lo:0 is used as the local server unique address. It is the source address for any DNS queries, router-id address, address used for diagnostics, etc.

Bind Configuration

I am not going to go into the fine points of running bind in an ISP like resource limits. There are plenty of documents on the net about this. The only items mentioned will be the ones that matter in the context of this article.

Authoritative NameServer? Configuration

The interesting parts of the config are the query address and the listening address.
# /etc/bind/named.conf.options (debian specific)
options {
   listen-on {                 
      1.2.3.5; # shared unicast
      2.3.6.1; # unique loopback
   };
   query-source address 2.3.6.1; 
   notify no; 
   recursion no;
}

The servers listen both on the shared unicast and on at least one more unique address. This is essential for testing the functionality of every server and the network as a whole. This is also a major advantage compared to clusters and other approaches. Every server in the load balance pool can be queried from the Internet and tested if it replies correctly

The notifications are completely turned off. It is not possible to distribute zones using normal DNS means in a RFC3258 environment so they have to be pushed via rsync, scp or similar method. Also, if the secondary nameserver is itself an RFC3258 installation, only one secondary server will receive the notification.

Recursion is turned off as it is a dedicated authoritative name server.

The same config is used on nameservers that are declared as secondary by the customers in their zone files. Note, the servers are declared, but not used as such per normal DNS transfer semantics (in fact cannot be). The mechanism for providing secondary name service is explained later on.

Resolver Configuration

# /etc/bind/named.conf.options (debian specific)
options {
   listen-on {                 
      1.2.3.6; # shared unicast
      2.3.6.2; # unique loopback
   };
   query-source address 2.3.6.2; 
   notify no; 
   recursion yes;
}

It is essentially the same as the authoritative name server configuration with one minor difference - recursion is turned on. It is not carrying any zones except the ones used for testing (explained later on).

Staging Server Configuration

# /etc/bind/named.conf.options (debian specific)
options {
   listen-on { 
      2.3.6.254; # unique loopback
   };
   query-source address 2.3.6.254;
   transfer-source 2.3.6.254;
   notify no; 
   recursion no;
}

  • The customers are given the staging server unique loopback as a transfer address which should be allowed to transfer the zone files.
  • The customers are asked to configure their nameserver to send notifies to this address using the also-notify option.
  • This server is polled regularly by external software for changes in serial numbers for all zones to be pushed out (the code using perl Net::DNS is utterly trivial so I will not post it here). If any zone has changed on the server it is dumped locally to a file using a zone transfer and pushed out using whatever method of pushing has been chosen for the authoritative name servers. It is important to get the zone using AXFR and not from the zone files, because zone transfer is atomic. This may not be the case when trying to use zone files stored by bind itself. The query portion of the same software can be reused to verify that the zone has been propagated to all servers.

It is a good idea to load any zone changes on this server(s) as well and use the SOA change/push mechanism from the previous point for internal zones as well as customer slave zones. This will decrease the chance for any operational errors.

This nameserver should not be queried by anyone, but customers who have zone files on it. In fact it may be a good idea to generate an ACL (bind or ip filtering) which limits access to port 53 only to legitimate primary nameservers.

Routing Protocol setup for a RFC2270 DNS Server Group

This is the setup for OSPF. It is also possible to use RIP. I am not following ISIS on unix systems but I do not see a reason to use ISIS for 6-8 servers anyway.

Here is an example on how to do it (note - this is quagga syntax, looks like cisco but there are some subtle differences):

interface eth0  
  ip ospf authentication message-digest  
  ip ospf authentication-key ex4mpl3
  ip ospf cost 10 ! 
  ip ospf priority 3 ! we will set the upstream routers to 1 and 2
interface eth1  
  ip ospf authentication message-digest  
  ip ospf authentication-key ex4mpl3
  ip ospf cost 20 !
  ip ospf priority 3 ! we will set the upstream routers to 1 and 2
router ospf  
  ospf router-id 2.3.6.1  
  network 2.3.6.1/32 area 0.0.0.0  ! globally unique loopback
  network 1.2.3.5/32 area 0.0.0.0  ! global unicast
  network 2.3.4.4/30 area 0.0.0.0  
  network 2.3.5.4/30 area 0.0.0.0  
  area 0 authentication message-digest 

There are only two minor caveats:

Quagga does not like its router ID changed. Hence it should be forced to the globally unique loopback Multipath will require configuring route caching, hence interfaces are deliberately weighted assymetrically. The configuration on the upstream router is slightly more complex (but not much).

interface Loopback1
   ip address 2.3.6.253 255.255.255.255
! 802.1q to get to each server on a separate /30
interface FastEthernet0.1
   ip address 2.3.4.6 255.255.255.252
   ip ospf authentication message-digest  
   ip ospf authentication-key ex4mpl3
   ip ospf cost 10 ! 
   ip ospf priority 1 ! 
interface FastEthernet1.1
   ip address 2.3.5.6 255.255.255.252
   ip ospf authentication message-digest  
   ip ospf authentication-key ex4mpl3
   ip ospf cost 20 !
   ip ospf priority 1 ! 
router ospf 1
   redistribute bgp 64514 route-map default-only 
   ! we get only default from BGP, but we doublecheck it anyway
   network 2.3.6.253 0.0.0.0 area 0.0.0.0  
   network 2.3.4.4 0.0.0.3 area 0.0.0.0  
   network 2.3.5.4 0.0.0.3 area 0.0.0.0  
   area 0 authentication message-digest 
router bgp 64514 
   no synchronization
   neighbor 3.4.5.6 remote-as 64515 
   neighbor 3.4.5.6 send-community
   neighbor 3.4.5.6 route-map default-only in
   neighbor 3.4.5.6 route-map rfc3258rfc2270 out
   !
   neighbor 3.4.5.7 remote-as 64515 
   neighbor 3.4.5.7 send-community
   neighbor 3.4.5.7 route-map default-only in
   neighbor 3.4.5.7 route-map rfc3258rfc2270 out
   !
   network 2.3.6.253 mask 32
   ! shared unicast addresses for servers (as many as necessary)
   network 1.2.3.4 mask 32
   network 1.2.3.5 mask 32 
   ! unique addresses for servers (as many as necessary)
   network 2.3.6.1
   network 2.3.6.2 
route-map default-only permit 10
   match ip address default
ip prefix-list default  seq 5 permit 0.0.0.0/0
ip prefix-list default  seq 100 deny any
route-map rfc3258rfc2270 permit 10
   match ip address ouraddresses
   ! set communities 
   ! different community on every unicast export is a good idea
   ! for all practical purposes - optional bells and whistles

This routing configuration is quite simple. Most BGP/OSPF configurations out there are considerably more complex than this.

Zone and configuration propagation

The easiest and by far the most reliable method to propagate zones is rsync over ssh. This is best done from a dedicated staging server as described above. It is also best done inband not via an out of band management network. The reasons for this are as follows: In order to successfully spoof a TCP connection to a 3258/2270 group someone needs to be able to inject information into an ISP BGP table. If someone can do this, having a 3258/2270 connection spoof is the least of the ISP problems.

Even if the connection is spoofed, SSH using public keys will immediately pick up a discrepancy in the server keys and an alert will be raised. After spending the effort to have resilient network for DNS, using a less resilient network for all data transfers does not make a lot of sense.

Monitoring Setup

RFC3258 is notorious for the difficulties in monitoring it. 3258/2270 can make a number of the difficulties less prominent:

  • 3258/2270 allows communities to be used to tag routes from a specific server group. From there on, determining which server is answering is trivial. This along with the inherent features of a BGP connection solves most of the routing problems.
  • The software to poll/push customer zones as a secondary as described above in the staging server section can double up for monitoring.
  • In addition to monitoring by the staging software, a dummy zone can be pushed to all recursive servers and another one to all non-recursive servers. The SOA on the zones can be used for checks using the standard DNS monitor plugins for mon, netsaint, nagios and many of their commercial lookalikes. Any server with a SOA out of date can be picked up isolated and fixed. Alternatively it can be pulled out of the pool by removing its BGP network statement from the routers upstream (or amending the ACLs).
  • A great advantage of 3258/2270 compared to a cluster is that all servers can be checked inband from the perspective of a customer. This is usually not possible for a cluster where the diagnostics of the elements need to be done out of band (or behind the cluster head).

Applicability to other protocols

NTP

3258/2270 is immediately applicable to NTP in non-BGP multipath environments. In fact it can be run on any of the DNS servers themselves provided that they have been tuned accordingly (no locate, man-db, etc jobs or anything of the like). NTP has a number of bugs and problems related to the way it supports aliased interfaces and the ntp develoeprs do not consider any one of them to be urgent so these are least likely to be fixed. These are mostly in the client or peer parts of the code so this should not affect its function as a server (this is as of version 4.2.0).

SMTP (anycast)

In my opinion anycasting for TCP protocols in todays networks is a form of ritual suicide. It is usually quite messy. 3258/2270 does not remove the suicide part, but it makes it considerably less messy. The usage of BGP and flap dampening will quickly isolate any groups that are having routing problems. The instabilities are considerably less then with IGP redistribution. As a result it will work for many cases where a purely IGP based anycast solution will fail.

SMTP (part of a server group)

It is not necessary to use anycast to extract significant benefits from 3258/2270 for a mail relay. If it is setup in the same manner as the DNS servers and operates strictly from a loopback alias it will benefit from the extra resilience. In addition to this if a server is down any connection attempts will result in quick network unreachable instead of a timeout. As a result a relay talking to another relay will fallback nearly immediately instead of waiting. This will clearly improve performance of a large MX pool when one of the members is down.

Caveats and FAQ

These include answers to questions I have been asked in the past for this design (mostly the old slideware that preceded it).

Q
Does this thing work.
A
Yes it does. Your mileage may vary
Q
Can forwarding on the servers be turned off?
A
Not if you are running OSPF. OSPF assumes any element which participates in the topology to be capable of forwarding packets. A topology which does not have a server forwarding packets in normal mode of operation may still have it in a failure modes. It is possible to do turn off forwarding if RIP is used instead of OSPF.
Q
Can the servers be monitored and data updated over an out-of-band management connection?
A
Monitoring over OOB will miss some conditions where servers are unreachable. DNS updates themselves are an essential part of the system functionality. The system as designed provides redundancy for this. Most OOB and management networks do not have that.

References and Links

-- AntonIvanov? - 22 Nov 2008

Topic attachments
I Attachment Action Size Date Who Comment
pngpng 3258-1.png manage 32.4 K 22 Nov 2008 - 09:46 AntonIvanov? Figure 1
pngpng 3258-2.png manage 34.6 K 22 Nov 2008 - 09:47 AntonIvanov? Figure 2
pngpng 3258-3.png manage 24.6 K 22 Nov 2008 - 09:47 AntonIvanov? Figure 3
pngpng 3258-4.png manage 34.0 K 22 Nov 2008 - 09:48 AntonIvanov? Figure 4
pngpng 3258-5.png manage 44.3 K 23 Nov 2008 - 09:06 AntonIvanov? Figure 5
pngpng 3258-6.png manage 31.4 K 23 Nov 2008 - 09:07 AntonIvanov? Figure 6
pngpng 3258-7.png manage 45.2 K 23 Nov 2008 - 09:07 AntonIvanov? Figure 7
Topic revision: r2 - 23 Nov 2008 - 09:11:41 - AntonIvanov?


  • Google
    Web
    sigsegv.cx

 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback