Content is scrollable



Back to February 2000 meeting page

Flexible Level 4 switching Applications using Linux Virtual Servers

(aka Building Stupidly Large Servers using Linux Virtual Servers)


Michael Sparks
zath@i.am
http://MichaelSparks.tripod.com/

Level 4 switching & LVS : What is it ?

  • Essentially very similar to IP Masquerading. Just "the other way around".
  • Packets arrive at a load balancer, and are forwarded to private servers, based on the contents of the /proc/net/ip_masquerade table's "FromIP:port ToIP:Port Type" triplet.
  • New entries added to the /proc/net/ip_masquerade table when SYN packets arrive. Lots of flexibility in schedulers in choosing which server to use.
  • Front end is termed a director. Backend servers termed real servers.

Credits: Who wrote it?

  • Lead developer - Wensong Zhang. Other important people:
  • Julian Anastasov - Lots of patches/ideas
  • Peter Kese - Port to 2.2 kernel
  • Joseph Mack - HOWTO Maintainer.
  • Lars Marowsky-Bree - Hosting the primary LVS website amongst other things
  • Many,many others: Mike Wangsmo, Rob Thomas, TC Lewis, Matthew Kellett, Mike Douglas, Horms, VA Linux, Redhat, plus a number of the usual suspects.

Level 4 switching & LVS : What's out there ?

  • Most commercial level 4 switches - eg Alteon, Arrowpoint, Big IP, Foundry, (all bar 1 of those I know) operate using NAT - ie just like IP masquerading. (LVS Term VS-NAT)
  • IBM's Net Dispatcher has the service IP public on the director, and private on the real servers. The director modifies the ethernet packet's MAC addresses to that of the ereal server. The real server can now reply directly to the client. Director and server must be on same LAN. (LVS Term VS-DR)
  • LVS - has both of the above 2 options, and a third - packets can be forwarded using IPIP tunnelling. (not IP-GRE tuinnelling unfortunately) This allows the real servers to be on different networks from the load balancer - very useful for resiliency & failover. (LVS Term VS-TUN)

Level 4 switching & LVS : Pro's Cons

  • Commercial systems
    • pro plug in and go. Sometimes in higher end switches/routers anyway.
    • cons Expensive. Doesn't have a general purpose OS available. Only one choice of scheduler normally, and one choice of forwarding.
  • Linux Virtual Servers
    • con Still under active development.
    • pros
      • Still under active development :-)
      • FREE
      • You may already have it - built into the RedHat 6.1 kernel.
      • More flexible
      • Choice of forwarding on a per server basis.
      • Full operating system available.

Forwarding & Scheduling

  • Forwarding Mechanism Benefits:
    • VS NAT -simple to set up requires no modifcation to servers, which can be running any OS
    • VS-DR - since servers reply directly, more scaleable , any OS.
    • VS-TUN - due to using IPIP, Linux only, but is the most flexible forwarding mechanism
  • Scheduling:
    • round-robin scheduling
    • weighted round-robin scheduling
    • least-connection scheduling
    • weighted least-connection scheduling
    • Persistance.

Gotchas !

  • Level 4 switching is only a mechanism.Need to use other tools for mointoring/maniplualting system state - eg mon, or home grown tools.
  • VS-NAT - only route to outside world must be through the Director.
  • VS-DR/VS-TUN - ARP. Since all the machines have the same IP address, only one of them must be allowed to respond to ARP requests. For VS-DR under non-linux systems, simply specifying -arp works. For 2.2 kernel Linux boxes, you need to tell the kernel the device is private.

    Eg:

    ifconfig <DEV> up
    echo 1 > /proc/sys/net/ipv4/conf/all/hidden
    echo 1 > /proc/sys/net/ipv4/conf/<DEV>/hidden
    ifconfig <DEV> <VIP> up
    
  • UDP Services - if using VS-TUN/VS-DR, you must configure the UDP services to either respond on the same address the request came in on or bind the UDP service to the VIP. (due to being a connectionless protocol)

Kernel Configuration

  • Best approach is to build support for all scheduling methods into the kernel in one go. Key options to select:
              Prompt for development and/or incomplete code/drivers
      	Network firewalls, 
      	IP: firewalling
      	IP: masquerading
      	IP: masquerading virtual server support,
      	(16) IP masquerading VS table size (the Nth power of 2)
      	IP: aliasing support (optional)
      
  • And as modules:
      	IPVS: round-robin scheduling
      	IPVS: weighted round-robin scheduling
      	IPVS: least-connection scheduling
      	IPVS: weighted least-connection scheduling
      	IP: tunneling
      
  • Add alias tunl0 ipip to your /etc/conf.modules file.
    
    

IPVSADM

    IP Virtual Server ADMinstration tool. Quick and simple access to the mechanism:
    ipvsadm  v1.7 1999/11/28
    Usage: /sbin/ipvsadm -[A|E] -[t|u] service-address [-s scheduler] [-p [timeout]] [-M [netmask]]
           /sbin/ipvsadm -D -[t|u] service-address
           /sbin/ipvsadm -C
           /sbin/ipvsadm -[a|e] -[t|u] service-address -r server-address [options]
           /sbin/ipvsadm -d -[t|u] service-address -r server-address
           /sbin/ipvsadm -[L|l] [-n]
    
    Commands:
    Either long or short options are allowed.
      --add-service     -A        add virtual service with options
      --edit-service    -E        edit virtual service with options
      --delete-service  -D        delete virtual service
      --clear           -C        clear the whole table
      --add-server      -a        add real server with options
      --edit-server     -e        edit real server with options
      --delete-server   -d        delete real server
      --list            -L        list the table
    
    Options:
      --tcp-service  -t service-address  service-address is host and port
      --udp-service  -u service-address  service-address is host and port
      --scheduler    -s       It can be rr|wrr|lc|wlc,
                                         the default scheduler is wlc.
      --persistent   -p [timeout]        persistent port
      --netmask      -M [netmask]        persistent granularity mask
      --real-server  -r server-address   server-address is host (and port)
      --masquerading -m                  masquerading (NAT)
      --ipip         -i                  ipip encapsulation (tunneling)
      --gatewaying   -g                  gatewaying (direct routing) (default)
      --weight:      -w          capacity of real server
      --numeric      -n                  numeric output of addresses and ports
    

Building a large scale web server : Realserver setup

  • Assuming Linux boxes running a web server - eg Apache, Roxen, etc, and all web servers on same network segment.
  • Decide on a service address - eg 130.88.203.3
  • On the real servers:
    ifconfig dummy0 up
    echo 1 > /proc/sys/net/ipv4/conf/all/hidden
    echo 1 > /proc/sys/net/ipv4/conf/dummy0/hidden
    ifconfig dummy0 130.88.203.3 up
    

Building a large scale web server : Director setup

    On the director - put this IP on a public device - eg an alias of a normal ethernet device:
    ifconfig eth0:0 130.88.203.3 netmask 255.255.255.255 broadcast 130.88.203.3
    
    And create the service:
    ipvsadm -A -t 130.88.203.3:80 -s wlc
    
    Assuming your 200 real servers have IPs 130.88.203.45 - 130.88.203.244, and that you've got sh-utils installed:
    for i in `seq 45 244`; do
       ipvsadm -a -t 130.88.203.3:80 -r 130.88.203.$i:80 -g -w 1000
    done
    

Building a large scale DNS server

    Assume the network device config on all servers (director & real alike) is unchanged. DNS is a good example of a UDP based service hence the choice. On the real servers, if you're running bind 8, you need to put the following into the /etc/named.conf file if using VS-DR or VS-TUN (Assuming 130.88.203.2 as the VIP)
            listen-on { 130.88.203.2; }
    

Director configuration.

    create the service:
    ipvsadm -A -u 130.88.203.3:53 -s wrr
    ipvsadm -A -t 130.88.203.3:53 -s wrr
    
    Build the routing table:
    for i in `seq 45 244`; do
       ipvsadm -a -u 130.88.203.3:53 -r 130.88.203.$i:53 -g -w 1000
       ipvsadm -a -t 130.88.203.3:53 -r 130.88.203.$i:53 -g -w 1000
    done
    

Building a large scale Web Proxy server

    Assume the network device config on all servers (director & real alike) is unchanged. Assume software on all boxes is squid. In the squid.conf file on all the real servers add the line:
    udp_incoming_address 130.88.203.3
    

Director configuration.

    Create the TCP based HTTP proxy service:
    ipvsadm -A -t 130.88.203.3:3128 -s wlc
    
    Create the UDP based ICP service:
    ipvsadm -A -t 130.88.203.3:3130 -s wrr
    
    Build the routing table:
    for i in `seq 45 244`; do
       ipvsadm -a -t 130.88.203.3:3128 -r 130.88.203.$i:3128 -g -w 1000
       ipvsadm -a -u 130.88.203.3:3130 -r 130.88.203.$i:3130 -g -w 1000
    done
    

Building a bigger better Web server : BBBWS

  • Assuming you have a large number of ASP, php, cgi requests, you may want machines dedicated to this purpose. eg Allocate servers:
    • 130.88.203.45-100 to be cgi-bin servers.
    • 130.88.203.101-144 to be php3 servers.
    • 130.88.203.145-200 to be asp servers.
    • 130.88.203.201-244 to be normal webservers/frontends.
  • Designate an IP address per partitioning - eg:
    • 130.88.203.245 be www-cgi.mydomain.net
    • 130.88.203.246 be www-php.mydomain.net
    • 130.88.203.247 be www-asp.mydomain.net
    • 130.88.203.3 be www.mydomain.net

BBBWS : Farming off the request types

  • Run squid on all the machines as an http accellerator. Run a redirector that has the following rules:
    • map requests containing .*\.cgi$ or cgi-bin to www-cgi.mydomain.net
    • map requests for .*\.php$ to www-php.mydomain.net
    • map requests for .*\.asp$ to www-asp.mydomain.net
    • otherwise serve from local server.
  • Use squid's ability to hide this from the user.

BBBWS : Configuring the real servers

    On machines configure dummy0 as
    130.88.203.45-100 130.88.203.245
    130.88.203.101-144 130.88.203.246
    130.88.203.145-200 130.88.203.247
    130.88.203.201-244 130.88.203.3
  • On machine 130.88.203.201-244, run the webserver on port 81 rather than the usual port 80. (see next slide :-)

BBBWS : Configuring the Director

  • Listen on all the IPs:
    ifconfig eth0:0 130.88.203.3 netmask 255.255.255.255 broadcast 130.88.203.3 up
    ifconfig eth0:1 130.88.203.245 netmask 255.255.255.255 broadcast 130.88.203.245 up 
    ifconfig eth0:2 130.88.203.246 netmask 255.255.255.255 broadcast 130.88.203.246 up
    ifconfig eth0:3 130.88.203.247 netmask 255.255.255.255 broadcast 130.88.203.247 up
    
  • Create the services:
    for i in 3 245 246 247; do
       ipvsadm -A -t 130.88.203.$i:80 -s wlc -p
    done
    
    Create the routing tables 
    for i in `seq 45 100`; do             # CGI-BIN servers
       ipvsadm -a -t 130.88.203.245:80 -r 130.88.203.$i:80 -g -w 1000
    done
    for i in `seq 101 144`; do            # PHP servers
       ipvsadm -a -t 130.88.203.246:80 -r 130.88.203.$i:80 -g -w 1000
    done
    for i in `seq 145 200`; do            # ASP servers
       ipvsadm -a -t 130.88.203.246:80 -r 130.88.203.$i:80 -g -w 1000
    done
    for i in `seq 201 244`; do            # Normal/frontends servers
       ipvsadm -a -t 130.88.203.3:80 -r 130.88.203.$i:80 -g -w 1000
    done
    

What else can we load balance ?

  • Any TCP or UDP based service.
  • Including SMTP, FTP (specific support for ftp's data/control lines are in place), quake servers, POP3 servers, ssh/telnet connections into a cluster of machines, etc. (eg 30-100 diskless X terminals using a handful of real servers as central config)
  • Your imagination is effectively the limit.

What I didn't cover

  • Monitoring tools - essentially what you use depends on how closely you need to monitor the real servers. If they're flakey, you need very good monitoring. If they're not, you can get by with very simple tools. There's a large number of tools out there including mon, and for simple setups, the LVS tar ball comes with some.

  • Redhat 6.1 is setup to do VS-DR out of the box, and includes a simple admin/monitoring tool called Piranha.

  • Director failover - currently this is best achieved using the software fake to inform the router that the MAC address of the director has changed.

  • Content Synchronisation - one way of doing this is to use a network file system - such as afs or Coda if available on the realservers. If the amount of content is fairly small - ie less than 4Gb, then it may be easier just to use rsync for the static data.

  • Any errors in this talk are all down to either typos or me being stupid, in all cases, the linux virtual servers website is the canonical documentation. Thanks for listening :-)

Linux Virtual Servers website address

http://www.LinuxVirtualServer.org/