Content is scrollable
If you are accessing this page, you might also want to see my list of U.K. Beowulf sites.
Mad Dog is a cluster of 12 dual processor linux machines. Seven of these are Pro 166 machines (192 MB of memory), the remaining five are PII 400 ones, with 500 MB memory. It is a Beowulf system, based on the original work done by Don Becker and collaborators at NASA, strongly influenced by the Loki and Naegling machines from Los Alamos and CalTech. It is based on commodity components, and free software, which makes this a very cheap alternative to commercial HPC systems.
Each of the machines that make up Mad Dog are Dual Pentium Pro 166 Mhz
machines, on a gigabyte motherboard, with 196 MB memory each, and a 4.3
GB IDE hard disk. The root machine contains 2 ethernet cards (3COM XL),
and all machines are connected through a fast ethernet switch for intraprocess
communication (Superstack II 3000) and the outside world. They have a simple
video card and are connected to a mechanical keyboard/monitor switch.
The new nodes (called englishmen, get it?) are Dual PII 400 machines
on a Tyan Tiger motheboard, with 500 MB ram. The new network cards are DLINK
Each of the machines runs RedHat Linux version 6.0. Supported software includes gcc, g77, pvm and mips. The compute nodes boot with their root file systems NFS mounted from the head (root) machine. The disks on these nodes are at the moment only used for local storage and swap space. Ip-numbers are assigned using rarp, for ease of reconfiguration.
One of the problems is maintaining such systems - since I already have
quite a few linux PCs in my group, that is not a major additional stumbling.
More important is the speed, and especially the latency of network connections
using ordinary TCP/IP on fast ethernet. There exist of the shelf solutions
to this (such as myrinet and SCI), but these are expensive.
There have been some attempt to write special protocols for exchanging information with lower latency. At the moment it seems that VIA (pushed by Intel and Microsoft, amongst others) is the new way to go. The SCI systems have their own software. If I am not mistaken there is a cluster at RAL that implements this technology.
It might be interesting to experiment with special protocols for inter-process communication. To that end I will probably add an additional ethernet card to each machine, and connect these to a fast hub. This link can then be used for standard ip traffic not related to the parallel processing (nfs, telnet, rsh, etc.). It might be smart to look at linux/alpha, since alpha PCs are a lot faster than intels. A final interest is in parallel file systems , which have recently been implemented for linux PCs.