Parallel Computing

Introduction

What type of computational resources are required to fully model a photonic crystal slab in 3D with FDTD? Let's assume a typical photonic crystal slab problem. Our photonic crystal slab consists of a slab of silicon with air holes periodically etched in a triangular array in the x-y plane. The slab has line defects consisting of a row of holes of different diameter from the background holes. Two line defects are arranged so that they create a sharp bend. We wish to compute the transmission losses from the bend.

Such a problem will typically require a computational domain up to 60x30 periods. The slab thickness is typically 3/4 of a period, plus we need several periods of "cladding" material on each side to provide the TIR confinement. Each period will require 15-25 Yee cells, so we will need approximately 1200x600x160 = 115,200,000 Yee cells in our computational domain. Each Yee cell requires 6 double precision complex field values (16 bytes per field) plus one short integer value to hold the material type for the cell (2 bytes). Since each cell requires 98 bytes of storage, our problem requires just over 10.5 Gigabytes of memory just for the computational domain. We will also require some additional memory for the domain boundaries and the monitor values. If one goes to single precision, the memory requirement in cut in half.

Fortunately, we can employ an inexpensive cluster of networked computers to tackle the problem. The various nodes in the cluster each work on a separate "chunk" of the computational domain. By using the Messaging-Passing Interface (MPI) protocol in the FDTD code, we can get various nodes to communicate the necessary information to piece together the computation, so that it proceeds as one big FDTD computation.

Hardware

When I was first working on this computational problem at Optical Switch Corporation, we had many spare desktop workstations lying around unused. So I requisitioned 20 Dell GX110 PCs. These were PIII systems of various speeds (533-733 MHz), each with 512 MB of memory. At the time, I was personally using two Dell 620 workstations (900MHz PIII Xeon dual processor with 2 GB of memory), which I also added to the mix.

OSC Parallel Cluster Properties

	20 single processor PIII (533-733MHz) PCs
	2 additional 900MHz P3 Xeon dual processor machines (not shown)
	Running Linux Redhat 7.2 and LAM 6.5.5
	MPI-based FDTD program explicitly written for 100 Mbps network
	14 GB of memory space
	Slabs up to 60 x 60 x 15 periods
	User-friendly input/output routines using MATLAB

The OSC cluster ran remarkable well for an inhomogeneous cluster, but my choice of components was determined by what was available in OSC's inventory. When I went back to my Postdoc work at UTD, I had the opportunity to create a parallel cluster from the ground up. My goal was to create a cluster with similar capabilities to the OSC cluster.

I decided to go with faster processors, with more memory per processor, in order to cut down on the number of nodes. That in turn decreased the number of network connections, which are the "slow" part of the cluster. Using dual processor machines further cut down on the network connections. They also lowered the overall cost by reducing the number of cases, power supplies, motherboards, drives, and so forth, required to construct the cluster (not to mention the electricity savings). Overall cost for the cluster hardware, including UPS backup power, was around $15,000.

UTD Parallel Cluster Properties

	6 dual processor 2.0GHz P4 Xeon PCs (12 processors total)
	1 single processor 2.13GHz P4 workstation running Windows XP for input/output and bridge to external network
	Running Linux Redhat 7.2 and LAM 6.5.6
	MPI-based FDTD program explicitly written for 100 Mbps network
	12 GB of memory space
	UPS power supply backup
	Slabs up to 60 x 60 x 15 periods
	User-friendly input/output routines using MATLAB

Software

With parallel computers, the overall cluster speed is governed by how the memory pool is shared. If all of the processors and memory are hardwired together, memory access time will be similar to that of a single processor machine. However, such a parallel machine would be very expensive. The next best thing is to take individual machines (either single processor or multi-processor SMP machines) and hook them together with a high speed interconnect such as Myrinet. Unfortunately, Myrinet is pricey (each card would cost as much as one of our dual processor nodes). One could use cheaper Gigabit Ethernet (which is slower than Myrinet and has a higher latency), but drivers for Linux are hard to come by.

Standard 100 megabit Ethernet is very cheap and drivers are very stable. Unfortunately, it is orders of magnitude slower than memory access, so MPI calls across the network incur a large computational penalty. As a result I had to design FDTD software specifically for standard Ethernet. Since the code handles the domain decomposition, shared computational boundary sizes can be minimized. At most, each node has only two shared boundaries, so it has to communicate with only two other nodes. While each node will have to communicate with at most two other nodes, it will not have to do so simultaneously. FDTD requires exchange of only 2 field values along the shared boundary with one node during E-field propagation, and 2 more with the second node during H-field propagation. As a result, the network activity can be spread across the time each node is computing the rest of its domain. As long as each node's computation domain is sufficiently larger than the shared boundary size, the cluster maintains a linear speed increase with the addition of further nodes (i.e. it's computing in parallel very efficiently).

Currently, I'm working to add standard PML and UPML-PC boundaries to the parallel cluster FDTD code. Unlike absorbing boundaries (like MUR), PML-type boundaries require values from adjacent Yee cells, so MPI communications must be part of the boundary code. And the large boundary regions required by UPML-PC require them to be split up over multiple nodes in the cluster in order to maintain efficiency.