



| |
Introduction
What type of computational resources are required to fully
model a photonic crystal slab in 3D with FDTD? Let's assume a typical photonic
crystal slab problem. Our photonic crystal slab consists of a slab of silicon
with air holes periodically etched in a triangular array in the x-y plane. The
slab has line defects consisting of a row of holes of different diameter from
the background holes. Two line defects are arranged so that they create a sharp
bend. We wish to compute the transmission losses from the bend.
Such a problem will typically require a computational
domain up to 60x30 periods. The slab thickness is typically 3/4 of a period,
plus we need several periods of "cladding" material on each side to
provide the TIR confinement. Each period will require 15-25 Yee cells, so we
will need approximately 1200x600x160 = 115,200,000 Yee cells in our
computational domain. Each Yee cell requires 6 double precision complex field
values (16 bytes per field) plus one short integer value to hold the material
type for the cell (2 bytes). Since each cell requires 98 bytes of storage, our
problem requires just over 10.5 Gigabytes of memory just for the
computational domain. We will also require some additional memory for the domain
boundaries and the monitor values. If one goes to single precision, the memory
requirement in cut in half.
Fortunately, we can employ an inexpensive cluster of
networked computers to tackle the problem. The various nodes in the cluster each
work on a separate "chunk" of the computational domain. By using the
Messaging-Passing Interface (MPI) protocol in the FDTD code, we can get various
nodes to communicate the necessary information to piece together the
computation, so that it proceeds as one big FDTD computation.
|
Hardware
When I was first working on this computational problem
at Optical Switch Corporation, we had many spare desktop workstations lying
around unused. So I requisitioned 20 Dell GX110 PCs. These were PIII systems
of various speeds (533-733 MHz), each with 512 MB of memory. At the time, I was
personally using two Dell 620 workstations (900MHz PIII Xeon dual processor
with 2 GB of memory), which I also added to the mix. |
The OSC cluster ran remarkable well for
an inhomogeneous cluster, but my choice of components was determined by what
was available in OSC's inventory. When I went back to my Postdoc work at UTD, I had the
opportunity to create a parallel cluster from the ground up. My goal was to
create a cluster with similar capabilities to the OSC cluster.
I decided to go with faster processors, with more memory
per processor, in order to cut down on the number of nodes. That in turn
decreased the number of network connections, which are the "slow" part of
the cluster. Using dual processor machines further cut down on the network
connections. They also lowered the overall cost by reducing the number of
cases, power supplies, motherboards, drives, and so forth, required to
construct the cluster (not to mention the electricity savings). Overall cost
for the cluster hardware, including UPS backup power, was around $15,000. |
Software
With parallel computers, the overall cluster speed is
governed by how the memory pool is shared. If all of the processors and
memory are hardwired together, memory access time will be similar to that of
a single processor machine. However, such a parallel machine would be very
expensive. The next best thing is to take individual machines (either single
processor or multi-processor SMP machines) and hook them together with a
high speed interconnect such as Myrinet. Unfortunately, Myrinet is pricey
(each card would cost as much as one of our dual processor nodes). One could
use cheaper Gigabit Ethernet (which is slower than Myrinet and has a
higher latency), but drivers for Linux are hard to come by.
Standard 100 megabit Ethernet is very cheap and
drivers are very stable. Unfortunately, it is orders of magnitude slower
than memory access, so MPI calls across the network incur a large
computational penalty. As a result I had to design FDTD software specifically for
standard Ethernet. Since the code handles the domain decomposition, shared
computational boundary sizes can be minimized. At most, each node has only
two shared boundaries, so it has to communicate with only two other nodes.
While each node will have to communicate with at most two other nodes, it
will not have to do so simultaneously. FDTD
requires exchange of only 2 field values along the shared boundary with one
node during E-field propagation, and 2 more with the second node during
H-field propagation. As a result, the network activity can be spread across
the time each node is computing the rest of its domain. As long as each
node's computation domain is sufficiently larger than the shared boundary
size, the cluster maintains a linear speed increase with the addition of
further nodes (i.e. it's computing in parallel very efficiently).
Currently, I'm working to add standard PML and
UPML-PC boundaries to the parallel
cluster FDTD code. Unlike absorbing boundaries (like MUR), PML-type boundaries
require values from adjacent Yee cells, so MPI communications must be part
of the boundary code. And the large boundary regions required by UPML-PC require
them to be split up over multiple nodes in the cluster in order to maintain
efficiency. |
|