|
Evaluating supercomputer performance
Pat Worley works at the console of the Compaq AlphaServer SC supercomputer. (Photo by Buddy Bland)
|
Like buildings, supercomputers have different architectures. Picture four computer
processing units (CPUs) and four data-storage units (computer memory). Give each
processor its own memory unit and then connect the processors. If one processor wants to
read data in memory attached to another processor, it must ask the other processor for the
data. This arrangement is called distributed memory, and the collection of processors is
called a cluster. If, instead, each processor is connected to each of the four memory units
and can access data directly, this arrangement is called shared memory. Now, put these
four processors and their memory units in one box in a shared memory arrangement and
call it a node.
Eagle, the IBM RS/6000 SP supercomputer at ORNL, is a cluster of 176 four-processor
nodes, combining both distributed and shared memory in a single system. In 1998 when
DOE’s Center for Computational Sciences (CCS) at ORNL was planning to purchase a
next-generation supercomputer for ORNL, it signed a contract with IBM that called for
16-processor nodes. At the time, a performance evaluation team led by Pat Worley of
ORNL’s Computer Science and Mathematics Division (CSMD) compared the
4-processor and 16-processor IBM nodes, to determine which architecture would work
best for the codes that were to be run on the machine.
“We found that smaller nodes work better for our science applications,” says Buddy
Bland, head of CSMD’s Systems and Operations Group. “So we changed the contract
with IBM from 16-processor nodes to 4-processor nodes. As a result, we obtained Eagle
eight months earlier at a cost 20% less than the total in the original contract. Now, we
must decide which architecture will work best for the 10-teraflop super- computer we
want built for 2003. Already climate modelers are writing codes that will port to this
future supercomputer.”
Worley and his team have focused their recent performance
evaluation efforts on the Compaq AlphaServer SC machine
at ORNL known as Falcon and on a new IBM machine at
ORNL known as Cheetah. Falcon uses 4-processor nodes,
like Eagle. Cheetah uses the new IBM p690 nodes, which
each have 32 processors. Worley’s team has found that IBM
Power4 processors used in a p690 node are two-and-a-half
times faster than Eagle’s processors and twice as fast as
Falcon’s processors for a variety of application codes.
Unlike the earlier 16-processor IBM nodes, the
32-processor, p690 node has up to 4 times better bandwidth
than Eagle for communication within a node. Hence, a larger
volume of messages and other data can be passed more
quickly among Cheetah’s processors than among Eagle’s. As
Bland puts it, “If you have a really fast water pump, you want
a fire hose, not a straw, to increase the speed and volume of
flow. Cheetah has the bandwidth equivalent of a fire hose.”
As part of their performance evaluation, Worley and his team do “benchmarking.” They
test existing parallel-computing codes to determine whether each code runs faster on, for
example, the IBM or Compaq machines. Then they “diagnose” the performance of the
code.
“We try to determine why a code runs faster on one machine than another,” Worley says.
“We investigate whether a code may run more slowly on one machine because of the
coding style—the way a computer program is written. If so, we can advise code
developers on how to alter their style so the code will run faster on a particular machine.”
ORNL team members also do performance engineering. They can tune a code to improve
its performance on a specific machine. In addition, Worley’s team tells vendors which
problems they need to solve in designing their next-generation machines so that certain
codes will run faster.
“Our customers are code writers and users, vendors, and system administrators,” Worley
says. “We provide advice on how to configure and run their systems and on what
machines they should buy next. We guide the development of both codes and
supercomputers.
“In our most recent efforts we have focused on evaluating the performance of Falcon and
Cheetah in running climate, car crash, computational chemistry, human genome analysis,
and materials codes. We measure how fast each code runs and predict how much time and
how many processors are needed to get the computing job done.”
The ORNL team was the first to show that a supercomputer made in the United States
(Falcon) could exceed a performance goal (5 seconds per model day) for modeling the
global climate. Later the team also showed that Eagle can exceed that goal. Without the
input of the CCS performance evaluation team, ORNL’s supercomputers would not have
nearly as good an output. ###
|