|
Networking: making faster connections among supercomputers
ORNL is a hub on DOE’s Energy Sciences network (ESnet), which connects DOE’s national laboratories. (Illustration by Gail Sweeden)
|
Some computational scientists’ high-performance computing jobs are getting done even
while they are not working, thanks in part to networking. Networks, particularly
high-speed networks, allow supercomputer nodes to “talk” to each other, send messages
to other nodes asking for data, and transmit large data files across the country. In addition,
networks allow computational scientists to keep tabs on the progress of long-running jobs
that can often run for days at a time.
DOE’s Center for Computational Sciences (CCS) at ORNL has supercomputers, as do the
National Science Foundation center in Pittsburgh, the National Center for Supercomputing
Applications facility at San Diego, and DOE’s National Energy Research Scientific
Computing Center (NERSC) at Lawrence Berkeley National Laboratory (LBNL) in
California. But, according to Bill Wing of ORNL’s Computer Science and Mathematics
Division, the similarities stop there.
“Other supercomputer centers slice up their resources on a fine scale and run hundreds of
jobs for thousands of users,” he says. “We are different. We focus our computational
resources on a few high-end users who need massive computing capacity for climate
prediction, human genome analysis, and materials science simulations. Our customer
base, which includes many users out-side ORNL, has different needs, including different
network needs, so we use a different model.”
At CCS the computational scientists modeling future climate or exploding stars or
searching for genes in DNA sequences run jobs for days or weeks at a time and generate
huge files of calculated results that are transmitted between ORNL and NERSC’s data
archives. Sometimes climate modeling can produce a run of data amounting to 1 trillion
byes (1 terabyte). These data are sent between ORNL and NERSC in chunks of 250
million bytes (megabytes).
"We focus on moving large files of data," Wing
says. “ORNL and LBNL are writing computer
programs to ensure that these data packages slide
through the network rather than clog it. In
addition, we are developing the ability to allow
users to monitor the progress of these
long-running jobs—and steer them if
necessary—from a variety of portable access
points, including laptops and personal digital
assistants like Palm Pilots or iPAQs.”
Data are sent over the network mostly using the
transmission control protocol (TCP), a
predefined protocol that computers use to communicate over a network. LBNL and ORNL
researchers are devising ways to improve the ability to send large files so that
supercomputers are not idle because of delays in data delivery.
To reduce delays in data delivery, Nageswara Rao of ORNL’s Computer Science and
Mathematics Division has developed a computer program called NetLet that is being
tested on 12 free telnet and university sites serving as monitors and routers. “NetLet
allows computers to efficiently talk with each other, ‘predict’ the delay in getting the
message to the receiver, and suitably route the message,” Rao says. “This algorithm
enables the computers to measure connection speeds and the delays of pathways and then
identify the best combination of pathways to get the information delivered efficiently in
the time or at the rate guaranteed.”
Demonstrations of NetLet have shown that the algorithm has improved the speed of data
delivery by about 40% without any additional support from the Internet routers. “Some of
our data files used to take 10 seconds to get from our computer to a destination computer,”
Rao says. “Those same data files can now get there in 6 seconds. That means that a huge
data file that took 10 hours to arrive at a destination computer can now get there in 6
hours.”
The data files transmitted from ORNL's Eagle (IBM RS/6000 SP supercomputer) to the
NERSC data archive fly over DOE’s Energy Sciences Network (ESnet), a semiprivate
part of the Internet. Currently, DOE facilities such as ORNL and LBNL are using the new
ESnet (OC12), a high-speed link operated by Qwest that supports data transmission at
622 megabits per second—4 times faster than the old ESnet. ###
|