Feature Story | 31-Dec-2001

Retaining and retrieving data more effectively

DOE/Oak Ridge National Laboratory



Randy Burris examines the archive of tape drives and disks of a High-Performance Storage System (HPSS) at ORNL. The HPSS, which was developed by ORNL and several partners, is storage-system software that leads the computer industry in data capacity and transfer speeds. At ORNL the HPSS is used for DOE’s Atmospheric Radiation Measurement, or ARM, data archive. This archive contains more than 4 million files representing more than 25 terabytes of data. (Top photo by Tom Cerniglio; bottom photo by Curtis Boles)

A scientist needs data about how different types of clouds reflect, absorb, and transmit the energy of sunlight. The data, based on measurements taken by instruments on the ground and aboard airplanes and satellites, will help the scientist improve the accuracy of a computer model in predicting the influence of industrial emissions of greenhouse gases on global warming.

The scientist accesses a Web-based interface and requests 100 files of data from the Department of Energy’s Atmospheric Radiation Measurement (ARM) data archive, located at ORNL. In this archive are tape drives (for slower-speed but higher-capacity storage) and disks (for high-speed access.) They contain more than 4 million files representing more than 25 terabytes of data. Three robots retrieve the tapes on which the requested files are stored and load them for copying on the disk drive of the ARM Web site server. Within an hour, the scientist can access the requested files.

For the past four years, the ARM data archive has been using the High-Performance Storage System (HPSS), storage-system software that leads the computer industry in data capacity and transfer speeds and is standard for storage systems in the high-performance computing community. The ARM project is one of two large customers for HPSS at DOE’s Center for Computational Sciences (CCS) at ORNL, where Laboratory researchers provide and support the data archive. The HPSS system manages the hierarchy of devices, storing more than 3.5 billion measurements. It can place 12,000 new files a day into storage. It will eventually be able routinely to find and retrieve up to 5000 files an hour to meet the growing requests for information related to global change.

The other large customer is the group of climate prediction modelers using ORNL’s supercomputers. They can produce a run of cal-culations generating 1 terabyte of data that needs to be stored. These results may also be sent from ORNL to the data archives at DOE’s National Energy Research Scientific Computing Center (NERSC) in California in chunks of 250 megabytes.

HPSS was developed by a consortium of DOE national laboratories and IBM. The DOE participants are ORNL, Sandia, Lawrence Berkeley (LBNL), Los Alamos, and Lawrence Livermore national laboratories. HPSS, which received an R&D 100 Award in 1997, is marketed by IBM. ORNL researchers Deryl Steinert, Vicky White, and Mark Arnold have been developing the graphical user interface between the operator and the HPSS for running, monitoring, and otherwise managing the system. More than 70 terabytes are now stored in ORNL’s production HPSS installation, managed by Stan White, Nina Hathaway, and Tim Jones.

The ORNL mass-storage program also includes the Probe Storage Research Facility, operated by Dan Million. In one probe project, researchers Nagiza Samatova and George Ostrouchov investigate the use of data mining to extract meaningful information from massive scientific datasets.

Probe resources are also used for developing new software to send larger chunks of data more rapidly over the network to such facilities as the CAVE virtual reality theater at ORNL (see Visualization Tools).

“Our Probe staff recently accomplished one of our goals,” says Randy Burris, manager of data storage systems for CCS. “Thanks in part to work by ORNL researchers Tom Dunigan and Florence Fowler on network protocols, we are now using the bandwidth between CCS and NERSC more effectively. We are now transmitting more than 12 megabytes per second over ESnet, DOE’s semiprivate portion of the Internet.” Probe researchers also have a role in several projects funded by DOE’s Scientific Discovery through Advanced Computing (SciDAC) program. For the Scientific Data Management Integrated Software Infrastructure Center, a SciDAC project led by Arie Shoshoni of LBNL, Probe resources will be used to develop ways to improve data access and transfer and to test and implement other concepts. Probe resources are also being used in the DOE Science Grid and the Earth Systems Grid II projects. The SciDAC project on climate prediction, led by John Drake, will be using the Probe facility to determine how to transfer bulk amounts of data over the wide-area network. In work for the SciDAC project on astrophysics modeling, led by Tony Mezzacappa, Ross Toedte will be using Probe resources as he develops an effective visualization of the details of a stellar explosion. Finally, Net100 project researchers will use Probe resources as they seek to improve computer operating systems so excellent network throughput will be easily achievable without extensive application-specific tuning.

The production and research elements of ORNL’s mass-storage program are providing and promising valuable services to computational scientists throughout the Laboratory.

###

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.