NSF grant triggers wide computing possiblities form BTeV
Everybody talks about crashing
computers, but nobody does anything
But with a $4.98 million National
Science Foundation grant in the area
of Information Technology Research,
Fermilab's B-physics at the Tevatron
experiment (BTeV) just might help
solve the puzzle of "Why don't things
always work as well as we'd like?"
That question represents the theme
for educational outreach components of the effort to build a fault tolerance
system into the BTeV trigger and data acquisition project. BTeV's goal is
assembling as many as 10,000 parallel computers and making them work
together dependably and consistently in the triggering and DAQ
system-despite incorporating different kinds of computers with different
tasks. The BTeV trigger will be challenged to reconstruct 15 million particle
events per second, and to use that reconstruction data in deciding which
events to keep for further analysis. It will be further challenged to perform
the reconstructions around the clock -while spotting and correcting any
problems that arise.
The idea of self-awareness or introspection in computers is not new. But the
idea of achieving "fault tolerance" or self-correction at this level of complexity
is new and intriguing.
"People have written fault tolerant systems for smaller numbers of
computers, on the order of hundreds," said BTeV cospokesperson Joel
Butler. "But when you get to the ambitious level we're working at, with
perhaps 10,000 computers, those ideas do not scale up. You can't just
change the number of processors and have it all work outÍ In this very
self-aware computing system, the software will be expected to solve
problems from the level of the smallest processor in the system all the way
up to the level of whether the whole thing is really behaving as expected."
Imagine the possibilities.
"This is a very hot topic in electrical engineering and computer science right
now, this concept of evolvability and fault tolerance " said BTeV collaborator
Paul Sheldon of Vanderbilt University, the project's principal investigator.
"With thousands of components, you'll always have something going wrong,
somewhere. You want a system to be able to adapt to a fault because if it
doesn't, you'll crash or miss something critically important. It [fault tolerance]
is going to be useful in complex systems such as weather monitoring. Or in
vehicle navigation where you can literally crash. Think of the country's air
traffic control system, which is really old and can't be easily upgraded.
That's why we try these things [in science] first-to make them work without
the agony. Technology like this eventually percolates down, and hopefully it
will someday make your own computer crash less often."
Both the technology and the thinking will also percolate beyond Fermilab,
beyond the four collaborating universities (Vanderbilt, the University of
Illinois, Pittsburgh and Syracuse), and beyond the graduate students who
will be working on the project.
Adapting the QuarkNet model established by Fermilab's Education
Department, the BTeV trigger computing project aims to involve high school
teachers. The QuarkNet method trains high school teachers to train other
teachers, as well as connecting students through the Web to ongoing
particle physics experiments. The BTeV educational adaptation would
include exercises in the concepts of exception handling and fault tolerance-in
other words, how to work around glitches without an entire structure coming
apart, in day-to-day applications. What happens if it rains on graduation
day? How does a baseball league schedule work, especially when games
are canceled? How is the production of a play affected when understudies
Underlying the computing connections is a basic tenet of science: the need
for a methodical way of thinking, of exploring the consequences when things
go wrong, of devising plans to correct or work around those consequences.
"It's an important part of scientific literacy
for the general public," said Marge
Bardeen, head of Fermilab's Education
Department. "Having been through an
experience of how science works, they
would gain a better understanding of basic
research and see its value. They would
gain a better understanding of how to
make careful and responsible decisions
about science, about funding and other
issues. Also, we don't often teach science as an experimental,
research-based, kid-centered discipline. We don't often teach science the
way science is done. First, how do we help teachers understand how
scientists work; and second, how do we figure out how to do that in a
classroom? That's what QuarkNet tries to do, and that's what the BTeV
group will try to do."
The BTeV trigger system (click here for graphic) distinguishes itself by
essentially merging with the experiment, assuming the role of part of the
apparatus. The trigger system will reconstruct every bunch crossing of the
Tevatron-bunch crossings occur at 7.6 million per second, or 132
nano-seconds apart. The data system will attempt to find all the tracks and
interaction vertices, looking for evidence that there is a decay downstream
of the interaction vertex which could come from a b-particle. Then it thinks
about which events to keep and which to discard.
"The trigger must work reliably and quickly, over a long period of time," said
Fermilab physicist Erik Gottschalk, who has worked on designing the trigger
system. "This process is not being done off-line. It's integral to the
experiment itself instead of being a step removed, as it would if it were
being handled off-line. If it fails, it affects the data. Everything counts on the
And that trigger will count on the fault-tolerance software developed with the
help of the NSF grant, approximately $1 million per year for five years,
already effective as of October 1. BTeV applied for the grant after a
Fermilab technical review of the experiment proposal suggested
strengthening the fault-tolerance aspects of the system. Collaborators
reached out to people at their own institutions who were conducting this kind
of research-the Institute for Software Integrated Systems at Vanderbilt, the
Coordinated Sciences Lab at the University of Illinois, the (research group)
at Syracuse and the (research group) at Pittsburgh. Together, the
experiment and university collaborators wrote a proposal that survived
competition with thousands of other entries, emerging with a share of $156
million which NSF has targeted "to preserve America's position as the world
leader of computer science and its applications."
NSF is especially interested in possible applications, scientific and
commercial. The BTeV proposal points to a wide range of uses including
medicine (data acquisition in Positron Emission Tomography), astrophysics
(the Pierre Auger Cosmic Ray Observatory and its 1,600 detector stations),
vehicle navigation, weathering monitoring and disaster warning systems,
widely-available Internet services-and others yet to be described. In fact,
the collaboration intends to hold a series of workshops, inviting
representatives from these areas of technology, to discuss these
connections and expand the list.
Sheldon, as principal investigator of the project, coordinates the apportioning
of resources. He points out that the funds are directed specifically to
"No physicists are actually being funded by this grant," he said. "The whole
point was to bring in people from other disciplines."
Butler, whose experience dates back to early fixed-target experiments at
Fermilab, is enthusiastic about expanding the formal connections between
high-energy physics and computer science among several institutions.
"You would think it's the most natural of collaborations," Butler said,
"high-energy physics with its complicated computer needs, and university
computer scientists with their resources. But there really haven't been that
many examples. It's exciting that NSF has opened up this possibility."
The Department of Energy's Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time.