The ability to collect and analyze massive amounts of data is rapidly transforming science, industry and everyday life, but what we have seen so far is likely just the tip of the iceberg. Many of the benefits of "Big Data" have yet to surface because of a lack of interoperability, missing tools and hardware that is still evolving to meet the diverse needs of scientific communities.
One of the National Science Foundation's (NSF) priority goals is to improve the nation's capacity in data science by investing in the development of infrastructure, building multi-institutional partnerships to increase the number of U.S. data scientists and augmenting the usefulness and ease of using data.
As part of that effort, NSF today announced $31 million in new funding to support 17 innovative projects under the Data Infrastructure Building Blocks (DIBBs) program. Now in its second year, the 2014 DIBBs awards support research in 22 states and touch on research topics in computer science, information technology and nearly every field of science supported by NSF.
"Developed through extensive community input and vetting, NSF has an ambitious vision and strategy for advancing scientific discovery through data," said Irene Qualters, division director for Advanced Cyberinfrastructure at NSF. "This vision requires a collaborative national data infrastructure that is aligned to research priorities and that is efficient, highly interoperable and anticipates emerging data policies."
This year's data cyberinfrastructure awards build capacity and capability across the nation and across research communities and complement previous awards.
"Each project tests a critical component in a future data ecosystem in conjunction with a research community of users," Qualters said. "This assures that solutions will be applied and use-inspired."
NSF sees these building blocks as digital components that can be joined together to develop the foundations for a robust data infrastructure. The building blocks encompass hardware, software and networking tools, as well as the communities and people who manage data and who are the practitioners of data science.
Of the 17 awards, two support early implementations of research projects that are more mature; the others support pilot demonstrations. Each is a partnership between researchers in computer science and other science domains.
One of the two early implementation grants will support a research team led by Geoffrey Fox, a professor of computer science and informatics at Indiana University. Fox's team plans to create middleware and analytics libraries to allow data science to work at large scale on high-performance computing systems (also known as supercomputers).
Fox and his interdisciplinary team plan to test their platform with several different applications, including those used in geospatial information systems (GIS), biomedicine, epidemiology and remote sensing.
"Our innovative architecture integrates key features of open source cloud computing software with supercomputing technology," Fox said. "And our outreach involves 'data analytics as a service' with training and curricula set up in a Massive Open Online Course or MOOC."
Other institutions collaborating on the project include: Arizona State University, Emory University, Rutgers University, University of Kansas, University of Utah and Virginia Tech.
The other early implementation project is led by Ken Koedinger, professor of human computer interaction and psychology at Carnegie Mellon University. Whereas Fox's team focuses on problems in sensing and the life sciences, Koedinger's team concentrates on developing infrastructure that will drive innovation in education.
The team will develop a distributed data infrastructure called LearnSphere that will make more educational data accessible to course developers, while also motivating more researchers and companies to share their data with the greater learning sciences community. LearnSphere will include a graphical user interface, a library of analytical methods and a wide variety of educational data gathered from such sources as interactive tutoring systems, educational games and MOOCs.
"We've seen the power that data has to improve performance in many fields, from medicine to movie recommendations," Koedinger said. "Educational data holds the same potential to guide the development of courses that enhance learning while also generating even more data to give us a deeper understanding of the learning process."
Other institutions collaborating on this project include: MIT, Stanford University and the University of Memphis.
The DIBBs program awarded each early implementation project $5 million over 5 years.
The second group of awards supports pilot demonstrations that build upon the advanced cyberinfrastructure capabilities of existing research communities to address specific challenges in science and engineering research and extend those data capabilities to meet broad community needs. The awards provide $1.5 million over 3 years.
Among the projects supported by DIBBs awards are efforts to develop cyberinfrastructure to visualize geo-chronological data, like carbon testing of corals (College of Charleston); data capture and curation for materials science research (University of Illinois Urbana-Champaign); and efforts to manage data emerging from the Laser Interferometer Gravitational-wave Observatory or LIGO (Syracuse University).
The DIBBs program is part of a coordinated strategy within NSF to advance data-driven cyberinfrastructure. It complements other major efforts including the DataOne project, the Research Data Alliance and Wrangler, a groundbreaking data analysis and management system for the national open science community.
2014 NSF DIBBs Awards
Geoffrey Fox, Indiana University: Middleware and High Performance Analytics Libraries for Scalable Data Science
Ken Koedinger, Carnegie Mellon University: Building a Scalable Infrastructure for Data-Driven Discovery and Innovation in Education
Victor Pankratius, MIT: An Infrastructure for Computer Aided Discovery in Geoscience
Klara Nahrstedt, University of Illinois at Urbana-Champaign: Timely and Trusted Curator and Coordinator Data Building Blocks
Jerome Reiter, Duke University: An Integrated System for Public/Private Access to Large-scale, Confidential Social Science Data
Hsinchun Chen, University of Arizona: DIBBs for Intelligence and Security Informatics Research and Community
Santiago Pujol, Purdue University: Building a Modular Cyber-Platform for Systematic Collection, Curation, and Preservation of Large Engineering and Science Data--A Pilot Demonstration Project
James Bowring, College of Charleston: Collaborative Research: Cyberinfrastructure for Interpreting and Archiving U-series Geochronologic Data
Stephen Ficklin, Washington State University: Tripal Gateway, a platform for next-generation data analysis and sharing
Feifei Li, University of Utah: STORM: Spatio-Temporal Online Reasoning and Management of Large Data
Duncan Brown, Syracuse University: Domain-aware management of heterogeneous workflows: Active data management for gravitational-wave science workflows
Rafal Angryk, Georgia State University Research Foundation, Inc.: Systematic Data-Driven Analysis and Tools for Spatiotemporal Solar Astronomy Data
Jia Zhang, Carnegie Mellon University: An Infrastructure Supporting Collaborative Data Analytics Workflow Design and Management
Giridhar Manepalli, Corporation for National Research Initiatives (NRI): User Driven Architecture for Data Discovery
Shaowen Wang, University of Illinois at Urbana-Champaign: Scalable Capabilities for Spatial Data Synthesis
Amit Chourasia, University of California, San Diego: Ubiquitous Access to Transient Data and Preliminary Results via the SeedMe Platform
Christopher Jenkins, University of Colorado at Boulder: Porting Practical NLP and ML Semantics from Biomedicine to the Earth, Ice and Life Sciences