The total amount of data created, captured, copied and consumed globally in 2020 exceeded 64 trillion gigabytes, and German market research firm Statista projects that by 2025 the total data created could surpass 180 trillion gigabytes. To put that in perspective, with just one gigabyte, you could send 350,000 emails, view 600 web pages and stream 200 songs.
This data revolution has transformed scientific research, especially in the physical and life sciences, including health care. The problem is there’s only enough storage capacity for about 10% of the data produced globally. New algorithms and architectures for networked data-intensive computing are needed, based on storage, processing and use.
Farzad Farnoud Hassanzadeh, an assistant professor of electrical and computer engineering and computer science at the University of Virginia School of Engineering and Applied Science, has earned a prestigious National Science Foundation CAREER award to meet this need. He will use his $560,000 five-year award to develop new models and data compression algorithms that will make the storage and analysis of large data sequences more efficient and accurate.
The CAREER program, one of the NSF’s most prestigious awards for early-career faculty, recognizes the recipient’s potential for leadership in research and education. Farnoud leads the information processing and storage lab, whose members solve problems at the intersection of information theory, computational biology and machine learning — a research strength of the Charles L. Brown Department of Electrical and Computer Engineering.
“From an information theory perspective, data is just a sequence of symbols, which could be letters, DNA symbols or bytes,” Farnoud said. “The contents of a book are a sequence of letters. Spelling and grammar rules help us anticipate which letters naturally follow to form a word, and which words naturally follow to form a sentence. If you can predict the next word well, you can compress the sequence very well. We can prove this mathematically.”
For short sequences of data, models that predict the data that will probably come next, called probabilistic models, can be very helpful. But the models struggle to find and analyze patterns that emerge in long sequences. In the example of words and sentences in a book, what comes next depends on a small number of previous letters or words, not what appeared two pages before.
“I am interested in patterns that emerge at long ranges,” Farnoud said, referring to a data feature called long-range dependence. If you imagine a sequence, long-range dependence is how far back you need to go to describe the probabilistic characteristics of the symbol you are observing.
“Let’s say the sequence itself is a terabyte of data, or 10 to the 12th power bytes,” Farnoud said. “It is normal to assume that a byte’s probabilistic characteristics depend on maybe the previous 10 bytes. But if long-range dependencies exist, the byte’s probabilistic characteristics may depend on a million bytes before it.”
These long-range dependencies between elements of the sequence are not captured well in current models. Farnoud will apply his CAREER Award to construct probabilistic models that describe long-range dependence accurately and realistically, leveraging the statistical properties of the models to improve tasks like prediction or data compression.
With data sets this large, it would take a super computer to “zip” the file. Determining which patterns are meaningful provides equal insight into which patterns are redundant and can be removed during data compression. “If you want to do data compression effectively at scale, you need to be able to model these long-range dependencies and take advantage of these patterns,” Farnoud said.
In addition to theoretical and scientific advances, Farnoud’s data-compression methods could enable large-scale data storage systems to operate more efficiently, requiring less hardware, computing resources and electrical power.
The same limitations of existing models also give rise to challenges when analyzing data, specifically genomic data, which is generated over billions of years through evolutionary processes. For example, repeats are a prevalent feature of genomes and can be better analyzed by models that can handle long-range dependence. Farnoud will develop better statistical algorithms to analyze these sequences.
“There are statistical tests that biologists and phylogenomic scientists use, to determine if two organisms are related and how many mutations are needed for an evolutionary event to happen, for certain diseases to develop,” Farnoud said. “Those types of studies would benefit from having these more accurate models and hypothesis testing and prediction tools.”