We live in the era of big data. The huge volume of information we generate daily has major applications in various fields of science and technology, economy and management. For example, more and more companies now collect, store and analyse large-scale data sets from multiple sources to gain business insights or measure risk.
However, as Prof. Yong Zhou, one of the authors of a new study published in the KeAi journal Fundamental Research, notes: “Typically, these large or massive data sets cannot be processed with independent computers, which poses new challenges for traditional data analysis in terms of computational methods and statistical theory.”
Together with colleagues at the Chinese University of Hong Kong, Zhou, a Professor at China’s East China Normal University, has developed a new algorithm that promises to address these computational problems.
He explains: “State-of-the-art numerical algorithms already exist, such as optimal subsampling algorithms and divide and conquer algorithms. In contrast to the optimal subsampling algorithm, which samples small-scale, informative data points, the divide and conquer algorithm divides large data sets randomly into sub-data sets and processes them separately on multiple machines. While the divide and conquer method is effective in using computational resources to provide a big data analysis, a robust and efficient meta-method is usually required when integrating the results.”
In this study, the researchers have focused on the large-scale inference of a linear expectile regression model, which has wide applications in risk management. They propose a communication-effective, divide and conquer algorithm, in which the summary statistics from the subsystems are combined by the confidence distribution. Zhou explains: “This is a robust and efficient meta-method for integrating the results. More importantly, we studied the relationship between the number of the machines and the sample size. We found that the requirement for the number of machines is a trade-off between statistical accuracy and computational efficiency.”
Zhou adds: “We believe the algorithm we have developed can significantly help to address the computational challenges arising from large-scale data.”
Contact the author: Yong Zhou, firstname.lastname@example.org
The publisher KeAi was established by Elsevier and China Science Publishing & Media Ltd to unfold quality research globally. In 2013, our focus shifted to open access publishing. We now proudly publish more than 100 world-class, open access, English language journals, spanning all scientific disciplines. Many of these are titles we publish in partnership with prestigious societies and academic institutions, such as the National Natural Science Foundation of China (NSFC).
Method of Research
Subject of Research
Linear expectile regression under massive data