News Release

Slicing and dictionaries: a new approach to medical big data

Peer-Reviewed Publication

FAR Publishing Limited

Medical databases are undergoing rapid expansion, with the number of observed values and variable types continuously increasing, resulting in increasingly rich data content. This growth leads to a significant expansion in the size of individual data files, encompassing both an increase in the number of rows (length) and the number of columns (width). For instance, the chartevents file in the MIMIC 3.0 database boasts hundreds of millions of records, and the numeric file in the Amsterdam Critical Care Database version 1.0.2 is similarly large. In contrast, the core file of the first wave (Wave 1) of the ELSA dataset contains only 12,099 records but includes up to 4,484 variables.​

Querying, cleaning, and processing such large-scale databases present numerous challenges. Currently, common data access methods like SQL queries, distributed storage system conversion, and specialized data platforms have their drawbacks. SQL queries, while versatile as a standard method, require users to master the SQL language, have a steep learning curve, and are inefficient for complex queries. Distributed storage systems such as Hadoop offer strong scalability but come with high deployment and maintenance costs, necessitating specialized technical teams, which makes them unsuitable for most clinical researchers, leaving ordinary researchers struggling to independently apply such systems for analysis.​

To address these challenges, this paper proposes an innovative "slicing + dictionary" data processing strategy, based on the theory of data decomposition and restructuring, extended and applied to specific scenarios in medical big data. This method effectively reduces the amount of data processed in a single operation, constructs an efficient indexing system through preprocessing, and maintains clinical relevance during data decomposition and reorganization.​

The strategy comprises two core components: data slicing and dictionary construction. Data slicing employs a multi-dimensional strategy to adapt to different clinical research needs, including clinical dimension slicing (dividing by parameter types like vital signs), event dimension slicing (constructing data views around key clinical events such as surgery), and hybrid dimension slicing (creating composite slices by combining multiple features). The slicing granularity can be flexibly adjusted, and it draws inspiration from distributed database sharding technology while incorporating clinical semantic considerations. Slicing is classified into vertical (for row-dominated data) and horizontal (for column-heavy data), with both applicable simultaneously for extremely large datasets.​

Dictionary construction acts as a bridge between user query intentions and data slices, featuring an encoding-description-location-attribute structure, a multi-level classification system, synonym mapping, and cross-database compatibility. It draws on the experience of the Unified Medical Language System (UMLS) in integrating biomedical terminology, enabling researchers to retrieve data using standardized clinical terms.​

The core advantages of this method include reduced resource requirements, improved query efficiency, enhanced flexibility, and cross-database universality, directly addressing the limitations of traditional analysis methods, as supported by relevant research.​

However, it faces limitations such as slice design trade-offs, update and maintenance costs, support for non-standard queries, and initial setup effort, though long-term benefits may offset the initial investment. Future research will focus on automated slicing optimization, dictionary self-learning, cloud-based deployment models, and integration with AI/ML workflows, with the method expected to become more intelligent and usable with technological advancements.​

In conclusion, the "slicing + dictionary" method offers a new paradigm for addressing large-scale medical database access challenges, reducing technical barriers and resource requirements, improving efficiency and flexibility, and empowering ordinary researchers. It holds promise for advancing medical research, promoting the democratization of medical big data, and optimizing medical resource allocation, with future work focusing on practical implementation, performance validation, and optimization for different scenarios.


Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.