Big Data has become ubiquitous in recent years, and especially so in disciplines with heterogeneous and complex data patterns. This is particularly true for chemistry. In some ways, chemical compounds may be compared with synonyms in linguistics because one particular compound can be represented in various ways. To further complicate things, some of them don't even have a specific structure and only exist as an amalgamation of forms turning into each other. That's why it's important to know whether we are dealing with different compounds or with different representations of the same one.
Sometimes, databases also have errors arising from general unawareness of software features or just general inattentiveness. Special software is needed to detect and correct such errors.
In the case of organic chemistry, reactions are notoriously difficult to analyze. That's why reaction data in chemoinformatics is much less developed than information about single molecules.
Laboratory of Chemoinformatics and Molecular Modeling (Kazan Federal University) has been working on this problem since 2013. The efforts have so far been funded by the Government of Russia and Russian Science Foundation. The group includes from the University of Strasbourg, University of North Carolina, Moscow State University, Palacky University Olomouc, and Helmholtz Center in Munich.
Kazanites have learned to predict reaction characteristics, find optimal reaction conditions, detect and correct data errors. As a result, a unique database of reaction characteristics has arisen. Currently, it includes 3.5 million entries. KFU is the only Russian member of Reaxys R&D Collaboration, a collective working on chemical databases.
In this new project, titled CGRtools, KFU researchers solved a number of problems to better handle reaction information. The software library is significantly richer in functionality than all the existing tools. CGRtools supports molecules and reaction as objects being the only tool supporting CGRs. CGRtools treats chemical objects similarly to standard Python data types like integers, strings, etc. Every chemical object is hashable due to atom numbering canonicalization. The objects support transparent class inheritance which augments existing functionalities - methods and attributes - without breaking up existing ones.
Importantly, the library is in free access at https://github.com/cimmkzn/CGRtools.
Journal of Chemical Information and Modeling