An iterative machine learning approach has identified elusive 800 million-year-old amino acid patterns that are responsible for facilitating protein interactions.
Leucine-aspartic acid (LD) motifs are short amino acid sequences embedded within some proteins to link them to cellular molecules that control cell adhesion, motility and survival. They are known to also play a role in cancer cell spreading and in cardiovascular and infectious diseases. LD motifs were first revealed in 1996 in a family of proteins called paxillin. Only three other LD motif-containing proteins have been discovered since then, and scientists do not know the importance of LD motifs or how many other types of proteins contain them.
KAUST structural biologist Stefan Arold and computational bioscientists Xin Gao and Vladimir Bajic combined the efforts of their teams to develop a machine learning tool that they called LD Motif Finder (LDMF) to scan through the human proteome and identify LD motif patterns. This was no small task given the tiny number of known LD-motif-containing proteins that could be used to train the tool.
The team "taught" their computational tool using biophysical and structural data from known LD motifs and their proteins. To improve the accuracy of their algorithm, they included a round of experimental testing of its initial predictions and trained the tool to learn from these results.
A final step, performed in collaboration with KAUST colleagues Mariusz and Lukasz Jaremko, involved three-dimensional structural analyses of the association between newly identified LD motifs and known LD motif-binding proteins.
Using this integrative approach, the researchers were able to identify 12 new human proteins that carry functional LD motifs. "This gives us a good idea of how many of these motifs exist within the human proteome," says Arold. "It seems there are far fewer than researchers initially suggested. Of course, this does not mean that they are biologically irrelevant."
The researchers found that these proteins containing LD motifs had functions related to cell adhesion and morphogenesis, suggesting that LD motifs significantly define the proteins' cellular roles. Indeed, the researchers observed alterations in cell adhesion or spreading when fluorescently labeled LD motifs were injected into cultured human cells.
Given that the machine learning tool made it easy to scan whole proteomes, the team also investigated the genomes of mammals, birds, fish, worms, insects and microbes for LD motifs. This large-scale analysis allowed them to conclude that LD motif signaling evolved more than 800 million years ago in unicellular organisms, possibly by co-opting ancestral interaction sequences that label proteins for export out of the nucleus.
"The model, which is freely available online, is highly accurate and sensitive, but there is still room for improvement," says Ph.D. student Meshari Alazmi, first author of the study.
The team hopes to continue developing their model to study the evolution and prevalence of other short protein-protein interaction motifs across species.