Machine learning reveals hidden dimensions of functional similarity in proteins
PNAS, Jan 2 2026
Large language models trained on biological sequences, rather than natural language, are transforming biology, from predicting human genetic disease (1, 2) to the design of new-to-nature proteins (3–5). In this issue of PNAS, Cao et al. (6) extend these applications to detect the molecular underpinnings of phenotypic convergence by decoding patterns invisible to traditional sequence analysis approaches

![]()
Fig.: Detecting molecular convergence using protein language model embeddings.
The Traditional View of Molecular Convergence
Evolution often arrives at strikingly similar solutions to common environmental challenges. Bats and toothed whales independently evolved sonar-like echolocation systems for navigating in hard-to-see environments (7). Distantly related lineages of fish developed antifreeze proteins to survive polar seas (8). Flowering plants from disparate families converged on crassulacean acid metabolism (CAM) as a water-use efficient adaptation of photosynthesis (9). These phenotypic convergences are encoded by genomic changes and therefore act as natural experiments that can illuminate the mapping between genotype and phenotype.
For decades, the molecular basis of convergent evolution has been investigated primarily through the lens of site-by-site convergence: the emergence of the same amino acid change independently at homologous positions across different lineages (10). These approaches have uncovered parallel substitutions in proteins underling echolocation in bats and cetaceans (11), and the repeated evolution of antifreeze glycoproteins in polar fishes (8). Yet, this approach is inherently limited since proteins are not simple strings of independent amino acids but intricate three-dimensional structures where distant residues interact, where function emerges from collective properties, and where multiple molecular routes may lead to the same function. Tools like PSI-BLAST (12) and HMMER (13) began to address this by identifying functional relationships through shared sequence profiles, rather than relying on single amino acid changes. However, they still rely on the assumption that functional convergence must be accompanied by similarity in sequence space. But proteins can share as little as 20% sequence identity and still adopt similar folds and catalyze the same reactions (14). If the same function can be achieved through highly diverse sequences, then convergent evolution may operate not just at single sites or highly similar sequences but also across some higher-order properties of proteins.
Views: 38


