Patterns of interacting amino acids are so preserved within protein families that the sole analysis of evolutionary comutations can identify pairs of contacting residues. It is also known that evolution conserves functional dynamics, i.e., the concerted motion or displacement of large protein regions or domains. Is it, therefore, possible to use a pure sequence-based analysis to identify these dynamical domains? To address this question, we introduce here a general coevolutionary coupling analysis strategy and apply it to a curated sequence database of hundreds of protein families. For most families, the sequence-based method partitions amino acids into a few clusters. When viewed in the context of the native structure, these clusters have the signature characteristics of viable protein domains: They are spatially separated but individually compact. They have a direct functional bearing too, as shown for various reference cases. We conclude that even large-scale structural and functionally related properties can be recovered from inference methods applied to evolutionary-related sequences.
The collective behavior of a large number of degrees of freedom can be often described by a handful of variables. This observation justifies the use of dimensionality reduction approaches to model complex systems and motivates the search for a small set of relevant “collective” variables. Here, we analyze this issue by focusing on the optimal number of variable needed to capture the salient features of a generic dataset and develop a novel estimator for the intrinsic dimension (ID). By approximating geodesics with minimum distance paths on a graph, we analyze the distribution of pairwise distances around the maximum and exploit its dependency on the dimensionality to obtain an ID estimate. We show that the estimator does not depend on the shape of the intrinsic manifold and is highly accurate, even for exceedingly small sample sizes. We apply the method to several relevant datasets from image recognition databases and protein multiple sequence alignments and discuss possible interpretations for the estimated dimension in light of the correlations among input variables and of the information content of the dataset.
Potts models and variational autoencoders (VAEs) have recently gained popularity as generative protein sequence models (GPSMs) to explore fitness landscapes and predict the effect of mutations. Despite encouraging results, quantitative characterization and comparison of GPSM-generated probability distributions is still lacking. It is currently unclear whether GPSMs can faithfully reproduce the complex multi-residue mutation patterns observed in natural sequences arising due to epistasis. We develop a set of sequence statistics to assess the "generative capacity" of three GPSMs of recent interest: the pairwise Potts Hamiltonian, the VAE, and the site-independent model, using natural and synthetic datasets. We show that the generative capacity of the Potts Hamiltonian model is the largest, in that the higher order mutational statistics generated by the model agree with those observed for natural sequences. In contrast, we show that the VAE's generative capacity lies between the pairwise Potts and site-independent models. Importantly, our work measures GPSM generative capacity in terms of higher-order sequence covariation statistics which we have developed, and provides a new framework for evaluating and interpreting GPSM accuracy that emphasizes the role of epistasis.
The transient receptor potential (TRP) channel superfamily plays a central role in transducing diverse sensory
stimuli in eukaryotes. Although dissimilar in sequence and domain organization, all known TRP channels act as
polymodal cellular sensors and form tetrameric assemblies similar to those of their distant relatives, the voltage-
gated potassium (Kv) channels. Here, we investigated the related questions of whether the allosteric mechanism
underlying polymodal gating is common to all TRP channels, and how this mechanism differs from that underpinning
Kv channel voltage sensitivity. To provide insight into these questions, we performed comparative sequence
analysis on large, comprehensive ensembles of TRP and Kv channel sequences, contextualizing the patterns of
conservation and correlation observed in the TRP channel sequences in light of the well-studied Kv channels. We
report sequence features that are specific to TRP channels and, based on insight from recent TRPV1 structures, we
suggest a model of TRP channel gating that differs substantially from the one mediating voltage sensitivity in Kv
channels. The common mechanism underlying polymodal gating involves the displacement of a defect in the
H-bond network of S6 that changes the orientation of the pore-lining residues at the hydrophobic gate.