Data Science

DIMENSIONALITY REDUCTION APPROACHES TO IDENTIFY PATTERNS IN SEQUENCES AND STRUCTURES


Establishing a connection between the microscopic structural determinants of proteins’ functional mechanics and the set of residue-residue interactions that are conserved along evolution implies analyzing large datasets of protein sequences and structures to identify, from the pattern of correlations, the underlying network of interactions. However, this strategy is made impractical by the large number of variables characterizing the datasets, which makes statistical inference a computationally hard and mathematically underdetermined problem. In both the cases (sequences and structures) a major conceptual problem to be addressed is that of dimensionality reduction: the rich and detailed information contained in sequences databases and molecular dynamics trajectories hides a small set of (possibly collective) variables that are maximally informative. The focus of our investigation is on developing methods to:
(i) estimate the number of relevant variables (intrinsic dimension)
(ii) extract subsequences from protein sequences that maximally explain the functional difference between distinct protein families
(iii) improve clustering algorithms for protein sequences and structures


D. Granata and V. Carnevale. Accurate Estimation of the Intrinsic Dimension Using Graph Distances: Unraveling the Geometric Complexity of Datasets, Sci Rep. 2016; 6: 31377

D. Granata, M. Marsili, M. L. Klein, and V. Carnevale, Sequence signature of voltage sensing detected via dimensionality reduction techniques.in BIOPHYSICAL JOURNAL, vol. 108, p. 426a, 2015.