Data Science

DIMENSIONALITY REDUCTION APPROACHES TO IDENTIFY PATTERNS IN SEQUENCES AND STRUCTURES

Establishing a connection between the microscopic structural determinants of proteins’ functional mechanics and the
set of residue-residue interactions that are conserved along evolution implies analyzing large datasets of protein
sequences and structures to identify, from the pattern of correlations, the underlying network of interactions.
However, this strategy is made impractical by the large number of variables characterizing the datasets, which makes
statistical inference a computationally hard and mathematically underdetermined problem. In both the cases (sequences
and structures) a major conceptual problem to be addressed is that of dimensionality reduction: the rich and detailed
information contained in sequences databases and molecular dynamics trajectories hides a small set of
(possibly collective) variables that are maximally informative. The focus of our investigation is on developing
methods to:

(i) estimate the number of relevant variables (intrinsic dimension)

(ii) extract subsequences from protein sequences that maximally explain the functional difference between
distinct protein families

(iii) improve clustering algorithms for protein sequences and structures

D. Granata and V. Carnevale. Accurate Estimation of the Intrinsic Dimension Using Graph Distances: Unraveling the
Geometric Complexity of Datasets, Sci Rep. 2016; 6: 31377

D. Granata, M. Marsili, M. L. Klein, and V. Carnevale, Sequence signature of voltage sensing detected via dimensionality
reduction techniques.in BIOPHYSICAL JOURNAL, vol. 108, p. 426a, 2015.