To advance precision medicine, there is a need for techniques to identify significant and meaningful characteristics of individual patients in high dimensional genomic data such as gene expression or sequence variation. A machine learning approach known as anomaly detection has proven to be a powerful model for finding functional pathways whose typical expression relationships are disrupted in individuals. We have developed new anomaly detection algorithms, based on predicting features as functions of other features in the data (Figure 1), and have shown them to be more robust to the challenges presented by genomic data than other leading anomaly detection methods.
Figure 1. Each brown dot represents a “normal” sample. A linear function predicts the expression of gene 1 from the expression of gene 2 in these samples. Because the same linear model predicts the expression of gene 1 well for the blue dot on the right, this feature model does not contribute to calling the new sample represented by that dot anomalous. However, on the central blue dot, the predictive model fails, showing that the relationship between genes 1 and 2 is anomalous in that sample.
By looking for enrichment of previously defined functional gene sets among the most anomalous variables, we can functionally characterize the gene expression patterns of individual patients in medically informative ways. We are now working to extend these methods to identify patient-specific dysregulation in other types of data, including common sequence variants and clinical variables.
We are interested in understanding how the properties of molecular networks allow us to infer functional roles of genes and their relationships to disease. We are also interested in using computational approaches to discover new disease genes and the roles that functional processes or pathways play in disease processes.
Disease genes identified through GWAS studies or even through pedigrees often prove inadequate as drug targets, as do genes showing the downstream effects of disease-causing mutations. One approach to finding targetable pathways is to identify those pathways mediating the transcriptional response using graph theoretic analyses of protein-protein interaction networks, disease genes, and transcriptional profiles. We developed an algorithm to do this and applied it to find common mediating pathways in pulmonary disease (Figure 2).
Figure 2. FC Epsilon RI signaling pathway members play a significant role in mediating communication between disease genes and transcriptional consequences of disease in three disorders involving airway reactivity (COPD, asthma, and bronchopulmonary dysplasia (BPD)). Pathway genes are colored dark green, BPD disease genes are red, and genes differentially expressed in BPD are shown in blue.
To link related diseases to gene sets, we have developed computational methods that are robust to ambiguities in the disease gene annotation process. This approach, which relies on disease relationships defined in hierarchies such as MeSH or the Disease Ontology, allows discovery of surprising connections between gene sets and disease processes, suggesting causes of co-morbidities and novel treatment approaches. One obstacle to this process, however, is that most hierarchical representations of disease are not designed to represent genetic contributions to complex disorders, and typically do a poor job of doing so. We have therefore also developed methods of inferring disease ontologies from disease-gene data (Figure 3). We hope that our methods will eventually contribute to new disease ontologies better suited to supporting genetic discoveries.
Figure 3. Disease ontology inference from disease-gene data. A) Dendrogram showing a small subtree of the MeSH disease hierarchy for complications of preterm birth. Hierarchical clustering builds a tree whose internal nodes are hard to interpret. B) We repeatedly find the most general disease term (by citation count) from each cluster and promote it as an internal node. C) Final tree built by our parent promotion method.