Whitepaper

Accelerating and democratising computational drug target identification using Conduit

Ben and Jamie
2023-02-09
ABSTRACT

At Conduit, we believe software, computational biology and knowledge graphs will increase the quality and speed of drug discovery.

Using the well known paper "CRAFT" as a starting point and rough template, our software platform was used to replicate similar analytical techniques to identify the Epileptic target Csf1R, however did not require the ability to code. This opens up the possibility for vast knowledge unlock where diverse biopharma teams of both computational and non-computational biological experts can contribute, analyse and collaborate to reveal full data potential, ultimately leading to better precision medicines.

Overrepresentation analysis was used to confirm the genes' involvement in inflammatory processes related to epilepsy, while clustering analysis of protein-protein interactions was used to identify smaller, functionally related clusters. A heatmap signature was then used to identify differentially expressed clusters as well as an assessment of disease linkage conducted via text mining. Finally, master regulatory analysis was used to identify which transcription factors could be therapeutically manipulated to cause “signal reversion” to a healthy state.

Exploring this paper was a great opportunity for us to showcase some of our platform’s functionality. A non-technical user was able to identify Csf1R in a matter of hours as well as benefit from the confidence and quality gains that full end-to-end data transparency brings and communicate their analysis clearly. Although an “after the fact” assessment using a subset of the experimental techniques used in the paper, we believe that as our product capabilities mature the possibilities to improve the speed and quality of target identification will only grow.

FULL TEXT

Introduction

At Conduit, we believe software and knowledge graphs will increase the quality and speed of drug discovery. In this blog series, we want to explore whether our platform can produce similar findings to published literature and assess speed and efficiency gains it provides. 

The paper affectionately known as CRAFT “A systems-level framework for drug discovery identifies Csf1R as an anti-epileptic drug target” (Srivastava et al.) presents a computational framework to identify targets for Epilepsy. It identifies receptor Csf1R as a promising target that they proceed to validate in pre-clinical models. At Conduit, we felt this framework strongly resonated with our thesis and decided to explore whether we could identify Csf1R through our platform.

Importing a Gene List into the Platform (148 genes)

We started with a long list of 148 genes that the paper identifies after an initial prioritisation based on RNASeq values from a mouse model of Epilepsy and involvement in inflammatory processes.

Uploading this gene list into the Conduit platform adds them to the canvas and automatically generates a network of interactions between them (and their protein products) using trusted source datasets in our knowledge graph. Genes with one or fewer connections between were removed as in the paper.

Our platform colours genes according to their associated experimental values, in this case RNA-Seq differential expression (Log2 fold change, healthy v epileptic mice). This will support prioritisation as we progress.

To sense check the relevance of this gene set we ran overrepresentation analysis and confirmed involvement in inflammatory processes such as neutrophil degranulation that the paper had also recorded.

VIDEO: Over-representation analysis to confirm involvement of genes in inflammatory processes

Prioritising the Gene List (148 -> 16 genes)

When the 148 genes were in the platform we started sequentially prioritising the list using Conduit features.

Clustering

Clustering is an approach that uses network information to identify groups of functionally related genes. We ran Louvain modularity clustering to group genes based on the community they are most connected to. The Conduit knowledge graph caters for several network data types and the type we used for this analysis was protein-protein interaction (PPI) networks, one of the most functionally relevant for our purposes. A secondary benefit of using PPIs (as opposed to gene co-expression networks used in the paper) was that it would give some early insight into functional relatedness between omic network data types that could be used in this sort of analysis as well as the platform robustness to different data types. Clustering highlighted 5 modules of between 3 and 19 genes.

Disease Linkage

To choose one of these clusters to move forward with, we queried the graph for evidence of links between the genes and Epilepsy/Seizure. The platform surfaced gene-disease relationships from genetic, text mining or expression evidence (as well as others). This information had been pre-ingested into our knowledge graph via open targets. 

Differential Expression in Epilepsy Models

Using the heatmap signature on the nodes, we observed that some clusters had higher RNASeq differential expression values than others indicating some increased role in the disease state in epilepsy mouse models.

After evaluating the disease and differential expression evidence for each cluster, we decided to further explore Cluster 2, due to its link to Epilepsy/Seizure (based on text mining evidence) as well as having a high proportion of differentially expressed genes.

VIDEO: Clustering, disease linkage and differential expression observations

Identifying a Target (16 -> 1 gene)

Master Regulator Analysis

The final stage was to narrow this group of 16 genes to 1 target gene. Similar to the paper, we used master regulator analysis to identify transcription factors whose targets matched the genes in Cluster 2. This allowed us to interrogate which genes could be therapeutically manipulated in order to cause “signal reversion” to a healthy state.

Through our in-platform master regulator analysis we were able to reproduce the paper’s result of identifying Csf1R as a significant upstream regulator of our chosen cluster. It was inconclusive as to whether Csf1R was activating or inhibiting the cluster due to feedback loops but there is rationale for a similar activation hypothesis stated in the paper. 

We also identified other interesting upstream regulators such as CFI and CFH, part of the complement system that have been implicated in the literature previously and others such as STK40 not previously identified. 

VIDEO: Master Regulator Analysis
Conclusion

Exploring this paper was a great opportunity for us to showcase some of our platform’s functionality. A non-technical user was able to identify Csf1R in a matter of hours as well as benefit from the confidence and quality gains that full end-to-end data transparency brings. Although an “after the fact” assessment using a subset of the experimental techniques used in the paper, we believe that as our product capabilities mature the possibilities in this regard will only grow.

We also note that subsequent findings have indicated that inhibition of Csf1R is not well tolerated by patients. We hope that as we increase the number of data types (such as gene novelty, tractability, clinical data e.g. from NLP ingestion) we can increase the clinical relevance of our outputs, to ultimately better serve the increased demand on computer models from initiatives such as the FDA Modernisation Act. Increasing the feature suite further (such as data auto-ingestion, recommendation systems, pathfinding, ML Ops) and increasing applicability to use these digital twins across the pharmaceutical value chain such as in clinical development will increase platform and scientific value.

N.B. custom code still plays a huge role in drug discovery, but we hope our approach unlocks currently untapped resources to advance the field.