Blog

LLMs - Another big step towards accelerating data-driven Biotechs of the future

Mingran
2023-12-15
Introduction

Drug discovery is a complex, long, and data-intensive process. At Conduit Bio, our goal is to create core software and analytics infrastructure that allows biopharma to accelerate unbiased, data-driven drug discovery. Central to this strategy is the development of the Conduit platform, which allows automated data ingestion and curation, knowledge graph visualisation and a suite of analytics and ML tools, leading to identified drug targets and enhanced disease understanding. Data is represented in a way that best represents the underlying biological complexity, in our case in a graphical format, hence the term 'knowledge graph.' Utilising this approach has enabled us to identify drug targets at least twice as fast.

However, knowledge graphs have their limitations. They can struggle to stay current and may lack comprehensive coverage, which can restrict their analytical effectiveness. Another challenge is the nature of biological data, which is often complex and scattered across publications, figures, and data platforms.

The advent of Large Language Models (LLMs) like ChatGPT has been transformative, enhancing our existing knowledge graph systems. LLMs, the intellectual core of this evolution, are renowned for their transformer architecture and adeptness at processing and summarising extensive textual data. While LLMs can sometimes generate inaccurate ('hallucinated') information, their integration with knowledge graphs becomes invaluable in such instances. By combining their strengths, LLMs and knowledge graphs excel at complex cognitive tasks such as deduction, induction, and identifying cause-and-effect relationships. This synergy has got us really excited.

 

Visualization of attention (i.e. importance of the model to understand) induced by the given input sentence.nes Source: bertviz

With cool tools like LangChain and the latest release from OpenAI's assistant API, hooking up Large Language Models (LLMs) with other systems is easier than ever. A few lines of code like below can get you started:

Practical Applications of LLM and Knowledge Graphs in Drug Discovery

Now we will dive into some use-cases for LLMs in Drug Discovery when intergrated with knowledge graphs, these use-cases match our thesis of how LLMs can be use in a diverse way to enable data-driven biotechs.

Constructing Knowledge Graphs from Text

LLMs are transforming Knowledge Graph construction, leveraging their ability to process diverse textual data. This expands the breadth and depth of these graphs. LLMs excel in identifying key text elements, organising them, and deciphering connections. Instead of manually extracting relationships by reading and noting down protein interactions, LLMs can automate this process. The process is defined in the diagram below. 

Moreover, the potential of these models includes incorporating multimodal proprietary sources for specificity and personalisation. Tools like paperGPT, which allow querying from user-uploaded papers, exemplify this. As LLMs develop into multimodal systems, their capacity to include varied sources like meeting notes and visual materials increases. Looking ahead, the idea of crafting your own custom Knowledge Graph might become an effortlessly achievable task.

An example workflow of employing LLMs to construct KG from text (Source: Bratanic Tomaz)

Information Retrieval with KG-Enhanced LLMs.

Knowledge Graphs are incredibly detailed, containing vast datasets, but they can be overwhelming, especially when navigating through large amounts of data. Here, LLMs play a crucial role. Their integration with Knowledge Graphs makes these complex datasets more accessible to non-experts, enhancing the LLMs' ability to pull relevant information from the graphs. Below, we illustrate how Conduit's pipeline translates custom questions to Knowledge Graph queries by an LLM. Simply, you ask in plain English, and you get precise, useful information extracted from intricate graphs

An example translation by LLMs from natural language to cypher query

Analysis copiloting

The recent advancements in assistant APIs have significantly enhanced the capabilities of Large Language Models (LLMs), extending their reach beyond traditional Knowledge Graphs to incorporate existing analytical modules. In our Conduit platform, we focus on the fluid integration of these assistant APIs with our recommendation system. This system is designed to assimilate inputs from various modules within the platform, such as user-uploaded proprietary datasets, cluster analysis, pathway enrichment analysis, and more.

Take, for instance, a user asking, "Which genes are the best targets for Alzheimer's disease?" Our platform empowers the LLM to lead the user through a series of insightful follow-up inquiries. These may include determining the most influential experiments for establishing a causal link between genes and Alzheimer’s disease, identifying relevant RNA sequencing datasets, analyzing them to pinpoint the most upregulated genes, and correlating these with CRISPR knockout datasets to pinpoint key elements in the pathway. Additionally, our pre-existing modules assist users in performing enrichment and clustering analyses on the identified gene targets. All this information is then integrated into our recommendation system as input. The outcome is a detailed recommendation report, offering insights comparable to those provided by a bioinformatician, and is available around the clock.

LLMs Powering the Conduit Platform

At Conduit, we've embarked on an exciting journey by integrating Large Language Models (LLMs) into our system, aiming to significantly enhance user experience. This new feature is set to be released soon, and we're already witnessing notable improvements in both speed and usability. For those interested in innovative AI applications, stay updated on our progress!