Posts

Showing posts from May, 2025

Chemprop: Aggregation and Extrapolation

Image
Context I have just finished a first version of my Master’s Thesis and waiting for my supervisor’s opinion. That’s why I again have tones of time for reading my favorite topics and seek some junior job position (😭). Recently, I came across Dr. Pat Walters ’s blogs about the extrapolation capacity of Chemprop model. Part 1: https://patwalters.github.io/Why-Dont-Machine-Learning-Models-Extrapolate/ Part 2 (guest post from Dr. Alan Cheng and Jeffery Zhou ): https://patwalters.github.io/GNNs-Can-Extrapolate/ In Part 1, Dr. Walters suggested an interesting point: Chemprop models struggle with the extrapolation of Molecular Weights (MW). Specifically, models trained on compounds with MW below 400 g/mol have difficulty making predictions for those with MW higher than 500 g/mol. I refer to this as extrapolation on the target space, since it does not involve the input space, such as molecule clusters or chemical spaces, in this context. In the follow-up blog, Dr. Cheng and Zhou raised an even...

Integrate cheminformatics data and ClaudeAI— Part 1: ChEMBL

Image
First appearing in November 2024, Model Context Protocol (MCP) quickly gained popularity within the AI community. This innovative technology allows for the integration of various resources, such as databases, local documents, and websites, into LLMs. As a result, AI applications can access more up-to-date data that might have previously been restricted or unseen. In this post, I will share my experience integrating ClaudeAI with three of the most popular cheminformatics databases: ChEMBL, PubChem, and Protein Data Bank (PDB). Concomitantly, I also share my view about the pros and cons of MCP. All the code can be found on my GitHub . I hope you find this instruction helpful! Fundamental Concepts Model Context Protocol (MCP) is a protocol that allows large language models (LLMs) to connect to external tools, APIs, or environments to access resources , task-specific context, and perform actions outside of their native model boundaries. It essentially augments the model’s capabilities...

Large dataset on 8GB RAM? Let IterableDataset handle

Image
Context I am currently working on a directed-message passing neural network called Chemprop , which is a useful model for structure-property prediction. As a student, my resources are quite limited, with only 8GB or 16GB of RAM available to me. However, I need to handle a dataset containing 1 million records for my project. To use Chemprop, the input consists of SMILES strings and target values stored in a CSV file. These SMILES strings are then converted into a MoleculeDataset, which includes information about the molecular graph, target values, and more. The MoleculeDataset is subsequently passed to a DataLoader to create batches for the training loops. With my 8 GB RAM laptop, I can load the entire CSV file, but converting all the data into a MoleculeDataset at once is not feasible . Therefore, I needed to find an alternative approach to: Handle this conversion step Compatible with Chemprop framework . Eventually, I discovered the IterableDataset feature in PyTorch, which effectivel...

Data duplication definition is flexible. How can we handle it?

Image
    Duplication or not duplication? I’m currently enjoying an extended holiday in Sweden, from Easter to Valborg, which gives me plenty of time to relax and read my favorite topic. Recently, I came across an insightful review article by Srijit Seal et al. A particular section of it reminded me of the many mistakes I encountered when I first began my cheminformatics journey, which is data leakage. While the article focuses on machine learning for toxicology prediction, the insights and warnings it offers are applicable to improving QSAR modeling in general. This inspired me to write a blog discussing both the obvious and hidden data leakage, as well as my approach to addressing the issue. Data leakage — the devil Data leakage is a real devil in machine learning training protocols. There are many forms of data leakage, but one of the most common problems arises when training data overlaps with validation data. In other words, this occurs when there are duplicates ...