Google Summer of Code - Part 3

GitHub Repository All work before and including commit 3761fd7 was done by me during GSoC.

Task 3 Link to heading

With the core package features finished, we were able to replicate the results of the research paper with similar metrics. However, because those metrics were low and did not show high predictive power, we wanted to try out graph neural networks for this purpose. We believed that graph neural networks would be perfect for our data because SBOL is inherently a graph. Typical synthetic biology machine learning models use only the DNA sequence to make predictions; however, this approach lacks the experimental metadata that can be imperative for prediction. Because SBOL captures this metadata, encoding this information in a graph neural network would provide more context to the model and theoretically allow for higher accuracies.

AutoRDF2GML for Graph Conversion Link to heading

To begin, we used a tool called AutoRDF2GML, designed for converting an RDF file to a format suitable for graph neural network training. The main thing we had to resolve to use this tool was the config file; we needed to input our nodes, relationships, and models, which had to be customized to the data. If we wanted this workflow to be reproducible for other forms of data, this config would need to be auto-generated.

Later on, I wrote a script that automates generation of this config by using SPARQL queries to:

Fetch major nodes we cared about (ComponentDefinition, ModuleDefinition, SequenceAnnotation)
Fetch the URIs of connecting relationships needed for the config file

Once the config was parsed, the tool ran and generated several CSV files containing the edge mappings and node features.

Building Graph Data for PyTorch Geometric Link to heading

I wrote pandas scripts that took those mappings and features to create a PyTorch Geometric HeteroData object, which is used as training data for graph neural networks in PyTorch. Once I had created these scripts, I essentially had a pipeline that went from SBOL files on my computer, processed through the AutoRDF2GML tool, and turned into multiple HeteroData objects.

To start, I created a straightforward GNN through PyTorch Geometric, which took in heterogeneous graphs and applied different convolutions for several stages of message passing. Our example was fitted towards a graph-level regression problem, as we wanted to predict the RNA expression level of the entire SBOL graph. Once I configured the PyTorch setup and processed all 17K files through AutoRDF2GML (which took days of running), the GNN was ready to be trained.

Training Challenges Link to heading

After running the GNN for several epochs, we noticed that the validation and training loss was very sporadic, jumping from very high values to very low values. I discovered that this was because our target label, the RNA expression level, was very skewed and had outliers much larger than the rest of the distribution. To account for this, I applied a log transformation and retrained the model; seeing lower and more consistent values for the training and validation loss indicated an improvement.

However, I noticed that the training and validation loss then remained relatively constant throughout each epoch, meaning the model was barely learning anything from the data. After further experimentation and reflection, I realized that one important reason the model wasn’t learning lay inherently in the data. Our data, because it belonged to a single experiment where only the DNA sequence would change, was encoded with essentially the same graph structure with only a differing DNA sequence. There was not enough variation in our data, which means the model would not be able to generalize or learn anything new with such little variation.

Reflections Link to heading

Moreover, I realized that in converting the SBOL files using AutoRDF2GML, the tool would use SciBERT to vectorize some of the RDF features; however, since our features included the DNA sequence (which was the main, changing component of our graphs), SciBERT would encode the DNA sequence as a “scientific term,” even though a DNA sequence would just be a jumble of random characters to the model.

We concluded that the GNN was still a very useful application, but we needed to work on:

Getting a better DNA encoding
Using more varied data

Task 4 Link to heading

In reflection to the previous task, we wanted to test other models that would be good for encoding DNA properties and generating strong results. We decided to try DNABERT, a transformer-based model, because it would be the most complex, was trained on a very large dataset of DNA, and would presumably generate a meaningful encoding for our DNA.

DNABERT Setup and Fine-Tuning Link to heading

To start, I began getting familiar with the Hugging Face Transformers library in Python for fine-tuning a transformer. I researched how fine-tuning is normally done and looked at several code snippets for fine-tuning DNABERT. Adapting this to my dataset, I wanted to fine-tune DNABERT for a regression task, so I added a regression head on top of the DNABERT layer. Moreover, I learned about LoRA fine-tuning, which utilizes low-rank adaptation to decrease the number of weights to fine-tune so that it wouldn’t be too computationally intensive.

In creating this workflow, I had a notebook that took our existing 17K rows of data, tokenized them using the DNABERT tokenizer, split into training, test, and evaluation datasets, then utilized the Trainer class in Hugging Face to fine-tune the model with some initial hyperparameters and using LoRA. Eventually, the model compiled and was able to be fine-tuned, but since there wasn’t much data or hyperparameter optimization, and I couldn’t run this fine-tuning on my local computer, our results/evaluation of this method using DNABERT could not be finalized.

HPC Experiments and Hyperparameter Tuning Link to heading

Later we were able to get access to an HPC, so the problems I had with running these training/preprocessing methods on my local computer would be bypassed. To prepare my data to be run on the HPC, I created several scripts for hyperparameter tuning. We also decided to switch to a similar dataset from the same research paper that had a total of 600K entries because of this new access to the HPC.

I created:

One Python script for hyperparameter tuning a scikit-learn model (RandomForest, LogisticRegression) to test our preprocessing package and aim for better results
Another notebook that took the transformer fine-tuning and added an additional hyperparameter tuning section using Optuna to find parameters

Conclusion Link to heading

This was the last task I completed, so we were not able to test and get results for this, which would be the next step. In the end, I learned a lot about SBOL, creating novel neural networks, working with transformers, and applying preprocessing methods through SPARQL queries and interacting with triple-store databases. We ended up with a Python package that is published on PyPI for researchers to use. Though we wanted our Python package to include a module that could create a pipeline for graph neural network training, we established a working pipeline going from SBOL files to standard machine learning training on DNA features. We also established a good framework for exploring the potential of graph neural networks for enhanced prediction by incorporating experimental metadata, with some components of this pipeline already automated. Furthermore, we set up a strong workflow for fine-tuning a transformer to determine if this may improve prediction quality for DNA features on our dataset, which will also be useful for completing the graph neural network training pipeline.

Thank you to Gonzalo Vidal and Chris Myers for being my mentors in this project, and to the National Resource for Network Biology for accepting this project under the Google Summer of Code program!