Google Summer of Code - Part 2

GitHub Repository All work before and including commit 3761fd7 was done by me during GSoC.

Task 2 — Building SBOL-to-ML Functionality Link to heading

Now that we had SBOL data from Task 1, the next step was to build functionality in our package that could transform SBOL into a tabular dataset (similar to a pandas DataFrame). This dataset would then be compatible with standard ML models such as RandomForest, LinearRegression, and LogisticRegression from scikit-learn.

Initial Approach: pySBOL2 Extraction Link to heading

My first attempt used pySBOL2 directly to pull SBOL data, extract the individual components I had built in Task 1, and collect information such as:

DNA sequence
Start and end positions
Target value (label)

The problem was that this approach was too dataset-specific and difficult to generalize.

Switching to SPARQL Queries Link to heading

Since SBOL is inherently a triple-store data format, the better option was to use SPARQL queries. SPARQL is ideal for extracting arbitrary information from SBOL.

For most practical synthetic biology ML models:

Features come from the DNA sequence
The target label can be any SBOL URI the user specifies

Our package function now:

Runs a SPARQL query to extract the DNA sequence
Pulls the target label
Builds a tabular dataset with those two columns

Preprocessing DNA Sequences Link to heading

Even with a table, the DNA sequence is still a string of characters, not a vector. The next step was to add preprocessing functions to convert DNA into ML-friendly features.

We implemented several methods inspired by the research paper:

k-mer features → Count every k-length substring occurrence
One-hot encoding → Convert DNA bases into binary vectors
- A = [0,0,0,1]
- G = [0,0,1,0]
- T = [0,1,0,0]
- C = [1,0,0,0]
GC-score → Fraction of G and C bases in the sequence (scalar feature)
PWM scores → Not generalized in our package (requires a pre-established genome matrix, so we skipped it)

The original paper used R scripts for preprocessing. Instead, we created our own Python functions that replicate the same behavior.

Final Pipeline Flow Link to heading

With these functions, we completed the SBOL → ML preprocessing pipeline. The typical user workflow became:

Pull data from SynBioHub using pySBOL2
Run our function to generate a tabular dataset
Apply one or more preprocessing functions (k-mer, one-hot, GC, etc.)
Get a vectorized dataset ready for ML training

Note: All DNA sequences are padded or trimmed to the same length, since ML models require fixed-size inputs.

Read part three here!