GitHub Repository All work before and including commit 3761fd7 was done by me during GSoC.
Task 2 — Building SBOL-to-ML Functionality Link to heading
Now that we had SBOL data from Task 1, the next step was to build functionality in our package that could transform SBOL into a tabular dataset (similar to a pandas DataFrame). This dataset would then be compatible with standard ML models such as RandomForest, LinearRegression, and LogisticRegression from scikit-learn.
Initial Approach: pySBOL2 Extraction Link to heading
My first attempt used pySBOL2 directly to pull SBOL data, extract the individual components I had built in Task 1, and collect information such as:
- DNA sequence
- Start and end positions
- Target value (label)
The problem was that this approach was too dataset-specific and difficult to generalize.
Switching to SPARQL Queries Link to heading
Since SBOL is inherently a triple-store data format, the better option was to use SPARQL queries. SPARQL is ideal for extracting arbitrary information from SBOL.
For most practical synthetic biology ML models:
- Features come from the DNA sequence
- The target label can be any SBOL URI the user specifies
Our package function now:
- Runs a SPARQL query to extract the DNA sequence
- Pulls the target label
- Builds a tabular dataset with those two columns
Preprocessing DNA Sequences Link to heading
Even with a table, the DNA sequence is still a string of characters, not a vector. The next step was to add preprocessing functions to convert DNA into ML-friendly features.
We implemented several methods inspired by the research paper:
- k-mer features → Count every k-length substring occurrence
- One-hot encoding → Convert DNA bases into binary vectors
- A = [0,0,0,1]
- G = [0,0,1,0]
- T = [0,1,0,0]
- C = [1,0,0,0]
- GC-score → Fraction of G and C bases in the sequence (scalar feature)
- PWM scores → Not generalized in our package (requires a pre-established genome matrix, so we skipped it)
The original paper used R scripts for preprocessing. Instead, we created our own Python functions that replicate the same behavior.
Final Pipeline Flow Link to heading
With these functions, we completed the SBOL → ML preprocessing pipeline. The typical user workflow became:
- Pull data from SynBioHub using pySBOL2
- Run our function to generate a tabular dataset
- Apply one or more preprocessing functions (k-mer, one-hot, GC, etc.)
- Get a vectorized dataset ready for ML training
Note: All DNA sequences are padded or trimmed to the same length, since ML models require fixed-size inputs.