Read part one here!

GitHub Repository All work before and including commit 3761fd7 was done by me during GSoC.

Task 2 — Building SBOL-to-ML Functionality Link to heading

Now that we had SBOL data from Task 1, the next step was to build functionality in our package that could transform SBOL into a tabular dataset (similar to a pandas DataFrame). This dataset would then be compatible with standard ML models such as RandomForest, LinearRegression, and LogisticRegression from scikit-learn.


Initial Approach: pySBOL2 Extraction Link to heading

My first attempt used pySBOL2 directly to pull SBOL data, extract the individual components I had built in Task 1, and collect information such as:

  • DNA sequence
  • Start and end positions
  • Target value (label)

The problem was that this approach was too dataset-specific and difficult to generalize.


Switching to SPARQL Queries Link to heading

Since SBOL is inherently a triple-store data format, the better option was to use SPARQL queries. SPARQL is ideal for extracting arbitrary information from SBOL.

For most practical synthetic biology ML models:

  • Features come from the DNA sequence
  • The target label can be any SBOL URI the user specifies

Our package function now:

  1. Runs a SPARQL query to extract the DNA sequence
  2. Pulls the target label
  3. Builds a tabular dataset with those two columns

Preprocessing DNA Sequences Link to heading

Even with a table, the DNA sequence is still a string of characters, not a vector. The next step was to add preprocessing functions to convert DNA into ML-friendly features.

We implemented several methods inspired by the research paper:

  • k-mer features → Count every k-length substring occurrence
  • One-hot encoding → Convert DNA bases into binary vectors
    • A = [0,0,0,1]
    • G = [0,0,1,0]
    • T = [0,1,0,0]
    • C = [1,0,0,0]
  • GC-score → Fraction of G and C bases in the sequence (scalar feature)
  • PWM scores → Not generalized in our package (requires a pre-established genome matrix, so we skipped it)

The original paper used R scripts for preprocessing. Instead, we created our own Python functions that replicate the same behavior.


Final Pipeline Flow Link to heading

With these functions, we completed the SBOL → ML preprocessing pipeline. The typical user workflow became:

  1. Pull data from SynBioHub using pySBOL2
  2. Run our function to generate a tabular dataset
  3. Apply one or more preprocessing functions (k-mer, one-hot, GC, etc.)
  4. Get a vectorized dataset ready for ML training

Note: All DNA sequences are padded or trimmed to the same length, since ML models require fixed-size inputs.

Read part three here!