Navigation Logo Black

Introducing Helix-mRNA-v0

Helical Team

We proudly introduce our first iteration of our mRNA Bio Foundation Model: Helix-mRNA-v0, the first hybrid architecture for mRNA sequences. It unlocks new frontiers and use cases in mRNA therapy.

It is designed to:

  • Be Efficient, handling long sequence lengths effortlessly
  • Balance Diversity & Specificity, leveraging a 2-step pretraining approach
  • Deliver High-Resolution, using single nucleotides as a resolution

Snippets of Helix-mRNA-v0

  • A small model: 5M parameter.
  • Through our sophisticated two-step pretraining approach, we pretrain on a diverse set of 26M mRNA sequences and a total training size of 312B tokens achieving SOTA performance while offering more flexibility.
  • Our hybrid architecture integrates attention mechanisms with state-space models, leveraging the unique advantages of each approach.
  • The model demonstrates remarkable efficiency through its streamlined design, maintaining high performance at a smaller scale.
  • Our implementation of single-nucleotide tokenization enables fine-grained insights at the molecular level.

Why Helix-mRNA?

The intrinsic properties of mRNA molecules, such as translation efficiency, stability, and half-life, play an important role in determining the performance of mRNA-based therapies. Optimizing these properties is therefore essential in preclinical development. However, experimental optimization is often constrained by high costs and extensive timelines. This underscores the importance of robust in silico approaches to accelerate innovation while minimizing costs.

With Helix-mRNA we can optimize mRNA properties for specific functional requirements. It can be fine-tuned to improve translation efficiency, stability, or half-life, depending on the desired application. The model can evaluate existing mRNA sequences, assisting in the prioritization of candidates. Beyond that, Helix-mRNA can be applied for generative mRNA design, enabling the creation of diverse mRNA sequences for diverse applications like designing mRNA payloads for vaccines such as SARS-CoV-2.

Pioneering next-generation mRNA models

Helix-mRNA benchmark comparison against Transformer HELM, Transformer XE and CodonBERT.

At Helical we believe that Bio Foundation Models need to have three main qualities:

  • It should be efficient on the long contexts of mRNA
  • Starting from a diverse dataset, it should be specialised on each niche (e.g. species)
  • Include multiple pre-training stages regarding the data quality

To be efficient on long context lengths frequently found in mRNA sequences, Helix-mRNA is the first hybrid state-space and attention based model used on mRNA sequences allowing for the best of both worlds between these two approaches (See An Empirical Study of Mamba-based Language Models). These traits make it particularly suitable for studying full-length transcripts, splice variants, and complex mRNA structural elements.

Processing long mRNA sequences traditionally presents a computational challenge, forcing many methods to sacrifice sequence information through truncation or coarse-grained tokenization. Our efficient processing of long sequence lengths allows us to tokenize full-length transcripts at single-nucleotide resolution by mapping each base (A, C, U, G, N) to a unique integer, preserving complete biological information without data loss. A further special character E is incorporated into the sequence, denoting the start of each codon. This fine-grained approach maximizes the model's ability to extract patterns from the sequences (see HELM: Hierarchical Encoding for mRNA Language Modeling) Unlike coarser tokenization methods that might group nucleotides together or use k-mer based approaches, our single-nucleotide resolution preserves the full sequential information of the mRNA molecule. This simple yet effective encoding scheme ensures that no information is lost during the preprocessing stage, allowing the downstream model to learn directly from the raw sequence composition.

Pretraining data

Helix-mRNA was pretrained on a curated genomic dataset consisting of 26million mRNAs from diverse eukaryotic species and 238 clinically relevant viruses (resulting in 312 Billion tokens). Eukaryotic sequences included representative taxa from animals (mammals, birds, reptiles, amphibians, and fish), fungi, protists, and other animal clades, while plant sequences were excluded. The viral component encompassed major human pathogens including respiratory viruses (SARS-CoV-2, influenza A/B/C, RSV), retroviruses (HIV-1/2, HTLV-1/2), hepatotropic viruses (HBV, HCV), herpesviruses (HSV-1/2, EBV, VZV), and emerging arboviruses (Zika, Dengue 1-4). This taxonomically diverse pretraining dataset was specifically curated to capture both deep evolutionary conservation patterns and the distinct nucleotide compositions and structures characteristic of viral genomes. The broad phylogenetic coverage enables the model to learn robust representations of sequence features across different evolutionary distances and genomic architectures.

Data class pretraining percentages

Model performance

We compare Helix-mRNA against leading models in both transformer-based (CodonBERT, HELM, XE) architectures to evaluate its performance against the current state-of-the-art (SOTA). We are able to outperform these SOTA models on most benchmarks.

Unlike existing models which specialize in specific mRNA tasks, Helix-mRNA can evaluate various downstream mRNA tasks including codon-related and -unrelated tasks, making it a comprehensive solution for mRNA analysis.

Benchmark comparisons showing our Helix-mRNA-v0 model against CodonBERT, and HELM. The evaluation metrics vary by task type: Spearman correlation (S) and accuracy (A). CodonBERT results are taken from HELM: Hierarchical Encoding for mRNA Language Modeling) for all tasks excluding Vaccine Degradation and E. coli Proteins due to data availability.

Get started

The model is available in the open source Helical package and the pretrained weights are available on Hugging Face.

Get in contact with us about the model here.

Acknowledgements

We thank LuxProvide for providing the computational resources needed to train this first version of Helix-mRNA.

About Helical

Helical is an open-core platform for computational biologists and data scientists to effortlessly integrate single-cell & genomics AI Bio Foundation Models in early-stage drug discovery.

Follow or subscribe to stay up-to-date with the latest developments in Bio Foundation Models.

Black Logo
Continue Reading our Latest Articles
platform
ai
Single-cell Bio Foundation Models: A beginner’s overview

In this short post, we will give you an overview of the most promising open-source single-cell foundation models that you should test and integrate into your research!

platform
ai
Benchmarking Geneformer v1 vs v2 Bio Foundation Models

In this short post, we will give you an overview of the most promising open-source single-cell foundation models that you should test and integrate into your research!

platform
ai
How to use single-cell bio foundation models for cell type classification

In this short post, we will give you an overview of the most promising open-source single-cell foundation models that you should test and integrate into your research!

platform
ai
Fine-Tuning Single-Cell Bio Foundation Models: A Beginner’s Guide

In this short post, we will give you an overview of the most promising open-source single-cell foundation models that you should test and integrate into your research!