Introducing Helix-mRNA-v0
We proudly introduce our first iteration of our mRNA Bio Foundation Model: Helix-mRNA-v0, the first hybrid architecture for mRNA sequences. It unlocks new frontiers and use cases in mRNA therapy.
It is designed to:
- Be Efficient, handling long sequence lengths effortlessly
- Balance Diversity & Specificity, leveraging a 2-step pretraining approach
- Deliver High-Resolution, using single nucleotides as a resolution
Snippets of Helix-mRNA-v0
- A small model: 5M parameter.
- Through our sophisticated two-step pretraining approach, we pretrain on a diverse set of 26M mRNA sequences and a total training size of 312B tokens achieving SOTA performance while offering more flexibility.
- Our hybrid architecture integrates attention mechanisms with state-space models, leveraging the unique advantages of each approach.
- The model demonstrates remarkable efficiency through its streamlined design, maintaining high performance at a smaller scale.
- Our implementation of single-nucleotide tokenization enables fine-grained insights at the molecular level.
Why Helix-mRNA?
The intrinsic properties of mRNA molecules, such as translation efficiency, stability, and half-life, play an important role in determining the performance of mRNA-based therapies. Optimizing these properties is therefore essential in preclinical development. However, experimental optimization is often constrained by high costs and extensive timelines. This underscores the importance of robust in silico approaches to accelerate innovation while minimizing costs.
With Helix-mRNA we can optimize mRNA properties for specific functional requirements. It can be fine-tuned to improve translation efficiency, stability, or half-life, depending on the desired application. The model can evaluate existing mRNA sequences, assisting in the prioritization of candidates. Beyond that, Helix-mRNA can be applied for generative mRNA design, enabling the creation of diverse mRNA sequences for diverse applications like designing mRNA payloads for vaccines such as SARS-CoV-2.
Pioneering next-generation mRNA models
At Helical we believe that Bio Foundation Models need to have three main qualities:
- It should be efficient on the long contexts of mRNA
- Starting from a diverse dataset, it should be specialised on each niche (e.g. species)
- Include multiple pre-training stages regarding the data quality
To be efficient on long context lengths frequently found in mRNA sequences, Helix-mRNA is the first hybrid state-space and attention based model used on mRNA sequences allowing for the best of both worlds between these two approaches (See An Empirical Study of Mamba-based Language Models). These traits make it particularly suitable for studying full-length transcripts, splice variants, and complex mRNA structural elements.
Processing long mRNA sequences traditionally presents a computational challenge, forcing many methods to sacrifice sequence information through truncation or coarse-grained tokenization. Our efficient processing of long sequence lengths allows us to tokenize full-length transcripts at single-nucleotide resolution by mapping each base (A, C, U, G, N) to a unique integer, preserving complete biological information without data loss. A further special character E is incorporated into the sequence, denoting the start of each codon. This fine-grained approach maximizes the model's ability to extract patterns from the sequences (see HELM: Hierarchical Encoding for mRNA Language Modeling) Unlike coarser tokenization methods that might group nucleotides together or use k-mer based approaches, our single-nucleotide resolution preserves the full sequential information of the mRNA molecule. This simple yet effective encoding scheme ensures that no information is lost during the preprocessing stage, allowing the downstream model to learn directly from the raw sequence composition.
Pretraining data
Helix-mRNA was pretrained on a curated genomic dataset consisting of 26million mRNAs from diverse eukaryotic species and 238 clinically relevant viruses (resulting in 312 Billion tokens). Eukaryotic sequences included representative taxa from animals (mammals, birds, reptiles, amphibians, and fish), fungi, protists, and other animal clades, while plant sequences were excluded. The viral component encompassed major human pathogens including respiratory viruses (SARS-CoV-2, influenza A/B/C, RSV), retroviruses (HIV-1/2, HTLV-1/2), hepatotropic viruses (HBV, HCV), herpesviruses (HSV-1/2, EBV, VZV), and emerging arboviruses (Zika, Dengue 1-4). This taxonomically diverse pretraining dataset was specifically curated to capture both deep evolutionary conservation patterns and the distinct nucleotide compositions and structures characteristic of viral genomes. The broad phylogenetic coverage enables the model to learn robust representations of sequence features across different evolutionary distances and genomic architectures.
Model performance
We compare Helix-mRNA against leading models in both transformer-based (CodonBERT, HELM, XE) architectures to evaluate its performance against the current state-of-the-art (SOTA). We are able to outperform these SOTA models on most benchmarks.
Unlike existing models which specialize in specific mRNA tasks, Helix-mRNA can evaluate various downstream mRNA tasks including codon-related and -unrelated tasks, making it a comprehensive solution for mRNA analysis.
Get started
The model is available in the open source Helical package and the pretrained weights are available on Hugging Face.
- Download and use the model here!
- Get the model weights on HuggingFace.
Get in contact with us about the model here.
Acknowledgements
We thank LuxProvide for providing the computational resources needed to train this first version of Helix-mRNA.
About Helical
Helical is an open-core platform for computational biologists and data scientists to effortlessly integrate single-cell & genomics AI Bio Foundation Models in early-stage drug discovery.
Check out our
open-source libraryFollow or subscribe to stay up-to-date with the latest developments in Bio Foundation Models.