Building DeepQuant: My Adventure with Transformers and RNA-seq
So, I had one of those classic 2 AM ideas: "What if I used a transformer to quantify RNA-seq data?" What could possibly go wrong? Thus, DeepQuant was born—a project that has become both my passion and the bane of my debugging existence.
I got started on this because, frankly, I was getting tired of the old-school ways of handling RNA-seq data. Traditional alignment-based tools like STAR are fantastic, but they often stumble when a read could plausibly map to multiple places. Think of reads from genes with tons of different isoforms or from gene families that look nearly identical. The aligner often just shrugs and assigns the read to multiple locations or picks one semi-randomly, which can really muddy the waters when you're trying to get accurate expression counts. Even the faster, alignment-free tools like Salmon, which are super clever, rely on k-mers and can sometimes get confused by repetitive sequences or complex transcript structures, leading to less precise assignments.
This is where I thought DeepQuant could shine. My "magic sauce" is a hierarchical approach that tries to be a bit smarter about where reads come from. Instead of just mapping a sequence, it first uses a transformer model—specifically, DNABERT-2 to figure out the broader gene family. Once it has that context, it drills down to pinpoint the exact isoform. It’s like a detective who first identifies the neighborhood before trying to find the specific house address.
Now for the reality check: while I still believe this idea has legs, getting it to work has been a struggle. Right now, I think the bottleneck is my read assignment logic. The model is great at understanding the sequence context, but translating that understanding into a confident, unambiguous assignment to a single transcript is proving to be a serious challenge. It’s one thing for the model to say, "I'm pretty sure this read belongs to the Globin family," but it's another for it to definitively say, "...and it's from the HBB gene, isoform variant 2." I spend most of my days tweaking algorithms and staring at output logs, convinced I’m just one small logical fix away from a breakthrough.
Despite the hurdles, DeepQuant is alive and kicking on GitHub. It even has some cool features I'm proud of, like uncertainty-aware assignments that tell you how confident the model is in its guess.
Be warned, though: the entire thing is currently held together by Python scripts, a mountain of caffeine, and a fragile hope that my logic will eventually work. But that’s the fun of it, right? It's messy, it's experimental, and it’s how we push things forward. If you're feeling adventurous, I'd love for you to check it out.