DEVLOG #6
May 22, 2026
Spent the afternoon benchmarking Carbon, a new family of genomic foundation models from HuggingFace trained on 1 trillion tokens of DNA sequences.
What is Carbon
Carbon is a causal language model for DNA. It uses a hybrid tokenizer that switches between BPE for text and 6-mer encoding for DNA sequences, triggered by a <dna> tag. The family has three sizes: 500M, 3B, and 8B. The 500M model is explicitly designed as a draft model for speculative decoding with the 3B.
Throughput on T4 GPU
I ran inference benchmarks for Carbon-500M on a Colab T4 GPU. The model is 512M parameters in bfloat16 precision.
| Config | Prompt | Generated | Throughput |
|---|---|---|---|
| 128bp context | 23 tokens | 64 tokens | 27.2 tok/s |
| 128bp context | 23 tokens | 256 tokens | 25.0 tok/s |
| 512bp context | 87 tokens | 64 tokens | 26.6 tok/s |
| 512bp context | 256 tokens | 256 tokens | 26.3 tok/s |
The throughput is remarkably flat across prompt lengths, which makes sense given the model size. No attention bottleneck at these context lengths on T4.
Speculative Decoding
Since Carbon-500M is designed as a draft for Carbon-3B, I tested speculative decoding using HuggingFace’s built-in assistant_model flag. Both models together use about 8 GB of VRAM on the T4.
| Config | 3B standalone | 3B with spec decoding | Speedup |
|---|---|---|---|
| 128bp, 64 tokens | 22.3 tok/s | 19.4 tok/s | 0.87x |
| 128bp, 256 tokens | 22.3 tok/s | 20.1 tok/s | 0.90x |
| 512bp, 64 tokens | 21.4 tok/s | 18.6 tok/s | 0.87x |
| 512bp, 256 tokens | 22.0 tok/s | 19.7 tok/s | 0.89x |
Speculative decoding is about 10 to 13 percent slower than running 3B alone. This is not surprising on T4 for a few reasons.
First, the T4 is memory bandwidth bound, not compute bound. Loading weights for two models per round costs more than the verification saves. Second, HuggingFace’s naive assistant_model implementation is not the optimized parallel speculative decoding used in production systems like vLLM. It runs draft tokens one at a time without batching the verification step. Third, speculative decoding gains are larger on A100 or H100 class hardware where the verifier is compute bound and the parallel draft verification comes essentially for free.
The design intention still makes sense though. A proper implementation with vLLM’s speculative decoding backend on an A100 should show meaningful speedup. The 6x size ratio between draft and verifier, plus the shared tokenizer and training data, are all ideal conditions for high draft acceptance rates, which is where the actual wall clock improvement comes from.
What is next
Want to test Carbon on actual biology tasks rather than synthetic ATCG repeats. The evaluation suite includes variant effect prediction and sequence recovery tasks which would give a better picture of where the model actually stands relative to Evo2 and GENERator.