DEVLOG #6

May 22, 2026

Spent the afternoon benchmarking Carbon, a new family of genomic foundation models from HuggingFace trained on 1 trillion tokens of DNA sequences.

What is Carbon

Carbon is a causal language model for DNA. It uses a hybrid tokenizer that switches between BPE for text and 6-mer encoding for DNA sequences, triggered by a <dna> tag. The family has three sizes: 500M, 3B, and 8B. The 500M model is explicitly designed as a draft model for speculative decoding with the 3B.

Throughput on T4 GPU

I ran inference benchmarks for Carbon-500M on a Colab T4 GPU. The model is 512M parameters in bfloat16 precision.

ConfigPromptGeneratedThroughput
128bp context23 tokens64 tokens27.2 tok/s
128bp context23 tokens256 tokens25.0 tok/s
512bp context87 tokens64 tokens26.6 tok/s
512bp context256 tokens256 tokens26.3 tok/s

The throughput is remarkably flat across prompt lengths, which makes sense given the model size. No attention bottleneck at these context lengths on T4.

Speculative Decoding

Since Carbon-500M is designed as a draft for Carbon-3B, I tested speculative decoding using HuggingFace’s built-in assistant_model flag. Both models together use about 8 GB of VRAM on the T4.

Config3B standalone3B with spec decodingSpeedup
128bp, 64 tokens22.3 tok/s19.4 tok/s0.87x
128bp, 256 tokens22.3 tok/s20.1 tok/s0.90x
512bp, 64 tokens21.4 tok/s18.6 tok/s0.87x
512bp, 256 tokens22.0 tok/s19.7 tok/s0.89x

Speculative decoding is about 10 to 13 percent slower than running 3B alone. This is not surprising on T4 for a few reasons.

First, the T4 is memory bandwidth bound, not compute bound. Loading weights for two models per round costs more than the verification saves. Second, HuggingFace’s naive assistant_model implementation is not the optimized parallel speculative decoding used in production systems like vLLM. It runs draft tokens one at a time without batching the verification step. Third, speculative decoding gains are larger on A100 or H100 class hardware where the verifier is compute bound and the parallel draft verification comes essentially for free.

The design intention still makes sense though. A proper implementation with vLLM’s speculative decoding backend on an A100 should show meaningful speedup. The 6x size ratio between draft and verifier, plus the shared tokenizer and training data, are all ideal conditions for high draft acceptance rates, which is where the actual wall clock improvement comes from.

What is next

Want to test Carbon on actual biology tasks rather than synthetic ATCG repeats. The evaluation suite includes variant effect prediction and sequence recovery tasks which would give a better picture of where the model actually stands relative to Evo2 and GENERator.