Model Updates
Stay up to date with the latest model support updates for Tenstorrent.
May 2025
Llama 3.1-8B
Added support for Llama 3.1 8B on Blackhole P100, P150, 2xP150.
Mistral 7B
Added support for Mistral 7B in models/tt_transformers.
Integrated Mistral 7B into the vLLM fork.
Llama 3.2-90B-Vision
Added support for Llama 3.2 90B Vision on QuietBox.
April 2025
TT-Transformers
Added support for non-uniform data format configurations in decoder layers.
Llama 3.1-70B – Galaxy
Achieved 45 t/s/u decode mode on Galaxy with batch size 32 and sequence length 128.
Optimizations: DRAM prefetching, SubDevices, CCL via TT-Fabric.
March 2025
TT-Transformers
Moved llama3 demos to tt_transformers. Added hybrid data/tensor parallelism.
Whisper
Added support for Whisper (distil-large-v3) model on N150.
QwQ-32B
Added support for QwQ-32B on QuietBox.
February 2025
DeepSeek R1 Distill Llama 3.3 70B
Added support for DeepSeek R1 Distill Llama 3.3 70B on QuietBox.
Qwen 2.5
Added support for Qwen2.5-7B on N300 and Qwen2.5-72B on QuietBox.
Llama 3.1/3.2
Overhauled demo script with simplified causal generation. Added support for input overrides.
Llama 3.1/3.2
Enabled HuggingFace model format support.
Llama 3.2-11B-Vision
Added text-only prompt processing and vLLM fork support.
January 2025
Llama 3 series
Integrated into vLLM fork. Enabled max context prefill (131072) on N150/N300.
December 2024
Llama 3.1/3.2
Added support for batch size 32 and the maximum context length (131072 tokens).
Added full hardware compatibilty for the 1B/3B/8B/11B/70B models (all models are now compatible with N150, N300, QuietBox, Galaxy except for 70B which is only supported on QuietBox and Galaxy due to its large size).
Llama 3.1/3.2
Improved the decode performance of the 1B/3B/8B/11B text models (for 8B, increased from ~23 t/s/u to ~28 t/s/u) by using BFP4 weights (instead of BFP8) for FF1 and FF3 in the MLP.
Added the option to specify custom model configurations, with two defaults for performance and accuracy already provided.
November 2024
Llama 3.2 - 1B/3B/11B
Created a new shared codebase for the Llama3 family of models, with newly added support for Llama3.2-1B/3B/11B.
Llama 3/3.1 - 70B
Added support for the ttnn.experimental.rotary_embedding_llama op in decode mode, eliminating unnecessary device transfers of rotation matrices.
October 2024
Llama 3/3.1 - 70B
Enabled prefill workloads to pad to multiples of 1024 instead of powers of 2, improving overall performance for longer sequences.
Llama 3.1 - 8B
Added support for continuous batching
Added paged caching support for PagedAttention
Added a demo which runs with TT-NN tracing (23 t/s/u decode on main)
September 2024
Llama 3/3.1 - 70B
Added support for 128K context length using PagedAttention
Added a continuous batching demo for running multiple batches of users consecutively
Added the option to enable TT-NN tracing
Mixtral7Bx8
Note: This feature is available as of release v0.52.0-rc1
Added support for any user prompt size up to a maximum of 32k tokens
August 2024
Falcon7B
Added data parallel demo for a single Galaxy (32 chips)
Refactored all modules and tests to use ttnn multi-device tensors
Llama 3.1 - 8B
Note: Release v0.51.0-rc33
Added multi-batching support to the demo for running multiple batches of users consecutively
Mixtral7Bx8
Improved end-to-end performance through optimizations to the attention mask in flash decoding
Llama 3.1 - 8B
Added support for flash decoding
Mistral7B
Updated the demo to support multiple batches of users
Mamba-2.8B
Updated the demo to use the full prefill graph instead of processing a single token of the prompt at a time using decode
Mixtral7Bx8
Added support for decode with 32K context length using flash decoding
Fused mixture of experts into a single operation using ttnn.moe
July 2024
Llama 3.1 - 8B
Added support for LLaMA 3.1 - 8B
Runs fast prefill for sequence lengths of up to 512 tokens
Supports a maximum context length of 8K tokens
Llama 3/3.1 - 70B
Added support for LLaMA 3.1 70B (new scaled rotary position embeddings)
Prefill and decode now support 8K context length with batch size 16
Mistral7B
Added prefill support for 4K context length, using scaled dot product attention