Skip to main content

Model Updates

Stay up to date with the latest model support updates for Tenstorrent.

May 2025

May 26

Llama 3.1-8B
Added support for Llama 3.1 8B on Blackhole P100, P150, 2xP150.

May 26

Mistral 7B
Added support for Mistral 7B in models/tt_transformers.
Integrated Mistral 7B into the vLLM fork.

May 5

Llama 3.2-90B-Vision
Added support for Llama 3.2 90B Vision on QuietBox.

April 2025

Apr 22

TT-Transformers
Added support for non-uniform data format configurations in decoder layers.

Apr 7

Llama 3.1-70B – Galaxy
Achieved 45 t/s/u decode mode on Galaxy with batch size 32 and sequence length 128.
Optimizations: DRAM prefetching, SubDevices, CCL via TT-Fabric.

March 2025

Mar 24

TT-Transformers
Moved llama3 demos to tt_transformers. Added hybrid data/tensor parallelism.

Mar 24

Whisper
Added support for Whisper (distil-large-v3) model on N150.

Mar 10

QwQ-32B
Added support for QwQ-32B on QuietBox.

February 2025

Feb 24

DeepSeek R1 Distill Llama 3.3 70B
Added support for DeepSeek R1 Distill Llama 3.3 70B on QuietBox.

Feb 24

Qwen 2.5
Added support for Qwen2.5-7B on N300 and Qwen2.5-72B on QuietBox.

Feb 24

Llama 3.1/3.2
Overhauled demo script with simplified causal generation. Added support for input overrides.

Feb 10

Llama 3.1/3.2
Enabled HuggingFace model format support.

Feb 10

Llama 3.2-11B-Vision
Added text-only prompt processing and vLLM fork support.

January 2025

Jan 13

Llama 3 series
Integrated into vLLM fork. Enabled max context prefill (131072) on N150/N300.

December 2024

Dec 16

Llama 3.1/3.2
Added support for batch size 32 and the maximum context length (131072 tokens).
Added full hardware compatibilty for the 1B/3B/8B/11B/70B models (all models are now compatible with N150, N300, QuietBox, Galaxy except for 70B which is only supported on QuietBox and Galaxy due to its large size).

Dec 02

Llama 3.1/3.2
Improved the decode performance of the 1B/3B/8B/11B text models (for 8B, increased from ~23 t/s/u to ~28 t/s/u) by using BFP4 weights (instead of BFP8) for FF1 and FF3 in the MLP.
Added the option to specify custom model configurations, with two defaults for performance and accuracy already provided.

November 2024

Nov 18

Llama 3.2 - 1B/3B/11B
Created a new shared codebase for the Llama3 family of models, with newly added support for Llama3.2-1B/3B/11B.

Nov 18

Llama 3/3.1 - 70B
Added support for the ttnn.experimental.rotary_embedding_llama op in decode mode, eliminating unnecessary device transfers of rotation matrices.

October 2024

Oct 21

Llama 3/3.1 - 70B
Enabled prefill workloads to pad to multiples of 1024 instead of powers of 2, improving overall performance for longer sequences.

Oct 07

Llama 3.1 - 8B
Added support for continuous batching
Added paged caching support for PagedAttention
Added a demo which runs with TT-NN tracing (23 t/s/u decode on main)

September 2024

Sep 23

Llama 3/3.1 - 70B
Added support for 128K context length using PagedAttention
Added a continuous batching demo for running multiple batches of users consecutively
Added the option to enable TT-NN tracing

Sep 09

Mixtral7Bx8
Note: This feature is available as of release v0.52.0-rc1
Added support for any user prompt size up to a maximum of 32k tokens

August 2024

Aug 26

Falcon7B
Added data parallel demo for a single Galaxy (32 chips)
Refactored all modules and tests to use ttnn multi-device tensors

Aug 26

Llama 3.1 - 8B
Note: Release v0.51.0-rc33
Added multi-batching support to the demo for running multiple batches of users consecutively

Aug 26

Mixtral7Bx8
Improved end-to-end performance through optimizations to the attention mask in flash decoding

Aug 12

Llama 3.1 - 8B
Added support for flash decoding

Aug 12

Mistral7B
Updated the demo to support multiple batches of users

Aug 12

Mamba-2.8B
Updated the demo to use the full prefill graph instead of processing a single token of the prompt at a time using decode

Aug 12

Mixtral7Bx8
Added support for decode with 32K context length using flash decoding
Fused mixture of experts into a single operation using ttnn.moe

July 2024

Jul 29

Llama 3.1 - 8B
Added support for LLaMA 3.1 - 8B
Runs fast prefill for sequence lengths of up to 512 tokens
Supports a maximum context length of 8K tokens

Jul 29

Llama 3/3.1 - 70B
Added support for LLaMA 3.1 70B (new scaled rotary position embeddings)
Prefill and decode now support 8K context length with batch size 16

Jul 29

Mistral7B
Added prefill support for 4K context length, using scaled dot product attention