Taalas serves Llama 3.1 8B at 17,000 tokens/second

By Simon Willison's Weblog February 20, 2026 · Edited February 20, 2026

Taalas serves Llama 3.1 8B at 17,000 tokens/second (https://taalas.com/the-path-to-ubiquitous-ai/) This new Canadian hardware startup just announced their first product - a custom hardware

Taalas serves Llama 3.1 8B at 17,000 tokens/second (https://taalas.com/the-path-to-ubiquitous-ai/)

This new Canadian hardware startup just announced their first product - a custom hardware implementation of the Llama 3.1 8B model (from July 2024 (https://simonwillison.net/2024/Jul/23/introducing-llama-31/)) that can run at a staggering 17,000 tokens/second.

I was going to include a video of their demo but it’s so fast it would look more like a screenshot. You can try it out at chatjimmy.ai (https://chatjimmy.ai).

They describe their Silicon Llama as “aggressively quantized, combining 3-bit and 6-bit parameters.” Their next generation will use 4-bit - presumably they have quite a long lead time for baking out new models!

Via Hacker News (https://news.ycombinator.com/item?id=47086181)

Tags: ai (https://simonwillison.net/tags/ai), generative-ai (https://simonwillison.net/tags/generative-ai), llama (https://simonwillison.net/tags/llama), llms (https://simonwillison.net/tags/llms)

Reference: https://simonwillison.net/2026/Feb/20/taalas/#atom-everything