To help those who got a bit confused (like me) this Groq the company making accelerators designed specifically for LLM's that they call LPUs (Language Process Units) [0]. So they want to sell you their custom machines that, while expensive, will be much more efficient at running LLMs for you. While there is also Grok [0] which is xAI's series of LLMs and competes with ChatGPT and other models like Claude and DeepSeek.
EDIT - Seems that Groq has stopped selling their chips and now will only partner to fund large build outs of their cloud [2].
I deeply crave prosumer hardware that can sit on my shelf and handle massive models, like 200-400B at a reasonable quant. Something like Groq or Digits but at the cost of a high-end gaming PC, like $3k. This has to be a massive market, considering that even ancient Pascal-series GPUs that were once $50 are going for $500.
I have that irresistible urge too, but I have to keep reminding myself that I could spend $2000 in credits over the course of a year, and get the performance and utility of a $40k server, with scalable capacity, and without any risk that that investment will be obsolete when Llama5 comes out.
The Framework Desktop is one not absurdly expensive option. The memory speed isn't great (200 something GB/s), but any faster with those requirements at least doubles the price (e.g. a Mac Studio, only the highest tier M chips have faster memory).
hi! i work @ groq and just made an account here to answer any questions for anyone who might be confused. groq has been around since 2016 and although we do offer hardware for enterprises in the form of dedicated instances, our goal is to make the models that we host easily accessible via groqcloud and groq api (openai compatible) so you can instantly get access to fast inference. :)
we have a pretty generous free tier and a dev tier you can upgrade to for higher rate limits. also, we deeply value privacy and don't retain your data. you can read more about that here: https://groq.com/privacy-policy/
Scout claims a 10 million input token length but the available providers currently seem to limit to 128,000 (Groq and Fireworks) or 328,000 (Together) - I wonder who will win the race to get that full sized 10 million token window running?
Maverick claims 1 million and Fireworks offers 1.05M while Together offers 524,000. Groq isn't offering Maverick yet
I'd pump the brakes a bit on the 10M context expectations. Its just another linear attention mechanism with rope scaling [1]. They're doing something similar to what cohere did recently, using a global attention mask and a local chunked attention mask.
Notably the max sequence length in training was 256k, but the native short context is still just 8k. I'd expect the retrieval performance to be all over the place here. Looking forward to seeing some topic modeling benchmarks run against it (ill be doing so with some of my local/private datasets).
EDIT: should be fair/complete and note they do claim perfect NIAH text retrieval performance across all 10M tokens for the Scout model on their blog post: https://ai.meta.com/blog/llama-4-multimodal-intelligence/. There are some serious limitations and caveats to that particular flavor of test though.
The source file I linked in my initial comment is honestly the most succinct way to understand how this works, but the TL;DR is that there is a NoPE layer interval parameter passed to the transformer block implementation. That defines how frequent a "no positional encoding" layer is used. The NoPE layers use the global attention mask, which is a traditional application of attention (attends to all tokens in the context window). The other layers use RoPE (rotary positional encodings) and a chunked local attention mask, which only attends to a fixed set of tokens in each chunk.
There is a wealth of literature to catch up on to understand the performance motivations behind those choices, but you can think of it as essentially a balancing act. They want to extend the context length, which is limited by conventional attention compute scaling. RoPE on the other hand is a trick that helps you to scale attention to longer context, but at the cost of poor retrieval across the entire context window. This approach is a hybrid of those two things. The recent Cohere models employ a similar methodology.
Not sure what you mean by summary paper, its a pretty dense topic that assumes a fair amount of prior knowledge of the fundamentals. But maybe the Meta blog post may suffice for that?
Otherwise yes there are lots of papers on this and related topics, a few dozen in fact. But here are some notable ones, a couple of them are linked in their blog post.
Ah, got it. Yea, then I'd focus on learning how RoPE works first. That will at least help you understand how the retrieval in current long context implementations is so limited.
A colleague from a discord I spend time in threw together this video a year or so ago, might be helpful as a first watch before a deep dive: https://www.youtube.com/watch?v=IZYx2YFzVNc
Covers positional encoding as a general concept first, then goes into rotary embeddings.
I might be biased by the products I'm building but it feels to me that function support is table stakes now? Are open source models are just missing the dataset to fine tune one?
Very few of the models supported on Groq/Together/Fireworks support function calling. And rarely the interesting ones (DeepSeek V3, large llamas, etc)
100%. we've found that llama-3.3-70b-versatile and qwen-qwq-32b perform exceptionally well with reliable function calling. we had recognized the need for this and our engineers partnered with glaive ai to create fine tunes of llama 3.0 specifically for better function calling performance until the llama 3.3 models came along and performed even better.
i'd actually love to hear your experience with llama scout and maverick for function calling. i'm going to dig into it with our resident function calling expert rick lamers this week.
Thank you for saying this out loud. I've been losing my mind wondering where the discussion on this was. LLMs without Tool Use/Function Calling is basically a non starter for anything I want to do.
When I was working with LLMs without function calling I made the scaffold put some information in the system prompt that tells it some JSON-ish syntax it can use to invoke function calls.
It places more of a "mental burden" on the model to output tool calls in your custom format, but it worked enough to be useful.
Although Llama 4 is too big for mere mortals to run without many caveats, the economics of call a dedicated-hosting Llama 4 are more interesting than expected.
$0.11 per 1M tokens, a 10 million content window (not yet implemented in Groq), and faster inference due to fewer activated parameters allows for some specific applications that were not cost-feasible to be done with GPT-4o/Claude 3.7 Sonnet. That's all dependent on whether the quality of Llama 4 is as advertised, of course, particularly around that 10M context window.
It's possible that we'll see smaller Llama 4-based models in the future, though. Similar to Llama 3.2 1B, which was released later than other Llama 3.x models.
I got an error when passing a prompt with about 20k tokens to the Llama 4 Scout model on groq (despite Llama 4 supporting up to 10M token context). groq responds with a POST https://api.groq.com/openai/v1/chat/completions 413 (Payload Too Large) error.
Is there some technical limitation on the context window size with LPUs or is this a temporary stop-gap measure to avoid overloading groq's resources? Or something else?
Just tried this thank you. Couple qs - looked like just scout access for now, do you have plans for larger model access? Also, seems like context length is always fairly short with you guys, is that architectural or cost-based decisions?
amazing! and yes, we'll have maverick available today. the reason we limit ctx window is because demand > capacity. we're pretty busy with building out more capacity so we can get to a state where we give everyone access to larger context windows without melting our currently available lpus, haha.
cool. I would so happily pay you guys for long context API that aider could point at -- the speed is just game changing. I know your arch is different, so I understand it's an engineering lift. But, I bet you'd find some pareto optimal point in the curve where you could charge a lott more for the speed you guys can do if it's long enough for coding.
Seems to be about 500 tk/s. That's actually significantly less than I expected / hoped for, but fantastic compared to nearly anything else. (specdec when?)
Out of curiosity, the console is letting me set max output tokens to 131k but errors above 8192. what's the max intended to be? (8192 max output tokens would be rough after getting spoiled with 128K output of Claude 3.7 Sonnet and 64K of gemini models.)
do you happen to be trying this out on free tier right now? because our rate limits are at 6k tokens per minute on free tier for this model, which might be what you're running into.
When I tried llama4 scout and tried to set the max output tokens above 8192 it told me the max was 8192. Once I set it below, it worked. This was in the console
Llama scout is a 17B x 16 MOE. So that 17B active parameters. That makes it faster to run. But the memory requirements are still large. They claim it fits on an H100. So under 80GB. A mac studio at 96GB could run this. By run i mean inference, Ollama is easy to use for this. 4x3090 nvidia cards would also work but its not the easiest pc build. The tinybox https://tinygrad.org/#tinybox is 15k and you can do Lora fine tuning. Could also do a regular pc with 128gb of ram, but its would be quite slow.
I'm glad I saw this because llama-3.3-70b-versatile just stopped working in my app. I switched it to meta-llama/llama-4-scout-17b-16e-instruct and it started working again. Maybe groq stopped supporting the old one?
To help those who got a bit confused (like me) this Groq the company making accelerators designed specifically for LLM's that they call LPUs (Language Process Units) [0]. So they want to sell you their custom machines that, while expensive, will be much more efficient at running LLMs for you. While there is also Grok [0] which is xAI's series of LLMs and competes with ChatGPT and other models like Claude and DeepSeek.
EDIT - Seems that Groq has stopped selling their chips and now will only partner to fund large build outs of their cloud [2].
0 - https://groq.com/the-groq-lpu-explained/
1 - https://grok.com/
2 - https://www.eetimes.com/groq-ceo-we-no-longer-sell-hardware
I deeply crave prosumer hardware that can sit on my shelf and handle massive models, like 200-400B at a reasonable quant. Something like Groq or Digits but at the cost of a high-end gaming PC, like $3k. This has to be a massive market, considering that even ancient Pascal-series GPUs that were once $50 are going for $500.
I have that irresistible urge too, but I have to keep reminding myself that I could spend $2000 in credits over the course of a year, and get the performance and utility of a $40k server, with scalable capacity, and without any risk that that investment will be obsolete when Llama5 comes out.
> I deeply crave prosumer hardware that can sit on my shelf and handle massive models, like 200-400B at a reasonable quant.
So, an Apple Mac Studio?
Nvidia's working on it. 200B at $3k
https://www.nvidia.com/en-us/products/workstations/dgx-spark...
> This has to be a massive market
It's not - it's absolutely a vanishingly small market.
The Framework Desktop is one not absurdly expensive option. The memory speed isn't great (200 something GB/s), but any faster with those requirements at least doubles the price (e.g. a Mac Studio, only the highest tier M chips have faster memory).
At home people would rather use the cloud.
> So they want to sell you there custom machines
They stopped selling the hardware to the public, and it takes an extraordinary amount of it to run these larger models due to limited ram.
hi! i work @ groq and just made an account here to answer any questions for anyone who might be confused. groq has been around since 2016 and although we do offer hardware for enterprises in the form of dedicated instances, our goal is to make the models that we host easily accessible via groqcloud and groq api (openai compatible) so you can instantly get access to fast inference. :)
we have a pretty generous free tier and a dev tier you can upgrade to for higher rate limits. also, we deeply value privacy and don't retain your data. you can read more about that here: https://groq.com/privacy-policy/
https://groq.com/hey-elon-its-time-to-cease-de-grok/
Groq was suing Grok at some point, but Elon Musk is basically untouchable now.
Groq's blog post about the issue was a shitpost, not an actual legal document.
Suing for what? The name?
[flagged]
[flagged]
[dead]
It's live on Groq, Together and Fireworks now.
All three of those can also be accessed via OpenRouter - with both a chat interface and an API:
- Scout: https://openrouter.ai/meta-llama/llama-4-scout
- Maverick: https://openrouter.ai/meta-llama/llama-4-maverick
Scout claims a 10 million input token length but the available providers currently seem to limit to 128,000 (Groq and Fireworks) or 328,000 (Together) - I wonder who will win the race to get that full sized 10 million token window running?
Maverick claims 1 million and Fireworks offers 1.05M while Together offers 524,000. Groq isn't offering Maverick yet
I'd pump the brakes a bit on the 10M context expectations. Its just another linear attention mechanism with rope scaling [1]. They're doing something similar to what cohere did recently, using a global attention mask and a local chunked attention mask.
Notably the max sequence length in training was 256k, but the native short context is still just 8k. I'd expect the retrieval performance to be all over the place here. Looking forward to seeing some topic modeling benchmarks run against it (ill be doing so with some of my local/private datasets).
[1] https://github.com/meta-llama/llama-models/blob/eececc27d275...
EDIT: should be fair/complete and note they do claim perfect NIAH text retrieval performance across all 10M tokens for the Scout model on their blog post: https://ai.meta.com/blog/llama-4-multimodal-intelligence/. There are some serious limitations and caveats to that particular flavor of test though.
> using a global attention mask and a local chunked attention mask.
Would you mind expanding on this? Or point to a reference or two? Thanks! I am trying to understand it.
The source file I linked in my initial comment is honestly the most succinct way to understand how this works, but the TL;DR is that there is a NoPE layer interval parameter passed to the transformer block implementation. That defines how frequent a "no positional encoding" layer is used. The NoPE layers use the global attention mask, which is a traditional application of attention (attends to all tokens in the context window). The other layers use RoPE (rotary positional encodings) and a chunked local attention mask, which only attends to a fixed set of tokens in each chunk.
There is a wealth of literature to catch up on to understand the performance motivations behind those choices, but you can think of it as essentially a balancing act. They want to extend the context length, which is limited by conventional attention compute scaling. RoPE on the other hand is a trick that helps you to scale attention to longer context, but at the cost of poor retrieval across the entire context window. This approach is a hybrid of those two things. The recent Cohere models employ a similar methodology.
Thanks. Is there a summary paper out there? I get that I need to read a lot and am willing to do so..
Not sure what you mean by summary paper, its a pretty dense topic that assumes a fair amount of prior knowledge of the fundamentals. But maybe the Meta blog post may suffice for that?
Otherwise yes there are lots of papers on this and related topics, a few dozen in fact. But here are some notable ones, a couple of them are linked in their blog post.
RoFormer: Enhanced Transformer with Rotary Position Embedding - https://arxiv.org/abs/2104.09864
Scaling Laws of RoPE-based Extrapolation - https://arxiv.org/abs/2310.05209
The Impact of Positional Encoding on Length Generalization in Transformers - https://arxiv.org/abs/2305.19466
Scalable-Softmax Is Superior for Attention - https://arxiv.org/abs/2501.19399
Thanks! I am familiar with attention, linear attention, flash attention etc... just not up to speed on how it is scaled to 1M or 10M context windows.
Ah, got it. Yea, then I'd focus on learning how RoPE works first. That will at least help you understand how the retrieval in current long context implementations is so limited.
A colleague from a discord I spend time in threw together this video a year or so ago, might be helpful as a first watch before a deep dive: https://www.youtube.com/watch?v=IZYx2YFzVNc
Covers positional encoding as a general concept first, then goes into rotary embeddings.
I might be biased by the products I'm building but it feels to me that function support is table stakes now? Are open source models are just missing the dataset to fine tune one?
Very few of the models supported on Groq/Together/Fireworks support function calling. And rarely the interesting ones (DeepSeek V3, large llamas, etc)
100%. we've found that llama-3.3-70b-versatile and qwen-qwq-32b perform exceptionally well with reliable function calling. we had recognized the need for this and our engineers partnered with glaive ai to create fine tunes of llama 3.0 specifically for better function calling performance until the llama 3.3 models came along and performed even better.
i'd actually love to hear your experience with llama scout and maverick for function calling. i'm going to dig into it with our resident function calling expert rick lamers this week.
Thank you for saying this out loud. I've been losing my mind wondering where the discussion on this was. LLMs without Tool Use/Function Calling is basically a non starter for anything I want to do.
When I was working with LLMs without function calling I made the scaffold put some information in the system prompt that tells it some JSON-ish syntax it can use to invoke function calls.
It places more of a "mental burden" on the model to output tool calls in your custom format, but it worked enough to be useful.
Although Llama 4 is too big for mere mortals to run without many caveats, the economics of call a dedicated-hosting Llama 4 are more interesting than expected.
$0.11 per 1M tokens, a 10 million content window (not yet implemented in Groq), and faster inference due to fewer activated parameters allows for some specific applications that were not cost-feasible to be done with GPT-4o/Claude 3.7 Sonnet. That's all dependent on whether the quality of Llama 4 is as advertised, of course, particularly around that 10M context window.
It's possible that we'll see smaller Llama 4-based models in the future, though. Similar to Llama 3.2 1B, which was released later than other Llama 3.x models.
Yeah, I too am looking forward to their small text only models at 3B and 1B.
> Llama 4 is too big for mere mortals to run without many caveats
AMD MI300x has day zero support to run it using vLLM. Easy enough to rent them for decent pricing.
I got an error when passing a prompt with about 20k tokens to the Llama 4 Scout model on groq (despite Llama 4 supporting up to 10M token context). groq responds with a POST https://api.groq.com/openai/v1/chat/completions 413 (Payload Too Large) error.
Is there some technical limitation on the context window size with LPUs or is this a temporary stop-gap measure to avoid overloading groq's resources? Or something else?
FYI, the last sentence, "Start building today on GroqCloud – sign up for free access here…" links to https://conosle.groq.com/ (instead of "console")
Fixed. Thanks for the report.
Just tried this thank you. Couple qs - looked like just scout access for now, do you have plans for larger model access? Also, seems like context length is always fairly short with you guys, is that architectural or cost-based decisions?
amazing! and yes, we'll have maverick available today. the reason we limit ctx window is because demand > capacity. we're pretty busy with building out more capacity so we can get to a state where we give everyone access to larger context windows without melting our currently available lpus, haha.
cool. I would so happily pay you guys for long context API that aider could point at -- the speed is just game changing. I know your arch is different, so I understand it's an engineering lift. But, I bet you'd find some pareto optimal point in the curve where you could charge a lott more for the speed you guys can do if it's long enough for coding.
Seems to be about 500 tk/s. That's actually significantly less than I expected / hoped for, but fantastic compared to nearly anything else. (specdec when?)
Out of curiosity, the console is letting me set max output tokens to 131k but errors above 8192. what's the max intended to be? (8192 max output tokens would be rough after getting spoiled with 128K output of Claude 3.7 Sonnet and 64K of gemini models.)
do you happen to be trying this out on free tier right now? because our rate limits are at 6k tokens per minute on free tier for this model, which might be what you're running into.
When I tried llama4 scout and tried to set the max output tokens above 8192 it told me the max was 8192. Once I set it below, it worked. This was in the console
Would it be realistic to buy and self-host the hardware to run, for example, the latest Llama 4 models, assuming a budget of less than $500,000?
Yes - I'm able to run Llama 3.1 405B on 3x A6000 + 3x 4090.
Will have Llama 4 Maverick running in 4bit quantization (typically results in only minor quality degradation) once llama.cpp support is merged.
Total hardware cost well under $50,000.
The 2T Behemoth model is tougher, but enough Blackwell 6000 Pro cards (16) should be able to run it for under $200k.
Llama scout is a 17B x 16 MOE. So that 17B active parameters. That makes it faster to run. But the memory requirements are still large. They claim it fits on an H100. So under 80GB. A mac studio at 96GB could run this. By run i mean inference, Ollama is easy to use for this. 4x3090 nvidia cards would also work but its not the easiest pc build. The tinybox https://tinygrad.org/#tinybox is 15k and you can do Lora fine tuning. Could also do a regular pc with 128gb of ram, but its would be quite slow.
A box of AMD MI300x (1.5TB of memory) is much less than $500k and AMD made sure to have day zero support with vLLM.
That said, I'm obviously biased but you're probably better off renting it.
You can do it with regular gpus for less
I'm glad I saw this because llama-3.3-70b-versatile just stopped working in my app. I switched it to meta-llama/llama-4-scout-17b-16e-instruct and it started working again. Maybe groq stopped supporting the old one?
All I get is {"error":{"message":"Not Found"}}
can you reach out to us via live chat on console.groq.com with your organization id?