jtrn 2 days ago

The results without the fluff:

Model Architecture * Type: Mixture-of-Experts (MoE) transformer model. * Total Parameters: 1 trillion. * Activated Parameters: 32 billion. * Experts: 384 total experts, with 8 activated per token. * Attention Heads: 64.

Pre-training * Optimizer: A novel optimizer named MuonClip was used. It integrates the Muon optimizer with a QK-Clip mechanism to address training instability. * Dataset: The model was pre-trained on 15.5 trillion tokens. * Training Process: Kimi K2 was trained with zero loss spikes. The initial context window was 4,096 tokens, later extended to 128k tokens using the YaRN method.

Post-training * The model underwent a multi-stage process featuring a large-scale agentic data synthesis pipeline and a joint reinforcement learning (RL) stage. * The RL framework combines verifiable rewards with a self-critique rubric reward mechanism. * A data synthesis pipeline generated tens of thousands of tool-use training examples.

Performance Benchmarks (non-thinking mode) * SWE-bench Verified: 65.8%. * SWE-bench Multilingual: 47.3%. * LiveCodeBench v6: 53.7%. * OJBench: 27.1%. * Tau2-Bench micro-average: 66.1. * ACEBench (en): 76.5. * AIME 2025: 49.5. * GPQA-Diamond: 75.1. * LMSYS Arena Leaderboard (July 17, 2025): Ranked 1st among open-source models and 5th overall.

swyx 2 days ago

(hi i'm OP) kimi k2 was released a while ago with some headlines like muonclip already discussed* but the tech report is new so submitted here. their own highlights are here: https://x.com/Kimi_Moonshot/status/1947520758760313170

we just covered it today on the latent.space paper club if you want to listen along while reading this paper https://youtu.be/VHwZa7lZhK8

definitely see also sebastian raschka's writeup https://t.co/oEt8XzNxik

*background on muon and muonclip https://www.youtube.com/watch?v=fcTNQLebHb0

chisleu 2 days ago

It looks like qwen3-coder is going to steal K2's thunder in terms of agentic coding use.

  • jadbox 2 days ago

    Maybe so, but currently I like the sound of K2's writing more so than qwen3 (so far in my testing).

OutOfHere 2 days ago

It has a small context length of just 128K.

man4 3 days ago

[dead]