Kazuki Fujii
@okoge_kaz
Tokyo Tech CS Master (Rio Yokota Lab → Jun Sakuma Lab) Distributed Training, Sytems for Machine Learning, Low Precision Training
Thrilled to see our SwallowProject paper cited in KIMI K2's Technical Report (2.2 Pre-training Data)! 🙏 Thank you for recognizing our work! @Kimi_Moonshot

🚀 Excited to share that the Workshop on Mathematical Reasoning and AI (MATH‑AI) will be at NeurIPS 2025! 📅 Dec 6 or 7 (TBD), 2025 🌴 San Diego, California
🚀 Introducing Qwen3-MT – our most powerful translation model yet! Trained on trillions of multilingual tokens, it supports 92+ languages—covering 95%+ of the world’s population. 🌍✨ 🔑 Why Qwen3-MT? ✅ Top-tier translation quality ✅ Customizable: terminology control, domain…
最強!!!!!!!! 「NVIDIA Blackwell GPU」を搭載した「NVIDIA DGX SuperPOD」として、世界最大のAI計算基盤を構築~4,000基超の「NVIDIA Blackwell GPU」の整備が完了~ | 企業・IR | ソフトバンク share.google/xfkRuQnEKKYojM…
Nvidia presents ThinkAct Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
After three intense months of hard work with the team, we made it! We hope this release can help drive the progress of Coding Agents. Looking forward to seeing Qwen3-Coder continue creating new possibilities across the digital world!
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…
this is what is not small! boys spent so much time building the Qwen3-Coder after Qwen2.5-Coder. it is much bigger, but based on MoE, and way stronger and smarter than before! not sure we can say competitive with claude sonnet 4 but might be for sure a really good coding agent.…
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…
SwallowProjectでは、SwallowCode, SwallowMathよりも更に高品質な数学とコードデータの作成に取り組んでいます。 日本語能力を強化するだけにとどまらず、Openモデルの数学、コード能力をさらに強くするための方法を今後も研究していきます!
Thrilled to see our SwallowProject paper cited in KIMI K2's Technical Report (2.2 Pre-training Data)! 🙏 Thank you for recognizing our work! @Kimi_Moonshot
Diffusion Beats Autoregressive in Data-Constrained Settings Comparison of diffusion and autoregressive language models from 7M to 2.5B params and up to 80B training tokens. Key findings: 1. Diffusion models surpass autoregressive models given sufficient compute. Across a wide…
GENIACの3サイクル目に採択されました。今期は自律稼働デバイス(監視カメラ・ロボット・ドローンなど)に向けた軽量で高精度なVLMを作ることを目標にします。 PFNではこのGENIAC 3サイクル目と平行して新規LLM開発や既存モデルの改良、特化モデルの開発を進めていきます。
PFNの「自律稼働デバイスに向けた高精度軽量VLMの開発」が、経済産業省とNEDOによる生成AIの開発力強化に向けたプロジェクト「GENIAC」に採択されました。 VLM: Vision-language model. 視覚情報とテキスト情報を扱うAIのモデル nedo.go.jp/koubo/CD3_1003…
Bye Qwen3-235B-A22B, hello Qwen3-235B-A22B-2507! After talking with the community and thinking it through, we decided to stop using hybrid thinking mode. Instead, we’ll train Instruct and Thinking models separately so we can get the best quality possible. Today, we’re releasing…
Kimi K2 tech report just dropped! Quick hits: - MuonClip optimizer: stable + token-efficient pretraining at trillion-parameter scale - 20K+ tools, real & simulated: unlocking scalable agentic data - Joint RL with verifiable + self-critique rubric rewards: alignment that adapts -…
今話題のKIMI K2のTechnical PaperにてSwallow Projectから先日発表した論文が言及されていました! SwallowProjectは、Openなモデル開発プロジェクトとして研究的な新規性とともに、実際に使えるモデルの開発を目指して、引き続き研究開発を進めていきます! (次のモデルも開発中です)
Thrilled to see our SwallowProject paper cited in KIMI K2's Technical Report (2.2 Pre-training Data)! 🙏 Thank you for recognizing our work! @Kimi_Moonshot
TikTok Researchers Introduce SWE-Perf: The First Benchmark for Repository-Level Code Performance Optimization SWE-Perf, introduced by TikTok researchers, is the first benchmark designed to evaluate large language models (LLMs) on repository-level code performance optimization.…
Note that this is a non-thinking model. Thinking model on the way!
Bye Qwen3-235B-A22B, hello Qwen3-235B-A22B-2507! After talking with the community and thinking it through, we decided to stop using hybrid thinking mode. Instead, we’ll train Instruct and Thinking models separately so we can get the best quality possible. Today, we’re releasing…
A small update on Qwen3-235B-A22B, but a big improvement on its quality! We thought about this decision for a long time, but we believe that providing better-quality performance is more important than the unification at this moment. We are still continuing our research on hybrid…
Bye Qwen3-235B-A22B, hello Qwen3-235B-A22B-2507! After talking with the community and thinking it through, we decided to stop using hybrid thinking mode. Instead, we’ll train Instruct and Thinking models separately so we can get the best quality possible. Today, we’re releasing…
We've just release 100+ intermediate checkpoints and our training logs from SmolLM3-3B training. We hope this can be useful to the researcher working on mech interpret, training dynamics, RL and other topics :) Training logs: -> Usual training loss (the gap in the loss are due…
CUTLASS 4.1 is now available, which adds support for ARM systems (GB200) and block scaled MMAs
🚨🔥 CUTLASS 4.0 is released 🔥🚨 pip install nvidia-cutlass-dsl 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python slidehelloworld.png docs.nvidia.com/cutlass/media/…