<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Llm on Svtter's Blog</title><link>https://svtter.cn/en/tags/llm/</link><description>Recent content in Llm on Svtter's Blog</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Mon, 01 Jun 2026 12:00:00 +0800</lastBuildDate><atom:link href="https://svtter.cn/en/tags/llm/index.xml" rel="self" type="application/rss+xml"/><item><title>OMP M3 Model Patch: Adding MiniMax M3 to pi-ai</title><link>https://svtter.cn/en/p/omp-m3-model-patch-adding-minimax-m3-to-pi-ai/</link><pubDate>Mon, 01 Jun 2026 12:00:00 +0800</pubDate><guid>https://svtter.cn/en/p/omp-m3-model-patch-adding-minimax-m3-to-pi-ai/</guid><description>&lt;img src="https://svtter.cn/p/omp-m3-%E6%A8%A1%E5%9E%8B%E8%A1%A5%E4%B8%81%E4%B8%BA-pi-ai-%E6%B7%BB%E5%8A%A0-minimax-m3-%E6%94%AF%E6%8C%81/cover.png" alt="Featured image of post OMP M3 Model Patch: Adding MiniMax M3 to pi-ai" /&gt;&lt;p&gt;MiniMax released M3 on 2026-06-01 (&lt;code&gt;minimax/minimax-m3-20260531&lt;/code&gt; on OpenRouter), but the upstream &lt;code&gt;models.json&lt;/code&gt; shipped by &lt;code&gt;@oh-my-pi/pi-ai@15.7.3&lt;/code&gt; hadn&amp;rsquo;t been updated to include it. This post documents the patch I applied to add M3 support across all five provider endpoints.&lt;/p&gt;
&lt;h2 id="target-file"&gt;Target File
&lt;/h2&gt;&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;~/.bun/install/global/node_modules/@oh-my-pi/pi-ai/src/models.json
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id="provider-entries-added-5"&gt;Provider Entries Added (5)
&lt;/h2&gt;&lt;p&gt;All entries are appended at the end of their respective provider object, mirroring the structure of the existing &lt;code&gt;MiniMax-M2.7&lt;/code&gt; entry.&lt;/p&gt;
&lt;h3 id="1-minimax-official-anthropic-compatible-api"&gt;1. &lt;code&gt;minimax&lt;/code&gt; (Official Anthropic-compatible API)
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;key&lt;/strong&gt;: &lt;code&gt;MiniMax-M3&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;api&lt;/strong&gt;: &lt;code&gt;anthropic-messages&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;baseUrl&lt;/strong&gt;: &lt;code&gt;https://api.minimax.io/anthropic&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;contextWindow&lt;/strong&gt;: 204800, &lt;strong&gt;maxTokens&lt;/strong&gt;: 131072&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;cost&lt;/strong&gt;: input 0.3, output 1.2, cacheRead 0.06, cacheWrite 0.375&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;thinking&lt;/strong&gt;: budget mode, minimal..xhigh&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="2-minimax-cn-official-anthropic-compatible-api-china"&gt;2. &lt;code&gt;minimax-cn&lt;/code&gt; (Official Anthropic-compatible API, China)
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;key&lt;/strong&gt;: &lt;code&gt;MiniMax-M3&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;api&lt;/strong&gt;: &lt;code&gt;anthropic-messages&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;baseUrl&lt;/strong&gt;: &lt;code&gt;https://api.minimaxi.com/anthropic&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Same context/cost/thinking as &lt;code&gt;minimax&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="3-minimax-code-coding-plan-openai-compatible"&gt;3. &lt;code&gt;minimax-code&lt;/code&gt; (Coding Plan, OpenAI-compatible)
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;key&lt;/strong&gt;: &lt;code&gt;MiniMax-M3&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;api&lt;/strong&gt;: &lt;code&gt;openai-completions&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;baseUrl&lt;/strong&gt;: &lt;code&gt;https://api.minimax.io/v1&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;cost&lt;/strong&gt;: all 0 (Coding Plan flat-rate)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;compat&lt;/strong&gt;: &lt;code&gt;supportsStore=false&lt;/code&gt;, &lt;code&gt;supportsDeveloperRole=false&lt;/code&gt;, &lt;code&gt;supportsReasoningEffort=false&lt;/code&gt;, &lt;code&gt;reasoningContentField=reasoning_content&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;thinking&lt;/strong&gt;: effort mode, minimal..high&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="4-minimax-code-cn-coding-plan-cn"&gt;4. &lt;code&gt;minimax-code-cn&lt;/code&gt; (Coding Plan CN)
&lt;/h3&gt;&lt;p&gt;Mirror of &lt;code&gt;minimax-code&lt;/code&gt; with &lt;code&gt;baseUrl: https://api.minimaxi.com/v1&lt;/code&gt; and provider &lt;code&gt;minimax-code-cn&lt;/code&gt;.&lt;/p&gt;
&lt;h3 id="5-openrouter-openrouter-passthrough"&gt;5. &lt;code&gt;openrouter&lt;/code&gt; (OpenRouter Passthrough)
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;key&lt;/strong&gt;: &lt;code&gt;minimax/minimax-m3-20260531&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;api&lt;/strong&gt;: &lt;code&gt;openai-completions&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;baseUrl&lt;/strong&gt;: &lt;code&gt;https://openrouter.ai/api/v1&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;cost&lt;/strong&gt;: input 0.3, output 1.2, cacheRead 0.05, cacheWrite 0&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;thinking&lt;/strong&gt;: effort mode, minimal..high&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="verification"&gt;Verification
&lt;/h2&gt;&lt;p&gt;Searching for &lt;code&gt;&amp;quot;MiniMax-M3|minimax-m3&amp;quot;&lt;/code&gt; in the patched file returns exactly 5 hits — one per provider block.&lt;/p&gt;
&lt;h2 id="caveats"&gt;Caveats
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;omp update&lt;/code&gt; will overwrite the patch&lt;/strong&gt;. Re-apply after updates, or pin the package version.&lt;/li&gt;
&lt;li&gt;If upstream later ships an official M3 entry, our local copy may diverge (custom pricing/context) until the next update.&lt;/li&gt;
&lt;li&gt;Pricing values for M3 were inferred from the M2.7 template and the OpenRouter listing ($0.30 / $1.20). Confirm against the official MiniMax pricing page if cost accuracy matters.&lt;/li&gt;
&lt;li&gt;Context window (204800) and maxTokens (131072) mirror M2.7 — adjust if M3 differs at GA.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="addendum-2026-06-02-the-proper-route-via-omp-user-config"&gt;Addendum (2026-06-02): The proper route via OMP user config
&lt;/h2&gt;&lt;p&gt;The pi-ai patch above is a hack — any &lt;code&gt;omp update&lt;/code&gt; re-pulls the package and the patch is gone.
The proper OMP way is &lt;code&gt;~/.omp/agent/models.yml&lt;/code&gt;: a user-level file that OMP merges on top
of the built-in catalog, with no bun-global dependency, and which &lt;code&gt;omp update&lt;/code&gt; leaves alone.&lt;/p&gt;
&lt;h3 id="final-config"&gt;Final config
&lt;/h3&gt;&lt;p&gt;Append to &lt;code&gt;~/.omp/agent/models.yml&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;span class="lnt"&gt;11
&lt;/span&gt;&lt;span class="lnt"&gt;12
&lt;/span&gt;&lt;span class="lnt"&gt;13
&lt;/span&gt;&lt;span class="lnt"&gt;14
&lt;/span&gt;&lt;span class="lnt"&gt;15
&lt;/span&gt;&lt;span class="lnt"&gt;16
&lt;/span&gt;&lt;span class="lnt"&gt;17
&lt;/span&gt;&lt;span class="lnt"&gt;18
&lt;/span&gt;&lt;span class="lnt"&gt;19
&lt;/span&gt;&lt;span class="lnt"&gt;20
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c"&gt;# MiniMax M3 Code Plan&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;&lt;/span&gt;&lt;span class="c"&gt;# Set MINIMAX_API_KEY in ~/.zshenv first&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;minimax&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;baseUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;https://api.minimaxi.com/anthropic&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;MINIMAX_API_KEY&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;api&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;anthropic-messages&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;authHeader&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;disableStrictTools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;models&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;MiniMax-M3&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;MiniMax M3&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;reasoning&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="l"&gt;text, image]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;contextWindow&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1000000&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;maxTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;16384&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;cacheRead&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;cacheWrite&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;&lt;code&gt;apiKey: MINIMAX_API_KEY&lt;/code&gt; follows OMP&amp;rsquo;s resolution rule: try the value as an env-var name
first, then fall back to a literal. I &lt;code&gt;export MINIMAX_API_KEY=$MINIMAX_CODE_PLAN_KEY&lt;/code&gt; in
&lt;code&gt;~/.zshenv&lt;/code&gt;, so the key is sourced at runtime and the dotfile stays clean for git.&lt;/p&gt;
&lt;h3 id="key-choices"&gt;Key choices
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;anthropic-messages&lt;/code&gt;, not &lt;code&gt;openai-completions&lt;/code&gt;:&lt;/strong&gt;
M3 speaks both protocols. The &lt;code&gt;openai-completions&lt;/code&gt; route had two friction points:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;OMP&amp;rsquo;s &lt;code&gt;openai-completions&lt;/code&gt; transport emits &lt;code&gt;developer&lt;/code&gt; role + &lt;code&gt;reasoning_effort&lt;/code&gt; for reasoning models. MiniMax&amp;rsquo;s schema check is stricter than OpenAI&amp;rsquo;s, and an empty &lt;code&gt;reasoning&lt;/code&gt; field occasionally 400s&lt;/li&gt;
&lt;li&gt;After switching to &lt;code&gt;anthropic-messages&lt;/code&gt;, tool calls and streaming reasoning go through the Anthropic SDK normalization path — same as kimi and claude&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;disableStrictTools: true&lt;/code&gt;:&lt;/strong&gt;
The Anthropic SDK sends &lt;code&gt;strict: true&lt;/code&gt; on every tool definition by default. Third-party
Anthropic-fronted gateways (MiniMax, kimi, etc.) usually don&amp;rsquo;t recognize the field and 400.
The kimi provider in the same file already sets this flag. The trade-off is that tool
schemas are not server-side validated, so prompts have to carry the schema discipline.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Context 1M / maxTokens 16K:&lt;/strong&gt;
&lt;code&gt;contextWindow: 1000000&lt;/code&gt; matches OpenRouter&amp;rsquo;s spec for &lt;code&gt;minimax/minimax-m3-20260531&lt;/code&gt;
(M2.7 was 204800, M3 is 5× that). &lt;code&gt;maxTokens: 16384&lt;/code&gt; carries over from M2.7 — I couldn&amp;rsquo;t
find an official M3 number. &lt;code&gt;cost&lt;/code&gt; is all zero because the Code Plan is flat-rate.&lt;/p&gt;
&lt;h3 id="switching-to-it"&gt;Switching to it
&lt;/h3&gt;&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;span class="lnt"&gt;4
&lt;/span&gt;&lt;span class="lnt"&gt;5
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# At launch&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;omp --model minimax/MiniMax-M3
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Or in the TUI&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;/model minimax/MiniMax-M3
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;After the switch, &lt;code&gt;/status&lt;/code&gt; should show &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; pointing at &lt;code&gt;api.minimaxi.com/anthropic&lt;/code&gt;.&lt;/p&gt;
&lt;h3 id="how-the-two-routes-compose"&gt;How the two routes compose
&lt;/h3&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;pi-ai bundled &lt;code&gt;models.json&lt;/code&gt; patch&lt;/th&gt;
&lt;th&gt;&lt;code&gt;models.yml&lt;/code&gt; custom provider&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Persistence&lt;/td&gt;
&lt;td&gt;&lt;code&gt;omp update&lt;/code&gt; wipes it&lt;/td&gt;
&lt;td&gt;Persistent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-machine sync&lt;/td&gt;
&lt;td&gt;No (bun-global path)&lt;/td&gt;
&lt;td&gt;Yes (dotfile in git)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Upgrade cost&lt;/td&gt;
&lt;td&gt;Re-apply patch&lt;/td&gt;
&lt;td&gt;OMP merges automatically&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Merge with built-in&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes, last-write-wins&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The two compose. &lt;code&gt;models.yml&lt;/code&gt; providers enter through OMP&amp;rsquo;s &amp;ldquo;custom&amp;rdquo; channel; whatever
pi-ai later ships in its bundled list (if M3 lands upstream) enters through the &amp;ldquo;built-in&amp;rdquo;
channel. When both define the same &lt;code&gt;provider/model&lt;/code&gt; with different &lt;code&gt;baseUrl&lt;/code&gt;, OMP&amp;rsquo;s
last-write-wins rule means &lt;code&gt;models.yml&lt;/code&gt; always wins — which is exactly what you want
for a CN endpoint override.&lt;/p&gt;</description></item><item><title>How kimi-code Handles kimi-k2.6: A Comparison with OpenCode</title><link>https://svtter.cn/en/p/how-kimi-code-handles-kimi-k2.6-a-comparison-with-opencode/</link><pubDate>Wed, 27 May 2026 10:30:00 +0800</pubDate><guid>https://svtter.cn/en/p/how-kimi-code-handles-kimi-k2.6-a-comparison-with-opencode/</guid><description>&lt;img src="https://svtter.cn/p/kimi-code-%E5%AF%B9-kimi-k2.6-%E7%9A%84%E4%B8%93%E7%94%A8%E5%A4%84%E7%90%86%E4%B8%8E-opencode-%E7%9A%84%E5%AF%B9%E6%AF%94/featured-image.png" alt="Featured image of post How kimi-code Handles kimi-k2.6: A Comparison with OpenCode" /&gt;&lt;p&gt;Recently, kimi-code migrated from Python to TypeScript. Here&amp;rsquo;s a quick analysis.&lt;/p&gt;
&lt;p&gt;Based on my review of the &lt;strong&gt;kimi-code&lt;/strong&gt; source code (particularly &lt;code&gt;packages/kosong/src/providers/kimi.ts&lt;/code&gt;, &lt;code&gt;kimi-schema.ts&lt;/code&gt;, &lt;code&gt;kimi-files.ts&lt;/code&gt;, etc.) and relevant OpenCode compatibility issues, here are the kimi-k2.6-specific optimizations in kimi-code and how they differ from OpenCode.&lt;/p&gt;
&lt;h2 id="1-native-kimi-provider-not-a-generic-openai-compatible-layer"&gt;1. Native Kimi Provider (Not a Generic OpenAI-compatible Layer)
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;kimi-code&lt;/strong&gt; does not treat Kimi as &amp;ldquo;just another OpenAI-compatible endpoint.&amp;rdquo; Instead, it implements a dedicated &lt;code&gt;kimi&lt;/code&gt; provider type:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;kimi-code&lt;/th&gt;
&lt;th&gt;OpenCode&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Provider Type&lt;/td&gt;
&lt;td&gt;Dedicated &lt;code&gt;'kimi'&lt;/code&gt; type with independent adapter&lt;/td&gt;
&lt;td&gt;Accessed via generic OpenAI/Anthropic bridge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Proprietary Fields&lt;/td&gt;
&lt;td&gt;Native handling of &lt;code&gt;reasoning_content&lt;/code&gt;, &lt;code&gt;thinking&lt;/code&gt;, &lt;code&gt;generationKwargs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;reasoning_content&lt;/code&gt; often lost in the bridge layer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auth Headers&lt;/td&gt;
&lt;td&gt;Supports &lt;code&gt;kimiRequestHeaders&lt;/code&gt;, &lt;code&gt;X-Msh-Tool-Call-Id&lt;/code&gt;, and other Moonshot-specific headers&lt;/td&gt;
&lt;td&gt;Generic header forwarding&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id="2-full-lifecycle-handling-of-reasoning_content"&gt;2. Full Lifecycle Handling of &lt;code&gt;reasoning_content&lt;/code&gt;
&lt;/h2&gt;&lt;p&gt;kimi-k2.6 has thinking enabled by default and &lt;strong&gt;requires &lt;code&gt;reasoning_content&lt;/code&gt; to be preserved across multi-turn conversation history&lt;/strong&gt;. Otherwise, tool calls will result in a 400 error.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How kimi-code handles it:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;convertMessage&lt;/code&gt;&lt;/strong&gt;: Extracts internal &lt;code&gt;think&lt;/code&gt; content parts and serializes them into the &lt;code&gt;reasoning_content&lt;/code&gt; field, ensuring thinking content is never lost in message history&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Streaming Parser&lt;/strong&gt;: Explicitly extracts &lt;code&gt;delta.reasoning_content&lt;/code&gt; / &lt;code&gt;message.reasoning_content&lt;/code&gt; in both &lt;code&gt;_convertStreamResponse&lt;/code&gt; and &lt;code&gt;_convertNonStreamResponse&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;TUI Rendering&lt;/strong&gt;: A dedicated &lt;code&gt;ThinkingComponent&lt;/code&gt; renders thinking content in real time, with expand/collapse support and a spinner animation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;OpenCode&amp;rsquo;s Problem:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The OpenCode Go bridge &lt;strong&gt;drops &lt;code&gt;reasoning_content&lt;/code&gt;&lt;/strong&gt; on the second turn, causing the Moonshot API to return:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-gdscript3" data-lang="gdscript3"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;thinking&lt;/span&gt; &lt;span class="n"&gt;is&lt;/span&gt; &lt;span class="n"&gt;enabled&lt;/span&gt; &lt;span class="n"&gt;but&lt;/span&gt; &lt;span class="n"&gt;reasoning_content&lt;/span&gt; &lt;span class="n"&gt;is&lt;/span&gt; &lt;span class="n"&gt;missing&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;assistant&lt;/span&gt; &lt;span class="k"&gt;tool&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id="3-json-schema-normalization-kimi-schemats"&gt;3. JSON Schema Normalization (&lt;code&gt;kimi-schema.ts&lt;/code&gt;)
&lt;/h2&gt;&lt;p&gt;Moonshot&amp;rsquo;s tool parameter validator has strict and unique requirements for JSON Schema. This is one of the primary sources of incompatibility between OpenCode and kimi-k2.6.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What kimi-code&amp;rsquo;s &lt;code&gt;normalizeKimiToolSchema&lt;/code&gt; does:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dereferences &lt;code&gt;$ref&lt;/code&gt;&lt;/strong&gt;: Inlines definitions from &lt;code&gt;$defs&lt;/code&gt; / &lt;code&gt;definitions&lt;/code&gt;, eliminating external references&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fills in missing &lt;code&gt;type&lt;/code&gt;&lt;/strong&gt;: The Kimi validator rejects nested property schemas that omit &lt;code&gt;type&lt;/code&gt; (e.g., MCP-generated enum-only schemas). kimi-code infers and backfills &lt;code&gt;type: string/object/array&lt;/code&gt;, etc.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Circular reference detection&lt;/strong&gt;: Preserves the original &lt;code&gt;$ref&lt;/code&gt; when a circular reference is detected, avoiding infinite recursion&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;OpenCode&amp;rsquo;s Problem:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Generated schemas use &lt;code&gt;#/definitions/&lt;/code&gt; instead of the &lt;code&gt;#/$defs/&lt;/code&gt; format required by Moonshot, and lack schema type inference and backfilling for Kimi, causing complex tool calls to fail with 400.&lt;/p&gt;
&lt;h2 id="4-native-thinking-mode-configuration-system"&gt;4. Native Thinking Mode Configuration System
&lt;/h2&gt;&lt;p&gt;kimi-code has built-in support for Kimi&amp;rsquo;s thinking mode from the configuration layer all the way to the UI:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Config Parsing&lt;/strong&gt;: &lt;code&gt;ThinkingConfigSchema&lt;/code&gt; supports &lt;code&gt;mode: auto/on/off&lt;/code&gt; and &lt;code&gt;effort: low/medium/high/xhigh/max&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Model Capability Tags&lt;/strong&gt;: &lt;code&gt;ModelAlias&lt;/code&gt; supports &lt;code&gt;capabilities: ['thinking', 'always_thinking']&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Model Selector UI&lt;/strong&gt;: Press &lt;code&gt;←→&lt;/code&gt; to toggle thinking on/off; &lt;code&gt;always-on&lt;/code&gt; models cannot be turned off&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Provider Method&lt;/strong&gt;: &lt;code&gt;withThinking(effort)&lt;/code&gt; correctly generates:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;span class="lnt"&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-json" data-lang="json"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nt"&gt;&amp;#34;reasoning_effort&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;high&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nt"&gt;&amp;#34;extra_body&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nt"&gt;&amp;#34;thinking&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nt"&gt;&amp;#34;type&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;enabled&amp;#34;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Token Budget&lt;/strong&gt;: Automatically normalizes legacy &lt;code&gt;max_tokens&lt;/code&gt; to Kimi&amp;rsquo;s preferred &lt;code&gt;max_completion_tokens&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;OpenCode&amp;rsquo;s Problem:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When using the Anthropic bridge, it hardcodes &lt;code&gt;thinking&lt;/code&gt; content blocks, but the Kimi API only supports &lt;code&gt;text/image_url/video_url/video&lt;/code&gt;, resulting in:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Invalid value: thinking. Supported values are: &amp;#39;text&amp;#39;,&amp;#39;image_url&amp;#39;,&amp;#39;video_url&amp;#39; and &amp;#39;video&amp;#39;.
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id="5-native-moonshot-service-integration"&gt;5. Native Moonshot Service Integration
&lt;/h2&gt;&lt;p&gt;kimi-code includes Moonshot-exclusive services instead of relying on generic local implementations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;MoonshotFetchURLProvider&lt;/code&gt;&lt;/strong&gt;: Prioritizes Moonshot&amp;rsquo;s &lt;code&gt;coding-fetch&lt;/code&gt; service (with built-in page text extraction), falling back to local fetch only on failure&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;MoonshotWebSearchProvider&lt;/code&gt;&lt;/strong&gt;: Calls the Moonshot search API directly, supporting &lt;code&gt;enable_page_crawling&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;KimiFiles&lt;/code&gt;&lt;/strong&gt;: Uploads videos to the Moonshot file service, returning &lt;code&gt;video_url&lt;/code&gt; in the &lt;code&gt;ms://&amp;lt;file-id&amp;gt;&lt;/code&gt; format&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="6-tool-call-layer-details"&gt;6. Tool Call Layer Details
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Built-in Functions&lt;/strong&gt;: Tool names starting with &lt;code&gt;$&lt;/code&gt; are recognized as Kimi builtin functions and serialized as &lt;code&gt;type: 'builtin_function'&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Usage Extraction&lt;/strong&gt;: Supports Moonshot&amp;rsquo;s proprietary &lt;code&gt;choices[0].usage&lt;/code&gt; placement, as well as &lt;code&gt;cached_tokens&lt;/code&gt; and other fields&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Finish Reason Mapping&lt;/strong&gt;: Maps OpenAI-style &lt;code&gt;stop&lt;/code&gt;/&lt;code&gt;tool_calls&lt;/code&gt;/&lt;code&gt;length&lt;/code&gt; values to an internal unified enum&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="7-cli-core-and-llm-sdk-architectural-isolation"&gt;7. CLI Core and LLM SDK Architectural Isolation
&lt;/h2&gt;&lt;p&gt;This is an easily overlooked but important architectural difference.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The core CLI of kimi-code (&lt;code&gt;apps/kimi-code&lt;/code&gt;) does not directly depend on any OpenAI or Anthropic TypeScript SDK.&lt;/strong&gt; Looking at its &lt;code&gt;package.json&lt;/code&gt;, the core dependencies are only generic libraries like TUI rendering (&lt;code&gt;pi-tui&lt;/code&gt;), CLI parsing (&lt;code&gt;commander&lt;/code&gt;), and syntax highlighting (&lt;code&gt;cli-highlight&lt;/code&gt;). All LLM provider interactions are isolated within the self-developed &lt;code&gt;kosong&lt;/code&gt; package.&lt;/p&gt;
&lt;p&gt;While &lt;code&gt;packages/kosong&lt;/code&gt; internally uses &lt;code&gt;openai&lt;/code&gt; and &lt;code&gt;@anthropic-ai/sdk&lt;/code&gt; as implementation details (since the Kimi API is OpenAI-compatible), it exposes a unified LLM abstraction interface to the outside. The CLI core only depends on &lt;code&gt;kosong&lt;/code&gt; and has no awareness of underlying vendor SDKs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;OpenCode is different.&lt;/strong&gt; Its &lt;code&gt;packages/opencode&lt;/code&gt; core package directly depends on a large number of vendor SDKs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;@ai-sdk/openai&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;@ai-sdk/anthropic&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;@ai-sdk/google&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;@ai-sdk/azure&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;@openrouter/ai-sdk-provider&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&amp;hellip; (more than a dozen provider-specific packages in total)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This means OpenCode&amp;rsquo;s core code is deeply coupled with each vendor&amp;rsquo;s SDK, while kimi-code&amp;rsquo;s core CLI stays clean, with all model interactions fully isolated through a self-developed abstraction layer.&lt;/p&gt;
&lt;h2 id="8-what-commit-history-reveals-about-evolution-paths"&gt;8. What Commit History Reveals About Evolution Paths
&lt;/h2&gt;&lt;p&gt;The structural code differences above are just a static snapshot. What&amp;rsquo;s more interesting is comparing the commit histories of the two projects—their &lt;strong&gt;dynamic evolution directions are completely different&lt;/strong&gt;.&lt;/p&gt;
&lt;h3 id="kimi-code-native-design-continuously-reducing-configuration-burden"&gt;kimi-code: Native Design, Continuously Reducing Configuration Burden
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;842e699&lt;/code&gt; — &amp;ldquo;Kimi For Coding&amp;rdquo; (Initial Commit)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This was the starting point of the entire project. The initial code already included:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;packages/kosong/src/providers/kimi.ts&lt;/code&gt;: Dedicated Kimi provider&lt;/li&gt;
&lt;li&gt;&lt;code&gt;packages/kosong/src/providers/kimi-schema.ts&lt;/code&gt;: Dedicated JSON Schema normalizer&lt;/li&gt;
&lt;li&gt;&lt;code&gt;packages/kosong/src/providers/kimi-files.ts&lt;/code&gt;: Dedicated file upload service&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Conclusion: kimi-code treated the Kimi API as a first-class citizen from day one, not as a later patch.&lt;/strong&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;d95b013&lt;/code&gt; fix(catalog): preserve reasoning fields in custom model (&lt;a class="link" href="https://github.com/MoonshotAI/kimi-code/pull/70" target="_blank" rel="noopener"
&gt;#70&lt;/a&gt;)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This commit fixed a very subtle issue. models.dev uses the &lt;code&gt;interleaved&lt;/code&gt; field to mark reasoning support, but early code treated &lt;code&gt;interleaved=true&lt;/code&gt; as undefined, causing models selected via &lt;code&gt;/connect&lt;/code&gt; to silently lose their reasoning capability.&lt;/p&gt;
&lt;p&gt;Fixes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;interleaved=true&lt;/code&gt; is mapped to the default &lt;code&gt;reasoning_content&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;interleaved&lt;/code&gt; is added to the &lt;code&gt;update-catalog.mjs&lt;/code&gt; allowlist; otherwise the offline catalog in release builds would silently drop the field again&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;61f7d0e&lt;/code&gt; fix(kosong): make openai-compatible thinking work without reasoning_key (&lt;a class="link" href="https://github.com/MoonshotAI/kimi-code/pull/78" target="_blank" rel="noopener"
&gt;#78&lt;/a&gt;)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;This is the core commit for reasoning handling&lt;/strong&gt;, showcasing kimi-code&amp;rsquo;s deep thinking on compatibility. The diff reveals a three-layer design:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Inbound Auto-Scan&lt;/strong&gt; (response parsing)&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-ts" data-lang="ts"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;KNOWN_REASONING_KEYS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;reasoning_content&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;reasoning_details&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;reasoning&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="kr"&gt;as&lt;/span&gt; &lt;span class="kr"&gt;const&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;// Auto-scan three fields; first string value wins
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Outbound Default Write-Back&lt;/strong&gt; (request serialization)&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-ts" data-lang="ts"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;DEFAULT_OUTBOUND_REASONING_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;KNOWN_REASONING_KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt; &lt;span class="c1"&gt;// &amp;#39;reasoning_content&amp;#39;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;// Defaults to writing back as reasoning_content, no user config needed
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Auto-Inject &lt;code&gt;reasoning_effort&lt;/code&gt;&lt;/strong&gt; (historical continuity)&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-ts" data-lang="ts"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;// When history contains ThinkPart but caller hasn&amp;#39;t explicitly set reasoning_effort,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;// auto-inject &amp;#39;medium&amp;#39; to prevent strict gateways like One API / DeepSeek from returning 400
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Edge cases are handled meticulously: blank &lt;code&gt;reasoning_key&lt;/code&gt; (&lt;code&gt;&amp;quot;&amp;quot;&lt;/code&gt;) is normalized to &lt;code&gt;undefined&lt;/code&gt;; values explicitly set by the caller via &lt;code&gt;withGenerationKwargs&lt;/code&gt; &lt;strong&gt;are not silently overwritten by auto-injection&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The verification goal explicitly states:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Manually verified end-to-end against the real DeepSeek API with a hand-written config.toml that does not set reasoning_key: thinking content renders, no 400, multi-turn conversations work.&lt;/p&gt;&lt;/blockquote&gt;
&lt;hr&gt;
&lt;h3 id="opencode-generic-layer-design-openai-centric"&gt;OpenCode: Generic Layer Design, OpenAI-centric
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;eb84f46&lt;/code&gt; fix(llm): split OpenAI reasoning summary blocks (#29000)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This commit demonstrates OpenCode&amp;rsquo;s completely different approach to reasoning—designed around the &lt;strong&gt;OpenAI Responses API&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Maintains a state machine for &lt;code&gt;encrypted_content&lt;/code&gt; and &lt;code&gt;item_reference&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Folds multiple summary parts by &lt;code&gt;item_id&lt;/code&gt; + &lt;code&gt;summary_index&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;When &lt;code&gt;store:false&lt;/code&gt;, filters out reasoning items lacking &lt;code&gt;encrypted_content&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;This is completely different from Kimi&amp;rsquo;s &lt;code&gt;reasoning_content&lt;/code&gt; mechanism.&lt;/strong&gt; Kimi does not need &lt;code&gt;encrypted_content&lt;/code&gt; or &lt;code&gt;item_reference&lt;/code&gt;; it simply attaches a &lt;code&gt;reasoning_content&lt;/code&gt; field to the message.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id="a-hard-fact"&gt;A Hard Fact
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a class="link" href="https://github.com/anomalyco/opencode/issues/26331" target="_blank" rel="noopener"
&gt;OpenCode Issue #26331&lt;/a&gt;&lt;/strong&gt; &amp;ldquo;Bug: OpenCode Go bridge layer incompatible with kimi-k2.6 tool calls&amp;rdquo; — &lt;strong&gt;Status: still open&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a class="link" href="https://github.com/anomalyco/opencode/issues/27054" target="_blank" rel="noopener"
&gt;OpenCode Issue #27054&lt;/a&gt;&lt;/strong&gt; &amp;ldquo;KIMI K2.6 showing error in Opencode GO&amp;rdquo; — &lt;strong&gt;Status: closed, but the resolution was to disable MCP (a workaround)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The last comment on #27054:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The workaround is to disable your MCP and then initiate the session&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;That&amp;rsquo;s not a fix. That&amp;rsquo;s avoiding the problem.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id="commit-history-comparison-summary"&gt;Commit History Comparison Summary
&lt;/h3&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;kimi-code&lt;/th&gt;
&lt;th&gt;OpenCode&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Initial Design&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Initial commit includes full Kimi provider + schema normalizer + file service&lt;/td&gt;
&lt;td&gt;Generic multi-model architecture, adapted later via bridge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reasoning Mechanism&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Designed around &lt;code&gt;reasoning_content&lt;/code&gt; field, with auto-scan / write-back / effort injection&lt;/td&gt;
&lt;td&gt;Designed around OpenAI Responses&amp;rsquo; &lt;code&gt;encrypted_content&lt;/code&gt; + &lt;code&gt;item_reference&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Schema Handling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dedicated &lt;code&gt;normalizeKimiToolSchema&lt;/code&gt;, dereferences &lt;code&gt;$ref&lt;/code&gt; + backfills &lt;code&gt;type&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Generic schema validation, focused on friendly error messages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Config Philosophy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Makes OpenAI-compatible gateways &amp;ldquo;zero-config&amp;rdquo; by auto-inferring all fields&lt;/td&gt;
&lt;td&gt;Relies on users manually adapting via bridge/config&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Issue Status&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Continuously shipping reasoning-related patches (&lt;a class="link" href="https://github.com/MoonshotAI/kimi-code/pull/70" target="_blank" rel="noopener"
&gt;#70&lt;/a&gt;, &lt;a class="link" href="https://github.com/MoonshotAI/kimi-code/pull/78" target="_blank" rel="noopener"
&gt;#78&lt;/a&gt;)&lt;/td&gt;
&lt;td&gt;kimi-k2.6 compatibility issue &lt;a class="link" href="https://github.com/anomalyco/opencode/issues/26331" target="_blank" rel="noopener"
&gt;#26331&lt;/a&gt; still open&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id="summary-core-differences"&gt;Summary: Core Differences
&lt;/h2&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;kimi-code&lt;/th&gt;
&lt;th&gt;OpenCode&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Architecture Positioning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native design for Kimi/Moonshot, dedicated provider&lt;/td&gt;
&lt;td&gt;Generic multi-model agent, adapted via bridge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Thinking/Reasoning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native support, full lifecycle preservation of &lt;code&gt;reasoning_content&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Easily lost in bridge layer, causing 400 errors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSON Schema&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dedicated &lt;code&gt;normalizeKimiToolSchema&lt;/code&gt; for dereferencing and type backfilling&lt;/td&gt;
&lt;td&gt;Generic schema generation, does not meet Kimi validator requirements&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API Format&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Directly generates Moonshot-native format (including &lt;code&gt;thinking&lt;/code&gt; config, &lt;code&gt;$defs&lt;/code&gt; normalization, etc.)&lt;/td&gt;
&lt;td&gt;Transformed through OpenAI/Anthropic protocol conversion, causing format mismatches&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Service Integration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Built-in Moonshot fetch/search/file services&lt;/td&gt;
&lt;td&gt;Uses generic local tools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Core Dependencies&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CLI core does not directly depend on vendor SDKs; isolated via self-developed &lt;code&gt;kosong&lt;/code&gt; package&lt;/td&gt;
&lt;td&gt;Core package directly coupled with &lt;code&gt;@ai-sdk/openai&lt;/code&gt; and more than a dozen other vendor SDKs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Looking at commit history, kimi-code&amp;rsquo;s evolution is directed at &lt;strong&gt;continuously eliminating user configuration burden&lt;/strong&gt; (&lt;code&gt;reasoning_key&lt;/code&gt; went from required → optional override → auto-inferred; &lt;code&gt;interleaved&lt;/code&gt; went from filtered → correctly mapped), while OpenCode&amp;rsquo;s evolution is directed at &lt;strong&gt;deepening OpenAI ecosystem integration&lt;/strong&gt; (Responses API, encrypted reasoning, item reference), leaving Kimi adaptation stuck at the generic bridge layer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;That&amp;rsquo;s the truth at the commit level: one is native evolution, the other is a bridge gap.&lt;/strong&gt;&lt;/p&gt;</description></item><item><title>Does Self-Hosting an LLM Really Let You Use It Without Limits?</title><link>https://svtter.cn/en/p/does-self-hosting-an-llm-really-let-you-use-it-without-limits/</link><pubDate>Thu, 19 Mar 2026 12:30:00 +0800</pubDate><guid>https://svtter.cn/en/p/does-self-hosting-an-llm-really-let-you-use-it-without-limits/</guid><description>&lt;img src="https://svtter.cn/p/%E8%87%AA%E5%B7%B1%E9%83%A8%E7%BD%B2%E5%A4%A7%E6%A8%A1%E5%9E%8B%E7%9C%9F%E7%9A%84%E5%B0%B1%E8%83%BD%E8%82%86%E6%97%A0%E5%BF%8C%E6%83%AE%E5%9C%B0%E7%94%A8%E5%90%97/cover.jpg" alt="Featured image of post Does Self-Hosting an LLM Really Let You Use It Without Limits?" /&gt;&lt;p&gt;Many people start thinking seriously about self-hosting an LLM not because of technical romance, but because API bills, rate limits, or compliance requirements have started to collide with real business constraints.&lt;/p&gt;
&lt;p&gt;So a very natural question shows up: if the model runs on your own machine, does that mean you can finally use it without limits?&lt;/p&gt;
&lt;p&gt;My answer is: &lt;strong&gt;no.&lt;/strong&gt; Self-hosting a model does not mean unlimited freedom. It mostly means that many of the constraints and costs previously absorbed by the platform are now transferred to you.&lt;/p&gt;
&lt;p&gt;But there is a more useful second question: once usage gets large enough, can self-hosting actually become cheaper?&lt;/p&gt;
&lt;p&gt;The answer is: &lt;strong&gt;possibly, but under stricter conditions than many people expect.&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In short: self-hosting an LLM does not mean unlimited freedom.&lt;/p&gt;
&lt;p&gt;It means taking on part of the cost and responsibility that a platform would normally absorb. Self-hosting becomes financially attractive only when load stays high, utilization remains strong, and you can either accept model trade-offs or optimize the stack yourself.&lt;/p&gt;&lt;/blockquote&gt;
&lt;h2 id="local-deployment-does-not-mean-no-limits"&gt;Local deployment does not mean no limits
&lt;/h2&gt;&lt;p&gt;Let us clear up the most common misunderstanding first.&lt;/p&gt;
&lt;p&gt;Many people interpret &amp;ldquo;the model runs on my own machine&amp;rdquo; as &amp;ldquo;I can now use it however I want.&amp;rdquo; In reality, the limits do not disappear. They simply show up in a different form.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The first limit is hardware.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Parameter count, VRAM capacity, quantization level, KV cache, and concurrency are real physical constraints. Even a quantized 70B model still puts serious pressure on memory and bandwidth. Being able to run it does not mean it runs comfortably. Getting output does not mean latency and throughput are acceptable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The second limit is model capability itself.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Hallucinations, knowledge cutoffs, long-context degradation, and unstable reasoning do not vanish just because the model sits on your own server. Deployment location does not change the model&amp;rsquo;s ceiling. More importantly, most so-called self-hosting setups use open-weight models, not the actual closed models behind systems like Claude or GPT.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The third limit is responsibility transfer.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When you use an API, content safety, service stability, rate limiting, and much of the infrastructure burden are partially handled by the provider. Once you self-host, those problems do not go away. They become your monitoring, your operations, your review pipeline, and your incident response.&lt;/p&gt;
&lt;p&gt;So &lt;strong&gt;self-hosting is not &amp;ldquo;use without limits.&amp;rdquo; It is &amp;ldquo;you own the boundaries.&amp;rdquo;&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id="the-real-calculation-is-not-just-the-price-of-a-gpu"&gt;The real calculation is not just the price of a GPU
&lt;/h2&gt;&lt;p&gt;If you want to know whether self-hosting is worth it, the real comparison is not &amp;ldquo;how much does the card cost?&amp;rdquo; but these two larger accounts.&lt;/p&gt;
&lt;p&gt;The annual cost of self-hosting can be written roughly like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Annual self-hosting cost = hardware depreciation + electricity + network / hosting + operations labor + redundancy for failures
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The annual API cost is more direct:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Annual API cost = average daily token usage * price per million tokens * 365
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;That looks simple, but three details are often ignored.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Self-hosting is not a one-time hardware purchase.&lt;/strong&gt; Electricity, spare parts, hosting conditions, alerting, upgrades, and maintenance all keep happening.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;API pricing is not a single fixed number.&lt;/strong&gt; Model choice, input-output ratio, cache hit rate, and tool usage can all change the final bill significantly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Utilization is easy to underestimate.&lt;/strong&gt; If your machine sits idle most of the time, a low per-inference cost means very little. On the other hand, if the workload is stable and the hardware stays busy, the financial case for self-hosting becomes much stronger.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So the numbers below should be read as rough order-of-magnitude guidance, not as a procurement quote.&lt;/p&gt;
&lt;h2 id="a-rough-but-useful-breakeven-table"&gt;A rough but useful breakeven table
&lt;/h2&gt;&lt;p&gt;To keep the discussion simple, let us start with a deliberately rough set of assumptions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;API pricing is estimated at roughly CNY 50 per million tokens&lt;/li&gt;
&lt;li&gt;token usage counts both input and output together&lt;/li&gt;
&lt;li&gt;local hardware is depreciated over 3 years&lt;/li&gt;
&lt;li&gt;self-hosting cost includes baseline power and operations overhead&lt;/li&gt;
&lt;li&gt;the local setup mainly assumes open-weight model inference, not strict parity with top closed models&lt;/li&gt;
&lt;li&gt;this does not include training, fine-tuning, or a dedicated platform team&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Under those assumptions, you get a rough picture like this:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style="text-align: left"&gt;Scenario&lt;/th&gt;
&lt;th style="text-align: left"&gt;Daily token usage&lt;/th&gt;
&lt;th style="text-align: left"&gt;Likely local setup&lt;/th&gt;
&lt;th style="text-align: left"&gt;Annual self-hosting cost&lt;/th&gt;
&lt;th style="text-align: left"&gt;Annual API cost&lt;/th&gt;
&lt;th style="text-align: left"&gt;Rough conclusion&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="text-align: left"&gt;Light usage&lt;/td&gt;
&lt;td style="text-align: left"&gt;500K&lt;/td&gt;
&lt;td style="text-align: left"&gt;Single high-end consumer workstation&lt;/td&gt;
&lt;td style="text-align: left"&gt;CNY 20K - 40K&lt;/td&gt;
&lt;td style="text-align: left"&gt;about CNY 9K&lt;/td&gt;
&lt;td style="text-align: left"&gt;API is cheaper&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left"&gt;Medium usage&lt;/td&gt;
&lt;td style="text-align: left"&gt;5M&lt;/td&gt;
&lt;td style="text-align: left"&gt;Dual-GPU or small inference workstation&lt;/td&gt;
&lt;td style="text-align: left"&gt;CNY 60K - 120K&lt;/td&gt;
&lt;td style="text-align: left"&gt;about CNY 91K&lt;/td&gt;
&lt;td style="text-align: left"&gt;Near breakeven&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left"&gt;Heavy usage&lt;/td&gt;
&lt;td style="text-align: left"&gt;50M&lt;/td&gt;
&lt;td style="text-align: left"&gt;Multi-GPU server or cluster&lt;/td&gt;
&lt;td style="text-align: left"&gt;CNY 400K - 800K&lt;/td&gt;
&lt;td style="text-align: left"&gt;about CNY 912K&lt;/td&gt;
&lt;td style="text-align: left"&gt;Self-hosting may be cheaper&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;img src="https://svtter.cn/p/%E8%87%AA%E5%B7%B1%E9%83%A8%E7%BD%B2%E5%A4%A7%E6%A8%A1%E5%9E%8B%E7%9C%9F%E7%9A%84%E5%B0%B1%E8%83%BD%E8%82%86%E6%97%A0%E5%BF%8C%E6%83%AE%E5%9C%B0%E7%94%A8%E5%90%97/pics/inline-01.jpg"
width="4800"
height="3584"
srcset="https://svtter.cn/p/%E8%87%AA%E5%B7%B1%E9%83%A8%E7%BD%B2%E5%A4%A7%E6%A8%A1%E5%9E%8B%E7%9C%9F%E7%9A%84%E5%B0%B1%E8%83%BD%E8%82%86%E6%97%A0%E5%BF%8C%E6%83%AE%E5%9C%B0%E7%94%A8%E5%90%97/pics/inline-01_hu_e538165957f7c9a8.jpg 480w, https://svtter.cn/p/%E8%87%AA%E5%B7%B1%E9%83%A8%E7%BD%B2%E5%A4%A7%E6%A8%A1%E5%9E%8B%E7%9C%9F%E7%9A%84%E5%B0%B1%E8%83%BD%E8%82%86%E6%97%A0%E5%BF%8C%E6%83%AE%E5%9C%B0%E7%94%A8%E5%90%97/pics/inline-01_hu_c17af6e4e0b01ddc.jpg 1024w"
loading="lazy"
alt="An illustration showing how the balance shifts from API costs to local hardware investment as LLM usage grows from light to heavy"
class="gallery-image"
data-flex-grow="133"
data-flex-basis="321px"
&gt;&lt;/p&gt;
&lt;p&gt;If you want local quality to get as close as possible to top-tier closed models, this table usually moves upward again, because stronger models, more VRAM, and higher availability targets all push infrastructure and operations costs higher.&lt;/p&gt;
&lt;p&gt;This table points to three things.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Individuals and small teams usually do not save money with self-hosting.&lt;/strong&gt; If your workload is only a few hundred thousand tokens per day, APIs are still usually the more economical option. You spend less on hardware and avoid carrying the operations burden.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The real breakeven point tends to appear only in consistently high-usage scenarios.&lt;/strong&gt; Not one occasional spike, but a workload that stays high day after day. Only then can hardware cost be spread efficiently enough.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The larger the usage, the more attractive self-hosting becomes financially.&lt;/strong&gt; That is why large companies invest seriously in inference platforms. It is not because they enjoy complexity. It is because once the scale is large enough, the math really changes.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="one-critical-condition-you-may-not-be-comparing-the-same-thing"&gt;One critical condition: you may not be comparing the same thing
&lt;/h2&gt;&lt;p&gt;The biggest problem in many &amp;ldquo;self-hosting is cheaper than API&amp;rdquo; discussions is not the arithmetic. It is that the compared products are often not equivalent.&lt;/p&gt;
&lt;p&gt;On the API side, you may be buying access to a top-tier closed model. On the local side, you may be running a quantized open-weight model. Both are called &amp;ldquo;LLMs,&amp;rdquo; but they are not the same product in a strict sense.&lt;/p&gt;
&lt;p&gt;That means:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;if open-weight quality is acceptable for your use case, self-hosting may indeed save a lot of money&lt;/li&gt;
&lt;li&gt;if your quality bar is high and you depend on the best closed models, the room for self-hosting becomes much smaller&lt;/li&gt;
&lt;li&gt;if you compare a cheaper model to a more expensive model, the result is not just a deployment conclusion, but also a model-selection conclusion&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Put differently, &lt;strong&gt;many people think they are calculating deployment cost when they are actually accepting a capability downgrade first.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;There is nothing wrong with that trade-off, but it should be stated clearly.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://svtter.cn/p/%E8%87%AA%E5%B7%B1%E9%83%A8%E7%BD%B2%E5%A4%A7%E6%A8%A1%E5%9E%8B%E7%9C%9F%E7%9A%84%E5%B0%B1%E8%83%BD%E8%82%86%E6%97%A0%E5%BF%8C%E6%83%AE%E5%9C%B0%E7%94%A8%E5%90%97/pics/inline-02.jpg"
width="4800"
height="3584"
srcset="https://svtter.cn/p/%E8%87%AA%E5%B7%B1%E9%83%A8%E7%BD%B2%E5%A4%A7%E6%A8%A1%E5%9E%8B%E7%9C%9F%E7%9A%84%E5%B0%B1%E8%83%BD%E8%82%86%E6%97%A0%E5%BF%8C%E6%83%AE%E5%9C%B0%E7%94%A8%E5%90%97/pics/inline-02_hu_3afbc14068dd055d.jpg 480w, https://svtter.cn/p/%E8%87%AA%E5%B7%B1%E9%83%A8%E7%BD%B2%E5%A4%A7%E6%A8%A1%E5%9E%8B%E7%9C%9F%E7%9A%84%E5%B0%B1%E8%83%BD%E8%82%86%E6%97%A0%E5%BF%8C%E6%83%AE%E5%9C%B0%E7%94%A8%E5%90%97/pics/inline-02_hu_7f9cead440467875.jpg 1024w"
loading="lazy"
alt="An illustration showing that a closed cloud model and a local open-weight model are not fully equivalent in capability, cost, and operational burden"
class="gallery-image"
data-flex-grow="133"
data-flex-basis="321px"
&gt;&lt;/p&gt;
&lt;h2 id="what-self-hosting-gives-you-besides-cost-savings"&gt;What self-hosting gives you besides cost savings
&lt;/h2&gt;&lt;p&gt;If a company still chooses to self-host after doing the math, it is usually not only about saving API money.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data control.&lt;/strong&gt; Some businesses simply do not want raw data flowing through third-party providers for long-term operational or compliance reasons. Local deployment makes the compliance and audit path easier to manage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Customization.&lt;/strong&gt; You can optimize around your own tasks with quantization, routing, distillation, fine-tuning, and tighter integration into internal systems. Standard APIs usually give you less freedom here.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;A more predictable cost ceiling.&lt;/strong&gt; API pricing scales directly with usage. When the business grows, the bill grows with it. Self-hosting has a large upfront investment, but under high and stable load, the cost curve is often easier to predict.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Offline operation and availability.&lt;/strong&gt; If your environment requires internal-only deployment, or if you cannot accept key workflows depending entirely on external services, local deployment may simply fit the engineering requirements better.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="a-more-practical-decision-framework"&gt;A more practical decision framework
&lt;/h2&gt;&lt;p&gt;If you do not want to model every variable from day one, start with these three questions.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Is your workload consistently high over time?&lt;/strong&gt; If you only see occasional spikes rather than sustained token usage every day, APIs are often still the better choice because you are not paying for idle hardware.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Can you accept the gap between a local model and a closed flagship model?&lt;/strong&gt; If your business depends on best-in-class model quality, a large part of the claimed savings may come from lowering model quality rather than from deployment efficiency alone.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Do you actually have the ability to operate an inference service long term?&lt;/strong&gt; What happens when a GPU fails, drivers conflict, service latency spikes, the model version needs to change, or rate limiting and monitoring need to be built? If nobody owns these questions, the issue is no longer just cost. It becomes a delivery problem.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="conclusion"&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;Back to the original question: does self-hosting an LLM really let you use it without limits?&lt;/p&gt;
&lt;p&gt;My answer is still: &lt;strong&gt;no.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;It does not remove hardware bottlenecks, erase model capability gaps, or magically solve moderation, reliability, and operations work for you. What it gives you is not absolute freedom, but more control and the responsibility that comes with it.&lt;/p&gt;
&lt;p&gt;At the same time, &lt;strong&gt;self-hosting is absolutely not a fake option.&lt;/strong&gt; It becomes increasingly reasonable when several conditions are true at once:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;your token usage stays high for a long time&lt;/li&gt;
&lt;li&gt;the workload is stable and hardware utilization remains high&lt;/li&gt;
&lt;li&gt;open-weight models are acceptable, or you already have the ability to optimize them well&lt;/li&gt;
&lt;li&gt;data control, internal deployment, or predictable cost ceilings matter to you&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you are an individual, a small team, or just an occasional heavy user, APIs are still usually the more practical answer: less effort, less operational burden, and lower cost of experimentation.&lt;/p&gt;
&lt;p&gt;If you are already in the phase where you burn tokens steadily every day, then it is worth calculating the full picture instead of staring only at API unit prices. Very often the answer is not &amp;ldquo;now I can use it without limits,&amp;rdquo; but a more grounded question that matters more: &lt;strong&gt;is this worth owning yourself?&lt;/strong&gt;&lt;/p&gt;</description></item><item><title>The Mathematical Trap of Big Model Coding Plan Packages: Can Promised Usage Be Delivered Under Concurrency Limits?</title><link>https://svtter.cn/en/p/the-mathematical-trap-of-big-model-coding-plan-packages-can-promised-usage-be-delivered-under-concurrency-limits/</link><pubDate>Fri, 23 Jan 2026 11:52:52 +0800</pubDate><guid>https://svtter.cn/en/p/the-mathematical-trap-of-big-model-coding-plan-packages-can-promised-usage-be-delivered-under-concurrency-limits/</guid><description>&lt;img src="https://svtter.cn/p/%E5%A4%A7%E6%A8%A1%E5%9E%8B-coding-plan-%E5%A5%97%E9%A4%90%E7%9A%84%E6%95%B0%E5%AD%A6%E9%99%B7%E9%98%B1%E5%B9%B6%E5%8F%91%E9%99%90%E5%88%B6%E4%B8%8B%E7%9A%84%E6%89%BF%E8%AF%BA%E9%87%8F%E8%83%BD%E5%90%A6%E5%85%91%E7%8E%B0/cover.png" alt="Featured image of post The Mathematical Trap of Big Model Coding Plan Packages: Can Promised Usage Be Delivered Under Concurrency Limits?" /&gt;&lt;h2 id="preface"&gt;Preface
&lt;/h2&gt;&lt;p&gt;Recently, several domestic big model manufacturers have launched Coding Plan subscription packages for developers, promoting &amp;ldquo;low prices for massive usage,&amp;rdquo; claiming that for just tens to hundreds of RMB per month, you can get &amp;ldquo;hundreds of billions of tokens&amp;rdquo; of usage quota.&lt;/p&gt;
&lt;p&gt;It sounds wonderful, but as a developer accustomed to speaking with data, I decided to do some calculations: &lt;strong&gt;Under concurrency limits, can these promised usage amounts really be consumed?&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id="typical-package-structure"&gt;Typical Package Structure
&lt;/h2&gt;&lt;p&gt;Taking the common three-tier packages on the market as an example:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Package&lt;/th&gt;
&lt;th&gt;Monthly Fee&lt;/th&gt;
&lt;th&gt;Promised Usage (every 5 hours)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lite&lt;/td&gt;
&lt;td&gt;~20 RMB&lt;/td&gt;
&lt;td&gt;About 120 prompts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pro&lt;/td&gt;
&lt;td&gt;~100 RMB&lt;/td&gt;
&lt;td&gt;About 600 prompts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max&lt;/td&gt;
&lt;td&gt;~200 RMB&lt;/td&gt;
&lt;td&gt;About 2,400 prompts&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Officials will also add: &amp;ldquo;Each prompt is expected to call the model 15-20 times, with a total monthly usage of up to tens to hundreds of billions of tokens.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;It seems like incredible value, but the devil is in the details.&lt;/p&gt;
&lt;h2 id="key-limitation-concurrency"&gt;Key Limitation: Concurrency
&lt;/h2&gt;&lt;p&gt;Most manufacturers&amp;rsquo; documentation will casually mention: &amp;ldquo;Package usage is subject to concurrency limits (number of in-flight request tasks).&amp;rdquo;&lt;/p&gt;
&lt;p&gt;But what exactly is the limit? Often not explicitly stated. According to community feedback and actual measurements, typical concurrency limits are as follows:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Package&lt;/th&gt;
&lt;th&gt;Concurrency (in-flight requests)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lite&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pro&lt;/td&gt;
&lt;td&gt;~4-5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max&lt;/td&gt;
&lt;td&gt;~7&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This number directly determines your actual throughput ceiling.&lt;/p&gt;
&lt;h2 id="math-time-can-the-max-package-use-2400-prompts"&gt;Math Time: Can the Max Package Use 2,400 Prompts?
&lt;/h2&gt;&lt;p&gt;Let&amp;rsquo;s take the highest-tier Max package as an example and do a simple calculation.&lt;/p&gt;
&lt;h3 id="known-conditions"&gt;Known Conditions
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Promised Usage&lt;/strong&gt;: 2,400 prompts every 5 hours&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Concurrency Limit&lt;/strong&gt;: 7&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Model calls triggered per prompt&lt;/strong&gt;: 15-20 times (official data)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Model generation speed&lt;/strong&gt;: About 50-60 tokens/second&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;5 hours = 18,000 seconds&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="calculation-process"&gt;Calculation Process
&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Step 1: Estimate single API call time&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A complete API call includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Input processing: ~1 second&lt;/li&gt;
&lt;li&gt;Model inference generation (assuming 500 tokens output): 500 ÷ 55 ≈ 9 seconds&lt;/li&gt;
&lt;li&gt;Network round-trip delay: ~1 second&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Total: About 10-12 seconds/call&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 2: Calculate maximum calls in 5 hours&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Maximum calls = Concurrency × (Total time ÷ Single call time)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; = 7 × (18,000 ÷ 10)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; = 12,600 calls
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;&lt;strong&gt;Step 3: Convert to prompts&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;According to official claims, each prompt triggers 15-20 calls:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Completable prompts = 12,600 ÷ 17.5 ≈ 720 prompts
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h3 id="conclusion"&gt;Conclusion
&lt;/h3&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Official Promise&lt;/th&gt;
&lt;th&gt;Concurrency Limit&lt;/th&gt;
&lt;th&gt;Achievement Rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prompts per 5 hours&lt;/td&gt;
&lt;td&gt;2,400&lt;/td&gt;
&lt;td&gt;~720&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;30%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Even under ideal conditions, the actual usable amount of the Max package is only about 30% of the promise.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id="harsher-reality-call-inflation-in-agent-mode"&gt;Harsher Reality: Call Inflation in Agent Mode
&lt;/h2&gt;&lt;p&gt;The above calculation is still based on the official claim of &amp;ldquo;15-20 calls per prompt.&amp;rdquo; But in actual AI Coding Agent scenarios (like Claude Code, Cline, etc.), the situation is much worse.&lt;/p&gt;
&lt;h3 id="how-agent-mode-works"&gt;How Agent Mode Works
&lt;/h3&gt;&lt;p&gt;When you give an AI programming assistant a task, it typically:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Analyzes requirements, creates a plan&lt;/li&gt;
&lt;li&gt;Reads relevant files (each file may trigger a call)&lt;/li&gt;
&lt;li&gt;Writes code&lt;/li&gt;
&lt;li&gt;Runs tests&lt;/li&gt;
&lt;li&gt;Discovers errors, fixes them&lt;/li&gt;
&lt;li&gt;Repeats 3-5 until successful&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A seemingly simple prompt may trigger &lt;strong&gt;50-100+ model calls&lt;/strong&gt; in an Agent loop.&lt;/p&gt;
&lt;h3 id="actual-measurement-case"&gt;Actual Measurement Case
&lt;/h3&gt;&lt;p&gt;User feedback:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;ldquo;2 simple prompts, 80 seconds, consumed 38M Tokens, used up 97% of the 5-hour limit&amp;rdquo;&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Reverse calculation:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Each prompt consumes about 19M tokens&lt;/li&gt;
&lt;li&gt;If calculated at 128K context, equivalent to &lt;strong&gt;~127 model calls/prompt&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is &lt;strong&gt;6-8 times higher&lt;/strong&gt; than the official &amp;ldquo;15-20 times.&amp;rdquo;&lt;/p&gt;
&lt;h3 id="revised-actual-usable-amount"&gt;Revised Actual Usable Amount
&lt;/h3&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Calls per prompt&lt;/th&gt;
&lt;th&gt;Usable prompts in 5 hours&lt;/th&gt;
&lt;th&gt;Achievement Rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Official ideal&lt;/td&gt;
&lt;td&gt;17.5&lt;/td&gt;
&lt;td&gt;720&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Light usage&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;252&lt;/td&gt;
&lt;td&gt;10.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Moderate usage&lt;/td&gt;
&lt;td&gt;75&lt;/td&gt;
&lt;td&gt;168&lt;/td&gt;
&lt;td&gt;7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Heavy Agent usage&lt;/td&gt;
&lt;td&gt;100+&lt;/td&gt;
&lt;td&gt;&amp;lt;126&lt;/td&gt;
&lt;td&gt;&amp;lt;5%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id="why-is-this-happening"&gt;Why Is This Happening?
&lt;/h2&gt;&lt;h3 id="1-token-calculation-includes-context"&gt;1. Token Calculation Includes Context
&lt;/h3&gt;&lt;p&gt;Big model token consumption isn&amp;rsquo;t just output, it includes input. In Coding scenarios:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Each call must send complete conversation history&lt;/li&gt;
&lt;li&gt;Code project context can easily reach tens of K tokens&lt;/li&gt;
&lt;li&gt;128K context window means each call may consume 100K+ tokens&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="2-concurrency-is-a-hard-constraint"&gt;2. Concurrency is a Hard Constraint
&lt;/h3&gt;&lt;p&gt;Regardless of how large your package quota is, concurrency determines the maximum throughput per unit time. This is a &lt;strong&gt;physical bottleneck&lt;/strong&gt;, not something commercial strategies can bypass.&lt;/p&gt;
&lt;h3 id="3-promises-based-on-ideal-assumptions"&gt;3. Promises Based on Ideal Assumptions
&lt;/h3&gt;&lt;p&gt;Manufacturers&amp;rsquo; promotional numbers are often based on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Each call uses only small context&lt;/li&gt;
&lt;li&gt;Each prompt triggers only a few calls&lt;/li&gt;
&lt;li&gt;Users won&amp;rsquo;t use continuously at high intensity&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But these assumptions rarely hold true in real AI Coding scenarios.&lt;/p&gt;
&lt;h2 id="a-table-to-see-the-truth"&gt;A Table to See the Truth
&lt;/h2&gt;&lt;p&gt;Taking the Max package (~200 RMB/month) as an example:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Official Promotion&lt;/th&gt;
&lt;th&gt;Theoretical Limit&lt;/th&gt;
&lt;th&gt;Actual Expectation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prompts per 5 hours&lt;/td&gt;
&lt;td&gt;2,400&lt;/td&gt;
&lt;td&gt;720&lt;/td&gt;
&lt;td&gt;150-400&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monthly prompts&lt;/td&gt;
&lt;td&gt;345,600&lt;/td&gt;
&lt;td&gt;103,680&lt;/td&gt;
&lt;td&gt;21,600-57,600&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monthly tokens&lt;/td&gt;
&lt;td&gt;&amp;ldquo;Hundreds of billions&amp;rdquo;&lt;/td&gt;
&lt;td&gt;~10 billion&lt;/td&gt;
&lt;td&gt;1-3 billion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Achievement Rate&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5-17%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id="advice-for-developers"&gt;Advice for Developers
&lt;/h2&gt;&lt;h3 id="1-dont-be-fooled-by-hundreds-of-billions-of-tokens"&gt;1. Don&amp;rsquo;t Be Fooled by &amp;ldquo;Hundreds of Billions of Tokens&amp;rdquo;
&lt;/h3&gt;&lt;p&gt;Token count is a highly misleading metric. In Coding Agent scenarios, context takes up the majority, with truly effective output tokens possibly only 1-5%.&lt;/p&gt;
&lt;h3 id="2-focus-on-concurrency"&gt;2. Focus on Concurrency
&lt;/h3&gt;&lt;p&gt;This is the core metric that determines actual experience. If manufacturers don&amp;rsquo;t disclose concurrency limits, it&amp;rsquo;s likely because the numbers don&amp;rsquo;t look good.&lt;/p&gt;
&lt;h3 id="3-calculate-cost-per-prompt"&gt;3. Calculate Cost per Prompt
&lt;/h3&gt;&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Actual cost per prompt = Monthly fee ÷ Actual usable prompts
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Taking the Max package as an example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Official promotion: 200 ÷ 345,600 = 0.0006 RMB/prompt&lt;/li&gt;
&lt;li&gt;Actual situation: 200 ÷ 30,000 = &lt;strong&gt;0.007 RMB/prompt&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A 10x difference.&lt;/p&gt;
&lt;h3 id="4-consider-pay-as-you-go"&gt;4. Consider Pay-as-You-Go
&lt;/h3&gt;&lt;p&gt;If your usage isn&amp;rsquo;t high, pay-as-you-go may be more cost-effective than monthly packages. At least you won&amp;rsquo;t pay for &amp;ldquo;unusable quotas.&amp;rdquo;&lt;/p&gt;
&lt;h2 id="conclusion-1"&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;The emergence of big model Coding Plan packages is itself a good thing, lowering the barrier for developers to use AI programming assistants. But when choosing packages, be sure to:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Require manufacturers to disclose concurrency limits&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Calculate throughput limits yourself&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Don&amp;rsquo;t be misled by the big numbers of &amp;ldquo;hundreds of billions of tokens&amp;rdquo;&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;After all, &lt;strong&gt;promised usage that can&amp;rsquo;t be consumed equals a disguised price increase.&lt;/strong&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;This article is based on public information and mathematical derivation; specific values may vary due to manufacturer adjustments. Readers are advised to verify through actual measurements.&lt;/em&gt;&lt;/p&gt;</description></item><item><title>Efficient and Cost-Effective: My AI Agent Workflow Choice</title><link>https://svtter.cn/en/p/efficient-and-cost-effective-my-ai-agent-workflow-choice/</link><pubDate>Mon, 05 Jan 2026 16:00:00 +0800</pubDate><guid>https://svtter.cn/en/p/efficient-and-cost-effective-my-ai-agent-workflow-choice/</guid><description>&lt;img src="https://svtter.cn/p/%E9%AB%98%E6%95%88%E7%9C%81%E9%92%B1%E6%88%91%E7%9A%84-ai-agent-%E5%B7%A5%E4%BD%9C%E6%B5%81%E9%80%89%E6%8B%A9/featured-image.jpg" alt="Featured image of post Efficient and Cost-Effective: My AI Agent Workflow Choice" /&gt;&lt;p&gt;Claude Code&amp;rsquo;s $100/month price tag is a bit steep for many. To address this, I&amp;rsquo;ve been experimenting with a more practical and affordable workflow.&lt;/p&gt;
&lt;p&gt;In terms of models, my recommendation is to use &lt;strong&gt;Gemini 3 Flash&lt;/strong&gt; on an as-needed (pay-as-you-go) basis as a replacement.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why?&lt;/strong&gt; Gemini 3 Flash offers incredible value. It&amp;rsquo;s fast, efficient, and costs a fraction of what you&amp;rsquo;d pay for Opus or Sonnet. For the vast majority of tasks, Flash is more than enough.&lt;/p&gt;
&lt;h2 id="the-cost-saving-workflow"&gt;The Cost-Saving Workflow
&lt;/h2&gt;&lt;p&gt;Here is my current &amp;ldquo;budget&amp;rdquo; workflow:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Planning &amp;amp; Proposals&lt;/strong&gt;: Use Gemini 3 Flash.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Execution &amp;amp; Building&lt;/strong&gt;: Use the free &lt;strong&gt;GLM 4.7&lt;/strong&gt; (or MiniMax M2.1) via OpenCode. If you have a &lt;a class="link" href="https://svtter.cn/p/2025-10-09-%e6%88%91%e7%8e%b0%e5%9c%a8%e6%9b%b4%e5%a4%9a%e7%9a%84%e4%bd%bf%e7%94%a8-GLM-4.6-%e4%ba%86/" &gt;Zhipu Coding Plan&lt;/a&gt;, that works perfectly too.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Speaking of Gemini 3, we have to talk about &lt;strong&gt;GPT-5.2&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Many engineers still rely on ChatGPT.com directly instead of using a proper coding agent. Regardless of the efficiency debate, the reliability is concerning. From my experience, GPT-5.2&amp;rsquo;s default tone has been tuned to be overly &amp;ldquo;people-pleasing,&amp;rdquo; which might not be ideal for professional developers seeking direct technical feedback.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://svtter.cn/p/%E9%AB%98%E6%95%88%E7%9C%81%E9%92%B1%E6%88%91%E7%9A%84-ai-agent-%E5%B7%A5%E4%BD%9C%E6%B5%81%E9%80%89%E6%8B%A9/pics/image_1767597061665_0.png"
width="1023"
height="930"
srcset="https://svtter.cn/p/%E9%AB%98%E6%95%88%E7%9C%81%E9%92%B1%E6%88%91%E7%9A%84-ai-agent-%E5%B7%A5%E4%BD%9C%E6%B5%81%E9%80%89%E6%8B%A9/pics/image_1767597061665_0_hu_175ada8cb4120ce2.png 480w, https://svtter.cn/p/%E9%AB%98%E6%95%88%E7%9C%81%E9%92%B1%E6%88%91%E7%9A%84-ai-agent-%E5%B7%A5%E4%BD%9C%E6%B5%81%E9%80%89%E6%8B%A9/pics/image_1767597061665_0_hu_c7107e2757a481d7.png 1024w"
loading="lazy"
alt="GPT-5.2 Response Tone"
class="gallery-image"
data-flex-grow="110"
data-flex-basis="264px"
&gt;&lt;/p&gt;
&lt;p&gt;Furthermore, while GPT-5.2 scored impressively on &lt;strong&gt;SWE-bench Verified&lt;/strong&gt;, my real-world experience has been mixed. It&amp;rsquo;s worth looking at the history of SWE-bench:&lt;/p&gt;
&lt;p&gt;Originally proposed by a team from &lt;strong&gt;Princeton University&lt;/strong&gt; (ICLR 2024), it evaluates a model&amp;rsquo;s ability to solve real GitHub issues. However, in August 2024, OpenAI&amp;rsquo;s Preparedness team collaborated with the original authors to create &lt;strong&gt;SWE-bench Verified&lt;/strong&gt; (a subset of 500 manually verified issues). Since OpenAI was involved in the design of this benchmark, their models&amp;rsquo; performance on it should be taken with a grain of salt. While not necessarily a deliberate manipulation, the risk of inherent bias is significant.&lt;/p&gt;
&lt;p&gt;Ultimately, as I often say, &amp;ldquo;Codex&amp;rdquo; models don&amp;rsquo;t always deliver the most practical results in everyday coding.&lt;/p&gt;
&lt;h2 id="opencode-tips"&gt;OpenCode Tips
&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Leveraging Agents&lt;/strong&gt;: OpenCode supports launching SubAgents. When debugging complex projects, you can have OpenCode launch agents in different directories to handle front-end and back-end tasks separately, which also helps avoid permission issues.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;OpenSpec: Cross-Agent Collaboration&lt;/strong&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;span class="lnt"&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;1. OpenCode + Gemini 3 Flash → Generate proposal
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;2. Codex → Code Review
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;3. Claude Code → Secondary Review
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;4. OpenSpec Apply → Final Execution
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;OpenSpec generates reliable specs, but sometimes cheaper models produce lower-quality code. In such cases, you can generate multiple times using the spec and select the best result.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="final-thoughts"&gt;Final Thoughts
&lt;/h2&gt;&lt;p&gt;As AI Agent engineers, we need to adapt to these ongoing trends:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Models are becoming smarter.&lt;/li&gt;
&lt;li&gt;Execution is becoming faster.&lt;/li&gt;
&lt;li&gt;Prices are dropping.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;While these trends are promising, we still need to balance speed, cost, and quality for every task. We might soon see agent systems that automate this balancing act, but for now, it&amp;rsquo;s a crucial part of the engineer&amp;rsquo;s role.&lt;/p&gt;</description></item><item><title>Coding Performance and Model Cost-Effectiveness Analysis</title><link>https://svtter.cn/en/p/coding-performance-and-model-cost-effectiveness-analysis/</link><pubDate>Sat, 03 Jan 2026 00:00:00 +0000</pubDate><guid>https://svtter.cn/en/p/coding-performance-and-model-cost-effectiveness-analysis/</guid><description>&lt;img src="https://svtter.cn/p/%E7%BC%96%E7%A0%81%E6%80%A7%E8%83%BD%E4%B8%8E%E6%A8%A1%E5%9E%8B%E6%80%A7%E4%BB%B7%E6%AF%94%E5%88%86%E6%9E%90/pics/bg-new-v2.jpg" alt="Featured image of post Coding Performance and Model Cost-Effectiveness Analysis" /&gt;&lt;p&gt;This is my analysis report on the coding performance and cost-effectiveness of several models, used to compare the performance and cost efficiency of different models in coding tasks, in order to select the most suitable model.&lt;/p&gt;
&lt;iframe src="model-comparison.pdf" style="width:100%; height:85vh; border:0;"&gt;&lt;/iframe&gt;
&lt;p&gt;For Chinese language tasks, using GLM 4.7 is clearly more cost-effective. The price of 2000 RMB basically covers a year of usage.
The downside is that during peak hours, even the enterprise MAX version can be very slow.&lt;/p&gt;
&lt;p&gt;From my practical experience, the capabilities of minimax m2.1 far exceed those of GLM 4.7.&lt;/p&gt;</description></item><item><title>Third-party Client Performance</title><link>https://svtter.cn/en/p/third-party-client-performance/</link><pubDate>Wed, 19 Nov 2025 17:03:18 +0800</pubDate><guid>https://svtter.cn/en/p/third-party-client-performance/</guid><description>&lt;img src="https://svtter.cn/p/%E7%AC%AC%E4%B8%89%E6%96%B9%E5%AE%A2%E6%88%B7%E7%AB%AF%E4%B8%8E%E5%A4%A7%E6%A8%A1%E5%9E%8B-api-%E7%BB%93%E5%90%88--%E6%80%A7%E8%83%BD%E5%B0%8F%E6%B5%8B/pics/bg.jpg" alt="Featured image of post Third-party Client Performance" /&gt;&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Which is the most expensive model on Silicon Flow?
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;I mean siliconflow.cn
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Help me take a look
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Over the past year, I have attempted to use &lt;a class="link" href="https://github.com/ThinkInAIXYZ/deepchat" target="_blank" rel="noopener"
&gt;deepchat&lt;/a&gt; and large model APIs (such as k2 thinking turbo) to build a relatively private chat tool (or agent assistant) for handling some private data. However, the overall experience has not been great. The large models often provide incorrect answers.&lt;/p&gt;
&lt;p&gt;For search capabilities, I used the bocha API, resetting 10 credits to provide search functionality for the large model.&lt;/p&gt;
&lt;h2 id="test-questions"&gt;Test Questions
&lt;/h2&gt;&lt;p&gt;I feel there are still some issues with context handling (within a single chat window). I briefly tested this question: &lt;code&gt;Which is the most expensive model on Silicon Flow?&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The answer is:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://svtter.cn/p/%E7%AC%AC%E4%B8%89%E6%96%B9%E5%AE%A2%E6%88%B7%E7%AB%AF%E4%B8%8E%E5%A4%A7%E6%A8%A1%E5%9E%8B-api-%E7%BB%93%E5%90%88--%E6%80%A7%E8%83%BD%E5%B0%8F%E6%B5%8B/pics/answ.png"
width="1200"
height="832"
srcset="https://svtter.cn/p/%E7%AC%AC%E4%B8%89%E6%96%B9%E5%AE%A2%E6%88%B7%E7%AB%AF%E4%B8%8E%E5%A4%A7%E6%A8%A1%E5%9E%8B-api-%E7%BB%93%E5%90%88--%E6%80%A7%E8%83%BD%E5%B0%8F%E6%B5%8B/pics/answ_hu_644beddb6c493bdd.png 480w, https://svtter.cn/p/%E7%AC%AC%E4%B8%89%E6%96%B9%E5%AE%A2%E6%88%B7%E7%AB%AF%E4%B8%8E%E5%A4%A7%E6%A8%A1%E5%9E%8B-api-%E7%BB%93%E5%90%88--%E6%80%A7%E8%83%BD%E5%B0%8F%E6%B5%8B/pics/answ_hu_b499e0f22dabe05a.png 1024w"
loading="lazy"
class="gallery-image"
data-flex-grow="144"
data-flex-basis="346px"
&gt;&lt;/p&gt;
&lt;h2 id="kimi-k2-thinking-turbo"&gt;Kimi k2 thinking turbo
&lt;/h2&gt;&lt;p&gt;First, deepchat:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://svtter.cn/p/%E7%AC%AC%E4%B8%89%E6%96%B9%E5%AE%A2%E6%88%B7%E7%AB%AF%E4%B8%8E%E5%A4%A7%E6%A8%A1%E5%9E%8B-api-%E7%BB%93%E5%90%88--%E6%80%A7%E8%83%BD%E5%B0%8F%E6%B5%8B/pics/ScreenShot_2025-11-19_171204_032.png"
width="2091"
height="1587"
srcset="https://svtter.cn/p/%E7%AC%AC%E4%B8%89%E6%96%B9%E5%AE%A2%E6%88%B7%E7%AB%AF%E4%B8%8E%E5%A4%A7%E6%A8%A1%E5%9E%8B-api-%E7%BB%93%E5%90%88--%E6%80%A7%E8%83%BD%E5%B0%8F%E6%B5%8B/pics/ScreenShot_2025-11-19_171204_032_hu_fa430333a9b46287.png 480w, https://svtter.cn/p/%E7%AC%AC%E4%B8%89%E6%96%B9%E5%AE%A2%E6%88%B7%E7%AB%AF%E4%B8%8E%E5%A4%A7%E6%A8%A1%E5%9E%8B-api-%E7%BB%93%E5%90%88--%E6%80%A7%E8%83%BD%E5%B0%8F%E6%B5%8B/pics/ScreenShot_2025-11-19_171204_032_hu_ce1c8d07e057ac48.png 1024w"
loading="lazy"
class="gallery-image"
data-flex-grow="131"
data-flex-basis="316px"
&gt;&lt;/p&gt;
&lt;p&gt;Hmm, incorrect.&lt;/p&gt;
&lt;p&gt;Then, kimi official:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://svtter.cn/p/%E7%AC%AC%E4%B8%89%E6%96%B9%E5%AE%A2%E6%88%B7%E7%AB%AF%E4%B8%8E%E5%A4%A7%E6%A8%A1%E5%9E%8B-api-%E7%BB%93%E5%90%88--%E6%80%A7%E8%83%BD%E5%B0%8F%E6%B5%8B/pics/ScreenShot_2025-11-19_171256_940.png"
width="2163"
height="1911"
srcset="https://svtter.cn/p/%E7%AC%AC%E4%B8%89%E6%96%B9%E5%AE%A2%E6%88%B7%E7%AB%AF%E4%B8%8E%E5%A4%A7%E6%A8%A1%E5%9E%8B-api-%E7%BB%93%E5%90%88--%E6%80%A7%E8%83%BD%E5%B0%8F%E6%B5%8B/pics/ScreenShot_2025-11-19_171256_940_hu_556a68f07306dba.png 480w, https://svtter.cn/p/%E7%AC%AC%E4%B8%89%E6%96%B9%E5%AE%A2%E6%88%B7%E7%AB%AF%E4%B8%8E%E5%A4%A7%E6%A8%A1%E5%9E%8B-api-%E7%BB%93%E5%90%88--%E6%80%A7%E8%83%BD%E5%B0%8F%E6%B5%8B/pics/ScreenShot_2025-11-19_171256_940_hu_3ff5c0eb4ecb44e5.png 1024w"
loading="lazy"
class="gallery-image"
data-flex-grow="113"
data-flex-basis="271px"
&gt;&lt;/p&gt;
&lt;p&gt;Also incorrect.&lt;/p&gt;
&lt;h2 id="trying-deepseek"&gt;Trying deepseek
&lt;/h2&gt;&lt;p&gt;First, let&amp;rsquo;s try the client.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://svtter.cn/p/%E7%AC%AC%E4%B8%89%E6%96%B9%E5%AE%A2%E6%88%B7%E7%AB%AF%E4%B8%8E%E5%A4%A7%E6%A8%A1%E5%9E%8B-api-%E7%BB%93%E5%90%88--%E6%80%A7%E8%83%BD%E5%B0%8F%E6%B5%8B/pics/ds-dc.png"
width="2001"
height="1509"
srcset="https://svtter.cn/p/%E7%AC%AC%E4%B8%89%E6%96%B9%E5%AE%A2%E6%88%B7%E7%AB%AF%E4%B8%8E%E5%A4%A7%E6%A8%A1%E5%9E%8B-api-%E7%BB%93%E5%90%88--%E6%80%A7%E8%83%BD%E5%B0%8F%E6%B5%8B/pics/ds-dc_hu_b6f304521e91652e.png 480w, https://svtter.cn/p/%E7%AC%AC%E4%B8%89%E6%96%B9%E5%AE%A2%E6%88%B7%E7%AB%AF%E4%B8%8E%E5%A4%A7%E6%A8%A1%E5%9E%8B-api-%E7%BB%93%E5%90%88--%E6%80%A7%E8%83%BD%E5%B0%8F%E6%B5%8B/pics/ds-dc_hu_1b7a5093b549396.png 1024w"
loading="lazy"
class="gallery-image"
data-flex-grow="132"
data-flex-basis="318px"
&gt;&lt;/p&gt;
&lt;p&gt;Incorrect.&lt;/p&gt;
&lt;p&gt;Then, deepseek official.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://svtter.cn/p/%E7%AC%AC%E4%B8%89%E6%96%B9%E5%AE%A2%E6%88%B7%E7%AB%AF%E4%B8%8E%E5%A4%A7%E6%A8%A1%E5%9E%8B-api-%E7%BB%93%E5%90%88--%E6%80%A7%E8%83%BD%E5%B0%8F%E6%B5%8B/pics/ds.png"
width="1260"
height="1538"
srcset="https://svtter.cn/p/%E7%AC%AC%E4%B8%89%E6%96%B9%E5%AE%A2%E6%88%B7%E7%AB%AF%E4%B8%8E%E5%A4%A7%E6%A8%A1%E5%9E%8B-api-%E7%BB%93%E5%90%88--%E6%80%A7%E8%83%BD%E5%B0%8F%E6%B5%8B/pics/ds_hu_520dc23733853495.png 480w, https://svtter.cn/p/%E7%AC%AC%E4%B8%89%E6%96%B9%E5%AE%A2%E6%88%B7%E7%AB%AF%E4%B8%8E%E5%A4%A7%E6%A8%A1%E5%9E%8B-api-%E7%BB%93%E5%90%88--%E6%80%A7%E8%83%BD%E5%B0%8F%E6%B5%8B/pics/ds_hu_694da5f223dde2dc.png 1024w"
loading="lazy"
class="gallery-image"
data-flex-grow="81"
data-flex-basis="196px"
&gt;&lt;/p&gt;
&lt;p&gt;Very close, and the answer seems reasonable. Unfortunately, it&amp;rsquo;s still incorrect.&lt;/p&gt;
&lt;h2 id="if-we-ask-chatgpt-directly"&gt;If we ask ChatGPT directly
&lt;/h2&gt;&lt;p&gt;Hiss, a bit off. Let&amp;rsquo;s try gpt-5.&lt;/p&gt;
&lt;p&gt;Prompt:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://svtter.cn/p/%E7%AC%AC%E4%B8%89%E6%96%B9%E5%AE%A2%E6%88%B7%E7%AB%AF%E4%B8%8E%E5%A4%A7%E6%A8%A1%E5%9E%8B-api-%E7%BB%93%E5%90%88--%E6%80%A7%E8%83%BD%E5%B0%8F%E6%B5%8B/pics/wechat_2025-11-19_171536_131.png"
width="1275"
height="1616"
srcset="https://svtter.cn/p/%E7%AC%AC%E4%B8%89%E6%96%B9%E5%AE%A2%E6%88%B7%E7%AB%AF%E4%B8%8E%E5%A4%A7%E6%A8%A1%E5%9E%8B-api-%E7%BB%93%E5%90%88--%E6%80%A7%E8%83%BD%E5%B0%8F%E6%B5%8B/pics/wechat_2025-11-19_171536_131_hu_33038f4abee22920.png 480w, https://svtter.cn/p/%E7%AC%AC%E4%B8%89%E6%96%B9%E5%AE%A2%E6%88%B7%E7%AB%AF%E4%B8%8E%E5%A4%A7%E6%A8%A1%E5%9E%8B-api-%E7%BB%93%E5%90%88--%E6%80%A7%E8%83%BD%E5%B0%8F%E6%B5%8B/pics/wechat_2025-11-19_171536_131_hu_81540202970b4134.png 1024w"
loading="lazy"
class="gallery-image"
data-flex-grow="78"
data-flex-basis="189px"
&gt;&lt;/p&gt;
&lt;h2 id="inference---reasons-for-poor-performance"&gt;Inference - Reasons for Poor Performance
&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;Insufficient search capability. The Bocha API is to blame.&lt;/li&gt;
&lt;li&gt;Different models may have different optimal hyperparameters for best performance. I called the large model API from Silicon Flow.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="conclusion"&gt;Conclusion
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;For this specific problem, ChatGPT still performs better. Compared to before, the official search + model combination also seems to perform better. Therefore, unless the data is particularly sensitive, it&amp;rsquo;s better to use the official service.&lt;/li&gt;
&lt;li&gt;This article is for reference only, just for fun.&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>