Llm on Svtter's Blog

OMP M3 Model Patch: Adding MiniMax M3 to pi-ai

Mon, 01 Jun 2026 12:00:00 +0800

MiniMax released M3 on 2026-06-01 (minimax/minimax-m3-20260531 on OpenRouter), but the upstream models.json shipped by @oh-my-pi/pi-ai@15.7.3 hadn’t been updated to include it. This post documents the patch I applied to add M3 support across all five provider endpoints.

Target File

1

~/.bun/install/global/node_modules/@oh-my-pi/pi-ai/src/models.json

Provider Entries Added (5)

All entries are appended at the end of their respective provider object, mirroring the structure of the existing MiniMax-M2.7 entry.

1. `minimax` (Official Anthropic-compatible API)

key: MiniMax-M3
api: anthropic-messages
baseUrl: https://api.minimax.io/anthropic
contextWindow: 204800, maxTokens: 131072
cost: input 0.3, output 1.2, cacheRead 0.06, cacheWrite 0.375
thinking: budget mode, minimal..xhigh

2. `minimax-cn` (Official Anthropic-compatible API, China)

key: MiniMax-M3
api: anthropic-messages
baseUrl: https://api.minimaxi.com/anthropic
Same context/cost/thinking as minimax

3. `minimax-code` (Coding Plan, OpenAI-compatible)

key: MiniMax-M3
api: openai-completions
baseUrl: https://api.minimax.io/v1
cost: all 0 (Coding Plan flat-rate)
compat: supportsStore=false, supportsDeveloperRole=false, supportsReasoningEffort=false, reasoningContentField=reasoning_content
thinking: effort mode, minimal..high

4. `minimax-code-cn` (Coding Plan CN)

Mirror of minimax-code with baseUrl: https://api.minimaxi.com/v1 and provider minimax-code-cn.

5. `openrouter` (OpenRouter Passthrough)

key: minimax/minimax-m3-20260531
api: openai-completions
baseUrl: https://openrouter.ai/api/v1
cost: input 0.3, output 1.2, cacheRead 0.05, cacheWrite 0
thinking: effort mode, minimal..high

Verification

Searching for "MiniMax-M3|minimax-m3" in the patched file returns exactly 5 hits — one per provider block.

Caveats

omp update will overwrite the patch. Re-apply after updates, or pin the package version.
If upstream later ships an official M3 entry, our local copy may diverge (custom pricing/context) until the next update.
Pricing values for M3 were inferred from the M2.7 template and the OpenRouter listing ($0.30 / $1.20). Confirm against the official MiniMax pricing page if cost accuracy matters.
Context window (204800) and maxTokens (131072) mirror M2.7 — adjust if M3 differs at GA.

Addendum (2026-06-02): The proper route via OMP user config

The pi-ai patch above is a hack — any omp update re-pulls the package and the patch is gone. The proper OMP way is ~/.omp/agent/models.yml: a user-level file that OMP merges on top of the built-in catalog, with no bun-global dependency, and which omp update leaves alone.

Final config

Append to ~/.omp/agent/models.yml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


# MiniMax M3 Code Plan
# Set MINIMAX_API_KEY in ~/.zshenv first
 minimax:
 baseUrl: https://api.minimaxi.com/anthropic
 apiKey: MINIMAX_API_KEY
 api: anthropic-messages
 authHeader: true
 disableStrictTools: true
 models:
 - id: MiniMax-M3
 name: MiniMax M3
 reasoning: true
 input: [text, image]
 contextWindow: 1000000
 maxTokens: 16384
 cost:
 input: 0
 output: 0
 cacheRead: 0
 cacheWrite: 0

apiKey: MINIMAX_API_KEY follows OMP’s resolution rule: try the value as an env-var name first, then fall back to a literal. I export MINIMAX_API_KEY=$MINIMAX_CODE_PLAN_KEY in ~/.zshenv, so the key is sourced at runtime and the dotfile stays clean for git.

Key choices

Why anthropic-messages, not openai-completions: M3 speaks both protocols. The openai-completions route had two friction points:

OMP’s openai-completions transport emits developer role + reasoning_effort for reasoning models. MiniMax’s schema check is stricter than OpenAI’s, and an empty reasoning field occasionally 400s
After switching to anthropic-messages, tool calls and streaming reasoning go through the Anthropic SDK normalization path — same as kimi and claude

Why disableStrictTools: true: The Anthropic SDK sends strict: true on every tool definition by default. Third-party Anthropic-fronted gateways (MiniMax, kimi, etc.) usually don’t recognize the field and 400. The kimi provider in the same file already sets this flag. The trade-off is that tool schemas are not server-side validated, so prompts have to carry the schema discipline.

Context 1M / maxTokens 16K: contextWindow: 1000000 matches OpenRouter’s spec for minimax/minimax-m3-20260531 (M2.7 was 204800, M3 is 5× that). maxTokens: 16384 carries over from M2.7 — I couldn’t find an official M3 number. cost is all zero because the Code Plan is flat-rate.

Switching to it

1
2
3
4
5


# At launch
omp --model minimax/MiniMax-M3

# Or in the TUI
/model minimax/MiniMax-M3

After the switch, /status should show ANTHROPIC_BASE_URL pointing at api.minimaxi.com/anthropic.

How the two routes compose

Dimension	pi-ai bundled `models.json` patch	`models.yml` custom provider
Persistence	`omp update` wipes it	Persistent
Cross-machine sync	No (bun-global path)	Yes (dotfile in git)
Upgrade cost	Re-apply patch	OMP merges automatically
Merge with built-in	Yes	Yes, last-write-wins

The two compose. models.yml providers enter through OMP’s “custom” channel; whatever pi-ai later ships in its bundled list (if M3 lands upstream) enters through the “built-in” channel. When both define the same provider/model with different baseUrl, OMP’s last-write-wins rule means models.yml always wins — which is exactly what you want for a CN endpoint override.

How kimi-code Handles kimi-k2.6: A Comparison with OpenCode

Wed, 27 May 2026 10:30:00 +0800

Recently, kimi-code migrated from Python to TypeScript. Here’s a quick analysis.

Based on my review of the kimi-code source code (particularly packages/kosong/src/providers/kimi.ts, kimi-schema.ts, kimi-files.ts, etc.) and relevant OpenCode compatibility issues, here are the kimi-k2.6-specific optimizations in kimi-code and how they differ from OpenCode.

1. Native Kimi Provider (Not a Generic OpenAI-compatible Layer)

kimi-code does not treat Kimi as “just another OpenAI-compatible endpoint.” Instead, it implements a dedicated kimi provider type:

Feature	kimi-code	OpenCode
Provider Type	Dedicated `'kimi'` type with independent adapter	Accessed via generic OpenAI/Anthropic bridge
Proprietary Fields	Native handling of `reasoning_content`, `thinking`, `generationKwargs`	`reasoning_content` often lost in the bridge layer
Auth Headers	Supports `kimiRequestHeaders`, `X-Msh-Tool-Call-Id`, and other Moonshot-specific headers	Generic header forwarding

2. Full Lifecycle Handling of `reasoning_content`

kimi-k2.6 has thinking enabled by default and requires reasoning_content to be preserved across multi-turn conversation history. Otherwise, tool calls will result in a 400 error.

How kimi-code handles it:

convertMessage: Extracts internal think content parts and serializes them into the reasoning_content field, ensuring thinking content is never lost in message history
Streaming Parser: Explicitly extracts delta.reasoning_content / message.reasoning_content in both _convertStreamResponse and _convertNonStreamResponse
TUI Rendering: A dedicated ThinkingComponent renders thinking content in real time, with expand/collapse support and a spinner animation

OpenCode’s Problem:

The OpenCode Go bridge drops reasoning_content on the second turn, causing the Moonshot API to return:

1

thinking is enabled but reasoning_content is missing in assistant tool call message

3. JSON Schema Normalization (`kimi-schema.ts`)

Moonshot’s tool parameter validator has strict and unique requirements for JSON Schema. This is one of the primary sources of incompatibility between OpenCode and kimi-k2.6.

What kimi-code’s normalizeKimiToolSchema does:

Dereferences $ref: Inlines definitions from $defs / definitions, eliminating external references
Fills in missing type: The Kimi validator rejects nested property schemas that omit type (e.g., MCP-generated enum-only schemas). kimi-code infers and backfills type: string/object/array, etc.
Circular reference detection: Preserves the original $ref when a circular reference is detected, avoiding infinite recursion

OpenCode’s Problem:

Generated schemas use #/definitions/ instead of the #/$defs/ format required by Moonshot, and lack schema type inference and backfilling for Kimi, causing complex tool calls to fail with 400.

4. Native Thinking Mode Configuration System

kimi-code has built-in support for Kimi’s thinking mode from the configuration layer all the way to the UI:

Config Parsing: ThinkingConfigSchema supports mode: auto/on/off and effort: low/medium/high/xhigh/max
Model Capability Tags: ModelAlias supports capabilities: ['thinking', 'always_thinking']
Model Selector UI: Press ←→ to toggle thinking on/off; always-on models cannot be turned off

Provider Method: withThinking(effort) correctly generates:

1
2
3
4


{
 "reasoning_effort": "high",
 "extra_body": { "thinking": { "type": "enabled" } }
}

Token Budget: Automatically normalizes legacy max_tokens to Kimi’s preferred max_completion_tokens

OpenCode’s Problem:

When using the Anthropic bridge, it hardcodes thinking content blocks, but the Kimi API only supports text/image_url/video_url/video, resulting in:

1

Invalid value: thinking. Supported values are: 'text','image_url','video_url' and 'video'.

5. Native Moonshot Service Integration

kimi-code includes Moonshot-exclusive services instead of relying on generic local implementations:

MoonshotFetchURLProvider: Prioritizes Moonshot’s coding-fetch service (with built-in page text extraction), falling back to local fetch only on failure
MoonshotWebSearchProvider: Calls the Moonshot search API directly, supporting enable_page_crawling
KimiFiles: Uploads videos to the Moonshot file service, returning video_url in the ms://<file-id> format

6. Tool Call Layer Details

Built-in Functions: Tool names starting with $ are recognized as Kimi builtin functions and serialized as type: 'builtin_function'
Usage Extraction: Supports Moonshot’s proprietary choices[0].usage placement, as well as cached_tokens and other fields
Finish Reason Mapping: Maps OpenAI-style stop/tool_calls/length values to an internal unified enum

7. CLI Core and LLM SDK Architectural Isolation

This is an easily overlooked but important architectural difference.

The core CLI of kimi-code (apps/kimi-code) does not directly depend on any OpenAI or Anthropic TypeScript SDK. Looking at its package.json, the core dependencies are only generic libraries like TUI rendering (pi-tui), CLI parsing (commander), and syntax highlighting (cli-highlight). All LLM provider interactions are isolated within the self-developed kosong package.

While packages/kosong internally uses openai and @anthropic-ai/sdk as implementation details (since the Kimi API is OpenAI-compatible), it exposes a unified LLM abstraction interface to the outside. The CLI core only depends on kosong and has no awareness of underlying vendor SDKs.

OpenCode is different. Its packages/opencode core package directly depends on a large number of vendor SDKs:

@ai-sdk/openai
@ai-sdk/anthropic
@ai-sdk/google
@ai-sdk/azure
@openrouter/ai-sdk-provider
… (more than a dozen provider-specific packages in total)

This means OpenCode’s core code is deeply coupled with each vendor’s SDK, while kimi-code’s core CLI stays clean, with all model interactions fully isolated through a self-developed abstraction layer.

8. What Commit History Reveals About Evolution Paths

The structural code differences above are just a static snapshot. What’s more interesting is comparing the commit histories of the two projects—their dynamic evolution directions are completely different.

kimi-code: Native Design, Continuously Reducing Configuration Burden

842e699 — “Kimi For Coding” (Initial Commit)

This was the starting point of the entire project. The initial code already included:

packages/kosong/src/providers/kimi.ts: Dedicated Kimi provider
packages/kosong/src/providers/kimi-schema.ts: Dedicated JSON Schema normalizer
packages/kosong/src/providers/kimi-files.ts: Dedicated file upload service

Conclusion: kimi-code treated the Kimi API as a first-class citizen from day one, not as a later patch.

d95b013 fix(catalog): preserve reasoning fields in custom model (#70)

This commit fixed a very subtle issue. models.dev uses the interleaved field to mark reasoning support, but early code treated interleaved=true as undefined, causing models selected via /connect to silently lose their reasoning capability.

Fixes:

interleaved=true is mapped to the default reasoning_content
interleaved is added to the update-catalog.mjs allowlist; otherwise the offline catalog in release builds would silently drop the field again

61f7d0e fix(kosong): make openai-compatible thinking work without reasoning_key (#78)

This is the core commit for reasoning handling, showcasing kimi-code’s deep thinking on compatibility. The diff reveals a three-layer design:

Inbound Auto-Scan (response parsing)

1
2


const KNOWN_REASONING_KEYS = ['reasoning_content', 'reasoning_details', 'reasoning'] as const;
// Auto-scan three fields; first string value wins

Outbound Default Write-Back (request serialization)

1
2


const DEFAULT_OUTBOUND_REASONING_KEY = KNOWN_REASONING_KEYS[0]; // 'reasoning_content'
// Defaults to writing back as reasoning_content, no user config needed

Auto-Inject reasoning_effort (historical continuity)

1
2


// When history contains ThinkPart but caller hasn't explicitly set reasoning_effort,
// auto-inject 'medium' to prevent strict gateways like One API / DeepSeek from returning 400

Edge cases are handled meticulously: blank reasoning_key ("") is normalized to undefined; values explicitly set by the caller via withGenerationKwargs are not silently overwritten by auto-injection.

The verification goal explicitly states:

Manually verified end-to-end against the real DeepSeek API with a hand-written config.toml that does not set reasoning_key: thinking content renders, no 400, multi-turn conversations work.

OpenCode: Generic Layer Design, OpenAI-centric

eb84f46 fix(llm): split OpenAI reasoning summary blocks (#29000)

This commit demonstrates OpenCode’s completely different approach to reasoning—designed around the OpenAI Responses API:

Maintains a state machine for encrypted_content and item_reference
Folds multiple summary parts by item_id + summary_index
When store:false, filters out reasoning items lacking encrypted_content

This is completely different from Kimi’s reasoning_content mechanism. Kimi does not need encrypted_content or item_reference; it simply attaches a reasoning_content field to the message.

A Hard Fact

OpenCode Issue #26331 “Bug: OpenCode Go bridge layer incompatible with kimi-k2.6 tool calls” — Status: still open
OpenCode Issue #27054 “KIMI K2.6 showing error in Opencode GO” — Status: closed, but the resolution was to disable MCP (a workaround)

The last comment on #27054:

The workaround is to disable your MCP and then initiate the session

That’s not a fix. That’s avoiding the problem.

Commit History Comparison Summary

Dimension	kimi-code	OpenCode
Initial Design	Initial commit includes full Kimi provider + schema normalizer + file service	Generic multi-model architecture, adapted later via bridge
Reasoning Mechanism	Designed around `reasoning_content` field, with auto-scan / write-back / effort injection	Designed around OpenAI Responses’ `encrypted_content` + `item_reference`
Schema Handling	Dedicated `normalizeKimiToolSchema`, dereferences `$ref` + backfills `type`	Generic schema validation, focused on friendly error messages
Config Philosophy	Makes OpenAI-compatible gateways “zero-config” by auto-inferring all fields	Relies on users manually adapting via bridge/config
Issue Status	Continuously shipping reasoning-related patches (#70, #78)	kimi-k2.6 compatibility issue #26331 still open

Summary: Core Differences

Dimension	kimi-code	OpenCode
Architecture Positioning	Native design for Kimi/Moonshot, dedicated provider	Generic multi-model agent, adapted via bridge
Thinking/Reasoning	Native support, full lifecycle preservation of `reasoning_content`	Easily lost in bridge layer, causing 400 errors
JSON Schema	Dedicated `normalizeKimiToolSchema` for dereferencing and type backfilling	Generic schema generation, does not meet Kimi validator requirements
API Format	Directly generates Moonshot-native format (including `thinking` config, `$defs` normalization, etc.)	Transformed through OpenAI/Anthropic protocol conversion, causing format mismatches
Service Integration	Built-in Moonshot fetch/search/file services	Uses generic local tools
Core Dependencies	CLI core does not directly depend on vendor SDKs; isolated via self-developed `kosong` package	Core package directly coupled with `@ai-sdk/openai` and more than a dozen other vendor SDKs

Looking at commit history, kimi-code’s evolution is directed at continuously eliminating user configuration burden (reasoning_key went from required → optional override → auto-inferred; interleaved went from filtered → correctly mapped), while OpenCode’s evolution is directed at deepening OpenAI ecosystem integration (Responses API, encrypted reasoning, item reference), leaving Kimi adaptation stuck at the generic bridge layer.

That’s the truth at the commit level: one is native evolution, the other is a bridge gap.

Does Self-Hosting an LLM Really Let You Use It Without Limits?

Thu, 19 Mar 2026 12:30:00 +0800

Many people start thinking seriously about self-hosting an LLM not because of technical romance, but because API bills, rate limits, or compliance requirements have started to collide with real business constraints.

So a very natural question shows up: if the model runs on your own machine, does that mean you can finally use it without limits?

My answer is: no. Self-hosting a model does not mean unlimited freedom. It mostly means that many of the constraints and costs previously absorbed by the platform are now transferred to you.

But there is a more useful second question: once usage gets large enough, can self-hosting actually become cheaper?

The answer is: possibly, but under stricter conditions than many people expect.

In short: self-hosting an LLM does not mean unlimited freedom.

It means taking on part of the cost and responsibility that a platform would normally absorb. Self-hosting becomes financially attractive only when load stays high, utilization remains strong, and you can either accept model trade-offs or optimize the stack yourself.

Local deployment does not mean no limits

Let us clear up the most common misunderstanding first.

Many people interpret “the model runs on my own machine” as “I can now use it however I want.” In reality, the limits do not disappear. They simply show up in a different form.

The first limit is hardware.

Parameter count, VRAM capacity, quantization level, KV cache, and concurrency are real physical constraints. Even a quantized 70B model still puts serious pressure on memory and bandwidth. Being able to run it does not mean it runs comfortably. Getting output does not mean latency and throughput are acceptable.

The second limit is model capability itself.

Hallucinations, knowledge cutoffs, long-context degradation, and unstable reasoning do not vanish just because the model sits on your own server. Deployment location does not change the model’s ceiling. More importantly, most so-called self-hosting setups use open-weight models, not the actual closed models behind systems like Claude or GPT.

The third limit is responsibility transfer.

When you use an API, content safety, service stability, rate limiting, and much of the infrastructure burden are partially handled by the provider. Once you self-host, those problems do not go away. They become your monitoring, your operations, your review pipeline, and your incident response.

So self-hosting is not “use without limits.” It is “you own the boundaries.”

The real calculation is not just the price of a GPU

If you want to know whether self-hosting is worth it, the real comparison is not “how much does the card cost?” but these two larger accounts.

The annual cost of self-hosting can be written roughly like this:

1

Annual self-hosting cost = hardware depreciation + electricity + network / hosting + operations labor + redundancy for failures

The annual API cost is more direct:

1

Annual API cost = average daily token usage * price per million tokens * 365

That looks simple, but three details are often ignored.

Self-hosting is not a one-time hardware purchase. Electricity, spare parts, hosting conditions, alerting, upgrades, and maintenance all keep happening.
API pricing is not a single fixed number. Model choice, input-output ratio, cache hit rate, and tool usage can all change the final bill significantly.
Utilization is easy to underestimate. If your machine sits idle most of the time, a low per-inference cost means very little. On the other hand, if the workload is stable and the hardware stays busy, the financial case for self-hosting becomes much stronger.

So the numbers below should be read as rough order-of-magnitude guidance, not as a procurement quote.

A rough but useful breakeven table

To keep the discussion simple, let us start with a deliberately rough set of assumptions:

API pricing is estimated at roughly CNY 50 per million tokens
token usage counts both input and output together
local hardware is depreciated over 3 years
self-hosting cost includes baseline power and operations overhead
the local setup mainly assumes open-weight model inference, not strict parity with top closed models
this does not include training, fine-tuning, or a dedicated platform team

Under those assumptions, you get a rough picture like this:

Scenario	Daily token usage	Likely local setup	Annual self-hosting cost	Annual API cost	Rough conclusion
Light usage	500K	Single high-end consumer workstation	CNY 20K - 40K	about CNY 9K	API is cheaper
Medium usage	5M	Dual-GPU or small inference workstation	CNY 60K - 120K	about CNY 91K	Near breakeven
Heavy usage	50M	Multi-GPU server or cluster	CNY 400K - 800K	about CNY 912K	Self-hosting may be cheaper

If you want local quality to get as close as possible to top-tier closed models, this table usually moves upward again, because stronger models, more VRAM, and higher availability targets all push infrastructure and operations costs higher.

This table points to three things.

Individuals and small teams usually do not save money with self-hosting. If your workload is only a few hundred thousand tokens per day, APIs are still usually the more economical option. You spend less on hardware and avoid carrying the operations burden.
The real breakeven point tends to appear only in consistently high-usage scenarios. Not one occasional spike, but a workload that stays high day after day. Only then can hardware cost be spread efficiently enough.
The larger the usage, the more attractive self-hosting becomes financially. That is why large companies invest seriously in inference platforms. It is not because they enjoy complexity. It is because once the scale is large enough, the math really changes.

One critical condition: you may not be comparing the same thing

The biggest problem in many “self-hosting is cheaper than API” discussions is not the arithmetic. It is that the compared products are often not equivalent.

On the API side, you may be buying access to a top-tier closed model. On the local side, you may be running a quantized open-weight model. Both are called “LLMs,” but they are not the same product in a strict sense.

That means:

if open-weight quality is acceptable for your use case, self-hosting may indeed save a lot of money
if your quality bar is high and you depend on the best closed models, the room for self-hosting becomes much smaller
if you compare a cheaper model to a more expensive model, the result is not just a deployment conclusion, but also a model-selection conclusion

Put differently, many people think they are calculating deployment cost when they are actually accepting a capability downgrade first.

There is nothing wrong with that trade-off, but it should be stated clearly.

What self-hosting gives you besides cost savings

If a company still chooses to self-host after doing the math, it is usually not only about saving API money.

Data control. Some businesses simply do not want raw data flowing through third-party providers for long-term operational or compliance reasons. Local deployment makes the compliance and audit path easier to manage.
Customization. You can optimize around your own tasks with quantization, routing, distillation, fine-tuning, and tighter integration into internal systems. Standard APIs usually give you less freedom here.
A more predictable cost ceiling. API pricing scales directly with usage. When the business grows, the bill grows with it. Self-hosting has a large upfront investment, but under high and stable load, the cost curve is often easier to predict.
Offline operation and availability. If your environment requires internal-only deployment, or if you cannot accept key workflows depending entirely on external services, local deployment may simply fit the engineering requirements better.

A more practical decision framework

If you do not want to model every variable from day one, start with these three questions.

Is your workload consistently high over time? If you only see occasional spikes rather than sustained token usage every day, APIs are often still the better choice because you are not paying for idle hardware.
Can you accept the gap between a local model and a closed flagship model? If your business depends on best-in-class model quality, a large part of the claimed savings may come from lowering model quality rather than from deployment efficiency alone.
Do you actually have the ability to operate an inference service long term? What happens when a GPU fails, drivers conflict, service latency spikes, the model version needs to change, or rate limiting and monitoring need to be built? If nobody owns these questions, the issue is no longer just cost. It becomes a delivery problem.

Conclusion

Back to the original question: does self-hosting an LLM really let you use it without limits?

My answer is still: no.

It does not remove hardware bottlenecks, erase model capability gaps, or magically solve moderation, reliability, and operations work for you. What it gives you is not absolute freedom, but more control and the responsibility that comes with it.

At the same time, self-hosting is absolutely not a fake option. It becomes increasingly reasonable when several conditions are true at once:

your token usage stays high for a long time
the workload is stable and hardware utilization remains high
open-weight models are acceptable, or you already have the ability to optimize them well
data control, internal deployment, or predictable cost ceilings matter to you

If you are an individual, a small team, or just an occasional heavy user, APIs are still usually the more practical answer: less effort, less operational burden, and lower cost of experimentation.

If you are already in the phase where you burn tokens steadily every day, then it is worth calculating the full picture instead of staring only at API unit prices. Very often the answer is not “now I can use it without limits,” but a more grounded question that matters more: is this worth owning yourself?

The Mathematical Trap of Big Model Coding Plan Packages: Can Promised Usage Be Delivered Under Concurrency Limits?

Fri, 23 Jan 2026 11:52:52 +0800

Preface

Recently, several domestic big model manufacturers have launched Coding Plan subscription packages for developers, promoting “low prices for massive usage,” claiming that for just tens to hundreds of RMB per month, you can get “hundreds of billions of tokens” of usage quota.

It sounds wonderful, but as a developer accustomed to speaking with data, I decided to do some calculations: Under concurrency limits, can these promised usage amounts really be consumed?

Typical Package Structure

Taking the common three-tier packages on the market as an example:

Package	Monthly Fee	Promised Usage (every 5 hours)
Lite	~20 RMB	About 120 prompts
Pro	~100 RMB	About 600 prompts
Max	~200 RMB	About 2,400 prompts

Officials will also add: “Each prompt is expected to call the model 15-20 times, with a total monthly usage of up to tens to hundreds of billions of tokens.”

It seems like incredible value, but the devil is in the details.

Key Limitation: Concurrency

Most manufacturers’ documentation will casually mention: “Package usage is subject to concurrency limits (number of in-flight request tasks).”

But what exactly is the limit? Often not explicitly stated. According to community feedback and actual measurements, typical concurrency limits are as follows:

Package	Concurrency (in-flight requests)
Lite	2
Pro	~4-5
Max	~7

This number directly determines your actual throughput ceiling.

Math Time: Can the Max Package Use 2,400 Prompts?

Let’s take the highest-tier Max package as an example and do a simple calculation.

Known Conditions

Promised Usage: 2,400 prompts every 5 hours
Concurrency Limit: 7
Model calls triggered per prompt: 15-20 times (official data)
Model generation speed: About 50-60 tokens/second
5 hours = 18,000 seconds

Calculation Process

Step 1: Estimate single API call time

A complete API call includes:

Input processing: ~1 second
Model inference generation (assuming 500 tokens output): 500 ÷ 55 ≈ 9 seconds
Network round-trip delay: ~1 second

Total: About 10-12 seconds/call

Step 2: Calculate maximum calls in 5 hours

1
2
3


Maximum calls = Concurrency × (Total time ÷ Single call time)
 = 7 × (18,000 ÷ 10)
 = 12,600 calls

Step 3: Convert to prompts

According to official claims, each prompt triggers 15-20 calls:

1

Completable prompts = 12,600 ÷ 17.5 ≈ 720 prompts

Conclusion

Metric	Official Promise	Concurrency Limit	Achievement Rate
Prompts per 5 hours	2,400	~720	30%

Even under ideal conditions, the actual usable amount of the Max package is only about 30% of the promise.

Harsher Reality: Call Inflation in Agent Mode

The above calculation is still based on the official claim of “15-20 calls per prompt.” But in actual AI Coding Agent scenarios (like Claude Code, Cline, etc.), the situation is much worse.

How Agent Mode Works

When you give an AI programming assistant a task, it typically:

Analyzes requirements, creates a plan
Reads relevant files (each file may trigger a call)
Writes code
Runs tests
Discovers errors, fixes them
Repeats 3-5 until successful

A seemingly simple prompt may trigger 50-100+ model calls in an Agent loop.

Actual Measurement Case

User feedback:

“2 simple prompts, 80 seconds, consumed 38M Tokens, used up 97% of the 5-hour limit”

Reverse calculation:

Each prompt consumes about 19M tokens
If calculated at 128K context, equivalent to ~127 model calls/prompt

This is 6-8 times higher than the official “15-20 times.”

Revised Actual Usable Amount

Scenario	Calls per prompt	Usable prompts in 5 hours	Achievement Rate
Official ideal	17.5	720	30%
Light usage	50	252	10.5%
Moderate usage	75	168	7%
Heavy Agent usage	100+	<126	<5%

Why Is This Happening?

1. Token Calculation Includes Context

Big model token consumption isn’t just output, it includes input. In Coding scenarios:

Each call must send complete conversation history
Code project context can easily reach tens of K tokens
128K context window means each call may consume 100K+ tokens

2. Concurrency is a Hard Constraint

Regardless of how large your package quota is, concurrency determines the maximum throughput per unit time. This is a physical bottleneck, not something commercial strategies can bypass.

3. Promises Based on Ideal Assumptions

Manufacturers’ promotional numbers are often based on:

Each call uses only small context
Each prompt triggers only a few calls
Users won’t use continuously at high intensity

But these assumptions rarely hold true in real AI Coding scenarios.

A Table to See the Truth

Taking the Max package (~200 RMB/month) as an example:

Metric	Official Promotion	Theoretical Limit	Actual Expectation
Prompts per 5 hours	2,400	720	150-400
Monthly prompts	345,600	103,680	21,600-57,600
Monthly tokens	“Hundreds of billions”	~10 billion	1-3 billion
Achievement Rate	100%	30%	5-17%

Advice for Developers

1. Don’t Be Fooled by “Hundreds of Billions of Tokens”

Token count is a highly misleading metric. In Coding Agent scenarios, context takes up the majority, with truly effective output tokens possibly only 1-5%.

2. Focus on Concurrency

This is the core metric that determines actual experience. If manufacturers don’t disclose concurrency limits, it’s likely because the numbers don’t look good.

3. Calculate Cost per Prompt

1

Actual cost per prompt = Monthly fee ÷ Actual usable prompts

Taking the Max package as an example:

Official promotion: 200 ÷ 345,600 = 0.0006 RMB/prompt
Actual situation: 200 ÷ 30,000 = 0.007 RMB/prompt

A 10x difference.

4. Consider Pay-as-You-Go

If your usage isn’t high, pay-as-you-go may be more cost-effective than monthly packages. At least you won’t pay for “unusable quotas.”

Conclusion

The emergence of big model Coding Plan packages is itself a good thing, lowering the barrier for developers to use AI programming assistants. But when choosing packages, be sure to:

Require manufacturers to disclose concurrency limits
Calculate throughput limits yourself
Don’t be misled by the big numbers of “hundreds of billions of tokens”

After all, promised usage that can’t be consumed equals a disguised price increase.

This article is based on public information and mathematical derivation; specific values may vary due to manufacturer adjustments. Readers are advised to verify through actual measurements.

Efficient and Cost-Effective: My AI Agent Workflow Choice

Mon, 05 Jan 2026 16:00:00 +0800

Claude Code’s $100/month price tag is a bit steep for many. To address this, I’ve been experimenting with a more practical and affordable workflow.

In terms of models, my recommendation is to use Gemini 3 Flash on an as-needed (pay-as-you-go) basis as a replacement.

Why? Gemini 3 Flash offers incredible value. It’s fast, efficient, and costs a fraction of what you’d pay for Opus or Sonnet. For the vast majority of tasks, Flash is more than enough.

The Cost-Saving Workflow

Here is my current “budget” workflow:

Planning & Proposals: Use Gemini 3 Flash.
Execution & Building: Use the free GLM 4.7 (or MiniMax M2.1) via OpenCode. If you have a Zhipu Coding Plan, that works perfectly too.

Speaking of Gemini 3, we have to talk about GPT-5.2.

Many engineers still rely on ChatGPT.com directly instead of using a proper coding agent. Regardless of the efficiency debate, the reliability is concerning. From my experience, GPT-5.2’s default tone has been tuned to be overly “people-pleasing,” which might not be ideal for professional developers seeking direct technical feedback.

Furthermore, while GPT-5.2 scored impressively on SWE-bench Verified, my real-world experience has been mixed. It’s worth looking at the history of SWE-bench:

Originally proposed by a team from Princeton University (ICLR 2024), it evaluates a model’s ability to solve real GitHub issues. However, in August 2024, OpenAI’s Preparedness team collaborated with the original authors to create SWE-bench Verified (a subset of 500 manually verified issues). Since OpenAI was involved in the design of this benchmark, their models’ performance on it should be taken with a grain of salt. While not necessarily a deliberate manipulation, the risk of inherent bias is significant.

Ultimately, as I often say, “Codex” models don’t always deliver the most practical results in everyday coding.

OpenCode Tips

Leveraging Agents: OpenCode supports launching SubAgents. When debugging complex projects, you can have OpenCode launch agents in different directories to handle front-end and back-end tasks separately, which also helps avoid permission issues.
OpenSpec: Cross-Agent Collaboration:
1 2 3 4

1. OpenCode + Gemini 3 Flash → Generate proposal 2. Codex → Code Review 3. Claude Code → Secondary Review 4. OpenSpec Apply → Final Execution
OpenSpec generates reliable specs, but sometimes cheaper models produce lower-quality code. In such cases, you can generate multiple times using the spec and select the best result.

Final Thoughts

As AI Agent engineers, we need to adapt to these ongoing trends:

Models are becoming smarter.
Execution is becoming faster.
Prices are dropping.

While these trends are promising, we still need to balance speed, cost, and quality for every task. We might soon see agent systems that automate this balancing act, but for now, it’s a crucial part of the engineer’s role.

Coding Performance and Model Cost-Effectiveness Analysis

Sat, 03 Jan 2026 00:00:00 +0000

This is my analysis report on the coding performance and cost-effectiveness of several models, used to compare the performance and cost efficiency of different models in coding tasks, in order to select the most suitable model.

For Chinese language tasks, using GLM 4.7 is clearly more cost-effective. The price of 2000 RMB basically covers a year of usage. The downside is that during peak hours, even the enterprise MAX version can be very slow.

From my practical experience, the capabilities of minimax m2.1 far exceed those of GLM 4.7.

Third-party Client Performance

Wed, 19 Nov 2025 17:03:18 +0800

1
2
3


Which is the most expensive model on Silicon Flow?
I mean siliconflow.cn
Help me take a look

Over the past year, I have attempted to use deepchat and large model APIs (such as k2 thinking turbo) to build a relatively private chat tool (or agent assistant) for handling some private data. However, the overall experience has not been great. The large models often provide incorrect answers.

For search capabilities, I used the bocha API, resetting 10 credits to provide search functionality for the large model.

Test Questions

I feel there are still some issues with context handling (within a single chat window). I briefly tested this question: Which is the most expensive model on Silicon Flow?.

The answer is:

Kimi k2 thinking turbo

First, deepchat:

Hmm, incorrect.

Then, kimi official:

Also incorrect.

Trying deepseek

First, let’s try the client.

Incorrect.

Then, deepseek official.

Very close, and the answer seems reasonable. Unfortunately, it’s still incorrect.

If we ask ChatGPT directly

Hiss, a bit off. Let’s try gpt-5.

Prompt:

Inference - Reasons for Poor Performance

Insufficient search capability. The Bocha API is to blame.
Different models may have different optimal hyperparameters for best performance. I called the large model API from Silicon Flow.

Conclusion

For this specific problem, ChatGPT still performs better. Compared to before, the official search + model combination also seems to perform better. Therefore, unless the data is particularly sensitive, it’s better to use the official service.
This article is for reference only, just for fun.

Llm on Svtter's Blog

OMP M3 Model Patch: Adding MiniMax M3 to pi-ai

Target File

Provider Entries Added (5)

1. minimax (Official Anthropic-compatible API)

2. minimax-cn (Official Anthropic-compatible API, China)

3. minimax-code (Coding Plan, OpenAI-compatible)

4. minimax-code-cn (Coding Plan CN)

5. openrouter (OpenRouter Passthrough)

Verification

Caveats

Addendum (2026-06-02): The proper route via OMP user config

Final config

Key choices

Switching to it

How the two routes compose

How kimi-code Handles kimi-k2.6: A Comparison with OpenCode

1. Native Kimi Provider (Not a Generic OpenAI-compatible Layer)

2. Full Lifecycle Handling of reasoning_content

3. JSON Schema Normalization (kimi-schema.ts)

4. Native Thinking Mode Configuration System

5. Native Moonshot Service Integration

6. Tool Call Layer Details

7. CLI Core and LLM SDK Architectural Isolation

8. What Commit History Reveals About Evolution Paths

kimi-code: Native Design, Continuously Reducing Configuration Burden

OpenCode: Generic Layer Design, OpenAI-centric

A Hard Fact

Commit History Comparison Summary

Summary: Core Differences

Does Self-Hosting an LLM Really Let You Use It Without Limits?

Local deployment does not mean no limits

The real calculation is not just the price of a GPU

A rough but useful breakeven table

One critical condition: you may not be comparing the same thing

What self-hosting gives you besides cost savings

A more practical decision framework

Conclusion

The Mathematical Trap of Big Model Coding Plan Packages: Can Promised Usage Be Delivered Under Concurrency Limits?

Preface

Typical Package Structure

Key Limitation: Concurrency

Math Time: Can the Max Package Use 2,400 Prompts?

Known Conditions

Calculation Process

Conclusion

Harsher Reality: Call Inflation in Agent Mode

How Agent Mode Works

Actual Measurement Case

Revised Actual Usable Amount

Why Is This Happening?

1. Token Calculation Includes Context

2. Concurrency is a Hard Constraint

3. Promises Based on Ideal Assumptions

A Table to See the Truth

Advice for Developers

1. Don’t Be Fooled by “Hundreds of Billions of Tokens”

2. Focus on Concurrency

3. Calculate Cost per Prompt

4. Consider Pay-as-You-Go

Conclusion

Efficient and Cost-Effective: My AI Agent Workflow Choice

The Cost-Saving Workflow

OpenCode Tips

Final Thoughts

Coding Performance and Model Cost-Effectiveness Analysis

Third-party Client Performance

Test Questions

Kimi k2 thinking turbo

Trying deepseek

If we ask ChatGPT directly

Inference - Reasons for Poor Performance

Conclusion

1. `minimax` (Official Anthropic-compatible API)

2. `minimax-cn` (Official Anthropic-compatible API, China)

3. `minimax-code` (Coding Plan, OpenAI-compatible)

4. `minimax-code-cn` (Coding Plan CN)

5. `openrouter` (OpenRouter Passthrough)

2. Full Lifecycle Handling of `reasoning_content`

3. JSON Schema Normalization (`kimi-schema.ts`)