AI/人工智能 on Svtter's Blog

sth: An HTML Preview Server for AI Agents

Sat, 09 May 2026 12:00:00 +0800

I’ve open sourced a small tool: static-html, with the command-line name sth.

What it does is simple: it provides an HTTP service that lets you register locally generated HTML files and preview them in a browser.

Why This Tool Is Needed

The problem stems from AI Agent output.

Nowadays I use agents like Claude Code and OpenCode for my work, and they often need to output complex content—code review summaries, comparative analyses, quotations, architecture design documents. When this content is sent to Telegram as plain text, the formatting gets completely messed up, tables become unreadable, and code syntax highlighting is lost.

In short, it’s just a big mess.

The initial approach was to have agents directly generate HTML files locally and open them in a browser. But the problems were:

The agent runs on a server without a graphical interface
Locally generated file paths are unpredictable and management is chaotic
No history—previously sent content can’t be found

So I needed a service where an agent could “send” an HTML file and get back a URL that could be opened in any device’s browser. The agent would handle mobile and PC compatibility.

What sth Does

sth is a lightweight HTTP service written in Go with just two core commands:

1
2
3
4
5


# Start the service
sth start

# Send an HTML file
sth send ./report.html

sth send packages the target HTML file along with resource files from the same directory (CSS, JS, images, etc.) and uploads them, then returns a URL. Opening this URL displays the complete page effect.

In practice, it runs on my intranet development machine, and agents specify the remote address via the --server parameter:

1

sth send ./report.html --server http://dev-1:3939

My Actual Usage

Currently sth mainly runs on my development server, working in tandem with the Hermes Agent.

Hermes is my daily AI assistant running on Telegram. When it needs to output complex content—such as code review conclusions, technical solution comparisons, project quotations—it calls the html-report skill to generate a beautifully formatted HTML file, then sends it to the preview server via sth send, and finally sends me the URL.

The entire workflow is:

1
2
3
4


User question -> Hermes Agent analysis
 -> Generate HTML report (html-report skill)
 -> sth send to preview server
 -> Return URL -> Send to Telegram

This way I can tap the link on my phone and see a well-formatted report instead of a blob of plain text.

Metadata Management

Beyond basic sending and previewing, sth also supports tagging, categorizing, and associating sessions with projects:

1
2
3
4
5


sth tag <session-id> code-review pricing
sth categorize <session-id> "Technical Review"
sth project <session-id> hydrogen-permeation
sth list --project hydrogen-permeation
sth search "quotation" --tag pricing

This feature solves a practical problem: over time, sent reports accumulate. Through tags and project categorization, you can quickly find previous outputs.

The difference between list and search is: list matches metadata fields exactly, while search performs full-text search. They can be used in combination.

Technical Details

Language: Go 1.24+
Storage: SQLite (github.com/mattn/go-sqlite3, requires CGO)
Deployment: Single binary file, just manage with systemd
Build: go build -o dist/sth ./cmd/html-server

It’s just that simple, no unnecessary dependencies.

Open Source

This tool was previously a private repo, but I just made it public today: sun-praise/static-html.

If you’re also using AI Agents for daily development work and have encountered the problem where “complex agent output can’t be read in chat tools,” give sth a try. It’s lightweight enough and does what it needs to do.

DeepSeek + Claude Code: Thinking Block Compatibility Analysis

Thu, 30 Apr 2026 15:00:00 +0800

Problem Description

When using DeepSeek models (such as deepseek-v4-flash) directly in Claude Code with extended thinking enabled, multi-turn conversations trigger a 400 error:

1

Bad Request: {"error":{"message":"The content[].thinking in the thinking mode must be passed back to the API.","type":"invalid_request_error","param":null,"code":"invalid_request_error"}}

Root Cause Analysis

Call Chain

1

Claude Code → DeepSeek Anthropic Compatible Endpoint (https://api.deepseek.com/anthropic)

Protocol Incompatibility

According to the DeepSeek Anthropic API Compatibility Documentation, the compatibility status is as follows:

Message Field	Support Status
`content[].thinking`	✅ Supported
`content[].redacted_thinking`	❌ Not Supported

In extended thinking mode during multi-turn conversations, Claude Code faithfully passes back all thinking blocks from the previous round (including redacted_thinking types) to the API as-is. DeepSeek does not recognize redacted_thinking, hence the 400 error.

Additionally, DeepSeek’s thinking block format differs from Anthropic’s native protocol, and the replay logic in tool_use scenarios is not fully compatible either.

Core Conflict

Anthropic API requirement: In extended thinking mode, content[].thinking and content[].redacted_thinking must be passed back unchanged
DeepSeek compatibility layer: Only supports thinking, does not support redacted_thinking
Claude Code behavior: Hard-coded according to Anthropic protocol, does not distinguish between target endpoint types

Community Feedback

This is a widespread community issue that almost all CC agent/router projects have encountered:

Issue	Project	Title
#1	cc-use	DeepSeek Thinking Mode Error: `content[].thinking` Must Be Passed Back
#878	openclaude	DeepSeek V4: reasoning_content must be passed back (400) on tool_calls
#1355	claude-code-router	CCR 代理 deepseek V4 思考时返回 400
#4543	new-api	ClaudeCode 接入 DeepSeek V4 遇到 400 reasoning_content 报错
#355	9router	DeepSeek API Error 400 – Missing reasoning_content
#16748	hermes-agent	DeepSeek /anthropic: stripped thinking blocks cause HTTP 400 on replay
#2414	cc-switch	Claude 使用 cc-switch 配置 deepseek-v4-pro，无法识别字段
#174	cc-haha	/compact 命令在使用 DeepSeek API 时无法工作

DeepSeek Official Response

Zero response. Nor is there any need to respond.

First, DeepSeek has no public API issue repository. All feedback occurs in third-party projects without any DeepSeek official personnel participating in any discussions.
Second, whether to use Anthropic as a compatibility standard, I think DeepSeek should be hesitant.

Temporary Workarounds

Disable extended thinking — When using DeepSeek in CC, turn off thinking mode
Use proxy filtering — Add a proxy layer between CC and DeepSeek to filter out redacted_thinking blocks
Switch models — Use DeepSeek for non-thinking scenarios and Anthropic native models for thinking scenarios

Why Doesn’t OpenCode Have This Problem?

OpenCode (opencode-ai/opencode) naturally avoids this problem architecturally, not through a dedicated “fix”.

The key lies in the convertMessages method in internal/llm/provider/anthropic.go (lines 60-119):

When building assistant messages, it only passes back TextContent (text) and ToolCall (tool calls)
Completely ignores ReasoningContent (thinking content), not putting it in messages
thinking content is only displayed in the UI through stream thinking_delta events and is not passed back to the API

Comparison with Claude Code’s behavior:

	Claude Code	OpenCode
thinking replay	✅ Faithfully replay all thinking blocks (including redacted_thinking)	❌ Do not replay thinking blocks
architectural reason	Follow Anthropic API specification, requires unchanged replay	Self-managed conversation state, thinking only for UI display
DeepSeek compatibility	❌ Triggers 400 (redacted_thinking not recognized)	✅ Not affected (doesn’t pass thinking at all)

Conclusion: OpenCode avoids the problem at the cost of not following Anthropic’s extended thinking specification. This approach is friendly to third-party compatible endpoints like DeepSeek, but if Anthropic native thinking context retention capability is needed in the future, re-implementation may be necessary.

Does Not Replay Thinking Blocks Affect DeepSeek Performance?

Basically no, reasons:

thinking blocks are the model’s internal scratchpad, not final output. The text replies and tool calls in the conversation history already retain key decisions and conclusions
DeepSeek’s reasoning is closer to OpenAI’s mode — each round is generated independently, unlike Anthropic’s strong reliance on cross-round replay to maintain reasoning coherence
OpenCode’s extensive actual use also confirms this — community users run multi-turn conversations using DeepSeek thinking mode in OpenCode without feedback about reasoning quality degradation

The truly potentially affected extreme scenario: in ultra-long multi-turn tasks, the model may repeat conclusions it has already reasoned through. However, in most actual use, the impact is negligible.

CC itself has similar thinking block replay bugs on Anthropic models (not DeepSeek-specific):

Issue	Title	Status
#10199	API Error 400 - Thinking Block Modification Error	Open (oncall)
#51985	thinking block missing in multi-turn conversations	Open
#20692	thinking blocks order error on first tool use	Open (oncall)
#54482	Thinking blocks stripped from context every turn (Opus 4.7)	Open

How to Fix DeepSeek Model Reasoning Issues in OpenCode

Fri, 24 Apr 2026 12:23:58 +0800

When using deepseek-reasoner, we often encounter this problem:

1

The reasoning_content' in the thinking mode must be passed back to the API.

Update

Both issues have now been officially resolved by opencode. Users only need to install the latest version of opencode and use it through the deepseek provider, without additional configuration.

1
2
3
4
5
6


Issue 1
The reasoning_content' in the thinking mode must be passed back to the API.

Issue 2
Bad Request: {"error":{"message":"The content[].thinking in the thinking mode must be passed back to the
API.","type":"invalid_request_error","param":null,"code":"invalid_request_error"}}

Both issues have been officially resolved. Install version 1.14.29 or above.

The old solution follows:

How to solve it? It’s straightforward.

How to Configure

Add provider information to your configuration:

.config/opencode/opencode.json or .config/opencode/opencode.jsonc

Modify the provider section to:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40


{
 "provider": {
 "deepseek": {
 "npm": "@ai-sdk/anthropic",
 "name": "DeepSeek",
 "options": {
 "baseURL": "https://api.deepseek.com/anthropic",
 "apiKey": "<apikey>"
 },
 "models": {
 "deepseek-v4-pro": {
 "name": "DeepSeek-V4-Pro",
 "limit": {
 "context": 1048576,
 "output": 262144
 },
 "options": {
 "thinking": {
 "type": "enabled",
 "budgetTokens": 8192
 }
 }
 },
 "deepseek-v4-flash": {
 "name": "DeepSeek-V4-Flash",
 "limit": {
 "context": 1048576,
 "output": 262144
 },
 "options": {
 "thinking": {
 "type": "enabled",
 "budgetTokens": 8192
 }
 }
 }
 }
 }
 }
}

How to Use

Select the deepseek model.

The result.

Supplement

This method cannot solve this problem

Bad Request: {"error":{"message":"The content[].thinking in the thinking mode must be passed back to the API.","type":"invalid_request_error","param":null,"code":"invalid_request_error"}}

If you encounter this problem, you need to wait for opencode to fix it.

Related article: DeepSeek + Claude Code: Thinking Block Compatibility Issue Analysis — Analyzes the root cause of 400 errors triggered by multi-turn conversations in extended thinking mode when using DeepSeek with Claude Code, along with community solutions.

Does Self-Hosting an LLM Really Let You Use It Without Limits?

Thu, 19 Mar 2026 12:30:00 +0800

Many people start thinking seriously about self-hosting an LLM not because of technical romance, but because API bills, rate limits, or compliance requirements have started to collide with real business constraints.

So a very natural question shows up: if the model runs on your own machine, does that mean you can finally use it without limits?

My answer is: no. Self-hosting a model does not mean unlimited freedom. It mostly means that many of the constraints and costs previously absorbed by the platform are now transferred to you.

But there is a more useful second question: once usage gets large enough, can self-hosting actually become cheaper?

The answer is: possibly, but under stricter conditions than many people expect.

In short: self-hosting an LLM does not mean unlimited freedom.

It means taking on part of the cost and responsibility that a platform would normally absorb. Self-hosting becomes financially attractive only when load stays high, utilization remains strong, and you can either accept model trade-offs or optimize the stack yourself.

Local deployment does not mean no limits

Let us clear up the most common misunderstanding first.

Many people interpret “the model runs on my own machine” as “I can now use it however I want.” In reality, the limits do not disappear. They simply show up in a different form.

The first limit is hardware.

Parameter count, VRAM capacity, quantization level, KV cache, and concurrency are real physical constraints. Even a quantized 70B model still puts serious pressure on memory and bandwidth. Being able to run it does not mean it runs comfortably. Getting output does not mean latency and throughput are acceptable.

The second limit is model capability itself.

Hallucinations, knowledge cutoffs, long-context degradation, and unstable reasoning do not vanish just because the model sits on your own server. Deployment location does not change the model’s ceiling. More importantly, most so-called self-hosting setups use open-weight models, not the actual closed models behind systems like Claude or GPT.

The third limit is responsibility transfer.

When you use an API, content safety, service stability, rate limiting, and much of the infrastructure burden are partially handled by the provider. Once you self-host, those problems do not go away. They become your monitoring, your operations, your review pipeline, and your incident response.

So self-hosting is not “use without limits.” It is “you own the boundaries.”

The real calculation is not just the price of a GPU

If you want to know whether self-hosting is worth it, the real comparison is not “how much does the card cost?” but these two larger accounts.

The annual cost of self-hosting can be written roughly like this:

1

Annual self-hosting cost = hardware depreciation + electricity + network / hosting + operations labor + redundancy for failures

The annual API cost is more direct:

1

Annual API cost = average daily token usage * price per million tokens * 365

That looks simple, but three details are often ignored.

Self-hosting is not a one-time hardware purchase. Electricity, spare parts, hosting conditions, alerting, upgrades, and maintenance all keep happening.
API pricing is not a single fixed number. Model choice, input-output ratio, cache hit rate, and tool usage can all change the final bill significantly.
Utilization is easy to underestimate. If your machine sits idle most of the time, a low per-inference cost means very little. On the other hand, if the workload is stable and the hardware stays busy, the financial case for self-hosting becomes much stronger.

So the numbers below should be read as rough order-of-magnitude guidance, not as a procurement quote.

A rough but useful breakeven table

To keep the discussion simple, let us start with a deliberately rough set of assumptions:

API pricing is estimated at roughly CNY 50 per million tokens
token usage counts both input and output together
local hardware is depreciated over 3 years
self-hosting cost includes baseline power and operations overhead
the local setup mainly assumes open-weight model inference, not strict parity with top closed models
this does not include training, fine-tuning, or a dedicated platform team

Under those assumptions, you get a rough picture like this:

Scenario	Daily token usage	Likely local setup	Annual self-hosting cost	Annual API cost	Rough conclusion
Light usage	500K	Single high-end consumer workstation	CNY 20K - 40K	about CNY 9K	API is cheaper
Medium usage	5M	Dual-GPU or small inference workstation	CNY 60K - 120K	about CNY 91K	Near breakeven
Heavy usage	50M	Multi-GPU server or cluster	CNY 400K - 800K	about CNY 912K	Self-hosting may be cheaper

If you want local quality to get as close as possible to top-tier closed models, this table usually moves upward again, because stronger models, more VRAM, and higher availability targets all push infrastructure and operations costs higher.

This table points to three things.

Individuals and small teams usually do not save money with self-hosting. If your workload is only a few hundred thousand tokens per day, APIs are still usually the more economical option. You spend less on hardware and avoid carrying the operations burden.
The real breakeven point tends to appear only in consistently high-usage scenarios. Not one occasional spike, but a workload that stays high day after day. Only then can hardware cost be spread efficiently enough.
The larger the usage, the more attractive self-hosting becomes financially. That is why large companies invest seriously in inference platforms. It is not because they enjoy complexity. It is because once the scale is large enough, the math really changes.

One critical condition: you may not be comparing the same thing

The biggest problem in many “self-hosting is cheaper than API” discussions is not the arithmetic. It is that the compared products are often not equivalent.

On the API side, you may be buying access to a top-tier closed model. On the local side, you may be running a quantized open-weight model. Both are called “LLMs,” but they are not the same product in a strict sense.

That means:

if open-weight quality is acceptable for your use case, self-hosting may indeed save a lot of money
if your quality bar is high and you depend on the best closed models, the room for self-hosting becomes much smaller
if you compare a cheaper model to a more expensive model, the result is not just a deployment conclusion, but also a model-selection conclusion

Put differently, many people think they are calculating deployment cost when they are actually accepting a capability downgrade first.

There is nothing wrong with that trade-off, but it should be stated clearly.

What self-hosting gives you besides cost savings

If a company still chooses to self-host after doing the math, it is usually not only about saving API money.

Data control. Some businesses simply do not want raw data flowing through third-party providers for long-term operational or compliance reasons. Local deployment makes the compliance and audit path easier to manage.
Customization. You can optimize around your own tasks with quantization, routing, distillation, fine-tuning, and tighter integration into internal systems. Standard APIs usually give you less freedom here.
A more predictable cost ceiling. API pricing scales directly with usage. When the business grows, the bill grows with it. Self-hosting has a large upfront investment, but under high and stable load, the cost curve is often easier to predict.
Offline operation and availability. If your environment requires internal-only deployment, or if you cannot accept key workflows depending entirely on external services, local deployment may simply fit the engineering requirements better.

A more practical decision framework

If you do not want to model every variable from day one, start with these three questions.

Is your workload consistently high over time? If you only see occasional spikes rather than sustained token usage every day, APIs are often still the better choice because you are not paying for idle hardware.
Can you accept the gap between a local model and a closed flagship model? If your business depends on best-in-class model quality, a large part of the claimed savings may come from lowering model quality rather than from deployment efficiency alone.
Do you actually have the ability to operate an inference service long term? What happens when a GPU fails, drivers conflict, service latency spikes, the model version needs to change, or rate limiting and monitoring need to be built? If nobody owns these questions, the issue is no longer just cost. It becomes a delivery problem.

Conclusion

Back to the original question: does self-hosting an LLM really let you use it without limits?

My answer is still: no.

It does not remove hardware bottlenecks, erase model capability gaps, or magically solve moderation, reliability, and operations work for you. What it gives you is not absolute freedom, but more control and the responsibility that comes with it.

At the same time, self-hosting is absolutely not a fake option. It becomes increasingly reasonable when several conditions are true at once:

your token usage stays high for a long time
the workload is stable and hardware utilization remains high
open-weight models are acceptable, or you already have the ability to optimize them well
data control, internal deployment, or predictable cost ceilings matter to you

If you are an individual, a small team, or just an occasional heavy user, APIs are still usually the more practical answer: less effort, less operational burden, and lower cost of experimentation.

If you are already in the phase where you burn tokens steadily every day, then it is worth calculating the full picture instead of staring only at API unit prices. Very often the answer is not “now I can use it without limits,” but a more grounded question that matters more: is this worth owning yourself?