Why Agent

Tue, 30 Sep 2025 11:54:06 +0800

I’ve always had a question: Why do we need agent frameworks? Aren’t large models enough on their own? This article reflects my current understanding of the subject.

After using several tools extensively and participating in multiple agent projects recently, I’ve reached some conclusions.

The Limitations of LLMs

The primary reason for using agents is the inherent limitations of LLMs.

First and foremost is the context window, as explicitly mentioned in langchain/subagent. Although many modern models have significantly expanded context windows (GPT-4 Turbo 128K, Claude-3.5 Sonnet 200K, Gemini-1.5 Pro up to 2M), they are still insufficient for truly complex tasks. For example, processing a massive codebase or analyzing hundreds of documents quickly exhausts these limits. Furthermore, processing extremely long contexts is both expensive and slow.

Beyond context, there are other capability gaps:

Vision Capabilities: While modern VLMs (Vision Language Models) are powerful, traditional CV (Computer Vision) models often perform better in specific scenarios. Additionally, some models (like DeepSeek-V3) don’t have native vision capabilities.
Resource Access: LLMs cannot directly interact with databases, file systems, or network services.
Specialized Tools: Tools for code execution, complex mathematics, or data analysis require protocols like MCP to be accessible to an LLM.

What Agents Can Do

Beyond addressing the limitations above, here are some practical ways agents add value.

Domain-Specific Text Processing

Agents can process different text segments (contexts) independently.

Context Optimization: Agents can compress or selectively provide context, effectively extending the usable context window.
Performance Gains: An LLM within an agent can focus on a single, specific task, leading to better performance. When given too much text, LLMs often struggle to identify key information; smaller, targeted context makes this much easier.
Specialized Knowledge: LLMs are trained on general data. To make an agent a domain expert, we can inject specific knowledge directly into its context.

Visual Capability Integration

Through agents, we can integrate traditional vision models to handle tasks that LLMs struggle with. For example, using an MCP (Model Context Protocol) to bridge an agent with vision capabilities.

A notable example is Zhipu’s Vision MCP. Using this MCP in conjunction with an agent significantly enhances visual processing power. This highlights the value of MCP servers that integrate specialized services.

Agent Frameworks

Pydantic AI: I find this particularly useful because it integrates Pydantic models into the agent framework, making it much easier to debug. I’ve tested its integration with Qwen3.
LangChain: I haven’t used this in production, only for basic debugging. The API changes frequently, which can be challenging. One minor issue is prompt handling; I used Jinja to solve this. Alternatively, the “LangChain way” involves using PromptTemplates.

Poor Performance of Large Models on Specific Tasks

Thu, 19 Jun 2025 16:34:32 +0800

Vision large models perform poorly on some specific tasks but perform better with formatted text. Here, I use the localization of meter reading areas as an example to demonstrate the performance of large models.

Source Code

https://github.com/Svtter/vl-model/pull/4

Test Tasks

Extract text boxes from the image.
Extract the meter reading area from the image.

Test File

We can observe the performance differences among various models from these test results:

Test Results Comparison

Results Using Bounding Boxes as Prompts

Detailed Performance of Each Model

Anthropic Claude 3.5 Sonnet

Google Gemini 2.5 Pro

OpenAI GPT-4o

Analysis Summary

From these test results, we can observe:

Differences in Visual Recognition Capabilities: Different models exhibit significant performance variations when handling the same visual task.
Formatted Text Processing: Compared to visual tasks, models perform more stably when processing structured text.
Model Characteristics: Each model has its unique strengths and limitations.

These results remind us to evaluate the suitability of AI models based on specific task types when making selections.

VLM on Svtter's Blog