Why Agent

语速

I’ve always had a question: Why do we need agent frameworks? Aren’t large models enough on their own? This article reflects my current understanding of the subject.

After using several tools extensively and participating in multiple agent projects recently, I’ve reached some conclusions.

The Limitations of LLMs

The primary reason for using agents is the inherent limitations of LLMs.

First and foremost is the context window, as explicitly mentioned in langchain/subagent. Although many modern models have significantly expanded context windows (GPT-4 Turbo 128K, Claude-3.5 Sonnet 200K, Gemini-1.5 Pro up to 2M), they are still insufficient for truly complex tasks. For example, processing a massive codebase or analyzing hundreds of documents quickly exhausts these limits. Furthermore, processing extremely long contexts is both expensive and slow.

Beyond context, there are other capability gaps:

Vision Capabilities: While modern VLMs (Vision Language Models) are powerful, traditional CV (Computer Vision) models often perform better in specific scenarios. Additionally, some models (like DeepSeek-V3) don’t have native vision capabilities.
Resource Access: LLMs cannot directly interact with databases, file systems, or network services.
Specialized Tools: Tools for code execution, complex mathematics, or data analysis require protocols like MCP to be accessible to an LLM.

What Agents Can Do

Beyond addressing the limitations above, here are some practical ways agents add value.

Domain-Specific Text Processing

Agents can process different text segments (contexts) independently.

Context Optimization: Agents can compress or selectively provide context, effectively extending the usable context window.
Performance Gains: An LLM within an agent can focus on a single, specific task, leading to better performance. When given too much text, LLMs often struggle to identify key information; smaller, targeted context makes this much easier.
Specialized Knowledge: LLMs are trained on general data. To make an agent a domain expert, we can inject specific knowledge directly into its context.

Visual Capability Integration

Through agents, we can integrate traditional vision models to handle tasks that LLMs struggle with. For example, using an MCP (Model Context Protocol) to bridge an agent with vision capabilities.

A notable example is Zhipu’s Vision MCP. Using this MCP in conjunction with an agent significantly enhances visual processing power. This highlights the value of MCP servers that integrate specialized services.

Agent Frameworks

Pydantic AI: I find this particularly useful because it integrates Pydantic models into the agent framework, making it much easier to debug. I’ve tested its integration with Qwen3.
LangChain: I haven’t used this in production, only for basic debugging. The API changes frequently, which can be challenging. One minor issue is prompt handling; I used Jinja to solve this. Alternatively, the “LangChain way” involves using PromptTemplates.