I’ve always had a question: Why do we need agent frameworks? Aren’t large models enough on their own? This article reflects my current understanding of the subject.
After using several tools extensively and participating in multiple agent projects recently, I’ve reached some conclusions.
The Limitations of LLMs
The primary reason for using agents is the inherent limitations of LLMs.
First and foremost is the context window, as explicitly mentioned in langchain/subagent. Although many modern models have significantly expanded context windows (GPT-4 Turbo 128K, Claude-3.5 Sonnet 200K, Gemini-1.5 Pro up to 2M), they are still insufficient for truly complex tasks. For example, processing a massive codebase or analyzing hundreds of documents quickly exhausts these limits. Furthermore, processing extremely long contexts is both expensive and slow.
Beyond context, there are other capability gaps:
- Vision Capabilities: While modern VLMs (Vision Language Models) are powerful, traditional CV (Computer Vision) models often perform better in specific scenarios. Additionally, some models (like DeepSeek-V3) don’t have native vision capabilities.
- Resource Access: LLMs cannot directly interact with databases, file systems, or network services.
- Specialized Tools: Tools for code execution, complex mathematics, or data analysis require protocols like MCP to be accessible to an LLM.
What Agents Can Do
Beyond addressing the limitations above, here are some practical ways agents add value.
Domain-Specific Text Processing
Agents can process different text segments (contexts) independently.
- Context Optimization: Agents can compress or selectively provide context, effectively extending the usable context window.
- Performance Gains: An LLM within an agent can focus on a single, specific task, leading to better performance. When given too much text, LLMs often struggle to identify key information; smaller, targeted context makes this much easier.
- Specialized Knowledge: LLMs are trained on general data. To make an agent a domain expert, we can inject specific knowledge directly into its context.
Visual Capability Integration
Through agents, we can integrate traditional vision models to handle tasks that LLMs struggle with. For example, using an MCP (Model Context Protocol) to bridge an agent with vision capabilities.
A notable example is Zhipu’s Vision MCP. Using this MCP in conjunction with an agent significantly enhances visual processing power. This highlights the value of MCP servers that integrate specialized services.
Further Reading
大家经常聊的 Agent,很多时候其实只是一个 Workflow。这两个概念混用,会导致产品设计和技术选型上走很多弯路。
— 一泽Eze (@eze_is_1) October 27, 2025
Anthropic 给了一个很清晰的划分,核心区别在于:
系统执行任务时,是由代码预设路径(Code-Driven),还是由LLM自己动态决定下一步(LLM-Driven)。前者是 Workflow,后者才是…
Agents and workflows allow LLMs to use tools. While the input and output remain text, the nature of what that text represents has changed. The creator of the text is no longer necessarily a human.
Agent Frameworks
- Pydantic AI: I find this particularly useful because it integrates Pydantic models into the agent framework, making it much easier to debug. I’ve tested its integration with Qwen3.
- LangChain: I haven’t used this in production, only for basic debugging. The API changes frequently, which can be challenging. One minor issue is prompt handling; I used Jinja to solve this. Alternatively, the “LangChain way” involves using PromptTemplates.