<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>VLM on Svtter's Blog</title><link>https://svtter.cn/en/tags/vlm/</link><description>Recent content in VLM on Svtter's Blog</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Tue, 30 Sep 2025 11:54:06 +0800</lastBuildDate><atom:link href="https://svtter.cn/en/tags/vlm/index.xml" rel="self" type="application/rss+xml"/><item><title>Why Agent</title><link>https://svtter.cn/en/p/why-agent/</link><pubDate>Tue, 30 Sep 2025 11:54:06 +0800</pubDate><guid>https://svtter.cn/en/p/why-agent/</guid><description>&lt;img src="https://svtter.cn/p/why-agent/pics/why-agent-background.svg" alt="Featured image of post Why Agent" /&gt;&lt;p&gt;I&amp;rsquo;ve always had a question: Why do we need agent frameworks? Aren&amp;rsquo;t large models enough on their own? This article reflects my current understanding of the subject.&lt;/p&gt;
&lt;p&gt;After using several tools extensively and participating in multiple agent projects recently, I&amp;rsquo;ve reached some conclusions.&lt;/p&gt;
&lt;h2 id="the-limitations-of-llms"&gt;The Limitations of LLMs
&lt;/h2&gt;&lt;p&gt;The primary reason for using agents is the inherent limitations of LLMs.&lt;/p&gt;
&lt;p&gt;First and foremost is the &lt;strong&gt;context window&lt;/strong&gt;, as explicitly mentioned in &lt;a class="link" href="https://docs.langchain.com/oss/python/deepagents/subagents#why-use-subagents%3F" target="_blank" rel="noopener"
&gt;langchain/subagent&lt;/a&gt;. Although many modern models have significantly expanded context windows (GPT-4 Turbo 128K, Claude-3.5 Sonnet 200K, Gemini-1.5 Pro up to 2M), they are still insufficient for truly complex tasks. For example, processing a massive codebase or analyzing hundreds of documents quickly exhausts these limits. Furthermore, processing extremely long contexts is both expensive and slow.&lt;/p&gt;
&lt;p&gt;Beyond context, there are other capability gaps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Vision Capabilities&lt;/strong&gt;: While modern VLMs (Vision Language Models) are powerful, traditional CV (Computer Vision) models often perform better in specific scenarios. Additionally, some models (like DeepSeek-V3) don&amp;rsquo;t have native vision capabilities.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Resource Access&lt;/strong&gt;: LLMs cannot directly interact with databases, file systems, or network services.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Specialized Tools&lt;/strong&gt;: Tools for code execution, complex mathematics, or data analysis require protocols like MCP to be accessible to an LLM.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="what-agents-can-do"&gt;What Agents Can Do
&lt;/h2&gt;&lt;p&gt;Beyond addressing the limitations above, here are some practical ways agents add value.&lt;/p&gt;
&lt;h3 id="domain-specific-text-processing"&gt;Domain-Specific Text Processing
&lt;/h3&gt;&lt;p&gt;Agents can process different text segments (contexts) independently.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Context Optimization&lt;/strong&gt;: Agents can compress or selectively provide context, effectively extending the usable context window.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Performance Gains&lt;/strong&gt;: An LLM within an agent can focus on a single, specific task, leading to better performance. When given too much text, LLMs often struggle to identify key information; smaller, targeted context makes this much easier.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Specialized Knowledge&lt;/strong&gt;: LLMs are trained on general data. To make an agent a domain expert, we can inject specific knowledge directly into its context.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id="visual-capability-integration"&gt;Visual Capability Integration
&lt;/h3&gt;&lt;p&gt;Through agents, we can integrate traditional vision models to handle tasks that LLMs struggle with. For example, using an MCP (Model Context Protocol) to bridge an agent with vision capabilities.&lt;/p&gt;
&lt;p&gt;A notable example is &lt;a class="link" href="https://docs.bigmodel.cn/cn/coding-plan/mcp/vision-mcp-server" target="_blank" rel="noopener"
&gt;Zhipu&amp;rsquo;s Vision MCP&lt;/a&gt;. Using this MCP in conjunction with an agent significantly enhances visual processing power. This highlights the value of MCP servers that integrate specialized services.&lt;/p&gt;
&lt;h2 id="further-reading"&gt;Further Reading
&lt;/h2&gt;&lt;blockquote class="twitter-tweet"&gt;&lt;p lang="zh" dir="ltr"&gt;大家经常聊的 Agent，很多时候其实只是一个 Workflow。这两个概念混用，会导致产品设计和技术选型上走很多弯路。&lt;br&gt;&lt;br&gt;Anthropic 给了一个很清晰的划分，核心区别在于：&lt;br&gt;系统执行任务时，是由代码预设路径（Code-Driven），还是由LLM自己动态决定下一步（LLM-Driven）。前者是 Workflow，后者才是…&lt;/p&gt;&amp;mdash; 一泽Eze (@eze_is_1) &lt;a href="https://twitter.com/eze_is_1/status/1982740850070425826?ref_src=twsrc%5Etfw"&gt;October 27, 2025&lt;/a&gt;&lt;/blockquote&gt; &lt;script async src="https://platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;
&lt;p&gt;Agents and workflows allow LLMs to use tools. While the input and output remain text, the nature of what that text represents has changed. The creator of the text is no longer necessarily a human.&lt;/p&gt;
&lt;h2 id="agent-frameworks"&gt;Agent Frameworks
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://ai.pydantic.dev/" target="_blank" rel="noopener"
&gt;Pydantic AI&lt;/a&gt;: I find this particularly useful because it integrates Pydantic models into the agent framework, making it much easier to debug. I&amp;rsquo;ve tested its integration with &lt;a class="link" href="https://ai.pydantic.dev/" target="_blank" rel="noopener"
&gt;Qwen3&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://www.langchain.com/" target="_blank" rel="noopener"
&gt;LangChain&lt;/a&gt;: I haven&amp;rsquo;t used this in production, only for basic debugging. The API changes frequently, which can be challenging. One minor issue is prompt handling; &lt;a class="link" href="https://svtter.cn/p/string-template-in-prompt.md/" &gt;I used Jinja to solve this&lt;/a&gt;. Alternatively, the &amp;ldquo;LangChain way&amp;rdquo; involves using &lt;a class="link" href="https://python.langchain.com/docs/concepts/prompt_templates/#string-prompttemplates" target="_blank" rel="noopener"
&gt;PromptTemplates&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Poor Performance of Large Models on Specific Tasks</title><link>https://svtter.cn/en/p/poor-performance-of-large-models-on-specific-tasks/</link><pubDate>Thu, 19 Jun 2025 16:34:32 +0800</pubDate><guid>https://svtter.cn/en/p/poor-performance-of-large-models-on-specific-tasks/</guid><description>&lt;img src="https://svtter.cn/p/poor-performance-of-large-models-on-specific-tasks/pics/bg.png" alt="Featured image of post Poor Performance of Large Models on Specific Tasks" /&gt;&lt;p&gt;Vision large models perform poorly on some specific tasks but perform better with formatted text. Here, I use the localization of meter reading areas as an example to demonstrate the performance of large models.&lt;/p&gt;
&lt;h2 id="source-code"&gt;Source Code
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://github.com/Svtter/vl-model/pull/4" target="_blank" rel="noopener"
&gt;https://github.com/Svtter/vl-model/pull/4&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="test-tasks"&gt;Test Tasks
&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;Extract text boxes from the image.&lt;/li&gt;
&lt;li&gt;Extract the meter reading area from the image.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="test-file"&gt;Test File
&lt;/h2&gt;&lt;p&gt;&lt;img src="https://svtter.cn/p/poor-performance-of-large-models-on-specific-tasks/pics/meter-2.jpg"
width="1280"
height="1707"
srcset="https://svtter.cn/p/poor-performance-of-large-models-on-specific-tasks/pics/meter-2_hu_771f0f2490b85ed1.jpg 480w, https://svtter.cn/p/poor-performance-of-large-models-on-specific-tasks/pics/meter-2_hu_8848c0cff3902819.jpg 1024w"
loading="lazy"
alt="Original Meter"
class="gallery-image"
data-flex-grow="74"
data-flex-basis="179px"
&gt;&lt;/p&gt;
&lt;p&gt;We can observe the performance differences among various models from these test results:&lt;/p&gt;
&lt;h2 id="test-results-comparison"&gt;Test Results Comparison
&lt;/h2&gt;&lt;h3 id="results-using-bounding-boxes-as-prompts"&gt;Results Using Bounding Boxes as Prompts
&lt;/h3&gt;&lt;p&gt;&lt;img src="https://svtter.cn/p/poor-performance-of-large-models-on-specific-tasks/pics/cropped_image.png"
width="1280"
height="1707"
srcset="https://svtter.cn/p/poor-performance-of-large-models-on-specific-tasks/pics/cropped_image_hu_41dee455ef817364.png 480w, https://svtter.cn/p/poor-performance-of-large-models-on-specific-tasks/pics/cropped_image_hu_4c0f4ae31905bed5.png 1024w"
loading="lazy"
alt="Overall Test Results"
class="gallery-image"
data-flex-grow="74"
data-flex-basis="179px"
&gt;&lt;/p&gt;
&lt;h3 id="detailed-performance-of-each-model"&gt;Detailed Performance of Each Model
&lt;/h3&gt;&lt;h4 id="anthropic-claude-35-sonnet"&gt;Anthropic Claude 3.5 Sonnet
&lt;/h4&gt;&lt;p&gt;&lt;img src="https://svtter.cn/p/poor-performance-of-large-models-on-specific-tasks/pics/cropped_image_anthropic_claude-3.5-sonnet.png"
width="187"
height="56"
srcset="https://svtter.cn/p/poor-performance-of-large-models-on-specific-tasks/pics/cropped_image_anthropic_claude-3.5-sonnet_hu_fef09b134291fdf1.png 480w, https://svtter.cn/p/poor-performance-of-large-models-on-specific-tasks/pics/cropped_image_anthropic_claude-3.5-sonnet_hu_d1e74a2cb60d339a.png 1024w"
loading="lazy"
alt="Claude 3.5 Sonnet Test Results"
class="gallery-image"
data-flex-grow="333"
data-flex-basis="801px"
&gt;&lt;/p&gt;
&lt;h4 id="google-gemini-25-pro"&gt;Google Gemini 2.5 Pro
&lt;/h4&gt;&lt;p&gt;&lt;img src="https://svtter.cn/p/poor-performance-of-large-models-on-specific-tasks/pics/cropped_image_google_gemini-2.5-pro.png"
width="690"
height="142"
srcset="https://svtter.cn/p/poor-performance-of-large-models-on-specific-tasks/pics/cropped_image_google_gemini-2.5-pro_hu_75fca6815db4fee4.png 480w, https://svtter.cn/p/poor-performance-of-large-models-on-specific-tasks/pics/cropped_image_google_gemini-2.5-pro_hu_50b7c46ce946b5fc.png 1024w"
loading="lazy"
alt="Gemini 2.5 Pro Test Results"
class="gallery-image"
data-flex-grow="485"
data-flex-basis="1166px"
&gt;&lt;/p&gt;
&lt;h4 id="openai-gpt-4o"&gt;OpenAI GPT-4o
&lt;/h4&gt;&lt;p&gt;&lt;img src="https://svtter.cn/p/poor-performance-of-large-models-on-specific-tasks/pics/cropped_image_openai_gpt-4o.png"
width="120"
height="60"
srcset="https://svtter.cn/p/poor-performance-of-large-models-on-specific-tasks/pics/cropped_image_openai_gpt-4o_hu_e7a4998fc04bc3f0.png 480w, https://svtter.cn/p/poor-performance-of-large-models-on-specific-tasks/pics/cropped_image_openai_gpt-4o_hu_3305e7a6fcb0125a.png 1024w"
loading="lazy"
alt="GPT-4o Test Results"
class="gallery-image"
data-flex-grow="200"
data-flex-basis="480px"
&gt;&lt;/p&gt;
&lt;h2 id="analysis-summary"&gt;Analysis Summary
&lt;/h2&gt;&lt;p&gt;From these test results, we can observe:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Differences in Visual Recognition Capabilities&lt;/strong&gt;: Different models exhibit significant performance variations when handling the same visual task.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Formatted Text Processing&lt;/strong&gt;: Compared to visual tasks, models perform more stably when processing structured text.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Model Characteristics&lt;/strong&gt;: Each model has its unique strengths and limitations.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These results remind us to evaluate the suitability of AI models based on specific task types when making selections.&lt;/p&gt;</description></item></channel></rss>