语速
Vision large models perform poorly on some specific tasks but perform better with formatted text. Here, I use the localization of meter reading areas as an example to demonstrate the performance of large models.
Source Code
https://github.com/Svtter/vl-model/pull/4
Test Tasks
- Extract text boxes from the image.
- Extract the meter reading area from the image.
Test File

We can observe the performance differences among various models from these test results:
Test Results Comparison
Results Using Bounding Boxes as Prompts

Detailed Performance of Each Model
Anthropic Claude 3.5 Sonnet

Google Gemini 2.5 Pro

OpenAI GPT-4o

Analysis Summary
From these test results, we can observe:
- Differences in Visual Recognition Capabilities: Different models exhibit significant performance variations when handling the same visual task.
- Formatted Text Processing: Compared to visual tasks, models perform more stably when processing structured text.
- Model Characteristics: Each model has its unique strengths and limitations.
These results remind us to evaluate the suitability of AI models based on specific task types when making selections.
