Poor Performance of Large Models on Specific Tasks

Thu, 19 Jun 2025 16:34:32 +0800

Vision large models perform poorly on some specific tasks but perform better with formatted text. Here, I use the localization of meter reading areas as an example to demonstrate the performance of large models.

Source Code

https://github.com/Svtter/vl-model/pull/4

Test Tasks

Extract text boxes from the image.
Extract the meter reading area from the image.

Test File

We can observe the performance differences among various models from these test results:

Test Results Comparison

Results Using Bounding Boxes as Prompts

Detailed Performance of Each Model

Anthropic Claude 3.5 Sonnet

Google Gemini 2.5 Pro

OpenAI GPT-4o

Analysis Summary

From these test results, we can observe:

Differences in Visual Recognition Capabilities: Different models exhibit significant performance variations when handling the same visual task.
Formatted Text Processing: Compared to visual tasks, models perform more stably when processing structured text.
Model Characteristics: Each model has its unique strengths and limitations.

These results remind us to evaluate the suitability of AI models based on specific task types when making selections.

Meter Reading on Svtter's Blog