Poor Performance of Large Models on Specific Tasks

语速

Vision large models perform poorly on some specific tasks but perform better with formatted text. Here, I use the localization of meter reading areas as an example to demonstrate the performance of large models.

Source Code

https://github.com/Svtter/vl-model/pull/4

Test Tasks

Extract text boxes from the image.
Extract the meter reading area from the image.

Test File

Original Meter

We can observe the performance differences among various models from these test results:

Test Results Comparison

Results Using Bounding Boxes as Prompts

Overall Test Results

Detailed Performance of Each Model

Anthropic Claude 3.5 Sonnet

Claude 3.5 Sonnet Test Results

Google Gemini 2.5 Pro

Gemini 2.5 Pro Test Results

OpenAI GPT-4o

GPT-4o Test Results

Analysis Summary

From these test results, we can observe:

Differences in Visual Recognition Capabilities: Different models exhibit significant performance variations when handling the same visual task.
Formatted Text Processing: Compared to visual tasks, models perform more stably when processing structured text.
Model Characteristics: Each model has its unique strengths and limitations.

These results remind us to evaluate the suitability of AI models based on specific task types when making selections.