Codesota · Papers · Multimodal2024-04-25

Paper

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

§ 01 · Benchmark results

4 results reproduced from this paper.

#	Model	Vendor	Benchmark	Metric	Value	SOTA	Date	Source
01	InternVL2-76B	Shanghai AI Lab	MMBench	accuracy	86.5%	—	2024-04-25	source ↗
02	InternVL2-76B	Shanghai AI Lab	MMMU	accuracy	67.4%	—	2024-04-25	source ↗
03	InternVL2-76B	Shanghai AI Lab	TextVQA	accuracy	84.4%	—	2024-04-25	source ↗
04	InternVL2-76B	Shanghai AI Lab	VQA v2.0	accuracy	87.2%	—	2024-04-25	source ↗

§ 02 · Models

1 model from this paper.

Shanghai AI Lab

§ 04 · Related papers

Other Multimodal papers tracked on Codesota.

2025-02-19 · 3 results
Qwen2.5-VL Technical Report
2025-01-22 · 2 results
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
2025-01-15 · 1 result
Gemini 2.0 Flash Technical Report
2024-10-25 · 4 results
SWE-bench Verified
2024-10-22 · 1 result
Claude 3.5 Sonnet Model Card
2024-09-18 · 4 results
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Read next

Three places to go from here.

Every ML paper with a sourced benchmark result, sorted by publication date.

Papers with Code is dead — alternatives

What replaced PWC for each use case: LLMs, OCR, speech, vision, robotics.

Every benchmark in Multimodal.