Back to Reading List
[Multimodal]·PAP-OTZVFJ·2023·April 16, 2026

Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models

2023

Issa Sugiura, Keito Sasagawa, Keisuke Nakao et al.

4 min readMultimodalOpen SourceTraining

Core Insight

Jagle advances Japanese vision-language models with a massive 9.2M instance dataset.

By the Numbers

9.2 million

dataset instances

2.2B

model parameters

10 tasks

Japanese evaluation tasks

within 5 points

performance gap to Qwen3-VL-2B-Instruct

In Plain English

The Jagle dataset, with 9.2 million instances, enhances VLMs for Japanese, achieving superior performance on ten tasks. It outperforms InternVL3.5-2B and closely rivals Qwen3-VL-2B-Instruct, while also improving English performance when combined with FineVision.

Knowledge Prerequisites

git blame for knowledge

To fully understand Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models, trace this dependency chain first. Papers in our library are linked — click to read them.

DIRECT PREREQIN LIBRARY
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Understanding the foundational techniques of pre-training language models is crucial for comprehending subsequent multimodal model developments.

Bidirectional encoder representationsSelf-supervised learningLanguage model pre-training
DIRECT PREREQIN LIBRARY
Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Knowledge of deliberate thought processes in language models is necessary to understand advancements in vision-language integration.

Problem solving in language modelsComplex reasoningDeliberate chain-of-thought
DIRECT PREREQIN LIBRARY
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Comprehending multi-agent systems in language models helps grasp the interaction and synthesis in multimodal models.

Multi-agent conversationLanguage model interactionsApplication generation
DIRECT PREREQIN LIBRARY
Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models

The integration of vision and language through streaming chain-of-thought is critical for understanding Jagle's contributions.

Vision-language streamingChain-of-thought reasoningMultimodal integration

YOU ARE HERE

Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models

How grounded is this content?

Metrics are computed from available source text only — abstract, summary, and impact fields ingested into this system. Full paper PDF is not ingested; numerical claims that originate from within the paper body will not appear in these scores.

Source Richness88%

7 of 8 content fields populated. More fields = better-grounded generation.

Source Depth~274 words

Total source text analyzed by the model. Includes extended deep-dive summary — high confidence.

Number Grounding3 / 4

Key statistics whose numeric values appear verbatim in ingested source text. Unverified stats may originate from the full paper body.

Quote Traceability3 / 3

Key passages whose significant vocabulary (≥4-char words) overlap ≥35% with source text. Measures lexical traceability, not semantic accuracy.

Methodology: Number grounding uses regex digit extraction against source text. Quote traceability uses token set intersection on content words stripped of stop-words. Neither metric validates semantic correctness or factual accuracy against the original paper. For full verification, cross-reference with the original paper via the arXiv link above.