UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities

Woongyeong Yeo^*¹ Kangsan Kim^*¹ Soyeong Jeong¹ Jinheon Baek¹ Sung Ju Hwang¹^,²

¹KAIST ²DeepAuto.ai

* : equal contribution

Paper arXiv Code

Concept figure — Conceptual illustration comparing existing RAG strategies with our proposed UniversalRAG.

RAG with Single Modality struggles to handle queries requiring modalities other than one in the corpus.

RAG with Single Granularity lacks flexibility in granularity, resulting in overly fine or overly coarse information.

RAG with Single Unified Corpus causes modality gaps that bias retrieval toward the modality of the query.

Our **UniversalRAG** overcomes these limitations via a **modality- and granularity-aware routing** mechanism over diverse corpora.

Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In contrast, real-world queries vary widely in the type of knowledge they require, which a single type of knowledge source cannot address. To address this, we introduce UniversalRAG, designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities. Specifically, motivated by the observation that forcing all modalities into a unified representation space derived from a single aggregated corpus causes a modality gap, where the retrieval tends to favor items from the same modality as the query, we propose modality-aware routing, which dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it, and further justify its effectiveness with a theoretical analysis. Moreover, beyond modality, we organize each modality into multiple granularity levels, enabling fine-tuned retrieval tailored to the complexity and scope of the query. We validate UniversalRAG on 10 benchmarks of multiple modalities, showing its superiority over various modality-specific and unified baselines.

Why UniversalRAG?

Can multimodal content be effectively retrieved from a single unified embedding space? Multimodal encoders are trained to align semantically similar content across text, images, and videos, yet a persistent modality gap remains: embeddings tend to cluster by modality rather than meaning, as illustrated in Figure 1. This separation limits cross-modal retrieval as queries are implicitly routed by modality instead of true semantic similarity.

UniversalRAG addresses this challenge through modality-aware routing. Rather than forcing heterogeneous data into a single shared space, UniversalRAG dynamically identifies the most relevant modality-specific corpus and performs targeted retrieval within it. This design sidesteps the modality gap while remaining flexible to the introduction of new modalities. Beyond modality, UniversalRAG further organizes each corpus by granularity—from passages to full documents, or from short clips to full videos—so the retrieval process matches both the semantic intent and the scope of the user’s query. As shown in Figure 2, UniversalRAG yields a balanced distribution of retrieved items across modalities, automatically selecting the most appropriate knowledge source for each query.

Modality Gap — **Figure 1.** Modality gap in the unified embedding space.

Placeholder figure 2 — **Figure 2.** Distribution of the retrieved data modalities.

Results

We evaluate UniversalRAG on a comprehensive benchmark covering 10 datasets that span multiple modalities and granularities. The full results in Table 1 and the averaged performance across different LVLMs in Figure 3 indicate that, UniversalRAG outperforms strong unimodal and multimodal RAG baselines, validating the effectiveness of our modality- and granularity-aware routing.

Table 1. Results of diverse RAG methods with Qwen3-VL-8B-Instruct across modalities. Bold denotes the best performance and underlined indicates the second-best among UniversalRAG variants, using either trained or training-free routers.

Figure 3. Comparison of averaged evaluation results across different RAG methods and LVLMs.

We further conduct ablation studies to validate the effectiveness of UniversalRAG’s granularity-aware routing and cross-modal retrieval, and additionally demonstrate its efficiency in real-world usage scenarios.

Cross-modal Performance — **Table 2.** Performance comparison of uni-modal and cross-modal approaches across different router models.

Granularity Performance — **Table 3.** Performance across different numbers of granularity (#Gn) for training-free router models.

Latency Comparison — **Figure 4.** Retrieval latency per query across corpus sizes.

Case Study

Case study comparing RAG on single modality against UniversalRAG (Ours)

Traditional RAG models are limited by modality-specific retrieval, often missing relevant knowledge from other modalities. UniversalRAG (Ours) dynamically routes each query to the most appropriate modality, enabling more accurate retrieval and response generation.

Case study comparing different levels of text granularity

The user query requires complex retrieval involving multiple entities. ParagraphRAG retrieves limited context centered around a single entity, leading to an incorrect answer. UniversalRAG (Ours) routes the query to the document corpus and retrieves richer document-level information, allowing it to capture both relevant entities and generate the correct response.

Case study comparing different levels of video granularity

The user query requires only a segment of the video to determine the answer. VideoRAG retrieves a broad range of frames across the video, which may include irrelevant content or miss key frames, leading to an incorrect response. UniversalRAG (Ours) routes the query to the clip-level granularity, retrieving more focused and relevant visual information, enabling it to generate the correct response.

Case study comparing UniversalRAG across uni-modal and cross-modal scenarios

In the uni-modal setup, where only a single prominent modality is used, information can sometimes be incomplete as they require evidence across modalities. UniversalRAG (Ours), with its cross-modal capability, gathers evidence from multiple modalities to generate a more comprehensive response.

1 / 4

BibTeX

@article{yeo2025universalrag,
  title={UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities},
  author={Yeo, Woongyeong and Kim, Kangsan and Jeong, Soyeong and Baek, Jinheon and Hwang, Sung Ju},
  journal={arXiv preprint arXiv:2504.20734},
  year={2025}
}