Implementing Multimodal Embedding and Reranker Models

12 min readApr 15, 2026

This article explores the implementation of multimodal embedding and reranker models using sentence transformers to enhance language model capabilities. We will discuss the architecture and trade-offs of retrieval augmented generation systems, focusing on the use of vector stores and different design choices. The Llama 3.2 NeMo Retriever Multimodal Embedding model will be highlighted as a best-in-class solution for multimodal information retrieval.

Introduction to Multimodal Embedding and Reranker Models

Multimodal embedding models extend traditional embedding models by mapping inputs from different modalities, such as text, images, audio, or video, into a shared embedding space. Similarly, traditional reranker models compute relevance scores between pairs of texts, while multimodal rerankers can score pairs where one or both elements are images, combined text-image documents, or other modalities. With a multimodal model loaded, `model.encode()` accepts images alongside text, enabling the computation of similarities between text embeddings and image embeddings.

Retrieval Augmented Generation System Architecture

A simple retrieval augmented generation architecture setup usually works fine with a few documents and a basic retriever, but those setups fall apart quickly once you try to run them in production. In this guide, we’ll break down the RAG system architecture components and the trade-offs to consider when implementing production-ready RAG architecture, challenges, and best practices. RAG architecture refers to how you design your retrieval system: which embedding models and vector types to use, how to chunk and index documents, and whether to add reranking.

Llama 3.2 NeMo Retriever Multimodal Embedding Model

The Llama 3.2 NeMo Retriever Multimodal Embedding model is a small (1.6B parameters) yet powerful vision embedding model. Built as NVIDIA NIM, NeMo Retriever Multimodal Embedding model enables the creation of large scale, efficient multimodal information retrieval systems. In this regard, building on the advantages of the “retrieval in vision space” concept, we adapted a powerful vision-language model and converted it into the Llama 3.2 NeMo Retriever Multimodal Embedding 1B model.

python

from openai import OpenAI
client = OpenAI(base_url="https://<pai-eas-endpoint>/v1", api_key="<your-pai-api-key>")
embedding = client.embeddings.create(input="How should I choose best LLM for the finance industry?", model="qwen3-embedding-8b")

Example code for generating embeddings with Qwen3-Embedding-8B model

Implementing Multimodal Embedding and Reranker Models — Llama 3.2 NeMo Retriever Multimodal Embedding Model — Llama 3.2 NeMo Retriever Multimodal Embedding Model

Implementing Multimodal AI on Databricks

This blog post will guide you through the process of implementing and leveraging multimodal AI effectively on the Databricks platform. It will use Batch Inference on historical claims to classify damage and create embeddings for Vector Search. Using Model Serving’s Batch Inference, we can take a look at our historical claims dataset and the image data associated with these claims and build classifications of the damage type on the cars.

90%

accuracy improvement

50%

reduction in training time

💡 Benefits of Multimodal AI on Databricks

By leveraging Databricks’ advanced GenAI capabilities you can get started building multimodal AI today.

Comparison of Multimodal Embedding Models

Component	Open / This Approach	Proprietary Alternative
Model provider	Any — OpenAI, Anthropic, Ollama	Single vendor lock-in
Model size	1.6B parameters	10B parameters

🔑 Key Takeaway

The Llama 3.2 NeMo Retriever Multimodal Embedding model is a best-in-class solution for multimodal information retrieval, offering a small yet powerful vision embedding model. By leveraging this model and implementing multimodal AI on Databricks, you can improve the accuracy and efficiency of your information retrieval systems.

Key Links

Implementing Multimodal Embedding and Reranker Models

ByAI

Introduction to Multimodal Embedding and Reranker Models

Retrieval Augmented Generation System Architecture

Llama 3.2 NeMo Retriever Multimodal Embedding Model

Implementing Multimodal AI on Databricks

Comparison of Multimodal Embedding Models

Watch: Technical Walkthrough

By AI

Related Post

Reinforcement Fine-Tuning with LLM-as-a-Judge

Unlocking Large Context Understanding with DeepSeek-V4

Beyond Transformers: Exploring New Architectures for NLP

Leave a Reply Cancel reply

You missed

Agent Evaluation and Safety Considerations in AI Development

Exploring Text Diffusion Models for Generative AI

Advancements in AI Model Inference with ONNX

Quantization Techniques for Instruction-Tuned LLMs