Implementing Multimodal Embedding and Reranker Models

Introduction to Multimodal Embedding and Reranker Models

Multimodal embedding models extend traditional embedding models by mapping inputs from different modalities, such as text, images, audio, or video, into a shared embedding space. Similarly, traditional reranker models compute relevance scores between pairs of texts, while multimodal rerankers can score pairs where one or both elements are images, combined text-image documents, or other modalities. With a multimodal model loaded, `model.encode()` accepts images alongside text, enabling the computation of similarities between text embeddings and image embeddings.

Retrieval Augmented Generation System Architecture

A simple retrieval augmented generation architecture setup usually works fine with a few documents and a basic retriever, but those setups fall apart quickly once you try to run them in production. In this guide, we’ll break down the RAG system architecture components and the trade-offs to consider when implementing production-ready RAG architecture, challenges, and best practices. RAG architecture refers to how you design your retrieval system: which embedding models and vector types to use, how to chunk and index documents, and whether to add reranking.

Llama 3.2 NeMo Retriever Multimodal Embedding Model

The Llama 3.2 NeMo Retriever Multimodal Embedding model is a small (1.6B parameters) yet powerful vision embedding model. Built as NVIDIA NIM, NeMo Retriever Multimodal Embedding model enables the creation of large scale, efficient multimodal information retrieval systems. In this regard, building on the advantages of the “retrieval in vision space” concept, we adapted a powerful vision-language model and converted it into the Llama 3.2 NeMo Retriever Multimodal Embedding 1B model.

python
from openai import OpenAI
client = OpenAI(base_url="https://<pai-eas-endpoint>/v1", api_key="<your-pai-api-key>")
embedding = client.embeddings.create(input="How should I choose best LLM for the finance industry?", model="qwen3-embedding-8b")

Example code for generating embeddings with Qwen3-Embedding-8B model

Implementing Multimodal Embedding and Reranker Models — Llama 3.2 NeMo Retriever Multimodal Embedding Model
Llama 3.2 NeMo Retriever Multimodal Embedding Model

Implementing Multimodal AI on Databricks

This blog post will guide you through the process of implementing and leveraging multimodal AI effectively on the Databricks platform. It will use Batch Inference on historical claims to classify damage and create embeddings for Vector Search. Using Model Serving’s Batch Inference, we can take a look at our historical claims dataset and the image data associated with these claims and build classifications of the damage type on the cars.

90%

accuracy improvement

50%

reduction in training time

💡  Benefits of Multimodal AI on Databricks

By leveraging Databricks’ advanced GenAI capabilities you can get started building multimodal AI today.


Comparison of Multimodal Embedding Models

Comparison of Multimodal Embedding Models

ComponentOpen / This ApproachProprietary Alternative
Model providerAny — OpenAI, Anthropic, OllamaSingle vendor lock-in
Model size1.6B parameters10B parameters

🔑  Key Takeaway

The Llama 3.2 NeMo Retriever Multimodal Embedding model is a best-in-class solution for multimodal information retrieval, offering a small yet powerful vision embedding model. By leveraging this model and implementing multimodal AI on Databricks, you can improve the accuracy and efficiency of your information retrieval systems.


Watch: Technical Walkthrough

By AI

To optimize for the 2026 AI frontier, all posts on this site are synthesized by AI models and peer-reviewed by the author for technical accuracy. Please cross-check all logic and code samples; synthetic outputs may require manual debugging

Leave a Reply

Your email address will not be published. Required fields are marked *