Introduction to Multimodal Embedding and Reranker Models
Multimodal embedding models extend traditional embedding models by mapping inputs from different modalities, such as text, images, audio, or video, into a shared embedding space. Similarly, traditional reranker models compute relevance scores between pairs of texts, while multimodal rerankers can score pairs where one or both elements are images, combined text-image documents, or other modalities. With a multimodal model loaded, `model.encode()` accepts images alongside text, enabling the computation of similarities between text embeddings and image embeddings.
Retrieval Augmented Generation System Architecture
A simple retrieval augmented generation architecture setup usually works fine with a few documents and a basic retriever, but those setups fall apart quickly once you try to run them in production. In this guide, we’ll break down the RAG system architecture components and the trade-offs to consider when implementing production-ready RAG architecture, challenges, and best practices. RAG architecture refers to how you design your retrieval system: which embedding models and vector types to use, how to chunk and index documents, and whether to add reranking.
Llama 3.2 NeMo Retriever Multimodal Embedding Model
The Llama 3.2 NeMo Retriever Multimodal Embedding model is a small (1.6B parameters) yet powerful vision embedding model. Built as NVIDIA NIM, NeMo Retriever Multimodal Embedding model enables the creation of large scale, efficient multimodal information retrieval systems. In this regard, building on the advantages of the “retrieval in vision space” concept, we adapted a powerful vision-language model and converted it into the Llama 3.2 NeMo Retriever Multimodal Embedding 1B model.
from openai import OpenAI
client = OpenAI(base_url="https://<pai-eas-endpoint>/v1", api_key="<your-pai-api-key>")
embedding = client.embeddings.create(input="How should I choose best LLM for the finance industry?", model="qwen3-embedding-8b")Example code for generating embeddings with Qwen3-Embedding-8B model

Implementing Multimodal AI on Databricks
This blog post will guide you through the process of implementing and leveraging multimodal AI effectively on the Databricks platform. It will use Batch Inference on historical claims to classify damage and create embeddings for Vector Search. Using Model Serving’s Batch Inference, we can take a look at our historical claims dataset and the image data associated with these claims and build classifications of the damage type on the cars.
90%
accuracy improvement
50%
reduction in training time
💡 Benefits of Multimodal AI on Databricks
By leveraging Databricks’ advanced GenAI capabilities you can get started building multimodal AI today.
Comparison of Multimodal Embedding Models
Comparison of Multimodal Embedding Models
| Component | Open / This Approach | Proprietary Alternative |
|---|---|---|
| Model provider | Any — OpenAI, Anthropic, Ollama | Single vendor lock-in |
| Model size | 1.6B parameters | 10B parameters |
🔑 Key Takeaway
The Llama 3.2 NeMo Retriever Multimodal Embedding model is a best-in-class solution for multimodal information retrieval, offering a small yet powerful vision embedding model. By leveraging this model and implementing multimodal AI on Databricks, you can improve the accuracy and efficiency of your information retrieval systems.
Key Links