Revolutionizing video search with multimodal AI

Revolutionizing video search with multimodal AI

Author

Michael Ryaboy

Developer Advocate

Key Takeaways

  1. Video content is inherently multimodal. Visuals and audio/text must be embedded together to capture their relationship.
  2. Traditional video RAG pipelines, which embed frames individually and rely on cosine similarity, are now considered outdated.
  3. Multimodal models like voyage-multimodal-3 can produce dense vector representations of the entire scene.

Let me be honest: most video RAG pipelines are outdated. You’re probably embedding frames individually with CLIP, storing image and text vectors separately, and praying that cosine similarity is enough to find the moment you’re looking for. While that approach was the best we could do in 2024, things have changed with the arrival of production-ready multimodal AI embedding models such as voyage-multimodal-3.

In this blog, I will define the blueprint for building a modern RAG pipeline for video, simultaneously ingesting visual frames and corresponding text transcripts to produce dense vectors of an entire scene. Together, we will work through video ingestion, semantic scene division (frames + transcript), embedding, storage, and retrieval using a vision LLM such as GPT-4o-mini.

Multimodal Video Embeddings

Note: Video search is still an active research area. I aim to provide a comprehensive technical guide on where we are today. You can follow along from my Colab notebook.

 

Understanding multimodal AI embeddings

You likely know about embeddings, which map sentences and images to numerical vectors, which are then grouped by similarity into a high-dimensional space.

multimodal embedding model does something more sophisticated: It takes multiple types of input at once, perhaps a chunk of text and several related images, then maps them to a single vector representing their combined meaning. It learns that the words “This formula shows…” belong with the frame displaying E=mc² and enables similarity search to find moments where the visual and auditory information align with the query. Voyage AI’s model is a prime example of this new capability.

Note: There are only a few truly multimodal embedding models, and the best are closed-source as of this blog’s writing.

Voyageai Chart

Image source: https://blog.voyageai.com/2024/11/12/voyage-multimodal-3/

 

Optimizing embeddings with KDB.AI

Built by the kdb+ team (known for low-latency financial systems), KDB.AI is a multimodal vector database enabling scalable, real-time AI applications with advanced capabilities. It integrates temporal and semantic relevance into enterprise-scale workflows, helping developers optimize costs while working seamlessly with popular LLMs.

In our example, we use qHNSW, a groundbreaking advancement designed to address the limitations of traditional HNSW (Hierarchical Navigable Small World) indices, which rely heavily on in-memory storage. While not as accurate as exhaustive searches like Flat, it balances accuracy and efficiency, making it ideal for large-scale datasets when high-speed, approximate nearest neighbor (ANN) searches are necessary.

Index Flat qFlat HNSW qHNSW IVF IVFPQ
Retrieval speed Low Low Very high Very high Moderate High
Indexing speed Very high Very high Low Low Moderate Moderate
Accuracy Highest Highest Balanced Balanced Balanced Balanced
Memory used High Very low Very high Low High Moderate
Storage Memory Disk Memory Disk Memory Memory

Key Features

  • On-disk storage: Unlike its predecessors, qHNSW stores the index on disk with memory-mapped access, drastically reducing memory requirements
  • Incremental disk access: Queries read data incrementally from disk, optimizing memory utilization
  • Cost efficiency: Disk-based storage is generally less expensive and energy-intensive than memory-based storage

When to use

  • Large-scale datasets: The hierarchical graph structure allows for efficient and low-memory indexing and querying
  • Approximate searches with high recall: qHNSW provides a good balance between approximation and accuracy, making it ideal for scenarios where perfect accuracy is not mandatory but speed is critical
  • Memory-constrained environments: qHNSW is an excellent fit if you need fast approximate search on a large dataset, but memory is a bottleneck (think IoT and edge devices)

To find out more, please visit KDB.AI

 

Example

Step 1: Data preparation

Let’s now put this together in a technical example. Remember, you can follow along from my Colab notebook.

First, we will extract the transcript and a video frame every 5 seconds.

Python
# ... [For download video code - see Colab] ...
with VideoFileClip(VIDEO_PATH) as clip:
    # Extract frames at 0.2 FPS. This rate is a tunable hyperparameter.
    # Lower FPS = fewer frames, less cost, potentially less visual detail.
    # Higher FPS = more detail, more cost, diminishing returns after a point.
    clip.write_images_sequence(os.path.join(FRAMES_DIR, "frame%04d.png"), fps=0.2, logger=None)
    clip.audio.write_audiofile(AUDIO_PATH, codec="libmp3lame", bitrate="192k", logger=None)

# Transcribe using Whisper
openai_client = OpenAI()

with open(AUDIO_PATH, "rb") as audio_file:
    # Requesting segments gives us text blocks with rough start/end times
    transcription = openai_client.audio.transcriptions.create(
        model="whisper-1", file=audio_file, response_format="verbose_json",
        timestamp_granularities=["segment"]
    )

What just happened:

  1. We obtained sequentially numbered frames and a structured transcription object containing text segments with timestamps.
  2. We set the value fps=0.2 to balance detail and cost.
  3. We set timestamp_granularities to ‘segment’ to get timestamps and punctuation with the OpenAI Whisper API.
  4. We grouped frames and aligned transcript sentences based on timestamps using NLTK, creating six frames and approximately 30 seconds of transcript per ~30s video chunk.
Python
# --- 1.d  Create Video Chunks (NLTK, sentence‑aware, true punctuation) ------
import os, re, math, pandas as pd
from IPython.display import display
import nltk

##############################################################################
# Configuration
##############################################################################
FRAMES_DIR       = frames_dir     # from step 1.b
FRAME_FPS        = 0.2            # write_images_sequence fps
TARGET_CHUNK_SEC = 30             # desired chunk length
SLACK_FACTOR     = 1.20           # allow up to 20 % over‑run before forcing cut
##############################################################################

print("\n--- Building sentence‑aligned ~30 s chunks ---")

# ---------------------------------------------------------------------------
# 0. Make sure the Punkt model is present
# ---------------------------------------------------------------------------
try:
    nltk.data.find("tokenizers/punkt")
except LookupError:
    nltk.download("punkt", quiet=True)
from nltk.tokenize import sent_tokenize

# ---------------------------------------------------------------------------
# 1. Frame paths and helper
# ---------------------------------------------------------------------------
frame_paths = sorted(
    [os.path.join(FRAMES_DIR, f) for f in os.listdir(FRAMES_DIR) if f.endswith(".png")],
    key=lambda p: int(re.search(r"frame(\d+)\.png", os.path.basename(p)).group(1))
)
def idx_from_time(t):   return int(round(t * FRAME_FPS))

# ---------------------------------------------------------------------------
# 2. Build per‑sentence list with *estimated* timestamps
# ---------------------------------------------------------------------------
sentences = []   # list of dict(start, end, sentence)
for seg in transcription_result.segments:
    seg_start, seg_end, seg_text = seg.start, seg.end, seg.text
    seg_sents = sent_tokenize(seg_text)

    # Distribute the segment’s duration across its sentences by char length
    seg_dur    = seg_end - seg_start
    char_total = sum(len(s) for s in seg_sents)
    running_t  = seg_start
    for s in seg_sents:
        char_frac  = len(s) / char_total
        sent_end   = running_t + char_frac * seg_dur
        sentences.append({"start": running_t, "end": sent_end, "sentence": s})
        running_t = sent_end

print(f"  • {len(sentences):,} sentences total")

# ---------------------------------------------------------------------------
# 3. Pack sentences into chunks  (guarantee ending punctuation)
# ---------------------------------------------------------------------------
chunks, cur_sents = [], []
cur_start, cur_end = None, None

def ends_with_stop(txt): return txt[-1] in ".?!"

for sent in sentences:
    if not cur_sents:                            # start new chunk
        cur_sents  = [sent]
        cur_start, cur_end = sent["start"], sent["end"]
        continue

    prospective_end  = sent["end"]
    prospective_span = prospective_end - cur_start

    # Decide if we should append sentence to current chunk
    if prospective_span <= TARGET_CHUNK_SEC * SLACK_FACTOR or not ends_with_stop(cur_sents[-1]["sentence"]):
        cur_sents.append(sent)
        cur_end = prospective_end
    else:
        chunks.append({"start": cur_start, "end": cur_end, "sentences": cur_sents})
        cur_sents  = [sent]
        cur_start, cur_end = sent["start"], sent["end"]

if cur_sents:
    chunks.append({"start": cur_start, "end": cur_end, "sentences": cur_sents})

print(f"  • {len(chunks)} chunks produced "
      f"(avg {sum(c['end']-c['start'] for c in chunks)/len(chunks):.1f}s)")

# ---------------------------------------------------------------------------
# 4. Attach frames to each chunk
# ---------------------------------------------------------------------------
records = []
total_imgs = 0
for idx, ch in enumerate(chunks):
    start_t, end_t = ch["start"], ch["end"]

    first_idx = idx_from_time(start_t)
    last_idx  = max(idx_from_time(end_t) - 1, first_idx)   # inclusive
    imgs      = frame_paths[first_idx : last_idx + 1]
    total_imgs += len(imgs)

    chunk_text = " ".join(s["sentence"] for s in ch["sentences"]).strip()

    records.append(
        {
            "section":    idx,
            "start_time": round(start_t, 2),
            "end_time":   round(end_t,   2),
            "images":     imgs,
            "text":       chunk_text,
        }
    )

print(f"  • {total_imgs} total images linked")

# ---------------------------------------------------------------------------
# 5. DataFrame
# ---------------------------------------------------------------------------
df_aligned = pd.DataFrame(records)
pd.set_option("display.max_colwidth", None)
display(df_aligned.head(10))
# ---------------------------------------------------------------------------

Output:

Voyageai Output

This chunking method is more sophisticated than simple fixed-size splitting. It uses Whisper’s segment timestamps and NLTK’s sentence tokenizer to create chunks that are roughly TARGET_CHUNK_SEC long. To improve semantic coherence, it also tries to end on proper sentence boundaries. It then maps each calculated text chunk’s time range (start_t, end_t) back to the corresponding frame files based on the FRAME_FPS, ensuring that each row in scene_df represents a ~30-second multimodal scene.

Step 2: Multimodal embedding

Next, we will pass each scene (text string + list of PIL Images) into Voyage AI.

Python
# Generating the Multimodal Embedding Vector
import voyageai
from PIL import Image

voyage = voyageai.Client() # Reads API key from env
model_name = "voyage-multimodal-3"
def embed_scene(scene_data):
    text = scene_data["text"]
    # Load images associated with this scene
    pil_images = [Image.open(p) for p in scene_data["images"] if os.path.exists(p)]
    if not pil_images: return None # Skip if no valid images
    # Critical: Input must be [text_string, Image_obj1, Image_obj2, ...]
    input_payload = [text] + pil_images
    try:
        response = voyage.multimodal_embed(
            inputs=[input_payload], # Embed one scene at a time
            model=model_name,
            input_type="document", # Optimize for retrieval
            truncation=True        # Auto-handle long inputs
        )
        return response.embeddings[0] # Get the single vector
    except Exception as e:
        print(f"Error embedding scene {scene_data.get('id', 'N/A')}: {e}")
        return None
# Apply the embedding function to each row
scene_df['embedding'] = scene_df.apply(embed_scene, axis=1)
# Drop rows where embedding failed
scene_df.dropna(subset=['embedding'], inplace=True)

How we structure the input_payload is important; Voyage requires the text first, then the images.

  • input_type=”document” is crucial because it notifies Voyage that you’re embedding content for storage/search, not making a query
  • Each scene is embedded individually for clarity, but Voyage supports batching (inputs=[scene1_payload, scene2_payload, …]) for higher throughput in production

Step 3: Store in KDB.AI

Now, we will define the table schema and implement our index.

Python
# SNIPPET 4: KDB.AI Schema and Table Creation
import kdbai_client as kdbai
import json
import numpy as np

session = kdbai.Session(endpoint=os.getenv("KDBAI_ENDPOINT"), api_key=os.getenv("KDBAI_API_KEY"))
db = session.database("default")
TABLE_NAME = "video_multimodal_scenes"
embedding_dim = len(scene_df['embedding'].iloc[0])
schema = [
    {"name": "id",          "type": "str"},
    {"name": "text_bytes",  "type": "bytes"},    # Store text efficiently as bytes
    {"name": "image_paths", "type": "str"},      # JSON string list of frame paths
    {"name": "embeddings",  "type": "float32s", "pytype": f"{embedding_dim}f"}, # Vector
]
# Use qHNSW for large datasets: Approximate search, disk-based storage
indexes = [{
    "type": "qHNSW", # Changed from qFlat for scalability
    "name": "idx_emb",
    "column": "embeddings",
    "params": {"dims": embedding_dim, "metric": "CS"}
}]
table = db.create_table(TABLE_NAME, schema=schema, indexes=indexes)
# Prepare and insert data
table.insert(insert_payload) # insert_payload is the formatted

What just happened:

  • Text is stored as (UTF-8 encoded) bytes
  • Image paths are stored as a JSON string
  • The embeddings column uses float32s
  • The index is set as qHNSW

Step 4: Run a query

Finally, we will embed the query, find similar scenes, prepare context (with merged images), and query the VLM.

Before we do, I’d like to flag that when initially developing this, I regularly hit API limits due to the number of generated tokens. To overcome this issue, I created a sprite, a long merged image comprising many sub-images.

Python
from PIL import Image

def merge_images(paths):
    images = [Image.open(p) for p in paths]
    widths, heights = zip(*(img.size for img in images))
    total_width = sum(widths)
    max_height = max(heights)

    merged = Image.new('RGB', (total_width, max_height))
    x_offset = 0
    for img in images:
        merged.paste(img, (x_offset, 0))
        x_offset += img.size[0]

    return merged

def multimodal_rag(query: str, k: int = 3) -> str:
    global table

    q_emb = voyage.multimodal_embed([[query]], model="voyage-multimodal-3", input_type="query", truncation=True).embeddings[0]
    retrieved = table.search(vectors={"idx_emb": [q_emb]}, n=k)[0]

    context = [{"type": "text", "text": f"Answer the query based only on the following video segments.\n\nQuestion: {query}\n"}]
    preview = []

    for i, row in retrieved.iterrows():
        tid = row.get('id', f'Retrieved_{i}')
        txt = row['text_bytes'].decode('utf-8')
        context += [{"type": "text", "text": f"\n--- Segment {tid} Text ---"}, {"type": "text", "text": txt}]
        preview.append({"type": "input_text", "text": f"\n--- Segment {tid} Text ---\n{txt}"})

        img_paths = json.loads(row.get('image_paths', '[]'))
        if img_paths:
            merged_img = merge_images(img_paths)
            merged_img_path = "/tmp/merged_segment_image.jpg"
            merged_img.save(merged_img_path, format="JPEG")

            base64_img = encode_base64(merged_img_path)
            context += [{"type": "image_url", "image_url": {"url": base64_img}}]

            preview.append({"type": "input_image", "image_url": base64_img})
            display(merged_img)

    context.append({"type": "text", "text": "\n--- End of Retrieved Context ---"})

    display_rag_preview(preview)

    res = openai.chat.completions.create(model="gpt-4o-mini", messages=[{"role": "user", "content": context}], max_tokens=500)
    out = res.choices[0].message.content

    display(Markdown(out))
    return out

Let’s test this with a new query.

Python
question1 = "What is the central limit theorem?"
print(f"\nExecuting RAG for query: '{question1}'")
response1 = multimodal_rag(query=question1, k=4)

Result:

Voyageai Results

What just happened:

  • Voyage optimizes query vectors differently from document vectors; we therefore use input_type=”query” when embedding the user’s question
  • AI table.search uses the idx_emb (HNSW index) for fast retrieval of the top k most similar scene vectors
  • To dramatically cut the number of image inputs, we use merge_images_to_sprite, thereby reducing k * 6 = 24 separate images, down to k=4 merged sprites
  • The llm_context carefully interleaves text ({“type”: “text”, …}) and the corresponding merged image sprite ({“type”: “image_url”, …}) which models like GPT-4o require

The new standard?

A scene-based multimodal RAG approach represents a fundamental improvement, respecting video’s inherent nature and the fusion of sight and sound over time.

It ensures:

  1. Context preservation: Embeddings capture the link between frames and the transcript.
  2. Relevant retrieval: Search finds semantically relevant scenes rather than isolated frames or text snippets.
  3. Grounded generation: VLMs can answer questions using both visual and textual evidence.
  4. Cost management: Image merging makes VLM calls affordable.

However, there are opportunities for improvement; finding the optimal fps and FRAMES_PER_CHUNK for different video types, for example, is still an art. But even with these considerations, the advancements in multimodal AI help transform video search from a cumbersome task into a seamless experience. By integrating visual and textual data, we can now achieve a more accurate and contextually relevant search, making it easier to find the exact moments we seek. This converts your video archive from a passive storage problem into an active, searchable knowledge base.

If you enjoyed this blog, please check out my others on kx.com and connect with me on Slack. You can also find all of the code used in this blog via my Colab notebook: https://github.com/KxSystems/kdbai-samples/blob/main/video_RAG/video_RAG.ipynb

 

Customer Stories

Discover richer, actionable insights for faster, better informed decision making

ADSS Logo
Capital Markets

ADSS leverages KX real-time data platform to accelerate its transformational growth strategy.

Read More About ADSS
Axi logo
Capital Markets

Axi uses KX to capture, analyze, and visualize streaming data in real-time and at scale.

Read More About Axi


Demo the world’s fastest database for vector, time-series, and real-time analytics

Start your journey to becoming an AI-first enterprise with 100x* more performant data and MLOps pipelines.

  • Process data at unmatched speed and scale
  • Build high-performance data-driven applications
  • Turbocharge analytics tools in the cloud, on premise, or at the edge

*Based on time-series queries running in real-world use cases on customer environments.

Book a demo with an expert

"*" indicates required fields

By submitting this form, you will also receive sales and/or marketing communications on KX products, services, news and events. You can unsubscribe from receiving communications by visiting our Privacy Policy. You can find further information on how we collect and use your personal data in our Privacy Policy.

This field is for validation purposes and should be left unchanged.

A verified G2 leader for time-series