Tutorial: Hybrid search with BM25 in KDB-X AI libraries

Key Takeaways

Traditional keyword (sparse) search struggles with context and meaning, while purely semantic (dense) search can miss exact terms
Hybrid search combines BM25 precision with semantic embeddings to deliver more accurate answers than either method alone

Traditional keyword searches often fall short when users expect the search engine to “understand” the context and meaning. In contrast, purely semantic (embedding-based) search can surface results that are topically similar but don’t contain the exact words or phrases the user is after.

Hybrid search in KDB-X AI-Libs solves this by blending two approaches:

BM25 (sparse): A scoring function that rewards documents containing the query terms, penalises overly frequent words, and normalises by document length
Dense search: Uses embeddings (vector representations of text) and retrieves by similarity (e.g., cosine distance)

In this tutorial, we’ll walk through the mechanics of BM25, dense and hybrid search in KDB-X AI-Libs. You can also explore the entire notebook on GitHub.

Load the dataset and create the tables

Let’s start by creating a table with a BM25 index on text and a dense index on embeddings.

To begin, we will load the necessary libraries.

.ai:use`kx.ai
\l pykx.q

We’ll use NanoMSMARCO, a compact version of the large-scale MS MARCO passage ranking dataset. It contains:

Corpus: The collection of documents we will be searching through
Queries: A set of search queries
Relevance: A “ground truth” mapping of which documents are most relevant for each query

Because it is lightweight while still preserving the structure of a real information retrieval benchmark, it is excellent for experimenting with hybrid search methods. It allows you to test both sparse (BM25 keyword-based) and dense (embedding-based) approaches, and then evaluate hybrid combinations against ground-truth relevance rankings – all without the heavy compute requirements of the full MS MARCO dataset.

// Load data from Hugging Face
.pykx.pyexec "from datasets import load_dataset";
.pykx.pyexec "corpus_ds = load_dataset('zeta-alpha-ai/NanoMSMARCO', 'corpus', split='train')";
.pykx.pyexec "queries_ds = load_dataset('zeta-alpha-ai/NanoMSMARCO', 'queries', split='train')";
.pykx.pyexec "rankings_ds = load_dataset('zeta-alpha-ai/NanoMSMARCO', 'qrels', split='train')";

// Convert Python object to q dictionaries 
corpus:.pykx.qeval "{int(item['_id']): pykx.CharVector(item['text']) for item in corpus_ds}";
queries:.pykx.qeval "{int(item['_id']): pykx.CharVector(item['text']) for item in queries_ds}";
rankings:.pykx.qeval "{int(item['query-id']): int(item['corpus-id']) for item in rankings_ds}";

//Create tables with consistent data types
corpus:([]docId:key corpus;text:value corpus);
queries:([]docId:key queries;text:value queries);
rankings:([]docId:key rankings;rankedIds:value rankings);

rankings:rankings lj 1!queries;

Create an embedding function

For our dense search, we’ll convert our text into vector embeddings: numerical representations of the text that capture its semantic meaning. We’ll use KDB-X Python to import PyTorch and the sentence-transformers library. For better performance, we’ll also apply dynamic quantization to our model.

torch:.pykx.import`torch;
ST:.pykx.import[`sentence_transformers;`:SentenceTransformer];
qd:.pykx.import[`torch.quantization;`:quantize_dynamic];
qconf:.pykx.import[`torch.quantization;`:default_qconfig];
torch[`:set_num_threads][3];

We load a lightweight and efficient transformer model, paraphrase-MiniLM-L3-v2. This model is ideal for generating high-quality embeddings efficiently.

model:ST[`$"paraphrase-MiniLM-L3-v2";`device pykw `cpu];
tqint8:torch[`:qint8];
qmodel:qd[model;.pykx.eval["lambda x: {x}"] torch[`:nn.Linear];`dtype pykw tqint8];
embed:{x[`:encode][.pykx.eval["lambda x: x.decode('utf-8') if type(x) is bytes else [x.decode('utf-8') for x in x]"] .pykx.topy y]`}[qmodel;];

We now have an embed function that can take a string and return a vector embedding:

embed "Hello World!";

Create a tokenizing function

For our sparse search, we need to break down our text into individual tokens (words or sub-words). We’ll use a tokenizer from the transformers library for this.

AT:.pykx.import[`transformers;`:AutoTokenizer];
tokenizer:AT[`$":from_pretrained"][`$"bert-base-uncased"];
tokenize:{if[10h~type y;y:`$y];x[$[":"~first string y;`$1_string y;y]][`$":input_ids"]`}[tokenizer;];

Our new tokenize function takes a string and returns a list of token IDs.

tokenize[`$"abc abc abc"];

Update the dataset

Now, we’ll apply our embedding and tokenization functions to our corpus and rankings tables, adding the new vector embeddings and token lists as new columns.

rankings:update tokens:tokenize each text from rankings;
rankings:update embeddings:embed each text from rankings;

corpus:update tokens:tokenize each text from corpus;
corpus:update embeddings:embed text from corpus;

Perform the searches

With our data prepared, we can now perform our dense and sparse searches and combine them to obtain our hybrid result.

Sparse search

For our sparse search, we’ll use the BM25 algorithm, a powerful and widely used ranking function for information retrieval.

The .ai.bm25.put function inserts sparse vectors into a BM25 index. The ck and cb parameters are hyperparameters that can be tuned to optimize the search results.

k1 (ck): Controls how the term frequency of each word affects the relevance score (term saturation)
b (cb): Controls how the length of a document affects the relevance score

The .ai.bm25.search function returns the top k nearest neighbors for sparse search.

Finally, to calculate accuracy, we find the intersection between the sparse search results and the ground truth rankings list. This indicates the percentage of top documents we found that were actually correct. Lastly, avg takes the average of all individual precision scores across all queries.

ck:1.75e;
cb:0.25e;
index:.ai.bm25.put[()!();ck;cb;corpus[`tokens]];

sparse:corpus[`docId]@/:.[;(::;1)].ai.bm25.search[index;;10;ck;cb] each rankings[`tokens];
acc:avg (count each {inter[x;y]}'[sparse;rankings`rankedIds]);
0N!acc

Result: 0.66 (66% accuracy)

Dense search

For our dense search, we’ll use a flat (brute-force) search to find the nearest neighbors to our query embeddings in the corpus embeddings. We’ll use the L2 distance metric (Euclidean distance) to measure similarity.

dense:corpus[`docId]@/:.[;(::;1)].ai.flat.search[corpus[`embeddings];rankings[`embeddings];10;`L2];
acc:avg (count each {inter[x;y]}'[dense;rankings`rankedIds]);
0N!acc

Result: 0.64 (64% accuracy)

Hybrid search

Finally, we combine the results of our sparse and dense searches using Reciprocal Rank Fusion (RRF). RRF is a method that combines multiple ranked lists into a single, more accurate list.

The .ai.hybrid.rrf function takes the two lists of search results and merges them. The 60 in this case is a constant that can be tuned.

hybrid:10#/:{.ai.hybrid.rrf[(x;y);60]}'[sparse;dense];
acc:avg (count each {inter[x;y]}'[hybrid;rankings`rankedIds]);
0N!acc

Result: 0.7 (70% accuracy)

Hybrid search outperforms sparse and dense searches on their own, offering a powerful middle path that combines the precision of keyword matching with the flexibility gained by using semantic vectors.

In this tutorial, we explored the power of this combined approach and how you can balance precision and meaning for your individual use cases.

If you enjoyed this blog and would like to explore other examples, you can visit our GitHub repository. You can also begin your journey with KDB-X by signing up for the KDB-X Community Edition Public Preview.

Tutorial: Hybrid search with BM25 in KDB-X AI libraries

Developer

Key Takeaways

Load the dataset and create the tables

Create an embedding function

Create a tokenizing function

Update the dataset

Perform the searches

Sparse search

Dense search

Hybrid search

Customer Stories

Demo the world’s fastest database for vector, time-series, and real-time analytics

Start your journey to becoming an AI-first enterprise with 100x* more performant data and MLOps pipelines.

Book a demo with an expert

KDB-X: Next-gen kdb+ is here – and it’s built different

Apex innovators: How hedge funds can evolve analytics at speed and scale

Benchmarking KDB-X vs QuestDB, ClickHouse, TimescaleDB and InfluxDB with TSBS

From ticks to tweets: Combining structured and unstructured financial data with KDB-X

KDB-X: The next era of kdb+ for AI-driven markets

Developer

Key Takeaways

Load the dataset and create the tables

Create an embedding function

Create a tokenizing function

Update the dataset

Perform the searches

Sparse search

Dense search

Hybrid search

How do hedge funds stay ahead in the great quant convergence?

KDB-X: The next era of kdb+ for AI-driven markets

KDB-X: Next-gen kdb+ is here – and it’s built different

Customer Stories

Demo the world’s fastest database for vector, time-series, and real-time analytics

Start your journey to becoming an AI-first enterprise with 100x* more performant data and MLOps pipelines.

Book a demo with an expert