Hybrid Search With BM25 In KDB X AI Libraries

Tutorial: Hybrid search with BM25 in KDB-X AI libraries

Author

Laura Kerr

Community Management Lead

Key Takeaways

  1. Traditional keyword (sparse) search struggles with context and meaning, while purely semantic (dense) search can miss exact terms
  2. Hybrid search combines BM25 precision with semantic embeddings to deliver more accurate answers than either method alone

Traditional keyword searches often fall short when users expect the search engine to “understand” the context and meaning. In contrast, purely semantic (embedding-based) search can surface results that are topically similar but don’t contain the exact words or phrases the user is after.

Hybrid search in KDB-X AI-Libs solves this by blending two approaches:

  • BM25 (sparse): A scoring function that rewards documents containing the query terms, penalises overly frequent words, and normalises by document length
  • Dense search: Uses embeddings (vector representations of text) and retrieves by similarity (e.g., cosine distance)

In this tutorial, we’ll walk through the mechanics of BM25, dense and hybrid search in KDB-X AI-Libs. You can also explore the entire notebook on GitHub.

Load the dataset and create the tables

Let’s start by creating a table with a BM25 index on text and a dense index on embeddings.

To begin, we will load the necessary libraries.

q
.ai:use`kx.ai
\l pykx.q

We’ll use NanoMSMARCO, a compact version of the large-scale MS MARCO passage ranking dataset. It contains:

  • Corpus: The collection of documents we will be searching through
  • Queries: A set of search queries
  • Relevance: A “ground truth” mapping of which documents are most relevant for each query

Because it is lightweight while still preserving the structure of a real information retrieval benchmark, it is excellent for experimenting with hybrid search methods. It allows you to test both sparse (BM25 keyword-based) and dense (embedding-based) approaches, and then evaluate hybrid combinations against ground-truth relevance rankings – all without the heavy compute requirements of the full MS MARCO dataset.

q
// Load data from Hugging Face
.pykx.pyexec "from datasets import load_dataset";
.pykx.pyexec "corpus_ds = load_dataset('zeta-alpha-ai/NanoMSMARCO', 'corpus', split='train')";
.pykx.pyexec "queries_ds = load_dataset('zeta-alpha-ai/NanoMSMARCO', 'queries', split='train')";
.pykx.pyexec "rankings_ds = load_dataset('zeta-alpha-ai/NanoMSMARCO', 'qrels', split='train')";

// Convert Python object to q dictionaries 
corpus:.pykx.qeval "{int(item['_id']): pykx.CharVector(item['text']) for item in corpus_ds}";
queries:.pykx.qeval "{int(item['_id']): pykx.CharVector(item['text']) for item in queries_ds}";
rankings:.pykx.qeval "{int(item['query-id']): int(item['corpus-id']) for item in rankings_ds}";

//Create tables with consistent data types
corpus:([]docId:key corpus;text:value corpus);
queries:([]docId:key queries;text:value queries);
rankings:([]docId:key rankings;rankedIds:value rankings);

rankings:rankings lj 1!queries;

Create an embedding function

For our dense search, we’ll convert our text into vector embeddings: numerical representations of the text that capture its semantic meaning. We’ll use KDB-X Python to import PyTorch and the sentence-transformers library. For better performance, we’ll also apply dynamic quantization to our model.

q
torch:.pykx.import`torch;
ST:.pykx.import[`sentence_transformers;`:SentenceTransformer];
qd:.pykx.import[`torch.quantization;`:quantize_dynamic];
qconf:.pykx.import[`torch.quantization;`:default_qconfig];
torch[`:set_num_threads][3];

We load a lightweight and efficient transformer model, paraphrase-MiniLM-L3-v2. This model is ideal for generating high-quality embeddings efficiently.

q
model:ST[`$"paraphrase-MiniLM-L3-v2";`device pykw `cpu];
tqint8:torch[`:qint8];
qmodel:qd[model;.pykx.eval["lambda x: {x}"] torch[`:nn.Linear];`dtype pykw tqint8];
embed:{x[`:encode][.pykx.eval["lambda x: x.decode('utf-8') if type(x) is bytes else [x.decode('utf-8') for x in x]"] .pykx.topy y]`}[qmodel;];

We now have an embed function that can take a string and return a vector embedding:

q
embed "Hello World!";

Create a tokenizing function

For our sparse search, we need to break down our text into individual tokens (words or sub-words). We’ll use a tokenizer from the transformers library for this.

q
AT:.pykx.import[`transformers;`:AutoTokenizer];
tokenizer:AT[`$":from_pretrained"][`$"bert-base-uncased"];
tokenize:{if[10h~type y;y:`$y];x[$[":"~first string y;`$1_string y;y]][`$":input_ids"]`}[tokenizer;];

Our new tokenize function takes a string and returns a list of token IDs.

q
tokenize[`$"abc abc abc"];

Update the dataset

Now, we’ll apply our embedding and tokenization functions to our corpus and rankings tables, adding the new vector embeddings and token lists as new columns.

q
rankings:update tokens:tokenize each text from rankings;
rankings:update embeddings:embed each text from rankings;

corpus:update tokens:tokenize each text from corpus;
corpus:update embeddings:embed text from corpus;

Perform the searches

With our data prepared, we can now perform our dense and sparse searches and combine them to obtain our hybrid result.

Sparse search

For our sparse search, we’ll use the BM25 algorithm, a powerful and widely used ranking function for information retrieval.

The .ai.bm25.put function inserts sparse vectors into a BM25 index. The ck and cb parameters are hyperparameters that can be tuned to optimize the search results.

  • k1 (ck): Controls how the term frequency of each word affects the relevance score (term saturation)
  • b (cb): Controls how the length of a document affects the relevance score

The .ai.bm25.search function returns the top k nearest neighbors for sparse search.

Finally, to calculate accuracy, we find the intersection between the sparse search results and the ground truth rankings list. This indicates the percentage of top documents we found that were actually correct. Lastly, avg takes the average of all individual precision scores across all queries.

q
ck:1.75e;
cb:0.25e;
index:.ai.bm25.put[()!();ck;cb;corpus[`tokens]];

sparse:corpus[`docId]@/:.[;(::;1)].ai.bm25.search[index;;10;ck;cb] each rankings[`tokens];
acc:avg (count each {inter[x;y]}'[sparse;rankings`rankedIds]);
0N!acc
Result: 0.66 (66% accuracy)

Dense search

For our dense search, we’ll use a flat (brute-force) search to find the nearest neighbors to our query embeddings in the corpus embeddings. We’ll use the L2 distance metric (Euclidean distance) to measure similarity.

q
dense:corpus[`docId]@/:.[;(::;1)].ai.flat.search[corpus[`embeddings];rankings[`embeddings];10;`L2];
acc:avg (count each {inter[x;y]}'[dense;rankings`rankedIds]);
0N!acc
Result: 0.64 (64% accuracy)

Hybrid search

Finally, we combine the results of our sparse and dense searches using Reciprocal Rank Fusion (RRF). RRF is a method that combines multiple ranked lists into a single, more accurate list.

The .ai.hybrid.rrf function takes the two lists of search results and merges them. The 60 in this case is a constant that can be tuned.

q
hybrid:10#/:{.ai.hybrid.rrf[(x;y);60]}'[sparse;dense];
acc:avg (count each {inter[x;y]}'[hybrid;rankings`rankedIds]);
0N!acc
Result: 0.7 (70% accuracy)

Hybrid search outperforms sparse and dense searches on their own, offering a powerful middle path that combines the precision of keyword matching with the flexibility gained by using semantic vectors.

In this tutorial, we explored the power of this combined approach and how you can balance precision and meaning for your individual use cases.

If you enjoyed this blog and would like to explore other examples, you can visit our GitHub repository. You can also begin your journey with KDB-X by signing up for the KDB-X Community Edition Public Preview.

Customer Stories

Discover richer, actionable insights for faster, better informed decision making

ADSS Logo
Capital Markets

ADSS leverages KX real-time data platform to accelerate its transformational growth strategy.

Read More About ADSS
Axi logo
Capital Markets

Axi uses KX to capture, analyze, and visualize streaming data in real-time and at scale.

Read More About Axi


Demo the world’s fastest database for vector, time-series, and real-time analytics

Start your journey to becoming an AI-first enterprise with 100x* more performant data and MLOps pipelines.

  • Process data at unmatched speed and scale
  • Build high-performance data-driven applications
  • Turbocharge analytics tools in the cloud, on premise, or at the edge

*Based on time-series queries running in real-world use cases on customer environments.

Book a demo with an expert

"*" indicates required fields

This field is for validation purposes and should be left unchanged.

By submitting this form, you will also receive sales and/or marketing communications on KX products, services, news and events. You can unsubscribe from receiving communications by visiting our Privacy Policy. You can find further information on how we collect and use your personal data in our Privacy Policy.

// social // social