Tutorial: Hybrid search with BM25 in KDB-X AI libraries

作者

Laura Kerr

Community Management Lead

投稿済み

24 9月, 2025

読む時間

ポイント

Traditional keyword (sparse) search struggles with context and meaning, while purely semantic (dense) search can miss exact terms
Hybrid search combines BM25 precision with semantic embeddings to deliver more accurate answers than either method alone

Traditional keyword searches often fall short when users expect the search engine to “understand” the context and meaning. In contrast, purely semantic (embedding-based) search can surface results that are topically similar but don’t contain the exact words or phrases the user is after.

Hybrid search in KDB-X AI-Libs solves this by blending two approaches:

BM25 (sparse): A scoring function that rewards documents containing the query terms, penalises overly frequent words, and normalises by document length
Dense search: Uses embeddings (vector representations of text) and retrieves by similarity (e.g., cosine distance)

In this tutorial, we’ll walk through the mechanics of BM25, dense and hybrid search in KDB-X AI-Libs. You can also explore the entire notebook on GitHub.

Load the dataset and create the tables

Let’s start by creating a table with a BM25 index on text and a dense index on embeddings.

To begin, we will load the necessary libraries.

.ai:use`kx.ai
\l pykx.q

We’ll use NanoMSMARCO, a compact version of the large-scale MS MARCO passage ranking dataset. It contains:

Corpus: The collection of documents we will be searching through
Queries: A set of search queries
Relevance: A “ground truth” mapping of which documents are most relevant for each query

Because it is lightweight while still preserving the structure of a real information retrieval benchmark, it is excellent for experimenting with hybrid search methods. It allows you to test both sparse (BM25 keyword-based) and dense (embedding-based) approaches, and then evaluate hybrid combinations against ground-truth relevance rankings – all without the heavy compute requirements of the full MS MARCO dataset.

// Load data from Hugging Face
.pykx.pyexec "from datasets import load_dataset";
.pykx.pyexec "corpus_ds = load_dataset('zeta-alpha-ai/NanoMSMARCO', 'corpus', split='train')";
.pykx.pyexec "queries_ds = load_dataset('zeta-alpha-ai/NanoMSMARCO', 'queries', split='train')";
.pykx.pyexec "rankings_ds = load_dataset('zeta-alpha-ai/NanoMSMARCO', 'qrels', split='train')";

// Convert Python object to q dictionaries 
corpus:.pykx.qeval "{int(item['_id']): pykx.CharVector(item['text']) for item in corpus_ds}";
queries:.pykx.qeval "{int(item['_id']): pykx.CharVector(item['text']) for item in queries_ds}";
rankings:.pykx.qeval "{int(item['query-id']): int(item['corpus-id']) for item in rankings_ds}";

//Create tables with consistent data types
corpus:([]docId:key corpus;text:value corpus);
queries:([]docId:key queries;text:value queries);
rankings:([]docId:key rankings;rankedIds:value rankings);

rankings:rankings lj 1!queries;

Create an embedding function

For our dense search, we’ll convert our text into vector embeddings: numerical representations of the text that capture its semantic meaning. We’ll use KDB-X Python to import PyTorch and the sentence-transformers library. For better performance, we’ll also apply dynamic quantization to our model.

torch:.pykx.import`torch;
ST:.pykx.import[`sentence_transformers;`:SentenceTransformer];
qd:.pykx.import[`torch.quantization;`:quantize_dynamic];
qconf:.pykx.import[`torch.quantization;`:default_qconfig];
torch[`:set_num_threads][3];

We load a lightweight and efficient transformer model, paraphrase-MiniLM-L3-v2. This model is ideal for generating high-quality embeddings efficiently.

model:ST[`$"paraphrase-MiniLM-L3-v2";`device pykw `cpu];
tqint8:torch[`:qint8];
qmodel:qd[model;.pykx.eval["lambda x: {x}"] torch[`:nn.Linear];`dtype pykw tqint8];
embed:{x[`:encode][.pykx.eval["lambda x: x.decode('utf-8') if type(x) is bytes else [x.decode('utf-8') for x in x]"] .pykx.topy y]`}[qmodel;];

We now have an embed function that can take a string and return a vector embedding:

embed "Hello World!";

Create a tokenizing function

For our sparse search, we need to break down our text into individual tokens (words or sub-words). We’ll use a tokenizer from the transformers library for this.

AT:.pykx.import[`transformers;`:AutoTokenizer];
tokenizer:AT[`$":from_pretrained"][`$"bert-base-uncased"];
tokenize:{if[10h~type y;y:`$y];x[$[":"~first string y;`$1_string y;y]][`$":input_ids"]`}[tokenizer;];

Our new tokenize function takes a string and returns a list of token IDs.

tokenize[`$"abc abc abc"];

Update the dataset

Now, we’ll apply our embedding and tokenization functions to our corpus and rankings tables, adding the new vector embeddings and token lists as new columns.

rankings:update tokens:tokenize each text from rankings;
rankings:update embeddings:embed each text from rankings;

corpus:update tokens:tokenize each text from corpus;
corpus:update embeddings:embed text from corpus;

Perform the searches

With our data prepared, we can now perform our dense and sparse searches and combine them to obtain our hybrid result.

Sparse search

For our sparse search, we’ll use the BM25 algorithm, a powerful and widely used ranking function for information retrieval.

The .ai.bm25.put function inserts sparse vectors into a BM25 index. The ck and cb parameters are hyperparameters that can be tuned to optimize the search results.

k1 (ck): Controls how the term frequency of each word affects the relevance score (term saturation)
b (cb): Controls how the length of a document affects the relevance score

The .ai.bm25.search function returns the top k nearest neighbors for sparse search.

Finally, to calculate accuracy, we find the intersection between the sparse search results and the ground truth rankings list. This indicates the percentage of top documents we found that were actually correct. Lastly, avg takes the average of all individual precision scores across all queries.

ck:1.75e;
cb:0.25e;
index:.ai.bm25.put[()!();ck;cb;corpus[`tokens]];

sparse:corpus[`docId]@/:.[;(::;1)].ai.bm25.search[index;;10;ck;cb] each rankings[`tokens];
acc:avg (count each {inter[x;y]}'[sparse;rankings`rankedIds]);
0N!acc

Result: 0.66 (66% accuracy)

Dense search

For our dense search, we’ll use a flat (brute-force) search to find the nearest neighbors to our query embeddings in the corpus embeddings. We’ll use the L2 distance metric (Euclidean distance) to measure similarity.

dense:corpus[`docId]@/:.[;(::;1)].ai.flat.search[corpus[`embeddings];rankings[`embeddings];10;`L2];
acc:avg (count each {inter[x;y]}'[dense;rankings`rankedIds]);
0N!acc

Result: 0.64 (64% accuracy)

Hybrid search

Finally, we combine the results of our sparse and dense searches using Reciprocal Rank Fusion (RRF). RRF is a method that combines multiple ranked lists into a single, more accurate list.

The .ai.hybrid.rrf function takes the two lists of search results and merges them. The 60 in this case is a constant that can be tuned.

hybrid:10#/:{.ai.hybrid.rrf[(x;y);60]}'[sparse;dense];
acc:avg (count each {inter[x;y]}'[hybrid;rankings`rankedIds]);
0N!acc

Result: 0.7 (70% accuracy)

Hybrid search outperforms sparse and dense searches on their own, offering a powerful middle path that combines the precision of keyword matching with the flexibility gained by using semantic vectors.

In this tutorial, we explored the power of this combined approach and how you can balance precision and meaning for your individual use cases.

If you enjoyed this blog and would like to explore other examples, you can visit our GitHub repository. You can also begin your journey with KDB-X by signing up for the KDB-X Community Edition Public Preview.

AIによるイノベーションを加速する、KXのデモをお客様に合わせてご提供します。

当社のチームが以下の実現をサポートします：

ストリーミング、リアルタイム、および過去データに最適化された設計

エンタープライズ向けのスケーラビリティ、耐障害性、統合性、そして高度な分析機能

幅広い開発言語との統合に対応する充実したツール群

Tutorial: Hybrid search with BM25 in KDB-X AI libraries

Developer

ポイント

Load the dataset and create the tables

Create an embedding function

Create a tokenizing function

Update the dataset

Perform the searches

Sparse search

Dense search

Hybrid search

Customer Stories

AIによるイノベーションを加速する、KXのデモをお客様に合わせてご提供します。

専門担当者によるデモをリクエスト

タイムシリーズ分野におけるG2認定リーダー

Modernizing infrastructures that mix Python and q

The ultimate guide to choosing embedding models for AI applications

7つの革新的トレーディングアプリ (および実践できる7つのベストプラクティス)

11 insights to help quants break through data and analytics barriers

PyKX 3.0: Easier to use and more powerful than ever

リアルタイム分析でクオンツパフォーマンスを飛躍的に向上

Webinar: Six best practices for optimizing trade execution

Developer

ポイント

Load the dataset and create the tables

Create an embedding function

Create a tokenizing function

Update the dataset

Perform the searches

Sparse search

Dense search

Hybrid search

The signal factory: From fragmented data to continuous intelligence

Countdown to alpha: How leading hedge funds turn backtesting into edge

Tutorial: Integrating Parquet format data with KDB-X

Customer Stories

AIによるイノベーションを加速する、KXのデモをお客様に合わせてご提供します。

専門担当者によるデモをリクエスト

タイムシリーズ分野におけるG2認定リーダー