Hybrid Search With BM25 In KDB X AI Libraries

Tutorial: Hybrid search with BM25 in KDB-X AI libraries

作者

Laura Kerr

Community Management Lead

ポイント

  1. Traditional keyword (sparse) search struggles with context and meaning, while purely semantic (dense) search can miss exact terms
  2. Hybrid search combines BM25 precision with semantic embeddings to deliver more accurate answers than either method alone

Traditional keyword searches often fall short when users expect the search engine to “understand” the context and meaning. In contrast, purely semantic (embedding-based) search can surface results that are topically similar but don’t contain the exact words or phrases the user is after.

Hybrid search in KDB-X AI-Libs solves this by blending two approaches:

  • BM25 (sparse): A scoring function that rewards documents containing the query terms, penalises overly frequent words, and normalises by document length
  • Dense search: Uses embeddings (vector representations of text) and retrieves by similarity (e.g., cosine distance)

In this tutorial, we’ll walk through the mechanics of BM25, dense and hybrid search in KDB-X AI-Libs. You can also explore the entire notebook on GitHub.

Load the dataset and create the tables

Let’s start by creating a table with a BM25 index on text and a dense index on embeddings.

To begin, we will load the necessary libraries.

q
.ai:use`kx.ai
\l pykx.q

We’ll use NanoMSMARCO, a compact version of the large-scale MS MARCO passage ranking dataset. It contains:

  • Corpus: The collection of documents we will be searching through
  • Queries: A set of search queries
  • Relevance: A “ground truth” mapping of which documents are most relevant for each query

Because it is lightweight while still preserving the structure of a real information retrieval benchmark, it is excellent for experimenting with hybrid search methods. It allows you to test both sparse (BM25 keyword-based) and dense (embedding-based) approaches, and then evaluate hybrid combinations against ground-truth relevance rankings – all without the heavy compute requirements of the full MS MARCO dataset.

q
// Load data from Hugging Face
.pykx.pyexec "from datasets import load_dataset";
.pykx.pyexec "corpus_ds = load_dataset('zeta-alpha-ai/NanoMSMARCO', 'corpus', split='train')";
.pykx.pyexec "queries_ds = load_dataset('zeta-alpha-ai/NanoMSMARCO', 'queries', split='train')";
.pykx.pyexec "rankings_ds = load_dataset('zeta-alpha-ai/NanoMSMARCO', 'qrels', split='train')";

// Convert Python object to q dictionaries 
corpus:.pykx.qeval "{int(item['_id']): pykx.CharVector(item['text']) for item in corpus_ds}";
queries:.pykx.qeval "{int(item['_id']): pykx.CharVector(item['text']) for item in queries_ds}";
rankings:.pykx.qeval "{int(item['query-id']): int(item['corpus-id']) for item in rankings_ds}";

//Create tables with consistent data types
corpus:([]docId:key corpus;text:value corpus);
queries:([]docId:key queries;text:value queries);
rankings:([]docId:key rankings;rankedIds:value rankings);

rankings:rankings lj 1!queries;

Create an embedding function

For our dense search, we’ll convert our text into vector embeddings: numerical representations of the text that capture its semantic meaning. We’ll use KDB-X Python to import PyTorch and the sentence-transformers library. For better performance, we’ll also apply dynamic quantization to our model.

q
torch:.pykx.import`torch;
ST:.pykx.import[`sentence_transformers;`:SentenceTransformer];
qd:.pykx.import[`torch.quantization;`:quantize_dynamic];
qconf:.pykx.import[`torch.quantization;`:default_qconfig];
torch[`:set_num_threads][3];

We load a lightweight and efficient transformer model, paraphrase-MiniLM-L3-v2. This model is ideal for generating high-quality embeddings efficiently.

q
model:ST[`$"paraphrase-MiniLM-L3-v2";`device pykw `cpu];
tqint8:torch[`:qint8];
qmodel:qd[model;.pykx.eval["lambda x: {x}"] torch[`:nn.Linear];`dtype pykw tqint8];
embed:{x[`:encode][.pykx.eval["lambda x: x.decode('utf-8') if type(x) is bytes else [x.decode('utf-8') for x in x]"] .pykx.topy y]`}[qmodel;];

We now have an embed function that can take a string and return a vector embedding:

q
embed "Hello World!";

Create a tokenizing function

For our sparse search, we need to break down our text into individual tokens (words or sub-words). We’ll use a tokenizer from the transformers library for this.

q
AT:.pykx.import[`transformers;`:AutoTokenizer];
tokenizer:AT[`$":from_pretrained"][`$"bert-base-uncased"];
tokenize:{if[10h~type y;y:`$y];x[$[":"~first string y;`$1_string y;y]][`$":input_ids"]`}[tokenizer;];

Our new tokenize function takes a string and returns a list of token IDs.

q
tokenize[`$"abc abc abc"];

Update the dataset

Now, we’ll apply our embedding and tokenization functions to our corpus and rankings tables, adding the new vector embeddings and token lists as new columns.

q
rankings:update tokens:tokenize each text from rankings;
rankings:update embeddings:embed each text from rankings;

corpus:update tokens:tokenize each text from corpus;
corpus:update embeddings:embed text from corpus;

Perform the searches

With our data prepared, we can now perform our dense and sparse searches and combine them to obtain our hybrid result.

Sparse search

For our sparse search, we’ll use the BM25 algorithm, a powerful and widely used ranking function for information retrieval.

The .ai.bm25.put function inserts sparse vectors into a BM25 index. The ck and cb parameters are hyperparameters that can be tuned to optimize the search results.

  • k1 (ck): Controls how the term frequency of each word affects the relevance score (term saturation)
  • b (cb): Controls how the length of a document affects the relevance score

The .ai.bm25.search function returns the top k nearest neighbors for sparse search.

Finally, to calculate accuracy, we find the intersection between the sparse search results and the ground truth rankings list. This indicates the percentage of top documents we found that were actually correct. Lastly, avg takes the average of all individual precision scores across all queries.

q
ck:1.75e;
cb:0.25e;
index:.ai.bm25.put[()!();ck;cb;corpus[`tokens]];

sparse:corpus[`docId]@/:.[;(::;1)].ai.bm25.search[index;;10;ck;cb] each rankings[`tokens];
acc:avg (count each {inter[x;y]}'[sparse;rankings`rankedIds]);
0N!acc
Result: 0.66 (66% accuracy)

Dense search

For our dense search, we’ll use a flat (brute-force) search to find the nearest neighbors to our query embeddings in the corpus embeddings. We’ll use the L2 distance metric (Euclidean distance) to measure similarity.

q
dense:corpus[`docId]@/:.[;(::;1)].ai.flat.search[corpus[`embeddings];rankings[`embeddings];10;`L2];
acc:avg (count each {inter[x;y]}'[dense;rankings`rankedIds]);
0N!acc
Result: 0.64 (64% accuracy)

Hybrid search

Finally, we combine the results of our sparse and dense searches using Reciprocal Rank Fusion (RRF). RRF is a method that combines multiple ranked lists into a single, more accurate list.

The .ai.hybrid.rrf function takes the two lists of search results and merges them. The 60 in this case is a constant that can be tuned.

q
hybrid:10#/:{.ai.hybrid.rrf[(x;y);60]}'[sparse;dense];
acc:avg (count each {inter[x;y]}'[hybrid;rankings`rankedIds]);
0N!acc
Result: 0.7 (70% accuracy)

Hybrid search outperforms sparse and dense searches on their own, offering a powerful middle path that combines the precision of keyword matching with the flexibility gained by using semantic vectors.

In this tutorial, we explored the power of this combined approach and how you can balance precision and meaning for your individual use cases.

If you enjoyed this blog and would like to explore other examples, you can visit our GitHub repository. You can also begin your journey with KDB-X by signing up for the KDB-X Community Edition Public Preview.

Customer Stories

Discover richer, actionable insights for faster, better informed decision making

資本市場

AxiはKXを使用して、ストリーミング・データをリアルタイムかつ大規模に取り込み、分析し、可視化しています。

詳細を読む 概要 Axi
資本市場

ADSSはKXリアルタイムデータプラットフォームを活用し、変革的成長戦略を加速させます。

詳細を読む 概要 ADSS
資本市場

10年以上にわたってKXの顧客である同社は、KXのチームとリアルタイムデータベースを信頼し、技術的な観点からクラウドへの移行を容易にできる確信がありました。

詳細を読む 概要 日本の投資銀行


AIによるイノベーションを加速する、KXのデモをお客様に合わせてご提供します。

当社のチームが以下の実現をサポートします:

  • ストリーミング、リアルタイム、および過去データに最適化された設計
  • エンタープライズ向けのスケーラビリティ、耐障害性、統合性、そして高度な分析機能
  • 幅広い開発言語との統合に対応する充実したツール群

専門担当者によるデモをリクエスト

*」は必須フィールドを示します

このフィールドは入力チェック用です。変更しないでください。

本フォームを送信いただくと、KXの製品・サービス、お知らせ、イベントに関する営業・マーケティング情報をお受け取りいただけます。プライバシーポリシーからお手続きいただくことで購読解除も可能です。当社の個人情報の収集・使用に関する詳しい情報については、プライバシーポリシーをご覧ください。

タイムシリーズ分野におけるG2認定リーダー

// social // social