Key Takeaways
- Traditional keyword (sparse) search struggles with context and meaning, while purely semantic (dense) search can miss exact terms
- Hybrid search combines BM25 precision with semantic embeddings to deliver more accurate answers than either method alone
Traditional keyword searches often fall short when users expect the search engine to “understand” the context and meaning. In contrast, purely semantic (embedding-based) search can surface results that are topically similar but don’t contain the exact words or phrases the user is after.
Hybrid search in KDB-X AI-Libs solves this by blending two approaches:
- BM25 (sparse): A scoring function that rewards documents containing the query terms, penalises overly frequent words, and normalises by document length
- Dense search: Uses embeddings (vector representations of text) and retrieves by similarity (e.g., cosine distance)
In this tutorial, we’ll walk through the mechanics of BM25, dense and hybrid search in KDB-X AI-Libs. You can also explore the entire notebook on GitHub.
Load the dataset and create the tables
Let’s start by creating a table with a BM25 index on text and a dense index on embeddings.
To begin, we will load the necessary libraries.
.ai:use`kx.ai
\l pykx.qWe’ll use NanoMSMARCO, a compact version of the large-scale MS MARCO passage ranking dataset. It contains:
- Corpus: The collection of documents we will be searching through
- Queries: A set of search queries
- Relevance: A “ground truth” mapping of which documents are most relevant for each query
Because it is lightweight while still preserving the structure of a real information retrieval benchmark, it is excellent for experimenting with hybrid search methods. It allows you to test both sparse (BM25 keyword-based) and dense (embedding-based) approaches, and then evaluate hybrid combinations against ground-truth relevance rankings – all without the heavy compute requirements of the full MS MARCO dataset.
// Load data from Hugging Face
.pykx.pyexec "from datasets import load_dataset";
.pykx.pyexec "corpus_ds = load_dataset('zeta-alpha-ai/NanoMSMARCO', 'corpus', split='train')";
.pykx.pyexec "queries_ds = load_dataset('zeta-alpha-ai/NanoMSMARCO', 'queries', split='train')";
.pykx.pyexec "rankings_ds = load_dataset('zeta-alpha-ai/NanoMSMARCO', 'qrels', split='train')";
// Convert Python object to q dictionaries
corpus:.pykx.qeval "{int(item['_id']): pykx.CharVector(item['text']) for item in corpus_ds}";
queries:.pykx.qeval "{int(item['_id']): pykx.CharVector(item['text']) for item in queries_ds}";
rankings:.pykx.qeval "{int(item['query-id']): int(item['corpus-id']) for item in rankings_ds}";
//Create tables with consistent data types
corpus:([]docId:key corpus;text:value corpus);
queries:([]docId:key queries;text:value queries);
rankings:([]docId:key rankings;rankedIds:value rankings);
rankings:rankings lj 1!queries;Create an embedding function
For our dense search, we’ll convert our text into vector embeddings: numerical representations of the text that capture its semantic meaning. We’ll use KDB-X Python to import PyTorch and the sentence-transformers library. For better performance, we’ll also apply dynamic quantization to our model.
torch:.pykx.import`torch;
ST:.pykx.import[`sentence_transformers;`:SentenceTransformer];
qd:.pykx.import[`torch.quantization;`:quantize_dynamic];
qconf:.pykx.import[`torch.quantization;`:default_qconfig];
torch[`:set_num_threads][3];We load a lightweight and efficient transformer model, paraphrase-MiniLM-L3-v2. This model is ideal for generating high-quality embeddings efficiently.
model:ST[`$"paraphrase-MiniLM-L3-v2";`device pykw `cpu];
tqint8:torch[`:qint8];
qmodel:qd[model;.pykx.eval["lambda x: {x}"] torch[`:nn.Linear];`dtype pykw tqint8];
embed:{x[`:encode][.pykx.eval["lambda x: x.decode('utf-8') if type(x) is bytes else [x.decode('utf-8') for x in x]"] .pykx.topy y]`}[qmodel;];We now have an embed function that can take a string and return a vector embedding:
embed "Hello World!";Create a tokenizing function
For our sparse search, we need to break down our text into individual tokens (words or sub-words). We’ll use a tokenizer from the transformers library for this.
AT:.pykx.import[`transformers;`:AutoTokenizer];
tokenizer:AT[`$":from_pretrained"][`$"bert-base-uncased"];
tokenize:{if[10h~type y;y:`$y];x[$[":"~first string y;`$1_string y;y]][`$":input_ids"]`}[tokenizer;];Our new tokenize function takes a string and returns a list of token IDs.
tokenize[`$"abc abc abc"];Update the dataset
Now, we’ll apply our embedding and tokenization functions to our corpus and rankings tables, adding the new vector embeddings and token lists as new columns.
rankings:update tokens:tokenize each text from rankings;
rankings:update embeddings:embed each text from rankings;
corpus:update tokens:tokenize each text from corpus;
corpus:update embeddings:embed text from corpus;Perform the searches
With our data prepared, we can now perform our dense and sparse searches and combine them to obtain our hybrid result.
Sparse search
For our sparse search, we’ll use the BM25 algorithm, a powerful and widely used ranking function for information retrieval.
The .ai.bm25.put function inserts sparse vectors into a BM25 index. The ck and cb parameters are hyperparameters that can be tuned to optimize the search results.
- k1 (ck): Controls how the term frequency of each word affects the relevance score (term saturation)
- b (cb): Controls how the length of a document affects the relevance score
The .ai.bm25.search function returns the top k nearest neighbors for sparse search.
Finally, to calculate accuracy, we find the intersection between the sparse search results and the ground truth rankings list. This indicates the percentage of top documents we found that were actually correct. Lastly, avg takes the average of all individual precision scores across all queries.
ck:1.75e;
cb:0.25e;
index:.ai.bm25.put[()!();ck;cb;corpus[`tokens]];
sparse:corpus[`docId]@/:.[;(::;1)].ai.bm25.search[index;;10;ck;cb] each rankings[`tokens];
acc:avg (count each {inter[x;y]}'[sparse;rankings`rankedIds]);
0N!accResult: 0.66 (66% accuracy)Dense search
For our dense search, we’ll use a flat (brute-force) search to find the nearest neighbors to our query embeddings in the corpus embeddings. We’ll use the L2 distance metric (Euclidean distance) to measure similarity.
dense:corpus[`docId]@/:.[;(::;1)].ai.flat.search[corpus[`embeddings];rankings[`embeddings];10;`L2];
acc:avg (count each {inter[x;y]}'[dense;rankings`rankedIds]);
0N!accResult: 0.64 (64% accuracy)Hybrid search
Finally, we combine the results of our sparse and dense searches using Reciprocal Rank Fusion (RRF). RRF is a method that combines multiple ranked lists into a single, more accurate list.
The .ai.hybrid.rrf function takes the two lists of search results and merges them. The 60 in this case is a constant that can be tuned.
hybrid:10#/:{.ai.hybrid.rrf[(x;y);60]}'[sparse;dense];
acc:avg (count each {inter[x;y]}'[hybrid;rankings`rankedIds]);
0N!accResult: 0.7 (70% accuracy)Hybrid search outperforms sparse and dense searches on their own, offering a powerful middle path that combines the precision of keyword matching with the flexibility gained by using semantic vectors.
In this tutorial, we explored the power of this combined approach and how you can balance precision and meaning for your individual use cases.
If you enjoyed this blog and would like to explore other examples, you can visit our GitHub repository. You can also begin your journey with KDB-X by signing up for the KDB-X Community Edition Public Preview.
