ポイント
- Fuzzy filters in KDB-X allow for resilient queries against real-world data, where typos, spelling differences, or symbol changes are common.
- By using appropriate distance metrics like Levenshtein distance, you can retrieve accurate results despite these variations
- By selecting appropriate distance metrics and thresholds, you can tailor fuzzy filters for your specific use case, whether that’s financial data, customer records, or any domain where identifiers evolve.
Real-world data often contains minor variations in symbol/string values, which can be a time-consuming effort to normalize, especially across large volumes. Fuzzy filters using KDB-X AI libs helps align data with small differences, without the need to cleanse your data before querying.
The KDB-X AI libs module provides fuzzy filters with simple functions, such as .ai.fuzzy.dist and .ai.fuzzy.search. In this tutorial, we will walk through examples of how to use both. You can also explore the entire notebook via GitHub.
Let’s begin by loading the AI libraries module and exploring the dataset:
.ai:use`kx.aifirst ohlcdate | 2025.02.12
sym | `ADD
company | "Color Star Technology Co Ltd"
close | 1.23
volume | 495439
open | 1.24
high | 1.28
low | 1.12
This table provides daily OHLC data for multiple stock symbols. Each row contains the opening price (open), the closing price (close), the maximum price (high), and the minimum price (low) for each day.
Example: Handling international spelling differences
Suppose we want to retrieve rows for Color Star Technology Co Ltd. The company name in the dataset uses US English spelling, but what if we query using the UK spelling “Colour”?
select from ohlc where company like "Colour Star Technology Co Ltd"This query fails, returning no rows. To solve this, we can apply fuzzy filters.
Step 1: Get distinct company names
comps:exec distinct company from ohlcStep 2: Compute fuzzy distances
.ai.fuzzy.dist[comps;"Colour Star Technology Co Ltd";`levenshtein]1 26 24 27 24 26 20 27fHere, we use Levenshtein distance, which counts the number of character edits needed to transform one string into another.
Let’s view all available distance metrics:
.ai.fuzzy.utils.fuzzyDistances`levenshtein`levenshteinNorm`indel`hamming`jaro`jaroWinkler`lcs`damerau`osa`prefix`postfixStep 3: Search with a threshold
res:.ai.fuzzy.search[comps;"Colour Star Technology Co Ltd";1;`levenshtein]select from ohlc where company like raze res[2]This successfully returns all rows for Color Star Technology Co Ltd, even though the query used “Colour”.
Example 2: Tracking stock symbol changes
Stock symbols change over time. Let’s see how fuzzy filters help handle this automatically.
Step 1: Count expected rows
select count distinct date from ohlcdate
----
124
The dataset spans ~6 months, so we expect 124 rows per symbol.
Step 2: Query symbol `HSHP`
count select from ohlc where sym=`HSHP47Only 47 rows appear. Why?
On June 3rd, 2025, Himalaya Shipping Ltd changed its symbol from HSHP to HSHIP:
Source: https://www.nasdaq.com/market-activity/stocks/symbol-change-history
Step 3: Use fuzzy distance
syms:exec distinct sym from ohlc;
.ai.fuzzy.dist[syms;`HSHP;`levenshtein]4 4 4 3 1 4 4 4 0fOne match has distance 0 (HSHP itself), and another only 1 character away.
Step 4: Search for close matches
res:.ai.fuzzy.search[syms;`HSHP;2;`levenshtein]res
0 1
8 4
HSHP HSHIP
The filter identifies HSHP and HSHIP.
Step 5: Query both symbols
select from ohlc where sym in res[2]date sym company close volume open high low
------------------------------------------------------------------------
2025.02.12 HSHIP "Himalaya Shipping Ltd" 4.83 205194 4.75 4.85 4.738
2025.02.13 HSHIP "Himalaya Shipping Ltd" 4.94 89261 4.87 4.98 4.8501
2025.02.14 HSHIP "Himalaya Shipping Ltd" 4.99 87297 4.99 5.055 4.95
2025.02.18 HSHIP "Himalaya Shipping Ltd" 5.23 230474 5.13 5.3 5.13
2025.02.19 HSHIP "Himalaya Shipping Ltd" 5.27 279502 5.3 5.37 5.14
2025.02.20 HSHIP "Himalaya Shipping Ltd" 5.5 170577 5.45 5.68 5.44
2025.02.21 HSHIP "Himalaya Shipping Ltd" 5.34 239442 5.6 5.6 5.32
2025.02.24 HSHIP "Himalaya Shipping Ltd" 5.18 278755 5.2 5.345 5.11
2025.02.25 HSHIP "Himalaya Shipping Ltd" 5.42 258100 5.42 5.59 5.34
2025.02.26 HSHIP "Himalaya Shipping Ltd" 5.68 198694 5.58 5.75 5.58
2025.02.27 HSHIP "Himalaya Shipping Ltd" 5.31 198515 5.4 5.45 5.28
2025.02.28 HSHIP "Himalaya Shipping Ltd" 5.45 174216 5.48 5.54 5.36
2025.03.03 HSHIP "Himalaya Shipping Ltd" 5.34 309573 5.55 5.58 5.335
2025.03.04 HSHIP "Himalaya Shipping Ltd" 5.25 147806 5.13 5.31 5.04
2025.03.05 HSHIP "Himalaya Shipping Ltd" 5.44 216607 5.67 5.67 5.41
2025.03.06 HSHIP "Himalaya Shipping Ltd" 5.52 189787 5.51 5.61 5.46
2025.03.07 HSHIP "Himalaya Shipping Ltd" 5.65 202489 5.71 5.75 5.625
..
This retrieves all rows across both symbols.
Step 6: Verify full coverage
count select from ohlc where sym in res[2]124We now have the complete dataset for Himalaya Shipping Ltd, regardless of its symbol change.
Fuzzy Filters in KDB-X allow resilient queries against real-world data, where typos, spelling differences, or symbol changes are common.
- Example 1 demonstrated how fuzzy matching overcomes spelling variations (e.g., Color vs. Colour)
- Example 2 demonstrated how to handle stock symbol changes seamlessly
By selecting appropriate distance metrics and thresholds, you can tailor fuzzy filters for your specific use case, whether that’s financial data, customer records, or any domain where identifiers evolve.
If you enjoyed this blog and would like to explore other examples, you can visit our GitHub repository. You can also begin your journey with KDB-X by signing up for the KDB-X Community Edition Public Preview.





