Key Takeaways
- KDB-X is optimized for high-performance time-series analytics, handling billions of rows efficiently.
- Partitioning and storage format matter for scalability, choosing the right storage design directly impacts query speed.
- KDB-X/q provides a powerful, concise syntax, making time-based aggregations and asof joins significantly faster and more intuitive.
In data-driven industries like finance, every trade, price tick, or quote is accompanied by a precise timestamp, down to the millisecond, capturing the pulse of global markets.
But speed is only part of the story. Historical data plays a crucial role in everything from building trust to shaping strategy and meeting regulatory demands. Firms rely on it to backtest algorithms, model risk, and prove compliance when it matters most.
In this blog, I’ll introduce how to work with KDB-X to analyze large-scale time-series data like this and discuss several key concepts that are fundamental to working with KDB-X. I’ll cover how to create a large time series dataset, save to a database on disk, perform time-based aggregations to analyze trends over time, and asof joins (aj) to combine time series data.
If you would like to follow along, you can do so by downloading the KDB-X Community Edition public preview from https://kdb-x.kx.com/. You can also follow the full tutorial on GitHub.
Let’s begin.
Create a time series dataset
To begin, we will download stocks.txt and create a sample dataset of 100 random symbols across 2 million rows to simulate trade data over time.
syms:100?`$read0 `:stocks.txt;
n:20000000;
day:2025.01.01;
trade:([] time:asc (`timestamp$day) + n?24:00:00; sym:n?syms; price:n?100f; size:n?1000)
The table is now available in memory. Let’s take a quick look at the row count and schema details using meta, then the first 10 rows using sublist.
meta trade
c | t f a
-----| -----
time | p s
sym | s
price| f
size | j
The following columns are produced when we run meta:
- c: column name
- t: column type
- f: foreign keys
- a: attributes (modifiers applied for performance optimization)
10 sublist trade
time sym price size
-------------------------------------------------
2025.01.01D00:00:00.000000000 MFICL 64.20376 597
2025.01.01D00:00:00.000000000 TEAM 30.63798 172
2025.01.01D00:00:00.000000000 RVYL 40.56048 879
2025.01.01D00:00:00.000000000 SIGI 57.2691 829
2025.01.01D00:00:00.000000000 DVSP 54.74414 658
2025.01.01D00:00:00.000000000 HYDR 61.67117 925
2025.01.01D00:00:00.000000000 ELAB 6.223127 784
2025.01.01D00:00:00.000000000 HYLS 75.65475 755
2025.01.01D00:00:00.000000000 WGMI 78.49312 596
2025.01.01D00:00:00.000000000 NRES 40.66333 747
Save data to disk
With our dataset created, we will now persist it on disk. We will also factor in scaling by partitioning our data by date. By doing so, KDB-X can limit query scope to the relevant partitions, which is significantly faster than scanning an entire dataset.
We will start by defining our file paths:
homeDir:getenv[`HOME];
dbDir:homeDir,"/data";
dbPath:hsym `$dbDir;
Next, we will set compression parameters via .z.zd.
.z.zd:(17;2;6)
To partition by date, we will use an inbuilt function .Q.dpft[d;p;f;t] which saves data to a (d)atabase location, targeting a particular (p)artition and indexes the data on a chosen (f)ield for the specified (t)able.
.Q.dpft[dbPath;day;`sym;`trade]
Once persisted, the table name is returned. We can then test the command by deleting the trade table from memory and reloading from disk.
delete trade from `.;
system"l ",dbDir;
meta trade
c | t f a
-----| -----
date | d
sym | s p
time | p
price| f
size | j
KDB-X provides several methods for storing tables:
- A flat table is the simplest form, held entirely in memory and suitable for small or temporary datasets
- A splayed table is stored on disk with each column saved as a separate file, enabling efficient column-wise access and better performance for larger datasets
- A partitioned table organizes splayed tables into subdirectories based on a partitioning column, typically a date, which is ideal for time-series data and allows fast access to specific partitions
- A segmented table adds another layer by distributing partitions across multiple root directories, often used in large-scale or distributed systems to support parallel access and high-throughput querying
When deciding on which table to use, consider the rate at which the table will grow, memory constraints, and performance expectations.
Note: to make the data more representative of an enterprise workload, we duplicated our previously defined partition across multiple days. This generated a 10GB file, consisting of approximately 1 billion rows of data. If you are following along and would like to do the same, please refer to the full tutorial on GitHub.
Query data
We will now perform a series of basic time series queries.
Total trade volume/hour
In our first query, we will explore the total trade volume per hour, which aggregates the number of shares, contracts, or units traded.
This metric is crucial for:
- Market activity analysis: Identifying peak trading hours
- Liquidity assessment: Understanding when the market is most liquid
- Anomaly detection: Spotting unusual spikes or drops in trading volume
- Strategy calibration: Aligning algorithmic trading strategies with high-volume periods
symbol:first syms;
select sum size by date,60 xbar time.minute from trade where sym=symbol
2025.01.01 00:00 | 4168236
2025.01.01 01:00 | 4160249
2025.01.01 02:00 | 4186595
2025.01.01 03:00 | 4187285
2025.01.01 04:00 | 4180584
..
In this example, we are using qSQL, an inbuilt table query language similar to SQL.
- Just as in SQL, table results are called using select and from, and can be filtered by expressions following a where
- Multiple filter criteria, separated by ,, are evaluated starting from the left
- To group similar values, we can use the by clause. This is particularly useful in combination with an aggregation function, such as sum, max, or min
Weighted average price and last trade price/15 minutes
In our second query, we will explore volume-weighted averages. Volume-weighted averages give a more accurate reflection of a stock’s price movement by incorporating trading volume at different price levels. This can be especially useful in determining whether price movement is driven by strong market participation or the result of a few trades.
select lastPx:last price, vwapPx:size wavg price by date, 15 xbar time.minute from trade where sym=symbol
date minute| lastPx vwapPx
-----------------| -----------------
2025.01.01 00:00 | 12.02315 49.7027
2025.01.01 00:15 | 89.32436 50.23902
2025.01.01 00:30 | 69.63196 49.84172
2025.01.01 00:45 | 45.60034 49.13936
2025.01.01 01:00 | 76.59549 49.59122
..
We can also test the performance of our query using \t
\t select lastPx:last price, vwapPx:size wavg price by date, 15 xbar time.minute from trade where sym=symbol
On a typical desktop, the query should have taken around 1-2 seconds to process a billion records, efficiently aggregating the last price (lastPx) and volume-weighted average price (vwapPx) for these trades.
The use of by date, 15 xbar time.minute optimized the grouping, making the computation fast, and highlighting KDB-X’s exceptional performance in high-speed time-series analytics.
Matching trades with quotes
Like its predecessor, KDB-X features the asof join (aj), designed to match records from two tables based on the most recent timestamp. Unlike a standard SQL join, where records must match exactly on a key, an asof join finds the most recent match. This is particularly important with time-series data, where we often deal with information arriving at different intervals.
For example:
- Trade and quote data whereby a trade occurs at a given time, and we want to match it with the latest available quote
- Sensor data in which a sensor records temperature every second, while another logs environmental data every 10 seconds
To begin, we will generate synthetic quote data for one day
n:2000000;
quote:([] time:asc (`timestamp$day) + n?86400000000000; sym:n?syms; bid:n?100f; ask:n?100f)
Because this table is in memory, we’ll need to apply the parted (p#) attribute to the sym column of the quote table before joining. This is because our trade table on disk already has the parted attribute.
meta trade
c | t f a
-----| -----
date | d
sym | s p
time | p
price| f
size | j
This is crucial for optimizing ASOF and ensures faster lookups of symbol-based joins. We must also sort the table by sym using xasc before applying parted to quote.
quote:`sym xasc quote;
quote:update `p#sym from quote
We can now perform an Asof join and match each trade with the most recent available quote for today’s date.
aj[`sym`time; select from trade where date=day; quote]
date sym time price size bid ask
-----------------------------------------------------------------------------
2025.01.01 AAME 2025.01.01D00:00:00.000000000 11.13743 579
2025.01.01 AAME 2025.01.01D00:00:01.000000000 25.39669 530
2025.01.01 AAME 2025.01.01D00:00:02.000000000 52.84274 139
2025.01.01 AAME 2025.01.01D00:00:03.000000000 29.17217 227
2025.01.01 AAME 2025.01.01D00:00:03.000000000 95.41841 735
2025.01.01 AAME 2025.01.01D00:00:04.000000000 60.95445 995
2025.01.01 AAME 2025.01.01D00:00:04.000000000 63.20168 324
2025.01.01 AAME 2025.01.01D00:00:04.000000000 78.9044 684
..
This approach ensures that for every trade, we have the best available quote information, allowing traders to analyze trade execution relative to the prevailing bid/ask spread at the time.
Whether you’re building a dataset from scratch, performing high-volume aggregations, or joining trade and quote data with precision, KDB-X offers a unified environment for both experimentation and production-ready analytics.
Today, we explored the core steps to get started: creating large time-series datasets, persisting them with efficient partitioning, scaling to billions of records, and running high-performance queries using q.
If you enjoyed this blog and would like to explore other examples, you can visit our GitHub repository. You can also begin your journey with KDB-X by signing up for the KDB-X Community Edition Public Preview, where you can test, experiment, and build high-performance data-intensive applications with exclusive access to continuous feature updates, all at no cost.