Tutorial: Master time series and historical analysis with KDB-X

ポイント

KDB-X is optimized for high-performance time-series analytics, handling billions of rows efficiently.
Partitioning and storage format matter for scalability, choosing the right storage design directly impacts query speed.
KDB-X/q provides a powerful, concise syntax, making time-based aggregations and asof joins significantly faster and more intuitive.

In data-driven industries like finance, every trade, price tick, or quote is accompanied by a precise timestamp, down to the millisecond, capturing the pulse of global markets.

But speed is only part of the story. Historical data plays a crucial role in everything from building trust to shaping strategy and meeting regulatory demands. Firms rely on it to backtest algorithms, model risk, and prove compliance when it matters most.

In this blog, I’ll introduce how to work with KDB-X to analyze large-scale time-series data like this and discuss several key concepts that are fundamental to working with KDB-X. I’ll cover how to create a large time series dataset, save to a database on disk, perform time-based aggregations to analyze trends over time, and asof joins (aj) to combine time series data.

If you would like to follow along, you can do so by downloading the KDB-X Community Edition public preview from https://kdb-x.kx.com/. You can also follow the full tutorial on GitHub.

Let’s begin.

Create a time series dataset

To begin, we will download stocks.txt and create a sample dataset of 100 random symbols across 2 million rows to simulate trade data over time.

syms:100?`$read0 `:stocks.txt;
n:20000000;
day:2025.01.01;
trade:([] time:asc (`timestamp$day) + n?24:00:00; sym:n?syms; price:n?100f; size:n?1000)

The table is now available in memory. Let’s take a quick look at the row count and schema details using meta, then the first 10 rows using sublist.

meta trade

c    | t f a
-----| -----
time | p   s
sym  | s    
price| f    
size | j

The following columns are produced when we run meta:

c: column name
t: column type
f: foreign keys
a: attributes (modifiers applied for performance optimization)

10 sublist trade

time                          sym   price    size
-------------------------------------------------
2025.01.01D00:00:00.000000000 MFICL 64.20376 597 
2025.01.01D00:00:00.000000000 TEAM  30.63798 172 
2025.01.01D00:00:00.000000000 RVYL  40.56048 879 
2025.01.01D00:00:00.000000000 SIGI  57.2691  829 
2025.01.01D00:00:00.000000000 DVSP  54.74414 658 
2025.01.01D00:00:00.000000000 HYDR  61.67117 925 
2025.01.01D00:00:00.000000000 ELAB  6.223127 784 
2025.01.01D00:00:00.000000000 HYLS  75.65475 755 
2025.01.01D00:00:00.000000000 WGMI  78.49312 596 
2025.01.01D00:00:00.000000000 NRES  40.66333 747

Save data to disk

With our dataset created, we will now persist it on disk. We will also factor in scaling by partitioning our data by date. By doing so, KDB-X can limit query scope to the relevant partitions, which is significantly faster than scanning an entire dataset.

We will start by defining our file paths:

homeDir:getenv[`HOME];
dbDir:homeDir,"/data";
dbPath:hsym `$dbDir;

Next, we will set compression parameters via .z.zd.

.z.zd:(17;2;6)

To partition by date, we will use an inbuilt function .Q.dpft[d;p;f;t] which saves data to a (d)atabase location, targeting a particular (p)artition and indexes the data on a chosen (f)ield for the specified (t)able.

.Q.dpft[dbPath;day;`sym;`trade]

Once persisted, the table name is returned. We can then test the command by deleting the trade table from memory and reloading from disk.

delete trade from `.;           
system"l ",dbDir;    
meta trade

c    | t f a
-----| -----
date | d    
sym  | s   p
time | p    
price| f    
size | j

KDB-X provides several methods for storing tables:

A flat table is the simplest form, held entirely in memory and suitable for small or temporary datasets
A splayed table is stored on disk with each column saved as a separate file, enabling efficient column-wise access and better performance for larger datasets
A partitioned table organizes splayed tables into subdirectories based on a partitioning column, typically a date, which is ideal for time-series data and allows fast access to specific partitions
A segmented table adds another layer by distributing partitions across multiple root directories, often used in large-scale or distributed systems to support parallel access and high-throughput querying

When deciding on which table to use, consider the rate at which the table will grow, memory constraints, and performance expectations.

Note: to make the data more representative of an enterprise workload, we duplicated our previously defined partition across multiple days. This generated a 10GB file, consisting of approximately 1 billion rows of data. If you are following along and would like to do the same, please refer to the full tutorial on GitHub.

Query data

We will now perform a series of basic time series queries.

Total trade volume/hour

In our first query, we will explore the total trade volume per hour, which aggregates the number of shares, contracts, or units traded.

This metric is crucial for:

Market activity analysis: Identifying peak trading hours
Liquidity assessment: Understanding when the market is most liquid
Anomaly detection: Spotting unusual spikes or drops in trading volume
Strategy calibration: Aligning algorithmic trading strategies with high-volume periods

symbol:first syms;
select sum size by date,60 xbar time.minute from trade where sym=symbol

2025.01.01 00:00 | 4168236
2025.01.01 01:00 | 4160249
2025.01.01 02:00 | 4186595
2025.01.01 03:00 | 4187285
2025.01.01 04:00 | 4180584
..

In this example, we are using qSQL, an inbuilt table query language similar to SQL.

Just as in SQL, table results are called using select and from, and can be filtered by expressions following a where
Multiple filter criteria, separated by ,, are evaluated starting from the left
To group similar values, we can use the by clause. This is particularly useful in combination with an aggregation function, such as sum, max, or min

Weighted average price and last trade price/15 minutes

In our second query, we will explore volume-weighted averages. Volume-weighted averages give a more accurate reflection of a stock’s price movement by incorporating trading volume at different price levels. This can be especially useful in determining whether price movement is driven by strong market participation or the result of a few trades.

select lastPx:last price, vwapPx:size wavg price by date, 15 xbar time.minute from trade where sym=symbol

date       minute| lastPx   vwapPx  
-----------------| -----------------
2025.01.01 00:00 | 12.02315 49.7027 
2025.01.01 00:15 | 89.32436 50.23902
2025.01.01 00:30 | 69.63196 49.84172
2025.01.01 00:45 | 45.60034 49.13936
2025.01.01 01:00 | 76.59549 49.59122
..

We can also test the performance of our query using \t

\t select lastPx:last price, vwapPx:size wavg price by date, 15 xbar time.minute from trade where sym=symbol

On a typical desktop, the query should have taken around 1-2 seconds to process a billion records, efficiently aggregating the last price (lastPx) and volume-weighted average price (vwapPx) for these trades.

The use of by date, 15 xbar time.minute optimized the grouping, making the computation fast, and highlighting KDB-X’s exceptional performance in high-speed time-series analytics.

Matching trades with quotes

Like its predecessor, KDB-X features the asof join (aj), designed to match records from two tables based on the most recent timestamp. Unlike a standard SQL join, where records must match exactly on a key, an asof join finds the most recent match. This is particularly important with time-series data, where we often deal with information arriving at different intervals.

For example:

Trade and quote data whereby a trade occurs at a given time, and we want to match it with the latest available quote
Sensor data in which a sensor records temperature every second, while another logs environmental data every 10 seconds

To begin, we will generate synthetic quote data for one day

n:2000000;
quote:([] time:asc (`timestamp$day) + n?86400000000000; sym:n?syms; bid:n?100f; ask:n?100f)

Because this table is in memory, we’ll need to apply the parted (p#) attribute to the sym column of the quote table before joining. This is because our trade table on disk already has the parted attribute.

meta trade

c    | t f a
-----| -----
date | d    
sym  | s   p
time | p    
price| f    
size | j

This is crucial for optimizing ASOF and ensures faster lookups of symbol-based joins. We must also sort the table by sym using xasc before applying parted to quote.

quote:`sym xasc quote;
quote:update `p#sym from quote

We can now perform an Asof join and match each trade with the most recent available quote for today’s date.

aj[`sym`time; select from trade where date=day; quote]

date       sym  time                          price    size bid      ask     
-----------------------------------------------------------------------------
2025.01.01 AAME 2025.01.01D00:00:00.000000000 11.13743 579                   
2025.01.01 AAME 2025.01.01D00:00:01.000000000 25.39669 530                   
2025.01.01 AAME 2025.01.01D00:00:02.000000000 52.84274 139                   
2025.01.01 AAME 2025.01.01D00:00:03.000000000 29.17217 227                   
2025.01.01 AAME 2025.01.01D00:00:03.000000000 95.41841 735                   
2025.01.01 AAME 2025.01.01D00:00:04.000000000 60.95445 995                   
2025.01.01 AAME 2025.01.01D00:00:04.000000000 63.20168 324                   
2025.01.01 AAME 2025.01.01D00:00:04.000000000 78.9044  684                   
..

This approach ensures that for every trade, we have the best available quote information, allowing traders to analyze trade execution relative to the prevailing bid/ask spread at the time.

Whether you’re building a dataset from scratch, performing high-volume aggregations, or joining trade and quote data with precision, KDB-X offers a unified environment for both experimentation and production-ready analytics.

Today, we explored the core steps to get started: creating large time-series datasets, persisting them with efficient partitioning, scaling to billions of records, and running high-performance queries using q.

If you enjoyed this blog and would like to explore other examples, you can visit our GitHub repository. You can also begin your journey with KDB-X by signing up for the KDB-X Community Edition Public Preview, where you can test, experiment, and build high-performance data-intensive applications with exclusive access to continuous feature updates, all at no cost.

Tutorial: Master time series and historical analysis with KDB-X

Developer

ポイント

Create a time series dataset

Save data to disk

Query data

Total trade volume/hour

Weighted average price and last trade price/15 minutes

Matching trades with quotes

Customer Stories

AIによるイノベーションを加速する、KXのデモをお客様に合わせてご提供します。

専門担当者によるデモをリクエスト

タイムシリーズ分野におけるG2認定リーダー

Modernizing infrastructures that mix Python and q

The ultimate guide to choosing embedding models for AI applications

7つの革新的トレーディングアプリ (および実践できる7つのベストプラクティス)

11 insights to help quants break through data and analytics barriers

PyKX 3.0: Easier to use and more powerful than ever

リアルタイム分析でクオンツパフォーマンスを飛躍的に向上

Webinar: Six best practices for optimizing trade execution

Developer

ポイント

Create a time series dataset

Save data to disk

Query data

Total trade volume/hour

Weighted average price and last trade price/15 minutes

Matching trades with quotes

KDB-X Public Preview: The next-gen kdb+ is here

What makes time-series database kdb+ so fast?

Mastering TAQ data analysis with kdb+

Customer Stories

AIによるイノベーションを加速する、KXのデモをお客様に合わせてご提供します。

専門担当者によるデモをリクエスト

タイムシリーズ分野におけるG2認定リーダー