Mastering time series and historical query analysis with KDB-X

Tutorial: Master time series and historical analysis with KDB-X

Author

Michaela Woods

Developer Advocate

Key Takeaways

  1. KDB-X is optimized for high-performance time-series analytics, handling billions of rows efficiently.
  2. Partitioning and storage format matter for scalability, choosing the right storage design directly impacts query speed.
  3. KDB-X/q provides a powerful, concise syntax, making time-based aggregations and asof joins significantly faster and more intuitive.

In data-driven industries like finance, every trade, price tick, or quote is accompanied by a precise timestamp, down to the millisecond, capturing the pulse of global markets.

But speed is only part of the story. Historical data plays a crucial role in everything from building trust to shaping strategy and meeting regulatory demands. Firms rely on it to backtest algorithms, model risk, and prove compliance when it matters most.

In this blog, I’ll introduce how to work with KDB-X to analyze large-scale time-series data like this and discuss several key concepts that are fundamental to working with KDB-X. I’ll cover how to create a large time series dataset, save to a database on disk, perform time-based aggregations to analyze trends over time, and asof joins (aj) to combine time series data.

If you would like to follow along, you can do so by downloading the KDB-X Community Edition public preview from https://kdb-x.kx.com/. You can also follow the full tutorial on GitHub.

Let’s begin.

Create a time series dataset

To begin, we will download stocks.txt and create a sample dataset of 100 random symbols across 2 million rows to simulate trade data over time.

q
syms:100?`$read0 `:stocks.txt;
n:20000000;
day:2025.01.01;
trade:([] time:asc (`timestamp$day) + n?24:00:00; sym:n?syms; price:n?100f; size:n?1000)

The table is now available in memory. Let’s take a quick look at the row count and schema details using meta, then the first 10 rows using sublist.

q
meta trade
c    | t f a
-----| -----
time | p   s
sym  | s    
price| f    
size | j   

The following columns are produced when we run meta:

q
10 sublist trade
time                          sym   price    size
-------------------------------------------------
2025.01.01D00:00:00.000000000 MFICL 64.20376 597 
2025.01.01D00:00:00.000000000 TEAM  30.63798 172 
2025.01.01D00:00:00.000000000 RVYL  40.56048 879 
2025.01.01D00:00:00.000000000 SIGI  57.2691  829 
2025.01.01D00:00:00.000000000 DVSP  54.74414 658 
2025.01.01D00:00:00.000000000 HYDR  61.67117 925 
2025.01.01D00:00:00.000000000 ELAB  6.223127 784 
2025.01.01D00:00:00.000000000 HYLS  75.65475 755 
2025.01.01D00:00:00.000000000 WGMI  78.49312 596 
2025.01.01D00:00:00.000000000 NRES  40.66333 747 

Save data to disk

With our dataset created, we will now persist it on disk. We will also factor in scaling by partitioning our data by date. By doing so, KDB-X can limit query scope to the relevant partitions, which is significantly faster than scanning an entire dataset.

We will start by defining our file paths:

q
homeDir:getenv[`HOME];
dbDir:homeDir,"/data";
dbPath:hsym `$dbDir;

Next, we will set compression parameters via .z.zd.

q
.z.zd:(17;2;6)

To partition by date, we will use an inbuilt function .Q.dpft[d;p;f;t] which saves data to a (d)atabase location, targeting a particular (p)artition and indexes the data on a chosen (f)ield for the specified (t)able.

q
.Q.dpft[dbPath;day;`sym;`trade]

Once persisted, the table name is returned. We can then test the command by deleting the trade table from memory and reloading from disk.

q
delete trade from `.;           
system"l ",dbDir;    
meta trade
c    | t f a
-----| -----
date | d    
sym  | s   p
time | p    
price| f    
size | j    

KDB-X provides several methods for storing tables:

  • A flat table is the simplest form, held entirely in memory and suitable for small or temporary datasets
  • A splayed table is stored on disk with each column saved as a separate file, enabling efficient column-wise access and better performance for larger datasets
  • A partitioned table organizes splayed tables into subdirectories based on a partitioning column, typically a date, which is ideal for time-series data and allows fast access to specific partitions
  • A segmented table adds another layer by distributing partitions across multiple root directories, often used in large-scale or distributed systems to support parallel access and high-throughput querying

When deciding on which table to use, consider the rate at which the table will grow, memory constraints, and performance expectations.

Note: to make the data more representative of an enterprise workload, we duplicated our previously defined partition across multiple days. This generated a 10GB file, consisting of approximately 1 billion rows of data. If you are following along and would like to do the same, please refer to the full tutorial on GitHub.

Query data

We will now perform a series of basic time series queries.

Total trade volume/hour

In our first query, we will explore the total trade volume per hour, which aggregates the number of shares, contracts, or units traded.

This metric is crucial for:

  • Market activity analysis: Identifying peak trading hours
  • Liquidity assessment: Understanding when the market is most liquid
  • Anomaly detection: Spotting unusual spikes or drops in trading volume
  • Strategy calibration: Aligning algorithmic trading strategies with high-volume periods
q
symbol:first syms;
select sum size by date,60 xbar time.minute from trade where sym=symbol
2025.01.01 00:00 | 4168236
2025.01.01 01:00 | 4160249
2025.01.01 02:00 | 4186595
2025.01.01 03:00 | 4187285
2025.01.01 04:00 | 4180584
..

In this example, we are using qSQL, an inbuilt table query language similar to SQL.

  • Just as in SQL, table results are called using select and from, and can be filtered by expressions following a where
  • Multiple filter criteria, separated by ,, are evaluated starting from the left
  • To group similar values, we can use the by clause. This is particularly useful in combination with an aggregation function, such as sum, max, or min

Weighted average price and last trade price/15 minutes

In our second query, we will explore volume-weighted averages. Volume-weighted averages give a more accurate reflection of a stock’s price movement by incorporating trading volume at different price levels. This can be especially useful in determining whether price movement is driven by strong market participation or the result of a few trades.

q
select lastPx:last price, vwapPx:size wavg price by date, 15 xbar time.minute from trade where sym=symbol
date       minute| lastPx   vwapPx  
-----------------| -----------------
2025.01.01 00:00 | 12.02315 49.7027 
2025.01.01 00:15 | 89.32436 50.23902
2025.01.01 00:30 | 69.63196 49.84172
2025.01.01 00:45 | 45.60034 49.13936
2025.01.01 01:00 | 76.59549 49.59122
..

We can also test the performance of our query using \t

q
\t select lastPx:last price, vwapPx:size wavg price by date, 15 xbar time.minute from trade where sym=symbol

On a typical desktop, the query should have taken around 1-2 seconds to process a billion records, efficiently aggregating the last price (lastPx) and volume-weighted average price (vwapPx) for these trades.

The use of by date, 15 xbar time.minute optimized the grouping, making the computation fast, and highlighting KDB-X’s exceptional performance in high-speed time-series analytics.

Matching trades with quotes

Like its predecessor, KDB-X features the asof join (aj), designed to match records from two tables based on the most recent timestamp. Unlike a standard SQL join, where records must match exactly on a key, an asof join finds the most recent match. This is particularly important with time-series data, where we often deal with information arriving at different intervals.

For example:

  • Trade and quote data whereby a trade occurs at a given time, and we want to match it with the latest available quote
  • Sensor data in which a sensor records temperature every second, while another logs environmental data every 10 seconds

To begin, we will generate synthetic quote data for one day

q
n:2000000;
quote:([] time:asc (`timestamp$day) + n?86400000000000; sym:n?syms; bid:n?100f; ask:n?100f)

Because this table is in memory, we’ll need to apply the parted (p#) attribute to the sym column of the quote table before joining. This is because our trade table on disk already has the parted attribute.

q
meta trade
c    | t f a
-----| -----
date | d    
sym  | s   p
time | p    
price| f    
size | j    

This is crucial for optimizing ASOF and ensures faster lookups of symbol-based joins. We must also sort the table by sym using xasc before applying parted to quote.

q
quote:`sym xasc quote;
quote:update `p#sym from quote

We can now perform an Asof join and match each trade with the most recent available quote for today’s date.

q
aj[`sym`time; select from trade where date=day; quote]
date       sym  time                          price    size bid      ask     
-----------------------------------------------------------------------------
2025.01.01 AAME 2025.01.01D00:00:00.000000000 11.13743 579                   
2025.01.01 AAME 2025.01.01D00:00:01.000000000 25.39669 530                   
2025.01.01 AAME 2025.01.01D00:00:02.000000000 52.84274 139                   
2025.01.01 AAME 2025.01.01D00:00:03.000000000 29.17217 227                   
2025.01.01 AAME 2025.01.01D00:00:03.000000000 95.41841 735                   
2025.01.01 AAME 2025.01.01D00:00:04.000000000 60.95445 995                   
2025.01.01 AAME 2025.01.01D00:00:04.000000000 63.20168 324                   
2025.01.01 AAME 2025.01.01D00:00:04.000000000 78.9044  684                   
..

This approach ensures that for every trade, we have the best available quote information, allowing traders to analyze trade execution relative to the prevailing bid/ask spread at the time.

Whether you’re building a dataset from scratch, performing high-volume aggregations, or joining trade and quote data with precision, KDB-X offers a unified environment for both experimentation and production-ready analytics.

Today, we explored the core steps to get started: creating large time-series datasets, persisting them with efficient partitioning, scaling to billions of records, and running high-performance queries using q.

If you enjoyed this blog and would like to explore other examples, you can visit our GitHub repository. You can also begin your journey with KDB-X by signing up for the KDB-X Community Edition Public Preview, where you can test, experiment, and build high-performance data-intensive applications with exclusive access to continuous feature updates, all at no cost.

Customer Stories

Discover richer, actionable insights for faster, better informed decision making

ADSS Logo
Capital Markets

ADSS leverages KX real-time data platform to accelerate its transformational growth strategy.

Read More About ADSS
Axi logo
Capital Markets

Axi uses KX to capture, analyze, and visualize streaming data in real-time and at scale.

Read More About Axi


Demo the world’s fastest database for vector, time-series, and real-time analytics

Start your journey to becoming an AI-first enterprise with 100x* more performant data and MLOps pipelines.

  • Process data at unmatched speed and scale
  • Build high-performance data-driven applications
  • Turbocharge analytics tools in the cloud, on premise, or at the edge

*Based on time-series queries running in real-world use cases on customer environments.

Book a demo with an expert

"*" indicates required fields

By submitting this form, you will also receive sales and/or marketing communications on KX products, services, news and events. You can unsubscribe from receiving communications by visiting our Privacy Policy. You can find further information on how we collect and use your personal data in our Privacy Policy.

This field is for validation purposes and should be left unchanged.

A verified G2 leader for time-series