GPU acceleration in KDB-X: Supercharging as-of joins and sorting

GPU acceleration in KDB-X: Supercharging as-of joins and sorting

作者

Ryan Siegler

Data Scientist

ポイント

  1. GPU acceleration in KDB-X delivers 4–10× faster performance for core operations like as-of joins and sorting on large-scale time-series data.
  2. NVIDIA CUDA and cuDF integration bring massively parallel processing to KDB-X while minimizing data movement between host and device memory.
  3. End-of-day workloads benefit dramatically, with GPUDirect Storage enabling direct I/O from disk to GPU memory for faster, more efficient pipelines.
  4. Multi-GPU scaling with NVIDIA H100s shows near-linear performance gains, cutting complex risk simulations like VaR from seconds to milliseconds.
  5. The KX incubation team is pioneering GPU-backed extensions that make KDB-X ready for the next generation of high-performance financial and analytical workloads.

End-of-day workloads just got faster.

High-frequency trading, tick-level market data, and time-series analytics are well-suited for KDB-X CPU-based architectures. But two core operations of many market data processing pipelines— as-of joins and sorting — are computationally expensive, especially when datasets scale to billions of rows.

To reduce the bottlenecks of these expensive operations, we can offload them to GPUs, specifically via NVIDIA CUDA kernels. This blog highlights how the KX incubation team is developing GPU-backed extensions that accelerate various table and mathematical operations by multiples, while minimizing data movement between host and device memory.

This technology is currently in development by the KX incubation team and has not yet reached general availability (GA). The article offers a preview of these powerful NVIDIA-accelerated features, and it’s the first in a planned series of blogs detailing our progress in accelerating core financial workloads.

Why GPUs for KDB-X?

CPUs excel at low-latency and higher complexity operations, but struggle in several areas as data volumes continue to grow:

  • Massive parallelism needs: Operations like binary searches per-symbol or global sorts scale poorly
  • Memory bandwidth: CPUs are bottlenecked when scanning large tables
  • EOD processing: Tasks such as splayed table sorts and joins during end-of-day workflows can run into hours on CPU

GPUs invert this trade-off. With thousands of cores and high-bandwidth memory, they thrive on repetitive, parallel tasks like “sort millions of rows by key.” Moore’s Law is no longer sufficient to keep up—NVIDIA provides the acceleration needed. When paired with cuDF, the GPU dataframe library from NVIDIA, we can bring these primitives to KDB-X directly.

Inside the KDB-X GPU acceleration layer: Architecture overview

The NVIDIA acceleration layer for KDB-X adds a .gpu namespace, exposing:

Table movement:

.gpu.tableToGPU → copy table to device memory

.gpu.tableFromGPU → bring results back to host

Join & sort primitives:

.gpu.aj → as-of join on GPU

.gpu.xasc → ascending sort by columns

.gpu.iasc + .gpu.gather → index-based sorting pattern (for top-N or partial sorts)

I/O shortcuts:

.gpu.loadKdbTable and .gpu.saveKdbTable use NVIDIA GPUDirect Storage to stream splayed tables directly into GPU memory, bypassing the CPU.

The goal is to keep data on the GPU across multiple steps — join, sort, aggregate — and only return the final result to CPU memory when needed.

As-of joins on GPU

As-of joins are a foundational building block in time-series analytics, powering many of the most data-intensive workflows in finance and beyond:

  • Trade–quote alignment – match each trade to the most recent quote to normalize tick data.
  • Order book reconstruction – merge incremental order updates into full book snapshots.
  • Signal and feature alignment – synchronize slower-moving analytics or model features with trade events.
  • Portfolio valuation and risk snapshots – map positions to the latest prices or risk factors for real-time VaR and P&L.
  • IoT and sensor data synchronization – align readings from asynchronous telemetry streams in industrial or energy applications.

How CPU handles it

On a CPU, aj scans backwards or performs a binary search per symbol. With millions of trades and quotes, the CPU quickly saturates.

How GPU handles it

On a GPU, the join keys (symbols, times) are transferred once to device memory. Then, thousands of GPU threads perform binary searches in parallel across symbols. Each trade’s “find quote” operation becomes highly parallel.

q
/ Generate trade & quote tables (simplified) 
n:1000000; 
t:([] sym:n?`AAPL`MSFT`GOOG; time:n?1000000; price:n?100f); 
q:([] sym:n?`AAPL`MSFT`GOOG; time:n?1000000; bid:n?100f; ask:n?100f); 
 
/ Move join keys to device 
S:(.gpu.toDevice sym?t`symbol;.gpu.toDevice `g#sym?q`symbol); 
T:(.gpu.toDevice `long$t`time;.gpu.toDevice `long$q`time); 
 
/ CPU baseline 
\t aj[`sym`time;t;q]; 
 
/ GPU accelerated 
\t .gpu.aj[`sym`time!(S;T);t;q]; 

On an NVIDIA L4 GPU, a 1 million-row join took ~48 ms, versus ~196 ms on a cost equivalent 48-core CPU — a 4× speedup. Wider quote tables (more columns per row) show even larger gains, as the GPU’s bandwidth handles more payload per join.

We also tested the same workload on an NVIDIA A100 GPU (80 GB HBM2e), scaling the dataset to 10 million rows across five symbols. The A100 completed the as-of join in ~31 ms, compared to ~172 ms on CPU — delivering a 5.5× improvement.

At 100 million rows, the GPU maintained sub-second latency while the CPU extended past 4 seconds, showing the scalability of parallel binary search across symbols. These results highlight how the A100’s higher memory bandwidth and larger on-device capacity benefit real-time tick data matching.

Sorting on GPU

Sorting is the backbone of both intraday and EOD workflows:

  • Tick stream ordering – ensure trades and quotes are time-sequenced for replay and backtesting
  • End-of-day (EOD) batch processing – re-sort large splayed tables by sym,time before aggregation or archival
  • Order book reconstruction – maintain bid/ask levels sorted by price or timestamp for efficient state updates
  • Top-of-book extraction – use sorted data to retrieve best bid/ask or top-N orders quickly
  • Trade ranking and leaderboards – sort by trade size, notional, or P&L for analytics dashboards
  • Portfolio performance ranking – order positions by risk, exposure, or return for VaR and P&L reporting
  • Windowed analytics – sort by time to enable rolling or windowed computations (e.g., VWAP, moving averages)
  • Data validation and deduplication – sort incoming feeds to detect out-of-order or duplicate records

Method 1: Full table sort with .gpu.xasc

In standard q, xasc takes a table and returns a new table with its rows fully reordered in ascending order by the specified columns. The GPU version, .gpu.xasc, behaves the same way — but performs the entire sort on the GPU using cuDF’s parallel sort algorithms.

This is the simplest approach: move a table to the GPU, sort it end-to-end by one or more keys, and (optionally) bring the fully sorted result back to the CPU.

q
/ Move table to GPU  
g:.gpu.tableToGPU t 
 
/ 1) Sort the GPU table g on device by sym then time ascending 
gsorted:.gpu.xasc[`sym`time] g 
 
/ 2) Copy the sorted results back from GPU to host as a KDB-X table 
t_sorted:.gpu.tableFromGPU gsorted 

If you want the top-N rows, you can just sublist from gsorted. But note: .gpu.xasc has already reordered the entire table before you slice. That’s fine when you need the full sort anyway (e.g., end-of-day processing), but wasteful if you only need a small subset.

Benchmarks show that full-table GPU sorts outperform CPU by 5–10× once table sizes exceed tens of millions of rows.

When we ran the same sorting workflow on an A100 GPU, sorting 10 million ticks by sym,time took ~42 ms, compared to ~320 ms on CPU — an 8× performance improvement.

For large EOD splayed datasets (hundreds of millions of rows), .gpu.xasc on the A100 reduced total sort time from minutes to seconds, thanks to the GPU’s 80 GB of high-bandwidth memory keeping full-day partitions resident on device.

Method 2: Index-based sort with .gpu.iasc

In q, iasc doesn’t sort the data itself — it returns the indexes (grade vector) that would order the list or table ascending. You can then apply those indexes to reorder rows.

On the GPU, .gpu.iasc computes that permutation vector directly on device. The key advantage is that you can slice the permutation before gathering, so you only materialize the rows you need:

q
/ Move table to GPU device 
g: .gpu.tableToGPU t 
 
/ 1) Compute device-side grade (row permutation) for multi-column sort 
idx: .gpu.iasc[`sym`time] g 
 
/ 2) (Optional) Slice the permutation for top-N without full reorder 
idxTop: .gpu.sublist[idx; 1000] 
 
/ 3) Gather rows on device using the permutation 
g_top: .gpu.gather[g; idxTop] 
 
/ 4) (Optional) Bring back to host 
topN: .gpu.tableFromGPU g_top 

This “index-then-gather” approach avoids fully reordering the entire table when you only need part of it. That makes it ideal for low-latency use cases like leaderboards or top-of-book queries, where top-N results are enough.

Summary:

  • Use .gpu.xasc when you need the entire table sorted and available.
  • Use .gpu.iasc when you only need a slice of the sorted data — it saves work by cutting early.

End-of-day workflows: GPUDirect for splayed tables

EOD jobs often involve massive splayed tables stored on disk. Traditionally, CPU must first load these into memory, then re-sort or join, then write back. With GPUDirect Storage (cuFile), we bypass CPU memory entirely:

q
/ Load directly to device memory 
g:.gpu.loadKdbTable `:data/quotes 
 
/ Perform sort & join on GPU 
g_sorted:.gpu.xasc[`sym`time] g 
 
/ Write results back as splayed 
.gpu.saveKdbTable[g_sorted; `:data/sortedQuotes]

This reduces I/O overhead dramatically. The first load incurs some cuFile initialization cost, but subsequent loads/writes are fast.

Benchmarks and observations

  • As-of joins: 4× faster NVIDIA acceleration vs CPU on 1M-row test; improvements grow with data width
  • Sorting: 5–10× speedups for large tables; partial sorts with .gpu.sublist avoid full reordering
  • I/O: GPUDirect keeps PCIe transfers minimal, especially when chaining multiple GPU operations

These speedups scale better as data size grows. CPUs flatten out with memory contention, while GPUs maintain throughput by leveraging parallelism and bandwidth.

Multi-GPU acceleration: VaR on five NVIDIA H100s

To push our GPU acceleration benchmarks beyond the midrange L4 configuration, we ran Value-at-Risk (VaR) calculations on a cluster of five NVIDIA H100 GPUs. Each device processes one day of scenario data in parallel, demonstrating how KDB-X can scale across multiple GPUs with minimal code changes.

In this configuration, each GPU loads its partition of the risk dataset directly from disk using .gpu.loadKdbTable, executes the VaR computation on-device with .gpu.aj and .gpu.xasc, and returns only the aggregated percentile results to host memory. The parallelism here is both intra-device (thousands of CUDA cores per GPU) and inter-device (multi-GPU concurrency).

q
/ Creates a list linking each GPU (1–5) to the folder containing that day’s risk data. 
L:{(x;`$":/home/ubuntu/data/risk/db/2025.01.0",string[x],"/wsp/")} each 1+til 5 

/ In parallel: select GPU: .gpu.sdev x[0], load its day’s data, return (deviceId; gpuTable) 
\t D:{.gpu.sdev x[0]; (x[0];.gpu.loadKdbTable x[1])} peach L 

Each GPU handles a full trading day, distributing the computation of the 95% VaR across millions of simulated price paths. Once resident on the device, VaR calculations are performed entirely within GPU memory:

q
/ CPU baseline 

\t calcVar[2025.01.01; `scenario_id`id1; 95] 

 

/ GPU accelerated across 5× H100s 

\t calcVarGPU[; `scenario_id`id1; 95] peach D 

On the CPU-only baseline, a five-day VaR run completed in ~15.8 s. Using five H100s, the same workload finished in ~0.27 s, representing a ~58× speedup. Even single-day VaR calculations saw gains of 4–6× versus the 48-core CPU benchmark, with identical results across both CPU and GPU paths (-433,797.4 at 95% confidence).

Our next article will take a deeper look at this VaR use case — including the data model, GPU memory layout, and the impact of Blackwell’s unified memory on multi-day, multi-GPU simulations.

Hardware matters: Scaling up on NVIDIA GPUs

The performance gains we’ve described are already significant on commodity accelerators like the NVIDIA L4. But the trend toward larger GPUs with more memory and tighter CPU–GPU integration points to even greater opportunities:

  • Memory capacity for full-day data: As demonstrated in our A100 tests, GPUs with large HBM memory — such as the A100 (80 GB HBM2e) and H100 (HBM3) — allow much larger time-series partitions to remain entirely resident on device. This reduces the need for chunked batch processing and repeated transfers, letting whole-day or multi-symbol workloads live in GPU memory at once
  • Next-generation interconnects: Blackwell-class GPUs are designed with shared memory coherence between CPU and GPU. Instead of explicitly copying tables back and forth, a KDB-X table in unified memory can be directly accessed from both host and device. This removes a major bottleneck: developers no longer have to think about staging data for joins or sorts — the GPU can see the same memory space as the CPU
  • Scaling beyond a single device: Multi-GPU setups connected by NVLink or NVSwitch enable distributed joins and sorts without traversing PCIe for every operation. For end-of-day workflows, this could mean scaling KDB-X pipelines across multiple GPUs
  • Architectural fit for market data pipelines: With bandwidths exceeding 4 TB/s on Blackwell HBM3e and lower-latency CPU–GPU coherency, even complex pipelines — tick normalization, as-of joins, re-sorting, and aggregation — could run in GPU memory as first-class citizens. The CPU remains for orchestration and edge logic, while the GPU does the heavy lifting

In practice, this means the same .gpu.aj and .gpu.xasc primitives shown here will just run bigger and faster as hardware improves, with fewer trade-offs around data movement. For firms living in the tens or hundreds of billions of ticks per day, that’s where the real payoff lies.

Participate and learn more

The KDB-X incubation team is actively seeking forward-thinking customers to collaborate with us in this area. By participating, you can help accelerate our research, provide valuable feedback, and shape the path to general availability (GA)—bringing these GPU-powered capabilities directly into production for your business. Please reach out to incubation@kx.com for more information.

Closing Thoughts

The combination of KDB-X and NVIDIA GPUs opens new ground for accelerating core financial workloads. As-of joins and sorts — previously bottlenecks for both real-time and EOD systems — can now run 4–10× faster on commodity GPUs like the L4. With GPUDirect and cuDF integration, we also cut out unnecessary CPU copies, enabling data pipelines that remain GPU-native end-to-end.

Looking forward, multi-GPU support and broader type coverage will only expand the applicability. For now, the message is clear: if you’re running large-scale market data pipelines in KDB-X, GPUs can save both time and infrastructure cost — without changing the essence of your q code.

For discussion and feedback, join the conversation on the KX Developer Community forum, our community Slack channel, or open a thread in the repository’s Discussions tab. To explore KDB-X hands-on, visit docs.kx.com or start with the KDB-X Community Edition.

AIによるイノベーションを加速する、KXのデモをお客様に合わせてご提供します。

当社のチームが以下の実現をサポートします:

  • ストリーミング、リアルタイム、および過去データに最適化された設計
  • エンタープライズ向けのスケーラビリティ、耐障害性、統合性、そして高度な分析機能
  • 幅広い開発言語との統合に対応する充実したツール群

専門担当者によるデモをリクエスト

*」は必須フィールドを示します

このフィールドは入力チェック用です。変更しないでください。

本フォームを送信いただくと、KXの製品・サービス、お知らせ、イベントに関する営業・マーケティング情報をお受け取りいただけます。プライバシーポリシーからお手続きいただくことで購読解除も可能です。当社の個人情報の収集・使用に関する詳しい情報については、プライバシーポリシーをご覧ください。

// social // social