Introducing KDB-X GPU Acceleration

Nicholas Sansone

Author

Nicholas Sansone

Developer Relations Engineer

Key Takeaways

  1. KDB-X GPU acceleration enables firms to run significantly more simulations and scenarios in the same time window, improving the accuracy of trading and risk models.
  2. By parallelizing compute-intensive workloads, GPUs reduce latency and make real-time analytics and intraday recalculations practical at scale.
  3. Running AI and analytics directly where the data resides eliminates costly data movement and preserves performance under production conditions.
  4. Faster backtesting and research cycles increase model throughput, allowing teams to validate more ideas and capture alpha more efficiently.
  5. Consolidating GPU-accelerated analytics into a unified platform reduces infrastructure complexity while improving both performance and cost efficiency.

If you’ve been running high volume market data pipelines, you already know the story: as-of joins and end-of-day sorts are among the most compute-intensive operations in your stack. They work beautifully at moderate size, but as datasets push into the hundreds of millions of rows, CPU bottlenecks start to bite. KDB-X GPU Acceleration is built to change that.

GPU Acceleration is a new offering in the KDB-X family that ships the GPU module as a first-class, production-ready capability. Everything you know about q stays the same – the same syntax, the same data model, the same table semantics – but now you have the option to offload your most expensive operations to NVIDIA GPUs, keeping data GPU-resident across multiple steps and only returning results to the CPU when you actually need them.

The core idea: Selective GPU residency

GPU Acceleration initializes by loading the GPU module:

q
.gpu: use`kx.gpu 

The design philosophy is deliberately minimal. You don’t rewrite your pipelines. You decide which columns should live on the GPU, push them there with .gpu.xto, and the rest of your table stays exactly where it is.

q
/ Push only the join-key columns to GPU — the rest stays on CPU 

T:.gpu.xto[`time`sym;trade] 

Q:.gpu.xto[`time`sym;quote] 

When you inspect T, it looks like an ordinary KDB-X table — except that time and sym show as foreign, meaning they are GPU-resident pointers rather than in-process memory:

q
time    ex  sym     sale  vol  price 

------------------------------------------- 

foreign T   foreign TI    1    132.02 

foreign T   foreign TI    1    134.89 

This is a powerful pattern. You’re not paying the cost of transferring wide tables across PCIe. You move only the columns that matter for the operation — the join keys, the sort keys — and the GPU does its work on those. The payload columns come along only when you materialize the result. Use .gpu.from to bring GPU-resident data back to CPU when you need it. It is important to note here that all attributes are preserved on .gpu.to, but only sorted persists on .gpu.from.

The GPU module APIs at a glance

The GPU module exposes a clean namespace of primitives that compose well:

Function Purpose
.gpu.to / .gpu.xto Send data or specific columns to GPU
.gpu.from Return data from GPU to memory
.gpu.aj As-of join on GPU-resident key columns
.gpu.iasc / .gpu.asc Sort indices and in-place sort
.gpu.xasc Full table sort ascending by column order
.gpu.select GPU-accelerated table queries
.gpu.bin Binary search
.gpu.append Append data on GPU
.gpu.ndev / .gpu.sdev / .gpu.mdev Device count, selection, and memory introspection

The device management functions are particularly useful when writing multi-GPU code. .gpu.ndev[] returns the number of available devices, .gpu.sdev[n] selects the active device, and .gpu.mdev[] returns available memory — the on-disk sort uses these to automatically compute safe batch sizes.

 

GPU-accelerated as-of joins

As-of joins are the workhorse of tick data processing: aligning trades to the most recent quote, synchronizing signals to market events, reconstructing order book snapshots. On CPU, aj performs a backward scan or binary search per symbol — and with 19 million trades joined against 194 million quotes, that adds up fast.

With GPU Acceleration, .gpu.aj parallelizes the binary search across every symbol simultaneously, using thousands of GPU threads. Because sym and time are already GPU-resident in both tables, there’s no upfront transfer cost on each invocation.

q
/ Three variants — each progressively faster 

aj[`sym`time; trade; quote]            / CPU baseline 

.gpu.aj[`sym`time; trade; Q]     	/ Q keys on GPU 

.gpu.aj[`sym`time; T; Q]     	        / Both tables have GPU keys — fastest 

The progression matters. When only one table has GPU-resident keys, there is still transfer overhead for the other. When both are already on device, the join runs entirely on GPU — no CPU involvement until results come back.

Benchmark results show an NVIDIA L4 GPU outperforming a CPU system with 48 available cores by roughly 4x at 1 million rows. On an A100 at 10 million rows, the gap widens to around 5.5×. At 100 million rows, GPU maintains sub-second latency while CPU extends past 4 seconds — the kind of scaling that matters for full-day tick datasets.

GPU-accelerated sorting

Sorting is where GPU Acceleration really shines for end-of-day workflows. Two approaches are available depending on whether you need the full sorted result or just the top-N rows.

Full table sort: .gpu.xasc

The straightforward approach — push a table to GPU, sort it by one or more keys, bring it back:

 

q
tgpu:.gpu.to trade 

.gpu.from .gpu.xasc[`sym`time;tgpu] 

This is the right choice for EOD batch jobs where you need the entire table re-sorted before aggregation or archival. Benchmarks show 5–10× speedups over CPU once table sizes exceed tens of millions of rows. On an A100, sorting 10 million ticks by sym,time takes ~42ms versus ~320ms on CPU.

Index-based sort: .gpu.iasc

When you only need part of the sorted result — top-of-book, leaderboards, best-N records — .gpu.iasc computes the sort permutation without materializing the full reorder:

q
gt: .gpu.to trade			/ send trade table to GPU 

idx: .gpu.from .gpu.iasc gt	/ bring indices back to CPU 

topN: t[idx 1000]		        / index into original CPU table 

This avoids the cost of reordering millions of rows when you only care about a thousand of them. The rule of thumb: use .gpu.xasc when you need everything, .gpu.iasc when you need a slice.

On-disk sorting

For sorting large splayed tables on disk, the pattern is to load the full table into CPU memory, push only the sort-key columns to the GPU to compute indices, then use those indices to reorder the full table before writing back:

q
/ Load full table, select only sort-key cols for GPU 

d: get `:/data/orders50m/ 

g: .gpu.to ?[d; (); 0b; c!c:`orderId`time] 

 

/ Get sort indices, reorder full table, write back 

upsert[`:/data/orders50m_sorted/] d @ .gpu.from .gpu.iasc g 

The key insight is that .gpu.iasc only needs the sort-key columns — it returns an index vector, not reordered data. The full table d stays on CPU and gets reordered cheaply using standard q indexing once the indices come back. The honest caveat: at very large scale, this pattern is bottlenecked by disk I/O rather than compute. The .gpu.mdev[] and .gpu.ndev[] functions let you introspect available device memory to decide whether batching is needed.

Multi-GPU: scaling VaR across H100s

GPU Acceleration’s multi-GPU support is straightforward: use peach to distribute work across devices, with .gpu.sdev to target each one.

A five-device H100 VaR benchmark illustrates the pattern. Each GPU loads one trading day’s risk data directly from disk and computes the 95th percentile VaR across millions of simulated price paths:

q
/ Map each GPU to a day's risk partition 

L: {(x; `$":/data/risk/2025.01.0", string[x], "/")} each 1 + til 5 

 

/ Load each partition in parallel — each worker selects its GPU, pushes data to device 

D: {.gpu.sdev x[0]; (x[0]; .gpu.to get x[1])} peach L 

 

/ Run VaR across all GPUs simultaneously 

calcVarGPU[; `scenario_id`id1; 95] peach D 

Five H100s completed a five-day VaR run in ~0.27 seconds versus ~15.8 seconds on a CPU system with 48 available cores — a 58× speedup, with identical numerical results across both paths

Closing thoughts

The optimization of the most CPU-intensive operations in typical KDB-X workloads, especially those leveraging large-scale market data pipelines, not only saves time and overhead – it accelerates the entire system.

In a world where every millisecond counts, analytics on massive datasets can now be performed quicker (up to 58x!) than ever before with utilization of the KDB-X GPU Acceleration.

When it’s survival of the fastest, every enhancement gives an edge. GPU Acceleration for KDB-X workloads will help bring current KDB-X systems to the next level.

Interested in building yourself? Learn how to leverage GPU Acceleration with your current workloads in our tutorial, which can be found here.

Demo the world’s fastest database for vector, time-series, and real-time analytics

Start your journey to becoming an AI-first enterprise with 100x* more performant data and MLOps pipelines.

  • Process data at unmatched speed and scale
  • Build high-performance data-driven applications
  • Turbocharge analytics tools in the cloud, on premise, or at the edge

*Based on time-series queries running in real-world use cases on customer environments.

Book a demo with an expert

"*" indicates required fields

This field is for validation purposes and should be left unchanged.

By submitting this form, you will also receive sales and/or marketing communications on KX products, services, news and events. You can unsubscribe from receiving communications by visiting our Privacy Policy. You can find further information on how we collect and use your personal data in our Privacy Policy.

// social // social