Key Takeaways
- KDB-X GPU acceleration enables firms to run significantly more simulations and scenarios in the same time window, improving the accuracy of trading and risk models.
- By parallelizing compute-intensive workloads, GPUs reduce latency and make real-time analytics and intraday recalculations practical at scale.
- Running AI and analytics directly where the data resides eliminates costly data movement and preserves performance under production conditions.
- Faster backtesting and research cycles increase model throughput, allowing teams to validate more ideas and capture alpha more efficiently.
- Consolidating GPU-accelerated analytics into a unified platform reduces infrastructure complexity while improving both performance and cost efficiency.
If you’ve been running high volume market data pipelines, you already know the story: as-of joins and end-of-day sorts are among the most compute-intensive operations in your stack. They work beautifully at moderate size, but as datasets push into the hundreds of millions of rows, CPU bottlenecks start to bite. KDB-X GPU Acceleration is built to change that.
GPU Acceleration is a new offering in the KDB-X family that ships the GPU module as a first-class, production-ready capability. Everything you know about q stays the same – the same syntax, the same data model, the same table semantics – but now you have the option to offload your most expensive operations to NVIDIA GPUs, keeping data GPU-resident across multiple steps and only returning results to the CPU when you actually need them.
The core idea: Selective GPU residency
GPU Acceleration initializes by loading the GPU module:
.gpu: use`kx.gpu The design philosophy is deliberately minimal. You don’t rewrite your pipelines. You decide which columns should live on the GPU, push them there with .gpu.xto, and the rest of your table stays exactly where it is.
/ Push only the join-key columns to GPU — the rest stays on CPU
T:.gpu.xto[`time`sym;trade]
Q:.gpu.xto[`time`sym;quote] When you inspect T, it looks like an ordinary KDB-X table — except that time and sym show as foreign, meaning they are GPU-resident pointers rather than in-process memory:
time ex sym sale vol price
-------------------------------------------
foreign T foreign TI 1 132.02
foreign T foreign TI 1 134.89 This is a powerful pattern. You’re not paying the cost of transferring wide tables across PCIe. You move only the columns that matter for the operation — the join keys, the sort keys — and the GPU does its work on those. The payload columns come along only when you materialize the result. Use .gpu.from to bring GPU-resident data back to CPU when you need it. It is important to note here that all attributes are preserved on .gpu.to, but only sorted persists on .gpu.from.
The GPU module APIs at a glance
The GPU module exposes a clean namespace of primitives that compose well:
| Function | Purpose |
|---|---|
.gpu.to / .gpu.xto |
Send data or specific columns to GPU |
.gpu.from |
Return data from GPU to memory |
.gpu.aj |
As-of join on GPU-resident key columns |
.gpu.iasc / .gpu.asc |
Sort indices and in-place sort |
.gpu.xasc |
Full table sort ascending by column order |
.gpu.select |
GPU-accelerated table queries |
.gpu.bin |
Binary search |
.gpu.append |
Append data on GPU |
.gpu.ndev / .gpu.sdev / .gpu.mdev |
Device count, selection, and memory introspection |
The device management functions are particularly useful when writing multi-GPU code. .gpu.ndev[] returns the number of available devices, .gpu.sdev[n] selects the active device, and .gpu.mdev[] returns available memory — the on-disk sort uses these to automatically compute safe batch sizes.
GPU-accelerated as-of joins
As-of joins are the workhorse of tick data processing: aligning trades to the most recent quote, synchronizing signals to market events, reconstructing order book snapshots. On CPU, aj performs a backward scan or binary search per symbol — and with 19 million trades joined against 194 million quotes, that adds up fast.
With GPU Acceleration, .gpu.aj parallelizes the binary search across every symbol simultaneously, using thousands of GPU threads. Because sym and time are already GPU-resident in both tables, there’s no upfront transfer cost on each invocation.
/ Three variants — each progressively faster
aj[`sym`time; trade; quote] / CPU baseline
.gpu.aj[`sym`time; trade; Q] / Q keys on GPU
.gpu.aj[`sym`time; T; Q] / Both tables have GPU keys — fastest The progression matters. When only one table has GPU-resident keys, there is still transfer overhead for the other. When both are already on device, the join runs entirely on GPU — no CPU involvement until results come back.
Benchmark results show an NVIDIA L4 GPU outperforming a CPU system with 48 available cores by roughly 4x at 1 million rows. On an A100 at 10 million rows, the gap widens to around 5.5×. At 100 million rows, GPU maintains sub-second latency while CPU extends past 4 seconds — the kind of scaling that matters for full-day tick datasets.
GPU-accelerated sorting
Sorting is where GPU Acceleration really shines for end-of-day workflows. Two approaches are available depending on whether you need the full sorted result or just the top-N rows.
Full table sort: .gpu.xasc
The straightforward approach — push a table to GPU, sort it by one or more keys, bring it back:
tgpu:.gpu.to trade
.gpu.from .gpu.xasc[`sym`time;tgpu] This is the right choice for EOD batch jobs where you need the entire table re-sorted before aggregation or archival. Benchmarks show 5–10× speedups over CPU once table sizes exceed tens of millions of rows. On an A100, sorting 10 million ticks by sym,time takes ~42ms versus ~320ms on CPU.
Index-based sort: .gpu.iasc
When you only need part of the sorted result — top-of-book, leaderboards, best-N records — .gpu.iasc computes the sort permutation without materializing the full reorder:
gt: .gpu.to trade / send trade table to GPU
idx: .gpu.from .gpu.iasc gt / bring indices back to CPU
topN: t[idx 1000] / index into original CPU table
This avoids the cost of reordering millions of rows when you only care about a thousand of them. The rule of thumb: use .gpu.xasc when you need everything, .gpu.iasc when you need a slice.
On-disk sorting
For sorting large splayed tables on disk, the pattern is to load the full table into CPU memory, push only the sort-key columns to the GPU to compute indices, then use those indices to reorder the full table before writing back:
/ Load full table, select only sort-key cols for GPU
d: get `:/data/orders50m/
g: .gpu.to ?[d; (); 0b; c!c:`orderId`time]
/ Get sort indices, reorder full table, write back
upsert[`:/data/orders50m_sorted/] d @ .gpu.from .gpu.iasc g The key insight is that .gpu.iasc only needs the sort-key columns — it returns an index vector, not reordered data. The full table d stays on CPU and gets reordered cheaply using standard q indexing once the indices come back. The honest caveat: at very large scale, this pattern is bottlenecked by disk I/O rather than compute. The .gpu.mdev[] and .gpu.ndev[] functions let you introspect available device memory to decide whether batching is needed.
Multi-GPU: scaling VaR across H100s
GPU Acceleration’s multi-GPU support is straightforward: use peach to distribute work across devices, with .gpu.sdev to target each one.
A five-device H100 VaR benchmark illustrates the pattern. Each GPU loads one trading day’s risk data directly from disk and computes the 95th percentile VaR across millions of simulated price paths:
/ Map each GPU to a day's risk partition
L: {(x; `$":/data/risk/2025.01.0", string[x], "/")} each 1 + til 5
/ Load each partition in parallel — each worker selects its GPU, pushes data to device
D: {.gpu.sdev x[0]; (x[0]; .gpu.to get x[1])} peach L
/ Run VaR across all GPUs simultaneously
calcVarGPU[; `scenario_id`id1; 95] peach D Five H100s completed a five-day VaR run in ~0.27 seconds versus ~15.8 seconds on a CPU system with 48 available cores — a 58× speedup, with identical numerical results across both paths
Closing thoughts
The optimization of the most CPU-intensive operations in typical KDB-X workloads, especially those leveraging large-scale market data pipelines, not only saves time and overhead – it accelerates the entire system.
In a world where every millisecond counts, analytics on massive datasets can now be performed quicker (up to 58x!) than ever before with utilization of the KDB-X GPU Acceleration.
When it’s survival of the fastest, every enhancement gives an edge. GPU Acceleration for KDB-X workloads will help bring current KDB-X systems to the next level.
Interested in building yourself? Learn how to leverage GPU Acceleration with your current workloads in our tutorial, which can be found here.

