ポイント
- KDB.AI integrates NVIDIA cuVS to accelerate vector search and index builds for real-time financial AI workflows.
- GPU-accelerated CAGRA indexes reduce semantic retrieval latency compared with CPU-based qHNSW under recall-aligned benchmark conditions.
- The benchmark shows up to 11.8x faster algorithm-level index builds and up to 40.3x lower algorithm-level search latency.
- End-to-end results show up to 3.0x faster index builds, up to 9.3x lower search latency, and up to 12.7x improved P95 tail latency.
- Faster semantic retrieval helps financial AI agents keep structured market data and unstructured context synchronized at market speed.

This is where KX’s domain leadership in financial time-series meets NVIDIA accelerated computing: continuous embedding updates, rapid index rebuilds, and agentic reasoning grounded in real-time market conditions.
The business challenge: Two domains, one operational clock
Financial agentic systems must reason across two fundamentally different domains to maintain temporal coherence:
- Structured, low-latency time-series: trades, quotes, order books, volatility surfaces, macro indicators. The what of the market.
- Unstructured, semantic context: earnings transcripts, filings, analyst commentary, sentiment feeds, macro event streams. The why.
When a Fed comment lands, volatility surfaces reprice in seconds and order books absorb the shift before the press release is fully parsed. When a 10-K drops mid-session, the market has already started discounting the disclosure by the time most systems finish embedding it. An agent reasoning over yesterday’s index isn’t slow. It’s wrong, and confidently so.
Where latency compounds: Live execution and backtesting
Agentic workflows compound semantic lag. A single decision rarely flows from a single retrieval. The agent detects a structured dislocation, pulls historical context, compares disclosures, reassesses risk, then loops back. Five, ten, sometimes more sequential retrievals before any action is taken. Each one gates the next. Each millisecond of lag becomes a multiplier.
In live execution, this is the difference between catching a volatility shift mid-move and reasoning about it after the fact. A risk recalibration that needs to complete inside the alpha window cannot afford to wait on an index that refreshes overnight.
The same constraint shows up in research. Systematic strategies are validated through backtesting, replaying years of unified market and narrative data to test whether a signal generalizes. When semantic retrieval is slow, the iteration cycle that lets quants refine hypotheses collapses. At 9.3x end-to-end peak search throughput, a backtest finishes in roughly a tenth of the wall-clock time, turning a quarterly research cycle into a weekly one.
The bottleneck: CPU-bound synchronization
Deterministic analytics already execute in microseconds via KDB-X. CPU-bound semantic retrieval doesn’t. As corpora grow into the tens of millions, index refreshes stretch and query latency rises, taxing real-time reasoning at scale. Each millisecond of lag compounds across sequential retrievals into seconds of “semantic lag” that pull the architecture off the market’s cadence. Closing that gap is the foundation of the KX + NVIDIA collaboration: a stack where deterministic time-series and accelerated semantic retrieval execute on the same operational clock.
Solving the bottleneck: CPU vs. GPU at the vector layer
To understand where the gains come from, it is useful to see how vector workloads behave on each architecture.
| Dimension | CPU-based vector search | GPU-accelerated vector search |
|---|---|---|
| Parallelism | Limited by core count and thread-level concurrency | Thousands of lightweight cores executing vector distance computations in parallel |
| Memory bandwidth | Constrained by CPU DRAM bandwidth; memory access can become a bottleneck at scale | High-bandwidth GPU memory reduces data movement constraints for graph traversal and distance calculations |
| Index construction model | Index build time grows with dataset size, often forcing scheduled or overnight refresh cycles | Parallelized index construction reduces build time, enabling more frequent refresh |
| Latency characteristics | Often optimized for batch throughput; single-query latency can degrade under load | Improves both throughput and batch=1 latency, critical when 5-10 sequential agentic queries gate every reasoning step |
| Throughput scaling | Requires horizontal scaling across CPU nodes to increase capacity | Higher query density per node; improved performance efficiency |
| Economic scaling logic | Lower per-hour cost, but performance may require additional nodes | Higher per-hour cost, but economically justified with end-to-end index build and search speedups |
Using cuVS for faster search and index builds
To eliminate this lag in the semantic layer, KDB.AI integrates NVIDIA cuVS directly into its vector engine.
Within KDB.AI, cuVS supports GPU-optimized index types such as CAGRA, the index type used throughout this article’s benchmarks. While HNSW is the CPU standard, CAGRA is a GPU-native algorithm purpose-built for CUDA parallelism. It optimizes graph traversal to deliver superior throughput and lower latency for large-scale vector workloads.
This acceleration shortens index build cycles and reduces per-query latency, allowing new embeddings to be incorporated more rapidly and relevant context to be retrieved quickly. To quantify what that means in practice, we benchmarked GPU-accelerated CAGRA against CPU-based qHNSW under recall-aligned conditions.
Methodology and results
To ensure a fair and production-relevant comparison, we evaluated CPU-based vector search against GPU-accelerated cuVS under controlled conditions.
- Hardware configuration, on-prem: GPU: NVIDIA H100; CPU: Intel Xeon, 192 threads.
- Dataset: MIRACL 10M dataset.
- Number of runs: 100 per configuration.
- Configuration and tuning: Both CPU, qHNSW, and GPU, CAGRA, configurations were tuned to comparable recall targets across a range of build and search parameters. This ensured that performance gains were not achieved by sacrificing accuracy.
- Algorithm-level vs. end-to-end definition: To avoid ambiguity, we distinguish clearly between two measurement scopes.
Algorithm-level performance
Algorithm-level measurements isolate the core indexing and search operations inside the vector engine. They exclude Python client overhead, client-to-database transmission, CPU/GPU transfer latency, and disk persistence, all of which we’ll measure separately in the end-to-end results.
This isolation matters because it answers a specific question: how much of the speedup is coming from GPU parallelism itself, before any system-level overhead enters the picture?
We observed:
figure>


Index construction is up to 11.8x faster, and search latency drops by up to 40.3x. Both reflect the same underlying advantage: high-dimensional distance computation is parallel, and CAGRA’s graph traversal is built to keep thousands of GPU cores busy.
End-to-end performance in the production workflow
End-to-end measurements capture the full workflow as it runs in production: the call into KDB.AI from the Python client, data transmission to CPU or GPU, execution of the index build or search, and persistence of the result. This is how the application actually feels, and the overall performance you would expect to see.
Algorithm-level results show raw acceleration potential. End-to-end results show what survives once real-world overhead enters the picture, and that’s the number that matters for deployment decisions.


Conclusion: financial AI at market speed
The benchmarks above don’t just describe faster vector search. They eliminate three specific failure modes in agentic financial systems.
- Index staleness becomes a non-issue. Today, a 99% recall index on qHNSW takes roughly 45 minutes to rebuild. That cost is why most production deployments fall back to overnight refresh, leaving every intraday decision grounded in yesterday’s context. At 15 minutes on CAGRA, the same rebuild fits between events. A 10-K drops at 9:30 AM and is reflected in the semantic layer before the first hour of trading closes. Index refresh stops being a scheduled job and becomes an operational response to new data.
- The agentic loop fits inside the alpha window. An agent issuing 10 sequential retrievals spends roughly 970ms on retrieval alone at qHNSW’s P95 latency. This is long enough that by the time the loop completes, the market has already moved past the signal that triggered it. The same loop on CAGRA completes in under 80ms, leaving the rest of the decision budget for LLM reasoning and action. The reasoning step is no longer competing with the market for time.
- Research iteration cycles compress. A backtest that historically took a week of wall-clock time finishes in roughly a day at 9.3x end-to-end peak throughput. Quants iterate on hypotheses on demand instead of quarterly. Hypothesis testing stops being budgeted and starts being routine.
For firms building agentic systems across equities, derivatives, and crypto markets, this is the shift from semantic retrieval as a scheduled batch process to semantic retrieval as a live signal. The alignment between structured and unstructured intelligence is not an optimization. It is foundational.
Use KDB.AI cuVS today
Spin up KDB.AI with cuVS-accelerated CAGRA indexes using the KX + NVIDIA blueprints. Clone it, run it on your hardware, and compare against your existing CPU baseline.
This blog was co-authored by Manas Singh, Technical Product Manager, NVIDIA
