Continuous Financial AI: GPU-Accelerated Vector Search in KDB.AI with NVIDIA cuVS

作者

Ryan Siegler

Data Scientist

ポイント

  1. KDB.AI integrates NVIDIA cuVS to accelerate vector search and index builds for real-time financial AI workflows.
  2. GPU-accelerated CAGRA indexes reduce semantic retrieval latency compared with CPU-based qHNSW under recall-aligned benchmark conditions.
  3. The benchmark shows up to 11.8x faster algorithm-level index builds and up to 40.3x lower algorithm-level search latency.
  4. End-to-end results show up to 3.0x faster index builds, up to 9.3x lower search latency, and up to 12.7x improved P95 tail latency.
  5. Faster semantic retrieval helps financial AI agents keep structured market data and unstructured context synchronized at market speed.
Most financial AI systems still operate on yesterday’s rhythm: ingest during the day, rebuild overnight, respond in batches.Markets don’t move in batches. They move continuously. Liquidity shifts in seconds. Volatility reprices in minutes. News propagates instantly. The performance of agentic financial systems is only meaningful if they keep pace with that reality.That requires synchronizing structured market data with the unstructured context that explains it. When structured time-series already executes at market speed, semantic retrieval cannot become the bottleneck. This article demonstrates how GPU-accelerated vector search and index build with KDB.AI and NVIDIA cuVS closes that gap, delivering the following performance improvements relative to CPU-based qHNSW under recall-aligned configurations. The full benchmark methodology and results follow below in the methodology and results section; here are the headline numbers:

Index buils time
Table 1. Speedup ranges span recall bands tested. Lower latency is better; higher throughput is better.

This is where KX’s domain leadership in financial time-series meets NVIDIA accelerated computing: continuous embedding updates, rapid index rebuilds, and agentic reasoning grounded in real-time market conditions.

The business challenge: Two domains, one operational clock

Financial agentic systems must reason across two fundamentally different domains to maintain temporal coherence:

  • Structured, low-latency time-series: trades, quotes, order books, volatility surfaces, macro indicators. The what of the market.
  • Unstructured, semantic context: earnings transcripts, filings, analyst commentary, sentiment feeds, macro event streams. The why.

When a Fed comment lands, volatility surfaces reprice in seconds and order books absorb the shift before the press release is fully parsed. When a 10-K drops mid-session, the market has already started discounting the disclosure by the time most systems finish embedding it. An agent reasoning over yesterday’s index isn’t slow. It’s wrong, and confidently so.

Where latency compounds: Live execution and backtesting

Agentic workflows compound semantic lag. A single decision rarely flows from a single retrieval. The agent detects a structured dislocation, pulls historical context, compares disclosures, reassesses risk, then loops back. Five, ten, sometimes more sequential retrievals before any action is taken. Each one gates the next. Each millisecond of lag becomes a multiplier.

In live execution, this is the difference between catching a volatility shift mid-move and reasoning about it after the fact. A risk recalibration that needs to complete inside the alpha window cannot afford to wait on an index that refreshes overnight.

The same constraint shows up in research. Systematic strategies are validated through backtesting, replaying years of unified market and narrative data to test whether a signal generalizes. When semantic retrieval is slow, the iteration cycle that lets quants refine hypotheses collapses. At 9.3x end-to-end peak search throughput, a backtest finishes in roughly a tenth of the wall-clock time, turning a quarterly research cycle into a weekly one.

The bottleneck: CPU-bound synchronization

Deterministic analytics already execute in microseconds via KDB-X. CPU-bound semantic retrieval doesn’t. As corpora grow into the tens of millions, index refreshes stretch and query latency rises, taxing real-time reasoning at scale. Each millisecond of lag compounds across sequential retrievals into seconds of “semantic lag” that pull the architecture off the market’s cadence. Closing that gap is the foundation of the KX + NVIDIA collaboration: a stack where deterministic time-series and accelerated semantic retrieval execute on the same operational clock.

Solving the bottleneck: CPU vs. GPU at the vector layer

To understand where the gains come from, it is useful to see how vector workloads behave on each architecture.

Dimension CPU-based vector search GPU-accelerated vector search
Parallelism Limited by core count and thread-level concurrency Thousands of lightweight cores executing vector distance computations in parallel
Memory bandwidth Constrained by CPU DRAM bandwidth; memory access can become a bottleneck at scale High-bandwidth GPU memory reduces data movement constraints for graph traversal and distance calculations
Index construction model Index build time grows with dataset size, often forcing scheduled or overnight refresh cycles Parallelized index construction reduces build time, enabling more frequent refresh
Latency characteristics Often optimized for batch throughput; single-query latency can degrade under load Improves both throughput and batch=1 latency, critical when 5-10 sequential agentic queries gate every reasoning step
Throughput scaling Requires horizontal scaling across CPU nodes to increase capacity Higher query density per node; improved performance efficiency
Economic scaling logic Lower per-hour cost, but performance may require additional nodes Higher per-hour cost, but economically justified with end-to-end index build and search speedups

Using cuVS for faster search and index builds

To eliminate this lag in the semantic layer, KDB.AI integrates NVIDIA cuVS directly into its vector engine.

Within KDB.AI, cuVS supports GPU-optimized index types such as CAGRA, the index type used throughout this article’s benchmarks. While HNSW is the CPU standard, CAGRA is a GPU-native algorithm purpose-built for CUDA parallelism. It optimizes graph traversal to deliver superior throughput and lower latency for large-scale vector workloads.

This acceleration shortens index build cycles and reduces per-query latency, allowing new embeddings to be incorporated more rapidly and relevant context to be retrieved quickly. To quantify what that means in practice, we benchmarked GPU-accelerated CAGRA against CPU-based qHNSW under recall-aligned conditions.

Methodology and results

To ensure a fair and production-relevant comparison, we evaluated CPU-based vector search against GPU-accelerated cuVS under controlled conditions.

  • Hardware configuration, on-prem: GPU: NVIDIA H100; CPU: Intel Xeon, 192 threads.
  • Dataset: MIRACL 10M dataset.
  • Number of runs: 100 per configuration.
  • Configuration and tuning: Both CPU, qHNSW, and GPU, CAGRA, configurations were tuned to comparable recall targets across a range of build and search parameters. This ensured that performance gains were not achieved by sacrificing accuracy.
  • Algorithm-level vs. end-to-end definition: To avoid ambiguity, we distinguish clearly between two measurement scopes.

Algorithm-level performance

Algorithm-level measurements isolate the core indexing and search operations inside the vector engine. They exclude Python client overhead, client-to-database transmission, CPU/GPU transfer latency, and disk persistence, all of which we’ll measure separately in the end-to-end results.

This isolation matters because it answers a specific question: how much of the speedup is coming from GPU parallelism itself, before any system-level overhead enters the picture?

We observed:
figure>

Figure 1. Algorithm-level index build time.
Search Speedup Algorithm
Figure 2. Algorithm-level search speedup at batch size 1.

 

Index construction is up to 11.8x faster, and search latency drops by up to 40.3x. Both reflect the same underlying advantage: high-dimensional distance computation is parallel, and CAGRA’s graph traversal is built to keep thousands of GPU cores busy.

End-to-end performance in the production workflow

End-to-end measurements capture the full workflow as it runs in production: the call into KDB.AI from the Python client, data transmission to CPU or GPU, execution of the index build or search, and persistence of the result. This is how the application actually feels, and the overall performance you would expect to see.

Algorithm-level results show raw acceleration potential. End-to-end results show what survives once real-world overhead enters the picture, and that’s the number that matters for deployment decisions.

 End-to-end index build time.
Figure 3. End-to-end index build time.
Figure 4. End-to-end Search speedup at batch size 1.
Figure 4. End-to-end Search speedup at batch size 1.
Even with full workflow overhead, GPU acceleration delivers up to 3.0x faster index builds and up to 9.3x lower search latency at the high-recall, greater than 0.99, regime financial agents operate in. The P95 results are worth a closer look: tail latency improves by up to 12.7x, which matters disproportionately for iterative reasoning loops.

Conclusion: financial AI at market speed

The benchmarks above don’t just describe faster vector search. They eliminate three specific failure modes in agentic financial systems.

  • Index staleness becomes a non-issue. Today, a 99% recall index on qHNSW takes roughly 45 minutes to rebuild. That cost is why most production deployments fall back to overnight refresh, leaving every intraday decision grounded in yesterday’s context. At 15 minutes on CAGRA, the same rebuild fits between events. A 10-K drops at 9:30 AM and is reflected in the semantic layer before the first hour of trading closes. Index refresh stops being a scheduled job and becomes an operational response to new data.
  • The agentic loop fits inside the alpha window. An agent issuing 10 sequential retrievals spends roughly 970ms on retrieval alone at qHNSW’s P95 latency. This is long enough that by the time the loop completes, the market has already moved past the signal that triggered it. The same loop on CAGRA completes in under 80ms, leaving the rest of the decision budget for LLM reasoning and action. The reasoning step is no longer competing with the market for time.
  • Research iteration cycles compress. A backtest that historically took a week of wall-clock time finishes in roughly a day at 9.3x end-to-end peak throughput. Quants iterate on hypotheses on demand instead of quarterly. Hypothesis testing stops being budgeted and starts being routine.

For firms building agentic systems across equities, derivatives, and crypto markets, this is the shift from semantic retrieval as a scheduled batch process to semantic retrieval as a live signal. The alignment between structured and unstructured intelligence is not an optimization. It is foundational.

Use KDB.AI cuVS today

Spin up KDB.AI with cuVS-accelerated CAGRA indexes using the KX + NVIDIA blueprints. Clone it, run it on your hardware, and compare against your existing CPU baseline.

This blog was co-authored by Manas Singh, Technical Product Manager, NVIDIA

AIによるイノベーションを加速する、KXのデモをお客様に合わせてご提供します。

当社のチームが以下の実現をサポートします:

  • ストリーミング、リアルタイム、および過去データに最適化された設計
  • エンタープライズ向けのスケーラビリティ、耐障害性、統合性、そして高度な分析機能
  • 幅広い開発言語との統合に対応する充実したツール群

専門担当者によるデモをリクエスト

*」は必須フィールドを示します

このフィールドは入力チェック用です。変更しないでください。

本フォームを送信いただくと、KXの製品・サービス、お知らせ、イベントに関する営業・マーケティング情報をお受け取りいただけます。プライバシーポリシーからお手続きいただくことで購読解除も可能です。当社の個人情報の収集・使用に関する詳しい情報については、プライバシーポリシーをご覧ください。

// social // social