Supercharge hardware evaluation with KX Nano: An open-source benchmark tool

ブログに戻る

Supercharge hardware evaluation with KX Nano: An open-source benchmark tool

作者

Ferenc Bodon Ph.D

Head of benchmarking

投稿済み

14 7月, 2025

読む時間

ポイント

KX Nano is an open-source benchmark tool designed to evaluate hardware performance from the perspective of kdb+.
KX Nano focuses on executing fundamental kdb+ operations, providing granular insights into hardware performance, including sequential and random read and write tasks, as well as aggregation, serialization, and vector operations.
KX Nano thoroughly investigates the core components of your system through rigorous tests for storage, memory, and CPU.

KX has a long-standing reputation for prioritizing performance and maximizing the potential of available hardware. This commitment is evident in our constant collaboration with hardware vendors, independent reports, and the world records achieved via STAC-M3^TM benchmarking.

STAC^TM, considered the gold standard in high-speed analytic testing, captures the performance of the entire solution, including database software, compute resources, networking, and storage. However, it is closed-source and challenging to replicate. Fortunately, there are other tools readily available, including KX Nano, an open-source toolkit designed to calculate raw CPU, memory, and storage I/O capabilities.

Nano features:

Comprehensive hardware testing: Nano investigates the deep core components of your system. It includes rigorous tests for storage, memory, and CPU, specifically stressing L1, L2, and L3 caches
Low-level kdb+ operations: Nano focuses on executing fundamental kdb+ operations, including sequential/random read/write tasks together with aggregation (e.g. sum and sort), serialization, compression, and vector operations (opening a file)
Stress testing capabilities: Nano employs a clever approach to simulate demanding workloads. The main bash script initiates multiple kdb+ worker processes and a kdb+ controller, then directs these workers to execute the same operation simultaneously. Some of these operations are multi-threaded. This places a significant load on the hardware, particularly on the filesystem or on the memory, helping to identify bottlenecks and limitations
Modular and extensible: Recognizing that every testing scenario is unique, KX designed Nano to be highly modular and extensible. Clients and hardware vendors have requested and contributed new tests in the past, making it a continuously evolving tool
Configurable and customizable: Whether you want to compare the performance of flagship AMD CPUs against AWS Graviton and Intel CPUs using the cpuonly mode or are curious about how FSx Lustre or Rook Ceph run hundreds of parallel random reads across thousands of memory-mapped files, Nano offers complete flexibility and test customization

Getting started with Nano is straightforward. Specify the data directory where the kdb+ processes will persist data and run the script with the default values.

$ git clone https://github.com/KxSystems/nano.git
$ cd nano
$ echo "/mnt/nvme1/nanotest " > partitions     # specify the data directory
$ source ./config/kdbenv                       # load kdb+ environment
$ source ./config/env                          # load default benchmark values
$ ./nano.sh

You can also adjust parameters, including the number of kdb+ worker processes and the number of threads per worker.

$ THREADNR=8 ./nano.sh --processnr 32 –-scope cpuonly

The script nano.sh generates a rich result file in PSV (pipe-separated values) format, where each line captures detailed test metrics and metadata. Every entry includes essential test information, such as the test name and the corresponding q expression, along with specific hardware components being stressed (e.g., CPU, memory, or disk). Additionally, each record contains the measured performance value, allowing for straightforward parsing and analysis. This structured output facilitates performance benchmarking and hardware profiling by organizing key data points in a consistent, machine-readable, and human-readable format.

For future reference and reproducibility, nano.sh also captures hardware information, such as the output of lscpu and numactl, and creates a config.yaml that stores the most important hardware and software settings (e.g, the number of CPUs or the kdb+ version used).

Case study

Advanced Micro Devices (AMD) is a leading semiconductor company that designs and develops a range of products, including central processing units (CPUs), graphics processing units (GPUs), and system-on-chip (SoC) solutions, catering to markets such as data centers, gaming, and embedded systems.

KX Nano was tested on two generations of AMD EPYC^TM based systems, 4th-generation AMD EPYC processors (codenamed “Genoa”) and 5th-generation AMD EPYC processors (codenamed “Turin”).

The CPU tests in nano.sh measure the performance of various kdb+ operations on vectors of different sizes, targeting different levels of the memory hierarchy. For example:

The test “med float large” benchmarks the median calculation on a large floating-point vector
“18! int tiny” evaluates how quickly a small integer vector can be serialized and compressed.

These tests are designed to stress different parts of the system: small vectors fit in L1/L2 CPU caches, medium-sized vectors test L3 cache, and large vectors exercise main memory bandwidth. Both integer and floating-point operations are included to assess arithmetic performance across data types. The workload includes operations with different memory access patterns: some, like “reciprocal”, read and generate new vectors (testing read-write throughput), while others, such as “sum”, are read-heavy.

Additionally, a random vector generation test evaluates pure memory write performance. The throughput of each operation is derived from the vector size divided by execution time, providing a measure of elements processed per second.

To summarize overall CPU performance, the geometric mean of these results is computed, offering a balanced aggregate metric across different test scenarios.

Benchmarks

Three benchmarking scenarios were conducted¹:

Scenario 1: One kdb+ worker – One thread/kdb+ worker, which stresses the performance of a single core
Scenario 2: Max kdb+ workers – One thread/kdb+ worker, which tests the performance of the entire system and deploys as many kdb+ workers as there are available threads in the system
Scenario 3: N kdb+ workers – Eight threads/kdb+ worker, which tests the performance of the entire system and allows for variation in the number of threads/kdb+ worker

For all three of the scenarios, the geometric means across all vector sizes were calculated against the following test groups:

Test group	Tests
CPU cache – Tests stress L1 and L2 cache	CPU read CPU cache CPU read write CPU cache CPU write CPU cache
Mem – Tests access L3 cache and main memory	CPU read mem CPU read write mem CPU write mem

Scenario 1 outcome

A single Turin thread executing a single kdb+ worker outperforms the same configuration on Genoa by up to 32% in the CPU read-write test among the “CPU cache” test group
Turin outperforms Genoa by up to 43% in the CPU read test amongst the “mem” test group

Given the multi-threaded nature of the Nano benchmarking suite, scenarios 2 and 3 are more representative of system-level production deployments.

Scenario 2 outcome

One thread is allocated per kdb+ worker
Genoa runs 192 kdb+ workers across a two-socket 2P system (96 per socket)
Turin runs 256 kdb+ workers across a two-socket 2P system (128 per socket)
All cores were observed to execute at 100% CPU utilization, maximizing the compute capacity of the systems under test

Greater generational performance improvements are observed in scenario 2.

Scenario 3

Scenario 3 outcome

A variable number of threads can be allocated per kdb+ worker. Given the eight-core/CCD configurations of both “Zen 5” and “Zen 4” cores, a decision was made to assign eight threads per kdb+ worker
Dividing 192 threads of the two-socket Genoa system by eight yields a total of 24 kdb+ workers
Dividing 256 threads of the two-socket Turin system by eight yields a total of 32 kdb+ workers
Interestingly, although all cores were under load, it was observed that CPU utilization varied across tests, compared to the 100% CPU utilization observed across all cores in Scenario 2
The highest extent of generational performance improvement is observed in Scenario 3, with a maximum performance uplift of 1.91x for the CPU read-write test in the “CPU cache” test group

Some test-level details are summarized below. The benchmarks used a large vector filled with semi-random floating-point numbers to evaluate performance. The graph displays execution time ratios, where a value of 1.6 indicates that the Turin CPU completed the test 60% faster than the Genoa CPU. This normalization allows for straightforward comparison across different hardware configurations.

Scenario 3 Subtest

Scenario 3 (subtest level performance) outcome

Turin consistently outperforms Genoa across the “float large” subtests.

System under test

	AMD EPYC 9654	AMD EPYC 9755
Server model	AMD CRB “Titanite”	AMD CRB “Volcano”
Processor	9654	9755
Socket	2	2
Cores per socket	96	128
Frequency	2.4 GHz/3.7 GHz	2.7 GHz/4.1 GHz
L1d/L2/L3	6 MiB/192 MiB/ 768 MiB	12 MiB/ 256 MiB/ 1 GiB
NUMA nodes	2	2
Memory	1.5 TB	2.3 TB
Memory module size	64 GB	96 GB
Memory speed	DDR5/4800 MT/s	DDR5/6400 MT/s
Memory channels	24	24
OS	RHEL 9.5	RHEL 9.5
Kernel	5.14.0-503.11.1.el9_5.x86_64	5.14.0-503.40.1.el9_5.x86_64
SMT	OFF	OFF
Determinism	Power	Power
Nano version	6.2	6.2
*CRB = customer reference board

While both AMD EPYC processor families are designed to advance data center and enterprise computing, the 5th Generation AMD EPYC processors released in October 2024 bring the latest technological advancements, including the following.

Zen 5 and Zen 5c cores are produced using 4nm and 3nm process technology, respectively, featuring up to 17% higher instructions per clock (IPC) for single-threaded tasks.
Increased core density, offering up to 192 cores from the previous maximum core counts in 4th Generation AMD EPYC of 96 cores with “Genoa” and 128 cores with “Bergamo”
Enhanced memory with DDR5 6400 MT/s speeds supported via the 6nm process I/O die.

You can learn more about the AMD EPYC processors via the following links:

¹Simultaneous multithreading SMT=OFF and SMT=ON were both tested. As SMT=ON (whereby a single core can execute two threads concurrently) was observed to provide performance improvement in some cases and performance degradation in others, the choice was made to opt for SMT=OFF (whereby a single core operates as a single thread) to report optimal results across the geomeans.

KX Nano is an open-source tool. We encourage you to explore its capabilities, contribute your enhancements, and add tests. By fostering a collaborative environment, the tool can become even more robust and beneficial for the entire kdb+ community.

Visit our GitHub repository to learn more.