Explore the methodology and benefits of kdb+ memory mapping

ポイント

Memory mapping accelerates access to on-disk data by mapping file contents to virtual memory. This avoids the overhead of copying entire files into process buffers. Only the required parts of the file are loaded into memory on demand, allowing large files to be processed with low memory overhead.
kdb+ supports two modes of memory mapping: immediate and deferred. Immediate mode maps files into memory instantly, while deferred mode informs the operating system that the mapping will occur at a later stage.
The anymap feature allows almost all kdb+ structures to be memory-mapped, due to the format kdb+ uses to persist complex data structures to disk.
.Q.MAP keeps all HDB partitions immediately mapped, enabling faster data access by eliminating the overhead of repeated system calls every time a file is accessed. Consideration should be given to system resources and whether data is compressed before using .Q.MAP.

Memory mapping lies at the heart of how kdb+ interacts with on-disk data, contributing to its high-speed data retrieval. Although mapping occurs natively, being aware of how it works, what it applies to, and the various methods available can enable developers to make informed choices when working with databases and APIs, ensuring optimal performance from kdb+.

This blog explores the concept and methods of memory mapping, the benefits of anymap, and memory mapping examples, including .Q.MAP.

Background

Memory mapping enhances speed when accessing data on disk. To access an on-disk file, the data within the file is normally copied into the data buffers of the process. Memory mapping avoids the copying overhead by utilizing an optimized mapping instead.

Memory mapping utilizes virtual memory to treat secondary storage as primary memory, enabling processes to access more memory than is physically available.

When a file is memory-mapped, its contents are mapped to memory instead of being copied. This is achieved by assigning a portion of virtual memory to contain the mapping, which is a byte-for-byte correlation to the memory-mapped resource. When the file data is contained within the process address space, the amount of I/O data movement is decreased by allowing direct access to the file.

When the executing process uses the mapping, it can treat the mapped virtual memory as if it were primary memory. Note that when a kdb+ process updates mapped data, it does not affect the actual data on-disk files, nor does it affect other processes mapping the same files.

Pages (blocks of virtual memory) are loaded on demand, meaning the pages are copied into memory only when a process is trying to access them and if they are already absent from memory (known as a page fault). This requires only the pages necessary for the executing process to be loaded, thereby avoiding unnecessary work.

Creating the mapping incurs overhead and may not be advantageous in all situations. Memory mapping can, however, be particularly useful for randomly accessing files and for repeated access to these files.

Memory mapping in kdb+

Kdb+ can map files in two modes:

Immediate mode: The files are mapped into memory immediately.
Deferred mode: The files are not immediately mapped, but the operating system is informed that the mapping will take place at a future time, allowing optimizations to be made. The files are only mapped when the executing process explicitly tries to access them.

The developer can invoke either immediate or deferred mapping with a simple syntax change, and kdb+ conveniently allows us to inspect what is happening via .Q.w, an in-built function to retrieve memory statistics. Given that the used and mapped memory will be inspected frequently throughout, a small helper function called getMem will be defined.

q)getMem:{`used`mmap#.Q.w[]} 
q)getMem[] 
used| 421936 
mmap| 0

Immediate mapping

The simplest structure that can be mapped is a vector, though vectors cannot be mapped in a deferred manner. The difference between modes can be illustrated using a splayed table.

/Creating a sample splayed table 
q)`:tab/ set ([]100000?100;100000?1000f) 
`:tab/

To retrieve a file on disk, the get command is used. This maps the file to memory; it does not load it into memory, assuming it is mappable. When ‘get’ is used on the file path without a trailing slash, the mapping of the file is immediate.

Explicitly assigning the mapped table to a variable ensures the table remains mapped. Otherwise, it is immediately unmapped.

q)immediate:get`:tab     /No trailing slash

Checking the mmap memory stats confirms that the data has been mapped, as opposed to copied; otherwise, mmap would be zero, and the used memory would increase substantially.

q)getMem[] 
used| 423104 
mmap| 1600032

.Q.s1 allows us to investigate the structure of the table. The result confirms the mapping is immediate as the data values are displayed:

q).Q.s1 immediate
"+`x`x1!(81 96 32 46 99 88 61 54 31 6 25 49 61 76 30 10 37 90 92 69 20 80 62 36 38 68 ..

The table is cleared to reset the memory statistics to compare with deferred mapping.

q)delete immediate from `.
`.

Deferred mapping

One extra character is all that is required to inform kdb+ that a deferred mapping is to be used. By appending a trailing slash to the file path when using the ‘get’ command, the mapping is now deferred.

q)deferred:get`:tab/

q)getMem[] 
used| 421776 
mmap| 0

q).Q.s1 deferred 
"+`x`x1!`:tab/"

The mmap memory is zero, even though kdb+ has been requested to ‘get’ the data. When inspecting the underlying structure, no data is displayed; instead, a special structure indicating a mapping to the required files is shown.

So, is the data there or not? Without explicitly checking the memory/underlying structure, it would be easy to assume the trailing slash has no effect.

Referencing the ‘deferred’ variable:

q)deferred 
x  x1
-----------
80 587.8718
70 780.8416
…

The data is displayed immediately.

The difference is that the deferred table is only mapped when the executing process attempts to access it, and the mapping and unmapping are carried out every time it is accessed. This will result in increased overhead, especially compared to repeatedly accessing a table that has been mapped immediately.

Comparing the times of a simple select statement:

q)immediate:get`:tab
q)(select from immediate)~select from deferred
1b
q)\t:10 select from immediate
0
q)\t:10 select from deferred       /overhead of extra system calls
22

Anymap

Anymap, introduced in kdb+ 3.6, allows almost all structures to become mappable. Prior to 3.6, mappable compound lists had the restriction that elements of the list were of the same type, for example, a compound list of longs.

Non-uniformly typed structures, for example, a list of longs and floats, were copied into memory when accessed.

With anymap this problem is bypassed due to the format in which kdb+ saves the data to disk. The anymap structure is used for compound lists of uniform and non-uniform types. All anymap structures have type 77h.

q)`:a set (til 1000;1000?100f)
`:a
q)type a
77h

Notice that two files have been created.

q)\ls
,"a" 
"a#"

The data is stored within the second file with # appended. The anymap structure within the files provides a format that is mappable, as opposed to previously unmappable non-fixed-width records.

When retrieving the data, it is reconstructed automatically from both files.

q)getMem[] 
used| 421680 
mmap| 0
 
q)a:get`:a 
q)getMem[] 
used| 422784 
mmap| 24240 

q)a    /both vectors mapped 
0        1        2        3       4        5        6        7   .. 
18.70281 35.95293 48.09078 44.6898 13.16095 63.33324 69.90336 44.18975 ..

Anymap is not restricted to a list of vectors but can also contain dictionaries and tables.

Anymap prevents the entire file from being copied but still requires individual vectors within the file to be copied to the heap when accessed, even when the file is written with set.

q)getMem[]
used| 422784
mmap| 24240

q)a1:a 0      /extracting the vector of longs from the compound list
q)getMem[]
used| 430976
mmap| 24240

This increases the used memory while the whole file is still mapped.

1: Write Down

An alternative method of writing the data can prevent this data copy; instead of using set, use 1:

For example,

q)`:b 1: (til 1000;1000?100f)  /replace set with 1:
`:b
q)getMem[]
used| 430976
mmap| 24240
q)b:get`:b
q)getMem[]
used| 431040
mmap| 48480
q)b1:b 0
q)getMem[]         /compared with using set, this time the used memory does not increase
used| 431040
mmap| 48480
q)b
0        1        2        3       4        5        6        7   ..
18.70281 35.95293 48.09078 44.6898 13.16095 63.33324 69.90336 44.18975 ..
q)b1 /able to access the individual vector without using more memory due to 1: write down
0 1 2 3 4 5 6 7 ..

Any vector within the mapped structure is available for use with no extra copying overhead using this method.

Flat file tables

The 1: write down is useful for flat file tables. If written with ‘set’, flat file tables are always copied when accessed.

q)`:t 1: ([]col1:til 10000;col2:10000?300f) 
`:t 
q)\ls ,"t" 
"t#" 
"t##"

Notice the t## file is also created in this case. Inspecting the file shows it contains the table columns names, as they are stored as symbols.

q)get`$":t##" 
`col1`col2

Anymap and symbols

Symbols are interned strings that can vary in length, due to the non-fixed-width records, which prevents these vectors from being mapped to memory.

q)`:syms set 1000000?`3 
`:syms 
q)getMem[] 
used| 422336 
mmap| 0 
q)syms:get`:syms 
q)getMem[] 
used| 8812064 
mmap| 0

The name anymap suggests it can map anything, though symbols remain an exception. Using 1: will have no mapping benefits on symbol vectors. If symbols are present in a compound structure, they are enumerated against a file named file##.

q)n:1000000 
q)`:a set (n?1000;n?100f;n?`3)   /third vector is a symbol vector 
`:a 
q)\ls 
,"a" 
"a#"
"a##" 
q)getMem[] 
used| 422352 
mmap| 0 
q)a:get`:a 
q)getMem[]             /mmap increased but so does used 
used| 489008 
mmap| 24008272
q)get`$":a##" `jai`eip`ogj`iap`hbf`aff`loh`lgk`ifk`hea`ofd`aaj`deb`pog`iaf`cfg`kck

The ## file is equivalent to the sym file in an HDB, it is a symbol vector used to deenumerate the symbols, and hence, it is copied into memory when the file is loaded. Other mappable vectors are mapped as usual.

Symbols are always copied, even with anymap, though thankfully, kdb+ enforces the enumeration of symbols when splaying tables or when using anymap. This means only a single distinct list of all symbols must be copied, and given that the symbol data type should be chosen for highly repeating values, the amount of data kdb+ has to copy is kept to a minimum.

Memory mapping considerations

Kdb+ is all about speed; understanding how memory mapping is used within q-sql will help ensure the best performance is achieved.

To explore the effects of memory mapping when using q-sql, a sample splayed table called ‘trade’ is created, whose symbols columns are enumerated using .Q.en.

q)n:1000000 
q)trade:([]sym:n?`3;time:n?.z.t;price:n?200f;size:n?300) 
q)`:trade/ set .Q.en[`:.;trade] 
`:trade/ 
q)key`:. 
`s#`sym`trade

The table can be loaded from disk using \l, which loads the table in a deferred manner.

q)\l . 
q)getMem[] 
used| 490016          /increases due to sym file 
mmap| 0

All the columns are simple mappable vectors.

q)meta trade 
c    | t f a 
-----| ----- 
sym | s    
time | t    
price| f    
size | j

The number of columns specified in a query will affect how much data is mapped.

Selecting all columns:

q)t:select from trade 
q)getMem[] 
used| 490816 
mmap| 28004144

Selecting a subset of columns, for example, dropping the price and size columns.

q)t:select sym, time from trade 
q)getMem[] 
used| 490752 
mmap| 12004112

The value of mmap decreases, this is because the columns in a splayed table are only page faulted when required, the columns are accessed only when needed, preventing any unnecessary overhead of mapping extra data that will not be used. kdb+ allows only the smallest subset of data which the query requires to be mapped. Only include the columns needed in queries.

Virtual columns

When using the virtual column i, there is a noticeable difference in the used memory.

q)t:update i:i from trade 
q)getMem[] 
used| 8878112 
mmap| 28004144

Virtual columns do not exist on disk and are therefore not mapped; instead, they are created on demand when accessed. Referencing the virtual column, therefore, leads to greater RAM usage. Only include the virtual column ‘i’ when necessary.

Where constraints

When ‘where’ constraints are added while querying splayed tables, the resulting dataset is copied into RAM as opposed to being mapped.

q)\ts select from trade 
8 784 
q)t:select from trade where price>100 
q)getMem[] 
used| 15169568 
mmap| 0

The used memory increases substantially, and at this stage, none of the data is mapped.

The further the constraint reduces the dataset, the less memory is used.

q)\ts select from trade where price > 150 
16 9437984 
q)\ts select from trade where price > 200 
14 1049280

.Q.MAP

As shown when comparing deferred and immediate mapping, there is an overhead associated with mapping and unmapping the files every time they are accessed. Having all the files permanently mapped would be advantageous to reduce this overhead. .Q.MAP was added to accommodate this. It can be run after loading a database.

q).Q.w[] 
used| 424624 
heap| 67108864 
peak| 67108864 
wmax| 0 
mmap| 0 
mphy| 2083708928 
syms| 934 
symw| 44816 
q).Q.MAP[] 
q).Q.w[] 
used| 434544 
heap| 67108864             /heap is unchanged 
peak| 67108864 
wmax| 0 
mmap| 4165159648           /significant increase in mmap 
mphy| 2083708928 
syms| 976 
symw| 48172

Points to consider.

.Q.MAP can significantly increase performance by removing the mapping overhead, though .Q.MAP itself may take some time to run.
.Q.MAP will use a large amount of the address space, it should not be used blindly.
It is inadvisable to use with a compressed database as decompressed maps will use physical memory and or swap. https://code.kx.com/q/ref/dotq/#qmap-maps-partitions.
.Q.MAP opens a handle to all files. The limit on the number of open handles on the server may need to be increased to accommodate this. https://code.kx.com/q/kb/linux-production/#compression

To explore the effects of .Q.MAP, a sample partitioned database, is loaded. The database used is partitioned by date and contains two tables, trade and quote. Let’s explore the effects of .Q.MAP with a data partitioned database containing two tables, trade and quote.

q)\l . 
q)\a 
`s#`quote`trade 
q)getMem[] 
used| 424848  /from the sym file 
mmap| 0

When loading the root HDB directory, the sym file is copied into memory (and potentially flat file tables if written down using set), mappable vectors are mapped immediately, and splayed tables are mapped in a deferred manner.

q)\ts select sym, time, price, size, exchange from trade where date=last date
6 3312
q)\ts select sym, time, price, size, exchange from trade where date=last date
6 2896
q)\\

q)\l .
q)\ts .Q.MAP[]            /has an overhead to run
98 10416
q)\ts select sym, time, price, size, exchange from trade where date=last date
0 2944

The data is returned instantly, outperforming repeated queries in the previous example. When dealing with uncompressed HDBs, .Q.MAP should be considered if increased performance is required.

To learn more, visit kx.com or sign up for a free personal edition. You can also read our independent benchmarking report and see how we compare against other TSDBs.

Mastering memory mapping in kdb+

Developer

ポイント

Background

Memory mapping in kdb+

Immediate mapping

Deferred mapping

Anymap

1: Write Down

Flat file tables

Anymap and symbols

Memory mapping considerations

Virtual columns

Where constraints

.Q.MAP

Customer Stories

AIによるイノベーションを加速する、KXのデモをお客様に合わせてご提供します。

専門担当者によるデモをリクエスト

タイムシリーズ分野におけるG2認定リーダー

Modernizing infrastructures that mix Python and q

The ultimate guide to choosing embedding models for AI applications

7つの革新的トレーディングアプリ (および実践できる7つのベストプラクティス)

11 insights to help quants break through data and analytics barriers

PyKX 3.0: Easier to use and more powerful than ever

リアルタイム分析でクオンツパフォーマンスを飛躍的に向上

Webinar: Six best practices for optimizing trade execution

Developer

ポイント

Background

Memory mapping in kdb+

Immediate mapping

Deferred mapping

Anymap

1: Write Down

Flat file tables

Anymap and symbols

Memory mapping considerations

Virtual columns

Where constraints

.Q.MAP

KDB-X: Next-gen kdb+ is here – and it’s built different

Benchmarking KDB-X vs QuestDB, ClickHouse, TimescaleDB and InfluxDB with TSBS

GPU acceleration in KDB-X: Supercharging as-of joins and sorting

Customer Stories

AIによるイノベーションを加速する、KXのデモをお客様に合わせてご提供します。

専門担当者によるデモをリクエスト

タイムシリーズ分野におけるG2認定リーダー