Contact Us

    Thank you

    We will be in contact shortly.

    Kdb+/q Insights: Parsing data in kdb+

    27 March 2019

    by Rian O’Cuinneagain

    Passing data between systems can often become complex, if not impossible, due to lack of interoperability. While kdb+ provides many interfaces on  to simplify integration its native text parsing capabilities remain extremely important too, as they can be used to greatly simplify data ingestion and inter-process communication. These requirements are often encountered during the exploration or proof-of-concept phases of projects but there are two additional areas where they can be critical:

    • For large one-off ingestion of data
    • For ongoing ingestion of data from a provider which does not provide other methods

    In both of these cases it is important that the parsing of data is as efficient as possible. In this blog we offer some tips to implement clean and efficient parsing of CSV formats to kdb+ and in a follow-up blog we will look at the increasingly popular alternative format, JSON.


    CSV files do not have flexibility in the type of data structures they hold but parsing is straight forward.

    Take the following example file:

    read0 `:tab.csv

    With one line it was possible to parse to the exact type of table required

    tab:("JFS* DT";enlist ",") 0: `:tab.csv
    meta tab
    longCol floatCol symbolCol stringCol dateCol    timeCol     
    1       4        b         "bb"      2018.11.23 00:01:00.000
    2       5        h         "dd"      2018.11.23 00:01:00.003
    c        | t f a
    ---------| -----
    longCol  | j    
    floatCol | f    
    symbolCol| s    
    stringCol| C    
    dateCol  | d    
    timeCol  | t
    • * is used to read a string column. stringCol
    • A space can be used to ignore a given column. skipCol
    • Any column to the right which is not given a parse rule is ignored. ignoreCol

    For more information on settings and the loading of fixed-width fields see:

    Don’t clog your memory

    A common mistake by users loading CSV data is to use S as a default datatype. This is the symbol datatype which should only be used when loading columns that have a small number of unique values. Instead, use * when uncertain, this will load the data as strings (C in a table meta).

    This is because symbols are interned strings. Any time a symbol of a new value is created in a process it will be added to an internal lookup. These cannot be freed during garbage collection with .Q.gc and will instead persist in the memory of the process until it exits.

    You can see the number of syms in this table and the space memory they are using with .Q.w

    .Q.w[]`syms`symw //Check the start values

    1206 53636

    r:`$string til 10000 //Create a list of 10 thousand symbols


    11204 588234

    delete r from `. //Delete r
    .Q.gc[] //Run garbage collection, no memory will be returned
    .Q.w[]`syms`symw //No change in number of syms or memory used



    11204 588234

    The white paper Working With Sym Files covers this topic in greater detail when it comes time to store symbols to disk.

    Complex parsing

    Date format

    //You may get an unexpected date
    z 0


    //Or a null
    z 0


    //Changing z to 1 gives the intended result
    z 1


    //Using z correctly will perform much better than using a manual parsing method
    manyDates:100000#enlist "30/12/2010"
    t "D"${"." sv reverse "/" vs x} each manyDates


    z 1
    t "D"$manyDates


    Other accepted time and date formats

    Many other formats can be parsed by kdb+.

    A selection:



    "D"$"2018 Jan 30"


    "D"$"2018 January 30"


    "P"$"1546300800" //Unix epoch


    //These value only parse to the deprecated datetime Z format
    //We can simply cast them to timestamps


    Speed and efficiency

    Don’t do the same work twice (.Q.fu)

    There are often fields which we cannot parse natively. Parsing using custom string manipulations is a computationally intensive task. One way to avoid this is by applying the function once per distinct item and mapping the result. This is only suitable when the data has a smaller number of distinct elements in it. .i.e for dates but not unique timestamps etc.

    .Q.fu simplifies this task

    //Take this example which will not parse:
    "D"$"November 30 2018"


    //By reordering the components it will parse
    "D"$" " sv @[;2 0 1] " " vs "November 30 2018"


    //This text based cutting is not efficient
    manyDates:100000#enlist "November 30 2018"
    t "D"${" " sv @[;2 0 1] " " vs x} each manyDates


    //Switching to .Q.fu sees a huge speed up
    t .Q.fu[{"D"${" " sv @[;2 0 1] " " vs x} each x}] manyDates


    Straight line speed (Vectorised operations)

    Sometimes part of parsing data requires mathematical calculations to be performed. A common example of this is differing epochs between languages and systems. When parsing a column one may write functions which iterate through a row at a time rather than operating on the whole column. This is sometimes the only choice. However, kdb+ has native vector based operations which gain huge efficiency by operating on the column as a whole.

    //Assume we are given a field which is seconds since 1900.01.01D00
    //With that information we can extract what is needed from the field


    //If may be tempting to write a function and iterate through the data
    manyTimes:1000000#enlist "3755289600"
    t {1900.01.01D00+0D00:00:01*"J"$x} each manyTimes
    //But this will perform poorly


    //It serves better to write functions which accept lists
    //This allows you to take advantage of vector based numeric operators in cases like this
    t {1900.01.01D00+0D00:00:01*"J"$x} manyTimes


    Skip the middle man (Named pipes)

    Often plain text files will come compressed. This requires them to be:

    1. Decompressed to disk in full
    2. Ingested from disk

    This is an inefficient use of resources, as the uncompressed file will only ever be read once. Named pipes allow the disk to be taken out of the equation by streaming the uncompressed data directly to kdb+

    For more information and examples see:

    Stream it in (.Q.fs, .Q.fsn & .Q.fps)

    As text files grow the memory usage of the ingestion process can become a concern. .Q.fs,.Q.fsn and .Q.fps allow control of this by providing the ability to specify the number of lines at a time to pull in to memory. Then each batch can be published to another process on written to disk before continuing.

    • .Q.fs – Operates on standard files
    • .Q.fsn – Allows the specification of chunk size
    • .Q.fps – Operates on named pipes

    As well as memory management .Q.fsn also allows us to ensure our optimizations using .Q.fu and vectorised operations are supplied with sufficient data on each invocation to see speed ups.

    Cookie notice

    We use cookies to personalise content and ads, to provide social media features and to analyse our traffic. We also share information about your use of our site with our social media, advertising and analytics partners.