A developers guide to JSON parsing in kdb+

With its ability to hold complex structures, JSON has long offered kdb+ developers significant advantages over traditional CSV. From data serialization, deserialization, and API integration to easier debugging, data exchange, and unstructured data querying. However, ingesting JSON can also introduce complexities, leading to performance and scalability issues if not done correctly.

In this blog, we will explore some practical examples to help developers navigate these issues and optimize data ingestion.

Understanding datatypes

It’s important to understand that kdb+ can extract the following data types from JSON

String: Sequence of characters, typically used to represent text
Float: Real numbers that require fractional precision
Boolean: Binary states such as yes/no, on/off, or 1/0

This means that in addition to parsing the data from JSON, developers may wish to cast to something more suitable.

To demonstrate, let’s convert a long in kdb+ to JSON and then parse back using .j.j and .j.k

6~.j.k .j.j 6

0b

As you can see, the roundtrip fails because the input is no longer equal to the output due to the numerics being converted to floats.

.j.k .j.j 6

6f

JSON table encoding

Let’s now create a table in kdb+ and explore the cast and tok operators.

The cast operator is used to convert data from one type to another. This is particularly useful when you need to change the data type of a value to match the expected type for a specific operation or function. For example, if you have a string representing a number and you need to perform arithmetic operations on it, you can cast it to an integer or float
The tok operator interprets a string as a different data type. This is useful when working with data that comes in as a string (e.g., from a file or API) and must be converted to a more helpful type for analysis or processing. For example, you might receive a date as a string and need to convert it to a date type to perform date calculations

//Create a sample table

tab:([] longCol:1 2;
        floatCol:4 5f;
        symbolCol:`b`h;
        stringCol:("bb";"dd");
        dateCol:2018.11.23 2018.11.23;
        timeCol:00:01:00.000 00:01:00.003)
tab

longCol floatCol symbolCol stringCol dateCol    timeCol     
------------------------------------------------------------ 
1       4        b         "bb"      2018.11.23 00:01:00.000 
2       5        h         "dd"      2018.11.23 00:01:00.003


meta tab

c        | t f a
---------| -----
longCol  | j   
floatCol | f    
symbolCol| s   
stringCol| C   
dateCol  | d   
timeCol  | t

Let’s now record the round trip.

q).j.k .j.j tab
longCol floatCol symbolCol stringCol dateCol      timeCol      
----------------------------------------------------------------
1       4        ,"b"      "bb"      "2018-11-23" "00:01:00.000"
2       5        ,"h"      "dd"      "2018-11-23" "00:01:00.003"

q)meta .j.k .j.j tab
c        | t f a
---------| -----
longCol  | f   
floatCol | f   
symbolCol| C   
stringCol| C   
dateCol  | C   
timeCol  | C

Now, let’s use lowercase casts on numerics and capital case tok on our string datatypes.

//* will leave a column untouched

flip "j*S*DT"$flip .j.k .j.j tab
tab~flip "j*S*DT"$flip .j.k .j.j tab
longCol floatCol symbolCol stringCol dateCol    timeCol    
------------------------------------------------------------
1       4        b         "bb"      2018.11.23 00:01:00.000
2       5        h         "dd"      2018.11.23 00:01:00.003


1b

Instead of using flip and specifying * to leave a column untouched, we can write a simple helper function.

To begin, we will pass it a dictionary containing the rules we need to perform.

helper:{[t;d] ![t;();0b;key[d]!{($;x;y)}'[value d;key d]]}

castRules:`longCol`symbolCol`dateCol`timeCol!"jSDT"

tab~helper[;castRules] .j.k .j.j tab
1b

Similarly, instead of forcing $, we can make a more general helper based on a monodic function per column.

generalHelper:{[t;d] ![t;();0b;key[d]!{(x;y)}'[value d;key d]]}

castRules:`longCol`symbolCol`dateCol`timeCol!({neg "j"$ x};{`$upper x};"D"$;"T"$)

generalHelper[;castRules] .j.k .j.j tab
longCol floatCol symbolCol stringCol dateCol    timeCol    
------------------------------------------------------------
-1      4        B         "bb"      2018.11.23 00:01:00.000
-2      5        H         "dd"      2018.11.23 00:01:00.003

Field-based JSON encoding

One common use of JSON is objects (key/value pairs) that parse in kdb+ as dictionaries. These are useful for storing sparse datasets in which it does not make sense to have each key as a new column.

q)\c 25 200
q)read0 `:sample.json
"{\"data\":\"26cd02c57f9db87b1df9f2e7bb20cc7b\",\"expiry\":1527796725,\"requestID\":[\"b4a566eb-2529-5cf4-1327-857e3d73653e\"]}\"
"{"result":"success","message":"success","receipt":[123154,4646646],"requestID":["b4a566eb-2529-5cf4-1327-857e3d73653e"]}"
"{"receipt":[12345678,98751466],"requestID":["b4a566eb-2529-5cf4-1327-857e3d73653e"]}"
"{"data":"26cd02c57f9db87b1df9f2e7bb20cc7b","requestID":["b4a566eb-2529-5cf4-1327-857e3d73653e"]}"
"{"receipt":[12345678,98751466],"requestID":["b4a566eb-2529-5cf4-1327-857e3d73653e"]}"
"{"listSize":2,"list":"lzplogjxokyetaeflilquziatzpjagsginnajfpbkomfancdmhmumxhazblddhcc"}"
"{"requestID":["b4a566eb-2529-5cf4-1327-857e3d73653e"]}"

A potential way to manage these items may be to create a utility to cast dictionaries using keys to control casting rules, thus allowing more complex parsing rules for each field.

//Converts JSON to q with rules per key
decode:{[j]k:.j.k j;(key k)!j2k[key k]@'value k}

//Converts q to JSON with rules per key
encode:{[k].j.j (key k)!k2j[key k]@'value k}

//Rules for JSON to q conversion
j2k:(enlist `)!enlist (::);

j2k[`expiry]:{0D00:00:01*`long$x};
j2k[`result]:`$;
j2k[`receipt]:`long$;
j2k[`id]:{"G"$first x};
j2k[`listSize]:`long$;
j2k[`data]:cut[32];
j2k[`blockCount]:`long$;
j2k[`blocks]:raze;

//Rules for q to JSON conversion
k2j:(enlist `)!enlist (::);

k2j[`expiry]:{`long$%[x;0D00:00:01]};
k2j[`result]:(::);
k2j[`receipt]:(::);
k2j[`id]:enlist;
k2j[`listSize]:(::);
k2j[`data]:raze;
k2j[`blocks]:(::);


//Using default .j.k our structures are not transferred as we wish
{show .j.k x} each read0 `:sample.json;
data     | "26cd02c57f9db87b1df9f2e7bb20cc7b"
expiry   | 1.527797e+009
requestID| ,"b4a566eb-2529-5cf4-1327-857e3d73653e"
result   | "success"
message  | "success"
receipt  | 123154 4646646f
requestID| ,"b4a566eb-2529-5cf4-1327-857e3d73653e"
receipt  | 1.234568e+007 9.875147e+007
requestID| ,"b4a566eb-2529-5cf4-1327-857e3d73653e"
data     | "26cd02c57f9db87b1df9f2e7bb20cc7b"
requestID| ,"b4a566eb-2529-5cf4-1327-857e3d73653e"
receipt  | 1.234568e+007 9.875147e+007
requestID| ,"b4a566eb-2529-5cf4-1327-857e3d73653e"
listSize| 2f
list    | "lzplogjxokyetaeflilquziatzpjagsginnajfpbkomfancdmhmumxhazblddhcc"
requestID| "b4a566eb-2529-5cf4-1327-857e3d73653e"


//Using decode utility captures complex structures
{show decode x} each read0 `:sample.json;
data     | ,"26cd02c57f9db87b1df9f2e7bb20cc7b"
expiry   | 17682D19:58:45.000000000
requestID| ,"b4a566eb-2529-5cf4-1327-857e3d73653e"
result   | `success
message  | "success"
receipt  | 123154 4646646
requestID| ,"b4a566eb-2529-5cf4-1327-857e3d73653e"
receipt  | 12345678 98751466
requestID| ,"b4a566eb-2529-5cf4-1327-857e3d73653e"
data     | "26cd02c57f9db87b1df9f2e7bb20cc7b"   
requestID| "b4a566eb-2529-5cf4-1327-857e3d73653e"
receipt  | 12345678 98751466
requestID| ,"b4a566eb-2529-5cf4-1327-857e3d73653e"
listSize| 2
list    | "lzplogjxokyetaeflilquziatzpjagsginnajfpbkomfancdmhmumxhazblddhcc"
requestID| "b4a566eb-2529-5cf4-1327-857e3d73653e"


//The encode utility allows us to round trip
{sample~{encode decode x} each sample:read0 x}`:sample.json

1b

Querying unstructured data

It has become much easier to manage unstructured data via Anymap. However, some considerations must be made.

sample:([] data:decode each read0 `:sample.json)
sample
data                                                                                                                         
-----------------------------------------------------------------------------------------------------------------------------
`data`expiry`requestID!(,"26cd02c57f9db87b1df9f2e7bb20cc7b";17682D19:58:45.000000000;,"b4a566eb-2529-5cf4-1327-857e3d73653e")
`result`message`receipt`requestID!(`success;"success";123154 4646646;,"b4a566eb-2529-5cf4-1327-857e3d73653e")               
`receipt`requestID!(12345678 98751466;,"b4a566eb-2529-5cf4-1327-857e3d73653e")                                              
`data`requestID!(,"26cd02c57f9db87b1df9f2e7bb20cc7b";,"b4a566eb-2529-5cf4-1327-857e3d73653e")                               
`receipt`requestID!(12345678 98751466;,"b4a566eb-2529-5cf4-1327-857e3d73653e")                                              
`listSize`list!(2;"lzplogjxokyetaeflilquziatzpjagsginnajfpbkomfancdmhmumxhazblddhcc")                                       
(,`requestID)!,,"b4a566eb-2529-5cf4-1327-857e3d73653e"

Indexing at depth allows the sparse data within the dictionaries to be queried easily.

select data[;`requestID] from sample
x                                     
---------------------------------------
,"b4a566eb-2529-5cf4-1327-857e3d73653e"
,"b4a566eb-2529-5cf4-1327-857e3d73653e"
,"b4a566eb-2529-5cf4-1327-857e3d73653e"
,"b4a566eb-2529-5cf4-1327-857e3d73653e"
,"b4a566eb-2529-5cf4-1327-857e3d73653e"
0N                                    
,"b4a566eb-2529-5cf4-1327-857e3d73653e"

Notice that when a key is missing from a dictionary, kdb+ returns a null value. The type of this null is determined by the first key type within the dictionary.

//Many different nulls are returned
select data[;`expiry] from sample
x                      
------------------------
17682D19:58:45.000000000
`                      
`long$()               
,""                    
`long$()               
0N                     
,""

//Succeeds on first 2 rows as by chance only null returned in a atom null
select from (2#sample) where null data[;`expiry]

//Fails once moving to 3 rows as there is an empty list null
select from (3#sample) where null data[;`expiry]

data                                                                                                         
-------------------------------------------------------------------------------------------------------------
`result`message`receipt`requestID!(`success;"success";123154 4646646;,"b4a566eb-2529-5cf4-1327-857e3d73653e")


evaluation error:

type

  [0]  select from (3#sample) where null data[;`expiry]
       ^

Checking if a given key is in the dictionary will only return rows that do not have the key.

select from sample where `expiry in/:key each data, not null data[;`expiry]
data                                                                                                                        
-----------------------------------------------------------------------------------------------------------------------------
`data`expiry`requestID!(,"26cd02c57f9db87b1df9f2e7bb20cc7b";17682D19:58:45.000000000;,"b4a566eb-2529-5cf4-1327-857e3d73653e")

However, if we prepend each dictionary with the null symbol key “and generic null value(::)), we can query more freely.

updatedata:(enlist[`]!enlist(::))(,)/:datafrom`sample;
sample

data
---------------------------------------------------------------------------------------------------------------------------------
``data`expiry`requestID!(::;,"26cd02c57f9db87b1df9f2e7bb20cc7b";17682D19:58:45.000000000;,"b4a566eb-2529-5cf4-1327-857e3d73653e")
``result`message`receipt`requestID!(::;`success;"success";1231544646646;,"b4a566eb-2529-5cf4-1327-857e3d73653e")
``receipt`requestID!(::;1234567898751466;,"b4a566eb-2529-5cf4-1327-857e3d73653e")
``data`requestID!(::;,"26cd02c57f9db87b1df9f2e7bb20cc7b";,"b4a566eb-2529-5cf4-1327-857e3d73653e")
``receipt`requestID!(::;1234567898751466;,"b4a566eb-2529-5cf4-1327-857e3d73653e")
``listSize`list!(::;2;"lzplogjxokyetaeflilquziatzpjagsginnajfpbkomfancdmhmumxhazblddhcc")
``requestID!(::;,"b4a566eb-2529-5cf4-1327-857e3d73653e")

As you can see, nulls with missing keys are now represented as (::), meaning the previously failed query can now be executed.

select expiry:data[;`expiry] from sample
expiry                 
------------------------
17682D19:58:45.000000000
::                      
::                     
::                     
::                     
::                     
::

select from sample where not null data[;`expiry]
data                                                                                                                            
---------------------------------------------------------------------------------------------------------------------------------
``data`expiry`requestID!(::;,"26cd02c57f9db87b1df9f2e7bb20cc7b";17682D19:58:45.000000000;,"b4a566eb-2529-5cf4-1327-857e3d73653e")

We can also easily replace (::) values, in this case with an infinity value.

fill:{@[y;where null y;:;x]}
select expiry:fill[0Wn]data[;`expiry] from sample
expiry                 
------------------------
17682D19:58:45.000000000
0W                     
0W                     
0W                     
0W                     
0W                     
0W

To learn more about the commands featured in this blog, visit our comprehensive documentation on code.kx. You can also begin your journey with kdb+ by downloading our personal edition, or via one of our many courses on the KX Academy.