Tutorial: Integrating Parquet format data with KDB-X

Key Takeaways

Interoperability: Being able to use q with Parquet open file format means users can exchange data with ecosystems such as Spark, Pandas, Arrow etc.
Fast Analytics at Scale: Query large Parquet datasets efficiently with virtual tables.
Seamless Integration: Use q queries directly on Parquet files, alongside in-memory or partitioned tables.

In this tutorial, you’ll learn how to integrate Parquet data with KDB-X, build virtual tables across multiple files, and run high-performance time-series analytics using q.

Parquet is a columnar storage format designed for efficient storage and retrieval. KDB-X supports reading and writing Parquet files through the ‘pq’ module. Use cases for integrating Parquet data with KDB-X include:

Data interchange: Share Parquet datasets between KDB-X and tools like Spark, Pandas, and Hive.
Efficient analytics: Run SQL or q queries directly against Parquet files with row group pruning
Archival storage: Keep large historical datasets compressed but queryable
Hybrid queries: Join or aggregate across in-memory tables and Parquet-backed virtual tables in one query

This tutorial outlines how to integrate KDB-X with tables in Parquet format on disk, and how to seamlessly perform common operations and calculations on both single tables and multiple tables in memory. You can also explore the entire notebook via GitHub.

Download the Data

We will begin by downloading and exploring this dataset from Kaggle: https://www.kaggle.com/datasets/zahinawosaf/forex-tick-data.

#!/bin/bash  

curl -L -o ~/parquet/forex-tick-data.zip https://www.kaggle.com/api/v1/datasets/download/zahinawosaf/forex-tick-data  

unzip parquet/forex-tick-data.zip

Then we can find the data and count how many files we have:

find parquet/AUDUSD | wc -l
"253"

Now, we will load the Parquet module:

([pq]):use`kx.pq

Single Table Example

Let’s load in one of the tables, so that we can see what we’re working with:

quote: pq `$":parquet/AUDUSD/AUDUSD - 2004-04-01 - 2004-04-30.parquet"

Straight away, we can run some basic functions and aggregations on this table using q:

q)select rows:count i from quote  
rows  
-------  
2213548  

q)meta quote  
c         | t f a  
----------| -----  
timestamp | p  
symbol    | C  
ask_price | f  
bid_price | f  
ask_volume| i  
bid_volume| i

As you can see from above, we can treat this table just like a q table, and run built-in qSQL functions like select from quote and meta quote.

We can also query using common q functionality like xbar and the parquet column names:

q)10#select spread:max ask_price-bid_price by 0D01 xbar timestamp from quote where ask_price>=bid_price, not null ask_volume  

timestamp                    | spread 
-----------------------------| ------- 
2004.04.01D00:00:00.000000000| 0.0006 
2004.04.01D01:00:00.000000000| 0.0005 
2004.04.01D02:00:00.000000000| 0.00056 
2004.04.01D03:00:00.000000000| 0.0006 
2004.04.01D04:00:00.000000000| 0.0005 
2004.04.01D05:00:00.000000000| 0.00056 
2004.04.01D06:00:00.000000000| 0.0005 
2004.04.01D07:00:00.000000000| 0.0005 
2004.04.01D08:00:00.000000000| 0.0004 
2004.04.01D09:00:00.000000000| 0.0004

Multiple Table Example

The virtual table API allows combining multiple Parquet files into a single virtual table using virtual partition columns:

tb:use`kx.pq.t

Let’s have a look at the file structure and then extract the file names as a column:

q)path:`:parquet/AUDUSD; 
q)files:([]file:` sv'path,/:key path); 
q)count files 
252 
q)10 sublist files 

file  
--------------------------------------------------------  
:parquet/AUDUSD/AUDUSD - 2004-01-01 - 2004-01-31.parquet  
:parquet/AUDUSD/AUDUSD - 2004-02-01 - 2004-02-29.parquet  
:parquet/AUDUSD/AUDUSD - 2004-03-01 - 2004-03-31.parquet  
:parquet/AUDUSD/AUDUSD - 2004-04-01 - 2004-04-30.parquet  
:parquet/AUDUSD/AUDUSD - 2004-05-01 - 2004-05-31.parquet  
:parquet/AUDUSD/AUDUSD - 2004-06-01 - 2004-06-30.parquet  
:parquet/AUDUSD/AUDUSD - 2004-07-01 - 2004-07-31.parquet  
:parquet/AUDUSD/AUDUSD - 2004-08-01 - 2004-08-31.parquet  
:parquet/AUDUSD/AUDUSD - 2004-09-01 - 2004-09-30.parquet  
:parquet/AUDUSD/AUDUSD - 2004-10-01 - 2004-10-31.parquet

Creating the columns that will act as our virtual columns, using the filename and the month:

q)virt:pq each files`file; 
q)part:update month:2004.01m+til count files from files; 
q)10 sublist part 

file                                                     month  
----------------------------------------------------------------  
:parquet/AUDUSD/AUDUSD - 2004-01-01 - 2004-01-31.parquet 2004.01  
:parquet/AUDUSD/AUDUSD - 2004-02-01 - 2004-02-29.parquet 2004.02  
:parquet/AUDUSD/AUDUSD - 2004-03-01 - 2004-03-31.parquet 2004.03  
:parquet/AUDUSD/AUDUSD - 2004-04-01 - 2004-04-30.parquet 2004.04  
:parquet/AUDUSD/AUDUSD - 2004-05-01 - 2004-05-31.parquet 2004.05  
:parquet/AUDUSD/AUDUSD - 2004-06-01 - 2004-06-30.parquet 2004.06  
:parquet/AUDUSD/AUDUSD - 2004-07-01 - 2004-07-31.parquet 2004.07  
:parquet/AUDUSD/AUDUSD - 2004-08-01 - 2004-08-31.parquet 2004.08  
:parquet/AUDUSD/AUDUSD - 2004-09-01 - 2004-09-30.parquet 2004.09  
:parquet/AUDUSD/AUDUSD - 2004-10-01 - 2004-10-31.parquet 2004.10

Now that we have our virtual columns, we can create one virtual table in memory pulling Parquet data from each file:

q)quote_all:tb.mkP part!virt;  
q)meta quote_all 

c         | t f a  
----------| -----  
file      | s  
month     | m  
timestamp | p  
symbol    | C  
ask_price | f  
bid_price | f  
ask_volume| i  
bid_volume| i

The table in memory now contains all the data from each file in the dataset:

q)select rows:count i from quote_all  
rows  
---------  
353889010

We can now start to query this data just like we would query a q table:

q)select rows:count i by file from quote_all  

file                                                    | rows  
--------------------------------------------------------| -------  
:parquet/AUDUSD/AUDUSD - 2004-01-01 - 2004-01-31.parquet| 2027152  
:parquet/AUDUSD/AUDUSD - 2004-02-01 - 2004-02-29.parquet| 1963657  
:parquet/AUDUSD/AUDUSD - 2004-03-01 - 2004-03-31.parquet| 2253977  
:parquet/AUDUSD/AUDUSD - 2004-04-01 - 2004-04-30.parquet| 2213548  
:parquet/AUDUSD/AUDUSD - 2004-05-01 - 2004-05-31.parquet| 1929827  
:parquet/AUDUSD/AUDUSD - 2004-06-01 - 2004-06-30.parquet| 616357  
:parquet/AUDUSD/AUDUSD - 2004-07-01 - 2004-07-31.parquet| 638555  
:parquet/AUDUSD/AUDUSD - 2004-08-01 - 2004-08-31.parquet| 649610  
:parquet/AUDUSD/AUDUSD - 2004-09-01 - 2004-09-30.parquet| 664278  
:parquet/AUDUSD/AUDUSD - 2004-10-01 - 2004-10-31.parquet| 634990  
:parquet/AUDUSD/AUDUSD - 2004-11-01 - 2004-11-30.parquet| 655733  
:parquet/AUDUSD/AUDUSD - 2004-12-01 - 2004-12-31.parquet| 694420  

..

We can perform huge calculations and aggregate over years of Parquet data using q:

calc:{[st;et;bkt]  

       tmp:ungroup select timestamp,  
             mid:(bid_price+ask_price)%2,  
             bboSize:(bid_volume+ask_volume)%2,  
             spread:ask_price-bid_price, 
             e_n:((bid_price>=prev bid_price)*bid_volume) 
               -((bid_price<=prev bid_price)*prev bid_volume) 
               -((ask_price<=prev ask_price)*ask_volume) 
               +((ask_price>=prev ask_price)*prev ask_volume) 

             by date:`date$timestamp, `$symbol from quote_all where month within (st;et);  
          
       select start_time:st, end_time:et,  

              ofi:sum e_n,  
             dP:last mid-first mid,  
              n:count e_n,  
              avg_spread: avg spread,  
              avg_bbo_size:avg bboSize  

         by bkt xbar timestamp, symbol from tmp  

};

We can use \ts to see the power of using KDB-X to perform fast analytics at scale:

q)\ts calc[2004.01m;2006.01m;0D01] 

7541 3489661664

The calculation above around 7 seconds and used less than 3.5 gigabytes of memory.

Conclusion

Integrating Parquet data with KDB-X provides a powerful and flexible foundation for large-scale analytics. By combining Parquet’s efficient columnar storage with q’s expressive querying capabilities, users can seamlessly work across their on-disk and in-memory datasets, interchange data with external ecosystems, and perform fast calculations over massive time-series volumes. The new module management structure makes it straightforward to unlock the full value of Parquet-backed data within KDB-X.

To learn more about KDB-X modules, visit KDB-X Module Management.

If you enjoyed this blog and would like to explore other examples, you can visit our  GitHub repository. You can also begin your journey with KDB-X by signing up for the KDB-X Community Edition Public Preview.

Tutorial: Integrating Parquet format data with KDB-X

Developer

Key Takeaways

Download the Data

Single Table Example

Multiple Table Example

Conclusion

Demo the world’s fastest database for vector, time-series, and real-time analytics

Start your journey to becoming an AI-first enterprise with 100x* more performant data and MLOps pipelines.

Book a demo with an expert

KDB-X: Next-gen kdb+ is here – and it’s built different

Supercharging your quants with real-time analytics

High Frequency Data Benchmarking

Benchmarking KDB-X vs QuestDB, ClickHouse, TimescaleDB and InfluxDB with TSBS

From ticks to tweets: Combining structured and unstructured financial data with KDB-X

KDB-X: The next era of kdb+ for AI-driven markets

Developer

Key Takeaways

Download the Data

Single Table Example

Multiple Table Example

Conclusion

Related content

KDB-X: Next-gen kdb+ is here – and it’s built different

Tutorial: Tutorial analyzing data with KDB-X SQL

Tutorial: Dynamic time warping (DTW) with KDB-X

Demo the world’s fastest database for vector, time-series, and real-time analytics

Start your journey to becoming an AI-first enterprise with 100x* more performant data and MLOps pipelines.

Book a demo with an expert