We have made changes to our privacy policy.

How to Build and Manage Databases Using PyKX

Conor McCarthyTechnical Lead for PyKX
24 January 2024 | 7 minutes

Introduction

For many years, KX and its partners have looked to provide solutions that simplify database management in kdb+, this has seen the launch of several products and frameworks, including kdb Insights Enterprise, kdb Insights Database, and Data Intellects, TorQ.

With the introduction of PyKX nearly 2 years ago, we delivered an initial new wave of capability, increasing access to our powerful analytic tooling, via a familiar python interface and opening kdb+ data estates to a new wave of developers. Over the years this has continued to evolve and harden capabilities.

Earlier this week, we released the next phase in this journey as part of PyKX 2.3. Specifically, through the beta release of a new feature; centred around “database management”, providing greater flexibility in the provisioning and management of kdb+ databases via Python.

With the initial release, developers will be able to:

  • Create a database and tables from any historical data.
  • Add new partitions and tables to a database.
  • Add and rename tables and columns.
  • Create copies of table columns.
  • Apply functions to columns.
  • Update data types of columns
  • Add & fix missing partitions from tables.
  • Interrogate the content/structure of your database.

Click here for a full breakdown of the API.


Walkthrough

Let’s explore some of these features further by walking through example management tasks.

Warning:
Operations on persisted databases can result in changes being made, which you may find difficult to revert. For example, the application of functions, which change the values of rows within a column may result in an updated value from which you cannot get the original data. Prior to using this functionality for complex use-cases you should be clear of the intention of your changes and ideally be able to mitigate issues with access to a backup of your data.

To begin, you will need to enable PyKX beta features:

import os
os.environ['PYKX_BETA_FEATURES'] = 'True'
import pykx as kx

Once initialised, you can validate what features are enabled by running the following command.

print(kx.config.beta_features)
print(kx.beta_features)
True
[‘Database Management’]

Next, create a new directory to host your database and initialise the DB class.

os.makedirs('/tmp/my_database')
db = kx.DB(path='/tmp/my_database')

From here, you can generate a new dataset which will be saved to the database.

from datetime import date 
N = 100000 
dataset = kx.Table(data={ 
 'date': kx.random.random(N, [date(2020, 1, 1), date(2020, 1, 2)]), 
 'sym': kx.random.random(N, ['AAPL', 'GOOG', 'MSFT']), 
 'price': kx.random.random(N, 10.0)  
 }) 
dataset
99999 2020.01.02 MSFT 7.364259
100,000 rows × 3 columns
date sym price
0 2020.01.01 MSFT 4.802218
1 2020.01.01 GOOG 0.5058034
2 2020.01.01 AAPL 7.055037
3 2020.01.02 AAPL 2.924899
4 2020.01.01 MSFT 1.084145
5 2020.01.02 AAPL 1.1094
6 2020.01.01 GOOG 0.8396528
7 2020.01.02 MSFT 7.527356
8 2020.01.01 MSFT 7.716519
9 2020.01.02 MSFT 0.7148808
10 2020.01.01 MSFT 2.539819
11 2020.01.02 GOOG 9.568963
12 2020.01.02 GOOG 5.546168
13 2020.01.01 GOOG 9.614568
14 2020.01.01 AAPL 3.907669
15 2020.01.02 GOOG 6.349116
16 2020.01.01 MSFT 9.47742
17 2020.01.02 MSFT 3.082102
18 2020.01.01 GOOG 2.703125
19 2020.01.01 MSFT 2.438608
20 2020.01.02 AAPL 0.6931198
21 2020.01.02 MSFT 2.896335
22 2020.01.01 GOOG 2.624367

With the dataset created, save it to disk by partitioning(splitting) the data into sub-directories based on date.

db.create(dataset, table_name='db_table', partition='date')

Congratulations, you have now created your first kdb+ database. You can access this table as a property of the database.

db.db_table
date sym price
0 2020.01.01 MSFT 4.802218
1 2020.01.01 GOOG 0.5058034
2 2020.01.01 AAPL 7.055037
3 2020.01.01 MSFT 1.084145
4 2020.01.01 GOOG 0.8396528
5 2020.01.01 MSFT 7.716519
6 2020.01.01 MSFT 2.539819
7 2020.01.01 GOOG 9.614568
8 2020.01.01 AAPL 3.907669
9 2020.01.01 MSFT 9.47742
10 2020.01.01 GOOG 2.703125
11 2020.01.01 MSFT 2.438608
12 2020.01.01 GOOG 2.624367
13 2020.01.01 AAPL 2.20948
14 2020.01.01 AAPL 7.839242
15 2020.01.01 AAPL 0.8549648
16 2020.01.01 AAPL 5.564782
17 2020.01.01 AAPL 3.42925
18 2020.01.01 MSFT 6.195387
19 2020.01.01 AAPL 8.137941
20 2020.01.01 MSFT 5.73365
21 2020.01.01 AAPL 7.278849
22 2020.01.01 AAPL 1.252762
99999 2020.01.02 MSFT 7.364259
100,000 rows × 3 columns

Now, let’s explore how to add data into an existing database table by using the db.create function.

new_data = kx.Table(data={ 
 'sym': kx.random.random(3333, ['AAPL', 'GOOG', 'MSFT']), 
 'price': kx.random.random(3333, 10.0)  
 }) 

db.create(new_data, 'db_table', date(2020, 1, 3))
Writing Database Partition 2020-01-03 to table db_table

Next, you can rename your table to ‘finance_data’ and the column ‘sym’ to a more descriptive name.

db.rename_table('db_table', 'finance_data')
db.rename_column('finance_data', 'sym', 'ticker')
2024.01.22 17:53:27 renaming :/tmp/my_database/2020.01.01/db_table to :/tmp/my_database/2020.01.01/finance_data
2024.01.22 17:53:27 renaming :/tmp/my_database/2020.01.02/db_table to :/tmp/my_database/2020.01.02/finance_data
2024.01.22 17:53:27 renaming :/tmp/my_database/2020.01.03/db_table to :/tmp/my_database/2020.01.03/finance_data
2024.01.22 17:53:27 renaming sym to ticker in `:/tmp/my_database/2020.01.01/finance_data
2024.01.22 17:53:27 renaming sym to ticker in `:/tmp/my_database/2020.01.02/finance_data
2024.01.22 17:53:27 renaming sym to ticker in `:/tmp/my_database/2020.01.03/finance_data

With more significant editing tasks, you may want to perform the operation on a practice column and protect your original dataset. To demonstrate, create a copy of the ‘price’ column and then perform a function to double the content values. Finally, modify the data type of your new column to be less memory intensive.

db.copy_column('finance_data', 'price', 'price_copy') 
db.apply_function('finance_data', 'price_copy', lambda x:2*x) 
db.set_column_type('finance_data', 'price_copy', kx.RealAtom)

We can now look at the content of finance_data to see how these changes have modified the database.

db.finance_data
date ticker price price_copy
0 2020.01.01 MSFT 4.802218 9.604437e
1 2020.01.01 GOOG 0.5058034 1.011607e
2 2020.01.01 AAPL 7.055037 14.11007e
3 2020.01.01 MSFT 1.084145 2.16829e
4 2020.01.01 GOOG 0.8396528 1.679306e
5 2020.01.01 MSFT 7.716519 15.43304e
6 2020.01.01 MSFT 2.539819 5.079639e
7 2020.01.01 GOOG 9.614568 19.22914e
8 2020.01.01 AAPL 3.907669 7.815337e
9 2020.01.01 MSFT 9.47742 18.95484e
10 2020.01.01 GOOG 2.703125 5.40625e
11 2020.01.01 MSFT 2.438608 4.877215e
12 2020.01.01 GOOG 2.624367 5.248734e
13 2020.01.01 AAPL 2.20948 4.41896e
14 2020.01.01 AAPL 7.839242 15.67848e
15 2020.01.01 AAPL 0.8549648 1.70993e
16 2020.01.01 AAPL 5.564782 11.12956e
17 2020.01.01 AAPL 3.42925 6.8585e
18 2020.01.01 MSFT 6.195387 12.39077e
19 2020.01.01 AAPL 8.137941 16.27588e
20 2020.01.01 MSFT 5.73365 11.4673e
21 2020.01.01 AAPL 7.278849 14.5577e
22 2020.01.01 AAPL 1.252762 2.505525e
103332 2020.01.03 MSFT 4.152568 8.305137e
103,333 rows × 3 columns

Let’s also check the number of data rows, per partition within the database.

db.partition_count()
date finance_data
2020.01.01 49859
2020.01.02 50141
2020.01.03 3333

Finally, using the create function that you used earlier, and the fill_database function, you can add the missing partitions to a newly onboarded table.

orders = kx.Table(data = { 
    'id': kx.random.random(100, kx.GUIDAtom.null), 
    'type': kx.random.random(100, ['Buy', 'Sell'])}) 

db.create(orders, 'orders', date(2020, 1, 3)) 
db.fill_database()
Writing Database Partition 2020-01-03 to table orders
Successfully filled missing tables to partition: :/tmp/my_database/2020.01.02
Successfully filled missing tables to partition: :/tmp/my_database/2020.01.01

You can see that the orders table has now been onboarded and is accessible.

print(db.tables)
db.orders
[‘finance_data’, ‘orders’]
date id type
0 2020.01.03 29f1e3e9-dba5-12da-acf8-51154deb3dc9 Buy
1 2020.01.03 5cda0be6-60a5-07fc-f929-a3d03c6106d1 Buy
2 2020.01.03 47d62791-e47a-e8ee-1afa-09cfe0657581 Buy
3 2020.01.03 b15ac75d-4bd9-a6af-5882-fc9f2e0771c1 Sell
4 2020.01.03 2df69841-083e-77d4-1dd6-cf3919e8c19d Sell
5 2020.01.03 9e98c564-1f25-ca7d-9b3d-b270a2acf79f Sell
6 2020.01.03 8b4fe1cb-da0c-8589-f9b5-8faa57f73ab2 Sell
7 2020.01.03 2a47001a-0ffd-8a0e-3b42-45a5f11980a1 Buy
8 2020.01.03 8113ba9a-7065-6471-5b58-4c194c9a2360 Sell
9 2020.01.03 0418a02c-a1c3-db83-e82e-ff5d3b4ef847 Buy
10 2020.01.03 52a1be39-1ceb-87b7-0e4d-f156f32860a0 Buy
11 2020.01.03 1a536bf8-5f1f-5a6e-83de-0c6f8d632c8a Buy
12 2020.01.03 2de7b0aa-67a7-2150-aeb9-d2cd5ac855b8 Buy
13 2020.01.03 1f6bf7de-b72a-4acb-9c80-207f7be4cd96 Buy
14 2020.01.03 c22bcb71-61f7-5431-54e0-a5bb315680e2 Sell
15 2020.01.03 ed925950-bccf-4684-1a7b-09ce2ecd4edc Sell
16 2020.01.03 2fd437b0-0a61-d5b2-6749-c48cd7344561 Sell
17 2020.01.03 5961b504-53d0-1cf5-ace6-234bf7569f0b Buy
18 2020.01.03 8bf69a5e-5225-b873-5759-d52b70bfa40b Buy
19 2020.01.03 898bfe1e-6445-bf9c-932d-d2cbff828d11 Buy
20 2020.01.03 34b6cff5-3a4c-cd4b-2be0-bef15fe4b031 Sell
21 2020.01.03 41958ee9-09a9-45c7-ce71-0c0fcd12555c Sell
22 2020.01.03 520d5ab7-ca9a-ff07-c89e-051f0379908d Buy
99 2020.01.03 1672522c-87d2-ad8f-85d3-f55662b189cb Buy
100 rows × 3 columns

Conclusions

Creating and managing databases at scale is a tough challenge and one where mistakes in operations can have a meaningful impact on business performance and use-case efficiency. While the Database Management feature does not solve all problems in this area, it is intended to provide you with an initial grounding. Allowing you, in Python, to create and perform basic operations on kdb+ databases to make this task easier.

As mentioned throughout the blog, this is presently a beta feature and subject to updates prior to a general availability release later this year. If you run into difficulties with this feature, or have suggestions for additional functionality, please open an issue on our public GitHub or alternatively open a pull request.

 

Demo kdb, the fastest time-series data analytics engine in the cloud








    For information on how we collect and use your data, please see our privacy notice. By clicking “Download Now” you understand and accept the terms of the License Agreement and the Acceptable Use Policy.