Back to Blog

How to Build and Manage Databases Using PyKX

Author

Conor Macarthy

VP of Data Science

Published

24 January, 2024

Reading Time

Introduction

For many years, KX and its partners have looked to provide solutions that simplify database management in kdb+, this has seen the launch of several products and frameworks, including kdb Insights Enterprise, kdb Insights Database, and Data Intellects, TorQ.

With the introduction of PyKX nearly 2 years ago, we delivered an initial new wave of capability, increasing access to our powerful analytic tooling, via a familiar python interface and opening kdb+ data estates to a new wave of developers. Over the years this has continued to evolve and harden capabilities.

Earlier this week, we released the next phase in this journey as part of PyKX 2.3. Specifically, through the beta release of a new feature; centred around “database management”, providing greater flexibility in the provisioning and management of kdb+ databases via Python.

With the initial release, developers will be able to:

Create a database and tables from any historical data.
Add new partitions and tables to a database.
Add and rename tables and columns.
Create copies of table columns.
Apply functions to columns.
Update data types of columns
Add & fix missing partitions from tables.
Interrogate the content/structure of your database.

Click here for a full breakdown of the API.

Walkthrough

Let’s explore some of these features further by walking through example management tasks.

Warning:
Operations on persisted databases can result in changes being made, which you may find difficult to revert. For example, the application of functions, which change the values of rows within a column may result in an updated value from which you cannot get the original data. Prior to using this functionality for complex use-cases you should be clear of the intention of your changes and ideally be able to mitigate issues with access to a backup of your data.

To begin, you will need to enable PyKX beta features:

import os
os.environ['PYKX_BETA_FEATURES'] = 'True'
import pykx as kx

Once initialised, you can validate what features are enabled by running the following command

print(kx.config.beta_features)
print(kx.beta_features)

True
[‘Database Management’]

Next, create a new directory to host your database and initialise the DB class.

os.makedirs('/tmp/my_database')
db = kx.DB(path='/tmp/my_database')

From here, you can generate a new dataset which will be saved to the database.

from datetime import date
N = 100000
dataset = kx.Table(data={
 'date': kx.random.random(N, [date(2020, 1, 1), date(2020, 1, 2)]),
 'sym': kx.random.random(N, ['AAPL', 'GOOG', 'MSFT']),
 'price': kx.random.random(N, 10.0)
 })
dataset

	date	sym	price
0	2020.01.01	MSFT	4.802218
1	2020.01.01	GOOG	0.505803
2	2020.01.01	AAPL	7.055037
3	2020.01.01	MSFT	1.084145
4	2020.01.01	GOOG	0.839653
5	2020.01.01	MSFT	7.716519
6	2020.01.01	MSFT	2.539819
7	2020.01.01	GOOG	9.614568
8	2020.01.01	AAPL	3.907669
9	2020.01.01	MSFT	9.47742
10	2020.01.01	GOOG	2.703125
11	2020.01.01	MSFT	2.438608
12	2020.01.01	GOOG	2.624367
13	2020.01.01	AAPL	2.20948
14	2020.01.01	AAPL	7.839242
15	2020.01.01	AAPL	0.854965
16	2020.01.01	AAPL	5.564782
17	2020.01.01	AAPL	3.42925
18	2020.01.01	MSFT	6.195387
19	2020.01.01	AAPL	8.137941
20	2020.01.01	MSFT	5.73365
21	2020.01.01	AAPL	7.278849
22	2020.01.01	AAPL	1.252762
…	…	…	…
99999	2020.01.02	MSFT	7.364259
100,000 rows × 3 columns

Now, let’s explore how to add data into an existing database table by using the db.create function.

new_data = kx.Table(data={
 'sym': kx.random.random(3333, ['AAPL', 'GOOG', 'MSFT']),
 'price': kx.random.random(3333, 10.0)
 })

db.create(new_data, 'db_table', date(2020, 1, 3))

Writing Database Partition 2020-01-03 to table db_table

Next, you can rename your table to ‘finance_data’ and the column ‘sym’ to a more descriptive name.

db.rename_table('db_table', 'finance_data')
db.rename_column('finance_data', 'sym', 'ticker')

2024.01.22 17:53:27 renaming :/tmp/my_database/2020.01.01/db_table to :/tmp/my_database/2020.01.01/finance_data
2024.01.22 17:53:27 renaming :/tmp/my_database/2020.01.02/db_table to :/tmp/my_database/2020.01.02/finance_data
2024.01.22 17:53:27 renaming :/tmp/my_database/2020.01.03/db_table to :/tmp/my_database/2020.01.03/finance_data
2024.01.22 17:53:27 renaming sym to ticker in `:/tmp/my_database/2020.01.01/finance_data
2024.01.22 17:53:27 renaming sym to ticker in `:/tmp/my_database/2020.01.02/finance_data
2024.01.22 17:53:27 renaming sym to ticker in `:/tmp/my_database/2020.01.03/finance_data

With more significant editing tasks, you may want to perform the operation on a practice column and protect your original dataset. To demonstrate, create a copy of the ‘price’ column and then perform a function to double the content values. Finally, modify the data type of your new column to be less memory intensive.

db.copy_column('finance_data', 'price', 'price_copy')
db.apply_function('finance_data', 'price_copy', lambda x:2*x)
db.set_column_type('finance_data', 'price_copy', kx.RealAtom)

We can now look at the content of finance_data to see how these changes have modified the database.

db.finance_data

	date	ticker	price	price_copy
0	2020.01.01	MSFT	4.802218	9.604437e
1	2020.01.01	GOOG	0.505803	1.011607e
2	2020.01.01	AAPL	7.055037	14.11007e
3	2020.01.01	MSFT	1.084145	2.16829e
4	2020.01.01	GOOG	0.839653	1.679306e
5	2020.01.01	MSFT	7.716519	15.43304e
6	2020.01.01	MSFT	2.539819	5.079639e
7	2020.01.01	GOOG	9.614568	19.22914e
8	2020.01.01	AAPL	3.907669	7.815337e
9	2020.01.01	MSFT	9.47742	18.95484e
10	2020.01.01	GOOG	2.703125	5.40625e
11	2020.01.01	MSFT	2.438608	4.877215e
12	2020.01.01	GOOG	2.624367	5.248734e
13	2020.01.01	AAPL	2.20948	4.41896e
14	2020.01.01	AAPL	7.839242	15.67848e
15	2020.01.01	AAPL	0.854965	1.70993e
16	2020.01.01	AAPL	5.564782	11.12956e
17	2020.01.01	AAPL	3.42925	6.8585e
18	2020.01.01	MSFT	6.195387	12.39077e
19	2020.01.01	AAPL	8.137941	16.27588e
20	2020.01.01	MSFT	5.73365	11.4673e
21	2020.01.01	AAPL	7.278849	14.5577e
22	2020.01.01	AAPL	1.252762	2.505525e
103332	2020.01.03	MSFT	4.152568	8.305137e
…	…	…	…	…
103,333 rows × 3 columns

Let’s also check the number of data rows, per partition within the database.

db.partition_count()

date	finance_data
2020.01.01	49859
2020.01.02	50141
2020.01.03	3333

Finally, using the create function that you used earlier, and the fill_database function, you can add the missing partitions to a newly onboarded table.

orders = kx.Table(data = {
    'id': kx.random.random(100, kx.GUIDAtom.null),
    'type': kx.random.random(100, ['Buy', 'Sell'])})

db.create(orders, 'orders', date(2020, 1, 3))
db.fill_database()

Writing Database Partition 2020-01-03 to table orders
Successfully filled missing tables to partition: :/tmp/my_database/2020.01.02
Successfully filled missing tables to partition: :/tmp/my_database/2020.01.01

You can see that the orders table has now been onboarded and is accessible.

print(db.tables)
db.orders

	Date	ID	Type
0	2020.01.03	29f1e3e9-dba5-12da-acf8-51154deb3dc9	Buy
1	2020.01.03	5cda0be6-60a5-07fc-f929-a3d03c6106d1	Buy
2	2020.01.03	47d62791-e47a-e8ee-1afa-09cfe0657581	Buy
3	2020.01.03	b15ac75d-4bd9-a6af-5882-fc9f2e0771c1	Sell
4	2020.01.03	2df69841-083e-77d4-1dd6-cf3919e8c19d	Sell
5	2020.01.03	9e98c564-1f25-ca7d-9b3d-b270a2acf79f	Sell
6	2020.01.03	8b4fe1cb-da0c-8589-f9b5-8faa57f73ab2	Sell
7	2020.01.03	2a47001a-0ffd-8a0e-3b42-45a5f11980a1	Buy
8	2020.01.03	8113ba9a-7065-6471-5b58-4c194c9a2360	Sell
9	2020.01.03	0418a02c-a1c3-db83-e82e-ff5d3b4ef847	Buy
10	2020.01.03	52a1be39-1ceb-87b7-0e4d-f156f32860a0	Buy
11	2020.01.03	1a536bf8-5f1f-5a6e-83de-0c6f8d632c8a	Buy
12	2020.01.03	2de7b0aa-67a7-2150-aeb9-d2cd5ac855b8	Buy
13	2020.01.03	1f6bf7de-b72a-4acb-9c80-207f7be4cd96	Buy
14	2020.01.03	c22bcb71-61f7-5431-54e0-a5bb315680e2	Sell
15	2020.01.03	ed925950-bccf-4684-1a7b-09ce2ecd4edc	Sell
16	2020.01.03	2fd437b0-0a61-d5b2-6749-c48cd7344561	Sell
17	2020.01.03	5961b504-53d0-1cf5-ace6-234bf7569f0b	Buy
18	2020.01.03	8bf69a5e-5225-b873-5759-d52b70bfa40b	Buy
19	2020.01.03	898bfe1e-6445-bf9c-932d-d2cbff828d11	Buy
20	2020.01.03	34b6cff5-3a4c-cd4b-2be0-bef15fe4b031	Sell
21	2020.01.03	41958ee9-09a9-45c7-ce71-0c0fcd12555c	Sell
22	2020.01.03	520d5ab7-ca9a-ff07-c89e-051f0379908d	Buy
…	…	…	…
99	2020.01.03	1672522c-87d2-ad8f-85d3-f55662b189cb	Buy

100 rows × 3 columns

Conclusions

Creating and managing databases at scale is a tough challenge and one where mistakes in operations can have a meaningful impact on business performance and use-case efficiency. While the Database Management feature does not solve all problems in this area, it is intended to provide you with an initial grounding. Allowing you, in Python, to create and perform basic operations on kdb+ databases to make this task easier.

As mentioned throughout the blog, this is presently a beta feature and subject to updates prior to a general availability release later this year. If you run into difficulties with this feature, or have suggestions for additional functionality, please open an issue on our public GitHub or alternatively open a pull request.

Demo the world’s fastest database for vector, time-series, and real-time analytics

Start your journey to becoming an AI-first enterprise with 100x* more performant data and MLOps pipelines.

Process data at unmatched speed and scale
Build high-performance data-driven applications
Turbocharge analytics tools in the cloud, on premise, or at the edge

*Based on time-series queries running in real-world use cases on customer environments.

Book a demo with an expert

"*" indicates required fields

First Name*

Last Name*

Company*

Job Title*

Business Telephone*

Business Email*

Industry*

Country*

How can KX help you?*

How did you hear about us?*

By submitting this form, you will also receive sales and/or marketing communications on KX products, services, news and events. You can unsubscribe from receiving communications by visiting our Privacy Policy. You can find further information on how we collect and use your personal data in our Privacy Policy.

CAPTCHA

Name

This field is for validation purposes and should be left unchanged.

Modernizing infrastructures that mix Python and q

The ultimate guide to choosing embedding models for AI applications

Apex innovators: How hedge funds can evolve analytics at speed and scale

Outrun the competition: Winning the digital assets race

PyKX 3.0: Easier to use and more powerful than ever

Supercharging your quants with real-time analytics

Apex innovators: How hedge funds can evolve analytics at speed and scale

Developer

Introduction

Walkthrough

Conclusions

Demo the world’s fastest database for vector, time-series, and real-time analytics

Start your journey to becoming an AI-first enterprise with 100x* more performant data and MLOps pipelines.

Book a demo with an expert