Introduction
For many years, KX and its partners have looked to provide solutions that simplify database management in kdb+, this has seen the launch of several products and frameworks, including kdb Insights Enterprise, kdb Insights Database, and Data Intellects, TorQ.
With the introduction of PyKX nearly 2 years ago, we delivered an initial new wave of capability, increasing access to our powerful analytic tooling, via a familiar python interface and opening kdb+ data estates to a new wave of developers. Over the years this has continued to evolve and harden capabilities.
Earlier this week, we released the next phase in this journey as part of PyKX 2.3. Specifically, through the beta release of a new feature; centred around “database management”, providing greater flexibility in the provisioning and management of kdb+ databases via Python.
With the initial release, developers will be able to:
- Create a database and tables from any historical data.
- Add new partitions and tables to a database.
- Add and rename tables and columns.
- Create copies of table columns.
- Apply functions to columns.
- Update data types of columns
- Add & fix missing partitions from tables.
- Interrogate the content/structure of your database.
Click here for a full breakdown of the API.
Walkthrough
Let’s explore some of these features further by walking through example management tasks.
Operations on persisted databases can result in changes being made, which you may find difficult to revert. For example, the application of functions, which change the values of rows within a column may result in an updated value from which you cannot get the original data. Prior to using this functionality for complex use-cases you should be clear of the intention of your changes and ideally be able to mitigate issues with access to a backup of your data.
To begin, you will need to enable PyKX beta features:
import os
os.environ['PYKX_BETA_FEATURES'] = 'True'
import pykx as kx
Once initialised, you can validate what features are enabled by running the following command.
print(kx.config.beta_features)
print(kx.beta_features)
[‘Database Management’]
Next, create a new directory to host your database and initialise the DB class.
os.makedirs('/tmp/my_database')
db = kx.DB(path='/tmp/my_database')
From here, you can generate a new dataset which will be saved to the database.
from datetime import date
N = 100000
dataset = kx.Table(data={
'date': kx.random.random(N, [date(2020, 1, 1), date(2020, 1, 2)]),
'sym': kx.random.random(N, ['AAPL', 'GOOG', 'MSFT']),
'price': kx.random.random(N, 10.0)
})
dataset
99999 | 2020.01.02 | MSFT | 7.364259 |
---|---|---|---|
100,000 rows × 3 columns | |||
date | sym | price | |
0 | 2020.01.01 | MSFT | 4.802218 |
1 | 2020.01.01 | GOOG | 0.5058034 |
2 | 2020.01.01 | AAPL | 7.055037 |
3 | 2020.01.02 | AAPL | 2.924899 |
4 | 2020.01.01 | MSFT | 1.084145 |
5 | 2020.01.02 | AAPL | 1.1094 |
6 | 2020.01.01 | GOOG | 0.8396528 |
7 | 2020.01.02 | MSFT | 7.527356 |
8 | 2020.01.01 | MSFT | 7.716519 |
9 | 2020.01.02 | MSFT | 0.7148808 |
10 | 2020.01.01 | MSFT | 2.539819 |
11 | 2020.01.02 | GOOG | 9.568963 |
12 | 2020.01.02 | GOOG | 5.546168 |
13 | 2020.01.01 | GOOG | 9.614568 |
14 | 2020.01.01 | AAPL | 3.907669 |
15 | 2020.01.02 | GOOG | 6.349116 |
16 | 2020.01.01 | MSFT | 9.47742 |
17 | 2020.01.02 | MSFT | 3.082102 |
18 | 2020.01.01 | GOOG | 2.703125 |
19 | 2020.01.01 | MSFT | 2.438608 |
20 | 2020.01.02 | AAPL | 0.6931198 |
21 | 2020.01.02 | MSFT | 2.896335 |
22 | 2020.01.01 | GOOG | 2.624367 |
… | … | … | … |
With the dataset created, save it to disk by partitioning(splitting) the data into sub-directories based on date.
db.create(dataset, table_name='db_table', partition='date')
Congratulations, you have now created your first kdb+ database. You can access this table as a property of the database.
db.db_table
date | sym | price | |
---|---|---|---|
0 | 2020.01.01 | MSFT | 4.802218 |
1 | 2020.01.01 | GOOG | 0.5058034 |
2 | 2020.01.01 | AAPL | 7.055037 |
3 | 2020.01.01 | MSFT | 1.084145 |
4 | 2020.01.01 | GOOG | 0.8396528 |
5 | 2020.01.01 | MSFT | 7.716519 |
6 | 2020.01.01 | MSFT | 2.539819 |
7 | 2020.01.01 | GOOG | 9.614568 |
8 | 2020.01.01 | AAPL | 3.907669 |
9 | 2020.01.01 | MSFT | 9.47742 |
10 | 2020.01.01 | GOOG | 2.703125 |
11 | 2020.01.01 | MSFT | 2.438608 |
12 | 2020.01.01 | GOOG | 2.624367 |
13 | 2020.01.01 | AAPL | 2.20948 |
14 | 2020.01.01 | AAPL | 7.839242 |
15 | 2020.01.01 | AAPL | 0.8549648 |
16 | 2020.01.01 | AAPL | 5.564782 |
17 | 2020.01.01 | AAPL | 3.42925 |
18 | 2020.01.01 | MSFT | 6.195387 |
19 | 2020.01.01 | AAPL | 8.137941 |
20 | 2020.01.01 | MSFT | 5.73365 |
21 | 2020.01.01 | AAPL | 7.278849 |
22 | 2020.01.01 | AAPL | 1.252762 |
… | … | … | … |
99999 | 2020.01.02 | MSFT | 7.364259 |
100,000 rows × 3 columns |
Now, let’s explore how to add data into an existing database table by using the db.create function.
new_data = kx.Table(data={
'sym': kx.random.random(3333, ['AAPL', 'GOOG', 'MSFT']),
'price': kx.random.random(3333, 10.0)
})
db.create(new_data, 'db_table', date(2020, 1, 3))
Next, you can rename your table to ‘finance_data’ and the column ‘sym’ to a more descriptive name.
db.rename_table('db_table', 'finance_data')
db.rename_column('finance_data', 'sym', 'ticker')
2024.01.22 17:53:27 renaming :/tmp/my_database/2020.01.02/db_table to :/tmp/my_database/2020.01.02/finance_data
2024.01.22 17:53:27 renaming :/tmp/my_database/2020.01.03/db_table to :/tmp/my_database/2020.01.03/finance_data
2024.01.22 17:53:27 renaming sym to ticker in `:/tmp/my_database/2020.01.01/finance_data
2024.01.22 17:53:27 renaming sym to ticker in `:/tmp/my_database/2020.01.02/finance_data
2024.01.22 17:53:27 renaming sym to ticker in `:/tmp/my_database/2020.01.03/finance_data
With more significant editing tasks, you may want to perform the operation on a practice column and protect your original dataset. To demonstrate, create a copy of the ‘price’ column and then perform a function to double the content values. Finally, modify the data type of your new column to be less memory intensive.
db.copy_column('finance_data', 'price', 'price_copy')
db.apply_function('finance_data', 'price_copy', lambda x:2*x)
db.set_column_type('finance_data', 'price_copy', kx.RealAtom)
We can now look at the content of finance_data to see how these changes have modified the database.
db.finance_data
date | ticker | price | price_copy | |
---|---|---|---|---|
0 | 2020.01.01 | MSFT | 4.802218 | 9.604437e |
1 | 2020.01.01 | GOOG | 0.5058034 | 1.011607e |
2 | 2020.01.01 | AAPL | 7.055037 | 14.11007e |
3 | 2020.01.01 | MSFT | 1.084145 | 2.16829e |
4 | 2020.01.01 | GOOG | 0.8396528 | 1.679306e |
5 | 2020.01.01 | MSFT | 7.716519 | 15.43304e |
6 | 2020.01.01 | MSFT | 2.539819 | 5.079639e |
7 | 2020.01.01 | GOOG | 9.614568 | 19.22914e |
8 | 2020.01.01 | AAPL | 3.907669 | 7.815337e |
9 | 2020.01.01 | MSFT | 9.47742 | 18.95484e |
10 | 2020.01.01 | GOOG | 2.703125 | 5.40625e |
11 | 2020.01.01 | MSFT | 2.438608 | 4.877215e |
12 | 2020.01.01 | GOOG | 2.624367 | 5.248734e |
13 | 2020.01.01 | AAPL | 2.20948 | 4.41896e |
14 | 2020.01.01 | AAPL | 7.839242 | 15.67848e |
15 | 2020.01.01 | AAPL | 0.8549648 | 1.70993e |
16 | 2020.01.01 | AAPL | 5.564782 | 11.12956e |
17 | 2020.01.01 | AAPL | 3.42925 | 6.8585e |
18 | 2020.01.01 | MSFT | 6.195387 | 12.39077e |
19 | 2020.01.01 | AAPL | 8.137941 | 16.27588e |
20 | 2020.01.01 | MSFT | 5.73365 | 11.4673e |
21 | 2020.01.01 | AAPL | 7.278849 | 14.5577e |
22 | 2020.01.01 | AAPL | 1.252762 | 2.505525e |
103332 | 2020.01.03 | MSFT | 4.152568 | 8.305137e |
… | … | … | … | … |
103,333 rows × 3 columns |
Let’s also check the number of data rows, per partition within the database.
db.partition_count()
date | finance_data |
---|---|
2020.01.01 | 49859 |
2020.01.02 | 50141 |
2020.01.03 | 3333 |
Finally, using the create function that you used earlier, and the fill_database function, you can add the missing partitions to a newly onboarded table.
orders = kx.Table(data = {
'id': kx.random.random(100, kx.GUIDAtom.null),
'type': kx.random.random(100, ['Buy', 'Sell'])})
db.create(orders, 'orders', date(2020, 1, 3))
db.fill_database()
Successfully filled missing tables to partition: :/tmp/my_database/2020.01.02
Successfully filled missing tables to partition: :/tmp/my_database/2020.01.01
You can see that the orders table has now been onboarded and is accessible.
print(db.tables)
db.orders
[‘finance_data’, ‘orders’] | |||
---|---|---|---|
date | id | type | |
0 | 2020.01.03 | 29f1e3e9-dba5-12da-acf8-51154deb3dc9 | Buy |
1 | 2020.01.03 | 5cda0be6-60a5-07fc-f929-a3d03c6106d1 | Buy |
2 | 2020.01.03 | 47d62791-e47a-e8ee-1afa-09cfe0657581 | Buy |
3 | 2020.01.03 | b15ac75d-4bd9-a6af-5882-fc9f2e0771c1 | Sell |
4 | 2020.01.03 | 2df69841-083e-77d4-1dd6-cf3919e8c19d | Sell |
5 | 2020.01.03 | 9e98c564-1f25-ca7d-9b3d-b270a2acf79f | Sell |
6 | 2020.01.03 | 8b4fe1cb-da0c-8589-f9b5-8faa57f73ab2 | Sell |
7 | 2020.01.03 | 2a47001a-0ffd-8a0e-3b42-45a5f11980a1 | Buy |
8 | 2020.01.03 | 8113ba9a-7065-6471-5b58-4c194c9a2360 | Sell |
9 | 2020.01.03 | 0418a02c-a1c3-db83-e82e-ff5d3b4ef847 | Buy |
10 | 2020.01.03 | 52a1be39-1ceb-87b7-0e4d-f156f32860a0 | Buy |
11 | 2020.01.03 | 1a536bf8-5f1f-5a6e-83de-0c6f8d632c8a | Buy |
12 | 2020.01.03 | 2de7b0aa-67a7-2150-aeb9-d2cd5ac855b8 | Buy |
13 | 2020.01.03 | 1f6bf7de-b72a-4acb-9c80-207f7be4cd96 | Buy |
14 | 2020.01.03 | c22bcb71-61f7-5431-54e0-a5bb315680e2 | Sell |
15 | 2020.01.03 | ed925950-bccf-4684-1a7b-09ce2ecd4edc | Sell |
16 | 2020.01.03 | 2fd437b0-0a61-d5b2-6749-c48cd7344561 | Sell |
17 | 2020.01.03 | 5961b504-53d0-1cf5-ace6-234bf7569f0b | Buy |
18 | 2020.01.03 | 8bf69a5e-5225-b873-5759-d52b70bfa40b | Buy |
19 | 2020.01.03 | 898bfe1e-6445-bf9c-932d-d2cbff828d11 | Buy |
20 | 2020.01.03 | 34b6cff5-3a4c-cd4b-2be0-bef15fe4b031 | Sell |
21 | 2020.01.03 | 41958ee9-09a9-45c7-ce71-0c0fcd12555c | Sell |
22 | 2020.01.03 | 520d5ab7-ca9a-ff07-c89e-051f0379908d | Buy |
… | … | … | … |
99 | 2020.01.03 | 1672522c-87d2-ad8f-85d3-f55662b189cb | Buy |
100 rows × 3 columns |
Conclusions
Creating and managing databases at scale is a tough challenge and one where mistakes in operations can have a meaningful impact on business performance and use-case efficiency. While the Database Management feature does not solve all problems in this area, it is intended to provide you with an initial grounding. Allowing you, in Python, to create and perform basic operations on kdb+ databases to make this task easier.
As mentioned throughout the blog, this is presently a beta feature and subject to updates prior to a general availability release later this year. If you run into difficulties with this feature, or have suggestions for additional functionality, please open an issue on our public GitHub or alternatively open a pull request.