Kdb+/q Insights: Database Maintenance with q

15 Oct 2019 | ,
Share on:

by David Crossey

Introduction

Databases lie at the heart of kdb+. The ability to quickly process large amounts of on-disk data separates kdb+ from its competitors. Database design evolves over time and any q developer will inevitably have to make schema changes to historical databases. The dbmaint utility makes this easy.

In this blog we will expand upon the concepts discussed in a previous blog, Kdb+/q Insights: Scripting with q, and explore some concepts on how we might set up a framework for running maintenance against tables stored in a partitioned database. [1]

Use Case

Our objective is to build a non-interactive script to run some maintenance tasks against a database, adding new columns to a table, renaming columns and changing column types.

The following section assumes presence of an HDB with some trade data based on  ‘setuphdb.q’ located on GitHub. This script can be used to mock up a database with some trade data for usage in this scenario; based on a Q For Mortals example by Jeffry Borror.

q setuphdb.q -db hdb

With our trade data on disk, we can begin running some maintenance tasks, for example we may want to add a column showing the ticker ‘industry’, rename ‘px’ to ‘price’ for readability and demonstrate the conversion of a column from one type to another.

2#select from trades where date=2015.01.10
date      tm             sym    qty     price      val
--------------------------------------------------------------------------------------------------------
2015.01.01   00:00:00.395   ibm    1050    213        223650   
2015.01.01   00:00:01.059   goog   1430    643.68     920462.4

Preparing wrapper script

We will start by creating a simple shell wrapper script that will ensure our maintenance process is limited to running on a single core; this would be useful in a production environment to prevent our task from impacting other processes on our server. 

This script will run a non-interactive q shell for us, passing in command line args/options (using ‘$*’), including our database and q scripts.

#!/bin/sh
taskset -c 0 q ./dbmaint.q $* -c 2000 2000 <<< '\l hdbmaint.q'
exit 0

Note, we could run tasket directly with q on the command line, however storing the command in a script is useful for reusability (such as scheduling as you’ll see later on). 

See Github for full script.

Preparing maintenance script

To begin we will define some logging functions as seen in the previous blog to help output standardized messages during our script execution:

/// Logging utilities
\d .log
print:{(-1)(" " sv string (.z.D;.z.T)),x;};
out:{[x]print[": INFO : ",x]};
err:{[x]print[": ERROR : ",x]};
errexit:{err x;err"Exiting";exit 1};
sucexit:{out "Maintenance complete"; out "Success. Exiting";exit 0};
usage:{[x] errexit "Missing param(s) Usage: hdbmaint.q "," " sv "-",'string distinct `db`action,x };
\d .

We will add a check to load dbmaint.q, should it not be passed in by our bash wrapper [2]. and the capture and check that a ‘db’ directory has been passed to our script, along with an ‘action’.

We will also add special handling for functions noted as ‘fn’ passed into our scripts which are passed to our script as strings to be evaluated.

/// dbmaint.q check
if[not `addcol in key `.; .log.out "Attempting to load dbmaint.q in current directory"; @[system;"l ./dbmaint.q"; {.log.errexit "Could not load dbmaint.q",x}]];

/// Parameter handling
d:.Q.opt .z.x;
if[not all `db`action in key d; .log.usage `db`action];
d:(first each d),$[`fn in key d;enlist[`fn]!enlist " " sv d[`fn];()];
d:{x[y]:`$x[y];x}[d;key[d] except `db`fn];
d[`db]:hsym `$first system raze "readlink -f ",d[`db];

Next we will define our subroutines (functions), to be called later in our ‘main’ function; this helps provide some level of separation within our script.

A function to load and map our database in memory:

load_db:{
   .log.out "Loading database: ",string x;
   system "l ",1_string x;
}

A function to validate user defined parameters:

param_check:{
    requiredInputs:`addcol`deletecol`renamecol`fncol!(`table`colname`fn;`table`colname;`table`oldname`newname;`table`colname`fn);
    
    params:requiredInputs[y];
    if[not all params in key[x]; .log.usage[params]];

    .log.out "Params: ",.Q.s1 x;
 }

A function that will create a backup directory and copy the ‘sym’ file [3] (using rsync) from our database directory to our backup directory. Note, for demonstration purposes only [4].

backup:{
    backup_path:(first system "dirname ",string[x]),"/";
    backup_dir:"hdb_bak/",{ssr[x;y;"-"]}/["-" sv string each (.z.D;.z.T);(".";":")];

    .log.out "Creating ",backup_dir;
    system "mkdir -p ",1_ backup_path,backup_dir;

    .log.out "Copying sym file...";
    system "rsync -aL ",(1_ string[x]),"/sym ",(1_ backup_path,backup_dir);

    .log.out "Backup complete";
 }

We will then create a namespace for our wrapper functions which we will dynamically call in our main script.

/// Wrapper functions
\d .maint
addcol:{[d] (get[`.] `addcol)[d[`db];d[`table];d[`colname];value d[`fn]];};
deletecol:{[d] (get[`.] `deletecol)[d[`db];d[`table];d[`colname]];};
renamecol:{[d] (get[`.] `renamecol)[d[`db];d[`table];d[`oldname];d[`newname]];};
fncol:{[d] (get[`.] `fncol)[d[`db];d[`table];d[`colname];value d[`fn]];};
\d .

Note, we use get [`.] `function to retrieve the dbmaint defined functions which have been loaded in the global namespace. You can learn more about namespaces here.

The reasons for using a namespace in this manner, and not a switch (case) statement are 1) a ‘cond error can occur in older versions of kdb+ where there are an even number of cases, and 2) namespace provides logical separation of these away from our main function.

To wrap up all of these functions, we will define a ‘main’ function and entry point into our script, which can be outlined as:

  1. Load database
  2. Check user inputs
  3. Backup sym file
  4. Execute maintenance
  5. Exit process
/// Main body
main:{
    load_db d[`db];
    param_check[d;action];
    backup d[`db];
    .maint[d[`action]][d][];
    .log.sucexit[];
 }

/// Entry point
@[main;`;{.log.errexit "Error running main: ",x}];

Utilities

The following section will briefly discuss some high level utilities [5] to provide some level of automation to running our maintenance scripts.

Process logging

Executing the script from the command line will cause all output to be directed to the console under normal circumstances. Whilst this is great for development and testing purposes, in higher environments such as production, having a record of what happened during execution is not only useful, but a requirement from an auditing perspective.

To achieve this we can run our script with redirecting our file descriptors into a file, by chaining redirections as (stderr → stdout ‘2>&1’) → logfile ‘>hdbmaint.log

$ bash hdbmaint.sh -db ./hdb -action renamecol -table trades -oldname px -newname price 2>&1 >hdbmaint.log

Upon executing this command you will not see any information logged to the console but instead will be written to a process log file on disk. 

For long running processes, you can open another terminal window and use the utilities such as less to stream the log for viewing purposes.

Background Execution

Going a step further, knowing that our script will not require any user input and depending upon the number of actions defined, may take a considerable amount of time.

We could start our script, then send the process to the background using ‘CTRL+Z’ as follows:

$ q
1+1
2
[1]+  Stopped                 ~/.scripts/start_q.sh

To see processes running in the background we can use the utility ‘jobs’:

$ jobs
[1]+  Stopped                 ~/.scripts/start_q.sh

We can then return the running process to the foreground using ‘fg’:

$ fg % 1
~/.scripts/start_q.sh
{x*x} 5

See here for more on fg, bg and jobs usage.

A more straightforward method of starting a script for background execution is with the usage of the ampersand ‘&’ symbol.

$ bash hdbmaint.sh -db ./hdb -action renamecol -table trades -oldname px -newname price 2>&1 >hdbmaint.log &

This will run the maintenance process in the background so you can continue utilizing your terminal window.

You can read more about ‘Job Control’ here.

Shell Protection

To help protect our maintenance script during execution from unintentional interruption, we can run with nohup to prevent our shell from interrupting our process, should our terminal session be dropped.

This can be achieved by adding nohup in front of our command:

$ nohup bash hdbmaint.sh -db ./hdb -action renamecol -table trades -oldname px -newname price 2>&1 >hdbmaint.log &

Note, if you don’t redirect output it will be automatically directed to a file called ‘nohup.out’ 

In our example we are already redirecting stderr and stdout into a log file (‘hdbmaint.log’).

Task Scheduling

There may be occasions where we may want to schedule our maintenance scripts to automatically run at a predetermined time, for example during early hours on a Sunday morning when there is little to no expected usage of our database.

The following command is an example of how we can schedule our maintenance script to run on the first Sunday of the month at 03:00 by utilizing the cron daemon and setting up a crontab.

$ crontab -l | grep hdbmaint.sh
0 3 1-7 * 7 bash /home/dcrossey/maint-blog/scripts/hdbmaint.sh -db /home/dcrossey/maint-blog/scripts/hdb -action renamecol -table trades -oldname px -newname price 2>&1 >/home/dcrossey/maint-blog/scripts/hdbmaint.log

Read as 0 mins 3 hours, 1st-7th date, every month, Sunday only.

Note, there is no need to use ‘nohup’ as you will not be running the job from a terminal. 

Similarly ‘&’ is not required as every job run by cron is run in the background automatically.

See The Geek Diary for more examples on how to setup crontab.

Sample execution

To begin, we can use the ‘setup.q’ script provided on Github to generate a trade table with some data

$ q setuphdb.q -db hdb

$ q hdb
2#select from trades where date=2015.01.10

date      tm             sym    qty     price      val
--------------------------------------------------------------------------------------------------------
2015.01.01   00:00:00.395   ibm    1050    213        223650   
2015.01.01   00:00:01.059   goog   1430    643.68     920462.4

Now we can call out ‘demo.q’ script which can be found on GitHub.

For brevity, I’ve used ellipses to hide similar output in the log below.

$ bash demo.sh

$ cat hdbmaint.log
2019.09.29 13:03:17.599: INFO : Loading database: :/home/dcrossey/repos/maint-blog/scripts/hdb
2019.09.29 13:03:17.608: INFO : Params: `db`action`table`oldname`newname`fn!("./hdb";"renamecol";"trades";"px";"price";"")
2019.09.29 13:03:17.660: INFO : Creating hdb_bak/2019-09-29-13-03-17-660
2019.09.29 13:03:17.717: INFO : Copying sym file...
2019.09.29 13:03:17.851: INFO : Backup complete
2019.09.29 13:03:17 renaming px to price in `:/home/dcrossey/repos/maint-blog/scripts/hdb/2015.01.01/trades
...
2019.09.29 13:03:18.623: INFO : Maintenance complete
2019.09.29 13:03:18.623: INFO : Success. Exiting
2019.09.29 13:03:19.550: INFO : Loading database: :/home/dcrossey/repos/maint-blog/scripts/hdb
2019.09.29 13:03:19.556: INFO : Params: `db`action`table`colname`fn!("./hdb";"addcol";"trades";"industry";"enlist \"tech\"")
2019.09.29 13:03:19.596: INFO : Creating hdb_bak/2019-09-29-13-03-19-596
2019.09.29 13:03:19.646: INFO : Copying sym file...
2019.09.29 13:03:19.758: INFO : Backup complete
2019.09.29 13:03:19 adding column industry (type 0) to `:/home/dcrossey/repos/maint-blog/scripts/hdb/2015.01.01/trades
...
2019.09.29 13:03:20.499: INFO : Maintenance complete
2019.09.29 13:03:20.500: INFO : Success. Exiting
2019.09.29 13:03:21.367: INFO : Loading database: :/home/dcrossey/repos/maint-blog/scripts/hdb
2019.09.29 13:03:21.373: INFO : Params: `db`action`table`colname`fn!("./hdb";"fncol";"trades";"industry";"{(` sv x,`sym)?`$y} d[`db]")
2019.09.29 13:03:21.417: INFO : Creating hdb_bak/2019-09-29-13-03-21-417
2019.09.29 13:03:21.470: INFO : Copying sym file...
2019.09.29 13:03:21.562: INFO : Backup complete
2019.09.29 13:03:21 resaving column industry (type 20) in `:/home/dcrossey/repos/maint-blog/scripts/hdb/2015.01.01/trades
...
2019.09.29 13:03:22.814: INFO : Maintenance complete
2019.09.29 13:03:22.814: INFO : Success. Exiting

From our log file we can see three distinct maintenance tasks have occurred:

  1. Renamed ‘px’ to ‘price’
  2. Added column ‘industry’ with an empty string as default value
  3. Converted ‘industry’ from a string type to a symbol (enumerated over ‘sym’)

Finally we can load our HDB, and if we query trades, we should now see these modifications in place:

2#select from trades
date       tm           sym  qty  price  val      industry
----------------------------------------------------------
2015.01.01 00:00:00.395 ibm  1050 213    223650   tech    
2015.01.01 00:00:01.059 goog 1430 643.68 920462.4 tech    

meta trades
c       | t f a
--------| -----
date    | d    
tm      | t    
sym     | s    
qty     | j    
price   | f    
val     | f    
industry| s

Further expansion

These scripts can be used as a base to develop your own maintenance wrapper scripts, or extend this script. 

You may have more niche maintenance commands that cannot be executed simply using the dbmaint framework and may wish to embedded ad-hoc commands. 

As an example, if we wanted to upper case every symbol in the sym file we could add the following to our maint script:

.log.out "Changing symbols to uppercase...";
(` sv (d[`db];`sym)) set `$upper string get (` sv (d`[db];`sym));

You could incorporate more advanced features such as rolling back database changes in the event of an issue; or enhance the scripts modularity by passing in a config file containing table names, function and operations to execute.

Source Code

All source code detailed in this blog can be found on GitHub.

Footnotes


[1] These concepts can also be applied to segmented databases.


[2] This also allows us to run the q script independently of the bash wrapper.


[3] Always backup your sym file prior to any modification! To quote Jeffry Borror: “Corrupting the sym file will almost certainly render your database useless.”


[4] In the real world it is impractical and unfeasible to backup an HDB in its entirety prior to maintenance. These types of modifications should always be tested extensively and validated in lower environments (UAT, Staging) before being executed on a production system.


[5] Utilities discussed in this blog are relevant to Linux based operating systems.

SUGGESTED ARTICLES