Kdb+ Interface to R

2 Aug 2017 | , , , , ,
Share on:

By Louise Totten

Kx recently introduced an updated interface to R from kdb+. It is available on GitHub here. What follows is a description of the interface and its capabilities, as well as some examples.

R is a development environment commonly used for data analysis and visualizing data. When interfacing R with kdb+ we can think of kdb+ as a database behind R which can be used to store huge amounts of data, and we can easily extract this data using the language q.

We can utilize the strengths of both R and kdb+/q to perform statistical analyses on large volumes of data. If you have previously been working with kdb+ and R, you probably were using one of the prior interfaces found on the kdb+ wiki (code.kx.com).

This blog focuses solely on the new R client for kdb+ and gives an overview on the following topics:

  • Set up a q instance
  • Set up the rkdb interface
  • Aggregating a sample dataset in the q instance and returning the results to the R client using the rkdb interface

A major benefit of the new rkdb interface is improved ease of installation over previous interfaces. We can now install from R using one simple command.

The main benefit here is that we can combine the unrivaled storage and processing power of kdb+ with the inbuilt libraries and visualization tools available in R.

Kdb+ interface to R

Kdb+R=rkdb

Prerequisites:

Setting up the new rkdb interface is straightforward. Within R, just run the one command.

devtools::install_github('kxsystems/rkdb')

kdb+ interface to R

kdb+ interface to R

Establishing a connection to q from R

Start by setting up two q instances which will act as servers, they will hold all of the data we want to analyze.

First set up a q process running on port 5000 which will load in a q script to generate a large volume of dummy trade and quote data:

//taqdemodata.q
//define trade schema
trade:([]date:`date$();time:`time$();sym:`$();price:`float$();size:`int$();ex:`symbol$());

//define quote schema
quote:([]date:`date$();time:`time$();sym:`symbol$();bid:`float$();bsize:`int$();ask:`float$();asize:`int$();ex:`symbol$()); 

//define data 
n:1000000; /n is number of trades
st:09:30t;
et:17t;
portfolio:`IBM`MSFT`GOOG`YHOO`VOD`BOI;
exchanges:(`N;`O;`B;`L);
dts:.z.D-til 300;

tdata:(n?dts;st+n?et-st;n?portfolio;n?100f;n?1000;n?exchanges);
insert[`trade;tdata];

n*:10;
qdata:(n?dts;st+n?et-st;n?portfolio;n?100f;n?1000;n?100f;n?1000;n?exchanges);
insert[`quote;qdata];

//sort trade
xasc[`date`time;`trade];
//sort quote
xasc[`date`time;`quote];

Setting up the first q session:

q)/setting console size
q)\c 20 120
q)
q)/setting port
q)\p 5000
q)
q)/loading taqdemodata.q
q)\l taqdemodata.q
q)
q)\a
`quote`trade
q)5#trade
date       time         sym  price    size ex
---------------------------------------------
2016.09.23 09:30:06.168 IBM  40.76838 203  B
2016.09.23 09:30:09.551 VOD  27.68937 284  N
2016.09.23 09:30:10.989 MSFT 78.95682 452  N
2016.09.23 09:30:13.526 IBM  27.7396  403  O
2016.09.23 09:30:16.421 VOD  1.612398 244  O
q)5#quote
date       time         sym  bid      bsize ask      asize ex
-------------------------------------------------------------
2016.09.23 09:30:00.688 MSFT 81.72594 514   12.58012 980   L
2016.09.23 09:30:01.269 BOI  57.99858 906   77.43472 868   B
2016.09.23 09:30:01.428 GOOG 9.704925 778   95.76563 710   O
2016.09.23 09:30:01.461 GOOG 66.55175 261   55.11257 802   O
2016.09.23 09:30:03.434 IBM  96.09301 185   8.264956 487   L

The second q instance runs on port 6000 and loads in a csv file of property price data (available from https://data.gov.uk/dataset/land-registry-monthly-price-paid-data):

q)/setting console size
q)\c 20 120
q)
q)/setting port
q)\p 6000
q)pp:("SFPSSSFSSSSSSSSS";enlist",")0: `:pp2017.csv
q)pp
id                                   price   dateOfTransfer                postcode propertyType oldNew duration PAON SOAN street             locality   townCity    district             county         PPDCatType recordStatus
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
4E95D757-A215-EDA1-E050-A8C0630539E2 983000  2017.03.29D00:00:00.000000000 EN1 2PF  D            N               31        WELLINGTON ROAD               ENFIELD     ENFIELD              GREATER LONDON A          A
4E95D757-A216-EDA1-E050-A8C0630539E2 450000  2017.04.12D00:00:00.000000000 HA1 1PE  T            N               70        HEADSTONE ROAD                HARROW      HARROW               GREATER LONDON A          A
4E95D757-A217-EDA1-E050-A8C0630539E2 415000  2017.03.16D00:00:00.000000000 TW13 6TE S            N               19        CONWAY ROAD                   FELTHAM     HOUNSLOW             GREATER LONDON A          A
4E95D757-A218-EDA1-E050-A8C0630539E2 472000  2017.03.31D00:00:00.000000000 UB6 9NY  S            N               95        CHINNOR CRESCENT              GREENFORD   EALING               GREATER LONDON A          A
4E95D757-A219-EDA1-E050-A8C0630539E2 682500  2017.03.06D00:00:00.000000000 W7 2JS   S            N               17        SEWARD ROAD                   LONDON      EALING               GREATER LONDON A          A
4E95D757-A21A-EDA1-E050-A8C0630539E2 360000  2017.04.05D00:00:00.000000000 UB3 5LY  S            N               122       WEST END LANE      HARLINGTON HAYES       HILLINGDON           GREATER LONDON A          A
4E95D757-A21B-EDA1-E050-A8C0630539E2 384000  2017.03.15D00:00:00.000000000 HA5 2AA  T            N               20        PINNER GREEN                  PINNER      HARROW               GREATER LONDON A          A
4E95D757-A21C-EDA1-E050-A8C0630539E2 445000  2017.04.21D00:00:00.000000000 TW3 4DT  D            N               20        ELMSWORTH AVENUE              HOUNSLOW    HOUNSLOW             GREATER LONDON A          A
4E95D757-A21D-EDA1-E050-A8C0630539E2 415000  2017.03.24D00:00:00.000000000 TW2 6AJ  S            N               11        ALBEMARLE AVENUE              TWICKENHAM  RICHMOND UPON THAMES GREATER LONDON A          A
4E95D757-A21E-EDA1-E050-A8C0630539E2 1525000 2017.03.24D00:00:00.000000000 W13 8ES  S            N               20        ST STEPHENS AVENUE            LONDON      EALING               GREATER LONDON A          A
4E95D757-A21F-EDA1-E050-A8C0630539E2 582000  2017.03.16D00:00:00.000000000 TW12 3EJ T            N               359       HANWORTH ROAD                 HAMPTON     RICHMOND UPON THAMES GREATER LONDON A          A
4E95D757-A220-EDA1-E050-A8C0630539E2 405000  2017.03.10D00:00:00.000000000 EN2 0PY  T            N               69        GORDON ROAD                   ENFIELD     ENFIELD              GREATER LONDON A          A
4E95D757-A221-EDA1-E050-A8C0630539E2 520000  2017.03.28D00:00:00.000000000 EN6 2JN  S            N               61        ABERDALE GARDENS              POTTERS BAR HERTSMERE            HERTFORDSHIRE  A          A
4E95D757-A222-EDA1-E050-A8C0630539E2 587500  2017.04.07D00:00:00.000000000 HA2 7EE  D            N               37        FERNBROOK DRIVE               HARROW      HARROW               GREATER LONDON B          A
4E95D757-A223-EDA1-E050-A8C0630539E2 570000  2017.04.13D00:00:00.000000000 HA5 1JJ  T            N               37        DURLEY AVENUE                 PINNER      HARROW               GREATER LONDON A          A
..
q)meta pp
c             | t f a
--------------| -----
id            | s
price         | f
dateOfTransfer| p
postcode      | s
propertyType  | s
oldNew        | s
duration      | f
PAON          | s
SOAN          | s
street        | s
locality      | s
townCity      | s
district      | s
county        | s
PPDCatType    | s
recordStatus  | s

In order to track the connections that the R client makes to the q instances, run the following q code:

current_connections:([]time:`time$();handle:`int$();user:`$();host:`$();IP:();number_messages:`int$());
.z.po:{`current_connections insert (.z.T;x;.z.u;.Q.host[.z.a];`int$vs[0x00;.z.a];0);}; 
.z.pc:{delete from `current_connections where handle=x;}; 
.z.pg:{update number_messages+1 from `current_connections where handle=.z.w;res:value x;res}; 
.z.ps:{update number_messages:number_messages+1 from `current_connections where handle=.z.w;value x;};

Once a connection has been established from the R client to the q servers using the rkdb interface the datasets are available for further analysis. To check that the rkdb library is loaded execute:

library(rkdb)

Then connect to the q instances using the same syntax as the previous example:

open_connection(hostname, port, username:password)

Verify on the first q instance that the connection has been established by checking the current_connections table:

q)\p
5000i
q)current_connections
time         handle user    host            IP       number_messages
--------------------------------------------------------------------
20:58:39.855 544    taqUser desktop-4ksdj7d 127 0 0 1 0
h2<-open_connection("127.0.0.1",6000,"taqUser:password")

Again, check on the second q instance in the current_connections table to confirm that the connection has been established:

q)\p
6000i
q)current_connections
time         handle user   host            IP        number_messages
--------------------------------------------------------------------
20:59:16.241 552    ppUser desktop-4ksdj7d 127 0 0 1 0
q)

Executing a query

As usual the syntax for executing a query is simply:

execute(connectionhandle, request)

Here are the results of a simple query to determine the number of rows per date in the trade table:

Closing a connection

To close a connection the syntax is:

close_connection(connectionhandle)

Executing a query on the closed connection will give an error showing there is no connection to the kdb+ server:

Check the current connections table to confirm that the connection has been closed:

q)\p 
5000i
q)current_connections
time handle user host IP number_messages
----------------------------------------

Example 1: Find average price and size by symbol from trade table.

Save the trade table on the q server to a csv file. Load it into the R client and perform an aggregation using the inbuilt aggregation function.

Replicate this 100 times to measure how long it takes.

In the session below it takes about a minute:

One of the major advantages of using R is that it can utilize the libraries written to optimize common tasks and queries. By loading the “data.table” library, the time taken to perform the same aggregation 100 times over is reduced to 2.58 seconds:

Then we compare the performance of carrying out the aggregation using “data.table” on the R client versus sending a query to the q server to execute and return the result. As shown below, the results returned from running the query on the q server match the results from running in the R client:

By repeating both approaches multiple times we can compare the performance:

Sending the query to the q server to perform the aggregation is significantly faster in both cases.

Example 2: Find average price by district for properties within Surrey County

As before, load the csv file into R and convert it to a data.table object:

Using the “data.table” library, the time taken to perform the price aggregation is 0.17 seconds:

Running the same query on the q server returns the same results, but only takes 0.1 seconds:

Repeat both approaches 1000 times to compare the performance:

As before, sending the query to the q server is faster. We can plot the results of the query on the q server:

kdb+ interface to R

Summary

The new kdb+ interface to R (rkdb) can be installed using one single command, making it easy for users of R and kdb+ to combine the best features of both tools. Execution time for analytics run on large volumes of data can be made to run significantly faster as demonstrated in the examples above.

Examples used in this blog were made using:
R 3.4.0
Rtools 3.4
kdb+ version 3.5 2017.05.02

Louise Totten is an expert kdb+ engineer who is a Team Lead at First Derivatives plc in Newry, Northern Ireland.

 

SUGGESTED ARTICLES

Kx Insights: Machine learning subject matter experts in semiconductor manufacturing

9 Jul 2018 | , ,

Subject matter experts are needed for ML projects since generalist data scientists cannot be expected to be fully conversant with the context, details, and specifics of problems across all industries. The challenges are often domain-specific and require considerable industry background to fully contextualize and address. For that reason, successful projects are typically those that adopt a teamwork approach bringing together the strengths of data scientists and subject matter experts. Where data scientists bring generic analytics and coding capabilities, Subject matter experts provide specialized insights in three crucial areas: identifying the right problem, using the right data, and getting the right answers.

Transitive Comparison

Kdb+ Transitive Comparisons

6 Jun 2018 | , ,

By Hugh Hyndman, Director, Industrial IoT Solutions. A direct comparison of the performance of kdb+ against InfluxData and, by transitivity, against Cassandra, ElasticSearch, MongoDB, and OpenTSDB