Kdb+ interface to R

Kdb+ Interface to R

2 Aug 2017 | , , , , ,
Share on:

By Louise Totten

Kx recently introduced an updated interface to R from kdb+. It is available on GitHub here. What follows is a description of the interface and its capabilities, as well as some examples.

R is a development environment commonly used for data analysis and visualizing data. When interfacing R with kdb+ we can think of kdb+ as a database behind R which can be used to store huge amounts of data, and we can easily extract this data using the language q.

We can utilize the strengths of both R and kdb+/q to perform statistical analyses on large volumes of data. If you have previously been working with kdb+ and R, you probably were using one of the prior interfaces found on the kdb+ wiki (code.kx.com).

This blog focuses solely on the new R client for kdb+ and gives an overview on the following topics:

  • Set up a q instance
  • Set up the rkdb interface
  • Aggregating a sample dataset in the q instance and returning the results to the R client using the rkdb interface

A major benefit of the new rkdb interface is improved ease of installation over previous interfaces. We can now install from R using one simple command.

The main benefit here is that we can combine the unrivaled storage and processing power of kdb+ with the inbuilt libraries and visualization tools available in R.

Kdb+ interface to R

Kdb+R=rkdb

Prerequisites:

Setting up the new rkdb interface is straightforward. Within R, just run the one command.

devtools::install_github('kxsystems/rkdb')

kdb+ interface to R

kdb+ interface to R

Establishing a connection to q from R

Start by setting up two q instances which will act as servers, they will hold all of the data we want to analyze.

First set up a q process running on port 5000 which will load in a q script to generate a large volume of dummy trade and quote data:

//taqdemodata.q
//define trade schema
trade:([]date:`date$();time:`time$();sym:`$();price:`float$();size:`int$();ex:`symbol$());

//define quote schema
quote:([]date:`date$();time:`time$();sym:`symbol$();bid:`float$();bsize:`int$();ask:`float$();asize:`int$();ex:`symbol$()); 

//define data 
n:1000000; /n is number of trades
st:09:30t;
et:17t;
portfolio:`IBM`MSFT`GOOG`YHOO`VOD`BOI;
exchanges:(`N;`O;`B;`L);
dts:.z.D-til 300;

tdata:(n?dts;st+n?et-st;n?portfolio;n?100f;n?1000;n?exchanges);
insert[`trade;tdata];

n*:10;
qdata:(n?dts;st+n?et-st;n?portfolio;n?100f;n?1000;n?100f;n?1000;n?exchanges);
insert[`quote;qdata];

//sort trade
xasc[`date`time;`trade];
//sort quote
xasc[`date`time;`quote];

Setting up the first q session:

q)/setting console size
q)\c 20 120
q)
q)/setting port
q)\p 5000
q)
q)/loading taqdemodata.q
q)\l taqdemodata.q
q)
q)\a
`quote`trade
q)5#trade
date       time         sym  price    size ex
---------------------------------------------
2016.09.23 09:30:06.168 IBM  40.76838 203  B
2016.09.23 09:30:09.551 VOD  27.68937 284  N
2016.09.23 09:30:10.989 MSFT 78.95682 452  N
2016.09.23 09:30:13.526 IBM  27.7396  403  O
2016.09.23 09:30:16.421 VOD  1.612398 244  O
q)5#quote
date       time         sym  bid      bsize ask      asize ex
-------------------------------------------------------------
2016.09.23 09:30:00.688 MSFT 81.72594 514   12.58012 980   L
2016.09.23 09:30:01.269 BOI  57.99858 906   77.43472 868   B
2016.09.23 09:30:01.428 GOOG 9.704925 778   95.76563 710   O
2016.09.23 09:30:01.461 GOOG 66.55175 261   55.11257 802   O
2016.09.23 09:30:03.434 IBM  96.09301 185   8.264956 487   L

The second q instance runs on port 6000 and loads in a csv file of property price data (available from https://data.gov.uk/dataset/land-registry-monthly-price-paid-data):

q)/setting console size
q)\c 20 120
q)
q)/setting port
q)\p 6000
q)pp:("SFPSSSFSSSSSSSSS";enlist",")0: `:pp2017.csv
q)pp
id                                   price   dateOfTransfer                postcode propertyType oldNew duration PAON SOAN street             locality   townCity    district             county         PPDCatType recordStatus
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
4E95D757-A215-EDA1-E050-A8C0630539E2 983000  2017.03.29D00:00:00.000000000 EN1 2PF  D            N               31        WELLINGTON ROAD               ENFIELD     ENFIELD              GREATER LONDON A          A
4E95D757-A216-EDA1-E050-A8C0630539E2 450000  2017.04.12D00:00:00.000000000 HA1 1PE  T            N               70        HEADSTONE ROAD                HARROW      HARROW               GREATER LONDON A          A
4E95D757-A217-EDA1-E050-A8C0630539E2 415000  2017.03.16D00:00:00.000000000 TW13 6TE S            N               19        CONWAY ROAD                   FELTHAM     HOUNSLOW             GREATER LONDON A          A
4E95D757-A218-EDA1-E050-A8C0630539E2 472000  2017.03.31D00:00:00.000000000 UB6 9NY  S            N               95        CHINNOR CRESCENT              GREENFORD   EALING               GREATER LONDON A          A
4E95D757-A219-EDA1-E050-A8C0630539E2 682500  2017.03.06D00:00:00.000000000 W7 2JS   S            N               17        SEWARD ROAD                   LONDON      EALING               GREATER LONDON A          A
4E95D757-A21A-EDA1-E050-A8C0630539E2 360000  2017.04.05D00:00:00.000000000 UB3 5LY  S            N               122       WEST END LANE      HARLINGTON HAYES       HILLINGDON           GREATER LONDON A          A
4E95D757-A21B-EDA1-E050-A8C0630539E2 384000  2017.03.15D00:00:00.000000000 HA5 2AA  T            N               20        PINNER GREEN                  PINNER      HARROW               GREATER LONDON A          A
4E95D757-A21C-EDA1-E050-A8C0630539E2 445000  2017.04.21D00:00:00.000000000 TW3 4DT  D            N               20        ELMSWORTH AVENUE              HOUNSLOW    HOUNSLOW             GREATER LONDON A          A
4E95D757-A21D-EDA1-E050-A8C0630539E2 415000  2017.03.24D00:00:00.000000000 TW2 6AJ  S            N               11        ALBEMARLE AVENUE              TWICKENHAM  RICHMOND UPON THAMES GREATER LONDON A          A
4E95D757-A21E-EDA1-E050-A8C0630539E2 1525000 2017.03.24D00:00:00.000000000 W13 8ES  S            N               20        ST STEPHENS AVENUE            LONDON      EALING               GREATER LONDON A          A
4E95D757-A21F-EDA1-E050-A8C0630539E2 582000  2017.03.16D00:00:00.000000000 TW12 3EJ T            N               359       HANWORTH ROAD                 HAMPTON     RICHMOND UPON THAMES GREATER LONDON A          A
4E95D757-A220-EDA1-E050-A8C0630539E2 405000  2017.03.10D00:00:00.000000000 EN2 0PY  T            N               69        GORDON ROAD                   ENFIELD     ENFIELD              GREATER LONDON A          A
4E95D757-A221-EDA1-E050-A8C0630539E2 520000  2017.03.28D00:00:00.000000000 EN6 2JN  S            N               61        ABERDALE GARDENS              POTTERS BAR HERTSMERE            HERTFORDSHIRE  A          A
4E95D757-A222-EDA1-E050-A8C0630539E2 587500  2017.04.07D00:00:00.000000000 HA2 7EE  D            N               37        FERNBROOK DRIVE               HARROW      HARROW               GREATER LONDON B          A
4E95D757-A223-EDA1-E050-A8C0630539E2 570000  2017.04.13D00:00:00.000000000 HA5 1JJ  T            N               37        DURLEY AVENUE                 PINNER      HARROW               GREATER LONDON A          A
..
q)meta pp
c             | t f a
--------------| -----
id            | s
price         | f
dateOfTransfer| p
postcode      | s
propertyType  | s
oldNew        | s
duration      | f
PAON          | s
SOAN          | s
street        | s
locality      | s
townCity      | s
district      | s
county        | s
PPDCatType    | s
recordStatus  | s

In order to track the connections that the R client makes to the q instances, run the following q code:

current_connections:([]time:`time$();handle:`int$();user:`$();host:`$();IP:();number_messages:`int$());
.z.po:{`current_connections insert (.z.T;x;.z.u;.Q.host[.z.a];`int$vs[0x00;.z.a];0);}; 
.z.pc:{delete from `current_connections where handle=x;}; 
.z.pg:{update number_messages+1 from `current_connections where handle=.z.w;res:value x;res}; 
.z.ps:{update number_messages:number_messages+1 from `current_connections where handle=.z.w;value x;};

Once a connection has been established from the R client to the q servers using the rkdb interface the datasets are available for further analysis. To check that the rkdb library is loaded execute:

library(rkdb)

Then connect to the q instances using the same syntax as the previous example:

open_connection(hostname, port, username:password)

Verify on the first q instance that the connection has been established by checking the current_connections table:

q)\p
5000i
q)current_connections
time         handle user    host            IP       number_messages
--------------------------------------------------------------------
20:58:39.855 544    taqUser desktop-4ksdj7d 127 0 0 1 0
h2<-open_connection("127.0.0.1",6000,"taqUser:password")

Again, check on the second q instance in the current_connections table to confirm that the connection has been established:

q)\p
6000i
q)current_connections
time         handle user   host            IP        number_messages
--------------------------------------------------------------------
20:59:16.241 552    ppUser desktop-4ksdj7d 127 0 0 1 0
q)

Executing a query

As usual the syntax for executing a query is simply:

execute(connectionhandle, request)

Here are the results of a simple query to determine the number of rows per date in the trade table:

Closing a connection

To close a connection the syntax is:

close_connection(connectionhandle)

Executing a query on the closed connection will give an error showing there is no connection to the kdb+ server:

Check the current connections table to confirm that the connection has been closed:

q)\p 
5000i
q)current_connections
time handle user host IP number_messages
----------------------------------------

Example 1: Find average price and size by symbol from trade table.

Save the trade table on the q server to a csv file. Load it into the R client and perform an aggregation using the inbuilt aggregation function.

Replicate this 100 times to measure how long it takes.

In the session below it takes about a minute:

One of the major advantages of using R is that it can utilize the libraries written to optimize common tasks and queries. By loading the “data.table” library, the time taken to perform the same aggregation 100 times over is reduced to 2.58 seconds:

Then we compare the performance of carrying out the aggregation using “data.table” on the R client versus sending a query to the q server to execute and return the result. As shown below, the results returned from running the query on the q server match the results from running in the R client:

By repeating both approaches multiple times we can compare the performance:

Sending the query to the q server to perform the aggregation is significantly faster in both cases.

Example 2: Find average price by district for properties within Surrey County

As before, load the csv file into R and convert it to a data.table object:

Using the “data.table” library, the time taken to perform the price aggregation is 0.17 seconds:

Running the same query on the q server returns the same results, but only takes 0.1 seconds:

Repeat both approaches 1000 times to compare the performance:

As before, sending the query to the q server is faster. We can plot the results of the query on the q server:

kdb+ interface to R

Summary

The new kdb+ interface to R (rkdb) can be installed using one single command, making it easy for users of R and kdb+ to combine the best features of both tools. Execution time for analytics run on large volumes of data can be made to run significantly faster as demonstrated in the examples above.

Examples used in this blog were made using:
R 3.4.0
Rtools 3.4
kdb+ version 3.5 2017.05.02

Louise Totten is an expert kdb+ engineer who is a Team Lead at First Derivatives plc in Newry, Northern Ireland.

 

© 2017 Kx Systems
Kx® and kdb+ are registered trademarks of Kx Systems, Inc., a subsidiary of First Derivatives plc.

SUGGESTED ARTICLES

Kx collaborating with Fintech startup chartiq

Collaboration: The Dominant Trend in Finance

13 Dec 2017 | , , , ,

In December we are re-blogging some of our favorite content from Kx partners and affiliated companies, starting with this article on the ChartIQ blog. ChartIQ is an agile FinTech company that sells an advanced HTML5 charting library used in technical data analysis, trading configurations and for charting in the capital markets industry. Kx offers a ChartIQ integration as an addition to our Dashboards. In Collaboration: The Dominant Trend in Finance, ChartIQ’s Hanni Chehak writes about the rise of FinTech companies, and the role collaboration plays as FinTech companies are increasingly disrupting the traditional banking sector.

Water system workers with kdb+ historical database

Kdb+ Use Case: Machine Learning Water System Maintenance Application

6 Dec 2017 | , , , ,

Kdb+ is being used much more widely in machine learning applications today. Its ability to quickly ingest and process data, particularly large, fragmented datasets, is one way that developers are adding kdb+ to their technology stack of artificial intelligence and machine learning tools.
For Australian kdb+ developer Sherief Khorshid, who also develops machine learning systems, incorporating kdb+ into a predictive maintenance application gave him the edge in a hackathon win that landed him a cash prize and a contract with the Water Corporation of Western Australia.

kdb+ FFI

Kdb+ FFI: Access external libraries more easily from q

22 Nov 2017 | , , ,

Following on from the hugely popular Python library and interface embedPy and PyQ, Kx has released an FFI as part of the Fusion for kdb+ interfaces. As with embedPy and PyQ, this FFI is open-sourced under the Apache 2 license.
The kdb+ FFI is a foreign function interface library for loading and calling dynamic libraries from q code. It has been adapted and expanded upon from a library originally written by Alex Belopolsky of Enlightenment Research. With the kdb+ FFI you can now call your favorite C/C++ libraries directly from q without the overhead of having to compile shared objects and load into q using the 2: command.