kdb+ kafka interface

Kdb+ interface to Kafka

20 Jun 2017 | , , , , ,
Share on:

 

By Sergey Vidyuk

Kx recently open-sourced a Kafka interface to kdb+ on GitHub under the Apache2 license that eases integration of applications using Kafka with kdb+. Here is a description of the interface.

Kdb+kafka = kfk library

Libkfk is a comprehensive interface to Apache Kafka for kdb+ based on the librdkafka library – similar to 20+ other language bindings.

This interface provides yet another way of combining the ingest, real-time and historical data processing capabilities of kdb+ with external data streams and consumers.

Notable features include:
* Multiple clients, publishers and subscribers in the same process
* Minimal library overhead to utilize the multi-threaded C API
* Message data decoded and delivered
* Metadata, statistics and error list exposed for full production control and monitoring

Kfk library strives to expose the maximum amount of information to the user to give them control and understanding of the data flow within and beyond the application. To spot optimization possibilities remember that events within your application are a data feed in themselves for storing and facilitating further analysis.

What is Kafka?

Tibco RV, RabbitMQ, MQ, 29West LBM(UMS), JMS….If any of these sound familiar, you can just append another one to the list: Kafka.

At a high level, Kafka is just another messaging system which uses brokers to decouple producers and consumers of data. Such environments have existed in the financial world for a long time. Kafka differs in that it provides production quality, performant systems with free, open-source code.

kdb+ interface to Kafka

illustration by Sergey Vidyuk

 

Kafka Terminology

If you are unfamiliar with enterprise messaging systems, it is useful to review a summary of the main concepts. This might make library usage easier due to almost direct mapping from the API to the main concepts in Kafka. 

  • Maintains feeds of messages in topics
  • Contains processes that publish messages to topics which are called producers
  • Has processes called consumers that subscribe to topics and process the feed of published messages  
  • Runs as a cluster of a lot more servers each of which is called a broker
  • Topics are broken up into ordered commit logs called partitions. Ordering is only guaranteed within the partition for a topic
  • Multiple consumers can subscribe to the topic and are responsible for managing their own offset within the partition
  • Consumers can be organized into consumer groups
  • Delivery is at least once

Examples

Here are a few minimalistic examples of producer/consumer and the kind of messages that a user would expect in their application. 

Simple Producer

\l kfk.q
// setup config dictionary
kfk_cfg:(!) . flip(
  (`metadata.broker.list;`localhost:9092);		// mandatory
  (`statistics.interval.ms;`10000);
  (`queue.buffering.max.ms;`1);
  (`fetch.wait.max.ms;`10)
  );
producer:.kfk.Producer[kfk_cfg]			 // create producer
test_topic:.kfk.Topic[producer;`test;()!()]	 // create topic for producer

show "Publishing on topic:",string .kfk.TopicName test_topic;
.kfk.Pub[test_topic;.kfk.PARTITION_UA;string .z.p;""];
show "Published 1 message";
producer_meta:.kfk.Metadata[producer];
show producer_meta`topics;

Simple Consumer

\l kfk.q

kfk_cfg:(!) . flip(
    	(`metadata.broker.list;`localhost:9092);         // broker connection
        (`group.id;`0);				        // consumer group
        (`queue.buffering.max.ms;`1);
        (`fetch.wait.max.ms;`10);(`statistics.interval.ms;`10000));

// create consumer
client:.kfk.Consumer[kfk_cfg];
data:();
// override data callback. Empty function by default.
// your despatch and processing logic goes here(and runs on main thread)
.kfk.consumecb:{[msg]
    msg[`data]:"c"$msg[`data];
    msg[`rcvtime]:.z.p;
    data,::enlist msg;}
// subscribe to a topic “test” with automatic partitioning
.kfk.Sub[client;`test;enlist .kfk.PARTITION_UA];

client_meta:.kfk.Metadata[client];
show client_meta`topics;

Message
Normal data message

mtype    | `
topic    | `test
partition| 0i
offset   | 1065
msgtime  | 0Np
data     | "2017.06.07D16:08:51.805544000"
key      | `byte$()
rcvtime  | 2017.06.07D16:08:51.866241000

End of batch message

mtype    | `_PARTITION_EOF
topic    | `test
partition| 0i
offset   | 1076
msgtime  | 0Np
data     | ""
key      | `byte$()
rcvtime  | 2017.06.07D16:08:52.881488000

Summary

In this blog post you have learned some of the main features of libkfk to interface with Kafka, a tiny bit about what Kafka is, and the main terminology to feel comfortable using the library. Minimalistic examples of producer and consumer would be a good starting point to try libkfk in your setup. We encourage you to do that and both file issues and suggest changes on GitHub. It is easy to get started with a local test Kafka broker by following these “setup testing instance” instructions.

Code improvements, examples, documentation and steps to make setup easier are very welcome!

Sources
https://github.com/KxSystems/kafka
https://github.com/edenhill/librdkafka
https://kafka.apache.org/
https://www.slideshare.net/ConfluentInc/deep-dive-into-apache-kafka-66821186
https://www.slideshare.net/jhols1/kafka-atlmeetuppublicv2

 

Sergey Vidyuk is a senior software developer and expert in data management. He works in the Kx R&D team on the next generation of kdb+/q. Prior to joining Kx, Sergey was the CTO at Saxo Bank in London and previously worked for many years as a senior developer in other financial institutions.

SUGGESTED ARTICLES

Kx Insights: Machine learning subject matter experts in semiconductor manufacturing

9 Jul 2018 | , ,

Subject matter experts are needed for ML projects since generalist data scientists cannot be expected to be fully conversant with the context, details, and specifics of problems across all industries. The challenges are often domain-specific and require considerable industry background to fully contextualize and address. For that reason, successful projects are typically those that adopt a teamwork approach bringing together the strengths of data scientists and subject matter experts. Where data scientists bring generic analytics and coding capabilities, Subject matter experts provide specialized insights in three crucial areas: identifying the right problem, using the right data, and getting the right answers.

Transitive Comparison

Kdb+ Transitive Comparisons

6 Jun 2018 | , ,

By Hugh Hyndman, Director, Industrial IoT Solutions. A direct comparison of the performance of kdb+ against InfluxData and, by transitivity, against Cassandra, ElasticSearch, MongoDB, and OpenTSDB