Kdb+ for big data applications

ODBMS.org Interview with Kx Data Engineer Jamie O’Mahony

8 Aug 2017 | , , , , , ,
Share on:

Kx engineer Jamie O’Mahony was recently interviewed by Roberto Zicari of ODBMS.org about design considerations in large-scale database applications.

What lessons did you learn from using shared-nothing scale-out data architectures to support large volumes of data?

At times “Big Data” is not as big as people think and therefore the potential of scaling offered by this architecture is not required. To define the terms, “shared-nothing” applies to distributed systems in which each node is independent and does not share memory or disk. “Scale-out” refers to horizontal scaling, it suggests increasing the power of a system typically by adding extra storage, or CPU’s.

In the financial domain, where datasets can typically exceed 100’s of billions of records, these techniques have been used in production systems for decades often on single large memory machines or with very simple architectures.

The speed of access to get the data and to analyze it in an environment that understands time-series data is critical. In this area, the ability to join time-series data across multiple tables has meant there has been less focus on using shared-nothing architectures.

What kind of data infrastructure do you use to support applications?

Data architectures supported by kdb+ are easily extensible. Kdb+ has been proven through industry benchmarks to demonstrate performance that allows querying billions of records very quickly. These architectures take advantage of the many cores available in production systems.

Lambda architecture is frequently the design paradigm in financial services trading applications due to its simplicity.

This is because with Lamdba architecture a system is capable of handling massive quantities of data by taking advantage of both batch (historical) and stream processing methods (real time). Although the term has once again gained popularity only recently, Kx have been following this methodology for over 20 years.

Using a single unified programming language further simplifies the data architecture. In other applications with more layers to the technology stack data is moved between layers adding latency and complexity causing both performance and maintenance problems. For communication within the data infrastructure a series of open source APIs are provided for integration with other programming languages as well as JDBC/ODBC interfaces for communication with other third party databases.

Do you have some advice for people looking to handle petabyte databases?

It is commonly assumed that a petabyte database requires a massively distributed solution. However, this is not how the problem is solved in the financial services industry.

Using high-end commodity hardware, the largest banks and trading operations build sophisticated systems using very simple, elegant designs on as few machines as possible. When transaction speed is a competitive advantage, only the most efficient solutions survive.

People don’t realize how much can be done with a single machine if a little more thought is expended up front on the design. It is my experience that people are too quick to jump on the latest technology band wagon.

If people rush into implementation too quickly, instead of taking time to really understand the underlying business logic and to design a system that reflects that logic then they pay the price in performance and problems maintaining and enhancing the system over the life of the whole application.

What are the typical mistakes made on large scale data projects? In your opinion, how can they be avoided in practice?

The design stage of a large, complex project must be given sufficient importance at the start. An essential part of the process is choosing the appropriate technology. Over the last few years, there have been some popular Big Data “solutions” that organizations have jumped to use as the basis of their enterprise systems, without understanding the full implications of their choices. Unfortunately these systems are not able to deliver on their promises, but they have also created maintenance nightmares.

It is very important – already at the design stage – to involve a system administrator. The system administrator should ensure that the hardware and software are optimized in tandem. Their deep understanding of the system will pay off later when they are needed for troubleshooting issues.

Another critical omission that I have seen in a number of projects is that when software choices were being made, the organization neglected to nominate internal staff to become proficient in the technologies chosen.

How do you ensure data quality?

Ensuring data quality is not a process to be “bolted on” later. It should be an integral element of any Big Data project. Both for the initial load of data from multiple legacy sources and later for the incremental load of current data.

The database chosen plays a role in ensuring data quality. When you have a highly performant database, you can do much more extensive checking before allowing the data into your “master copy.” A high performance database allows in-depth analysis of the new data in the context of the existing system – and makes it much easier to zero in on problems early on.


Jamie O’Mahony is a Senior Kx Solutions Architect who has built a number of complex enterprise database systems over the past four years in financial institutions around the world. Jamie is currently based in New York City.

© 2018 Kx Systems
Kx® and kdb+ are registered trademarks of Kx Systems, Inc., a subsidiary of First Derivatives plc.


Head of Products, Solutions and Innovation at Kx on Product Design and the Vision for the Future

16 Mar 2018 | , , ,

As the SVP of Products, Solutions and Innovation at Kx Systems, James Corcoran is part of a new chapter in software development at Kx. Since joining Kx parent First Derivatives as a financial engineer in 2009, James has worked around the world building enterprise systems at top global investment banks before moving to the Kx product team in London. James sat down with us recently to discuss his perspective on product design and our technology strategy for the future.

Kdb+ Utilities: Essential utility for identifying performance problems

28 Feb 2018 | ,

If you are a kdb+/q developer, you will find the utilities created by Kx Managing Director and Senior Solution Architect Leslie Goldsmith to be a valuable resource. The “Kdb+ Utilities” series of blog posts gives a quick introduction to the utilities, available at Leslie Goldsmith’s GitHub. In this third part of the series we look at Leslie’s qprof, which allows a programmer to drill down into q functions or applications to inspect performance and CPU usage in a fine-grained fashion.

kdb+ utility to search codebase

Kdb+ Utilities: Q code Workspace Utilities

6 Feb 2018 | , ,

If you are a kdb+/q developer, you will find the workspace utilities created by Kx Managing Director and Senior Solution Architect Leslie Goldsmith to be a valuable resource. This is the first in a series of blog posts that give a quick introduction to several utilities available at Leslie Goldsmith’s GitHub. In this part of the series we look at an essential tool which contains routines for summarizing and searching the contents of a workspace, ws.