Kdb+ for big data applications

ODBMS.org Interview with Kx Data Engineer Jamie O’Mahony

8 Aug 2017 | , , , , , ,
Share on:

Kx engineer Jamie O’Mahony was recently interviewed by Roberto Zicari of ODBMS.org about design considerations in large-scale database applications.

What lessons did you learn from using shared-nothing scale-out data architectures to support large volumes of data?

At times “Big Data” is not as big as people think and therefore the potential of scaling offered by this architecture is not required. To define the terms, “shared-nothing” applies to distributed systems in which each node is independent and does not share memory or disk. “Scale-out” refers to horizontal scaling, it suggests increasing the power of a system typically by adding extra storage, or CPU’s.

In the financial domain, where datasets can typically exceed 100’s of billions of records, these techniques have been used in production systems for decades often on single large memory machines or with very simple architectures.

The speed of access to get the data and to analyze it in an environment that understands time-series data is critical. In this area, the ability to join time-series data across multiple tables has meant there has been less focus on using shared-nothing architectures.

What kind of data infrastructure do you use to support applications?

Data architectures supported by kdb+ are easily extensible. Kdb+ has been proven through industry benchmarks to demonstrate performance that allows querying billions of records very quickly. These architectures take advantage of the many cores available in production systems.

Lambda architecture is frequently the design paradigm in financial services trading applications due to its simplicity.

This is because with Lamdba architecture a system is capable of handling massive quantities of data by taking advantage of both batch (historical) and stream processing methods (real time). Although the term has once again gained popularity only recently, Kx have been following this methodology for over 20 years.

Using a single unified programming language further simplifies the data architecture. In other applications with more layers to the technology stack data is moved between layers adding latency and complexity causing both performance and maintenance problems. For communication within the data infrastructure a series of open source APIs are provided for integration with other programming languages as well as JDBC/ODBC interfaces for communication with other third party databases.

Do you have some advice for people looking to handle petabyte databases?

It is commonly assumed that a petabyte database requires a massively distributed solution. However, this is not how the problem is solved in the financial services industry.

Using high-end commodity hardware, the largest banks and trading operations build sophisticated systems using very simple, elegant designs on as few machines as possible. When transaction speed is a competitive advantage, only the most efficient solutions survive.

People don’t realize how much can be done with a single machine if a little more thought is expended up front on the design. It is my experience that people are too quick to jump on the latest technology band wagon.

If people rush into implementation too quickly, instead of taking time to really understand the underlying business logic and to design a system that reflects that logic then they pay the price in performance and problems maintaining and enhancing the system over the life of the whole application.

What are the typical mistakes made on large scale data projects? In your opinion, how can they be avoided in practice?

The design stage of a large, complex project must be given sufficient importance at the start. An essential part of the process is choosing the appropriate technology. Over the last few years, there have been some popular Big Data “solutions” that organizations have jumped to use as the basis of their enterprise systems, without understanding the full implications of their choices. Unfortunately these systems are not able to deliver on their promises, but they have also created maintenance nightmares.

It is very important – already at the design stage – to involve a system administrator. The system administrator should ensure that the hardware and software are optimized in tandem. Their deep understanding of the system will pay off later when they are needed for troubleshooting issues.

Another critical omission that I have seen in a number of projects is that when software choices were being made, the organization neglected to nominate internal staff to become proficient in the technologies chosen.

How do you ensure data quality?

Ensuring data quality is not a process to be “bolted on” later. It should be an integral element of any Big Data project. Both for the initial load of data from multiple legacy sources and later for the incremental load of current data.

The database chosen plays a role in ensuring data quality. When you have a highly performant database, you can do much more extensive checking before allowing the data into your “master copy.” A high performance database allows in-depth analysis of the new data in the context of the existing system – and makes it much easier to zero in on problems early on.

 

Jamie O’Mahony is a Senior Kx Solutions Architect who has built a number of complex enterprise database systems over the past four years in financial institutions around the world. Jamie is currently based in New York City.

© 2017 Kx Systems
Kx® and kdb+ are registered trademarks of Kx Systems, Inc., a subsidiary of First Derivatives plc.

SUGGESTED ARTICLES

Kx collaborating with Fintech startup chartiq

Collaboration: The Dominant Trend in Finance

13 Dec 2017 | , , , ,

In December we are re-blogging some of our favorite content from Kx partners and affiliated companies, starting with this article on the ChartIQ blog. ChartIQ is an agile FinTech company that sells an advanced HTML5 charting library used in technical data analysis, trading configurations and for charting in the capital markets industry. Kx offers a ChartIQ integration as an addition to our Dashboards. In “Collaboration: The Dominant Trend in Finance,” ChartIQ’s Hanni Chehak writes about the rise of FinTech companies, and the role collaboration plays as FinTech companies are increasingly disrupting the traditional banking sector.

Water system workers with kdb+ historical database

Kdb+ Use Case: Machine Learning Water System Maintenance Application

6 Dec 2017 | , , , ,

Kdb+ is being used much more widely in machine learning applications today. Its ability to quickly ingest and process data, particularly large, fragmented datasets, is one way that developers are adding kdb+ to their technology stack of artificial intelligence and machine learning tools.
For Australian kdb+ developer Sherief Khorshid, who also develops machine learning systems, incorporating kdb+ into a predictive maintenance application gave him the edge in a hackathon win that landed him a cash prize and a contract with the Water Corporation of Western Australia.

kdb+ FFI

Kdb+ FFI: Access external libraries more easily from q

22 Nov 2017 | , , ,

Following on from the hugely popular Python library and interface embedPy and PyQ, Kx has released an FFI as part of the Fusion for kdb+ interfaces. As with embedPy and PyQ, this FFI is open-sourced under the Apache 2 license.
The kdb+ FFI is a foreign function interface library for loading and calling dynamic libraries from q code. It has been adapted and expanded upon from a library originally written by Alex Belopolsky of Enlightenment Research. With the kdb+ FFI you can now call your favorite C/C++ libraries directly from q without the overhead of having to compile shared objects and load into q using the 2: command.