SQL isn’t keeping pace with the demand to quickly analyze large amounts of data
by Joe Celko and Arthur Whitney (July 1, 2002) • Inside Market Data Vol 17 No 39 • July 1, 2002 (New York – London)
“Time is what keeps everything from happening all at once.” —George Carlin
In the real world, time is a continuum, not a point. But in the digital world of ones and zeros, we pretend time occurs in a series of points. It’s a convenient lie, one that enables us to apply digital technologies to all kinds of business operations. In the analysis of market data, for example, trends can be uncovered by looking at price changes over time.
SQL has become the industry standard for querying relational data. Unfortunately, however, analyzing price changes over time is exactly what SQL and most other programming languages don’t do well. If Cobol had had temporal data types, we probably would not have had a Y2K crisis. Today, as the volume of trading transactions continues to grow, the limitations of SQL in querying financial events over time is failing to keep pace with the demand for quick analysis of large volumes of data.
SQL’s weaknesses are particularly apparent when it comes to market data, where traders want to capture tick data on every bid, ask, order and cancellation at as many levels as possible. Market data providers often create added value by supplying tools that can assist with these analyses. Typically these tools run up against three limitations.
First, unless the streaming data is written to storage, it can only be analyzed while it resides in memory–even though traders would like to be able to analyze large amounts of historical, relational data as well as streaming intraday data.
Second, SQL-based queries of relational data are inordinately slow, because SQL is not optimized to handle time-series data.
Finally, as the volume of data grows–particularly on the historical side–existing databases and analytic tools often cannot scale to provide adequate analytic performance. Although SQL was the first language to have temporal data types built in, its time capabilities are limited. Standard SQL provides only simple temporal functions–you can add and subtract units of time and you can compare points in time.
However, SQL is a set-oriented language, and the nature of a set is that it has no ordering–that is, no sense of time. It considers the data as a completed whole that exists all at once. This approach provides a tremendous advantage in dealing with infinite sets in mathematics; for example, you can prove a theorem for all the integers without having to deal with them one at a time.
The bad news is that you can’t easily do sequential processing, such as chart a variable–for example, the price of a stock–over time. No single point on the chart can tell you if the trend is up or down. No single point can give you the rate of change. Nor will the entire set of points tell you these things. The information is revealed only in the sequence in which the points appear. In order to handle the sequencing, SQL must use multiple joins of the data, which only expands the amount of data involved in the query and further impedes performance. SQL can do the job, but not well.
What’s the Solution? Given the limitations of SQL, how can market data providers offer their subscribers comprehensive query capabilities for time series analysis? The challenge is to avoid settling for a partial solution. Assisting customers to expand their decision support for trading and to develop and execute sophisticated trading models that result in competitive advantage should be the paramount concerns.
Here are the solution characteristics to look for:
- A solution that does not split the streaming and historical data, enabling the same tool to be used with both.
- Extremely high (near real-time) performance, even on enormous volumes of data.
- Consistent scalability as data volumes grow to multiple gigabytes and into terabytes.
- Maximum flexibility in data types–the ability to receive and store any type of historical or real-time data.
- Support for user-defined queries.
- Support for open standards: JDBC, ODBC, XML, etc.
- Investment protection–you or your customers should be able to continue using any legacy SQL, Java, C and C++ code as needed.
- Easy interfacing to a variety of third-party statistical packages.
- Support for mobile, desktop, mid-range and mainframe systems.
- Support for Windows, UNIX and Linux operating systems. Solutions that offer all of these capabilities in one package do exist, and they enable market data providers to expand and enhance their services to subscribers.
About the authors: Joe Celko, SQL expert and author of Instant SQL Programming, is an independent consultant based in Austin, Texas. Arthur Whitney is CTO of Kx Systems (kx.com), whose kdb database and KSQL language are used by market data providers and financial services firms worldwide.
This article originally appeared in Inside Market Data, Vol 17, No 39 (July 1, 2002). ©2002 Risk Waters Group. All rights reserved. Reprinted with permission.