Technologist Peter Simpson, who is responsible for visual data analytics at Datawatch, has contributed a blog post to kxcommunity.com about the role kdb+ plays in his work. Read his post here, and if you are in the NYC area, stop by the next Kx Community NYC Meetup to hear him talk this Tuesday after work.
When I first think of kx, I think of tick data, central market data teams and trading use cases.
Under the covers I see that the data commonly analyzed is sparse timeseries; pretty similar to sensor data.
My recent project has been working with electricity smart meter data. And due to the volume and analytics I would like to perform, kx fits perfectly. As expected I’m looking at sparse time series, but unlike market / trading data, the time granularity is in minutes, rather than nano-seconds. As you get to monitoring usage load, you get more real time, but again not down to lower than seconds. On the other side, we do monitor and analyze a much larger collection of data series. Rather than a few thousand symbols, I’m looking at hundreds of thousands of individual meters.
Like financial data, I’m not just interested in the data itself, but instead combining various different datasets, from other time series such as weather data. To standing data for the geo location, residence type, heating & cooling characteristics, occupancy type, etc. And then looking for the absolute values, and more commonly the relative differences and trends across time.
So very quickly I’m back in the big data world, but with the requirement to perform both monitoring, and exploratory data analysis, which is where “Rumsfeldism” and kx need to combine.
The “known knowns” I can work with easily, it is the “known unknowns”, and “unknown unknowns” that require exploratory analysis. I’m not just looking for the top / bottom performers, but instead I’m looking for the pattern, the exceptions to the pattern and how they relate to their peers.
I use Datawatch to visually analyze datasets from kx because of the speed, power, and scalability it offers.
The volumes of data are of course gigantic, so I cannot simply pull them into memory on my laptop. And when I render the interactive dashboards to an iPad, I cannot send over the data, it would take far too long. Instead I need to keep the data in the database, and pull out only the data I need for display purposes.
For the “known knowns” I can use pre-defined dashboards and paths through the displays. Behind the scenes I can use parameterized selects, or parameterized pre-defined functions; or subscribe to live streaming data from kdb+tick.
For the rest, it is much harder to have pre-defined paths, as when I see the data pattern visually, I will want to drill down into that area of interest.
Consequently I use Datawatch to automatically write the q queries I need, based on what I do on screen. (e.g. aggregate, conflate, filter.) Both for geospatial, traditional BI stuff (slicing & dicing of dimensions& measures) and more heavily statistical stuff. This functionality came from multiple customer requests, who all wanted to perform more exploratory analysis of their huge transactional datasets.
Kx allows me to put the “interactive” into “interactive big data analysis”. As I screen my available data universe, I expect a near instantaneous response. I cannot wait for a Hadoop batch job to run. And I cannot use something like Cassandra, because I need to aggregate, conflate & bin on the fly. Additionally a key component of the data is it’s temporal aspect; again here kx shines in that I can standardize, conflate and fill time series, something I would struggle with in most other “big data” solutions. And of course, I can go from streaming data, to intra-day, conflated history, and long term historical storage with every underlying record.
Now our dynamic/automatic querying of kx, is evolutionary based on customer demand. The latest work has suggested that as we show more statistical displays (e.g. distribution curves), we need to be better at writing frequency distribution queries. I find it interesting that I’m rarely looking at all the underlying data; there is just far too much of it. Instead I need to view sampled , aggregated views of the data universe, and dynamically filter based on areas of interest within the visuals themselves or through separate screening criteria. I only drop down into the underlying data when I’ve visually identified an area of interest.
The surprising result is how different from traditional Business intelligence reports our output workbooks are. The purpose isn’t to tell you what you already know. (My biggest sales region is …..). Instead it is about gaining insight from the available data landscape, and that tends to result in highlighting the unusual against prior performance.
As we move out from just Capital Markets use cases, a whole new world of sensor data is appearing, and we’re trying to keep up with demand. The fun part is seeing the use cases in action, knowing that the data actually refers to something physical. Probably the most exciting area is location analytics, especially with logistics datasets, as here we combine the real time and historical power of kx, with geo-mapping, defining the exact location of a problem; how we got there, and what to expect next. I spent the last few days talking to suppliers of parking sensors, and how they can optimize parking revenue for a city; previously I’ve been looking at fracking, mining, trucking, and before that ATM machine transactions, so the combination of sensors with geolocation seems to be a common trend; whether logistics, energy, utilities or even with the case of ATM machines finance.