Machine Learning and the Value of Historical Data

2 August 2018 | 6 minutes

By Przemek Tomczak

Data is being generated at a faster rate now than ever before. IDC has predicted that in 2025, there will be 163 zettabytes of data generated each year—a massive increase from the 16.1 zettabytes created in 2016. These high rates of data generation are partially an outcome of the multitude of sensors found on Internet of Things (IoT) devices, the majority of which are capable of recording data many times per second. IHS estimates that the number of IoT devices in use will increase from 15.4 billion devices in 2015 to 75.4 billion in 2025, indicating that these immense rates of data generation will continue to grow even higher in the years to come.

The challenge and cost of Big Data

Even though the amount of data is increasing, our ability to derive value from all of it is not. For many companies and applications, it is costly to work with and to maintain such large amounts of data.

Additionally, doing so may slow down the systems in which it is stored. As a result, many companies are left with no choice but to summarize, aggregate, archive, delete, or prune their data in order to reduce costs and to meet service level requirements.

There is a significant opportunity cost to this approach, as machine learning algorithms depend on working with a large amount of raw historical data in order to detect and predict events.

For instance, suppose that sensor data from mechanical equipment is thrown away or aggregated. In the future, if a data pattern is discovered that is associated with a failure condition, it may not be possible to detect it, since the raw data is no longer accessible.

Furthermore, if an analysis is being conducted on only a few days or weeks of data, some anomalies may not be detectable during such a short period, even if they could be over a series of months.

Machine learning, Big Data and innovation

Although summarizing or aggregating data may be useful for reporting, these summaries cannot be used for machine learning, which requires raw historical data in its entirety. In order to create an algorithm that detects patterns or deviations, a model must first be trained using historical data, and then retrained based on new parameters or new data sets. Thus, throwing away data limits training and, by extension, the innovation of new models as well.

With the advent of Python, R, and proprietary technologies, there is tremendous growth in the ecosystem of machine learning algorithms. This makes it possible for algorithms that were developed for use in a particular industry or application to be applied to new problems and data sets. Some analytics tools enable the simulation of many algorithms at the same time in order to determine which one is best for a given use case. However, for any of these approaches to be effective, they require fast access to high quality, raw data.

In machine learning, it is not sufficient to have a static model. Rather, the goal should be to use a model that is continuously updated based on new information, since there is tremendous value in rerunning and testing different algorithms. However, in order for this to be possible, the model must be rerun while incorporating historical information. For instance, if a new parameter that indicates a failure condition were discovered, the model would need to be rerun in order to account for the new information. This requires the ability to link or correlate multiple data sets together. For example, energy consumption information can be related to other data sets as well, such as TV viewing habits as they relate to major sports events. For example, Canadian utility BC Hydro has measured a 4% drop in electricity usage during the Stanley Cup in the province.

Machine learning meets Industrial Internet of Things

One application of machine learning is the detection of the formation of bubbles or liquid cavities in a water pump. When this occurs, it can result in significant damage to the pump and connected equipment. To detect the formation of these bubbles, pumps are typically outfitted with sensors that determine water pressure as well as motor vibration. Historical data from these sensors, depicting both normal and failure pump conditions, is used to train the algorithms. If there is a failure in the future, the failure prediction model can be retrained to help improve its accuracy.

Another example is the application of support vector machine (SVM) models to predict cascading blackouts. This model is trained using historical information regarding past blackouts and transformer outages, as well as grid data such as voltage or power flow measurements (Gupta, Kambli, Wagh & Kazi, 2015). As a result, this model’s predictions are able to assist with proactive grid maintenance and blackout prevention. However, if new information arises, such as a new indication of a power failure or the occurrence of a major blackout, the model would need to be retrained in order to incorporate this new insight.

Enabling continuous machine learning

Continuous machine learning depends on several factors: having timely access to raw historical information; the ability to link or relate disparate data sets based on time; and the integration of popular machine learning libraries and analytics environments.

KX has made it considerably easier to connect common machine learning tools with historical data, using our Fusion interface. When combined with our experience storing and analyzing historical data for the capital markets industry, we are re-training models and executing algorithms faster than ever before. KX is also helping companies to significantly reduce the cost of retaining this abundance of historical data, through innovative data storage technologies, support for low-cost storage media, and advanced compression algorithms.

Przemek Tomczak is Senior Vice-President of Internet of Things and Utilities at KX Systems. For over twenty five years, KX has been providing the world’s fastest database technology and business intelligence solutions for high velocity and large data sets. Previously, Przemek held senior roles at the Independent Electricity System Operator in Ontario, Canada and top-tier consulting firms and systems integrators. Przemek also has a CPA and has a background in business, technology, and risk management.


Gupta, S., Kambli, R., Wagh, S., & Kazi, F. (2015). Support-Vector-Machine-Based Proactive Cascade Prediction in Smart Grid Using Probabilistic Framework. IEEE Transactions On Industrial Electronics, 62(4), 2478-2486. doi: 10.1109/tie.2014.2361493

IDC. (2017). Data Age 2025 (p. 3). IDC. Retrieved from

Industrial Internet Consortium. (2017). Analytics Framework (p. 26). Industrial Internet Consortium. Retrieved from

IHS Technology. (2016). IoT Platforms: Enabling the Internet of Things (p. 5). IHS Technology. Retrieved from

Demo kdb, the fastest time-series data analytics engine in the cloud

    For information on how we collect and use your data, please see our privacy notice. By clicking “Download Now” you understand and accept the terms of the License Agreement and the Acceptable Use Policy.