Dave Thomas, Chief Scientist at KX Labs, is an authority on enterprise architecture and software engineering. In this interview by KX’s Simon Garland, Dave discusses key considerations in setting up Big Data and machine learning systems. Dave’s number one piece of advice – live in your data and experiment!
Simon Garland: Over the past three decades you’ve watched the growth of the whole “Big Data” industry, and the hype that has surrounded it. While a number of open-source solutions like Hadoop have come in and out of fashion, what advice would you give somebody who is designing a large complex system today?
Dave Thomas: It is always the same when you start something new, you have to conduct experiments. The lesson of the successful Internet companies is to try things. These companies have been investing in these types of experiments for years. For example, MapReduce was one of Google’s better known experiments — one which they later moved on from, which is why they have made it available.
You need to try living in your data. My advice is to get a huge bucket of data. Take your trillion-record database, and try to store it. See how fast it is to read through the whole dataset any way you want– get some sense of what working with a trillion data records (or terabytes) is like. See first-hand, what it takes to clean your data, query it, perform analyses and visualizations, build and train the ML models. This will quickly give you insights to the resources required.
For unstructured data take a trillion records from CSVs and try to use them. If you want an easy way to do that, try kdb+ on demand. It provides a simple way to start getting a feel for what it is like working with Big Data. Go to a data boot camp. Live with your data for a while. Try some tools. MemSQL, Spark, kdb+, Redshift etc. and you will get a sense of what the real problems are.
Talk to other customers who have been down the road before you, they have seen the movie and can tell you about the bad endings or what worked well. I recommend you avoid the pain of installing the Apache Stack – it has many different tools which are constantly changing with new ones being introduced (which is a good reason to explore such systems as a service rather than trying to install and support them internally).
Simon Garland: Where do you think the problems start for those who are planning Big Data applications?
Dave Thomas: The problems come when you go out and procure solutions you don’t understand. If you don’t have some experience living with your data, you end up with a laundry list of technologies from different vendors or analysts. When this goes out to procurement critical things like performance and usability, cost of ownership, etc. often get reduced to lowest price with the usual disappointments. Unless one does controlled experiments and develops clear requirements one can waste a lot of time and money.
It is well known in the data warehouse world that it takes a year to design your star schema and then another year to extract, transform and load your data into your data warehouse. With Big Data, by the time you’ve imported and cleaned your data, the requirements have quite likely changed under your feet. Unfortunately, while many joke about this, it is a fact that data science is 80% cleaning the data so you can use it and 10% data science. The final 10% is explaining the application to those that need to take some action from the results. Unless you understand where the time and money go, making an informed choice is impossible.
For example, if you have loaded up a lot of financial data, and business analysts come along later expecting to be able to mine the metadata (e.g. people’s names and background information), you will find that bolting that on creates an enormous mess. Financial data is usually relatively clean, thanks to accounting and audits. However the customer information is often a really messy proposition. One company I talked with recently feels it could take millions of dollars and a couple of years to clean their data so they can use it for machine learning.
Learn how to write queries. Neither business people nor computer types are very experienced in writing queries which allows you to really play in your large data. It takes experience to write queries which make sense and which execute efficiently. Not many people know how to write non-trivial queries. Everyone can write a simple select, but joins or time-series calculations are much harder. In my view the future of programming depends very much on one’s ability to write queries.
Set some simple goals. Don’t be surprised if your experiments take two or three months. There is no easy way to get someplace you have never been before. Beware of industry and vendor hype. It will take you longer to read their propaganda than to run the experiments which give you concrete information about your data and needs.
Simon Garland: What about running benchmarks to look at what’s going on? We often see benchmarking done by someone who has no experience doing it, and they do things like just taking the best result of three, or just the last result, which of course is simply coming from cache.
Dave Thomas: The white paper industry churns out statistics by the ton. There are lots of cruel and unnatural acts that vendors perform to look better than their competitors. Trying to get meaningful benchmarks is nearly impossible, that is why they are often called “bench lies.” Back to experiments, it is always best to test with your own data sets and explore what storage, IO and processing you need. Be conscious that a poorly expressed query can make a great product look bad, and that a fast parallel engine won’t perform well on an aggressively serial query. Do your own experiments with your own data!
Simon Garland: Another thing that is really important, and gets overlooked, is that when someone starts one of these projects there is an expectation it should be up and running in a short time — and they haven’t brought in their business experts early enough. Without building on in-house expertise they often find out – too late – that crazy decisions were made at the design stage.
Dave Thomas: When you are going to experiment, make sure the business-domain people vet your experiment. If you test how long it takes to read the file, the business people will say they don’t need that, they need sliding-window joins, for example. Unless you validate the criteria used you will build something nobody wants. Unfortunately this requires a partnership between business and IT stakeholders which is often missing. The same relationship is needed between development and production.
Even those who carefully figure things out ahead of time can be caught out when they find that their hardware choices have been dictated by procurement, not by performance requirements. For example, we have often seen people with massive machines with many CPUs get disappointing performance. That is why you hear about sophisticated users running analyses on their laptops rather than on one of the company clusters.
Few people grasp how cheap 10 or 100 TB are now. The value from investing in hardware, vs the cost of development, is huge. If you can put all of your data on one machine in memory, the performance is spectacular. Cloud instances now offer 1 TB or more, and NVMe memory will move that to 10 – 100 TB in time.
Simon Garland: Do you want to talk some sense about Machine Learning?
Dave Thomas: I lived thru the AI summer, and then the AI winter, so I’m cautious when I hear AI or Cognitive Computing. I even owned a Symbolics Lisp machine in the 1980s! Our expectations for AI are still too high, witness smart speakers. It is amazing what companies like Google, who have a lot of data and a lot of processing, can do with ML. But one has to be aware of the investments they have made to get there. Fortunately these companies are sharing many of their assets via open source and/or service offerings. Good ML needs lots of data, and even more compute.
There are many online courses, however I believe the real secret to ML is understanding the domain (as it is for other areas of problem solving). So often, people who know little about the domain just think that with ML we are now faster and the machine will figure it all out. Garbage in, garbage out is as true today as it was 40 years ago. At its heart, most ML is still a modeling problem, and you need good models to get good results.
There is a lot of useful stuff, but most of it is statistical modeling and optimization – much of which has been around for a while (which is a good thing!). The difference is the amount of compute power and memory we can now throw at the problem. The deep-learning side is not about learning – it is about optimization. It works really well, when applying ML/AI to datasets that are the same. But if you are looking to predict Black Swan events like the oil crisis, or the credit-swap default crisis, ML can’t predict those sorts of things.
Statistics are good; modeling and fitting functions on data is good; matching with the data you have is good. But, as with cyber applications, where humans are changing how they do things, if you set up your cyber strategy to look at how things are going to happen based on your existing historical data sets, you will have made predictions you can’t justify.
Simon Garland: To users it will seem like an eternity, but these ML projects on Big Data can take several years to be truly productive. What issues do you anticipate?
Dave Thomas: The first challenge is having an understanding of the domain so one can build useful and efficient models. The second is getting sufficient clean and labeled data, without bias, to serve as ground truth and train the models. The final challenge is our ability to understand what deep learning is really doing and interpret the accuracy of the results. How do you quantify them? Just to be able to label data as true and false can be a huge effort. But for something more complex it could require a great deal of clever coding, which is a tremendous amount of work. How to test it is a challenge.
Any time you use AI or ML, people’s expectations are that it will read their minds. That is because they attribute some sort of intelligence or intuitive capability to the technology that simply isn’t there. I recently talked to someone in a regulatory environment who was impressed by ML. As long as their ML algorithms agree with their existing calculations they can use them, but they can’t use ML unless they know the system is reasoning correctly. But with many complex systems that’s not always possible – ML may be giving wrong or suboptimal answers and there’s no way to know.
It is very exciting, and the work we are doing is very promising, but ML is something that demands a lot of domain expertise and know-how to apply the ML tools effectively. You can easily burn a lot of resources doing the wrong ML experiments/deployments, or choosing ML approaches which are not suited to your domain. The people who do well are going to have to invest a lot to get on top of it.
Simon Garland: When you think as an educator, what skills do you think we should be teaching the next generation of children?
Dave Thomas: What we want is to teach children creativity as well as the scientific method and problem solving. It’s now so important that people get comfortable working with data, probability and statistics along with some basic computation skills. Anything that works on the creative side is good and it’s better if it encourages self esteem and a sense of accomplishment. However, this demands that young people have effective reading, writing, and increasingly, visual communication skills.
The biggest thing I see, is that we don’t teach people how to express queries. The future of computing is queries. Most programs will substantially be composed of queries. Even with Google Search, people don’t understand complex queries. Not enough people know how to write queries.
Working with data should be a fundamental skill. Exploring datasets, and then learning how to write queries. It is not about the programming language. You have to understand natural language and how it corresponds to a query language. Too often, even when someone is given an SQL query, they can’t tell you what it is asking. We all need to know how to ask questions well.
As Chief Scientist, Dave is involved in the development of KX products and solutions, as well as envisioning their future direction. Dave has had a long and storied career in software development and is perhaps best known as the founder and past CEO of Object Technology International, formerly OTI, now IBM OTI Labs, a pioneer in Agile Product Development. He was the principal visionary and architect for IBM VisualAge Smalltalk and Java tools and virtual machines including the popular open-source, multi-language Eclipse.org IDE. As a cofounder of Bedarra Research Labs he led the creation of the Ivy visual analytics workbench. Dave is a renowned speaker, adjunct professor and Chairman of the Australian developer YOW! Conferences.