Robert Hill is a director of the Northern Ireland Space Office and chair of the Northern Ireland Space Special Interest Group (NISSIG) hosted by the UK Aerospace, Defence, Security and Space trade membership group ADS. This interview follows from a previous conversation on the impact of big data on space observation and exploration
Q: Towards the end of our last conversation you said the impact of data was as big on astronomy as on earth observation. Could you explain that further?
A: Possibly the most obvious effect is in the way in which astronomers actually work. Originally they observed the sky directly, via telescopes, and compared what they saw with previous images. Today they view it indirectly, in the form of data – data that may be rendered as images for viewing in the traditional way, but more commonly data in its raw format, and with incredible detail that enables much deeper insight into what forms our universe, and how it may have evolved.
Q: So the fact that the observation has become indirect hasn’t compromised insight?
A: No, quite the opposite. As I said, the data we can gather today is amazingly detailed and allows us to ask – and answer – much deeper questions than the relatively limited ones of old such as “Did it move or did it not move?” As well as that, the data can be shared and correlated more easily across multiple disciplines, so we now have the combined insights from scientists and cosmologists to geologists and meteorologists all peering into oceans of not just observational data, but all the accompanying metadata as well. All that collaboration is bringing much deeper, cross-discipline insight into what we can learn from data.
Q: You mentioned metadata. Why has that become so important?
Simply because the data about the data can be just as revealing as the data itself. It’s like surveillance investigations where knowing who talks to who, where, when and for how long can be almost as important as knowing what they actually said. It can have the same value in our domain.
Take the FITS files from the European Space Agency’s Gaia project, for example. By analyzing not only the images they capture but also the metadata like latitude, longitude, time and other fields in their headers, we can begin to make inferences about things like, say, the recessional velocity of what we are observing. In fact, that is something we did in Kx as FITS is an array format that’s perfectly suited to the way kdb+ processes data. It’s that breadth and volume of data, including metadata, that contains so much potential information.
Q: And how much of that type of data is available?
Lots. In fact if you want an example of Big Data this area probably one of the biggest you can get.
Take the LSST, the Large Synoptic Survey Telescope project, for example. It is expected to be fully operational in 2021 when it will capture a complete image of the southern sky every three days (by way of contrast the Hubble telescope would have taken 120 years to do the same thing). In the process it will accumulate 15TB of data every night.
Then consider the SKA, the Square Kilometer Array. It is expected to produce 160TB of data per second! That will rapidly bring us into the region of exabytes per day, and zettabytes per year! To put that in somewhat more intuitive terms it’s estimated that when it is fully operational it will produce more data per day than the entire planet currently does in a year!
The upside of all this data, and the fact that it is democratized and federated, is that we are in a world of the virtual observatory where the limitations of geography have been eliminated. The challenge that replaces it is how to manage those volumes and, as I said before, turn that data into information and knowledge. This will require a robust capture and compute platform that few technologies today can provide.
Q: So it’s a classic Big Data challenge?
Yes, and there is also the complexity to consider, especially at a multispectral level. Go back to the telescope where, by definition, you were dealing with just a subset of the electromagnetic spectrum – the visible part. Now we are capturing the full range from low frequency microwaves right through to high frequency gamma waves. So what’s needed is technology, like kdb+, that can cope with those massive volumes, their origin from multiple sources, their many different formats – and then make that data usable. The great thing about kdb+ is that it has been doing this for years, and just continues to improve with new compression techniques, additional data storage options and integration to the latest machine learning technologies for extracting value from that data.
Q: Could you give an example of the diverse types of data challenges you see?
Consider what we call “time domain astronomy” that refers to the fact that there are certain events that occur very infrequently but when they do can be very intense. One example is Gamma Ray Bursts that may last from milliseconds to hours. There are two things to consider here: one from an observational viewpoint and one from a data perspective.
On the observational side, given their infrequency, it’s vitally important to record as much information as possible about them so it’s a matter of, once detected, marshaling all available devices to assist in the observation. We did a project like this on the actuators for the Extremely Large Telescope (ELT) to enable better focusing so it’s a great example of where real-time alerting and response is vital as the second chance may not come for years!
The other aspect is data. While infrequent, the time domain events can be very energetic, which translates in our terms to “hugely data-intensive.” Very swiftly we move from recording nothing for long periods (years recall) to massive amounts in an instant. That introduces both capture and storage challenges, two things that Kx excels in. Its ability to record bursts of data at nanosecond precision means it can manage the capture challenge and its efficient compression enables it to store the long periods of inactivity highly efficiently.
Q: What areas do you see all of this potential being applied in?
One fascinating area is looking for planets in so-called “Goldilocks Zones” where the conditions would seem right for life to flourish as it has on Earth. Since the volumes of data in these searches are so huge, it definitely requires machine learning techniques to determine possible candidates that human insight can then study further and make a determination on. Doing it the other way around, by having humans trawl through that data in search of signs, is just not an option any more.
The implicit assumption in these searches is that we are looking for “life as we know it”. More intriguing, and humbling, though, is the fact that the “as we know it” part is so terribly small. It’s very sobering to consider that today, 400 years after Galileo made his remarkable discoveries that helped us understand the solar system better and revolutionized not just astronomy but modern day science, we may still “know” only about 4% of what exists out there. Between dark matter, that we suggest must exist because our equations would blow up otherwise, and dark energy, that we similarly propose to explain our expanding universe, there is so much we simply do not understand.
What helped achieve that understanding in the past was data. Initially it was through human observation, looking to the night sky and recording what was seen. Then it became more sophisticated with the Hubble Space Telescope, Voyager and other deep space probes giving us more advanced data. And as I mentioned earlier, with advancements like the SKA and the LSST we are continuing that process of discovery via data.
Q: So are you optimistic that we can get beyond that 4% knowledge barrier?
Very optimistic. We have seen how it can be done. Take the gravitational waves that Einstein had postulated. It was through data from LIGO detectors that ripples in space-time arising from the merger of two Black Hole collisions were identified and confirmed the theory. We are moving to an era where finally we have the data and tools like Kx that can enable us to confirm many theories and advance others. We may be on the cusp of reducing many of our “known unknowns.” It’s an exciting time.