Kx & Hadoop
Many of our customers use both Hadoop and Kx products in production today. This can be in combination, independently, or side-by-side within the same infrastructure. Put simply, there is no “one-size-fits-all” answer to the question “which is best?” But, in understanding the original use-case for both technologies, and in seeing how both are currently being used today, clarity should ensue.Kx
This technology was derived from a need to allow use cases such as web searches/queries to run significantly quicker than ever before. A big problem at this time was a finite amount of capability within existing server infrastructures. A combination of Google, Yahoo and the community in general, accelerated these earlier ideas and ultimately contributed elements of it back into the Apache open source community. Specific key elements of this contribution were MapReduce and GFS from Google.
Kx is a proprietary technology, with the initial cost of entry being offset by:
- Simplicity of implementation. There is efficiency gained via re-using the same database and functional programming language across many use cases. Fewer resources are required to implement new
methodologies. The product itself is remarkably discrete and simple, therefore it scales well and lends itself well to being built upon. The speed at which a broad set of work can be rapidly delivered from a single person is high, and the genesis of this stems from its functional programming model
- Efficiency of implementation. Kx technology gets more out of the same hardware and can reduce infrastructure costs significantly.
- Flexibility. Kx attaches itself easily to a broad set of use cases and therefore benefits from re-use across different business functions in an enterprise. Within the Kx user community there are a large amount of shared examples of integrations with common APIs and data analytics toolkits, like Tableau.
Apache Hadoop, by contrast to Kx, is based on an open source model (Apache v2 distribution model). The initial cost of entry can therefore seem very low. Apache Hadoop licensing is permissive, allowing extensions of the software to be licensed separately or under the Apache distribution mode, as so desired by a commercial distributor of these extensions. This has led to divergence of content in the commercially-supported distributions derived from Hadoop.Your first “How do I” Questions?
Each of the main Apache Hadoop components is examined in more detail, later. Before that, here are some answers to typical “pop quiz” questions:
- Can Kx work with HDFS? Yes. However it is unlikely to be chosen as an approach. The reasons the analytics industry is moving away from HDFS as a construct for analytics applies to Kx also. Throughput and latency of read/write operations using HDFS is much less efficient than with embedded storage or a distributed object or file system, even when using the same volume of storage equipment. Some contributors to the performance degradation of HDFS for Kx can be slightly mitigated by layering traditional file systems under HDFS, such as with the Lustre, GPFS or MapR file systems. Note that if the HDFS layer is implemented on top of another distributed file system, this throws up the possibility of using its perhaps more beneficial methods to read/write data into Kx, which somewhat makes the HDFS layer unnecessary.
- Can Kx ingest data directly from HDFS sources? Yes. This is a much more likely scenario for a sophisticated user of the kdb+ database. Kdb+ has interfaces for a wide range of ingest sources and
languages, including the ability to ingest from HDFS files via the Hadoop utilities. For example “Hadoop FS” could be piped into a FIFO within the named-pipe support of q.
- What about MapReduce with kdb+? Use of the MapReduce model is inherent within kdb+. It can manifest not only across a distributed networked architecture but also can efficiently span shared memory when running many threads on one server
- Can Kx work alongside Hive, or Spark? Yes. This is the best use case for Kx/Hadoop interoperation. For example, runtime data being generated and stored in Spark/HBase or Spark can be interoperated with Kx using a number of public interfaces e.g. The operating functions found within Kx are a superset of the functions offered in Spark. We envisage the requirement for an ETL (batch) process extracting data from a Hive or HBase database into kdb+, followed by q syntax data analytics. Performance and function of this will depend on the data model and type of data being transformed.
- Can I port from kdb+ to one of the other toolsets in Hadoop? Nothing prevents this, but you will almost certainly end up with a slower solution in terms of latency, throughput and query time metrics. If this is acceptable to the user of the application it could be considered. For any time-series or similar structured data, the data could be exported and reimported. The target system will lack some of the capabilities built in to kdb+.
Kx's core technology, the kdb+ time-series database, is renowned for its computational speed and performance, as well as the simplicity of its architecture for large-scale data analytics.Request a demo