NLP with kdb+

Natural Language Processing in kdb+

30 May 2018 | , , , , , , ,
Share on:

By Fionnuala Carr

As part of Kx25, the international kdb+ user conference held May 18th, a series of seven JuypterQ notebooks were released and are now available on https://code.kx.com/q/ml/. Each notebook demonstrates how to implement different machine learning techniques in kdb+, primarily using embedPy, to solve all kinds of machine learning problems, from feature extraction to fitting and testing a model. These notebooks act as a foundation to our users, allowing them to manipulate the code and get access to the exciting world of machine learning within Kx. (For more about the Kx machine learning team please watch Andrew Wilson’s presentation at Kx25 on the Kx Youtube channel).

Another announcement made at Kx25 was the release of the kdb+ natural language processing library, the first in a series of machine learning libraries. Like all of our machine learning libraries, the NLP package is available as open source, Apache 2 software, and is supported for our clients.

Background

Natural language processing (NLP) is an area of artificial intelligence and linguistics, devoted to making computers understand the statements in written language. NLP has gained attention for representing and analyzing human language computationally. It has applications in various fields such as machine translation, email spam detection, information extraction, document summarization, medical informatics, as well as question answering systems in the form of automated chatbots. NLP can be used to answer a variety of questions about unstructured text, as well as facilitating open-ended exploration. Although the source is text, transformations are applied to convert this data to vectors, dictionaries and symbols which can be handled very effectively by q. Many operations such as searching, clustering, and keyword extraction can all be done using very simple data structures, such as feature vectors. It is beyond the scope of this article to provide a complete introduction to NLP however interested parties should consult the supplied references.

Technical Description

The notebook demonstrates the process of converting raw text data to kdb+ tables to applying NLP analysis on the data. Firstly, it introduces the concepts of different operations that allow us apply NLP algorithms to the dataset. EmbedPy allows us to import SpaCy which is the python library specific to NLP analysis. SpaCy is used to parse the raw text in which the following processes are undertaken, tokenization, sentence detection and part-of-speech recognition. These results are saved in a kdb+ table and all other functionalities of the NLP library are done in q.

To demonstrate some of the functionality of the NLP library, we analyse the novel Moby Dick. We search for all the proper nouns using the part-of-speech recognition tags and identifying the most significant words in the corpus. We can extract words that are related using the .nlp.findRelatedTerms function. This function finds terms that have occured in the same sentence as the queried word. It returns a dictionary of each related term and their z-score in a descending order. A large z-score indicates that the term occurs more frequently than if the term was randomly distributed in the corpus. Using this function we can identify the three major captains in Moby Dick which are Peleg, Bildad and Ahab.

With the NLP library, we can also import, parse and analysis emails. To demonstrate these features, we import one of the largest publicly available datasets of corporate emails, the Enron dataset. We can compare corporas in which we separate Jeff Skilling’s emails into two corporas, one in which emails containing topics of his fraternity and the other containing the rest of his emails. The result of this function gives us an insight into the secret fraternity code words used. Additionally, using Jeff’s emails, we can extract one specific email and find the most similar email in the corpus. .nlp.explainSimilarity gives an understanding as to why these emails are similar. It calculates how much each shared term contributes to the cosine similarity and sorts it in descending order. The cosine similarity is used as a measure to indicate if two documents are similar. It calculates the cosine of the  angle between two vectors. For example if we have a document with the word “dog” appearing 200 times and another document with the same word appearing only 50 times. The Euclidean distance between the documents will be higher but the angle will be small because they are pointing to the same direction, which is what matters when we are comparing documents.

Using the .nlp.compareDocToCentroid allows you to discover outliers in a corpus. This function compares a document with a centroid which is the sum of the keywords and calculates the cosine similarity between the documents. This is illustrated using the emails of former Enron CEO Ken Lay which contained emails about a petition. Sorting the petitions by distance from centroid reveals many angry and threatening emails from Enron stockholders following the scandal.

A number of clustering algorithms are contained in the NLP library including the summarize, MCL and Radix clustering methods.  The .nlp.summarize function is a fast clustering algorithm that produces cohesive and reasonably sized clusters. It finds the n documents that best summarizes the n most important keywords in the corpus, then clusters the remaining documents around these centroids. When this clustering algorithm is applied to Jeff’s emails, there appears to be clusters containing various recurring reports, such as EnronOnline and ENE reports, as well as fraternity emails.

We can extract dates and times using .nlp.findDates and .nlp.findTimes  from strings in which allows us to build a timeline of events. This concept is demonstrated using the IEEE VAST 2014 challenge data in which it has several articles describing a kidnapping.

Another feature in the NLP library is sentiment analysis which determines if a writer’s attitude towards a specific topic is positive, negative or neutral. The NLP library uses a pre-built model of the degrees of positive and negative sentiment for English words and emoticons, as well as parsing to account for negation, adverbs and other modifiers. Sentences can be scored for their negative, positive and neutral sentiment. The model included has been trained on social-media messages which is based on the VADER sentiment analysis paper.  Using Jeff’s emails, the top ten positive and negative sentences were found. The positive sentences that were found in the corpus contained invitations to speak at an awards ceremony and to thank him for making the company such a success and giving them a great opportunity to work in Enron. The negative sentences contained topics such as violence and death tolls.

The MBOX file is the most common format for storing email message on a hard drive. All the messages for each mailbox are stored as a single, long, text file in a string of concatenated email messages. The NLP library allows the user to import these files and creates a kdb+ table.

If you would like to further investigate the uses of the NLP library, check out ML 07 Natural Language processing notebook on GitHub and visit code.kx.com/q/ml/nlp to find the complete list of the functions that are available in the NLP library. You can use Anaconda to integrate into your python installation to set up your Machine Learning environment , or you build your own which consists of downloading kdb+, embedPy and JupyterQ. You can find the installation steps on code.kx.com/q/ml/setup/

Don’t hesitate to contact ai@kx.com if you have any suggestions or queries.

Other articles in this JupyterQ series of blogs by the Kx Machine Learning Team:

Neural Networks in kdb+ by Esperanza López Aguilera

Dimensionality Reduction in kdb+ by Conor McCarthy

Classification Using k-Nearest Neighbors in kdb+ by Fionnuala Carr

Feature Engineering in kdb+ by Fionnuala Carr

Decision Trees in kdb+ by Conor McCarthy

Random Forests in kdb+ by Esperanza López Aguilera

References:

Khurana, D., et al. Natural Language Processing: State of The Art, Current Trends and Challenges. Available at https://arxiv.org/ftp/arxiv/papers/1708/1708.05148.pdf

Jurafsky, D. and Martin, J. Speech and Language Processing:  An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Available at https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf

Bird, S, Klein, E and Loper E. Natural Language Processing with Python. Available at https://pdfs.semanticscholar.org/3673/bccde93025e05431a2bcac4e8ff18c9c273a.pdf

Prakash M Nadkarni Lucila Ohno-Machado Wendy W Chapman. Natural language processing: an introduction. Journal of the American Medical Informatics Association, Volume 18, Issue 5, 1 September 2011, Pages 544–551, https://doi.org/10.1136/amiajnl-2011-000464

Carpena, P., et al. Level statistics of words: Finding keywords in literary texts and symbolic sequences. Physical Review E 79.3 (2009): 035102. Available at https://pdfs.semanticscholar.org/4c95/897633779b20191aa53537ef4190287f29e2.pdf

Rayson, Paul, and Roger Garside. Comparing corpora using frequency profiling. Proceedings         of the workshop on Comparing Corpora.Association for Computational Linguistics, 2000. Available at http://ucrel.lancs.ac.uk/people/paul/publications/rg_acl2000.pdf

C.J. Hutto and Eric Gilbert. VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Available at http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf

SUGGESTED ARTICLES

Kx Insights: Machine learning and the value of historical data

2 Aug 2018 | , , ,

Data is being generated at a faster rate now than ever before. IDC has predicted that in 2025, there will be 163 zettabytes of data generated each year—a massive increase from the 16.1 zettabytes created in 2016. These high rates of data generation are partially an outcome of the multitude of sensors found on Internet of Things (IoT) devices, the majority of which are capable of recording data many times per second. IHS estimates that the number of IoT devices in use will increase from 15.4 billion devices in 2015 to 75.4 billion in 2025, indicating that these immense rates of data generation will continue to grow even higher in the years to come.

SEMICON 2018 Snapshot: Data and the Era of AI

24 Jul 2018 | ,

By Bill Pierson The future of the semiconductor industry is looking bright judging by the breadth of new developments, initiatives and innovations on display at SEMICON West in San Francisco this July. Industry leading companies presented the latest technical and business insights into today’s opportunities and challenges, particularly in the areas of smart manufacturing and […]