by Ben Jeffery.
Given a dataset of login attempts, it is valuable to identify any suspicious successful logins as they could indicate a threat such as a compromised user account. In this example we apply user and entity behavior analytics (UEBA) to learn the expected behavior for each user in order to find when individual users deviate from their past behavior. The KX Developer IDE (as well as KX Analyst), as of version 1.1.0, facilitates q and Python development in the same environment by using embedPy to complement the functionality of kdb+. As a result, in example the follows, we were able to use Keras and Tensorflow to apply deep learning techniques via embedPy.
Autoencoders are a deep learning form of neural network that learn to compress, then expand, their inputs. They compress the input by combining related features, then expand the compressed form into an approximation of the original. When a trained model is given an input similar to those it was trained on, the relationships are recognizable, and it rebuilds the input more accurately than when it is given an unfamiliar input. The difference between the input and the output is the error.
If an autoencoder is trained on past legitimate logins, then when it tries to compress and expand logins that, while successful, are not in fact legitimate, the error will be greater as the relationships between the features will be different from those it has learned. For logins, this error in rebuilding the input can be used to score a login’s suspiciousness.
As an example, if user1 always logs in from location1 on weekdays, and location2 on weekends, the location and isWeekend fields for user1 can be collapsed into a single bit. Assuming this pattern holds, inputs for user1 can be recovered from this single bit, but when the user deviates from their normal behavior, the input will be rebuilt incorrectly.
The login data in this example is a table with a record for each login attempt to a sample system. It includes the time and date, source IP, destination IP, and username of each login event. We can fill in location data using the date and source IP columns. The accuracy of the location is dependent on the country the IP address is in, and level of granularity required. The country, and administrative region of an IP address can be identified more accurately than its city. After loading and transforming the table in q, it is sent to Python. By making use of embedPy, q and Python share the same memory space, thereby making the table available to Python without requiring any network marshaling.
Variables, or the results of expressions, can be written to Python variables using the context menu. These values can be data or q functions. A dictionary or table sent to Python will be accessible as a Python dictionary.
Values can also be sent to Python programmatically by passing them to Python functions, which can be referenced in q using the embedPy API directly.
Python files are opened in syntax-highlighted Python editors in KX Developer. Selections can be evaluated using the Python interpreter.
The code below takes the table of logins we have sent from q, uses it to train an autoencoder, then runs the autoencoder on the whole table.
Once we’ve run the Python script, and having determined the suspiciousness scores for each login, we can send this list back to q using the editor context-menu.
Joining the suspiciousness scores to the login table in q, we can inspect the result as a scatter plot in the KX Developer Visual Inspector. This plot shows the suspiciousness of logins over time, where the periodic spikes in suspiciousness are weekends or holidays, when legitimate logins are less likely. While unsuccessful (red) logins can be ignored, successful (blue) logins with a high suspiciousness score represent logins with an unusual combination of locations, IP addresses, time, day of week, and user, and may be indicative of suspicious activity.
The autoencoder can also be set up to work in a streaming fashion. We can write a function in Python to take a batch of logins and update the model, then return the logins annotated with their scores, and if they should be considered suspicious.
After sending this function to q, it can be called like any q function.
In this plot the unsuspicious logins have a bimodal distribution, with three outlying logins classified as suspicious due to a combination of unusual locations and times for this user.
Note: To enable the Python specific menu options in KX Developer, open File > User Settings, and enable the Python Integration option.