Web Scraping – A Kdb+ Use case

24 Jan 2019 | , ,
Share on:

By Abin Saju

Web scraping is a method through which human readable content is extracted from a web page using an automated system. The system can be implemented using a bot/web crawler which traverses through domains or through a web browser which mimics human interaction with a page. There are many use cases for the scraped data, one of the biggest providers of scraped data is Google, which indexes and sometimes archives websites which are available for retrieval by a user. Other than search providers, e-commerce is another big sector that uses scraped data as a valuable tool in determining if the price of a product matches or beats the one offered by a competitor.

In my spare time I have implemented a rudimentary Python framework using Selenium which accesses a web page, interacts with it, and scrapes the data I am interested in. Selenium makes it quite easy to interact with a page, for example, looking at Wikipedia using Chrome’s inbuilt developer tools:

The developer tool highlights the id’s used by the webpage for the various widgets, these id’s can be used to interact with the page instead of the mouse. To do a simple search once my session is active, I simply do:

driver.get("https://wikipedia.org")
driver.find_element_by_name('search').send_keys("Kx Systems")
driver.find_element_by_css_selector('.sprite.svg-search-icon').click()

Using this technique interaction with virtually any website can be automated. For example, I could create a historical database that stores the prices of hotels. The system could alert me when prices are starting to rise or if there was a sudden drop for a particular listing. This can be done by scraping the html source code and converting it to a structured format using Selenium and Python’s lxml library. It would be trivial to configure the destination, room type, check-in and check-out date.

One way to store the scraped data is to use Pandas and save them in csv or Excel format. However, if in the future if I wanted to expand the program to look at 100’s of websites at much shorter intervals, I would need a much more efficient method of storing the data. So, I am going to use kdb+ along with embedPy. Using this interface means I don’t need to re write my existing code and allows me to quickly setup a database.

// Load embedPy
\l p.q
// Use European date format
\z 1
//Create folder if it doesn't exist
system "mkdir -p db";
// Empty folder if running the first time
\l db;
// If table to store data doesn't exist, create it.
if[not `HotelInfo in .Q.pt;
	// Set down empty schema if table doesn't exist
	.Q.dd[hsym `$string .z.d;`HotelInfo`] set .Q.en[`:.] flip `t`destination`hotel`checkin`checkout`price!"pssddj"$\:();
	system "l ."
	];
// Load definitions from the python script
pyscript:.p.import[`hotelscrape];
.pyscrape.start:pyscript`:startUp;
.pyscrape.getData:pyscript`:enterDestAndScrape;

The above code sets up a database if it doesn’t already exist and loads the embedPy objects. The scraped data is passed back as a list which can then be converted to a table and upserted to disk. The use of embedPy reduces intermediate steps like saving down and reading in of csv’s.

The program can then be run recursively at set time periods. For this I use the internal .z.ts function:

\l scrape.q
.z.ts:{runProgram[hotelInfo]}
\t 14400000 / (4*60*60*1000) run every 4 hours

And voila! The program is now able to scrape data on multiple destinations at regular intervals and store it in a state-of-the-art db.

Further Reading:

A comparison of Python and q for data analysis: https://kx.com/blog/a-comparison-of-python-and-q-for-data-analysis/

embedPy GitHub repo: https://github.com/kxsystems/embedpy

How Google search works: https://www.google.com/intl/en_uk/search/howsearchworks/crawling-indexing/

More on how various elements can be located using Selenium: https://selenium-python.readthedocs.io/locating-elements.html

 

SUGGESTED ARTICLES

kdb+ and NAG library with PyQ

Using the NAG Library for Python with kdb+ and PyQ

25 Jun 2019 | , ,

This paper provides detailed instructions on how to use the Numerical Algorithms Group (NAG) Library for Python with kdb+ and PyQ. The NAG Library contains more than 1,800 mathematical and statistical routines, and is accessible by numerous programming languages (including Python, C++, Java, Fortran, etc.).

A comparison of Python and q for data problem solving

8 May 2019 |

This article takes a simple, real-life problem and analyzes different solutions in Python and q. The problem leads us to discover nice areas of both programming languages, including vector operations, Einstein summation, adverbs and functional form of select statements. Each solution has lessons that deepen our IT knowledge, especially when we consider performance.