Different options to scrape web pages requiring user interactions.
Overview
In several of my previous articles, I mentioned applications and libraries that we can use to scrape data. In this article, let’s explore the libraries and methods that we can use to get data from dynamic web pages.
The Basics
Let’s get started with the basics, in an earlier article, I used Python requests + lxml
to scrape stock data. This approach is straight forward and it should meet our data scraping requirements most of the time.
With this approach, you need to analyze the web pages for dynamic websites and find out the AJAX APIs invoked to scrape the data you want. This could be complicated as the URL can be dynamic and varies accordingly to the user selection.
E.g. for a particular website I want to scrape data, the links are generated dynamically using Javascript.
Requests-HTML
requests-html
is a Python library to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible.
Since it is built on top of Pyppeteer
(Python port of puppeteer
) and pyquery
(Ajquery
like library for Python), you can use the pyppeteer
APIs (similar to puppeteer
) to trigger mouse or keyboard key events to scrape the content you want.
Below is an example of scraping a dynamic web page using requests-html
.