Different options to scrape web pages requiring user interactions.
In several of my previous articles, I mentioned applications and libraries that we can use to scrape data. In this article, let’s explore the libraries and methods that we can use to get data from dynamic web pages.
Let’s get started with the basics, in an earlier article, I used Python
requests + lxml to scrape stock data. This approach is straight forward and it should meet our data scraping requirements most of the time.
With this approach, you need to analyze the web pages for dynamic websites and find out the AJAX APIs invoked to scrape the data you want. This could be complicated as the URL can be dynamic and varies accordingly to the user selection.
requests-html is a Python library to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible.
Since it is built on top of
Pyppeteer (Python port of
jquery like library for Python), you can use the
pyppeteer APIs (similar to
puppeteer) to trigger mouse or keyboard key events to scrape the content you want.
Below is an example of scraping a dynamic web page using