RPA and Web Scraping using Jupyter

Overview
In my previous article I walked through with you on how to use Python + requests + lxml
to scrape stock data. In this article let’s explore using Robotic Process Automation (RPA) in a Jupyter Notebook environment to perform web scraping. Personally I find Jupyter Notebook + RPA
a great combination as the interactive nature of Jupyter Notebook allows for quick iterations and trial-and-errors when developing robots. Also another good thing is that all these tools are open source.
I am going to usexeus-robot
which is a Jupyter kernel for Robot Framework based on the native implementation of the Jupyter protocol xeus.
Setup
xeus-robot
I assumed you already have JupyterLab 3.0 and above installed. To install xeus-robot and its dependencies, just follow the instructions and run the following commands
$ conda install -c conda-forge xeus-robot
xeus-robot depends on Robot Framework which is a generic open source automation framework for acceptance testing, acceptance test driven development (ATDD), and robotic process automation (RPA).
SeleniumLibrary
Since I am going to perform web scraping, I need to install SeleniumLibrary from Robot Framework.
$ pip install --upgrade robotframework-seleniumlibrary
Browser Drivers
I also need to install a web driver based on the browser I want to automate. I can use webdrivermanager
to install the browser driver. In this case I installed for both Firefox and Chrome.
$ pip install webdrivermanager
$ webdrivermanager firefox chrome --linkpath /usr/local/bin
Note that I install the drivers to /usr/local/bin
. You can definitely install them to other location, but make sure the location is in your environment PATH.
With the setup completed, after I start jupyter lab
I can see the option for RobotFramework (XRobot)
notebook.

Web Scraping using RPA
Let’s start the fun by using RPA to perform web scraping. I am going to scrape the S&P500 stock names from a web site.
- Navigate to the website
- From the drop down list, select S&P 500
- Wait until the page is refreshed
- Scrape the stock names and the links
- Save the information to a flat file

Using Robot Framework and Jupyter notebook, the above tasks are easy to implement in just few lines of code.
Robot Framework in Jupyter Notebook
Unlike using a programming language, Robot Framework has easy syntax, utilizing human-readable keywords. Its capabilities can be extended by libraries implemented with Python or Java. The framework has a rich ecosystem around it, consisting of libraries and tools that are developed as separate projects.
As developer, you may need to change your mindset to program the use case using normal English language.
The Code
Settings, Variables and Keywords
Let’s start by defining the settings , variables and keywords. You can find the code listing below.
*** Settings ***
Documentation Settings with reusable keywords and variables.
...
... The system specific keywords created here form our own
... domain specific language. They utilize keywords provided
... by the imported SeleniumLibrary.
Library SeleniumLibrary
Library OperatingSystem
Library String*** Variables ***
${SERVER} https://www.investing.com
${BROWSER} Chrome
${STOCKS URL} ${SERVER}/equities/americas
${stocks_filter} xpath=//*[@id="stocksFilter"]
${stocks_to_grab} S&P 500
${stock_link} //tr[starts-with(@id,'pair')]/td/a
${link_count} 0*** Keywords ***
Open Browser To Stocks Page
Open Browser ${STOCKS URL} ${BROWSER}
Save to File
[Arguments] ${value1} ${value2}
Append To File path=${EXECDIR}/stocks.txt content=${value1},${value2}
Settings
Documentation
is like comment where you can put the description of your test cases.Library
is use to import the modules that I want to use. Refer here for all available libraries.OperatingSystem
andString
are the built-in standard libraries.
Variables
- This is the variable declaration section where I list down the variables I will be using. For more information, refer to the user guide.
- I am using XPath to retrieve the data that I want. If you want to know how to get the XPath using Chrome, refer to my previous article.
Keywords
The section allows you to define custom keyword using the pre-defined keywords. I defined 2 keywords
Open Browser to Stock Page
opens the browser to the web page from which I want to scrape the data.Save to File
saves the scraped information into a text file.
Using Jupyter Notebook, you can actually test out the custom keyword individually.

Test Cases
Below is the code to perform the task to scrape stock information.
*** Test Cases ***
Get All Stocks
Open Browser to Stocks Page
Maximize Browser Window
Wait Until Element Is Visible ${stocks_filter}
Select From List By Label ${stocks_filter} ${stocks_to_grab}
Wait Until Element Is Visible xpath:${stock_link}
${link_count}= Get Element Count xpath:${stock_link}
Log Many link_count ${link_count}
Should Be True ${link_count} > 0
FOR ${index} IN RANGE 1 ${link_count}+1
${link_text}= Get Text xpath:(${stock_link})[${index}]
${link_url}= Get Element Attribute xpath=(${stock_link})[${index}] href
Log Many link_text ${link_text}
Log Many link_url ${link_url}
Save to File ${link_text} ${link_url}
END
Close All Browsers
Open Browser to Stocks Page
launches Chrome browser and navigate to the website. I defined Chrome as the browser I want to use in the variables section. You can definitely use other browsers, e.g. Firefox.Maximize Browser Window
maximizes the browser window.Wait until Element is Visible
waits until the stock information is loaded.Select From List by Label
select S&P 500 from the drop down list.Wait until Element is Visible
waits until the S&P 500 data is loaded.Get Element Count
counts the number of stocks.Log Many
logs the information.Should be True
validates the count of the links.For Loop
loops through all the links, retrieves the stock names and URLs and save to a text file.Close All Browsers
closes all the browsers.
Results
Log Report
For tracing and debugging purpose, you can read the log report of the execution.

Click the Log
button you should see a report similar to the following.

Scraped Stock Information
The scraped stock information is saved in stocks.txt
in the execution folder.

Summary
I only scratched the surface of Robot Framework capabilities and there are many more that you can explore on your own. As data scientists or data analysts, xeus-robot
is a tool you should learn if your day to day tasks involve any web scraping or process automation.
Do also check out RPA Framework on many more open source libraries that can be used for process automation for anything you can ever imagined.
The notebook that I used can be found at this repository.
Do also check out the following articles.
Serving Machine Learning Models (DCGAN, PGAN, ResNext) using FastAPI and Streamlit
Overview
medium.com