Web Scraping w/ Selenium & BeautifulSoup

Posted in Python

Scenario

I inherited a report at work that required going to a specific url, logging in with a username & password, selecting some prompts, and then copy/pasting the contents of a table into a text file.

First thing that entered my mind while being trained on the process of the new report was...WEB SCRAPING!

So this post will explain how I went about automating this process using Python along with Selenium and BeautifulSoup.

Initital Setup

To start I installed the following packages...

pip install selenium
pip install beautifulsoup4
pip install pandas

I also installed the chromedriver.exe and saved it to the same path where the python script was located.

Imports

from selenium import webdriver
from bs4 import BeautifulSoup as bs
import pandas as pd
import time

In addition to the selenium and beautiful soup packages, I am importing pandas so I can create a DataFrame with the content I am scraping and importing time so I can strategically add pauses to my code while webpages load.

Web Driver

driver = webdriver.Chrome()
url = <"Your URL Here">
driver.get(url)

As mentioned before, the browser I am working with is Google Chrome, so with the chromedriver.exe I am using the Selenium webdriver to interact with the browser.

Login

my_username = <'Your Username'>
my_password = <'Your Password'>

time.sleep(3) #Give page time to load

username.send_keys(my_username)
password.send_keys(my_password)

submitButton = driver.find_element_by_id("login")
submitButton.click()

My situation did require me to login with a username and password. In order to do this I had to "inspect" the login page to locate the specific form elements an login button.

I found that the name of the username input box was conveniently called "username", so I utilized 'username.send_keys(my_username)' where 'my_username' was a variable that contained my username.

I did the same thing with my password by using 'password.send_keys(my_password)'.

To submit my credentials I located the button's id and used the driver.find_element_by_id() function and passed in "login".

Dropdown

driver.find_element_by_xpath('Your xpath here').click()
driver.find_element_by_name('find').click()

After logging in, I had to make a selection from a dropdown menu and click another button.

To locate both of these elements, this time I used 'driver.find_element_by_xpath()' and 'driver.find_element_by_name()' after inspecting the page.

I then used '.click()' to accomplish the selection and submission.

New Window

driver.switch_to.window(driver.window_handles[1])

The dropdown selection resulted in a new tab opening up and displaying the report I needed.

This is where I got stuck a little bit...

The code was still refering to the initial tab that opened up, so I needed to use 'driver.switch_to.window(driver.window_handles[1]) in order to point the code to the correct tab.

Parse HTML

soup = bs(driver.page_source, 'html.parser')

table_rows = soup.find_all('tr')

data = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    data. append(row)

Now that I navigated to the data I needed to scrape, I used BeautifulSoup to grab the page source and parse the html. The page source revealed that the data needed was wrapped in a table.

So after some googling I found some code that used 'soup.find_all()' to locate the the table rows (tr) and table data (td) and append them into an empty list using a loop.

Pass Into Pandas DataFrame

data = pd.DataFrame(data)
data = data[3:-4] #Strip unecessary rows

data.to_csv(r'yourPath\yourFile.txt', header=None, inde=None, sep='\t'))

After I had my list that contained my table data, I stored it into a Pandas DataFrame and used 'to_csv()' to save the contents as a .txt file.

Full Code Example

from selenium import webdriver
from bs4 import BeautifulSoup as bs
import pandas as pd
import time

driver = webdriver.Chrome()
url = <"Your URL Here">
driver.get(url)

#Login
my_username = <'Your Username'>
my_password = <'Your Password'>

time.sleep(3) #Give page time to load

username.send_keys(my_username)
password.send_keys(my_password)

submitButton = driver.find_element_by_id("login")
submitButton.click()

#Dropdown selection
driver.find_element_by_xpath('Your xpath here').click()
driver.find_element_by_name('find').click()

#New Tab
driver.switch_to.window(driver.window_handles[1])

#Parse HTML with beautifulSoup
soup = bs(driver.page_source, 'html.parser')

table_rows = soup.find_all('tr')

data = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    data. append(row)

#Pandas to_csv
data = pd.DataFrame(data)
data = data[3:-4] #Strip unecessary rows

data.to_csv(r'yourPath\yourFile.txt', header=None, inde=None, sep='\t'))

Happy Scraping!