Web Scraping w/ Selenium & BeautifulSoup
Posted in Python
Scenario
I inherited a report at work that required going to a specific url, logging in with a username & password, selecting some prompts, and then copy/pasting the contents of a table into a text file.
First thing that entered my mind while being trained on the process of the new report was...WEB SCRAPING!
So this post will explain how I went about automating this process using Python along with Selenium and BeautifulSoup.
Initital Setup
To start I installed the following packages...
- pip install selenium
- pip install beautifulsoup4
- pip install pandas
I also installed the chromedriver.exe and saved it to the same path where the python script was located.
Imports
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import pandas as pd
import time
In addition to the selenium and beautiful soup packages, I am importing pandas so I can create a DataFrame with the content I am scraping and importing time so I can strategically add pauses to my code while webpages load.
Web Driver
driver = webdriver.Chrome()
url = <"Your URL Here">
driver.get(url)
As mentioned before, the browser I am working with is Google Chrome, so with the chromedriver.exe I am using the Selenium webdriver to interact with the browser.
Login
my_username = <'Your Username'>
my_password = <'Your Password'>
time.sleep(3) #Give page time to load
username.send_keys(my_username)
password.send_keys(my_password)
submitButton = driver.find_element_by_id("login")
submitButton.click()
My situation did require me to login with a username and password. In order to do this I had to "inspect" the login page to locate the specific form elements an login button.
I found that the name of the username input box was conveniently called "username", so I utilized 'username.send_keys(my_username)' where 'my_username' was a variable that contained my username.
I did the same thing with my password by using 'password.send_keys(my_password)'.
To submit my credentials I located the button's id and used the driver.find_element_by_id() function and passed in "login".
Dropdown
driver.find_element_by_xpath('Your xpath here').click()
driver.find_element_by_name('find').click()
After logging in, I had to make a selection from a dropdown menu and click another button.
To locate both of these elements, this time I used 'driver.find_element_by_xpath()' and 'driver.find_element_by_name()' after inspecting the page.
I then used '.click()' to accomplish the selection and submission.
New Window
driver.switch_to.window(driver.window_handles[1])
The dropdown selection resulted in a new tab opening up and displaying the report I needed.
This is where I got stuck a little bit...
The code was still refering to the initial tab that opened up, so I needed to use 'driver.switch_to.window(driver.window_handles[1]) in order to point the code to the correct tab.
Parse HTML
soup = bs(driver.page_source, 'html.parser')
table_rows = soup.find_all('tr')
data = []
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
data. append(row)
Now that I navigated to the data I needed to scrape, I used BeautifulSoup to grab the page source and parse the html. The page source revealed that the data needed was wrapped in a table.
So after some googling I found some code that used 'soup.find_all()' to locate the the table rows (tr) and table data (td) and append them into an empty list using a loop.
Pass Into Pandas DataFrame
data = pd.DataFrame(data)
data = data[3:-4] #Strip unecessary rows
data.to_csv(r'yourPath\yourFile.txt', header=None, inde=None, sep='\t'))
After I had my list that contained my table data, I stored it into a Pandas DataFrame and used 'to_csv()' to save the contents as a .txt file.
Full Code Example
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import pandas as pd
import time
driver = webdriver.Chrome()
url = <"Your URL Here">
driver.get(url)
#Login
my_username = <'Your Username'>
my_password = <'Your Password'>
time.sleep(3) #Give page time to load
username.send_keys(my_username)
password.send_keys(my_password)
submitButton = driver.find_element_by_id("login")
submitButton.click()
#Dropdown selection
driver.find_element_by_xpath('Your xpath here').click()
driver.find_element_by_name('find').click()
#New Tab
driver.switch_to.window(driver.window_handles[1])
#Parse HTML with beautifulSoup
soup = bs(driver.page_source, 'html.parser')
table_rows = soup.find_all('tr')
data = []
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
data. append(row)
#Pandas to_csv
data = pd.DataFrame(data)
data = data[3:-4] #Strip unecessary rows
data.to_csv(r'yourPath\yourFile.txt', header=None, inde=None, sep='\t'))
Happy Scraping!