Sawtooth Data

illustrations illustrations illustrations illustrations illustrations illustrations illustrations

Getting Data: Scraping Websites with Python

post-thumb

Scraping Websites

This tutorial will help you learn how to scrape password protected websites using python. I’ll be giving you an example of how to scrape your own LinkedIn profile. While your task might vary, the process is the same. Remember to use caution and don’t break any rules while you’re building these automations. While it may be legal to do it, you can still get banned or cause grief for the reliability engineers responsibe for the sites you visit.

Installing Dependencies

I’m using Debian, so you may need to adapt the steps to get your installation right.

1. Install python packages, selenium and beautiful soup

The Selenium docs for python are a good reference if you have trouble with any of this.

I always build my projects in a virtual environment,

python3 -m virtualenv -p python3 .
source bin/activate
pip install selenium bs4

2. Install OS packages, chromium + chromedriver

First install chromium and check the version.

sudo apt-get install chromium
chromium --version

You have to match the chromium version to the chrome webdriver version or selenium won’t work. Here is the list of chromedriver versions. Debian uses an older version (90.0.4430.24 as of Jan 2 2022).

Once you download the chrome driver, unzip it, move it into your path somewhere, and mark it as executible

cd ~/downloads
unzip chromedriver_linux64.zip
chmod +x chromedriver
mv chromedriver /usr/local/bin

Your download location and choice of bin location may vary. You should be able to check that the executible is in your path using the which command,

which chromedriver

If you don’t get any results, then you need to check your PATH.

echo $PATH

You can either update your PATH to include where chromedriver is (not recommended) or move the executible to somewhere in your PATH get it to work.

Scraping your own LinkedIn Profile

Next you need to get your environment set up. The only thing we’ll need for this is your Linkedin login credentials. I exported my LinkedIn username and password to the shell. You might do that or save it to a file and read it in. Neither are great for deployed production systems, but should be fine for local development work. To do this in a professional setting, you might consider encrypting the credentials, then reading them into memory and decrypting them. If you have another website in mind, feel free to change the environment variable names to something better.

Here is a simple script to log in and then print out the page source. I saved it as print_source.py

import os
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By

driver = Chrome()
driver.get('https://www.linkedin.com/login')
driver.find_element(By.ID, "username").send_keys(os.get_env('LINKEDIN_USERNAME'))
driver.find_element(By.ID, "password").send_keys(os.get_env('LINKEDIN_PASSWORD'))
driver.find_element(By.CLASS_NAME, "login__form_action_container").click()

page_html_source = driver.page_source

If you have a different website in mind, You may need to inspect the elements in the browser to find the right ID or CLASS_NAME to do the form fill. There are a multitude of tutorials out there to help you learn how to inspect elements for webscraping. The Selenium Locating Elements docs show you the ways you can search for the right container.

We can run it like this:

export LINKEDIN_USERNAME='myusername'
export LINKEDIN_PASSWORD='mypassword'
python print_source.py

You should see a whole lot of HTML output.

Parsing the page source

Next, I want to parse the page source to get some basic info. I’ll save this script as print_contact_info.py.

import os
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup

driver = Chrome()
try:
    driver.get('https://www.linkedin.com/login')
    driver.find_element(By.ID, "username").send_keys(os.getenv('LINKEDIN_USERNAME'))
    driver.find_element(By.ID, "password").send_keys(os.getenv('LINKEDIN_PASSWORD'))
    driver.find_element(By.CLASS_NAME, "login__form_action_container").click()
    driver.get(f"https://www.linkedin.com/in/{os.getenv('LINKEDIN_VANITY_URL')}/detail/contact-info/")
    WebDriverWait(driver, timeout=15).until(lambda d: d.find_element(By.CLASS_NAME,"pv-profile-section__section-info"))

    contact_methods = [
        "ci-websites",
        "ci-email",
        "ci-phone",
        "ci-twitter",
        "ci-ims"
    ]
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    for cm in contact_methods:
        print(cm)
        results = soup.find_all("section", {"class": f"pv-contact-info__contact-type {cm}"})
        for result in results:
            r_array = result.text.split()
            r_type = r_array[0]
            r_info = r_array[1]
            r_detail = None
            if len(r_array) > 2:
                r_detail = ' '.join(r_array[2:])
            print(f"{r_type} {r_info} {r_detail}")
except Exception as e:
    print(e)
finally:
    driver.quit()

We added a new ENV parameter for your LinkedIn URL. Now, you can run it like this:

export LINKEDIN_USERNAME='myusername'
export LINKEDIN_PASSWORD='mypassword'
export LINKEDIN_VANITY_URL='myvanityurl'
python print_contact_info.py

Here we added in a few more bells and whistles.

  • I put the entire body in a try/except/finally block so that the script will exit cleanly even if there is an error.
  • I added a WebDriverWait just in case the modal took a little longer to load. It’s not really necessary for this case since the information is present on the screen when the page finishes loading. If you have something that takes a while to show up after the page loads, then this would be useful.
  • We’re using Beautiful Soup to parse the HTML. In general Beautiful Soup is easier to work with when parsing through the page source.

Conclusion

Hopefully now you have a working example of how to use Selenium, Chromium, and Beautiful Soup in python to get information from a password protected account. Remember to be careful when using automations like this because it’s possible to get banned from the sites you love to use.