Web Scraping 101

less than 1 minute read

This is a simple example of web scraping from wikipedia using python It requires two libraries: beautifulsoup4, requests

import requests
import sys
import bs4
import re

url = "https://en.wikipedia.org/wiki/Jim_Simons_(mathematician)"
r = requests.get(url)

if r.status_code == 200:
    html = r.text
    print("Success! Retrieved the page")
else:
    print("Error: downloading failed")
    sys.exit(1)
OUT[1]:
Success! Retrieved the page

The following code extracts the occupations

names = re.findall("\w+_",url) ## regex to filter names
names = names[0].split("_")
names = [i for i in names if i]  ## remove empty string
full_name = " ".join(names)    

soup = bs4.BeautifulSoup(html, 'html.parser')
role = soup.find(class_="role")
role_list = role.find_all('li')
for i in range(len(role_list)):
    role_list[i] = role_list[i].text
roles = ", ".join(role_list)

print("%s is %s." %(full_name, roles))
OUT[2]:
Jim Simons is mathematician, hedge fund manager, philanthropist.