Welcome to the Treehouse Community

Want to collaborate on code errors? Have bugs you need feedback on? Looking for an extra set of eyes on your latest project? Get support with fellow developers, designers, and programmers of all backgrounds and skill levels here with the Treehouse Community! While you're at it, check out some resources Treehouse students have shared here.

Looking to learn something new?

Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and join thousands of Treehouse students and alumni in the community today.

Start your free trial

Python

How does this web scraper work & Why won't it work on workspaces

Hi, so I found this web scraper. Could you explain to it line by line please? Also when in workspaces I get no module named urllib 2

import urllib2
from bs4 import BeautifulSoup
import re

opener = urllib2.build_opener()
opener.addheaders = [('User_agent', 'Mozilla 5.0')]

url = ('https://en.wikipedia.org/wiki/List_of_American_comedy_films')
ourUrl = opener.open(url).read()
Soup = BeautifulSoup(ourUrl)
for link in soup.findAll('a',altrs = {'href':re.compile("^/wiki/")}):
  print (link.text)

body = soup.find(text="Origin").findNext('td')
outfile = open('/projects/training/wikipedia.txt'/'w/')
outfile.write(body.text)

Hi Marie Lu, I'll go through the code line by line as you asked for :-)

So the first three lines are import commands. The first one import urllib2 import the whole urllib2 package, which is capable of sending requests and receiving responses via http protocols. It is the same command as in the third line for the python regular expression package re.

import re

In the second line this script imports the class BeautifulSoup from the *bs4 package, which is for scraping documents for example based on ther xhtml structure.

from bs4 import BeautifulSoup

Before we move on, you probably receive the error no module named urllib 2 because this package is not installed and therefore cannot be imported. To install the package you would need to use the Python package manager pip or pip3 and install the urllib2 package via the command line with the command pip install urllib2 in your terminal (not in the Python code) Before that you can check which packages are installed with pip list

Now, let's move on:

With opener = urllib2.build_opener() you instantiate the url opener of the urllib2 package. This means you will be able to use its functions like opening a request to an url, which will be done later.

The following line just adds a user agent to your instance of the url opener, so a webpage can process your request.

opener.addheaders = [('User_agent', 'Mozilla 5.0')]

Here you assign your desired url as a string to the variable url. From my point you do not need the curved brackets.

url = ('https://en.wikipedia.org/wiki/List_of_American_comedy_films')

In the next line you provide your url to your instance of the url opener and you read the response. This response is assigned to the variable ourUrl.

ourUrl = opener.open(url).read()

What you want now is to be able to parse your response, why you need to instantiate the BeautifulSoup. This is done in the next line together with our response in the variable ourUrl

Soup = BeautifulSoup(ourUrl)

The actual parsing happens now. In here, I think, we have a little typo as you try to use soup.findAll instead of

Soup.findAll like it was defined in the line above.

But what happens here is, that you would like to go through all links in our response and extract all anchor texts of links which match the regular expression ^/wiki/ and prints the anchor text.

for link in soup.findAll('a',altrs = {'href':re.compile("^/wiki/")}): print (link.text)

In the body line we search for the text Origin and the next td element Documentation

The rest of the code just writes the text of your body variable to the file wikipedia.txt

I hope that helps, just reply for any further questions Cheers Urs