Welcome to the Treehouse Community

Want to collaborate on code errors? Have bugs you need feedback on? Looking for an extra set of eyes on your latest project? Get support with fellow developers, designers, and programmers of all backgrounds and skill levels here with the Treehouse Community! While you're at it, check out some resources Treehouse students have shared here.

Looking to learn something new?

Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and join thousands of Treehouse students and alumni in the community today.

Start your free trial

Python

What am I doing wrong?

I'm very new to web scraping, and I'm trying to read some information off a website. It seems like the URL doesn't work.

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("https://www.battlemetrics.com/servers/ark/2339725")
soup = BeautifulSoup(html.read(), "html.parser")

for th in soup.find_all('th'):
    print(th)

1 Answer

Brendan Whiting
seal-mask
.a{fill-rule:evenodd;}techdegree seal-36
Brendan Whiting
Front End Web Development Techdegree Graduate 84,738 Points

This is the errror I get:

HTTP Error 403: Forbidden

Basically, the problem is that the webpage can tell that it's being scraped by a bot, and a lot of people don't want their webpages to be scraped :P.

I found this stackoverflow article with a solution, and modified your code by adding the headers as described and it seems to work. It's sort of like we're having the bot disguise itself with the header that says "I'm a Mozilla Browser!" and the webpage says "OK, I believe you".

from urllib.request import urlopen, Request
from bs4 import BeautifulSoup

req = Request('https://www.battlemetrics.com/servers/ark/2339725', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()

soup = BeautifulSoup(webpage, "html.parser")
for th in soup.find_all('th'):
    print(th)