Welcome to the Treehouse Community
Want to collaborate on code errors? Have bugs you need feedback on? Looking for an extra set of eyes on your latest project? Get support with fellow developers, designers, and programmers of all backgrounds and skill levels here with the Treehouse Community! While you're at it, check out some resources Treehouse students have shared here.
Looking to learn something new?
Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and join thousands of Treehouse students and alumni in the community today.
Start your free trialNathan English
Front End Web Development Techdegree Student 10,817 PointsWhat am I doing wrong?
I'm very new to web scraping, and I'm trying to read some information off a website. It seems like the URL doesn't work.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.battlemetrics.com/servers/ark/2339725")
soup = BeautifulSoup(html.read(), "html.parser")
for th in soup.find_all('th'):
print(th)
1 Answer
Brendan Whiting
Front End Web Development Techdegree Graduate 84,738 PointsThis is the errror I get:
HTTP Error 403: Forbidden
Basically, the problem is that the webpage can tell that it's being scraped by a bot, and a lot of people don't want their webpages to be scraped :P.
I found this stackoverflow article with a solution, and modified your code by adding the headers as described and it seems to work. It's sort of like we're having the bot disguise itself with the header that says "I'm a Mozilla Browser!" and the webpage says "OK, I believe you".
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
req = Request('https://www.battlemetrics.com/servers/ark/2339725', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage, "html.parser")
for th in soup.find_all('th'):
print(th)