Welcome to the Treehouse Community

Want to collaborate on code errors? Have bugs you need feedback on? Looking for an extra set of eyes on your latest project? Get support with fellow developers, designers, and programmers of all backgrounds and skill levels here with the Treehouse Community! While you're at it, check out some resources Treehouse students have shared here.

Looking to learn something new?

Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and join thousands of Treehouse students and alumni in the community today.

Start your free trial

Python Scraping Data From the Web A World Full of Spiders Crawling Spiders

Jonathan Kuhl
Jonathan Kuhl
26,133 Points

Not getting any data at all from spider

I've been following the video, I made my spider, I made almost no changes to the code and I got no results from the spider. I got a report saying zero webpages were crawled, despite the urls being copied and pasted from the webpages Treehouse provided:

import scrapy

class HorseSpider(scrapy.Spider):
    name = 'ike'

    def start_request(self):
        urls = [
            'https://treehouse-projects.github.io/horse-land/index.html',
            'https://treehouse-projects.github.io/horse-land/mustang.html'
        ]
        return [scrapy.Request(url=url, callback=self.parse) for url in urls]

    def parse(self, response):
        url = response.url
        page = url.split('/')[-1]
        filename = 'horses-%s' % page
        print('URL: {}'.format(url))
        with open(filename, 'wb') as file:
            file.write(response.body)
        print('Saved as %s' % filename)

What am I missing?

2 Answers

That's just a convention. Yes, you can call it start_request or start_requests but you do have to be consistent. Good pointing out this detail though, Trevor!

Actually now that I've reached the end of the video, and @kenalger says "we need start_requests and parse, now I'm not sure...

jessechapman
jessechapman
4,088 Points

I make the same error as the initial poster (defining a "start_request" function). Changing it to start_requests worked for me.