Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Well done!
      You have completed Scraping Data From the Web!
      
    
You have completed Scraping Data From the Web!
Preview
    
      
  Let's use the Python Library, Scrapy, to create a spider to crawl the web.
Additional Resources
- Python List Comprehensions
- Scrapy Response object
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
                      Inside the spiders folder, let's create
a new file to crawl our sample horse site.
                      0:00
                    
                    
                      So Spiders > File > New File > Python,
horse.py.
                      0:06
                    
                    
                      To start with, we need to import scrapy.
                      0:14
                    
                    
                      And then we'll write a class,
we'll call it HorseSpider,
                      0:20
                    
                    
                      which we'll inherit from scrapy.Spider.
                      0:26
                    
                    
                      Now, we need to give out HorseSpider
a name, let's call it ike,
                      0:30
                    
                    
                      after the horse in Charlotte's Web.
                      0:34
                    
                    
                      Spider names must be unique
within a scrapy project.
                      0:39
                    
                    
                      So scrapy knows which spider
to run in the project.
                      0:42
                    
                    
                      We'll use it to run our
spider in just a bit.
                      0:46
                    
                    
                      There are two functions
we need to write in here.
                      0:48
                    
                    
                      Start request, which defines
the initial request to be made and
                      0:51
                    
                    
                      if applicable, how to follow links.
                      0:55
                    
                    
                      So we'll define start_requests.
                      0:59
                    
                    
                      For now, we'll just pass.
                      1:01
                    
                    
                      The other function is parse which will
tell the spider how extracted data is to
                      1:03
                    
                    
                      be parsed.
                      1:08
                    
                    
                      Inside start_requests, we provide
a list of URLs that we want to process.
                      1:11
                    
                    
                      So urls, and we pass in a list.
                      1:17
                    
                    
                      So we'll pass in our index and
our mustang.html pages.
                      1:21
                    
                    
                      The whole URL is treehouse-projects.
                      1:26
                    
                    
                      Github.io/horseland/index.html.
                      1:33
                    
                    
                      We'll paste that in and
change it to mustang.
                      1:43
                    
                    
                      Then we need to return a scrapy.Request.
                      1:49
                    
                    
                      This is a list comprehension.
                      1:52
                    
                    
                      It's going to create a new list of
request by looping to each of our URLs.
                      1:54
                    
                    
                      More on the teacher's notes.
                      2:00
                    
                    
                      So we wanna return a list
of scrapy.Request.
                      2:01
                    
                    
                      We want our url to be url,
our callback is gonna be self.parse.
                      2:10
                    
                    
                      We want that for urls in urls.
                      2:17
                    
                    
                      This line is looping through our urls list
and on each one calling the parse method.
                      2:21
                    
                    
                      Let's update that method to do something.
                      2:26
                    
                    
                      We could do a lot of
things inside this method.
                      2:31
                    
                    
                      How you parse the data on
a site will be highly dependent
                      2:33
                    
                    
                      on the purpose of your project, since
every use case can be a little different.
                      2:37
                    
                    
                      For now,
let's just save the entire HTML file.
                      2:42
                    
                    
                      So we'll define a url, the response.url.
                      2:46
                    
                    
                      This response object
represents an HTTP response
                      2:50
                    
                    
                      from the request we
made in start_requests.
                      2:54
                    
                    
                      It's usually downloaded by the downloader
and fed to the spiders for processing.
                      2:57
                    
                    
                      See the teacher's notes for additional
documentation on scrapy's response object.
                      3:03
                    
                    
                      So with our url,
we wanna get a specific page.
                      3:08
                    
                    
                      We'll split it, On our last slash there,
                      3:13
                    
                    
                      and our file name we'll call it horses.
                      3:20
                    
                    
                      We'll format that with our page and
we'll print out what the URL is,
                      3:25
                    
                    
                      And then we'll save our page.
                      3:36
                    
                    
                      I'm going to just write
the entire response body.
                      3:44
                    
                    
                      Then we'll print out the saved file name.
                      3:49
                    
                    
                      Nice, now in a terminal window,
We navigate to our spider's directory.
                      3:57
                    
                    
                      And tell scrapy to crawl
using our spider name.
                      4:11
                    
                    
                      So we do scrapy crawl ike.
                      4:15
                    
                    
                      If we look at output in our terminal,
we can find,
                      4:19
                    
                    
                      come up here a little bit,
To right in here.
                      4:23
                    
                    
                      We see that the spider looked for
                      4:29
                    
                    
                      our robots.txt file, which it didn't
find since the site doesn't have one.
                      4:30
                    
                    
                      See this 404 code here?
                      4:35
                    
                    
                      In our robots.txt, the pages
                      4:37
                    
                    
                      that we included in our URLs list were
found and saved from the parse method.
                      4:41
                    
                    
                      There's the URLs, there's the file names,
                      4:46
                    
                    
                      we'll come back up here,
there they are, very nice.
                      4:50
                    
                    
                      Great work on writing your first spider.
                      4:55
                    
                    
                      We saw the two methods that a scrapy
spider needs, start requests and parse.
                      4:58
                    
                    
                      We put in a list of URLs in
the start_requests method and
                      5:05
                    
                    
                      have it loop through that list and
process each URL with the parse method.
                      5:08
                    
                    
                      We could have our parse method
do something more powerful
                      5:13
                    
                    
                      than just saving the entire file.
                      5:16
                    
                    
                      But this is a nice start.
                      5:18
                    
                    
                      Next up though,
                      5:20
                    
                    
                      let's see how to write a spider that will
crawl more URLs than what we give it.
                      5:21
                    
              
        You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up