🌟 Dreaming of a bright future? 🎓 Ask about the Treehouse Scholarship program! 🚀

Join our Live Session: Maximize Your Treehouse Experience! 🐸 Register here!

Join our free community Discord server here!

🤖 Level up your chatbot knowledge with our latest AI course.

🐸 What's happening at Treehouse? 💚

Preview

Start a free Courses trial
to watch this video

Sign up for Treehouse

Installing Scrapy

3:18 with Ken Alger

Getting up and going with the Scrapy library.

Teacher's Notes
Questions?
Video Transcript
Downloads
Workspaces

Additional Resources

Scrapy web site
Scrapy installation guide

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up

The script we built to read the horse website 0:00

is a basic web crawling bot to scrape data from a site. 0:02

Python has a great module available which provides a more full-featured way 0:07

to quickly extract data you need from websites. 0:12

With Scrapy, we write the rules regarding the data we want extracted and 0:15

let it do the rest. 0:20

Let's get Scrapy installed and then set up our first spider project. 0:21

Let's look at the Scrapy installation guide. 0:26

We see that it will run on Python 2.7 and 3.4 and higher, 0:29

and can be installed using conda or from PyPI with pip. 0:34

Let's add this package to our project in PyCharm. 0:39

We want Scrapy. 0:48

And install the package. 0:53

If you find there are issues with your installation, check the platform-specific 0:54

installation notes in the Scrapy documentation for additional information. 0:59

Once it's finished installing, you can come out of here. 1:03

Go to a Terminal window. 1:08

Let's create a new spider. 1:12

We'll call it AraneaSpider. 1:13

Aranea is one of Charlotte's children's names in the classic children's book, 1:16

Charlotte's Web. 1:21

It's also the genus name of one of my personal favorite spiders, the orb weaver. 1:23

So if we do scrapy startproject AraneaSpider, 1:28

it creates our spider for us. 1:34

Running this command handles creating the directory structure and setup for 1:37

a Scrapy project. 1:42

Let's see what Scrapy has provided for us. 1:43

We'll minimize this. 1:46

So here, under our folder, there's a scrapy.cfg files which handles deployment 1:48

configuration, a project Python module from which we'll import our code. 1:54

And there are some stub files that are generated. 1:59

Their names are pretty descriptive, items, middlewares, 2:02

pipelines, settings, all include respective setting information. 2:06

Next is the spiders directory. 2:11

This is where we'll put our spiders. 2:13

Let's talk a little bit about what a couple of these files are used for. 2:16

items.py is used to define a model of data for scraped items. 2:20

Scrapy spiders can return scraped data as Python dicts. 2:25

As you know, dicts lack structure. 2:29

We can use items.py to create containers, 2:33

where we can put the data we get from a site. 2:37

Middlewares allow for custom functionality to be built to customize the responses 2:40

that are sent to spiders. 2:45

The pipeline.py is used to customize the processing of data. 2:46

For example, you could write a pipeline that would cleanse the HTML, 2:51

then move down the processing pipeline to be validated, 2:56

then store the information into a database. 3:00

Steps along the data processing path can be put into the pipeline. 3:03

settings.py allows for the behavior of Scrapy components to be customized. 3:08

In our next video, let's write our first spider. 3:13

I'll see you shortly. 3:16

You need to sign up for Treehouse in order to download course files.

Sign up

You need to sign up for Treehouse in order to set up Workspace

Sign up