Installing Scrapy3:18 with Ken Alger
Getting up and going with the Scrapy library.
The script we built to read the horse website 0:00 is a basic web crawling bot to scrape data from a site. 0:02 Python has a great module available which provides a more full-featured way 0:07 to quickly extract data you need from websites. 0:12 With Scrapy, we write the rules regarding the data we want extracted and 0:15 let it do the rest. 0:20 Let's get Scrapy installed and then set up our first spider project. 0:21 Let's look at the Scrapy installation guide. 0:26 We see that it will run on Python 2.7 and 3.4 and higher, 0:29 and can be installed using conda or from PyPI with pip. 0:34 Let's add this package to our project in PyCharm. 0:39 We want Scrapy. 0:48 And install the package. 0:53 If you find there are issues with your installation, check the platform-specific 0:54 installation notes in the Scrapy documentation for additional information. 0:59 Once it's finished installing, you can come out of here. 1:03 Go to a Terminal window. 1:08 Let's create a new spider. 1:12 We'll call it AraneaSpider. 1:13 Aranea is one of Charlotte's children's names in the classic children's book, 1:16 Charlotte's Web. 1:21 It's also the genus name of one of my personal favorite spiders, the orb weaver. 1:23 So if we do scrapy startproject AraneaSpider, 1:28 it creates our spider for us. 1:34 Running this command handles creating the directory structure and setup for 1:37 a Scrapy project. 1:42 Let's see what Scrapy has provided for us. 1:43 We'll minimize this. 1:46 So here, under our folder, there's a scrapy.cfg files which handles deployment 1:48 configuration, a project Python module from which we'll import our code. 1:54 And there are some stub files that are generated. 1:59 Their names are pretty descriptive, items, middlewares, 2:02 pipelines, settings, all include respective setting information. 2:06 Next is the spiders directory. 2:11 This is where we'll put our spiders. 2:13 Let's talk a little bit about what a couple of these files are used for. 2:16 items.py is used to define a model of data for scraped items. 2:20 Scrapy spiders can return scraped data as Python dicts. 2:25 As you know, dicts lack structure. 2:29 We can use items.py to create containers, 2:33 where we can put the data we get from a site. 2:37 Middlewares allow for custom functionality to be built to customize the responses 2:40 that are sent to spiders. 2:45 The pipeline.py is used to customize the processing of data. 2:46 For example, you could write a pipeline that would cleanse the HTML, 2:51 then move down the processing pipeline to be validated, 2:56 then store the information into a database. 3:00 Steps along the data processing path can be put into the pipeline. 3:03 settings.py allows for the behavior of Scrapy components to be customized. 3:08 In our next video, let's write our first spider. 3:13 I'll see you shortly. 3:16
You need to sign up for Treehouse in order to download course files.Sign up