1 00:00:00,330 --> 00:00:02,980 The script we built to read the horse website 2 00:00:02,980 --> 00:00:07,240 is a basic web crawling bot to scrape data from a site. 3 00:00:07,240 --> 00:00:12,180 Python has a great module available which provides a more full-featured way 4 00:00:12,180 --> 00:00:15,299 to quickly extract data you need from websites. 5 00:00:15,299 --> 00:00:20,531 With Scrapy, we write the rules regarding the data we want extracted and 6 00:00:20,531 --> 00:00:21,919 let it do the rest. 7 00:00:21,919 --> 00:00:26,857 Let's get Scrapy installed and then set up our first spider project. 8 00:00:26,857 --> 00:00:29,254 Let's look at the Scrapy installation guide. 9 00:00:29,254 --> 00:00:34,870 We see that it will run on Python 2.7 and 3.4 and higher, 10 00:00:34,870 --> 00:00:39,850 and can be installed using conda or from PyPI with pip. 11 00:00:39,850 --> 00:00:42,130 Let's add this package to our project in PyCharm. 12 00:00:48,360 --> 00:00:49,059 We want Scrapy. 13 00:00:53,029 --> 00:00:54,949 And install the package. 14 00:00:54,949 --> 00:00:59,351 If you find there are issues with your installation, check the platform-specific 15 00:00:59,351 --> 00:01:03,509 installation notes in the Scrapy documentation for additional information. 16 00:01:03,509 --> 00:01:06,085 Once it's finished installing, you can come out of here. 17 00:01:08,580 --> 00:01:09,614 Go to a Terminal window. 18 00:01:12,233 --> 00:01:13,490 Let's create a new spider. 19 00:01:13,490 --> 00:01:16,408 We'll call it AraneaSpider. 20 00:01:16,408 --> 00:01:21,777 Aranea is one of Charlotte's children's names in the classic children's book, 21 00:01:21,777 --> 00:01:23,090 Charlotte's Web. 22 00:01:23,090 --> 00:01:28,832 It's also the genus name of one of my personal favorite spiders, the orb weaver. 23 00:01:28,832 --> 00:01:34,362 So if we do scrapy startproject AraneaSpider, 24 00:01:34,362 --> 00:01:37,829 it creates our spider for us. 25 00:01:37,829 --> 00:01:42,498 Running this command handles creating the directory structure and setup for 26 00:01:42,498 --> 00:01:43,660 a Scrapy project. 27 00:01:43,660 --> 00:01:46,480 Let's see what Scrapy has provided for us. 28 00:01:46,480 --> 00:01:48,720 We'll minimize this. 29 00:01:48,720 --> 00:01:54,085 So here, under our folder, there's a scrapy.cfg files which handles deployment 30 00:01:54,085 --> 00:01:59,009 configuration, a project Python module from which we'll import our code. 31 00:01:59,009 --> 00:02:02,425 And there are some stub files that are generated. 32 00:02:02,425 --> 00:02:06,567 Their names are pretty descriptive, items, middlewares, 33 00:02:06,567 --> 00:02:11,439 pipelines, settings, all include respective setting information. 34 00:02:11,439 --> 00:02:13,919 Next is the spiders directory. 35 00:02:13,919 --> 00:02:16,712 This is where we'll put our spiders. 36 00:02:16,712 --> 00:02:20,434 Let's talk a little bit about what a couple of these files are used for. 37 00:02:20,434 --> 00:02:25,149 items.py is used to define a model of data for scraped items. 38 00:02:25,149 --> 00:02:29,959 Scrapy spiders can return scraped data as Python dicts. 39 00:02:29,959 --> 00:02:33,790 As you know, dicts lack structure. 40 00:02:33,790 --> 00:02:37,123 We can use items.py to create containers, 41 00:02:37,123 --> 00:02:40,373 where we can put the data we get from a site. 42 00:02:40,373 --> 00:02:45,252 Middlewares allow for custom functionality to be built to customize the responses 43 00:02:45,252 --> 00:02:46,793 that are sent to spiders. 44 00:02:46,793 --> 00:02:51,979 The pipeline.py is used to customize the processing of data. 45 00:02:51,979 --> 00:02:56,758 For example, you could write a pipeline that would cleanse the HTML, 46 00:02:56,758 --> 00:03:00,647 then move down the processing pipeline to be validated, 47 00:03:00,647 --> 00:03:03,739 then store the information into a database. 48 00:03:03,739 --> 00:03:08,099 Steps along the data processing path can be put into the pipeline. 49 00:03:08,099 --> 00:03:13,660 settings.py allows for the behavior of Scrapy components to be customized. 50 00:03:13,660 --> 00:03:16,676 In our next video, let's write our first spider. 51 00:03:16,676 --> 00:03:17,600 I'll see you shortly.