1 00:00:00,000 --> 00:00:04,703 [MUSIC] 2 00:00:04,703 --> 00:00:07,457 We've managed to make a couple of spiders that were great for 3 00:00:07,457 --> 00:00:10,120 sites that don't require interaction. 4 00:00:10,120 --> 00:00:13,955 But many sites do indeed require some sort of interaction. 5 00:00:13,955 --> 00:00:17,350 For example, logging in to a site with a username and 6 00:00:17,350 --> 00:00:20,310 password requires a form submission. 7 00:00:20,310 --> 00:00:21,850 There are many different reasons for 8 00:00:21,850 --> 00:00:25,520 needing to work with forms when getting and scraping data. 9 00:00:25,520 --> 00:00:28,510 Let's head back into our code to take a look at some techniques. 10 00:00:30,440 --> 00:00:34,010 Our Horse Land site is hosted on GitHub pages, 11 00:00:34,010 --> 00:00:37,120 which doesn't support backend technologies. 12 00:00:37,120 --> 00:00:43,010 So we'll be using a bit of a workaround from Formspree to handle the form posts. 13 00:00:43,010 --> 00:00:47,530 Check the teacher's notes for additional information about formspree.io and 14 00:00:47,530 --> 00:00:48,780 how to get started with that. 15 00:00:48,780 --> 00:00:53,890 If we'd look at our form page, we see that it's a pretty simple form 16 00:00:53,890 --> 00:00:57,980 with just a first name, last name, and a job title. 17 00:00:57,980 --> 00:01:02,390 Scrapy has a class called form request, which allows for form processing. 18 00:01:02,390 --> 00:01:05,830 And, hold your horses, it's easy to use. 19 00:01:05,830 --> 00:01:08,850 Let's mosey on over to our code and create a new spider. 20 00:01:10,600 --> 00:01:16,415 So I'll create a new file, gonna be a Python file, and we'll call it formSpider. 21 00:01:18,650 --> 00:01:21,654 The first form request will need to be imported. 22 00:01:21,654 --> 00:01:28,150 So from scrapy.http import FormRequest. 23 00:01:28,150 --> 00:01:29,348 And we need to import spider. 24 00:01:29,348 --> 00:01:32,483 Scrapy.spiders. 25 00:01:32,483 --> 00:01:35,409 import Spider. 26 00:01:37,420 --> 00:01:41,310 We need to create a new class that inherits from Spider as our next step. 27 00:01:43,200 --> 00:01:49,120 Call it FormSpider and, as we've seen, we need to give our Spider a name. 28 00:01:51,080 --> 00:01:52,547 We'll just call it horseForm. 29 00:01:55,620 --> 00:01:57,508 And we define our start URL. 30 00:02:00,302 --> 00:02:01,270 Which again, is a list. 31 00:02:02,370 --> 00:02:04,126 What's the URL for our form? 32 00:02:06,890 --> 00:02:08,189 We'll just cut and paste that in. 33 00:02:11,799 --> 00:02:14,740 This looks pretty familiar this far, I think. 34 00:02:14,740 --> 00:02:20,030 Next we define our parse method and we'll define the formdata we want to pass in. 35 00:02:21,310 --> 00:02:25,611 So define parse and formdata. 36 00:02:25,611 --> 00:02:28,673 Let's go use the developer tools in the browser to see what the form 37 00:02:28,673 --> 00:02:30,530 fields are called. 38 00:02:30,530 --> 00:02:34,022 Come over here, Developer Tools. 39 00:02:36,627 --> 00:02:38,962 So they're down in here in this form. 40 00:02:46,930 --> 00:02:54,540 So we have firstname, Lastname, And jobtitle. 41 00:02:54,540 --> 00:02:56,430 All lower case and no spaces. 42 00:02:58,260 --> 00:02:59,190 So we want firstname. 43 00:03:00,860 --> 00:03:01,950 My first name is Ken. 44 00:03:04,982 --> 00:03:09,140 Lastname, Alger. 45 00:03:09,140 --> 00:03:14,133 And jobtitle is Teacher. 46 00:03:15,720 --> 00:03:19,234 Now we need to return a form request from response object. 47 00:03:19,234 --> 00:03:26,940 So return FormRequest.from_response. 48 00:03:26,940 --> 00:03:33,250 We'll return the response, the form number on the page we're processing, 49 00:03:33,250 --> 00:03:40,510 and that's zero based, formnumber, and then the form data we want. 50 00:03:40,510 --> 00:03:43,340 So formdata = formdata. 51 00:03:45,380 --> 00:03:48,370 And then a callback for what to do next. 52 00:03:48,370 --> 00:03:49,120 So callback. 53 00:03:51,880 --> 00:03:54,540 We'll make a method here called after_post. 54 00:03:55,740 --> 00:03:59,560 This passes the data we defined into the form and, 55 00:03:59,560 --> 00:04:04,280 by default, utilizes the submit button to submit our data. 56 00:04:04,280 --> 00:04:08,640 Then it will do whatever we define in the after_post method. 57 00:04:08,640 --> 00:04:14,350 Here we could do data saving or data processing or further scraping tasks. 58 00:04:14,350 --> 00:04:19,470 For now, let's just print out that the form was processed and 59 00:04:19,470 --> 00:04:21,943 the response object itself. 60 00:04:21,943 --> 00:04:26,865 So we'll define after_post, self, and again, that takes a response. 61 00:04:26,865 --> 00:04:31,260 We'll print and we'll do 62 00:04:31,260 --> 00:04:36,254 a little formatting, just so 63 00:04:36,254 --> 00:04:41,260 we can see it in the terminal. 64 00:04:41,260 --> 00:04:45,533 And we'll print the response. 65 00:04:45,533 --> 00:04:47,090 Let's just copy this line here. 66 00:04:50,750 --> 00:04:51,710 There we go. 67 00:04:51,710 --> 00:04:58,039 And we can, all right, let's open a Terminal window, 68 00:05:00,410 --> 00:05:08,861 Go to our Spiders folder, And have Scrapy run our crawler. 69 00:05:13,863 --> 00:05:14,580 We look up here. 70 00:05:16,080 --> 00:05:19,370 Great, we see that the spider found and submitted our form. 71 00:05:19,370 --> 00:05:24,110 In our case here, it was posted to formspree.io for processing. 72 00:05:24,110 --> 00:05:27,490 Here's our printed information and our 200 response code. 73 00:05:27,490 --> 00:05:30,390 Great, I've included links in the teacher's notes 74 00:05:30,390 --> 00:05:32,310 about form request as well. 75 00:05:32,310 --> 00:05:35,350 I'd encourage you to look at it as it is a powerful tool for 76 00:05:35,350 --> 00:05:39,708 processing forms and can even be used to handle login forms.