1 00:00:00,300 --> 00:00:04,370 Using web scraping tools doesn't just have to be for gathering data. 2 00:00:04,370 --> 00:00:06,190 It can be used to test a site as well. 3 00:00:07,250 --> 00:00:10,200 Testing your code is a great development practice to get into. 4 00:00:11,255 --> 00:00:12,500 Writing a unit test, and 5 00:00:12,500 --> 00:00:17,560 combining them with a web scraper, can be a powerful tool for testing a site. 6 00:00:17,560 --> 00:00:20,740 You can check to make sure that a page's title is as expected, 7 00:00:20,740 --> 00:00:26,540 or that all of the content resides in an element with a specific CSS class. 8 00:00:26,540 --> 00:00:29,160 If you need a refresher on testing in Python, 9 00:00:29,160 --> 00:00:31,580 check the teacher's notes for some great resources. 10 00:00:33,160 --> 00:00:34,890 Let's head back to our sample site, and 11 00:00:34,890 --> 00:00:38,510 use unit tests to make sure it has the elements that we expected it to have. 12 00:00:40,230 --> 00:00:41,780 Let's go back to our horse site. 13 00:00:41,780 --> 00:00:44,700 We'll check to see if it's a stable version 14 00:00:44,700 --> 00:00:46,560 of what we're expecting it to have. 15 00:00:46,560 --> 00:00:52,820 Go over, let's create a new file, new Python file. 16 00:00:52,820 --> 00:00:59,070 We'll call it horse_test.py, and we'll bring in our imports. 17 00:00:59,070 --> 00:01:03,600 We need request here to bring in urlopen. 18 00:01:06,180 --> 00:01:08,412 We'll bring in BeautifulSoup, and 19 00:01:08,412 --> 00:01:12,516 since we're running the unit test we'll need to import unittest. 20 00:01:14,324 --> 00:01:16,950 Next, we define our class and setup information. 21 00:01:18,100 --> 00:01:26,950 So we'll call the class TestHorseLand, which inherits unittest and TestCase. 22 00:01:26,950 --> 00:01:32,790 We'll set our soup, to start with, equal to None, and then we define a setUpClass. 23 00:01:34,760 --> 00:01:37,397 And in this case, it won't take self. 24 00:01:37,397 --> 00:01:41,349 We'll pass in our url, 25 00:01:41,349 --> 00:01:49,388 treehouse-projects.github.io/horse-land/i- 26 00:01:49,388 --> 00:01:53,013 ndex.html. 27 00:01:54,530 --> 00:01:56,297 Then we define our soup object. 28 00:01:58,272 --> 00:02:03,004 It's going to be BeautifulSoup, urlopen, pass in the URL, 29 00:02:03,004 --> 00:02:05,830 and we want the html.parser again. 30 00:02:08,140 --> 00:02:13,855 Now, let's test that the h1 text is what we're expecting it to be. 31 00:02:13,855 --> 00:02:16,762 So we'll define a test for header1, 32 00:02:19,582 --> 00:02:27,925 We want header1 to be equal to our TestHorseLand.soup.find. 33 00:02:27,925 --> 00:02:32,686 We want to grab the h1, and get_text. 34 00:02:32,686 --> 00:02:36,775 Next, we want to make sure that header1, that we're capturing here, 35 00:02:36,775 --> 00:02:39,400 is equal to what our string should be. 36 00:02:39,400 --> 00:02:41,734 In our case, Horse Land. 37 00:02:41,734 --> 00:02:46,984 So we would do self.assertEqual, pass in our string 38 00:02:46,984 --> 00:02:51,659 that we want, Horse Land, equal to header1. 39 00:02:51,659 --> 00:03:00,140 And do our dunder check here, And we'll run unittest.main. 40 00:03:00,140 --> 00:03:06,020 And when we run this, we get an OK, and the test passed, very nice. 41 00:03:06,020 --> 00:03:10,690 Another method to test sites is with a package called selenium, 42 00:03:10,690 --> 00:03:14,130 which is designed specifically for website testing. 43 00:03:14,130 --> 00:03:17,550 It can be installed on PyCharm, the same as BeautifulSoup, or 44 00:03:17,550 --> 00:03:19,760 it can be installed with Pipenv. 45 00:03:19,760 --> 00:03:22,430 I've included a link to the installation information 46 00:03:22,430 --> 00:03:24,100 in the teacher's notes, as well. 47 00:03:24,100 --> 00:03:28,560 One additional step you'll need is the driver for your preferred browser. 48 00:03:28,560 --> 00:03:31,980 Follow the instructions on the page to get it set up. 49 00:03:31,980 --> 00:03:34,681 Let's create a new file to show off selenium. 50 00:03:34,681 --> 00:03:40,131 So we can close this, Do another new Python 51 00:03:40,131 --> 00:03:45,474 file, horse_test_selenium. 52 00:03:48,269 --> 00:03:50,193 So we'll be using BeautifulSoup again. 53 00:03:52,160 --> 00:03:59,345 And from selenium, we want to import webdriver. 54 00:03:59,345 --> 00:04:03,458 We'll also want to import the time module, to allow the page to fully load. 55 00:04:05,167 --> 00:04:09,330 So next, we want to tell our webdriver which browser to use. 56 00:04:09,330 --> 00:04:11,406 I'm using Chrome, so I'll set that up, 57 00:04:16,480 --> 00:04:18,654 Then we tell the driver to go get our page. 58 00:04:25,522 --> 00:04:30,390 Horse-land, back to index.html. 59 00:04:30,390 --> 00:04:33,963 Let's have our script wait a few seconds, before we process anything. 60 00:04:33,963 --> 00:04:39,088 Just to give the JavaScript time to run, and load the horse images on the page. 61 00:04:39,088 --> 00:04:44,690 We do time.sleep, pass in 5, that should give us plenty of time. 62 00:04:44,690 --> 00:04:47,931 Now, we can utilize BeautifulSoup to parse the page. 63 00:04:47,931 --> 00:04:51,190 Let's just print out the HTML, to see if we get the images. 64 00:04:51,190 --> 00:04:54,163 Recall from earlier video, when we did this, 65 00:04:54,163 --> 00:04:56,687 we just got an empty, unordered list. 66 00:04:56,687 --> 00:04:59,920 Because BeautifulSoup doesn't wait for JavaScript. 67 00:04:59,920 --> 00:05:03,447 The driver object has a function called page_source, 68 00:05:03,447 --> 00:05:07,133 which gets us the source of the page at the time it was read. 69 00:05:07,133 --> 00:05:11,730 So we'll say page_html, driver.page_source, and 70 00:05:11,730 --> 00:05:14,706 we can use that with BeautifulSoup. 71 00:05:16,570 --> 00:05:22,908 We'll pass in the page_html, we'll use our html.parser again, 72 00:05:22,908 --> 00:05:26,303 and we'll pretty-print our soup. 73 00:05:28,680 --> 00:05:30,502 Then, we want to make sure we close the driver. 74 00:05:32,840 --> 00:05:40,423 And let's run our script, and there we go! 75 00:05:42,820 --> 00:05:45,338 We see all of our images and page content. 76 00:05:45,338 --> 00:05:49,146 We could now put our scraping skills to use in many productive ways.