1
00:00:00,300 --> 00:00:04,370
Using web scraping tools doesn't
just have to be for gathering data.

2
00:00:04,370 --> 00:00:06,190
It can be used to test a site as well.

3
00:00:07,250 --> 00:00:10,200
Testing your code is a great
development practice to get into.

4
00:00:11,255 --> 00:00:12,500
Writing a unit test, and

5
00:00:12,500 --> 00:00:17,560
combining them with a web scraper,
can be a powerful tool for testing a site.

6
00:00:17,560 --> 00:00:20,740
You can check to make sure that
a page's title is as expected,

7
00:00:20,740 --> 00:00:26,540
or that all of the content resides in
an element with a specific CSS class.

8
00:00:26,540 --> 00:00:29,160
If you need a refresher
on testing in Python,

9
00:00:29,160 --> 00:00:31,580
check the teacher's notes for
some great resources.

10
00:00:33,160 --> 00:00:34,890
Let's head back to our sample site, and

11
00:00:34,890 --> 00:00:38,510
use unit tests to make sure it has
the elements that we expected it to have.

12
00:00:40,230 --> 00:00:41,780
Let's go back to our horse site.

13
00:00:41,780 --> 00:00:44,700
We'll check to see if
it's a stable version

14
00:00:44,700 --> 00:00:46,560
of what we're expecting it to have.

15
00:00:46,560 --> 00:00:52,820
Go over, let's create a new file,
new Python file.

16
00:00:52,820 --> 00:00:59,070
We'll call it horse_test.py,
and we'll bring in our imports.

17
00:00:59,070 --> 00:01:03,600
We need request here to bring in urlopen.

18
00:01:06,180 --> 00:01:08,412
We'll bring in BeautifulSoup, and

19
00:01:08,412 --> 00:01:12,516
since we're running the unit test
we'll need to import unittest.

20
00:01:14,324 --> 00:01:16,950
Next, we define our class and
setup information.

21
00:01:18,100 --> 00:01:26,950
So we'll call the class TestHorseLand,
which inherits unittest and TestCase.

22
00:01:26,950 --> 00:01:32,790
We'll set our soup, to start with, equal
to None, and then we define a setUpClass.

23
00:01:34,760 --> 00:01:37,397
And in this case, it won't take self.

24
00:01:37,397 --> 00:01:41,349
We'll pass in our url,

25
00:01:41,349 --> 00:01:49,388
treehouse-projects.github.io/horse-land/i-

26
00:01:49,388 --> 00:01:53,013
ndex.html.

27
00:01:54,530 --> 00:01:56,297
Then we define our soup object.

28
00:01:58,272 --> 00:02:03,004
It's going to be BeautifulSoup,
urlopen, pass in the URL,

29
00:02:03,004 --> 00:02:05,830
and we want the html.parser again.

30
00:02:08,140 --> 00:02:13,855
Now, let's test that the h1 text
is what we're expecting it to be.

31
00:02:13,855 --> 00:02:16,762
So we'll define a test for header1,

32
00:02:19,582 --> 00:02:27,925
We want header1 to be equal to
our TestHorseLand.soup.find.

33
00:02:27,925 --> 00:02:32,686
We want to grab the h1, and get_text.

34
00:02:32,686 --> 00:02:36,775
Next, we want to make sure that header1,
that we're capturing here,

35
00:02:36,775 --> 00:02:39,400
is equal to what our string should be.

36
00:02:39,400 --> 00:02:41,734
In our case, Horse Land.

37
00:02:41,734 --> 00:02:46,984
So we would do self.assertEqual,
pass in our string

38
00:02:46,984 --> 00:02:51,659
that we want, Horse Land,
equal to header1.

39
00:02:51,659 --> 00:03:00,140
And do our dunder check here,
And we'll run unittest.main.

40
00:03:00,140 --> 00:03:06,020
And when we run this, we get an OK,
and the test passed, very nice.

41
00:03:06,020 --> 00:03:10,690
Another method to test sites is
with a package called selenium,

42
00:03:10,690 --> 00:03:14,130
which is designed specifically for
website testing.

43
00:03:14,130 --> 00:03:17,550
It can be installed on PyCharm,
the same as BeautifulSoup, or

44
00:03:17,550 --> 00:03:19,760
it can be installed with Pipenv.

45
00:03:19,760 --> 00:03:22,430
I've included a link to
the installation information

46
00:03:22,430 --> 00:03:24,100
in the teacher's notes, as well.

47
00:03:24,100 --> 00:03:28,560
One additional step you'll need is
the driver for your preferred browser.

48
00:03:28,560 --> 00:03:31,980
Follow the instructions on
the page to get it set up.

49
00:03:31,980 --> 00:03:34,681
Let's create a new file
to show off selenium.

50
00:03:34,681 --> 00:03:40,131
So we can close this,
Do another new Python

51
00:03:40,131 --> 00:03:45,474
file, horse_test_selenium.

52
00:03:48,269 --> 00:03:50,193
So we'll be using BeautifulSoup again.

53
00:03:52,160 --> 00:03:59,345
And from selenium,
we want to import webdriver.

54
00:03:59,345 --> 00:04:03,458
We'll also want to import the time module,
to allow the page to fully load.

55
00:04:05,167 --> 00:04:09,330
So next, we want to tell our
webdriver which browser to use.

56
00:04:09,330 --> 00:04:11,406
I'm using Chrome, so I'll set that up,

57
00:04:16,480 --> 00:04:18,654
Then we tell the driver
to go get our page.

58
00:04:25,522 --> 00:04:30,390
Horse-land, back to index.html.

59
00:04:30,390 --> 00:04:33,963
Let's have our script wait a few seconds,
before we process anything.

60
00:04:33,963 --> 00:04:39,088
Just to give the JavaScript time to run,
and load the horse images on the page.

61
00:04:39,088 --> 00:04:44,690
We do time.sleep, pass in 5,
that should give us plenty of time.

62
00:04:44,690 --> 00:04:47,931
Now, we can utilize
BeautifulSoup to parse the page.

63
00:04:47,931 --> 00:04:51,190
Let's just print out the HTML,
to see if we get the images.

64
00:04:51,190 --> 00:04:54,163
Recall from earlier video,
when we did this,

65
00:04:54,163 --> 00:04:56,687
we just got an empty, unordered list.

66
00:04:56,687 --> 00:04:59,920
Because BeautifulSoup doesn't wait for
JavaScript.

67
00:04:59,920 --> 00:05:03,447
The driver object has
a function called page_source,

68
00:05:03,447 --> 00:05:07,133
which gets us the source of
the page at the time it was read.

69
00:05:07,133 --> 00:05:11,730
So we'll say page_html,
driver.page_source, and

70
00:05:11,730 --> 00:05:14,706
we can use that with BeautifulSoup.

71
00:05:16,570 --> 00:05:22,908
We'll pass in the page_html,
we'll use our html.parser again,

72
00:05:22,908 --> 00:05:26,303
and we'll pretty-print our soup.

73
00:05:28,680 --> 00:05:30,502
Then, we want to make
sure we close the driver.

74
00:05:32,840 --> 00:05:40,423
And let's run our script, and there we go!

75
00:05:42,820 --> 00:05:45,338
We see all of our images and page content.

76
00:05:45,338 --> 00:05:49,146
We could now put our scraping skills
to use in many productive ways.