1 00:00:00,000 --> 00:00:05,067 [MUSIC] 2 00:00:05,067 --> 00:00:08,854 Toward the end of these lessons, we're going to Python and 3 00:00:08,854 --> 00:00:12,900 the scikit-learn project to write our own classifier. 4 00:00:12,900 --> 00:00:17,590 But before we continue, we should formally define some of the terms I've been using 5 00:00:17,590 --> 00:00:19,600 to describe machine learning and 6 00:00:19,600 --> 00:00:23,290 then break them down further with more examples. 7 00:00:23,290 --> 00:00:29,840 Speaking of examples, an example is a single element in a dataset. 8 00:00:29,840 --> 00:00:35,610 Sometimes you might hear an example referred to as a sample, 9 00:00:35,610 --> 00:00:37,750 but it means the same thing. 10 00:00:37,750 --> 00:00:40,710 If your data is formatted in a table, 11 00:00:40,710 --> 00:00:45,020 an example might be a single row in the table. 12 00:00:45,020 --> 00:00:48,930 A dataset is comprised on many examples. 13 00:00:48,930 --> 00:00:49,940 And in general, 14 00:00:49,940 --> 00:00:55,000 each example helps improve the confidence of your model's predictions. 15 00:00:55,000 --> 00:01:00,150 Say for instance, you're running a movie studio and you want to try an forecast 16 00:01:00,150 --> 00:01:04,492 how much money a movie might make, so that you can set a budget. 17 00:01:04,492 --> 00:01:09,660 Your dataset would probably be examples of older movies. 18 00:01:09,660 --> 00:01:13,510 So what about those older movies might you include? 19 00:01:13,510 --> 00:01:17,920 Each part of an example is called a feature. 20 00:01:17,920 --> 00:01:22,180 A feature is one characteristic of an example. 21 00:01:22,180 --> 00:01:25,670 Again, if you formatted your data in a table, 22 00:01:25,670 --> 00:01:29,200 each feature might be a single column. 23 00:01:29,200 --> 00:01:34,220 In the case of predicting a movie's box office performance, your older examples of 24 00:01:34,220 --> 00:01:38,610 movies might include things like their total box office sales. 25 00:01:38,610 --> 00:01:41,760 The budget, the genre, release date and 26 00:01:41,760 --> 00:01:46,510 maybe more advanced features, like a star power calculation. 27 00:01:46,510 --> 00:01:50,760 Which could take all the actors in each movie and calculate a weighted average of 28 00:01:50,760 --> 00:01:54,540 their typical box office performance in other movies they've been in. 29 00:01:56,330 --> 00:02:00,340 A dataset might contain good and bad features. 30 00:02:00,340 --> 00:02:04,070 And some features that are more important than others. 31 00:02:04,070 --> 00:02:06,840 For example, you might find that the genre and 32 00:02:06,840 --> 00:02:10,100 release date is more important than the budget. 33 00:02:10,100 --> 00:02:14,290 So your model could weigh those features more heavily. 34 00:02:14,290 --> 00:02:19,230 A feature that might be completely irrelevant is the movie's title. 35 00:02:19,230 --> 00:02:23,830 Sure a movie needs a title and you might be able to come up with a machine learning 36 00:02:23,830 --> 00:02:28,000 model that can determine what makes a good and bad movie title. 37 00:02:28,000 --> 00:02:31,900 But in most cases, its probably too subjective and 38 00:02:31,900 --> 00:02:36,349 inconsequential to weigh it against other more quantifiable features. 39 00:02:37,550 --> 00:02:42,360 Something like the box office performance of a movie is very difficult to predict. 40 00:02:42,360 --> 00:02:45,740 And it includes a huge number of factors that are nearly 41 00:02:45,740 --> 00:02:48,570 impossible to simulate perfectly. 42 00:02:48,570 --> 00:02:51,830 But that's why a model is nothing more than that. 43 00:02:51,830 --> 00:02:55,710 A model or a simplification of the problem. 44 00:02:55,710 --> 00:03:00,590 It's just one tool that can be used in combination with other approaches 45 00:03:00,590 --> 00:03:02,130 to arrive at a solution.