1 00:00:00,220 --> 00:00:03,660 As we've learned, a data set is comprised of examples. 2 00:00:03,660 --> 00:00:06,710 And each of those examples has common features 3 00:00:06,710 --> 00:00:10,900 that a model can use to perform analysis and comparisons. 4 00:00:10,900 --> 00:00:14,860 But what do we want a model to do with that data? 5 00:00:14,860 --> 00:00:18,800 Ultimately, we want it to make some kind of a prediction. 6 00:00:18,800 --> 00:00:22,010 And the prediction it makes is called a label. 7 00:00:23,290 --> 00:00:27,640 Let's go back to the earlier example of a spam filter. 8 00:00:27,640 --> 00:00:32,350 Each example, in this case an email, has features which, 9 00:00:32,350 --> 00:00:37,740 in this case, might be things like the subject line, body, and sender. 10 00:00:37,740 --> 00:00:43,820 In this case, the label is whether the message is spam or not spam. 11 00:00:43,820 --> 00:00:46,750 A classifier is a type of algorithm or 12 00:00:46,750 --> 00:00:52,780 model that makes a prediction about how a piece of data should be categorized. 13 00:00:52,780 --> 00:00:55,530 You can think of a classifier like a function. 14 00:00:55,530 --> 00:00:59,970 Data goes in, and then the classifier predicts the correct category for 15 00:00:59,970 --> 00:01:01,190 that data. 16 00:01:01,190 --> 00:01:05,930 It does this by using an existing data set that has examples where the labels 17 00:01:05,930 --> 00:01:07,060 are known. 18 00:01:07,060 --> 00:01:11,600 So for a spam filter, you would train the classifier with a data set 19 00:01:11,600 --> 00:01:16,740 where lots of emails are already labeled as spam or not spam. 20 00:01:16,740 --> 00:01:20,960 And then when a new email comes in, it can try to assign a label. 21 00:01:22,590 --> 00:01:25,920 There's one more thing I want to mention before we carry on. 22 00:01:25,920 --> 00:01:31,540 Cleaning and organizing data in different ways can often produce different results. 23 00:01:31,540 --> 00:01:36,560 In the case of the emails, you might find that the raw data from the email doesn't 24 00:01:36,560 --> 00:01:41,980 make useful features because further heuristics need to be applied. 25 00:01:41,980 --> 00:01:46,564 For a spam filter classifier, you might create features that counts 26 00:01:46,564 --> 00:01:49,817 the number of spammy phrases from a dictionary. 27 00:01:49,817 --> 00:01:54,832 Like free offer or click here, or a feature that identifies 28 00:01:54,832 --> 00:02:01,860 an attachment as a photo or an executable program that might be a virus. 29 00:02:01,860 --> 00:02:05,950 There's an old saying in computing called garbage in, garbage out. 30 00:02:05,950 --> 00:02:10,271 It means that if you provide the computer with bad information, or 31 00:02:10,271 --> 00:02:14,901 if you give your machine learning model a data set that's inaccurate or 32 00:02:14,901 --> 00:02:17,542 not representative of the whole truth. 33 00:02:17,542 --> 00:02:20,778 Then you're going to get a bad result. 34 00:02:20,778 --> 00:02:22,060 And that's it. 35 00:02:22,060 --> 00:02:26,920 As you can imagine, there are many more definitions and terms in machine learning. 36 00:02:26,920 --> 00:02:31,320 But those are the big ones we'll need in order to continue with our exercise 37 00:02:31,320 --> 00:02:32,580 in the next videos. 38 00:02:32,580 --> 00:02:36,040 Where we're going to write our own classifier in Python 39 00:02:36,040 --> 00:02:38,455 using a library called scikit-learn.