Labels and Classifiers2:39 with Nick Pettit
Let's continue defining some machine learning terms so that we have the vocabulary to discuss these ideas in more detail.
Vocabulary and Definitions
- Label: A category for data, or a prediction from a classification algorithm
- Classifier: A supervised machine learning model that makes a prediction about how a piece of data should be categorized
As we've learned, a data set is comprised of examples. 0:00 And each of those examples has common features 0:03 that a model can use to perform analysis and comparisons. 0:06 But what do we want a model to do with that data? 0:10 Ultimately, we want it to make some kind of a prediction. 0:14 And the prediction it makes is called a label. 0:18 Let's go back to the earlier example of a spam filter. 0:23 Each example, in this case an email, has features which, 0:27 in this case, might be things like the subject line, body, and sender. 0:32 In this case, the label is whether the message is spam or not spam. 0:37 A classifier is a type of algorithm or 0:43 model that makes a prediction about how a piece of data should be categorized. 0:46 You can think of a classifier like a function. 0:52 Data goes in, and then the classifier predicts the correct category for 0:55 that data. 0:59 It does this by using an existing data set that has examples where the labels 1:01 are known. 1:05 So for a spam filter, you would train the classifier with a data set 1:07 where lots of emails are already labeled as spam or not spam. 1:11 And then when a new email comes in, it can try to assign a label. 1:16 There's one more thing I want to mention before we carry on. 1:22 Cleaning and organizing data in different ways can often produce different results. 1:25 In the case of the emails, you might find that the raw data from the email doesn't 1:31 make useful features because further heuristics need to be applied. 1:36 For a spam filter classifier, you might create features that counts 1:41 the number of spammy phrases from a dictionary. 1:46 Like free offer or click here, or a feature that identifies 1:49 an attachment as a photo or an executable program that might be a virus. 1:54 There's an old saying in computing called garbage in, garbage out. 2:01 It means that if you provide the computer with bad information, or 2:05 if you give your machine learning model a data set that's inaccurate or 2:10 not representative of the whole truth. 2:14 Then you're going to get a bad result. 2:17 And that's it. 2:20 As you can imagine, there are many more definitions and terms in machine learning. 2:22 But those are the big ones we'll need in order to continue with our exercise 2:26 in the next videos. 2:31 Where we're going to write our own classifier in Python 2:32 using a library called scikit-learn. 2:36
You need to sign up for Treehouse in order to download course files.Sign up