Bummer! This is just a preview. You need to be signed in with a Pro account to view the entire video.
Web Spam Research: Good Robots vs Bad Robots43:20 with Matt Peters
Matt goes on a wild ride to find the interesting features necessary to algorithmically classify a site as spam or non-spam. Will the good robots finally win?
All right, thank you, Lawrence. So today I'm going to talk about some research into web spam that I've been doing lately. 0:00 And in particular, I'm going to talk about some algorithmic 0:06 approaches for detecting web spam. 0:08 So why do we care about web spam? 0:11 Well, first, we care about web spam because Google does, 0:13 through the Penguin or the Panda updates. 0:16 They're taking increasingly hard line against web spam 0:21 and sites that violate their guidelines. 0:23 And this raises the whole host of practical SEO considerations that 0:28 you might be thinking about every day in your job. 0:32 For instance, if you're going out and building links, a common question might be, 0:35 do I trust this site? Should I get a link from this site, or is that actually going to hurt me? 0:37 Another common use case might be if you, for instance, are 0:41 helping a site clean up from a penalty and you need to go 0:45 and remove a whole bunch of spam backlinks that someone else built— 0:48 maybe there's thousands and thousands of links—having some way of 0:51 having some spam score that tells you which ones are 0:55 the most like likely to be spam would be a great way of starting. 0:58 There are also a lot of engineering challenges that we face every day at SEOMoz. 1:00 We have a Mozscape index. We have web crawlers. 1:04 And so here's Roger. He's our web crawler. 1:07 He goes out and crawls the web, which means that he 1:11 goes out and downloads lots and lots of pages from the Internet, 1:13 processes them, extracts links from them, and then 1:16 sends them up to our processing pipeline. This sits in the cloud. 1:20 It takes several weeks to actually process all of this data into 1:23 the Mozscape index that we can actually use. We started at 40 computers. 1:27 As we scaled up our index size, we've gotten up to 100 or 200 1:31 computers that were running to actually do the processing. 1:34 So all of this takes a significant amount of engineering time, 1:37 money, and other types of resources. 1:41 If spam is a large component of what we're actually crawling and indexing, 1:44 then it can just consume lots of resources that we would rather be spending on other things. 1:48 It can also adversely impact index quality. 1:52 If we have too much spam in our index, it can skew some of our metrics, 1:56 like MozRank and Page Authority, and we'd like to remove it. 1:59 The problem is—one of the interesting things about this problem is 2:03 that we operate at very large scale. We crawl billions and billions of pages. 2:06 And the only way to do this in practical way is to use some algorithmic approach to do this. 2:10 Okay, so what are actual goals in this project? 2:14 Simply put, our goals are to, at some point in the future, 2:19 surface a spam metric in the Mozscape index. We're not there yet. 2:22 This started as a research project. It's still in the research phase. 2:26 To determine the feasibility and those types of things that 2:30 we would need to do to actually get this to actually work. 2:33 There's some engineering challenges we need to tackle 2:35 before we can actually get it into the Mozscape index, 2:38 but that is our long-term goal. 2:40 And when we first started this work, Rand put up this post on Google+, 2:42 where he announced that we were working on a spam metric. 2:47 We got a lot of feedback from this post. We got over 100 comments to this thing on Google+. 2:51 And I want to take a minute to address some of the concerns 2:56 and some of the comments that were raised. 2:59 The first thing that we heard was that, why should we even trust this metric? 3:01 What is the relevance of SEOMoz's spam metric? That's a great question. 3:06 So in this talk today, I'm going to focus directly on comparing 3:10 data against sites that Google has banned or penalized. 3:14 So for the purposes of this talk, our definition of spam is going to be Google's definition of spam. 3:17 I don't know what form this is going to take long term, 3:22 but I can assure you that whatever we do will make direct comparisons against 3:24 sites that Google has penalized. 3:28 The second thing that we heard was that, you're working on the wrong problem. 3:30 You should be focusing on Mozscape itself 3:33 and making the index bigger, fresher, higher quality. 3:35 And I can assure you we're working on that very, very hard. 3:39 As Rand talked about this morning, we have made significant investments 3:42 in our own private cloud, our own private data center 3:46 to scale up the size of the index, to make it fresher. 3:48 We have some extremely talented engineers that I 3:51 work with every day on the big data and the Mozscape team, 3:54 working on that exact problem. 3:57 The third concern that we saw was, if you produce a spam metric, 3:59 then everyone will know who is—what sites are spam or not. 4:04 And a bunch of concern about this. 4:08 And our answer to that is that we believe in transparency. 4:11 And this is data that the search engines already have. 4:14 They use them in algorithms right now, and we would like to make it available to everyone else. 4:17 Okay, so I started out this talk saying that we're going to do 4:21 an algorithmic approach to detecting web spam. So what does that actually mean? 4:27 So in practice it means that we're going to do some machine learning 4:31 or go to a machine learning model. 4:33 So machine learning, simply put, means we go out, we gather some data, 4:35 and then we use it to predict something. 4:39 In our case, since we're predicting whether a site is spam or not, 4:42 our data is web data. 4:46 So we start by going out and crawling lots of pages 4:48 and downloading lots of data. 4:51 But this comes to us in the form of raw HTML. It's actually not very useful. 4:53 So we need to go through another stage, which is called feature extraction. 4:57 And feature is just a common term that is used everywhere in machine learning. 5:01 Features are concrete things that we extract from the data 5:05 that we actually use to do predictions. 5:10 Here in this case, we've got a couple of features, the number of 5:12 words in the body, maybe the number of links 5:15 that are in the HTML document might be a good indicator of whether it's spam or not. 5:17 And a whole host of other things. 5:21 We then send these features through a machine learning algorithm, 5:23 which at some times can act like a black box. 5:28 In our case, it's actually quite transparent, because I wrote all the code myself. 5:30 But for the purposes of this talk, this is basically a black box. 5:34 And this outputs a prediction as to whether this site or page is spam or not. 5:38 In this talk I'm going to focus mainly on the features, 5:43 and we're going to look for ones that are the most relevant for predicting spam. 5:46 We do this because these features are actual things we can wrap our brain around, 5:49 and they give us some intuition about what is the most useful. 5:52 Okay, so spam is an interesting topic. 5:55 One of the things that I find very interesting about it is there's this split 6:00 between on-page features and in-link features. 6:03 So this is a hypothetical link graph here, and we're going to keep coming back 6:06 to this example. I'm going to take some time to explain it. 6:10 So here we've got a bunch of different sites. You can think of these squares or boxes as sites. 6:13 And the arrows as links between sites. 6:17 So we have 2 different types of sites here. We have spam sites that are black. 6:19 And when I say a spam site, this is a site that actually has spam on it. 6:23 This is a link farm, a page that has completely stuffed in keywords 6:28 that has no usable—no use to a human, scraped content, a fake blog, something like that. 6:32 This page actually has spam outlinks coming from it. 6:38 And then we have legitimate sites, 6:41 e-commerce sites, blogs that all of you have, sites that don't have spam on them. 6:43 Spam links to—we have spam sites link to other spam sites, but they also link to 6:48 non-spam sites. 6:56 So, for instance, here's a spam site. It has some spam outlinks on it. 7:00 But if we consider this site, this is a legitimate site, and this has— 7:04 the site itself isn't spam, but it has spam pointing to it. 7:08 Google will penalize or ban both of these two. Both of these sites. 7:11 So it will ban a legitimate site if it has too much spam coming to it. 7:15 And so throughout this talk, we're going to be constantly talking about 7:18 in-link features and on-page features. 7:21 So when I first started this a couple months ago, 7:26 I thought to myself, there's probably been a whole bunch of research done 7:29 on spam already, so let's go out and do some literature review, and see actually has been done. 7:31 It turns out there's been a lot of great research done over the years on spam, 7:36 and in these conferences in 2006, 2007, and later in 2010, 7:40 there's a large organized effort to collect lots of data around spam. 7:45 They actually developed their own spam guidelines. 7:48 Here's a screenshot of some of their spam guidelines. 7:51 And then they gave these to humans and and actually asked humans 7:54 to go and annotate whether the site was spam or not, according to their guidelines. 7:56 They tabulated several thousand or more different labels whether the site was spam or not. 8:00 And with that, you can actually go and do some statistics and some analysis. 8:05 One of the interesting papers that came out of this work was this paper 8:08 by Ntoulis et al. in 2006, and it produced a number of graphs like this. 8:11 This is a complicated graph, but I'm going to show a lot of these, 8:16 so I'll take some time to go through it. This was a graph of the number of words in the title tag on a page. 8:19 And there's two parts to this. 8:24 The first part is these lines in blue. 8:26 This is a histogram or probability density of all of the pages in their data set. 8:29 So, for instance, if you look over at the left-hand side of the graph, 8:37 you can see that for about 5 words in the title, there's 12, 13, 14 percent of the data. 8:40 And then as you get out to title tags that are 30, 8:46 35 characters long, there's very little data. 8:49 A very small percentage of the data has that length title tag. 8:51 When we look at this pink or magenta line, 8:55 this is the percentage of spam in each of those different categories. 8:58 So if you look down in the lower left-hand side of the graph, 9:01 you'll see that for short titles, there's very small spam percentage. 9:04 As you get to longer and longer title tags, 9:08 over on the right-hand side of the graph, 9:11 you see that the spam percentage increases to something like 60, 70 percent, 50 percent. 9:13 So this is keyword stuffing in the title tag. 9:19 Another thing you might look at is the percentage of the page covered in anchor text. 9:22 You might think that common spam technique would be to just 9:27 spew up a bunch of pages with a bunch of anchor text 9:31 and links on them with no real content. 9:33 So here you can actually look at the percentage 9:35 of the page covered in anchor text, and here the trend isn't actually so clear. It oscillates around. 9:37 Maybe there's some indication as you get towards 9:41 larger and larger percentages of the page is anchor text, 9:43 that it's higher percentage of spam, but maybe it's not so clear. 9:46 One interesting thing about this work is that 9:50 many years later, a very recent paper in 2011 9:54 took this data set and used a very, very fancy and sophisticated 9:57 state-of-the-art machine learning model, and what they found is that 10:02 you can actually do quite a good job at predicting whether a page 10:05 is spam or not by just a very few simple on-page features like this, 10:08 and they got (inaudible) performance by doing this. 10:12 So this is actually encouraging that if you have a complicated enough model, 10:16 that you can do a good job of predicting spam with not much data. 10:18 One of the other interesting things they found in this paper is that 10:22 they actually compared on-page content versus in-link features, 10:24 and they found that on-page was actually much better 10:29 job predicting spam than in-link features. 10:31 So we've talked about some of the different types of on-page features, 10:34 and I'm going to talk about some more later on. 10:38 If we think about in-link features, one of the more common ones 10:40 and one that we have available in Mozscape right now is something called MozTrust. 10:44 So MozTrust measures the—MozTrust starts with the set 10:47 of high-quality seed sites that are very trustworthy. 10:53 Things like government sites, very reputable university sites, things like that. 10:56 And then it then flows trust out from that site. 11:03 So if this is a seed site here in green in the bottom of the diagram, 11:06 since this site links to another site, that site is going to have high MozTrust. 11:09 It's going to pass the trust to that other site, 11:13 and then as that site links out to some other sites on the right-hand side there, 11:15 they're going to have more moderate values of MozTrust. 11:19 And it's been found, at least back in 2004, 11:21 that MozTrust did a pretty decent job of predicting spam. 11:23 We'll come back to this, too. 11:26 One other interesting thing that's come out since the Penguin update, 11:28 is that anchor text might be very good predictor of spam. 11:32 The intuition here is that natural or organic anchor text 11:35 is typically either a branded keyword for the domain that's being linked to, 11:39 or it's words like Click Here or other types of natural, 11:44 organic text that actually appears on the Internet. 11:48 In particular, it's not words like Hollywood Dentist, 11:51 Cosmetic Dentist, Los Angeles Dentist. 11:54 And if you have too high a percentage of your anchor text as these 11:57 commercial keywords that is not branded, then 12:01 that could potentially be a spam signal. 12:04 So it's natural to ask, are these even still relevant today? 12:06 A lot of this work was done in 2006, 2004. 12:10 Spam tactics have changed remarkably in the last 10 years. 12:15 Are these even still relevant today? 12:17 So I'm going to try and answer that question today, 12:19 and we're going to do it by gathering a large set of data 12:22 for sites that are banned or penalized from Google. 12:25 So we define—Curtis has a great blog post on our website 12:29 that you can read more about the actual methodology 12:35 affecting whether a site is banned or penalized, but 12:37 a site is banned if it is completely removed from the index. 12:39 That means if you do a site-colon for the domain, it doesn't appear at all. 12:43 This is a manual (inaudible) supplied. 12:46 A site is penalized if you do a domain name search for the exact domain, 12:48 and it doesn't appear on the first page. 12:54 Any site will rank on the first page if it hasn't been penalized 12:56 for a search for the exact domain name. 13:00 Okay, so I started out talking that we're going to go collect some data and do 13:03 some machine learning on that, so where do we get our data from? 13:09 Well, we've got a somewhat complicated data collection process 13:12 that I'm going to talk to you about right now. So we first start with the Mozscape index. 13:15 We've got about 200 million different sites that we wanted to classify those are spam or not. 13:18 We take a stratified sample of this data and we 13:22 throw in about 3,000 directory sites that we've already labeled, 13:25 plus a long list of suspected spam sites into this data set. 13:29 We get down to about 47,000 sites by doing that. 13:33 We then go out and crawl these pages. 13:35 We extract about—we crawl about five pages on each site. 13:38 We throw in some additional data from Wikipedia, SEM, (inaudible) 13:41 some information about language, and we extract a bunch of features from that. 13:45 And then we apply a filter to this data, 13:49 and we filter on sites that are still alive. 13:51 They don't return—sites that are—we only include sites that return 13:54 at least one 200 response to us. 13:58 We remove redirects, remove 301s, things like that. 14:00 We also apply an English language filter. 14:04 We do this because we want to bring in some information about 14:06 search queries and about language, and it makes it a lot easier 14:09 if we just restrict initially to English-language documents. 14:13 This gets us down to about 22,000 sites. 14:16 Then from there we then send these through Google, and we 14:18 tabulate whether they have been banned or penalized. 14:22 For the rest of this talk, I'm going to focus only on sites that have been penalized. 14:24 Any site that has been banned has clearly been penalized, 14:28 because if you search for its domain name, it's not going to appear. 14:31 But we have a lot more sites that are penalized, so I'm going to focus mainly on them. 14:33 Okay, so what did we actually see here? 14:38 First, here are the overall results. 14:42 So overall we found that about 17 percent of the sites were penalized. 14:46 About 5 percent were banned. 14:50 If we look at the suspected spam sites' composition, 14:52 we actually see that between 50 and 60 percent of these sites were 14:56 penalized, so we actually did—most of these actually were spam on MozTrust. 14:58 MozTrust does a great job of predicting spam, 15:06 and it's one of the biggest predictors that we've found. 15:08 So here is my version of those 2006 graphs. 15:10 And so if you look on the left-hand side of the graph, 15:14 you can see that for MozTrust values between 0 and 1, 15:17 about anywhere between 35 and 40-45 percent of those sites were spam. 15:20 Were penalized by Google. 15:26 If you then look at the bottom right-hand side of the graph, 15:29 say, MozTrust values between 5 and 6, 15:31 we only have a few percent or less of the sites are actually penalized. 15:34 We could think about maybe MozRank might be a good predictor, 15:39 and MozRank actually is a pretty decent predictor, although it's not as good as MozTrust. 15:43 In particular, if you look at the bottom of the MozRank graph on the right there, 15:47 you can see that as we get to higher values of MozRank, 15:51 4, 5, 6, it actually plateaus at about 10 percent penalized, 15:54 whereas the MozTrust line keeps going down. 15:58 So it's a stronger predictor. 16:01 The number of in-links. 16:04 How does that actually do a decent job of predicting spam or not? 16:06 It's actually some interesting things when we look at the raw number of in-links. 16:09 In general, we see the same trend that we see with MozRank and MozTrust. 16:13 As the number of links increases, then the spam percentage goes down. 16:18 But we see these interesting bumps up 16:23 when we look at total number of links over on the left-hand side. 16:26 And we see an increase in spam percentage for sites that have 16:29 actually many thousands of links to them. 16:33 We don't see this huge bump up necessarily on external links. 16:36 This may be there a little bit, but not as much. 16:39 So these are a population of sites that are in our index 16:41 that are very large and tend to have lots of 16:46 internal links in the actual site itself that are spam. 16:48 Domain size. You see the same trend in domain size. 16:52 This just is a—helps to confirm that, yes, indeed, this 16:56 increase we see in spam for sites—for internal links or all links 17:01 that are actually large sites, and these are sites that have anywhere between 17:07 3 and 22,000 different pages on the site. 17:10 Linking root domains. Linking root domains tends to be a pretty decent job at actually predicting spam, too. 17:16 Although we do see a bump up in the middle again with 17:23 this population of sites that have a lot of links to them that are spam. 17:25 Okay, anchor text. 17:31 We have this intuition that 17:34 organic anchor text is branded or things like Click Here. 17:37 And so to actually pull this out and do some statistics on it, 17:41 we need to actually translate this into a number. 17:44 So this is a complicated slide. There are some details here about how I actually did this. 17:46 I'm not going to go through all the details, but I'm going to 17:50 talk through them using this example. 17:52 So I went to the Mozscape index, and I pulled out the 17:56 top anchor text phrases pointing to each of these domains 17:59 and the number of domains that have that anchor text. 18:03 Here are the top 10 anchor text phrases pointing to SEOMoz.org, the entire domain. 18:05 Number 1 anchor text is the word SEOMoz. 18:10 Seventy-seven hundred linking root domains. 18:12 Number 2 was SEOMoz.org. 18:14 As we go farther down the list, we see words like Rand Fishkin and dog snuggie. 18:17 Dog snuggie links are all spam links pointing to us. 18:20 And so because I have a relatively simple heuristic that 18:23 is essentially just match for the domain name, and then I have other things in there. 18:26 I tried to make some (inaudible) heuristic for detecting acronyms and thinks like that. 18:30 But we don't pick up things like Rand Fishkin or— 18:35 maybe it's arguably a banded keyword, maybe it's arguably not. 18:39 In my case, we're going to label that as unbranded. 18:41 Dog snuggie is clearly an unbranded keyword. 18:43 So then once you label each of the anchor text as whether it's 18:46 branded or organic or unbranded, we can then go and compute the percentage of 18:50 these linking root domains in the top 10 that are non-branded. 18:56 And the results are, if I can— 19:01 All right, so basically here are the results. 19:05 So this is the graph of the percentage of the anchor text 19:08 in the top 10 that is unbranded. 19:11 It's not organic and it's not branded. 19:14 And on the right-hand side, you can see that as you get to 19:16 larger and larger percentages of unbranded 19:19 anchor text, we do see an increase in spam. 19:21 These are sites that have 80 percent of their anchor 19:24 text in the top 10, or more, that is not branded. 19:28 On the contrary, if you look at these in the middle of the graph, 19:30 we have some relatively organic mix of unbranded and organic anchor text, 19:34 and that actually is best. We see lower spam percentages. 19:40 So one other interesting thing that we can do is we might think, well, 19:45 we get some information by looking at things like MozTrust, 19:49 or maybe the percentage of unbranded anchor text, 19:53 but they're a single metric that we have for an entire domain or an entire page. 19:56 What if instead we—maybe we can look at the entire in-link profile. 20:01 So here's our hypothetical link graph again, and I 20:05 filled in some domain authority numbers on this. 20:07 So for each of these sites I've now given a domain authority. 20:10 And if maybe we consider this one in the center here 20:14 that has some mix of spam links pointing to it 20:17 and some mix of non-spam links pointing to it. 20:20 And we look at all the domain authorities and all the domains linking to it, 20:23 and we applied a histogram, a curve of that, 20:26 maybe we might get something like that in the upper right-hand corner. 20:29 Instead if we look at maybe the site over here that has a 68 domain authority, 20:32 which has only non-spam links pointing to it, very clean in-link profile, 20:36 maybe we might get a histogram that looks something like this. 20:40 And so we can actually do this and see what the results are. 20:42 There's one final wrinkle here, and that's that larger sites that have 20:47 higher domain MozRank or higher domain authority 20:51 are going to tend to naturally have higher domain authority links pointing to them, 20:53 and so in order to sort of do an apples-to-apples comparison, 20:57 we'll actually need to segment them, so here the blue line are sites that are not penalized. 21:00 The red line are sites that are penalized, 21:05 and maybe I'll just focus your attention here into this bottom left-hand corner. 21:07 So we do is we actually do see some indication that there are some differences 21:11 in these in-link profiles, in particular, the sites that are penalized, 21:14 the domain authority histogram is more peaked around 20 to 30, 21:18 and it has less higher-quality links above domain authority about 40. 21:25 We see that trend, for instance, in the bottom right, 21:28 if we look at the domain MozRank 6-8 sites. 21:31 And it's less apparent if we look at 21:34 the smaller sites that have less links pointing to them. 21:37 So the answer is that maybe we can get some information 21:40 about this, maybe not. It's actually not clear. 21:43 Okay, so what if we look at on-page features? 21:47 So here is the percent of anchor text on the page. 21:50 On the left is our data set, and on the right is this figure from the 2006 paper. 21:53 Again, we actually see it's not really that very conclusive. 22:00 I mean things oscillate around a little bit. 22:03 Maybe there's some indication that as you get to higher percentages 22:05 of anchor text that you get higher spam percentage, but it's actually not so clear. 22:07 If we instead split this into internal and external anchor text, 22:10 we actually do see something interesting. 22:14 So on the left-hand side, we have the percentage of 22:16 pages covered in internal anchor text. 22:18 And what we see is that sites that don't have much internal anchor text 22:20 or any into it and any internal anchor text on the page 22:25 tend to have a higher spam signal. 22:27 If we look at the right-hand side, we actually do see that, as 22:30 we get to higher and higher percentages of external anchor text, 22:34 that it is a spam signal. This makes sense. 22:37 Spammers aren't interested in spamming their own site with internal anchor text. 22:40 They're interested in building external links from other sites to pass link tos. 22:43 We can look at the title tag. 22:49 This one's actually quite interesting to me, and this one indicates 22:52 directly how much spam tactics have changed. 22:54 So again, on the left-hand side is our data set from 2012. 22:57 On the right-hand side is the data from 2006. 23:01 And what we see is that we don't see the significant 23:04 increase in spam percentage as the titles get longer. 23:07 Spammers have stopped stuffing keywords 23:10 in title tags, because they found that it doesn't work. 23:12 We look at the length of the body, the length of the HTML body, 23:17 like basically the length of the document, the number of words in it. 23:23 Again we actually see some differences from the data in 2006. 23:25 The first interesting thing is that the actual histograms have changed. 23:29 Documents overall have gotten longer. 23:32 The maximum values back in 2006 were, say, documents of 250 words or so. 23:34 Here we actually see a maximum around 400 words. 23:43 So documents overall have gotten longer. 23:46 And actually, the spam signal has changed completely. 23:49 Whereas back in 2006, shorter documents tended to not be spam at all, 23:52 here, actually, shorter documents tend to have a higher spam signal. 23:56 The visible ratio. This one's actually interesting, though somewhat hard to explain. 24:01 So here what we do is we take the entire HTML document, including 24:08 all of the markup, and then we figure out what percentage of that 24:11 actually appears on the page. 24:15 And then you take this as a ratio. 24:17 So as you get to higher visible ratios, 24:19 it means that most of the HTML document is actually viewable on the page. 24:21 Meaning relatively little of it is actually 24:27 used for markup itself to format the page. 24:29 This is a page that probably looked pretty ugly if a person actually looked at it. 24:32 And so, not surprisingly, as we go to higher and higher 24:36 visible ratios percentage, meaning more of the HTML document is visible, 24:42 or less of it is actually used for formatting 24:46 to appear nice to a person when they look at it, 24:48 we see that it's a higher spam percentage. 24:53 And so we can keep playing this game. 24:56 We can think of anything we want to think of, and we can try 24:59 and actually build a feature for it and try and extract it. 25:01 Here are a couple of the other ones that this paper in 2006 did. 25:03 They looked at the average word length in the body. 25:08 The compression ratio is actually an interesting one, in the upper right-hand corner. 25:11 The idea is that if you compress a document, 25:14 you get high compression ratios when you have lots of repeated elements. 25:17 Meaning you have lots of the same types of words in them. 25:20 That tends to be a spam signal. 25:23 You can do some things with precision and recall, 25:25 which are information retrieval measures of how well the document 25:28 matches the top 1000 words in the English language. 25:35 And you can do lots of other things. 25:38 One of the interesting things that we tried to do 25:40 is we have this idea that—when I first started this, I started looking at, 25:44 you know, I looked at a fair amount of spam 25:49 to actually get some idea of what the spam tactics were. 25:51 And one of the interesting things that we saw was that 25:53 anchor text on spam pages is very carefully chosen. 25:56 And so here's an example from an actual spam page that we found, 25:59 where the anchor text is very obvious, 26:03 kitchen equipment, commercial-intent keyword. 26:05 We see this in the anchor text, where if we look at sites that have been penalized, 26:08 look at the anchor text pointing to them, 26:12 it tends to be commercial keywords that aren't branded. 26:14 But in order to actually do this, we need to have some way of 26:17 actually measuring the commercial intent of the anchor text, 26:20 or of the words on the page. 26:22 And what we did to do this is we went out to 26:25 SEMRush, and we asked them to give us a list of 26:28 the top 25,000 highest cost-per-click keywords and highest search-volume keywords. 26:31 And the idea is that this would be a good way 26:36 of actually measuring commercial intent. 26:38 A keyword that is a high cost per click is going to be 26:40 a valuable keyword, so it seems like it's natural that you're going to spam for it. 26:43 And, unfortunately, the results were inconclusive. 26:46 With hindsight, 25,000 keywords was not nearly enough. 26:49 And so, for instance, what we did to try and quantify this is 26:52 we took all of the anchor text on the page, and then 26:56 we saw whether it appeared in this list, and we computed the sum total 26:59 of the cost per click of all that anchor text. 27:02 For instance is this graph on the right-hand side. 27:04 And what you actually see is that 16,000 or 17,000 of the 22,000 27:06 sites don't have any overlap with this data set. 27:11 We actually didn't have enough data here. 27:13 So what are some things that are missing? 27:16 We have the basic on-page features, basic in-link features. 27:18 What are some things that we didn't do? 27:22 Spun content is very popular spam technique nowadays. 27:24 So what you do is you go out and you scrape a page from somewhere, 27:29 and then you randomly substitute lots of words with synonyms. 27:33 And this way it won't be duplicate content, but it'll still have about the same topic, 27:37 and then you spam it, and you throw out some fake blogs or 27:43 whatever you're going to do with this, okay? 27:47 So this is an actual sentence from a spam site that we found. 27:49 "Clean money provides detailed info 27:53 concerning on-line monetary unfold betting corporations." 27:55 We read that as English speakers—huh, that doesn't make any sense. 27:58 It's be nice to try and pull this out in some way algorithmically. 28:03 One other thing that we found is that fake blogs— 28:06 there's no user interaction, there's no reason 28:11 for a human to actually ever go to that site. 28:13 So they have no comments, no shares, no tweets. 28:16 Seems like an obvious spam signal. 28:19 Sidebar and footer links. 28:21 We didn't make any attempt to pull out sidebar and footer links. 28:24 There's some more engineering challenges to do this. 28:28 First you need some way of actually determining which of the links 28:30 are sidebars and footer links. But we're working on that. 28:32 But this is another common spam technique that it'd be nice to actually 28:35 try and pull out into a feature. 28:39 Okay, so that being said, there's a lot of things that are missing from this. 28:44 We've only pulled out some base set of features. 28:50 We can actually ask, how well can we actually model spam with just this data? 28:52 It turns out we can actually do quite well. 28:56 Using a logistic regression model, we can get 28:59 86 percent accuracy using just 32 features. 29:02 Well, there are some caveats here. 29:06 We can get 83 percent accuracy by just guessing no 29:09 spam for every single prediction, so accuracy isn't a good measure at all. 29:12 In this case, it doesn't really tell us anything about how well the model does. 29:16 This 0.82 AUC, which is some very complicated way 29:19 of measuring the accuracy of the model, is actually a much better 29:25 way of measuring it. 29:27 You can game it by doing things like always guessing non-spam. 29:29 And I can assure you that 0.82 AUC is actually pretty good. 29:32 That's actually quite good for a model this simple. 29:36 And I want to say a couple things about this. 29:38 Logistic regression is the simplest machine learning model that we could use for this. 29:40 It's very, very simple. 29:46 Thirty-two features, in terms of practical machine learning— 29:48 Most practical machine learning uses hundreds or thousands or even more features. 29:51 Google's algorithm uses over 200 different things. 29:56 But each of those is probably sort of a meta feature. 29:59 Each of those uses its own things, and so using only 32 features is 30:03 really quite small for doing machine learning. 30:07 So we have a very simple model. 30:10 We have a very few number of features, but yet we can get a very good model. 30:12 And the reason is because these features are actually quite predictive. 30:14 We do a really good job. We can go a little bit of a step further. 30:17 We can do something a little bit more complicated. 30:23 So in this case, instead of taking just one model for all of our data 30:25 and sending it through, (inaudible), we can actually split it into two. 30:29 And so here what I did was I'd set one logistic regression model 30:33 for just on-page features, things that we can get 30:37 just by crawling the page and looking at the page. 30:39 I took another logistic regression model for just in-link features. 30:41 So these are things that are just computed from linking to the page itself. 30:45 We send them through some sort of weighted average of some sort of mixture, 30:49 and this actually predicts whether it's something with spam or not. 30:53 This model performs the same as the other one. 30:56 We haven't actually increased the model complexity at all. 31:00 We've just split it into two. 31:02 So we don't get any increase in performance like this. 31:04 But we can do something really interesting with this model. 31:06 And what we can do is we can attribute responsibility. 31:09 We can attribute responsibility to whether a site is likely to be penalized or not 31:12 because of its links or because of how it looks. 31:17 So, for instance, if we go back to our hypothetical link graph here, 31:20 and we consider this page up here or this site up here, 31:22 this is a spam site. It's got lots of spam content on it. It's got lots of spam pointing to it. 31:26 So maybe the model might predict that, yeah, this 31:31 site has a 90 percent chance of being penalized. 31:33 And responsibility because it looks like spam and has spam linking to it, 31:35 is maybe 50-50 between in-link and on-page features. 31:41 If we consider this site instead, this is a legitimate site. 31:46 It's not a spam site, but it's got a lot of spam linking to it. 31:51 And so maybe the model might predict that, oh, maybe 31:54 there's a 65 percent chance that this site is actually penalized. 31:57 But we can say, well, of that 65 percent, 85 percent of the responsibility 32:00 for being penalized is due to the links pointing to it, 32:05 and only 15 percent is due to the actual content of the page itself. 32:08 So we can tell you how likely is it penalized and 32:12 whether you're penalized because of who links to you or because of how you look. 32:16 Okay, so what are some takeaways here? 32:20 First takeaway is that when you look at a lot of data, 32:23 unnatural things jump out. They tend to be pretty obvious. 32:26 And then you can actually do a pretty decent job at 32:31 modeling spam with relatively simple—as far as machine learning goes, 32:33 relatively simple models. 32:38 And if you are building spam links, if you are participating in spam, 32:40 and it's not cutting edge, then you're very likely to get whacked, 32:44 especially given the increased scrutiny that Google is placing against spam. 32:48 Second thing is that MozTrust is actually quite a good predictor of spam. 32:54 So we don't have a spam metric in Linkscape right now. 32:59 If you look at only one thing, look at MozTrust. 33:03 It actually does a pretty good job of predicting whether something is spam or not. 33:05 If you're building links from sites that have low MozTrust, 33:08 be very careful. 33:12 Third thing is that for the future of SEOmoz and the tools that we're building, 33:15 we do hope to have some sort of spam score available in the future in Mozscape. 33:21 We're not there yet. There are some engineering challenges that we need to solve first, before we can do that. 33:26 In the more near term, we have this—it's a tool that we have 33:32 that's currently right now in labs at SEOMoz, called Freshscape, 33:39 which goes out and crawls the fresh web. 33:43 And we plan to repurpose a lot of this work to improve the quality 33:46 of Freshscape, to remove spam blogs, to remove low-quality type things, and 33:50 improve the quality of Freshscape. 33:55 That's it. (applause) 33:57 >>My head's going to explode. (laughter) 34:09 That was awesome. You must have lots of questions, because— 34:13 (laughter) look at all the hands. Okay. You're going to have to run. 34:18 >>I'm not sure that you addressed this, but I have a question. 34:27 You talked about duplicate content. There are some hotels that have vanity sites, but they have a corporate site. 34:31 So, say, Holiday Inn's corporate site has the same information 34:37 as their local site for, say, New Orleans Holiday Inn. 34:42 They get penalized, because Google looks at it as duplicate site information, 34:46 even though they've changed the words and added more info. 34:52 Do you have any idea how to get around that? 34:55 >>No, I don't. I think that you'd be better off asking someone who does SEO how to do that. 34:57 I actually don't really have any idea. 35:03 I have a little familiarity with the algorithms they do to deduct duplicate content. 35:05 And I know that they don't just detect exact duplicates but near duplicates, too. 35:10 But in terms of actually practically doing it, you could probably do some things 35:16 with, like, real canonic or something, but I would actually have no idea. 35:20 >>Is a link profile refreshed in the sense that if I have a link from 35:28 a brand new site but the site is not actually a spam site, it just happens to be new, 35:35 over time it may gain MozRank or PageRank, making that 35:40 link more valuable, but do I still run the risk of being penalized 35:44 because it's a new site, or do I hang onto the link with the hope that 35:48 people will latch onto the content of the site and the MozRank will go up? 35:53 >>That depends on the details of how Google's algorithm works, 35:56 which—maybe I should have pointed this out before—I'm not actually— 36:01 everything I've presented in this talk is descriptive statistics. 36:05 I haven't said anything about—or I tried not to say anything about actual causation. 36:08 So I honestly don't know how Google's algorithm works. 36:13 I would hope that if the link came from a page that 36:18 other people link to and it was authoritative and had a higher page rank, 36:22 then they wouldn't penalize you for that. 36:25 But, again, I'm not Google, so I don't really know the answer to that. 36:28 >>Hi, this is Michael Rotkin, and I own seochampion.com, and 36:33 I love your tools, by the way, I loved your presentation, by the way. 36:40 One of my questions was is what goes in to determining on your side 36:44 the domain authority, because a lot of my clients are paying attention 36:50 to your domain authority, because Google page ranked that meter, 36:54 the 0 through 10 meter, seems to kind of go off and— 36:58 you know, we know how Google is. 37:01 They're all about making their own money on paper click, 37:03 and they could pretty much care less. 37:06 And I do run a lot of sales through Google, because they're my merchant, 37:09 but I respect them, too, but I'm kind of curious 37:13 for clients, too, what really goes into your domain authority, 37:16 and then also I love your tools, love Rand's tools, love everything about this. 37:19 I've been going to search engine strategies since early 2000, 37:24 and I think your conference is the best. 37:27 So I'm kind of interested more on an ETA, too, for your tools, 37:29 because I'm also promoting that quite a bit as well, and we've 37:34 been able to give you over 300 signups myself. 37:36 >>Well that's great. We really appreciate your support. 37:40 And it makes us feel good when someone tells us they like our tools, 37:43 because we put a lot of hard work into producing them. 37:47 To answer your question about domain authority, 37:50 well, there's a short answer and there's a long answer. 37:53 The short answer is that when we updated the domain authority, 37:56 when I updated the domain authority model back in the fall, 37:59 I put out a long blog post that talks about domain authority and page authority. 38:02 So you can find that on our blog. 38:05 The longer answer is that we take all of our in-link metrics that we have 38:07 available in Mozscape, and then we send them through a machine learning algorithm 38:11 that predicts how likely that page is to rank higher in Google searches 38:15 across a very large data set. 38:19 So domain authority and page authority are machine learning models that 38:21 take just the link profile of the page or domain 38:25 and predict how likely it is to rank. 38:28 As far as an ETA, when we expect to have this out, 38:32 we don't have an ETA right now for that. 38:36 >>Okay, the question was when you were looking at 38:41 high cost-per-click anchor text on page, were you— 38:46 I recognize you'd be looking at external links, but I'd also— 38:50 did you look at internal links as well, the ratio? 38:53 >>For—>>High cost-per-clicks, the ratio— 38:55 >>Yeah, we did, and again, we had the same result. 38:58 Actually, the—if you go back and look at the slide deck, 39:01 the plot that I actually put in there was of just internal anchor text. 39:05 So we did internal anchor text, external anchor text. 39:08 I even did things like the entire page, like all the words on the page, 39:11 whether they're anchor text or not, and we basically got the same result, 39:14 that we didn't have enough data. We should have gotten 250,000 or 2½ million or something like that keywords, but— 39:17 >>Hi. When you're analyzing backlinks using OSE, how do you differentiate between penalized sites and banned sites? 39:27 >>We don't have that information available in OSE. 39:37 This is something that we did outside of OSE itself. 39:40 So in this data set, the way that I differentiated 39:45 between banned and penalized sites was if a site was 39:50 banned, then it was also penalized. 39:52 If a site wasn't banned, then we checked whether it was penalized 39:55 and saw whether it fell off of the first page. 39:59 >>Hi. So my question is you've been differentiating between the on-page factors and the in-link factors. 40:04 And you've been able to say, like, 50 percent of 40:13 this penalty is due to on-page factors and stuff like that. 40:17 And that to me seems like Google is able to differentiate between 40:20 the actual spam sites and the legitimate sites that have spam links. 40:26 From your research, does it seem like 40:31 Google treats them differently or the same way? 40:33 >>I'm not sure. All that we see is the end result, 40:38 Whether a site is penalized or not. And we don't know why it's penalized or not. 40:42 And what I try to do is infer why it was penalized. 40:46 Because it has spam links or because it actually had spam on the site. 40:52 Rand showed, for instance, this morning that we got a notice 40:57 because—that we had unnatural links pointing to seomoz.org. 41:00 And that tells me very clearly that we have detected that spam is pointing to you. 41:05 The notice didn't say that you look like you have spam on your site. 41:11 It said you look—you have unnatural links pointing to you. 41:16 So I would guess that they can differentiate this, but, 41:19 again, they're pretty tight-lipped about these things, so I don't really know for sure. 41:22 >>Which features are stronger, the on-page features or the in-link features? 41:27 >>I didn't have a chance to actually do that. 41:33 Honestly, I just did this model last week, so I haven't really had much time to spend with it. 41:35 That's one of the first things I wanted to go and look at. 41:39 >>Great talk, thank you. 41:53 In your estimation, how much of the World Wide Web is spam? 41:56 >>That's a good question. And I don't know the answer to that yet. 42:00 I do hope to estimate that at some point in the future. 42:06 Things like MozRank and MozTrust are logarithmic. 42:13 So that means that as you increase MozRank and MozTrust, 42:17 you get significantly fewer numbers of sites. 42:22 And if you just look at, for instance, the number of MozRank sites 42:24 or MozTrust sites that are between 0 and 1 versus 1 and 2 versus 42:30 3 and 4 versus, you know, the rest of it, 42:34 most of the Internet has very low MozRank and MozTrust. 42:37 So if you—by that measure, if 40 percent of it is spam, 42:41 that has low MozRank, then at least significant percentage. 42:48 >>I think we're a little over time, and I know that you guys have a lot of questions. 42:55 Is there a place where people can find you? 42:59 >>Yes, you can send me a tweet, or I'll be outside in the lobby. 43:02 You can come grab me. I'll also be at the various social engagements we have. 43:05 >>You know you're going to be attacked out there right now. (laughter) 43:09 All right, well, thank you very much. (applause) 43:11
You need to sign up for Treehouse in order to download course files.Sign up