Welcome to the Treehouse Community
Want to collaborate on code errors? Have bugs you need feedback on? Looking for an extra set of eyes on your latest project? Get support with fellow developers, designers, and programmers of all backgrounds and skill levels here with the Treehouse Community! While you're at it, check out some resources Treehouse students have shared here.
Looking to learn something new?
Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and join thousands of Treehouse students and alumni in the community today.
Start your free trialmatthewshear
9,229 PointsHow do I even begin to think about scraping a site? We were not taught how to do this.
So I've reached project 6. The content scraper project. Unfortunately, the only things we were taught about npms are how to install and uninstall them. We were never taught how to use them.
How do I even think about starting to scrape a site? Is there a course I should watch? The ones in the degree have not prepared me for this project AT ALL.
Again, all we were taught was how to install, update or uninstall a package. I have no idea where to begin.
10 Answers
Michael Liendo
15,326 PointsSmithi Milli has a pretty good video on webscraping in node. There's quite a few ways to go about it, but this is probably one of the easiest. Not sure of the requirements for your assignment, but I hope this helps: https://www.youtube.com/watch?v=LJHpm0J688Y
Thomas Nilsen
14,957 PointsAs promised. Here is a fairly complete example (with the exception of writing this to a CSV-file)
The result from my code looks like this:
var request = require('request');
var cheerio = require('cheerio');
var baseURI = 'http://www.shirts4mike.com/';
var allShirtsURL = baseURI + 'shirts.php';
// ========= MAIN CALL ==========
getAllShirtDetailsAsync()
.then(function(data) {
console.log(data);
})
.catch(function(err) {
console.log(err);
});
//First, we create a async call for getting all the different shirt-urls (where all the details are)
function getShirtEndpointsAsync() {
return new Promise(function (resolve, reject) {
request(allShirtsURL, function (error, response, body) {
//If no errors - we're good to move on..
if (!error && response.statusCode === 200) {
//Load the entire body into cheerio
//Now we can start searching for elements...
var $ = cheerio.load(body);
//Storage for the urls
//e.g. shirt.php?id=101
var storage = [];
$('.products li').each(function (i, elem) {
var shirt = $(this).find('a').attr('href');
storage.push(shirt);
});
//Make a complete url out of all the above
//e.g we turn this: shirt.php?id=101
//into this: http://www.shirts4mike.com/shirt.php?id=101
var specificShirtsURL = storage.map(function (endpoint) {
return baseURI + endpoint;
});
//Resolve all the urls we made, so we can use them in our next async call
resolve(specificShirtsURL);
} else {
//Something annoying has happened...
reject('Could not find anything about anything or anyone...');
}
});
});
}
//Here we use the urls from above to extract the things we want/need from each shirt page.
function getAllShirtDetailsAsync() {
return getShirtEndpointsAsync().then(function (endpoints) {
//Map every endpoint so we can make a request with each URL
var promises = endpoints.map(function (endpoint) {
return new Promise(function (resolve, reject) {
request(endpoint, function (error, response, body) {
//Again - check for no errorr...
if (!error && response.statusCode === 200) {
//Load in our body (containing all the shitr details..)
var $ = cheerio.load(body);
//Create a object from with the info
var productDetails = {
url: endpoint,
imageUrl: $('.shirt-picture span img').attr('src'),
prize: $('.shirt-details h1 span').text(),
title: $('.shirt-details h1').text(),
time: new Date()
};
//Resolve it.
resolve(productDetails)
} else {
//Annying error
reject('Error while getting info about all the different shirts...');
}
});
});
});
//Resolve ALL the promises from above (we are after all making multiple call to get all the different shirt info)
return Promise.all(promises);
})
.then(function (data) {
return data;
})
.catch(function (err) {
return Promise.reject(err);
});
}
matthewshear
9,229 PointsIt works but I really cannot follow it. Once again, I don't understand why I'm being testing on something that I wasn't taught.
Neil Bircumshaw
Full Stack JavaScript Techdegree Student 14,597 PointsHi, I understand most of that code except this part
return getShirtEndpointsAsync().then(function (endpoints) {
//Map every endpoint so we can make a request with each URL
var promises = endpoints.map(function (endpoint) {
return new Promise(function (resolve, reject) {
Could you explain what's going on here? So you're returning the ShirtEndpointsAsync() function to get all the 8 URLS, would you then use the "then" method to say what to do with this function? and what exactly is "endpoints". I'm totally lost at the "promises = endpoints.map..." - I just have no clue what this means, I thought the map method duplicated an array.
Cheers if anyone is there to reply to this!
Thomas Nilsen
14,957 PointsGood question. I upvoted you now at least.
Thomas Nilsen
14,957 PointsI'll put together a sample project in order to help you out and post it here when I'm done. Shouldn't take too long.
Thomas Nilsen
14,957 PointsHave none of the videos you have watched thus far been about async programming?
Even so, I think that even though you're enrolled in a tech degree-program, they also (to some extent) expect you to do some research and figure out things on your own.
I tried to comment the code to make it a little easier to follow. Let me know if you have any questions.
ruhullalam
14,500 PointsAre we allowed to use more than one npm package. Most guides use request and cheerio, but then obviously we have to use a json to csv parser. So that's 3.
Are we allowed to do that?
Thomas Nilsen
14,957 PointsThis method:
return getShirtEndpointsAsync().then(function (endpoints) {
//...
}
endpoints is an array of all of the urls
- http://www.shirts4mike.com/shirt.php?id=101
- http://www.shirts4mike.com/shirt.php?id=102
- http://www.shirts4mike.com/shirt.php?id=103
- etc...
Then i map all the endpoints to a new Promise, like so:
var promises = endpoints.map(function (endpoint) {
return new Promise(function (resolve, reject) {
//etc...
}
//etc...
}
Map doesn't duplicate arrays like you said. You can call map on an array to change every element.
Here is a small example of it:
var arr = [1, 2, 3, 4]
var modifiedArr = arr.map(function(n) {
return n*2;
});
//prints [ 2, 4, 6, 8 ]
console.log(modifiedArr);
so essentially, I create and array of Promises, called promises and return it like so:
return Promise.all(promises);
Neil Bircumshaw
Full Stack JavaScript Techdegree Student 14,597 PointsThank you for your answer, when is it you declare "endpoints" as the array and set it to the "http://www.shirts4mike.com/shirt.php?id=101 http://www.shirts4mike.com/shirt.php?id=102 http://www.shirts4mike.com/shirt.php?id=103
etc" array?
I think that was the main thing confusing me haha!
Thomas Nilsen
14,957 PointsI do it in the getShirtEndpointsAsync() method.
//Make a complete url out of all the above
//e.g we turn this: shirt.php?id=101
//into this: http://www.shirts4mike.com/shirt.php?id=101
var specificShirtsURL = storage.map(function (endpoint) {
return baseURI + endpoint;
});
//Resolve all the urls we made, so we can use them in our next async call
resolve(specificShirtsURL);
The I access the data when I call getAllShirtDetailsAsync() method:
return getShirtEndpointsAsync().then(function (endpoints) {
//now I have access to the endpoints that was resolved in the method i mentioned before
}
Neil Bircumshaw
Full Stack JavaScript Techdegree Student 14,597 PointsBut one is called "endpoints" and one is called "endpoint", I'm just confused as to where the reference to the "endpoints" array is mentioned to start with.
var specificShirtsURL = storage.map(function (endpoint) { return baseURI + endpoint;
This is where you create the array of full URL's. and then resolve this in the getShirtEndpointsAsync() function. But I don't understand how you got these full URL's to be reference to the word "endpoints"
Sorry, I'm just finding it hard to see.
Thomas Nilsen
14,957 PointsWhen i call
resolve(specificShirtsURL);
in getShirtEndpointsAsync(), and at a later point calls the method like so:
getShirtEndpointsAsync().then(function(endpoints) {
})
I do call it endpoints, but in reality, I can call it whatever I want. You may want to read up on the concept on callbacks/promises.
Same goes for map-function:
var promises = endpoints.map(function (endpoint) {}
Here is the same example again:
var arr = [1, 2, 3, 4]
var modifiedArr = arr.map(function(iCanCallThisValueWhateverIWant) {
return iCanCallThisValueWhateverIWant*2;
});
//prints [ 2, 4, 6, 8 ]
console.log(modifiedArr);
Michael Liendo
15,326 PointsMichael Liendo
15,326 PointsWhy was my post downvoted when it's the same as the post above: A guide that references Cheerio.