Welcome to the Treehouse Community

Want to collaborate on code errors? Have bugs you need feedback on? Looking for an extra set of eyes on your latest project? Get support with fellow developers, designers, and programmers of all backgrounds and skill levels here with the Treehouse Community! While you're at it, check out some resources Treehouse students have shared here.

Looking to learn something new?

Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and join thousands of Treehouse students and alumni in the community today.

Start your free trial

JavaScript

How do I even begin to think about scraping a site? We were not taught how to do this.

So I've reached project 6. The content scraper project. Unfortunately, the only things we were taught about npms are how to install and uninstall them. We were never taught how to use them.

How do I even think about starting to scrape a site? Is there a course I should watch? The ones in the degree have not prepared me for this project AT ALL.

Again, all we were taught was how to install, update or uninstall a package. I have no idea where to begin.

10 Answers

Michael Liendo
Michael Liendo
15,326 Points

Smithi Milli has a pretty good video on webscraping in node. There's quite a few ways to go about it, but this is probably one of the easiest. Not sure of the requirements for your assignment, but I hope this helps: https://www.youtube.com/watch?v=LJHpm0J688Y

Michael Liendo
Michael Liendo
15,326 Points

Why was my post downvoted when it's the same as the post above: A guide that references Cheerio.

Thomas Nilsen
Thomas Nilsen
14,957 Points

As promised. Here is a fairly complete example (with the exception of writing this to a CSV-file)

The result from my code looks like this:

imgur

var request = require('request');
var cheerio = require('cheerio');

var baseURI = 'http://www.shirts4mike.com/';
var allShirtsURL = baseURI + 'shirts.php';


// ========= MAIN CALL ==========

getAllShirtDetailsAsync()
    .then(function(data) {
        console.log(data);
    })
    .catch(function(err) {
        console.log(err);
    });



//First, we create a async call for getting all the different shirt-urls (where all the details are)
function getShirtEndpointsAsync() {
    return new Promise(function (resolve, reject) {
        request(allShirtsURL, function (error, response, body) {

            //If no errors - we're good to move on..
            if (!error && response.statusCode === 200) {

                //Load the entire body into cheerio
                //Now we can start searching for elements...
                var $ = cheerio.load(body);

                //Storage for the urls
                //e.g. shirt.php?id=101
                var storage = [];
                $('.products li').each(function (i, elem) {
                    var shirt = $(this).find('a').attr('href');
                    storage.push(shirt);
                });

                //Make a complete url out of all the above
                //e.g we turn this: shirt.php?id=101
                //into this: http://www.shirts4mike.com/shirt.php?id=101
                var specificShirtsURL = storage.map(function (endpoint) {
                    return baseURI + endpoint;
                });

                //Resolve all the urls we made, so we can use them in our next async call
                resolve(specificShirtsURL);
            } else {
                //Something annoying has happened...
                reject('Could not find anything about anything or anyone...');
            }
        });
    });
}

//Here we use the urls from above to extract the things we want/need from each shirt page. 

function getAllShirtDetailsAsync() {
    return getShirtEndpointsAsync().then(function (endpoints) {

        //Map every endpoint so we can make a request with each URL
        var promises = endpoints.map(function (endpoint) {
            return new Promise(function (resolve, reject) {

                request(endpoint, function (error, response, body) {

                    //Again - check for no errorr...
                    if (!error && response.statusCode === 200) {

                        //Load in our body (containing all the shitr details..)
                        var $ = cheerio.load(body);

                        //Create a object from with the info
                        var productDetails = {
                            url: endpoint,
                            imageUrl: $('.shirt-picture span img').attr('src'),
                            prize: $('.shirt-details h1 span').text(),
                            title: $('.shirt-details h1').text(),
                            time: new Date()
                        };

                        //Resolve it.
                        resolve(productDetails)


                    } else {

                        //Annying error
                        reject('Error while getting info about all the different shirts...');
                    }
                });
            });
        });

        //Resolve ALL the promises from above (we are after all making multiple call to get all the different shirt info)
        return Promise.all(promises);
    })
    .then(function (data) {
        return data;
    })
    .catch(function (err) {
        return Promise.reject(err);
    });
}

It works but I really cannot follow it. Once again, I don't understand why I'm being testing on something that I wasn't taught.

Neil Bircumshaw
seal-mask
.a{fill-rule:evenodd;}techdegree
Neil Bircumshaw
Full Stack JavaScript Techdegree Student 14,597 Points

Hi, I understand most of that code except this part

return getShirtEndpointsAsync().then(function (endpoints) {

        //Map every endpoint so we can make a request with each URL
        var promises = endpoints.map(function (endpoint) {
            return new Promise(function (resolve, reject) {

Could you explain what's going on here? So you're returning the ShirtEndpointsAsync() function to get all the 8 URLS, would you then use the "then" method to say what to do with this function? and what exactly is "endpoints". I'm totally lost at the "promises = endpoints.map..." - I just have no clue what this means, I thought the map method duplicated an array.

Cheers if anyone is there to reply to this!

Thomas Nilsen
Thomas Nilsen
14,957 Points

This is a pretty good guide

TLDR;

An NPM-module called cheerio should help you do this

Thomas Nilsen
Thomas Nilsen
14,957 Points

Good question. I upvoted you now at least.

Thomas Nilsen
Thomas Nilsen
14,957 Points

I'll put together a sample project in order to help you out and post it here when I'm done. Shouldn't take too long.

Thomas Nilsen
Thomas Nilsen
14,957 Points

Have none of the videos you have watched thus far been about async programming?

Even so, I think that even though you're enrolled in a tech degree-program, they also (to some extent) expect you to do some research and figure out things on your own.

I tried to comment the code to make it a little easier to follow. Let me know if you have any questions.

Are we allowed to use more than one npm package. Most guides use request and cheerio, but then obviously we have to use a json to csv parser. So that's 3.

Are we allowed to do that?

Thomas Nilsen
Thomas Nilsen
14,957 Points

Neil Bircumshaw

This method:

return getShirtEndpointsAsync().then(function (endpoints) {
 //...
}

endpoints is an array of all of the urls

Then i map all the endpoints to a new Promise, like so:

var promises = endpoints.map(function (endpoint) {
            return new Promise(function (resolve, reject) {
                //etc...
            }
            //etc...
}

Map doesn't duplicate arrays like you said. You can call map on an array to change every element.

Here is a small example of it:

var arr = [1, 2, 3, 4]

var modifiedArr = arr.map(function(n) {
    return n*2;
});

//prints [ 2, 4, 6, 8 ]
console.log(modifiedArr);

so essentially, I create and array of Promises, called promises and return it like so:

return Promise.all(promises);
Thomas Nilsen
Thomas Nilsen
14,957 Points

Neil Bircumshaw

I do it in the getShirtEndpointsAsync() method.

//Make a complete url out of all the above
 //e.g we turn this: shirt.php?id=101
//into this: http://www.shirts4mike.com/shirt.php?id=101
var specificShirtsURL = storage.map(function (endpoint) {
    return baseURI + endpoint;
});

//Resolve all the urls we made, so we can use them in our next async call
resolve(specificShirtsURL);

The I access the data when I call getAllShirtDetailsAsync() method:

 return getShirtEndpointsAsync().then(function (endpoints) {
    //now I have access to the endpoints that was resolved in the method i mentioned before
}
Neil Bircumshaw
seal-mask
.a{fill-rule:evenodd;}techdegree
Neil Bircumshaw
Full Stack JavaScript Techdegree Student 14,597 Points

But one is called "endpoints" and one is called "endpoint", I'm just confused as to where the reference to the "endpoints" array is mentioned to start with.

var specificShirtsURL = storage.map(function (endpoint) { return baseURI + endpoint;

This is where you create the array of full URL's. and then resolve this in the getShirtEndpointsAsync() function. But I don't understand how you got these full URL's to be reference to the word "endpoints"

Sorry, I'm just finding it hard to see.

Thomas Nilsen
Thomas Nilsen
14,957 Points

Neil Bircumshaw

When i call

resolve(specificShirtsURL);

in getShirtEndpointsAsync(), and at a later point calls the method like so:

getShirtEndpointsAsync().then(function(endpoints) {
})

I do call it endpoints, but in reality, I can call it whatever I want. You may want to read up on the concept on callbacks/promises.

Same goes for map-function:

 var promises = endpoints.map(function (endpoint) {}

Here is the same example again:

var arr = [1, 2, 3, 4]

var modifiedArr = arr.map(function(iCanCallThisValueWhateverIWant) {
    return iCanCallThisValueWhateverIWant*2;
});

//prints [ 2, 4, 6, 8 ]
console.log(modifiedArr);