Python Scraping Data From the Web A World Full of Spiders Everyone Loves Charlotte

Unable to find and list all external links

Hi, I've written my function to scrape the external links from treehouse's horse website. However my function only retrieves the first external link from the webpage (https://en.wikipedia.org/wiki/Horse) and then continues on to find the first external link in the next webpage. For example:

https://treehouse-projects.github.io/horse-land/index.html

===============================================

https://www.biodiversitylibrary.org/page/726976

===============================================

https://about.biodiversitylibrary.org

===============================================

https://biodiversitylibrary.org/

and so on....

How would I go about finding and listing external links that exist only on the first webpage (in this case the treehouse horse webpage)? For instance I would like my final site_links list to be the following:

https://en.wikipedia.org/wiki/Horse

=========================================

https://commons.wikimedia.org/wiki/Horse_breeds

=========================================

https://commons.wikimedia.org/wiki/Horse_breeds

=========================================

https://creativecommons.org/licenses/by-sa/3.0/

My code is the following:

from urllib.request import urlopen
from bs4 import BeautifulSoup

import re

site_links = []

def external_links(linkURL):
    linkURL = re.sub('https://', '', linkURL)        
    html = urlopen('https://{}'.format(linkURL))
    soup = BeautifulSoup(html, 'html.parser')
    return soup.find('a', href=re.compile('(^https://)'))

if __name__ == '__main__':
    urls = external_links('treehouse-projects.github.io/horse-land/index.html')
    while len(urls) > 0:
        page = urls.attrs['href']
        if page not in site_links:
            site_links.append(page)
            print(page)
            print('\n===================================\n')
            urls = external_links(page)
        else:
            break

Thanks and greatly appreciated!!

Josh Stephens
Josh Stephens
12,592 Points

I think what is happening is that you are using find on your external_links function which only gets the first anchor tag. However if you change it to findAll() you get back a list with all anchors for that page. I wrote a recursive version in haste but it scrapped a little too much but if you want to try and correct it here it is

def recursive_link_retrieval(linkURL, sub_links=list()):
    links = sub_links
    linkURL = re.sub('https://', '', linkURL)        
    html = urlopen('https://{}'.format(linkURL))
    soup = BeautifulSoup(html, 'html.parser')

    new_links = soup.findAll('a', href=re.compile('(^https://)'))

    for link in new_links:
        if link not in links:
            links.append(link)
            recursive_link_retrieval(link.attrs['href'], links)

    return links

1 Answer

Beau Genereux
Beau Genereux
4,887 Points

This is my solution but:

  • I did not check to validate whether it parses any external links on mustang.html
  • I am pretty sure there are more straightforward ways to code it but this is where I am for now
  • I wanted to use a two-dimensional array all_links[[internal_links],[external_links]] but was already a bit challenged..

Hope it helps :)

from urllib.request import urlopen
from bs4 import BeautifulSoup

import re

# CHALLENGE: Fetch all external urls.

# An empty array to store the links.
links_int = []
links_ext = []

# Fetches all links.
def all_links(linkURL):
    #print('## inside def all_links()')
    html = urlopen('https://treehouse-projects.github.io/horse-land/'.format(linkURL))
    soup = BeautifulSoup(html, 'html.parser')
    # Get all the links on the page.
    for link in soup.find_all('a'):
        while len(links_int) > 0:
            page = link.attrs['href']
            if page not in links_int and page not in links_ext:
                links_ext.append(page)
                #print(page)
            else:
                break

# Fetches internal links only.
def internal_links(linkURL):
    html = urlopen('https://treehouse-projects.github.io/horse-land/{}'.format(linkURL))
    soup = BeautifulSoup(html, 'html.parser')

    return soup.find('a', href=re.compile('(.html)$'))


# I read about this but don't understand it yet.
if __name__ == '__main__':
    #print('## inside if __name__')
    urls = internal_links('index.html')
    while len(urls) > 0:
        page = urls.attrs['href']
        if page not in links_int:
            links_int.append(page)
            #print(page)
            #print('\n')
            urls = internal_links(page)
        else:
            all_links(page)
            break

print('links_int =', links_int)
print('links_ext =', links_ext)