Unable to find and list all external links

Question

Hi, I've written my function to scrape the external links from treehouse's horse website. However my function only retrieves the first external link from the webpage (https://en.wikipedia.org/wiki/Horse) and then continues on to find the first external link in the next webpage. For example:

https://treehouse-projects.github.io/horse-land/index.html

===============================================

https://www.biodiversitylibrary.org/page/726976

===============================================

https://about.biodiversitylibrary.org

===============================================

https://biodiversitylibrary.org/

and so on....

How would I go about finding and listing external links that exist only on the first webpage (in this case the treehouse horse webpage)? For instance I would like my final site_links list to be the following:

https://en.wikipedia.org/wiki/Horse

=========================================

https://commons.wikimedia.org/wiki/Horse_breeds

=========================================

https://commons.wikimedia.org/wiki/Horse_breeds

=========================================

https://creativecommons.org/licenses/by-sa/3.0/

My code is the following:

from urllib.request import urlopen
from bs4 import BeautifulSoup

import re

site_links = []

def external_links(linkURL):
    linkURL = re.sub('https://', '', linkURL)        
    html = urlopen('https://{}'.format(linkURL))
    soup = BeautifulSoup(html, 'html.parser')
    return soup.find('a', href=re.compile('(^https://)'))

if __name__ == '__main__':
    urls = external_links('treehouse-projects.github.io/horse-land/index.html')
    while len(urls) > 0:
        page = urls.attrs['href']
        if page not in site_links:
            site_links.append(page)
            print(page)
            print('\n===================================\n')
            urls = external_links(page)
        else:
            break

Thanks and greatly appreciated!!

Answer 1 · 2020-07-20T02:19:49Z

July 20, 2020 2:19am

This is my solution but:

I did not check to validate whether it parses any external links on mustang.html
I am pretty sure there are more straightforward ways to code it but this is where I am for now
I wanted to use a two-dimensional array all_links[[internal_links],[external_links]] but was already a bit challenged..

Hope it helps :)

from urllib.request import urlopen
from bs4 import BeautifulSoup

import re

# CHALLENGE: Fetch all external urls.

# An empty array to store the links.
links_int = []
links_ext = []

# Fetches all links.
def all_links(linkURL):
    #print('## inside def all_links()')
    html = urlopen('https://treehouse-projects.github.io/horse-land/'.format(linkURL))
    soup = BeautifulSoup(html, 'html.parser')
    # Get all the links on the page.
    for link in soup.find_all('a'):
        while len(links_int) > 0:
            page = link.attrs['href']
            if page not in links_int and page not in links_ext:
                links_ext.append(page)
                #print(page)
            else:
                break

# Fetches internal links only.
def internal_links(linkURL):
    html = urlopen('https://treehouse-projects.github.io/horse-land/{}'.format(linkURL))
    soup = BeautifulSoup(html, 'html.parser')

    return soup.find('a', href=re.compile('(.html)$'))


# I read about this but don't understand it yet.
if __name__ == '__main__':
    #print('## inside if __name__')
    urls = internal_links('index.html')
    while len(urls) > 0:
        page = urls.attrs['href']
        if page not in links_int:
            links_int.append(page)
            #print(page)
            #print('\n')
            urls = internal_links(page)
        else:
            all_links(page)
            break

print('links_int =', links_int)
print('links_ext =', links_ext)

Welcome to the Treehouse Community

Looking to learn something new?

Jason Tran

Jason Tran

Unable to find and list all external links

Josh Stephens

Josh Stephens

1 Answer

Beau Genereux

Beau Genereux