Python Scraping Data From the Web A World Full of Spiders Everyone Loves Charlotte

Jason Tran
Jason Tran
7,208 Points

Unable to find and list all external links

Hi, I've written my function to scrape the external links from treehouse's horse website. However my function only retrieves the first external link from the webpage (https://en.wikipedia.org/wiki/Horse) and then continues on to find the first external link in the next webpage. For example:

https://treehouse-projects.github.io/horse-land/index.html

===============================================

https://www.biodiversitylibrary.org/page/726976

===============================================

https://about.biodiversitylibrary.org

===============================================

https://biodiversitylibrary.org/

and so on....

How would I go about finding and listing external links that exist only on the first webpage (in this case the treehouse horse webpage)? For instance I would like my final site_links list to be the following:

https://en.wikipedia.org/wiki/Horse

=========================================

https://commons.wikimedia.org/wiki/Horse_breeds

=========================================

https://commons.wikimedia.org/wiki/Horse_breeds

=========================================

https://creativecommons.org/licenses/by-sa/3.0/

My code is the following:

from urllib.request import urlopen
from bs4 import BeautifulSoup

import re

site_links = []

def external_links(linkURL):
    linkURL = re.sub('https://', '', linkURL)        
    html = urlopen('https://{}'.format(linkURL))
    soup = BeautifulSoup(html, 'html.parser')
    return soup.find('a', href=re.compile('(^https://)'))

if __name__ == '__main__':
    urls = external_links('treehouse-projects.github.io/horse-land/index.html')
    while len(urls) > 0:
        page = urls.attrs['href']
        if page not in site_links:
            site_links.append(page)
            print(page)
            print('\n===================================\n')
            urls = external_links(page)
        else:
            break

Thanks and greatly appreciated!!

Josh Stephens
Josh Stephens
11,474 Points

I think what is happening is that you are using find on your external_links function which only gets the first anchor tag. However if you change it to findAll() you get back a list with all anchors for that page. I wrote a recursive version in haste but it scrapped a little too much but if you want to try and correct it here it is

def recursive_link_retrieval(linkURL, sub_links=list()):
    links = sub_links
    linkURL = re.sub('https://', '', linkURL)        
    html = urlopen('https://{}'.format(linkURL))
    soup = BeautifulSoup(html, 'html.parser')

    new_links = soup.findAll('a', href=re.compile('(^https://)'))

    for link in new_links:
        if link not in links:
            links.append(link)
            recursive_link_retrieval(link.attrs['href'], links)

    return links