Python Scraping Data From the Web A World Full of Spiders Everyone Loves Charlotte

Michael Strand
Michael Strand
10,897 Points

Finding Internal Links Without .html

Hi, How would you go about finding internal links that have a URL structure like https://www.website.com/about-us?

Michael Strand
Michael Strand
10,897 Points

This is my starting point, but this only returns one url.

from urllib.request import urlopen
from bs4 import BeautifulSoup

import re

site_links = []

def internal_links(linkURL):
    html = urlopen('https://www.website.com/{}'
                .format(linkURL))
    soup = BeautifulSoup(html, 'html.parser')

    return soup.find('a', href=re.compile('(^https://www.website.com/)'))


if __name__ == '__main__':
    urls = internal_links('/')
    while len(urls) > 0:
        page = urls.attrs['href']

        if page not in site_links:
            site_links.append(page)

            print(page)
            print('\n===========\n')
            urls = internal_links(page)

        else:
            break

2 Answers

Steven Parker
Steven Parker
177,727 Points

According to the documentation, "find" always returns just one result. To get a list of all matching items, use "find_all" instead.

Michael Strand
Michael Strand
10,897 Points

I did try that and it just returned errors. If I follow the video it uses find but looks for the extension .html and returns multiple results.

Steven Parker
Steven Parker
177,727 Points

I'm confused. I don't have experience with it myself, but unless I'm reading the docs wrong, they seem to state explicitly that "find" only returns one item.

Michael Strand
Michael Strand
10,897 Points

Yes, I was confused as well. I don't quite understand how that worked as well.