Welcome to the Treehouse Community

Want to collaborate on code errors? Have bugs you need feedback on? Looking for an extra set of eyes on your latest project? Get support with fellow developers, designers, and programmers of all backgrounds and skill levels here with the Treehouse Community! While you're at it, check out some resources Treehouse students have shared here.

Looking to learn something new?

Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and join thousands of Treehouse students and alumni in the community today.

Start your free trial

Watch Video

Posted October 16, 2019 1:23pm by

Finding Internal Links Without .html

Hi, How would you go about finding internal links that have a URL structure like https://www.website.com/about-us?

October 16, 2019 1:26pm

This is my starting point, but this only returns one url.

from urllib.request import urlopen
from bs4 import BeautifulSoup

import re

site_links = []

def internal_links(linkURL):
    html = urlopen('https://www.website.com/{}'
                .format(linkURL))
    soup = BeautifulSoup(html, 'html.parser')

    return soup.find('a', href=re.compile('(^https://www.website.com/)'))


if __name__ == '__main__':
    urls = internal_links('/')
    while len(urls) > 0:
        page = urls.attrs['href']

        if page not in site_links:
            site_links.append(page)

            print(page)
            print('\n===========\n')
            urls = internal_links(page)

        else:
            break

2 Answers

October 17, 2019 9:51am

According to the documentation, "find" always returns just one result. To get a list of all matching items, use "find_all" instead.

October 17, 2019 12:10pm

I did try that and it just returned errors. If I follow the video it uses find but looks for the extension .html and returns multiple results.

October 17, 2019 12:13pm

I'm confused. I don't have experience with it myself, but unless I'm reading the docs wrong, they seem to state explicitly that "find" only returns one item.

October 21, 2019 6:06pm

Yes, I was confused as well. I don't quite understand how that worked as well.

Posting to the forum is only allowed for members with active accounts.
Please sign in or sign up to post.

Welcome to the Treehouse Community

Looking to learn something new?

Michael Strand

Michael Strand

Finding Internal Links Without .html

Michael Strand

Michael Strand

2 Answers

Steven Parker

Steven Parker

Michael Strand

Michael Strand

Steven Parker

Steven Parker

Michael Strand

Michael Strand