Extracting Unique Hyperlinks from HTML Content

2024-12-07 15:57:44 17 Views

Code introduction

This function extracts all unique hyperlinks from the given HTML content based on a specified tag and attribute. It also excludes links that match any domains provided in the exclude_domains list.

Technology Stack : Beautiful Soup, regular expressions (re)

Code Type : HTML parsing and link extraction

Code Difficulty : Intermediate

                
                    
def find_random_links(html_content, tag='a', attribute='href', exclude_domains=None):
    """
    This function extracts all unique hyperlinks from a given HTML content based on a specified tag and attribute.
    It also excludes links that match any domain names provided in the exclude_domains list.
    """
    from bs4 import BeautifulSoup
    import re

    soup = BeautifulSoup(html_content, 'html.parser')
    links = set()

    # Find all tags of the specified type
    for tag_instance in soup.find_all(tag):
        # Get the attribute value
        href = tag_instance.get(attribute)
        if href:
            # Exclude links that match any domain in exclude_domains
            if exclude_domains and any(re.search(domain, href) for domain in exclude_domains):
                continue
            links.add(href)

    return list(links)