You can download this code by clicking the button below.
This code is now available for download.
This function extracts all unique hyperlinks from the given HTML content based on a specified tag and attribute. It also excludes links that match any domains provided in the exclude_domains list.
Technology Stack : Beautiful Soup, regular expressions (re)
Code Type : HTML parsing and link extraction
Code Difficulty : Intermediate
def find_random_links(html_content, tag='a', attribute='href', exclude_domains=None):
"""
This function extracts all unique hyperlinks from a given HTML content based on a specified tag and attribute.
It also excludes links that match any domain names provided in the exclude_domains list.
"""
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html_content, 'html.parser')
links = set()
# Find all tags of the specified type
for tag_instance in soup.find_all(tag):
# Get the attribute value
href = tag_instance.get(attribute)
if href:
# Exclude links that match any domain in exclude_domains
if exclude_domains and any(re.search(domain, href) for domain in exclude_domains):
continue
links.add(href)
return list(links)