Extract Links from HTML with BeautifulSoup

2024-12-16 12:16:28 36 Views

Code introduction

This function uses the BeautifulSoup library to extract all links from a given HTML content. It accepts HTML content, the tag to search for, and an optional class name as parameters, and then returns a list of URLs containing all extracted links.

Technology Stack : BeautifulSoup

Code Type : Function

Code Difficulty : Intermediate

                
                    
def extract_links_from_html(html_content, tag='a', class_name=None):
    """
    Extracts all links from a given HTML content using BeautifulSoup.

    :param html_content: The HTML content from which to extract links.
    :param tag: The tag to search for (default is 'a').
    :param class_name: The class name to filter the links by (optional).
    :return: A list of URLs extracted from the HTML content.
    """
    from bs4 import BeautifulSoup, SoupStrainer

    # Initialize BeautifulSoup with the HTML content and the specified tag
    soup = BeautifulSoup(html_content, 'html.parser', parse_only=SoupStrainer(tag))

    # If a class name is provided, filter the links by this class name
    if class_name:
        links = soup.find_all(class_=class_name)
    else:
        links = soup.find_all()

    # Extract the href attribute from each link and return the list of URLs
    return [link.get('href') for link in links if link.get('href') is not None]

Tags: BeautifulSoup