Scrapy Response Link Extraction Function

  • Share this:

Code introduction


This custom function is used to extract all valid links from a Scrapy response object. It first defines a helper function to check if a URL is valid, then uses XPath to extract all links from the response and filters out the valid links.


Technology Stack : Scrapy, urllib.parse

Code Type : Scrapy custom function

Code Difficulty : Intermediate


                
                    
def extract_links_from_response(response):
    # This function extracts all the links from a Scrapy response object

    def is_valid_link(url):
        # Helper function to determine if a URL is valid
        from urllib.parse import urlparse
        parsed_url = urlparse(url)
        return all([parsed_url.scheme, parsed_url.netloc])

    # Extract all links from the response
    links = response.xpath('//a/@href').getall()
    
    # Filter out invalid links
    valid_links = filter(is_valid_link, links)
    
    return list(valid_links)