Scrapy-based HTML Parsing and Crawler Initialization

  • Share this:

Code introduction


This function uses the Scrapy library to create a simple crawler, parse a simulated HTML response, and extract text from it. It also starts a CrawlerProcess to run a custom crawler.


Technology Stack : Scrapy, Selector, HtmlResponse, CrawlerProcess

Code Type : Scrapy custom function

Code Difficulty : Intermediate


                
                    
def random_scrapy_function(arg1, arg2, arg3):
    from scrapy import Selector
    from scrapy.crawler import CrawlerProcess
    from scrapy.http import HtmlResponse

    # Create a sample HTML response
    sample_html = '<html><head><title>Test Page</title></head><body><p>Hello, Scrapy!</p></body></html>'
    response = HtmlResponse(url='http://example.com', body=sample_html, encoding='utf-8')

    # Use Selector to extract text from the HTML response
    selector = Selector(response=response)
    text = selector.xpath('//p/text()').get()

    # Process the response using a CrawlerProcess
    process = CrawlerProcess(settings={
        'USER_AGENT': 'Scrapy/1.0 (+http://www.scrapy.org)'
    })
    process.crawl(MySpider)
    process.start()

    return text, process