HTML to JSON Text Extractor

  • Share this:

Code introduction


This function takes HTML content as input, parses it into an XML tree using the lxml library, then extracts text from all elements, and converts these texts into a JSON string.


Technology Stack : lxml, etree, HTMLParser, json

Code Type : Function

Code Difficulty : Intermediate


                
                    
def parse_html_to_json(html_content, encoding='utf-8'):
    from lxml import etree
    
    # Parse HTML content to XML
    parser = etree.HTMLParser()
    tree = etree.fromstring(html_content, parser)
    
    # Extract text from all elements
    texts = [element.text for element in tree.iter()]
    
    # Convert text list to JSON string
    import json
    json_data = json.dumps({'texts': texts})
    
    return json_data