You can download this code by clicking the button below.
This code is now available for download.
This function takes HTML content as input, parses it into an XML tree using the lxml library, then extracts text from all elements, and converts these texts into a JSON string.
Technology Stack : lxml, etree, HTMLParser, json
Code Type : Function
Code Difficulty : Intermediate
def parse_html_to_json(html_content, encoding='utf-8'):
from lxml import etree
# Parse HTML content to XML
parser = etree.HTMLParser()
tree = etree.fromstring(html_content, parser)
# Extract text from all elements
texts = [element.text for element in tree.iter()]
# Convert text list to JSON string
import json
json_data = json.dumps({'texts': texts})
return json_data