Text Analysis with Tokenization, Stop Words Removal, and Lemmatization

  • Share this:

Code introduction


This function takes a text and a language parameter, then uses the nltk library for tokenization, stop word removal, and lemmatization.


Technology Stack : The nltk library, including word_tokenize, stopwords, and WordNetLemmatizer.

Code Type : Function

Code Difficulty : Intermediate


                
                    
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def analyze_text(text, language='english'):
    # Tokenize the text into words
    tokens = word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words(language))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    
    # Lemmatize the words
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
    
    return lemmatized_tokens