Random Lemmatization with Language Selection

  • Share this:

Code introduction


This function takes a string of text and a language parameter, randomly selects a lemmatizer based on the language parameter, tokenizes the text, removes stop words, and applies the lemmatizer to reduce each word to its basic form.


Technology Stack : nltk.tokenize.word_tokenize, nltk.corpus.stopwords, nltk.stem.WordNetLemmatizer

Code Type : Text processing function

Code Difficulty : Intermediate


                
                    
import random
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def random_lemmatization(text, language='english'):
    """
    This function takes a string of text and returns a lemmatized version of the text.
    It randomly chooses a lemmatizer based on the language parameter.
    """
    lemmatizers = {
        'english': WordNetLemmatizer(),
        'spanish': WordNetLemmatizer(),
        'french': WordNetLemmatizer()
    }
    lemmatizer = random.choice(list(lemmatizers.values()))
    
    words = word_tokenize(text)
    stop_words = set(stopwords.words(language))
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words if word.isalpha() and word not in stop_words]
    
    return ' '.join(lemmatized_words)