Fairseq-based Machine Translation Function

  • Share this:

Code introduction


This function implements a simple machine translation feature using the Fairseq library, translating a source language sentence into a target language. It loads a pre-trained model, encodes the input sentence, then uses the model for translation, and returns the translated sentence.


Technology Stack : Fairseq, PyTorch, Transformer, LanguagePairDataset

Code Type : Function

Code Difficulty : Advanced


                
                    
def translate_sentence(src_lang, tgt_lang, sentence, model_path, beam_size=5):
    """
    Translates a given sentence from a source language to a target language using a Fairseq model.

    Args:
        src_lang (str): Source language code.
        tgt_lang (str): Target language code.
        sentence (str): The sentence to translate.
        model_path (str): The path to the Fairseq model checkpoint.
        beam_size (int): Beam size for the translation.

    Returns:
        str: The translated sentence.
    """
    import torch
    from fairseq.models import FairseqModel
    from fairseq.data.data_utils import collate_tokens
    from fairseq.models.transformer import TransformerModel
    from fairseq.data.encoders import BaseWrapper
    from fairseq.data import LanguagePairDataset

    # Load the model
    model = FairseqModel.from_pretrained(model_path)
    model.eval()

    # Load the source and target language encoders
    src_encoder = BaseWrapper(model.src_dict)
    tgt_encoder = BaseWrapper(model.tgt_dict)

    # Encode the input sentence
    src_tokens = src_encoder.encode(sentence)
    src_tokens = src_tokens.unsqueeze(0)

    # Prepare the dataset
    dataset = LanguagePairDataset(
        src_tokens=src_tokens,
        tgt_tokens=torch.tensor([model.tgt_dict.encode_line(tgt_lang, add_special_tokens=True).long()]),
        src_dict=model.src_dict,
        tgt_dict=model.tgt_dict,
    )

    # Collate the dataset
    batch = collate_tokens([dataset], model.src_dict, model.tgt_dict)

    # Translate the sentence
    with torch.no_grad():
        translated_tokens = model.translate(
            sample=batch,
            max_tokens=beam_size,
            beam=beam_size,
            return_tokens=True
        )

    # Decode the translated tokens
    translated_sentence = tgt_encoder.decode(translated_tokens[0])

    return translated_sentence