SabiYarn: Optimizing LLM Pre-training for Low-Resource Languages
Large Language Models (LLMs) have seen significant advancements in recent years, primarily through scaling up model size and training data. This approach is highly resource-intensive, often costing millions of dollars and posing a substantial barrier to the inclusion of low-resource languages, which frequently lack both data and funding for computational resources.
A new paper, “SabiYarn: Advancing Low-Resource Languages with Multi-task NLP Pre-Training,” accepted at the AfricaNLP workshop at the 2025 ACL, introduces a series of optimization methods for LLM pre-training. These innovations enabled the training of a state-of-the-art multilingual foundation model for Nigerian languages on a single 24 GB GPU. A key technique proposed is a mask-based loss computation strategy, which intelligently avoids calculating loss on input prompt tokens already known by the model. This method aims to ensure the loss function accurately reflects the model’s true performance on relevant tokens, thereby preventing wasted computation on backpropagating irrelevant losses. This article delves into this compute-aware pre-training design and its impact on model performance.
The High Cost of Prompt Tokens in Low-Resource Environments
During pre-training, LLMs are typically trained through a causal language modeling task, predicting the next token in a sequence. This is a computationally demanding process involving trillions of tokens, with the goal of minimizing the cross-entropy loss between predicted and actual tokens through backpropagation. Over this extensive training, models acquire various skills, memorize facts, and build a comprehensive world model.
For cutting-edge models like Meta’s Llama 4 or OpenAI’s GPT-4, this process can involve thousands of GPUs running for months, performing over 10^25 floating-point operations (FLOPs). Consider a translation example: given the sequence “Translate English to Yoruba: I love rice. => Mo fẹ́ràn ìrẹsì,” a standard LLM is trained to predict every token, from the initial prompt (“Translate English to Yoruba:”) to the actual answer (“Mo fẹ́ràn ìrẹsì”). While straightforward to implement, this approach treats all tokens equally, meaning significant computation is spent on learning to predict tokens that are static or already known as part of the prompt. While acceptable in environments with virtually unlimited compute, this becomes problematic under resource constraints. If half the input sequence is an unchanging instruction, half the training compute is potentially wasted on redundant learning.
Integrating Task Awareness into Pre-training
Due to severe computational limitations, the SabiYarn project could not incorporate a separate post-training stage, where models are typically aligned with user-facing goals using supervised examples and reinforcement learning from human feedback (RLHF). Such post-training stages are crucial for models to generate helpful and aligned responses, for instance, responding to “How are you today?” with “I’m doing good” instead of merely completing the sequence with a question mark.
To compensate for the absence of post-training, the SabiYarn team embedded task awareness directly into the pre-training phase. Their objective was to enable the model to generalize beyond basic next-token prediction towards solving specific tasks like named-entity recognition, sentiment analysis, and translation, entirely through prompt-based conditioning. Inspired by the T5 paper, they designed a task-specific training scheme using XML-like prompt tags. For example, an English-to-Pidgin translation task would be formatted as <translate> let me call my father </translate>: Make I go call my Papa
.
With this structured format, a critical innovation was to calculate the cross-entropy loss only on the label tokens (“Make I go call my Papa”). This was implemented in PyTorch by masking out the prompt tokens in the label tensor using an ignore index (-100), which PyTorch’s cross_entropy loss function skips by default.
Focused Learning: Only What Matters
An unexpected benefit of this masking approach is improved task focus. Because the model does not backpropagate on the input portion of the sequence, its learning signal originates exclusively from task-relevant tokens. In a typical pre-training scenario where loss is computed on every token, the model learns to reproduce the prompt structure and task tags alongside generating outputs, diluting the learning signal across the entire sequence.
Conversely, with loss masking, the model still processes input-output connections through its self-attention mechanism during the forward pass. However, the crucial learning process (backpropagation) occurs only when predicting the output tokens. This can be likened to how humans learn a new language: we receive the full input as context, but our learning occurs when we are corrected on our translation, not on the input sentence itself. By forcing the model to treat prompts as context rather than a prediction target, this method directs training towards input-output mappings and reduces the tendency to overfit on prompt formatting.
Impact on Training Performance
To validate these findings, the researchers conducted an experiment training a model on a complex sentence descrambling task, comparing masked loss with non-masked loss. The task involved transforming grammatically incoherent sentences into coherent forms using the same words, for instance, correcting “The equations expensive. show is optimization computationally that.” to “The equations show that optimization is computationally expensive.” The results showed that the model converged significantly faster on the task when the loss on the input prompt was not calculated. These efficiency gains are substantial and compound over the entire training run, leading to accelerated convergence.
Trade-offs of Masking
While masking prompt tokens for loss computation conserves compute and sharpens focus, it does present trade-offs. Excluding prompts from the learning signal increases the risk that the model may not adapt well if the prompt structure or phrasing changes during inference. However, such trade-offs must be weighed against the realities of resource constraints. In low-resource training scenarios, approaches that reduce compute while preserving core task performance are often more practical than fully supervised, resource-intensive alternatives.
The Case for Native African Language LLMs
While much of the African LLM community has focused on adapting open-source pre-trained models, training a foundational model from scratch, as done in SabiYarn, offers distinct advantages. This approach allows for the creation of models that do not inherit the cultural biases embedded in Euro-American corpora. Furthermore, it provides invaluable research insights and data regarding tokenization, transfer learning, linguistic patterns, and training dynamics specifically for African languages.
A frequently overlooked aspect is the tokenizer, which dictates how languages are broken into tokens for LLM processing. Training custom, language-specific tokenizers enables the integration of unique morphological and phonological structures, such as tonal diacritics in Yoruba, which carry semantic meaning. This also enhances efficiency, as the tokenizer can effectively break down each language into tokens that recognize useful grammatical structures like affixes and punctuation, which the model can then leverage for meaningful representations. In contrast, using existing tokenizers not trained on target languages often leads to poor tokenization, inaccurate grammatical representation, inflated sequence lengths, and ultimately degraded performance, particularly for smaller models with lower computing demands.
Looking ahead, the SabiYarn research group plans to explore modern LLM architectures, incorporating reasoning, instruction following, and test-time computing strategies within resource-constrained pre-training. Their future work also includes hardware-specific optimizations for training and inference, and expanding their efforts to include an even wider array of African languages.