We present Team Horizon's approach to BHASHA Shared Task 1: Indic Grammatical Error Correction (IndicGEC). We explore transformer-based multilingual models — mT5-small and IndicBART — to correct grammatical and semantic errors across five Indian languages: Bangla, Hindi, Tamil, Telugu, and Malayalam. Due to limited annotated data, we develop a synthetic data augmentation pipeline that introduces realistic linguistic errors under ten categories, simulating natural mistakes found in Indic scripts. We demonstrate that linguistically grounded augmentation significantly improves grammatical correction accuracy in low-resource Indic languages.
What We Bring to the Table
- A linguistically informed synthetic error-injection framework for Indic GEC data augmentation, covering 10 error categories and 42 rules.
- Evaluation and comparison of two multilingual transformer paradigms: mT5-small (general multilingual) and IndicBART (Indic-specific).
- Empirical analysis of dataset scaling, training epochs, and their effects on generalization in low-resource settings.
- Insights into error-type distributions, cross-language transfer across Dravidian languages, and limitations of multilingual setups.
Hybrid Augmentation + Fine-Tuning Pipeline
We combine synthetic data augmentation with multilingual transformer fine-tuning. Clean sentences from official task data, AI4Bharat IndicCorp v2, and Indic Wikipedia dumps are transformed into noisy versions via controlled error injection, then used to fine-tune both models.
10 Linguistic Error Categories
Each clean sentence is transformed into up to 5 noisy variants. Error count follows a realistic distribution: 60% single-error, 25% two errors, 10% three errors, 5% four or more.
Data Statistics
Official BHASHA annotated splits are small — ranging from 91 to 599 training examples. Synthetic augmentation expands each language to roughly 10,000–12,000 high-quality parallel pairs.
Model Comparison
mT5-small outperforms IndicBART across all five languages in our setup, with particularly strong gains in Tamil and Malayalam. Both models are lightweight (≤300M parameters) and trained under identical conditions for fair comparison.
Correction Performance by Error Type
Spelling and punctuation errors are corrected most reliably. Semantic and structural errors remain the hardest, reflecting the difficulty of capturing meaning-level nuance without richer context.
| Error Type | Corrected | Missed |
|---|---|---|
| Spelling | 5% | |
| Punctuation | 8% | |
| Duplication | 10% | |
| Grammar | 12% | |
| Word Choice | 15% | |
| Structural | 20% | |
| Semantic | 22% |
Key Takeaways
Augmentation Matters
Training on larger augmented datasets improved GLEU by 4–5 points. Expanding from <1k to ~10k pairs per language was critical for robust performance.
Epoch Sensitivity
Performance plateaued at 8–10 training epochs. Overfitting was consistently observed beyond this range, making early stopping essential.
Cross-lingual Transfer
Notable cross-lingual transfer was observed among related Dravidian languages (Tamil, Telugu, Malayalam), suggesting shared morphological structure benefits multilingual training.
Model Choice
mT5-small outperformed IndicBART in our experiments, likely due to more effective parameter tuning alongside our specific augmentation scheme.