BHASHA 2025 @ ACL · December 23, 2025

Multilingual IndicGEC with Transformer-based Grammatical Error Correction

Manav Dhamecha  ·  Gaurav Damor  ·  Sunil Choudhary  ·  Pruthwik Mishra

Sardar Vallabhbhai National Institute of Technology, Surat  ·  Team Horizon

GLEU Scores
(mT5-small)
Tamil
86.03
5th Rank
Malay.
84.36
8th Rank
Bangla
82.69
6th Rank
Hindi
80.44
7th Rank
Telugu
72.00
6th Rank

We present Team Horizon's approach to BHASHA Shared Task 1: Indic Grammatical Error Correction (IndicGEC). We explore transformer-based multilingual models — mT5-small and IndicBART — to correct grammatical and semantic errors across five Indian languages: Bangla, Hindi, Tamil, Telugu, and Malayalam. Due to limited annotated data, we develop a synthetic data augmentation pipeline that introduces realistic linguistic errors under ten categories, simulating natural mistakes found in Indic scripts. We demonstrate that linguistically grounded augmentation significantly improves grammatical correction accuracy in low-resource Indic languages.

What We Bring to the Table

Hybrid Augmentation + Fine-Tuning Pipeline

We combine synthetic data augmentation with multilingual transformer fine-tuning. Clean sentences from official task data, AI4Bharat IndicCorp v2, and Indic Wikipedia dumps are transformed into noisy versions via controlled error injection, then used to fine-tune both models.

📄
Clean Corpora
BHASHA data + IndicCorp v2 + Wikipedia
⚙️
Error Injection
42 rules across 10 linguistic categories
📦
Augmented Pairs
10k–12k pairs per language
🤖
Fine-Tuning
mT5-small & IndicBART
GEC Output
Evaluated with GLEU

10 Linguistic Error Categories

Each clean sentence is transformed into up to 5 noisy variants. Error count follows a realistic distribution: 60% single-error, 25% two errors, 10% three errors, 5% four or more.

Spelling
Non-dictionary & dictionary misspellings, mātrā swaps, homophone substitution
Tense
Incorrect verb tense inflection
Person
Subject–verb person disagreement
Number
Singular/plural agreement errors
Gender
Grammatical gender mismatches
Case
Incorrect postposition / case markers
Parts-of-Speech
Wrong word class usage
Missing / Extra
Omitted or duplicated tokens and postpositions
Punctuation
Wrong or missing ।/?/, markers
Semantic
Semantically incorrect adverb or postposition choice

Data Statistics

Official BHASHA annotated splits are small — ranging from 91 to 599 training examples. Synthetic augmentation expands each language to roughly 10,000–12,000 high-quality parallel pairs.

Hindi
Train599
Dev107
Test236
Bangla
Train598
Dev101
Test330
Malay.
Train300
Dev50
Test102
Tamil
Train91
Dev16
Test65
Telugu
Train599
Dev100
Test310

Model Comparison

mT5-small outperforms IndicBART across all five languages in our setup, with particularly strong gains in Tamil and Malayalam. Both models are lightweight (≤300M parameters) and trained under identical conditions for fair comparison.

mT5-small
300M params · General multilingual · Pre-trained on mC4
Ta
86.03
Ml
84.36
Bn
82.69
Hi
80.44
Te
72.00
IndicBART
Indic-specific · Seq2seq · Pre-trained on Indic corpora
Ta
76.45
Ml
74.84
Bn
73.50
Hi
72.33
Te
66.10

Correction Performance by Error Type

Spelling and punctuation errors are corrected most reliably. Semantic and structural errors remain the hardest, reflecting the difficulty of capturing meaning-level nuance without richer context.

Error Type Corrected Missed
Spelling
95%
5%
Punctuation
92%
8%
Duplication
90%
10%
Grammar
88%
12%
Word Choice
85%
15%
Structural
80%
20%
Semantic
78%
22%

Key Takeaways

📈

Augmentation Matters

Training on larger augmented datasets improved GLEU by 4–5 points. Expanding from <1k to ~10k pairs per language was critical for robust performance.

🔄

Epoch Sensitivity

Performance plateaued at 8–10 training epochs. Overfitting was consistently observed beyond this range, making early stopping essential.

🌐

Cross-lingual Transfer

Notable cross-lingual transfer was observed among related Dravidian languages (Tamil, Telugu, Malayalam), suggesting shared morphological structure benefits multilingual training.

⚖️

Model Choice

mT5-small outperformed IndicBART in our experiments, likely due to more effective parameter tuning alongside our specific augmentation scheme.

Cite This Work

@inproceedings{dhamecha2025indicgec, title = {Team Horizon at BHASHA Task 1: Multilingual IndicGEC with Transformer-based Grammatical Error Correction Models}, author = {Dhamecha, Manav and Damor, Gaurav and Choudhary, Sunil and Mishra, Pruthwik}, booktitle = {Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)}, pages = {142--146}, year = {2025}, organization = {Association for Computational Linguistics} }