IndicGEC - Team Horizon @ BHASHA 2025

Abstract

We present Team Horizon's approach to BHASHA Shared Task 1: Indic Grammatical Error Correction (IndicGEC). We explore transformer-based multilingual models — mT5-small and IndicBART — to correct grammatical and semantic errors across five Indian languages: Bangla, Hindi, Tamil, Telugu, and Malayalam. Due to limited annotated data, we develop a synthetic data augmentation pipeline that introduces realistic linguistic errors under ten categories, simulating natural mistakes found in Indic scripts. We demonstrate that linguistically grounded augmentation significantly improves grammatical correction accuracy in low-resource Indic languages.

Contributions

What We Bring to the Table

A linguistically informed synthetic error-injection framework for Indic GEC data augmentation, covering 10 error categories and 42 rules.
Evaluation and comparison of two multilingual transformer paradigms: mT5-small (general multilingual) and IndicBART (Indic-specific).
Empirical analysis of dataset scaling, training epochs, and their effects on generalization in low-resource settings.
Insights into error-type distributions, cross-language transfer across Dravidian languages, and limitations of multilingual setups.

Methodology

Hybrid Augmentation + Fine-Tuning Pipeline

We combine synthetic data augmentation with multilingual transformer fine-tuning. Clean sentences from official task data, AI4Bharat IndicCorp v2, and Indic Wikipedia dumps are transformed into noisy versions via controlled error injection, then used to fine-tune both models.

📄

Clean Corpora

BHASHA data + IndicCorp v2 + Wikipedia

→

⚙️

Error Injection

42 rules across 10 linguistic categories

→

📦

Augmented Pairs

10k–12k pairs per language

→

🤖

Fine-Tuning

mT5-small & IndicBART

→

✅

GEC Output

Evaluated with GLEU

Error Injection

10 Linguistic Error Categories

Each clean sentence is transformed into up to 5 noisy variants. Error count follows a realistic distribution: 60% single-error, 25% two errors, 10% three errors, 5% four or more.

Spelling

Non-dictionary & dictionary misspellings, mātrā swaps, homophone substitution

Tense

Incorrect verb tense inflection

Person

Subject–verb person disagreement

Number

Singular/plural agreement errors

Gender

Grammatical gender mismatches

Case

Incorrect postposition / case markers

Parts-of-Speech

Wrong word class usage

Missing / Extra

Omitted or duplicated tokens and postpositions

Punctuation

Wrong or missing ।/?/, markers

Semantic

Semantically incorrect adverb or postposition choice

Dataset

Data Statistics

Official BHASHA annotated splits are small — ranging from 91 to 599 training examples. Synthetic augmentation expands each language to roughly 10,000–12,000 high-quality parallel pairs.

Hindi

Train599

Dev107

Test236

Bangla

Train598

Dev101

Test330

Malay.

Train300

Dev50

Test102

Tamil

Train91

Dev16

Test65

Telugu

Train599

Dev100

Test310

Results

Model Comparison

mT5-small outperforms IndicBART across all five languages in our setup, with particularly strong gains in Tamil and Malayalam. Both models are lightweight (≤300M parameters) and trained under identical conditions for fair comparison.

mT5-small

300M params · General multilingual · Pre-trained on mC4

Ta

86.03

Ml

84.36

Bn

82.69

Hi

80.44

Te

72.00

IndicBART

Indic-specific · Seq2seq · Pre-trained on Indic corpora

Ta

76.45

Ml

74.84

Bn

73.50

Hi

72.33

Te

66.10

Error Analysis

Correction Performance by Error Type

Spelling and punctuation errors are corrected most reliably. Semantic and structural errors remain the hardest, reflecting the difficulty of capturing meaning-level nuance without richer context.

Error Type	Corrected	Missed
Spelling	95%	5%
Punctuation	92%	8%
Duplication	90%	10%
Grammar	88%	12%
Word Choice	85%	15%
Structural	80%	20%
Semantic	78%	22%

Findings

Key Takeaways

📈

Augmentation Matters

Training on larger augmented datasets improved GLEU by 4–5 points. Expanding from <1k to ~10k pairs per language was critical for robust performance.

🔄

Epoch Sensitivity

Performance plateaued at 8–10 training epochs. Overfitting was consistently observed beyond this range, making early stopping essential.

🌐

Cross-lingual Transfer

Notable cross-lingual transfer was observed among related Dravidian languages (Tamil, Telugu, Malayalam), suggesting shared morphological structure benefits multilingual training.

⚖️

Model Choice

mT5-small outperformed IndicBART in our experiments, likely due to more effective parameter tuning alongside our specific augmentation scheme.

Citation

Cite This Work

@inproceedings{dhamecha2025indicgec, title = {Team Horizon at BHASHA Task 1: Multilingual IndicGEC with Transformer-based Grammatical Error Correction Models}, author = {Dhamecha, Manav and Damor, Gaurav and Choudhary, Sunil and Mishra, Pruthwik}, booktitle = {Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)}, pages = {142--146}, year = {2025}, organization = {Association for Computational Linguistics} }