BHASHA 2025 @ ACL · Task 2 · December 23, 2025
🏆 Ranked 1st Among All Participating Teams

Fine-tuning Multilingual Transformers
for Indic Word Grouping

Manav Dhamecha  ·  Gaurav Damor  ·  Sunil Choudhary  ·  Pruthwik Mishra

Sardar Vallabhbhai National Institute of Technology, Surat  ·  Team Horizon

Best Exact Match (MuRIL)
58.18%
↑ +13.05% post-challenge refinement
Official Submission
45.13%
Test exact match
Test Sentences
226
Hindi word grouping
Models Tested
3
MuRIL · XLM-R · IndicBERT
Data Augmented
5K
Hindi sentences added

Word Grouping involves identifying cohesive sequences of words into semantically meaningful units — a fundamental characteristic of Indo-Aryan and Dravidian languages. We model the word-grouping problem as a token classification task and fine-tune three multilingual Transformer encoder models using the BIO annotation scheme. Our best model, MuRIL, achieves 58.18% exact match accuracy and ranks 1st among all participating teams. We further analyze class imbalance, decoding strategies, and the impact of data augmentation in this structured prediction task.

What We Contribute

BIO Token Classification Pipeline

We frame word grouping as a sequence labeling problem using three labels: B (beginning of a group), I (inside a group), and O (outside, i.e., a standalone word). Each grouped output sentence is reconstructed by concatenating tokens predicted with the same group label, separated by the __ boundary marker.

📝
Input Sentence
Raw Hindi text tokenized with subword tokenizer
🔖
BIO Labeling
Word-level labels aligned to subword tokens
⚖️
Weighted Loss
Inverse-frequency class weights upweight B & I
🤖
Encoder Fine-tune
MuRIL / XLM-R / IndicBERT v2
🔧
Reconstruction
Groups joined with __ separator
Exact Match
Full sentence must match gold output
BIO Tagging Example (Hindi)
रामO
धीरेB
धीरेI
चलB
रहाI
हैI
राम   धीरे_धीरे   चल_रहा_है

Model Comparison

MuRIL consistently outperforms the other encoders, benefiting from its targeted pretraining on Indian languages and cased vocabulary that better preserves morpheme and script cues essential for grouping decisions.

MuRIL
Multilingual Representations for Indian Languages · Strong Indic coverage · Cased vocabulary
Dev
46.58%
Test
58.18%
XLM-R
Cross-lingual RoBERTa · Trained on 100 languages · Large multilingual corpora
Dev
39.00%
Test
53.36%
IndicBERT v2
Indic-specific MLM · Covers 23 Indian languages · IndicCorp pretrained
Dev
35.40%
Test
52.73%

Use Our Models on Hugging Face

All our trained models are publicly available on Hugging Face. You can directly use them for inference, experimentation, or integration into your applications.

Where and Why the Model Fails

Among the mismatched predictions, over-merging is the most common failure mode — the model groups too aggressively. Performance also degrades sharply with sentence length.

Dev Set (N=100) 35%

Over-merge
50.8%
Over-split
29.2%
Wrong boundaries
20.0%

Test Set (N=226) 45%

Over-merge
54.8%
Over-split
31.5%
Wrong boundaries
13.7%

Length Sensitivity

Exact match accuracy falls steeply as sentences grow longer — a challenge inherent to the strict exact-match evaluation criterion.

Length Range Dev EM Test EM Note
≤ 20 words 41.67% 63.27% Best performance
21–40 words 40.82% 45.99% Moderate drop
> 40 words 18.52% 20.00% Significant degradation

Gold Label Ambiguity

A key finding: several common multiword expressions appear grouped in some gold sentences but left ungrouped in others. This introduces unavoidable boundary ambiguity that constrains the achievable exact-match ceiling.

Phrase Tokens Grouped Ungrouped
होता है 2 29 7
होती है 2 17 3
करते हैं 2 15 8
होते हैं 2 11 3
हो सकता है 3 11 1
हो सकते हैं 3 9 4
करने की 2 8 5
करता है 2 7 5

Ablations & Observations

⚖️

Class-Weighted Loss Works

Inverse-frequency class weighting improved B/I token recall and boosted exact match by 1–2 absolute percentage points over the unweighted baseline.

🔤

Cased Vocabulary Matters

MuRIL's cased vocabulary preserves morpheme and script cues better than XLM-R, reducing tokenization-induced boundary errors in Hindi.

📦

Augmentation Backfired

Training on 5K rule-based augmented sentences introduced stylistic mismatch with the official data. The augmented model scored only 30.58% — lower than any fine-tuned baseline.

🔁

Post-Challenge Refinements

After the deadline, improved class-weighting and boundary reconstruction cleanup boosted the test score from 45.13% (official) to 58.18% — a +13% gain.

Cite This Work

@inproceedings{dhamecha2025indicwg, title = {Team Horizon at BHASHA Task 2: Fine-tuning Multilingual Transformers for Indic Word Grouping}, author = {Dhamecha, Manav and Damor, Gaurav and Choudhary, Sunil and Mishra, Pruthwik}, booktitle = {Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)}, pages = {175--179}, year = {2025}, organization = {Association for Computational Linguistics} }