Word Grouping involves identifying cohesive sequences of words into semantically meaningful units — a fundamental characteristic of Indo-Aryan and Dravidian languages. We model the word-grouping problem as a token classification task and fine-tune three multilingual Transformer encoder models using the BIO annotation scheme. Our best model, MuRIL, achieves 58.18% exact match accuracy and ranks 1st among all participating teams. We further analyze class imbalance, decoding strategies, and the impact of data augmentation in this structured prediction task.
What We Contribute
-
A simple, effective BIO token-classification pipeline for Indic word
grouping, with careful subword-to-word label alignment using the
tokenizer's
word_ids()helper. - Fine-tuning and evaluation of three multilingual pretrained encoders (MuRIL, XLM-R, IndicBERT v2) under identical conditions with a class-weighted loss to address the O-label dominance problem.
- A thorough quantitative error analysis on both dev and test splits, with breakdown by error type and sentence length sensitivity.
- Annotation inconsistency analysis of the gold dataset, highlighting the impact of multiword expression labeling ambiguity on achievable exact-match accuracy.
BIO Token Classification Pipeline
We frame word grouping as a sequence labeling problem using three
labels: B (beginning of a group), I (inside a group), and O (outside,
i.e., a standalone word). Each grouped output sentence is
reconstructed by concatenating tokens predicted with the same group
label, separated by the __ boundary marker.
Model Comparison
MuRIL consistently outperforms the other encoders, benefiting from its targeted pretraining on Indian languages and cased vocabulary that better preserves morpheme and script cues essential for grouping decisions.
Use Our Models on Hugging Face
All our trained models are publicly available on Hugging Face. You can directly use them for inference, experimentation, or integration into your applications.
Where and Why the Model Fails
Among the mismatched predictions, over-merging is the most common failure mode — the model groups too aggressively. Performance also degrades sharply with sentence length.
Dev Set (N=100) 35%
Test Set (N=226) 45%
Length Sensitivity
Exact match accuracy falls steeply as sentences grow longer — a challenge inherent to the strict exact-match evaluation criterion.
| Length Range | Dev EM | Test EM | Note |
|---|---|---|---|
| ≤ 20 words | 41.67% | 63.27% | Best performance |
| 21–40 words | 40.82% | 45.99% | Moderate drop |
| > 40 words | 18.52% | 20.00% | Significant degradation |
Gold Label Ambiguity
A key finding: several common multiword expressions appear grouped in some gold sentences but left ungrouped in others. This introduces unavoidable boundary ambiguity that constrains the achievable exact-match ceiling.
| Phrase | Tokens | Grouped | Ungrouped |
|---|---|---|---|
| होता है | 2 | 29 | 7 |
| होती है | 2 | 17 | 3 |
| करते हैं | 2 | 15 | 8 |
| होते हैं | 2 | 11 | 3 |
| हो सकता है | 3 | 11 | 1 |
| हो सकते हैं | 3 | 9 | 4 |
| करने की | 2 | 8 | 5 |
| करता है | 2 | 7 | 5 |
Ablations & Observations
Class-Weighted Loss Works
Inverse-frequency class weighting improved B/I token recall and boosted exact match by 1–2 absolute percentage points over the unweighted baseline.
Cased Vocabulary Matters
MuRIL's cased vocabulary preserves morpheme and script cues better than XLM-R, reducing tokenization-induced boundary errors in Hindi.
Augmentation Backfired
Training on 5K rule-based augmented sentences introduced stylistic mismatch with the official data. The augmented model scored only 30.58% — lower than any fine-tuned baseline.
Post-Challenge Refinements
After the deadline, improved class-weighting and boundary reconstruction cleanup boosted the test score from 45.13% (official) to 58.18% — a +13% gain.