IndicWG - Team Horizon @ BHASHA 2025

Abstract

Word Grouping involves identifying cohesive sequences of words into semantically meaningful units — a fundamental characteristic of Indo-Aryan and Dravidian languages. We model the word-grouping problem as a token classification task and fine-tune three multilingual Transformer encoder models using the BIO annotation scheme. Our best model, MuRIL, achieves 58.18% exact match accuracy and ranks 1st among all participating teams. We further analyze class imbalance, decoding strategies, and the impact of data augmentation in this structured prediction task.

Contributions

What We Contribute

A simple, effective BIO token-classification pipeline for Indic word grouping, with careful subword-to-word label alignment using the tokenizer's word_ids() helper.
Fine-tuning and evaluation of three multilingual pretrained encoders (MuRIL, XLM-R, IndicBERT v2) under identical conditions with a class-weighted loss to address the O-label dominance problem.
A thorough quantitative error analysis on both dev and test splits, with breakdown by error type and sentence length sensitivity.
Annotation inconsistency analysis of the gold dataset, highlighting the impact of multiword expression labeling ambiguity on achievable exact-match accuracy.

Methodology

BIO Token Classification Pipeline

We frame word grouping as a sequence labeling problem using three labels: B (beginning of a group), I (inside a group), and O (outside, i.e., a standalone word). Each grouped output sentence is reconstructed by concatenating tokens predicted with the same group label, separated by the __ boundary marker.

📝

Input Sentence

Raw Hindi text tokenized with subword tokenizer

→

🔖

BIO Labeling

Word-level labels aligned to subword tokens

→

⚖️

Weighted Loss

Inverse-frequency class weights upweight B & I

→

🤖

Encoder Fine-tune

MuRIL / XLM-R / IndicBERT v2

→

🔧

Reconstruction

Groups joined with __ separator

→

✅

Exact Match

Full sentence must match gold output

BIO Tagging Example (Hindi)

रामO

धीरेB

धीरेI

चलB

रहाI

हैI

→

राम धीरे_धीरे चल_रहा_है

Models & Results

Model Comparison

MuRIL consistently outperforms the other encoders, benefiting from its targeted pretraining on Indian languages and cased vocabulary that better preserves morpheme and script cues essential for grouping decisions.

MuRIL

Multilingual Representations for Indian Languages · Strong Indic coverage · Cased vocabulary

Dev

46.58%

Test

58.18%

XLM-R

Cross-lingual RoBERTa · Trained on 100 languages · Large multilingual corpora

Dev

39.00%

Test

53.36%

IndicBERT v2

Indic-specific MLM · Covers 23 Indian languages · IndicCorp pretrained

Dev

35.40%

Test

52.73%

Deployment

Use Our Models on Hugging Face

All our trained models are publicly available on Hugging Face. You can directly use them for inference, experimentation, or integration into your applications.

WG-GoogleMuRIL

Best performing model · Fine-tuned MuRIL for Hindi word grouping

🥇 Production-ready

WG-IndicBERT

Lightweight Indic encoder · Efficient for low-resource deployment

⚡ Fast inference

WG-XLM-R

Cross-lingual transformer · Strong multilingual generalization

🌐 Multilingual

Error Analysis

Where and Why the Model Fails

Among the mismatched predictions, over-merging is the most common failure mode — the model groups too aggressively. Performance also degrades sharply with sentence length.

Dev Set (N=100) 35%

Over-merge

50.8%

Over-split

29.2%

Wrong boundaries

20.0%

Test Set (N=226) 45%

Over-merge

54.8%

Over-split

31.5%

Wrong boundaries

13.7%

Length Sensitivity

Exact match accuracy falls steeply as sentences grow longer — a challenge inherent to the strict exact-match evaluation criterion.

Length Range	Dev EM	Test EM	Note
≤ 20 words	41.67%	63.27%	Best performance
21–40 words	40.82%	45.99%	Moderate drop
> 40 words	18.52%	20.00%	Significant degradation

Annotation Inconsistency

Gold Label Ambiguity

A key finding: several common multiword expressions appear grouped in some gold sentences but left ungrouped in others. This introduces unavoidable boundary ambiguity that constrains the achievable exact-match ceiling.

Phrase	Tokens	Grouped	Ungrouped
होता है	2	29	7
होती है	2	17	3
करते हैं	2	15	8
होते हैं	2	11	3
हो सकता है	3	11	1
हो सकते हैं	3	9	4
करने की	2	8	5
करता है	2	7	5

Key Findings

Ablations & Observations

⚖️

Class-Weighted Loss Works

Inverse-frequency class weighting improved B/I token recall and boosted exact match by 1–2 absolute percentage points over the unweighted baseline.

🔤

Cased Vocabulary Matters

MuRIL's cased vocabulary preserves morpheme and script cues better than XLM-R, reducing tokenization-induced boundary errors in Hindi.

📦

Augmentation Backfired

Training on 5K rule-based augmented sentences introduced stylistic mismatch with the official data. The augmented model scored only 30.58% — lower than any fine-tuned baseline.

🔁

Post-Challenge Refinements

After the deadline, improved class-weighting and boundary reconstruction cleanup boosted the test score from 45.13% (official) to 58.18% — a +13% gain.

Citation

Cite This Work

@inproceedings{dhamecha2025indicwg, title = {Team Horizon at BHASHA Task 2: Fine-tuning Multilingual Transformers for Indic Word Grouping}, author = {Dhamecha, Manav and Damor, Gaurav and Choudhary, Sunil and Mishra, Pruthwik}, booktitle = {Proceedings of the 1st Workshop on Benchmarks, Harmonization, Annotation, and Standardization for Human-Centric AI in Indian Languages (BHASHA 2025)}, pages = {175--179}, year = {2025}, organization = {Association for Computational Linguistics} }