Enhancing Sindhi word segmentation using subword representation learning and position-aware self-attention

Sindhi word segmentation is a challenging task due to space omission and insertion issues. The Sindhi language itself adds to this complexity. It’s cursive and consists of characters with inherent joining and non-joining properties, independent of word boundaries. Existing Sindhi word segmentation m...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Ali, Wazir (VerfasserIn) , Kumar, Jay (VerfasserIn) , Tumrani, Saifullah (VerfasserIn) , Nour, Redhwan (VerfasserIn) , Noor, Adeeb (VerfasserIn) , Xu, Zenglin (VerfasserIn)
Dokumenttyp: Article (Journal)
Sprache:Englisch
Veröffentlicht: 2025
In: IEEE access
Year: 2025, Jahrgang: 13, Pages: 183133-183142
ISSN:2169-3536
DOI:10.1109/ACCESS.2024.3507382
Online-Zugang:Verlag, kostenfrei, Volltext: https://doi.org/10.1109/ACCESS.2024.3507382
Verlag, kostenfrei, Volltext: https://ieeexplore.ieee.org/document/10769409/authors
Volltext
Verfasserangaben:Wazir Ali, Jay Kumar, Saifullah Tumrani, Redhwan Nour, Adeeb Noor, and Zenglin Xu (Senior Member, IEEE)
Beschreibung
Zusammenfassung:Sindhi word segmentation is a challenging task due to space omission and insertion issues. The Sindhi language itself adds to this complexity. It’s cursive and consists of characters with inherent joining and non-joining properties, independent of word boundaries. Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features. However, these methods have limitations, such as difficulty handling out-of-vocabulary words, limited robustness for other languages, and inefficiency with large amounts of noisy or raw text. Neural network-based models, in contrast, can automatically capture word boundary information without requiring prior knowledge. In this paper, we propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task. The SGNWS model incorporates subword representation learning through a bidirectional long short-term memory encoder, position-aware self-attention, and a conditional random field. Our empirical results demonstrate that the SGNWS model achieves state-of-the-art performance in Sindhi word segmentation on six datasets.
Beschreibung:Online veröffentlicht: 27. November 2024, Artikelversion: 13. Dezember 2024
Gesehen am 04.06.2025
Beschreibung:Online Resource
ISSN:2169-3536
DOI:10.1109/ACCESS.2024.3507382