A Joint Model of Automatic Word Segmentation and Part-Of-Speech Tagging for Ancient Classical Texts Based on Radicals

Bolin Chang, Yiguo Yuan, Bin Li, Zhixing Xu, Minxuan Feng, Dongbo Wang

Abstract

The digitization of ancient books necessitates the implementation of automatic word segmentation and part-of-speech tagging. However, the existing research on this topic encounters pressing issues, including suboptimal efficiency and precision, which require immediate resolution. This study employs a methodology that combines word segmentation and part-of-speech tagging. It establishes a correlation between fonts and radicals, trains the Radical2Vec radical vector representation model, and integrates it with the SikuRoBERTa word vector representation model. Finally, it connects the BiLSTM-CRF neural network.The study investigates the combination of word segmentation and part-of-speech tagging through an experimental approach using a specific data set. In the evaluation dataset, the F1 score for word segmentation is 95.75%, indicating a high level of accuracy. Similarly, the F1 score for part-of-speech tagging is 91.65%, suggesting a satisfactory performance in this task. This model enhances the efficiency and precision of the processing of ancient books, thereby facilitating the advancement of digitization efforts for ancient books and ensuring the preservation and advancement of ancient book heritage.

Anthology ID:: 2023.alp-1.15
Volume:: Proceedings of the Ancient Language Processing Workshop
Month:: September
Year:: 2023
Address:: Varna, Bulgaria
Editors:: Adam Anderson, Shai Gordin, Bin Li, Yudong Liu, Marco C. Passarotti
Venues:: ALP | WS
SIG:
Publisher:: INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:: 122–132
Language:
URL:: https://aclanthology.org/2023.alp-1.15
DOI:
Bibkey:
Cite (ACL):: Bolin Chang, Yiguo Yuan, Bin Li, Zhixing Xu, Minxuan Feng, and Dongbo Wang. 2023. A Joint Model of Automatic Word Segmentation and Part-Of-Speech Tagging for Ancient Classical Texts Based on Radicals. In Proceedings of the Ancient Language Processing Workshop, pages 122–132, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):: A Joint Model of Automatic Word Segmentation and Part-Of-Speech Tagging for Ancient Classical Texts Based on Radicals (Chang et al., ALP-WS 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.alp-1.15.pdf

PDF Cite Search