Исследование влияния предобработки текста на качество тематической классификации.

D.Yu. Podzol; I.A.  Kolomoitseva

Authors

D.Yu. Podzol Донецкий национальный технический университет
I.A. Kolomoitseva Донецкий национальный технический университет

Keywords:

topic classification, text preprocessing, machine learning, neural models, RuBERT

Abstract

The study examines the impact of various text preprocessing strategies on the quality of topic classification for Russian-language documents. The SVM, LSTM, and RuBERT models are compared under three levels of data cleaning. The results show that moderate preprocessing improves the accuracy of classical and recurrent models, while excessive filtering reduces the performance of transformer-based architectures. Based on the findings, an adaptive preprocessing strategy tailored to the characteristics of each model is proposed.

References

Vaswani, A. Attention Is All You Need / A. Vaswani, N. Shazeer, N. Parmar [и др.] // Advances in Neural Information Processing Systems. – 2017. – DOI: 10.48550/arXiv.1706.03762.

Devlin, J. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding / J. Devlin, M.-W. Chang, K. Lee, K. Toutanova // NAACL-HLT. – 2019. – DOI: 10.18653/v1/N19-1423.

Mikolov, T. Efficient Estimation of Word Representations in Vector Space / T. Mikolov, K. Chen, G. Corrado, J. Dean. – arXiv preprint, 2013. – DOI: 10.48550/arXiv.1301.3781.

Kuratov, Y. Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language (RuBERT) / Y. Kuratov, M. Arkhipov. – arXiv preprint, 2019. – DOI: 10.48550/arXiv.1905.07213.

Стрелец, А. И. Методы классификации текстовых данных по темам / А. И. Стрелец, В. С. Иванников, А. А. Орлов, А. В. Атавина // Международный журнал гуманитарных и естественных наук. – 2019. – № 6(1). – С. 74–76. – DOI: 10.24411/2500-1000-2019-11252.

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. Enriching Word Vectors with Subword Information. arXiv preprint, 2017. (FastText).

Zharkov, D., & Korobov, M. pymorphy2: Open-source morphological analyzer for Russian and Ukrainian. (Описание инструмента pymorphy2).

Kudo, T., & Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. arXiv preprint, 2018.

Ribeiro, M. T., Singh, S., & Guestrin, C. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. (LIME)

Lundberg, S. M., & Lee, S.-I. A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems 30 (NIPS 2017). (SHAP)

Sennrich, R., Haddow, B., & Birch, A. Neural Machine Translation of Rare Words with Subword Units. Proceedings of ACL, 2016. (BPE / subword methods)

Research on the Impact of Text Preprocessing on the Quality of Topic Classification.

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)

Language

Information