Research on the Impact of Text Preprocessing on the Quality of Topic Classification.
Keywords:
topic classification, text preprocessing, machine learning, neural models, RuBERTAbstract
The study examines the impact of various text preprocessing strategies on the quality of topic classification for Russian-language documents. The SVM, LSTM, and RuBERT models are compared under three levels of data cleaning. The results show that moderate preprocessing improves the accuracy of classical and recurrent models, while excessive filtering reduces the performance of transformer-based architectures. Based on the findings, an adaptive preprocessing strategy tailored to the characteristics of each model is proposed.
References
Vaswani, A. Attention Is All You Need / A. Vaswani, N. Shazeer, N. Parmar [и др.] // Advances in Neural Information Processing Systems. – 2017. – DOI: 10.48550/arXiv.1706.03762.
Devlin, J. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding / J. Devlin, M.-W. Chang, K. Lee, K. Toutanova // NAACL-HLT. – 2019. – DOI: 10.18653/v1/N19-1423.
Mikolov, T. Efficient Estimation of Word Representations in Vector Space / T. Mikolov, K. Chen, G. Corrado, J. Dean. – arXiv preprint, 2013. – DOI: 10.48550/arXiv.1301.3781.
Kuratov, Y. Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language (RuBERT) / Y. Kuratov, M. Arkhipov. – arXiv preprint, 2019. – DOI: 10.48550/arXiv.1905.07213.
Стрелец, А. И. Методы классификации текстовых данных по темам / А. И. Стрелец, В. С. Иванников, А. А. Орлов, А. В. Атавина // Международный журнал гуманитарных и естественных наук. – 2019. – № 6(1). – С. 74–76. – DOI: 10.24411/2500-1000-2019-11252.
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. Enriching Word Vectors with Subword Information. arXiv preprint, 2017. (FastText).
Zharkov, D., & Korobov, M. pymorphy2: Open-source morphological analyzer for Russian and Ukrainian. (Описание инструмента pymorphy2).
Kudo, T., & Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. arXiv preprint, 2018.
Ribeiro, M. T., Singh, S., & Guestrin, C. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. (LIME)
Lundberg, S. M., & Lee, S.-I. A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems 30 (NIPS 2017). (SHAP)
Sennrich, R., Haddow, B., & Birch, A. Neural Machine Translation of Rare Words with Subword Units. Proceedings of ACL, 2016. (BPE / subword methods)