|
Language models pre-trained on large unlabeled corpora have proven to be very effective in improving many downstream NLP tasks. However, existing language models are primarily designed for English, and less consideration has been given to the more abundant semantic information that Chinese characters imply. The unique semantically related stroke sequence pattern and polyphony enable the enhancement of a Chinese language representation model. Masked language models, such as BERT, are also plagued by inefficient training data utilization, requiring more iterations to complete training. In light of these shortcomings, we propose an improved, customized Chinese pre-trained language model based on the transformer, called SPCLM (Stroke-encoding and Pinyin-learning enhanced Chinese pre-trained Language representation Model). SPCLM contains stroke encoders and an additional pronunciation prediction task. Moreover, the autoregressive objection and mask prediction jointly assist in model formulation. Experimental results demonstrate that SPCLM outperforms other baseline methods, achieving comparable results on five Chinese NLP tasks, with insufficient pre-training, including natural language inference, semantic similarity, named entity recognition, sentiment analysis, and question answering. |
|
Keywords:Software Engineering; Language Model; Multi-Task Learning |
|