CN113901210A

CN113901210A - Method for marking verbosity of Thai and Burma characters by using local multi-head attention to mechanism fused word-syllable pair

Info

Publication number: CN113901210A
Application number: CN202111078804.XA
Authority: CN
Inventors: 线岩团; 王悦寒; 余正涛; 相艳
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2021-09-15
Filing date: 2021-09-15
Publication date: 2022-01-07
Anticipated expiration: 2041-09-15
Also published as: CN113901210B

Abstract

The invention relates to a method for marking the word character of Thai and Burma characters by fusing word-syllable pairs by using a local multi-head attention mechanism, belonging to the field of natural language processing. The invention comprises the following steps: preprocessing a text data set of Thai or Myanmar; selecting word-syllable pair characteristics as model input in a windowed manner; then, learning context characteristics from the word-syllable pair sequence by utilizing a local multi-head attention mechanism; and finally, modeling a part-of-speech dependency relationship through a conditional random field, and predicting part-of-speech tags. The experimental results for the part-of-speech tagging data sets in Thai and Myanmar show that compared with the current optimal model, the method has the advantages that syllables are blended as morphological characteristics of words, the method is beneficial to learning the context characteristics of unknown words, and the influence of wrong tagging of the unknown words on the model performance is relieved. In addition, the invention adopts a local multi-head self-attention mechanism, so that the model can obtain richer local dependence characteristics, and a better labeling result can be obtained in the part-of-speech labeling task.

Description

Method for marking verbosity of Thai and Burma characters by using local multi-head attention to mechanism fused word-syllable pair

Technical Field

The invention relates to a method for marking the word character of Thai and Burma characters by fusing word-syllable pairs by using a local multi-head attention mechanism, belonging to the technical field of natural language processing.

Background

Part-of-speech tagging, which is to judge the part of speech of each word in a given sentence, belongs to one of basic tasks in the field of natural language processing (natural language processing NLP), and can improve the accuracy of syntactic analysis, thereby promoting the improvement of many NLP tasks.

The early part-of-speech tagging method mainly comprises rule-based and statistical machine learning. The part-of-speech tagging method based on the rules has the problems of incomplete rule customization and rule conflict. At present, methods based on statistical Machine learning mainly include Support Vector Machine (SVM), Hidden Markov Model (HMM), and Conditional Random field models (CRFs). This kind of method usually needs to manually construct the feature extraction function, and the process is troublesome and the feature extraction is insufficient.

The neural network method avoids the process of manually constructing and extracting features through word embedding or embedding layers, and meanwhile improves the part-of-speech tagging effect. The mainstream part-of-speech tagging model usually adopts BilSTM (Bidirectional Long Short-Term Memory) and a pre-training language model to obtain the part-of-speech tagging context characteristics. The method based on the BilSTM effectively relieves the dependence of the model on characteristic engineering, and realizes the information transmission of different words on time sequence dimension, but the overlong sequence has little predictive significance on word part of speech from the perspective of part of speech tagging. Based on a language model pre-trained by BERT and the like, high-quality characters, syllables and word embedding can be learned to capture potential syntactic and semantic similarity between words, but the parameter quantity of the model is increased in the pre-training process, so that the prediction speed of the trained part-of-speech tagging model is seriously slowed. In addition, the method ignores the word-building characteristics and internal structure information of Thai and Myanmar, so that morphological information is lost, and the performance of part-of-speech tagging is influenced.

Syllables are used as basic constituent units of the language words and have certain association with the part of speech of the words. Such as word part-of-speech tag prediction by BiLSTM-CRF learning common features in the words and syllable embedding of the burma text. But the role of syllables is not significant in the long-distance context feature modeling process. The morphological characteristics of the unknown words are integrated into the part-of-speech tagging model of the local multi-head attention mechanism, so that the method is beneficial to learning the context characteristics of the unknown words and relieving the influence of wrong tagging of the unknown words on the model performance.

Disclosure of Invention

The invention provides a method for marking the part of speech of Thai and Burma languages by using a local multi-head attention mechanism to fuse word-syllable pairs, which is used for marking the part of speech of syllabic languages of southeast Asian phonemes such as Thai and Burma languages and solves the problem of poor part of speech marking effects of low-frequency parts of speech and unknown words.

The technical scheme of the invention is as follows: the method for marking the writings of Thai and Burma by fusing word-syllable pairs by using local multi-head attention comprises the following specific steps:

step1, carrying out text preprocessing on the Thai text LST20 data set or the Myanmar ALT data set, such as a Thai sentence with m words, and finding potential affix information in the words by carrying out syllable segmentation on each word in the sentence so as to expand the word sequence into a word-syllable pair sequence.

And Step2, sequentially obtaining input containing information of n word-syllable pairs from the data preprocessed in the Step1 in a sliding window mode, performing feature coding on the word-syllable pairs by using a local multi-head attention mechanism, and then obtaining the predicted shared features of the input n-grams by splicing the output features of the transform coder and the syllable embedding.

And Step3, finally, modeling part-of-speech dependency relationship through a conditional random field, and predicting part-of-speech tags.

The specific steps of Step1 are as follows:

step1.1, constructing a word alphabet and a part-of-speech tag alphabet aiming at a training set according to a vocabulary divided by '\\ n' in a Thai text;

step1.2, calling the most advanced syllable segmenter of Thai or Burmese to segment the syllables of the words in the text to construct a syllable alphabet;

step1.3, then, for each word, the present invention assigns to it the syllable it contains. Wherein for syllables, the invention only cuts the prefix syllables and suffix syllables that make up each word as input. If there are some words that consist of a single syllable, the present invention will supplement the syllable with a "< PAD >" operation to complete the input syllable vector.

In a preferred embodiment of the present invention, Step2 is:

step2.1, the coding layer embeds words of n-grams

Syllable embedding corresponding to it

As input to the encoder, the input n-gram matrix can be expressed as:

step2.2, the multi-head attention layer of the encoder maps queries and a set of key-value pairs to outputs. Given a sequence of vectors

Single-headed attention projects X onto three different matrices: q matrix is

K matrix is

V matrix is

The attention weight is obtained by calculating the dot product attention of each word in the sentence, and the final score is the weighted sum of the values;

Q,K,V＝XW^Q,XW^K,XW^V， (2)

wherein the matrix

Is a learnable parameter, d_kThe size of the dimension of the output vector for the embedding layer of the model, this factor being to adjust Q and K^TTo prevent the too large inner product from being unevenly distributed after softmax. softmax normalized the scale values.

Step2.3, Multi-Head attention layer Multi-Head is formed by splicing a plurality of attention layers;

MultiHead(Q,K,V)＝[Att₁,Att₂,...,Att_h]， (4)

step2.4, feedforward neural network layer are formed by two linear layers in series connection, they have independent weight and deviation, and the dimensionality is different, can further extract the semantic information;

Z＝layer-norm(X+MultiHead(X))， (5)

FFN(Z)＝ReLU(W₁Z+b₁)W₂Z+b₂， (6)

wherein layer-norm (-) represents normalization layer, FFN represents Feed Forward Network layer, W₁，b₁,W₂,b₂Is the projection parameter and Z represents the output of the normalization layer.

Step2.5, deriving the output o of the encoder block using a normalization layer after the feedforward neural network layer_i. The predicted shared features of the input n-gram are then derived by concatenating the transform encoder output features and the pitch embedding, most preferablyThen a vector is obtained through a MultiLayer Perceptron (MLP)

In a preferred embodiment of the present invention, Step3 is:

step3.1, in the decoding process, the invention uses a Viterbi algorithm to obtain the highest score of the label sequence. Formally, the present invention uses x ═ x (x)₁,x₂,...,x_n) To represent an input sequence, using y ═ y (y)₁,y₂,...,y_n) To represent model output labels, the present invention defines a score as;

wherein A is a transition score matrix, e.g. A_i,jIndicating a transition from the i-th state to the j-th state in a pair of consecutive time steps. y is₀And y_nIs a tag at the beginning and end of a sentence that the model will add to the set of possible tags. P is the fractional matrix output by the transform encoder. θ is a parameter of the neural network setting.

Step3.2, performing softmax operation on all possible label sequences to generate a probability value for the sequence y;

wherein y is_xRepresenting a set of x possible tag sequences.

Step3.3, in the training process, using a maximum likelihood estimation method to improve the logarithm probability of a correct label sequence to the maximum extent;

wherein y is_xRepresenting a set of possible tag sequences for sentence x. It is apparent from the above formula that the present invention is continually optimized to produce an efficient output tag sequence.

Step3.4, at decoding, by obtaining the maximum score y^*To predict the output sequence of the model.

The invention has the beneficial effects that:

(1) the invention provides a Thai and Myanmar lexical labeling model utilizing a local multi-head attention mechanism and a conditional random field, and the model structure is simpler and more effective than the existing BilSTM and pre-training models;

(2) the invention provides a method for learning context characteristics from a local word-syllable pair sequence, which effectively relieves the ambiguity problem and improves the part-of-speech prediction effect of low-frequency words and unknown words;

(3) the experimental result of the part-of-speech tagging data set aiming at Thai and Burma shows that the tagging effect of the model provided by the invention is superior to that of the current optimal model, and the parameter scale is smaller.

Drawings

FIG. 1 is a schematic diagram of a specific structure of a recognition model according to the present invention;

FIG. 2 is a diagram of word-syllable pair sequences in the present invention.

Detailed Description

Example 1: as shown in fig. 1-2, a method for verbally labeling tai and mainma characters by fusing word-syllable pairs with local multi-head attention mechanism includes the following steps:

step1.3, then, for each word, the invention assigns to it the syllables that it contains, as shown in FIG. 1. Wherein for syllables, the invention only cuts the prefix syllables and suffix syllables that make up each word as input. If there are some words that consist of a single syllable, the present invention will supplement the syllable with a "< PAD >" operation to complete the input syllable vector.

Step2.1, the coding layer embeds words of n-grams

Syllable embedding corresponding to it

As input to the encoder, the input n-gram matrix can be expressed as:

Single-headed attention projects X onto three different matrices: q matrix is

K matrix is

V matrix is

Q,K,V＝XW^Q,XW^K,XW^V， (2)

wherein the matrix

MultiHead(Q,K,V)＝[Att₁,Att₂,...,Att_h]， (4)

Z＝layer-norm(X+MultiHead(X))， (5)

FFN(Z)＝ReLU(W₁Z+b₁)W₂Z+b₂， (6)

wherein layer-norm (-) represents normalization layer, FFN represents Feed Forward Network layer, W₁,b₁,W₂,b₂Is the projection parameter and Z represents the output of the normalization layer.

Step2.5, deriving the output o of the encoder block using a normalization layer after the feedforward neural network layer_i. Then, the output characteristic of a transform coder and the pitch embedding are spliced to obtain the predicted shared characteristic of the input n-gram, and finally, a vector is obtained through a MultiLayer Perceptron (MLP)

Step3.1, in the decoding process, the invention uses a Viterbi algorithm to obtain the highest score of the label sequence. Formally, the present invention uses x ═ x (x)₁,x₂,...,x_n) To represent an input sequence, using y ═ y (y)₁,y₂,...，y_n) To represent model output labels, the present invention defines a score as;

wherein A is a transition score matrix, e.g. A_i，jIndicating a transition from the i-th state to the j-th state in a pair of consecutive time steps. y is₀And y_nIs a tag at the beginning and end of a sentence that the model will add to the set of possible tags. P is the fractional matrix output by the transform encoder. θ is a parameter of the neural network setting.

wherein y is_xRepresenting a set of x possible tag sequences.

To illustrate the effect of the present invention, 3-group comparative experiments were set up. The first group of experiments verify the validity of Thai part-of-speech tagging, the second group of experiments verify the validity of Burmese part-of-speech tagging, and the third group of experiments verify the validity of low-frequency part-of-speech prediction during part-of-speech tagging tasks.

(1) Validity of Thai part-of-speech tag

The invention uses a Multihead-Att + Syllable + CRF model structure to enrich semantic information by fusing affix Syllable characteristics on the basis of word vectors. In order to verify the validity of the MultiHead-Att + Syllable + CRF model on the part-of-speech tagging of Thai, comparison analysis is performed on the model and a plurality of other related models under the Thai LST20 corpus. The results of the experiment are shown in table 1.

TABLE 1 Experimental results for LST20 data

From table 1, it can be seen that both Micro F1 and Macro F1 values of the model of the present invention exceed all mainstream models on LST20 data set, demonstrating the effectiveness of the present invention for the part-of-speech tagging task in thailand. The lowest value of the method F1 based on CRF shows that the neural network model can acquire deeper input information of the language compared with the traditional machine learning-based method. And the F1 value of the model is higher than that of other three pre-training language models based on the neural network, which probably is caused by the fact that other models lack text morphological information and internal structure information, so that deviation occurs when the part-of-speech labels of the models are predicted. The model of the invention combines the word-forming characteristics of Thai and utilizes the sequence characteristics of local word-syllable pairs, thereby improving the performance of recognizing the part of speech of each word. Meanwhile, as can be seen from table 1, the difference between the Micro F1 value and the Macro F1 value is large, because the Micro F1 value is likely to be affected by common part-of-speech categories, and the model has a good learning effect on the common part-of-speech categories; the Macro F1 value looks equally to each part-of-speech category, so its value is affected by the rare categories, and thus the Macro F1 value is much lower.

(2) Validity of Burmese text part-of-speech tagging

In order to verify the effectiveness of the MultiHead-Att + Syllable + CRF model in the semantic annotation of Burma, the invention carries out a comparison experiment with a plurality of related models below under the Burma language corpus. The results of the experiment are shown in table 2.

TABLE 2 Experimental results for the Burmese ALT dataset

Table 2 shows the results of comparing this method with models such as CNN, Seq2Seq and BilSTM-CRF. The experimental result shows that the model of the invention obtains the best effect in the Burmese POS marking task, and the local attention mechanism of the model of the invention can more efficiently process the information interaction of the input features, so that the word vector features and the syllable vector features are better fused and utilized, and the validity of the Burmese word marking task is also proved. Among them, the CNN-based model has the worst performance, and the Seq2Seq model is inferior, probably because the CNN-based method relies on the features selection and convolution of the mainline, while the Seq2Seq method is a coding-decoding method, which directly causes decoding errors if coding errors occur. Compared with a BilSTM-CRF model and a BERT + BilSTM-CRF + Fine-tune model, the F1 value of the model is respectively improved by nearly 4% and 1%, which shows that the Transformer is more effective than the characteristic information carried by the BilSTM. The reason may be that the Transformer encoder can model information directly for each word-syllable, whereas BiLSTM needs to model from both ends of the input sentence to the middle step by step, resulting in that the standard information of long sentences may be lost. Because BERT + BilSTM-CRF + Fine-tune is a combined model of word segmentation and part-of-speech tagging, the model is complex, and simultaneously Burmese characters, syllables and word vectors pre-trained by the BERT are required.

(3) Validity of low frequency part of speech prediction

The method aims to verify the effectiveness of a Multihead-Att + Syllable + CRF model in low-frequency part-of-speech prediction during part-of-speech tagging. Aiming at the Thai LST20 corpus set, part-of-speech tags with the occurrence frequency lower than 4000 in the test set are taken as low-frequency parts-of-speech, and are compared and analyzed with the most advanced Thai part-of-speech tagging model Wangchanberta model at present, and the experimental results are shown in Table 3.

TABLE 3 Low frequency parts of speech comparison test results

As can be seen from Table 3, the model of the present invention outperforms the most advanced models in the prediction of low frequency parts of speech. The performance improvement of the low-frequency part of speech CL and the performance improvement of the low-frequency part of speech IJ are most obvious, the fact that the local word-syllable pair information can supplement word characteristics is shown, the local multi-head attention mechanism can establish the dependency relationship among words and carry out information interaction in real time, and therefore the model is promoted to predict the low-frequency part of speech. And XX has slightly low performance, which may be because XX represents unknown part of speech, the model of the present invention supplements the part of the character labeled XX according to the word-forming characteristics of the word itself, so that the XX label is misjudged by the model as the part of speech of the word with the same syllable.

The above experimental data prove that the part-of-speech tagging model can obtain better performance by using syllable information because syllable-level information considers the characteristics of affixes, and many affixes in Thai and Burma can express the part-of-speech information, which is one of the factors that the part-of-speech tagging is not negligible. Meanwhile, the parallelism of the model can be improved by using a local multi-head attention mechanism, and the context characteristics around the words are effectively captured. Experiments prove that the method provided by the invention obtains the optimal effect compared with a plurality of baseline models. The invention provides a method for capturing word-syllable sequence information by using a local multi-head attention mechanism to improve the part-of-speech tagging task aiming at the part-of-speech tagging task of southeast Asian phoneme syllable character languages such as Thai and Myanmar.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The method for marking the word characters of Thai and Burma by using local multi-head attention to fuse word-syllable pairs is characterized by comprising the following steps of:

the method comprises the following specific steps:

step1, performing text preprocessing on the Thai text data set or the Myanmar text data set, and performing syllable segmentation on each word in a sentence to find potential affix information in the word, so that a word sequence is expanded into a word-syllable pair sequence;

step2, sequentially obtaining input containing n word-syllable pair information from data preprocessed in the Step1 in a sliding window mode, carrying out feature coding on the word-syllable pairs by using a local multi-head attention mechanism, and then obtaining predicted shared features of the input n-grams by splicing the output features of a transform encoder and the syllable embedding;

2. The method for verbalization annotation of Thai and Myanmar characters using local multi-head attention mechanism to fuse word-syllable pairs according to claim 1, wherein: the specific steps of Step1 are as follows:

step1.2, calling a syllable segmenter of Thai or Myanmar to segment the syllables of the words in the text to construct a syllable alphabet;

step1.3, then, for each word, assign it the syllable it contains; wherein for syllables, only prefix syllables and suffix syllables constituting each word are intercepted as input; if there are some words that consist of a single syllable, then the syllable will be supplemented with a "< PAD >" operation to complete the input syllable vector.

3. The method for verbalization annotation of Thai and Myanmar characters using local multi-head attention mechanism to fuse word-syllable pairs according to claim 1, wherein: the specific steps of Step2 are as follows:

step2.1, the coding layer embeds words of n-grams

Syllable embedding corresponding to it

As input to the encoder, the input n-gram matrix is represented as:

step2.2, the multi-head attention layer of the encoder maps the query and a set of key-value pairs to the output; given a sequence of vectors

Single-headed attention projects X onto three different matrices: q matrix is

K matrix is

V matrix is

Q,K,V＝XW^Q,XW^K,XW^V (2)

wherein the matrix

Is a learnable parameter, d_kThe size of the dimension of the output vector for the embedding layer of the model, this factor being to adjust Q and K^TThe size of the inner product is used for preventing the overlarge inner product from being unevenly distributed after passing through softmax; softmax normalizes the scale values;

MultiHead(Q,K,V)＝[Att₁,Att₂,...,Att_h]， (4)

step2.4, feedforward neural network layer are formed by two linear layers in series connection, the linear layers have independent weight and deviation, and the dimensionalities are different, so that semantic information can be further extracted;

Z＝layer-norm(X+MultiHead(X))， (5)

FFN(Z)＝ReLU(W₁Z+b₁)W₂Z+b₂， (6)

wherein layer-norm (-) represents normalization layer, FFN represents Feed Forward Network layer, W₁,b₁,W₂,b₂Is the projection parameter, and Z represents the output of the normalization layer;

step2.5, deriving the output o of the encoder block using a normalization layer after the feedforward neural network layer_i(ii) a Then, the output characteristic of a transform encoder and the pitch embedding are spliced to obtain the predicted shared characteristic of the input n-gram, and finally, a vector is obtained through a multi-layer perceptron MLP

4. The method for verbalization annotation of Thai and Myanmar characters using local multi-head attention mechanism to fuse word-syllable pairs according to claim 1, wherein: the specific steps of Step3 are as follows:

step3.1, in the decoding process, obtaining the highest score of the label sequence by using a Viterbi algorithm; formally, x ═ x is used₁,x₂,...,x_n) To represent an input sequence, using y ═ y (y)₁,y₂,...,y_n) To represent the model output tags, so a score is defined;

wherein A is a transition score matrix, A_i,jRepresenting the transition from the i-th state to the j-th state in a pair of successive time steps, y₀And y_nLabels at the beginning and the end of a sentence, P is a fractional matrix output by a transform encoder, and theta is a parameter set by a neural network;

wherein y is_xRepresents a set of x possible tag sequences;

wherein y is_xRepresenting a set of possible tag sequences for a sentence x;

step3.4, at decoding, by obtaining the maximum score y^*To predict the output sequence of the model;