CN113901210A - Method for marking verbosity of Thai and Burma characters by using local multi-head attention to mechanism fused word-syllable pair - Google Patents
Method for marking verbosity of Thai and Burma characters by using local multi-head attention to mechanism fused word-syllable pair Download PDFInfo
- Publication number
- CN113901210A CN113901210A CN202111078804.XA CN202111078804A CN113901210A CN 113901210 A CN113901210 A CN 113901210A CN 202111078804 A CN202111078804 A CN 202111078804A CN 113901210 A CN113901210 A CN 113901210A
- Authority
- CN
- China
- Prior art keywords
- word
- syllable
- thai
- speech
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 230000007246 mechanism Effects 0.000 title claims abstract description 20
- 206010024796 Logorrhoea Diseases 0.000 title description 2
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 239000011159 matrix material Substances 0.000 claims description 21
- 239000013598 vector Substances 0.000 claims description 16
- 238000013528 artificial neural network Methods 0.000 claims description 11
- 230000008569 process Effects 0.000 claims description 11
- 238000012549 training Methods 0.000 claims description 10
- 238000010606 normalization Methods 0.000 claims description 9
- 230000007704 transition Effects 0.000 claims description 6
- 230000011218 segmentation Effects 0.000 claims description 4
- 238000007476 Maximum Likelihood Methods 0.000 claims description 3
- 238000003058 natural language processing Methods 0.000 abstract description 6
- 238000002372 labelling Methods 0.000 abstract description 4
- 230000000877 morphologic effect Effects 0.000 abstract description 4
- 230000009286 beneficial effect Effects 0.000 abstract description 3
- 230000000694 effects Effects 0.000 description 8
- 238000002474 experimental method Methods 0.000 description 8
- 239000000047 product Substances 0.000 description 4
- 239000013589 supplement Substances 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a method for marking the word character of Thai and Burma characters by fusing word-syllable pairs by using a local multi-head attention mechanism, belonging to the field of natural language processing. The invention comprises the following steps: preprocessing a text data set of Thai or Myanmar; selecting word-syllable pair characteristics as model input in a windowed manner; then, learning context characteristics from the word-syllable pair sequence by utilizing a local multi-head attention mechanism; and finally, modeling a part-of-speech dependency relationship through a conditional random field, and predicting part-of-speech tags. The experimental results for the part-of-speech tagging data sets in Thai and Myanmar show that compared with the current optimal model, the method has the advantages that syllables are blended as morphological characteristics of words, the method is beneficial to learning the context characteristics of unknown words, and the influence of wrong tagging of the unknown words on the model performance is relieved. In addition, the invention adopts a local multi-head self-attention mechanism, so that the model can obtain richer local dependence characteristics, and a better labeling result can be obtained in the part-of-speech labeling task.
Description
Technical Field
The invention relates to a method for marking the word character of Thai and Burma characters by fusing word-syllable pairs by using a local multi-head attention mechanism, belonging to the technical field of natural language processing.
Background
Part-of-speech tagging, which is to judge the part of speech of each word in a given sentence, belongs to one of basic tasks in the field of natural language processing (natural language processing NLP), and can improve the accuracy of syntactic analysis, thereby promoting the improvement of many NLP tasks.
The early part-of-speech tagging method mainly comprises rule-based and statistical machine learning. The part-of-speech tagging method based on the rules has the problems of incomplete rule customization and rule conflict. At present, methods based on statistical Machine learning mainly include Support Vector Machine (SVM), Hidden Markov Model (HMM), and Conditional Random field models (CRFs). This kind of method usually needs to manually construct the feature extraction function, and the process is troublesome and the feature extraction is insufficient.
The neural network method avoids the process of manually constructing and extracting features through word embedding or embedding layers, and meanwhile improves the part-of-speech tagging effect. The mainstream part-of-speech tagging model usually adopts BilSTM (Bidirectional Long Short-Term Memory) and a pre-training language model to obtain the part-of-speech tagging context characteristics. The method based on the BilSTM effectively relieves the dependence of the model on characteristic engineering, and realizes the information transmission of different words on time sequence dimension, but the overlong sequence has little predictive significance on word part of speech from the perspective of part of speech tagging. Based on a language model pre-trained by BERT and the like, high-quality characters, syllables and word embedding can be learned to capture potential syntactic and semantic similarity between words, but the parameter quantity of the model is increased in the pre-training process, so that the prediction speed of the trained part-of-speech tagging model is seriously slowed. In addition, the method ignores the word-building characteristics and internal structure information of Thai and Myanmar, so that morphological information is lost, and the performance of part-of-speech tagging is influenced.
Syllables are used as basic constituent units of the language words and have certain association with the part of speech of the words. Such as word part-of-speech tag prediction by BiLSTM-CRF learning common features in the words and syllable embedding of the burma text. But the role of syllables is not significant in the long-distance context feature modeling process. The morphological characteristics of the unknown words are integrated into the part-of-speech tagging model of the local multi-head attention mechanism, so that the method is beneficial to learning the context characteristics of the unknown words and relieving the influence of wrong tagging of the unknown words on the model performance.
Disclosure of Invention
The invention provides a method for marking the part of speech of Thai and Burma languages by using a local multi-head attention mechanism to fuse word-syllable pairs, which is used for marking the part of speech of syllabic languages of southeast Asian phonemes such as Thai and Burma languages and solves the problem of poor part of speech marking effects of low-frequency parts of speech and unknown words.
The technical scheme of the invention is as follows: the method for marking the writings of Thai and Burma by fusing word-syllable pairs by using local multi-head attention comprises the following specific steps:
step1, carrying out text preprocessing on the Thai text LST20 data set or the Myanmar ALT data set, such as a Thai sentence with m words, and finding potential affix information in the words by carrying out syllable segmentation on each word in the sentence so as to expand the word sequence into a word-syllable pair sequence.
And Step2, sequentially obtaining input containing information of n word-syllable pairs from the data preprocessed in the Step1 in a sliding window mode, performing feature coding on the word-syllable pairs by using a local multi-head attention mechanism, and then obtaining the predicted shared features of the input n-grams by splicing the output features of the transform coder and the syllable embedding.
And Step3, finally, modeling part-of-speech dependency relationship through a conditional random field, and predicting part-of-speech tags.
The specific steps of Step1 are as follows:
step1.1, constructing a word alphabet and a part-of-speech tag alphabet aiming at a training set according to a vocabulary divided by '\\ n' in a Thai text;
step1.2, calling the most advanced syllable segmenter of Thai or Burmese to segment the syllables of the words in the text to construct a syllable alphabet;
step1.3, then, for each word, the present invention assigns to it the syllable it contains. Wherein for syllables, the invention only cuts the prefix syllables and suffix syllables that make up each word as input. If there are some words that consist of a single syllable, the present invention will supplement the syllable with a "< PAD >" operation to complete the input syllable vector.
In a preferred embodiment of the present invention, Step2 is:
step2.1, the coding layer embeds words of n-gramsSyllable embedding corresponding to itAs input to the encoder, the input n-gram matrix can be expressed as:
step2.2, the multi-head attention layer of the encoder maps queries and a set of key-value pairs to outputs. Given a sequence of vectorsSingle-headed attention projects X onto three different matrices: q matrix isK matrix isV matrix isThe attention weight is obtained by calculating the dot product attention of each word in the sentence, and the final score is the weighted sum of the values;
Q,K,V=XWQ,XWK,XWV, (2)
wherein the matrixIs a learnable parameter, dkThe size of the dimension of the output vector for the embedding layer of the model, this factor being to adjust Q and KTTo prevent the too large inner product from being unevenly distributed after softmax. softmax normalized the scale values.
Step2.3, Multi-Head attention layer Multi-Head is formed by splicing a plurality of attention layers;
MultiHead(Q,K,V)=[Att1,Att2,...,Atth], (4)
step2.4, feedforward neural network layer are formed by two linear layers in series connection, they have independent weight and deviation, and the dimensionality is different, can further extract the semantic information;
Z=layer-norm(X+MultiHead(X)), (5)
FFN(Z)=ReLU(W1Z+b1)W2Z+b2, (6)
wherein layer-norm (-) represents normalization layer, FFN represents Feed Forward Network layer, W1,b1,W2,b2Is the projection parameter and Z represents the output of the normalization layer.
Step2.5, deriving the output o of the encoder block using a normalization layer after the feedforward neural network layeri. The predicted shared features of the input n-gram are then derived by concatenating the transform encoder output features and the pitch embedding, most preferablyThen a vector is obtained through a MultiLayer Perceptron (MLP)
In a preferred embodiment of the present invention, Step3 is:
step3.1, in the decoding process, the invention uses a Viterbi algorithm to obtain the highest score of the label sequence. Formally, the present invention uses x ═ x (x)1,x2,...,xn) To represent an input sequence, using y ═ y (y)1,y2,...,yn) To represent model output labels, the present invention defines a score as;
wherein A is a transition score matrix, e.g. Ai,jIndicating a transition from the i-th state to the j-th state in a pair of consecutive time steps. y is0And ynIs a tag at the beginning and end of a sentence that the model will add to the set of possible tags. P is the fractional matrix output by the transform encoder. θ is a parameter of the neural network setting.
Step3.2, performing softmax operation on all possible label sequences to generate a probability value for the sequence y;
wherein y isxRepresenting a set of x possible tag sequences.
Step3.3, in the training process, using a maximum likelihood estimation method to improve the logarithm probability of a correct label sequence to the maximum extent;
wherein y isxRepresenting a set of possible tag sequences for sentence x. It is apparent from the above formula that the present invention is continually optimized to produce an efficient output tag sequence.
Step3.4, at decoding, by obtaining the maximum score y*To predict the output sequence of the model.
The invention has the beneficial effects that:
(1) the invention provides a Thai and Myanmar lexical labeling model utilizing a local multi-head attention mechanism and a conditional random field, and the model structure is simpler and more effective than the existing BilSTM and pre-training models;
(2) the invention provides a method for learning context characteristics from a local word-syllable pair sequence, which effectively relieves the ambiguity problem and improves the part-of-speech prediction effect of low-frequency words and unknown words;
(3) the experimental result of the part-of-speech tagging data set aiming at Thai and Burma shows that the tagging effect of the model provided by the invention is superior to that of the current optimal model, and the parameter scale is smaller.
Drawings
FIG. 1 is a schematic diagram of a specific structure of a recognition model according to the present invention;
FIG. 2 is a diagram of word-syllable pair sequences in the present invention.
Detailed Description
Example 1: as shown in fig. 1-2, a method for verbally labeling tai and mainma characters by fusing word-syllable pairs with local multi-head attention mechanism includes the following steps:
step1, carrying out text preprocessing on the Thai text LST20 data set or the Myanmar ALT data set, such as a Thai sentence with m words, and finding potential affix information in the words by carrying out syllable segmentation on each word in the sentence so as to expand the word sequence into a word-syllable pair sequence.
Step1.1, constructing a word alphabet and a part-of-speech tag alphabet aiming at a training set according to a vocabulary divided by '\\ n' in a Thai text;
step1.2, calling the most advanced syllable segmenter of Thai or Burmese to segment the syllables of the words in the text to construct a syllable alphabet;
step1.3, then, for each word, the invention assigns to it the syllables that it contains, as shown in FIG. 1. Wherein for syllables, the invention only cuts the prefix syllables and suffix syllables that make up each word as input. If there are some words that consist of a single syllable, the present invention will supplement the syllable with a "< PAD >" operation to complete the input syllable vector.
And Step2, sequentially obtaining input containing information of n word-syllable pairs from the data preprocessed in the Step1 in a sliding window mode, performing feature coding on the word-syllable pairs by using a local multi-head attention mechanism, and then obtaining the predicted shared features of the input n-grams by splicing the output features of the transform coder and the syllable embedding.
Step2.1, the coding layer embeds words of n-gramsSyllable embedding corresponding to itAs input to the encoder, the input n-gram matrix can be expressed as:
step2.2, the multi-head attention layer of the encoder maps queries and a set of key-value pairs to outputs. Given a sequence of vectorsSingle-headed attention projects X onto three different matrices: q matrix isK matrix isV matrix isThe attention weight is obtained by calculating the dot product attention of each word in the sentence, and the final score is the weighted sum of the values;
Q,K,V=XWQ,XWK,XWV, (2)
wherein the matrixIs a learnable parameter, dkThe size of the dimension of the output vector for the embedding layer of the model, this factor being to adjust Q and KTTo prevent the too large inner product from being unevenly distributed after softmax. softmax normalized the scale values.
Step2.3, Multi-Head attention layer Multi-Head is formed by splicing a plurality of attention layers;
MultiHead(Q,K,V)=[Att1,Att2,...,Atth], (4)
step2.4, feedforward neural network layer are formed by two linear layers in series connection, they have independent weight and deviation, and the dimensionality is different, can further extract the semantic information;
Z=layer-norm(X+MultiHead(X)), (5)
FFN(Z)=ReLU(W1Z+b1)W2Z+b2, (6)
wherein layer-norm (-) represents normalization layer, FFN represents Feed Forward Network layer, W1,b1,W2,b2Is the projection parameter and Z represents the output of the normalization layer.
Step2.5, deriving the output o of the encoder block using a normalization layer after the feedforward neural network layeri. Then, the output characteristic of a transform coder and the pitch embedding are spliced to obtain the predicted shared characteristic of the input n-gram, and finally, a vector is obtained through a MultiLayer Perceptron (MLP)
Step3.1, in the decoding process, the invention uses a Viterbi algorithm to obtain the highest score of the label sequence. Formally, the present invention uses x ═ x (x)1,x2,...,xn) To represent an input sequence, using y ═ y (y)1,y2,...,yn) To represent model output labels, the present invention defines a score as;
wherein A is a transition score matrix, e.g. Ai,jIndicating a transition from the i-th state to the j-th state in a pair of consecutive time steps. y is0And ynIs a tag at the beginning and end of a sentence that the model will add to the set of possible tags. P is the fractional matrix output by the transform encoder. θ is a parameter of the neural network setting.
Step3.2, performing softmax operation on all possible label sequences to generate a probability value for the sequence y;
wherein y isxRepresenting a set of x possible tag sequences.
Step3.3, in the training process, using a maximum likelihood estimation method to improve the logarithm probability of a correct label sequence to the maximum extent;
wherein y isxRepresenting a set of possible tag sequences for sentence x. It is apparent from the above formula that the present invention is continually optimized to produce an efficient output tag sequence.
Step3.4, at decoding, by obtaining the maximum score y*To predict the output sequence of the model.
To illustrate the effect of the present invention, 3-group comparative experiments were set up. The first group of experiments verify the validity of Thai part-of-speech tagging, the second group of experiments verify the validity of Burmese part-of-speech tagging, and the third group of experiments verify the validity of low-frequency part-of-speech prediction during part-of-speech tagging tasks.
(1) Validity of Thai part-of-speech tag
The invention uses a Multihead-Att + Syllable + CRF model structure to enrich semantic information by fusing affix Syllable characteristics on the basis of word vectors. In order to verify the validity of the MultiHead-Att + Syllable + CRF model on the part-of-speech tagging of Thai, comparison analysis is performed on the model and a plurality of other related models under the Thai LST20 corpus. The results of the experiment are shown in table 1.
TABLE 1 Experimental results for LST20 data
From table 1, it can be seen that both Micro F1 and Macro F1 values of the model of the present invention exceed all mainstream models on LST20 data set, demonstrating the effectiveness of the present invention for the part-of-speech tagging task in thailand. The lowest value of the method F1 based on CRF shows that the neural network model can acquire deeper input information of the language compared with the traditional machine learning-based method. And the F1 value of the model is higher than that of other three pre-training language models based on the neural network, which probably is caused by the fact that other models lack text morphological information and internal structure information, so that deviation occurs when the part-of-speech labels of the models are predicted. The model of the invention combines the word-forming characteristics of Thai and utilizes the sequence characteristics of local word-syllable pairs, thereby improving the performance of recognizing the part of speech of each word. Meanwhile, as can be seen from table 1, the difference between the Micro F1 value and the Macro F1 value is large, because the Micro F1 value is likely to be affected by common part-of-speech categories, and the model has a good learning effect on the common part-of-speech categories; the Macro F1 value looks equally to each part-of-speech category, so its value is affected by the rare categories, and thus the Macro F1 value is much lower.
(2) Validity of Burmese text part-of-speech tagging
In order to verify the effectiveness of the MultiHead-Att + Syllable + CRF model in the semantic annotation of Burma, the invention carries out a comparison experiment with a plurality of related models below under the Burma language corpus. The results of the experiment are shown in table 2.
TABLE 2 Experimental results for the Burmese ALT dataset
Table 2 shows the results of comparing this method with models such as CNN, Seq2Seq and BilSTM-CRF. The experimental result shows that the model of the invention obtains the best effect in the Burmese POS marking task, and the local attention mechanism of the model of the invention can more efficiently process the information interaction of the input features, so that the word vector features and the syllable vector features are better fused and utilized, and the validity of the Burmese word marking task is also proved. Among them, the CNN-based model has the worst performance, and the Seq2Seq model is inferior, probably because the CNN-based method relies on the features selection and convolution of the mainline, while the Seq2Seq method is a coding-decoding method, which directly causes decoding errors if coding errors occur. Compared with a BilSTM-CRF model and a BERT + BilSTM-CRF + Fine-tune model, the F1 value of the model is respectively improved by nearly 4% and 1%, which shows that the Transformer is more effective than the characteristic information carried by the BilSTM. The reason may be that the Transformer encoder can model information directly for each word-syllable, whereas BiLSTM needs to model from both ends of the input sentence to the middle step by step, resulting in that the standard information of long sentences may be lost. Because BERT + BilSTM-CRF + Fine-tune is a combined model of word segmentation and part-of-speech tagging, the model is complex, and simultaneously Burmese characters, syllables and word vectors pre-trained by the BERT are required.
(3) Validity of low frequency part of speech prediction
The method aims to verify the effectiveness of a Multihead-Att + Syllable + CRF model in low-frequency part-of-speech prediction during part-of-speech tagging. Aiming at the Thai LST20 corpus set, part-of-speech tags with the occurrence frequency lower than 4000 in the test set are taken as low-frequency parts-of-speech, and are compared and analyzed with the most advanced Thai part-of-speech tagging model Wangchanberta model at present, and the experimental results are shown in Table 3.
TABLE 3 Low frequency parts of speech comparison test results
As can be seen from Table 3, the model of the present invention outperforms the most advanced models in the prediction of low frequency parts of speech. The performance improvement of the low-frequency part of speech CL and the performance improvement of the low-frequency part of speech IJ are most obvious, the fact that the local word-syllable pair information can supplement word characteristics is shown, the local multi-head attention mechanism can establish the dependency relationship among words and carry out information interaction in real time, and therefore the model is promoted to predict the low-frequency part of speech. And XX has slightly low performance, which may be because XX represents unknown part of speech, the model of the present invention supplements the part of the character labeled XX according to the word-forming characteristics of the word itself, so that the XX label is misjudged by the model as the part of speech of the word with the same syllable.
The above experimental data prove that the part-of-speech tagging model can obtain better performance by using syllable information because syllable-level information considers the characteristics of affixes, and many affixes in Thai and Burma can express the part-of-speech information, which is one of the factors that the part-of-speech tagging is not negligible. Meanwhile, the parallelism of the model can be improved by using a local multi-head attention mechanism, and the context characteristics around the words are effectively captured. Experiments prove that the method provided by the invention obtains the optimal effect compared with a plurality of baseline models. The invention provides a method for capturing word-syllable sequence information by using a local multi-head attention mechanism to improve the part-of-speech tagging task aiming at the part-of-speech tagging task of southeast Asian phoneme syllable character languages such as Thai and Myanmar.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.
Claims (4)
1. The method for marking the word characters of Thai and Burma by using local multi-head attention to fuse word-syllable pairs is characterized by comprising the following steps of:
the method comprises the following specific steps:
step1, performing text preprocessing on the Thai text data set or the Myanmar text data set, and performing syllable segmentation on each word in a sentence to find potential affix information in the word, so that a word sequence is expanded into a word-syllable pair sequence;
step2, sequentially obtaining input containing n word-syllable pair information from data preprocessed in the Step1 in a sliding window mode, carrying out feature coding on the word-syllable pairs by using a local multi-head attention mechanism, and then obtaining predicted shared features of the input n-grams by splicing the output features of a transform encoder and the syllable embedding;
and Step3, finally, modeling part-of-speech dependency relationship through a conditional random field, and predicting part-of-speech tags.
2. The method for verbalization annotation of Thai and Myanmar characters using local multi-head attention mechanism to fuse word-syllable pairs according to claim 1, wherein: the specific steps of Step1 are as follows:
step1.1, constructing a word alphabet and a part-of-speech tag alphabet aiming at a training set according to a vocabulary divided by '\\ n' in a Thai text;
step1.2, calling a syllable segmenter of Thai or Myanmar to segment the syllables of the words in the text to construct a syllable alphabet;
step1.3, then, for each word, assign it the syllable it contains; wherein for syllables, only prefix syllables and suffix syllables constituting each word are intercepted as input; if there are some words that consist of a single syllable, then the syllable will be supplemented with a "< PAD >" operation to complete the input syllable vector.
3. The method for verbalization annotation of Thai and Myanmar characters using local multi-head attention mechanism to fuse word-syllable pairs according to claim 1, wherein: the specific steps of Step2 are as follows:
step2.1, the coding layer embeds words of n-gramsSyllable embedding corresponding to itAs input to the encoder, the input n-gram matrix is represented as:
step2.2, the multi-head attention layer of the encoder maps the query and a set of key-value pairs to the output; given a sequence of vectorsSingle-headed attention projects X onto three different matrices: q matrix isK matrix isV matrix isThe attention weight is obtained by calculating the dot product attention of each word in the sentence, and the final score is the weighted sum of the values;
Q,K,V=XWQ,XWK,XWV (2)
wherein the matrixIs a learnable parameter, dkThe size of the dimension of the output vector for the embedding layer of the model, this factor being to adjust Q and KTThe size of the inner product is used for preventing the overlarge inner product from being unevenly distributed after passing through softmax; softmax normalizes the scale values;
step2.3, Multi-Head attention layer Multi-Head is formed by splicing a plurality of attention layers;
MultiHead(Q,K,V)=[Att1,Att2,...,Atth], (4)
step2.4, feedforward neural network layer are formed by two linear layers in series connection, the linear layers have independent weight and deviation, and the dimensionalities are different, so that semantic information can be further extracted;
Z=layer-norm(X+MultiHead(X)), (5)
FFN(Z)=ReLU(W1Z+b1)W2Z+b2, (6)
wherein layer-norm (-) represents normalization layer, FFN represents Feed Forward Network layer, W1,b1,W2,b2Is the projection parameter, and Z represents the output of the normalization layer;
step2.5, deriving the output o of the encoder block using a normalization layer after the feedforward neural network layeri(ii) a Then, the output characteristic of a transform encoder and the pitch embedding are spliced to obtain the predicted shared characteristic of the input n-gram, and finally, a vector is obtained through a multi-layer perceptron MLP
4. The method for verbalization annotation of Thai and Myanmar characters using local multi-head attention mechanism to fuse word-syllable pairs according to claim 1, wherein: the specific steps of Step3 are as follows:
step3.1, in the decoding process, obtaining the highest score of the label sequence by using a Viterbi algorithm; formally, x ═ x is used1,x2,...,xn) To represent an input sequence, using y ═ y (y)1,y2,...,yn) To represent the model output tags, so a score is defined;
wherein A is a transition score matrix, Ai,jRepresenting the transition from the i-th state to the j-th state in a pair of successive time steps, y0And ynLabels at the beginning and the end of a sentence, P is a fractional matrix output by a transform encoder, and theta is a parameter set by a neural network;
step3.2, performing softmax operation on all possible label sequences to generate a probability value for the sequence y;
wherein y isxRepresents a set of x possible tag sequences;
step3.3, in the training process, using a maximum likelihood estimation method to improve the logarithm probability of a correct label sequence to the maximum extent;
wherein y isxRepresenting a set of possible tag sequences for a sentence x;
step3.4, at decoding, by obtaining the maximum score y*To predict the output sequence of the model;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111078804.XA CN113901210B (en) | 2021-09-15 | 2021-09-15 | Method for marking verbosity of Thai and Burma characters by using local multi-head attention to mechanism fused word-syllable pair |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111078804.XA CN113901210B (en) | 2021-09-15 | 2021-09-15 | Method for marking verbosity of Thai and Burma characters by using local multi-head attention to mechanism fused word-syllable pair |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113901210A true CN113901210A (en) | 2022-01-07 |
CN113901210B CN113901210B (en) | 2022-12-13 |
Family
ID=79028510
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111078804.XA Active CN113901210B (en) | 2021-09-15 | 2021-09-15 | Method for marking verbosity of Thai and Burma characters by using local multi-head attention to mechanism fused word-syllable pair |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113901210B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116977436A (en) * | 2023-09-21 | 2023-10-31 | 小语智能信息科技(云南)有限公司 | Burmese text image recognition method and device based on Burmese character cluster characteristics |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110134757A (en) * | 2019-04-19 | 2019-08-16 | 杭州电子科技大学 | A kind of event argument roles abstracting method based on bull attention mechanism |
US20190355270A1 (en) * | 2018-05-18 | 2019-11-21 | Salesforce.Com, Inc. | Multitask Learning As Question Answering |
CN110489750A (en) * | 2019-08-12 | 2019-11-22 | 昆明理工大学 | Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF |
CN110534087A (en) * | 2019-09-04 | 2019-12-03 | 清华大学深圳研究生院 | A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium |
CN111581474A (en) * | 2020-04-02 | 2020-08-25 | 昆明理工大学 | Evaluation object extraction method of case-related microblog comments based on multi-head attention system |
CN111581396A (en) * | 2020-05-06 | 2020-08-25 | 西安交通大学 | Event graph construction system and method based on multi-dimensional feature fusion and dependency syntax |
CN112270193A (en) * | 2020-11-02 | 2021-01-26 | 重庆邮电大学 | Chinese named entity identification method based on BERT-FLAT |
CN112487796A (en) * | 2020-11-27 | 2021-03-12 | 北京智源人工智能研究院 | Method and device for sequence labeling and electronic equipment |
CN112784532A (en) * | 2021-01-29 | 2021-05-11 | 电子科技大学 | Multi-head attention memory network for short text sentiment classification |
CN112883726A (en) * | 2021-01-21 | 2021-06-01 | 昆明理工大学 | Multi-task Thai word segmentation method based on syllable segmentation and word segmentation joint learning |
US20210279576A1 (en) * | 2020-03-03 | 2021-09-09 | Google Llc | Attention neural networks with talking heads attention |
-
2021
- 2021-09-15 CN CN202111078804.XA patent/CN113901210B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190355270A1 (en) * | 2018-05-18 | 2019-11-21 | Salesforce.Com, Inc. | Multitask Learning As Question Answering |
CN110134757A (en) * | 2019-04-19 | 2019-08-16 | 杭州电子科技大学 | A kind of event argument roles abstracting method based on bull attention mechanism |
CN110489750A (en) * | 2019-08-12 | 2019-11-22 | 昆明理工大学 | Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF |
CN110534087A (en) * | 2019-09-04 | 2019-12-03 | 清华大学深圳研究生院 | A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium |
US20210279576A1 (en) * | 2020-03-03 | 2021-09-09 | Google Llc | Attention neural networks with talking heads attention |
CN111581474A (en) * | 2020-04-02 | 2020-08-25 | 昆明理工大学 | Evaluation object extraction method of case-related microblog comments based on multi-head attention system |
CN111581396A (en) * | 2020-05-06 | 2020-08-25 | 西安交通大学 | Event graph construction system and method based on multi-dimensional feature fusion and dependency syntax |
CN112270193A (en) * | 2020-11-02 | 2021-01-26 | 重庆邮电大学 | Chinese named entity identification method based on BERT-FLAT |
CN112487796A (en) * | 2020-11-27 | 2021-03-12 | 北京智源人工智能研究院 | Method and device for sequence labeling and electronic equipment |
CN112883726A (en) * | 2021-01-21 | 2021-06-01 | 昆明理工大学 | Multi-task Thai word segmentation method based on syllable segmentation and word segmentation joint learning |
CN112784532A (en) * | 2021-01-29 | 2021-05-11 | 电子科技大学 | Multi-head attention memory network for short text sentiment classification |
Non-Patent Citations (13)
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116977436A (en) * | 2023-09-21 | 2023-10-31 | 小语智能信息科技(云南)有限公司 | Burmese text image recognition method and device based on Burmese character cluster characteristics |
CN116977436B (en) * | 2023-09-21 | 2023-12-05 | 小语智能信息科技(云南)有限公司 | Burmese text image recognition method and device based on Burmese character cluster characteristics |
Also Published As
Publication number | Publication date |
---|---|
CN113901210B (en) | 2022-12-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Sun et al. | Token-level ensemble distillation for grapheme-to-phoneme conversion | |
CN110046350B (en) | Grammar error recognition method, device, computer equipment and storage medium | |
US7966173B2 (en) | System and method for diacritization of text | |
CN111694924A (en) | Event extraction method and system | |
CN111339750B (en) | Spoken language text processing method for removing stop words and predicting sentence boundaries | |
Simonnet et al. | ASR error management for improving spoken language understanding | |
CN111563375B (en) | Text generation method and device | |
CN116127952A (en) | Multi-granularity Chinese text error correction method and device | |
CN116416480B (en) | Visual classification method and device based on multi-template prompt learning | |
Wooters et al. | Multiple-pronunciation lexical modeling in a speaker independent speech understanding system. | |
CN116306600B (en) | MacBert-based Chinese text error correction method | |
CN114153971A (en) | Error-containing Chinese text error correction, identification and classification equipment | |
CN112905736A (en) | Unsupervised text emotion analysis method based on quantum theory | |
CN115510863A (en) | Question matching task oriented data enhancement method | |
Romero et al. | Modern vs diplomatic transcripts for historical handwritten text recognition | |
CN113901210B (en) | Method for marking verbosity of Thai and Burma characters by using local multi-head attention to mechanism fused word-syllable pair | |
CN114937465A (en) | Speech emotion recognition method based on self-supervision learning and computer equipment | |
CN110942767A (en) | Recognition labeling and optimization method and device for ASR language model | |
CN112883726B (en) | Multi-task Thai word segmentation method based on syllable segmentation and word segmentation joint learning | |
CN114970537B (en) | Cross-border ethnic cultural entity relation extraction method and device based on multi-layer labeling strategy | |
CN112989839A (en) | Keyword feature-based intent recognition method and system embedded in language model | |
CN115759102A (en) | Chinese poetry wine culture named entity recognition method | |
CN112634878B (en) | Speech recognition post-processing method and system and related equipment | |
CN115223549A (en) | Vietnamese speech recognition corpus construction method | |
Wana et al. | A multi-view approach for Mandarin non-native mispronunciation verification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |