CN113901210A - Method for marking verbosity of Thai and Burma characters by using local multi-head attention to mechanism fused word-syllable pair - Google Patents

Method for marking verbosity of Thai and Burma characters by using local multi-head attention to mechanism fused word-syllable pair Download PDF

Info

Publication number
CN113901210A
CN113901210A CN202111078804.XA CN202111078804A CN113901210A CN 113901210 A CN113901210 A CN 113901210A CN 202111078804 A CN202111078804 A CN 202111078804A CN 113901210 A CN113901210 A CN 113901210A
Authority
CN
China
Prior art keywords
word
syllable
thai
speech
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111078804.XA
Other languages
Chinese (zh)
Other versions
CN113901210B (en
Inventor
线岩团
王悦寒
余正涛
相艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202111078804.XA priority Critical patent/CN113901210B/en
Publication of CN113901210A publication Critical patent/CN113901210A/en
Application granted granted Critical
Publication of CN113901210B publication Critical patent/CN113901210B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a method for marking the word character of Thai and Burma characters by fusing word-syllable pairs by using a local multi-head attention mechanism, belonging to the field of natural language processing. The invention comprises the following steps: preprocessing a text data set of Thai or Myanmar; selecting word-syllable pair characteristics as model input in a windowed manner; then, learning context characteristics from the word-syllable pair sequence by utilizing a local multi-head attention mechanism; and finally, modeling a part-of-speech dependency relationship through a conditional random field, and predicting part-of-speech tags. The experimental results for the part-of-speech tagging data sets in Thai and Myanmar show that compared with the current optimal model, the method has the advantages that syllables are blended as morphological characteristics of words, the method is beneficial to learning the context characteristics of unknown words, and the influence of wrong tagging of the unknown words on the model performance is relieved. In addition, the invention adopts a local multi-head self-attention mechanism, so that the model can obtain richer local dependence characteristics, and a better labeling result can be obtained in the part-of-speech labeling task.

Description

Method for marking verbosity of Thai and Burma characters by using local multi-head attention to mechanism fused word-syllable pair
Technical Field
The invention relates to a method for marking the word character of Thai and Burma characters by fusing word-syllable pairs by using a local multi-head attention mechanism, belonging to the technical field of natural language processing.
Background
Part-of-speech tagging, which is to judge the part of speech of each word in a given sentence, belongs to one of basic tasks in the field of natural language processing (natural language processing NLP), and can improve the accuracy of syntactic analysis, thereby promoting the improvement of many NLP tasks.
The early part-of-speech tagging method mainly comprises rule-based and statistical machine learning. The part-of-speech tagging method based on the rules has the problems of incomplete rule customization and rule conflict. At present, methods based on statistical Machine learning mainly include Support Vector Machine (SVM), Hidden Markov Model (HMM), and Conditional Random field models (CRFs). This kind of method usually needs to manually construct the feature extraction function, and the process is troublesome and the feature extraction is insufficient.
The neural network method avoids the process of manually constructing and extracting features through word embedding or embedding layers, and meanwhile improves the part-of-speech tagging effect. The mainstream part-of-speech tagging model usually adopts BilSTM (Bidirectional Long Short-Term Memory) and a pre-training language model to obtain the part-of-speech tagging context characteristics. The method based on the BilSTM effectively relieves the dependence of the model on characteristic engineering, and realizes the information transmission of different words on time sequence dimension, but the overlong sequence has little predictive significance on word part of speech from the perspective of part of speech tagging. Based on a language model pre-trained by BERT and the like, high-quality characters, syllables and word embedding can be learned to capture potential syntactic and semantic similarity between words, but the parameter quantity of the model is increased in the pre-training process, so that the prediction speed of the trained part-of-speech tagging model is seriously slowed. In addition, the method ignores the word-building characteristics and internal structure information of Thai and Myanmar, so that morphological information is lost, and the performance of part-of-speech tagging is influenced.
Syllables are used as basic constituent units of the language words and have certain association with the part of speech of the words. Such as word part-of-speech tag prediction by BiLSTM-CRF learning common features in the words and syllable embedding of the burma text. But the role of syllables is not significant in the long-distance context feature modeling process. The morphological characteristics of the unknown words are integrated into the part-of-speech tagging model of the local multi-head attention mechanism, so that the method is beneficial to learning the context characteristics of the unknown words and relieving the influence of wrong tagging of the unknown words on the model performance.
Disclosure of Invention
The invention provides a method for marking the part of speech of Thai and Burma languages by using a local multi-head attention mechanism to fuse word-syllable pairs, which is used for marking the part of speech of syllabic languages of southeast Asian phonemes such as Thai and Burma languages and solves the problem of poor part of speech marking effects of low-frequency parts of speech and unknown words.
The technical scheme of the invention is as follows: the method for marking the writings of Thai and Burma by fusing word-syllable pairs by using local multi-head attention comprises the following specific steps:
step1, carrying out text preprocessing on the Thai text LST20 data set or the Myanmar ALT data set, such as a Thai sentence with m words, and finding potential affix information in the words by carrying out syllable segmentation on each word in the sentence so as to expand the word sequence into a word-syllable pair sequence.
And Step2, sequentially obtaining input containing information of n word-syllable pairs from the data preprocessed in the Step1 in a sliding window mode, performing feature coding on the word-syllable pairs by using a local multi-head attention mechanism, and then obtaining the predicted shared features of the input n-grams by splicing the output features of the transform coder and the syllable embedding.
And Step3, finally, modeling part-of-speech dependency relationship through a conditional random field, and predicting part-of-speech tags.
The specific steps of Step1 are as follows:
step1.1, constructing a word alphabet and a part-of-speech tag alphabet aiming at a training set according to a vocabulary divided by '\\ n' in a Thai text;
step1.2, calling the most advanced syllable segmenter of Thai or Burmese to segment the syllables of the words in the text to construct a syllable alphabet;
step1.3, then, for each word, the present invention assigns to it the syllable it contains. Wherein for syllables, the invention only cuts the prefix syllables and suffix syllables that make up each word as input. If there are some words that consist of a single syllable, the present invention will supplement the syllable with a "< PAD >" operation to complete the input syllable vector.
In a preferred embodiment of the present invention, Step2 is:
step2.1, the coding layer embeds words of n-grams
Figure BDA0003263105390000021
Syllable embedding corresponding to it
Figure BDA0003263105390000022
As input to the encoder, the input n-gram matrix can be expressed as:
Figure BDA0003263105390000023
step2.2, the multi-head attention layer of the encoder maps queries and a set of key-value pairs to outputs. Given a sequence of vectors
Figure BDA0003263105390000024
Single-headed attention projects X onto three different matrices: q matrix is
Figure BDA0003263105390000025
K matrix is
Figure BDA0003263105390000026
V matrix is
Figure BDA0003263105390000027
The attention weight is obtained by calculating the dot product attention of each word in the sentence, and the final score is the weighted sum of the values;
Q,K,V=XWQ,XWK,XWV, (2)
Figure BDA0003263105390000031
wherein the matrix
Figure BDA0003263105390000032
Is a learnable parameter, dkThe size of the dimension of the output vector for the embedding layer of the model, this factor being to adjust Q and KTTo prevent the too large inner product from being unevenly distributed after softmax. softmax normalized the scale values.
Step2.3, Multi-Head attention layer Multi-Head is formed by splicing a plurality of attention layers;
MultiHead(Q,K,V)=[Att1,Att2,...,Atth], (4)
step2.4, feedforward neural network layer are formed by two linear layers in series connection, they have independent weight and deviation, and the dimensionality is different, can further extract the semantic information;
Z=layer-norm(X+MultiHead(X)), (5)
FFN(Z)=ReLU(W1Z+b1)W2Z+b2, (6)
wherein layer-norm (-) represents normalization layer, FFN represents Feed Forward Network layer, W1,b1,W2,b2Is the projection parameter and Z represents the output of the normalization layer.
Step2.5, deriving the output o of the encoder block using a normalization layer after the feedforward neural network layeri. The predicted shared features of the input n-gram are then derived by concatenating the transform encoder output features and the pitch embedding, most preferablyThen a vector is obtained through a MultiLayer Perceptron (MLP)
Figure BDA0003263105390000033
Figure BDA0003263105390000034
In a preferred embodiment of the present invention, Step3 is:
step3.1, in the decoding process, the invention uses a Viterbi algorithm to obtain the highest score of the label sequence. Formally, the present invention uses x ═ x (x)1,x2,...,xn) To represent an input sequence, using y ═ y (y)1,y2,...,yn) To represent model output labels, the present invention defines a score as;
Figure BDA0003263105390000035
wherein A is a transition score matrix, e.g. Ai,jIndicating a transition from the i-th state to the j-th state in a pair of consecutive time steps. y is0And ynIs a tag at the beginning and end of a sentence that the model will add to the set of possible tags. P is the fractional matrix output by the transform encoder. θ is a parameter of the neural network setting.
Step3.2, performing softmax operation on all possible label sequences to generate a probability value for the sequence y;
Figure BDA0003263105390000041
wherein y isxRepresenting a set of x possible tag sequences.
Step3.3, in the training process, using a maximum likelihood estimation method to improve the logarithm probability of a correct label sequence to the maximum extent;
Figure BDA0003263105390000042
wherein y isxRepresenting a set of possible tag sequences for sentence x. It is apparent from the above formula that the present invention is continually optimized to produce an efficient output tag sequence.
Step3.4, at decoding, by obtaining the maximum score y*To predict the output sequence of the model.
Figure BDA0003263105390000043
The invention has the beneficial effects that:
(1) the invention provides a Thai and Myanmar lexical labeling model utilizing a local multi-head attention mechanism and a conditional random field, and the model structure is simpler and more effective than the existing BilSTM and pre-training models;
(2) the invention provides a method for learning context characteristics from a local word-syllable pair sequence, which effectively relieves the ambiguity problem and improves the part-of-speech prediction effect of low-frequency words and unknown words;
(3) the experimental result of the part-of-speech tagging data set aiming at Thai and Burma shows that the tagging effect of the model provided by the invention is superior to that of the current optimal model, and the parameter scale is smaller.
Drawings
FIG. 1 is a schematic diagram of a specific structure of a recognition model according to the present invention;
FIG. 2 is a diagram of word-syllable pair sequences in the present invention.
Detailed Description
Example 1: as shown in fig. 1-2, a method for verbally labeling tai and mainma characters by fusing word-syllable pairs with local multi-head attention mechanism includes the following steps:
step1, carrying out text preprocessing on the Thai text LST20 data set or the Myanmar ALT data set, such as a Thai sentence with m words, and finding potential affix information in the words by carrying out syllable segmentation on each word in the sentence so as to expand the word sequence into a word-syllable pair sequence.
Step1.1, constructing a word alphabet and a part-of-speech tag alphabet aiming at a training set according to a vocabulary divided by '\\ n' in a Thai text;
step1.2, calling the most advanced syllable segmenter of Thai or Burmese to segment the syllables of the words in the text to construct a syllable alphabet;
step1.3, then, for each word, the invention assigns to it the syllables that it contains, as shown in FIG. 1. Wherein for syllables, the invention only cuts the prefix syllables and suffix syllables that make up each word as input. If there are some words that consist of a single syllable, the present invention will supplement the syllable with a "< PAD >" operation to complete the input syllable vector.
And Step2, sequentially obtaining input containing information of n word-syllable pairs from the data preprocessed in the Step1 in a sliding window mode, performing feature coding on the word-syllable pairs by using a local multi-head attention mechanism, and then obtaining the predicted shared features of the input n-grams by splicing the output features of the transform coder and the syllable embedding.
Step2.1, the coding layer embeds words of n-grams
Figure BDA0003263105390000051
Syllable embedding corresponding to it
Figure BDA0003263105390000052
As input to the encoder, the input n-gram matrix can be expressed as:
Figure BDA0003263105390000053
step2.2, the multi-head attention layer of the encoder maps queries and a set of key-value pairs to outputs. Given a sequence of vectors
Figure BDA0003263105390000054
Single-headed attention projects X onto three different matrices: q matrix is
Figure BDA0003263105390000055
K matrix is
Figure BDA0003263105390000056
V matrix is
Figure BDA0003263105390000057
The attention weight is obtained by calculating the dot product attention of each word in the sentence, and the final score is the weighted sum of the values;
Q,K,V=XWQ,XWK,XWV, (2)
Figure BDA0003263105390000058
wherein the matrix
Figure BDA0003263105390000059
Is a learnable parameter, dkThe size of the dimension of the output vector for the embedding layer of the model, this factor being to adjust Q and KTTo prevent the too large inner product from being unevenly distributed after softmax. softmax normalized the scale values.
Step2.3, Multi-Head attention layer Multi-Head is formed by splicing a plurality of attention layers;
MultiHead(Q,K,V)=[Att1,Att2,...,Atth], (4)
step2.4, feedforward neural network layer are formed by two linear layers in series connection, they have independent weight and deviation, and the dimensionality is different, can further extract the semantic information;
Z=layer-norm(X+MultiHead(X)), (5)
FFN(Z)=ReLU(W1Z+b1)W2Z+b2, (6)
wherein layer-norm (-) represents normalization layer, FFN represents Feed Forward Network layer, W1,b1,W2,b2Is the projection parameter and Z represents the output of the normalization layer.
Step2.5, deriving the output o of the encoder block using a normalization layer after the feedforward neural network layeri. Then, the output characteristic of a transform coder and the pitch embedding are spliced to obtain the predicted shared characteristic of the input n-gram, and finally, a vector is obtained through a MultiLayer Perceptron (MLP)
Figure BDA0003263105390000061
Figure BDA0003263105390000062
Step3.1, in the decoding process, the invention uses a Viterbi algorithm to obtain the highest score of the label sequence. Formally, the present invention uses x ═ x (x)1,x2,...,xn) To represent an input sequence, using y ═ y (y)1,y2,...,yn) To represent model output labels, the present invention defines a score as;
Figure BDA0003263105390000063
wherein A is a transition score matrix, e.g. Ai,jIndicating a transition from the i-th state to the j-th state in a pair of consecutive time steps. y is0And ynIs a tag at the beginning and end of a sentence that the model will add to the set of possible tags. P is the fractional matrix output by the transform encoder. θ is a parameter of the neural network setting.
Step3.2, performing softmax operation on all possible label sequences to generate a probability value for the sequence y;
Figure BDA0003263105390000064
wherein y isxRepresenting a set of x possible tag sequences.
Step3.3, in the training process, using a maximum likelihood estimation method to improve the logarithm probability of a correct label sequence to the maximum extent;
Figure BDA0003263105390000065
wherein y isxRepresenting a set of possible tag sequences for sentence x. It is apparent from the above formula that the present invention is continually optimized to produce an efficient output tag sequence.
Step3.4, at decoding, by obtaining the maximum score y*To predict the output sequence of the model.
Figure BDA0003263105390000071
To illustrate the effect of the present invention, 3-group comparative experiments were set up. The first group of experiments verify the validity of Thai part-of-speech tagging, the second group of experiments verify the validity of Burmese part-of-speech tagging, and the third group of experiments verify the validity of low-frequency part-of-speech prediction during part-of-speech tagging tasks.
(1) Validity of Thai part-of-speech tag
The invention uses a Multihead-Att + Syllable + CRF model structure to enrich semantic information by fusing affix Syllable characteristics on the basis of word vectors. In order to verify the validity of the MultiHead-Att + Syllable + CRF model on the part-of-speech tagging of Thai, comparison analysis is performed on the model and a plurality of other related models under the Thai LST20 corpus. The results of the experiment are shown in table 1.
TABLE 1 Experimental results for LST20 data
Figure BDA0003263105390000072
From table 1, it can be seen that both Micro F1 and Macro F1 values of the model of the present invention exceed all mainstream models on LST20 data set, demonstrating the effectiveness of the present invention for the part-of-speech tagging task in thailand. The lowest value of the method F1 based on CRF shows that the neural network model can acquire deeper input information of the language compared with the traditional machine learning-based method. And the F1 value of the model is higher than that of other three pre-training language models based on the neural network, which probably is caused by the fact that other models lack text morphological information and internal structure information, so that deviation occurs when the part-of-speech labels of the models are predicted. The model of the invention combines the word-forming characteristics of Thai and utilizes the sequence characteristics of local word-syllable pairs, thereby improving the performance of recognizing the part of speech of each word. Meanwhile, as can be seen from table 1, the difference between the Micro F1 value and the Macro F1 value is large, because the Micro F1 value is likely to be affected by common part-of-speech categories, and the model has a good learning effect on the common part-of-speech categories; the Macro F1 value looks equally to each part-of-speech category, so its value is affected by the rare categories, and thus the Macro F1 value is much lower.
(2) Validity of Burmese text part-of-speech tagging
In order to verify the effectiveness of the MultiHead-Att + Syllable + CRF model in the semantic annotation of Burma, the invention carries out a comparison experiment with a plurality of related models below under the Burma language corpus. The results of the experiment are shown in table 2.
TABLE 2 Experimental results for the Burmese ALT dataset
Figure BDA0003263105390000081
Table 2 shows the results of comparing this method with models such as CNN, Seq2Seq and BilSTM-CRF. The experimental result shows that the model of the invention obtains the best effect in the Burmese POS marking task, and the local attention mechanism of the model of the invention can more efficiently process the information interaction of the input features, so that the word vector features and the syllable vector features are better fused and utilized, and the validity of the Burmese word marking task is also proved. Among them, the CNN-based model has the worst performance, and the Seq2Seq model is inferior, probably because the CNN-based method relies on the features selection and convolution of the mainline, while the Seq2Seq method is a coding-decoding method, which directly causes decoding errors if coding errors occur. Compared with a BilSTM-CRF model and a BERT + BilSTM-CRF + Fine-tune model, the F1 value of the model is respectively improved by nearly 4% and 1%, which shows that the Transformer is more effective than the characteristic information carried by the BilSTM. The reason may be that the Transformer encoder can model information directly for each word-syllable, whereas BiLSTM needs to model from both ends of the input sentence to the middle step by step, resulting in that the standard information of long sentences may be lost. Because BERT + BilSTM-CRF + Fine-tune is a combined model of word segmentation and part-of-speech tagging, the model is complex, and simultaneously Burmese characters, syllables and word vectors pre-trained by the BERT are required.
(3) Validity of low frequency part of speech prediction
The method aims to verify the effectiveness of a Multihead-Att + Syllable + CRF model in low-frequency part-of-speech prediction during part-of-speech tagging. Aiming at the Thai LST20 corpus set, part-of-speech tags with the occurrence frequency lower than 4000 in the test set are taken as low-frequency parts-of-speech, and are compared and analyzed with the most advanced Thai part-of-speech tagging model Wangchanberta model at present, and the experimental results are shown in Table 3.
TABLE 3 Low frequency parts of speech comparison test results
Figure BDA0003263105390000082
As can be seen from Table 3, the model of the present invention outperforms the most advanced models in the prediction of low frequency parts of speech. The performance improvement of the low-frequency part of speech CL and the performance improvement of the low-frequency part of speech IJ are most obvious, the fact that the local word-syllable pair information can supplement word characteristics is shown, the local multi-head attention mechanism can establish the dependency relationship among words and carry out information interaction in real time, and therefore the model is promoted to predict the low-frequency part of speech. And XX has slightly low performance, which may be because XX represents unknown part of speech, the model of the present invention supplements the part of the character labeled XX according to the word-forming characteristics of the word itself, so that the XX label is misjudged by the model as the part of speech of the word with the same syllable.
The above experimental data prove that the part-of-speech tagging model can obtain better performance by using syllable information because syllable-level information considers the characteristics of affixes, and many affixes in Thai and Burma can express the part-of-speech information, which is one of the factors that the part-of-speech tagging is not negligible. Meanwhile, the parallelism of the model can be improved by using a local multi-head attention mechanism, and the context characteristics around the words are effectively captured. Experiments prove that the method provided by the invention obtains the optimal effect compared with a plurality of baseline models. The invention provides a method for capturing word-syllable sequence information by using a local multi-head attention mechanism to improve the part-of-speech tagging task aiming at the part-of-speech tagging task of southeast Asian phoneme syllable character languages such as Thai and Myanmar.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (4)

1. The method for marking the word characters of Thai and Burma by using local multi-head attention to fuse word-syllable pairs is characterized by comprising the following steps of:
the method comprises the following specific steps:
step1, performing text preprocessing on the Thai text data set or the Myanmar text data set, and performing syllable segmentation on each word in a sentence to find potential affix information in the word, so that a word sequence is expanded into a word-syllable pair sequence;
step2, sequentially obtaining input containing n word-syllable pair information from data preprocessed in the Step1 in a sliding window mode, carrying out feature coding on the word-syllable pairs by using a local multi-head attention mechanism, and then obtaining predicted shared features of the input n-grams by splicing the output features of a transform encoder and the syllable embedding;
and Step3, finally, modeling part-of-speech dependency relationship through a conditional random field, and predicting part-of-speech tags.
2. The method for verbalization annotation of Thai and Myanmar characters using local multi-head attention mechanism to fuse word-syllable pairs according to claim 1, wherein: the specific steps of Step1 are as follows:
step1.1, constructing a word alphabet and a part-of-speech tag alphabet aiming at a training set according to a vocabulary divided by '\\ n' in a Thai text;
step1.2, calling a syllable segmenter of Thai or Myanmar to segment the syllables of the words in the text to construct a syllable alphabet;
step1.3, then, for each word, assign it the syllable it contains; wherein for syllables, only prefix syllables and suffix syllables constituting each word are intercepted as input; if there are some words that consist of a single syllable, then the syllable will be supplemented with a "< PAD >" operation to complete the input syllable vector.
3. The method for verbalization annotation of Thai and Myanmar characters using local multi-head attention mechanism to fuse word-syllable pairs according to claim 1, wherein: the specific steps of Step2 are as follows:
step2.1, the coding layer embeds words of n-grams
Figure RE-FDA0003335329190000011
Syllable embedding corresponding to it
Figure RE-FDA0003335329190000012
As input to the encoder, the input n-gram matrix is represented as:
Figure RE-FDA0003335329190000013
step2.2, the multi-head attention layer of the encoder maps the query and a set of key-value pairs to the output; given a sequence of vectors
Figure RE-FDA0003335329190000014
Single-headed attention projects X onto three different matrices: q matrix is
Figure RE-FDA0003335329190000015
K matrix is
Figure RE-FDA0003335329190000016
V matrix is
Figure RE-FDA0003335329190000017
The attention weight is obtained by calculating the dot product attention of each word in the sentence, and the final score is the weighted sum of the values;
Q,K,V=XWQ,XWK,XWV (2)
Figure RE-FDA0003335329190000021
wherein the matrix
Figure RE-FDA0003335329190000022
Is a learnable parameter, dkThe size of the dimension of the output vector for the embedding layer of the model, this factor being to adjust Q and KTThe size of the inner product is used for preventing the overlarge inner product from being unevenly distributed after passing through softmax; softmax normalizes the scale values;
step2.3, Multi-Head attention layer Multi-Head is formed by splicing a plurality of attention layers;
MultiHead(Q,K,V)=[Att1,Att2,...,Atth], (4)
step2.4, feedforward neural network layer are formed by two linear layers in series connection, the linear layers have independent weight and deviation, and the dimensionalities are different, so that semantic information can be further extracted;
Z=layer-norm(X+MultiHead(X)), (5)
FFN(Z)=ReLU(W1Z+b1)W2Z+b2, (6)
wherein layer-norm (-) represents normalization layer, FFN represents Feed Forward Network layer, W1,b1,W2,b2Is the projection parameter, and Z represents the output of the normalization layer;
step2.5, deriving the output o of the encoder block using a normalization layer after the feedforward neural network layeri(ii) a Then, the output characteristic of a transform encoder and the pitch embedding are spliced to obtain the predicted shared characteristic of the input n-gram, and finally, a vector is obtained through a multi-layer perceptron MLP
Figure RE-FDA0003335329190000023
Figure RE-FDA0003335329190000024
4. The method for verbalization annotation of Thai and Myanmar characters using local multi-head attention mechanism to fuse word-syllable pairs according to claim 1, wherein: the specific steps of Step3 are as follows:
step3.1, in the decoding process, obtaining the highest score of the label sequence by using a Viterbi algorithm; formally, x ═ x is used1,x2,...,xn) To represent an input sequence, using y ═ y (y)1,y2,...,yn) To represent the model output tags, so a score is defined;
Figure RE-FDA0003335329190000025
wherein A is a transition score matrix, Ai,jRepresenting the transition from the i-th state to the j-th state in a pair of successive time steps, y0And ynLabels at the beginning and the end of a sentence, P is a fractional matrix output by a transform encoder, and theta is a parameter set by a neural network;
step3.2, performing softmax operation on all possible label sequences to generate a probability value for the sequence y;
Figure RE-FDA0003335329190000031
wherein y isxRepresents a set of x possible tag sequences;
step3.3, in the training process, using a maximum likelihood estimation method to improve the logarithm probability of a correct label sequence to the maximum extent;
Figure RE-FDA0003335329190000032
wherein y isxRepresenting a set of possible tag sequences for a sentence x;
step3.4, at decoding, by obtaining the maximum score y*To predict the output sequence of the model;
Figure RE-FDA0003335329190000033
CN202111078804.XA 2021-09-15 2021-09-15 Method for marking verbosity of Thai and Burma characters by using local multi-head attention to mechanism fused word-syllable pair Active CN113901210B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111078804.XA CN113901210B (en) 2021-09-15 2021-09-15 Method for marking verbosity of Thai and Burma characters by using local multi-head attention to mechanism fused word-syllable pair

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111078804.XA CN113901210B (en) 2021-09-15 2021-09-15 Method for marking verbosity of Thai and Burma characters by using local multi-head attention to mechanism fused word-syllable pair

Publications (2)

Publication Number Publication Date
CN113901210A true CN113901210A (en) 2022-01-07
CN113901210B CN113901210B (en) 2022-12-13

Family

ID=79028510

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111078804.XA Active CN113901210B (en) 2021-09-15 2021-09-15 Method for marking verbosity of Thai and Burma characters by using local multi-head attention to mechanism fused word-syllable pair

Country Status (1)

Country Link
CN (1) CN113901210B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116977436A (en) * 2023-09-21 2023-10-31 小语智能信息科技(云南)有限公司 Burmese text image recognition method and device based on Burmese character cluster characteristics

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134757A (en) * 2019-04-19 2019-08-16 杭州电子科技大学 A kind of event argument roles abstracting method based on bull attention mechanism
US20190355270A1 (en) * 2018-05-18 2019-11-21 Salesforce.Com, Inc. Multitask Learning As Question Answering
CN110489750A (en) * 2019-08-12 2019-11-22 昆明理工大学 Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF
CN110534087A (en) * 2019-09-04 2019-12-03 清华大学深圳研究生院 A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium
CN111581474A (en) * 2020-04-02 2020-08-25 昆明理工大学 Evaluation object extraction method of case-related microblog comments based on multi-head attention system
CN111581396A (en) * 2020-05-06 2020-08-25 西安交通大学 Event graph construction system and method based on multi-dimensional feature fusion and dependency syntax
CN112270193A (en) * 2020-11-02 2021-01-26 重庆邮电大学 Chinese named entity identification method based on BERT-FLAT
CN112487796A (en) * 2020-11-27 2021-03-12 北京智源人工智能研究院 Method and device for sequence labeling and electronic equipment
CN112784532A (en) * 2021-01-29 2021-05-11 电子科技大学 Multi-head attention memory network for short text sentiment classification
CN112883726A (en) * 2021-01-21 2021-06-01 昆明理工大学 Multi-task Thai word segmentation method based on syllable segmentation and word segmentation joint learning
US20210279576A1 (en) * 2020-03-03 2021-09-09 Google Llc Attention neural networks with talking heads attention

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190355270A1 (en) * 2018-05-18 2019-11-21 Salesforce.Com, Inc. Multitask Learning As Question Answering
CN110134757A (en) * 2019-04-19 2019-08-16 杭州电子科技大学 A kind of event argument roles abstracting method based on bull attention mechanism
CN110489750A (en) * 2019-08-12 2019-11-22 昆明理工大学 Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF
CN110534087A (en) * 2019-09-04 2019-12-03 清华大学深圳研究生院 A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium
US20210279576A1 (en) * 2020-03-03 2021-09-09 Google Llc Attention neural networks with talking heads attention
CN111581474A (en) * 2020-04-02 2020-08-25 昆明理工大学 Evaluation object extraction method of case-related microblog comments based on multi-head attention system
CN111581396A (en) * 2020-05-06 2020-08-25 西安交通大学 Event graph construction system and method based on multi-dimensional feature fusion and dependency syntax
CN112270193A (en) * 2020-11-02 2021-01-26 重庆邮电大学 Chinese named entity identification method based on BERT-FLAT
CN112487796A (en) * 2020-11-27 2021-03-12 北京智源人工智能研究院 Method and device for sequence labeling and electronic equipment
CN112883726A (en) * 2021-01-21 2021-06-01 昆明理工大学 Multi-task Thai word segmentation method based on syllable segmentation and word segmentation joint learning
CN112784532A (en) * 2021-01-29 2021-05-11 电子科技大学 Multi-head attention memory network for short text sentiment classification

Non-Patent Citations (13)

* Cited by examiner, † Cited by third party
Title
JIE HAO 等: "Multi-Granularity Self-Attention for Neural Machine Translation", 《ARXIV》 *
LINHAO ZHANG 等: "Using Bidirectional Transformer-CRF for Spoken Language Understanding", 《 NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING》 *
LOPEZ-GAZPIO 等: "Word n-gram attention models for sentence similarity and inference", 《EXPERT SYSTEMS WITH APPLICATIONS》 *
SHIZHE DIAO 等: "ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations", 《EMNLP》 *
WEIZHEN QI 等: "ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training", 《ARXIV》 *
丘心颖 等: "融合 Self-Attention 机制和 n-gram 卷积核的 印尼语复合名词自动识别方法研究", 《湖南工业大学学报》 *
唐国强 等: "融入语言模型和注意力机制的临床电子病历命名实体识别", 《计算机科学》 *
姚锦玮: "智能泵站平台人机交互中语音识别和语音合成的研究与设计", 《中国优秀硕士学位论文全文数据库 工程科技Ⅱ辑》 *
李洺宇: "准书面语朝鲜语语音语料自动标注***的研究与实现", 《中国优秀硕士学位论文全文数据库 哲学与人文科学辑》 *
杨敬闻: "基于XLNet与字词融合编码的中文命名实体识别研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
王旭强 等: "基于注意力机制的特征融合序列标注模型", 《山东科技大学学报(自然科学版)》 *
邓钰 等: "用于短文本情感分类的多头注意力记忆网络", 《计算机应用》 *
陶广奉 等: "融合上下文字符信息的泰语神经网络分词方法", 《计算机工程与科学》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116977436A (en) * 2023-09-21 2023-10-31 小语智能信息科技(云南)有限公司 Burmese text image recognition method and device based on Burmese character cluster characteristics
CN116977436B (en) * 2023-09-21 2023-12-05 小语智能信息科技(云南)有限公司 Burmese text image recognition method and device based on Burmese character cluster characteristics

Also Published As

Publication number Publication date
CN113901210B (en) 2022-12-13

Similar Documents

Publication Publication Date Title
Sun et al. Token-level ensemble distillation for grapheme-to-phoneme conversion
CN110046350B (en) Grammar error recognition method, device, computer equipment and storage medium
US7966173B2 (en) System and method for diacritization of text
CN111694924A (en) Event extraction method and system
CN111339750B (en) Spoken language text processing method for removing stop words and predicting sentence boundaries
Simonnet et al. ASR error management for improving spoken language understanding
CN111563375B (en) Text generation method and device
CN116127952A (en) Multi-granularity Chinese text error correction method and device
CN116416480B (en) Visual classification method and device based on multi-template prompt learning
Wooters et al. Multiple-pronunciation lexical modeling in a speaker independent speech understanding system.
CN116306600B (en) MacBert-based Chinese text error correction method
CN114153971A (en) Error-containing Chinese text error correction, identification and classification equipment
CN112905736A (en) Unsupervised text emotion analysis method based on quantum theory
CN115510863A (en) Question matching task oriented data enhancement method
Romero et al. Modern vs diplomatic transcripts for historical handwritten text recognition
CN113901210B (en) Method for marking verbosity of Thai and Burma characters by using local multi-head attention to mechanism fused word-syllable pair
CN114937465A (en) Speech emotion recognition method based on self-supervision learning and computer equipment
CN110942767A (en) Recognition labeling and optimization method and device for ASR language model
CN112883726B (en) Multi-task Thai word segmentation method based on syllable segmentation and word segmentation joint learning
CN114970537B (en) Cross-border ethnic cultural entity relation extraction method and device based on multi-layer labeling strategy
CN112989839A (en) Keyword feature-based intent recognition method and system embedded in language model
CN115759102A (en) Chinese poetry wine culture named entity recognition method
CN112634878B (en) Speech recognition post-processing method and system and related equipment
CN115223549A (en) Vietnamese speech recognition corpus construction method
Wana et al. A multi-view approach for Mandarin non-native mispronunciation verification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant