CN112183106B

CN112183106B - Semantic understanding method and device based on phoneme association and deep learning

Info

Publication number: CN112183106B
Application number: CN202010919954.8A
Authority: CN
Inventors: 赖文波; 林康; 谭则涛; 方伟
Original assignee: Gf Securities Co ltd
Current assignee: Gf Securities Co ltd
Priority date: 2020-09-03
Filing date: 2020-09-03
Publication date: 2024-05-14
Anticipated expiration: 2040-09-03
Also published as: CN112183106A

Abstract

The invention discloses a semantic understanding method based on phoneme association and deep learning, which comprises the steps of converting acquired voice into text through an ASR model; the text is segmented and then sequentially input into a word2vec model and a bi-lstm model, and a morpheme understanding vector of the text is output; combining the obtained key word phoneme features, original phoneme features and regular phoneme features to obtain feature triplets of text phoneme association, inputting the feature triplets into a text-cnn convolutional neural network to carry out phoneme association understanding, convolutionally extracting local combination features, and outputting phoneme association vectors; and weighting the phoneme understanding vector and the phoneme association vector, combining the weighted result with regular phoneme features to obtain a joint vector, and inputting the joint vector into a classifier to carry out semantic understanding and classification of the voice. The semantic understanding method based on phoneme association and deep learning improves the accuracy of the text natural language understanding result and reduces the development cost of a semantic understanding model.

Description

Semantic understanding method and device based on phoneme association and deep learning

Technical Field

The present invention relates to the field of computer natural language processing technologies, and in particular, to a semantic understanding method, apparatus, terminal device and computer readable storage medium based on phoneme association and deep learning.

Background

With the rapid development of ASR (Automatic Speech Recognition, i.e., automatic speech recognition technology), the application of semantic understanding technology based on the conversion of speech to text of ASR is becoming wider and wider. Although ASR has evolved to a mature extent, speech has obvious domain features (medical, biological, chemical, etc.) and regional features (dialect, accent, spoken, etc.) in specific fields and regions. For diversified voice input, on one hand, the text quality and the accuracy rate of ASR recognition may be poor, so that the text has lost original natural language meaning and syntax structural characteristics literally, and a computer cannot give accurate semantic understanding; on the other hand, although the input voice quality is not single-temperature, the text after ASR conversion has larger difference, so that the semantics can be correctly understood only by combining pronunciation characteristics, scenes, contexts, imagination and the like by means of the understanding capability of natural language of human beings.

At present, the accuracy of semantic understanding of a computer to natural language can be indirectly improved by unilaterally improving the recognition capability of ASR. Therefore, different standards with different requirements can be used for development and training according to different voices of each field and each region, so that ASR is applicable to each region in each field, but development cost is extremely high, and semantic understanding effect of a computer is still not ideal for texts with deviation.

Disclosure of Invention

The invention aims to provide a semantic understanding method, a semantic understanding device, terminal equipment and a computer readable storage medium based on phoneme association and deep learning.

In order to overcome the above-mentioned drawbacks in the prior art, an embodiment of the present invention provides a semantic understanding method based on phoneme association and deep learning, including:

converting the acquired voice into text through an ASR model;

The text is segmented and then sequentially input into a word2vec model and a bi-lstm model, and a morpheme understanding vector of the text is output;

Obtaining a feature triplet of text phoneme associations, comprising: identifying and separating word phonemes after word segmentation, and adding phoneme association keywords to obtain keyword phoneme characteristics; identifying and separating sentence phonemes from the text to obtain original phoneme features; acquiring regular phoneme characteristics after sentence phoneme recognition and separation; combining the key word phoneme features, the original phoneme features and the regular phoneme features;

Inputting the feature triplet into a text-cnn convolutional neural network for phoneme association understanding, convoluting and extracting local combined features, and outputting a phoneme association vector;

and weighting the phoneme understanding vector and the phoneme association vector, combining the weighted result with regular phoneme features to obtain a joint vector, and inputting the joint vector into a classifier to carry out semantic understanding and classification of the voice.

Further, the identifying and separating word phonemes after word segmentation includes:

converting the text into pinyin by pinyin tool bags;

when the pinyin belongs to an integer syllable, setting an initial consonant as NA and a final sound as the integer syllable;

When the pinyin does not belong to the whole syllable and starts with the double initial consonants, the initial consonants are set as double initial consonants, and the rest phonemes are vowels;

when the pinyin is not an integral syllable and does not start with double initials, setting the initial consonants as single initials and the rest phonemes as finals;

and identifying initials and finals, and separating the initials and the finals by using symbols to obtain a phoneme identification and separation result.

Further, after the recognizing and separating the word phonemes after word segmentation, the method further includes:

Importing a keyword library, and comparing the phoneme recognition and separation result with the keywords word by word;

Judging whether each word has the same initial consonant or vowel and has a plurality of results at the same time; if yes, adding keyword association identifiers behind the phrases corresponding to the nearest result; if not, the separation result is not processed.

Further, the identifying and acquiring regular phoneme features after sentence phoneme recognition and separation includes:

acquiring a sentence phoneme recognition and separation result and a primary regular library, wherein the primary regular library is a default regular library of a system;

Converting the original regular base into a phoneme part matching regular base, and numbering a plurality of regularities in the phoneme regular base;

and carrying out regular matching on the phonemes according to the phoneme regular library, and generating the single-heat characteristic according to the semantics corresponding to the hit regular numbers to serve as regular phoneme characteristics.

Further, the converting the original regular library into the phoneme part matching regular library comprises the following steps:

judging the attribute and structure of each character in the original regular, correspondingly carrying out phoneme regular processing, and integrating the processing results to obtain a phoneme regular base, wherein the regular processing comprises the following steps:

When the character is a phrase Chinese character, adding the regular matching with the initial consonant or the final part in the phrase into a phoneme regular new character string;

when the character is a single Chinese character, adding a phoneme separation result of the character into a phoneme regular new character string;

when the character is not Chinese, the character is directly added to the phoneme regular new character string.

Further, the word segmentation is performed on the text, and then the text is sequentially input into a word2vec model and a bi-lstm model, and a morpheme understanding vector of the text is output, which comprises the following steps:

inputting the text into a word2vec model, and performing word vector embedding;

Inputting the embedding post-word vectors into a bi-lstm model for semantic understanding; and inputting the understood result into the self-attention model to obtain sentence vectors of the text, wherein the sentence vectors are used as morpheme understanding vectors of the text.

Further, the step of converting the acquired voice into text further comprises the step of adopting a neural network algorithm and a CTC algorithm to perform voice conversion; the word segmentation of the text comprises the word segmentation of the text through a jieba word segmentation module in python; the weighting of the phoneme understanding vector and the phoneme association vector uses attention techniques.

The embodiment of the invention also provides a semantic understanding device based on phoneme association and deep learning, which is characterized by comprising the following components:

the voice conversion module is used for converting the acquired voice into text through an ASR model;

The morpheme understanding vector output module is used for dividing the text, inputting the text into a word2vec model and a bi-lstm model in sequence, and outputting the morpheme understanding vector of the text;

the feature triplet obtaining module is used for obtaining feature triples associated with text phonemes, and comprises the following steps: identifying and separating word phonemes after word segmentation, and adding phoneme association keywords to obtain keyword phoneme characteristics; identifying and separating sentence phonemes from the text to obtain original phoneme features; acquiring regular phoneme characteristics after sentence phoneme recognition and separation; combining the key word phoneme features, the original phoneme features and the regular phoneme features;

The phoneme association vector output module is used for inputting the feature triplet into the text-cnn convolutional neural network to perform phoneme association understanding, convoluting and extracting local combination features, and outputting a phoneme association vector;

and the total joint understanding module is used for weighting the linguistic understanding vector and the phoneme association vector, combining the weighted result with the regular phoneme characteristic to obtain a joint vector, and inputting the joint vector into the classifier for semantic understanding and classifying of the voice.

The embodiment of the invention also provides a terminal device, which comprises:

one or more processors;

a memory coupled to the processor for storing one or more programs;

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the phoneme association and deep learning based semantic understanding method as described in any of the above.

Embodiments of the present invention also provide a computer-readable storage medium having stored thereon a computer program for execution by a processor to implement the phoneme association and deep learning based semantic understanding method as described in any of the above.

Compared with the prior art, the voice text natural language understanding method based on phoneme association and deep learning provided by the embodiment of the invention increases the association capability of homophones and class words for the voice recognition text, enhances the semantic understanding capability of artificial intelligence on the voice-to-text, improves the accuracy of business scenes such as text auditing, text classification and the like, and reduces the development cost of a semantic understanding model.

Drawings

FIG. 1 is a flow chart of a semantic understanding method based on phoneme association and deep learning according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for obtaining a phoneme association feature triplet according to an embodiment of the present invention;

FIG. 3 is a flow chart of a method for phoneme recognition and separation according to an embodiment of the present invention;

FIG. 4 is a flowchart of a method for adding associated keyword identification according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating a method for converting original regularization into phoneme partial matching regularization according to an embodiment of the invention;

FIG. 6 is a flow chart of a general joint understanding method provided by an embodiment of the present invention;

FIG. 7 is a diagram of the overall architecture of a semantic understanding device based on phone association and deep learning according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a semantic understanding device based on phoneme association and deep learning according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the step numbers used herein are for convenience of description only and are not limiting as to the order in which the steps are performed.

It is to be understood that the terminology used in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

The terms "comprises" and "comprising" indicate the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The term "and/or" refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

First aspect:

Referring to fig. 1, an embodiment of the present invention provides a semantic understanding method based on phoneme association and deep learning, including:

S10, converting the acquired voice into a text through an ASR model;

In this step, a segment of speech is first obtained, and then the speech uttered by a human is converted into text by an ASR model, which is a process of converting sound into text, corresponding to human ears, wherein ASR is a speech recognition technology, also called automatic speech recognition, whose goal is to convert the lexical content in human speech into computer readable inputs, such as keys, binary codes or character sequences. Unlike speaker recognition and speaker verification, the latter attempts to identify or verify the speaker making the speech, not the lexical content contained therein. The ASR has the disadvantage that the text recognized by the ASR has the problem of accuracy, and especially under the condition that the client comes from all over the country and different regional characteristics exist in pronunciation, the quality of the text converted by the ASR is very low, even the original semantic and syntactic structural characteristics are lost from the aspect of literal understanding, the machine cannot perform natural language understanding, and the NLP cannot correctly understand the syntactic characteristics, so that great trouble is brought to practical application. In many cases, the quality of many sound recordings per se is not problematic in actual production, and accords with the standard, but the text after ASR conversion has larger difference, so that a computer cannot correctly understand the natural language of the ASR text in literal sense, but can understand the ASR text, because people have associative ability, correct words can be remitted from pronunciation characteristics according to the context of the context, and correct semantics can be understood, and therefore, the judgment and the understanding can be carried out only manually. Based on the method, from the associativity, an associativity mechanism similar to human is added in the model, the associativity of machine voice is given, and from the perspective of combining phonemes with semantics, the computer natural language understanding of the ASR text is carried out by combining phoneme features, so that the acquired voice is converted into the text through the ASR model only as the first step of the method. And the conversion process can realize voice conversion based on CNN or RNN neural network, CTC and other technologies, wherein the conversion process comprises the following steps: speech input, data encoding, data decoding, text output.

Wherein, CNN refers to convolutional neural network (Convolutional Neural Networks, CNN) which is a feedforward neural network (Feedforward Neural Networks) containing convolutional calculation and having a depth structure and is one of representative algorithms of deep learning (DEEP LEARNING); the convolutional neural network has characteristic learning (representation learning) capability, can perform Shift-INVARIANT CLASSIFICATION classification on input information according to a hierarchical structure, is also called Shift-INVARIANT ARTIFICIAL Neural Networks, SIANN, is constructed by a visual perception (visual perception) mechanism of a convolutional neural network imitation organism, can perform supervised learning and unsupervised learning, and has the advantages that the convolutional neural network can perform grid-like feature (feature engineering) with less calculation amount, such as pixel and audio learning, has a stabilizing effect and has no additional feature engineering (feature engineering) requirement on data; the input layer of the convolutional neural network can process multidimensional data, and commonly, the input layer of the one-dimensional convolutional neural network receives a one-dimensional or two-dimensional array, wherein the one-dimensional array is usually time or frequency spectrum sampling; the two-dimensional array may include a plurality of channels; the input layer of the two-dimensional convolutional neural network receives a two-dimensional or three-dimensional array; the input layer of the three-dimensional convolutional neural network receives a four-dimensional array. Since convolutional neural networks are widely used in the field of computer vision, many studies have previously assumed three-dimensional input data, i.e., two-dimensional pixel points and RGB channels on a plane, when introducing their structures. The hidden layer of the convolutional neural network comprises common structures of a convolutional layer, a pooling layer and a full-connection layer 3, and complex structures such as Inception modules, residual blocks and the like can exist in some more modern algorithms. In a common architecture, the convolutional layer and the pooling layer are specific to convolutional neural networks. The convolution kernels in the convolution layer contain weight coefficients, whereas the pooled layer does not contain weight coefficients, so the pooled layer may not be considered a separate layer. Taking the LeNet-5 as an example, the order in which class 3 is commonly built into the hidden layer is typically: input-convolution layer-pooling layer-full connection layer-output;

it should be noted that RNN refers to a recurrent neural network (Recurrent Neural Network, RNN) which is a type of recurrent neural network (recursive neural network) that takes sequence data as input, performs recursion (recursion) in the evolution direction of the sequence and all nodes (circulation units) are connected in a chained manner, wherein a bidirectional recurrent neural network (Bidirectional RNN, bi-RNN) and a Long Short term memory network (Long Short-Term Memory networks, LSTM) are common recurrent neural network recurrent neural networks, have memory, parameter sharing and complete graphics (Turing completeness), and thus have certain advantages when learning the nonlinear characteristics of the sequence. The recurrent neural network has application in the fields of natural language processing (Natural Language Processing, NLP), such as speech recognition, language modeling, machine translation, etc., and is also used for various time series predictions. A recurrent neural network constructed with the introduction of convolutional neural networks (Convoutional Neural Network, CNN) can address computer vision problems involving sequence inputs.

CTC is a loss function in the sequence annotation problem. Conventional sequence labeling algorithms require that the input and output symbols be perfectly aligned at each instant. Whereas CTC expands the tag set, adding null elements. After the sequences are marked by using the extended tag set, all predicted sequences which can be converted into real sequences through the mapping function are correct predicted results. I.e. the predicted sequence is obtained without the need for data alignment processing. The objective function is to maximize the sum of probabilities of all correct predicted sequences. In finding all the correct predicted sequences, a forward-backward algorithm is used. The forward process calculates the probability of predicting the correct prefix from the time 1-t; the backward process calculates the probability of predicting the correct suffix from the time T-T. Then: prefix probability vs. suffix probability/probability of predicting s at time t = probability of all correct predicted sequences at time t. Dynamic programming reduces time complexity: only when a certain specific symbol is predicted by the arrival at the previous time, the correct prediction can be made at the current time. Then, by time t, the probability of predicting the prefix of the correct tag sequence= (sum of all sub-sequence probabilities of predicting correct by time t-1) = the probability of predicting the current tag.

S20, after word segmentation is carried out on the text, the text is sequentially input into a word2vec model and a bi-lstm model, and a morpheme understanding vector of the text is output;

In this step, the main implementation is based on semantic understanding of text morphemes instead of phoneme association, and the method comprises the following 2 substeps:

S201, inputting a text into a word2vec model to obtain a word vector corresponding to the text, and carrying out word vector embedding; ;

It should be noted that the Word2vec model is a group of related models for generating Word vectors. These models are shallow, bi-layer neural networks that are used to train to reconstruct linguistic word text. The network is represented by words and the order of the words is unimportant under the word2vec model assumption that the words are input in adjacent locations to be guessed. After training is completed, word2vec models can be used to map each word to a vector that can be used to represent word-to-word relationships, which is the hidden layer of the neural network. embedding is one way to convert discrete variables into a continuous vector representation.

In neural networks embedding refers to embedding, the role of which is very important, not only to reduce the spatial dimension of a discrete variable, but also to meaningfully represent that variable; the first entry point for this concept in the field of deep learning is the so-called Manifold Hypothesis (manifold hypothesis). Manifold hypothesis refers to that "the natural original data is a low-dimensional manifold embedded in the high-dimensional space where the original data is located (embedded in)". Then the task of deep learning is to map the high-dimensional raw data (images, sentences) to the low-dimensional manifold such that the high-dimensional raw data becomes separable after being mapped to the low-dimensional manifold, and this mapping is called embedding (Embedding). Such as Word Embedding, that is, mapping sentences of word composition to a token vector. As is now commonly understood by the deep learning community, embedding is the Feature extracted from the raw data, i.e., the low-dimensional vector after mapping through the neural network.

S202, inputting the embedding word vectors into a bi-lstm model for semantic understanding; inputting the understood result into a self-attention model to obtain sentence vectors of the text, wherein the sentence vectors are used as morpheme understanding vectors of the text;

In this step, it should be added that LSTM is known as Long Short-Term Memory, which is one of RNN (Recurrent Neural Network). LSTM is well suited for modeling time series data, such as text data, due to its design characteristics. BiLSTM is an abbreviation for Bi-directional Long Short-Term Memory, which is a combination of forward LSTM and backward LSTM. Both are often used to model context information in natural language processing tasks. The LSTM model can better capture the dependency relationship of a longer distance. Because LSTM can learn which information to memorize and which information to forget through the training process. Modeling sentences with LSTM has a problem: the back-to-front information cannot be encoded. In more fine-grained classification, five classification tasks such as recognition for a strong degree, recognition for a weak degree, neutrality, detraction for a weak degree, detraction for a strong degree require attention to interactions between emotion words, degree words, negatives. As an example, "this restaurant is dirty and not good, where" no-go "is a modification to the degree of" dirty ", bi-lstm can better capture the semantic dependency in both directions.

In addition, the attentiveness mechanism mimics the internal process of biological observation behavior, a mechanism that aligns internal experience with external sensations to increase the observation finesse of a partial region. Attention mechanisms can quickly extract important features of sparse data and are thus widely used for natural language processing tasks, particularly machine translation. While self-attention mechanisms are improvements in attention mechanisms that reduce reliance on external information, are more adept at capturing internal dependencies of data or features. The text is used for explaining how the self-attention mechanism is applied to the word representation weighting of the sparse text through the text emotion analysis case, and the model efficiency is effectively improved.

S30, acquiring a text phoneme association feature triplet, which comprises the following steps: identifying and separating word phonemes after word segmentation, and adding phoneme association keywords to obtain keyword phoneme characteristics; identifying and separating sentence phonemes from the text to obtain original phoneme features; acquiring regular phoneme characteristics after sentence phoneme recognition and separation; combining the key word phoneme features, the original phoneme features and the regular phoneme features;

Referring to fig. 2, a flow of obtaining feature triples of phoneme association in the present scheme is provided:

1. Obtaining key word phoneme characteristics:

Firstly, word segmentation operation is carried out on input text through a jieba module in Python, and a word list is output, wherein Python is a cross-platform computer programming language and is a high-level script language combining interpretation, compiling, interactivity and object-oriented. Originally designed for writing automation scripts (shell), the more used for independent, large-scale project development, with the continual updating of versions and the addition of new language functions. Whereas jieba block refers to a large number of operations on strings in Python, sometimes it is necessary to extract specific information from one string, and it is definitely not feasible to use slices, and all this section records two powerful text processing modules, one is a regular expression re module and the other is a chinese processing module jieba, which is characterized in that it includes: i) Accurate mode: the sentences are most accurately cut, and the method is suitable for text analysis; ii) full mode: all words which can form words in sentences are scanned out quickly, but ambiguity cannot be resolved; iii) Search engine mode: on the basis of accuracy, the long words are segmented again, the recall rate is improved, and the method is suitable for word segmentation of a search engine. After the word segmentation is performed by the jieba module, a word segmentation list is obtained, and then the following steps can be performed: 1) The word phonemes are identified and separated, and the specific flow is shown in fig. 3:

1.1 Converting the text into pinyin by pinyin tool bags;

1.2 When the pinyin belongs to the whole syllable, setting the initial consonant as NA and the final as the whole syllable; wherein the whole syllable library comprises syllables of: "zhi, chi, shi, ri, zi, ci, si, yi, wu, yu, ye, yue, yin, yun, yuan, ying" and the like;

1.3 When the pinyin does not belong to the whole syllable and starts with the double-initial consonant, the initial consonant is set as the double-initial consonant, and the rest phonemes are vowels; wherein the double initial library comprises: "zh, ch, sh" and the like;

1.4 When the pinyin is not an integral syllable and does not start with double initials, the initial consonants are set as single initials, and the rest phonemes are vowels; wherein the single initial consonant library comprises "b, p, m, f, d, t, n, l, g, k, h, j, q, x, z, c, s, y, w, r";

1.5 The initial consonant and the final are identified, and the initial consonant and the final are separated by a symbol "#", so that a phoneme identification and separation result is obtained.

After obtaining the phoneme recognition and separation result, step 2) is then performed: keyword identification adding operation is performed on the phrases separated from the initial and the final, as shown in fig. 4, mainly including:

2.1 Importing a key word stock, and comparing the phoneme recognition and separation result with the key words word by word; the keyword library is a default keyword library of the system;

2.2 Judging whether each character has the same initial consonant or vowel, if so, further judging whether the conditions of the same initial consonant and vowel are multiple, if so, selecting the nearest result, and adding a keyword association identifier behind the corresponding original word group; the format is 'onium (associative keyword)'; if it is initially determined that there is no word with the same initial consonant or the same final, then identification may not be performed because no ambiguity is created. Finally, after all the phrases are processed, the key word phoneme characteristics are obtained in a summarizing mode.

In the step, the text is thinned into finer phoneme granularity, keywords with similar pronunciation are identified in the original text according to the existing keyword library, and the information of the associated keywords is supplemented to the original information, so that the artificial intelligence association capability is given.

Second), original phoneme features are obtained:

in fig. 2, after text data is input, JIEBA word segmentation is required to be performed on a part for obtaining the phoneme features of the keyword, but the original phoneme features and the regular phoneme features are not required to be obtained, only the whole sentence is required to be subjected to phoneme recognition and separation processing, and the processed result is directly used as the original phoneme features.

It should be noted that the regular expression (regular expression) describes a pattern (pattern) of matching a string, and may be used to check whether a string contains a certain sub-string, replace the matched sub-string, or take out a sub-string that meets a certain condition from a certain string, etc. The processing string is normalized, similar to creating a mathematical expression.

Third), acquiring regular phoneme features:

3.1 Obtaining a sentence phoneme recognition and separation result and a primary regular library, wherein the primary regular library is a default regular library of the system; wherein the original regular library is a manually-arranged regular library list of text intents applied by the original system.

3.2 Converting the original regular base into a phoneme part matching regular base, and numbering a plurality of regularities in the phoneme regular base; fig. 5 is a schematic flow chart of converting an original regular database into a phoneme part matching regular database:

Firstly, inputting an original rule, for example, inputting an original rule, then, performing circulation to judge whether each character in the original rule is a Chinese character, and if not, directly adding the character into a new character string in the phoneme rule; if yes, further judging whether the character is a phrase, if not, adding a phoneme separation result of the character into a phoneme regular new character string; if the character string is a phrase, adding the regular character string which accords with the partial matching of the initials or finals in the phrase into a regular new character string of the phonemes;

the implementation of this step will be described as an example, for example, an original rule is input as follows:

". (he|person) | (meeting|driving|busy|inconvenient|unavailable) | according to the above procedure: firstly judging whether the character string is a Chinese character or not, then judging whether the character string is a Chinese character or not, judging whether the character string is a phrase or a single character, if 'he, people, busy' is the single character, only adding a phoneme separation result of the character into a phoneme regular new character string, and obtaining the Chinese character string ((\s|n) t#a (|\s) | (\s|s) r#en (|\s)). The words are meeting, driving, busy, inconvenient and not available; adding the regular matching with the initial consonant or the final part in the phrase into a phoneme regular new character string; the method comprises the following steps of:

((\s|＾)k#ai\sh#ui(￥|\s)|(\s|＾)k#ai\sch#e(￥|\s)|(\s|＾)z#ai\sm#ang(￥|\s)|(\s|＾)m#ang(￥|\s)|(\s|＾)b#u\sf#ang\sb#ian(￥|\s)|(\s|＾)m#ei\sk#ong(￥|\s)).""

finally, through such a procedure, the original regularization can be converted into phoneme partial matching regularization, namely:

".＊((\s|＾)t#a(￥|\s)|(\s|＾)r#en(￥|\s)).＊((\s|＾)k#ai\sh#ui(￥|\s)|(\s|＾)k#ai\sch#e(￥|\s)|(\s|＾)z#ai\sm#ang(￥|\s)|(\s|＾)m#ang(￥|\s)|(\s|＾)b#u\sf#ang\sb#ian(￥|\s)|(\s|＾)m#ei\sk#ong(￥|\s)).＊"

3.3 And performing regular phoneme matching on the result according to a regular phoneme library, and generating a single-heat feature according to semantics corresponding to the hit regular number to serve as regular phoneme features.

In this step, it should be noted that, since the phonemes have more than one regular corresponding semantics, generally, there are several to more than ten kinds of the phonemes are different, at this time, the phonemes need to be numbered, then the result after sentence phoneme recognition and separation is matched against the numbered rules, if hit, the semantics corresponding to the hit number regular are processed into the single-hot features, if not hit, the extraction of the unique features is not needed, and finally all the unique features are sorted to be used as regular phonemic features;

In the step, the regular library of the original text rules is changed into specific logic of the regular matching of the initial consonant and the final part of the phoneme level, the semantic result of the regular matching of the phoneme level sentence is taken as a characteristic and is supplemented into the model information input, and the existing expert experience is effectively utilized to endow the artificial intelligence sentence level association capability;

Furthermore, it should be noted that the unique features refer to discrete, unordered, irregular features, and in this step, the semantics contained by each phoneme are not related to each other, and thus are as one-hot features, but in the machine learning algorithm, if a classification feature is encountered as discrete, unordered one-hot features cannot be put directly into the machine learning algorithm, because the classifier is usually continuous and ordered in data. It is therefore also desirable in the method of the present invention to perform a combination of feature triplets to obtain a continuous feature.

S40, inputting the feature triplets into a text-cnn convolutional neural network to perform phoneme association understanding, convoluting and extracting local combined features, and outputting phoneme association vectors;

In the step, text-cnn is used for extracting local features of the phoneme strings after the original sentence phonemes are separated and the phoneme strings after the keyword association is added, extracting local combination features of the original sentence phoneme strings and extracting local combination features of the phoneme strings after the keyword association, and obtaining a phoneme association vector.

Wherein the text-cnn model is used for classifying the text, and comprises:

Embedding layers, wherein the aim is to obtain word vectors;

convolution layers; the word vector passes through a one-dimensional convolution layer to obtain two output channels;

MaxPolling layers: the third layer is a 1-max pooling layer so that sentences of different lengths can all become fixed length representations after pooling layers.

FullConnection and Softmax layers: i.e. the full connection layer, outputs the probability of each category. In the step, after text-cnn training, semantic understanding vectors based on phoneme association can be obtained.

S50, weighting the phoneme understanding vector and the phoneme association vector, combining the weighted result with the regular phoneme features to obtain a joint vector, and inputting the joint vector into a classifier to carry out semantic understanding and classification of the voice.

Referring to fig. 6, fig. 6 is a flow chart of a general combined understanding method according to an embodiment of the invention;

In this step, the vectors of the morpheme sequence and the phoneme sequence are weighted, weighted by attention technology, a general definition of attention is given to a group of vector set values, and a vector query, attention mechanism is a mechanism for calculating the weighted summation of the values according to the query. attention is the calculation of the "weight" for each value in this set of values. This attention mechanism is sometimes also called query output, which focuses on (or considers) different parts of the original text. (Query attends to the values) because the weight occupied by the morpheme understanding and phoning understanding vectors is expected to be obtained, the semantic single-hot features corresponding to the hit regular numbers are combined to form a joint vector, at the moment, the semantic understanding vectors are not single vectors based on morpheme understanding or phoneme understanding, but are joint vectors of the morpheme, the phonemes, the weights of the morpheme and the phon and the semantic single-hot features corresponding to the hit regular numbers, the vectors are input into ESIM (enhanced LSTM model) to perform feature interaction of the morpheme and the phon, and finally the vectors are input into a classifier to perform model semantic understanding classification.

It is to be appreciated that classification is a very important method of data mining. The concept of classification is to learn a classification function or construct a classification model (i.e., what we commonly call a Classifier) based on existing data. The function or model can map data records in a database to one of a given class, and thus can be applied to data prediction. In summary, the classifier is a generic term of a method for classifying samples in data mining, and includes algorithms such as decision trees, logistic regression, naive bayes, neural networks, and the like.

Common classifiers include the following three types:

1) Decision tree classifier:

A set of attributes is provided, and a decision tree classifies data by making a series of decisions based on the set of attributes. This process is similar to identifying a plant by its characteristics. Such a classifier may be applied to determine the credit level of a person, e.g., a decision tree may conclude that "a person with a home, a car with a value between $1.5 and $2.3, two children" has good credit. The decision tree generator generates a decision tree from a "training set". The visualization tool provided by the SGI company's data mining tool MineSet uses a tree graph in which each decision is represented by a node of the tree to display the structure of the decision tree classifier. The graphical representation may help the user understand the classification algorithm, providing a valuable viewing perspective for the data. The generated classifier may be used to classify the data.

2) Selecting a tree classifier:

The selection tree classifier classifies the data using techniques similar to decision tree classifiers. Unlike decision trees, the selection tree contains special selection nodes, which have multiple branches. For example, a selection node in a selection tree for distinguishing the place of origin of the automobile may select horsepower, the number of cylinders, the weight of the automobile, or the like as information attributes. In the decision tree, one node can choose one attribute at a time at most as a consideration object. When classifying in the selection tree, various situations can be comprehensively considered. The selection tree is typically more accurate than the decision tree, but is also much larger. The selection tree generator generates the selection tree from the training set using the same algorithm that the decision tree generator generates the decision tree. MineSet visualization means displays the selection tree using a selection tree graph. The tree graph may help the user understand the classifier, finding which attributes are more important in deciding the tag attribute values. And may also be used to classify data.

3) Evidence classifier:

evidence classifiers classify data by examining the likelihood that a particular outcome will occur given an attribute. For example, it may make a determination that a person with a car having a value between $1.5 and $2.3 may be 70% well-credited and 30% poorly-credited. The classifier uses the maximum probability value to classify and predict the data on the basis of a simple probability model. Similar to the decision tree classifier, the generator generates an evidence classifier from the training set. MineSet a visualization tool uses an evidence graph consisting of a series of pie charts describing different probability values to display the classifier. The evidence graph may help the user understand the classification algorithm, provide insight into the data, and help the user answer questions like "if. And may also be used to classify data.

In the step, the combination of the morphemes, the phonemes and the unique features of the hit regular numbers is input into an ESIM model for training, feature interaction is carried out, the voice is understood from the semantic angle, and the semantic is understood from the voice angle, so that good voice semantic understanding capability is obtained.

According to the semantic understanding method based on phoneme association and deep learning, keyword association, sentence phoneme level matching and morpheme-phoneme joint modeling are carried out on the converted text, association capacity of homophones and class-pronunciation words is increased for the speech recognition text, semantic understanding capacity of artificial intelligence on the speech conversion text is enhanced, accuracy of text natural language understanding results is improved, and development cost of a semantic understanding model is reduced.

Second aspect:

referring to fig. 7-8, an embodiment of the present invention further provides a semantic understanding device based on phoneme association and deep learning, including:

a speech conversion module 01, configured to convert the acquired speech into text through an ASR model;

the morpheme understanding vector output module 02 is used for dividing the text, inputting the text into a word2vec model and a bi-lstm model in sequence, and outputting the morpheme understanding vector of the text;

A feature triplet obtaining module 03, configured to obtain a feature triplet associated with a text phoneme, including: identifying and separating word phonemes after word segmentation, and adding phoneme association keywords to obtain keyword phoneme characteristics; identifying and separating sentence phonemes from the text to obtain original phoneme features; acquiring regular phoneme characteristics after sentence phoneme recognition and separation; combining the key word phoneme features, the original phoneme features and the regular phoneme features;

The phoneme association vector output module 04 is used for inputting the feature triplet into a text-cnn convolutional neural network to perform phoneme association understanding, convoluting and extracting local combination features, and outputting a phoneme association vector;

the total joint understanding module 05 is used for inputting the morpheme understanding vector and the phoneme association vector into the ESIM model for feature interaction, carrying out morpheme and phoneme joint understanding through the bi-lstm model, and outputting a space vector for semantic understanding.

Third aspect:

one or more processors;

a memory coupled to the processor for storing one or more programs;

The processor is used for controlling the whole operation of the computer terminal equipment so as to complete all or part of the steps of the full-automatic electricity consumption prediction method. The memory is used to store various types of data to support operation at the computer terminal device, which may include, for example, instructions for any application or method operating on the computer terminal device, as well as application-related data. The Memory may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as static random access Memory (Static Random Access Memory, SRAM for short), electrically erasable programmable Read-Only Memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory, EEPROM for short), erasable programmable Read-Only Memory (Erasable Programmable Read-Only Memory, EPROM for short), programmable Read-Only Memory (Programmable Read-Only Memory, PROM for short), read-Only Memory (ROM for short), magnetic Memory, flash Memory, magnetic disk, or optical disk.

The terminal device may be implemented by one or more application specific integrated circuits (Application Specific a to NTEGRATED CIRCUIT, AS1C for short), digital signal Processor (DIGITAL SIGNAL Processor to DSP for short), digital signal processing device (DIGITAL SIGNAL Processing Device for short DSPD), programmable logic device (Programmable Logic Device for short PLD), field programmable gate array (Field Programmable GATE ARRAY for short FPGA), controller, microcontroller, microprocessor or other electronic component, for executing the method for determining peak shaver auxiliary service charge allocation correction coefficient according to any one of the above embodiments, and achieving the technical effects consistent with the method described above.

Fourth aspect:

An embodiment of the present invention also provides a computer readable storage medium including program instructions which, when executed by a processor, implement the steps of the phoneme association and deep learning based semantic understanding method as described in any of the embodiments above. For example, the computer readable storage medium may be the above memory including program instructions executable by a processor of the computer terminal device to perform the semantic understanding method based on phoneme association and deep learning according to any of the above embodiments, and achieve technical effects consistent with the method as described above.

While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the invention, such changes and modifications are also intended to be within the scope of the invention.

Claims

1. A semantic understanding method based on phoneme association and deep learning, comprising:

converting the acquired voice into text through an ASR model;

The text is segmented and then sequentially input into a word2vec model and a bi-lstm model, and a morpheme understanding vector of the text is output; comprising the following steps: inputting the text into a word2vec model to obtain a word vector corresponding to the text, and carrying out word vector embedding; inputting the embedding word vectors into a bi-lstm model for semantic understanding; inputting the understood result into a self-attention model to obtain sentence vectors of the text, wherein the sentence vectors are used as morpheme understanding vectors of the text;

Obtaining a feature triplet of text phoneme associations, comprising: identifying and separating word phonemes after word segmentation, and adding phoneme association keywords to obtain keyword phoneme characteristics; identifying and separating sentence phonemes from the text to obtain original phoneme features; and obtaining regular phoneme features after sentence phoneme recognition and separation, comprising: acquiring a sentence phoneme recognition and separation result and an original regular library, wherein the original regular library is a default regular library of a system, converting the original regular library into a phoneme part matching regular library, numbering a plurality of regularities in the phoneme part matching regular library, performing phoneme regular matching on the result according to the phoneme part matching regular library, and generating a single-heat characteristic according to semantics corresponding to hit regular numbers to serve as regular phoneme characteristics; combining the key word phoneme features, the original phoneme features and the regular phoneme features;

Inputting the feature triplet into a text-cnn convolutional neural network for phoneme association understanding, convoluting and extracting local combined features, and outputting a phoneme association vector; the text-cnn is used for carrying out local feature extraction on the phoneme character string after the original sentence phoneme separation and the phoneme character string after the keyword association addition, extracting local combination features of the original sentence phoneme character string and extracting local combination features of the phoneme character string after the keyword association, and obtaining a phoneme association vector;

2. The semantic understanding method based on phoneme association and deep learning as claimed in claim 1, wherein the recognizing and separating word phonemes after the word segmentation comprises:

converting the text into pinyin by pinyin tool bags;

When the pinyin does not belong to the whole syllable and starts with double initials, the initial consonants are set as double initials, and the rest phonemes are vowels;

3. The semantic understanding method based on phoneme association and deep learning as claimed in claim 2, further comprising, after the recognition and separation of the word phonemes after the word segmentation:

4. The semantic understanding method based on phoneme association and deep learning as claimed in claim 1, wherein the converting the original regular base into the phoneme partial matching regular base comprises:

Judging the attribute and structure of each character in the original regularization, correspondingly carrying out phoneme regularization processing, and integrating the processing results to obtain a phoneme part matching regularization library, wherein the regularization processing comprises:

When the character is a phrase Chinese character, adding the regular matching with the initial consonant or the final part in the phrase into a phoneme regular new character string; when the character is a single Chinese character, adding a phoneme separation result of the character into a phoneme regular new character string; when the character is not Chinese, the character is directly added to the phoneme regular new character string.

5. The semantic understanding method based on phoneme association and deep learning as claimed in claim 1, wherein the converting the acquired voice into text further comprises performing voice conversion using a neural network algorithm and a CTC algorithm; the word segmentation of the text comprises the word segmentation of the text through a jieba word segmentation module in python; the weighting of the phoneme understanding vector and the phoneme association vector uses attention techniques.

6. A semantic understanding device based on phoneme association and deep learning, comprising:

The morpheme understanding vector output module is used for dividing the text, inputting the text into a word2vec model and a bi-lstm model in sequence, and outputting the morpheme understanding vector of the text; comprising the following steps: inputting the text into a word2vec model to obtain a word vector corresponding to the text, and carrying out word vector embedding; inputting the embedding word vectors into a bi-lstm model for semantic understanding; inputting the understood result into a self-attention model to obtain sentence vectors of the text, wherein the sentence vectors are used as morpheme understanding vectors of the text;

The feature triplet obtaining module is used for obtaining feature triples associated with text phonemes, and comprises the following steps: identifying and separating word phonemes after word segmentation, and adding phoneme association keywords to obtain keyword phoneme characteristics; identifying and separating sentence phonemes from the text to obtain original phoneme features; and obtaining regular phoneme features after sentence phoneme recognition and separation, comprising: acquiring a sentence phoneme recognition and separation result and an original regular library, wherein the original regular library is a default regular library of a system, converting the original regular library into a phoneme part matching regular library, numbering a plurality of regularities in the phoneme part matching regular library, performing phoneme regular matching on the result according to the phoneme part matching regular library, and generating a single-heat characteristic according to semantics corresponding to hit regular numbers to serve as regular phoneme characteristics; combining the key word phoneme features, the original phoneme features and the regular phoneme features;

the phoneme association vector output module is used for inputting the feature triplet into the text-cnn convolutional neural network to perform phoneme association understanding, convoluting and extracting local combination features, and outputting a phoneme association vector; the text-cnn is used for carrying out local feature extraction on the phoneme character string after the original sentence phoneme separation and the phoneme character string after the keyword association addition, extracting local combination features of the original sentence phoneme character string and extracting local combination features of the phoneme character string after the keyword association, and obtaining a phoneme association vector;

7. A terminal device, comprising:

one or more processors;

a memory coupled to the processor for storing one or more programs;

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the phoneme association and deep learning based semantic understanding method as recited in any one of claims 1 to 5.

8. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program is executed by a processor to implement the phoneme association and deep learning based semantic understanding method according to any one of claims 1 to 5.