WO2021244099A1 - Voice editing method, electronic device and computer readable storage medium - Google Patents

Voice editing method, electronic device and computer readable storage medium Download PDF

Info

Publication number
WO2021244099A1
WO2021244099A1 PCT/CN2021/080772 CN2021080772W WO2021244099A1 WO 2021244099 A1 WO2021244099 A1 WO 2021244099A1 CN 2021080772 W CN2021080772 W CN 2021080772W WO 2021244099 A1 WO2021244099 A1 WO 2021244099A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
word
vector
sentences
context
Prior art date
Application number
PCT/CN2021/080772
Other languages
French (fr)
Chinese (zh)
Inventor
晏小辉
左利鹏
皮特
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021244099A1 publication Critical patent/WO2021244099A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • This application relates to the field of Artificial Intelligence (AI), and in particular to a voice editing method, electronic equipment, and computer-readable storage medium.
  • AI Artificial Intelligence
  • the electronic device recognizes the voice data input by the user and converts the voice data into text to perform corresponding operations.
  • the user finds that the voice recognition is wrong, or the user wants to actively change the statement, he needs to modify the input voice data.
  • Existing methods for modifying input voice data generally need to manually switch the input mode, for example, switching voice input to text input to modify the text data converted from voice data, or when the user input is detected corresponding to the input mode.
  • the prefix word is used, the text data is modified according to the modification instruction input by the user. Therefore, the interaction cost is increased, the operation is complicated, and the user experience is affected.
  • the present application provides a voice editing method, electronic equipment, and computer-readable storage medium, which realizes the editing of text data without increasing additional interaction costs, is easy to operate, and improves user experience.
  • a voice editing method including: acquiring input voice data; converting the voice data into text data, and dividing the text data into t sentences, where t is an integer greater than 1; Calculate the semantic consistency confidence of the t-th sentence in the t sentences and the c sentences before the t-th sentence, where the semantic consistency confidence is used to describe the t-th sentence and the Describe the degree of semantic relevance of c sentences, where c is an integer greater than 0; if the semantic consistency confidence is less than a preset value, recognize the t-th sentence, and use the recognition result as an editing instruction for all The text data is edited.
  • the speech data is obtained, the speech data is converted into text data, the text data is divided into t sentences, and the semantic consistency between the t sentence in the t sentence and the c sentence before the t sentence is calculated.
  • Confidence if the confidence of semantic consistency is less than the preset value, it means that the t-th sentence and the c-th sentence have a low degree of semantic relevance, that is, the t-th sentence and the c-th sentence are not coherent sentences, which further explains the t-th sentence Compared with the c sentence, the topic conversion is different from the previous c sentence.
  • the t sentence is recognized, and the recognition result is used as an editing command to edit the text data, which can eliminate the need for the user to perform additional
  • the interaction can also realize the editing operation of text data, which is easy to operate, thereby improving the user experience.
  • the calculating the confidence of semantic consistency between the t-th sentence in the t-th sentence and the c-th sentence before the t-th sentence includes:
  • the t sentences are input into the preset semantic consistency model, and the semantic consistency confidence of the t-th sentence in the t sentences output by the semantic consistency model and the c-th sentence before the t-th sentence is obtained. Since the preset semantic consistency model is obtained after training based on a large number of training samples, the semantic consistency confidence is calculated through the preset semantic consistency model, which improves the accuracy and stability of the calculation result.
  • the preset semantic consistency model is used to calculate a comprehensive representation vector of the t-th sentence according to the t-th sentence and the c-th sentence, where:
  • the comprehensive representation vector of the t-th sentence is used to describe the semantic association between the t-th sentence and the c sentences, and the semantic association between the t-th sentence and each sentence in the c sentences, Therefore, the comprehensive representation vector can represent more related information between the t-th sentence and the c sentence, and then the semantic consistency confidence is determined according to the comprehensive representation vector of the t-th sentence, which improves the accuracy of the semantic consistency confidence. sex.
  • the calculating a comprehensive representation vector of the t-th sentence according to the t-th sentence and the c-th sentence includes: according to the t-th sentence and the C sentences determine the context vector of each word in the t-th sentence, and determine the context vector of each word in the c sentence; calculate according to the context vector of each word in the t-th sentence and the context vector of each word in the c sentence The comprehensive representation vector of the t-th sentence.
  • the determining the context vector of each word in the t-th sentence according to the t-th sentence and the c sentence, and determining the context vector of each word in the c-th sentence Including: performing attention operations on the t-th sentence and the c-th sentence, so as to capture more internal features between the t-th sentence and the c-th sentence, and get the attention of the t-th sentence and the above Power; Calculate the context vector of each word in the t-th sentence and the context vector of each word in the c-th sentence according to the t-th sentence and the above attention.
  • the performing an attention operation on the t-th sentence and the c-th sentence to obtain the t-th sentence and the above attention includes: Perform word segmentation processing on the t-th sentence, determine the hidden vector corresponding to each word of the t-th sentence according to the t-th sentence after word segmentation processing; perform word segmentation processing on the c sentences, and perform word segmentation processing on the c after the word segmentation processing Sentences to determine the hidden vectors corresponding to the words in the c sentences; to perform attention operations on the hidden vectors corresponding to the words in the t-th sentence and the hidden vectors corresponding to the words in the c sentences, Get the t-th sentence and the above attention.
  • the calculation of the context vector of each word of the t-th sentence and the context vector of each word of the c sentence according to the t-th sentence and the above attention includes : Calculate the context representation of each word in the t-th sentence according to the attention of the t-th sentence and the above, and the hidden vector corresponding to each word in the c-th sentence; The context representation and the hidden vectors corresponding to the words of the t-th sentence are subjected to the residual connection operation to obtain the context vectors of the words of the t-th sentence; according to the attention of the t-th sentence and the above, and The hidden vectors corresponding to the words in the t-th sentence are calculated, the context representations of the words in the c sentences are calculated; the context representations of the words in the c sentences, and the hidden vectors corresponding to the words in the c sentences, The residual connection operation is performed to obtain the context vector of each word of the c sentences, which can reduce the signal loss and improve the accuracy
  • the calculation of the comprehensive representation vector of the t-th sentence according to the context vector of each word of the t-th sentence and the context vector of each word of the c sentence includes : Perform attention operations on the context vectors of the words in the t-th sentence and the context vectors of the words in the c sentences to obtain the attention of the t-th sentence and the c-th sentence; according to the t-th sentence The attention corresponding to the sentence and c sentences is calculated as the comprehensive representation vector of the t-th sentence.
  • the calculation of the comprehensive representation vector of the t-th sentence according to the attention of the t-th sentence and the c-th sentence includes: according to the t-th sentence and The attention corresponding to c sentences, and the context vector of each word in c sentences, calculate the context representation of each word in the t-th sentence and the c sentence; for the context corresponding to each word in the t-th sentence and the c sentence Representation, and the context vector of each word in the t-th sentence, perform the residual connection operation to obtain the comprehensive representation vector of the t-th sentence, which can reduce signal loss and improve the accuracy of calculation.
  • the determining the semantic consistency confidence level according to the comprehensive representation vector of the t-th sentence includes: determining c according to the context vector of each word of the c sentence The comprehensive representation vector of the sentence; the comprehensive representation vector of the t-th sentence and the comprehensive representation vector of the c sentences are spliced, and the semantic consistency confidence is determined according to the spliced vector.
  • the recognizing the t-th sentence includes: inputting the t-th sentence into a preset intent recognition model to obtain the preset intent recognition The recognition result output by the model.
  • the speech The editing method further includes: if the semantic consistency confidence is greater than or equal to the preset value, storing the text data.
  • a voice editing device including:
  • the acquisition module is used to acquire the input voice data
  • the sentence segmentation module is used to convert the voice data into text data and divide the text data into t sentences, where t is an integer greater than 1;
  • the calculation module is used to calculate the semantic consistency confidence of the tth sentence in the t sentences and the c sentences before the tth sentence, wherein the semantic consistency confidence is used to describe the The degree of semantic relevance between t sentences and said c sentences, where c is an integer greater than 0;
  • the recognition module is configured to recognize the t-th sentence if the semantic consistency confidence is less than a preset value, and use the recognition result as an editing instruction to edit the text data.
  • the calculation module is specifically configured to:
  • the calculation module includes:
  • the first calculation unit is configured to calculate a comprehensive representation vector of the t-th sentence according to the t-th sentence and the c-th sentence, wherein the comprehensive representation vector of the t-th sentence is used to describe the t-th sentence
  • the second calculation unit is configured to determine the semantic consistency confidence level according to the comprehensive representation vector of the t-th sentence.
  • the first calculation unit is specifically configured to:
  • the first calculation unit is specifically further configured to:
  • the first calculation unit is specifically further configured to:
  • Attention operations are performed on the hidden vectors corresponding to the words of the t-th sentence and the hidden vectors corresponding to the words of the c sentences to obtain the attention of the t-th sentence and the above.
  • the first calculation unit is specifically further configured to:
  • a residual connection operation is performed on the context representations of the words of the c sentences and the hidden vectors corresponding to the words of the c sentences to obtain the context vectors of the words of the c sentences.
  • the first calculation unit is specifically further configured to:
  • the first calculation unit is specifically further configured to:
  • a residual connection operation is performed on the context representation of each word in the tth sentence and the c sentence corresponding to the c sentence, and the context vector of each word in the t-th sentence, to obtain a comprehensive representation vector of the t-th sentence.
  • the second calculation unit is specifically configured to:
  • the comprehensive representation vector of the t-th sentence and the comprehensive representation vector of the c sentences are spliced, and the semantic consistency confidence is determined according to the spliced vector.
  • the identification module is specifically configured to:
  • the t-th sentence is input into a preset intent recognition model, and a recognition result output by the preset intent recognition model is obtained.
  • the identification module is specifically further configured to:
  • the text data is stored.
  • an electronic device including a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
  • the processor executes the computer program, the first The voice editing method described in the aspect.
  • a computer-readable storage medium stores a computer program that, when executed by a processor, implements the voice editing method described in the first aspect.
  • a computer program product is provided.
  • the terminal device executes the voice editing method described in the first aspect.
  • FIG. 1 is an application scenario diagram of a voice editing method provided by an embodiment of this application
  • FIG. 2 is a diagram of another application scenario of the voice editing method provided by an embodiment of the application.
  • FIG. 3 is a schematic flowchart of a voice editing method provided by an embodiment of the application.
  • FIG. 4 is a schematic diagram of clause processing provided by an embodiment of the application.
  • FIG. 5 is a schematic diagram of calculating the confidence of semantic consistency provided by an embodiment of the application.
  • Fig. 6 is a schematic diagram of a semantic consistency model provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a specific implementation process of a voice editing method provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of an electronic device provided by an embodiment of the application.
  • the voice editing method provided in the embodiments of this application is applied to electronic equipment, where the electronic equipment can be a terminal such as a mobile phone, a tablet, a computer, a smart speaker, a vehicle-mounted device, etc., or a server. / There is no special restriction on the type.
  • the methods provided in the embodiments of the present application may be executed entirely on the terminal, or entirely executed on the server, or partly executed on the terminal and partly executed on the server.
  • An electronic device such as a smart speaker, acquires voice data input by a user, converts the voice data into text data, and divides the text data into For t sentences, calculate the semantic consistency confidence of the t-th sentence and the c sentences before the t-th sentence, where the semantic consistency confidence is used to describe the degree of semantic relevance between the t-th sentence and the c sentence. If the semantic consistency confidence is less than the preset value, the t-th sentence is recognized, and the recognition result is used as an editing instruction to edit the text data to obtain the updated text.
  • the t-th sentence is: replace "hotel” with "nine o'clock”
  • the sentence before the t-th sentence that is, the c-th sentence is: What are the arrangements for the hotel tomorrow
  • the t-th sentence The confidence of the semantic consistency between the sentence and the c-th sentence is less than the preset value, then the t-th sentence is recognized, and the text data is edited according to the recognition result, and the updated text obtained is "What are the arrangements for tomorrow at nine o'clock”. If the semantic consistency confidence is greater than or equal to the preset value, the text data is recorded, and the text cached in the electronic device is updated according to the recorded text data to obtain the updated text.
  • the electronic device After obtaining the updated text, the electronic device recognizes the intent of the updated text, generates a corresponding reply text according to the intent of the updated text, converts the reply text into voice data, and finally outputs the voice data. For example, if the intent of the updated text is to obtain schedule information, the electronic device obtains the reply text corresponding to the schedule information, and then converts the reply text into voice data and plays it; if the intent of the updated text is to play the song ⁇ , the electronic device Search for the song ⁇ , if the corresponding song is found, get the audio of the song ⁇ , generate the reply text of "Song will be played for you ⁇ ", and then convert the reply text into voice data and play it, and finally play the search result
  • the audio of the song ⁇ can also achieve voice editing of text data without the need for additional interaction by the user, so that the electronic device can timely obtain the user’s true intention and respond, which improves the user experience.
  • FIG. 2 another application scenario diagram of the voice editing method provided by this embodiment of the application.
  • the user inputs voice data at a voice input terminal, such as application software of a mobile phone or a web page of a computer, and the voice input terminal inputs the user
  • the voice data of is sent to the server, the server converts the voice data into text data, divides the text data into t sentences, and calculates the confidence of semantic consistency between the tth sentence and the c sentences before the tth sentence, if the semantics are consistent If the confidence degree of semantic consistency is less than the preset value, the t-th sentence is recognized, and the recognition result is used as an editing instruction to edit the text data to obtain the updated text; if the confidence degree of semantic consistency is greater than or equal to the preset value, the text is recorded Data, update the text cached in the server according to the recorded text data, and get the updated text to perform further operations based on the updated text, for example, the server recognizes the intention to update the text, and sends the corresponding resource to the voice
  • the voice editing method provided by the embodiment of the present application includes:
  • the electronic device collects voice data input by the user through a microphone.
  • S102 Convert the voice data into text data, and divide the text data into t sentences, where t is an integer greater than 1.
  • the collected voice data is input into a preset voice recognition model to obtain text data output by the preset voice recognition model.
  • the speech recognition model uses speech data and corresponding text data as training samples, and uses a machine learning algorithm to train a preset algorithm model.
  • the text data is segmented to divide the text data into t sentences, where the tth sentence is converted from the voice data currently input by the user.
  • the sentence before the sentence is converted from the historical voice data input by the user, and the sentence before the t-th sentence is the upper part of the t-th sentence.
  • punctuation marks or spaces are generated according to the pause interval when the user inputs voice data. For example, if the user is inputting voice data, the pause interval between two words is greater than the preset duration, during the voice conversion, punctuation marks or spaces are added between the two words, where the punctuation marks can be comma.
  • the text data is divided into sentences according to the punctuation marks or spaces. It should be noted that before the user enters the current voice data, the sentence before the tth sentence can be divided, or the user can input the current voice data, and after all the voice data is converted into text data, the text data The sentences are divided.
  • S103 Calculate the semantic consistency confidence of the t-th sentence in the t sentences and the c sentences before the t-th sentence, where the semantic consistency confidence is used to describe the t-th sentence and the c The degree of semantic relevance of sentences, where c is an integer greater than 0, and c ⁇ t-1.
  • the t-th sentence and the c sentences before the t-th sentence are input into the preset semantic consistency model to obtain the semantics output by the preset semantic consistency model Confidence of consistency.
  • the semantic consistency model takes the text data and the semantic consistency confidence between each sentence of the text data as the training sample, and is obtained after training the preset algorithm model with a machine learning algorithm.
  • the calculation principle of the semantic consistency model in the training process is the same as the application principle of the semantic consistency model.
  • the following takes the application of the semantic consistency model as an example to introduce the calculation process of the semantic consistency model when calculating the semantic consistency confidence.
  • the semantic consistency model is used to calculate the semantic correlation between the t-th sentence and the c sentence in the t sentences, and determine the semantic consistency confidence level according to the phonetic correlation.
  • the semantic association can be the semantic association between the t-th sentence and the c sentence, or the semantic association between the t-th sentence and each sentence in the c sentence, or it can include both the t-th sentence and the c sentence.
  • the semantic association of also includes the semantic association between the t-th sentence and each sentence in the c sentence.
  • the semantic association between the tth sentence and the c sentence represents the semantic connection between the tth sentence and the above as a whole
  • the semantic association between the tth sentence and the c sentence represents the tth sentence and the c sentence Semantic relevance at the sentence level.
  • the semantic consistency model is used to calculate the semantic association between the tth sentence and the c sentence, and the semantic association between the tth sentence and each sentence in the c sentence, so that the tth sentence can be extracted More effective association information with c sentences, which in turn makes the output semantic consistency confidence more robust.
  • the comprehensive representation vector of the t-th sentence is used to describe the semantic association between the t-th sentence and the c sentence, and the semantic association between the t-th sentence and each sentence in the c sentence, that is, the semantic consistency model Firstly, the comprehensive representation vector of the t-th sentence is calculated according to the t-th sentence and the c-th sentence, and then the semantic consistency confidence is determined according to the comprehensive representation vector of the t-th sentence.
  • each sentence in the t sentences perform word segmentation processing on each sentence in the t sentences to obtain the words in the c sentences and the words in the t-th sentence.
  • word segmentation processing on each sentence in the t sentences to obtain the words in the c sentences and the words in the t-th sentence.
  • each sentence is segmented according to the English words;
  • the text data is in Chinese, the text is segmented according to the comparison result of the text data and the set vocabulary database.
  • t represents the sentence number of the t-th sentence
  • L t represents the number of words in the t-th sentence
  • L ⁇ represents the number of words in each sentence in c sentences.
  • the preset semantic consistency model includes an embedding layer, a context encoder, a pooling layer, and a fully connected layer. Input the words of the t-th sentence into the embedding layer to obtain the semantic embedding representation corresponding to the words of the t-th sentence in, d e represents the dimension of the embedding representation of the corresponding vector; each word of c sentences is input to the embedding layer, and the semantic embedding representation corresponding to each word of c sentences is obtained in,
  • the semantic embedding corresponding to each word in the t-th sentence represents the input context encoder to obtain the words in the t-th sentence
  • d h represents the dimensionality of the hidden vector
  • RNN Recurrent Neural Network
  • the hidden vector corresponding to each word in the t-th sentence The hidden vector corresponding to each word in c sentences Attention operation is performed, that is, the hidden vector corresponding to each word in the t-th sentence is used for attention operation with the hidden vector corresponding to each word in the c-th sentence to obtain the attention of the t-th sentence and the above.
  • the formula for the t-th sentence and the above attention calculation is: in, Represents the attention of the t-th sentence and the above, g ⁇ is the attention weight function, and the formula of the weight function is Wherein, a, b represents the two input vectors attention weighting function, a T represents a transposition, Is the parameter to be learned.
  • the t-th sentence and the c-th sentence can be represented by vectors respectively, and the attention operation is performed on the vector of the t-th sentence and the vector of the c-th sentence to obtain the t-th sentence. Sentence and attention above.
  • the context representation of each word in the t-th sentence and the context representation of each word in the c sentence are calculated according to the following formula:
  • softmax is a logistic regression operation, Indicates the contextual representation of each word in the t-th sentence, Represents the contextual representation of each word in c sentences. It can be seen from the formula that the context representation of each word in the tth sentence and the context representation of each word in the c sentence can be used to describe the semantic related information of each word in the t-th sentence and the c-th sentence, that is, the t-th sentence and The related information of the whole sentence of c.
  • the context representation of each word in the t-th sentence is obtained And the context representation of each word in c sentences Then, perform the residual connection operation on the context representation of each word in the t-th sentence and the hidden vector corresponding to each word in the t-th sentence to obtain the context vector of each word in the t-th sentence, namely but Represents the context vector of each word in the t-th sentence. Perform the residual connection operation on the context representation of each word in c sentences and the hidden vector corresponding to each word in c sentences to obtain the context vector of each word in c sentences, namely but Represents the context vector of each word in c sentences.
  • the context representation of each word in the t-th sentence can be used as the context vector of each word in the t-th sentence
  • the context representation of each word in the c sentence can be used as the context of each word in the c sentence The context vector.
  • the attention operation is performed on the context vector of each word in the t-th sentence and the context vector of each word in the c sentence.
  • the context vector performs attention operations to obtain the attention corresponding to the t-th sentence and the c-th sentence.
  • the correlation vector corresponding to each word in the t-th sentence and each sentence in the c-th sentence is Perform a pooling operation to get the context representation of each word in the tth sentence and the c sentence Then, the context of each word in the t-th sentence and the c-sentence corresponding to the context is expressed And the context vector of each word in the t-th sentence Carry out the residual connection operation to obtain the comprehensive representation vector of each word in the t-th sentence. Specifically, according to the formula Calculate the comprehensive representation vector of each word in the t-th sentence, where, Represents the comprehensive representation vector of each word in the t-th sentence.
  • the comprehensive representation vector of each word in the t-th sentence is integrated into the comprehensive representation vector of the t-th sentence, and the context vector of each word of the c sentence
  • the formed set is integrated into a comprehensive representation vector of c sentences, the comprehensive representation vector of the t-th sentence and the comprehensive representation vector of c sentences are spliced, and the semantic consistency confidence is determined according to the spliced vector. Specifically, the spliced vector is passed through the fully connected layer, and the semantic consistency confidence is output.
  • the comprehensive representation vector of the t-th sentence and the comprehensive representation vector of the c sentence are input to the pooling layer, and the pooling operation is performed separately, and then splicing is performed to reduce Errors introduced in the operation process.
  • r represents the comprehensive representation vector of the t-th sentence after the pooling operation
  • r ctx represents the comprehensive representation vector of the c sentences after the pooling operation.
  • S104 Determine whether the semantic consistency confidence is less than a preset value.
  • the semantic consistency confidence is a number between 0 and 1.
  • the preset value is an index to determine the degree of semantic relevance between the t-th sentence and the c-th sentence.
  • the semantic consistency confidence is less than the preset value, it means that the semantic correlation between the tth sentence and the c sentence is low, and the tth sentence is not coherent with the c sentence, which further explains that the tth sentence is relative to the c sentence.
  • the sentence undergoes a topic change, which is a voice command different from c sentences.
  • the t-th sentence is input into the preset intention recognition model, the intention of the t-th sentence is recognized, and the intention of the t-th sentence is used as an editing instruction to edit the text data.
  • the editing instructions may be instructions to move the cursor, replace words, or delete words, for example, “move the cursor forward by N characters”, “move the cursor to the jth character of the i-th sentence”, “move the cursor to the back of ⁇ ", “Replace ⁇ with ⁇ ”, “delete ⁇ ”, etc.
  • the intent recognition model used to recognize the t-th sentence can be used to extract feature words or keywords in the t-th sentence, and to determine editing instructions based on the feature words or keywords, for example, If the extracted feature words or keywords include "cursor”, "move”, “forward”, etc., the cursor is moved according to the recognition result; if the extracted keywords include "replacement”, the replacement is determined according to the recognition result Words and words to be replaced, and then word replacement.
  • the intent recognition model can also be used to match the t-th sentence with a preset template, and determine the editing instructions according to the matching result.
  • the template includes "move cursor to the left ⁇ ", “move cursor to ⁇ ", “change ⁇ is replaced by ⁇ ", etc.
  • each template corresponds to an editing method
  • the corresponding editing method is determined according to the matching result of the t-th sentence and the template
  • the corresponding editing instruction is executed according to the editing method.
  • the electronic device when the electronic device recognizes the editing command for word replacement, it determines the target description word and the target homophone in the editing command, where the target description word is the word where the target homophone is located, for example, " ⁇ " In the “purple” replaced with “purple”, the target description word is “purple”, and the target homophone is " ⁇ "; in the “ji" of "computer", the target description word is "computer”. ", the target homophone is "ji".
  • the pinyin sequence of the target description word into the Pinyin-to-Chinese character sequence labeling model, and obtain the pinyin-corresponding candidate Chinese character and its prior probability of the pinyin-to-Chinese character sequence labeling model output Distribution, and then input the candidate Chinese characters corresponding to the pinyin of the target homophone and the words to be replaced before the t-th sentence into the homophone classification model to obtain the correlation probability corresponding to each candidate Chinese character, and weight the prior probability and the correlation probability On average, the final probability of each candidate Chinese character is obtained, and the candidate Chinese character with the largest final probability is output as a replacement Chinese character for replacing the Chinese character to be replaced before the t-th sentence.
  • the prior probability is: in the pinyin sequence corresponding to the target description word, the probability that the candidate Chinese character is the target homophone; the correlation probability is used to characterize the semantic association between the candidate Chinese character and the word to be replaced.
  • the editing instruction is to replace the "ji" of "enter” with the "ji” of "computer", and input “jisuanji” into the pinyin-to-Chinese character sequence labeling model, and the prior probability of "ji” is 0.3, and "jisuanji” "” has a prior probability of 0.7; input “ji", “ji”, and “entry” into the homophone classification model, and get the correlation probability of "ji” is 0.9, and the correlation probability of "ji” is 0.1.
  • the probability and the associated probability are weighted average, and the final probability of "counting" is 0.6, and the associated probability of "machine” is 0.4.
  • the semantic consistency confidence is greater than or equal to the preset value, it means that the t-th sentence and the c-th sentence before the t-th sentence have a higher semantic association, and the t-th sentence and the c-th sentence are coherent text data .
  • the text data is stored, the stored text data is input into the intent recognition model, the intent of the text data is recognized, and the corresponding operation is performed according to the recognized intent.
  • the training corpus is first collected.
  • the training corpus includes text data.
  • the text data includes at least two sentences. Among them, the last sentence of some text data is an editing instruction, and the last sentence of some text data is a non-editing instruction. Annotate the above semantic consistency confidence of the text data, generate training samples, and train the training samples to generate a semantic consistency model.
  • Convert the voice data input by the user into text data divide the text data into t sentences, input the tth sentence and the c sentence into the semantic consistency model, and judge the tth sentence according to the semantic consistency confidence level output by the semantic consistency model Whether a sentence is an editing instruction; if it is an editing instruction, the editing instruction is executed to obtain the updated text; if it is not an editing instruction, the text data is recorded, and the text stored in the electronic device is updated to obtain the updated text. After obtaining the updated text, perform corresponding operations based on the updated text, and store the incorrectly recognized text data as a new training corpus, label it, and add it to the training sample to optimize the semantic consistency model.
  • the text data is divided into t sentences, the t sentences are input into the semantic consistency model, and the semantic consistency confidence between the t sentence and the c sentence is calculated Since the semantic consistency model is used to describe the semantic association between the tth sentence and the c sentence, and the semantic association between the tth sentence and the c sentence, it can extract the tth sentence and the c sentence more effectively Associated information, which in turn makes the output semantic consistency confidence more robust.
  • the semantic consistency confidence is less than the preset value, it means that the semantic relevance between the tth sentence and the c sentence is low, that is, the tth sentence has a topic conversion relative to the c sentence, which is different from
  • the t-th sentence is used as an editing instruction at this time, and the editing instruction is executed, so that the editing of c sentences can be realized without additional interaction, which is easy to operate and improves user experience .
  • an embodiment of the present application also provides an electronic device.
  • the electronic device provided by the embodiment of the present application may include: a processor 210, a memory 220, and a network interface 230.
  • the processor 210, the memory 220, and the network interface 230 are connected through a communication bus 240.
  • the processor 210 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the processor 210 may include one or more processing units.
  • the memory 220 may be an internal storage unit of the electronic device, such as a hard disk or a memory of the electronic device.
  • the memory 220 may also be an external storage device of the electronic device, such as a plug-in hard disk equipped on the electronic device, a smart memory card (Smart Media Card, SMC), a Secure Digital (SD) card, and a flash memory card (Flash). Card) and so on.
  • the memory 220 may also include both an internal storage unit of an electronic device and an external storage device.
  • the memory 220 is used to store computer programs and other programs and data required by the electronic device.
  • the memory 220 can also be used to temporarily store data that has been output or will be output.
  • the network interface 230 may be used to send and receive information, and may include a wired interface and/or a wireless interface, and is generally used to establish a communication connection between the electronic device and other electronic devices.
  • the electronic device may further include a user interface 250.
  • the user interface 250 may include a display (Display) and an input unit such as a keyboard (Keyboard).
  • the optional user interface 250 may also include a standard wired interface and a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, etc.
  • the display can also be appropriately called a display screen or a display unit, which is used to display information processed in the electronic device and to display a visualized user interface.
  • FIG. 8 is only an example of an electronic device, and does not constitute a limitation on the electronic device. It may include more or fewer components than those shown in the figure, or a combination of certain components, or different component arrangements.
  • the electronic device provided in this embodiment can execute the foregoing method embodiments, and its implementation principles and technical effects are similar, and will not be repeated here.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the implementation of all or part of the processes in the above-mentioned embodiment methods in this application can be completed by instructing relevant hardware through a computer program.
  • the computer program can be stored in a computer-readable storage medium. When executed by the processor, the steps of the foregoing method embodiments can be implemented.
  • the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate form.
  • the computer-readable medium may at least include: any entity or device capable of carrying computer program code to a photographing device/electronic device, a recording medium, a computer memory, a read-only memory (ROM, Read-Only Memory), and a random access memory (RAM, Random Access Memory), electric carrier signal, telecommunications signal and software distribution medium.
  • a photographing device/electronic device a recording medium
  • a computer memory a read-only memory (ROM, Read-Only Memory), and a random access memory (RAM, Random Access Memory), electric carrier signal, telecommunications signal and software distribution medium.
  • ROM read-only memory
  • RAM Random Access Memory
  • electric carrier signal telecommunications signal and software distribution medium.
  • U disk mobile hard disk, floppy disk or CD-ROM, etc.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the disclosed apparatus/network equipment and method may be implemented in other ways.
  • the device/network device embodiments described above are only illustrative.
  • the division of the modules or units is only a logical function division, and there may be other divisions in actual implementation, such as multiple units.
  • components can be combined or integrated into another system, or some features can be omitted or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

A voice editing method, an electronic device and a computer readable storage medium, relating to the field of AI. The voice editing method comprises: acquiring inputted voice data; converting the voice data into text data, and dividing the text data into t sentences; calculating the confidence of semantic consistency of the t-th sentence of the t sentences and c sentences prior to the t sentences, the confidence of semantic consistency being used to describe the semantic association degree between the t-th sentence and the c sentences; if the confidence of semantic consistency is less than a preset value, indicating that the semantic association degree between the t-th sentence and the c sentences is relatively low, further indicating that the t-th sentence performs topic conversion with respect to the c sentences, and in this case, recognizing the t-th sentence, and taking the recognition result as an editing instruction to edit the text data, so that a user does not need to perform additional interaction, and an editing operation on the text data can also be implemented, and the operation is simple and convenient, thereby improving the user experience.

Description

语音编辑方法、电子设备及计算机可读存储介质Voice editing method, electronic equipment and computer readable storage medium
本申请要求于2020年06月01日提交国家知识产权局、申请号为202010484871.0、申请名称为“语音编辑方法、电子设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the State Intellectual Property Office, the application number is 202010484871.0, and the application name is "voice editing method, electronic equipment and computer-readable storage medium" on June 1, 2020. The entire content of it is approved The reference is incorporated in this application.
技术领域Technical field
本申请涉及人工智能(Artificial Intelligence,AI)领域,尤其涉及一种语音编辑方法、电子设备及计算机可读存储介质。This application relates to the field of Artificial Intelligence (AI), and in particular to a voice editing method, electronic equipment, and computer-readable storage medium.
背景技术Background technique
人工智能技术的发展深刻改变了人与机器交互的方式:从以PC、智能手机为代表的通过键盘、鼠标、触摸屏的交互,发展到以智能对话***(如手机语音助手、智能音箱、智能车载等)为代表的语音交互。语音对话的交互方式比传统的点击、触摸方式更具有便利性和灵活性,在越来越多的领域得到应用。The development of artificial intelligence technology has profoundly changed the way humans and machines interact: from the interaction of keyboards, mice, and touch screens represented by PCs and smart phones, to smart dialogue systems (such as mobile phone voice assistants, smart speakers, and smart cars). Etc.) as the representative voice interaction. The interactive mode of voice dialogue is more convenient and flexible than the traditional click and touch mode, and it is applied in more and more fields.
在语音对话过程中,电子设备识别用户输入的语音数据,将语音数据转换为文本,以执行对应的操作。当用户发现语音识别出错,或者用户想主动变换说法时,需要对输入的语音数据进行修改。现有的对输入的语音数据进行修改的方法,一般需要手动切换输入模式,例如,将语音输入切换为文本输入,以对由语音数据所转换的文本数据进行修改,或者在检测到用户输入对应的前缀词时,根据用户输入的修改指令对文本数据进行修改,因此,增加了交互成本,操作复杂,影响用户体验。In the process of voice dialogue, the electronic device recognizes the voice data input by the user and converts the voice data into text to perform corresponding operations. When the user finds that the voice recognition is wrong, or the user wants to actively change the statement, he needs to modify the input voice data. Existing methods for modifying input voice data generally need to manually switch the input mode, for example, switching voice input to text input to modify the text data converted from voice data, or when the user input is detected corresponding to the input mode. When the prefix word is used, the text data is modified according to the modification instruction input by the user. Therefore, the interaction cost is increased, the operation is complicated, and the user experience is affected.
发明内容Summary of the invention
本申请提供一种语音编辑方法、电子设备及计算机可读存储介质,在不增加额外的交互成本的情况下,实现对文本数据的编辑,操作简便,提高用户体验。The present application provides a voice editing method, electronic equipment, and computer-readable storage medium, which realizes the editing of text data without increasing additional interaction costs, is easy to operate, and improves user experience.
第一方面,提供一种语音编辑方法,包括:获取输入的语音数据;将所述语音数据转换为文本数据,并将所述文本数据划分为t个句子,所述t为大于1的整数;计算所述t个句子中第t个句子与所述第t个句子之前的c个句子的语义一致性置信度,其中,所述语义一致性置信度用于描述所述第t个句子与所述c个句子的语义关联程度,所述c为大于0的整数;若所述语义一致性置信度小于预设值,对所述第t个句子进行识别,并将识别结果作为编辑指令对所述文本数据进行编辑。In a first aspect, a voice editing method is provided, including: acquiring input voice data; converting the voice data into text data, and dividing the text data into t sentences, where t is an integer greater than 1; Calculate the semantic consistency confidence of the t-th sentence in the t sentences and the c sentences before the t-th sentence, where the semantic consistency confidence is used to describe the t-th sentence and the Describe the degree of semantic relevance of c sentences, where c is an integer greater than 0; if the semantic consistency confidence is less than a preset value, recognize the t-th sentence, and use the recognition result as an editing instruction for all The text data is edited.
上述实施例中,通过获取语音数据,将语音数据转换为文本数据,将文本数据划分为t个句子,计算t个句子中第t个句子与第t个句子之前的c个句子的语义一致性置信度,若语义一致性置信度小于预设值,说明第t个句子与c个句子的语义关联程度较低,即第t个句子与c个句子不是连贯的语句,进一步说明第t个句子相对于c个句子进行了话题转换,是不同于前面c个句子的指令,此时,对第t个句子进行识别,并将识别结果作为编辑指令对文本数据进行编辑,能够不需要用户进行额外的交互,也可实现对文本数据的编辑操作,操作简便,从而提高了用户体验。In the above embodiment, the speech data is obtained, the speech data is converted into text data, the text data is divided into t sentences, and the semantic consistency between the t sentence in the t sentence and the c sentence before the t sentence is calculated. Confidence, if the confidence of semantic consistency is less than the preset value, it means that the t-th sentence and the c-th sentence have a low degree of semantic relevance, that is, the t-th sentence and the c-th sentence are not coherent sentences, which further explains the t-th sentence Compared with the c sentence, the topic conversion is different from the previous c sentence. At this time, the t sentence is recognized, and the recognition result is used as an editing command to edit the text data, which can eliminate the need for the user to perform additional The interaction can also realize the editing operation of text data, which is easy to operate, thereby improving the user experience.
在第一方面的一种可能的实现方式中,所述计算所述t个句子中第t个句子与所 述第t个句子之前的c个句子的语义一致性置信度,包括:将所述t个句子输入预设的语义一致性模型,得到所述语义一致性模型输出的所述t个句子中第t个句子与所述第t个句子之前的c个句子的语义一致性置信度。由于预设的语义一致性模型是根据大量训练样本训练后得到的,通过预设的语义一致性模型计算语义一致性置信度,提高了计算结果的准确性和稳定性。In a possible implementation of the first aspect, the calculating the confidence of semantic consistency between the t-th sentence in the t-th sentence and the c-th sentence before the t-th sentence includes: The t sentences are input into the preset semantic consistency model, and the semantic consistency confidence of the t-th sentence in the t sentences output by the semantic consistency model and the c-th sentence before the t-th sentence is obtained. Since the preset semantic consistency model is obtained after training based on a large number of training samples, the semantic consistency confidence is calculated through the preset semantic consistency model, which improves the accuracy and stability of the calculation result.
在第一方面的一种可能的实现方式中,所述预设的语义一致性模型用于:根据所述第t个句子和所述c个句子计算第t个句子的综合表示向量,其中,所述第t个句子的综合表示向量用于描述所述第t个句子与所述c个句子的语义关联,以及所述第t个句子与所述c个句子中每个句子的语义关联,因此综合表示向量可以表征第t个句子和c个句子之间更多的关联信息,再根据所述第t个句子的综合表示向量确定语义一致性置信度,提高了语义一致性置信度的准确性。In a possible implementation of the first aspect, the preset semantic consistency model is used to calculate a comprehensive representation vector of the t-th sentence according to the t-th sentence and the c-th sentence, where: The comprehensive representation vector of the t-th sentence is used to describe the semantic association between the t-th sentence and the c sentences, and the semantic association between the t-th sentence and each sentence in the c sentences, Therefore, the comprehensive representation vector can represent more related information between the t-th sentence and the c sentence, and then the semantic consistency confidence is determined according to the comprehensive representation vector of the t-th sentence, which improves the accuracy of the semantic consistency confidence. sex.
在第一方面的一种可能的实现方式中,所述根据所述第t个句子和所述c个句子计算第t个句子的综合表示向量,包括:根据所述第t个句子和所述c个句子确定第t个句子各词语的上下文向量,以及确定c个句子各词语的上下文向量;根据所述第t个句子各词语的上下文向量,以及所述c个句子各词语的上下文向量计算第t个句子的综合表示向量。In a possible implementation manner of the first aspect, the calculating a comprehensive representation vector of the t-th sentence according to the t-th sentence and the c-th sentence includes: according to the t-th sentence and the C sentences determine the context vector of each word in the t-th sentence, and determine the context vector of each word in the c sentence; calculate according to the context vector of each word in the t-th sentence and the context vector of each word in the c sentence The comprehensive representation vector of the t-th sentence.
在第一方面的一种可能的实现方式中,所述根据所述第t个句子和所述c个句子确定第t个句子各词语的上下文向量,以及确定c个句子各词语的上下文向量,包括:对所述第t个句子和所述c个句子进行注意力运算,从而可以捕捉到第t个句子和c个句子之间更多的内部特征,得到第t个句子与上文的注意力;根据所述第t个句子与上文的注意力计算第t个句子各词语的上下文向量,以及c个句子各词语的上下文向量。In a possible implementation of the first aspect, the determining the context vector of each word in the t-th sentence according to the t-th sentence and the c sentence, and determining the context vector of each word in the c-th sentence, Including: performing attention operations on the t-th sentence and the c-th sentence, so as to capture more internal features between the t-th sentence and the c-th sentence, and get the attention of the t-th sentence and the above Power; Calculate the context vector of each word in the t-th sentence and the context vector of each word in the c-th sentence according to the t-th sentence and the above attention.
在第一方面的一种可能的实现方式中,所述对所述第t个句子和所述c个句子进行注意力运算,得到第t个句子与上文的注意力,包括:对所述第t个句子进行分词处理,根据分词处理后的第t个句子确定所述第t个句子的各词语对应的隐向量;对所述c个句子进行分词处理,根据分词处理后的所述c个句子确定所述c个句子的各词语对应的隐向量;对所述第t个句子的各词语对应的隐向量,以及所述c个句子的各词语对应的隐向量,进行注意力运算,得到第t个句子与上文的注意力。In a possible implementation of the first aspect, the performing an attention operation on the t-th sentence and the c-th sentence to obtain the t-th sentence and the above attention includes: Perform word segmentation processing on the t-th sentence, determine the hidden vector corresponding to each word of the t-th sentence according to the t-th sentence after word segmentation processing; perform word segmentation processing on the c sentences, and perform word segmentation processing on the c after the word segmentation processing Sentences to determine the hidden vectors corresponding to the words in the c sentences; to perform attention operations on the hidden vectors corresponding to the words in the t-th sentence and the hidden vectors corresponding to the words in the c sentences, Get the t-th sentence and the above attention.
在第一方面的一种可能的实现方式中,所述根据所述第t个句子与上文的注意力计算第t个句子各词语的上下文向量,以及c个句子各词语的上下文向量,包括:根据所述第t个句子与上文的注意力,以及所述c个句子的各词语对应的隐向量,计算第t个句子各词语的上下文表示;对所述第t个句子各词语的上下文表示,以及所述第t个句子的各词语对应的隐向量,进行残差连接运算,得到第t个句子各词语的上下文向量;根据所述第t个句子与上文的注意力,以及所述第t个句子的各词语对应的隐向量,计算c个句子各词语的上下文表示;对所述c个句子各词语的上下文表示,以及所述c个句子的各词语对应的隐向量,进行残差连接运算,得到c个句子各词语的上下文向量,从而可以减少信号损失,提高了计算的准确度。In a possible implementation of the first aspect, the calculation of the context vector of each word of the t-th sentence and the context vector of each word of the c sentence according to the t-th sentence and the above attention includes : Calculate the context representation of each word in the t-th sentence according to the attention of the t-th sentence and the above, and the hidden vector corresponding to each word in the c-th sentence; The context representation and the hidden vectors corresponding to the words of the t-th sentence are subjected to the residual connection operation to obtain the context vectors of the words of the t-th sentence; according to the attention of the t-th sentence and the above, and The hidden vectors corresponding to the words in the t-th sentence are calculated, the context representations of the words in the c sentences are calculated; the context representations of the words in the c sentences, and the hidden vectors corresponding to the words in the c sentences, The residual connection operation is performed to obtain the context vector of each word of the c sentences, which can reduce the signal loss and improve the accuracy of the calculation.
在第一方面的一种可能的实现方式中,所述根据所述第t个句子各词语的上下文向量,以及所述c个句子各词语的上下文向量计算第t个句子的综合表示向量,包括: 对所述第t个句子各词语的上下文向量,以及所述c个句子各词语的上下文向量进行注意力运算,得到第t个句子与c个句子对应的注意力;根据所述第t个句子与c个句子对应的注意力计算第t个句子的综合表示向量。In a possible implementation of the first aspect, the calculation of the comprehensive representation vector of the t-th sentence according to the context vector of each word of the t-th sentence and the context vector of each word of the c sentence includes : Perform attention operations on the context vectors of the words in the t-th sentence and the context vectors of the words in the c sentences to obtain the attention of the t-th sentence and the c-th sentence; according to the t-th sentence The attention corresponding to the sentence and c sentences is calculated as the comprehensive representation vector of the t-th sentence.
在第一方面的一种可能的实现方式中,所述根据所述第t个句子与c个句子对应的注意力计算第t个句子的综合表示向量,包括:根据所述第t个句子与c个句子对应的注意力,以及c个句子各词语的上下文向量,计算第t个句子各词语与c个句子对应的上下文表示;对所述第t个句子各词语与c个句子对应的上下文表示,以及第t个句子各词语的上下文向量,进行残差连接运算,得到第t个句子的综合表示向量,从而可以减少信号损失,提高了计算的准确度。In a possible implementation of the first aspect, the calculation of the comprehensive representation vector of the t-th sentence according to the attention of the t-th sentence and the c-th sentence includes: according to the t-th sentence and The attention corresponding to c sentences, and the context vector of each word in c sentences, calculate the context representation of each word in the t-th sentence and the c sentence; for the context corresponding to each word in the t-th sentence and the c sentence Representation, and the context vector of each word in the t-th sentence, perform the residual connection operation to obtain the comprehensive representation vector of the t-th sentence, which can reduce signal loss and improve the accuracy of calculation.
在第一方面的一种可能的实现方式中,所述根据所述第t个句子的综合表示向量确定语义一致性置信度,包括:根据所述c个句子各词语的上下文向量,确定c个句子的综合表示向量;对所述第t个句子的综合表示向量以及所述c个句子的综合表示向量进行拼接,根据拼接后的向量确定语义一致性置信度。In a possible implementation of the first aspect, the determining the semantic consistency confidence level according to the comprehensive representation vector of the t-th sentence includes: determining c according to the context vector of each word of the c sentence The comprehensive representation vector of the sentence; the comprehensive representation vector of the t-th sentence and the comprehensive representation vector of the c sentences are spliced, and the semantic consistency confidence is determined according to the spliced vector.
在第一方面的一种可能的实现方式中,所述对所述第t个句子进行识别,包括:将所述第t个句子输入预设的意图识别模型,得到所述预设的意图识别模型输出的识别结果。In a possible implementation of the first aspect, the recognizing the t-th sentence includes: inputting the t-th sentence into a preset intent recognition model to obtain the preset intent recognition The recognition result output by the model.
在第一方面的一种可能的实现方式中,在所述计算所述t个句子中第t个句子与所述第t个句子之前的c个句子的语义一致性置信度之后,所述语音编辑方法还包括:若所述语义一致性置信度大于或者等于所述预设值,存储所述文本数据。In a possible implementation of the first aspect, after the calculation of the semantic consistency confidence of the t-th sentence in the t-th sentence and the c-th sentence before the t-th sentence, the speech The editing method further includes: if the semantic consistency confidence is greater than or equal to the preset value, storing the text data.
第二方面,提供一种语音编辑装置,包括:In a second aspect, a voice editing device is provided, including:
获取模块,用于获取输入的语音数据;The acquisition module is used to acquire the input voice data;
分句模块,用于将所述语音数据转换为文本数据,并将所述文本数据划分为t个句子,所述t为大于1的整数;The sentence segmentation module is used to convert the voice data into text data and divide the text data into t sentences, where t is an integer greater than 1;
计算模块,用于计算所述t个句子中第t个句子与所述第t个句子之前的c个句子的语义一致性置信度,其中,所述语义一致性置信度用于描述所述第t个句子与所述c个句子的语义关联程度,所述c为大于0的整数;The calculation module is used to calculate the semantic consistency confidence of the tth sentence in the t sentences and the c sentences before the tth sentence, wherein the semantic consistency confidence is used to describe the The degree of semantic relevance between t sentences and said c sentences, where c is an integer greater than 0;
识别模块,用于若所述语义一致性置信度小于预设值,对所述第t个句子进行识别,并将识别结果作为编辑指令对所述文本数据进行编辑。The recognition module is configured to recognize the t-th sentence if the semantic consistency confidence is less than a preset value, and use the recognition result as an editing instruction to edit the text data.
在第二方面的一种可能的实现方式中,所述计算模块具体用于:In a possible implementation manner of the second aspect, the calculation module is specifically configured to:
将所述t个句子输入预设的语义一致性模型,得到所述语义一致性模型输出的所述t个句子中第t个句子与所述第t个句子之前的c个句子的语义一致性置信度。Input the t sentences into the preset semantic consistency model to obtain the semantic consistency between the tth sentence in the t sentences output by the semantic consistency model and the c sentence before the tth sentence Confidence.
在第二方面的一种可能的实现方式中,所述计算模块包括:In a possible implementation manner of the second aspect, the calculation module includes:
第一计算单元,用于根据所述第t个句子和所述c个句子计算第t个句子的综合表示向量,其中,所述第t个句子的综合表示向量用于描述所述第t个句子与所述c个句子的语义关联,以及所述第t个句子与所述c个句子中每个句子的语义关联;The first calculation unit is configured to calculate a comprehensive representation vector of the t-th sentence according to the t-th sentence and the c-th sentence, wherein the comprehensive representation vector of the t-th sentence is used to describe the t-th sentence The semantic relationship between the sentence and the c sentences, and the semantic relationship between the t-th sentence and each sentence in the c sentences;
第二计算单元,用于根据所述第t个句子的综合表示向量确定语义一致性置信度。The second calculation unit is configured to determine the semantic consistency confidence level according to the comprehensive representation vector of the t-th sentence.
在第二方面的一种可能的实现方式中,所述第一计算单元具体用于:In a possible implementation manner of the second aspect, the first calculation unit is specifically configured to:
根据所述第t个句子和所述c个句子确定第t个句子各词语的上下文向量,以及确定c个句子各词语的上下文向量;Determine the context vector of each word in the t-th sentence according to the t-th sentence and the c sentence, and determine the context vector of each word in the c-th sentence;
根据所述第t个句子各词语的上下文向量,以及所述c个句子各词语的上下文向量计算第t个句子的综合表示向量。Calculate the comprehensive representation vector of the t-th sentence according to the context vector of each word of the t-th sentence and the context vector of each word of the c-th sentence.
在第二方面的一种可能的实现方式中,所述第一计算单元具体还用于:In a possible implementation manner of the second aspect, the first calculation unit is specifically further configured to:
对所述第t个句子和所述c个句子进行注意力运算,得到第t个句子与上文的注意力;Perform attention operations on the t-th sentence and the c-th sentence to obtain the attention of the t-th sentence and the above;
根据所述第t个句子与上文的注意力计算第t个句子各词语的上下文向量,以及c个句子各词语的上下文向量。Calculate the context vector of each word in the t-th sentence and the context vector of each word in the c sentence according to the t-th sentence and the above attention.
在第二方面的一种可能的实现方式中,所述第一计算单元具体还用于:In a possible implementation manner of the second aspect, the first calculation unit is specifically further configured to:
对所述第t个句子进行分词处理,根据分词处理后的第t个句子确定所述第t个句子的各词语对应的隐向量;Perform word segmentation processing on the t-th sentence, and determine the hidden vector corresponding to each word of the t-th sentence according to the t-th sentence after the word segmentation processing;
对所述c个句子进行分词处理,根据分词处理后的所述c个句子确定所述c个句子的各词语对应的隐向量;Perform word segmentation processing on the c sentences, and determine the hidden vector corresponding to each word of the c sentences according to the c sentences after the word segmentation processing;
对所述第t个句子的各词语对应的隐向量,以及所述c个句子的各词语对应的隐向量,进行注意力运算,得到第t个句子与上文的注意力。Attention operations are performed on the hidden vectors corresponding to the words of the t-th sentence and the hidden vectors corresponding to the words of the c sentences to obtain the attention of the t-th sentence and the above.
在第二方面的一种可能的实现方式中,所述第一计算单元具体还用于:In a possible implementation manner of the second aspect, the first calculation unit is specifically further configured to:
根据所述第t个句子与上文的注意力,以及所述c个句子的各词语对应的隐向量,计算第t个句子各词语的上下文表示;Calculate the context representation of each word in the t-th sentence according to the attention of the t-th sentence and the above, and the latent vector corresponding to each word in the c-th sentence;
对所述第t个句子各词语的上下文表示,以及所述第t个句子的各词语对应的隐向量,进行残差连接运算,得到第t个句子各词语的上下文向量;Perform a residual connection operation on the context representation of each word in the t-th sentence and the hidden vector corresponding to each word in the t-th sentence to obtain the context vector of each word in the t-th sentence;
根据所述第t个句子与上文的注意力,以及所述第t个句子的各词语对应的隐向量,计算c个句子各词语的上下文表示;Calculate the context representation of each word in the c sentence according to the attention of the t-th sentence and the above, and the hidden vector corresponding to each word in the t-th sentence;
对所述c个句子各词语的上下文表示,以及所述c个句子的各词语对应的隐向量,进行残差连接运算,得到c个句子各词语的上下文向量。A residual connection operation is performed on the context representations of the words of the c sentences and the hidden vectors corresponding to the words of the c sentences to obtain the context vectors of the words of the c sentences.
在第二方面的一种可能的实现方式中,所述第一计算单元具体还用于:In a possible implementation manner of the second aspect, the first calculation unit is specifically further configured to:
对所述第t个句子各词语的上下文向量,以及所述c个句子各词语的上下文向量进行注意力运算,得到第t个句子与c个句子对应的注意力;Performing an attention operation on the context vector of each word in the t-th sentence and the context vector of each word in the c sentence to obtain the attention corresponding to the t-th sentence and the c sentence;
根据所述第t个句子与c个句子对应的注意力计算第t个句子的综合表示向量。Calculate the comprehensive representation vector of the t-th sentence according to the attention of the t-th sentence and the c-th sentence.
在第二方面的一种可能的实现方式中,所述第一计算单元具体还用于:In a possible implementation manner of the second aspect, the first calculation unit is specifically further configured to:
根据所述第t个句子与c个句子对应的注意力,以及c个句子各词语的上下文向量,计算第t个句子各词语与c个句子对应的上下文表示;According to the attention of the t-th sentence and the c sentence and the context vector of each word of the c sentence, calculate the context representation of each word of the t-th sentence and the c sentence;
对所述第t个句子各词语与c个句子对应的上下文表示,以及第t个句子各词语的上下文向量,进行残差连接运算,得到第t个句子的综合表示向量。A residual connection operation is performed on the context representation of each word in the tth sentence and the c sentence corresponding to the c sentence, and the context vector of each word in the t-th sentence, to obtain a comprehensive representation vector of the t-th sentence.
在第二方面的一种可能的实现方式中,所述第二计算单元具体用于:In a possible implementation manner of the second aspect, the second calculation unit is specifically configured to:
根据所述c个句子各词语的上下文向量,确定c个句子的综合表示向量;Determine the comprehensive representation vector of c sentences according to the context vectors of the words of the c sentences;
对所述第t个句子的综合表示向量以及所述c个句子的综合表示向量进行拼接,根据拼接后的向量确定语义一致性置信度。The comprehensive representation vector of the t-th sentence and the comprehensive representation vector of the c sentences are spliced, and the semantic consistency confidence is determined according to the spliced vector.
在第二方面的一种可能的实现方式中,所述识别模块具体用于:In a possible implementation manner of the second aspect, the identification module is specifically configured to:
将所述第t个句子输入预设的意图识别模型,得到所述预设的意图识别模型输出的识别结果。The t-th sentence is input into a preset intent recognition model, and a recognition result output by the preset intent recognition model is obtained.
在第二方面的一种可能的实现方式中,所述识别模块具体还用于:In a possible implementation manner of the second aspect, the identification module is specifically further configured to:
若所述语义一致性置信度大于或者等于所述预设值,存储所述文本数据。If the semantic consistency confidence is greater than or equal to the preset value, the text data is stored.
第三方面,提供一种电子设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上述第一方面所述的语音编辑方法。In a third aspect, an electronic device is provided, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor. When the processor executes the computer program, the first The voice editing method described in the aspect.
第四方面,提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如上述第一方面所述的语音编辑方法。In a fourth aspect, a computer-readable storage medium is provided, and the computer-readable storage medium stores a computer program that, when executed by a processor, implements the voice editing method described in the first aspect.
第五方面,提供一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面所述的语音编辑方法。In a fifth aspect, a computer program product is provided. When the computer program product runs on a terminal device, the terminal device executes the voice editing method described in the first aspect.
可以理解的是,上述第二方面至第五方面的有益效果可以参见上述第一方面中的相关描述,在此不再赘述。It can be understood that, for the beneficial effects of the second aspect to the fifth aspect described above, reference may be made to the relevant description in the first aspect described above, which will not be repeated here.
附图说明Description of the drawings
图1为本申请实施例提供的语音编辑方法的一种应用场景图;FIG. 1 is an application scenario diagram of a voice editing method provided by an embodiment of this application;
图2为本申请实施例提供的语音编辑方法的另一种应用场景图;FIG. 2 is a diagram of another application scenario of the voice editing method provided by an embodiment of the application;
图3为本申请实施例提供的语音编辑方法的流程示意图;FIG. 3 is a schematic flowchart of a voice editing method provided by an embodiment of the application;
图4为本申请实施例提供的分句处理的示意图;FIG. 4 is a schematic diagram of clause processing provided by an embodiment of the application;
图5为本申请实施例提供的计算语义一致性置信度的示意图;FIG. 5 is a schematic diagram of calculating the confidence of semantic consistency provided by an embodiment of the application;
图6是本申请实施例提供的语义一致性模型的示意图;Fig. 6 is a schematic diagram of a semantic consistency model provided by an embodiment of the present application;
图7是本申请实施例提供的语音编辑方法的具体实现流程示意图;FIG. 7 is a schematic diagram of a specific implementation process of a voice editing method provided by an embodiment of the present application;
图8为本申请实施例提供的电子设备的示意图。FIG. 8 is a schematic diagram of an electronic device provided by an embodiment of the application.
具体实施方式detailed description
以下描述中,为了说明而不是为了限定,提出了诸如特定***结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的***、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。In the following description, for the purpose of illustration rather than limitation, specific details such as a specific system structure and technology are proposed for a thorough understanding of the embodiments of the present application. However, it should be clear to those skilled in the art that the present application can also be implemented in other embodiments without these specific details. In other cases, detailed descriptions of well-known systems, devices, circuits, and methods are omitted to avoid unnecessary details from obstructing the description of this application.
应当理解,当在本申请说明书和所附权利要求书中使用时,术语“包括”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that when used in the specification and appended claims of this application, the term "comprising" indicates the existence of the described features, wholes, steps, operations, elements and/or components, but does not exclude one or more other The existence or addition of features, wholes, steps, operations, elements, components, and/or collections thereof.
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be understood that the term "and/or" used in the specification and appended claims of this application refers to any combination of one or more of the items listed in the associated and all possible combinations, and includes these combinations.
在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。The reference to "one embodiment" or "some embodiments" described in the specification of this application means that one or more embodiments of this application include a specific feature, structure, or characteristic described in combination with the embodiment. Therefore, the sentences "in one embodiment", "in some embodiments", "in some other embodiments", "in some other embodiments", etc. appearing in different places in this specification are not necessarily All refer to the same embodiment, but mean "one or more but not all embodiments", unless otherwise specifically emphasized. The terms "including", "including", "having" and their variations all mean "including but not limited to" unless otherwise specifically emphasized.
本申请实施例提供的语音编辑方法应用于电子设备,其中,电子设备可以是手机、平板、电脑、智能音箱、车载设备等终端,也可以是服务器,本申请实施例对该电子设备的具体形态/类型不作特殊限制。本申请实施例提供的方法可以全部执行于终端,也可以 全部执行于服务器,也可以部分执行于终端,部分执行于服务器。The voice editing method provided in the embodiments of this application is applied to electronic equipment, where the electronic equipment can be a terminal such as a mobile phone, a tablet, a computer, a smart speaker, a vehicle-mounted device, etc., or a server. / There is no special restriction on the type. The methods provided in the embodiments of the present application may be executed entirely on the terminal, or entirely executed on the server, or partly executed on the terminal and partly executed on the server.
如图1所示,为本申请实施例提供的语音编辑方法的一种应用场景图,电子设备,例如智能音箱,获取用户输入的语音数据,将语音数据转换为文本数据,将文本数据划分为t个句子,计算第t个句子与第t个句子之前的c个句子的语义一致性置信度,其中,语义一致性置信度用于描述第t个句子与c个句子的语义关联程度。若语义一致性置信度小于预设值,对第t个句子进行识别,并将识别结果作为编辑指令对文本数据进行编辑,得到更新文本。例如,转换后的文本数据中,第t个句子为:将“酒店”替换为“九点”,第t个句子之前的句子,即第c个句子为:明天酒店有什么安排,第t个句子与第c个句子的语义一致性置信度小于预设值,则对第t个句子进行识别,根据识别结果对文本数据进行编辑,得到的更新文本为“明天九点有什么安排”。若语义一致性置信度大于或等于预设值,则记录文本数据,根据记录的文本数据更新电子设备中缓存的文本,得到更新文本。得到更新文本后,电子设备识别更新文本的意图,再根据更新文本的意图生成对应的回复文本,再将回复文本转换为语音数据,最后输出语音数据。例如,若更新文本的意图是获取日程信息,则电子设备获取日程信息对应的回复文本,再将该回复文本转换为语音数据并进行播放;若更新文本的意图是播放歌曲××,则电子设备搜索歌曲××,若搜索到对应歌曲,则获取歌曲××的音频,生成“即将为您播放歌曲××”的回复文本,再将该回复文本转换为语音数据并进行播放,最后播放搜索到的歌曲××的音频,能够不需要用户进行额外的交互,也可实现对文本数据的语音编辑,从而使电子设备可以及时获取用户的真实意图并进行响应,提高了用户体验。As shown in Figure 1, an application scenario diagram of the voice editing method provided by this embodiment of the application. An electronic device, such as a smart speaker, acquires voice data input by a user, converts the voice data into text data, and divides the text data into For t sentences, calculate the semantic consistency confidence of the t-th sentence and the c sentences before the t-th sentence, where the semantic consistency confidence is used to describe the degree of semantic relevance between the t-th sentence and the c sentence. If the semantic consistency confidence is less than the preset value, the t-th sentence is recognized, and the recognition result is used as an editing instruction to edit the text data to obtain the updated text. For example, in the converted text data, the t-th sentence is: replace "hotel" with "nine o'clock", the sentence before the t-th sentence, that is, the c-th sentence is: What are the arrangements for the hotel tomorrow, the t-th sentence The confidence of the semantic consistency between the sentence and the c-th sentence is less than the preset value, then the t-th sentence is recognized, and the text data is edited according to the recognition result, and the updated text obtained is "What are the arrangements for tomorrow at nine o'clock". If the semantic consistency confidence is greater than or equal to the preset value, the text data is recorded, and the text cached in the electronic device is updated according to the recorded text data to obtain the updated text. After obtaining the updated text, the electronic device recognizes the intent of the updated text, generates a corresponding reply text according to the intent of the updated text, converts the reply text into voice data, and finally outputs the voice data. For example, if the intent of the updated text is to obtain schedule information, the electronic device obtains the reply text corresponding to the schedule information, and then converts the reply text into voice data and plays it; if the intent of the updated text is to play the song ××, the electronic device Search for the song ××, if the corresponding song is found, get the audio of the song ××, generate the reply text of "Song will be played for you ××", and then convert the reply text into voice data and play it, and finally play the search result The audio of the song ×× can also achieve voice editing of text data without the need for additional interaction by the user, so that the electronic device can timely obtain the user’s true intention and respond, which improves the user experience.
如图2所示,为本申请实施例提供的语音编辑方法的另一种应用场景图,用户在语音输入端,例如手机的应用软件或者电脑的网页页面输入语音数据,语音输入端将用户输入的语音数据发送至服务器,服务器将语音数据转换为文本数据,将文本数据划分为t个句子,计算第t个句子与第t个句子之前的c个句子的语义一致性置信度,若语义一致性置信度小于预设值,则对第t个句子进行识别,并将识别结果作为编辑指令对文本数据进行编辑,得到更新文本;若语义一致性置信度大于或等于预设值,则记录文本数据,根据记录的文本数据更新服务器中缓存的文本,得到更新文本,以根据更新文本执行进一步的操作,例如,服务器识别更新文本的意图,根据更新文本的意图将对应的资源发送至语音输入端,或者将更新文本发送至语音输入端以进行显示。从而不需要用户进行额外的交互,即可实现对文本数据的编辑操作。As shown in Figure 2, another application scenario diagram of the voice editing method provided by this embodiment of the application. The user inputs voice data at a voice input terminal, such as application software of a mobile phone or a web page of a computer, and the voice input terminal inputs the user The voice data of is sent to the server, the server converts the voice data into text data, divides the text data into t sentences, and calculates the confidence of semantic consistency between the tth sentence and the c sentences before the tth sentence, if the semantics are consistent If the confidence degree of semantic consistency is less than the preset value, the t-th sentence is recognized, and the recognition result is used as an editing instruction to edit the text data to obtain the updated text; if the confidence degree of semantic consistency is greater than or equal to the preset value, the text is recorded Data, update the text cached in the server according to the recorded text data, and get the updated text to perform further operations based on the updated text, for example, the server recognizes the intention to update the text, and sends the corresponding resource to the voice input terminal according to the intent of the updated text , Or send the updated text to the voice input terminal for display. Therefore, the editing operation of the text data can be realized without additional interaction of the user.
下面以本申请实施例提供的方法全部执行于同一个电子设备为例,介绍本申请实施例提供的语音编辑方法。Hereinafter, taking the methods provided in the embodiments of the present application all being executed on the same electronic device as an example, the voice editing method provided in the embodiments of the present application is introduced.
如图3所示,本申请实施例提供的语音编辑方法,包括:As shown in FIG. 3, the voice editing method provided by the embodiment of the present application includes:
S101:获取语音数据。S101: Acquire voice data.
具体地,电子设备通过麦克风采集用户输入的语音数据。Specifically, the electronic device collects voice data input by the user through a microphone.
S102:将语音数据转换为文本数据,并将文本数据划分为t个句子,所述t为大于1的整数。S102: Convert the voice data into text data, and divide the text data into t sentences, where t is an integer greater than 1.
具体地,将采集的语音数据经过降噪、滤波后,输入预设的语音识别模型,得到预设的语音识别模型输出的文本数据。其中,语音识别模型是以语音数据和对应的文本数据作为训练样本,采用机器学习的算法对预设的算法模型进行训练后所得到的模 型。Specifically, after noise reduction and filtering, the collected voice data is input into a preset voice recognition model to obtain text data output by the preset voice recognition model. Among them, the speech recognition model uses speech data and corresponding text data as training samples, and uses a machine learning algorithm to train a preset algorithm model.
如图4所示,在得到文本数据后,对文本数据进行分句处理,以将文本数据划分为t个句子,其中,第t个句子是由用户当前输入的语音数据转换得到的,第t个句子之前的句子是由用户输入的历史语音数据转换得到的,第t个句子之前的句子是第t个句子的上文。As shown in Figure 4, after the text data is obtained, the text data is segmented to divide the text data into t sentences, where the tth sentence is converted from the voice data currently input by the user. The sentence before the sentence is converted from the historical voice data input by the user, and the sentence before the t-th sentence is the upper part of the t-th sentence.
在一种可能的实现方式中,根据用户输入语音数据时的停顿间隔生成的标点符号或者空格。例如,若用户输入语音数据的过程中,其中两个词语之间的停顿间隔大于预设时长,则在语音转换时,在该两个词语之间添加标点符号或者空格,其中,标点符号可以为逗号。在根据停顿间隔生成标点符号或者空格后,再根据的标点符号或者空格对文本数据进行句子划分。需要说明的是,可以是在用户输入当前语音数据之前,对第t个句子之前的句子进行划分,也可以是在用户输入当前语音数据,且在所有语音数据转换为文本数据后,对文本数据的句子进行划分。In a possible implementation, punctuation marks or spaces are generated according to the pause interval when the user inputs voice data. For example, if the user is inputting voice data, the pause interval between two words is greater than the preset duration, during the voice conversion, punctuation marks or spaces are added between the two words, where the punctuation marks can be comma. After generating punctuation marks or spaces according to the pause interval, the text data is divided into sentences according to the punctuation marks or spaces. It should be noted that before the user enters the current voice data, the sentence before the tth sentence can be divided, or the user can input the current voice data, and after all the voice data is converted into text data, the text data The sentences are divided.
S103:计算t个句子中第t个句子与第t个句子之前的c个句子的语义一致性置信度,其中,所述语义一致性置信度用于描述所述第t个句子与所述c个句子的语义关联程度,所述c为大于0的整数,c≤t-1。S103: Calculate the semantic consistency confidence of the t-th sentence in the t sentences and the c sentences before the t-th sentence, where the semantic consistency confidence is used to describe the t-th sentence and the c The degree of semantic relevance of sentences, where c is an integer greater than 0, and c≤t-1.
在一种可能的实现方式中,如图5所示,将第t个句子以及第t个句子之前的c个句子输入预设的语义一致性模型,得到预设的语义一致性模型输出的语义一致性置信度。其中,语义一致性模型是以文本数据以及文本数据各句子之间的语义一致性置信度为训练样本,采用机器学习的算法对预设的算法模型进行训练后得到的。In a possible implementation, as shown in Figure 5, the t-th sentence and the c sentences before the t-th sentence are input into the preset semantic consistency model to obtain the semantics output by the preset semantic consistency model Confidence of consistency. Among them, the semantic consistency model takes the text data and the semantic consistency confidence between each sentence of the text data as the training sample, and is obtained after training the preset algorithm model with a machine learning algorithm.
语义一致性模型在训练过程中的计算原理和语义一致性模型的应用原理相同。下面以语义一致性模型的应用为例,介绍语义一致性模型在计算语义一致性置信度时的计算过程。本申请实施例中,语义一致性模型用于根据计算t个句子中第t个句子和c个句子的语义关联,根据该语音关联确定语义一致性置信度。其中,该语义关联可以是第t个句子与c个句子的语义关联,也可以是第t个句子与c个句子中每个句子的语义关联,也可以既包括第t个句子与c个句子的语义关联,也包括第t个句子与c个句子中每个句子的语义关联。其中,第t个句子与c个句子的语义关联表示第t个句子与上文整体的语义关联,第t个句子与c个句子中每个句子的语义关联表示第t个句子与c个句子在句子级别的语义关联。本申请实施例中,语义一致性模型用于计算第t个句子与c个句子的语义关联,以及第t个句子与c个句子中每个句子的语义关联,从而能够提取出第t个句子与c个句子更有效的关联信息,进而使得输出的语义一致性置信度更具鲁棒性。The calculation principle of the semantic consistency model in the training process is the same as the application principle of the semantic consistency model. The following takes the application of the semantic consistency model as an example to introduce the calculation process of the semantic consistency model when calculating the semantic consistency confidence. In the embodiments of the present application, the semantic consistency model is used to calculate the semantic correlation between the t-th sentence and the c sentence in the t sentences, and determine the semantic consistency confidence level according to the phonetic correlation. Among them, the semantic association can be the semantic association between the t-th sentence and the c sentence, or the semantic association between the t-th sentence and each sentence in the c sentence, or it can include both the t-th sentence and the c sentence. The semantic association of, also includes the semantic association between the t-th sentence and each sentence in the c sentence. Among them, the semantic association between the tth sentence and the c sentence represents the semantic connection between the tth sentence and the above as a whole, and the semantic association between the tth sentence and the c sentence represents the tth sentence and the c sentence Semantic relevance at the sentence level. In the embodiment of this application, the semantic consistency model is used to calculate the semantic association between the tth sentence and the c sentence, and the semantic association between the tth sentence and each sentence in the c sentence, so that the tth sentence can be extracted More effective association information with c sentences, which in turn makes the output semantic consistency confidence more robust.
本申请实施例中,采用第t个句子的综合表示向量描述第t个句子与c个句子的语义关联,以及第t个句子与c个句子中每个句子的语义关联,即语义一致性模型首先根据第t个句子和c个句子计算第t个句子的综合表示向量,再根据第t个句子的综合表示向量确定语义一致性置信度。In the embodiments of this application, the comprehensive representation vector of the t-th sentence is used to describe the semantic association between the t-th sentence and the c sentence, and the semantic association between the t-th sentence and each sentence in the c sentence, that is, the semantic consistency model Firstly, the comprehensive representation vector of the t-th sentence is calculated according to the t-th sentence and the c-th sentence, and then the semantic consistency confidence is determined according to the comprehensive representation vector of the t-th sentence.
具体地,首先对t个句子中的每个句子进行分词处理,得到c个句子的各词语以及第t个句子的各词语。其中,若文本数据为英文,则按照英文单词对每个句子进行分词处理,若文本数据为中文,则根据文本数据与设定的词语库的对比结果对文本进 行分词处理。示例性地,设定第t个句子用S t表示,
Figure PCTCN2021080772-appb-000001
其中,t表示第t个句子的句子序号,
Figure PCTCN2021080772-appb-000002
表示第t个句子中的词语,L t表示第t个句子中的词语数量,设定第t个句子之前的c个句子用S τ表示,τ∈{t-1,…t-c},则
Figure PCTCN2021080772-appb-000003
Figure PCTCN2021080772-appb-000004
其中,
Figure PCTCN2021080772-appb-000005
表示c个句子中的词语,L τ表示c个句子中每个句子的词语数量。在得到c个句子的各词语以及第t个句子的各词语后,将c个句子的各词语以及第t个句子的各词语输入预设的语义一致性模型,计算出第t个句子的综合表示向量,再根据第t个句子的综合表示向量确定语义一致性置信度。
Specifically, first, perform word segmentation processing on each sentence in the t sentences to obtain the words in the c sentences and the words in the t-th sentence. Among them, if the text data is in English, each sentence is segmented according to the English words; if the text data is in Chinese, the text is segmented according to the comparison result of the text data and the set vocabulary database. Illustratively, assume that the t-th sentence is represented by S t ,
Figure PCTCN2021080772-appb-000001
Among them, t represents the sentence number of the t-th sentence,
Figure PCTCN2021080772-appb-000002
Represents the words in the t-th sentence, L t represents the number of words in the t-th sentence, set the c sentences before the t-th sentence as S τ , τ∈{t-1,...tc}, then
Figure PCTCN2021080772-appb-000003
Figure PCTCN2021080772-appb-000004
in,
Figure PCTCN2021080772-appb-000005
Represents the words in c sentences, and L τ represents the number of words in each sentence in c sentences. After obtaining the words of the c sentences and the words of the tth sentence, input the words of the c sentence and the words of the tth sentence into the preset semantic consistency model to calculate the synthesis of the tth sentence Represents the vector, and then determines the semantic consistency confidence based on the comprehensive representation vector of the t-th sentence.
如图6所示,在一种可能的实现方式中,预设的语义一致性模型包括嵌入层、上下文编码器、池化层和全连接层。将第t个句子的各词语输入嵌入层,得到第t个句子的各词语对应的语义嵌入表示
Figure PCTCN2021080772-appb-000006
其中,
Figure PCTCN2021080772-appb-000007
d e代表嵌入表示对应的向量的维度;将c个句子的各词语输入嵌入层,得到c个句子的各词语对应的语义嵌入表示
Figure PCTCN2021080772-appb-000008
其中,
Figure PCTCN2021080772-appb-000009
As shown in Figure 6, in a possible implementation manner, the preset semantic consistency model includes an embedding layer, a context encoder, a pooling layer, and a fully connected layer. Input the words of the t-th sentence into the embedding layer to obtain the semantic embedding representation corresponding to the words of the t-th sentence
Figure PCTCN2021080772-appb-000006
in,
Figure PCTCN2021080772-appb-000007
d e represents the dimension of the embedding representation of the corresponding vector; each word of c sentences is input to the embedding layer, and the semantic embedding representation corresponding to each word of c sentences is obtained
Figure PCTCN2021080772-appb-000008
in,
Figure PCTCN2021080772-appb-000009
在得到第t个句子各词语对应的嵌入表示以及c个句子各词语对应的嵌入表示后,将第t个句子的各词语对应的语义嵌入表示输入上下文编码器,得到第t个句子的各词语对应的隐向量
Figure PCTCN2021080772-appb-000010
其中,
Figure PCTCN2021080772-appb-000011
d h表示隐向量的维度;将c个句子的各词语对应的语义嵌入表示输入上下文编码器,得到c个句子的各词语对应的隐向量
Figure PCTCN2021080772-appb-000012
其中,
Figure PCTCN2021080772-appb-000013
在一种可能的实现方式中,上下文编码器采用循环神经网络(Recurrent Neural Network,RNN)进行编码,编码的计算公式为h i=tanh(Ue i+Wh i-1+b),i∈{1,…,L},其中,{U,W,b}为编码器的参数;e i表示第i个词语的嵌入表示,i表示第i个词语对应的隐向量。
After obtaining the embedding representation corresponding to each word in the tth sentence and the embedding representation corresponding to each word in the c sentence, the semantic embedding corresponding to each word in the t-th sentence represents the input context encoder to obtain the words in the t-th sentence Corresponding hidden vector
Figure PCTCN2021080772-appb-000010
in,
Figure PCTCN2021080772-appb-000011
d h represents the dimensionality of the hidden vector; embedding the semantics corresponding to each word of c sentences to represent the input context encoder, and obtain the hidden vector corresponding to each word of c sentences
Figure PCTCN2021080772-appb-000012
in,
Figure PCTCN2021080772-appb-000013
In a possible implementation, the context encoder uses Recurrent Neural Network (RNN) for encoding, and the calculation formula for encoding is h i =tanh(Ue i +Wh i-1 +b),i∈{ 1,...,L}, where {U,W,b} are the parameters of the encoder; e i represents the embedding representation of the i-th word, and i represents the hidden vector corresponding to the i-th word.
对第t个句子的各词语对应的隐向量
Figure PCTCN2021080772-appb-000014
c个句子的各词语对应的隐向量
Figure PCTCN2021080772-appb-000015
进行注意力运算,即将第t个句子的每个词语对应的隐向量,依次与c个句子的每个词语对应的隐向量进行注意力运算,得到第t个句子与上文的注意力。在一种可能的实现方式中,第t个句子与上文的注意力运算的公式为:
Figure PCTCN2021080772-appb-000016
Figure PCTCN2021080772-appb-000017
其中,
Figure PCTCN2021080772-appb-000018
代表第t个句子与上文的注意力,g α是注意力权重函数,权重函数的公式为
Figure PCTCN2021080772-appb-000019
其中,a,b表示输入注意力权重函数的两个向量,a t表示a的转置,
Figure PCTCN2021080772-appb-000020
为待学习参数。需要说明的是,在其他可行的实现方式中,可以将第t个句子和c个句子分别用向量表示,对第t个句子的 向量和c个句子的向量进行注意力运算,得到第t个句子与上文的注意力。
The hidden vector corresponding to each word in the t-th sentence
Figure PCTCN2021080772-appb-000014
The hidden vector corresponding to each word in c sentences
Figure PCTCN2021080772-appb-000015
Attention operation is performed, that is, the hidden vector corresponding to each word in the t-th sentence is used for attention operation with the hidden vector corresponding to each word in the c-th sentence to obtain the attention of the t-th sentence and the above. In a possible implementation, the formula for the t-th sentence and the above attention calculation is:
Figure PCTCN2021080772-appb-000016
Figure PCTCN2021080772-appb-000017
in,
Figure PCTCN2021080772-appb-000018
Represents the attention of the t-th sentence and the above, g α is the attention weight function, and the formula of the weight function is
Figure PCTCN2021080772-appb-000019
Wherein, a, b represents the two input vectors attention weighting function, a T represents a transposition,
Figure PCTCN2021080772-appb-000020
Is the parameter to be learned. It should be noted that in other feasible implementations, the t-th sentence and the c-th sentence can be represented by vectors respectively, and the attention operation is performed on the vector of the t-th sentence and the vector of the c-th sentence to obtain the t-th sentence. Sentence and attention above.
在得到第t个句子与上文的注意力后,根据第t个句子与上文的注意力,以及c个句子的各词语对应的隐向量
Figure PCTCN2021080772-appb-000021
计算第t个句子各词语的上下文表示。根据第t个句子与上文的注意力,以及第t个句子的各词语对应的隐向量
Figure PCTCN2021080772-appb-000022
计算c个句子各词语的上下文表示。在一种可能的实现方式中,根据下述公式计算第t个句子各词语的上下文表示和c个句子各词语的上下文表示:
After getting the attention of the t-th sentence and the above, according to the attention of the t-th sentence and the above, and the hidden vector corresponding to each word in the c sentence
Figure PCTCN2021080772-appb-000021
Calculate the context representation of each word in the t-th sentence. According to the attention of the t-th sentence and the above, and the hidden vector corresponding to each word in the t-th sentence
Figure PCTCN2021080772-appb-000022
Calculate the context representation of each word in c sentences. In a possible implementation, the context representation of each word in the t-th sentence and the context representation of each word in the c sentence are calculated according to the following formula:
Figure PCTCN2021080772-appb-000023
Figure PCTCN2021080772-appb-000023
Figure PCTCN2021080772-appb-000024
Figure PCTCN2021080772-appb-000024
Figure PCTCN2021080772-appb-000025
Figure PCTCN2021080772-appb-000025
Figure PCTCN2021080772-appb-000026
Figure PCTCN2021080772-appb-000026
其中,softmax是逻辑回归运算,
Figure PCTCN2021080772-appb-000027
表示第t个句子各词语的上下文表示,
Figure PCTCN2021080772-appb-000028
表示c个句子各词语的上下文表示。由公式可以看出,第t个句子各词语的上下文表示和c个句子各词语的上下文表示可以用于描述第t个句子与c个句子每个词语的语义关联信息,即第t个句子与c个句子整体的关联信息。
Among them, softmax is a logistic regression operation,
Figure PCTCN2021080772-appb-000027
Indicates the contextual representation of each word in the t-th sentence,
Figure PCTCN2021080772-appb-000028
Represents the contextual representation of each word in c sentences. It can be seen from the formula that the context representation of each word in the tth sentence and the context representation of each word in the c sentence can be used to describe the semantic related information of each word in the t-th sentence and the c-th sentence, that is, the t-th sentence and The related information of the whole sentence of c.
如图6所示,在一种可能的实现方式中,为了减少信息损失,在得到第t个句子各词语的上下文表示
Figure PCTCN2021080772-appb-000029
和c个句子各词语的上下文表示
Figure PCTCN2021080772-appb-000030
后,对第t个句子各词语的上下文表示,以及第t个句子的各词语对应的隐向量,进行残差连接运算,得到第t个句子各词语的上下文向量,即
Figure PCTCN2021080772-appb-000031
Figure PCTCN2021080772-appb-000032
表示第t个句子各词语的上下文向量。对c个句子各词语的上下文表示,以及c个句子的各词语对应的隐向量,进行残差连接运算,得到c个句子各词语的上下文向量,即
Figure PCTCN2021080772-appb-000033
Figure PCTCN2021080772-appb-000034
Figure PCTCN2021080772-appb-000035
表示c个句子各词语的上下文向量。需要说明的是,在其他可行的实现方式中,可以将第t个句子各词语的上下文表示作为第t个句子各词语的上下文向量,将c个句子各词语的上下文表示作为c个句子各词语的上下文向量。
As shown in Figure 6, in a possible implementation, in order to reduce the loss of information, the context representation of each word in the t-th sentence is obtained
Figure PCTCN2021080772-appb-000029
And the context representation of each word in c sentences
Figure PCTCN2021080772-appb-000030
Then, perform the residual connection operation on the context representation of each word in the t-th sentence and the hidden vector corresponding to each word in the t-th sentence to obtain the context vector of each word in the t-th sentence, namely
Figure PCTCN2021080772-appb-000031
but
Figure PCTCN2021080772-appb-000032
Represents the context vector of each word in the t-th sentence. Perform the residual connection operation on the context representation of each word in c sentences and the hidden vector corresponding to each word in c sentences to obtain the context vector of each word in c sentences, namely
Figure PCTCN2021080772-appb-000033
Figure PCTCN2021080772-appb-000034
but
Figure PCTCN2021080772-appb-000035
Represents the context vector of each word in c sentences. It should be noted that in other feasible implementations, the context representation of each word in the t-th sentence can be used as the context vector of each word in the t-th sentence, and the context representation of each word in the c sentence can be used as the context of each word in the c sentence The context vector.
在得到第t个句子各词语的上下文向量
Figure PCTCN2021080772-appb-000036
以及c个句子各词语的上下文向量
Figure PCTCN2021080772-appb-000037
后,对第t个句子各词语的上下文向量以及c个句子各词语的上下文向量进行注意力运算,即将第t个句子各词语的上下文向量依次与c个句子的每个句子中的各词语的 上下文向量进行注意力运算,得到第t个句子与c个句子对应的注意力。在一种可能的实现方式中,第t个句子与c个句子对应的注意力计算公式为:
Figure PCTCN2021080772-appb-000038
其中,
Figure PCTCN2021080772-appb-000039
表示第t个句子与c个句子对应的注意力,g β是注意力权重函数,
Figure PCTCN2021080772-appb-000040
Θ β=W β,a,b表示输入注意力权重函数的两个向量,a t表示a的转置,
Figure PCTCN2021080772-appb-000041
为待学习参数。
After getting the context vector of each word in the t-th sentence
Figure PCTCN2021080772-appb-000036
And the context vector of each word in c sentences
Figure PCTCN2021080772-appb-000037
Then, the attention operation is performed on the context vector of each word in the t-th sentence and the context vector of each word in the c sentence. The context vector performs attention operations to obtain the attention corresponding to the t-th sentence and the c-th sentence. In a possible implementation, the attention calculation formula corresponding to the t-th sentence and the c-th sentence is:
Figure PCTCN2021080772-appb-000038
in,
Figure PCTCN2021080772-appb-000039
Indicates the attention corresponding to the t-th sentence and c-th sentence, g β is the attention weight function,
Figure PCTCN2021080772-appb-000040
Θ β = W β , a, b represent the two vectors of the input attention weight function, a t represents the transposition of a,
Figure PCTCN2021080772-appb-000041
Is the parameter to be learned.
在得到第t个句子与c个句子对应的注意力后,根据第t个句子与c个句子对应的注意力,以及c个句子各词语的上下文向量,计算第t个句子各词语与c个句子对应的上下文表示。在一种可能的实现中,采用下述公式计算第t个句子各词语与c个句子对应的上下文表示:After obtaining the attention corresponding to the tth sentence and the c sentence, according to the attention corresponding to the tth sentence and the c sentence, and the context vector of each word in the c sentence, calculate the words and c words in the tth sentence The contextual representation of the sentence. In a possible implementation, the following formula is used to calculate the context representation of each word in the tth sentence and the c sentence:
Figure PCTCN2021080772-appb-000042
Figure PCTCN2021080772-appb-000042
Figure PCTCN2021080772-appb-000043
Figure PCTCN2021080772-appb-000043
Figure PCTCN2021080772-appb-000044
Figure PCTCN2021080772-appb-000044
其中,
Figure PCTCN2021080772-appb-000045
表示第t个句子各词语与c个句子对应的上下文表示,
Figure PCTCN2021080772-appb-000046
表示第t个句子各词语与c个句子的每个句子对应的关联向量。由公式可以看出,第t个句子各词语与c个句子对应的上下文表示可以用于描述第t个句子与c个句子在句子级别的语义关联。需要说明的是,在其他可行的实现方式中,也可以对第t个句子各词语的上下文表示和c个句子各词语的上下文表示,进行注意力运算,得到第t个句子与c个句子对应的注意力。
in,
Figure PCTCN2021080772-appb-000045
Represents the context representation of each word in the tth sentence and the c sentence,
Figure PCTCN2021080772-appb-000046
Represents the correlation vector corresponding to each word in the t-th sentence and each sentence in the c-th sentence. It can be seen from the formula that the context representation of each word in the tth sentence and the c sentence can be used to describe the semantic relationship between the tth sentence and the c sentence at the sentence level. It should be noted that in other feasible implementations, the context representation of each word in the t-th sentence and the context representation of each word in the c-th sentence can also be subjected to attention calculations, and the t-th sentence corresponds to the c sentence. Attention.
如图6所示,在一种可能的实现方式中,在对第t个句子各词语与c个句子的每个句子对应的关联向量
Figure PCTCN2021080772-appb-000047
进行池化运算,得到第t个句子各词语与c个句子对应的上下文表示
Figure PCTCN2021080772-appb-000048
后,再对第t个句子各词语与c个句子对应的上下文表示
Figure PCTCN2021080772-appb-000049
以及第t个句子各词语的上下文向量
Figure PCTCN2021080772-appb-000050
进行残差连接运算,得到第t个句子各词语的综合表示向量。具体地,根据公式
Figure PCTCN2021080772-appb-000051
计算第t个句子各词语的综合表示向量,其中,
Figure PCTCN2021080772-appb-000052
表示第t个句子各词语的综合表示向量。
As shown in Figure 6, in a possible implementation, the correlation vector corresponding to each word in the t-th sentence and each sentence in the c-th sentence is
Figure PCTCN2021080772-appb-000047
Perform a pooling operation to get the context representation of each word in the tth sentence and the c sentence
Figure PCTCN2021080772-appb-000048
Then, the context of each word in the t-th sentence and the c-sentence corresponding to the context is expressed
Figure PCTCN2021080772-appb-000049
And the context vector of each word in the t-th sentence
Figure PCTCN2021080772-appb-000050
Carry out the residual connection operation to obtain the comprehensive representation vector of each word in the t-th sentence. Specifically, according to the formula
Figure PCTCN2021080772-appb-000051
Calculate the comprehensive representation vector of each word in the t-th sentence, where,
Figure PCTCN2021080772-appb-000052
Represents the comprehensive representation vector of each word in the t-th sentence.
在得到第t个句子各词语的综合表示向量
Figure PCTCN2021080772-appb-000053
后,将第t个句子各词语的综合表示向量
Figure PCTCN2021080772-appb-000054
形成的集合整合为第t个句子的综合表示向量,将c个句子各词语的上下文向量
Figure PCTCN2021080772-appb-000055
形成的集合整合为c个句子的综合表示向量,对第t个句子的综合表示向量以及c个句子的综合表示向量进行拼接,根据拼接后的向量确定语义一致性置信度。具体地,将拼接后的向量经过全连接层,输出语义一致性置信度。
After getting the comprehensive representation vector of each word in the t-th sentence
Figure PCTCN2021080772-appb-000053
Then, the comprehensive representation vector of each word in the t-th sentence
Figure PCTCN2021080772-appb-000054
The formed set is integrated into the comprehensive representation vector of the t-th sentence, and the context vector of each word of the c sentence
Figure PCTCN2021080772-appb-000055
The formed set is integrated into a comprehensive representation vector of c sentences, the comprehensive representation vector of the t-th sentence and the comprehensive representation vector of c sentences are spliced, and the semantic consistency confidence is determined according to the spliced vector. Specifically, the spliced vector is passed through the fully connected layer, and the semantic consistency confidence is output.
在一种可能的实现方式中,如图6所示,将第t个句子的综合表示向量以及c个句子的综合表示向量输入池化层,分别进行池化运算后,再进行拼接,以降低运算过程中引入的误差。具体地,根据公式
Figure PCTCN2021080772-appb-000056
Figure PCTCN2021080772-appb-000057
进行池化运算,其中,r表示池化运算后的第t个句子的综合表示向量,r ctx表示池化运算后的c个句子的综合表示向量。进行池化运算后,将r和r ctx进行拼接,将拼接后的向量输入全连接层,输出语义一致性置信度。即根据公式Coh(S t)=MLP([r ctx;r])计算语义一致性置信度,其中,Coh(S t)表示语义一致性置信度,MLP为全连接运算。
In a possible implementation, as shown in Figure 6, the comprehensive representation vector of the t-th sentence and the comprehensive representation vector of the c sentence are input to the pooling layer, and the pooling operation is performed separately, and then splicing is performed to reduce Errors introduced in the operation process. Specifically, according to the formula
Figure PCTCN2021080772-appb-000056
and
Figure PCTCN2021080772-appb-000057
Perform a pooling operation, where r represents the comprehensive representation vector of the t-th sentence after the pooling operation, and r ctx represents the comprehensive representation vector of the c sentences after the pooling operation. After the pooling operation is performed, r and r ctx are spliced, the spliced vector is input to the fully connected layer, and the semantic consistency confidence is output. That is, the semantic consistency confidence is calculated according to the formula Coh(S t )=MLP([r ctx ;r]), where Coh(S t ) represents the semantic consistency confidence, and MLP is a fully connected operation.
需要说明的是,在其他可行的实现方式中,可以通过预设的相似度计算规则,例如,编辑距离的计算方法或者欧几里得距离的计算方法,计算第t个句子与c个句子的相似度,将相似度作为第t个句子与c个句子的语义一致性置信度。It should be noted that, in other feasible implementation methods, you can use preset similarity calculation rules, for example, the calculation method of edit distance or the calculation method of Euclidean distance to calculate the difference between the tth sentence and the c sentence Similarity, the similarity is regarded as the confidence of semantic consistency between the t-th sentence and the c-th sentence.
S104:判断语义一致性置信度是否小于预设值。S104: Determine whether the semantic consistency confidence is less than a preset value.
在一种可能的实现方式中,语义一致性置信度为0至1之间的数字。预设值是判定第t个句子与c个句子的语义关联程度的指标。In a possible implementation, the semantic consistency confidence is a number between 0 and 1. The preset value is an index to determine the degree of semantic relevance between the t-th sentence and the c-th sentence.
S105:若所述语义一致性置信度小于预设值,对所述第t个句子进行识别,并将识别结果作为编辑指令对所述文本数据进行编辑。S105: If the semantic consistency confidence is less than a preset value, recognize the t-th sentence, and use the recognition result as an editing instruction to edit the text data.
具体地,若语义一致性置信度小于预设值,说明第t个句子与c个句子的语义关联较低,第t个句子与c个句子不连贯,进一步说明第t个句子相对于c个句子进行了话题转换,是不同于c个句子的语音指令。此时,将第t个句子输入预设的意图识别模型,识别出第t个句子的意图,将第t个句子的意图作为编辑指令,对文本数据进行编辑。示例性地,编辑指令可以是移动光标、词语替换、词语删除的指令,例如,“光标前移N个字”、“光标移动至第i句第j个字”、“光标移动到××后”、“将××替换为××”、“删除××”等。Specifically, if the semantic consistency confidence is less than the preset value, it means that the semantic correlation between the tth sentence and the c sentence is low, and the tth sentence is not coherent with the c sentence, which further explains that the tth sentence is relative to the c sentence. The sentence undergoes a topic change, which is a voice command different from c sentences. At this time, the t-th sentence is input into the preset intention recognition model, the intention of the t-th sentence is recognized, and the intention of the t-th sentence is used as an editing instruction to edit the text data. Illustratively, the editing instructions may be instructions to move the cursor, replace words, or delete words, for example, "move the cursor forward by N characters", "move the cursor to the jth character of the i-th sentence", "move the cursor to the back of ×× ", "Replace ×× with ××", "delete ××", etc.
在一种可能的实现方式中,用于识别第t个句子的意图识别模型,可以用于提取第t个句子中的特征词或者关键词,根据特征词或者关键词来确定编辑指令,例如,若提取的特征词或者关键词包括“光标”、“移动”、“向前”等,则根据识别结果对光标进行移动;若提取的关键词包括“替换”,则根据识别结果确定出替换的词语和待替换的词语,进而进行词语替换。意图识别模型也可以用于将第t个句子与预设的模板进行匹配,根据匹配结果确定编辑指令,例如,模板包括“光标向左移动××”、“光标移动至××”、“将××替换为××”等,每个模板对应一种编辑方式,根据第t个句子与模板的匹配结果确定对应的编辑方式,根据该编辑方式执行对应的编辑指令。In a possible implementation, the intent recognition model used to recognize the t-th sentence can be used to extract feature words or keywords in the t-th sentence, and to determine editing instructions based on the feature words or keywords, for example, If the extracted feature words or keywords include "cursor", "move", "forward", etc., the cursor is moved according to the recognition result; if the extracted keywords include "replacement", the replacement is determined according to the recognition result Words and words to be replaced, and then word replacement. The intent recognition model can also be used to match the t-th sentence with a preset template, and determine the editing instructions according to the matching result. For example, the template includes "move cursor to the left ××", "move cursor to ××", "change ×× is replaced by ××", etc., each template corresponds to an editing method, the corresponding editing method is determined according to the matching result of the t-th sentence and the template, and the corresponding editing instruction is executed according to the editing method.
由于汉语中的同音字很普遍,对于词语替换的编辑指令,用户一般采用词语特指 的说法,例如,将“子”替换为“紫色”的“紫”、将“记入”的“记”替换为“计算机”的“计”等,由于同音词语的存在,在对第t个句子进行识别的过程中,有可能替换后的词语仍然不能代表用户意图,例如,将“紫色的紫”识别为“姿色的姿”,将“计算机的计”识别为“计算机的机”。Since homophones in Chinese are very common, users generally use words specifically referring to the editing instructions for word replacement, for example, replacing "子" with "紫" for "purple", and "ji" for "entering". Replaced with "computer" "ji", etc., due to the existence of homophones, in the process of recognizing the t-th sentence, it is possible that the replaced words still cannot represent the user's intention, for example, "purple purple" is recognized It is the "zizizhizi", and the "computer's meter" is recognized as "the computer's machine".
本申请实施例中,电子设备在识别出词语替换的编辑命令时,确定编辑命令中的目标描述词以及目标同音字,其中,目标描述词为目标同音字所在的词语,例如,将“子”替换为“紫色”的“紫”中,目标描述词为“紫色”,目标同音字为“紫”;将“记”的记替换为“计算机”的“计”中,目标描述词为“计算机”,目标同音字为“计”。确定目标描述词以及目标同音字后,将目标描述词的拼音序列输入拼音转汉字的序列标注模型,获得拼音转汉字的序列标注模型输出的目标同音字的拼音对应的候选汉字及其先验概率分布,再将目标同音字的拼音对应的候选汉字,以及第t个句子之前的待替换的词语输入同音字分类模型中,得到各候选汉字对应的关联概率,对先验概率和关联概率进行加权平均,得到各候选汉字的最终概率,输出最终概率最大的候选汉字,作为替换汉字,用于替换第t个句子之前的待替换的汉字。其中,先验概率是:目标描述词对应的拼音序列中,候选汉字是目标同音字的概率;关联概率用于表征候选汉字与待替换词语之间的语义关联。通过将两种模型结合,可以输出准确的替换汉字。In the embodiment of the present application, when the electronic device recognizes the editing command for word replacement, it determines the target description word and the target homophone in the editing command, where the target description word is the word where the target homophone is located, for example, "子" In the "purple" replaced with "purple", the target description word is "purple", and the target homophone is "紫"; in the "ji" of "computer", the target description word is "computer". ", the target homophone is "ji". After determining the target description word and target homophone, input the pinyin sequence of the target description word into the Pinyin-to-Chinese character sequence labeling model, and obtain the pinyin-corresponding candidate Chinese character and its prior probability of the pinyin-to-Chinese character sequence labeling model output Distribution, and then input the candidate Chinese characters corresponding to the pinyin of the target homophone and the words to be replaced before the t-th sentence into the homophone classification model to obtain the correlation probability corresponding to each candidate Chinese character, and weight the prior probability and the correlation probability On average, the final probability of each candidate Chinese character is obtained, and the candidate Chinese character with the largest final probability is output as a replacement Chinese character for replacing the Chinese character to be replaced before the t-th sentence. Among them, the prior probability is: in the pinyin sequence corresponding to the target description word, the probability that the candidate Chinese character is the target homophone; the correlation probability is used to characterize the semantic association between the candidate Chinese character and the word to be replaced. By combining the two models, accurate replacement Chinese characters can be output.
例如,编辑指令为将“记入”的“记”替换为“计算机”的“计”,将“jisuanji”输入拼音转汉字的序列标注模型,得到“计”的先验概率为0.3,“机”的先验概率为0.7;将“计”和“机”,以及“记入”输入同音字分类模型,得到“计”的关联概率为0.9,“机”的关联概率为0.1,对先验概率和关联概率进行加权平均,得到“计”的最终概率为0.6,“机”的关联概率为0.4。For example, the editing instruction is to replace the "ji" of "enter" with the "ji" of "computer", and input "jisuanji" into the pinyin-to-Chinese character sequence labeling model, and the prior probability of "ji" is 0.3, and "jisuanji" "" has a prior probability of 0.7; input "ji", "ji", and "entry" into the homophone classification model, and get the correlation probability of "ji" is 0.9, and the correlation probability of "ji" is 0.1. The probability and the associated probability are weighted average, and the final probability of "counting" is 0.6, and the associated probability of "machine" is 0.4.
S106:若所述语义一致性置信度大于或者等于所述预设值,存储所述文本数据。S106: If the semantic consistency confidence is greater than or equal to the preset value, store the text data.
具体地,若语义一致性置信度大于或者等于预设值,说明第t个句子与第t个句子之前的c个句子的语义关联较高,第t个句子和c个句子为连贯的文本数据,将第t个句子记录于c个句子后,存储文本数据,将存储的文本数据输入意图识别模型,识别出文本数据的意图,根据识别出的意图执行对应的操作。Specifically, if the semantic consistency confidence is greater than or equal to the preset value, it means that the t-th sentence and the c-th sentence before the t-th sentence have a higher semantic association, and the t-th sentence and the c-th sentence are coherent text data , After recording the t-th sentence in the c sentence, the text data is stored, the stored text data is input into the intent recognition model, the intent of the text data is recognized, and the corresponding operation is performed according to the recognized intent.
下面结合具体的应用场景,对本申请实施例提供的语音编辑方法的具体实现流程进行进一步说明。如图7所示,首先采集训练语料,训练语料包括文本数据,文本数据包括至少两个句子,其中,部分文本数据的最后一个句子为编辑指令,部分文本数据的最后一个句子为非编辑指令,标注文本数据的上文的语义一致性置信度,生成训练样本,对训练样本进行训练,生成语义一致性模型。将用户输入的语音数据转换为文本数据,将文本数据划分为t个句子,将第t个句子和c个句子输入语义一致性模型,根据语义一致性模型输出的语义一致性置信度判断第t个句子是否是编辑指令;若是编辑指令,则执行编辑指令,得到更新文本;若不是编辑指令,则记录文本数据,更新电子设备中存储的文本,得到更新文本。得到更新文本后,根据更新文本执行对应的操作,同时存储未正确识别的文本数据,将其作为新的训练语料,并进行标注,添加至训练样本,对语义一致性模型进行优化。In the following, in combination with specific application scenarios, the specific implementation process of the voice editing method provided in the embodiments of the present application will be further described. As shown in Figure 7, the training corpus is first collected. The training corpus includes text data. The text data includes at least two sentences. Among them, the last sentence of some text data is an editing instruction, and the last sentence of some text data is a non-editing instruction. Annotate the above semantic consistency confidence of the text data, generate training samples, and train the training samples to generate a semantic consistency model. Convert the voice data input by the user into text data, divide the text data into t sentences, input the tth sentence and the c sentence into the semantic consistency model, and judge the tth sentence according to the semantic consistency confidence level output by the semantic consistency model Whether a sentence is an editing instruction; if it is an editing instruction, the editing instruction is executed to obtain the updated text; if it is not an editing instruction, the text data is recorded, and the text stored in the electronic device is updated to obtain the updated text. After obtaining the updated text, perform corresponding operations based on the updated text, and store the incorrectly recognized text data as a new training corpus, label it, and add it to the training sample to optimize the semantic consistency model.
上述实施例中,通过将用户输入的语音数据转换为文本数据,将文本数据划分为t个句子,将t个句子输入语义一致性模型,计算第t个句子与c个句子的语义一致性置信度,由于语义一致性模型用于描述第t个句子与c个句子,以及第t个句子与c个 中每个句子的语义关联,从而能够提取出第t个句子与c个句子更有效的关联信息,进而使得输出的语义一致性置信度更具鲁棒性。上述实施例中,若语义一致性置信度小于预设值,说明第t个句子与c个句子的语义关联度较低,即第t个句子相对于c个句子进行了话题转换,是不同于上文的指令,此时将第t个句子作为编辑指令,并执行该编辑指令,从而可以在不需要额外的交互的情况下,即可实现对c个句子的编辑,操作简便,提高用户体验。In the above embodiment, by converting the voice data input by the user into text data, the text data is divided into t sentences, the t sentences are input into the semantic consistency model, and the semantic consistency confidence between the t sentence and the c sentence is calculated Since the semantic consistency model is used to describe the semantic association between the tth sentence and the c sentence, and the semantic association between the tth sentence and the c sentence, it can extract the tth sentence and the c sentence more effectively Associated information, which in turn makes the output semantic consistency confidence more robust. In the above embodiment, if the semantic consistency confidence is less than the preset value, it means that the semantic relevance between the tth sentence and the c sentence is low, that is, the tth sentence has a topic conversion relative to the c sentence, which is different from In the above instructions, the t-th sentence is used as an editing instruction at this time, and the editing instruction is executed, so that the editing of c sentences can be realized without additional interaction, which is easy to operate and improves user experience .
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.
基于同一发明构思,本申请实施例还提供了一种电子设备。如图8所示,本申请实施例提供的电子设备可以包括:处理器210、存储器220、网络接口230。处理器210、存储器220和网络接口230通过通信总线240相连。Based on the same inventive concept, an embodiment of the present application also provides an electronic device. As shown in FIG. 8, the electronic device provided by the embodiment of the present application may include: a processor 210, a memory 220, and a network interface 230. The processor 210, the memory 220, and the network interface 230 are connected through a communication bus 240.
所述处理器210可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。可选的,处理器210可包括一个或多个处理单元。The processor 210 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. Optionally, the processor 210 may include one or more processing units.
所述存储器220可以是电子设备的内部存储单元,例如电子设备的硬盘或内存。所述存储器220也可以是电子设备的外部存储设备,例如电子设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器220还可以既包括电子设备的内部存储单元也包括外部存储设备。所述存储器220用于存储计算机程序以及电子设备所需的其他程序和数据。所述存储器220还可以用于暂时地存储已经输出或者将要输出的数据。The memory 220 may be an internal storage unit of the electronic device, such as a hard disk or a memory of the electronic device. The memory 220 may also be an external storage device of the electronic device, such as a plug-in hard disk equipped on the electronic device, a smart memory card (Smart Media Card, SMC), a Secure Digital (SD) card, and a flash memory card (Flash). Card) and so on. Further, the memory 220 may also include both an internal storage unit of an electronic device and an external storage device. The memory 220 is used to store computer programs and other programs and data required by the electronic device. The memory 220 can also be used to temporarily store data that has been output or will be output.
网络接口230可以用于收发信息,可以包括有线接口和/或无线接口,通常用于在该电子设备与其他电子设备之间建立通信连接。The network interface 230 may be used to send and receive information, and may include a wired interface and/or a wireless interface, and is generally used to establish a communication connection between the electronic device and other electronic devices.
可选地,该电子设备还可以包括用户接口250,用户接口250可以包括显示器(Display)、输入单元比如键盘(Keyboard),可选的用户接口250还可以包括标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在电子设备中处理的信息以及用于显示可视化的用户界面。Optionally, the electronic device may further include a user interface 250. The user interface 250 may include a display (Display) and an input unit such as a keyboard (Keyboard). The optional user interface 250 may also include a standard wired interface and a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, etc. Among them, the display can also be appropriately called a display screen or a display unit, which is used to display information processed in the electronic device and to display a visualized user interface.
本领域技术人员可以理解,图8仅仅是电子设备的举例,并不构成对电子设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that FIG. 8 is only an example of an electronic device, and does not constitute a limitation on the electronic device. It may include more or fewer components than those shown in the figure, or a combination of certain components, or different component arrangements.
本实施例提供的电子设备可以执行上述方法实施例,其实现原理与技术效果类似,此处不再赘述。The electronic device provided in this embodiment can execute the foregoing method embodiments, and its implementation principles and technical effects are similar, and will not be repeated here.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own focus. For parts that are not described in detail or recorded in an embodiment, reference may be made to related descriptions of other embodiments.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上 描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述***中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, only the division of the above-mentioned functional units and modules is used as an example. In practical applications, the above-mentioned functions can be allocated to different functional units and modules as required. Module completion, that is, the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist alone physically, or two or more units can be integrated into one unit. The above-mentioned integrated units can be hardware-based Formal realization can also be realized in the form of a software functional unit. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application. For the specific working process of the units and modules in the foregoing system, reference may be made to the corresponding process in the foregoing method embodiment, which will not be repeated here.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质至少可以包括:能够将计算机程序代码携带到拍照装置/电子设备的任何实体或装置、记录介质、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质。例如U盘、移动硬盘、磁碟或者光盘等。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the implementation of all or part of the processes in the above-mentioned embodiment methods in this application can be completed by instructing relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium. When executed by the processor, the steps of the foregoing method embodiments can be implemented. Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate form. The computer-readable medium may at least include: any entity or device capable of carrying computer program code to a photographing device/electronic device, a recording medium, a computer memory, a read-only memory (ROM, Read-Only Memory), and a random access memory (RAM, Random Access Memory), electric carrier signal, telecommunications signal and software distribution medium. For example, U disk, mobile hard disk, floppy disk or CD-ROM, etc.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
在本申请所提供的实施例中,应该理解到,所揭露的装置/网络设备和方法,可以通过其它的方式实现。例如,以上所描述的装置/网络设备实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。In the embodiments provided in this application, it should be understood that the disclosed apparatus/network equipment and method may be implemented in other ways. For example, the device/network device embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division, and there may be other divisions in actual implementation, such as multiple units. Or components can be combined or integrated into another system, or some features can be omitted or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。A person of ordinary skill in the art may be aware that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of this application.
最后应说明的是:以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。Finally, it should be noted that the above are only specific implementations of this application, but the scope of protection of this application is not limited to this. Any change or replacement within the technical scope disclosed in this application shall be covered by this application. Within the scope of protection applied for. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims (14)

  1. 一种语音编辑方法,其特征在于,包括:A voice editing method, characterized in that it comprises:
    获取输入的语音数据;Obtain the input voice data;
    将所述语音数据转换为文本数据,并将所述文本数据划分为t个句子,所述t为大于1的整数;Converting the voice data into text data, and dividing the text data into t sentences, where t is an integer greater than 1;
    计算所述t个句子中第t个句子与所述第t个句子之前的c个句子的语义一致性置信度,其中,所述语义一致性置信度用于描述所述第t个句子与所述c个句子的语义关联程度,所述c为大于0的整数;Calculate the semantic consistency confidence of the t-th sentence in the t sentences and the c sentences before the t-th sentence, where the semantic consistency confidence is used to describe the t-th sentence and the State the degree of semantic relevance of c sentences, where c is an integer greater than 0;
    若所述语义一致性置信度小于预设值,对所述第t个句子进行识别,并将识别结果作为编辑指令对所述文本数据进行编辑。If the semantic consistency confidence is less than the preset value, the t-th sentence is recognized, and the recognition result is used as an editing instruction to edit the text data.
  2. 根据权利要求1所述的语音编辑方法,其特征在于,所述计算所述t个句子中第t个句子与所述第t个句子之前的c个句子的语义一致性置信度,包括:The voice editing method according to claim 1, wherein the calculating the confidence of semantic consistency between the t-th sentence in the t-th sentence and the c sentences before the t-th sentence comprises:
    将所述t个句子输入预设的语义一致性模型,得到所述语义一致性模型输出的所述t个句子中第t个句子与所述第t个句子之前的c个句子的语义一致性置信度。Input the t sentences into the preset semantic consistency model, and obtain the semantic consistency between the tth sentence in the t sentences output by the semantic consistency model and the c sentence before the tth sentence Confidence.
  3. 根据权利要求2所述的语音编辑方法,其特征在于,所述预设的语义一致性模型用于:The voice editing method according to claim 2, wherein the preset semantic consistency model is used for:
    根据所述第t个句子和所述c个句子计算第t个句子的综合表示向量,其中,所述第t个句子的综合表示向量用于描述所述第t个句子与所述c个句子的语义关联,以及所述第t个句子与所述c个句子中每个句子的语义关联;Calculate the comprehensive representation vector of the t-th sentence according to the t-th sentence and the c-th sentence, wherein the comprehensive representation vector of the t-th sentence is used to describe the t-th sentence and the c sentence The semantic association of and the semantic association between the t-th sentence and each sentence in the c sentences;
    根据所述第t个句子的综合表示向量确定语义一致性置信度。Determine the semantic consistency confidence level according to the comprehensive representation vector of the t-th sentence.
  4. 根据权利要求3所述的语音编辑方法,其特征在于,所述根据所述第t个句子和所述c个句子计算第t个句子的综合表示向量,包括:The voice editing method according to claim 3, wherein the calculating a comprehensive representation vector of the t-th sentence according to the t-th sentence and the c-th sentence comprises:
    根据所述第t个句子和所述c个句子确定第t个句子各词语的上下文向量,以及确定c个句子各词语的上下文向量;Determine the context vector of each word in the t-th sentence according to the t-th sentence and the c sentence, and determine the context vector of each word in the c-th sentence;
    根据所述第t个句子各词语的上下文向量,以及所述c个句子各词语的上下文向量计算第t个句子的综合表示向量。Calculate the comprehensive representation vector of the t-th sentence according to the context vector of each word of the t-th sentence and the context vector of each word of the c-th sentence.
  5. 根据权利要求4所述的语音编辑方法,其特征在于,所述根据所述第t个句子和所述c个句子确定第t个句子各词语的上下文向量,以及确定c个句子各词语的上下文向量,包括:The voice editing method according to claim 4, wherein the context vector of each word in the t-th sentence is determined according to the t-th sentence and the c sentence, and the context of each word in the c-th sentence is determined Vectors, including:
    对所述第t个句子和所述c个句子进行注意力运算,得到第t个句子与上文的注意力;Perform attention operations on the t-th sentence and the c-th sentence to obtain the attention of the t-th sentence and the above;
    根据所述第t个句子与上文的注意力计算第t个句子各词语的上下文向量,以及c个句子各词语的上下文向量。Calculate the context vector of each word in the t-th sentence and the context vector of each word in the c sentence according to the t-th sentence and the above attention.
  6. 根据权利要求5所述的语音编辑方法,其特征在于,所述对所述第t个句子和所述c个句子进行注意力运算,得到第t个句子与上文的注意力,包括:The voice editing method according to claim 5, wherein the performing attention operation on the t-th sentence and the c-th sentence to obtain the t-th sentence and the above attention comprises:
    对所述第t个句子进行分词处理,根据分词处理后的第t个句子确定所述第t个句子的各词语对应的隐向量;Perform word segmentation processing on the t-th sentence, and determine the hidden vector corresponding to each word of the t-th sentence according to the t-th sentence after the word segmentation processing;
    对所述c个句子进行分词处理,根据分词处理后的c个句子确定所述c个句子的各词语对应的隐向量;Perform word segmentation processing on the c sentences, and determine the hidden vector corresponding to each word of the c sentences according to the c sentences after the word segmentation processing;
    对所述第t个句子的各词语对应的隐向量,以及所述c个句子的各词语对应的隐向量,进行注意力运算,得到第t个句子与上文的注意力。Attention operations are performed on the hidden vectors corresponding to the words of the t-th sentence and the hidden vectors corresponding to the words of the c-th sentence to obtain the attention of the t-th sentence and the above.
  7. 根据权利要求6所述的语音编辑方法,其特征在于,所述根据所述第t个句子与上文的注意力计算第t个句子各词语的上下文向量,以及c个句子各词语的上下文向量,包括:The voice editing method according to claim 6, wherein the context vector of each word in the t-th sentence and the context vector of each word in the c-th sentence are calculated according to the t-th sentence and the above attention ,include:
    根据所述第t个句子与上文的注意力,以及所述c个句子的各词语对应的隐向量,计算第t个句子各词语的上下文表示;Calculate the context representation of each word in the t-th sentence according to the attention of the t-th sentence and the above, and the latent vector corresponding to each word in the c-th sentence;
    对所述第t个句子各词语的上下文表示,以及所述第t个句子的各词语对应的隐向量,进行残差连接运算,得到第t个句子各词语的上下文向量;Perform a residual connection operation on the context representation of each word in the t-th sentence and the hidden vector corresponding to each word in the t-th sentence to obtain the context vector of each word in the t-th sentence;
    根据所述第t个句子与上文的注意力,以及所述第t个句子的各词语对应的隐向量,计算c个句子各词语的上下文表示;Calculate the context representation of each word in the c sentence according to the attention of the t-th sentence and the above, and the hidden vector corresponding to each word in the t-th sentence;
    对所述c个句子各词语的上下文表示,以及所述c个句子的各词语对应的隐向量,进行残差连接运算,得到c个句子各词语的上下文向量。A residual connection operation is performed on the context representations of the words of the c sentences and the hidden vectors corresponding to the words of the c sentences to obtain the context vectors of the words of the c sentences.
  8. 根据权利要求4所述的语音编辑方法,其特征在于,所述根据所述第t个句子各词语的上下文向量,以及所述c个句子各词语的上下文向量计算第t个句子的综合表示向量,包括:The speech editing method according to claim 4, wherein the comprehensive representation vector of the t-th sentence is calculated according to the context vector of each word of the t-th sentence and the context vector of each word of the c-th sentence ,include:
    对所述第t个句子各词语的上下文向量,以及所述c个句子各词语的上下文向量进行注意力运算,得到第t个句子与c个句子对应的注意力;Performing an attention operation on the context vector of each word in the t-th sentence and the context vector of each word in the c sentence to obtain the attention corresponding to the t-th sentence and the c sentence;
    根据所述第t个句子与c个句子对应的注意力计算第t个句子的综合表示向量。Calculate the comprehensive representation vector of the t-th sentence according to the attention of the t-th sentence and the c-th sentence.
  9. 根据权利要求8所述的语音编辑方法,其特征在于,所述根据所述第t个句子与c个句子对应的注意力计算第t个句子的综合表示向量,包括:8. The voice editing method according to claim 8, wherein the calculation of the comprehensive representation vector of the t-th sentence according to the attention of the t-th sentence and the c-th sentence comprises:
    根据所述第t个句子与c个句子对应的注意力,以及c个句子各词语的上下文向量,计算第t个句子各词语与c个句子对应的上下文表示;According to the attention of the t-th sentence and the c sentence and the context vector of each word of the c sentence, calculate the context representation of each word of the t-th sentence and the c sentence;
    对所述第t个句子各词语与c个句子对应的上下文表示,以及第t个句子各词语的上下文向量,进行残差连接运算,得到第t个句子的综合表示向量。A residual connection operation is performed on the context representation of each word in the tth sentence and the c sentence corresponding to the c sentence, and the context vector of each word in the t-th sentence, to obtain a comprehensive representation vector of the t-th sentence.
  10. 根据权利要求4所述的语音编辑方法,其特征在于,所述根据所述第t个句子的综合表示向量确定语义一致性置信度,包括:The speech editing method according to claim 4, wherein the determining the semantic consistency confidence according to the comprehensive representation vector of the t-th sentence comprises:
    根据所述c个句子各词语的上下文向量,确定c个句子的综合表示向量;Determine the comprehensive representation vector of c sentences according to the context vectors of the words of the c sentences;
    对所述第t个句子的综合表示向量以及所述c个句子的综合表示向量进行拼接,根据拼接后的向量确定语义一致性置信度。The comprehensive representation vector of the t-th sentence and the comprehensive representation vector of the c sentences are spliced, and the semantic consistency confidence is determined according to the spliced vector.
  11. 根据权利要求1所述的语音编辑方法,其特征在于,所述对所述第t个句子进行识别,包括:The voice editing method according to claim 1, wherein said recognizing said t-th sentence comprises:
    将所述第t个句子输入预设的意图识别模型,得到所述预设的意图识别模型输出的识别结果。The t-th sentence is input into a preset intent recognition model to obtain a recognition result output by the preset intent recognition model.
  12. 根据权利要求1所述的语音编辑方法,其特征在于,在所述计算所述t个句子中第t个句子与所述第t个句子之前的c个句子的语义一致性置信度之后,所述语音编辑方法还包括:The speech editing method according to claim 1, characterized in that, after calculating the semantic consistency confidence of the t-th sentence in the t-th sentence and the c-th sentence before the t-th sentence, the The voice editing method also includes:
    若所述语义一致性置信度大于或者等于所述预设值,存储所述文本数据。If the semantic consistency confidence is greater than or equal to the preset value, the text data is stored.
  13. 一种电子设备,包括存储器、处理器以及存储在所述存储器中并可在所述处 理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至12任一项所述的语音编辑方法。An electronic device, comprising a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program as claimed in claims 1 to 12. The voice editing method described in any one of items.
  14. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至12任一项所述的语音编辑方法。A computer-readable storage medium storing a computer program, wherein the computer program implements the voice editing method according to any one of claims 1 to 12 when the computer program is executed by a processor.
PCT/CN2021/080772 2020-06-01 2021-03-15 Voice editing method, electronic device and computer readable storage medium WO2021244099A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010484871.0A CN113761843B (en) 2020-06-01 2020-06-01 Voice editing method, electronic device and computer readable storage medium
CN202010484871.0 2020-06-01

Publications (1)

Publication Number Publication Date
WO2021244099A1 true WO2021244099A1 (en) 2021-12-09

Family

ID=78782605

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/080772 WO2021244099A1 (en) 2020-06-01 2021-03-15 Voice editing method, electronic device and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN113761843B (en)
WO (1) WO2021244099A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114238566A (en) * 2021-12-10 2022-03-25 零犀(北京)科技有限公司 Data enhancement method and device for voice or text data
CN114416928A (en) * 2022-01-25 2022-04-29 阿里巴巴达摩院(杭州)科技有限公司 Text determination method and device and electronic equipment
US11995394B1 (en) * 2023-02-07 2024-05-28 Adobe Inc. Language-guided document editing

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103645876A (en) * 2013-12-06 2014-03-19 百度在线网络技术(北京)有限公司 Voice inputting method and device
CN103885743A (en) * 2012-12-24 2014-06-25 大陆汽车投资(上海)有限公司 Voice text input method and system combining with gaze tracking technology
CN106933561A (en) * 2015-12-31 2017-07-07 北京搜狗科技发展有限公司 Pronunciation inputting method and terminal device
US20170263248A1 (en) * 2016-03-14 2017-09-14 Apple Inc. Dictation that allows editing
CN109119079A (en) * 2018-07-25 2019-01-01 天津字节跳动科技有限公司 voice input processing method and device
CN111161735A (en) * 2019-12-31 2020-05-15 安信通科技(澳门)有限公司 Voice editing method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5044783B2 (en) * 2007-01-23 2012-10-10 国立大学法人九州工業大学 Automatic answering apparatus and method
KR20150024188A (en) * 2013-08-26 2015-03-06 삼성전자주식회사 A method for modifiying text data corresponding to voice data and an electronic device therefor
CN109994105A (en) * 2017-12-29 2019-07-09 宝马股份公司 Data inputting method, device, system, vehicle and readable storage medium storing program for executing
CN108984529B (en) * 2018-07-16 2022-06-03 北京华宇信息技术有限公司 Real-time court trial voice recognition automatic error correction method, storage medium and computing device
CN110738997B (en) * 2019-10-25 2022-06-17 百度在线网络技术(北京)有限公司 Information correction method and device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103885743A (en) * 2012-12-24 2014-06-25 大陆汽车投资(上海)有限公司 Voice text input method and system combining with gaze tracking technology
CN103645876A (en) * 2013-12-06 2014-03-19 百度在线网络技术(北京)有限公司 Voice inputting method and device
CN106933561A (en) * 2015-12-31 2017-07-07 北京搜狗科技发展有限公司 Pronunciation inputting method and terminal device
US20170263248A1 (en) * 2016-03-14 2017-09-14 Apple Inc. Dictation that allows editing
CN109119079A (en) * 2018-07-25 2019-01-01 天津字节跳动科技有限公司 voice input processing method and device
CN111161735A (en) * 2019-12-31 2020-05-15 安信通科技(澳门)有限公司 Voice editing method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114238566A (en) * 2021-12-10 2022-03-25 零犀(北京)科技有限公司 Data enhancement method and device for voice or text data
CN114416928A (en) * 2022-01-25 2022-04-29 阿里巴巴达摩院(杭州)科技有限公司 Text determination method and device and electronic equipment
US11995394B1 (en) * 2023-02-07 2024-05-28 Adobe Inc. Language-guided document editing

Also Published As

Publication number Publication date
CN113761843A (en) 2021-12-07
CN113761843B (en) 2023-11-28

Similar Documents

Publication Publication Date Title
JP7317791B2 (en) Entity linking method, device, apparatus and storage medium
CN109241524B (en) Semantic analysis method and device, computer-readable storage medium and electronic equipment
WO2021244099A1 (en) Voice editing method, electronic device and computer readable storage medium
CN107818781B (en) Intelligent interaction method, equipment and storage medium
CN107220235B (en) Speech recognition error correction method and device based on artificial intelligence and storage medium
CN108846077B (en) Semantic matching method, device, medium and electronic equipment for question and answer text
CN110223695B (en) Task creation method and mobile terminal
CN111985240B (en) Named entity recognition model training method, named entity recognition method and named entity recognition device
CN107729313B (en) Deep neural network-based polyphone pronunciation distinguishing method and device
WO2022142011A1 (en) Method and device for address recognition, computer device, and storage medium
WO2020233131A1 (en) Question-and-answer processing method and apparatus, computer device and storage medium
CN112784696B (en) Lip language identification method, device, equipment and storage medium based on image identification
US12008336B2 (en) Multimodal translation method, apparatus, electronic device and computer-readable storage medium
CN107112009B (en) Method, system and computer-readable storage device for generating a confusion network
WO2022174496A1 (en) Data annotation method and apparatus based on generative model, and device and storage medium
CN111460115A (en) Intelligent man-machine conversation model training method, model training device and electronic equipment
WO2024099037A1 (en) Data processing method and apparatus, entity linking method and apparatus, and computer device
Dhole Resolving intent ambiguities by retrieving discriminative clarifying questions
CN109271624B (en) Target word determination method, device and storage medium
CN111126084B (en) Data processing method, device, electronic equipment and storage medium
CN116303537A (en) Data query method and device, electronic equipment and storage medium
CN110827799B (en) Method, apparatus, device and medium for processing voice signal
CN113220828B (en) Method, device, computer equipment and storage medium for processing intention recognition model
CN117591663B (en) Knowledge graph-based large model promt generation method
CN114722832A (en) Abstract extraction method, device, equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21817248

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21817248

Country of ref document: EP

Kind code of ref document: A1