CN117437915A - Reply content generation method and device, electronic equipment and readable medium - Google Patents

Reply content generation method and device, electronic equipment and readable medium Download PDF

Info

Publication number
CN117437915A
CN117437915A CN202311261688.4A CN202311261688A CN117437915A CN 117437915 A CN117437915 A CN 117437915A CN 202311261688 A CN202311261688 A CN 202311261688A CN 117437915 A CN117437915 A CN 117437915A
Authority
CN
China
Prior art keywords
voice
vector
reply
text
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311261688.4A
Other languages
Chinese (zh)
Inventor
马春春
方康
冯敏
闵天磊
李国忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Qiangtong Intelligent Technology Co ltd
Original Assignee
Shanghai Qiangtong Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Qiangtong Intelligent Technology Co ltd filed Critical Shanghai Qiangtong Intelligent Technology Co ltd
Priority to CN202311261688.4A priority Critical patent/CN117437915A/en
Publication of CN117437915A publication Critical patent/CN117437915A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/041Abduction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the disclosure discloses a reply content generation method, a reply content generation device, electronic equipment and a computer readable medium. One embodiment of the method comprises the following steps: acquiring user voice; determining at least one piece of reply text corresponding to the user voice to obtain a reply text sequence; generating a voice vector according to the voice of the user; generating a text vector sequence according to the pre-configured prompt information and the reply text sequence; and generating reply content corresponding to the user voice according to the voice vector and the text vector sequence. The implementation mode realizes the improvement of the accuracy and the efficiency of the speech recognition and the natural language understanding, provides more convenient and intelligent speech interaction experience for the user, can shorten the flow, accelerate the processing time and realize the smoother interaction between the machine and the human.

Description

Reply content generation method and device, electronic equipment and readable medium
Technical Field
Embodiments of the present disclosure relate to the field of computer technology, and in particular, to a reply content generation method, apparatus, electronic device, and computer readable medium.
Background
In the field of natural language processing, conventional speech recognition and text generation techniques typically require a combination of modules, such as speech signal processing, feature extraction, acoustic models, language models, and the like. These modules require manual design and adjustment, and require a large amount of manual annotation data to train the model. This approach has many problems such as coupling between modules, data scarcity, and model complexity.
The existing models are mostly text input and text output, when a human performs a conversation, voice is firstly converted into text through voice recognition, and the models output answers of the conversation according to the text and then are played through a text-to-voice technology. The model is complex in structure and long in processing time, emotion information when human expression is lost in the voice recognition process, meaning of a section of the same words expressed by different language gases is completely different, and therefore the existing model is easy to solve the problems that the content expressed by the human cannot be accurately understood and the response is slow when the existing model is communicated with the human.
Disclosure of Invention
The disclosure is in part intended to introduce concepts in a simplified form that are further described below in the detailed description. The disclosure is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Some embodiments of the present disclosure propose a reply content generation method, apparatus, electronic device, and computer-readable medium to solve the technical problems mentioned in the background section above.
In a first aspect, some embodiments of the present disclosure provide a reply content generation method, the method including: acquiring user voice; determining at least one piece of reply text corresponding to the user voice to obtain a reply text sequence; generating a voice vector according to the user voice; generating a text vector sequence according to the pre-configured prompt information and the reply text sequence; and generating reply content corresponding to the user voice according to the voice vector and the text vector sequence.
In a second aspect, some embodiments of the present disclosure provide a reply content generation apparatus, the apparatus including: an acquisition unit configured to acquire user voice; a determining unit configured to determine at least one piece of reply text corresponding to the user voice, and obtain a reply text sequence; a first generation unit configured to generate a speech vector from the user speech; a second generating unit configured to generate a text vector sequence according to the pre-configured sample information and the reply text sequence; and a third generation unit configured to generate reply content corresponding to the user voice according to the voice vector and the text vector sequence.
In a third aspect, an embodiment of the present application provides an electronic device, where the network device includes: one or more processors; a storage means for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method as described in any of the implementations of the first aspect.
In a fourth aspect, embodiments of the present application provide a computer readable medium having stored thereon a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first aspect.
One of the above embodiments of the present disclosure has the following advantageous effects: the method and the device automatically learn language rules and semantic information by utilizing a large-scale corpus and the voice decoder, so that the accuracy and efficiency of generating voice are improved, in addition, the voice decoder can directly generate text output from voice signals without manually designing and adjusting a plurality of modules, in addition, the voice decoder can automatically learn the language rules and the semantic information, so that the accuracy and the efficiency of voice recognition and natural language understanding are improved, voice input can be recognized more accurately, and different voice input modes and language environments can be automatically learned and adapted, so that more accurate and efficient voice recognition is realized. Meanwhile, various voice processing functions such as voice synthesis and voice translation can be realized, and more convenient and intelligent voice interaction experience is provided for the user. Besides, the voice decoder improves the defects of the traditional model, the voice is input into the voice decoder directly to obtain the voice output, the flow is shortened, the processing time is accelerated, meanwhile, the emotion information of the human speaking is completely reserved by directly inputting the audio flow, the voice decoder can accurately understand the expression of the human, and the smoother interaction between the machine and the human is completed.
Drawings
The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.
FIG. 1 is a schematic illustration of one application scenario of a reply content generation method according to some embodiments of the present disclosure;
FIG. 2 is a flow chart of some embodiments of a reply content generation method according to the present disclosure;
FIG. 3 is a schematic diagram of the structure of some embodiments of reply content generation devices according to the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.
It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings. Embodiments of the present disclosure and features of embodiments may be combined with each other without conflict.
It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.
It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.
The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.
The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 is a schematic diagram of one application scenario of a reply content generation method according to some embodiments of the present disclosure.
As shown in fig. 1, a server 101 may obtain a user voice 102, determine at least one piece of reply text corresponding to the user voice 102 to obtain a reply text sequence 103, generate a voice vector 104 according to the user voice 102, generate a text vector sequence 106 according to preconfigured prompt information 105 and the reply text sequence 103, and generate reply content 107 corresponding to the user voice 102 according to the voice vector 104 and the text vector sequence 106.
It is to be understood that the reply content generation method may be executed by a terminal device, or may be executed by the server 101, and the execution subject of the method may include a device in which the terminal device and the server 101 are integrated via a network, or may be executed by various software programs. The terminal device may be, among other things, various electronic devices with information processing capabilities including, but not limited to, smartphones, tablet computers, electronic book readers, laptop and desktop computers, and the like. The execution body may be embodied as a server 101, software, or the like. When the execution subject is software, the execution subject can be installed in the electronic device enumerated above. It may be implemented as a plurality of software or software modules, for example, for providing distributed services, or as a single software or software module. The present invention is not particularly limited herein.
It should be understood that the number of servers in fig. 1 is merely illustrative. There may be any number of servers, as desired for implementation.
With continued reference to fig. 2, a flow 200 of some embodiments of reply content generation methods according to the present disclosure is shown. The reply content generation method comprises the following steps:
Step 201, user speech is acquired.
In some embodiments, the execution subject of the reply content generation method (e.g., the server shown in fig. 1) may acquire the user voice through a wired connection or a wireless connection. It should be noted that the wireless connection may include, but is not limited to, 3G/4G connections, wiFi connections, bluetooth connections, wiMAX connections, zigbee connections, UWB (ultra wideband) connections, and other now known or later developed wireless connection means.
Specifically, data preprocessing and voice denoising are generally adopted after user voice is acquired:
1. data preprocessing: first, a large amount of voice data needs to be preprocessed, including operations of sampling, filtering, feature extraction, etc. of the voice signal, so as to facilitate subsequent voice recognition.
2. A voice denoising device: creating a speech denoiser can help eliminate noise in speech and improve the quality and clarity of the speech signal. In practical application, the voice denoising device can be used in the fields of voice recognition, voice synthesis, voice communication and the like, and the accuracy and reliability of voice processing are effectively improved. Through analyzing and processing the voice signals, the denoising device can identify and eliminate interference factors such as environmental noise, electronic noise, voice distortion and the like, so that clearer and natural voice signals are obtained. In the process of realizing the voice denoising device, a series of signal processing algorithms and techniques (time domain filtering, frequency domain filtering and wavelet transformation) are adopted, so that the optimal denoising effect can be achieved.
Step 202, determining at least one piece of reply text corresponding to the user voice to obtain a reply text sequence.
In some embodiments, the executing entity may determine at least one piece of reply text corresponding to the user voice, to obtain a reply text sequence.
Step 203, generating a speech vector according to the user speech.
In some embodiments, the execution body may generate a speech vector from the user speech.
In some optional implementations of some embodiments, the executing entity may input the user speech to a pre-trained speech encoder to obtain a speech vector, where a residual vector quantizer is disposed in the speech encoder, and the residual vector quantizer is used to compress data.
Specifically, the speech encoder: the creation speech encoder may convert the denoised audio information into speech representations for processing and transmission in the fields of digital communications, speech recognition, speech synthesis, etc. The main task of a speech coder is to convert a high-dimensional speech signal into a low-dimensional speech representation for transmission and storage with limited bandwidth and storage space. In the process of realizing the speech coder, a series of signal processing algorithms and techniques (linear predictive coding, vector quantization and wavelet transformation) are adopted, so that the optimal coding effect can be achieved. By means of the speech encoder, the speech signal can be converted into a digital signal, whereby a digitized speech processing and transmission is achieved.
Mainly used are the preceding speech coder, and the intermediate multi-layer Residual Vector Quantizer (RVQ). The encoder converts the input audio x with the sampling rate fs into an embedded sequence and outputs the embedded sequence to the quantizer.
The reason for using Residual Vector Quantizer (RVQ): the vector generated by the speech encoder may take an unlimited number of values. In order to transmit them to the receiver using a limited number of bits, they must be replaced with an approximation vector from a limited set codebook.
This approach works well for bit rates of about 1kbps or less, but reaches its limit soon when higher bit rates are used instead. For example, even if the bit rate is as low as 3kbps, assuming that the encoder produces 100 vectors per second, it is necessary to store a codebook of more than 10 hundred million vectors, which is not practical. If the slave residual vector quantizer is used, the first layer quantizes the code vector at medium resolution, and each subsequent layer processes the residual of the previous layer. Dividing the quantization process into several layers can greatly reduce the codebook size. For example, at 3kbps, with 100 vectors per second, the codebook size is reduced from 10 hundred million to 320 using 5 quantization levels.
The core idea of RVQ is to exploit the redundant information in the data to increase the compression ratio while maintaining a high coding efficiency. Using a much lower bandwidth provides similar quality.
The conventional linear prediction algorithm is that a speech signal is decomposed into a plurality of linear components and a residual component, and has an advantage of high compression but is very sensitive to noise, so that special processing optimization is required for compression and denoising in the decomposition. The algorithm used for compression is similar to vector quantization, which solves the problem of excessive codebook storage in the scenario of 3 kbps.
The present disclosure implements a hierarchical progressive quantization process by compressing data using a residual vector quantizer, each layer processing the previous residual. Better removes the noise effect.
And 204, generating a text vector sequence according to the pre-configured prompt information and the reply text sequence.
In some embodiments, the execution body may generate the text vector sequence according to the pre-configured prompt information and the reply text sequence.
Specifically, the foregoing promt refers to a serializer and deserializer that provides input text to the speech decoder, helping the speech decoder to better process the input data. The present disclosure preserves emotion information when a human is speaking by supporting a prompt input, which can customize the sound line, emotion, etc. of the output speech of a speech decoder.
Specifically, the execution body may generate the text vector sequence through a text encoder. The text encoder uses a specht 5 model, which is also a transducers model.
Text encoder: creating a text encoder may convert the user-output sample information into text representations for processing and analysis in the fields of natural language processing, machine translation, text classification, and the like. The main task of the text encoder is to convert high-dimensional text information into low-dimensional text representations for storage and processing in limited memory space. In the process of realizing the text encoder, a series of natural language processing algorithms and techniques (word embedding, convolutional neural network and cyclic neural network) are adopted, so that the optimal encoding effect can be achieved. The text information can be converted into vector representation by the text encoder, so that the digital processing and analysis of the text information are realized.
And step 205, generating reply content corresponding to the user voice according to the voice vector and the text vector sequence.
In some embodiments, the executing body may generate the reply content corresponding to the user voice according to the voice vector and the text vector sequence.
In some optional implementations of some embodiments, the execution body may input the text vector sequence and the speech vector to a pre-trained speech decoder to obtain a reply vector corresponding to the user speech; generating reply voice and/or reply text according to the reply vector; and taking the reply voice and/or the reply text as reply content.
As an example, the user voice may be "how do the Shanghai today weather? The prompt information may be model prompt information of "answer with gentle girl' and the like, the section of voice and the section of text are analyzed into vectors (text vectors and voice vectors) with characteristic information through two encoders (text encoder and voice encoder), then the vectors are input into a voice decoder, the reply vectors are output through the voice decoder, and a section of voice of soft girl is generated as reply content according to the reply vectors, for example, the voice of soft girl is" the weather is sunny at the sea ".
In some optional implementations of some embodiments, the speech decoder is composed of an input layer, at least two convolution layers, at least two pooling layers, a plurality of activation layers, and a full-connection layer, where the convolution layers include a first convolution layer composed of five convolution kernels and a second convolution layer composed of ten convolution kernels, where the first and second convolution layers are sequentially connected to the pooling layers and the activation layers.
In some alternative implementations of some embodiments, the speech decoder is trained according to the following steps: obtaining a training sample set, wherein the training sample set comprises a sample voice vector and a sample text vector sequence, and sample reply vectors corresponding to the sample voice vector and the sample text vector sequence; inputting the sample voice vector and the sample text vector sequence into a model to be trained to obtain a reply vector; comparing the answer vector with the sample answer vector to obtain a comparison result; and responding to the fact that the comparison result meets the preset condition, and taking the model to be trained as a voice decoder.
Specifically, the sample speech vector may be obtained by inputting the speech uttered by the person into a speech encoder, and the sample text vector is usually generated according to the input sample template. Say the content of the sample speech vector is "how do the Shanghai today weather? The sample prompt information may be model prompt information such as "answer with gentle female voice", and the sample answer vector corresponding to the sample voice vector and the sample text vector may be voice of a famous voice gentle female (expected output), for example, "the weather in the open sea is currently clear", the sample voice vector and the sample text vector are resolved into answer vectors (actual output) with characteristic information through two encoders, the actual output is compared with the expected output, and the voice decoder is optimized according to the comparison result, so that the training is completed once.
In some optional implementations of some embodiments, the executing body may optimize the model to be trained by using forward propagation and backward propagation in an alternating manner in response to determining that the comparison result does not meet a preset condition.
In some optional implementations of some embodiments, the executing entity may calculate the output result using forward propagation through training data and weight parameters; the gradient of the loss function on each parameter is calculated by adopting back propagation through a derivative chain rule, gradient simulation momentum is used for replacing the gradient, and the model to be trained is optimized according to the following formula:wherein, gamma represents the super-momentum parameter which satisfies 0.ltoreq.gamma<1, vt represents the velocity at t, η t The learning rate is represented, and t represents the time step.
Compared with the traditional mode, the independent variable of each time step (t) approximates that the latest 1/(1-gamma) time step (t) updating quantity is divided by 1-gamma after the former time step (t) updating quantity is subjected to exponential weighted moving average, so that the moving amplitude of the independent variable in each direction in the momentum method depends on not only the current gradient but also whether each gradient obtained in the past is consistent in each direction. In this way, the optimal solution can be found more easily, and will drop faster when the gradient is not high.
Specifically, in the training process of the neural network, forward propagation and reverse propagation are alternately performed, and the forward propagation calculates an output result through training data and weight parameters; the back propagation calculates the gradient of the loss function to each parameter through the derivative chain rule, and the parameter is updated according to the gradient.
Forward propagation process formula:wherein z is (l) Representing the input vector of layer I, omega (l) A represents a weight matrix from layer 1 to layer 1, a (l) Representing the activation output vector of the first layer. b represents the offset vector, σ (z (l) ) Is the function of the activation and, l=2, 3, 4..l (L represents the last layer), the output of each layer becomes the input of the next layer, eventually yielding a set of functions.
The back propagation process formula: delta (l) =((ω (l+1) ) T δ (l+1) )⊙σ′(z (l) ),δ (l) A loss vector representing the loss function at layer L, +..
Specifically, the speech decoder: the speech decoder may input the speech characterization and the text characterization to the decoder to obtain a speech output. The main task of the speech decoder is to convert low-dimensional speech and text tokens into high-dimensional speech signals for processing and generation in the fields of speech synthesis, speech conversion, etc. In the process of realizing the voice decoder, a series of signal processing algorithms and techniques (deconvolution neural network and cyclic neural network) can be adopted to achieve the optimal decoding effect. Through the decoder, the voice representation and the text representation can be converted into voice signals, so that the applications such as voice synthesis and voice conversion are realized.
The neural network structure adopted by the voice decoder is similar to CNNLM, and consists of an input layer, at least two convolution layers, at least two pooling layers, a plurality of activation layers and a full connection layer:
input layer: and carrying out preliminary processing on the input original data, firstly removing the mean value of the data, concentrating the data near the origin, and then carrying out normalization to integrate the data characteristics of each dimension into the same dimension. Finally, the Word Embedding algorithm is used for mapping the data into a group of vectors
Convolution layer: by convolving the input data with a set of learnable convolution kernels and adding a certain number of padding values around the edges of the input data, the output as well as the size of the input data is kept unchanged during the convolution operation. When the step of the convolution operation is set to one unit length, the relationship between the filter size and the number of filler layers is as follows: p= (F-1)/2, where P is the number of layers filled to the outer layer and F is the filter size. The convolution operation can be regarded as a filtering operation in which the values obtained by the matrix operations for each filter are combined to re-arrive at a new output matrix. Briefly, the function of the convolution layer is to extract various features of the input data.
An activation layer: and activating the data of the convolution layer through a ReLU function, and enhancing the recognition capability of the model.
Pooling layer: the pooling layer is typically in the middle of the multiple convolution layers and serves to compress the data, reduce overfitting, and enhance the characteristic information of the sound. If a piece of music is partially distorted, what music is still listened to.
Full tie layer: the layers are typically connected by different weights at the end of the neural network. The output of the pooling layer is connected to one or more fully connected layers for feature fusion and classification.
Training algorithm and optimizing:
the algorithm used for training the model is BP. The basic idea is to calculate the gradient of the loss function to the network parameters by means of the chain law. The forward propagation starts with the input layers and calculates the output value of each layer in turn according to the network structure until the final output value of the network is calculated. During the forward propagation, the input and output values for each layer need to be saved for subsequent computation of gradients. Counter-propagation refers to computing gradient values for each layer in turn from the output layer according to the network structure until the gradient of the network parameters is computed. In the back propagation process, the gradient of each layer needs to be calculated according to a chain rule, and network parameters are updated by using an optimization algorithm such as gradient descent. In short, the main body of the back propagation algorithm is to calculate the gradients of the output layer and the hidden layer, thereby calculating and optimizing the weight matrix.
The optimized algorithm refers to the SGD algorithm. Data of gradient simulated momentum is used instead of gradients:wherein the overmomentum parameter gamma satisfies 0.ltoreq.gamma<1, when the optimization algorithm is advantageous, compared with the conventional SGD algorithm, the optimal solution can be better found, and if the descending angle is smaller, the descending speed of momentum is faster.
Training samples:
the audio data samples are typically represented in the form of a time-frequency plot, the frequency and intensity at each instant typically being converted into the form of a matrix of pixels. The training samples typically include audio data and corresponding labels or classifications.
In addition, the present disclosure may also implement the functions of speech recognition, speech synthesis, and speech translation:
and (3) voice recognition: in the voice recognition process, an input voice signal is converted into text output and then is input into a voice decoder, and the corresponding text output is obtained through the reasoning process of the voice decoder. In this process, factors such as noise, accent, speech speed and the like of the voice signal need to be considered, so as to improve accuracy and robustness of recognition.
And (3) speech synthesis: in the speech synthesis process, text is required to be input into a speech decoder, and corresponding speech output is obtained through the reasoning process of the speech decoder. In this process, factors such as smoothness and naturalness of the speech are required to be considered, so as to improve the quality and readability of the synthesized speech.
And (3) speech translation: in the speech translation process, an input speech signal needs to be converted into a text output in a target language. The specific implementation mode comprises the steps of inputting a voice signal into a voice decoder, obtaining corresponding source language text output through an reasoning process of a model, inputting the source language text into a translation model, and obtaining corresponding target language text output through the reasoning process of the model. In the process, the corresponding relation between the voice signal and the text needs to be considered, so that the accuracy and fluency of translation are improved.
In summary, the above description of the embodiments of the speech decoder relates to a number of aspects, including data preprocessing, model training, speech recognition, speech synthesis, and speech translation. Through the application of the technical means, more accurate and efficient voice interaction experience can be realized, and more convenient and intelligent voice service is provided for users.
One of the above embodiments of the present disclosure has the following advantageous effects: the method and the device automatically learn language rules and semantic information by utilizing a large-scale corpus and the voice decoder, so that the accuracy and efficiency of generating voice are improved, in addition, the voice decoder can directly generate text output from voice signals without manually designing and adjusting a plurality of modules, in addition, the voice decoder can automatically learn the language rules and the semantic information, so that the accuracy and the efficiency of voice recognition and natural language understanding are improved, voice input can be recognized more accurately, and different voice input modes and language environments can be automatically learned and adapted, so that more accurate and efficient voice recognition is realized. Meanwhile, various voice processing functions such as voice synthesis and voice translation can be realized, and more convenient and intelligent voice interaction experience is provided for the user. Besides, the voice decoder improves the defects of the traditional model, the voice is input into the voice decoder directly to obtain the voice output, the flow is shortened, the processing time is accelerated, meanwhile, the emotion information of the human speaking is completely reserved by directly inputting the audio flow, the voice decoder can accurately understand the expression of the human, and the smoother interaction between the machine and the human is completed.
With further reference to fig. 3, as an implementation of the method shown in the above figures, the present disclosure provides some embodiments of a reply content generation apparatus, which correspond to those method embodiments shown in fig. 2, and which are particularly applicable in various electronic devices.
As shown in fig. 3, the reply content generation apparatus 300 of some embodiments includes: an acquisition unit 301, a determination unit 302, a first generation unit 303, a second generation unit 304, and a third generation unit 305. Wherein the acquiring unit 301 is configured to acquire a user voice; a determining unit 302 configured to determine at least one piece of reply text corresponding to the user voice, and obtain a reply text sequence; a first generating unit 303 configured to generate a speech vector from the user speech; a second generating unit 304 configured to generate a text vector sequence according to the pre-configured sample information and the reply text sequence; and a third generating unit 305 configured to generate reply content corresponding to the user's voice based on the voice vector and the text vector sequence.
In an alternative implementation of some embodiments, the third generating unit 305 is further configured to: inputting the text vector sequence and the voice vector to a pre-trained voice decoder to obtain a reply vector corresponding to the user voice; generating reply voice and/or reply text according to the reply vector; and taking the reply voice and/or the reply text as reply content.
In an alternative implementation manner of some embodiments, the voice decoder is composed of an input layer, at least two convolution layers, at least two pooling layers, a plurality of activation layers and a full connection layer, wherein the convolution layers comprise a first convolution layer composed of five convolution kernels and a second convolution layer composed of ten convolution kernels, and the first convolution layer and the second convolution layer are sequentially connected with the pooling layers and the activation layers.
In an alternative implementation of some embodiments, the above-mentioned speech decoder is trained according to the following steps: obtaining a training sample set, wherein the training sample set comprises a sample voice vector and a sample text vector sequence, and sample reply vectors corresponding to the sample voice vector and the sample text vector sequence; inputting the sample voice vector and the sample text vector sequence into a model to be trained to obtain a reply vector; comparing the answer vector with the sample answer vector to obtain a comparison result; and responding to the fact that the comparison result meets the preset condition, and taking the model to be trained as a voice decoder.
In an alternative implementation of some embodiments, the apparatus further includes an optimizing unit configured to: and in response to determining that the comparison result does not meet the preset condition, adopting forward propagation and reverse propagation to alternately perform optimization on the model to be trained.
In an alternative implementation of some embodiments, the optimization unit is further configured to calculate the output result using the forward propagation through the training data and the weight parameters; the gradient of the loss function on each parameter is calculated by adopting back propagation through a derivative chain rule, gradient simulation momentum is used for replacing the gradient, and the model to be trained is optimized according to the following formula:wherein, gamma represents the super-momentum parameter which satisfies 0.ltoreq.gamma<1, vt represents the velocity at t, η t The learning rate is represented, and t represents the time step.
In an alternative implementation of some embodiments, the first generating unit 303 is further configured to input the user speech to a pre-trained speech encoder, where a residual vector quantizer is provided in the speech encoder, and the residual vector quantizer is used for compressing data.
It will be appreciated that the elements described in the apparatus 300 correspond to the various steps in the method described with reference to fig. 2. Thus, the operations, features and resulting benefits described above with respect to the method are equally applicable to the apparatus 300 and the units contained therein, and are not described in detail herein.
One of the above embodiments of the present disclosure has the following advantageous effects: the method and the device automatically learn language rules and semantic information by utilizing a large-scale corpus and the voice decoder, so that the accuracy and efficiency of generating voice are improved, in addition, the voice decoder can directly generate text output from voice signals without manually designing and adjusting a plurality of modules, in addition, the voice decoder can automatically learn the language rules and the semantic information, so that the accuracy and the efficiency of voice recognition and natural language understanding are improved, voice input can be recognized more accurately, and different voice input modes and language environments can be automatically learned and adapted, so that more accurate and efficient voice recognition is realized. Meanwhile, various voice processing functions such as voice synthesis and voice translation can be realized, and more convenient and intelligent voice interaction experience is provided for the user. Besides, the voice decoder improves the defects of the traditional model, the voice is input into the voice decoder directly to obtain the voice output, the flow is shortened, the processing time is accelerated, meanwhile, the emotion information of the human speaking is completely reserved by directly inputting the audio flow, the voice decoder can accurately understand the expression of the human, and the smoother interaction between the machine and the human is completed.
Referring now to fig. 4, a schematic diagram of an electronic device (e.g., server in fig. 1) 400 suitable for use in implementing some embodiments of the present disclosure is shown. The electronic device shown in fig. 4 is merely an example and should not impose any limitations on the functionality and scope of use of embodiments of the present disclosure.
As shown in fig. 4, the electronic device 400 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 401, which may perform various suitable actions and processes according to a program stored in a Read Only Memory (ROM) 402 or a program loaded from a storage means 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the electronic device 400 are also stored. The processing device 401, the ROM 402, and the RAM 403 are connected to each other by a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.
In general, the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 407 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 408 including, for example, magnetic tape, hard disk, etc.; and a communication device 409. The communication means 409 may allow the electronic device 400 to communicate with other devices wirelessly or by wire to exchange data. While fig. 4 shows an electronic device 400 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 4 may represent one device or a plurality of devices as needed.
In particular, according to some embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via communications device 409, or from storage 408, or from ROM 402. The above-described functions defined in the methods of some embodiments of the present disclosure are performed when the computer program is executed by the processing device 401.
It should be noted that, in some embodiments of the present disclosure, the computer readable medium may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.
The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring user voice; determining at least one piece of reply text corresponding to the user voice to obtain a reply text sequence; generating a voice vector according to the user voice; generating a text vector sequence according to the pre-configured prompt information and the reply text sequence; and generating reply content corresponding to the user voice according to the voice vector and the text vector sequence.
Computer program code for carrying out operations for some embodiments of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in some embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The described units may also be provided in a processor, for example, described as: a processor includes an acquisition unit, a determination unit, a first generation unit, a second generation unit, and a third generation unit. The names of these units do not constitute a limitation on the unit itself in some cases, and the acquisition unit may also be described as "a unit that acquires user voice", for example.
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.
The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above technical features, but encompasses other technical features formed by any combination of the above technical features or their equivalents without departing from the spirit of the invention. Such as the above-described features, are mutually substituted with (but not limited to) the features having similar functions disclosed in the embodiments of the present disclosure.

Claims (10)

1. A reply content generation method, comprising:
acquiring user voice;
determining at least one piece of reply text corresponding to the user voice to obtain a reply text sequence;
generating a voice vector according to the user voice;
generating a text vector sequence according to the pre-configured prompt information and the reply text sequence;
and generating reply content corresponding to the user voice according to the voice vector and the text vector sequence.
2. The method of claim 1, wherein the generating reply content corresponding to the user voice from the voice vector and the text vector sequence comprises:
inputting the text vector sequence and the voice vector to a pre-trained voice decoder to obtain a reply vector corresponding to the user voice;
generating reply voice and/or reply text according to the reply vector;
and taking the reply voice and/or the reply text as reply content.
3. The method of claim 2, wherein the speech decoder is comprised of an input layer, at least two convolution layers, at least two pooling layers, a plurality of activation layers, and a fully-connected layer, the convolution layers comprising a first convolution layer comprised of five convolution kernels and a second convolution layer comprised of ten convolution kernels, the first and second convolution layers being connected in sequence with the pooling layers and the activation layers.
4. A method according to claims 2-3, wherein the speech decoder is trained according to the following steps:
obtaining a training sample set, wherein the training sample set comprises a sample voice vector and a sample text vector sequence, and sample reply vectors corresponding to the sample voice vector and the sample text vector sequence;
inputting the sample voice vector and the sample text vector sequence to a model to be trained to obtain a reply vector;
comparing the response vector with the sample response vector to obtain a comparison result;
and responding to the comparison result meeting a preset condition, and taking the model to be trained as a voice decoder.
5. The method of claim 4, wherein the method further comprises:
and in response to determining that the comparison result does not meet a preset condition, adopting forward propagation and reverse propagation to alternately perform optimization on the model to be trained.
6. The method of claim 5, wherein the optimizing the model to be trained using forward propagation and backward propagation alternation comprises:
calculating an output result by adopting forward propagation through training data and weight parameters;
The gradient of the loss function on each parameter is calculated by adopting back propagation through a derivative chain rule, gradient simulation momentum is used for replacing the gradient, and the model to be trained is optimized according to the following formula:
ν t =γυ t-1t g t
χ t =χ t-1t
wherein, gamma represents an excess momentum parameter which satisfies 0.ltoreq.gamma<1, vt represents the velocity at t, η t The learning rate is represented, and t represents the time step.
7. The method of claim 1, wherein the generating a speech vector from the user speech comprises:
and inputting the user voice into a pre-trained voice coder to obtain a voice vector, wherein a residual vector quantizer is arranged in the voice coder and is used for compressing data.
8. A reply content generation device comprising:
an acquisition unit configured to acquire user voice;
a determining unit configured to determine at least one piece of reply text corresponding to the user voice, and obtain a reply text sequence;
a first generation unit configured to generate a speech vector from the user speech;
a second generating unit configured to generate a text vector sequence according to the pre-configured prompt information and the reply text sequence;
And a third generating unit configured to generate reply content corresponding to the user voice according to the voice vector and the text vector sequence.
9. An electronic device, comprising:
one or more processors;
a storage device having one or more programs stored thereon,
when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-7.
10. A computer readable medium having stored thereon a computer program, wherein the program when executed by a processor implements the method of any of claims 1-7.
CN202311261688.4A 2023-09-27 2023-09-27 Reply content generation method and device, electronic equipment and readable medium Pending CN117437915A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311261688.4A CN117437915A (en) 2023-09-27 2023-09-27 Reply content generation method and device, electronic equipment and readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311261688.4A CN117437915A (en) 2023-09-27 2023-09-27 Reply content generation method and device, electronic equipment and readable medium

Publications (1)

Publication Number Publication Date
CN117437915A true CN117437915A (en) 2024-01-23

Family

ID=89547066

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311261688.4A Pending CN117437915A (en) 2023-09-27 2023-09-27 Reply content generation method and device, electronic equipment and readable medium

Country Status (1)

Country Link
CN (1) CN117437915A (en)

Similar Documents

Publication Publication Date Title
CN110136731B (en) Cavity causal convolution generation confrontation network end-to-end bone conduction voice blind enhancement method
CN111143535B (en) Method and apparatus for generating a dialogue model
US10810993B2 (en) Sample-efficient adaptive text-to-speech
CN116030792B (en) Method, apparatus, electronic device and readable medium for converting voice tone
US20230267315A1 (en) Diffusion Models Having Improved Accuracy and Reduced Consumption of Computational Resources
CN110060657B (en) SN-based many-to-many speaker conversion method
CN112767959B (en) Voice enhancement method, device, equipment and medium
CN111696520A (en) Intelligent dubbing method, device, medium and electronic equipment
CN112786001B (en) Speech synthesis model training method, speech synthesis method and device
US11990148B2 (en) Compressing audio waveforms using neural networks and vector quantizers
WO2022213825A1 (en) Neural network-based end-to-end speech enhancement method and apparatus
EP3906551B1 (en) Method, apparatus and system for hybrid speech synthesis
CN115426075A (en) Encoding transmission method of semantic communication and related equipment
CN114429658A (en) Face key point information acquisition method, and method and device for generating face animation
CN111653261A (en) Speech synthesis method, speech synthesis device, readable storage medium and electronic equipment
CN117437915A (en) Reply content generation method and device, electronic equipment and readable medium
CN115496226A (en) Multi-modal emotion analysis method, device, equipment and storage based on gradient adjustment
WO2023009740A1 (en) Contrastive learning and masked modeling for end-to-end self-supervised pre-training
CN115223244A (en) Haptic motion simulation method, device, equipment and storage medium
CN113870827A (en) Training method, device, equipment and medium of speech synthesis model
CN112382268A (en) Method, apparatus, device and medium for generating audio
CN112382273A (en) Method, apparatus, device and medium for generating audio
CN112687262A (en) Voice conversion method and device, electronic equipment and computer readable storage medium
CN112951270A (en) Voice fluency detection method and device and electronic equipment
CN117616498A (en) Compression of audio waveforms using neural networks and vector quantizers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination