WO2019213965A1 - Procédé de traitement de signal vocal et dispositif mobile - Google Patents

Procédé de traitement de signal vocal et dispositif mobile Download PDF

Info

Publication number
WO2019213965A1
WO2019213965A1 PCT/CN2018/086596 CN2018086596W WO2019213965A1 WO 2019213965 A1 WO2019213965 A1 WO 2019213965A1 CN 2018086596 W CN2018086596 W CN 2018086596W WO 2019213965 A1 WO2019213965 A1 WO 2019213965A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
neural network
frames
frequency speech
mobile device
Prior art date
Application number
PCT/CN2018/086596
Other languages
English (en)
Chinese (zh)
Inventor
赵月娇
***
杨霖
尹朝阳
于雪松
张晶
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201880092454.2A priority Critical patent/CN112005300B/zh
Priority to PCT/CN2018/086596 priority patent/WO2019213965A1/fr
Publication of WO2019213965A1 publication Critical patent/WO2019213965A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation

Definitions

  • the present application relates to the field of signal processing technologies, and in particular, to a method and a mobile device for processing a voice signal.
  • voice is the most intuitive and concise communication method.
  • the bandwidth of natural speech is between 50 Hz and 8000 Hz.
  • the frequency band of speech is limited to between 300 Hz and 3400 Hz, and the speech signal between 300 Hz and 3400 Hz is called narrow band. voice signal.
  • the main energy of speech is contained in the low-frequency speech signal, and the lack of high-frequency signal causes the clarity and naturalness of the speech signal to be affected to some extent.
  • Some information representing the characteristics of the speaker is lost, such as the sound color;
  • the speech distortion is more serious, especially in noisy environments, the distortion is often not accepted by the user.
  • the methods of speech extension mainly include two methods based on network mapping and methods based on statistical models. Based on the network mapping method, the noise in the resulting wideband speech is large; based on the statistical model method, the resulting wideband speech cannot retain the sentiment of the original speech.
  • the present application provides a method for processing a voice signal and a mobile device, and the obtained wideband speech noise is small and retains the sentiment of the original voice, and the original voice can be reproduced well.
  • the first aspect provides a method for processing a voice signal, including:
  • the mobile device decodes the received encoded speech signal to obtain m sets of low frequency speech parameters; the m sets of low frequency speech parameters are low frequency speech parameters of m speech frames of the speech signal, and m is an integer greater than 1;
  • the mobile device obtains n high-frequency speech signals corresponding to the n unvoiced frames according to the low-frequency speech parameters and the mixed Gaussian model algorithm of the n unvoiced frames, and according to the low-frequency speech parameters and the neural network algorithm of the k voiced frames, Obtaining k high frequency speech signals corresponding to the k voiced frames, n and k being integers greater than 1, and sum of n and k is equal to m;
  • the mobile device synthesizes a low frequency speech signal and a high frequency speech signal of each of the m speech frames to obtain a wideband speech signal.
  • the scheme is performed on the mobile device side, and the original communication system is not changed, and only the corresponding device or the corresponding program can be set on the mobile device side; the voiced frame and the unvoiced frame are distinguished according to the voice parameter, and the accuracy is high; according to the unvoiced frame and
  • the mixed Gaussian model algorithm is used to obtain the high frequency speech signal corresponding to the unvoiced frame, which reduces the probability of noise introduction.
  • the neural network algorithm is used to obtain the high frequency speech signal corresponding to the voiced frame, which preserves the sentiment of the original speech.
  • the original voice can be accurately reproduced, which enhances the user's hearing experience.
  • each set of low frequency speech parameters includes: a pitch period; or, a subband signal strength; or, a gain value; or a line spectrum frequency; or, a pitch period, a subband signal strength, a gain value, or a line spectrum frequency At least two of them.
  • the mobile device determines the types of the m voice frames based on the m sets of low frequency voice parameters, including:
  • the mobile device uses the SAE algorithm to obtain m labels according to the m group low frequency speech parameters and the Stacked AutoEncoder (SAE) model, and m labels are used to indicate m corresponding to the m group low frequency speech parameters.
  • SAE Stacked AutoEncoder
  • the SAE model is that the mobile device or other mobile device uses the SAE algorithm to train based on a plurality of first training samples, and each of the first training samples includes a low-frequency voice signal of one voice frame of another voice signal. Corresponding low frequency speech parameters.
  • the mobile device obtains n high frequency speech signals corresponding to the n unvoiced frames according to the low frequency speech parameters and the mixed Gaussian model algorithm of the n unvoiced frames, including:
  • the mobile device obtains high frequency speech parameters of n unvoiced frames according to low frequency speech parameters and mixed Gaussian model algorithms of n unvoiced frames;
  • the mobile device constructs the n high frequency speech signals according to high frequency speech parameters of the n unvoiced frames.
  • the hybrid Gaussian model algorithm is used to predict the high-frequency speech signal of the unvoiced frame, and no noise is introduced, which improves the user's auditory feeling.
  • the mobile device obtains k high frequency speech signals corresponding to the k voiced frames according to the low frequency speech parameters and the neural network algorithm of the k voiced frames, including:
  • the mobile device uses a neural network algorithm to obtain high frequency speech parameters of k voiced frames according to low frequency speech parameters and neural network models of k voiced frames;
  • the mobile device constructs the k high frequency speech signals according to high frequency speech parameters of the k voiced frames;
  • the neural network model is that the mobile device or other mobile device uses the neural network algorithm to train based on a plurality of second training samples, and one of the second training samples includes h voiced sounds of another voice signal. h group low frequency speech parameters of the frame, h is an integer greater than one.
  • the neural network algorithm is used to predict the high-frequency speech signal of the voiced frame with almost no noise, and the emotion of the original speech can be preserved.
  • the neural network algorithm is a long-term and short-term memory (LSTM) neural network algorithm
  • the neural network model is an LSTM neural network model
  • the neural network algorithm is a bidirectional cyclic neural network (BRNN) algorithm
  • the neural network model is a BRNN model
  • the neural network algorithm is a cyclic neural network (RNN) algorithm
  • the neural network model is an RNN model
  • the BRNN algorithm can greatly improve the accuracy of the acquired high-frequency speech signal, so that the original speech can be accurately reproduced.
  • the second aspect provides a mobile device, including:
  • a decoding module configured to decode the received encoded speech signal to obtain m sets of low frequency speech parameters; the m sets of low frequency speech parameters are low frequency speech parameters of m speech frames of the speech signal, where m is greater than 1 Integer
  • a processing module configured to determine, according to the m sets of low frequency speech parameters, a type of the m speech frames, and reconstruct a low frequency speech signal corresponding to the m speech frames, where the type includes an unvoiced frame or a voiced frame;
  • An obtaining module configured to obtain n high frequency speech signals corresponding to the n unvoiced frames according to low frequency speech parameters and mixed Gaussian model algorithms of n unvoiced frames, and according to low frequency speech parameters and neural network algorithms of k voiced frames Obtaining k high frequency speech signals corresponding to the k voiced frames, n and k being integers greater than 1, and sum of n and k is equal to m;
  • the synthesizing module is configured to synthesize the low frequency speech signal and the high frequency speech signal of each of the m speech frames to obtain a wideband speech signal.
  • the relevant extension device or extension program can be set on the side of the voice processing device, and the original communication system is not changed; the voiced frame and the unvoiced frame are distinguished according to the voice parameter, and the accuracy is high; according to the unvoiced frame and the voiced sound
  • the hybrid Gaussian model algorithm is used to obtain the high-frequency speech signal corresponding to the unvoiced frame, which reduces the probability of noise introduction.
  • the neural network algorithm is used to obtain the high-frequency speech signal corresponding to the voiced frame, which preserves the sentiment of the original speech. Accurate reproduction of the original voice enhances the user's listening experience.
  • each set of low frequency speech parameters includes: a pitch period; or, a subband signal strength; or, a gain value; or a line spectrum frequency; or, a pitch period, a subband signal strength, a gain value, or a line spectrum frequency At least two of them.
  • the processing module is specifically configured to:
  • the SAE algorithm is used to obtain m labels, and m labels are used to indicate the types of m speech frames corresponding to the m group low frequency speech parameters;
  • the SAE model is that the mobile device or other mobile device uses the SAE algorithm to train based on a plurality of first training samples, and each of the first training samples includes a low-frequency voice signal of one voice frame of another voice signal. Corresponding low frequency speech parameters.
  • the obtaining module is specifically configured to:
  • the obtaining module is specifically configured to:
  • the neural network algorithm is used to obtain the high frequency speech parameters of k voiced frames.
  • the neural network model is that the mobile device or other mobile device uses the neural network algorithm to train based on a plurality of second training samples, and one of the second training samples includes h voiced sounds of another voice signal.
  • the low frequency speech parameter of the frame, h is an integer greater than one.
  • the neural network algorithm is a long-term and short-term memory (LSTM) neural network algorithm
  • the neural network model is an LSTM neural network model
  • the neural network algorithm is a bidirectional cyclic neural network (BRNN) algorithm
  • the neural network model is a BRNN model
  • the neural network algorithm is a cyclic neural network (RNN) algorithm
  • the neural network model is an RNN model
  • a third aspect provides a computer readable storage medium having stored thereon a computer program, when the computer program is executed by a processor, executing the first aspect of the claims and any of the possible designs of the first aspect The method described.
  • a fourth aspect provides a mobile device, including a processor
  • the processor is operative to couple with a memory, read and execute instructions in the memory, and perform the method of the first aspect and any of the possible aspects of the first aspect.
  • the mobile device further includes the memory.
  • the processing method of the voice signal in the present application is performed on the mobile device side, and the original communication system is not changed, and only the corresponding device or the corresponding program can be set on the mobile device side; the voiced frame and the unvoiced frame are distinguished according to the voice parameter, and the distinction is accurate.
  • the rate is high.
  • the mixed Gaussian model algorithm is used to obtain the high frequency speech signal corresponding to the unvoiced frame
  • the neural network algorithm is used to obtain the high frequency speech signal corresponding to the voiced frame, which reduces the probability of noise introduction.
  • the broadband speech retains the sentiment of the voice, and the original voice can be accurately reproduced, thereby improving the user's hearing experience.
  • FIG. 1 is a schematic structural diagram of an SAE according to an embodiment of the present application.
  • FIG. 2 is a schematic diagram of an automatic encoder corresponding to an SAE according to an embodiment of the present application
  • FIG. 3 is a schematic diagram of an LSTM neural network algorithm according to an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of an RNN provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of an RNN algorithm according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a BRNN algorithm according to an embodiment of the present application.
  • Figure 7 is a system architecture diagram of an embodiment of the present application.
  • FIG. 8 is a flowchart of a method for processing a voice signal according to an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram 1 of a mobile device according to an embodiment of the present disclosure.
  • FIG. 10 is a schematic structural diagram 2 of a mobile device according to an embodiment of the present disclosure.
  • Voice The bandwidth of human natural speech is generally between 50 Hz and 8000 Hz.
  • the speech signal between 300 Hz and 3400 Hz is called a narrowband speech signal.
  • the speech signal can be divided into two types: unvoiced and voiced according to whether the vocal cord vibrates.
  • Voiced sounds also known as voiced languages, carry most of the energy in the language, and voiced sounds show significant periodicity in the time domain; while unvoiced sounds are similar to white noise, with no obvious periodicity.
  • voiced sounds also known as voiced languages, carry most of the energy in the language, and voiced sounds show significant periodicity in the time domain; while unvoiced sounds are similar to white noise, with no obvious periodicity.
  • voiced sounds also known as voiced languages, carry most of the energy in the language, and voiced sounds show significant periodicity in the time domain; while unvoiced sounds are similar to white noise, with no obvious periodicity.
  • the airflow passes through the glottis to cause a oscillating vibration of the vocal cord, producing a quasi-peri
  • FIG. 1 is a schematic structural diagram of an SAE according to an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of an SAE-compatible automatic encoder according to an embodiment of the present disclosure
  • the SAE includes an input layer and two hidden layers.
  • An output layer wherein the number of neurons 11 in the input layer is equal to the dimension of the input vector plus one, wherein one bias node 12 is 1, that is, the input of the bias node is 1, and the output layer can be softmax
  • the classifier layer, the number of hidden layer neurons 21, and the number of neurons in the output layer are set as needed. It can be understood that the two layers of hidden layers are merely exemplary, and the number of layers of the hidden layer can be changed according to the actual number.
  • the SAE algorithm is as follows:
  • the n-dimensional vector X is an input vector, and the number of neurons in the input layer 100 is equal to n+1.
  • x n is an input of the nth neuron of the input layer; initializing the connection weight between each neuron of the input layer, the bias node, and each neuron of the first layer of the hidden layer 200, The weight matrix W 1 and the offset vector b 1 ; then the output h 1 of the first layer hidden layer is:
  • h 1 (h 1 , h 2 , h 3 , ..., h m-1 , h m )
  • h m is the output of the mth neuron of the first layer of the hidden layer
  • W km is between the kth neuron of the output layer and the mth neuron of the first layer of the hidden layer Connection weight.
  • the above process is called the encoding process of the input vector X, and then the process of h 1 decoding is performed by an automatic encoder to obtain the reconstructed input vector.
  • b 2 is the offset vector.
  • the updated W 1 For the updated W 1 , The updated b 1 , ⁇ is the learning rate.
  • connection weights between the neurons included in the first layer hidden layer 200, the bias nodes, and the second layer hidden layer 300 are initialized to form a weight matrix W 3 , which can refer to W 1 .
  • the above process is called the encoding process of h 1 , and then the process of h 2 decoding is performed by an automatic encoder to obtain reconstruction.
  • b 4 is the offset vector.
  • each of the neurons included in the second layer hidden layer 300, the offset node, and the connection weight between the neurons included in the output layer 400 are initialized to form a weight matrix W 5 , and the initialization b 5 is an offset vector.
  • the above process is a complete unsupervised learning process for sample X.
  • the sample X below is based on W 5 , using a Back Propagation (BP) neural network, performs a supervised learning process on Sample X:
  • BP Back Propagation
  • the output vector H 1 of the first layer hidden layer 200 is calculated for each of the neurons included in the input layer 100, the offset node, and the offset vector corresponding to each of the neurons included in the first layer of the hidden layer 200:
  • the output vector H 2 of the second layer hidden layer 300 is calculated for each of the neurons included in the first layer of the hidden layer 200, the offset node, and the offset vector corresponding to each of the neurons included in the second layer of the hidden layer 300:
  • W 5 is used as a connection weight matrix between each neuron included in the second layer hidden layer 300, the bias node and each neuron included in the output layer 400, and b 5 is each included in the second layer hidden layer 300.
  • the neuron, the offset node and the offset vector corresponding to each neuron included in the output layer 400 are calculated by the BP neural network algorithm to obtain the output vector Y.
  • Sample X completed a complete SAE-based learning process.
  • the initial weight matrix corresponding to the unsupervised learning is performed as the next sample X 1 ; the next training sample X 1 is obtained in the same step as the sample X to obtain the final updated Complete a complete SAE-based learning process.
  • the initial weight matrix corresponding to the unsupervised learning is performed as the next sample X 2 ; the next training sample X 2 completes a complete SAE-based learning process in the same steps as the sample X.
  • the input layer 100 includes a matrix of connection weights between the neurons, the bias nodes, and the neurons included in the first layer of the hidden layer 200
  • Each of the neurons included in the hidden layer 200 includes an offset vector corresponding to the offset node, and each of the neurons included in the first layer of the hidden layer 200, the bias node, and the second layer of the hidden layer 300
  • the matrix of connection weights between the included neurons, the offset vectors corresponding to the neurons included in the output layer 400 are all updated, and the updated physical quantity is used as the initial weight matrix corresponding to the next sample for unsupervised learning. , initial offset vector.
  • connection weights and corresponding offset values between the neurons in each layer are updated, and the updated values are unsupervised as the next sample.
  • the LSTM neural network consists of an input layer, at least one hidden layer, and one output layer. The difference is that there are no offset nodes in the input layer and hidden layer of the LSTM neural network.
  • the number of neurons in the input layer is equal to the dimension of the input vector, and the number of neurons in the hidden layer and the number of neurons in the output layer are set as needed.
  • the LSTM neural network algorithm is different from the SAE algorithm or the BP neural network algorithm in that it acquires the output of each neuron of the hidden layer and the output of each neuron of the output layer.
  • FIG. 3 is a schematic diagram of an LSTM neural network according to an embodiment of the present application.
  • X t-1 is the input of a neuron S at time t-1
  • h t-1 is the output of neuron S when the input is X t-1
  • C t-1 is at time t-1
  • X t is the input of the neuron S at time t
  • h t is the output of the neuron S when the input is X t
  • C t is the state of the neuron S corresponding to the time t
  • X t+ 1 is the input of the neuron S at time t+1
  • h t+1 is the output of the neuron S when the input is X t+1
  • C t+1 is the state of the neuron S corresponding to the time t+1.
  • neuron S has three inputs: C t-1 , X t , h t-1 , and the corresponding outputs have h t , C t-1 .
  • X t is calculated according to the output of each neuron in the upper layer and the connection weight between the neurons and neuron S in the upper layer and the corresponding offset vector (refer to the above-mentioned BP neural network).
  • h t-1 can also be called the output of the neuron S at the previous moment
  • C t- 1 can also be referred to as the state of the neuron S at the previous moment.
  • f t is the forgetting gate
  • W f is the weighting matrix of the forgetting gate
  • b f is the bias term of the forgetting gate
  • is the sigmoid function
  • i t is the input gate
  • W i is the weight matrix of the input gate
  • b i is Enter the offset of the gate
  • C t is the new state of the neuron corresponding to time t
  • O t is the output gate
  • W O is the weight matrix of the output gate
  • b O is the bias term of the output gate
  • h t The final output corresponding to neuron S at time t.
  • the LSTM neural network algorithm combines the current memory with the long-term memory to form a new unit state C t . Due to the control of the forgetting gate, the LSTM neural network can store information long before a long time. Due to the control of the input gate, it can prevent the current insignificant content from entering the memory; the output gate controls the influence of long-term memory on the current output.
  • the output of each neuron of the LSTM neural network can be calculated according to Equations 1 through 6 above.
  • connection weights and offset values in the LSTM neural network algorithm is also updated by the inverse error propagation algorithm and the gradient descent method.
  • each sample carries out the learning process of the LSTM neural network algorithm, the connection weights between the layers of neurons, the corresponding offset values, the weight matrix of the forgetting gate, the weight matrix of the input gate, and the weight of the output gate.
  • the matrix is updated once and the updated value is used to learn the next sample.
  • Each sample contains multiple subsequences, one for each input of the input layer at the time of LSTM learning.
  • RNN Recurrent Neural Networks
  • BRNN Bidirections Recurrent Neural Networks
  • FIG. 4 is a schematic structural diagram of an RNN provided by an embodiment of the present application
  • FIG. 5 is a schematic diagram of an RNN algorithm according to an embodiment of the present disclosure
  • FIG. 6 is a schematic diagram of a BRNN algorithm provided by an embodiment of the present application.
  • the neurons between the hidden layers in the RNN are no longer isolated, but connected, and the input of the hidden layer includes not only the output of the input layer but also the output of the hidden layer at the previous moment. .
  • h t is the output of the hidden layer at time t
  • h t-1 is the output of the hidden layer at time t-1
  • x t is the input of the input layer at time t
  • Z t is the output layer at time t Output
  • W xh is the weight matrix composed of the connection weights between the neurons of the input layer and the neurons of the hidden layer at time t
  • W hh is the output of the hidden layer at time t-1
  • - h t- 1 is the weight matrix corresponding to the input of the hidden layer at time t
  • W hz is the weight matrix composed of the connection weights between the neurons of the hidden layer and the neurons of the output layer at time t
  • b h is The offset vector corresponding to the hidden layer at time t
  • b z is the offset vector corresponding to the output layer at time t.
  • the input corresponding to one sample can be called a sequence, and in the RNN algorithm, one sample corresponds to multiple subsequences, such as subsequence x t-1 , subsequence x t , subsequence x t+1 ; since the hidden layer is at t
  • the output of time -1 is obtained according to the input x t-1 of the input layer at time t-1
  • x t and x t-1 respectively correspond to different subsequences, that is to say, in the RNN algorithm, there is a sequence between subsequences. Relationships, each subsequence is associated with its previous subsequence, and is spread out in time series through a neural network.
  • each connection weight does not change, that is, each subsequence of a sequence shares the connection weight, that is, the connection weight used by the output Z t-1 obtained according to the input x t-1 is obtained according to the input x t Z output connection weights T is used, obtained according to the input x t 1 Z output connection weights used + t + 1, is the same.
  • the RNN updates each connection weight and offset value in a learning process based on the error back time propagation algorithm for the learning process of the next sample.
  • the deep-loop neural network is a cyclic neural network with multiple layers of hidden layers.
  • the algorithm can refer to the above algorithm with a layer of hidden layers, which will not be described here.
  • the improvement of the BRNN algorithm over the RNN algorithm is based on the assumption that the current output is not only related to the previous input, but also to the subsequent inputs. It can be understood that the reverse layer and the forward layer shown in FIG. 6 do not refer to two hidden layers, but need to obtain two output values for representing the same hidden layer, which is the BRNN algorithm and the RNN algorithm. the difference.
  • h t1 is the positive time direction output of the hidden layer at time t
  • h t2 is the negative time direction output of the hidden layer at time t
  • h t-1 is the hidden layer
  • the output at time t-1, h t+1 is the output of the hidden layer at time t+1
  • x t is the input of the input layer at time t
  • the output h t-1 of the hidden layer at time t-1 is used as the weight matrix corresponding to the input of the hidden layer at time t, a first weight matrix composed of connection weights between each neuron of the layer and each neuron of the hidden layer at time t
  • the output h t+1 of the hidden layer at time t+1 is used as the weight matrix corresponding to the input of the hidden layer at time t, a second weight matrix composed of connection weights between each neuron of the layer and each neuron of the hidden layer at time t+1; a first weight matrix consisting of the weights of the connections between the neurons of the
  • the input corresponding to one sample can be called a sequence, and one sample corresponds to multiple subsequences, such as subsequence x t-1 , subsequence x t , subsequence x t+1 ; due to hidden layer
  • the output h t-1 at time t-1 is obtained from the input x t-1 of the input layer at time t-1
  • the output h t+1 of the hidden layer at time t+1 is the input layer according to time t+1.
  • the input x t+1 , x t , x t-1 , x t+1 respectively correspond to different sub-sequences, that is to say, in the BRNN algorithm, there is a sequential relationship between the sub-sequences, each sub-sequence and before it
  • the subsequences are associated and also associated with subsequences after it.
  • each connection weight does not change, that is, each subsequence of a sequence shares the connection weight, that is, the connection weight used according to the output y t-1 obtained by inputting x t-1 is obtained according to the input x t output y used in connection weights t, according to the weights of the input x t + 1 obtained in the output y t + 1 is used, is the same.
  • the deep bidirectional cyclic neural network is a cyclic neural network with multiple layers of hidden layers.
  • the algorithm can refer to the above algorithm with a layer of hidden layers, which will not be described here.
  • the mixed Gaussian model is a combination of probability density functions of multiple Gaussian distributions.
  • a Gaussian model with L mixed numbers can be expressed as:
  • G(x, ⁇ l , V l ) denotes the first mixed component of the mixed Gaussian model, which is represented by a b-dimensional multivariate single Gaussian probability density function with a mean of ⁇ l and a covariance of V l (positive definite matrix):
  • Figure 7 is a system architecture diagram of an embodiment of the present application.
  • the system includes a mobile device 10 and a network device 20;
  • the network device is a device with a wireless transceiver function or a chipset and necessary software and hardware that can be disposed on the device, including but not limited to: an evolved Node B (eNB), a radio network controller ( Radio network controller (RNC), Node B (NB), base station controller (BSC), base transceiver station (BTS), home base station (for example, home evolved NodeB, or home Node) B, HNB), baseband unit (BBU), access point (AP) in wireless fidelity (WIFI) system, wireless relay node, wireless backhaul node, transmission point (transmission) And reception point, TRP or transmission point, TP), etc., can also be 5G, such as NR, gNB in the system, or transmission point (TRP or TP), one or a group of base stations in the 5G system (including multiple The antenna panel) or the antenna panel, or may be a network node constituting a gNB or a transmission point, such as a baseband unit (BBU), or a
  • the gNB may include a centralized unit (CU) and a DU.
  • the gNB may also include a radio unit (RU).
  • the CU implements some functions of the gNB, and the DU implements some functions of the gNB.
  • the CU implements radio resource control (RRC), the function of the packet data convergence protocol (PDCP) layer, and the DU implements the wireless chain.
  • RRC radio resource control
  • PDCP packet data convergence protocol
  • the DU implements the wireless chain.
  • the functions of the radio link control (RLC), the media access control (MAC), and the physical (PHY) layer Since the information of the RRC layer eventually becomes information of the PHY layer or is transformed by the information of the PHY layer, high-level signaling, such as RRC layer signaling or PHCP layer signaling, can also be used in this architecture.
  • the network device can be a CU node, or a DU node, or a device including a CU node and a DU node.
  • the CU may be divided into network devices in the access network RAN, and the CU may be divided into network devices in the core network CN, which is not limited herein.
  • a mobile device may also be called a user equipment (UE), an access terminal, a subscriber unit, a subscriber station, a mobile station, a mobile station, a remote station, a remote terminal, a user terminal, a terminal, a wireless communication device, a user agent, or a user.
  • the mobile device involved in the present application may be a mobile phone, a tablet, a computer with wireless transceiver function, a virtual reality (VR) device, an augmented reality (AR) device, and industrial control ( Wireless terminal in industrial control, wireless terminal in self driving, wireless terminal in remote medical, wireless terminal in smart grid, transportation safety A wireless terminal, a wireless terminal in a smart city, a wireless terminal in a smart home, and the like.
  • the embodiment of the present application does not limit the application scenario.
  • the foregoing terminal device and a chip that can be disposed in the foregoing terminal device are collectively referred to as a terminal device.
  • network devices 20 can each communicate with a plurality of mobile devices, such as mobile device 10 shown in the figures.
  • Network device 20 can communicate with any number of mobile devices similar to mobile device 10.
  • FIG. 7 is merely a simplified schematic diagram for ease of understanding.
  • the communication system may also include other network devices or may also include other mobile devices, which are not shown in FIG.
  • FIG. 8 is a flowchart of a method for processing a voice signal according to an embodiment of the present disclosure. Referring to FIG. 8, the method in this embodiment includes:
  • Step S101 The mobile device decodes the received encoded speech signal to obtain m low frequency speech parameters, m low frequency speech parameters are low frequency speech parameters of m speech frames of the speech signal, and m is an integer greater than 1.
  • Step S102 The mobile device determines, according to the m sets of low frequency speech parameters, the types of the m speech frames, and reconstructs the low frequency speech signals corresponding to the m speech frames; wherein the type of the speech frames includes an unvoiced frame or a voiced frame;
  • Step S103 The mobile device obtains n high-frequency speech signals corresponding to the n unvoiced frames according to the low-frequency speech parameters and the mixed Gaussian model algorithm of the n unvoiced frames, and obtains according to the low-frequency speech parameters and the neural network algorithm of the k voiced frames.
  • k high-frequency speech signals corresponding to k voiced frames, n and k are integers greater than 1, and the sum of n and k is equal to m;
  • Step S104 The mobile device synthesizes the low frequency speech signal and the high frequency speech signal of each of the m speech frames to obtain a wideband speech signal.
  • the voice signal since the voice signal has short-term performance, that is, the voice signal remains relatively stable and consistent in a short time interval, the time is generally 5 ms to 50 ms. Therefore, the analysis of the voice signal must be established in a short time.
  • the "speech signal" referred to in this embodiment refers to a voice signal corresponding to a short time interval that can be analyzed.
  • the mobile device decodes the received encoded speech signal to obtain m sets of low frequency speech parameters; the m sets of low frequency speech parameters are low frequency speech parameters of m speech frames of the speech signal, and m is an integer greater than 1. It can be understood that each speech frame corresponds to a set of low frequency speech parameters.
  • the speech signal involved in step S101 may be referred to as a speech signal a in the following description.
  • the network device may perform parameter encoding on the m sets of low frequency speech parameters of the m speech frames of the speech signal a by using a parameter encoding method to obtain the encoded speech signal a.
  • the network device may use a mixed linear incentive prediction (MELP) algorithm to extract low-frequency voice parameters of the voice signal a.
  • MELP mixed linear incentive prediction
  • the low frequency speech parameters obtained by the MELP algorithm include: pitch period; or, subband signal strength; or, gain value; or line spectrum frequency; or pitch period, subband signal strength, gain value, or line spectrum frequency At least two.
  • the low frequency speech parameters including the pitch period, the subband signal strength, the gain value, or the line spectrum frequency have the following meanings: the low frequency speech parameters include the pitch period and the subband signal strength; or, the pitch period and the gain value; or, the pitch Period and line spectrum frequency; or, subband signal strength and gain value; or, subband signal strength and line spectrum frequency; or, line spectrum frequency and gain value; or, pitch period and subband signal strength and gain values; Pitch period and subband signal strength and line spectrum frequency; or, gain value and subband signal strength and line spectrum frequency; or, pitch period and gain value and line spectrum frequency; or, pitch period and subband signal strength and gain value and line spectrum frequency.
  • the low frequency speech parameters in this embodiment include a pitch period and a subband signal strength and a gain value and a line spectrum frequency.
  • the low frequency speech parameters may include more than the above parameters, and may also include other parameters. Different parameter extraction algorithms are used, and the corresponding low frequency speech parameters have certain differences.
  • the speech signal a is sampled to obtain digital speech, high-pass filtering is performed on the digital speech, and the low-frequency energy in the digital speech is removed, and the 50 Hz power frequency interference may exist, for example, 4
  • the step-cut Chebysh high-pass filter performs high-pass filtering, and the high-pass filtered digital speech is used as the speech signal to be processed.
  • the N sampling points corresponding to the to-be-processed speech signal are a speech frame.
  • N may be 160
  • the frame is shifted to 80 sampling points
  • the speech signal to be processed is divided into m speech frames, and then m speech frames are extracted.
  • Low frequency speech parameters Low frequency speech parameters.
  • the low frequency speech parameters of the speech frame are extracted: pitch period, subband signal strength, gain value, line spectrum frequency.
  • each voice frame includes a low frequency voice signal and a high frequency voice signal. Due to the limitation of the transmission bandwidth, the range of the voice frequency band is limited.
  • the low frequency voice parameter of the extracted voice frame is in the voice frame.
  • the low frequency speech parameter corresponding to the low frequency speech signal, correspondingly, the high frequency speech parameter subsequently appearing in this embodiment is the high frequency speech parameter corresponding to the high frequency speech signal in the speech frame.
  • the low frequency speech signal is opposite to the high frequency speech signal. It can be understood that if the frequency corresponding to the low frequency speech signal is 300 Hz to 3400 Hz, the frequency corresponding to the high frequency speech signal may be 3400 Hz to 8000 Hz.
  • the frequency range corresponding to the low-frequency speech signal in this embodiment may be a frequency range corresponding to the narrow-band speech signal in the prior art, that is, 300 Hz to 3400 Hz, and may also be other frequency ranges.
  • the acquisition of the pitch period includes the acquisition of the integer pitch period, the acquisition of the fractional pitch period, and the acquisition of the final base station period.
  • the acquisition of the pitch period includes the acquisition of the integer pitch period, the acquisition of the fractional pitch period, and the acquisition of the final base station period.
  • Each speech frame corresponds to one pitch period.
  • the 6-4KHz speech band (corresponding to the low-frequency speech signal) can be first divided into 5 fixed frequency bands (0-500 Hz, 500-1000 Hz, using the sixth-order Butterworth bandpass filter bank). 1000 to 2000 Hz, 2000 to 3000 Hz, 3000 to 4000 Hz). Such division is merely exemplary, and such division may not be employed.
  • the sub-band sound intensity of the first sub-band (0-500 Hz) is the normalized autocorrelation value corresponding to the fractional pitch period of the speech frame.
  • the sound intensity of the remaining four sub-bands is the maximum value of the autocorrelation function; for an unstable speech frame, that is, a speech frame with a large pitch period variation, the autocorrelation function of the sub-band signal envelope is used.
  • Subtract 0.1 perform full-wave rectification and smoothing, calculate the normalized autocorrelation function value, and normalize the autocorrelation function value as the sound intensity of the corresponding sub-band.
  • each speech frame corresponds to a plurality of sub-band sound intensities, such as five.
  • the pitch adaptive window length is used in the calculation.
  • the window length is determined by the following method: when Vbp1>0.6 (Vbp1>0.6, the speech frame is a voiced frame), and the window length is the minimum multiple of the fractional pitch period of more than 120 sample points. If the window length exceeds 320 sampling points, divide it by 2; when Vbp1 ⁇ 0.6 (Vbp1 ⁇ 0.6, the speech frame is an unvoiced frame), the window length is 120 sampling points.
  • the center of the first gain G 1 window is located at 90 samples before the last sample point of the current speech frame; the center of the second gain G 2 window is located at the last sample point of the current frame.
  • the gain value is the root mean square value of the windowed signal S n and the result is converted to a decibel form:
  • the input speech signal is weighted with a Hamming window of 200 sample points long (25 ms), and then a 10th order linear prediction analysis is performed, the center of the window being located at the last sampling point of the current frame.
  • the MELP algorithm uses the Chebyshev polynomial to recursively convert to the line spectrum frequency, which reduces the computational complexity.
  • Each speech frame corresponds to a line spectrum frequency
  • the line spectrum frequency is a vector having a plurality of components, such as a vector having 12 components.
  • the network device uses the MELP algorithm to extract the low-frequency speech parameters of the m speech frames of the speech signal, and each speech frame correspondingly obtains a set of low-frequency speech parameters, and the set of low-frequency speech parameters may include: a pitch period, and more
  • the sub-band has a sound intensity, two gains, and a line spectrum frequency vector.
  • the network device encodes the m sets of low frequency speech parameters of the m speech frames of the speech signal a to obtain the encoded speech signal a, and sends the encoded speech signal a to the mobile device, and the mobile device encodes the received signal.
  • the speech signal a is decoded, m sets of low frequency speech parameters are obtained, and each set of low frequency speech parameters corresponds to a low frequency speech signal of a speech frame of the speech signal a.
  • the mobile device determines the types of the m speech frames based on the m sets of low frequency speech parameters, and reconstructs the low frequency speech signals corresponding to the m speech frames; wherein the type of the speech frames includes an unvoiced frame or a voiced frame;
  • the mobile device After obtaining the m sets of low frequency speech parameters corresponding to the speech signal a, the mobile device reconstructs the low frequency speech signals corresponding to the m speech frames according to the m sets of low frequency speech parameters.
  • the mobile device reconstructs the low-frequency voice signal corresponding to the m voice frames according to the m-group low-frequency voice parameters, which is a mature technology in the prior art, and is not described in this embodiment.
  • the mobile device determines the type of m speech frames based on the m sets of low frequency speech parameters, that is, determines whether each speech frame is an unvoiced frame or a voiced frame.
  • the mobile device determines the types of the m voice frames based on the m sets of low frequency voice parameters, including:
  • the mobile device uses the SAE algorithm to obtain m labels according to the m-group low-frequency voice parameters and the stack automatic encoder SAE model, and the m labels are used to indicate the types of m voice frames corresponding to the m-group low-frequency voice parameters;
  • the SAE model is obtained by using a SAE algorithm and is trained based on a plurality of first training samples, each of which includes low-frequency speech parameters corresponding to low-frequency speech signals of one speech frame of other speech signals, and other speech signals are different from the present.
  • the speech signal a in the embodiment is obtained by using a SAE algorithm and is trained based on a plurality of first training samples, each of which includes low-frequency speech parameters corresponding to low-frequency speech signals of one speech frame of other speech signals, and other speech signals are different from the present.
  • the speech signal a in the embodiment.
  • the SAE model may be obtained by using the SAE algorithm in the mobile device in this embodiment, based on training of multiple first training samples, or using other SAE algorithms for training, and then training based on multiple first training samples, and then The mobile device of this embodiment directly acquires the trained SAE model from other devices.
  • the SAE algorithm is used to determine the type of the speech frame according to the low-frequency speech parameters of the speech frame. Compared with the method for determining the type of the speech frame in the prior art, the accuracy can be greatly improved.
  • a set of low-frequency speech parameters consists of pitch period, sub-band signal strength, gain value, and line spectrum frequency, and includes 1 pitch period, 5
  • the sub-band signal strength, two gain values, and the line spectrum frequency vector including 12 components the dimension of the input vector X is 20 dimensions, that is, has 20 components, and the input vector X is used as the SAE shown in FIG.
  • the input using the SAE algorithm as described above, outputs a label for indicating the type of the voice frame, and the SAE algorithm uses the SAE model trained based on the plurality of first training samples.
  • A1 obtaining a plurality of first training samples
  • all the first training samples are trained by using the SAE algorithm to obtain a SAE model.
  • each of the first training samples includes a low-frequency speech parameter corresponding to a low-frequency speech signal of one speech frame of another speech signal, and it can be understood that the frequency range corresponding to the low-frequency speech signal herein
  • the frequency range corresponding to the low-frequency voice signal from which the low-frequency voice parameter of the network device is encoded is the same, and the low-frequency voice parameter here is the same as the low-frequency voice parameter extracted by the network device or the low-frequency voice parameter decoded by the mobile device, and the extraction method is the same. .
  • the speech signal b belongs to one of the other speech signals, and for the l speech frames of the speech signal b, one set of low-frequency speech parameters corresponding to the low-frequency speech signals of one speech frame are respectively extracted, and one set of low-frequency speech parameters is included.
  • a set of low frequency speech parameters is a first training sample.
  • the number of first training samples is large enough, and other voice signals may include multiple voice signals, and the number of natural persons corresponding to the plurality of voice signals is as large as possible.
  • the normalized vector of the low frequency speech parameters included in the first training sample 1 is used as the input vector of the SAE, and the label of the first training sample 1 is taken as the expectation Output, the connection weight between the SAE neurons and the corresponding offset value are assigned to the initial value; using the SAE algorithm as described above, the actual output corresponding to the first training sample 1 is obtained, and the minimum is adopted according to the actual output and the expected output.
  • the inverse error propagation algorithm and the gradient descent method of the mean square error criterion adjust the connection weights and corresponding offset values between the SAE neurons to obtain the updated connection weights and corresponding biases between the neurons. Set value.
  • the normalized vector of the low frequency speech parameters included in the first training sample 2 is used as the input vector of the SAE, and the label of the first training sample 2 is used as the desired output.
  • the connection weights and corresponding offset values between the initially used SAE neurons are obtained after the first training sample 1 is trained, and the updated connection rights between the neurons are obtained.
  • the value and the corresponding offset value; using the SAE algorithm as described above, the actual output corresponding to the first training sample 2 is obtained, and the inverse error propagation algorithm and the gradient descent method using the minimum mean square error criterion are used according to the actual output and the expected output.
  • the connection weight between the SAE neurons and the corresponding offset value are adjusted again, and the updated connection weights and corresponding offset values between the neurons are obtained.
  • the normalized vector of the low frequency speech parameters included in the first training sample 3 is used as the input vector of the SAE, and the label of the first training sample 3 is taken as the desired output.
  • the connection weights and corresponding offset values between the initially used SAE neurons are obtained after the second training sample 2 is trained, and the updated connection rights between the neurons are obtained.
  • the value and the corresponding offset value; using the SAE algorithm as described above, the actual output corresponding to the first training sample 3 is obtained, and the inverse error propagation algorithm and the gradient descent method using the minimum mean square error criterion are used according to the actual output and the expected output.
  • the connection weight between the SAE neurons and the corresponding offset value are adjusted again, and the updated connection weights and corresponding offset values between the neurons are obtained.
  • the above training process is repeatedly performed until the error function converges, that is, after the training accuracy meets the requirements, the training process is stopped, and each training sample is trained at least once.
  • the neural network corresponding to the last training and the connection weights and corresponding offset values between the neurons in each layer are the SAE model.
  • the mE low-frequency speech parameters decoded by the SAE model and the mobile device can be used to obtain m labels by using the SAE algorithm, and m labels are used to indicate m speech frames corresponding to the m-group low-frequency speech parameters.
  • Types of It can be understood that if during the training, the included low frequency speech parameter is such a first training sample extracted from the low frequency speech signal of the voiced frame, and the corresponding label is 1, the m group low frequency speech decoded by the mobile device is obtained.
  • the obtained label should be close to 1 or 1; similarly, if during the training, the included low-frequency speech parameters are from unvoiced.
  • the first training sample extracted from the low-frequency speech signal of the frame, the corresponding label is 0, and the low-frequency speech parameters corresponding to the unvoiced frame in the m low-frequency speech parameters decoded by the mobile device are used, and the SAE algorithm is adopted according to the SAE model. After that, the resulting label should be close to 0 or 0.
  • the mobile device obtains n high-frequency speech signals corresponding to n unvoiced frames according to low-frequency speech parameters and mixed Gaussian model algorithms of n unvoiced frames, and according to low-frequency speech parameters and neural network algorithms of k voiced frames,
  • the k high frequency speech signals corresponding to the k voiced frames are obtained, n and k are integers greater than 1, and the sum of n and k is equal to m.
  • the use of the neural network algorithm to predict the high-frequency speech parameters corresponding to the unvoiced frame according to the low-frequency speech parameters corresponding to the unvoiced frame introduces artificial noise, which causes the user to hear the “ ⁇ ” noise and affect the user's auditory experience.
  • the high frequency speech signal corresponding to the unvoiced frame is not used in the neural network algorithm, and the mixed Gaussian model algorithm may be adopted.
  • the neural network algorithm is used to predict the high-frequency speech parameters corresponding to the voiced frames according to the low-frequency speech parameters corresponding to the voiced frames, and almost no artificial noise is introduced and the sentiment of the original speech can be preserved.
  • the voiced sounds are obtained according to the low-frequency voice parameters of the voiced frames.
  • the neural network algorithm can be used for the high frequency speech signal corresponding to the frame. This is the meaning of determining the type of the speech frame in step S102, that is to say, according to the nature of the unvoiced frame and the voiced frame, different machine learning algorithms are adopted, the worker noise can be introduced as little as possible and the emotion of the original voice is retained. Thereby achieving accurate reproduction of the original speech.
  • the mobile device obtains n high-frequency voice signals corresponding to the n unvoiced frames according to the low-frequency speech parameters of the n unvoiced frames and the mixed Gaussian model algorithm, including:
  • the mobile device obtains high frequency speech parameters of n unvoiced frames according to low frequency speech parameters and mixed Gaussian model algorithms of n unvoiced frames;
  • the mobile device constructs n high frequency speech signals corresponding to n unvoiced frames according to the high frequency speech parameters of the n unvoiced frames.
  • the mixed Gaussian model algorithm refers to the algorithm in the prior art, and details are not described herein again.
  • the mobile device obtains k high-frequency voice signals corresponding to k voiced frames according to the low-frequency voice parameters of the k voiced frames and the neural network algorithm, including:
  • the mobile device uses the neural network algorithm to obtain the high frequency speech parameters of the k voiced frames according to the low frequency speech parameters and the neural network model of the k voiced frames.
  • the mobile device constructs k high frequency speech signals corresponding to k voiced frames according to high frequency speech parameters of k voiced frames;
  • the neural network model is a neural network algorithm.
  • the mobile device or other mobile device in this embodiment is trained based on a plurality of second training samples, and the second training sample includes h voiced frames of another voice signal.
  • the set of low frequency speech parameters, h is an integer greater than one; the other speech signals are different from the speech signal a in this embodiment.
  • h may be the number of all voiced frames included in the other speech signal, or less than the number of all voiced frames included in the other speech signals.
  • the values of h may be different for different speech signals.
  • the neural network algorithm herein may be an LSTM neural network algorithm, and the neural network model is an LSTM neural network model; or
  • the neural network algorithm can be a BRNN algorithm, and the neural network model is a BRNN model; or,
  • the neural network algorithm is the RNN algorithm
  • the neural network model is the RNN model
  • the neural network algorithm is used as the BRNN algorithm, and the neural network model is the BRNN model.
  • the mobile device is based on the low-frequency speech parameters and neural network models of k voiced frames.
  • the neural network model is used to obtain k high corresponding to k voiced frames. The specific process of frequency speech signals.
  • the mobile device normalizes the decoded k-group speech parameters corresponding to the k voiced frames to obtain respective vectors, and the plurality of vectors obtained by normalizing the k-group speech parameters may be referred to as a vector.
  • a vector obtained by normalizing a set of low-frequency speech parameters of a sequence, k-group speech parameters may be referred to as a sub-sequence.
  • the order in which each subsequence is input into the bidirectional cyclic neural network is input in the chronological order of the corresponding speech frames of each subsequence, that is, each subsequence corresponds to an input at a time.
  • subsequence 1 corresponds to X t shown in FIG. 6
  • subsequence 1 corresponds to X t-1 shown in FIG. 6
  • subsequence 3 corresponds to X t+1 shown in FIG. 6.
  • the vector obtained by normalizing the k-group speech parameters is used as the input of the bidirectional cyclic neural network.
  • the bidirectional cyclic neural network algorithm described above is used to obtain the k-group low-frequency speech parameters based on the bidirectional cyclic neural network model.
  • the output corresponding to each set of low-frequency speech parameters, each output is used to indicate the high-frequency speech parameters of the corresponding voiced frames, which can be converted into high-frequency speech parameters, that is, k sets of high-frequency speech parameters of k voiced frames.
  • subsequence 1 For example, according to the time sequence of the voiced frames, there are subsequence 1, subsequence 2, and subsequence 3. If the output corresponding to the subsequence 2 is y t shown in FIG. 6, the output corresponding to the subsequence 1 is y shown in FIG. The output corresponding to t-1 and subsequence 3 is y t+1 shown in Fig. 6.
  • each subsequence shares the same bidirectional cyclic neural network model, and the bidirectional cyclic neural network algorithm is used to obtain the corresponding outputs.
  • the mobile device After the mobile device obtains the k sets of high frequency speech parameters of k voiced frames according to the BRNN model and the BRNN algorithm, the mobile device constructs k highs corresponding to the k voiced frames according to the k sets of high frequency speech parameters of the k voiced frames. Frequency speech signal.
  • the following describes the acquisition method of the bidirectional cyclic neural network BRNN model.
  • B2 Obtain a label of each second training sample, where the label is an h group high frequency voice parameter corresponding to the h group low frequency voice parameter included in the second training sample; wherein the second training sample includes the h group low frequency voice parameter and corresponding label
  • the h group high frequency speech parameters included are speech parameters of the same speech signal;
  • the second training sample is trained by using a bidirectional cyclic neural network algorithm to obtain a bidirectional cyclic neural network model.
  • a second training sample includes h groups of low-frequency speech parameters corresponding to low-frequency speech signals of h voiced frames of other speech signals, it being understood that the low-frequency speech signals here correspond to The frequency range corresponding to the low frequency speech signal corresponding to the low frequency speech parameter encoded by the network device is the same, and the low frequency speech parameter here is the same as the low frequency speech parameter extracted by the network device or the low frequency speech parameter decoded by the mobile device.
  • the h 1 set of low frequency speech parameters of the h 1 voiced frames of the speech signal 1 are extracted, and a second training sample 1 is obtained, that is, the second training sample 1 includes a plurality of sets of low frequency speech parameters, each of which The voiced frame corresponds to a set of low frequency speech parameters.
  • the h 2 sets of low frequency speech parameters of the h 2 voiced frames of the speech signal 2 are extracted to obtain a second training sample 2.
  • h 1 and h 2 may be the same and may be different; the voice signal 1 and the voice signal 2 are voice signals in other voice signals.
  • Said second training samples such as 1, h 1 is the speech signal to extract a high frequency signal corresponding to voiced speech frames H h 1 h 1 is a group of high frequency voiced speech frames a set of high-frequency speech parameters, the speech signal The parameter is the label of the second training sample 1.
  • Said second training samples such as 2, h 2 extracts a speech signal of two frequency signals corresponding to a voiced speech frame frequency speech parameters h 2 group, a voice signal voiced frame 2 H H 2 2 groups of voice frequency
  • the parameter is the label of the second training sample 2.
  • the plurality of vectors normalized by the h 1 group low frequency speech parameters of the second training sample 1 are used as input of the bidirectional cyclic neural network, and the second training sample
  • the multiple vectors normalized by the plurality of sets of low-frequency speech parameters of 1 may be referred to as a sequence, and the normalized vector of each set of low-frequency speech parameters of the h 1 set of low-frequency speech parameters may be referred to as a subsequence, and each subsequence
  • the order of inputting the bidirectional cyclic neural network is input according to the time sequence of the corresponding speech frames of each subsequence, that is, the input of each subsequence corresponding to one moment.
  • the second training sample 1 has a subsequence 1, a subsequence 2, and a subsequence 3 according to the time sequence of the speech frame. If the subsequence 2 corresponds to the X t shown in FIG. 6, the subsequence 1 corresponds to the X t shown in FIG. 6 . -1 , subsequence 3 corresponds to X t+1 shown in Fig. 6.
  • connection weights and offset values involved in the bidirectional cyclic neural network are assigned initial values, and all subsequences share connection weights and offset values;
  • each connection weight and the offset value the bidirectional cyclic neural network algorithm is used to obtain the actual output of the second training sample 1; it can be understood that each subsequence corresponds to one output, and the output of all subsequences is composed.
  • the second training sample 1 has a subsequence 1, a subsequence 2, and a subsequence 3 according to the time sequence of the speech frame. If the output corresponding to the subsequence 2 is y t shown in FIG. 6, the output corresponding to the subsequence 1 is a graph. The output corresponding to y t-1 and subsequence 3 shown in Fig. 6 is y t+1 shown in Fig. 6.
  • the initial connection weights and offset values are adjusted according to the processing result, and the adjusted connection weights and offset values are obtained.
  • the normalized vector of the h 2 sets of low frequency speech parameters of the second training sample 2 is used as an input of the bidirectional cyclic neural network;
  • connection weights and offset values involved in the training process are adjusted by the second training sample 1 and the adjusted connection weights and offset values;
  • connection weights and offset values involved in the training process are adjusted according to the processing result, and the adjusted connection weights and offset values are obtained.
  • the normalized vector of the h 3 sets of low frequency speech parameters of the second training sample 3 is used as an input of the bidirectional cyclic neural network;
  • connection weights and offset values involved in the training process are adjusted by the second training sample 2, and the adjusted connection weights and offset values are obtained;
  • connection weights and offset values involved in the training process are adjusted according to the processing result, and the adjusted connection weights and offset values are obtained.
  • the above training process is repeatedly performed until the preset training precision is reached or the preset training times are reached, the training process is stopped, and each training sample is trained at least once.
  • the bidirectional cyclic neural network corresponding to the last training and the connection weights and offset values are the BRNN models.
  • the bidirectional cyclic network algorithm is used to obtain the high frequency speech parameters corresponding to the voiced frames, which have the following beneficial effects:
  • y t is related not only to the input x t-1 at time t-1 (h t-1 is obtained by x t-1 ) but also to the input x t+1 at time t+1 (h t+1) Is obtained by x t+1 ).
  • x t corresponds to a set of low frequency speech parameters of the voiced frame a in the embodiment of the present application
  • the output y t corresponds to a set of high frequency speech parameters of the voiced frame a
  • x t-1 corresponds to the present
  • x t+1 corresponds to a set of low frequency voice parameters of the latter voiced frame c of the voiced frame a in the embodiment of the present application, that is,
  • the high-frequency speech parameters of the voiced frame a are predicted while considering the information of the voiced frames before and after.
  • the accuracy of the prediction of high-frequency speech parameters can be improved, and the accuracy of predicting high-frequency speech signals through low-frequency speech signals can be improved.
  • the bidirectional cyclic network algorithm is used to obtain the high frequency speech parameters corresponding to the voiced frames, which can improve the accuracy of predicting the high frequency speech signals of the corresponding frames through the low frequency speech signals of the voiced frames.
  • the mobile device obtains m sets of high frequency speech signals and m sets of low frequency speech signals of m speech frames of the speech signal a.
  • the mobile device For step S104, the mobile device combines the low frequency speech signal and the high frequency speech signal of each of the m speech frames to obtain a wideband speech signal.
  • the mobile device becomes a complete wideband speech after synthesizing the low frequency speech signal and the high frequency speech signal for each of the m speech frames.
  • the processing method of the voice signal in this embodiment is performed on the mobile device side, and the original communication system is not changed, and only the relevant extension device or extension program is set on the mobile device side; the voiced frame and the unvoiced frame are distinguished according to the voice parameter.
  • the mixed Gaussian model algorithm is used to obtain the high frequency speech signal corresponding to the unvoiced frame, which reduces the probability of noise introduction.
  • the neural network algorithm is used to obtain the high frequency speech corresponding to the voiced frame.
  • the signal preserves the sentiment of the original speech, so that the original voice can be accurately reproduced, and the user's auditory feeling is improved.
  • the solution provided by the embodiment of the present application is introduced for the functions implemented by the mobile device. It can be understood that, in order to implement the above functions, the device includes corresponding hardware structures and/or software modules for performing various functions.
  • the embodiments of the present application can be implemented in a combination of hardware or hardware and computer software in combination with the elements of the examples and algorithm steps described in the embodiments disclosed in the application. Whether a function is implemented in hardware or computer software to drive hardware depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of the technical solutions of the embodiments of the present application.
  • the embodiment of the present application may perform the division of the function module on the mobile device according to the foregoing method example.
  • each function module may be divided according to each function, or two or more functions may be integrated into one processing unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software function module.
  • the division of modules in the embodiments of the present application is schematic, and is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or may be integrated into another. A system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • FIG. 9 is a schematic structural diagram of a mobile device according to an embodiment of the present disclosure. referring to FIG. 9, the mobile device of the embodiment includes: a decoding module 31, a processing module 32, an obtaining module 33, and a synthesizing module 34.
  • the decoding module 31 is configured to decode the received encoded speech signal to obtain m sets of low frequency speech parameters; the m sets of low frequency speech parameters are low frequency speech parameters of m speech frames of the speech signal, where m is greater than 1 Integer
  • the processing module 32 is configured to determine a type of the m voice frames based on the m sets of low frequency voice parameters, and reconstruct a low frequency voice signal corresponding to the m voice frames, where the type includes an unvoiced frame or a voiced frame;
  • the obtaining module 33 is configured to obtain, according to the low frequency speech parameters and the mixed Gaussian model algorithm of the n unvoiced frames, n high frequency speech signals corresponding to the n unvoiced frames, and according to low frequency speech parameters and neural networks of the k voiced frames
  • the algorithm obtains k high frequency speech signals corresponding to the k voiced frames, n and k are integers greater than 1, and the sum of n and k is equal to m;
  • the synthesizing module 34 is configured to synthesize the low frequency speech signal and the high frequency speech signal of each of the m speech frames by the mobile device to obtain a wideband speech signal.
  • each set of low frequency speech parameters includes: a pitch period; or, a subband signal strength; or a gain value; or a line spectrum frequency.
  • the mobile device of this embodiment may be used to implement the technical solution of the foregoing method embodiment, and the implementation principle and the technical effect are similar, and details are not described herein again.
  • the processing module 32 is specifically configured to:
  • the SAE algorithm is used to obtain m labels, and m labels are used to indicate the types of m speech frames corresponding to the m group low frequency speech parameters;
  • the SAE model is that the mobile device or other mobile device uses the SAE algorithm to train based on a plurality of first training samples, and each of the first training samples includes a low-frequency voice signal of one voice frame of another voice signal. Corresponding low frequency speech parameters.
  • the obtaining module 33 is specifically configured to:
  • the obtaining module 33 is specifically configured to:
  • the neural network algorithm is used to obtain the high frequency speech parameters of k voiced frames.
  • the neural network model is that the mobile device or other mobile device uses the neural network algorithm to train based on a plurality of second training samples, and one of the second training samples includes h voiced sounds of another voice signal.
  • the low frequency speech parameter of the frame, h is an integer greater than one.
  • the neural network algorithm is a long-term and short-term memory (LSTM) neural network algorithm
  • the neural network model is an LSTM neural network model
  • the neural network algorithm is a bidirectional cyclic neural network (BRNN) algorithm
  • the neural network model is a BRNN model
  • the neural network algorithm is a cyclic neural network (RNN) algorithm
  • the neural network model is an RNN model
  • the mobile device of this embodiment may be used to implement the technical solution of the foregoing method embodiment, and the implementation principle and the technical effect are similar, and details are not described herein again.
  • FIG. 10 is a schematic structural diagram of a mobile device according to an embodiment of the present disclosure, including a processor 41, a memory 42, and a communication bus 43.
  • the processor 41 is configured to read and execute instructions in the memory 42 to implement the foregoing method embodiment.
  • the processor 41 is configured to read and call instructions in another memory through the memory 42 to implement the method in the above method embodiments.
  • the mobile device shown in FIG. 10 may be a device, or may be a chip or a chipset.
  • the device or the chip in the device has the function of implementing the method in the foregoing method embodiment.
  • the functions may be implemented by hardware or by corresponding software implemented by hardware.
  • the hardware or software includes one or more units corresponding to the functions described above.
  • the processor mentioned above may be a central processing unit (CPU), a microprocessor or an application specific integrated circuit (ASIC), or may be one or more for controlling the above aspects or A program-implemented integrated circuit of any of its possible designs for the transmission of upstream information.
  • CPU central processing unit
  • ASIC application specific integrated circuit
  • the present application also provides a computer storage medium comprising instructions that, when executed on a mobile device, cause the mobile device to perform a corresponding method in the above method embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

L'invention concerne un procédé de traitement de signal vocal et un dispositif mobile. Le procédé comprend : le décodage d'un signal vocal codé reçu pour obtenir m groupes de paramètres vocaux à basse fréquence, les m groupes de paramètres vocaux à basse fréquence étant des paramètres vocaux à basse fréquence de m trames vocales du signal vocal ; la détermination des types des m trames vocales conformément aux m groupes de paramètres vocaux à basse fréquence, et reconstruction d'un signal vocal à basse fréquence correspondant aux m trames vocales ; l'obtention de n signaux vocaux à haute fréquence correspondant à n trames sonores sans parole conformément aux paramètres vocaux à basse fréquence des n trames sonores sans parole et à un algorithme de modèle de mélange gaussien, et l'obtention de k signaux vocaux à haute fréquence correspondant à k trames sonores avec parole conformément aux paramètres vocaux à basse fréquence des k trames sonores avec parole et à un algorithme de réseau neuronal, la somme de n et k étant égale à m ; et la synthèse du signal vocal à basse fréquence et du signal vocal à haute fréquence de chaque trame vocale afin d'obtenir un signal vocal à large bande. La probabilité d'introduction de bruit est réduite, les émotions dans une parole originale sont conservées, et la parole originale est reproduite avec précision.
PCT/CN2018/086596 2018-05-11 2018-05-11 Procédé de traitement de signal vocal et dispositif mobile WO2019213965A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201880092454.2A CN112005300B (zh) 2018-05-11 2018-05-11 语音信号的处理方法和移动设备
PCT/CN2018/086596 WO2019213965A1 (fr) 2018-05-11 2018-05-11 Procédé de traitement de signal vocal et dispositif mobile

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/086596 WO2019213965A1 (fr) 2018-05-11 2018-05-11 Procédé de traitement de signal vocal et dispositif mobile

Publications (1)

Publication Number Publication Date
WO2019213965A1 true WO2019213965A1 (fr) 2019-11-14

Family

ID=68466641

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/086596 WO2019213965A1 (fr) 2018-05-11 2018-05-11 Procédé de traitement de signal vocal et dispositif mobile

Country Status (2)

Country Link
CN (1) CN112005300B (fr)
WO (1) WO2019213965A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111415674A (zh) * 2020-05-07 2020-07-14 北京声智科技有限公司 语音降噪方法及电子设备
CN111710327A (zh) * 2020-06-12 2020-09-25 百度在线网络技术(北京)有限公司 用于模型训练和声音数据处理的方法、装置、设备和介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112992167A (zh) * 2021-02-08 2021-06-18 歌尔科技有限公司 音频信号的处理方法、装置及电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996640A (zh) * 2009-08-31 2011-03-30 华为技术有限公司 频带扩展方法及装置
CN103026408A (zh) * 2010-07-19 2013-04-03 华为技术有限公司 音频信号产生装置
US20130151255A1 (en) * 2011-12-07 2013-06-13 Gwangju Institute Of Science And Technology Method and device for extending bandwidth of speech signal
CN104517610A (zh) * 2013-09-26 2015-04-15 华为技术有限公司 频带扩展的方法及装置
CN104637489A (zh) * 2015-01-21 2015-05-20 华为技术有限公司 声音信号处理的方法和装置
US20170194013A1 (en) * 2016-01-06 2017-07-06 JVC Kenwood Corporation Band expander, reception device, band expanding method for expanding signal band

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996640A (zh) * 2009-08-31 2011-03-30 华为技术有限公司 频带扩展方法及装置
CN103026408A (zh) * 2010-07-19 2013-04-03 华为技术有限公司 音频信号产生装置
US20130151255A1 (en) * 2011-12-07 2013-06-13 Gwangju Institute Of Science And Technology Method and device for extending bandwidth of speech signal
CN104517610A (zh) * 2013-09-26 2015-04-15 华为技术有限公司 频带扩展的方法及装置
CN104637489A (zh) * 2015-01-21 2015-05-20 华为技术有限公司 声音信号处理的方法和装置
US20170194013A1 (en) * 2016-01-06 2017-07-06 JVC Kenwood Corporation Band expander, reception device, band expanding method for expanding signal band

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111415674A (zh) * 2020-05-07 2020-07-14 北京声智科技有限公司 语音降噪方法及电子设备
CN111710327A (zh) * 2020-06-12 2020-09-25 百度在线网络技术(北京)有限公司 用于模型训练和声音数据处理的方法、装置、设备和介质

Also Published As

Publication number Publication date
CN112005300B (zh) 2024-04-09
CN112005300A (zh) 2020-11-27

Similar Documents

Publication Publication Date Title
US20220172708A1 (en) Speech separation model training method and apparatus, storage medium and computer device
CN107358966B (zh) 基于深度学习语音增强的无参考语音质量客观评估方法
WO2020042707A1 (fr) Procédé de réduction de bruit en temps réel à canal unique basé sur un réseau neuronal récurrent convolutif
JP7258182B2 (ja) 音声処理方法、装置、電子機器及びコンピュータプログラム
WO2020177371A1 (fr) Procédé et système de réduction de bruit de réseau neuronal adaptatif d'environnement pour aides auditives numériques et support de stockage
CN110867181B (zh) 基于scnn和tcnn联合估计的多目标语音增强方法
CN108447495B (zh) 一种基于综合特征集的深度学习语音增强方法
CN1750124B (zh) 带限音频信号的带宽扩展
US20130024191A1 (en) Audio communication device, method for outputting an audio signal, and communication system
EP1995723B1 (fr) Système d'entraînement d'une neuroevolution
CN106782497B (zh) 一种基于便携式智能终端的智能语音降噪算法
CN110085245B (zh) 一种基于声学特征转换的语音清晰度增强方法
WO2019213965A1 (fr) Procédé de traitement de signal vocal et dispositif mobile
JP2022547525A (ja) 音声信号を生成するためのシステム及び方法
WO2015154397A1 (fr) Procédé de traitement et de génération de signal de bruit, codeur/décodeur, et système de codage/décodage
CN114338623B (zh) 音频的处理方法、装置、设备及介质
CN114267372A (zh) 语音降噪方法、***、电子设备和存储介质
US6701291B2 (en) Automatic speech recognition with psychoacoustically-based feature extraction, using easily-tunable single-shape filters along logarithmic-frequency axis
CN110970044B (zh) 一种面向语音识别的语音增强方法
CN114203154A (zh) 语音风格迁移模型的训练、语音风格迁移方法及装置
WO2022213825A1 (fr) Procédé et appareil d'amélioration de la parole de bout en bout basés sur un réseau neuronal
JP2006521576A (ja) 基本周波数情報を分析する方法、ならびに、この分析方法を実装した音声変換方法及びシステム
WO2024055752A1 (fr) Procédé d'apprentissage de modèle de synthèse vocale, procédé de synthèse vocale et appareils associés
CN109215635B (zh) 用于语音清晰度增强的宽带语音频谱倾斜度特征参数重建方法
CN114708876B (zh) 音频处理方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18917872

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18917872

Country of ref document: EP

Kind code of ref document: A1