WO2019213965A1 - 语音信号的处理方法和移动设备 - Google Patents

语音信号的处理方法和移动设备 Download PDF

Info

Publication number
WO2019213965A1
WO2019213965A1 PCT/CN2018/086596 CN2018086596W WO2019213965A1 WO 2019213965 A1 WO2019213965 A1 WO 2019213965A1 CN 2018086596 W CN2018086596 W CN 2018086596W WO 2019213965 A1 WO2019213965 A1 WO 2019213965A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
neural network
frames
frequency speech
mobile device
Prior art date
Application number
PCT/CN2018/086596
Other languages
English (en)
French (fr)
Inventor
赵月娇
***
杨霖
尹朝阳
于雪松
张晶
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2018/086596 priority Critical patent/WO2019213965A1/zh
Priority to CN201880092454.2A priority patent/CN112005300B/zh
Publication of WO2019213965A1 publication Critical patent/WO2019213965A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation

Definitions

  • the present application relates to the field of signal processing technologies, and in particular, to a method and a mobile device for processing a voice signal.
  • voice is the most intuitive and concise communication method.
  • the bandwidth of natural speech is between 50 Hz and 8000 Hz.
  • the frequency band of speech is limited to between 300 Hz and 3400 Hz, and the speech signal between 300 Hz and 3400 Hz is called narrow band. voice signal.
  • the main energy of speech is contained in the low-frequency speech signal, and the lack of high-frequency signal causes the clarity and naturalness of the speech signal to be affected to some extent.
  • Some information representing the characteristics of the speaker is lost, such as the sound color;
  • the speech distortion is more serious, especially in noisy environments, the distortion is often not accepted by the user.
  • the methods of speech extension mainly include two methods based on network mapping and methods based on statistical models. Based on the network mapping method, the noise in the resulting wideband speech is large; based on the statistical model method, the resulting wideband speech cannot retain the sentiment of the original speech.
  • the present application provides a method for processing a voice signal and a mobile device, and the obtained wideband speech noise is small and retains the sentiment of the original voice, and the original voice can be reproduced well.
  • the first aspect provides a method for processing a voice signal, including:
  • the mobile device decodes the received encoded speech signal to obtain m sets of low frequency speech parameters; the m sets of low frequency speech parameters are low frequency speech parameters of m speech frames of the speech signal, and m is an integer greater than 1;
  • the mobile device obtains n high-frequency speech signals corresponding to the n unvoiced frames according to the low-frequency speech parameters and the mixed Gaussian model algorithm of the n unvoiced frames, and according to the low-frequency speech parameters and the neural network algorithm of the k voiced frames, Obtaining k high frequency speech signals corresponding to the k voiced frames, n and k being integers greater than 1, and sum of n and k is equal to m;
  • the mobile device synthesizes a low frequency speech signal and a high frequency speech signal of each of the m speech frames to obtain a wideband speech signal.
  • the scheme is performed on the mobile device side, and the original communication system is not changed, and only the corresponding device or the corresponding program can be set on the mobile device side; the voiced frame and the unvoiced frame are distinguished according to the voice parameter, and the accuracy is high; according to the unvoiced frame and
  • the mixed Gaussian model algorithm is used to obtain the high frequency speech signal corresponding to the unvoiced frame, which reduces the probability of noise introduction.
  • the neural network algorithm is used to obtain the high frequency speech signal corresponding to the voiced frame, which preserves the sentiment of the original speech.
  • the original voice can be accurately reproduced, which enhances the user's hearing experience.
  • each set of low frequency speech parameters includes: a pitch period; or, a subband signal strength; or, a gain value; or a line spectrum frequency; or, a pitch period, a subband signal strength, a gain value, or a line spectrum frequency At least two of them.
  • the mobile device determines the types of the m voice frames based on the m sets of low frequency voice parameters, including:
  • the mobile device uses the SAE algorithm to obtain m labels according to the m group low frequency speech parameters and the Stacked AutoEncoder (SAE) model, and m labels are used to indicate m corresponding to the m group low frequency speech parameters.
  • SAE Stacked AutoEncoder
  • the SAE model is that the mobile device or other mobile device uses the SAE algorithm to train based on a plurality of first training samples, and each of the first training samples includes a low-frequency voice signal of one voice frame of another voice signal. Corresponding low frequency speech parameters.
  • the mobile device obtains n high frequency speech signals corresponding to the n unvoiced frames according to the low frequency speech parameters and the mixed Gaussian model algorithm of the n unvoiced frames, including:
  • the mobile device obtains high frequency speech parameters of n unvoiced frames according to low frequency speech parameters and mixed Gaussian model algorithms of n unvoiced frames;
  • the mobile device constructs the n high frequency speech signals according to high frequency speech parameters of the n unvoiced frames.
  • the hybrid Gaussian model algorithm is used to predict the high-frequency speech signal of the unvoiced frame, and no noise is introduced, which improves the user's auditory feeling.
  • the mobile device obtains k high frequency speech signals corresponding to the k voiced frames according to the low frequency speech parameters and the neural network algorithm of the k voiced frames, including:
  • the mobile device uses a neural network algorithm to obtain high frequency speech parameters of k voiced frames according to low frequency speech parameters and neural network models of k voiced frames;
  • the mobile device constructs the k high frequency speech signals according to high frequency speech parameters of the k voiced frames;
  • the neural network model is that the mobile device or other mobile device uses the neural network algorithm to train based on a plurality of second training samples, and one of the second training samples includes h voiced sounds of another voice signal. h group low frequency speech parameters of the frame, h is an integer greater than one.
  • the neural network algorithm is used to predict the high-frequency speech signal of the voiced frame with almost no noise, and the emotion of the original speech can be preserved.
  • the neural network algorithm is a long-term and short-term memory (LSTM) neural network algorithm
  • the neural network model is an LSTM neural network model
  • the neural network algorithm is a bidirectional cyclic neural network (BRNN) algorithm
  • the neural network model is a BRNN model
  • the neural network algorithm is a cyclic neural network (RNN) algorithm
  • the neural network model is an RNN model
  • the BRNN algorithm can greatly improve the accuracy of the acquired high-frequency speech signal, so that the original speech can be accurately reproduced.
  • the second aspect provides a mobile device, including:
  • a decoding module configured to decode the received encoded speech signal to obtain m sets of low frequency speech parameters; the m sets of low frequency speech parameters are low frequency speech parameters of m speech frames of the speech signal, where m is greater than 1 Integer
  • a processing module configured to determine, according to the m sets of low frequency speech parameters, a type of the m speech frames, and reconstruct a low frequency speech signal corresponding to the m speech frames, where the type includes an unvoiced frame or a voiced frame;
  • An obtaining module configured to obtain n high frequency speech signals corresponding to the n unvoiced frames according to low frequency speech parameters and mixed Gaussian model algorithms of n unvoiced frames, and according to low frequency speech parameters and neural network algorithms of k voiced frames Obtaining k high frequency speech signals corresponding to the k voiced frames, n and k being integers greater than 1, and sum of n and k is equal to m;
  • the synthesizing module is configured to synthesize the low frequency speech signal and the high frequency speech signal of each of the m speech frames to obtain a wideband speech signal.
  • the relevant extension device or extension program can be set on the side of the voice processing device, and the original communication system is not changed; the voiced frame and the unvoiced frame are distinguished according to the voice parameter, and the accuracy is high; according to the unvoiced frame and the voiced sound
  • the hybrid Gaussian model algorithm is used to obtain the high-frequency speech signal corresponding to the unvoiced frame, which reduces the probability of noise introduction.
  • the neural network algorithm is used to obtain the high-frequency speech signal corresponding to the voiced frame, which preserves the sentiment of the original speech. Accurate reproduction of the original voice enhances the user's listening experience.
  • each set of low frequency speech parameters includes: a pitch period; or, a subband signal strength; or, a gain value; or a line spectrum frequency; or, a pitch period, a subband signal strength, a gain value, or a line spectrum frequency At least two of them.
  • the processing module is specifically configured to:
  • the SAE algorithm is used to obtain m labels, and m labels are used to indicate the types of m speech frames corresponding to the m group low frequency speech parameters;
  • the SAE model is that the mobile device or other mobile device uses the SAE algorithm to train based on a plurality of first training samples, and each of the first training samples includes a low-frequency voice signal of one voice frame of another voice signal. Corresponding low frequency speech parameters.
  • the obtaining module is specifically configured to:
  • the obtaining module is specifically configured to:
  • the neural network algorithm is used to obtain the high frequency speech parameters of k voiced frames.
  • the neural network model is that the mobile device or other mobile device uses the neural network algorithm to train based on a plurality of second training samples, and one of the second training samples includes h voiced sounds of another voice signal.
  • the low frequency speech parameter of the frame, h is an integer greater than one.
  • the neural network algorithm is a long-term and short-term memory (LSTM) neural network algorithm
  • the neural network model is an LSTM neural network model
  • the neural network algorithm is a bidirectional cyclic neural network (BRNN) algorithm
  • the neural network model is a BRNN model
  • the neural network algorithm is a cyclic neural network (RNN) algorithm
  • the neural network model is an RNN model
  • a third aspect provides a computer readable storage medium having stored thereon a computer program, when the computer program is executed by a processor, executing the first aspect of the claims and any of the possible designs of the first aspect The method described.
  • a fourth aspect provides a mobile device, including a processor
  • the processor is operative to couple with a memory, read and execute instructions in the memory, and perform the method of the first aspect and any of the possible aspects of the first aspect.
  • the mobile device further includes the memory.
  • the processing method of the voice signal in the present application is performed on the mobile device side, and the original communication system is not changed, and only the corresponding device or the corresponding program can be set on the mobile device side; the voiced frame and the unvoiced frame are distinguished according to the voice parameter, and the distinction is accurate.
  • the rate is high.
  • the mixed Gaussian model algorithm is used to obtain the high frequency speech signal corresponding to the unvoiced frame
  • the neural network algorithm is used to obtain the high frequency speech signal corresponding to the voiced frame, which reduces the probability of noise introduction.
  • the broadband speech retains the sentiment of the voice, and the original voice can be accurately reproduced, thereby improving the user's hearing experience.
  • FIG. 1 is a schematic structural diagram of an SAE according to an embodiment of the present application.
  • FIG. 2 is a schematic diagram of an automatic encoder corresponding to an SAE according to an embodiment of the present application
  • FIG. 3 is a schematic diagram of an LSTM neural network algorithm according to an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of an RNN provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of an RNN algorithm according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a BRNN algorithm according to an embodiment of the present application.
  • Figure 7 is a system architecture diagram of an embodiment of the present application.
  • FIG. 8 is a flowchart of a method for processing a voice signal according to an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram 1 of a mobile device according to an embodiment of the present disclosure.
  • FIG. 10 is a schematic structural diagram 2 of a mobile device according to an embodiment of the present disclosure.
  • Voice The bandwidth of human natural speech is generally between 50 Hz and 8000 Hz.
  • the speech signal between 300 Hz and 3400 Hz is called a narrowband speech signal.
  • the speech signal can be divided into two types: unvoiced and voiced according to whether the vocal cord vibrates.
  • Voiced sounds also known as voiced languages, carry most of the energy in the language, and voiced sounds show significant periodicity in the time domain; while unvoiced sounds are similar to white noise, with no obvious periodicity.
  • voiced sounds also known as voiced languages, carry most of the energy in the language, and voiced sounds show significant periodicity in the time domain; while unvoiced sounds are similar to white noise, with no obvious periodicity.
  • voiced sounds also known as voiced languages, carry most of the energy in the language, and voiced sounds show significant periodicity in the time domain; while unvoiced sounds are similar to white noise, with no obvious periodicity.
  • the airflow passes through the glottis to cause a oscillating vibration of the vocal cord, producing a quasi-peri
  • FIG. 1 is a schematic structural diagram of an SAE according to an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of an SAE-compatible automatic encoder according to an embodiment of the present disclosure
  • the SAE includes an input layer and two hidden layers.
  • An output layer wherein the number of neurons 11 in the input layer is equal to the dimension of the input vector plus one, wherein one bias node 12 is 1, that is, the input of the bias node is 1, and the output layer can be softmax
  • the classifier layer, the number of hidden layer neurons 21, and the number of neurons in the output layer are set as needed. It can be understood that the two layers of hidden layers are merely exemplary, and the number of layers of the hidden layer can be changed according to the actual number.
  • the SAE algorithm is as follows:
  • the n-dimensional vector X is an input vector, and the number of neurons in the input layer 100 is equal to n+1.
  • x n is an input of the nth neuron of the input layer; initializing the connection weight between each neuron of the input layer, the bias node, and each neuron of the first layer of the hidden layer 200, The weight matrix W 1 and the offset vector b 1 ; then the output h 1 of the first layer hidden layer is:
  • h 1 (h 1 , h 2 , h 3 , ..., h m-1 , h m )
  • h m is the output of the mth neuron of the first layer of the hidden layer
  • W km is between the kth neuron of the output layer and the mth neuron of the first layer of the hidden layer Connection weight.
  • the above process is called the encoding process of the input vector X, and then the process of h 1 decoding is performed by an automatic encoder to obtain the reconstructed input vector.
  • b 2 is the offset vector.
  • the updated W 1 For the updated W 1 , The updated b 1 , ⁇ is the learning rate.
  • connection weights between the neurons included in the first layer hidden layer 200, the bias nodes, and the second layer hidden layer 300 are initialized to form a weight matrix W 3 , which can refer to W 1 .
  • the above process is called the encoding process of h 1 , and then the process of h 2 decoding is performed by an automatic encoder to obtain reconstruction.
  • b 4 is the offset vector.
  • each of the neurons included in the second layer hidden layer 300, the offset node, and the connection weight between the neurons included in the output layer 400 are initialized to form a weight matrix W 5 , and the initialization b 5 is an offset vector.
  • the above process is a complete unsupervised learning process for sample X.
  • the sample X below is based on W 5 , using a Back Propagation (BP) neural network, performs a supervised learning process on Sample X:
  • BP Back Propagation
  • the output vector H 1 of the first layer hidden layer 200 is calculated for each of the neurons included in the input layer 100, the offset node, and the offset vector corresponding to each of the neurons included in the first layer of the hidden layer 200:
  • the output vector H 2 of the second layer hidden layer 300 is calculated for each of the neurons included in the first layer of the hidden layer 200, the offset node, and the offset vector corresponding to each of the neurons included in the second layer of the hidden layer 300:
  • W 5 is used as a connection weight matrix between each neuron included in the second layer hidden layer 300, the bias node and each neuron included in the output layer 400, and b 5 is each included in the second layer hidden layer 300.
  • the neuron, the offset node and the offset vector corresponding to each neuron included in the output layer 400 are calculated by the BP neural network algorithm to obtain the output vector Y.
  • Sample X completed a complete SAE-based learning process.
  • the initial weight matrix corresponding to the unsupervised learning is performed as the next sample X 1 ; the next training sample X 1 is obtained in the same step as the sample X to obtain the final updated Complete a complete SAE-based learning process.
  • the initial weight matrix corresponding to the unsupervised learning is performed as the next sample X 2 ; the next training sample X 2 completes a complete SAE-based learning process in the same steps as the sample X.
  • the input layer 100 includes a matrix of connection weights between the neurons, the bias nodes, and the neurons included in the first layer of the hidden layer 200
  • Each of the neurons included in the hidden layer 200 includes an offset vector corresponding to the offset node, and each of the neurons included in the first layer of the hidden layer 200, the bias node, and the second layer of the hidden layer 300
  • the matrix of connection weights between the included neurons, the offset vectors corresponding to the neurons included in the output layer 400 are all updated, and the updated physical quantity is used as the initial weight matrix corresponding to the next sample for unsupervised learning. , initial offset vector.
  • connection weights and corresponding offset values between the neurons in each layer are updated, and the updated values are unsupervised as the next sample.
  • the LSTM neural network consists of an input layer, at least one hidden layer, and one output layer. The difference is that there are no offset nodes in the input layer and hidden layer of the LSTM neural network.
  • the number of neurons in the input layer is equal to the dimension of the input vector, and the number of neurons in the hidden layer and the number of neurons in the output layer are set as needed.
  • the LSTM neural network algorithm is different from the SAE algorithm or the BP neural network algorithm in that it acquires the output of each neuron of the hidden layer and the output of each neuron of the output layer.
  • FIG. 3 is a schematic diagram of an LSTM neural network according to an embodiment of the present application.
  • X t-1 is the input of a neuron S at time t-1
  • h t-1 is the output of neuron S when the input is X t-1
  • C t-1 is at time t-1
  • X t is the input of the neuron S at time t
  • h t is the output of the neuron S when the input is X t
  • C t is the state of the neuron S corresponding to the time t
  • X t+ 1 is the input of the neuron S at time t+1
  • h t+1 is the output of the neuron S when the input is X t+1
  • C t+1 is the state of the neuron S corresponding to the time t+1.
  • neuron S has three inputs: C t-1 , X t , h t-1 , and the corresponding outputs have h t , C t-1 .
  • X t is calculated according to the output of each neuron in the upper layer and the connection weight between the neurons and neuron S in the upper layer and the corresponding offset vector (refer to the above-mentioned BP neural network).
  • h t-1 can also be called the output of the neuron S at the previous moment
  • C t- 1 can also be referred to as the state of the neuron S at the previous moment.
  • f t is the forgetting gate
  • W f is the weighting matrix of the forgetting gate
  • b f is the bias term of the forgetting gate
  • is the sigmoid function
  • i t is the input gate
  • W i is the weight matrix of the input gate
  • b i is Enter the offset of the gate
  • C t is the new state of the neuron corresponding to time t
  • O t is the output gate
  • W O is the weight matrix of the output gate
  • b O is the bias term of the output gate
  • h t The final output corresponding to neuron S at time t.
  • the LSTM neural network algorithm combines the current memory with the long-term memory to form a new unit state C t . Due to the control of the forgetting gate, the LSTM neural network can store information long before a long time. Due to the control of the input gate, it can prevent the current insignificant content from entering the memory; the output gate controls the influence of long-term memory on the current output.
  • the output of each neuron of the LSTM neural network can be calculated according to Equations 1 through 6 above.
  • connection weights and offset values in the LSTM neural network algorithm is also updated by the inverse error propagation algorithm and the gradient descent method.
  • each sample carries out the learning process of the LSTM neural network algorithm, the connection weights between the layers of neurons, the corresponding offset values, the weight matrix of the forgetting gate, the weight matrix of the input gate, and the weight of the output gate.
  • the matrix is updated once and the updated value is used to learn the next sample.
  • Each sample contains multiple subsequences, one for each input of the input layer at the time of LSTM learning.
  • RNN Recurrent Neural Networks
  • BRNN Bidirections Recurrent Neural Networks
  • FIG. 4 is a schematic structural diagram of an RNN provided by an embodiment of the present application
  • FIG. 5 is a schematic diagram of an RNN algorithm according to an embodiment of the present disclosure
  • FIG. 6 is a schematic diagram of a BRNN algorithm provided by an embodiment of the present application.
  • the neurons between the hidden layers in the RNN are no longer isolated, but connected, and the input of the hidden layer includes not only the output of the input layer but also the output of the hidden layer at the previous moment. .
  • h t is the output of the hidden layer at time t
  • h t-1 is the output of the hidden layer at time t-1
  • x t is the input of the input layer at time t
  • Z t is the output layer at time t Output
  • W xh is the weight matrix composed of the connection weights between the neurons of the input layer and the neurons of the hidden layer at time t
  • W hh is the output of the hidden layer at time t-1
  • - h t- 1 is the weight matrix corresponding to the input of the hidden layer at time t
  • W hz is the weight matrix composed of the connection weights between the neurons of the hidden layer and the neurons of the output layer at time t
  • b h is The offset vector corresponding to the hidden layer at time t
  • b z is the offset vector corresponding to the output layer at time t.
  • the input corresponding to one sample can be called a sequence, and in the RNN algorithm, one sample corresponds to multiple subsequences, such as subsequence x t-1 , subsequence x t , subsequence x t+1 ; since the hidden layer is at t
  • the output of time -1 is obtained according to the input x t-1 of the input layer at time t-1
  • x t and x t-1 respectively correspond to different subsequences, that is to say, in the RNN algorithm, there is a sequence between subsequences. Relationships, each subsequence is associated with its previous subsequence, and is spread out in time series through a neural network.
  • each connection weight does not change, that is, each subsequence of a sequence shares the connection weight, that is, the connection weight used by the output Z t-1 obtained according to the input x t-1 is obtained according to the input x t Z output connection weights T is used, obtained according to the input x t 1 Z output connection weights used + t + 1, is the same.
  • the RNN updates each connection weight and offset value in a learning process based on the error back time propagation algorithm for the learning process of the next sample.
  • the deep-loop neural network is a cyclic neural network with multiple layers of hidden layers.
  • the algorithm can refer to the above algorithm with a layer of hidden layers, which will not be described here.
  • the improvement of the BRNN algorithm over the RNN algorithm is based on the assumption that the current output is not only related to the previous input, but also to the subsequent inputs. It can be understood that the reverse layer and the forward layer shown in FIG. 6 do not refer to two hidden layers, but need to obtain two output values for representing the same hidden layer, which is the BRNN algorithm and the RNN algorithm. the difference.
  • h t1 is the positive time direction output of the hidden layer at time t
  • h t2 is the negative time direction output of the hidden layer at time t
  • h t-1 is the hidden layer
  • the output at time t-1, h t+1 is the output of the hidden layer at time t+1
  • x t is the input of the input layer at time t
  • the output h t-1 of the hidden layer at time t-1 is used as the weight matrix corresponding to the input of the hidden layer at time t, a first weight matrix composed of connection weights between each neuron of the layer and each neuron of the hidden layer at time t
  • the output h t+1 of the hidden layer at time t+1 is used as the weight matrix corresponding to the input of the hidden layer at time t, a second weight matrix composed of connection weights between each neuron of the layer and each neuron of the hidden layer at time t+1; a first weight matrix consisting of the weights of the connections between the neurons of the
  • the input corresponding to one sample can be called a sequence, and one sample corresponds to multiple subsequences, such as subsequence x t-1 , subsequence x t , subsequence x t+1 ; due to hidden layer
  • the output h t-1 at time t-1 is obtained from the input x t-1 of the input layer at time t-1
  • the output h t+1 of the hidden layer at time t+1 is the input layer according to time t+1.
  • the input x t+1 , x t , x t-1 , x t+1 respectively correspond to different sub-sequences, that is to say, in the BRNN algorithm, there is a sequential relationship between the sub-sequences, each sub-sequence and before it
  • the subsequences are associated and also associated with subsequences after it.
  • each connection weight does not change, that is, each subsequence of a sequence shares the connection weight, that is, the connection weight used according to the output y t-1 obtained by inputting x t-1 is obtained according to the input x t output y used in connection weights t, according to the weights of the input x t + 1 obtained in the output y t + 1 is used, is the same.
  • the deep bidirectional cyclic neural network is a cyclic neural network with multiple layers of hidden layers.
  • the algorithm can refer to the above algorithm with a layer of hidden layers, which will not be described here.
  • the mixed Gaussian model is a combination of probability density functions of multiple Gaussian distributions.
  • a Gaussian model with L mixed numbers can be expressed as:
  • G(x, ⁇ l , V l ) denotes the first mixed component of the mixed Gaussian model, which is represented by a b-dimensional multivariate single Gaussian probability density function with a mean of ⁇ l and a covariance of V l (positive definite matrix):
  • Figure 7 is a system architecture diagram of an embodiment of the present application.
  • the system includes a mobile device 10 and a network device 20;
  • the network device is a device with a wireless transceiver function or a chipset and necessary software and hardware that can be disposed on the device, including but not limited to: an evolved Node B (eNB), a radio network controller ( Radio network controller (RNC), Node B (NB), base station controller (BSC), base transceiver station (BTS), home base station (for example, home evolved NodeB, or home Node) B, HNB), baseband unit (BBU), access point (AP) in wireless fidelity (WIFI) system, wireless relay node, wireless backhaul node, transmission point (transmission) And reception point, TRP or transmission point, TP), etc., can also be 5G, such as NR, gNB in the system, or transmission point (TRP or TP), one or a group of base stations in the 5G system (including multiple The antenna panel) or the antenna panel, or may be a network node constituting a gNB or a transmission point, such as a baseband unit (BBU), or a
  • the gNB may include a centralized unit (CU) and a DU.
  • the gNB may also include a radio unit (RU).
  • the CU implements some functions of the gNB, and the DU implements some functions of the gNB.
  • the CU implements radio resource control (RRC), the function of the packet data convergence protocol (PDCP) layer, and the DU implements the wireless chain.
  • RRC radio resource control
  • PDCP packet data convergence protocol
  • the DU implements the wireless chain.
  • the functions of the radio link control (RLC), the media access control (MAC), and the physical (PHY) layer Since the information of the RRC layer eventually becomes information of the PHY layer or is transformed by the information of the PHY layer, high-level signaling, such as RRC layer signaling or PHCP layer signaling, can also be used in this architecture.
  • the network device can be a CU node, or a DU node, or a device including a CU node and a DU node.
  • the CU may be divided into network devices in the access network RAN, and the CU may be divided into network devices in the core network CN, which is not limited herein.
  • a mobile device may also be called a user equipment (UE), an access terminal, a subscriber unit, a subscriber station, a mobile station, a mobile station, a remote station, a remote terminal, a user terminal, a terminal, a wireless communication device, a user agent, or a user.
  • the mobile device involved in the present application may be a mobile phone, a tablet, a computer with wireless transceiver function, a virtual reality (VR) device, an augmented reality (AR) device, and industrial control ( Wireless terminal in industrial control, wireless terminal in self driving, wireless terminal in remote medical, wireless terminal in smart grid, transportation safety A wireless terminal, a wireless terminal in a smart city, a wireless terminal in a smart home, and the like.
  • the embodiment of the present application does not limit the application scenario.
  • the foregoing terminal device and a chip that can be disposed in the foregoing terminal device are collectively referred to as a terminal device.
  • network devices 20 can each communicate with a plurality of mobile devices, such as mobile device 10 shown in the figures.
  • Network device 20 can communicate with any number of mobile devices similar to mobile device 10.
  • FIG. 7 is merely a simplified schematic diagram for ease of understanding.
  • the communication system may also include other network devices or may also include other mobile devices, which are not shown in FIG.
  • FIG. 8 is a flowchart of a method for processing a voice signal according to an embodiment of the present disclosure. Referring to FIG. 8, the method in this embodiment includes:
  • Step S101 The mobile device decodes the received encoded speech signal to obtain m low frequency speech parameters, m low frequency speech parameters are low frequency speech parameters of m speech frames of the speech signal, and m is an integer greater than 1.
  • Step S102 The mobile device determines, according to the m sets of low frequency speech parameters, the types of the m speech frames, and reconstructs the low frequency speech signals corresponding to the m speech frames; wherein the type of the speech frames includes an unvoiced frame or a voiced frame;
  • Step S103 The mobile device obtains n high-frequency speech signals corresponding to the n unvoiced frames according to the low-frequency speech parameters and the mixed Gaussian model algorithm of the n unvoiced frames, and obtains according to the low-frequency speech parameters and the neural network algorithm of the k voiced frames.
  • k high-frequency speech signals corresponding to k voiced frames, n and k are integers greater than 1, and the sum of n and k is equal to m;
  • Step S104 The mobile device synthesizes the low frequency speech signal and the high frequency speech signal of each of the m speech frames to obtain a wideband speech signal.
  • the voice signal since the voice signal has short-term performance, that is, the voice signal remains relatively stable and consistent in a short time interval, the time is generally 5 ms to 50 ms. Therefore, the analysis of the voice signal must be established in a short time.
  • the "speech signal" referred to in this embodiment refers to a voice signal corresponding to a short time interval that can be analyzed.
  • the mobile device decodes the received encoded speech signal to obtain m sets of low frequency speech parameters; the m sets of low frequency speech parameters are low frequency speech parameters of m speech frames of the speech signal, and m is an integer greater than 1. It can be understood that each speech frame corresponds to a set of low frequency speech parameters.
  • the speech signal involved in step S101 may be referred to as a speech signal a in the following description.
  • the network device may perform parameter encoding on the m sets of low frequency speech parameters of the m speech frames of the speech signal a by using a parameter encoding method to obtain the encoded speech signal a.
  • the network device may use a mixed linear incentive prediction (MELP) algorithm to extract low-frequency voice parameters of the voice signal a.
  • MELP mixed linear incentive prediction
  • the low frequency speech parameters obtained by the MELP algorithm include: pitch period; or, subband signal strength; or, gain value; or line spectrum frequency; or pitch period, subband signal strength, gain value, or line spectrum frequency At least two.
  • the low frequency speech parameters including the pitch period, the subband signal strength, the gain value, or the line spectrum frequency have the following meanings: the low frequency speech parameters include the pitch period and the subband signal strength; or, the pitch period and the gain value; or, the pitch Period and line spectrum frequency; or, subband signal strength and gain value; or, subband signal strength and line spectrum frequency; or, line spectrum frequency and gain value; or, pitch period and subband signal strength and gain values; Pitch period and subband signal strength and line spectrum frequency; or, gain value and subband signal strength and line spectrum frequency; or, pitch period and gain value and line spectrum frequency; or, pitch period and subband signal strength and gain value and line spectrum frequency.
  • the low frequency speech parameters in this embodiment include a pitch period and a subband signal strength and a gain value and a line spectrum frequency.
  • the low frequency speech parameters may include more than the above parameters, and may also include other parameters. Different parameter extraction algorithms are used, and the corresponding low frequency speech parameters have certain differences.
  • the speech signal a is sampled to obtain digital speech, high-pass filtering is performed on the digital speech, and the low-frequency energy in the digital speech is removed, and the 50 Hz power frequency interference may exist, for example, 4
  • the step-cut Chebysh high-pass filter performs high-pass filtering, and the high-pass filtered digital speech is used as the speech signal to be processed.
  • the N sampling points corresponding to the to-be-processed speech signal are a speech frame.
  • N may be 160
  • the frame is shifted to 80 sampling points
  • the speech signal to be processed is divided into m speech frames, and then m speech frames are extracted.
  • Low frequency speech parameters Low frequency speech parameters.
  • the low frequency speech parameters of the speech frame are extracted: pitch period, subband signal strength, gain value, line spectrum frequency.
  • each voice frame includes a low frequency voice signal and a high frequency voice signal. Due to the limitation of the transmission bandwidth, the range of the voice frequency band is limited.
  • the low frequency voice parameter of the extracted voice frame is in the voice frame.
  • the low frequency speech parameter corresponding to the low frequency speech signal, correspondingly, the high frequency speech parameter subsequently appearing in this embodiment is the high frequency speech parameter corresponding to the high frequency speech signal in the speech frame.
  • the low frequency speech signal is opposite to the high frequency speech signal. It can be understood that if the frequency corresponding to the low frequency speech signal is 300 Hz to 3400 Hz, the frequency corresponding to the high frequency speech signal may be 3400 Hz to 8000 Hz.
  • the frequency range corresponding to the low-frequency speech signal in this embodiment may be a frequency range corresponding to the narrow-band speech signal in the prior art, that is, 300 Hz to 3400 Hz, and may also be other frequency ranges.
  • the acquisition of the pitch period includes the acquisition of the integer pitch period, the acquisition of the fractional pitch period, and the acquisition of the final base station period.
  • the acquisition of the pitch period includes the acquisition of the integer pitch period, the acquisition of the fractional pitch period, and the acquisition of the final base station period.
  • Each speech frame corresponds to one pitch period.
  • the 6-4KHz speech band (corresponding to the low-frequency speech signal) can be first divided into 5 fixed frequency bands (0-500 Hz, 500-1000 Hz, using the sixth-order Butterworth bandpass filter bank). 1000 to 2000 Hz, 2000 to 3000 Hz, 3000 to 4000 Hz). Such division is merely exemplary, and such division may not be employed.
  • the sub-band sound intensity of the first sub-band (0-500 Hz) is the normalized autocorrelation value corresponding to the fractional pitch period of the speech frame.
  • the sound intensity of the remaining four sub-bands is the maximum value of the autocorrelation function; for an unstable speech frame, that is, a speech frame with a large pitch period variation, the autocorrelation function of the sub-band signal envelope is used.
  • Subtract 0.1 perform full-wave rectification and smoothing, calculate the normalized autocorrelation function value, and normalize the autocorrelation function value as the sound intensity of the corresponding sub-band.
  • each speech frame corresponds to a plurality of sub-band sound intensities, such as five.
  • the pitch adaptive window length is used in the calculation.
  • the window length is determined by the following method: when Vbp1>0.6 (Vbp1>0.6, the speech frame is a voiced frame), and the window length is the minimum multiple of the fractional pitch period of more than 120 sample points. If the window length exceeds 320 sampling points, divide it by 2; when Vbp1 ⁇ 0.6 (Vbp1 ⁇ 0.6, the speech frame is an unvoiced frame), the window length is 120 sampling points.
  • the center of the first gain G 1 window is located at 90 samples before the last sample point of the current speech frame; the center of the second gain G 2 window is located at the last sample point of the current frame.
  • the gain value is the root mean square value of the windowed signal S n and the result is converted to a decibel form:
  • the input speech signal is weighted with a Hamming window of 200 sample points long (25 ms), and then a 10th order linear prediction analysis is performed, the center of the window being located at the last sampling point of the current frame.
  • the MELP algorithm uses the Chebyshev polynomial to recursively convert to the line spectrum frequency, which reduces the computational complexity.
  • Each speech frame corresponds to a line spectrum frequency
  • the line spectrum frequency is a vector having a plurality of components, such as a vector having 12 components.
  • the network device uses the MELP algorithm to extract the low-frequency speech parameters of the m speech frames of the speech signal, and each speech frame correspondingly obtains a set of low-frequency speech parameters, and the set of low-frequency speech parameters may include: a pitch period, and more
  • the sub-band has a sound intensity, two gains, and a line spectrum frequency vector.
  • the network device encodes the m sets of low frequency speech parameters of the m speech frames of the speech signal a to obtain the encoded speech signal a, and sends the encoded speech signal a to the mobile device, and the mobile device encodes the received signal.
  • the speech signal a is decoded, m sets of low frequency speech parameters are obtained, and each set of low frequency speech parameters corresponds to a low frequency speech signal of a speech frame of the speech signal a.
  • the mobile device determines the types of the m speech frames based on the m sets of low frequency speech parameters, and reconstructs the low frequency speech signals corresponding to the m speech frames; wherein the type of the speech frames includes an unvoiced frame or a voiced frame;
  • the mobile device After obtaining the m sets of low frequency speech parameters corresponding to the speech signal a, the mobile device reconstructs the low frequency speech signals corresponding to the m speech frames according to the m sets of low frequency speech parameters.
  • the mobile device reconstructs the low-frequency voice signal corresponding to the m voice frames according to the m-group low-frequency voice parameters, which is a mature technology in the prior art, and is not described in this embodiment.
  • the mobile device determines the type of m speech frames based on the m sets of low frequency speech parameters, that is, determines whether each speech frame is an unvoiced frame or a voiced frame.
  • the mobile device determines the types of the m voice frames based on the m sets of low frequency voice parameters, including:
  • the mobile device uses the SAE algorithm to obtain m labels according to the m-group low-frequency voice parameters and the stack automatic encoder SAE model, and the m labels are used to indicate the types of m voice frames corresponding to the m-group low-frequency voice parameters;
  • the SAE model is obtained by using a SAE algorithm and is trained based on a plurality of first training samples, each of which includes low-frequency speech parameters corresponding to low-frequency speech signals of one speech frame of other speech signals, and other speech signals are different from the present.
  • the speech signal a in the embodiment is obtained by using a SAE algorithm and is trained based on a plurality of first training samples, each of which includes low-frequency speech parameters corresponding to low-frequency speech signals of one speech frame of other speech signals, and other speech signals are different from the present.
  • the speech signal a in the embodiment.
  • the SAE model may be obtained by using the SAE algorithm in the mobile device in this embodiment, based on training of multiple first training samples, or using other SAE algorithms for training, and then training based on multiple first training samples, and then The mobile device of this embodiment directly acquires the trained SAE model from other devices.
  • the SAE algorithm is used to determine the type of the speech frame according to the low-frequency speech parameters of the speech frame. Compared with the method for determining the type of the speech frame in the prior art, the accuracy can be greatly improved.
  • a set of low-frequency speech parameters consists of pitch period, sub-band signal strength, gain value, and line spectrum frequency, and includes 1 pitch period, 5
  • the sub-band signal strength, two gain values, and the line spectrum frequency vector including 12 components the dimension of the input vector X is 20 dimensions, that is, has 20 components, and the input vector X is used as the SAE shown in FIG.
  • the input using the SAE algorithm as described above, outputs a label for indicating the type of the voice frame, and the SAE algorithm uses the SAE model trained based on the plurality of first training samples.
  • A1 obtaining a plurality of first training samples
  • all the first training samples are trained by using the SAE algorithm to obtain a SAE model.
  • each of the first training samples includes a low-frequency speech parameter corresponding to a low-frequency speech signal of one speech frame of another speech signal, and it can be understood that the frequency range corresponding to the low-frequency speech signal herein
  • the frequency range corresponding to the low-frequency voice signal from which the low-frequency voice parameter of the network device is encoded is the same, and the low-frequency voice parameter here is the same as the low-frequency voice parameter extracted by the network device or the low-frequency voice parameter decoded by the mobile device, and the extraction method is the same. .
  • the speech signal b belongs to one of the other speech signals, and for the l speech frames of the speech signal b, one set of low-frequency speech parameters corresponding to the low-frequency speech signals of one speech frame are respectively extracted, and one set of low-frequency speech parameters is included.
  • a set of low frequency speech parameters is a first training sample.
  • the number of first training samples is large enough, and other voice signals may include multiple voice signals, and the number of natural persons corresponding to the plurality of voice signals is as large as possible.
  • the normalized vector of the low frequency speech parameters included in the first training sample 1 is used as the input vector of the SAE, and the label of the first training sample 1 is taken as the expectation Output, the connection weight between the SAE neurons and the corresponding offset value are assigned to the initial value; using the SAE algorithm as described above, the actual output corresponding to the first training sample 1 is obtained, and the minimum is adopted according to the actual output and the expected output.
  • the inverse error propagation algorithm and the gradient descent method of the mean square error criterion adjust the connection weights and corresponding offset values between the SAE neurons to obtain the updated connection weights and corresponding biases between the neurons. Set value.
  • the normalized vector of the low frequency speech parameters included in the first training sample 2 is used as the input vector of the SAE, and the label of the first training sample 2 is used as the desired output.
  • the connection weights and corresponding offset values between the initially used SAE neurons are obtained after the first training sample 1 is trained, and the updated connection rights between the neurons are obtained.
  • the value and the corresponding offset value; using the SAE algorithm as described above, the actual output corresponding to the first training sample 2 is obtained, and the inverse error propagation algorithm and the gradient descent method using the minimum mean square error criterion are used according to the actual output and the expected output.
  • the connection weight between the SAE neurons and the corresponding offset value are adjusted again, and the updated connection weights and corresponding offset values between the neurons are obtained.
  • the normalized vector of the low frequency speech parameters included in the first training sample 3 is used as the input vector of the SAE, and the label of the first training sample 3 is taken as the desired output.
  • the connection weights and corresponding offset values between the initially used SAE neurons are obtained after the second training sample 2 is trained, and the updated connection rights between the neurons are obtained.
  • the value and the corresponding offset value; using the SAE algorithm as described above, the actual output corresponding to the first training sample 3 is obtained, and the inverse error propagation algorithm and the gradient descent method using the minimum mean square error criterion are used according to the actual output and the expected output.
  • the connection weight between the SAE neurons and the corresponding offset value are adjusted again, and the updated connection weights and corresponding offset values between the neurons are obtained.
  • the above training process is repeatedly performed until the error function converges, that is, after the training accuracy meets the requirements, the training process is stopped, and each training sample is trained at least once.
  • the neural network corresponding to the last training and the connection weights and corresponding offset values between the neurons in each layer are the SAE model.
  • the mE low-frequency speech parameters decoded by the SAE model and the mobile device can be used to obtain m labels by using the SAE algorithm, and m labels are used to indicate m speech frames corresponding to the m-group low-frequency speech parameters.
  • Types of It can be understood that if during the training, the included low frequency speech parameter is such a first training sample extracted from the low frequency speech signal of the voiced frame, and the corresponding label is 1, the m group low frequency speech decoded by the mobile device is obtained.
  • the obtained label should be close to 1 or 1; similarly, if during the training, the included low-frequency speech parameters are from unvoiced.
  • the first training sample extracted from the low-frequency speech signal of the frame, the corresponding label is 0, and the low-frequency speech parameters corresponding to the unvoiced frame in the m low-frequency speech parameters decoded by the mobile device are used, and the SAE algorithm is adopted according to the SAE model. After that, the resulting label should be close to 0 or 0.
  • the mobile device obtains n high-frequency speech signals corresponding to n unvoiced frames according to low-frequency speech parameters and mixed Gaussian model algorithms of n unvoiced frames, and according to low-frequency speech parameters and neural network algorithms of k voiced frames,
  • the k high frequency speech signals corresponding to the k voiced frames are obtained, n and k are integers greater than 1, and the sum of n and k is equal to m.
  • the use of the neural network algorithm to predict the high-frequency speech parameters corresponding to the unvoiced frame according to the low-frequency speech parameters corresponding to the unvoiced frame introduces artificial noise, which causes the user to hear the “ ⁇ ” noise and affect the user's auditory experience.
  • the high frequency speech signal corresponding to the unvoiced frame is not used in the neural network algorithm, and the mixed Gaussian model algorithm may be adopted.
  • the neural network algorithm is used to predict the high-frequency speech parameters corresponding to the voiced frames according to the low-frequency speech parameters corresponding to the voiced frames, and almost no artificial noise is introduced and the sentiment of the original speech can be preserved.
  • the voiced sounds are obtained according to the low-frequency voice parameters of the voiced frames.
  • the neural network algorithm can be used for the high frequency speech signal corresponding to the frame. This is the meaning of determining the type of the speech frame in step S102, that is to say, according to the nature of the unvoiced frame and the voiced frame, different machine learning algorithms are adopted, the worker noise can be introduced as little as possible and the emotion of the original voice is retained. Thereby achieving accurate reproduction of the original speech.
  • the mobile device obtains n high-frequency voice signals corresponding to the n unvoiced frames according to the low-frequency speech parameters of the n unvoiced frames and the mixed Gaussian model algorithm, including:
  • the mobile device obtains high frequency speech parameters of n unvoiced frames according to low frequency speech parameters and mixed Gaussian model algorithms of n unvoiced frames;
  • the mobile device constructs n high frequency speech signals corresponding to n unvoiced frames according to the high frequency speech parameters of the n unvoiced frames.
  • the mixed Gaussian model algorithm refers to the algorithm in the prior art, and details are not described herein again.
  • the mobile device obtains k high-frequency voice signals corresponding to k voiced frames according to the low-frequency voice parameters of the k voiced frames and the neural network algorithm, including:
  • the mobile device uses the neural network algorithm to obtain the high frequency speech parameters of the k voiced frames according to the low frequency speech parameters and the neural network model of the k voiced frames.
  • the mobile device constructs k high frequency speech signals corresponding to k voiced frames according to high frequency speech parameters of k voiced frames;
  • the neural network model is a neural network algorithm.
  • the mobile device or other mobile device in this embodiment is trained based on a plurality of second training samples, and the second training sample includes h voiced frames of another voice signal.
  • the set of low frequency speech parameters, h is an integer greater than one; the other speech signals are different from the speech signal a in this embodiment.
  • h may be the number of all voiced frames included in the other speech signal, or less than the number of all voiced frames included in the other speech signals.
  • the values of h may be different for different speech signals.
  • the neural network algorithm herein may be an LSTM neural network algorithm, and the neural network model is an LSTM neural network model; or
  • the neural network algorithm can be a BRNN algorithm, and the neural network model is a BRNN model; or,
  • the neural network algorithm is the RNN algorithm
  • the neural network model is the RNN model
  • the neural network algorithm is used as the BRNN algorithm, and the neural network model is the BRNN model.
  • the mobile device is based on the low-frequency speech parameters and neural network models of k voiced frames.
  • the neural network model is used to obtain k high corresponding to k voiced frames. The specific process of frequency speech signals.
  • the mobile device normalizes the decoded k-group speech parameters corresponding to the k voiced frames to obtain respective vectors, and the plurality of vectors obtained by normalizing the k-group speech parameters may be referred to as a vector.
  • a vector obtained by normalizing a set of low-frequency speech parameters of a sequence, k-group speech parameters may be referred to as a sub-sequence.
  • the order in which each subsequence is input into the bidirectional cyclic neural network is input in the chronological order of the corresponding speech frames of each subsequence, that is, each subsequence corresponds to an input at a time.
  • subsequence 1 corresponds to X t shown in FIG. 6
  • subsequence 1 corresponds to X t-1 shown in FIG. 6
  • subsequence 3 corresponds to X t+1 shown in FIG. 6.
  • the vector obtained by normalizing the k-group speech parameters is used as the input of the bidirectional cyclic neural network.
  • the bidirectional cyclic neural network algorithm described above is used to obtain the k-group low-frequency speech parameters based on the bidirectional cyclic neural network model.
  • the output corresponding to each set of low-frequency speech parameters, each output is used to indicate the high-frequency speech parameters of the corresponding voiced frames, which can be converted into high-frequency speech parameters, that is, k sets of high-frequency speech parameters of k voiced frames.
  • subsequence 1 For example, according to the time sequence of the voiced frames, there are subsequence 1, subsequence 2, and subsequence 3. If the output corresponding to the subsequence 2 is y t shown in FIG. 6, the output corresponding to the subsequence 1 is y shown in FIG. The output corresponding to t-1 and subsequence 3 is y t+1 shown in Fig. 6.
  • each subsequence shares the same bidirectional cyclic neural network model, and the bidirectional cyclic neural network algorithm is used to obtain the corresponding outputs.
  • the mobile device After the mobile device obtains the k sets of high frequency speech parameters of k voiced frames according to the BRNN model and the BRNN algorithm, the mobile device constructs k highs corresponding to the k voiced frames according to the k sets of high frequency speech parameters of the k voiced frames. Frequency speech signal.
  • the following describes the acquisition method of the bidirectional cyclic neural network BRNN model.
  • B2 Obtain a label of each second training sample, where the label is an h group high frequency voice parameter corresponding to the h group low frequency voice parameter included in the second training sample; wherein the second training sample includes the h group low frequency voice parameter and corresponding label
  • the h group high frequency speech parameters included are speech parameters of the same speech signal;
  • the second training sample is trained by using a bidirectional cyclic neural network algorithm to obtain a bidirectional cyclic neural network model.
  • a second training sample includes h groups of low-frequency speech parameters corresponding to low-frequency speech signals of h voiced frames of other speech signals, it being understood that the low-frequency speech signals here correspond to The frequency range corresponding to the low frequency speech signal corresponding to the low frequency speech parameter encoded by the network device is the same, and the low frequency speech parameter here is the same as the low frequency speech parameter extracted by the network device or the low frequency speech parameter decoded by the mobile device.
  • the h 1 set of low frequency speech parameters of the h 1 voiced frames of the speech signal 1 are extracted, and a second training sample 1 is obtained, that is, the second training sample 1 includes a plurality of sets of low frequency speech parameters, each of which The voiced frame corresponds to a set of low frequency speech parameters.
  • the h 2 sets of low frequency speech parameters of the h 2 voiced frames of the speech signal 2 are extracted to obtain a second training sample 2.
  • h 1 and h 2 may be the same and may be different; the voice signal 1 and the voice signal 2 are voice signals in other voice signals.
  • Said second training samples such as 1, h 1 is the speech signal to extract a high frequency signal corresponding to voiced speech frames H h 1 h 1 is a group of high frequency voiced speech frames a set of high-frequency speech parameters, the speech signal The parameter is the label of the second training sample 1.
  • Said second training samples such as 2, h 2 extracts a speech signal of two frequency signals corresponding to a voiced speech frame frequency speech parameters h 2 group, a voice signal voiced frame 2 H H 2 2 groups of voice frequency
  • the parameter is the label of the second training sample 2.
  • the plurality of vectors normalized by the h 1 group low frequency speech parameters of the second training sample 1 are used as input of the bidirectional cyclic neural network, and the second training sample
  • the multiple vectors normalized by the plurality of sets of low-frequency speech parameters of 1 may be referred to as a sequence, and the normalized vector of each set of low-frequency speech parameters of the h 1 set of low-frequency speech parameters may be referred to as a subsequence, and each subsequence
  • the order of inputting the bidirectional cyclic neural network is input according to the time sequence of the corresponding speech frames of each subsequence, that is, the input of each subsequence corresponding to one moment.
  • the second training sample 1 has a subsequence 1, a subsequence 2, and a subsequence 3 according to the time sequence of the speech frame. If the subsequence 2 corresponds to the X t shown in FIG. 6, the subsequence 1 corresponds to the X t shown in FIG. 6 . -1 , subsequence 3 corresponds to X t+1 shown in Fig. 6.
  • connection weights and offset values involved in the bidirectional cyclic neural network are assigned initial values, and all subsequences share connection weights and offset values;
  • each connection weight and the offset value the bidirectional cyclic neural network algorithm is used to obtain the actual output of the second training sample 1; it can be understood that each subsequence corresponds to one output, and the output of all subsequences is composed.
  • the second training sample 1 has a subsequence 1, a subsequence 2, and a subsequence 3 according to the time sequence of the speech frame. If the output corresponding to the subsequence 2 is y t shown in FIG. 6, the output corresponding to the subsequence 1 is a graph. The output corresponding to y t-1 and subsequence 3 shown in Fig. 6 is y t+1 shown in Fig. 6.
  • the initial connection weights and offset values are adjusted according to the processing result, and the adjusted connection weights and offset values are obtained.
  • the normalized vector of the h 2 sets of low frequency speech parameters of the second training sample 2 is used as an input of the bidirectional cyclic neural network;
  • connection weights and offset values involved in the training process are adjusted by the second training sample 1 and the adjusted connection weights and offset values;
  • connection weights and offset values involved in the training process are adjusted according to the processing result, and the adjusted connection weights and offset values are obtained.
  • the normalized vector of the h 3 sets of low frequency speech parameters of the second training sample 3 is used as an input of the bidirectional cyclic neural network;
  • connection weights and offset values involved in the training process are adjusted by the second training sample 2, and the adjusted connection weights and offset values are obtained;
  • connection weights and offset values involved in the training process are adjusted according to the processing result, and the adjusted connection weights and offset values are obtained.
  • the above training process is repeatedly performed until the preset training precision is reached or the preset training times are reached, the training process is stopped, and each training sample is trained at least once.
  • the bidirectional cyclic neural network corresponding to the last training and the connection weights and offset values are the BRNN models.
  • the bidirectional cyclic network algorithm is used to obtain the high frequency speech parameters corresponding to the voiced frames, which have the following beneficial effects:
  • y t is related not only to the input x t-1 at time t-1 (h t-1 is obtained by x t-1 ) but also to the input x t+1 at time t+1 (h t+1) Is obtained by x t+1 ).
  • x t corresponds to a set of low frequency speech parameters of the voiced frame a in the embodiment of the present application
  • the output y t corresponds to a set of high frequency speech parameters of the voiced frame a
  • x t-1 corresponds to the present
  • x t+1 corresponds to a set of low frequency voice parameters of the latter voiced frame c of the voiced frame a in the embodiment of the present application, that is,
  • the high-frequency speech parameters of the voiced frame a are predicted while considering the information of the voiced frames before and after.
  • the accuracy of the prediction of high-frequency speech parameters can be improved, and the accuracy of predicting high-frequency speech signals through low-frequency speech signals can be improved.
  • the bidirectional cyclic network algorithm is used to obtain the high frequency speech parameters corresponding to the voiced frames, which can improve the accuracy of predicting the high frequency speech signals of the corresponding frames through the low frequency speech signals of the voiced frames.
  • the mobile device obtains m sets of high frequency speech signals and m sets of low frequency speech signals of m speech frames of the speech signal a.
  • the mobile device For step S104, the mobile device combines the low frequency speech signal and the high frequency speech signal of each of the m speech frames to obtain a wideband speech signal.
  • the mobile device becomes a complete wideband speech after synthesizing the low frequency speech signal and the high frequency speech signal for each of the m speech frames.
  • the processing method of the voice signal in this embodiment is performed on the mobile device side, and the original communication system is not changed, and only the relevant extension device or extension program is set on the mobile device side; the voiced frame and the unvoiced frame are distinguished according to the voice parameter.
  • the mixed Gaussian model algorithm is used to obtain the high frequency speech signal corresponding to the unvoiced frame, which reduces the probability of noise introduction.
  • the neural network algorithm is used to obtain the high frequency speech corresponding to the voiced frame.
  • the signal preserves the sentiment of the original speech, so that the original voice can be accurately reproduced, and the user's auditory feeling is improved.
  • the solution provided by the embodiment of the present application is introduced for the functions implemented by the mobile device. It can be understood that, in order to implement the above functions, the device includes corresponding hardware structures and/or software modules for performing various functions.
  • the embodiments of the present application can be implemented in a combination of hardware or hardware and computer software in combination with the elements of the examples and algorithm steps described in the embodiments disclosed in the application. Whether a function is implemented in hardware or computer software to drive hardware depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of the technical solutions of the embodiments of the present application.
  • the embodiment of the present application may perform the division of the function module on the mobile device according to the foregoing method example.
  • each function module may be divided according to each function, or two or more functions may be integrated into one processing unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software function module.
  • the division of modules in the embodiments of the present application is schematic, and is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or may be integrated into another. A system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • FIG. 9 is a schematic structural diagram of a mobile device according to an embodiment of the present disclosure. referring to FIG. 9, the mobile device of the embodiment includes: a decoding module 31, a processing module 32, an obtaining module 33, and a synthesizing module 34.
  • the decoding module 31 is configured to decode the received encoded speech signal to obtain m sets of low frequency speech parameters; the m sets of low frequency speech parameters are low frequency speech parameters of m speech frames of the speech signal, where m is greater than 1 Integer
  • the processing module 32 is configured to determine a type of the m voice frames based on the m sets of low frequency voice parameters, and reconstruct a low frequency voice signal corresponding to the m voice frames, where the type includes an unvoiced frame or a voiced frame;
  • the obtaining module 33 is configured to obtain, according to the low frequency speech parameters and the mixed Gaussian model algorithm of the n unvoiced frames, n high frequency speech signals corresponding to the n unvoiced frames, and according to low frequency speech parameters and neural networks of the k voiced frames
  • the algorithm obtains k high frequency speech signals corresponding to the k voiced frames, n and k are integers greater than 1, and the sum of n and k is equal to m;
  • the synthesizing module 34 is configured to synthesize the low frequency speech signal and the high frequency speech signal of each of the m speech frames by the mobile device to obtain a wideband speech signal.
  • each set of low frequency speech parameters includes: a pitch period; or, a subband signal strength; or a gain value; or a line spectrum frequency.
  • the mobile device of this embodiment may be used to implement the technical solution of the foregoing method embodiment, and the implementation principle and the technical effect are similar, and details are not described herein again.
  • the processing module 32 is specifically configured to:
  • the SAE algorithm is used to obtain m labels, and m labels are used to indicate the types of m speech frames corresponding to the m group low frequency speech parameters;
  • the SAE model is that the mobile device or other mobile device uses the SAE algorithm to train based on a plurality of first training samples, and each of the first training samples includes a low-frequency voice signal of one voice frame of another voice signal. Corresponding low frequency speech parameters.
  • the obtaining module 33 is specifically configured to:
  • the obtaining module 33 is specifically configured to:
  • the neural network algorithm is used to obtain the high frequency speech parameters of k voiced frames.
  • the neural network model is that the mobile device or other mobile device uses the neural network algorithm to train based on a plurality of second training samples, and one of the second training samples includes h voiced sounds of another voice signal.
  • the low frequency speech parameter of the frame, h is an integer greater than one.
  • the neural network algorithm is a long-term and short-term memory (LSTM) neural network algorithm
  • the neural network model is an LSTM neural network model
  • the neural network algorithm is a bidirectional cyclic neural network (BRNN) algorithm
  • the neural network model is a BRNN model
  • the neural network algorithm is a cyclic neural network (RNN) algorithm
  • the neural network model is an RNN model
  • the mobile device of this embodiment may be used to implement the technical solution of the foregoing method embodiment, and the implementation principle and the technical effect are similar, and details are not described herein again.
  • FIG. 10 is a schematic structural diagram of a mobile device according to an embodiment of the present disclosure, including a processor 41, a memory 42, and a communication bus 43.
  • the processor 41 is configured to read and execute instructions in the memory 42 to implement the foregoing method embodiment.
  • the processor 41 is configured to read and call instructions in another memory through the memory 42 to implement the method in the above method embodiments.
  • the mobile device shown in FIG. 10 may be a device, or may be a chip or a chipset.
  • the device or the chip in the device has the function of implementing the method in the foregoing method embodiment.
  • the functions may be implemented by hardware or by corresponding software implemented by hardware.
  • the hardware or software includes one or more units corresponding to the functions described above.
  • the processor mentioned above may be a central processing unit (CPU), a microprocessor or an application specific integrated circuit (ASIC), or may be one or more for controlling the above aspects or A program-implemented integrated circuit of any of its possible designs for the transmission of upstream information.
  • CPU central processing unit
  • ASIC application specific integrated circuit
  • the present application also provides a computer storage medium comprising instructions that, when executed on a mobile device, cause the mobile device to perform a corresponding method in the above method embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

一种语音信号的处理方法和移动设备,方法包括:对接收到的编码后的语音信号解码后得到m组低频语音参数;m组低频语音参数为语音信号的m个语音帧的低频语音参数;基于m组低频语音参数确定m个语音帧的类型,并重构m个语音帧对应的低频语音信号;根据n个清音帧的低频语音参数和混合高斯模型算法,得到n个清音帧对应的n个高频语音信号,并根据k个浊音帧的低频语音参数和神经网络算法,得到k个浊音帧对应的k个高频语音信号,n和k的和等于m;对每个语音帧的低频语音信号和高频语音信号进行合成,得到宽带语音信号。降低了噪声引入的概率,保留了原始语音的情感度,可精确的再现原始语音。

Description

语音信号的处理方法和移动设备 技术领域
本申请涉及信号处理技术领域,尤其涉及一种语音信号的处理方法和移动设备。
背景技术
在信息传输中,语音是最直观简洁的通信方式。通常自然语音的带宽在50Hz~8000Hz之间,然而在现代通信***中,由于受传输带宽的限制,语音的频带范围被限制在300Hz~3400Hz之间,300Hz~3400Hz之间的语音信号称为窄带语音信号。语音的主要能量包含在低频语音信号中,而高频信号的缺失使得语音信号的清晰度与自然度在一定程度上受到影响,声色等一些代表说话者特性部分的信息被丢失;如打电话过程中语音失真较为严重,特别是在嘈杂的环境中,失真度往往不被用户接受。随着移动设备对语音质量的要求越来越高,仅仅是能听懂移动设备发出的声音已经远远不满足人们的需求。高清晰度,高保真度的语音信号是各种移动设备的新要求。因此相关研究者越来越多的技术投入到语音的带宽扩展中,以得到宽带语音。
目前语音扩展的方法主要有基于网络映射的方法和基于统计学模型的方法两种。基于网络映射的方法,最终得到的宽带语音中的噪声较大;基于统计学模型的方法,最终得到的宽带语音不能保留原始语音的情感度。
发明内容
本申请提供一种语音信号的处理方法和移动设备,得到的宽带语音噪声小且保留了原始语音的情感度,能够很好的再现原始语音。
第一方面提供一种语音信号的处理方法,包括:
移动设备对接收到的编码后的语音信号解码后得到m组低频语音参数;所述m组低频语音参数为所述语音信号的m个语音帧的低频语音参数,m为大于1的整数;
所述移动设备基于所述m组低频语音参数确定所述m个语音帧的类型,并重构m个语音帧对应的低频语音信号,所述类型包括清音帧或浊音帧;
所述移动设备根据n个清音帧的低频语音参数和混合高斯模型算法,得到所述n个清音帧对应的n个高频语音信号,并根据k个浊音帧的低频语音参数和神经网络算法,得到所述k个浊音帧对应的k个高频语音信号,n和k为大于1的整数,n和k的和等于m;
所述移动设备对m个语音帧中每个语音帧的低频语音信号和高频语音信号进行合成,得到宽带语音信号。
该方案在移动设备侧进行,不改变原有的通信***,只需在移动设备侧设置相应装置或者相应程序即可;根据语音参数区分浊音帧和清音帧,区分准确率高;根据清音帧和浊音帧性质的不同,采用混合高斯模型算法获取清音帧对应的高频语音信号, 降低了噪声引入的概率,采用神经网络算法获取浊音帧对应的高频语音信号,保留了原始语音的情感度,可精确的再现原始语音,提升了用户的听觉感受。
可选地,每组低频语音参数包括:基音周期;或者,子带信号强度;或者,增益值;或者,线谱频率;或者,基音周期,子带信号强度,增益值,或者线谱频率中的至少两个。
在一种可能的设计中,所述移动设备基于所述m组低频语音参数确定所述m个语音帧的类型,包括:
所述移动设备根据所述m组低频语音参数和栈自动编码机(Stacked AutoEncoder,简称SAE)模型,采用SAE算法,得到m个标签,m个标签用于指示m组低频语音参数对应的m个语音帧的类型;
其中,所述SAE模型是所述移动设备或其它移动设备采用所述SAE算法,基于多个第一训练样本训练得到的,每个第一训练样本包括其它语音信号的一个语音帧的低频语音信号对应的低频语音参数。
在一种可能的设计中,所述移动设备根据n个清音帧的低频语音参数和混合高斯模型算法,得到所述n个清音帧对应的n个高频语音信号,包括:
所述移动设备根据n个清音帧的低频语音参数和混合高斯模型算法,得到n个清音帧的高频语音参数;
所述移动设备根据所述n个清音帧的高频语音参数,构建所述n个高频语音信号。
采用混合高斯模型算法预测清音帧的高频语音信号几乎不会引入噪声,提升了用户的听觉感受。
在一种可能的设计中,所述移动设备根据k个浊音帧的低频语音参数和神经网络算法,得到所述k个浊音帧对应的k个高频语音信号,包括:
所述移动设备根据k个浊音帧的低频语音参数和神经网络模型,采用神经网络算法,得到k个浊音帧的高频语音参数;
所述移动设备根据所述k个浊音帧的高频语音参数,构建所述k个高频语音信号;
其中,所述神经网络模型是所述移动设备或其它移动设备采用所述神经网络算法,基于多个第二训练样本训练得到的,一个所述第二训练样本包括一个其它语音信号的h个浊音帧的h组低频语音参数,h为大于1的整数。
采用神经网络算法预测浊音帧的高频语音信号几乎不会引入噪声,且可保留原始语音的情感度。
可选地,所述神经网络算法为长短期记忆(LSTM)神经网络算法,所述神经网络模型为LSTM神经网络模型;
可选地,所述神经网络算法为双向循环神经网络(BRNN)算法,所述神经网络模型为BRNN模型;
可选地,所述神经网络算法为循环神经网络(RNN)算法,所述神经网络模型为RNN模型。
其中,采用BRNN算法可大大提高获取的高频语音信号的准确度,从而可精确的再现原始语音。
第二方面提供一种移动设备,包括:
解码模块,用于对接收到的编码后的语音信号解码后得到m组低频语音参数;所述m组低频语音参数为所述语音信号的m个语音帧的低频语音参数,m为大于1的整数;
处理模块,用于基于所述m组低频语音参数确定所述m个语音帧的类型,并重构m个语音帧对应的低频语音信号,所述类型包括清音帧或浊音帧;
获取模块,用于根据n个清音帧的低频语音参数和混合高斯模型算法,得到所述n个清音帧对应的n个高频语音信号,并根据k个浊音帧的低频语音参数和神经网络算法,得到所述k个浊音帧对应的k个高频语音信号,n和k为大于1的整数,n和k的和等于m;
合成模块,用于对m个语音帧中每个语音帧的低频语音信号和高频语音信号进行合成,得到宽带语音信号。
该方案中,只需在语音处理装置侧设置相关的扩展装置或者扩展程序即可,不改变原有的通信***;根据语音参数区分浊音帧和清音帧,区分准确率高;根据清音帧和浊音帧性质的不同,采用混合高斯模型算法获取清音帧对应的高频语音信号,降低了噪声引入的概率,采用神经网络算法获取浊音帧对应的高频语音信号,保留了原始语音的情感度,可精确的再现原始语音,提升了用户的听觉感受。
可选地,每组低频语音参数包括:基音周期;或者,子带信号强度;或者,增益值;或者,线谱频率;或者,基音周期,子带信号强度,增益值,或者线谱频率中的至少两个。
在一种可能的设计中,所述处理模块,具体用于:
根据所述m组低频语音参数和栈自动编码机(SAE)神经网络模型,采用SAE算法,得到m个标签,m个标签用于指示m组低频语音参数对应的m个语音帧的类型;
其中,所述SAE模型是所述移动设备或其它移动设备采用所述SAE算法,基于多个第一训练样本训练得到的,每个第一训练样本包括其它语音信号的一个语音帧的低频语音信号对应的低频语音参数。
在一种可能的设计中,所述获取模块,具体用于:
根据n个清音帧的低频语音参数和混合高斯模型算法,得到n个清音帧的高频语音参数;
根据所述n个清音帧的高频语音参数,构建所述n个高频语音信号。
在一种可能的设计中,所述获取模块,具体用于:
根据k个浊音帧的低频语音参数和神经网络模型,采用神经网络算法,得到k个浊音帧的高频语音参数;
根据所述k个浊音帧的高频语音参数,构建所述k个高频语音信号;
其中,所述神经网络模型是所述移动设备或其它移动设备采用所述神经网络算法,基于多个第二训练样本训练得到的,一个所述第二训练样本包括一个其它语音信号的h个浊音帧的低频语音参数,h为大于1的整数。
可选地,所述神经网络算法为长短期记忆(LSTM)神经网络算法,所述神经网络模型为LSTM神经网络模型;
可选地,所述神经网络算法为双向循环神经网络(BRNN)算法,所述神经网络模型为BRNN模型;或者,
可选地,所述神经网络算法为循环神经网络(RNN)算法,所述神经网络模型为RNN模型。
第三方面提供一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,在所述计算机程序被处理器执行时,执行权利要求第一方面以及第一方面任一可 能的设计所述的方法。
第四方面提供一种移动设备,包括处理器;
所述处理器用于与存储器耦合,读取并执行所述存储器中的指令,执行第一方面以及第一方面任一可能的设计所述的方法。
在一种可能的设计中,所述的移动设备,还包括所述存储器。
本申请中的语音信号的处理方法在移动设备侧进行,不改变原有的通信***,只需在移动设备侧设置相应装置或者相应程序即可;根据语音参数区分浊音帧和清音帧,区分准确率高;根据清音帧和浊音帧性质的不同,采用混合高斯模型算法获取清音帧对应的高频语音信号,采用神经网络算法获取浊音帧对应的高频语音信号,降低了噪声引入的概率,且得到宽带语音保留了语音的情感度,可精确的再现原始语音,提升了用户的听觉感受。
附图说明
图1为本申请实施例提供的SAE的结构示意图;
图2为本申请实施例提供的SAE对应的自动编码机示意图;
图3为本申请实施例提供的LSTM神经网络算法示意图;
图4为本申请实施例提供的RNN的结构示意图;
图5为本申请实施例提供的RNN算法的示意图;
图6为本申请实施例提供的BRNN算法的示意图;
图7为本申请实施例提供的***架构图;
图8为本申请实施例提供的语音信号的处理方法的流程图;
图9为本申请实施例提供的移动设备的结构示意图一;
图10为本申请实施例提供的移动设备的结构示意图二。
具体实施方式
首先对本申请涉及的技术名词进行解释。
1、语音:人类的自然语音的带宽一般在50Hz~8000Hz之间,其中,300Hz~3400Hz之间的语音信号称为窄带语音信号。其中,人在发音时,根据声带是否震动可以将语音信号分为清音跟浊音两种。浊音又称有声语言,携带者语言中大部分的能量,浊音在时域上呈现出明显的周期性;而清音类似于白噪声,没有明显的周期性。发浊音时,气流通过声门使声带产生张弛震荡式振动,产生准周期的激励脉冲串,这种声带振动的频率称为基音频率,相应的周期为基音周期。
2、栈自动编码机(Stacked AutoEncoder,简称SAE)算法:
图1为本申请实施例提供的SAE的结构示意图,图2为本申请实施例提供的SAE对应的自动编码机示意图;参见图1和图2,SAE包括一层输入层,2层隐含层,一层输出层;其中,输入层的神经元11的个数等于输入向量的维数加1,其中一个偏置节点12 为1,也就是偏置节点的输入为1,输出层可为softmax分类器层,隐含层神经元21的个数和输出层的神经元的个数根据需要设定。可以理解的是,此处2层隐含层只是示例性的,隐含层的层数可以根据实际数量变更。
SAE算法具体如下:
对应一个样本X=(x 1、x 2、x 3、……、x n-1、x n),n维向量X为输入向量,则输入层100的神经元的个数等于n+1,如图2所示,x n为输入层第n个神经元的输入;初始化输入层的各神经元、偏置节点与第一层隐含层200的各神经元之间的连接权值,组成权值矩阵W 1,以及偏置向量b 1;则第一层隐含层的输出h 1为:
h 1=f(W 1X+b 1)
其中,h 1=(h 1、h 2、h 3、……、h m-1、h m),h m为第一层隐含层第m个神经元的输出,f(x)=1/(1+exp(-x))为非线性激励函数,
Figure PCTCN2018086596-appb-000001
k=n+1,m为第一层隐含层中除了偏置节点的神经元的个数,W km为输出层第k个神经元与第一层隐含层第m个神经元之间的连接权值。
上述过程称为输入向量X的编码过程,接着采用自动编码机进行h 1解码的过程,得到重构的输入向量
Figure PCTCN2018086596-appb-000002
Figure PCTCN2018086596-appb-000003
其中,
Figure PCTCN2018086596-appb-000004
b 2为偏置向量。
定义代价函数:
Figure PCTCN2018086596-appb-000005
按照以下公式更新W 1,和b 1
Figure PCTCN2018086596-appb-000006
Figure PCTCN2018086596-appb-000007
其中,
Figure PCTCN2018086596-appb-000008
为更新后的W 1
Figure PCTCN2018086596-appb-000009
更新后的b 1,α为学习速率。
其次,初始化第一层隐含层200包括的各神经元、偏置节点和第二层隐含层300包括的各神经元之间的连接权值,组成权值矩阵W 3,可参照W 1,根据h 1计算第二层隐含层300的神经元的输出向量h 2
h 2=f(W 3h 1+b 3)
其中,b 3为偏置向量。
上述过程称为h 1的编码过程,接着采用自动编码机进行h 2解码的过程,得到重构
Figure PCTCN2018086596-appb-000010
Figure PCTCN2018086596-appb-000011
其中,
Figure PCTCN2018086596-appb-000012
b 4为偏置向量。
定义代价函数:
Figure PCTCN2018086596-appb-000013
按照以下公式更新W 3,和b 3
Figure PCTCN2018086596-appb-000014
Figure PCTCN2018086596-appb-000015
接着,初始化第二层隐含层300包括的各神经元、偏置节点和输出层400包括的各神经元之间的连接权值,组成权值矩阵W 5,初始化b 5为偏置向量。
上述过程为样本X进行的一次完整的无监督学习的过程。
下面样本X根据
Figure PCTCN2018086596-appb-000016
W 5,采用反向传播(Back Propagation,简称BP)神经网络,对样本X进行一次有监督的学习过程:如下:
Figure PCTCN2018086596-appb-000017
作为输入层100包括的各神经元、偏置节点与第一层隐含层200包括的各神经元之间的连接权值矩阵,
Figure PCTCN2018086596-appb-000018
为输入层100包括的各神经元、偏置节点与第一层隐含层200包括的各神经元对应的偏置向量,计算第一层隐含层200的输出向量H 1
Figure PCTCN2018086596-appb-000019
其中,
Figure PCTCN2018086596-appb-000020
Figure PCTCN2018086596-appb-000021
作为第一层隐含层200包括的各神经元、偏置节点与第二层隐含层300包括的各神经元之间的连接权值矩阵,
Figure PCTCN2018086596-appb-000022
为第一层隐含层200包括的各神经元、偏置节点与第二层隐含层300包括的各神经元对应的偏置向量,计算第二层隐含层300的输出向量H 2
Figure PCTCN2018086596-appb-000023
以W 5作为第二层隐含层300包括的各神经元、偏置节点与输出层400包括的各神经元之间的连接权值矩阵,b 5为第二层隐含层300包括的各神经元、偏置节点与输出层400包括的各神经元对应的偏置向量,采用BP神经网络算法,计算得到输出向量Y。
Y=σ(W 5H 2+b 5)
最后,根据样本X的期望输出
Figure PCTCN2018086596-appb-000024
以及样本X的实际输出Y,采用最小均方误差准则的反向误差传播算法和梯度下降法更新
Figure PCTCN2018086596-appb-000025
Figure PCTCN2018086596-appb-000026
Figure PCTCN2018086596-appb-000027
Figure PCTCN2018086596-appb-000028
W 5
Figure PCTCN2018086596-appb-000029
经过以上所有的步骤,样本X完成了一次完整的基于SAE算法的学习过程。
Figure PCTCN2018086596-appb-000030
作为下一个样本X 1进行无监督学习时对应的初始权值矩阵;下一个训练样本X 1按照与样本X相同的步骤,得到最终更新后的
Figure PCTCN2018086596-appb-000031
完成一次完整的基于SAE的学习过程。
最终更新后的
Figure PCTCN2018086596-appb-000032
作为下一个样本X 2进行无监督学习时对应的初始权值矩阵;下一个训练样本X 2按照与样本X相同的步骤,完成一次完整的基于SAE的学习过程。
也就是每一个样本进行一次完整的基于SAE的学习过程后,输入层100包括的各神经元、偏置节点与第一层隐含层200包括的各神经元之间的连接权值矩阵,第一层隐含层200包括的各神经元、偏置节点对应的偏置向量,第一层隐含层200包括的各神经元、偏置节点与第二层隐含层300包括的各神经元之间的连接权值矩阵,第二层隐含层300包括的各神经元、偏置节点对应的偏置向量,第二层隐含层300包括的各神经元、偏置节点与输出层400包括的各神经元之间的连接权值矩阵,输出层400包括的各神经元对应的偏置向量均被更新,更新后的上述物理量作为下一样本进行无监督学习时对应的初始权值矩阵、初始偏置向量。
综上所述,每一个样本进行一次完整的基于SAE的学习过程后,各层神经元之间的连 接权值以及对应的偏置值均被更新,更新后的值作为下一样本进行无监督学习时对应的初始权值、初始偏置值。
3、长短期记忆(Long Short Term Memory,LSTM)神经网络算法:
LSTM神经网络与SAE一样,包括一层输入层,至少一层隐含层,一层输出层;不同的是LSTM神经网络的输入层和隐含层中没有偏置节点。输入层的神经元的个数等于输入向量的维数,隐含层神经元的个数和输出层的神经元的个数根据需要设定。
LSTM神经网络算法与SAE算法或者BP神经网络算法不相同之处在于,获取隐含层的每个神经元的输出以及输出层的每个神经元的输出的方法。
下面以获取一个神经元S的输出为例来说明LSTM神经网络算法:
图3为本申请实施例提供的LSTM神经网络示意图。
参见图3,X t-1为t-1时刻某一神经元S的输入,h t-1为当输入为X t-1时神经元S的输出,C t-1为与t-1时刻对应的神经元S的状态,X t为t时刻神经元S的输入,h t为当输入为X t时神经元S的输出,C t为t时刻对应的神经元S的状态,X t+1为t+1时刻的神经元S的输入,h t+1为当输入为X t+1时神经元S的输出,C t+1为t+1时刻对应的神经元S的状态。
也就是说在t时刻,神经元S具有三个输入:C t-1,X t,h t-1,对应的输出具有h t、C t-1
在LSTM神经网络算法中,对于LSTM神经网络中某一神经元S来讲在不同的时刻具有不同的输入和输出。对于t时刻,X t是根据上一层各神经元的输出以及上一层各神经元和神经元S之间的连接权值以及对应的偏置向量计算得到的(参照上述对BP神经网络中获取隐含层或者输出层的输出方法的描述,与BP神经网络中获取隐含层或者输出层的输出方法),h t-1也可以称为上一时刻神经元S的输出,C t-1也可以称为上一时刻神经元S的状态,现在需要做的是计算神经元S在t时刻输入X t后的输出h t。可通过公式一至公式六计算:
f t=σ(W f·[h t-1,x t]+b f)               公式一;
i t=σ(W i·[h t-1,x t]+b i)                  公式二;
Figure PCTCN2018086596-appb-000033
Figure PCTCN2018086596-appb-000034
O t=σ(W O·[h t-1,x t]+b O)            公式五;
h t=O t·tanh(C t)            公式六;
其中,f t为遗忘门,W f为遗忘门的权重矩阵,b f为遗忘门的偏置项,σ为sigmoid函数,i t为输入门,W i为输入门的权重矩阵,b i为输入门的偏置项,
Figure PCTCN2018086596-appb-000035
为用于描述当前输入的状态,C t为与t时刻对应的神经元新的状态,O t为输出门,W O为输出门的权重矩阵,b O为输出门的偏置项,h t为神经元S在t时刻对应的最终输出。
通过上述过程,LSTM神经网络算法将关于当前的记忆和长期的记忆组合在一起,形成了新的单元状态C t。由于遗忘门的控制,LSTM神经网络可以保存很久很久之前的信息,由于输入门的控制,它又可以避免当前无关紧要的内容进入记忆;输出门控 制了长期记忆对当前输出的影响。
LSTM神经网络的每个神经元的输出均可按照上述公式一至公式六计算得到。
同样的,LSTM神经网络算法中更新各连接权值和偏置值的方法,也是采用反向误差传播算法和梯度下降法来更新。
可以说,每一个样本进行一次LSTM神经网络算法的学习过程后,各层神经元之间的连接权值、对应的偏置值、遗忘门的权重矩阵、输入门的权重矩阵、输出门的权重矩阵均被更新一次,更新后的值用于学习下一样本。每一个样本包含多个子序列,分别对应一次LSTM学习中输入层不同时刻的输入。
可以理解的是,上述LSTM神经网络算法只是一种经典的LSTM神经网络算法,在该经典的LSTM神经网络算法的基础上,具有很多的变体,分别对应不同的LSTM神经网络算法,本实施例中不再一一赘述。
4、循环神经网络(Recurrent Neural Networks,简称RNN)算法和双向循环神经网络(Bidirections Recurrent Neural Networks,简称BRNN)算法:
图4为本申请实施例提供的RNN的结构示意图,图5为本申请实施例提供的RNN算法的示意图,图6为本申请实施例提供的BRNN算法的示意图。
参见图4,在RNN中隐含层之间的神经元不再是孤立存在的,而是有连接的,且隐含层的输入不仅包括输入层的输出还包括上一时刻隐含层的输出。
参见图5,对应的算法如下:
h t=f(W xhx t+W hhh t-1+b h)
Z t=g(W hzh t+b z)
其中,h t为隐含层在t时刻的输出,h t-1为隐含层在t-1时刻的输出,x t为在t时刻输入层的输入,Z t为在t时刻输出层的输出,W xh为在t时刻输入层的各神经元和隐含层的各神经元之间的连接权值组成的权值矩阵,W hh为t-1时刻的隐含层的输出h t-1作为t时刻隐含层的输入对应的权值矩阵,W hz为在t时刻隐含层的各神经元和输出层的各神经元之间的连接权值组成的权值矩阵,b h为t时刻隐含层对应的偏置向量、b z为t时刻输出层对应的偏置向量。
一个样本对应的输入可称为一个序列,而在RNN算法中,一个样本对应多个子序列,比如子序列x t-1,子序列x t,子序列x t+1;由于隐含层在t-1时刻的输出是根据t-1时刻输入层的输入x t-1得到的,x t与x t-1分别对应不同的子序列,也就是说在RNN算法中,子序列之间存在顺序关系,每个子序列和它之前的子序列存在关联,通过神经网络在时序上展开。
在时域上,各连接权值不变,即一个序列的各子序列共享连接权值,即根据输入x t-1得到的输出Z t-1所使用的连接权值,根据输入x t得到的输出Z t所使用的连接权值,根据输入x t+1得到的输出Z t+1所使用的连接权值,是一致的。
RNN基于误差随时间反向传播算法更新一次学习过程中的各连接权值和偏置值,用于下一个样本的学习过程。
深度循环神经网络就是具有多层隐含层的循环神经网络,其算法可参照上述具有一层隐含层的算法,此处不再赘述。
参见图6,BRNN算法相对于RNN算法的改进之处,在于假设当前的输出不仅仅与前 面的输入有关,还与后面的输入有关。可以理解的是,图6中所示的反向层和正向层并不是指两个隐含层,而是为了表示同一个隐含层需要得到两个输出值,这是BRNN算法与RNN算法的不同之处。
图6中的对应的算法如下:
Figure PCTCN2018086596-appb-000036
Figure PCTCN2018086596-appb-000037
Figure PCTCN2018086596-appb-000038
其中,f、g为激活函数,h t1为在隐含层在t时刻的正时间方向输出,h t2为在隐含层在t时刻的负时间方向输出,h t-1为隐含层在t-1时刻的输出,h t+1为隐含层在t+1时刻的输出;x t为在t时刻输入层的输入;
Figure PCTCN2018086596-appb-000039
为t-1时刻的隐含层的输出h t-1作为t时刻隐含层的输入对应的权值矩阵,
Figure PCTCN2018086596-appb-000040
为在t时刻输入层的各神经元和隐含层的各神经元之间的连接权值组成的第一权值矩阵;
Figure PCTCN2018086596-appb-000041
为t+1时刻的隐含层的输出h t+1作为t时刻隐含层的输入对应的权值矩阵,
Figure PCTCN2018086596-appb-000042
为在t+1时刻输入层的各神经元和隐含层的各神经元之间的连接权值组成的第二权值矩阵;
Figure PCTCN2018086596-appb-000043
为在t时刻隐含层的各神经元和输出层的各神经元之间的连接权值组成的第一权值矩阵,
Figure PCTCN2018086596-appb-000044
为在t时刻隐含层的各神经元和输出层的各神经元之间的连接权值组成的第二权值矩阵,y t为输出层在t时刻的输出。
同样的,在BRNN算法中,一个样本对应的输入可称为一个序列,一个样本对应多个子序列,比如子序列x t-1,子序列x t,子序列x t+1;由于隐含层在t-1时刻的输出h t-1是根据t-1时刻输入层的输入x t-1得到的,隐含层在t+1时刻的输出h t+1是根据t+1时刻输入层的输入x t+1得到的,x t、x t-1、x t+1分别对应不同的子序列,也就是说在BRNN算法中,子序列之间存在顺序关系,每个子序列和它之前的子序列存在关联,也和它之后的子序列存在关联。
在时域上,各连接权值不变,即一个序列的各子序列共享连接权值,即根据输入x t-1得到的输出y t-1所使用的连接权值,根据输入x t得到的输出y t所使用的连接权值,根据输入x t+1得到的输出y t+1所使用的连接权值,是一致的。
深度双向循环神经网络就是具有多层隐含层的循环神经网络,其算法可参照上述具有一层隐含层的算法,此处不再赘述。
5、混合高斯模型
混合高斯模型为多个高斯分布的概率密度函数的组合,一个具有L个混合数的高斯模型可以表示为:
Figure PCTCN2018086596-appb-000045
其中,x表示观察矢量,Θ=(θ 1,θ 2,......,θ L)为参数向量集合,Θ k=(μ k,V k)是高斯分布参数,ρ l为混合高斯模型中每个高斯分量的加权系数,并且加权系数满足:
Figure PCTCN2018086596-appb-000046
G(x,μ l,V l)表示混合高斯模型的第l个混合分量,其通过均值为μ l、协方差 为V l(正定矩阵)的b维多元单一高斯概率密度函数表示:
Figure PCTCN2018086596-appb-000047
上面为本申请实施例涉及到的基础知识和相关算法的说明。下面对本申请实施例的语音信号的处理方法进行说明。
图7为本申请实施例提供的***架构图,参见图7,该***包括移动设备10和网络设备20;
其中,网络设备为具有无线收发功能的设备或可设置于该设备的芯片组及必要的软硬件,该设备包括但不限于:演进型节点B(evolved Node B,eNB)、无线网络控制器(radio network controller,RNC)、节点B(Node B,NB)、基站控制器(base station controller,BSC)、基站收发台(base transceiver station,BTS)、家庭基站(例如,home evolved NodeB,或home Node B,HNB)、基带单元(baseband unit,BBU),无线保真(wireless fidelity,WIFI)***中的接入点(access point,AP)、无线中继节点、无线回传节点、传输点(transmission and reception point,TRP或者transmission point,TP)等,还可以为5G,如,NR,***中的gNB,或,传输点(TRP或TP),5G***中的基站的一个或一组(包括多个天线面板)天线面板,或者,还可以为构成gNB或传输点的网络节点,如基带单元(BBU),或,分布式单元(DU,distributed unit)等。
在一些部署中,gNB可以包括集中式单元(centralized unit,CU)和DU。gNB还可以包括射频单元(radio unit,RU)。CU实现gNB的部分功能,DU实现gNB的部分功能,比如,CU实现无线资源控制(radio resource control,RRC),分组数据汇聚层协议(packet data convergence protocol,PDCP)层的功能,DU实现无线链路控制(radio link control,RLC)、媒体接入控制(media access control,MAC)和物理(physical,PHY)层的功能。由于RRC层的信息最终会变成PHY层的信息,或者,由PHY层的信息转变而来,因而,在这种架构下,高层信令,如RRC层信令或PHCP层信令,也可以认为是由DU发送的,或者,由DU+RU发送的。可以理解的是,网络设备可以为CU节点、或DU节点、或包括CU节点和DU节点的设备。此外,CU可以划分为接入网RAN中的网络设备,也可以将CU划分为核心网CN中的网络设备,在此不做限制。
移动设备也可以称为用户设备(user equipment,UE)、接入终端、用户单元、用户站、移动站、移动台、远方站、远程终端、用户终端、终端、无线通信设备、用户代理或用户装置。本申请涉及的移动设备可以是手机(mobile phone)、平板电脑(Pad)、带无线收发功能的电脑、虚拟现实(virtual reality,VR)设备、增强现实(augmented reality,AR)设备、工业控制(industrial control)中的无线终端、无人驾驶(self driving)中的无线终端、远程医疗(remote medical)中的无线终端、智能电网(smart grid)中的无线终端、运输安全(transportation safety)中的无线终端、智慧城市(smart city)中的无线终端、智慧家庭(smart home)中的无线终端等等。本申请的实施例对应用场景不做限定。本申请中将前述终端设备及可设置于前述终端设备的芯片统称为终端 设备。
在该***中,网络设备20均可以与多个移动设备(例如图中示出的移动设备10)通信。网络设备20可以与类似于移动设备10的任意数目的移动设备进行通信。
应理解,图7仅为便于理解而示例的简化示意图,该通信***中还可以包括其他网络设备或者还可以包括其他移动设备,图7中未予以画出。
下面结合具体的实施例对本申请的语音信号的处理方法进行说明。图8为本申请实施例提供的语音信号的处理方法的流程图,参见图8,本实施例的方法包括:
步骤S101、移动设备对接收到的编码后的语音信号解码后得到m组低频语音参数m组低频语音参数为该语音信号的m个语音帧的低频语音参数,m为大于1的整数;
步骤S102、移动设备基于m组低频语音参数确定m个语音帧的类型,并重构m个语音帧对应的低频语音信号;其中,语音帧的类型包括清音帧或浊音帧;
步骤S103、移动设备根据n个清音帧的低频语音参数和混合高斯模型算法,得到n个清音帧对应的n个高频语音信号,并根据k个浊音帧的低频语音参数和神经网络算法,得到k个浊音帧对应的k个高频语音信号,n和k为大于1的整数,n和k的和等于m;
步骤S104、移动设备对m个语音帧中每个语音帧的低频语音信号和高频语音信号进行合成,得到宽带语音信号。
具体地,由于语音信号具有短时性,即在一个较短的时间间隔内,语音信号保持相对稳定一致,这段时间一般可取为5ms~50ms,因此,对于语音信号的分析必须建立在短时的基础上。也就是说本实施例中涉及的“语音信号”指的是可以分析的较短时间间隔对应的语音信号。
对于步骤S101、移动设备对接收到的编码后的语音信号解码后得到m组低频语音参数;m组低频语音参数为该语音信号的m个语音帧的低频语音参数,m为大于1的整数,可以理解的是,每个语音帧对应一组低频语音参数。
为了便于理解,步骤S101中涉及的语音信号在后续的描述中可称为语音信号a。
对于网络设备,网络设备可采用参数编码的方法,对语音信号a的m个语音帧的m组低频语音参数进行参数编码,得到编码后的语音信号a。
具体地,网络设备可采用混合线性激励预测(Mixed linear incentive prediction,简称MELP)算法提取语音信号a的低频语音参数,下面对MELP算法提取语音信号的低频语音参数的方法进行简要的介绍。
采用MELP算法得到的低频语音参数包括:基音周期;或者,子带信号强度;或者,增益值;或者,线谱频率;或者,基音周期,子带信号强度,增益值,或者线谱频率中的至少两个。
低频语音参数包括基音周期,子带信号强度,增益值,或者线谱频率中的至少两个的含义如下:低频语音参数包括基音周期和子带信号强度;或,基音周期和增益值;或,基音周期和线谱频率;或,子带信号强度和增益值;或,子带信号强度和线谱频率;或,线谱频率和增益值;或,基音周期和子带信号强度和增益值;或,基音周期和子带信号强度和线谱频率;或,增益值和子带信号强度和线谱频率;或,基音周期和增益值和线谱频率;或,基音周期和子带信号强度和增益值和线谱频率。
可选地,本实施例中的低频语音参数包括基音周期和子带信号强度和增益值和线谱频率。
可以理解的是,低频语音参数可以不止包括上述的参数,还可以包括其它的参数。采用不同的参数提取算法,对应得到低频语音参数具有一定的差异。
网络设备采用MELP算法提取低频语音参数时,对语音信号a进行采样,得到数字语音,对数字语音进行高通滤波,去除数字语音中的低频能量,以及可能存在的50Hz工频干扰,比如可采用4阶切比雪夫高通滤波器进行高通滤波,高通滤波后的数字语音作为待处理的语音信号。
以待处理的语音信号对应的N个采样点为一个语音帧,比如,N可为160,帧移为80个采样点,将待处理的语音信号分成m个语音帧,然后提取m个语音帧的低频语音参数。
对于每个语音帧,提取语音帧的低频语音参数:基音周期,子带信号强度,增益值,线谱频率。
可以理解的是,每个语音帧包括低频语音信号和高频语音信号,由于传输带宽的限制,语音频带的范围被限制,在本实施例中,提取的语音帧的低频语音参数是语音帧中的低频语音信号对应的低频语音参数,相应地,本实施例中后续出现的高频语音参数为语音帧中的高频语音信号对应的高频语音参数。低频语音信号与高频语音信号是相对的,可以理解的是,若低频语音信号对应的频率为300Hz~3400Hz,则高频语音信号对应的频率可为3400Hz~8000Hz。
其中,本实施例中的低频语音信号对应的频率范围可为现有技术中的窄带语音信号对应的频率范围,即300Hz~3400Hz,也可为其它频率范围。
对于基音周期的获取:基音周期的获取包括整数基音周期的获取、分数基音周期的获取和最终基站周期的获取。具体算法,参照现有的MELP算法,本实施例中不再赘述。
每个语音帧对应一个基音周期。
对于子带声音强度的获取:可先使用六阶巴特沃兹带通滤波器组将0-4KHz的语音频带(低频语音信号对应的)分成5个固定的频段(0~500Hz,500~1000Hz,1000~2000Hz,2000~3000Hz,3000~4000Hz)。这样的划分只是示例性的,也可以不采用这样的划分。
第一子带(0~500Hz)的子带声音强度为语音帧的分数基音周期对应的归一化自相关值。
对于稳定的语音帧,其余的四个子带的声音强度为自相关函数的最大值;对于不稳定的语音帧,也就是基音周期变化较大的语音帧,采用子带信号包络的自相关函数减去0.1,再做全波整流和平滑滤波,计算归一化的自相关函数值,归一化的自相关函数值作为相应子带的声音强度。
即每个语音帧对应多个子带声音强度,比如5个。
对于增益的获取:每个语音帧可计算2个增益值。计算时使用基音自适应窗长,窗长由以下的方法决定:当Vbp1>0.6时(Vbp1>0.6,说明语音帧为浊音帧),窗长为大于120个采样点的分数基音周期的最小倍数,如果窗长超过320个采样点,则将其除以2;当Vbp1<0.6(Vbp1≤0.6,说明语音帧为清音帧),窗长为120个采样 点。第一个增益G 1窗的中心位于当前语音帧的最后一个采样点之前90个采样点;第二个增益G 2窗的中心位于当前帧的最后一个采样点。增益值为加窗信号S n的均方根值,结果转化为分贝形式为:
Figure PCTCN2018086596-appb-000048
其中,L是窗长,0.01为修正因子。如果增益计算出来的值为负,就将增益的值设为零。
对于线谱频率的获取:用200个采样点长(25ms)的汉明窗对输入语音信号进行加权,然后进行10阶的线性预测分析,窗的中心位于当前帧的最后一个采样点。第一步先采用传统的Levinson-Durbin算法求解线性预测系数a i(i=1,2,……,10);第二步对a i作15Hz的带宽扩展,即第i个预测系数乘以0.94 i(i=1,2,……,10),进行宽带扩展有助于改善共振峰结构和便于线谱频率量化。
MELP算法在得到线性预测系数后,利用Chebyshev多项式递推转换为线谱频率降低了计算复杂度。
每个语音帧对应一个线谱频率,线谱频率为具有多个分量的向量,比如具有12个分量的向量。
综上所述,网络设备采用MELP算法对语音信号的m个语音帧进行低频语音参数提取后,每个语音帧对应得到一组低频语音参数,一组低频语音参数可包括:一个基音周期,多个子带声音强度、两个增益,一个线谱频率向量。
接着,网络设备对语音信号a的m个语音帧的m组低频语音参数进行编码,得到编码后的语音信号a,将编码后的语音信号a发送至移动设备,移动设备对接收到的编码后的语音信号a解码后便会得到m组低频语音参数,每组低频语音参数与语音信号a的一个语音帧的低频语音信号对应。
对于步骤S102、移动设备基于m组低频语音参数确定m个语音帧的类型,并重构m个语音帧对应的低频语音信号;其中,语音帧的类型包括清音帧或浊音帧;
在得到语音信号a对应的m组低频语音参数后,移动设备根据m组低频语音参数,重构m个语音帧对应的低频语音信号。
其中,移动设备根据m组低频语音参数,重构m个语音帧对应的低频语音信号是现有技术中十分成熟的技术,本实施例中不再赘述。
此外,移动设备还基于m组低频语音参数确定m个语音帧的类型,也就是确定每个语音帧为清音帧还是浊音帧。
具体地,移动设备基于m组低频语音参数确定m个语音帧的类型,包括:
移动设备根据m组低频语音参数和栈自动编码机SAE模型,采用SAE算法,得到m个标签,m个标签用于指示m组低频语音参数对应的m个语音帧的类型;
其中,SAE模型是采用SAE算法,基于多个第一训练样本训练得到的,每个第一训练样本包括其它语音信号的一个语音帧的低频语音信号对应的低频语音参数,其它语音信号不同于本实施例中的语音信号a。
其中,SAE模型可为本实施例中的移动设备采用SAE算法,基于多个第一训练样本训练得到的,也可为其它的设备采用SAE算法,基于多个第一训练样本训练得到的, 然后本实施例的移动设备从其它的设备中直接获取训练好的SAE模型。
采用SAE算法,根据语音帧的低频语音参数确定语音帧的类型,相对于现有技术中确定语音帧的类型的方法,准确率可大大的提高。
具体地,对于每组低频语音参数均进行以下的操作,便可得到每个语音帧的类型:
将一组低频语音参数做归一化处理,得到输入向量X,比如,若一组低频语音参数由基音周期,子带信号强度,增益值,线谱频率组成,且包括1个基音周期,5个子带信号强度、2个增益值、包括12个分量的线谱频率向量,则输入向量X的维数为20维,也就是具有20个分量,将输入向量X作为图1所示的SAE的输入,采用如上所述的SAE算法,输出一标签,该标签用于指示语音帧的类型,SAE算法中采用基于多个第一训练样本训练得到的SAE模型。
下面对SAE模型的获取方法进行说明。
a1、获取多个第一训练样本;
a2、获取各第一训练样本各自的标签,标签用于指示第一训练样本对应的语音帧的类型;
a3、根据各第一训练样本包括的低频语音参数,采用SAE算法对所有第一训练样本进行训练,得到SAE模型。
对于a1:获取多个第一训练样本,每个第一训练样本包括其它语音信号的一个语音帧的低频语音信号对应的低频语音参数,可以理解的是,此处的低频语音信号对应的频率范围与网络设备编码的低频语音参数来自的低频语音信号对应的频率范围相同,此处的低频语音参数与网络设备提取的低频语音参数或者移动设备解码得到的低频语音参数的种类相同,且提取方法相同。
比如,语音信号b属于其它的语音信号中的一个语音信号,对于语音信号b的l个语音帧,分别提取l个语音帧的低频语音信号对应的l组低频语音参数,l组低频语音参数中的一组低频语音参数就是一个第一训练样本。
可以理解的是,第一训练样本的数量要足够大,其它的语音信号中可包括多个语音信号,且多个语音信号对应的自然人的数量尽可能的大。
对于a2:根据第一训练样本包括的低频语音参数对应的语音帧的类型,为每个第一训练样本分配一个标签,比如,若第一训练样本1包括的低频语音参数是从清音帧的低频语音信号中提取的,那么第一训练样本1的标签可为0;若第一训练样本2包括的低频语音参数是从浊音帧的低频语音信号中提取的,那么第一训练样本2的标签可为1。
对于a3:对于第一个进行训练的第一训练样本1,将第一训练样本1包括的低频语音参数进行归一化后的向量作为SAE的输入向量,将第一训练样本1的标签作为期望输出,SAE各神经元之间的连接权值和对应的偏置值赋予初始值;采用如上所述的SAE算法,得到第一训练样本1对应的实际输出,根据实际输出和期望输出,采用最小均方误差准则的反向误差传播算法和梯度下降法,调整SAE各神经元之间的连接权值和对应的偏置值,得到更新后的各神经元之间的连接权值和对应的偏置值。
对于第二个进行训练的第一训练样本2,将第一训练样本2包括的低频语音参数进行归一化后的向量作为SAE的输入向量,将第一训练样本2的标签作为期望输出, 此次训练过程或者学习过程,初始采用的SAE各层神经元之间的连接权值和对应的偏置值为第一训练样本1训练完毕后,得到的更新后的各神经元之间的连接权值和对应的偏置值;采用如上所述的SAE算法,得到第一训练样本2对应的实际输出,根据实际输出和期望输出,采用最小均方误差准则的反向误差传播算法和梯度下降法,再次调整SAE各神经元之间的连接权值和对应的偏置值,得到更新后的各神经元之间的连接权值和对应的偏置值。
对于第三个进行训练的第一训练样本3,将第一训练样本3包括的低频语音参数进行归一化后的向量作为SAE的输入向量,将第一训练样本3的标签作为期望输出,此次训练过程或者学习过程,初始采用的SAE各层神经元之间的连接权值和对应的偏置值为第二训练样本2训练完毕后,得到的更新后的各神经元之间的连接权值和对应的偏置值;采用如上所述的SAE算法,得到第一训练样本3对应的实际输出,根据实际输出和期望输出,采用最小均方误差准则的反向误差传播算法和梯度下降法,再次调整SAE各神经元之间的连接权值和对应的偏置值,得到更新后的各神经元之间的连接权值和对应的偏置值。
重复执行上述训练过程,直至误差函数收敛,也就是训练的精度满足要求后,停止训练过程,每个训练样本至少被训练一次。
最后一次训练对应的神经网络以及各层神经元之间的连接权值和对应的偏置值即为SAE模型。
在得到SAE模型后,便可根据SAE模型和移动设备解码得到的m组低频语音参数,采用SAE算法,得到m个标签,m个标签用于指示m组低频语音参数对应的m个语音帧的类型。可以理解的是,若在训练过程中,对于包括的低频语音参数是从浊音帧的低频语音信号提取的这样的第一训练样本,对应的标签为1,则移动设备解码得到的m组低频语音参数中与浊音帧对应的各组低频语音参数,根据SAE模型,采用SAE算法后,得到的标签应该接近1或者为1;同样的,若在训练过程中,对于包括的低频语音参数是从清音帧的低频语音信号提取的这样的第一训练样本,对应的标签为0,则移动设备解码得到的m组低频语音参数中与清音帧对应的各组低频语音参数,根据SAE模型,采用SAE算法后,得到的标签应该接近0或者为0。
对于步骤S103,移动设备根据n个清音帧的低频语音参数和混合高斯模型算法,得到n个清音帧对应的n个高频语音信号,并根据k个浊音帧的低频语音参数和神经网络算法,得到k个浊音帧对应的k个高频语音信号,n和k为大于1的整数,n和k的和等于m。
具体地,由于采用神经网络算法根据清音帧对应的低频语音参数预测清音帧对应的高频语音参数会引入人工噪声,会使得用户听到“哧哧”的噪声,影响了用户的听觉感受,因此,为了使得最终得到的语音信号中不引入人工噪声,本实施例中根据清音帧的低频语音参数,获取清音帧对应的高频语音信号不采用神经网络算法,可采用混合高斯模型算法。而采用神经网络算法根据浊音帧对应的低频语音参数预测浊音帧对应的高频语音参数,几乎不会引入人工噪声且可保留原始语音的情感度,因此,根据浊音帧的低频语音参数,获取浊音帧对应的高频语音信号,可采用神经网络算法。这就是步骤S102中确定语音帧类型的意义所在,也就是说根据清音帧和浊音帧的性质 的不同,采用不同的机器学习算法,可尽可能少的引入工噪声且保留原始语音的情感度,从而实现精确的再现原始语音。
具体地,移动设备根据n个清音帧的低频语音参数和混合高斯模型算法,得到n个清音帧对应的n个高频语音信号,包括:
移动设备根据n个清音帧的低频语音参数和混合高斯模型算法,得到n个清音帧的高频语音参数;
移动设备根据n个清音帧的高频语音参数,构建n个清音帧对应的n个高频语音信号。
其中,混合高斯模型算法参照现有技术中的算法,此处不再赘述。
移动设备根据k个浊音帧的低频语音参数和神经网络算法,得到k个浊音帧对应的k个高频语音信号,包括:
移动设备根据k个浊音帧的低频语音参数和神经网络模型,采用神经网络算法,得到k个浊音帧的高频语音参数;
移动设备根据k个浊音帧的高频语音参数,构建k个浊音帧对应的k个高频语音信号;
其中,神经网络模型是采用神经网络算法,本实施例中的移动设备或其它移动设备基于多个第二训练样本训练得到的,一个第二训练样本包括一个其它语音信号的h个浊音帧的h组低频语音参数,h为大于1的整数;其它语音信号不同于本实施例中的语音信号a。
对于一个其它语音信号而言,h可为该其它语音信号包括的所有浊音帧的数量,也可小于该其它语音信号包括的所有浊音帧的数量。对于不同的语音信号,h的值可不相同。
其中,此处的神经网络算法可为LSTM神经网络算法,神经网络模型为LSTM神经网络模型;或者,
神经网络算法可为BRNN算法,神经网络模型为BRNN模型;或者,
神经网络算法为RNN算法,神经网络模型为RNN模型。
下面以神经网络算法为BRNN算法,神经网络模型为BRNN模型为例,说明移动设备根据k个浊音帧的低频语音参数和神经网络模型,采用神经网络模型,得到k个浊音帧对应的k个高频语音信号的具体过程。
移动设备将解码得到的与k个浊音帧对应的k组频语音参数做归一化处理,得到各自对应的向量,k组频语音参数做归一化处理后得到的多个向量可以称为一个序列,k组频语音参数中的一组低频语音参数做归一化处理后得到的向量可以称为一个子序列。各子序列输入双向循环神经网络的顺序,是按照各子序列各自对应的语音帧的时间顺序输入的,也就是每个子序列对应一个时刻上的输入。
比如,按照浊音帧的时间顺序具有子序列1、子序列2、子序列3,若子序列2对应图6所示的X t,则子序列1对应图6所示的X t-1,子序列3对应图6所示的X t+1
将k组频语音参数做归一化处理后得到的多个向量作为双向循环神经网络的输入,采用如上所述的双向循环神经网络算法,基于双向循环神经网络模型,得到k组低频语音参数中每组低频语音参数对应的输出,每个输出用于指示相应浊音帧的高频语音参数,可转化为高频语音参数,也就是得到k个浊音帧的k组高频语音参数。
比如,按照浊音帧的时间顺序具有子序列1、子序列2、子序列3,若子序列2对应的输出为图6所示的y t,则子序列1对应的输出为图6所示的y t-1,子序列3对应的输出为图6所示的y t+1
在双向循环神经网络算法中,每个子序列共享同一个双向循环神经网络模型,采用双向循环神经网络算法,得到各自对应的输出。
在移动设备根据BRNN模型,采用BRNN算法,得到k个浊音帧的k组高频语音参数后,移动设备根据k个浊音帧的k组高频语音参数,构建k个浊音帧对应的k个高频语音信号。
下面对双向循环神经网络BRNN模型的获取方法进行说明。
b1、获取多个第二训练样本;
b2、获取每个第二训练样本的标签,标签为第二训练样本包括的h组低频语音参数对应的h组高频语音参数;其中,第二训练样本包括的h组低频语音参数和相应标签包括的h组高频语音参数为同一语音信号的语音参数;
b3、根据各第二训练样本和对应的标签,采用双向循环神经网络算法对第二训练样本进行训练,得到双向循环神经网络模型。
对于b1、获取多个第二训练样本,一个第二训练样本包括一个其它语音信号的h个浊音帧的低频语音信号对应的h组低频语音参数,可以理解的是,此处的低频语音信号对应的频率范围与网络设备编码的低频语音参数对应的低频语音信号对应的频率范围相同,此处的低频语音参数与网络设备提取的低频语音参数或者移动设备解码得到的低频语音参数的种类相同。
比如:对于语音信号1,提取语音信号1的h 1个浊音帧的h 1组低频语音参数,得到一个第二训练样本1,也就是说第二训练样本1包括多组低频语音参数,每个浊音帧对应一组低频语音参数。
对于语音信号2,提取语音信号2的h 2个浊音帧的h 2组低频语音参数,得到一个第二训练样本2。
其中,h 1和h 2可相同,可不相同;语音信号1和语音信号2均为其它语音信号中的语音信号。
可以理解的是,第二训练样本的数量要足够大。
对于b2、获取每个第二训练样本的标签;
比如上述的第二训练样本1,提取语音信号1的h 1个浊音帧的高频语音信号对应的h 1组高频语音参数,语音信号1的h 1个浊音帧的h 1组高频语音参数即为第二训练样本1的标签。
比如上述的第二训练样本2,提取语音信号2的h 2个浊音帧的高频语音信号对应的h 2组高频语音参数,语音信号2的h 2个浊音帧的h 2组高频语音参数即为第二训练样本2的标签。
对于b3、对于第一个进行训练的第二训练样本1,将第二训练样本1的h 1组低频语音参数各自归一化后的多个向量作为双向循环神经网络的输入,第二训练样本1的多组低频语音参数各自归一化后的多个向量可以称为一个序列,h 1组低频语音参数中的每组低频语音参数归一化后的向量可以称为子序列,各子序列输入双向循环神经网 络的顺序,是按照各子序列各自对应的语音帧的时间顺序输入的,也就是每个子序列对应一个时刻上的输入。
比如,第二训练样本1按照语音帧的时间顺序具有子序列1、子序列2、子序列3,若子序列2对应图6所示的X t,则子序列1对应图6所示的X t-1,子序列3对应图6所示的X t+1
将第二训练样本1的标签归一化后的向量作为期望输出;
双向循环神经网络涉及的各连接权值以及偏置值赋予初始值,所有的子序列共享连接权值和偏置值;
根据上述的输入、各连接权值以及偏置值,采用双向循环神经网络算法,得到第二训练样本1的实际输出;可以理解的是,每一个子序列对应一个输出,所有子序列的输出组成第二训练样本1的实际输出;
比如,第二训练样本1按照语音帧的时间顺序具有子序列1、子序列2、子序列3,若子序列2对应的输出为图6所示的y t,则子序列1对应的输出为图6所示的y t-1,子序列3对应的输出为图6所示的y t+1
对实际输出和期望输出进行处理后,根据处理结果调整初始的各连接权值以及偏置值,得到调整后的各连接权值以及偏置值。
对于第二个进行训练的第二训练样本2,将第二训练样本2的h 2组低频语音参数各自归一化后的向量作为双向循环神经网络的输入;
将第二训练样本2的标签归一化后的向量作为期望输出;
此次训练过程涉及的各连接权值以及偏置值采用第二训练样本1训练完毕后得到的调整后的各连接权值以及偏置值;
根据上述的输入、此次训练过程涉及的各连接权值以及偏置值,采用双向循环神经网络算法,得到第二训练样本2的实际输出;
对实际输出和期望输出进行处理后,根据处理结果调整此次训练过程涉及的各连接权值以及偏置值,得到调整后的各连接权值以及偏置值。
对于第三个进行训练的第二训练样本3,将第二训练样本3的h 3组低频语音参数各自归一化后的向量作为双向循环神经网络的输入;
将第二训练样本3的标签归一化后的向量作为期望输出;
此次训练过程涉及的各连接权值以及偏置值采用第二训练样本2训练完毕后得到的调整后的各连接权值以及偏置值;
根据上述的输入、此次训练过程涉及的各连接权值以及偏置值,采用双向循环神经网络算法,得到第二训练样本3的实际输出;
对实际输出和期望输出进行处理后,根据处理结果调整此次训练过程涉及的各连接权值以及偏置值,得到调整后的各连接权值以及偏置值。
重复执行上述训练过程,直至达到预设的训练精度或者达到预设的训练次数,停止训练过程,每个训练样本至少被训练一次。
最后一次训练对应的双向循环神经网络以及各连接权值和偏置值即为BRNN模型。
其中,采用双向循环网络算法得到浊音帧对应的高频语音参数具有如下的有益效果:
如上对双向循环神经网络算法的介绍,可知对于t时刻的输入x t,其经过双向循环神经网络后对应输出y t可通过如下公式得到:
Figure PCTCN2018086596-appb-000049
Figure PCTCN2018086596-appb-000050
Figure PCTCN2018086596-appb-000051
可知,y t不仅与t-1时刻的输入x t-1相关(h t-1是通过x t-1得到的),还与t+1时刻的输入x t+1相关(h t+1是通过x t+1得到的)。根据前述的介绍可知,当x t对应本申请实施例中的浊音帧a的一组低频语音参数时,其输出y t对应浊音帧a的一组高频语音参数,则x t-1对应本申请实施例中的浊音帧a的前一个浊音帧b的一组低频语音参数,x t+1对应本申请实施例中的浊音帧a的后一个浊音帧c的一组低频语音参数,也就是说当采用双向循环神经网络算法根据低频语音参数预测高频语音参数时,其不仅考虑了浊音帧a的前一个浊音帧b,也考虑了浊音帧a的后一个浊音帧c,结合语音的语义前后连贯性(即当前的语音信号不仅与上一帧语音信号相关,也与下一帧语音信号相关),可知,预测浊音帧a的高频语音参数时同时考虑其前后的浊音帧的信息,可提高对高频语音参数预测的准确度,即可提高通过低频语音信号预测高频语音信号的准确度。
综上所述,采用双向循环网络算法得到浊音帧对应的高频语音参数,可提高通过浊音帧的低频语音信号预测相应帧的高频语音信号的准确度。
通过上述步骤,移动设备得到了语音信号a的m个语音帧的m组高频语音信号和m组低频语音信号。
对于步骤S104、移动设备对m个语音帧中每个语音帧的低频语音信号和高频语音信号进行合成,得到宽带语音信号。
移动设备在将对m个语音帧中每个语音帧的低频语音信号和高频语音信号进行合成后,变得到了完整的宽带语音。
本实施例中的语音信号的处理方法在移动设备侧进行,不改变原有的通信***,只需在移动设备侧设置相关的扩展装置或者扩展程序即可;根据语音参数区分浊音帧和清音帧,区分准确率高;根据清音帧和浊音帧性质的不同,采用混合高斯模型算法获取清音帧对应的高频语音信号,降低了噪声引入的概率,采用神经网络算法获取浊音帧对应的高频语音信号,保留了原始语音的情感度,从而可精确的再现原始语音,提升了用户的听觉感受。
应理解,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
上述针对移动设备所实现的功能,对本申请实施例提供的方案进行了介绍。可以理解的是,设备为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。结合本申请中所公开的实施例描述的各示例的单元及算法步骤,本申请实施例能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同的方法来实现所描述的功能,但是这种实现不应认为超出本申请实施例的技术方案的范围。
本申请实施例可以根据上述方法示例对移动设备进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
图9为本申请实施例提供的移动设备的结构示意图;参见图9,本实施例的移动设备包括:解码模块31、处理模块32、获取模块33和合成模块34
解码模块31,用于对接收到的编码后的语音信号解码后得到m组低频语音参数;所述m组低频语音参数为所述语音信号的m个语音帧的低频语音参数,m为大于1的整数;
处理模块32,用于基于所述m组低频语音参数确定所述m个语音帧的类型,并重构m个语音帧对应的低频语音信号,所述类型包括清音帧或浊音帧;
获取模块33,用于根据n个清音帧的低频语音参数和混合高斯模型算法,得到所述n个清音帧对应的n个高频语音信号,并根据k个浊音帧的低频语音参数和神经网络算法,得到所述k个浊音帧对应的k个高频语音信号,n和k为大于1的整数,n和k的和等于m;
合成模块34,用于所述移动设备对m个语音帧中每个语音帧的低频语音信号和高频语音信号进行合成,得到宽带语音信号。
可选地,每组低频语音参数包括:基音周期;或者,子带信号强度;或者,增益值;或者,线谱频率。
本实施例的移动设备,可以用于执行上述方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。
在一种可能的设计中,所述处理模块32,具体用于:
根据所述m组低频语音参数和栈自动编码机SAE模型,采用SAE算法,得到m个标签,m个标签用于指示m组低频语音参数对应的m个语音帧的类型;
其中,所述SAE模型是所述移动设备或其它移动设备采用所述SAE算法,基于多个第一训练样本训练得到的,每个第一训练样本包括其它语音信号的一个语音帧的低频语音信号对应的低频语音参数。
在一种可能的设计中,所述获取模块33,具体用于:
根据n个清音帧的低频语音参数和混合高斯模型算法,得到n个清音帧的高频语音参数;
根据所述n个清音帧的高频语音参数,构建所述n个高频语音信号。
在一种可能的设计中,所述获取模块33,具体用于:
根据k个浊音帧的低频语音参数和神经网络模型,采用神经网络算法,得到k个浊音帧的高频语音参数;
根据所述k个浊音帧的高频语音参数,构建所述k个高频语音信号;
其中,所述神经网络模型是所述移动设备或其它移动设备采用所述神经网络算法, 基于多个第二训练样本训练得到的,一个所述第二训练样本包括一个其它语音信号的h个浊音帧的低频语音参数,h为大于1的整数。
可选地,所述神经网络算法为长短期记忆(LSTM)神经网络算法,所述神经网络模型为LSTM神经网络模型;
可选地,所述神经网络算法为双向循环神经网络(BRNN)算法,所述神经网络模型为BRNN模型;或者,
可选地,所述神经网络算法为循环神经网络(RNN)算法,所述神经网络模型为RNN模型。
本实施例的移动设备,可以用于执行上述方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。
图10为本申请实施例提供的移动设备的结构示意图二,包括处理器41、存储器42、通信总线43,处理器41用于读取并执行存储器42中的指令以实现上述方法实施例中的方法,或者,处理器41用于通过存储器42读取并调用另一个存储器中的指令以实现上述方法实施例中的方法。
图10所示的移动设备可以是一个设备,也可以是一个芯片或芯片组,设备或设备内的芯片具有实现上述方法实施例中的方法的功能。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的单元。
上述提到的处理器可以是一个中央处理器(central processing unit,CPU)、微处理器或专用集成电路(application specific integrated circuit,ASIC),也可以是一个或多个用于控制上述各方面或其任意可能的设计的上行信息的传输方法的程序执行的集成电路。
本申请还提供一种计算机存储介质,包括指令,当所述指令在移动设备上运行时,使得移动设备执行上述方法实施例中相应的方法。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。

Claims (12)

  1. 一种语音信号的处理方法,其特征在于,包括:
    移动设备对接收到的编码后的语音信号解码后得到m组低频语音参数;所述m组低频语音参数为所述语音信号的m个语音帧的低频语音参数,m为大于1的整数;
    所述移动设备基于所述m组低频语音参数确定所述m个语音帧的类型,并重构所述m个语音帧对应的低频语音信号,所述类型包括清音帧或浊音帧;
    所述移动设备根据n个清音帧的低频语音参数和混合高斯模型算法,得到所述n个清音帧对应的n个高频语音信号,并根据k个浊音帧的低频语音参数和神经网络算法,得到所述k个浊音帧对应的k个高频语音信号,n和k为大于1的整数,n和k的和等于m;
    所述移动设备对所述m个语音帧中每个语音帧的低频语音信号和高频语音信号进行合成,得到宽带语音信号。
  2. 根据权利要求1所述的方法,其特征在于,所述移动设备基于所述m组低频语音参数确定所述m个语音帧的类型,包括:
    所述移动设备根据所述m组低频语音参数和栈自动编码机SAE模型,采用SAE算法,得到m个标签,所述m个标签用于指示所述m组低频语音参数对应的所述m个语音帧的类型;
    其中,所述SAE模型是所述移动设备或其它移动设备采用所述SAE算法,基于多个第一训练样本训练得到的,每个第一训练样本包括其它语音信号的一个语音帧的低频语音信号对应的低频语音参数。
  3. 根据权利要求1或2所述的方法,其特征在于,所述移动设备根据n个清音帧的低频语音参数和混合高斯模型算法,得到所述n个清音帧对应的n个高频语音信号,包括:
    所述移动设备根据n个清音帧的低频语音参数和混合高斯模型算法,得到所述n个清音帧的高频语音参数;
    所述移动设备根据所述n个清音帧的高频语音参数,构建所述n个高频语音信号。
  4. 根据权利要求1或2所述的方法,其特征在于,所述移动设备根据k个浊音帧的低频语音参数和神经网络算法,得到所述k个浊音帧对应的k个高频语音信号,包括:
    所述移动设备根据k个浊音帧的低频语音参数和神经网络模型,采用神经网络算法,得到所述k个浊音帧的高频语音参数;
    所述移动设备根据所述k个浊音帧的高频语音参数,构建所述k个高频语音信号;
    其中,所述神经网络模型是所述移动设备或其它移动设备采用所述神经网络算法,基于多个第二训练样本训练得到的,一个所述第二训练样本包括一个其它语音信号的h个浊音帧的h组低频语音参数,h为大于1的整数。
  5. 根据权利要求4所述的方法,其特征在于,所述神经网络算法为长短期记忆LSTM神经网络算法,所述神经网络模型为LSTM神经网络模型;或者,
    所述神经网络算法为双向循环神经网络BRNN算法,所述神经网络模型为BRNN模型;或者,
    所述神经网络算法为循环神经网络RNN算法,所述神经网络模型为RNN模型。
  6. 一种移动设备,其特征在于,包括:
    解码模块,用于对接收到的编码后的语音信号解码后得到m组低频语音参数;所述m组低频语音参数为所述语音信号的m个语音帧的低频语音参数,m为大于1的整数;
    处理模块,用于基于所述m组低频语音参数确定所述m个语音帧的类型,并重构m个语音帧对应的低频语音信号,所述类型包括清音帧或浊音帧;
    获取模块,用于根据n个清音帧的低频语音参数和混合高斯模型算法,得到所述n个清音帧对应的n个高频语音信号,并根据k个浊音帧的低频语音参数和神经网络算法,得到所述k个浊音帧对应的k个高频语音信号,n和k为大于1的整数,n和k的和等于m;
    合成模块,用于对m个语音帧中每个语音帧的低频语音信号和高频语音信号进行合成,得到宽带语音信号。
  7. 根据权利要求6所述移动设备,其特征在于,所述处理模块,具体用于:
    根据所述m组低频语音参数和栈自动编码机SAE模型,采用SAE算法,得到m个标签,m个标签用于指示m组低频语音参数对应的m个语音帧的类型;
    其中,所述SAE模型是所述移动设备或其它移动设备采用所述SAE算法,基于多个第一训练样本训练得到的,每个第一训练样本包括其它语音信号的一个语音帧的低频语音信号对应的低频语音参数。
  8. 根据权利要求6或7所述移动设备,其特征在于,所述获取模块,具体用于:
    根据n个清音帧的低频语音参数和混合高斯模型算法,得到所述n个清音帧的高频语音参数;
    根据所述n个清音帧的高频语音参数,构建所述n个高频语音信号。
  9. 根据权利要求6或7所述移动设备,其特征在于,所述获取模块,具体用于:
    根据k个浊音帧的低频语音参数和神经网络模型,采用神经网络算法,得到所述k个浊音帧的高频语音参数;
    根据所述k个浊音帧的高频语音参数,构建所述k个高频语音信号;
    其中,所述神经网络模型是所述移动设备或其它移动设备采用所述神经网络算法,基于多个第二训练样本训练得到的,一个所述第二训练样本包括一个其它语音信号的h个浊音帧的低频语音参数,h为大于1的整数。
  10. 根据权利要求9所述移动设备,其特征在于,所述神经网络算法为长短期记忆LSTM神经网络算法,所述神经网络模型为LSTM神经网络模型;或者,
    所述神经网络算法为双向循环神经网络BRNN算法,所述神经网络模型为BRNN模型;或者,
    所述神经网络算法为循环神经网络RNN算法,所述神经网络模型为RNN模型。
  11. 一种计算机可读存储介质,其特征在于,计算机可读存储介质上存储有计算机程序,在所述计算机程序被处理器执行时,执行权利要求1至5中任一项所述的方法。
  12. 一种移动设备,其特征在于,包括处理器和存储器;
    所述处理器用于与所述存储器耦合,读取并执行所述存储器中的指令,以实现如权1-5任一所述的方法。
PCT/CN2018/086596 2018-05-11 2018-05-11 语音信号的处理方法和移动设备 WO2019213965A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2018/086596 WO2019213965A1 (zh) 2018-05-11 2018-05-11 语音信号的处理方法和移动设备
CN201880092454.2A CN112005300B (zh) 2018-05-11 2018-05-11 语音信号的处理方法和移动设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/086596 WO2019213965A1 (zh) 2018-05-11 2018-05-11 语音信号的处理方法和移动设备

Publications (1)

Publication Number Publication Date
WO2019213965A1 true WO2019213965A1 (zh) 2019-11-14

Family

ID=68466641

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/086596 WO2019213965A1 (zh) 2018-05-11 2018-05-11 语音信号的处理方法和移动设备

Country Status (2)

Country Link
CN (1) CN112005300B (zh)
WO (1) WO2019213965A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111415674A (zh) * 2020-05-07 2020-07-14 北京声智科技有限公司 语音降噪方法及电子设备
CN111710327A (zh) * 2020-06-12 2020-09-25 百度在线网络技术(北京)有限公司 用于模型训练和声音数据处理的方法、装置、设备和介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112992167A (zh) * 2021-02-08 2021-06-18 歌尔科技有限公司 音频信号的处理方法、装置及电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996640A (zh) * 2009-08-31 2011-03-30 华为技术有限公司 频带扩展方法及装置
CN103026408A (zh) * 2010-07-19 2013-04-03 华为技术有限公司 音频信号产生装置
US20130151255A1 (en) * 2011-12-07 2013-06-13 Gwangju Institute Of Science And Technology Method and device for extending bandwidth of speech signal
CN104517610A (zh) * 2013-09-26 2015-04-15 华为技术有限公司 频带扩展的方法及装置
CN104637489A (zh) * 2015-01-21 2015-05-20 华为技术有限公司 声音信号处理的方法和装置
US20170194013A1 (en) * 2016-01-06 2017-07-06 JVC Kenwood Corporation Band expander, reception device, band expanding method for expanding signal band

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996640A (zh) * 2009-08-31 2011-03-30 华为技术有限公司 频带扩展方法及装置
CN103026408A (zh) * 2010-07-19 2013-04-03 华为技术有限公司 音频信号产生装置
US20130151255A1 (en) * 2011-12-07 2013-06-13 Gwangju Institute Of Science And Technology Method and device for extending bandwidth of speech signal
CN104517610A (zh) * 2013-09-26 2015-04-15 华为技术有限公司 频带扩展的方法及装置
CN104637489A (zh) * 2015-01-21 2015-05-20 华为技术有限公司 声音信号处理的方法和装置
US20170194013A1 (en) * 2016-01-06 2017-07-06 JVC Kenwood Corporation Band expander, reception device, band expanding method for expanding signal band

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111415674A (zh) * 2020-05-07 2020-07-14 北京声智科技有限公司 语音降噪方法及电子设备
CN111710327A (zh) * 2020-06-12 2020-09-25 百度在线网络技术(北京)有限公司 用于模型训练和声音数据处理的方法、装置、设备和介质

Also Published As

Publication number Publication date
CN112005300B (zh) 2024-04-09
CN112005300A (zh) 2020-11-27

Similar Documents

Publication Publication Date Title
CN107358966B (zh) 基于深度学习语音增强的无参考语音质量客观评估方法
US20220172708A1 (en) Speech separation model training method and apparatus, storage medium and computer device
WO2020042707A1 (zh) 一种基于卷积递归神经网络的单通道实时降噪方法
JP7258182B2 (ja) 音声処理方法、装置、電子機器及びコンピュータプログラム
WO2020177371A1 (zh) 一种用于数字助听器的环境自适应神经网络降噪方法、***及存储介质
CN110867181B (zh) 基于scnn和tcnn联合估计的多目标语音增强方法
CN108447495B (zh) 一种基于综合特征集的深度学习语音增强方法
CN1750124B (zh) 带限音频信号的带宽扩展
US20130024191A1 (en) Audio communication device, method for outputting an audio signal, and communication system
CN106782497B (zh) 一种基于便携式智能终端的智能语音降噪算法
CN110085245B (zh) 一种基于声学特征转换的语音清晰度增强方法
WO2019213965A1 (zh) 语音信号的处理方法和移动设备
EP1995723A1 (en) Neuroevolution training system
JP2022547525A (ja) 音声信号を生成するためのシステム及び方法
CN114338623B (zh) 音频的处理方法、装置、设备及介质
US6701291B2 (en) Automatic speech recognition with psychoacoustically-based feature extraction, using easily-tunable single-shape filters along logarithmic-frequency axis
CN110970044B (zh) 一种面向语音识别的语音增强方法
CN114203154A (zh) 语音风格迁移模型的训练、语音风格迁移方法及装置
WO2022213825A1 (zh) 基于神经网络的端到端语音增强方法、装置
JP2006521576A (ja) 基本周波数情報を分析する方法、ならびに、この分析方法を実装した音声変換方法及びシステム
CN103971697B (zh) 基于非局部均值滤波的语音增强方法
CN114708876B (zh) 音频处理方法、装置、电子设备及存储介质
Mamun et al. CFTNet: Complex-valued frequency transformation network for speech enhancement
CN114783455A (zh) 用于语音降噪的方法、装置、电子设备和计算机可读介质
Shifas et al. End-to-end neural based modification of noisy speech for speech-in-noise intelligibility improvement

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18917872

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18917872

Country of ref document: EP

Kind code of ref document: A1