CN117649848A

CN117649848A - Speech signal processing apparatus and method

Info

Publication number: CN117649848A
Application number: CN202211632447.1A
Authority: CN
Inventors: 翟正元; 杨善松; 刘韶; 付爱国; 成刚
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2022-12-19
Filing date: 2022-12-19
Publication date: 2024-03-05

Abstract

The embodiment of the application provides a voice signal processing device and a voice signal processing method, and relates to the technical field of voice processing. The processing method of the voice signal comprises the following steps: a detector configured to acquire a voice signal; a controller configured to: extracting semantic features and emotion features of each audio frame of the voice signal, carrying out multi-mode bilinear pooling on the semantic features and emotion features of each voice frame of the voice signal, obtaining fusion features of each voice frame of the voice signal, and finally merging the fusion features of each voice frame of the voice signal to obtain acoustic features of the voice signal.

Description

Speech signal processing apparatus and method

Technical Field

The embodiment of the application relates to the technical field of voice processing. And more particularly, to a processing apparatus and method of a voice signal.

Background

At present, in the field of speech recognition, along with development of a deep learning speech recognition architecture based on a transducer (a neural network) model, prosodic information and spectral information of extracted speech can be converted and analyzed on an original speech signal, so that speech features with semantics and emotion are extracted from the speech signal, but when the speech features are extracted, due to limited description capability of low-level speech descriptors or statistic values, good performance of a system cannot be ensured, global speech information cannot be extracted well, and the expression capability of the extracted speech features is insufficient.

Disclosure of Invention

The exemplary embodiment of the application provides a processing method of a voice signal, which is used for obtaining voice characteristics which are richer and have expression capability.

The technical scheme provided by the embodiment of the application is as follows:

in a first aspect, an embodiment of the present application provides a processing apparatus for a speech signal, including:

a detector configured to acquire a voice signal;

a controller configured to:

extracting semantic features and emotional features of each audio frame of the voice signal;

carrying out multi-mode bilinear pooling on semantic features and emotion features of each voice frame of the voice signal to obtain fusion features of each voice frame of the voice signal;

and merging the fusion characteristics of each voice frame of the voice signal to obtain the acoustic characteristics of the voice signal.

In a second aspect, an embodiment of the present application provides a method for processing a voice signal, including:

acquiring a voice signal;

In a third aspect, embodiments of the present application provide a computer readable storage medium, where a computer program is stored, where the computer program, when executed by a computing device, causes the computing device to implement a method for processing a speech signal according to any one of the embodiments of the first aspect.

In a fourth aspect, the present application provides a computer program product for, when run on a computer, causing the computer to implement a method of processing a speech signal as shown in the second aspect.

As can be seen from the above technical solutions, in the processing method for a speech signal provided in the embodiments of the present application, first, a speech signal is obtained through a detector, then a controller extracts semantic features and emotional features of each audio frame of the speech signal, performs multi-mode bilinear pooling on the semantic features and emotional features of each speech frame of the speech signal, obtains fusion features of each speech frame of the speech signal, and finally merges the fusion features of each speech frame of the speech signal to obtain acoustic features of the speech signal. Compared with the prior art, the global acoustic feature expression capability extracted from the voice signal is not strong, and the technical scheme of the application fully fuses and merges the semantic features and the emotion features of the voice frame, so that the local detail information and the global semantic information are fully fused, and further richer voice semantic content and emotion content are obtained in the voice signal.

Drawings

In order to more clearly illustrate the embodiments of the present application or the implementation in the related art, a brief description will be given below of the drawings required for the embodiments or the related art descriptions, and it is apparent that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings for those of ordinary skill in the art.

FIG. 1 illustrates a scene architecture diagram of a method of processing a speech signal in some embodiments;

FIG. 2 illustrates a hardware configuration block diagram of a control device in some embodiments;

FIG. 3 illustrates a hardware configuration block diagram of the processing of speech signals in some embodiments;

FIG. 4 illustrates a software configuration diagram in the processing of a speech signal in some embodiments;

FIG. 5 illustrates a flow chart of steps of a method of processing a speech signal in some embodiments;

FIG. 6 is a flow chart illustrating steps of a method of processing a speech signal in other embodiments;

FIG. 7 is a flow chart showing steps of a method of processing a speech signal in other embodiments;

FIG. 8 is a flow chart illustrating steps of a method of processing a speech signal in other embodiments;

FIG. 9 illustrates a schematic diagram of a driving model in some embodiments;

FIG. 10 illustrates a schematic diagram of a driving model in some embodiments;

FIG. 11 shows a schematic diagram of a driving model in other embodiments;

FIG. 12 illustrates a schematic diagram of a driving model in some embodiments;

FIG. 13 illustrates a schematic diagram of a method of processing a speech signal in some embodiments;

fig. 14 illustrates a block diagram of a method of processing a speech signal in some embodiments.

Detailed Description

For purposes of clarity and implementation of the present application, the following description will make clear and complete descriptions of exemplary implementations of the present application with reference to the accompanying drawings in which exemplary implementations of the present application are illustrated, it being apparent that the exemplary implementations described are only some, but not all, of the examples of the present application.

It should be noted that the brief description of the terms in the present application is only for convenience in understanding the embodiments described below, and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.

The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

Fig. 1 is a schematic view of a scenario architecture of a method for processing a speech signal according to an embodiment of the present application. As shown in fig. 1, a scenario architecture provided in an embodiment of the present application includes: a server 400 and a processing device 200 for speech signals.

The processing device 200 for voice signals provided in this embodiment of the present application may have various implementation forms, for example, may be an intelligent speaker, a television, a refrigerator, a washing machine, an air conditioner, an intelligent curtain, a router, a set top box, a mobile phone, a personal computer (Personal Computer, PC) intelligent television, a laser projection device, a display (monitor), an electronic whiteboard (electronic bulletin board), a wearable device, an in-vehicle device, an electronic desktop (electronic table), and so on.

In some embodiments, a user may operate the display device 200 through the smart device 300 or the control apparatus 100, and may perform data communication with the server 400 when the processing device 200 for a voice signal receives an audio signal, wherein the processing device 200 for a voice signal may be allowed to perform communication connection with the server 400 through a Local Area Network (LAN), a Wireless Local Area Network (WLAN).

The server 400 may be a server providing various services, such as a server providing support for audio data collected by the terminal device 200. The server may perform analysis and other processing on the received data such as audio, and feed back the processing result (e.g., endpoint information) to the terminal device. The server 400 may be a server cluster, or may be a plurality of server clusters, and may include one or more types of servers.

The processing device 200 for speech signals may be hardware or software. When the processing device 200 for voice signals is hardware, it may be various electronic devices with sound collection functions, including but not limited to a smart speaker, a smart phone, a television, a tablet computer, an electronic book reader, a smart watch, a player, a computer, an AI device, a robot, a smart vehicle, etc. When the processing apparatus 200 for voice signals is software, it may be installed in the above-listed electronic apparatus. Which may be implemented as a plurality of software or software modules (e.g. for providing sound collection services) or as a single software or software module. The present invention is not particularly limited herein.

It should be noted that, the method for processing a voice signal provided in the embodiment of the present application may be executed by the server 400, may also be executed by the processing device 200 for a voice signal, or may be executed by both the server 400 and the processing device 200 for a voice signal, which is not limited in this application.

Fig. 2 shows a hardware configuration block diagram of a processing apparatus 200 for a voice signal in accordance with an exemplary embodiment. The processing apparatus 200 for voice signals as shown in fig. 2 includes at least one of a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface 280. The controller includes a central processing unit, an audio processor, a RAM, a ROM, and first to nth interfaces for input/output.

The communicator 220 is a component for communicating with external devices or servers according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, or other network communication protocol chip or a near field communication protocol chip, and an infrared receiver. The processing apparatus 200 of the voice signal can establish transmission and reception of the control signal and the data signal through the communicator 220 server 400.

The user interface 280 may be used to receive external control signals.

The detector 230 is used to collect signals of the external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for capturing the intensity of ambient light; alternatively, the detector 230 includes an image collector such as a camera, which may be used to collect external environmental scenes, user attributes, or user interaction gestures, or alternatively, the detector 230 includes a sound collector such as a microphone, or the like, which is used to receive external sounds.

The sound collector may be a microphone, also called "microphone", which may be used to receive the sound of a user and to convert the sound signal into an electrical signal. The processing device 200 for speech signals may be provided with at least one microphone. In other embodiments, the voice signal processing apparatus 200 may be provided with two microphones, and may implement a noise reduction function in addition to collecting the voice signal. In other embodiments, the voice signal processing device 200 may further be provided with three, four or more microphones to collect voice signals, reduce noise, identify a source of voice, implement a directional recording function, and the like.

Further, the microphone may be built in the processing apparatus 200 for voice signals, or the microphone may be connected to the processing apparatus 200 for voice signals by a wired or wireless means. Of course, the position of the microphone on the processing device 200 for voice signals is not limited in the embodiment of the present application. Alternatively, the processing apparatus 200 for voice signals may not include a microphone, i.e., the microphone is not provided in the processing apparatus 200 for voice signals. The processing device 200 for voice signals may be coupled to a microphone (also referred to as a microphone) via an interface, such as the USB interface 130. The external microphone may be secured to the speech signal processing device 200 by external fasteners such as a camera mount with a clip.

The controller 250 controls the operation of the display device and responds to the user's operations through various software control programs stored on the memory. The controller 250 controls the overall operation of the processing apparatus 200 for voice signals.

In some embodiments the controller includes at least one of a central processing unit (Central Processing Unit, CPU), video processor, audio processor, RAM Random Access Memory, RAM), ROM (Read-Only Memory), first to nth interfaces for input/output, a communication Bus (Bus), and the like.

In some examples, the operating system of the smart device is an Android system, and as shown in fig. 3, the processing device 200 for voice signals may be logically divided into an application layer 21, a kernel layer 22 and a hardware layer 23.

Wherein, as shown in fig. 3, the hardware layers may include the controller 250, the communicator 220, the detector 230, etc. shown in fig. 2. The application layer 21 includes one or more applications. The application may be a system application or a third party application. For example, the application layer 21 includes a voice recognition application that can provide a processing interface and services for voice signals, and the processing device 200 for voice signals is connected to the server 400.

The kernel layer 22 acts as software middleware between the hardware layer and the application layer 21 for managing and controlling hardware and software resources.

In some examples, the kernel layer 22 includes a detector driver for sending voice data collected by the detector 230 to a voice recognition application. Illustratively, the voice recognition application in the voice signal processing apparatus 200 is started, and in the case where the voice signal processing apparatus 200 establishes a communication connection with the server 400, the detector driver is configured to send the voice data input by the user and collected by the detector 230 to the voice recognition application. The speech recognition application then sends the query information containing the speech data to the intent recognition module 102 in the server. The intention recognition module 102 is used to input voice data transmitted by the processing device 200 of the voice signal to the intention recognition model.

In order to clearly illustrate the embodiments of the present application, a voice recognition network architecture provided in the embodiments of the present application is described below with reference to fig. 4.

Referring to fig. 4, fig. 4 is a schematic diagram of a processing network architecture of a voice signal according to an embodiment of the present application. In fig. 4, a processing device for a voice signal is used to receive input information and output a processing result of the information. The voice recognition module is provided with a voice recognition service for recognizing the audio as a text; the semantic understanding module is provided with semantic understanding service for carrying out semantic analysis on the text; the business management module is deployed with business instruction management service for providing business instructions; the language generation module is provided with a language generation service (NLG) for converting instructions executed by processing equipment for indicating the voice signals into text language; the voice synthesis module is provided with a voice synthesis (TTS) service, and is used for processing the text language corresponding to the instruction and then sending the processed text language to a loudspeaker for broadcasting. In one embodiment, there may be multiple entity service devices deployed with different service services in the architecture shown in fig. 4, and one or more entity service devices may also aggregate one or more functional services.

In some embodiments, the following describes an example of a process of processing information of a processing device for an input voice signal based on the architecture shown in fig. 4, taking the information of the processing device for an input voice signal as a voice instruction input through voice as an example:

[ Speech recognition ]

The processing device of the voice signal may perform noise reduction processing and feature extraction on the audio of the voice instruction after receiving the voice instruction input through voice, where the noise reduction processing may include steps of removing echo and environmental noise.

Semantic understanding

Natural language understanding is performed on the identified candidate text and associated context information using acoustic and language models, and the text is parsed into structured, machine-readable information, information such as business fields, intentions, word slots, etc., to express semantics, etc. Deriving an actionable intent determination intent confidence score, the semantic understanding module selecting one or more candidate actionable intents based on the determined intent confidence score,

[ business management ]

The semantic understanding module issues an execution instruction to the corresponding service management module according to the semantic analysis result of the text of the voice instruction so as to execute the operation corresponding to the voice instruction, and the user requests the operation, and feeds back the execution result of the operation corresponding to the voice instruction.

In some embodiments, when the processing device 200 of the voice signal obtains the voice signal through the detector 230, the processing device 200 of the voice signal extracts the semantic features and the emotion features of each audio frame of the voice signal through the controller 250, then performs multi-mode bilinear pooling on the semantic features and the emotion features of each voice frame of the voice signal, obtains the fusion features of each voice frame of the voice signal, and finally merges the fusion features of each voice frame of the voice signal to obtain the acoustic features of the voice signal.

In some embodiments, the manner in which controller 250 extracts semantic features and emotional features of individual audio frames of the speech signal may be: extracting mel-spectrum features of each audio frame of the speech signal as semantic features of each audio frame of the speech signal, and extracting short-time average energy features of each audio frame of the speech signal as emotional features of each audio frame of the speech signal.

In some embodiments, the manner in which controller 250 extracts mel-spectral features of individual audio frames of the speech signal as semantic features of individual audio frames of the speech signal may be: calculating the outer product of the Mel spectrum characteristics and the short-time average energy characteristics of each voice frame of the voice signal, obtaining a first matrix of each voice frame of the voice signal, converting the first matrix of each voice frame of the voice signal into a vector, and obtaining a vector expression of each voice frame of the voice signal; and carrying out L1 normalization operation and L2 normalization operation on vector expressions of each voice frame of the voice signal to obtain fusion characteristics of each voice frame of the voice signal.

In some embodiments, the controller 250 is further configured to: acquiring driving information corresponding to the voice signal according to the acoustic characteristics and the driving model; wherein the driving model includes: the encoder component is used for encoding the acoustic characteristics of the voice signal to obtain the voice characteristics of the voice signal; the decoder component is used for decoding the voice characteristics of the voice signals to obtain driving information corresponding to the voice signals; and driving the target three-dimensional model according to the driving information corresponding to the voice signal.

In some embodiments, the encoder assembly in the controller 250 may include: the feature extraction layer is used for carrying out convolution processing on the acoustic features of the voice signals to obtain first features; the first linear layer is used for carrying out linear transformation on the first characteristic to obtain a second characteristic; a random inactivation layer for performing a random inactivation operation on the encoder assembly; a convolution enhancement unit comprising a multi-level convolution enhancement layer, any one of the convolution enhancement layers comprising: the first forward propagation network module FFM, the non-attention module FAM, the convolution module CM, the second forward propagation network module FFM and the normalization module LN are connected in a residual way; the convolution enhancement unit is configured to process the second feature through the multi-stage convolution enhancement layer to obtain a voice feature of the voice signal.

In some embodiments, the encoder set in controller 250 further comprises: a multi-stage decoder; any decoder includes: the second linear layer is used for carrying out linear transformation on the voice characteristics of the voice signals to obtain third characteristics; and the activation function layer is used for processing the third characteristic and acquiring the driving information of the voice signal.

In some embodiments, controller 250 trains a machine learning model comprising the encoder component and the decoder component based on a sample data set to obtain the driving model; wherein the sample data set comprises: acoustic characteristics of a plurality of sample voice signals and driving information of the target three-dimensional model corresponding to each sample voice signal.

In some embodiments, the loss function employed in training a machine learning model comprising the encoder component and the decoder component based on the sample data set comprises:

wherein, loss _i Loss value, y, for audio frame i in sample speech signal _i For the driving information in the sample data set corresponding to the audio frame i, f _i Driving information y outputted for the machine learning model corresponding to the audio frame i _i-1 For the driving information in the sample data set corresponding to the previous audio frame of the audio frame i, f _i-1 And outputting driving information for the machine learning model corresponding to the previous audio frame of the audio frame i.

In some embodiments, the controller is further configured to:

and training the machine learning model based on the sample data set sampling dynamic block training mode to acquire the driving model.

Fig. 5 is a schematic flow chart illustrating a processing method of a voice signal according to an embodiment of the present application, and as shown in fig. 5, the processing method of a voice signal according to an embodiment of the present application includes the following steps S501 to S504:

s501, acquiring a voice signal.

In some embodiments, the voice signal may be a voice signal from a person who collects through a microphone, downloads from a network, reads from a disk. In other embodiments, the speech signal may also be speech information synthesized by speech synthesis software.

S502, extracting semantic features and emotion features of each audio frame of the voice signal.

In this embodiment of the present application, the extracting semantic features and emotional features of each audio frame of the speech signal first needs to perform preprocessing on the speech signal to obtain each audio frame of the speech signal, where the preprocessing the speech signal includes: pre-emphasis processing, framing processing, windowing processing.

Pre-emphasis treatment: the main purpose of the pre-emphasis process is to boost the high frequency components in the speech signal. The human generating system starts from the lungs, which act as a source of energy, and the air flow passes through the vocal cords, inducing periodic vibrations (vowels), the energy passing through the pharynx, mouth, lips, tongue, and forming the final sound. The vowel energy is concentrated mainly below 1KHz and drops at a rate of 6 dB/frequency decade. Consonants generally do not cause vocal cord vibrations, at higher frequencies. The effect of the radiation of the mouth lip on the low frequency is smaller, but the effect of the radiation of the mouth lip on the high frequency range is larger, and the pre-emphasis treatment is to eliminate the effect and improve the high frequency component.

In speech signals, the purpose of boosting the high frequency components is that the high frequency components contain more information, whereas the frequency of vowels is generally lower, the power spectrum decreases with increasing frequency, most of the energy being concentrated in the low frequency range. The signal-to-noise ratio of the high-frequency end of the message signal can be reduced to an intolerable degree, the pre-emphasis processing can keep the low-frequency part of the signal unchanged, and the high-frequency part of the signal is improved; whereas the low frequency part of the attenuated signal is de-emphasized and the high frequency part is maintained. The pre-emphasis process may boost the energy of the high frequency portion of the signal to compensate for the channel attenuating the high frequency portion too much.

In some embodiments, the pre-emphasis process typically employs a first order high pass filter with a transfer function as follows:

H(z)＝1-μz ^-1

wherein μ is a preset filter coefficient, μz ^-1 The value of (2) is generally close to 1.

Framing: framing processing to enable batch processing by a program, segments are made according to a specified length (time period or number of samples) and structured into a programmed data structure. The speech signal is macroscopically unstable and microscopically stationary with short-term stationarity (the speech signal can be considered approximately constant within 10ms-30 ms), which can be processed by dividing the speech signal into short segments, each of which is called a frame.

Windowing: after the speech signal is divided into frames, in order to make the transition between frames smooth, keep continuity, namely eliminate the signal discontinuity (namely spectrum leakage) possibly caused by both ends of each frame, it is necessary to window the signal interception and frame division, because the interception has frequency domain energy leakage, and the window function can reduce the influence caused by interception.

In some embodiments, each audio band is windowed, and for the speech signal s (t), a length-limited, movable window function is set as w (t), and then the windowed speech signal is set as s×w (t), and rectangular windows and hamming windows are commonly used in speech processing, and the window function f of the rectangular window is set as w (t) ₁ () Window function f for Hamming window ₂ () The expression is as follows:

window function expression for a rectangular window:

window function expression for hamming window:

where N represents the length of the window function.

S503, carrying out multi-mode bilinear pooling on semantic features and emotion features of each voice frame of the voice signal to obtain fusion features of each voice frame of the voice signal.

In the embodiment of the present application, the method of multi-modal bilinear pooling (Multimodal Bilinear Pooling, MBP) is mainly used for feature fusion, and for two different features extracted from the same sample, a vector after the fusion of the two features can be obtained through multi-modal bilinear pooling, so as to be used for classification.

It should be noted that, the purpose of fusing the semantic features and the emotion features of each voice frame of the voice signal is to fuse multiple features, so that the expression capability of the subsequent voice features can be improved.

S504, merging the fusion characteristics of each voice frame of the voice signal to obtain the acoustic characteristics of the voice signal.

In a specific embodiment, the fusion characteristics of the audio frames obtained through S501 to S503 are: z is Z ₁ ，Z ₂ ，Z ₃ ……Z _T Where T represents the length of the speech signal, for each frame T (T<Combining the fusion features of T) can obtain acoustic features: (Z) ₁ ，Z ₂ ，Z ₃ ……Z _T )。

As can be seen from the above technical solutions, in the processing method for a speech signal provided by the embodiments of the present application, firstly, a speech signal is obtained, and then semantic features and emotional features of each audio frame of the speech signal are extracted; the method comprises the steps of carrying out multi-mode bilinear pooling on semantic features and emotion features of each voice frame of a voice signal, obtaining fusion features of each voice frame of the voice signal, and finally merging the fusion features of each voice frame of the voice signal to obtain acoustic features of the voice signal.

As an extension and refinement of the foregoing embodiment, the embodiment of the present application provides a method for processing a voice signal, referring to fig. 6, the method for processing a voice signal includes the following steps S601 to S607:

S601, acquiring a voice signal.

S602, extracting the Mel spectrum characteristics of each audio frame of the voice signal as the semantic characteristics of each audio frame of the voice signal.

Wherein the Mel-spectrum characteristic is a Mel-spectrum coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC), in some embodiments, referring to fig. 7, the step of obtaining the Mel-spectrum coefficient is as follows:

s6021, calculating the frequency spectrum of the current audio frame data to obtain a short-time spectrum and taking an absolute value or a square value.

In some embodiments, the spectrum of the current audio frame data may be calculated by a fast fourier transform (Fast Fourier transform, FFT) to obtain a short-term spectrum, where the fast fourier transform uses a generic term of an efficient and fast computing method for computing the discrete fourier transform by a computer, and the number of multiplications required for computing the discrete fourier transform by the computer can be greatly reduced by using such an algorithm, and in particular, the more the number of sampling points to be transformed, the more significant the saving in the calculation amount of the FFT algorithm.

S6022, filtering the short-time spectrum based on the Mel filter group to obtain a filtering result and taking the logarithm.

S6023, carrying out logarithmic operation on the filtering result to obtain a logarithmic Mel energy spectrum.

S6024, performing discrete cosine transform decorrelation on the energy spectrum to obtain a Mel spectrum characteristic.

In some embodiments, the discrete cosine transform (Discrete Cosine Transform, DCT) is a transform related to a fourier transform that is similar to the discrete fourier transform (Discrete Fourier Transform, DFT), but uses only real numbers. The discrete cosine transform corresponds to a discrete fourier transform of approximately twice its length, which is performed on a real even function (since the fourier transform of a real even function is still a real even function), in some variants the position of the input or output needs to be shifted by half a unit (DCT has 8 standard types, of which 4 are common).

S603, extracting short-time average energy characteristics of each audio frame of the voice signal as emotion characteristics of each audio frame of the voice signal.

In the present embodiment, the short-time average energy (Short Time Average Energy, STAE) feature is related to the amplitude of the acoustic shock, describing the energy value of the speech signal, and occurs in a relatively short time. The short-time average energy (En) can be used to distinguish voiced sounds from unvoiced sounds (voiced En is much larger than unvoiced sounds), can be used to determine the demarcation of initials and finals, silence and voiced, ligature, etc., and can be used as a type of hyperband information for speech recognition. In general, if a speaker speaks loudly, the consumed energy is relatively large; if the speaker is louder, the representative consumed energy is smaller. When the voice is in response to different emotions, the volume of the voice is large, namely the energy of the voice is large, and when the voice is lost or calm, the volume of the voice is low, namely the energy of the voice is small. In some embodiments, the calculation formula for obtaining the short-time average energy is as follows:

Wherein E is _m Represents the short-time energy value of the m-th frame speech signal, w (m) represents the window function, N represents the window length, and x (N) represents the speech signal.

In the embodiment of the application, the Mel spectrum features focus on the frequency spectrum in the acoustic signal, the energy features focus on the emotion of the speaking content, semantic information of the voice is displayed from different directions respectively, and the semantic information exists in respective spaces, so that the two types of features are fused to effectively improve the expression capability of the subsequent voice features, and particularly the driving effect of the target three-dimensional model is improved subsequently.

S604, calculating the outer product of the Mel spectrum characteristic and the short-time average energy characteristic of each voice frame of the voice signal, and obtaining a first matrix of each voice frame of the voice signal.

In some embodiments, all interactions of vector elements may be exploited by computing the outer product of the mel-spectrum features and the short-time average energy features of the individual speech frames of the speech signal, and fusing the mel-spectrum feature vectors and the short-time average energy feature vectors to obtain a joint token space (i.e., a first matrix).

In some embodiments, the calculation formula of the first matrix for obtaining each speech frame of the speech signal is as follows:

Wherein,mel-spectrum feature representing the speech frame, < >>Short-time average energy characteristic of the speech frame, f ^T (, T) denotes the transpose of the f (a, T) matrix, T denotes the length of the speech signal, T denotes the instant of the current audio frame, M denotes the row of the matrix, and N denotes the column of the matrix.

S605, converting the first matrix of each voice frame of the voice signal into a vector to obtain a vector expression of each voice frame of the voice signal.

In some embodiments, the expressive power of the feature may be improved by converting the first matrix generated by the outer product into a vector expression. The calculation formula for converting the first matrix of each speech frame of the speech signal into a vector in S605 is as follows:

wherein x (t) represents a vector expression of the first matrix.

S606, carrying out L1 normalization operation and L2 normalization operation on vector expressions of each voice frame of the voice signal, and obtaining fusion characteristics of each voice frame of the voice signal.

Where the L1 normalization is the sum of the absolute values of all parameters of the vector expression will tend to produce smaller features, and the L2 normalization is the sum of the squares of all parameters of the vector expression will select more features, but the corresponding weights are close to zero. In some embodiments, the vector expression of the first matrix in the third step performs a first normalization operation, and a calculation formula for obtaining a first normalization result is as follows:

Wherein y (t) represents the first normalization result.

And performing second normalization operation on the first normalization result to obtain fusion characteristics, wherein the calculation formula is as follows:

wherein z (t) represents the fusion feature.

S607, merging the fusion characteristics of each voice frame of the voice signal to obtain the acoustic characteristics of the voice signal.

Referring to fig. 8, an embodiment of the present application provides a method for processing a voice signal, where the method for processing a voice signal includes the following steps S801 to S808:

s801, a voice signal is acquired.

S802, extracting the Mel spectrum characteristics of each audio frame of the voice signal as the semantic characteristics of each audio frame of the voice signal.

S803, extracting short-time average energy characteristics of each audio frame of the voice signal as emotion characteristics of each audio frame of the voice signal.

S804, calculating the outer product of the Mel spectrum characteristic and the short-time average energy characteristic of each voice frame of the voice signal, and obtaining a first matrix of each voice frame of the voice signal.

S805, converting the first matrix of each voice frame of the voice signal into a vector, and obtaining a vector expression of each voice frame of the voice signal.

S806, carrying out L1 normalization operation and L2 normalization operation on vector expressions of each voice frame of the voice signal, and obtaining fusion characteristics of each voice frame of the voice signal.

S807, driving information corresponding to the voice signal is obtained according to the acoustic characteristics and the driving model.

Wherein the driving model includes: the encoder component is used for encoding the acoustic characteristics of the voice signal to obtain the voice characteristics of the voice signal; the decoder component is used for decoding the voice characteristics of the voice signals to obtain the driving information corresponding to the voice signals.

In some embodiments, referring to fig. 9, the driving model is set based on the Encoder-Decoder structural model of the transducer shown in fig. 9, and the data is input to the Encoder, processed by the Encoder, and then transmitted to the Decoder for decoding, so as to obtain the output.

The driving model includes an encoder assembly and a decoder assembly, and referring to fig. 10, the encoder assembly includes:

the feature extraction layer 101 is configured to perform convolution processing on the acoustic feature of the speech signal, so as to obtain a first feature.

And the first linear layer 102 is used for performing linear transformation on the first feature to obtain a second feature.

Wherein the first linear layer is a fully connected layer (Fully Connected layers, FC) which functions as a "classifier" in the overall convolutional neural network, e.g., in one convolutional neural network, the convolutional layer, the pooling layer, the activation function layer, etc. operate to map raw data to hidden layer feature space, and the fully connected layer functions to map learned "distributed feature representation" to sample label space.

A random inactivation layer 103 for performing a random inactivation operation on the encoder assembly.

And in the learning process, the mutual dependence among the nodes is reduced by randomly zeroing part weight or output of the hidden layer, so that the regularization of the neural network is realized, and the structural risk is reduced.

A convolution enhancement unit 104 comprising a multi-level convolution enhancement layer, any one of the convolution enhancement layers comprising: the first forward propagation network module FFM, the inattention module FAM, the convolution module CM, the second forward propagation network module FFM and the normalization module LN are connected in a residual way; the convolution enhancement unit is configured to process the second feature through the multi-stage convolution enhancement layer to obtain a voice feature of the voice signal.

As shown in fig. 10, the convolution enhancement unit is composed of N serial convolution enhancement layers, and any one convolution enhancement layer includes: the first forward propagation network module 1041, the inattention module 1042, the convolution module 1043, the second forward propagation network module 1044 and the normalization operation module 1045 which are connected by residuals, and by way of example, the input R1 and the input R1 'processed by the first forward propagation network module 1041 are added to obtain R2, and then R2 is used as the input of the inattention module 1042, and R2' is obtained by the inattention module 1042; taking the sum R3 of R2 and R2 'as the input of a convolution module 1043, and obtaining R3' through the convolution module 1043; taking the sum R4 of R3 and R3 'as the input of the second forward propagation network module 1044, obtaining R4' through the second forward propagation network module 1044; and then taking the sum R5 of R4 and R4' as the input of the normalization operation module 1045, and obtaining the output Y through the normalization operation module 1045.

It should be noted that, the first and second forward propagation network modules may include; two linear layers, a nonlinear activation layer (located between the two linear layers) and a normalization operation layer.

For example, referring to fig. 11, the convolution module 1043 of the convolution enhancement unit 104 may include: 8 network layers are respectively: a normalization operation (layerrnorm) layer 111, a point-wise convolution (Pointwise Conv) layer 112, a Glue Activation function (Glue Activation) layer 113, a depth separable convolution (Depthwise Conv) layer 114, a batch regularization (batch norm) layer 115, a Swish Activation function (Swish Activation) layer 116, a point-wise convolution (Pointwise Conv) layer 117, a random inactivation (Dropout) layer 118, and a residual summing operation 119.

In some embodiments, the decoder component comprises: a multi-stage decoder; any decoder includes:

and the second linear layer is used for carrying out linear transformation on the voice characteristics of the voice signal to obtain third characteristics.

The second linear layer is a fully connected layer, and plays a role of a classifier in the whole convolutional neural network, for example, in one convolutional neural network, the convolutional layer, the pooling layer, the activation function layer and the like operate to map original data to a hidden layer feature space, and the fully connected layer plays a role of mapping learned distributed feature representation to a sample mark space.

And the activation function layer is used for processing the third characteristic and acquiring the driving information of the voice signal.

Wherein the activation function (Activation Function), which is a function running on neurons of the artificial neural network, is responsible for mapping the inputs of the neurons to the outputs. The activation function plays a very important role in learning and understanding very complex and nonlinear functions of the artificial neural network model, and is introduced to increase the nonlinearity of the neural network model, for example, the activation function may be: sigmoid function, tanh function, relu activation function, etc.

S808, driving the target three-dimensional model according to the driving information corresponding to the voice signal.

The target three-dimensional model may be a virtual three-dimensional face model, and the obtained driving information may drive the virtual three-dimensional face model to speak, generate a corresponding facial expression, facial action, and the like.

Referring to fig. 12, fig. 12 is a schematic structural diagram of the driving model 120, the input of the driving model 120 is a voice signal, the output of the driving model 120 is driving information, and the driving information may be a lattice combination { P } as shown in fig. 12 ₁ 、P ₂ 、……P _n } for drivingA target three-dimensional model 121.

In some embodiments, the driving model is based on an attention-free transducer (Attention Free Transformer, AFT) encoder, which can eliminate the need of dot product self-attention in the machine learning process, and when the driving model is trained, a large number of voice signals are firstly acquired, then the voice signals are converted into acoustic features, the acoustic features are combined with the driving model to acquire driving information corresponding to the voice signals, and then a large number of voice signals and driving information sample data sets corresponding to the voice signals are acquired and input into the machine learning model to train the machine learning model, so that the driving model is obtained. For example, the voice is input into the driving model successively, and the driving information is output into 3 groups 5023 vertex data sets to represent the vertex displacement of the face of the person through model calculation, so as to drive the target three-dimensional model.

It should be noted that, the training of the driving model adopts distributed training, so that hardware resources can be more fully utilized. The training strategy selects an Adam optimizer with higher convergence speed, and the training is iterated for 10+ten thousand steps until the model is completely converged.

In some embodiments, there are three interactions between the quantities in the AFT encoder, which are a Query vector (Q), a Key vector (K) and a Value vector (V), each of length 64, by first combining K and V with a set of learned positional deviations, the result of which is multiplied by Q in an element-wise manner, this operation having linear memory complexity in both context size and feature dimension, such that it can accommodate large input and model sizes, by multiplying the embedded vector X by three different weight matrices W by 3 different weight matrices ^Q 、W ^K 、W ^V As shown in fig. 13, the three matrices have the same size, and a specific calculation formula is: q=xw ^Q 、K＝XW ^K 、V＝XW ^V A calculation schematic diagram of Q, K, V,

then the calculation is carried out according to the following formula:

Y＝f(X)

wherein, the symbol ". Alpha.represents the product of the elements _q Is a nonlinear mapping applied to Q, defaulting to Sigmoid (an activation function);is to learn the paired positional deviations. For each target position t, AFT combines the result of the weighted average with Q with element-level multiplication. The weighting operation consists of K and a set of learned pairs of positional deviations. This provides a direct advantage in that it does not require computation and storage of large attention matrices, while being able to scale the global interactions between Q and V like a multi-headed self-attention mechanism.

In some embodiments, the method of obtaining the driving model includes the following steps 1 and 2:

step 1, training a machine learning model comprising the encoder component and the decoder component based on a sample data set to obtain the driving model;

wherein the sample data set comprises: acoustic characteristics of a plurality of sample voice signals and driving information of the target three-dimensional model corresponding to each sample voice signal.

In some embodiments, the preset machine learning model may be: convolutional neural network (Convolutional Neural Networks, CNN) model, recurrent neural network (Recurrent Neural Networks, RNN), recurrent neural network (Recursive Neural Network, RNN) and other machine learning models.

In some embodiments, the driving model is a model for extracting the features obtained by training a preset machine learning model based on acoustic features of a plurality of sample voice signals and driving information of the target three-dimensional model corresponding to each sample voice signal.

In the step 1, the training the machine learning model including the encoder component and the decoder component based on the sample data set includes:

And step 2, training the machine learning model based on the sample data set sampling dynamic block training mode to obtain the driving model.

Adjusting the offset value according to the dynamic window; the calculation formula is as follows:

/>

wherein w is _tt ' x ' represents an offset value, t is a current time, t ' is a target time corresponding to the offset value, s represents a dynamic window size, and s is randomly changed according to a dynamic block (Chunk) size in a training process.

In some embodiments, dynamic block training (Dynamic Chunk Training, DCT) can be combined to obtain better local context information in speech signals, where dynamic block sizes ranging from 1 to a uniform distribution of maximum utterance length, i.e., attention from left context attention to full context attention inequality, are used for different batches, models capture different information of various block sizes, and learn how to make accurate predictions when providing different limited right contexts. The blocks with the sizes from 1 to 25 are called streaming processing blocks of the streaming processing model, the maximum speaking length is called streaming processing blocks of the streaming processing model, and the distribution of the block sizes is changed in the training process by adopting the following calculation formula:

Wherein l _max The maximum speaking length of the audio of the current batch is represented, U represents uniform distribution, x represents the size of a dynamic block, s (x) represents the window size of the dynamic block x in the training process, x samples from 0 to 1.0 in each batch in the training process, and the window size can be dynamically adjusted according to the dynamic block size through the formula, so that different local context information can be obtained.

In some embodiments, referring to fig. 14 in combination with the above steps, an overall frame diagram of a processing method for a speech signal includes:

the data preprocessing module 141 is configured to perform pre-emphasis, framing and windowing on the voice signal, so as to obtain a plurality of audio frames of the voice signal.

The acoustic feature fusion and merging module 142 is configured to obtain mel spectrum features and short-time average energy features of an audio frame, fuse the mel spectrum features and the short-time average energy features of the audio frame to obtain fusion features, and merge the fusion features of a plurality of audio frames of the speech signal to obtain acoustic features.

A voice feature extraction module 143, configured to obtain voice features of the voice signal.

The facial feature point extraction module 144 is configured to extract feature points of a face of a person to obtain feature points of the face of the person.

The driving model building module 145 is configured to obtain a voice feature of the voice signal.

The driving target three-dimensional model module 146 drives the target three-dimensional model with the driving information.

In some embodiments, embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, which when executed by a computing device, causes the computing device to implement a method for processing a speech signal according to any of the embodiments above.

In some embodiments, the present application provides a computer program product which, when run on a computer, causes the computer to implement a method of processing speech signals as shown in the second aspect.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A processing apparatus for a speech signal, comprising:

a detector configured to acquire a voice signal;

a controller configured to:

2. The apparatus for processing a speech signal according to claim 1, wherein the controller is specifically configured to:

Extracting mel-spectrum features of each audio frame of the speech signal as semantic features of each audio frame of the speech signal;

short-time average energy features of individual audio frames of the speech signal are extracted as emotional features of individual audio frames of the speech signal.

3. The apparatus for processing a speech signal according to claim 2, wherein the controller is specifically configured to:

calculating the outer product of the Mel spectrum characteristics and the short-time average energy characteristics of each voice frame of the voice signal, and obtaining a first matrix of each voice frame of the voice signal;

converting a first matrix of each voice frame of the voice signal into a vector to obtain a vector expression of each voice frame of the voice signal;

and carrying out L1 normalization operation and L2 normalization operation on vector expressions of each voice frame of the voice signal to obtain fusion characteristics of each voice frame of the voice signal.

4. The apparatus for processing a speech signal according to claim 1, wherein the controller is further configured to:

acquiring driving information corresponding to the voice signal according to the acoustic characteristics and the driving model; wherein the driving model includes: the encoder component is used for encoding the acoustic characteristics of the voice signal to obtain the voice characteristics of the voice signal; the decoder component is used for decoding the voice characteristics of the voice signals to obtain driving information corresponding to the voice signals;

And driving the target three-dimensional model according to the driving information corresponding to the voice signal.

5. The apparatus for processing a speech signal according to claim 4, wherein said encoder assembly comprises:

the feature extraction layer is used for carrying out convolution processing on the acoustic features of the voice signals to obtain first features;

the first linear layer is used for carrying out linear transformation on the first characteristic to obtain a second characteristic;

a random inactivation layer for performing a random inactivation operation on the encoder assembly;

a convolution enhancement unit comprising a multi-level convolution enhancement layer, any one of the convolution enhancement layers comprising: the first forward propagation network module FFM, the non-attention module FAM, the convolution module CM, the second forward propagation network module FFM and the normalization module LN are connected in a residual way; the convolution enhancement unit is configured to process the second feature through the multi-stage convolution enhancement layer to obtain a voice feature of the voice signal.

6. The apparatus for processing a speech signal according to claim 4, wherein said decoder means comprises: a multi-stage decoder; any decoder includes:

the second linear layer is used for carrying out linear transformation on the voice characteristics of the voice signals to obtain third characteristics;

7. The apparatus for processing a speech signal according to claim 4, wherein the controller is further configured to:

training a machine learning model comprising the encoder component and the decoder component based on a sample data set to obtain the driving model;

8. The apparatus according to claim 7, wherein the loss function employed in training a machine learning model including the encoder component and the decoder component based on the sample data set comprises:

9. The apparatus for processing a speech signal according to claim 7, wherein the controller is further configured to:

10. A method for processing a speech signal, comprising:

acquiring a voice signal;