CN114694685A

CN114694685A - Voice quality evaluation method, device and storage medium

Info

Publication number: CN114694685A
Application number: CN202210382372.XA
Authority: CN
Inventors: 秦萌萌
Original assignee: Beijing Xiaomi Mobile Software Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd
Priority date: 2022-04-12
Filing date: 2022-04-12
Publication date: 2022-07-01

Abstract

The present disclosure relates to a method, an apparatus, and a storage medium for evaluating voice quality, wherein the method includes: acquiring frequency spectrum information of a plurality of voice frames corresponding to a first voice signal; aiming at the frequency spectrum information of the voice frames, extracting channel attention characteristic information of the voice frames by utilizing a pre-trained voice quality evaluation network model, and extracting deep voice characteristic information of the voice frames based on the channel attention characteristic information of the voice frames; extracting time sequence related characteristic information among the plurality of voice frames based on the deep voice characteristic information; and predicting the quality score of the first voice signal according to the time sequence related characteristic information among the plurality of voice frames.

Description

Voice quality evaluation method, device and storage medium

Technical Field

The present disclosure relates to the field of speech processing, and in particular, to a method and an apparatus for evaluating speech quality and a storage medium.

Background

The quality evaluation of the voice signal can assist the verification of a voice algorithm and a voice system, is one of main bases for judging the performance of the voice system and the voice system, and has wide application in the fields of voice transmission, voice communication, voice enhancement, voice synthesis, voice recognition, audio coding and decoding and the like.

With the rapid development of the neural network technology, the speech quality assessment network model based on deep learning has wide application market and development prospect as a research direction, although the speech quality assessment network model in the related technology realizes the objective quality assessment on speech signals, a large amount of original pure speech is required to participate in the training process of the network model, and the difficulty in obtaining training samples is high; and the network model structure is more complicated, the involved model parameters are large, the training speed is slow, and the model optimization is more difficult.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a voice quality assessment method, apparatus, and storage medium.

According to a first aspect of the embodiments of the present disclosure, there is provided a voice quality assessment method, including:

acquiring frequency spectrum information of a plurality of voice frames corresponding to a first voice signal;

aiming at the frequency spectrum information of the voice frames, extracting channel attention characteristic information of the voice frames by utilizing a pre-trained voice quality evaluation network model, and extracting deep voice characteristic information of the voice frames based on the channel attention characteristic information of the voice frames;

extracting time sequence related characteristic information among the plurality of voice frames based on the deep voice characteristic information;

and predicting the quality score of the first voice signal according to the time sequence related characteristic information among the plurality of voice frames.

Optionally, the voice quality assessment network model includes: an Effective Channel Attention (ECA) module, a multi-level residual convolution module, and a bidirectional Gated round robin (GRU) module;

the extracting, by using a pre-trained speech quality assessment network model, channel attention feature information of the plurality of speech frames for the spectrum information of the plurality of speech frames, and extracting deep speech feature information of the plurality of speech frames based on the channel attention feature information of the plurality of speech frames, includes:

aiming at the frequency spectrum information of the plurality of voice frames, extracting channel attention feature information of each channel of the plurality of voice frames by using the ECA module;

extracting deep voice feature information based on the channel attention feature information by using the multi-stage residual convolution module;

the extracting, based on the deep speech feature information, timing sequence related feature information between the speech frames includes:

and extracting time sequence related characteristic information corresponding to the deep voice characteristic information by utilizing the bidirectional GRU module.

Optionally, the multi-stage residual convolution module includes N residual convolution layers;

the residual convolution layer includes: a first convolutional layer and a second convolutional layer which are cascaded; wherein an input of the first convolution layer is passed to an output of the first convolution layer by a residual unit;

and the N residual convolution layers are used for carrying out deep separable convolution processing on the channel attention characteristic information to obtain deep voice characteristic information of the voice frames.

Optionally, the bidirectional GRU module comprises: a forward GRU sub-module and a backward GRU sub-module;

the extracting, by using the bidirectional GRU module, the time sequence related feature information corresponding to the deep speech feature information includes:

forward feature extraction is carried out on the deep voice feature information by utilizing the forward GRU sub-module to obtain forward time sequence feature information;

carrying out reverse feature extraction on the deep voice feature information by using the backward GRU submodule to obtain reverse time sequence feature information;

and obtaining time sequence related characteristic information corresponding to the deep voice characteristic information based on the forward time sequence characteristic information and the reverse time sequence characteristic information.

Optionally, the voice quality assessment network model includes: a fully connected module and a Global Average Pooling (GAP) module;

predicting a quality score of the first speech signal according to timing-related feature information between the plurality of speech frames, including:

performing full connection processing on the time sequence related characteristic information of a plurality of voice frames of the first voice signal by using the full connection module to obtain quality scores of the plurality of voice frames;

and carrying out global average processing on the quality scores of the voice frames by utilizing the GAP module to obtain the quality score of the first voice signal.

Optionally, the obtaining of the spectrum information of the plurality of speech frames corresponding to the first speech signal includes:

preprocessing the first voice signal to obtain a plurality of voice frames of the first voice signal;

and performing time-frequency conversion processing on the plurality of voice frames of the first voice signal to obtain the amplitude spectrum information of the plurality of voice frames.

Optionally, the preprocessing the first speech signal to obtain a plurality of speech frames of the first speech signal includes:

pre-emphasis processing is carried out on the first voice signal;

and performing frame division and windowing processing on the pre-emphasized first voice signal to obtain a plurality of voice frames of the first voice signal.

Optionally, before obtaining the spectrum information of a plurality of speech frames corresponding to the first speech signal, the method includes:

acquiring a training sample set of voice with noise and a quality scoring label of the training sample set;

inputting the frequency spectrum information of the voice frames corresponding to the plurality of voice signals with noise in the training sample set to an initial network model to be trained to obtain a prediction quality score output by the initial network model;

determining a loss function value of the initial network model according to the predicted quality score and the quality score label;

and adjusting the parameters to be trained of the initial network model based on the loss function values to obtain the voice quality evaluation network model.

Optionally, the inputting the spectrum information of the speech frames corresponding to the plurality of noisy speech signals in the training sample set to the initial network model to be trained to obtain the prediction quality score output by the initial network model includes:

determining the maximum frame number of the voice signals with noise based on the frequency spectrum information of the voice frames of the voice signals with noise;

determining whether the frame number of the frequency spectrum information of the plurality of voice signals with noise is the maximum frame number;

if the frame number of the frequency spectrum information of the voice signals with noise is not the maximum frame number, zero filling processing is carried out on the frequency spectrum information; the frame number of the frequency spectrum information after zero filling processing is the maximum frame number;

and inputting the frequency spectrum information after zero filling processing into the initial network model to obtain the prediction quality score output by the initial network model.

According to a second aspect of the embodiments of the present disclosure, there is provided a voice quality evaluation apparatus including:

the acquisition module is used for acquiring the frequency spectrum information of a plurality of voice frames corresponding to the first voice signal;

a first feature extraction module, configured to extract, by using a pre-trained speech quality assessment network model, channel attention feature information of the plurality of speech frames for the spectrum information of the plurality of speech frames, and extract deep speech feature information of the plurality of speech frames based on the channel attention feature information of the plurality of speech frames;

the second characteristic extraction module is used for extracting time sequence related characteristic information among the plurality of voice frames based on the deep voice characteristic information;

and the predicting module is used for predicting the quality score of the first voice signal according to the time sequence related characteristic information among the plurality of voice frames.

Optionally, the voice quality assessment network model includes: an effective channel attention ECA module, a multi-stage residual convolution module and a bidirectional gating circulation GRU module;

the first feature extraction module is configured to:

aiming at the frequency spectrum information of the plurality of voice frames, extracting channel attention characteristic information of each channel of the plurality of voice frames by using the ECA module;

the extracting, based on the deep speech feature information, timing-related feature information between the plurality of speech frames includes:

the residual convolutional layer includes: a first convolutional layer and a second convolutional layer which are cascaded; wherein an input of the first convolution layer is passed to an output of the first convolution layer by a residual unit;

the second feature extraction module is configured to:

performing reverse feature extraction on the deep voice feature information by using the backward GRU sub-module to obtain reverse time sequence feature information;

Optionally, the voice quality assessment network model includes: a full connection module and a global average pooling GAP module;

the prediction module is configured to:

Optionally, the obtaining module is configured to:

pre-emphasis processing is carried out on the first voice signal;

Optionally, the apparatus, comprising: a network training module to:

Optionally, the network training module is configured to:

determining the maximum frame number of the plurality of voice frames with noise signals based on the frequency spectrum information of the plurality of voice frames with noise signals;

and inputting the frequency spectrum information after zero padding processing into the initial network model to obtain a prediction quality score output by the initial network model.

According to a third aspect of the embodiments of the present disclosure, there is provided a speech quality evaluation device including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: the executable instructions, when executed, implement the steps in the method according to the first aspect of the embodiments of the present disclosure.

According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium, wherein instructions, when executed by a processor of a speech quality assessment apparatus, enable the speech quality assessment apparatus to perform the steps of the method according to the first aspect of embodiments of the present disclosure.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

according to the method and the device, the first voice signal with noise and the frequency spectrum information of the voice frames corresponding to the first voice signal are obtained, the frequency spectrum information of the voice frames corresponding to the first voice signal is input to a pre-trained voice quality assessment network model, and the channel attention feature information of the voice frames is extracted by using the voice quality assessment network model, so that more important deep voice feature information of the voice frames can be extracted more effectively based on the channel attention feature information after channel attention is strengthened;

and based on the deep voice characteristic information of the voice frames, extracting time sequence related characteristic information among the voice frames, and considering the mutual influence among the voice frames with different time sequences of the first voice signal by utilizing the time sequence related characteristic information capable of reflecting the change condition of the deep voice characteristic information of the voice frames, the quality score of the first voice signal is more accurately predicted, a pure voice signal does not need to be obtained, and the application difficulty of voice quality evaluation is reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a flow chart illustrating a method for objectively evaluating speech quality without reference based on speech enhancement in the related art.

FIG. 2 is a first flowchart illustrating a method for speech quality assessment, according to an example embodiment.

Fig. 3 is a network diagram illustrating a residual convolutional layer according to an exemplary embodiment.

Fig. 4 is a flowchart illustrating a speech quality assessment method according to an exemplary embodiment.

Fig. 5 is a third flowchart illustrating a speech quality assessment method according to an exemplary embodiment.

FIG. 6 is a block diagram illustrating a voice quality assessment network model in accordance with an exemplary embodiment.

Fig. 7 is a schematic structural diagram illustrating a speech quality assessment apparatus according to an exemplary embodiment.

Fig. 8 is a block diagram illustrating a speech quality assessment apparatus according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

With the development of information and communication technologies, intelligent voice technology based on voice signal transmission becomes an important technical tool for information communication, and gradually spreads over various scenes in life, such as scenes of VR games, karaoke, online teaching, online conferences and the like. The quality of the voice signal directly affects the semantic transmission, the sound judgment and the auditory sense, and the loss of useful information or the interference of redundant information affects the information exchange and the voice signal processing.

In a communication system, a series of phenomena affecting the quality of voice in communication transmission, such as packet loss and jitter, often occur due to technical problems such as network instability, and the like, and voice packets in communication contain a large amount of important information, for example, if the voice quality is poor, the communication quality is greatly reduced, the hearing feeling of a user is reduced, and invalid communication or inefficient communication is caused.

In the related art, the speech quality assessment is generally divided into: subjective evaluation methods and objective evaluation methods; the subjective evaluation method scores the audio frequency feeling through real hearing of human ears, and the result is real and reliable. However, since the subjective evaluation method takes human as the subject of evaluation, the evaluation result is easily affected by human subjective consciousness, and there is a large consumption of human and material resources, and the stability and repeatability of evaluation are not high enough.

The objective evaluation method is a speech quality evaluation method implemented by a computer, such as ANIQUE, E-Model, and P.563; compared with a subjective evaluation method, the objective evaluation method has high repeatability and stable system, can be suitable for various environments, and has high evaluation speed.

Illustratively, as shown in fig. 1, fig. 1 is a flow chart of a no-reference objective speech quality assessment method based on speech enhancement provided in the related art.

Firstly, inputting a voice to be detected into a trained voice enhancement model based on a Deep Belief Network (DBN) Network to obtain an enhanced voice signal; then, respectively extracting Mel cepstrum coefficients of the voice signal before enhancement and the voice signal after enhancement, and determining coefficient difference; and inputting the coefficient difference into a Back Propagation (BP) network model to obtain a final objective score.

However, this method requires training and scoring in combination with log power spectrum information of the original clean speech, which is not available in many cases; in addition, the DBN network and the BP network are used for training and scoring, and the problems of more parameters to be trained, low training speed, low evaluation correlation degree and the like exist.

An embodiment of the present disclosure provides a speech quality assessment method, as shown in fig. 2, fig. 2 is a first flowchart illustrating a speech quality assessment method according to an exemplary embodiment. The method comprises the following steps:

step S101, obtaining frequency spectrum information of a plurality of voice frames corresponding to a first voice signal;

step S102, aiming at the frequency spectrum information of the plurality of voice frames, extracting the channel attention feature information of the plurality of voice frames by utilizing a pre-trained voice quality evaluation network model, and extracting the deep voice feature information of the plurality of voice frames based on the channel attention feature information of the plurality of voice frames;

step S103, extracting time sequence related characteristic information among the plurality of voice frames based on the deep voice characteristic information;

step S104, predicting the quality score of the first voice signal according to the time sequence related characteristic information among the plurality of voice frames.

The voice quality evaluation method related in the embodiments of the present disclosure may be applied to an electronic device; here, the electronic device includes a terminal or a server, and the terminal may be a mobile phone, a tablet computer, a notebook computer, or the like; the server may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers.

In step S101, a first speech signal and spectrum information of a plurality of speech frames corresponding to the first speech signal are obtained.

The method comprises the steps that an electronic device can be connected with an audio acquisition device to obtain a first voice signal with noise acquired by the audio acquisition device; and obtaining spectrum information of a plurality of speech frames of the first speech signal.

It should be noted that, in the related art, in the evaluation and training process of the voice quality evaluation network model, pure original voice signals are usually required to participate, so that the evaluation accuracy of the network model can be ensured, but in many application scenarios, it is difficult or even impossible to obtain original pure voice, so that the network model cannot be trained and evaluated, and the application scenarios of the network model are limited.

In the embodiment of the disclosure, the first voice signal with noise can be directly acquired, the spectrum information of the plurality of voice frames of the first voice signal with noise is extracted, and the voice quality of the first voice signal is evaluated based on the spectrum information of the plurality of voice frames of the first voice signal, so that the participation of pure voice signals is not required, and the application scene of the voice quality evaluation network model is enlarged.

In step S102, inputting the spectrum information of the plurality of speech frames corresponding to the first speech signal to a pre-trained speech quality assessment network model, and performing feature extraction on the spectrum information of the plurality of speech frames by using the speech quality assessment network model to obtain channel attention feature information of the plurality of speech frames;

it should be noted that the voice quality assessment network model includes a plurality of convolution kernels, each convolution kernel corresponds to one feature channel, a channel dependency relationship between any two feature channels is learned based on a channel attention mechanism, and different feature information is adjusted based on the channel dependency relationship, so that different feature information is enhanced or suppressed to different degrees according to the importance degrees of different feature channels.

And after the channel attention characteristic information with strengthened channel attention is obtained, extracting deep voice characteristic information of the voice frames based on the channel attention characteristic information, so that the voice quality evaluation network model has better characteristic extraction performance and stronger generalization capability.

In step S103, based on the deep speech feature information of the speech frames, a speech quality assessment network model is used to extract timing-related feature information between the speech frames.

Here, the timing-related feature information is at least used to indicate a correlation between deep speech feature information of a plurality of speech frames at different timings.

The speech quality evaluation network model can comprise a recurrent neural network module, and the recurrent neural network module can be utilized to extract the time sequence related characteristic information among the plurality of speech frames.

It will be appreciated that the recurrent neural network is suitable for processing and predicting significant events in a time series; and extracting time sequence related characteristic information of a plurality of voice frames by utilizing a recurrent neural network module according to deep voice characteristic information of the voice frames with different time sequences.

In step S104, a quality score of the first speech signal is predicted by using a speech quality assessment network model based on timing-related feature information between a plurality of speech frames of the first speech signal.

The change condition of the deep voice characteristic information of the plurality of voice frames can be determined based on the time sequence related characteristic information, so that the trend of the deep voice characteristic information of the plurality of voice frames changing along with time is effectively predicted, and the voice quality of the first voice signal is determined; the accuracy of the speech quality assessment of the first speech signal is improved.

Optionally, the voice quality assessment network model includes: an effective channel attention ECA module, a multi-level residual convolution module and a bidirectional gated loop GRU module;

in the step S102, extracting channel attention feature information of the plurality of speech frames by using a pre-trained speech quality assessment network model for the spectrum information of the plurality of speech frames, and extracting deep speech feature information of the plurality of speech frames based on the channel attention feature information of the plurality of speech frames, including:

in step S103, extracting time-series relevant feature information between the plurality of speech frames based on the deep speech feature information includes:

In an embodiment of the present disclosure, the voice quality assessment network model includes: the system comprises an ECA module, a multi-stage residual convolution module and a bidirectional GRU module;

inputting the spectrum information of a plurality of voice frames corresponding to the first voice signal into an ECA module of the voice quality assessment network model, wherein the ECA module extracts channel attention information of a plurality of channels of the plurality of voice frames based on the spectrum information of the plurality of voice frames; and determining channel attention characteristic information of a plurality of channels of the plurality of voice frames according to the channel attention information and the frequency spectrum information of the plurality of voice frames.

Here, the channel attention information is used at least to indicate the degree of importance of the plurality of channels.

It should be noted that the ECA module can implement local cross-channel information interaction by one-dimensional convolution of adaptive kernel size; the kernel size k represents the coverage of local cross-channel interaction, namely local cross-channel interaction is captured by considering each channel and k neighbors thereof, correlation among different channels is modeled, the importance degree of each channel is automatically acquired by a network learning mode, and finally different weight coefficients are given to each channel, so that important features are strengthened and non-important features are suppressed.

It is to be understood that the channel attention information may include: weighting coefficients for the plurality of channels; and the ECA module performs channel attention enhancement processing on the frequency spectrum information of the voice frames based on the weight coefficients of the channels to obtain channel attention characteristic information of the channels of the voice frames.

Inputting the channel attention feature information of a plurality of channels of a plurality of voice frames output by the ECA module into the multistage residual convolution module, and performing depth feature extraction on the channel attention feature information by using the multistage residual convolution module to obtain deep voice feature information of the plurality of voice frames.

The method and the device have the advantages that the multistage residual convolution module is utilized, so that the network depth of the voice quality evaluation network model is improved, and the evaluation performance of the network model is improved; and the multi-level residual convolution module is utilized to realize layer jump connection in the network structure, the problem of gradient disappearance caused by too deep network depth is reduced, and the performance of feature extraction of the voice quality evaluation network model is effectively improved.

It should be noted that the performance of the convolutional neural network is strongly related to the depth of the network, and the deeper network structure can improve the recognition effect, however, in practice, after the depth of the convolutional neural network reaches a certain depth, the performance of the model is not improved any more, and even the performance becomes worse, which is called gradient vanishing. Therefore, the multi-level residual convolution module is used for realizing layer jump connection, so that in the deep convolution neural network, the output of some layers can be directly transmitted to the later layer by crossing the middle layer, thereby reducing the problem of gradient disappearance caused by too deep network depth and effectively improving the network performance.

Inputting the deep voice characteristic information of the voice frames output by the residual convolution modules into a bidirectional GRU module, and extracting the time sequence related characteristic information of the voice frames by using the bidirectional GRU module.

It should be noted that the transfer direction of the hidden layer in the GRU module is propagated from front to back in a single direction, i.e. the current state is only related to the above content (i.e. forward timing correlation characteristic), but in the speech quality assessment, the current state often needs to be more effective in combination with the context information.

The basic idea of the bidirectional GRU module is to stack two unidirectional GRU sub-modules up and down to respectively acquire the upper information (namely, forward timing correlation characteristic) and the lower information (namely, backward timing correlation characteristic) of the current state; and obtains the complete context information of the current state through information fusion.

The output dimension of a bidirectional GRU module is twice that of a unidirectional GRU module, so the bidirectional GRU module has a stronger expressive power than the unidirectional GRU module.

Optionally, the multi-stage residual convolution module includes N residual convolution layers; wherein the residual convolutional layer comprises: a first convolutional layer and a second convolutional layer which are cascaded; the input of the first convolution layer is passed to the output of the first convolution layer by a residual unit;

the multi-stage residual convolution module for evaluating the network model by using the voice quality extracts deep voice characteristic information based on the channel attention characteristic information, and comprises the following steps:

and performing deep separable convolution processing on the channel attention characteristic information by using the N residual convolution layers to obtain deep voice characteristic information of the voice frames.

In an embodiment of the present disclosure, the multi-stage residual convolution module includes: n residual convolution layers; and the N residual convolution layers are arranged in a cascade mode.

The residual convolutional layer includes: a first convolutional layer and a second convolutional layer which are cascaded;

it can be understood that the output of the first convolutional layer will be used as the input of the second convolutional layer, in the process, there may be a phenomenon of feature loss, the input of the first convolutional layer (i.e. original feature information) is transferred to the output of the first convolutional layer by using a residual error unit, and the input of the first convolutional layer and the output of the first convolutional layer are used together as the input of the second convolutional layer, thereby implementing feature supplementation.

In some embodiments, as shown in fig. 3, fig. 3 is a network structure diagram of a residual convolutional layer according to an exemplary embodiment. Wherein the first buildup layer comprises: two depth-separable convolutional layers, the second convolutional layer comprising: one depth separable convolution layer.

The output of the first convolution layer is added to the input of the first convolution layer by a residual unit, and the sum of the output and the input of the first convolution layer is input to the second convolution layer.

Here, the depth-separable convolutional layer is composed of one depth convolutional layer and one 1 × 1 dot convolutional layer; the convolution kernels of the depth convolution layers are all 3 multiplied by 3, and the moving step lengths are (1,1), (1,2) and (1,3) in sequence; the ReLU is used for the activation function of each convolutional layer.

The depth separable convolutional layer executes a space convolution operation, simultaneously keeps the channels independent, namely, executes convolution operation on a plurality of different channels respectively to obtain the characteristic information of the plurality of different channels; and then, deep convolution operation is carried out on the characteristic information of the plurality of different channels, and the characteristic information of the plurality of different channels is stacked together again. It will be appreciated that the depth separable convolutional layers utilize depth convolutional layers to achieve spatial convolution while keeping the channels independent.

In the embodiment of the disclosure, the N residual convolution layers are used to determine frequency spectrum information after attention enhancement based on the channel attention feature information; and aiming at the frequency spectrum information with the enhanced attention, carrying out deep separable convolution processing through a deep separable convolution layer in a residual convolution layer to obtain deep voice characteristic information of the voice frames.

On one hand, the correlation and the spatial correlation between a plurality of different channels are separately mapped by utilizing the depth separable convolution layer, so that the complexity and the training parameters are reduced, the operation efficiency is improved, and the light weight of the network model is ensured. On the other hand, the input of the first convolution layer is transmitted to the second convolution layer by utilizing a residual unit in the residual convolution layer, so that the combination with direct mapping is realized, and the feature extraction capability is improved.

In an embodiment of the present disclosure, the bidirectional GRU module includes: a forward GRU sub-module and a backward GRU sub-module; it will be appreciated that at each instant, the input to the bidirectional GRU module will be provided to both the forward GRU sub-module and the backward GRU sub-module; and the output quantity of the bidirectional GRU module is jointly determined by the forward GRU sub-module and the backward GRU sub-module.

It should be noted that, the GRU sub-module is composed of an update gate and a reset gate; wherein the update gate is used for controlling the degree of the state information at the previous moment brought into the current state; the larger the value of the update gate, the more state information is brought in at the previous time, and it can be understood that the update gate helps to capture long-term dependencies in the time series.

The reset gate is used for controlling how much information in the previous state is written into the current candidate set; the smaller the reset gate, the less information of the previous state is written; it will be appreciated that resetting the gate helps to capture short term dependencies in the time series.

Inputting deep voice feature information output by a multi-stage residual convolution module into a bidirectional GRU module, and performing forward feature extraction on the deep voice feature by using a forward GRU sub-module to obtain forward time sequence feature information;

it can be understood that, a forward GRU sub-module may be utilized to perform forward GRU calculation on the deep speech feature information to obtain the forward timing feature information; here, the forward timing characteristics information is used to indicate the above relation of the speech frame.

Carrying out reverse feature extraction on the deep voice feature by using a backward GRU sub-module to obtain reverse time sequence feature information;

it can be understood that, the backward GRU sub-module can be utilized to perform backward GRU calculation on the deep speech feature information to obtain the backward timing sequence feature information; here, the backward timing characteristic information is used to indicate a context relationship of the voice frame.

And fusing the forward time sequence characteristic information and the reverse time sequence characteristic information to obtain time sequence related characteristic information of the deep voice characteristic information.

the predicting the quality score of the first speech signal according to the timing-related characteristic information between the speech frames in the step S104 includes:

In an embodiment of the present disclosure, the voice quality assessment network model further includes: a fully connected module and a global average pooling GAP module.

And after the bidirectional GRU module outputs the time sequence related characteristic information, the full-connection module is utilized to perform full-connection processing on the time sequence related characteristic information to obtain quality scores corresponding to the plurality of voice frames.

And utilizing the GAP module to perform pooling processing on the quality scores corresponding to the plurality of voice frames to obtain the quality score of the first voice signal.

It may be understood that, the GAP module may perform global averaging on the quality scores corresponding to the multiple speech frames of the first speech signal, to obtain and output the quality score of the first speech signal.

According to the embodiment of the disclosure, the time sequence related characteristic information is subjected to full connection processing through the full connection module, the quality score evaluation of the frame level is increased, and the quality score of the voice signal is determined based on the quality score of the frame level, so that the evaluation process is more stable, and the error between the prediction score and the real score is detected.

In the embodiment of the disclosure, a noisy first voice signal may be acquired by an audio acquisition device, and the first voice signal may be preprocessed to obtain a plurality of voice frames of the first voice signal.

Here, the preprocessing includes at least: and (5) framing processing. It is understood that the framing processing refers to dividing the speech signal to be evaluated into a plurality of speech frames with fixed sizes, and in general, since the speech signal has short time stationarity, the frame length of one speech frame can be set within the short time stationary period of the speech signal, i.e. the frame length is set to 10ms to 30 ms.

It should be noted that the speech signal is a non-stationary signal that varies with time, but has short-time stationarity, so the speech signal can be framed into short speech frames, so that the speech signal exhibits a partial characteristic of a periodic function.

However, in order to ensure the smoothness of the connection between adjacent speech frames, i.e. to eliminate the signal discontinuity that may be caused by both ends of each speech frame, an overlap framing method may be used.

And after obtaining the plurality of voice frames of the first voice signal, performing time-frequency conversion processing on the plurality of voice frames of the first voice signal to obtain the amplitude spectrum information of the plurality of voice frames.

Here, the time-frequency conversion process may be a fourier transform, and the speech frame in the time domain is converted into the speech frame in the frequency domain by performing the fourier transform on the plurality of speech frames. The fourier transform may include: discrete fourier transform, fast fourier transform, short-time fourier transform, and the like; the selection may be performed according to a specific scenario, and the embodiment of the disclosure is not limited herein.

In the embodiment of the present disclosure, the magnitude spectrum information of a plurality of speech frames of the first speech signal may be obtained by performing discrete fourier transform on the speech frames; the transformation formula is as follows:

wherein, X is_m(k) The amplitude spectrum characteristic information of the mth voice frame; said x_mAnd (N) is the mth voice frame, and N is the frame length.

In some embodiments, the method further comprises:

filtering the amplitude spectrum information of the plurality of voice frames to obtain the amplitude spectrum characteristic information of the plurality of voice frames; and inputting the amplitude spectrum characteristic information of the plurality of voice frames into the voice quality evaluation network model.

In the embodiment of the present disclosure, after performing fourier transform on each speech frame of the first speech signal, filtering processing may be performed by using a mel filter, so as to obtain magnitude spectrum characteristic information corresponding to each speech frame.

It should be noted that Mel filtering is to apply the spectrum information to a Mel-scale triangular filter for filtering processing; it can be understood that, by performing mel filtering on the spectrum information of the speech frame, the spectrum information of the filtered speech frame is more in line with the auditory characteristics of human ears.

pre-emphasis processing is carried out on the first voice signal;

In an embodiment of the present disclosure, the pre-processing comprises: pre-emphasis processing and frame windowing processing.

The amplitude of the high-frequency part of the first voice signal is compensated by pre-emphasis processing the first voice signal to enhance the high-frequency signal in the first voice signal.

It should be noted that, most of the energy of the speech signal is concentrated in the low frequency part, which results in the decrease of the signal-to-noise ratio of the high frequency part of the speech signal; by performing pre-emphasis processing on the speech signal, the suppressed high frequency portion of the speech signal can be compensated for, making the spectrum of the speech signal flat, removing the spectral tilt.

In some embodiments, the first speech signal may be pre-emphasized by a high pass filter having a functional expression:

H(z)＝1-μz^-1；

wherein z is the first voice signal, μ is a pre-emphasis coefficient, and a general value range is 0.9-1.

The high-frequency part of the first voice signal is highlighted by pre-emphasis processing of the first voice signal, which is beneficial to reducing the attenuation loss of the first voice signal.

And performing frame division and windowing processing on the pre-emphasized first voice signal to obtain a plurality of continuous voice frames.

The framing windowing can be realized by weighting the voice signals by a movable fixed-length window function, and in order to ensure the short-time property of the processed voice signals, the windowing can reduce the phenomenon of spectrum leakage of the voice signals, namely, side lobes which are mistaken as false peaks do not appear except the original main lobe, and as a result, frequency components which do not exist originally do not appear.

In some embodiments, the frame windowed speech signal may be represented as:

x_m(n)＝w(n)x(m+n),0≤n≤N-1；

wherein, the x_m(N) is the mth speech frame, w (N) is the window function, x (N) is the speech signal, and N is the frame length.

It should be noted that, since the narrower the main lobe of the speech signal is, the smaller the side lobe is and the faster the attenuation is, the better the windowing of the speech signal can reduce the spectral energy leakage. The main lobe of the rectangular window commonly used in signal processing is narrow, but the side lobe is large and the attenuation is slow, so the hamming window with the main lobe being slightly wide, but the side lobe being small and the attenuation being fast can be adopted in the embodiment of the present disclosure to perform frame windowing processing on the voice signal.

The formula for the Hamming window is as follows:

the Hamming window is used for carrying out frame division and window adding processing on the first voice signal, so that a plurality of continuous voice frames can be obtained, and the influence of signal discontinuity possibly caused by two ends of each voice frame can be eliminated.

Optionally, the voice quality assessment network model includes: the rearrangement module is used for carrying out dimension transformation processing on the input frequency spectrum information and/or the depth voice characteristic information;

the extracting, by the ECA module, channel attention feature information of each channel of the plurality of speech frames with respect to the spectrum information of the plurality of speech frames includes:

aiming at the frequency spectrum information of the plurality of voice frames, carrying out dimension transformation processing on the frequency spectrum information by using the rearrangement module, inputting the frequency spectrum information after the dimension transformation processing into the ECA module, and extracting channel attention characteristic information of each channel of the plurality of voice frames; the dimensionality of the spectrum information of the plurality of voice frames after the dimensionality transformation processing is matched with the input dimensionality of the ECA module;

performing dimension conversion processing on the deep voice feature information by using the rearrangement module, inputting the deep voice feature information subjected to the dimension conversion processing into the bidirectional GRU module, and extracting time sequence related feature information corresponding to the deep voice feature information subjected to the dimension conversion processing; and the dimension of the depth voice feature information after the dimension transformation processing is matched with the input dimension of the bidirectional GRU module.

In an embodiment of the present disclosure, the speech quality assessment network model may include 2 rearrangement modules; the rearrangement module may be applied before the ECA module and between the multi-stage residual convolution module and the bidirectional GRU module.

And performing dimension transformation processing on the input frequency spectrum information and/or the depth voice characteristic information by using the rearrangement module, so that the output dimension of the former module (namely, the rearrangement module) is matched with the input dimension of the latter module (namely, the ECA module and the bidirectional GRU module).

In the embodiment of the present disclosure, the training sample set includes a plurality of noisy speech signals for adjusting the parameters to be trained in the initial network model; the quality scoring label includes at least: and the subjective quality score corresponds to the voice signal with the noise.

After a training sample set is obtained, preprocessing and time-frequency conversion processing can be carried out on a plurality of voice signals with noise in the training sample set, so that frequency spectrum information of voice frames corresponding to the voice signals with noise is obtained; and training the initial network model in a complete supervised learning mode by utilizing the frequency spectrum information of the voice frames corresponding to the plurality of voice signals with noise and the quality score labels of the plurality of voice signals with noise.

The frequency spectrum information of the voice frames corresponding to the plurality of voice signals with noise can be input into the initial network model, the ECA module of the initial network model is used for extracting channel attention characteristic information, and the multistage residual convolution module of the initial network model is used for extracting deep voice characteristic information of the plurality of voice frames with the voice signals with noise based on the channel attention characteristic information; inputting the deep voice feature information into a bidirectional GRU module, and extracting time sequence related feature information among the voice frames by using the bidirectional GRU module; and finally, carrying out full connection processing and average pooling on the time sequence related characteristic information through a full connection module and a GAP module to obtain the prediction quality scores of the plurality of voice signals with noise.

Determining a loss function value of the initial network model according to the difference between the predicted quality scores of the voice signals with noise and the quality score labels of the voice signals with noise, calculating a back propagation gradient according to the loss function value, updating the parameters to be trained in the initial network model, and repeating the steps until the iteration times or the loss function convergence are reached to obtain the voice quality evaluation network model.

It can be understood that, in the embodiment of the present disclosure, a network model is trained by using a noisy speech signal including a quality score tag of a subjective quality score, and the speech quality assessment network model obtained by training performs speech quality assessment on the speech signal, so that the correlation between the obtained prediction quality score and the subjective quality score is high, and the speech quality assessment performance is good.

In the embodiment of the disclosure, in the training stage of the network model, a plurality of noisy speech signals in a training sample set are trained in batch; considering that the lengths of a plurality of voice signals with noise may be different, in order to match the dimensionality of the frequency spectrum information of the voice frames corresponding to the plurality of voice signals with noise, the maximum frame number can be determined from the plurality of voice signals with noise; determining whether the frame number of the frequency spectrum information of the plurality of voice signals with noise is the maximum frame number; and if the frame number of the frequency spectrum information of a certain voice signal with noise is smaller than the maximum frame number, performing zero filling processing on the frequency spectrum information of the voice signal with noise, so that the dimensionality of the frequency spectrum information of the voice frames corresponding to the voice signal with noise after the zero filling processing is the same.

The present disclosure also provides the following embodiments:

fig. 4 is a flowchart illustrating a second speech quality assessment method according to an exemplary embodiment, where as shown in fig. 4, the method includes:

step S201, obtaining a training sample set of a voice with noise and a quality scoring label of the training sample set;

in this example, the training sample set includes a plurality of noisy speech signals, and each of the noisy speech signals corresponds to a quality score label.

Here, the quality score label is at least used for indicating a subjective quality average score corresponding to the noisy speech signal.

The method comprises the steps of obtaining a training sample set containing a plurality of noisy voice signals and quality scoring labels corresponding to the training sample set, and training a network model in a fully supervised mode by using the training sample set and the quality scoring labels.

Step S202, determining the maximum frame number of a plurality of voice signals with noise based on the frequency spectrum information of the voice frames corresponding to the voice signals with noise in the training sample set; determining whether the frame number of the frequency spectrum information of the plurality of voice signals with noise is the maximum frame number; if the frame number of the frequency spectrum information of the plurality of voice signals with noise is not the maximum frame number, zero filling processing is carried out on the frequency spectrum information; the frame number of the frequency spectrum information after zero filling processing is the maximum frame number;

in this example, after a training sample set is obtained, batch preprocessing needs to be performed on a plurality of noisy speech signals in the training sample set to obtain speech frames corresponding to the plurality of noisy speech signals; and extracting the frequency spectrum information of the voice frames corresponding to the plurality of voice signals with noise based on the voice frames corresponding to the plurality of voice signals with noise.

Considering that the lengths of a plurality of voice signals with noise may be different, when training the network model, batch training is performed on the plurality of voice signals with noise in a training sample set, so in order to make the dimensionality of the frequency spectrum information of the voice frames corresponding to the plurality of voice signals with noise uniform, the maximum frame number can be determined from the plurality of voice signals with noise in the same training batch; and performing zero filling processing on the voice signals which do not meet the maximum frame number in the plurality of voice signals with noises, so that the dimensionality of the frequency spectrum information of the voice frames corresponding to the plurality of voice signals with noises after the zero filling processing is the same.

Step S203, inputting the frequency spectrum information after zero filling processing into an initial network model to obtain a prediction quality score output by the initial network model;

in this example, the spectrum information of the plurality of noisy speech signals after zero padding processing is input into an initial network model, and the predicted quality scores corresponding to the plurality of noisy speech signals output by the initial network model are obtained.

Step S204, determining a loss function value of the initial network model according to the prediction quality score and the quality score label; adjusting the parameters to be trained of the initial network model based on the loss function values to obtain the voice quality evaluation network model;

in this example, the loss function value of the initial network model may be determined according to the predicted quality scores of the plurality of noisy speech signals output by the initial network model and the quality score labels corresponding to the plurality of noisy speech signals; optimizing the parameters to be trained of the initial network model by judging whether the loss function values meet the training stopping conditions or not, if not, repeating the training steps until the loss function values meet the training stopping conditions, and thus obtaining the voice quality evaluation network model; and subsequently, the trained voice quality assessment network model can be directly utilized to carry out voice quality assessment on the voice signals.

Step S205, acquiring a first voice signal, and preprocessing the first voice signal to obtain a plurality of voice frames of the first voice signal; performing time-frequency conversion processing on a plurality of voice frames of the first voice signal to obtain amplitude spectrum information of the plurality of voice frames;

in some embodiments, the preprocessing the first speech signal to obtain a plurality of speech frames of the first speech signal includes:

pre-emphasis processing is carried out on the first voice signal;

Step S206, inputting the amplitude spectrum information of the plurality of voice frames into a pre-trained voice quality evaluation network model, and extracting channel attention feature information of each channel of the plurality of voice frames by utilizing an ECA (echo cancellation) module of the voice quality evaluation network model;

in this example, the ECA module completes cross-channel information interaction in a very light weight manner, determines the importance degree of each feature channel, and obtains channel attention feature information after enhancing channel attention, so that the network model has better feature extraction performance and stronger generalization capability.

In some embodiments, the speech quality assessment network model further comprises: a rearrangement module; located before the ECA module; the voice recognition module is used for performing dimension transformation processing on the input spectrum information, and the dimension of the spectrum information of the plurality of voice frames after the dimension transformation processing is matched with the input dimension of the ECA module.

Step S207, extracting deep voice characteristic information based on the channel attention characteristic information by utilizing a multi-stage residual convolution module of the voice quality evaluation network model;

in this example, the multi-stage residual convolution module includes N residual convolution layers;

In this example, the channel attention feature information is subjected to deep separable convolution processing by using a residual convolution layer, and the correlation and the spatial correlation between feature channels are mapped separately, so that the operation complexity of the network model is reduced, and the network model is lighter.

And considering the problem that the gradient disappears due to the fact that the network depth of the multi-stage residual convolution module is too deep, the input of the first convolution layer in the residual convolution layer is transmitted to the second convolution layer by the residual unit, the result of direct mapping is achieved, and the feature extraction capability of the network model is improved.

Step S208, extracting time sequence related characteristic information corresponding to the deep voice characteristic information by using a bidirectional GRU module of the voice quality evaluation network model;

in this example, the bidirectional GRU module comprises: a forward GRU sub-module and a backward GRU sub-module;

forward feature extraction can be carried out on the deep voice feature information by utilizing the forward GRU sub-module to obtain forward time sequence feature information; performing reverse feature extraction on the deep voice feature information by using the backward GRU sub-module to obtain reverse time sequence feature information; and obtaining time sequence related characteristic information corresponding to the deep voice characteristic information based on the forward time sequence characteristic information and the reverse time sequence characteristic information.

Here, the dropout parameter of both the forward GRU sub-module and the backward GRU sub-module is set to 0.35.

In this example, the bidirectional GRU module is formed by combining a forward GRU sub-module and a backward GRU sub-module, forward timing characteristic information (i.e., the above information) is acquired by using the forward GRU sub-module, and backward timing characteristic information (i.e., the below information) is acquired by using the backward GRU sub-module; and obtaining complete context information of the plurality of voice frames by fusing the forward time sequence characteristic information and the reverse time sequence characteristic information.

It will be appreciated that the use of a bi-directional GRU module enables a simpler and more efficient learning of the bi-directional dependency at time steps, modelling the bi-directional temporal dependency of perceived speech quality.

In some embodiments, the rearrangement module is further located between the multi-stage residual convolution module and the bidirectional GRU module, and configured to perform dimension transformation processing on the input deep speech feature information, where a dimension of the depth speech feature information after the dimension transformation processing is matched with an input dimension of the bidirectional GRU module.

Step S209, using the full-connection module of the voice quality assessment network model to perform full-connection processing on the time sequence related characteristic information of the multiple voice frames of the first voice signal, so as to obtain quality scores of the multiple voice frames.

In this example, the quality score is determined on a frame-by-frame basis using a fully connected module, resulting in a quality score for each speech frame in the first speech signal.

Step S210, performing global average processing on the quality scores of the plurality of speech frames by using a GAP module of the speech quality assessment network model, to obtain the quality score of the first speech signal.

It should be noted that the GAP module is generally used to perform averaging processing on elements on the entire feature map; in this example, the GAP module performs global average processing on the quality scores of the multiple voice frames of the first voice signal to obtain an overall quality score of the first voice signal, and outputs the overall quality score.

It can be understood that the method for performing speech quality assessment using the speech quality assessment network model can be divided into two stages: a network model training stage and a voice quality evaluation stage; each stage needs to perform preprocessing and time-frequency conversion processing on the noisy speech signal to acquire the required spectrum information.

As shown in fig. 5, fig. 5 is a third flowchart illustrating a speech quality assessment method according to an exemplary embodiment. In the network model training stage, learning and training are carried out on the network model according to the training sample set and the corresponding quality scoring labels, and model parameters of the network model enabling the loss function to be converged are obtained; and determining a voice quality assessment network model in a voice quality assessment stage based on the model parameters, as shown in fig. 6, where fig. 6 is a schematic structural diagram of a voice quality assessment network model according to an exemplary embodiment.

In the speech quality evaluation stage, the speech signal with noise to be evaluated is input to the trained speech quality evaluation network model, and the quality score output by the speech quality evaluation network model is obtained.

An embodiment of the present disclosure further provides a voice quality assessment apparatus, as shown in fig. 7, fig. 7 is a schematic structural diagram of a voice quality assessment apparatus according to an exemplary embodiment. The apparatus 100, comprising:

an obtaining module 101, configured to obtain spectrum information of a plurality of voice frames corresponding to a first voice signal;

a first feature extraction module 102, configured to, for the spectrum information of the multiple speech frames, extract channel attention feature information of the multiple speech frames by using a pre-trained speech quality assessment network model, and extract deep speech feature information of the multiple speech frames based on the channel attention feature information of the multiple speech frames;

a second feature extraction module 103, configured to extract time sequence related feature information between the multiple speech frames based on the deep speech feature information;

a predicting module 104, configured to predict a quality score of the first speech signal according to the time-sequence related feature information between the multiple speech frames.

the first feature extraction module 102 is configured to:

the second feature extraction module 103 is configured to:

the prediction module 104 is configured to:

and utilizing the GAP module to perform global average processing on the quality scores of the plurality of voice frames to obtain the quality score of the first voice signal.

Optionally, the obtaining module 101 is configured to:

pre-emphasis processing is carried out on the first voice signal;

Optionally, the apparatus, comprising: a network training module 105 configured to:

Optionally, the network training module 105 is configured to:

With regard to the apparatus in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

Fig. 8 is a block diagram illustrating a speech quality assessment apparatus according to an exemplary embodiment. For example, the device 200 may be a mobile phone, a mobile computer, or the like.

Referring to fig. 8, the apparatus 200 may include one or more of the following components: a processing component 202, a memory 204, a power component 206, a multimedia component 208, an audio component 210, an input/output (I/O) interface 212, a sensor component 214, and a communication component 216.

The processing component 202 generally controls overall operation of the device 200, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 202 may include one or more processors 220 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 202 can include one or more modules that facilitate interaction between the processing component 202 and other components. For example, the processing component 202 can include a multimedia module to facilitate interaction between the multimedia component 208 and the processing component 202.

The memory 204 is configured to store various types of data to support operation at the device 200. Examples of such data include instructions for any application or method operating on the device 200, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 204 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 206 provides power to the various components of the device 200. The power components 206 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 200.

The multimedia component 208 includes a screen that provides an output interface between the device 200 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 208 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 200 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 210 is configured to output and/or input audio signals. For example, audio component 210 includes a Microphone (MIC) configured to receive external audio signals when apparatus 200 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 204 or transmitted via the communication component 216. In some embodiments, audio component 210 also includes a speaker for outputting audio signals.

The I/O interface 212 provides an interface between the processing component 202 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 214 includes one or more sensors for providing various aspects of status assessment for the device 200. For example, the sensor component 214 may detect an open/closed state of the device 200, the relative positioning of components, such as a display and keypad of the apparatus 200, the sensor component 214 may also detect a change in position of the apparatus 200 or a component of the apparatus 200, the presence or absence of user contact with the apparatus 200, orientation or acceleration/deceleration of the apparatus 200, and a change in temperature of the apparatus 200. The sensor assembly 214 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 214 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 214 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 216 is configured to facilitate wired or wireless communication between the apparatus 200 and other devices. The device 200 may access a wireless network based on a communication standard, such as Wi-Fi, 2G, or 3G, or a combination thereof. In an exemplary embodiment, the communication component 216 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 216 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 200 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as memory 204, comprising instructions executable by processor 220 of device 200 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method for evaluating speech quality, the method comprising:

2. The method of claim 1, wherein the speech quality assessment network model comprises: an effective channel attention ECA module, a multi-stage residual convolution module and a bidirectional gating circulation GRU module;

3. The method of claim 2, wherein the multi-stage residual convolution module includes N residual convolution layers;

the residual convolutional layer includes: a first convolutional layer and a second convolutional layer which are cascaded; wherein an input of the first convolution layer is transferred to an output of the first convolution layer through a residual error unit;

4. The method of claim 2, wherein the bidirectional GRU module comprises: a forward GRU sub-module and a backward GRU sub-module;

5. The method of claim 1, wherein the speech quality assessment network model comprises: a full connection module and a global average pooling GAP module;

6. The method of claim 1, wherein the obtaining the spectrum information of the plurality of speech frames corresponding to the first speech signal comprises:

7. The method of claim 6, wherein the pre-processing the first speech signal to obtain a plurality of speech frames of the first speech signal comprises:

carrying out pre-emphasis processing on the first voice signal;

8. The method of claim 1, wherein before obtaining the spectrum information of the plurality of speech frames corresponding to the first speech signal, the method comprises:

acquiring a training sample set with noise voice and a quality scoring label of the training sample set;

9. The method according to claim 8, wherein the inputting the spectrum information of the speech frames corresponding to the plurality of noisy speech signals in the training sample set to an initial network model to be trained to obtain a prediction quality score output by the initial network model comprises:

10. A speech quality assessment apparatus, comprising:

11. A speech quality assessment apparatus, comprising:

a processor;

a memory for storing executable instructions;

wherein the processor is configured to: the method of speech quality assessment of any of claims 1 to 9 is implemented when executing executable instructions stored in the memory.

12. A non-transitory computer-readable storage medium in which instructions, when executed by a processor of a voice quality assessment apparatus, enable the voice quality assessment apparatus to perform the voice quality assessment method of any one of claims 1 to 9.