CN114333796A

CN114333796A - Audio and video voice enhancement method, device, equipment, medium and smart television

Info

Publication number: CN114333796A
Application number: CN202111614722.2A
Authority: CN
Inventors: 秦宇; 陈俊彬
Original assignee: Shenzhen TCL Digital Technology Co Ltd
Current assignee: Shenzhen TCL Digital Technology Co Ltd
Priority date: 2021-12-27
Filing date: 2021-12-27
Publication date: 2022-04-12

Abstract

The application relates to a method, a device, equipment, a medium and a smart television for audio and video voice enhancement. The method comprises the following steps: acquiring an audio signal to be processed in an audio and video to be played; the audio signal to be processed is used as the input of a pre-trained speech recognition model, so that the speech recognition model outputs the speech signal in the audio signal to be processed; and playing the audio and video to be played based on the voice signal. By adopting the method, the audio playing of the smart television can be optimized, and the listening experience of the user is ensured.

Description

Audio and video voice enhancement method, device, equipment, medium and smart television

Technical Field

The application relates to the technical field of smart televisions, in particular to a method, a device, equipment, a medium and a smart television for audio and video voice enhancement.

Background

With the development of the smart television technology, the popularity of the smart television in the life of people is increasing. However, as people age, their hearing is also degraded, resulting in hearing impairment, especially in the elderly population. In the current technical scheme, hearing curve data of a user are measured through a hearing test APP installed on a smart television and compared with a standard healthy hearing curve, and a filter is calculated to perform frequency compensation on sound. However, hearing tests are cumbersome and require a headset for accurate measurement, which is costly. Therefore, how to optimize the audio playing of the smart television and ensure the listening experience of the user becomes a technical problem to be solved urgently.

Disclosure of Invention

Therefore, it is necessary to provide an audio and video speech enhancement method, apparatus, device, medium and smart television, which can optimize audio playback of the smart television and ensure listening experience of a user, in view of the above technical problems.

A method of speech enhancement of audio-visual, the method comprising:

acquiring an audio signal to be processed in an audio and video to be played;

the audio signal to be processed is used as the input of a pre-trained speech recognition model, so that the speech recognition model outputs the speech signal in the audio signal to be processed;

and playing the audio and video to be played based on the voice signal.

In one embodiment, the taking the audio signal to be processed as an input of a pre-trained speech recognition model to make the speech recognition model output a speech signal in the audio signal to be processed includes: performing framing processing on the audio signal to be processed to obtain an audio frame set corresponding to the audio signal to be processed; inputting each audio frame contained in the audio frame set into a pre-trained speech recognition model according to a time sequence so that the speech recognition model outputs a speech signal corresponding to each audio frame; and obtaining a voice signal corresponding to the audio signal to be processed according to the voice signal corresponding to each audio frame.

In one embodiment, the speech recognition model is a dense connection convolutional neural network including a plurality of convolutional layers connected in series, a plurality of Rnn layers, and a plurality of convolutional layers, and an output of each convolutional layer is channel-spliced with an input of its corresponding convolutional layer.

In one embodiment, the method further comprises: acquiring an audio frame training sample set, wherein each audio frame training sample in the audio frame training sample set comprises a voice signal and background noise; respectively inputting the training samples of the audio frames into a speech recognition model to be trained according to the time sequence; and adjusting parameters in the speech recognition model so that the speech recognition model outputs speech signals contained in the training samples of each audio frame.

In one embodiment, the acquiring an audio signal to be processed in an audio/video to be played includes: and if an enhancement request aiming at the voice signal is received, acquiring the audio signal to be processed in the audio and video to be played.

A television set, comprising:

display means for displaying a video signal;

the audio playing device is used for displaying the audio signal;

and the processor is respectively and electrically connected with the display device and the audio playing device, and can execute the audio and video voice enhancement method according to the embodiment.

An audio-visual speech enhancement device, the device comprising:

the acquisition module is used for acquiring audio signals to be processed in the audio and video to be played;

the output module is used for taking the audio signal to be processed as the input of a pre-trained voice recognition model so as to enable the voice recognition model to output the voice signal in the audio signal to be processed;

and the playing module is used for playing the audio and video to be played based on the voice signal.

In one embodiment, the output module comprises:

the framing unit is used for framing the audio signal to be processed to obtain an audio frame set corresponding to the audio signal to be processed;

the input unit is used for inputting each audio frame contained in the audio frame set into a pre-trained speech recognition model according to a time sequence so that the speech recognition model outputs a speech signal corresponding to each audio frame;

and the processing unit is used for obtaining the voice signal corresponding to the audio signal to be processed according to the voice signal corresponding to each audio frame.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring an audio signal to be processed in an audio and video to be played;

and playing the audio and video to be played based on the voice signal.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring an audio signal to be processed in an audio and video to be played;

and playing the audio and video to be played based on the voice signal.

One of the above technical solutions has the following advantages and beneficial effects:

according to the audio and video voice enhancement method, device, equipment, medium and smart television, the audio signal to be processed in the audio and video to be played is obtained, and the audio signal to be processed is used as the input of the pre-trained voice recognition model, so that the voice recognition model outputs the voice signal in the audio signal to be processed, and the audio and video to be played is played based on the voice signal. Therefore, background noise in the audio signal to be processed is removed through the voice recognition model, the voice signal in the audio signal to be processed, namely pure voice, is obtained, audio and video are played based on the pure voice, the influence of the background noise on the user listening voice can be avoided, the audio playing of the smart television is optimized, and the user listening experience is guaranteed.

Drawings

Fig. 1 shows a flow diagram of a speech enhancement method of audio and video according to an embodiment of the present application.

Fig. 2 shows a schematic flowchart of step S120 in the audio-video speech enhancement method of fig. 1 according to an embodiment of the present application.

FIG. 3 shows a schematic structural diagram of a speech recognition model according to an embodiment of the present application.

Fig. 4 shows a block diagram of a speech enhancement apparatus for audio and video according to an embodiment of the present application.

FIG. 5 illustrates an internal block diagram of a computer device according to one embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Fig. 1 shows a flow diagram of a voice enhancement method for audio and video according to an embodiment of the present application, where the method may be applied to a terminal, which may include but is not limited to one or more of a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart television, a smart wearable device, a vehicle-mounted computer, a server, or a cloud server.

Referring to fig. 1, the audio-video speech enhancement method at least includes steps S110 to S130, which are described in detail as follows:

in step S110, an audio signal to be processed in the audio/video to be played is acquired.

The audio/video to be played can be the audio/video which is not played currently or the audio/video which is not played currently in the played audio/video. The following description is given by taking the smart television as an example, when a user does not start playing a television program (i.e. an audio/video that is not played currently), a voice enhancement function of the smart television can be selected to enhance a voice in a television program to be played later; the user may also select speech enhancement while playing the television program, to enhance the speech of a subsequent television program that has not yet been played, and so on.

In this embodiment, the terminal may analyze the audio signal included in the audio/video to be played, that is, the audio signal to be processed, from the audio/video to be played, so as to prepare for subsequent processing.

In step S120, the audio signal to be processed is used as an input of a pre-trained speech recognition model, so that the speech recognition model outputs a speech signal in the audio signal to be processed.

The speech recognition model may be a neural network model for recognizing a speech signal included in the audio signal. The speech signal may be a voice of a person included in the audio-video, such as a singing voice, a speaking voice, or the like. A person skilled in the art may construct a speech recognition model in advance, and train the speech recognition model so that the speech recognition model can accurately output a speech signal included in an audio signal, so as to prevent background noise included in the audio signal from affecting the listening of a user to the speech signal.

In an exemplary embodiment of the present application, the pre-trained speech recognition model may be stored locally in the terminal, and when speech enhancement is required, the terminal may obtain the locally stored speech recognition model to recognize a speech signal included in the audio signal to be processed. In another example, the speech recognition model may also be stored in a server or a cloud server, and when speech enhancement is required, the terminal may remotely obtain the speech recognition model and use the speech recognition model to recognize a speech signal included in the audio signal to be processed. Those skilled in the art can select a storage manner of the corresponding speech recognition model according to actual needs, and the present application is not limited to this.

In step S130, the audio/video to be played is played based on the voice signal.

In an exemplary embodiment of the application, after the speech recognition model outputs a speech signal included in the audio to be processed, the terminal may play the audio and video to be played based on the speech signal. Specifically, the terminal can play the voice signal and the video signal in the audio/video to be played so as to ensure the synchronism of the voice signal and the video signal, and meanwhile, the user can clearly hear the voice in the audio/video to be played, so that the user experience is ensured.

In the embodiment shown in fig. 1, the background noise in the audio signal to be processed is removed through the speech recognition model, so that the speech signal, i.e., pure speech, in the audio signal to be processed is obtained, and audio and video playing is performed based on the pure speech, so that the influence of the background noise on the user listening speech can be avoided, the audio playing of the smart television is optimized, and the listening experience of the user is ensured.

Based on the embodiment shown in fig. 1, fig. 2 shows a schematic flowchart of step S120 in the audio-video speech enhancement method of fig. 1 according to an embodiment of the present application. Referring to fig. 2, step S120 at least includes steps S210 to S230, which are described in detail as follows:

in step S210, the audio signal to be processed is subjected to framing processing, so as to obtain an audio frame set corresponding to the audio signal to be processed.

In this embodiment, the terminal may perform framing processing on the audio signal to be processed to obtain each audio frame corresponding to the audio signal to be processed. And arranging the audio frames according to a time sequence (namely a playing sequence) to obtain an audio frame set corresponding to the audio signal to be processed.

In step S220, the audio frames included in the audio frame set are input into a pre-trained speech recognition model in a time sequence, so that the speech recognition model outputs speech signals corresponding to the audio frames.

In an exemplary embodiment of the present application, the terminal may respectively input the audio frames included in the audio frame set into a pre-trained speech recognition model according to a time sequence, and the speech recognition model may output speech signals corresponding to the audio frames one by one. It should be understood that, the audio to be processed is subjected to framing processing, and then the speech recognition model is used to recognize the speech signal corresponding to each audio frame, so that the accuracy of speech signal recognition can be ensured.

In step S230, a speech signal corresponding to the audio signal to be processed is obtained according to the speech signal corresponding to each audio frame.

In an exemplary embodiment of the present application, after the speech recognition model outputs the speech signal corresponding to each audio frame, the terminal may integrate the speech signal corresponding to each audio frame according to the time sequence of each audio frame, so as to obtain the speech signal corresponding to the audio signal to be processed.

In an exemplary embodiment of the present application, the speech recognition model is a dense-connected convolutional neural network including a plurality of convolutional layers connected in series, a plurality of Rnn layers, and a plurality of convolutional layers, and an output of each convolutional layer is channel-spliced with an input of its corresponding convolutional layer.

Specifically, fig. 3 shows a schematic structural diagram of a speech recognition model according to an embodiment of the present application. Referring to FIG. 3:

input: the method is used for example, one-dimensional audio signals are subjected to short-time Fourier transform to change time signals into frequency domain signals. Specifically, the audio signal may be framed by using a panning window function and divided into 512-point frames, and each frame has 256-point overlapping, and fast fourier transform is performed on each frame of signal, so that a time signal is transformed into a frequency domain signal, and 257-point data of a half-band frequency spectrum is obtained, where the 257-point data of each frame is a frame feature.

In order to model the time sequence of the audio signal, the speech recognition model needs to input 1 frame of features, i.e. 1 × 257 two-dimensional arrays, for each operation to estimate the pure speech data of the current frame, i.e. the speech signal. The speech recognition model of the application adds rnn structural modeling audio time sequence, so audio history information can be kept, and only 1 frame of data needs to be input each time.

Output: and outputting the characteristics, namely a model output result, and the pure speech estimated by the model, namely a speech signal. Is the half-band spectrum data of 257 points, the output spectrum can be restored into pure voice by utilizing the classical reverse short-time Fourier transform and the superposition addition method.

cnn layer (convolutional layer): comprising a convolution operation (conv), a Batch Normalization operation (bn), and an activation function, such as the relu activation function.

dcnn layer (expansion convolution layer): comprising a dilation convolution operation (dconv), a Batch Normalization operation (bn), and an activation function, such as a relu activation function.

Multilayer Rnn layer: the light-colored long rectangles in fig. 3 are connected in series by 1 or more rnn layers, and the previous layer is output to the next layer rnn as input.

In the speech recognition model shown in fig. 3, where the first cnn convolutional layer is denoted as c _1_1, the second is denoted as c _1_2, etc., it should be understood that the feature dimensions are reduced through one cnn layer, such as 257 points at the first layer, 129 points at the output, 129 points at the second layer, 65 points at the output, and the last cnn layer output as the multi-layer Rnn layer input. The output of Rnn layer is passed to the following dcnn layer, dcnn is the expansion convolution layer, dcnn layer will expand the feature, for example, the input is 65 points, the output is 127 points, so that the feature passes through several dcnn layers, the last number of dcnn layer is reciprocal, the last dcnn layer is numbered d _1_1, the last feature point number is the same as the model input, for example, all is 257 points.

Since the task of the method of the present application is to restore pure speech, in order not to lose information, channel concatenation (connections) is performed between the output of the cnn layer and the input of the following dcnn layer, for example, the output of c _1_1 is (127, n) point, the output of d _1_2 is (127, n) point, then the output of c _1_1 is 127 point and the output of d _1_2 is 127 point, 2 n data input to d _1_1, thus saving the information lost in the past due to the reduction of the number of feature points caused by convolution through cross-layer concatenation.

Therefore, through the arrangement of the convolutional layer, the characteristic dimension can be reduced, the input data volume of the Rnn layer is further reduced, in order to restore the village police voice, the data of the convolutional layer subjected to down sampling is up sampled into the data of the original characteristic scale through the expansion convolutional layer, the output of the convolutional layer and the input of the expansion convolutional layer corresponding to the rear part are subjected to channel splicing, so that the information lost in the sampling process can be perceived by the rear expansion convolutional layer, and the restoring effect of pure voice is guaranteed.

It should be noted that, since the speech recognition model adopts a structure corresponding to cross-layer connection of dimensional layers, the neural network model looks like a u-shape, and is therefore called u-net. Different from the original u-net, we adopt a plurality of u-net tandem architectures, for example, the first cnn layer code of the first u-net is c _1_1, the first cnn layer code of the second u-net is c _2_1, and the dcnn layers have the same theory. For better transfer characteristics, we use dense connections, where the preceding cnn layer is not only spliced to the dcnn layer of the u-net but also to the corresponding dcnn layer following it, e.g. the c _1_1 output is spliced to the dcnn _1_2, dcnn _2_2, dcnn _3_2, dcnn _ n _2 outputs and transferred to the next layer. Compared with the traditional u-net, the cnn layer output is spliced with one dcnn output instead of a plurality of dcnn layer outputs, and the speech recognition model is easier to train and can obtain better performance.

Based on the illustration in fig. 3, it should be understood that the fourier transform of the background sound-bearing speech is:

y＝x+n (1)

where x is the Fourier transform of clean speech and n is the Fourier transform of background noise.

A speech enhancement system F such that:

wherein,

is the Fourier transform of the estimated speech of system F, ideally

F is a trainable system, then the minimization goal of system training is:

i.e. using mean square error minimization.

If the system is considered as a filter, then:

then system F can be viewed as an ideal filter estimation system with a transfer function H and optimization objectives as:

h is equivalent to a wiener filter of a priori signal-to-noise ratio, is a target for practical training of the system, and can be simply regarded as data with a value between 0 and 1.

Based on the foregoing embodiment, in an exemplary embodiment of the present application, the method for speech enhancement of audio and video further includes:

acquiring an audio frame training sample set, wherein each audio frame training sample in the audio frame training sample set comprises a voice signal and background noise;

respectively inputting the training samples of the audio frames into a speech recognition model to be trained according to the time sequence;

and adjusting parameters in the speech recognition model so that the speech recognition model outputs speech signals contained in the training samples of each audio frame.

In this embodiment, the terminal may obtain a preset audio frame training sample set, where each audio frame training sample in the audio frame training sample set includes a speech signal and background noise. The terminal can respectively input the audio frame training samples into the speech recognition model to be trained according to the time sequence. The training of the speech recognition model may be accomplished by adjusting parameters in the speech recognition model such that the speech recognition model outputs speech signals contained in training samples of each audio frame.

In an exemplary embodiment of the present application, when recording a training sample set, real environment noise and a speech sample may be superimposed according to a certain signal-to-noise ratio, for example, signal-to-noise ratios of-5, 0, 5, 10, 15, and the like, to obtain a plurality of audio frame training sample sets with different signal-to-noise ratios, which are used as inputs of a speech recognition model to be trained, and are referred to as noise, that is, y in equation (1). The synthesized noise is clean and noiseless speech called clean, i.e. x in equation (1). During training, noise is input and clean is output of the model. And (5) optimizing the formula (5) by using an inverse gradient operation method to obtain the optimal model parameters.

In an exemplary embodiment of the present application, the acquiring an audio signal to be processed in an audio/video to be played includes:

and if an enhancement request aiming at the voice signal is received, acquiring the audio signal to be processed in the audio and video to be played.

In an example, a user may click on a particular area on a terminal interface to generate an enhancement request for a voice signal; in another example, the user may click a specific key of an input device configured for the terminal to generate and transmit an enhancement request for a voice signal to the terminal, such as inputting a keyboard, a mouse, or a remote controller.

After receiving the enhancement request for the voice signal, the terminal can acquire the audio signal to be processed in the audio/video to be played, so that the voice enhancement method for the audio/video provided by the embodiment of the application is executed, and the listening experience of a user is ensured.

It should be understood that although the various steps in the flow charts of fig. 1-2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1-2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

According to an embodiment of the present application, there is also provided a smart tv, including:

display means for displaying a video signal;

the audio playing device is used for displaying the audio signal;

In one embodiment, as shown in fig. 4, there is provided an audio-visual speech enhancement apparatus, including:

the acquiring module 410 is used for acquiring audio signals to be processed in the audio and video to be played;

an output module 420, configured to use the audio signal to be processed as an input of a pre-trained speech recognition model, so that the speech recognition model outputs a speech signal in the audio signal to be processed;

and the playing module 430 is configured to play the audio and video to be played based on the voice signal.

In one embodiment, the output module 420 comprises:

In one embodiment, the output module 420 is further configured to: acquiring an audio frame training sample set, wherein each audio frame training sample in the audio frame training sample set comprises a voice signal and background noise; respectively inputting the training samples of the audio frames into a speech recognition model to be trained according to the time sequence; and adjusting parameters in the speech recognition model so that the speech recognition model outputs speech signals contained in the training samples of each audio frame.

In one embodiment, the obtaining module 410 is configured to: and if an enhancement request aiming at the voice signal is received, acquiring the audio signal to be processed in the audio and video to be played.

For the specific limitation of the audio-video speech enhancement device, reference may be made to the above limitation on the audio-video speech enhancement method, and details are not described here. All or part of the modules in the audio and video speech enhancement device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of speech enhancement of audio-visual. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

acquiring an audio signal to be processed in an audio and video to be played;

and playing the audio and video to be played based on the voice signal.

In one embodiment, the processor, when executing the computer program, further performs the steps of: performing framing processing on the audio signal to be processed to obtain an audio frame set corresponding to the audio signal to be processed; inputting each audio frame contained in the audio frame set into a pre-trained speech recognition model according to a time sequence so that the speech recognition model outputs a speech signal corresponding to each audio frame; and obtaining a voice signal corresponding to the audio signal to be processed according to the voice signal corresponding to each audio frame.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring an audio frame training sample set, wherein each audio frame training sample in the audio frame training sample set comprises a voice signal and background noise; respectively inputting the training samples of the audio frames into a speech recognition model to be trained according to the time sequence; and adjusting parameters in the speech recognition model so that the speech recognition model outputs speech signals contained in the training samples of each audio frame.

In one embodiment, the processor, when executing the computer program, further performs the steps of: and if an enhancement request aiming at the voice signal is received, acquiring the audio signal to be processed in the audio and video to be played.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

acquiring an audio signal to be processed in an audio and video to be played;

and playing the audio and video to be played based on the voice signal.

In one embodiment, the computer program when executed by the processor further performs the steps of:

performing framing processing on the audio signal to be processed to obtain an audio frame set corresponding to the audio signal to be processed; inputting each audio frame contained in the audio frame set into a pre-trained speech recognition model according to a time sequence so that the speech recognition model outputs a speech signal corresponding to each audio frame; and obtaining a voice signal corresponding to the audio signal to be processed according to the voice signal corresponding to each audio frame.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring an audio frame training sample set, wherein each audio frame training sample in the audio frame training sample set comprises a voice signal and background noise; respectively inputting the training samples of the audio frames into a speech recognition model to be trained according to the time sequence; and adjusting parameters in the speech recognition model so that the speech recognition model outputs speech signals contained in the training samples of each audio frame.

In one embodiment, the computer program when executed by the processor further performs the steps of: and if an enhancement request aiming at the voice signal is received, acquiring the audio signal to be processed in the audio and video to be played.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the claims. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for enhancing audio and video speech is characterized by comprising the following steps:

acquiring an audio signal to be processed in an audio and video to be played;

and playing the audio and video to be played based on the voice signal.

2. The audio and video speech enhancement method according to claim 1, wherein the using the audio signal to be processed as an input of a pre-trained speech recognition model to make the speech recognition model output a speech signal in the audio signal to be processed comprises:

performing framing processing on the audio signal to be processed to obtain an audio frame set corresponding to the audio signal to be processed;

inputting each audio frame contained in the audio frame set into a pre-trained speech recognition model according to a time sequence so that the speech recognition model outputs a speech signal corresponding to each audio frame;

and obtaining a voice signal corresponding to the audio signal to be processed according to the voice signal corresponding to each audio frame.

3. The audio-visual speech enhancement method of claim 2, wherein the speech recognition model is a densely-connected convolutional neural network comprising a plurality of convolutional layers connected in series, a plurality of Rnn layers, and a plurality of expansion convolutional layers, and an output of each convolutional layer is channel-spliced with an input of its corresponding expansion convolutional layer.

4. The audio-visual speech enhancement method of claim 1, characterized in that the method further comprises:

5. The audio and video speech enhancement method according to claim 1, wherein the obtaining of the audio signal to be processed in the audio and video to be played comprises:

6. An intelligent television, comprising:

display means for displaying a video signal;

the audio playing device is used for displaying the audio signal;

a processor electrically connected to the display device and the audio playing device, respectively, wherein the processor can execute the audio and video speech enhancement method as claimed in any one of claims 1 to 5.

7. An apparatus for speech enhancement of audio and video, the apparatus comprising:

8. The audio-visual speech enhancement device of claim 7, wherein the output module comprises:

9. A computer device comprising a memory and a processor, the memory storing a computer program,

characterized in that the processor, when executing the computer program, implements the steps of the method of any one of claims 1 to 5.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5.