CN116453539A

CN116453539A - Voice separation method, device, equipment and storage medium for multiple speakers

Info

Publication number: CN116453539A
Application number: CN202310318722.0A
Authority: CN
Inventors: 杨毅
Original assignee: Zeku Technology Shanghai Corp Ltd
Current assignee: Zeku Technology Shanghai Corp Ltd
Priority date: 2023-03-28
Filing date: 2023-03-28
Publication date: 2023-07-18

Abstract

The application discloses a voice separation method, device and equipment for multiple speakers and a storage medium, and relates to the field of artificial intelligence. The method comprises the following steps: acquiring video frame data and audio frame data in a video; acquiring face area data from the video frame data; acquiring mouth shape changing data from the face region data, and extracting semantic embedded representation based on the mouth shape changing data; extracting a speech embedded representation based on the audio frame data; inputting the semantic embedded representation and the voice embedded representation into a multi-mode voice separation model, and separating to obtain voice data of a target speaker. According to the method, the semantic embedded representation and the voice embedded representation are respectively extracted, and are input into a multi-mode voice separation model, so that voice data of a target speaker is obtained through separation, and the voice enhancement effect of the target speaker is achieved.

Description

Voice separation method, device, equipment and storage medium for multiple speakers

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method, apparatus, device, and storage medium for speech separation for multiple speakers.

Background

Speech separation refers to the extraction of one or more target speech signals from mixed speech produced by multiple speakers.

With the development of deep learning, neural networks have also been applied in the field of speech separation. After training the audio sample of the target speaker to obtain a voice separation model, the mixed voice of multiple speakers is input into the voice separation model, and the voice separation model outputs the voice signal of the target speaker.

However, the voice signal of the target speaker extracted by the method is still not high in purity.

Disclosure of Invention

The embodiment of the application provides a voice separation method, device and equipment for multiple speakers and a storage medium. The technical scheme is as follows:

according to an aspect of the present application, there is provided a voice separation method for multiple speakers, acquiring video frame data and audio frame data in a video;

acquiring face area data from the video frame data; acquiring mouth shape changing data from the face region data, and extracting semantic embedded representation based on the mouth shape changing data;

extracting a speech embedded representation based on the audio frame data;

inputting the semantic embedded representation and the voice embedded representation into a multi-mode voice separation model, and separating to obtain voice data of a target speaker.

According to another aspect of the present application, there is provided a speech separation apparatus for multiple speakers, the apparatus comprising:

the acquisition module is used for acquiring video frame data and audio frame data in the video;

the acquisition module is further used for acquiring face area data from the video frame data; acquiring mouth shape changing data from the face region data, and extracting semantic embedded representation based on the mouth shape changing data;

the acquisition module is further used for extracting a voice embedded representation based on the audio frame data;

and the separation module is used for inputting the semantic embedded representation and the voice embedded representation into a multi-mode voice separation model to obtain voice data of the target speaker through separation.

According to another aspect of the present application, there is provided a computer device comprising: a processor and a memory having stored therein at least one program loaded and executed by the processor to cause a wireless device to implement the method of speech separation for multiple speakers as described in the above aspects.

According to another aspect of the present application, there is provided a computer-readable storage medium having stored therein at least one program loaded and executed by a processor to cause a wireless device to implement the speech separation method for multiple speakers as described in the above aspect.

According to another aspect of the present application, there is provided a computer program product comprising at least one program stored in a computer readable storage medium; the processor of the communication device reads the at least one program from the computer-readable storage medium, and the processor executes the at least one program to cause the communication device to perform the speech separation method for multiple speakers as described in the above aspect.

According to another aspect of the present application, there is provided a computer program comprising at least one program stored in a computer readable storage medium; the processor of the communication device reads the at least one program from the computer-readable storage medium, and the processor executes the at least one program to cause the communication device to perform the speech separation method for multiple speakers as described in the above aspect.

The beneficial effects that technical scheme that this application embodiment provided include at least:

acquiring face region data from video frame data and mouth shape variable data by respectively acquiring the video frame data and the audio frame data in the video, so as to extract semantic embedded representation; extracting a speech embedded representation from the audio frame data; and simultaneously, based on the semantic embedded representation and the voice embedded representation, the voice data of the target speaker is obtained, so that the accuracy of recognition of the target speaker is improved, and the effect of voice enhancement of the target speaker is realized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for multi-speaker speech separation provided in another exemplary embodiment of the present application;

FIG. 2 is a flow chart of a method for multi-speaker speech separation provided in another exemplary embodiment of the present application;

FIG. 3 is a schematic diagram of data processing provided by another exemplary embodiment of the present application;

FIG. 4 is a schematic diagram of a method for multi-speaker speech separation provided in another exemplary embodiment of the present application;

FIG. 5 is a flowchart of a method for multi-speaker speech separation provided in another exemplary embodiment of the present application;

FIG. 6 is a flowchart of a method for multi-speaker speech separation provided in another exemplary embodiment of the present application;

FIG. 7 is a flowchart of a method for multi-speaker speech separation provided in another exemplary embodiment of the present application;

FIG. 8 is a schematic illustration of speech processing provided by another exemplary embodiment of the present application;

FIG. 9 is a flowchart of a method for multi-speaker speech separation provided in another exemplary embodiment of the present application;

FIG. 10 is a block diagram of a speech separation apparatus for multiple speakers provided in accordance with another exemplary embodiment of the present application;

FIG. 11 is a block diagram of a computer device provided by an exemplary embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

First, the terms involved in the embodiments of the present application will be briefly described:

neural network model: is an algorithm mathematical model which simulates the behavior characteristics of an animal neural network and processes distributed parallel information. The network relies on the complexity of the system and achieves the purpose of processing information by adjusting the relationship of the interconnection among a large number of nodes.

Embedding the representation: refers to a distributed representation of an input object generated based on a neural network model. The main function is to transform the high-dimensional sparse vectors of the original object into low-dimensional, dense vectors, so that these low-dimensional, dense vectors can express one or more features of the corresponding object. Meanwhile, the distances between different vectors can reflect the similarity between objects.

Fourier transform (Fourier Transform, FT): refers to a mathematical method capable of converting a signal in the time domain into a signal in the frequency domain.

In the related art, in a scene including a plurality of speakers, it is necessary to separate voice data of the plurality of speakers to realize a distinction between a target speaker and voices of other people. Therefore, the application proposes a voice separation method for multiple speakers, and the specific implementation modes are as follows:

fig. 1 is a flowchart of a method for speech separation for multiple speakers according to an exemplary embodiment of the present application. The method comprises the following steps:

step 120: acquiring video frame data and audio frame data in a video;

in some embodiments, where only one speaker is included in the video, the speaker's voice data may be obtained directly by way of voice extraction.

In the embodiments of the present application, a case where a video includes multiple speakers is mainly described, that is, at least two speakers are included in the video. Illustratively, taking the video containing m speakers as an example, the value of m is a positive integer greater than or equal to 2.

And respectively acquiring video frame data and audio frame data in the video. The video frame data specifically refers to image frame data obtained from a time series of the video. Illustratively, the video includes N image frames in total according to a time sequence, that is, the video is composed of N image frames, where N is a positive integer greater than or equal to 2.

In one example, all N image frame data in a video is acquired.

In one example, since the calculated data amount of all N image frame data is large, the N image frames are decimated at intervals. The extraction may be performed at regular intervals, e.g. at intervals of 5 seconds, with one or more image frames extracted every 5 seconds. The extraction may also be performed at random times, e.g., at any interval, which may or may not be the same each time. In one example, the N image frames are randomly decimated based on the recognition of the N image frames, e.g., the i image frame is decimated when the i image frame is recognized to contain a speaker.

The audio frame data specifically refers to audio frame information in video, that is, information generated by sound signals generated when a plurality of speakers speak.

Step 142: acquiring face area data from video frame data;

face region data is obtained from video frame data, and one or more face regions may be included in the same image frame, i.e., the face region data is all or part of the video frame data. The face region data is face data corresponding to a speaker in an i-th image frame in the video frame data.

Illustratively, the image in the ith image frame is subjected to face recognition through an artificial intelligence (Artificial Intelligence, AI) model for face recognition, one or more faces are recognized from the ith image frame, and the region data of the recognized one or more faces is the face region data. The AI model for face recognition is obtained by taking face data of a plurality of speakers as training samples and training the face data in advance before implementing the embodiment of the application.

In one example, only face region data is contained in the video frame data, and the face region data is all data in the video frame data.

In one example, the video frame data includes at least two characteristic regions of a speaker, such as a face region and a hand region, and the face region data is part of the video frame data.

In some embodiments, when m speakers are included in the video, the face region data acquired from the video frame data includes from the face region data of the 1 st speaker to the face region data of the m-th speaker, where the value of m is a positive integer greater than or equal to 2.

Step 144: acquiring mouth shape variation data from the face region data;

In some embodiments, the face region data includes at least lip region data, and the mouth shape transformation data is obtained based on a change of the lip region data. The lip region data is one of face region data. For example, the face region data may include forehead region data, eye region data, nose region data, ear region data, and lip region data according to a difference in features (five sense organs) in the face.

Illustratively, the lip recognition is performed on each face region by an AI model for lip recognition, so as to obtain the mouth shape change of each face region. For example, it is recognized that the speaker's lip state changes from an open state to a closed state. The AI model for lip recognition is obtained by taking lip data of a plurality of speakers as training samples and training the lip data in advance before implementing the embodiment of the application.

In some embodiments, the mouth-shaped variation data is one-to-one with the speaker. Under the condition that m speakers are included in the video, mouth shape variation data corresponding to the 1 st speaker is obtained from face area data of the 1 st speaker, mouth shape variation data corresponding to the m th speaker is obtained from face area data of the m th speaker, and the value of m is a positive integer greater than or equal to 2. The m speakers include the target speaker.

Step 146: extracting semantic embedded representation based on the mouth-shaped variant data;

the semantic embedded representation is semantic information corresponding to voice information in audio frame data based on video frame data processing. The semantic embedded representation is derived based on mouth-shape variations, such as by lip-recognition, to identify the speaker as "hello" when the speaker is speaking. Under the condition that m speakers are contained in the video, obtaining semantic embedded representation corresponding to the 1 st speaker based on the mouth shape change of the 1 st speaker; and obtaining the semantic embedded representation corresponding to the mth speaker based on the mouth shape change of the mth speaker. The m speakers include the target speaker.

Step 160: extracting a speech embedded representation based on the audio frame data;

the speech embedded representation is derived based on speech information in the audio frame data, such as by voiceprint extraction, speech recognition, etc. The audio frame data is mainly audio data including the speaking sounds of a plurality of speakers in a video.

Step 180: inputting the semantic embedded representation and the voice embedded representation into a multi-mode voice separation model, and separating to obtain voice data of the target speaker.

The semantic embedded representation extracted based on the video frame data and the speech embedded representation extracted based on the audio frame data are input into a multi-mode speech separation model, and the multi-mode speech separation model separates the speech data of the target speaker by combining the semantic embedded representation and the speech embedded representation to obtain the speech data of the target speaker.

In summary, according to the method provided by the embodiment, by respectively acquiring the video frame data and the audio frame data in the video, the face region data is acquired from the video frame data, and further the mouth shape transformation data is acquired, so that the semantic embedded representation is extracted; extracting a speech embedded representation from the audio frame data; and simultaneously, based on the semantic embedded representation and the voice embedded representation, the voice data of the target speaker is obtained, so that the accuracy of recognition of the target speaker is improved, and the effect of voice enhancement of the target speaker is realized.

Fig. 2 is a flowchart of a method for speech separation for multiple speakers according to an exemplary embodiment of the present application. The method mainly comprises a face recognition network 21 for recognizing face area data in video frame data, a network 22 for mouth shape detection, a voice extraction network 23 for voice separation, and a voice evaluation network 24 for voice evaluation.

In some embodiments, each of the N image frames corresponds to time information (e.g., a timestamp) of the image frame, each of the audio frames also corresponds to time information of the audio frame, and the corresponding image frame and audio frame can be aligned based on the time information of the image frame and the image information of the audio frame. It should be understood that the time stamps of the image frames are not in one-to-one correspondence with the time stamps of the audio frames, e.g., 60 frames at 1 second is the time stamp of the image frames; every 10ms of sound is encoded into one audio frame.

In some embodiments, in the event that mouth-shaped variation data is acquired from face region data, a speech embedded representation is extracted based on audio frame data corresponding to a mouth-shaped variation time period. Wherein the mouth-shaped variable time period is a time period corresponding to mouth-shaped variable data. For example, as shown in fig. 3, the time length of one video is 150s, and the mouth-shaped variation data in the face region data in the video frame data is acquired from 50 th to 100 th seconds, and then the voice embedding representation is extracted from the audio frame data from 50 th to 100 th seconds.

In some embodiments, as shown in fig. 4, the step 160 may optionally include the following sub-steps:

Step 162: performing Fourier transform on the audio frame data to obtain spectrogram frequency data of multiple speakers;

and carrying out Fourier transformation on the audio frame data of the plurality of speakers, and converting the audio frame data into spectrogram frequency data of the plurality of speakers. Illustratively, as shown in FIG. 2, the audio frame data is converted to spectrogram frequency data by Short-time Fourier transform (STFT) and Short-time Fourier transform (Short-time Fourier Transform). The spectrogram is a frequency distribution diagram of the voice signals of a plurality of speakers and is used for representing the energy values of the voice signals of the plurality of speakers on corresponding frequencies.

Step 164: the spectrogram frequency data of multiple speakers is input into a feature extraction network to extract the speech embedded representation.

In some embodiments, the feature extraction network extracts deeper data features primarily by convolving the input data.

The spectrogram frequency data of multiple speakers are input into a feature extraction network, and speech embedding representation is obtained through convolution processing and feature extraction on the spectrogram frequency data of the multiple speakers.

In summary, according to the method provided in the embodiment, the voice signal in the audio frame is converted into the frequency domain data through fourier transformation, so that the subsequent processing of the voice signal is simplified, and the voice embedded representation is obtained through the processing of the voice signal.

In some embodiments, as shown in fig. 5, the step 180 may optionally include the following sub-steps:

step 182: acquiring a first speaker embedded representation of a target speaker;

the first speaker embedded representation is an embedded representation corresponding to the speech data of the target speaker or is understood to be a true speech embedded representation of the target speaker. The first Speaker embeds information representing the identity of the Speaker, also called Speaker identity (Speaker Identity Document, speaker ID). Each Speaker corresponds to a Speaker ID, which may be similar, but which differs at least slightly, so that different speakers correspond to different Speaker IDs.

In some embodiments, the first speaker-embedded representation is derived from a neural network model trained in advance. The neural network model is trained from the speech data of a clean (noiseless) target speaker, and the output of the neural network model is used as the first speaker embedded representation.

Step 184: the first speaker embedded representation, the semantic embedded representation and the voice embedded representation are input into a multi-mode voice separation model, and voice data of a target speaker are obtained through separation.

Inputting the acquired first speaker embedded representation of the target speaker, the semantic embedded representation extracted based on the video frame data and the speech embedded representation extracted based on the audio frame data into a multi-mode speech separation model, and separating the speech data of the target speaker by combining the first speaker embedded representation, the semantic embedded representation and the speech embedded representation to obtain the speech data of the target speaker.

Wherein the first speaker embedded representation is a true voice embedded representation of the target speaker, which serves as a reference for the voice data of the target speaker when separating the voice data of the target speaker. The semantic embedded representation is mainly used for assisting the speech embedded representation to separate the speech data of the target speaker, for example, the semantic of the word "hello" is used for assisting in separating the speech data of the word "hello" expressed by the target speaker.

In summary, the method provided in this embodiment, by combining the first speaker embedded representation, the semantic embedded representation and the speech embedded representation, can effectively improve the accuracy of identifying the target speaker when identifying the speech data of the target speaker.

In some embodiments, the multimodal speech separation model includes: long Short-Term Memory (LSTM) networks and fully connected layers. As shown in fig. 6, the step 184 may optionally include the following sub-steps:

step 1842: inputting the first speaker embedded representation, the semantic embedded representation and the voice embedded representation into an LSTM network to obtain a fused high-dimensional output;

the LSTM network is a time-cycled neural network, mainly comprising three phases: 1. forgetting the stage: the input to the LSTM network is forgotten selectively, e.g., data selection such as ambient noise in the speech embedded representation is forgotten. 2. A selection and memory stage: the inputs to the LSTM network are selectively memorized, for example, data selection memory of a corresponding first speaker-embedded representation in the speech-embedded representation. 3. Output stage: and fusing the result of the selection forgetting and the selection memorizing and outputting the result.

In some embodiments, the first speaker-embedded representation, the semantic-embedded representation, and the speech-embedded representation are input into an LSTM network, and a fused high-dimensional output is obtained through a forgetting-to-select phase and a memory-to-select phase. The fused high-dimensional output is an output result obtained by fusing the first speaker embedded representation, the semantic embedded representation and the voice embedded representation after forgetting and memorizing in an LSTM network.

Step 1844: inputting the fused high-dimensional output to a full-connection layer to obtain spectrogram frequency data of a target speaker;

and inputting the fused high-dimensional output to a full-connection layer for classification, wherein the full-connection layer performs feature extraction on the fused high-dimensional output, and classifies the frequency data of the spectrogram of the target speaker. The spectrogram frequency data are dimension data which are mapped to spectrogram frequencies by the fusion high-dimension output through the full-connection layer.

Step 1846: and carrying out inverse Fourier transform on the spectrogram frequency data of the target speaker to obtain the voice data of the target speaker.

And carrying out inverse Fourier transform on the spectrogram frequency data of the target speaker, and converting the spectrogram frequency data of the target speaker into voice data of the target speaker.

In some embodiments, the voice characteristics of the targeted speaker do not remain unchanged all the time, e.g., the targeted speaker may change from young to mature with age, and typically such changes are irreversible; for example, the target speaker may experience a transient sound change for physical reasons (cold), which is usually reversible, i.e. has a temporal nature, and in the event of a restoration of the physical state of the target speaker, the speech characteristics of the target speaker will be restored therewith.

Aiming at the situation that the voice characteristics of the target speaker may change, the voice separation method for multiple speakers provided by the embodiment of the application can also adaptively update the embedded representation of the target speaker.

Fig. 7 is a flowchart of a method for speech separation for multiple speakers according to an exemplary embodiment of the present application. The method comprises the following steps:

step 220: inputting the voice data of the target speaker to a voice evaluation model to obtain voice quality evaluation;

the voice data of the target speaker is the voice data obtained by separating from the video containing a plurality of speakers, and is the voice data corresponding to the target speaker in the current video.

The voice evaluation model is used for detecting voice quality of voice data, voice data of a target speaker separated from the multi-mode voice separation model is input into the voice evaluation model, and voice quality detection is carried out on the voice data of the target speaker, so that voice quality evaluation is obtained.

Step 240: generating a second speaker-embedded representation of the target speaker if the speech quality assessment is greater than the assessment threshold;

in some embodiments, the speech quality assessment is provided with an assessment threshold. The evaluation threshold is a threshold for reflecting an output signal-to-noise ratio (ratio of signal to noise) of the speech evaluation model, and when the speech quality evaluation is greater than the evaluation threshold, the output signal-to-noise ratio reflecting the speech evaluation model is greater than the threshold, and the noise of the speech data of the target speaker is greater, and at this time, a second speaker-embedded representation of the target speaker is generated.

In some embodiments, the second speaker-embedded representation is obtained in a similar manner as the first speaker-embedded representation described above. The voice data of the target speaker separated from the video containing a plurality of speakers is input into a neural network model for extracting a voice embedded representation, and a second speaker embedded representation is output.

Step 260: the first speaker embedded representation is updated based on the second speaker embedded representation.

In some embodiments, the second speaker-embedded representation is a speaker-embedded representation obtained based on speech data of a noisy target speaker, and the first speaker-embedded representation is a speaker-embedded representation obtained based on speech data of a clean (noiseless) target speaker. In some embodiments, the first speaker embedded representation is updated based on the second speaker embedded representation.

Wherein the first speaker-embedded representation is obtained by pre-sampling the speech data of a clean (noiseless) target speaker. The second speaker embedded representation is obtained by separating the voice data of the noisy target speaker from the video containing the plurality of speakers based on the voice separation method of the plurality of speakers and then based on the voice data of the noisy target speaker.

In some embodiments, a distance between the first speaker embedded representation and the second speaker embedded representation is calculated, the distance or understood as a difference value describing a similarity of the first speaker embedded representation and the second speaker embedded representation. The distance comprises a first threshold and a second threshold, wherein the first threshold is used for distinguishing that the similarity of the first speaker embedded representation and the second speaker embedded representation is larger and the similarity of the first speaker embedded representation and the second speaker embedded representation is general; the second threshold is used to distinguish between the first speaker-embedded representation and the second speaker-embedded representation that are generally less similar to the first speaker-embedded representation and the second speaker-embedded representation. The first threshold and the second threshold may be preset in advance, or may be dynamically adjusted according to actual situations.

In summary, by combining with the speech quality evaluation, the method provided in this embodiment adaptively updates the embedded representation of the target speaker, so that the embedded representation of the target speaker is closer to the current state, and the situation that the original mobile representation of the target speaker is still used under the condition that the embedded representation of the target speaker has changed is avoided, thereby causing inaccurate speech data of the output target speaker.

In some embodiments, the distance between the first speaker-embedded representation and the second speaker-embedded representation may be divided into the following three cases based on the first and second thresholds for distance:

first case: the distance between the first and second speaker-embedded representations is less than or equal to a first threshold or is understood to be a greater similarity between the first and second speaker-embedded representations.

In some embodiments, the first speaker embedded representation is not updated or is updated with the second speaker embedded representation if a distance between the first speaker embedded representation and the second speaker embedded representation is less than or equal to a first threshold. At this point it is understood that the distance between the first speaker-embedded representation and the second speaker-embedded representation is infinitely close or negligible. Thus, the effect achieved is consistent with updating the first speaker-embedded representation with the second speaker-embedded representation, or updating the first speaker-embedded representation without the second speaker-embedded representation.

It is worth noting that in case the distance between the first speaker embedded representation and the second speaker embedded representation is smaller than or equal to the first threshold value, it is preferred not to update the first speaker embedded representation. The first speaker embedded representation is less affected by noise than the second speaker embedded representation, and thus the first speaker embedded representation is generally considered to be more realistic.

Second case: the distance between the first and second speaker-embedded representations is greater than a first threshold and less than a second threshold, or is understood to be a general similarity between the first and second speaker-embedded representations.

In some embodiments, a weighted sum of the first and second speaker-embedded representations is calculated where a distance between the first and second speaker-embedded representations is greater than a first threshold and less than a second threshold, resulting in a third speaker-embedded representation.

In some embodiments, the weights of the first speaker insertion representation and the weights of the second speaker insertion representation are preset in advance. Illustratively, assume that a first speaker insert is denoted as A and a second speaker insert is denoted as B; the weight of the first speaker embedded representation is preset to be 80% in advance, and the weight of the second speaker embedded representation is preset to be 20%; the third speaker embedding representation=80%a+20%b. It should be noted that the 80% and 20% are only reference values, and the actual weight values can be adjusted according to the knowledge of those skilled in the art. In some embodiments, the first speaker-embedded representation is updated using the third speaker-embedded representation. Illustratively, the first speaker-embedded representation A is replaced with the calculated third speaker-embedded representation 80% A+20% B.

In some embodiments, the above weights are dynamically adjusted. Dynamically adjusting the weight of the first speaker embedded representation and the weight of the second speaker embedded representation under the condition that the sound state of the target speaker is changed; a weighted sum of the first speaker embedded representation and the second speaker embedded representation is calculated based on the adjusted weights of the first speaker embedded representation and the second speaker embedded representation. The weight is dynamically adjusted based on the actual condition of the different targeted speakers in different states, which are related to different factors, such as based on age changes of the targeted speaker, such as based on transient body changes (e.g., cold) of the targeted speaker, such as based on long-term body changes of the targeted speaker.

In some embodiments, the type of change in the voice state of the target speaker is identified by a voice state identification model that is pre-trained using the voice state of the target speaker as a training sample prior to implementing embodiments of the present application. The voice state of the target speaker is changed in at least one of a long-term change such as a voice change from young to adult of the target speaker and a short-term change such as a voice change of the target speaker due to cold.

In some embodiments, in the event that the type of change in the voice state of the targeted speaker is a long-term change, the similarity of the first speaker insert representation to the targeted speaker insert representation is less than the similarity of the second speaker insert representation to the targeted speaker insert representation, then the weight of the first speaker insert representation is adjusted to be less than the weight of the second speaker insert representation.

In some embodiments, in the event that the type of change in the voice state of the targeted speaker is a short-term change, the first speaker-embedded representation is more similar to the targeted speaker-embedded representation than the second speaker-embedded representation, then the weight of the first speaker-embedded representation is adjusted to be greater than the weight of the second speaker-embedded representation.

A weighted sum of the first speaker embedded representation and the second speaker embedded representation is calculated based on the adjusted weights of the first speaker embedded representation and the second speaker embedded representation.

In one example, the weight is dynamically adjusted based on the age of the targeted speaker. Typically, the age change of the target speaker is irreversible, so that when the target speaker is in the silence period, the reference value of the second speaker-embedded representation is greater than the reference value of the first speaker-embedded representation, and the weight of the second speaker-embedded representation is dynamically adjusted to be greater than the weight of the first speaker-embedded representation.

In one example, the weight is dynamically adjusted based on a transient physical change of the targeted speaker (e.g., a cold). Typically, a transient physical change (e.g., a cold) of the target speaker is reversed with physical recovery, but when the target speaker experiences a transient physical change, the voice characteristics of the target speaker are briefly changed, and dynamic adjustment of the embedded representation of the first speaker and the incorporated representation of the second speaker is required during periods of physical discomfort of the target speaker. Dynamically adjusting the weight of the second speaker embedded representation to be greater than the weight of the first speaker embedded representation in the event of physical discomfort to the target speaker; in the case of a body recovery of the target speaker, the weight of the first speaker-embedded representation is dynamically adjusted to be greater than the weight of the second speaker-embedded representation. In some embodiments, the weight is further adjusted according to the degree of physical discomfort of the target speaker, such as adjusting the weight of the first speaker-embedded representation to 80% and the weight of the second speaker-embedded representation to 20% in the case of poor physical condition (significant sound variation) of the target speaker; in the case where the body condition of the target speaker is improved (the sound change is small), the weight of the first speaker insertion representation is adjusted to 60%, and the weight of the second speaker insertion representation is adjusted to 40%.

In some embodiments, the weights are dynamically adjusted stepwise according to the period. In one example, the weight is dynamically adjusted on a one week (7 days) cycle, specifically considering that the targeted speaker may experience a brief physical discomfort (brief sound change) during the week, such as from cold to cure. In one example, the weight is dynamically adjusted with one month (30 days) as one period. In one example, the weight is dynamically adjusted with a period of half a year (180 days). In some embodiments, the period is preset or dynamically adjusted according to the actual situation.

Third case: the distance between the first and second speaker-embedded representations is greater than or equal to a second threshold or understood to be less similar between the first and second speaker-embedded representations.

In some embodiments, where the distance between the first speaker embedded representation and the second speaker embedded representation is greater than or equal to a second threshold, a hint is output that the separated speech data is not speech data of the target speaker. At this time, it is understood that the distance between the first speaker embedded representation and the second speaker embedded representation is large, and there is a large gap between the separated speech data (corresponding to the second speaker embedded representation) and the speech data (corresponding to the first speaker embedded representation) of the target speaker.

In summary, according to the method provided by the embodiment, the voice data of the target speaker is input to the voice evaluation model, so that the embedded representation of the target speaker is adaptively updated based on the distance between the first speaker embedded representation and the second speaker embedded representation, and the voice recognition accuracy of the target speaker under multiple scenes is further improved.

The embodiment of the application provides a voice separation method for multiple speakers. The method of speaker separation is generally to take mixed voice data as input, and multiple speakers can be separated after training by a neural network due to different voice embedding representations of different speakers. If the information of the target speaker is stored in the neural network in advance, the voice information of the target speaker can be obtained through multi-speaker voice extraction. The method provided by the embodiment of the application combines a single voice data separation method and an image-based face recognition method, and utilizes multi-mode combined biological feature recognition to achieve higher recognition rate. Meanwhile, the method provided by the embodiment of the application considers that the characteristics of the voice data of the target speaker can also change to a certain extent along with time and environment, such as children become sound, cold states are poor, and the voice embedded representation of the target speaker is different from the voice embedded representation under normal conditions, and the voice enhancement effect is achieved by adaptively updating the voice embedded representation of the target speaker.

According to the method provided by the embodiment of the application, the main data of the acquired video is divided into two parts, namely video frame data and audio frame data, for processing. Inputting pictures in video frame data into a face recognition network for processing, performing mouth shape detection on the output of the face recognition network, and obtaining semantic embedded representation of a speaker when detecting that the speaking mouth shape is changed in multi-frame pictures. And (5) carrying out STFT on the audio frame data to obtain the voice embedded representation of the speaker. At this time, the processing of the audio signal is started at the same time, and the semantic embedded representation, the voice embedded representation and the first speaker embedded representation obtained by registering in advance output a voice extraction network, and as shown in fig. 8, for example, voice separation is performed on the mixed voice data, so as to obtain the voice data of the target speaker. The output voice data is subjected to quality evaluation through a voice evaluation model, when the voice quality is higher, a second speaker embedded representation is generated according to the voice data, the distance between the second speaker embedded representation and the first speaker embedded representation is calculated, and the embedded representation of the target speaker is dynamically updated according to an updating strategy. By means of the self-adaptive updating strategy, voiceprint characteristics of the target speaker are dynamically updated along with the states of the speaker, so that the voice recognition rate and separation effect of the target speaker are improved. In this process, the video frame data is mainly used to detect whether a person speaks or not, and assist in multi-speaker speech separation according to semantics contained in the mouth-shape variation.

As shown in fig. 9, the method provided in the embodiment of the present application mainly includes the following modules:

1. the face extraction module 31. Based on convolutional neural networks (Convolutional Neural Network, CNN) and face recognition networks. Before the original face image is input into the face recognition network, CNN is used for face detection, because CNN performs better in certain scenes, such as partial occlusion or unclear contours. Meanwhile, face regions are cut out, all faces are remodeled into fixed sizes to conduct face embedding and extraction, the extracted face features are subjected to analysis of mouth shape changes, if people speaking in a scene is judged, semantic information contained in the mouth shape changes is formed into a semantic embedding representation, and meanwhile the voice extraction module is started to work.

2. The speech extraction module 32. Firstly, a voiceprint library of a target speaker is established in advance, and a clean voice signal of the target speaker is used as a reference input. A method for building the voiceprint library of target speaker includes inputting the speech signals of target speaker to neural network, labelling the speaker, training, and taking the output of network as Deep-net vector (d-vector), that is, the speaker's speaker ID.

3. Mask estimation module 33. The target speaker mask is predicted in the time-frequency domain, the network starts with a multi-layer CNN, captures the time and frequency variations, combines the output of the CNN layer and the output of the speech embedding module and the face extraction module as inputs to the LSTM layer, and then passes through the full-connection (Fully Connected layers, FC) layer for mapping the high-dimensional output of LSTM to the dimensions of the spectrogram frequency.

4. An adaptive update module 34. Because the speech characteristics of the target speaker are not always constant, environmental and temporal changes, such as the process of the target speaker growing from child to child with age and the speaker being in ill condition, can lead to poor results if the original speaker ID is used as the identity of the recognition under these conditions. According to the method provided by the embodiment of the application, the separated voice signals are subjected to voice quality detection through a voice evaluation system, a new speeker ID is regenerated for the signals with higher signal to noise ratio through a judgment criterion, and then distance calculation is carried out with the original speeker ID. An update strategy is that when the distance between the original and new speaker IDs is smaller than a set first threshold (distance minimum), no update is performed; when the distance is larger than a first threshold value (distance minimum value) and smaller than a second threshold value (distance maximum value), weighting the original speeker ID and the new speeker ID; when the distance is greater than a second threshold (distance maximum), the user is prompted to record a sound that is not the same person, and the user is prompted to re-record. Through the updating strategy, the speech recognition accuracy of the target speaker under multiple scenes is improved.

5. A multimodal targeted speaker separation and speech enhancement module. Firstly, recording a voice signal of a clean target speaker into a speeker ID, and embedding the speeker ID into a neural network; and then in the video mixed with a plurality of speakers, processing the mixed voice signals according to the mouth shape variation, carrying out short-time Fourier transform on the voice signals to obtain the frequency spectrum of the mixed signals, inputting the frequency spectrum of the mixed signals into a convolutional neural network, combining the output of a voice embedding module and a face extraction module to serve as the input of an LSTM layer, finally outputting the voice signals of the estimated speakers through a full connection layer, namely the frequency spectrum of the separated target speakers, and finally obtaining the voice signals of the separated target speakers through inverse Fourier transform.

The embodiment of the application provides a multi-mode target speaker separation and self-adaptive voice enhancement method, which at least comprises the following beneficial effects:

1. the method can extract the voice data of the target speaker in the video, and adopts a multi-mode joint processing method, thereby remarkably improving the accuracy of the recognition of the target speaker in two aspects of voice data extraction and image extraction and improving the output signal distortion ratio of the voice data.

2. When watching or listening to the audio and video, the user can hear the sound to be heard, so that an excellent audio and video experience is obtained.

3. The voice print characteristics of the target speaker can be dynamically adjusted according to the voice data of the target speaker in different states, so that the recognition rate of the voice of the target speaker is remarkably improved.

Fig. 10 is a block diagram of a speech separation apparatus for multiple speakers according to another exemplary embodiment of the present application. The device comprises:

an acquisition module 1020 is configured to acquire video frame data and audio frame data in the video.

The acquiring module 1020 is further configured to acquire face region data from the video frame data; and acquiring mouth shape variable data from the face region data, and extracting semantic embedded representation based on the mouth shape variable data.

The obtaining module 1020 is further configured to extract a speech embedded representation based on the audio frame data.

The separation module 1040 is configured to input the semantic embedded representation and the speech embedded representation into a multi-modal speech separation model, and separate to obtain speech data of the target speaker.

The obtaining module 1020 is further configured to obtain a first speaker-embedded representation of the target speaker.

The separation module 1040 is further configured to input the first speaker embedded representation, the semantic embedded representation, and the speech embedded representation into a multi-modal speech separation model, and separate to obtain speech data of the target speaker.

The multi-modal speech separation model includes: long and short time memory LSTM networks and fully connected layers.

The separation module 1040 is further configured to input the first speaker-embedded representation, the semantic-embedded representation, and the voice-embedded representation into the LSTM network, to obtain a fused high-dimensional output.

The separation module 1040 is further configured to input the fused high-dimensional output to the full-connection layer, so as to obtain spectrogram frequency data of the target speaker.

The separation module 1040 is further configured to perform inverse fourier transform on the spectrogram frequency data of the target speaker, so as to obtain voice data of the target speaker.

The separation module 1040 is further configured to extract, when the mouth-shaped data is obtained from the face region data, a speech embedded representation based on the audio frame data corresponding to the mouth-shaped time period.

The mouth-shaped variable time period is a time period corresponding to mouth-shaped variable data.

The obtaining module 1020 is further configured to perform fourier transform on the audio frame data to obtain spectrogram frequency data of multiple speakers.

The obtaining module 1020 is further configured to input spectrogram frequency data of multiple speakers into the feature extraction network to extract the speech embedded representation.

The apparatus further comprises:

the evaluation module 1060 is configured to input the voice data of the target speaker to the voice evaluation model, and obtain a voice quality evaluation.

The evaluation module 1060 is further configured to generate a second speaker-embedded representation of the target speaker if the speech quality evaluation is greater than the evaluation threshold.

The evaluation module 1060 is further configured to update the first speaker-embedded representation based on the second speaker-embedded representation.

The evaluation module 1060 is further configured to calculate a distance between the first speaker-embedded representation and the second speaker-embedded representation.

The evaluation module 1060 is further configured to calculate a weighted sum of the first speaker-embedded representation and the second speaker-embedded representation to obtain a third speaker-embedded representation if the distance is greater than the first threshold and less than the second threshold.

The evaluation module 1060 is further configured to update the first speaker-embedded representation with the third speaker-embedded representation.

The evaluation module 1060 is further configured to identify a type of change of the voice status of the target speaker, where the type of change includes at least one of a long-term change and a short-term change.

The evaluation module 1060 is further configured to adjust the weight of the first speaker-embedded representation to be smaller than the weight of the second speaker-embedded representation in case the change type is a long-term change.

The evaluation module 1060 is further configured to adjust the weight of the first speaker insertion representation to be greater than the weight of the second speaker insertion representation in case the change type is a short-term change.

The evaluation module 1060 is further configured to calculate a weighted sum of the first speaker-embedded representation and the second speaker-embedded representation based on the adjusted weights of the first speaker-embedded representation and the second speaker-embedded representation.

The evaluation module 1060 is further configured to not update the first speaker-embedded representation, or update the first speaker-embedded representation using the second speaker-embedded representation, if the distance is less than or equal to the first threshold.

The evaluation module 1060 is further configured to output a prompt message that the separated voice data is not the voice data of the target speaker if the distance is greater than or equal to the second threshold.

Fig. 11 shows a schematic structural diagram of a computer device according to an exemplary embodiment of the present application. Illustratively, the computer apparatus 1100 includes a central processing unit (Central Processing Unit, CPU) 1101, a system memory 1104 including a random access memory (random access memory, RAM) 1102 and a read-only memory (ROM) 1103, and a system bus 1105 connecting the system memory 1104 and the central processing unit 1101. The computer device 1100 also includes a basic input/output system 1106, which helps to transfer information between various devices within the computer, and a mass storage device 1107 for storing an operating system 1113, clients 1114, and other program modules 1115.

The basic input/output system 1106 includes a display 1108 for displaying information and an input device 1109, such as a mouse, keyboard, etc., for a user to input information. Wherein the display 1108 and input device 1109 are coupled to the central processing unit 1101 through an input/output controller 1110 coupled to the system bus 1105. The basic input/output system 1106 may also include an input/output controller 1110 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input/output controller 1110 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1107 is connected to the central processing unit 1101 through a mass storage controller (not shown) connected to the system bus 1105. The mass storage device 1107 and its associated computer-readable media provide non-volatile storage for the computer device 1100. That is, the mass storage device 1107 may include a computer-readable medium (not shown) such as a hard disk or a compact disk-Only (CD-ROM) drive.

The computer readable medium may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM), flash memory or other solid state memory technology, CD-ROM, digital versatile disks (Digital Versatile Disc, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 1104 and mass storage device 1107 described above may be collectively referred to as memory.

According to various embodiments of the present application, the computer device 1100 may also operate by being connected to a remote computer on a network, such as the Internet. I.e., the computer device 1100 may connect to the network 1112 through a network interface unit 1111 connected to the system bus 1105, or other types of networks or remote computer systems (not shown) may be connected using the network interface unit 1111.

An exemplary embodiment of the present application further provides a computer readable storage medium having at least one program stored therein, where the at least one program is loaded and executed by a processor to implement the voice separation method for multiple speakers provided in the above respective method embodiments.

An exemplary embodiment of the present application also provides a computer program product comprising at least one program, the at least one program being stored in a readable storage medium; the processor of the communication device reads the signaling from the readable storage medium, and the processor executes the signaling to cause the communication device to execute to implement the voice separation method for multiple speakers provided by the above-described respective method embodiments.

An exemplary embodiment of the present application also provides a computer program including at least one program, at least one program stored in a readable storage medium; the processor of the communication device reads the signaling from the readable storage medium, and the processor executes the signaling to cause the communication device to execute to implement the voice separation method for multiple speakers provided by the above-described respective method embodiments.

It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the above storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing is illustrative of the present invention and is not to be construed as limiting thereof, but rather as being included within the spirit and principles of the present invention.

Claims

1. A method for multi-speaker speech separation, the method comprising:

acquiring video frame data and audio frame data in a video;

extracting a speech embedded representation based on the audio frame data;

2. The method of claim 1, wherein the inputting the semantic embedded representation and the speech embedded representation into a multi-modal speech separation model separates the speech data of the target speaker, comprising:

acquiring a first speaker embedded representation of the target speaker;

inputting the first speaker embedded representation, the semantic embedded representation and the voice embedded representation into the multi-modal voice separation model, and separating to obtain the voice data of the target speaker.

3. The method of claim 2, wherein the multi-modal speech separation model comprises: long and short time memory LSTM network and full connection layer;

The steps of inputting the first speaker embedded representation, the semantic embedded representation and the voice embedded representation into the multi-modal voice separation model, and separating to obtain the voice data of the target speaker include:

inputting the first speaker embedded representation, the semantic embedded representation and the voice embedded representation into the LSTM network to obtain a fused high-dimensional output;

inputting the fusion high-dimensional output to the full-connection layer to obtain spectrogram frequency data of the target speaker;

and carrying out inverse Fourier transform on the spectrogram frequency data of the target speaker to obtain the voice data of the target speaker.

4. A method according to any one of claims 1 to 3, wherein said extracting a speech embedded representation based on said audio frame data comprises:

extracting the voice embedded representation based on audio frame data corresponding to a mouth-shaped transformation time period under the condition that the mouth-shaped transformation data is acquired from the face region data;

wherein the mouth-shaped variable time period is a time period corresponding to the mouth-shaped variable data.

5. A method according to any one of claims 1 to 3, wherein said extracting a speech embedded representation based on said audio frame data comprises:

Performing Fourier transform on the audio frame data to obtain spectrogram frequency data of multiple speakers;

inputting the spectrogram frequency data of the multiple speakers into a feature extraction network to extract the voice embedded representation.

6. A method according to claim 2 or 3, characterized in that the method further comprises:

inputting the voice data of the target speaker to a voice evaluation model to obtain voice quality evaluation;

generating a second speaker-embedded representation of the targeted speaker if the speech quality assessment is greater than an assessment threshold;

updating the first speaker embedded representation based on the second speaker embedded representation.

7. The method of claim 6, wherein updating the first speaker-embedded representation based on the second speaker-embedded representation comprises:

calculating a distance between the first speaker-embedded representation and the second speaker-embedded representation;

calculating a weighted sum of the first speaker-embedded representation and the second speaker-embedded representation to obtain a third speaker-embedded representation when the distance is greater than a first threshold and less than a second threshold;

Updating the first speaker embedded representation using the third speaker embedded representation.

8. The method of claim 7, wherein the calculating a weighted sum of the first speaker-embedded representation and the second speaker-embedded representation comprises:

identifying a type of change in the voice status of the targeted speaker, the type of change including at least one of a long-term change and a short-term change;

adjusting the weight of the first speaker embedded representation to be less than the weight of the second speaker embedded representation if the type of change is a long-term change;

adjusting the weight of the first speaker insertion representation to be greater than the weight of the second speaker insertion representation if the type of change is a short-term change;

and calculating a weighted sum of the first speaker embedded representation and the second speaker embedded representation based on the adjusted weight of the first speaker embedded representation and the adjusted weight of the second speaker embedded representation.

9. The method of claim 7, wherein the method further comprises:

and if the distance is less than or equal to the first threshold, not updating the first speaker embedded representation, or updating the first speaker embedded representation using the second speaker embedded representation.

10. The method of claim 7, wherein the method further comprises:

outputting prompt information that the separated voice data is not the voice data of the target speaker under the condition that the distance is larger than or equal to the second threshold value.

11. A speech separation apparatus for multiple speakers, the apparatus comprising:

12. A computer device, the computer device comprising: a processor and a memory, the memory having stored therein at least one program loaded and executed by the processor to implement the speech separation method for multiple speakers according to any of claims 1 to 10.

13. A computer-readable storage medium, wherein at least one program is stored in the storage medium, the at least one program being loaded and executed by a processor to cause a wireless device to implement the method for multi-speaker speech separation of any one of claims 1 to 10.

14. A computer program product, characterized in that the computer program product comprises at least one program, the at least one program being stored in a computer readable storage medium; a processor of a communication device reads the at least one program from the computer-readable storage medium, the processor executing the at least one program to cause the communication device to perform the speech separation method for multiple speakers according to any one of claims 1 to 10.

15. A computer program, characterized in that the computer program comprises at least one program, the at least one program being stored in a computer readable storage medium; a processor of a communication device reads the at least one program from the computer-readable storage medium, the processor executing the at least one program to cause the communication device to perform the speech separation method for multiple speakers according to any one of claims 1 to 10.