CN116863953A - Voice separation method, device, computer equipment and storage medium - Google Patents

Voice separation method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN116863953A
CN116863953A CN202310761319.5A CN202310761319A CN116863953A CN 116863953 A CN116863953 A CN 116863953A CN 202310761319 A CN202310761319 A CN 202310761319A CN 116863953 A CN116863953 A CN 116863953A
Authority
CN
China
Prior art keywords
voice
audio
speaker
original audio
fragments
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310761319.5A
Other languages
Chinese (zh)
Inventor
吕惟宁
李�昊
方帅
戴桢锦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Picc Information Technology Co ltd
Original Assignee
Picc Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Picc Information Technology Co ltd filed Critical Picc Information Technology Co ltd
Priority to CN202310761319.5A priority Critical patent/CN116863953A/en
Publication of CN116863953A publication Critical patent/CN116863953A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application relates to a voice separation method, a device, equipment and a storage medium, and relates to the technical field of communication. The method comprises the following steps: acquiring original audio, wherein the original audio comprises voice information of at least two speakers; performing voice detection on the original audio to obtain voice fragments in the original audio; voiceprint recognition and voice segmentation are carried out on the voice fragments to obtain at least two voice fragments in the original audio; clustering at least two voice fragments to obtain voice fragments belonging to each speaker; by the method, the voice fragments belonging to each speaker can be obtained under the condition of unknown speaker number, so that a better voice separation effect is achieved.

Description

Voice separation method, device, computer equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of communication, in particular to a voice separation method, a voice separation device, computer equipment and a storage medium.
Background
In a multi-person voice interaction scenario, voice information of each speaker can be obtained by distinguishing the speakers.
In the related art, when the speaker is distinguished, generally, in the case of knowing the number of speakers, adjacent voice segments are obtained according to the dividing points determined by the set sliding window, the adjacent voice segments are input into the neural network to obtain respective corresponding vector features, the distance value between the adjacent voice vectors is calculated, the sliding window is moved, and the above process is repeated to divide the voice signal based on the dividing points corresponding to the local maximum values between the respective adjacent voice vectors, so as to obtain the voice segments of two speakers.
However, in the related art, the method can only realize the segmentation of the speech segments of two speakers, and has poor processing effect on the problem of "cocktail party", which refers to the problems of unknown number of speakers, overlapping of voices of multiple speakers, background noise, and the like.
Disclosure of Invention
The embodiment of the application provides a voice separation method, a device, computer equipment and a storage medium, which can obtain voice fragments belonging to each speaker under the condition of unknown speaker number, thereby achieving better voice separation effect.
In one aspect, a method for voice separation is provided, the method comprising:
acquiring original audio, wherein the original audio comprises voice information of at least two speakers;
performing voice detection on the original audio to obtain voice fragments in the original audio;
voiceprint recognition and voice segmentation are carried out on the voice fragments to obtain at least two voice fragments in the original audio;
and clustering at least two voice fragments to obtain voice fragments belonging to each speaker.
In another aspect, there is provided a voice separation apparatus, the apparatus comprising:
The audio acquisition module is used for acquiring original audio, wherein the original audio comprises voice information of at least two speakers;
the voice detection module is used for carrying out voice detection on the original audio to obtain voice fragments in the original audio;
the voice segmentation module is used for carrying out voiceprint recognition and voice segmentation on the voice fragments to obtain at least two voice fragments in the original audio;
and the clustering processing module is used for carrying out clustering processing on at least two voice fragments to obtain voice fragments belonging to each speaker.
In one possible implementation, the apparatus further includes:
the voting module is used for inputting a target voice segment into at least two voting models, and obtaining voting results output by the at least two voting models respectively, wherein the voting results are used for indicating a predicted speaker of the target voice segment, and the target voice segment is any one of the at least two voice segments;
the identification determining module is used for determining speaker identification corresponding to the target voice segment based on voting results output by at least two voting models.
In one possible implementation, the apparatus further includes:
The first voice separation module is used for determining voice fragments belonging to each speaker based on a clustering result of clustering at least two voice fragments and speaker identifications corresponding to the at least two voice fragments.
In one possible implementation, the apparatus further includes:
the denoising module is used for denoising the original audio before performing voice detection on the original audio to obtain voice fragments in the original audio to obtain denoised original audio;
the voice detection module is used for carrying out voice detection on the denoised original audio to obtain voice fragments in the original audio.
In one possible implementation, the apparatus further includes:
the video acquisition module is used for acquiring an original video, wherein the original video is video information in an audio and video corresponding to the original audio;
the information extraction module is used for extracting acoustic information of the original audio and mouth shape amplitude information of the target speaker, wherein the acoustic information comprises audio phase information and audio amplitude information; the target speaker is a speaker with mouth shape information contained in the original video;
The amplitude denoising module is used for performing audio amplitude denoising on the audio amplitude information of the original audio based on the mouth shape amplitude information of the target speaker to obtain denoised audio amplitude information; the denoised audio amplitude information comprises the audio amplitude information of each voice segment of the target speaker;
the phase denoising module is used for performing audio phase denoising on the audio phase information based on the denoised audio amplitude information to obtain denoised audio phase information; the denoised audio phase information comprises the audio phase information of each voice segment of the target speaker;
and the second voice separation module is used for obtaining the voice segment of the target speaker based on the denoised audio amplitude information and the denoised audio phase information.
In one possible implementation manner, the second voice separation module is configured to determine voice segments belonging to each speaker based on a clustering result of clustering at least two voice segments and the voice segments of the target speaker; the number of the speakers corresponding to the original audio is greater than or equal to the number of the target speakers.
In one possible implementation, the apparatus further includes:
And the text generation module is used for generating text contents corresponding to each speaker based on the voice fragments belonging to each speaker and the target corpus corresponding to the original audio after the voice fragments belonging to each first speaker are obtained.
In another aspect, a computer device is provided, the computer device comprising a processor and a memory, the memory storing at least one computer program, the at least one computer program loaded and executed by the processor to implement the above-described speech separation method.
In another aspect, a computer readable storage medium having at least one computer program stored therein is provided, the computer program being loaded and executed by a processor to implement the above-described speech separation method.
In another aspect, a computer program product is provided, comprising at least one computer program loaded and executed by a processor to implement the speech separation method provided in the various alternative implementations described above.
The technical scheme provided by the application can comprise the following beneficial effects:
according to the voice separation method provided by the embodiment of the application, voice fragments are extracted from original audio comprising voice information of at least two speakers, voiceprint recognition and voice segmentation are carried out on the voice fragments to obtain at least two voice fragments, and clustering processing is carried out on the at least two voice fragments, so that the voice fragments belonging to the same speaker are obtained; by the method, the voice fragments belonging to each speaker can be obtained under the condition of unknown speaker number, so that a better voice separation effect is achieved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
FIG. 1 illustrates a flow chart of a method of speech separation provided by an exemplary embodiment of the present application;
FIG. 2 illustrates a frame diagram of an x-vector algorithm provided by an exemplary embodiment of the present application;
FIG. 3 illustrates a flowchart providing a method of speech separation according to an exemplary embodiment of the present application;
FIG. 4 is a flow chart illustrating a method of speech separation according to an exemplary embodiment of the present application;
FIG. 5 is a schematic diagram of voice separation over a deep network according to an exemplary embodiment of the present application;
fig. 6 shows a block diagram of a speech separation device according to an exemplary embodiment of the present application;
FIG. 7 is a block diagram of a computer device shown in accordance with an exemplary embodiment;
fig. 8 is a block diagram of a computer device, according to an example embodiment.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
As described above, the existing speech separation method can only realize the segmentation of the speech segments of two speakers, and cannot cope with the problem of "cocktail party", that is, cannot adapt to the speech separation scene where multiple speakers overlap and the number of speakers is unknown, resulting in poor speech separation effect. In view of this, an embodiment of the present application provides a voice separation method, fig. 1 shows a flowchart of the voice separation method according to an exemplary embodiment of the present application, where the voice separation method may be performed by a computer device, and as shown in fig. 1, the voice separation method may include the following steps:
step 110, obtaining original audio, wherein the original audio comprises voice information of at least two speakers.
In the embodiment of the application, the original audio can be audio information collected in a multi-person voice interaction scene, and illustratively, the original audio can be audio information collected from a multi-person online conference scene; the application does not limit the multi-person voice interaction scene.
Step 120, performing voice detection on the original audio to obtain a voice class segment in the original audio.
In the embodiment of the application, the computer equipment can extract the voice type fragments and the non-voice type fragments from the original audio by detecting the voice activity of the original audio; wherein, the voice segment refers to an audio segment containing voice information of a speaker; non-speech segments refer to audio segments that do not contain the speaker's voice information, such as background noise; voice activity detection (Voice activity detection, VAD, also called Speech activity detection, SAD) is aimed at detecting the presence or absence of a voice signal.
Illustratively, the computer device may perform voice activity detection by the voice activity detection module. The voice activity detection module can be constructed based on an LSTM (Long Short-Term Memory network), can acquire voice characteristics of each signal frame of the original audio, and inputs voice activity detection results of each signal frame, wherein the voice activity detection results are used for indicating whether the signal is a voice frame or a non-voice frame, so that voice class fragments and non-voice class fragments are determined by combining the voice activity detection results of each signal frame; the voice activity detection module may be obtained in advance based on the voice frame samples and the voice activity labels corresponding to the voice frame samples.
And 130, performing voiceprint recognition and voice segmentation on the voice fragments to obtain at least two voice fragments in the original audio.
Of the at least two speech segments, two adjacent speech segments belong to different speakers.
After the voice class segment in the original audio is obtained, extracting the characteristic vector of the signal frame in the voice class segment, and carrying out voiceprint recognition and voice segmentation on the voice class segment based on the characteristic vector of each signal frame in the voice class segment.
The voice segments are the collection of voice segments corresponding to the voice information of each speaker in the original audio, and the situations of continuous, overlapping and the like of the voice segments corresponding to different speakers possibly exist; in the embodiment of the application, the computer equipment can carry out voiceprint recognition and voice segmentation on the voice fragments through the speaker embedding and segmentation module, wherein the speaker embedding and segmentation module can map voice signals into high-dimensional feature vectors and segment input voice into a plurality of voice fragments; illustratively, the speaker embedding and segmentation module may be constructed based on an x-vector algorithm, wherein the x-vector algorithm is a voiceprint recognition algorithm.
The x-vector algorithm can divide the input voice fragments into a plurality of sections, and the output vectors of the voice fragments are obtained after the voice fragments are processed by a TDNN (Time-Delay Neural Network, time delay neural network) network respectively; the respective output vectors of the multiple segments of voice fragments are aggregated, and then the voice segmentation result of the input voice fragments is predicted based on the aggregated feature vectors; schematically, fig. 2 shows a frame diagram of an x-vector algorithm according to an exemplary embodiment of the present application, and as shown in fig. 2, the x-vector algorithm is divided into three stages:
Stage one: inputting the feature vector of the first voice segment into a TDNN network with multi-layer frame level to obtain the output vector of each segment of signal frame in the first voice segment; the first speech segment is a speech segment of any number of frames in a speech class segment.
Stage two: and (5) performing operation of a statistics pooling layer (Statistics pooling), wherein the statistics pooling layer calculates the mean value and the standard deviation based on the output vector of each segment of signal frame of the input voice segment, and then splices the mean value and the standard deviation to obtain sentence-level feature expression as the output of the statistics pooling layer.
Two omnidirectional connection layers and a softmax layer are connected behind the statistic pooling layer, wherein the node number of the softmax layer is consistent with the number of the training set speakers.
Stage three: and carrying out the ebedding operation on the feature vector processed by the softmax layer, and inputting the result after the ebedding operation into a PLDA (Probabilistic Linear Discriminant Analysis, probability linear discriminant analysis) model to obtain a voice segmentation result output by the PLDA model.
And 140, clustering at least two voice fragments to obtain voice fragments belonging to each speaker.
After a plurality of voice fragments are obtained, the voice fragments of the same category (namely the same speaker) can be classified into one category by clustering the plurality of voice fragments, so as to obtain the voice fragments belonging to each speaker.
In an embodiment of the present application, the computer device may cluster at least two speech segments by a bottom-up hierarchical clustering (Agglomerative Hierarchical Clustering, AHS) algorithm. In the clustering process, the computer equipment can assume that each voice segment is a cluster, calculate the similarity among the clusters, and perform cluster merging according to the similarity among the clusters, namely merge two clusters with the highest similarity into one cluster; performing similarity calculation among clusters again based on the combined clusters, performing cluster combination, and repeating the process until all the voice fragments belong to one cluster; the similarity among clusters in the application can be obtained by calculating the feature similarity among the voice fragments, and the feature similarity can be obtained by Euclidean distance function calculation, so that the application does not limit the similarity calculation mode.
After the hierarchical clustering result is obtained, the computer device can use the clustering result corresponding to each hierarchy as a clustering result of at least two voice fragments based on a preset condition to determine the voice fragments belonging to each speaker; illustratively, the preset condition may be a setting of a hierarchy, for example, a clustering result of a target hierarchy is used as a clustering result of at least two voice segments; alternatively, the preset condition may be set for the number of clusters, for example, a clustering result when the difference between the number of clusters and the target number is smaller than the difference threshold is used as a clustering result for at least two voice segments, and the preset condition is not limited by the present application. Alternatively, the computer device may also send the hierarchical clustering result to the relevant person, who determines the speech segments belonging to the respective speakers based on the hierarchical clustering result.
Alternatively, the computer device may also perform clustering on at least two speech segments by other clustering means, such as K-means (K-means) clustering, GMM-gaussian mixture model (Gaussian mixture model) clustering, etc.
In the embodiment of the application, the computer equipment can label each voice segment belonging to the same speaker through speaker identification so as to facilitate voice distinction.
In summary, according to the voice separation method provided by the embodiment of the application, voice segments are extracted from the original audio including the voice information of at least two speakers, voiceprint recognition and voice segmentation are performed on the voice segments to obtain at least two voice segments, and clustering is performed on the at least two voice segments, so that the voice segments belonging to the same speaker are obtained; by the method, the voice fragments belonging to each speaker can be obtained under the condition of unknown speaker number, so that a better voice separation effect is achieved.
In a possible implementation manner, in order to improve the voice separation effect, the embodiment of the application adds a voice separation auxiliary manner on the basis of the voice separation method shown in fig. 1, so as to improve the accuracy of voice separation. Fig. 3 shows a flowchart of a method for voice separation according to an exemplary embodiment of the present application, which may be performed by a computer device, which may be implemented as a server or a terminal, as shown in fig. 3, the method for voice separation comprising:
In step 310, original audio is obtained, where the original audio includes voice information of at least two speakers.
Step 320, performing voice detection on the original audio to obtain a voice class segment in the original audio.
Optionally, in order to improve the accuracy of the voice detection, before the voice detection is performed on the original audio, the computer device may perform denoising processing on the original audio to obtain denoised original audio; and then, carrying out voice detection on the denoised original audio to obtain voice fragments in the original audio. The denoising processing of the original audio may refer to denoising and dereverberation processing of the original audio to improve quality of the original audio. Illustratively, the computer device can perform voice noise reduction through a transducer model and remove reverberation in the voice through a WPE (Weighted Prediction Error) algorithm. It should be noted that the computer device may also implement the effect of voice noise reduction or dereverberation based on other possible noise reduction methods or dereverberation methods, which is not limited by the present application.
And 330, performing voiceprint recognition and voice segmentation on the voice class fragments to obtain at least two voice fragments in the original audio.
And 340, clustering at least two voice fragments to obtain voice fragments belonging to each speaker.
The relevant content of steps 310 to 340 may refer to the corresponding description of the embodiment shown in fig. 1, and will not be repeated here.
Step 350, inputting the target voice segment into at least two voting models, and obtaining voting results output by each of the at least two voting models, wherein the voting results are used for indicating a predicted speaker of the target voice segment, and the target voice segment is any one of the at least two voice segments.
In the embodiment of the application, in order to further improve the accuracy of voice separation, the computer equipment can also determine the speaker to which the voice fragment belongs based on the voting results of the voting models by inputting the voice fragment into the voting models.
The voting model may be obtained based on training of a training set, which may include a speech segment sample and a speaker tag corresponding to the speech segment sample; the corpus to which the training sets of different voting models belong can be different, that is, the sources of the training sets of different voting models can be different, so that the diversity and the applicability of the voting models are improved.
Step 360, determining speaker identification corresponding to the target voice segment based on the voting results output by the at least two voting models.
The speaker identification of the target voice segment is the speaker identification corresponding to the predicted speaker with the largest voting number.
Step 370, determining the speech segments belonging to each speaker based on the clustering result of the clustering process for the at least two speech segments and based on the speaker identifications corresponding to the at least two speech segments.
And taking speaker identification determined based on voting results obtained by voting of at least two voice fragments through a plurality of voting models as auxiliary information, and integrating clustering results of the at least two voice fragments to jointly determine voice fragments belonging to each speaker so as to increase the reliability of voice separation and improve the accuracy of voice separation.
In summary, the embodiment of the present application provides a method for separating speech, which extracts speech segments from original audio including speech information of at least two speakers, performs voiceprint recognition and speech segmentation on the speech segments to obtain at least two speech segments, and performs clustering on the at least two speech segments to obtain speech segments belonging to the same speaker; by the method, the voice fragments belonging to each speaker can be obtained under the condition of unknown speaker number, so that a better voice separation effect is achieved.
Simultaneously, a plurality of voting models are introduced to predict the speaker of each voice segment, the speaker identification of each voice segment is determined based on the voting results of the plurality of voting models, and the voice segments belonging to each speaker are jointly determined by combining the clustering results of the voice segments, so that the accuracy and the reliability of voice separation are further improved.
In a possible application scenario, the voice separation method can be applied to an audio-video interaction scenario, at the moment, the original audio can be obtained, meanwhile, the original video can be obtained, based on the fact, the voice fragments of all speakers can be determined based on the audio information, and meanwhile, the accuracy of voice separation can be further improved by introducing the video information. Fig. 4 shows a flowchart of a voice separation method according to an exemplary embodiment of the present application, which may be performed by a computer device, which may be implemented as a server or a terminal, as shown in fig. 4, and includes:
in step 410, an original video is obtained, where the original video is video information in an audio/video corresponding to the original audio.
All or part of the speakers can be contained in the original video; when all speakers are contained in the original video, the number of clusters when the voice fragments are clustered in the embodiment shown in fig. 1 or fig. 3 can be determined in an auxiliary manner based on the number of speakers contained in the original video, so that the voice separation effect is improved; when the original video contains partial speakers, the lower limit of the number of clusters when the voice fragments are clustered in the embodiment shown in fig. 1 or fig. 3 can be determined in an auxiliary manner based on the number of speakers contained in the original video, so that the clustering result screening is facilitated.
Step 420, extracting acoustic information of the original audio and mouth shape amplitude information of the target speaker, wherein the acoustic information comprises audio phase information and audio amplitude information; the target speaker is a speaker with mouth shape information contained in the original video.
Since the original video may contain all or part of the speaker's mouth shape information, the computer device may extract the speaker's mouth shape amplitude information from the speaker's mouth shape information contained in the original video. The mouth shape amplitude information is used to indicate the mouth shape variation amplitude of the speaker in the adjacent video frame. In the embodiment of the application, the computer equipment can extract the mouth shape amplitude information of the speaker from the original video based on the 3D residual error network, and in the extraction process, the original video can be sampled through a sampling window according to the preset jump length so as to obtain the mouth shape amplitude information of the speaker; illustratively, if the frame rate of the video is 25fps (40 ms per frame), the preset skip length is 10ms, and the original video is sampled through a sampling window with a window length of 40ms at a sampling rate of 16KHz, so as to obtain the mouth shape amplitude information of the speaker in the original video.
In an embodiment of the application, the computer device may extract acoustic information from the original audio by short-time fourier transform (Short Time Fourier Transform, STFT), which may be characterized as amplitude and phase spectrograms.
Step 430, performing audio amplitude denoising on the audio amplitude information of the original audio based on the mouth shape amplitude information of the target speaker to obtain denoised audio amplitude information; the denoised audio amplitude information includes audio amplitude information for each speech segment of the target speaker.
Optionally, when the mouth shape amplitude information of the target speaker indicates that the mouth shape variation amplitude of the target speaker is greater than the amplitude threshold, the speaker is indicated to be in a speaking state, and at this time, the audio amplitude information of the corresponding original audio can be screened based on the mouth shape amplitude information of the target speaker, so as to obtain the audio amplitude information corresponding to the mouth shape amplitude information of the target speaker.
In the embodiment of the application, the computer equipment can realize the process of audio amplitude denoising on the basis of the mouth shape amplitude information of the target speaker by an amplitude sub-network in the deep processing network, wherein the amplitude sub-network comprises a first residual error network layer, a characteristic fusion layer and a second residual error network; the first residual error network layer comprises a residual error sub-network layer corresponding to the mouth shape amplitude information and a residual error sub-network layer corresponding to the audio amplitude information; the first feature fusion layer is used for fusing two inputs together through linear projection and series connection, and then convolution processing is carried out on the fusion result of the first feature fusion layer through the second residual error network to obtain denoised audio amplitude information of predicted output. The residual network layer may be a residual network composed of a plurality of convolution blocks, and optionally, each convolution block contains a time convolution, and the kernel width of the residual network layer may be 5, and the stride is 1; the number of convolution blocks in each residual network layer can be set based on actual requirements, which is not limited by the present application.
Step 440, performing audio phase denoising on the audio phase information based on the denoised audio amplitude information to obtain denoised audio phase information; the denoised audio phase information includes audio phase information for each speech segment of the target speaker.
After the denoised audio amplitude information is obtained, the computer device can input the denoised audio amplitude information and the audio phase information into a phase sub-network in the depth network to obtain denoised audio phase information processed by the phase sub-network. The phase sub-network comprises a second feature fusion layer and a third residual error network layer; after the denoised audio amplitude information and the audio phase information are input into the phase sub-network, the second characteristic fusion layer fuses the two inputs together through linear projection and series connection, and then the fusion result of the second characteristic fusion layer is subjected to convolution processing through a third residual error network, so that the predicted denoised audio phase information is obtained.
In the embodiment of the present application, the depth network may be obtained based on an audio-video sample and an audio clip tag corresponding to the audio-video sample, where the audio clip tag is used to indicate an audio clip of each speaker in the audio-video sample.
Step 450, obtaining the voice segment of the target speaker based on the denoised audio amplitude information and the denoised audio phase information.
After the denoised audio amplitude information and the denoised audio phase information are obtained, the time spectrum of the audio formed by the denoised audio amplitude information and the denoised audio phase information is subjected to inverse short time Fourier transform (Invertible Short Time Fourier Transform, ISTFT) to obtain denoised audio fragments, at this time, the denoised audio fragments are the audio fragments corresponding to each speaker with the mouth shape amplitude information in the original video, and the corresponding relation between the audio fragments and the speakers with the mouth shape amplitude information is established. Therefore, the voice fragments of part of speakers in the audio and video scene can be extracted, and voice separation is performed.
Fig. 5 shows a schematic diagram of voice separation through a depth network according to an exemplary embodiment of the present application, as shown in fig. 5, where the depth network includes an amplitude sub-network 510 and a phase sub-network 520, and the input of the amplitude sub-network 510 is that, based on a 3D residual network, the speaker's mouth shape amplitude information is extracted from an original video, and the audio amplitude information is extracted from the original audio through short-time fourier transform, and the denoised audio amplitude information is output; the input of the phase sub-network 520 is the output denoised audio amplitude information output by the amplitude sub-network 510 and the audio phase information extracted from the original audio by short-time fourier transformation, so as to obtain denoised audio phase information, and then, by performing inverse short-time fourier transformation on the denoised audio amplitude information and the denoised audio amplitude information, an audio clip corresponding to a speaker having mouth-shape amplitude information in the original video is obtained.
Step 460, determining the speech segments belonging to each speaker based on the clustering result of the clustering process of the at least two speech segments and the speech segments of the target speaker; the number of speakers corresponding to the original audio is greater than or equal to the number of target speakers.
The audio segments of a part of the speakers in the audio-video scene can be extracted through the depth network, and in one possible implementation manner, the computer device can use the prediction result of the depth network as auxiliary information to correct or supplement the clustering result of the clustering of at least two voice segments so as to jointly determine the voice segments of each speaker in the audio-video scene.
Furthermore, the computer device may also combine the prediction result of the deep network, the clustering result of the clustering process on at least two voice segments, and the speaker identification corresponding to each of the at least two voice segments determined based on the voting model, to comprehensively determine the voice segments of each speaker, so that the accuracy of voice separation can be improved through the combination of multiple voice separation modes.
In addition, the computer device may use the prediction results of the depth network as training samples for the voting model, i.e., when determining speech segments of some speakers from the prediction results of the depth network, a training set of the depth network may be constructed based on the speech segments of the speakers to train the voting model.
Furthermore, the computer equipment can identify the human faces in the original video through the face recognition technology to determine the number of speakers participating in audio-video interaction, so as to assist in setting the number of clusters during voice separation in a voice segment clustering mode; in addition, the identity information of each speaker can be determined by combining the face recognition result, the speaker identification is correspondingly replaced by the corresponding speaker identity information, and illustratively, the voice segment of the speaker 1, the voice segment of the speaker 2 and the like can be determined after voice classification is performed in a voice segment clustering mode, the identity of the speaker 1 can be confirmed to be a shift after the face recognition result is combined, the identity of the speaker 2 is confirmed to be a learning committee, and at the moment, the voice separation result can be determined to be the voice segment of the shift, the voice segment of the learning committee and the like.
In one possible implementation, after obtaining the speech segments of the respective speakers, the method further includes: text content corresponding to each speaker is generated based on the speech segments belonging to each speaker and the target corpus corresponding to the original audio.
Schematically, in some scenes, in order to record audio interactive content or audio-video interactive content, there is a need to convert the audio content into text content, and at this time, text content of each speaker can be correspondingly obtained in a manner of voice-to-text; in the embodiment of the application, in order to improve the accuracy of converting voice into text, a target corpus corresponding to original audio can be referred to when converting voice into text, common vocabularies in a business scene corresponding to the original audio can be recorded in the target corpus, and when the similarity is larger than a similarity threshold value, the corresponding content in the text content is replaced by the target corpus by performing similarity calculation on the text content obtained based on the voice into text and the target corpus in the target corpus, so that the intelligent error correction of the text content obtained from the voice into text is realized, and the accuracy of converting voice into text is improved.
In summary, in the voice separation method provided by the embodiment of the application, in the audio-video interaction scene, the voice separation is assisted by combining the mouth shape amplitude information of the speaker in the video content, the voice amplitude information and the audio phase information are denoised based on the mouth shape amplitude information of the speaker, and the corresponding audio fragment of the speaker with the mouth shape amplitude information in the video content is determined based on the denoised audio amplitude information and the denoised audio phase information, so that the voice separation is realized, and the accuracy of the voice separation is improved. Furthermore, the audio fragments corresponding to the speakers are determined together by combining the clustering mode of the voice fragments, and the accuracy of voice separation under the condition of unknown number of the speakers can be further improved by combining multiple voice separation modes.
Fig. 6 shows a block diagram of a speech separation device according to an exemplary embodiment of the present application, which may implement all or part of the steps of the embodiments shown in fig. 1, 3 or 4, and which may include, as shown in fig. 6:
an audio obtaining module 610, configured to obtain an original audio, where the original audio includes voice information of at least two speakers;
The voice detection module 620 is configured to perform voice detection on the original audio to obtain a voice class segment in the original audio;
the voice segmentation module 630 is configured to perform voiceprint recognition and voice segmentation on the voice class segment to obtain at least two voice segments in the original audio;
the clustering module 640 is configured to perform clustering processing on at least two speech segments to obtain speech segments belonging to each speaker.
In one possible implementation, the apparatus further includes:
the voting module is used for inputting a target voice segment into at least two voting models, and obtaining voting results output by the at least two voting models respectively, wherein the voting results are used for indicating a predicted speaker of the target voice segment, and the target voice segment is any one of the at least two voice segments;
the identification determining module is used for determining speaker identification corresponding to the target voice segment based on voting results output by at least two voting models.
In one possible implementation, the apparatus further includes:
the first voice separation module is used for determining voice fragments belonging to each speaker based on a clustering result of clustering at least two voice fragments and speaker identifications corresponding to the at least two voice fragments.
In one possible implementation, the apparatus further includes:
the denoising module is used for denoising the original audio before performing voice detection on the original audio to obtain voice fragments in the original audio to obtain denoised original audio;
the voice detection module is used for carrying out voice detection on the denoised original audio to obtain voice fragments in the original audio.
In one possible implementation, the apparatus further includes:
the video acquisition module is used for acquiring an original video, wherein the original video is video information in an audio and video corresponding to the original audio;
the information extraction module is used for extracting acoustic information of the original audio and mouth shape amplitude information of the target speaker, wherein the acoustic information comprises audio phase information and audio amplitude information; the target speaker is a speaker with mouth shape information contained in the original video;
the amplitude denoising module is used for performing audio amplitude denoising on the audio amplitude information of the original audio based on the mouth shape amplitude information of the target speaker to obtain denoised audio amplitude information; the denoised audio amplitude information comprises the audio amplitude information of each voice segment of the target speaker;
The phase denoising module is used for performing audio phase denoising on the audio phase information based on the denoised audio amplitude information to obtain denoised audio phase information; the denoised audio phase information comprises the audio phase information of each voice segment of the target speaker;
and the second voice separation module is used for obtaining the voice segment of the target speaker based on the denoised audio amplitude information and the denoised audio phase information.
In one possible implementation manner, the second voice separation module is configured to determine voice segments belonging to each speaker based on a clustering result of clustering at least two voice segments and the voice segments of the target speaker; the number of the speakers corresponding to the original audio is greater than or equal to the number of the target speakers.
In one possible implementation, the apparatus further includes:
and the text generation module is used for generating text contents corresponding to each speaker based on the voice fragments belonging to each speaker and the target corpus corresponding to the original audio after the voice fragments belonging to each first speaker are obtained.
In summary, according to the voice separation device provided by the embodiment of the application, the voice segments are extracted from the original audio including the voice information of at least two speakers, the voice segments are subjected to voiceprint recognition and voice segmentation to obtain at least two voice segments, and then the at least two voice segments are clustered, so that the voice segments belonging to the same speaker are obtained; by the method, the voice fragments belonging to each speaker can be obtained under the condition of unknown speaker number, so that a better voice separation effect is achieved.
Fig. 7 illustrates a block diagram of a computer device 700 in accordance with an exemplary embodiment of the present application. The computer device may be implemented as a call center or a back-end service node in the above-described scheme of the present application. The computer apparatus 700 includes a central processing unit (Central Processing Unit, CPU) 701, a system Memory 704 including a random access Memory (Random Access Memory, RAM) 702 and a Read-Only Memory (ROM) 703, and a system bus 705 connecting the system Memory 704 and the central processing unit 701. The computer device 700 also includes a mass storage device 706 for storing an operating system 709, application programs 710, and other program modules 711.
The computer readable medium may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, erasable programmable read-Only register (Erasable Programmable Read Only Memory, EPROM), electrically erasable programmable read-Only Memory (EEPROM) flash Memory or other solid state Memory technology, CD-ROM, digital versatile disks (Digital Versatile Disc, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 704 and mass storage device 706 described above may be collectively referred to as memory.
According to various embodiments of the present disclosure, the computer device 700 may also operate through a network, such as the Internet, to a remote computer on the network. I.e. the computer device 700 may be connected to the network 708 through a network interface unit 707 connected to the system bus 705, or alternatively, the network interface unit 707 may be used to connect to other types of networks or remote computer systems (not shown).
The memory further includes at least one instruction, at least one program, a code set, or an instruction set, where the at least one instruction, the at least one program, the code set, or the instruction set is stored in the memory, and the central processor 701 implements all or part of the steps in the speech separation method shown in the foregoing embodiments by executing the at least one instruction, the at least one program, the code set, or the instruction set.
Fig. 8 illustrates a block diagram of a computer device 800 in accordance with an exemplary embodiment of the present application. The computer device 800 may be implemented as the back-end service node described above, such as: smart phones, tablet computers, notebook computers, desktop computers, etc. The computer device 800 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, and the like.
In general, the computer device 800 includes: a processor 801 and a memory 802.
Processor 801 may include one or more processing cores, such as a 4-core processor, a 6-core processor, and the like. The processor 801 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 801 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 801 may integrate a GPU (Graphics Processing Unit, image processor) for taking care of rendering and rendering of the content that the display screen is required to display. In some embodiments, the processor 801 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.
Memory 802 may include one or more computer-readable storage media, which may be non-transitory. Memory 802 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 802 is used to store at least one instruction for execution by processor 801 to implement all or part of the steps in the speech separation method shown in the method embodiments of the present application.
In some embodiments, the computer device 800 may optionally further include: a peripheral interface 803, and at least one peripheral. The processor 801, the memory 802, and the peripheral interface 803 may be connected by a bus or signal line. Individual peripheral devices may be connected to the peripheral device interface 803 by buses, signal lines, or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 804, a display 805, a camera assembly 806, audio circuitry 807, and a power supply 808.
In some embodiments, the computer device 800 also includes one or more sensors 809. The one or more sensors 809 include, but are not limited to: acceleration sensor 810, gyro sensor 811, pressure sensor 812, optical sensor 813, and proximity sensor 814.
Those skilled in the art will appreciate that the architecture shown in fig. 8 is not limiting and that more or fewer components than shown may be included or that certain components may be combined or that a different arrangement of components may be employed.
In an exemplary embodiment, there is also provided a computer readable storage medium having stored therein at least one computer program loaded and executed by a processor to implement all or part of the steps in the above-described speech separation method. For example, the computer readable storage medium may be Read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), compact disc Read-Only Memory (CD-ROM), magnetic tape, floppy disk, optical data storage device, and the like.
In an exemplary embodiment, a computer program product is also provided, which comprises at least one computer program, which is loaded by a processor and which performs all or part of the steps of the speech separation method shown in any of the embodiments of fig. 1, 3 or 4 described above.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (10)

1. A method of speech separation, the method comprising:
acquiring original audio, wherein the original audio comprises voice information of at least two speakers;
performing voice detection on the original audio to obtain voice fragments in the original audio;
voiceprint recognition and voice segmentation are carried out on the voice fragments to obtain at least two voice fragments in the original audio;
and clustering at least two voice fragments to obtain voice fragments belonging to each speaker.
2. The method according to claim 1, wherein the method further comprises:
inputting a target voice segment into at least two voting models, and obtaining voting results output by the at least two voting models respectively, wherein the voting results are used for indicating a predicted speaker of the target voice segment, and the target voice segment is any one of the at least two voice segments;
And determining speaker identification corresponding to the target voice segment based on voting results output by at least two voting models.
3. The method according to claim 2, wherein the method further comprises:
based on the clustering result of the clustering processing of the at least two voice fragments, and based on the speaker identifications corresponding to the at least two voice fragments, determining the voice fragments belonging to each speaker.
4. The method of claim 1, wherein prior to performing speech detection on the original audio to obtain a speech-like segment in the original audio, the method further comprises:
denoising the original audio to obtain denoised original audio;
the step of performing voice detection on the original audio to obtain a voice class segment in the original audio comprises the following steps:
and carrying out voice detection on the denoised original audio to obtain voice fragments in the original audio.
5. A method according to any one of claims 1 to 3, wherein the method further comprises:
acquiring an original video, wherein the original video is video information in an audio and video corresponding to the original audio;
Extracting acoustic information of the original audio and mouth shape amplitude information of a target speaker, wherein the acoustic information comprises audio phase information and audio amplitude information; the target speaker is a speaker with mouth shape information contained in the original video;
performing audio amplitude denoising on the audio amplitude information of the original audio based on the mouth shape amplitude information of the target speaker to obtain denoised audio amplitude information; the denoised audio amplitude information comprises the audio amplitude information of each voice segment of the target speaker;
performing audio phase denoising on the audio phase information based on the denoised audio amplitude information to obtain denoised audio phase information; the denoised audio phase information comprises the audio phase information of each voice segment of the target speaker;
and obtaining the voice segment of the target speaker based on the denoised audio amplitude information and the denoised audio phase information.
6. The method of claim 5, wherein the method further comprises:
determining the voice fragments belonging to each speaker based on the clustering result of the clustering processing of at least two voice fragments and the voice fragments of the target speaker; the number of the speakers corresponding to the original audio is greater than or equal to the number of the target speakers.
7. The method of claim 1, wherein after obtaining the speech segments belonging to each of the first speakers, the method further comprises:
text content corresponding to each speaker is generated based on the speech segments belonging to each speaker and the target corpus corresponding to the original audio.
8. A speech separation device, the device comprising:
the audio acquisition module is used for acquiring original audio, wherein the original audio comprises voice information of at least two speakers;
the voice detection module is used for carrying out voice detection on the original audio to obtain voice fragments in the original audio;
the voice segmentation module is used for carrying out voiceprint recognition and voice segmentation on the voice fragments to obtain at least two voice fragments in the original audio;
and the clustering processing module is used for carrying out clustering processing on at least two voice fragments to obtain voice fragments belonging to each speaker.
9. A computer device, characterized in that it comprises a processor and a memory, said memory storing at least one computer program, said at least one computer program being loaded and executed by said processor to implement the speech separation method according to any of claims 1 to 7.
10. A computer readable storage medium, characterized in that at least one computer program is stored in the computer readable storage medium, which computer program is loaded and executed by a processor to implement the speech separation method according to any of claims 1 to 7.
CN202310761319.5A 2023-06-26 2023-06-26 Voice separation method, device, computer equipment and storage medium Pending CN116863953A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310761319.5A CN116863953A (en) 2023-06-26 2023-06-26 Voice separation method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310761319.5A CN116863953A (en) 2023-06-26 2023-06-26 Voice separation method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116863953A true CN116863953A (en) 2023-10-10

Family

ID=88231430

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310761319.5A Pending CN116863953A (en) 2023-06-26 2023-06-26 Voice separation method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116863953A (en)

Similar Documents

Publication Publication Date Title
JP6938784B2 (en) Object identification method and its computer equipment and computer equipment readable storage medium
US10878824B2 (en) Speech-to-text generation using video-speech matching from a primary speaker
US9875743B2 (en) Acoustic signature building for a speaker from multiple sessions
CN110136749A (en) The relevant end-to-end speech end-point detecting method of speaker and device
Sahoo et al. Emotion recognition from audio-visual data using rule based decision level fusion
CN107112006A (en) Speech processes based on neutral net
Tao et al. End-to-end audiovisual speech activity detection with bimodal recurrent neural models
CN111785275A (en) Voice recognition method and device
US20230089308A1 (en) Speaker-Turn-Based Online Speaker Diarization with Constrained Spectral Clustering
US20210390949A1 (en) Systems and methods for phoneme and viseme recognition
US11238289B1 (en) Automatic lie detection method and apparatus for interactive scenarios, device and medium
CN113343831A (en) Method and device for classifying speakers in video, electronic equipment and storage medium
CN117337467A (en) End-to-end speaker separation via iterative speaker embedding
CN115457938A (en) Method, device, storage medium and electronic device for identifying awakening words
CN112992155B (en) Far-field voice speaker recognition method and device based on residual error neural network
Ghosal et al. Automatic male-female voice discrimination
CN109817223A (en) Phoneme marking method and device based on audio fingerprints
CN113593597A (en) Voice noise filtering method and device, electronic equipment and medium
CN113823300B (en) Voice processing method and device, storage medium and electronic equipment
CN116863953A (en) Voice separation method, device, computer equipment and storage medium
CN115565533A (en) Voice recognition method, device, equipment and storage medium
CN115376498A (en) Speech recognition method, model training method, device, medium, and electronic apparatus
Wang et al. Multi-Modal Multi-Correlation Learning for Audio-Visual Speech Separation
CN114495903A (en) Language category identification method and device, electronic equipment and storage medium
Zhang et al. Audio-visual speech separation with visual features enhanced by adversarial training

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination