CN116364102A - Data processing method and device, equipment and storage medium - Google Patents

Data processing method and device, equipment and storage medium Download PDF

Info

Publication number
CN116364102A
CN116364102A CN202111627928.9A CN202111627928A CN116364102A CN 116364102 A CN116364102 A CN 116364102A CN 202111627928 A CN202111627928 A CN 202111627928A CN 116364102 A CN116364102 A CN 116364102A
Authority
CN
China
Prior art keywords
speakers
speaker
network
predicted
separation model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111627928.9A
Other languages
Chinese (zh)
Inventor
荣玉军
陈铭
刘辉
徐家伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Hangzhou Information Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Hangzhou Information Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202111627928.9A priority Critical patent/CN116364102A/en
Publication of CN116364102A publication Critical patent/CN116364102A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The application discloses a data processing method, a device, equipment and a storage medium, which relate to the technical field of data processing and comprise the following steps: preprocessing a first audio frame in audio data to obtain a frequency spectrum characteristic corresponding to the first audio frame; inputting the spectral features into a speaker separation model, and separating first predicted speakers corresponding to the spectral features through the speaker separation model to obtain L first predicted speakers; calculating first loss values between the L first predicted speakers and the M real speakers by using a displacement invariance loss function; and reversely adjusting parameters in the speaker separation model at least based on the first loss value to obtain a converged speaker separation model. According to the method, the speaker separation model is trained through the displacement invariance loss function, so that the speaker separation model is obtained, and separation of any number of speaker voice data can be realized; the application scene of the speaker separation model is improved.

Description

Data processing method and device, equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, but not limited to, a data processing method and apparatus, a device, and a storage medium.
Background
In the related art, the implementation of the end-to-end speaker separation model mainly includes: configuring the number of speakers in the speaker separation model, and establishing parameters of the speaker separation model based on the number of speakers; and then training the speaker separation model by adopting an end-to-end training mode, so as to realize that the speaker separation model is output corresponding voice data of each speaker when the voice data of the speakers with the same number as the number of the speakers is input.
Therefore, the end-to-end speaker separation model in the related art can only establish a speaker separation model based on the known speaker number when the number of people contained in the audio data is known, and cannot process the audio data of any number of people.
Disclosure of Invention
The application provides a data processing method, a device, equipment and a storage medium, wherein the method trains a speaker separation model through a displacement invariance loss function, so that the speaker separation model can be obtained to separate any number of speaker voice data; the application scene of the speaker separation model is improved.
The technical scheme of the application is realized as follows:
the embodiment of the application provides a data processing method, which comprises the following steps:
Preprocessing a first audio frame in audio data to obtain a frequency spectrum characteristic corresponding to the first audio frame; the first audio frame includes voice data of M speakers; m is greater than or equal to 2;
inputting the spectral features into a speaker separation model, and separating first predicted speakers corresponding to the spectral features through the speaker separation model to obtain L first predicted speakers; the L is greater than or equal to 1;
calculating first loss values between the L first predicted speakers and the M real speakers by using a displacement invariance loss function; if the first set is the same as the second set, the first loss value is equal to zero; if the first set is different from the second set, the first loss value is not equal to zero; the first set includes the L first predicted speakers, and the second set includes the M real speakers;
and reversely adjusting parameters in the speaker separation model at least based on the first loss value to obtain a converged speaker separation model.
The embodiment of the application provides a data processing device, which comprises:
the preprocessing unit is used for preprocessing a first audio frame in the audio data to obtain spectrum characteristics corresponding to the first audio frame; the first audio frame includes voice data of M speakers; m is greater than or equal to 2;
The speaker separation unit is used for inputting the frequency spectrum characteristics into a speaker separation model, and separating first predicted speakers corresponding to the frequency spectrum characteristics through the speaker separation model to obtain L first predicted speakers; the L is greater than or equal to 1;
a calculating unit for calculating first loss values between the L first predicted speakers and the M real speakers using a permutation invariance loss function; if the first set is the same as the second set, the first loss value is equal to zero; if the first set is different from the second set, the first loss value is not equal to zero; the first set includes the L first predicted speakers, and the second set includes the M real speakers;
and the adjusting unit is used for reversely adjusting parameters in the speaker separation model at least based on the first loss value so as to obtain a converged speaker separation model.
The embodiment of the application also provides electronic equipment, which comprises: the data processing system comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor realizes the data processing method when executing the program.
The embodiment of the application also provides a storage medium, on which a computer program is stored, which when executed by a processor, implements the above-mentioned data processing method.
The data processing method, device, equipment and storage medium provided by the embodiment of the application comprise the following steps: preprocessing a first audio frame in audio data to obtain a frequency spectrum characteristic corresponding to the first audio frame; the first audio frame includes voice data of M speakers; m is greater than or equal to 2; inputting the spectral features into a speaker separation model, and separating first predicted speakers corresponding to the spectral features through the speaker separation model to obtain L first predicted speakers; the L is greater than or equal to 1; calculating first loss values between the L first predicted speakers and the M real speakers by using a displacement invariance loss function; if the first set is the same as the second set, the first loss value is equal to zero; if the first set is different from the second set, the first loss value is not equal to zero; the first set includes the L first predicted speakers, and the second set includes the M real speakers; and reversely adjusting parameters in the speaker separation model at least based on the first loss value to obtain a converged speaker separation model. The speaker separation model is trained through the displacement invariance loss function; only the variability in the content of the output results is of interest due to the displacement invariance loss, e.g., speaker a and speaker B are the output results of interest; the order among the speakers is not concerned, namely the speaker A and the speaker B are not concerned specifically; or speaker B, speaker a; the speaker separation model obtained through training is focused on distinguishing different speakers, so that the speaker separation model is not limited to application scenes, and separation of any number of speaker voice data can be realized.
Drawings
FIG. 1 is a schematic diagram of an alternative architecture of a data processing system according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of an alternative method for processing data according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of an alternative data processing method according to an embodiment of the present disclosure;
FIG. 4 is a schematic flow chart of an alternative data processing method according to an embodiment of the present disclosure;
FIG. 5 is a schematic flow chart of an alternative data processing method according to an embodiment of the present disclosure;
FIG. 6 is a schematic flow chart of an alternative data processing method according to an embodiment of the present disclosure;
FIG. 7 is a schematic diagram of an alternative structure of a depth residual time dilation convolutional network according to an embodiment of the present application;
FIG. 8 is a schematic diagram of an alternative configuration of a time-expanded convolution block provided by an embodiment of the present application;
FIG. 9 is a schematic diagram of an alternative architecture of an attention network provided by embodiments of the present application;
FIG. 10 is a schematic diagram of an alternative architecture of a linear approximation global attention network provided by embodiments of the present application;
FIG. 11 is a schematic diagram of an alternative structure of a data processing apparatus according to an embodiment of the present application;
Fig. 12 is a schematic structural diagram of an alternative electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the specific technical solutions of the present application will be described in further detail below with reference to the accompanying drawings in the embodiments of the present application. The following examples are illustrative of the present application, but are not intended to limit the scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.
In the following description, the term "first\second\third" is merely used for example to distinguish different objects, and does not represent a specific ordering for the objects, and does not have a limitation of precedence order. It will be appreciated that the "first-/second-/third-" may be interchanged with one another in the specific order or sequence of parts where appropriate to enable the embodiments of the present application described herein to be implemented in other than those illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.
The embodiment of the application can provide a data processing method, a data processing device, data processing equipment and a storage medium. In practical application, the data processing method may be implemented by a data processing device, and each functional entity in the data processing device may be cooperatively implemented by a hardware resource of an electronic device (data processing end), such as a computing resource of a processor, etc., and a communication resource (for example, for supporting communications in various modes such as implementing an optical cable, a cellular, etc.).
The data processing method provided by the embodiment of the application is applied to a data processing system, and the data processing system comprises a data processing end.
The data processing end is used for executing: preprocessing a first audio frame in audio data to obtain a frequency spectrum characteristic corresponding to the first audio frame; the first audio frame includes voice data of M speakers; m is greater than or equal to 2; inputting the spectral features into a speaker separation model, and separating first predicted speakers corresponding to the spectral features through the speaker separation model to obtain L first predicted speakers; the L is greater than or equal to 1; calculating first loss values between the L first predicted speakers and the M real speakers by using a displacement invariance loss function; if the first set is the same as the second set, the first loss value is equal to zero; if the first set is different from the second set, the first loss value is not equal to zero; the first set includes the L first predicted speakers, and the second set includes the M real speakers; and reversely adjusting parameters in the speaker separation model at least based on the first loss value to obtain a converged speaker separation model.
Optionally, the data processing system may further comprise a client. The client is used for acquiring the audio data.
As an example, the structure of a data processing system may be as shown in FIG. 1, comprising: a data processing end 10 and a client end 20. Communication between the data processing side 10 and the client side 20 may be via a network 30.
Here, the data processing terminal 10 is configured to perform: preprocessing a first audio frame in audio data to obtain a frequency spectrum characteristic corresponding to the first audio frame; the first audio frame includes voice data of M speakers; m is greater than or equal to 2; inputting the spectral features into a speaker separation model, and separating first predicted speakers corresponding to the spectral features through the speaker separation model to obtain L first predicted speakers; the L is greater than or equal to 1; calculating first loss values between the L first predicted speakers and the M real speakers by using a displacement invariance loss function; if the first set is the same as the second set, the first loss value is equal to zero; if the first set is different from the second set, the first loss value is not equal to zero; the first set includes the L first predicted speakers, and the second set includes the M real speakers; and reversely adjusting parameters in the speaker separation model at least based on the first loss value to obtain a converged speaker separation model.
The data processing terminal 10 may include a physical machine (e.g., a server, etc.), or an electronic device with a virtual machine (e.g., a cloud platform, etc.) having related data processing capabilities.
The client 20 is used to obtain audio data. In one example, a client user collects audio data for a plurality of speakers.
The client 20 may include, among other things, a microphone, a cell phone, and the like, electronic devices with audio processing capabilities.
The network 30 is used for communication between the data processing side 10 and the client side 20. In an example, the network 30 is used to send audio data collected by the client 20 to the data processing terminal 10. Wherein the network 30 may be a wired network, or a wireless network, etc.
Embodiments of a data processing method, apparatus, device, and storage medium according to the embodiments of the present application are described below with reference to the schematic diagram of a data processing system shown in fig. 1.
The embodiment of the application provides a data processing method, which is applied to a data processing device, wherein the data processing device can be deployed on an electronic device serving as a data processing end 10.
Fig. 2 is a schematic flow chart of an alternative data processing method, where the data processing method provided in the embodiment of the present application is used to create an end-to-end speaker separation model.
The data processing method may include, but is not limited to, S201 to S204 described below as shown in fig. 2.
S201, the data processing end preprocesses a first audio frame in the audio data to obtain a frequency spectrum characteristic corresponding to the first audio frame.
The first audio frame is a processing unit in the audio data, and when the data size of the audio data is large, the audio data can be divided into a plurality of first audio frames, and the processing method for the first audio frames according to the embodiment of the application is used for processing the plurality of first audio frames.
The number of frames of the first audio frame and the content of the first audio frame are not particularly limited, and may be configured according to actual requirements. Illustratively, the number of frames of the first audio frame may be configured to be 20. Illustratively, the first audio frame is 20 consecutive frames of audio data in the audio data.
In one example, the first audio frame includes speech data of M speakers; m is greater than or equal to 2.
S201 may be implemented as: the data processing end converts the one-dimensional time domain information of the first audio frame into high-dimensional frequency domain information; and taking the high-dimensional frequency domain information as the frequency spectrum characteristic corresponding to the first audio frame.
The embodiment of the application does not limit the specific conversion mode for converting the one-dimensional time domain information into the high-dimensional frequency domain information, and can be configured according to actual requirements.
In one possible implementation, in the case that the number of frames of the first audio frame is 20, S201 may be implemented as: the data processing end performs the following processing on each audio frame in the first audio frame to obtain 20 log-mel frequency spectrum characteristics with 64 dimensions; and then, splicing the 20 64-dimensional log-mel cepstrum (log-mel) spectrum features to obtain high-dimensional frequency domain information serving as a frequency domain feature.
Wherein, the data processing end performs the following processing on each audio frame in the first audio frames, which may be implemented as: the data processing end calculates one-dimensional time domain information of an audio frame by utilizing Fourier transformation and mel cepstrum coefficient, and extracts the 64-dimensional log-mel frequency spectrum characteristic of the audio frame.
S202, the data processing end inputs the spectral features to a speaker separation model, and the speaker separation model is used for separating first predicted speakers corresponding to the spectral features to obtain L first predicted speakers.
Wherein L is greater than or equal to 1. The L first predicted speakers are speakers in the first audio frame predicted by the speaker separation model.
S202 may be implemented as: the data processing end inputs the spectral features to a speaker separation model, analyzes the spectral features through the speaker separation model, predicts first predicted speakers corresponding to the spectral features through the analysis, and obtains L first predicted speakers.
It should be noted that L may be the same as M, or L may be different from M.
S203, the data processing end calculates first loss values between the L first predicted speakers and the M real speakers by using a displacement invariance loss function.
The M real speakers are used to characterize the speaker who actually is the first audio data. The embodiment of the application does not limit the concrete expression forms of the M real speakers, and can be configured according to actual requirements. In one possible implementation, the M real speakers may be characterized by pre-configured tags. For example, u can be defined by m,t =1, indicating that speaker m is not speaking in the t-th frame (corresponding to the first audio frame); by definition u m,t =0, indicating if speaker m did not speak in frame t; m real speakers can be known based on a plurality of defined tags.
The permutation invariance loss function is used to characterize the loss calculated in a combined manner for different sets of objects. The displacement invariance loss function is not particularly limited, and the displacement invariance loss function can be configured according to actual requirements. Specifically, if the first set is the same as the second set, the first loss value is equal to zero; if the first set is different from the second set, the first loss value is not equal to zero; the first set includes L first predicted speakers and the second set includes M real speakers.
For example, if the first set is the same as the second set, the first penalty value is equal to zero; if the first set comprises the second set, the first loss value is greater than zero; if the second set comprises the first set, the first loss is less than zero.
In an example, the permutation invariance loss function may be a permutation-invariant cross entropy loss function.
S204, the data processing end reversely adjusts parameters in the speaker separation model at least based on the first loss value so as to obtain a converged speaker separation model.
The embodiment of the application does not limit the specific adjustment algorithm and can be configured according to actual requirements. For example, it may be a base back propagation algorithm or a neural network back propagation algorithm, etc.
Implementation of S204 may include, but is not limited to, embodiment 1 or embodiment 2 described below.
In embodiment 1, the data processing end reversely adjusts parameters in the speaker separation model based on the first loss value to obtain a converged speaker separation model.
Illustratively, in the case where the first loss value is greater than zero, reversely adjusting parameters in the speaker separation model with the first direction as a processing dimension; under the condition that the first loss value is smaller than zero, the second direction is used as a processing dimension, and the parameters in the speaker separation model are reversely adjusted; in case the first loss value is equal to zero, no adjustment is made. Wherein the first direction is opposite to the second direction. For example, in the case where the first direction is increasing, the second direction is decreasing.
In embodiment 2, the data processing end reversely adjusts parameters in the speaker separation model based on the first loss value and the second loss value to obtain a converged speaker separation model.
Specific implementation of embodiment 2 can be referred to the following specific descriptions of SB06 to SB08, which will not be described here.
The data processing method provided by the embodiment of the application comprises the following steps: preprocessing a first audio frame in audio data to obtain a frequency spectrum characteristic corresponding to the first audio frame; the first audio frame includes voice data of M speakers; m is greater than or equal to 2; inputting the spectral features into a speaker separation model, and separating first predicted speakers corresponding to the spectral features through the speaker separation model to obtain L first predicted speakers; the L is greater than or equal to 1; calculating first loss values between the L first predicted speakers and the M real speakers by using a displacement invariance loss function; if the first set is the same as the second set, the first loss value is equal to zero; if the first set is different from the second set, the first loss value is not equal to zero; the first set includes the L first predicted speakers, and the second set includes the M real speakers; and reversely adjusting parameters in the speaker separation model at least based on the first loss value to obtain a converged speaker separation model. The speaker separation model is trained through the displacement invariance loss function; only the variability in the content of the output results is of interest due to the displacement invariance loss, e.g., speaker a and speaker B are the output results of interest; the order among the speakers is not concerned, namely the speaker A and the speaker B are not concerned specifically; or speaker B, speaker a; the speaker separation model obtained through training is focused on distinguishing different speakers, so that the speaker separation model is not limited to application scenes, and separation of any number of speaker voice data can be realized.
Next, the process of inputting the spectral features to the speaker separation model by the S202 data processing end and separating the first predicted speakers corresponding to the spectral features by the speaker separation model to obtain L first predicted speakers will be described, which may specifically include, but not limited to, any one of the following embodiments a to D.
The speaker separation model comprises a first convolution network and a first speaker recognition network, and L first predicted speakers are obtained based on the first convolution network and the first speaker recognition network;
the speaker separation model in the embodiment B comprises a first convolution network, a first speaker identification network and a second convolution network, and L first predicted speakers are obtained based on the first convolution network, the first speaker identification network and the second convolution network;
the speaker separation model in the embodiment C comprises a first convolution network, a first speaker identification network and a global attention network, and L first predicted speakers are obtained based on the first convolution network, the first speaker identification network and the global attention network;
embodiment D, the speaker separation model includes a first convolution network, a first speaker identification network, a second convolution network, and a global attention network, based on which L first predicted speakers are obtained.
Next, a description will be given of a procedure in which the speaker separation model of embodiment a includes a first convolution network and a first speaker recognition network, and L first predicted speakers are obtained based on the first convolution network and the first speaker recognition network. This process may include, but is not limited to, SA01 and SA02 shown in FIG. 3.
SA01, the data processing end inputs the spectrum characteristics to the first convolution network; and processing the spectrum characteristic through the first convolution network to obtain a first characteristic.
The specific network type and network structure of the first convolutional network in the embodiment of the present application are not limited, and may be configured according to actual requirements.
In an example, a first convolutional network may include: the depth residual time expands the convolutional network. The depth parameter time expansion convolution network comprises a plurality of time expansion convolution layers, and one time expansion convolution layer comprises a plurality of time expansion convolution blocks.
By way of example, SA01 may be implemented as: the data processing end inputs the spectrum characteristics to a first convolution network; and carrying out convolution processing on the spectrum characteristic through time expansion convolution blocks in a plurality of time expansion convolution layers in the first convolution network, so as to obtain a first characteristic.
The first feature is an abstract feature, and the embodiment of the application does not limit the concrete expression form of the first feature. In one example, the first feature is a multi-dimensional tensor.
And SA02, the data processing end inputs the first characteristics to the first speaker identification network, and the first characteristics are processed through the first speaker identification network to obtain the L first predicted speakers.
The specific network type and network structure of the first speaker identification network are not particularly limited, and the configuration can be performed according to actual requirements. In an example, the first speaker recognition network may include: full connectivity layer and sigmoid activation function.
SA02 may be implemented as: the data processing end inputs the first feature to a first speaker recognition network, processes the first feature through the first speaker recognition network, predicts first predicted speakers corresponding to the first feature, and accordingly obtains L first predicted speakers.
Optionally, before executing SA01, the data processing method provided in the embodiment of the present application may further execute SA03 described below, adjust a network structure of the first convolutional network through SA03, and then execute SA01 based on the first convolutional network after the structure is adjusted.
The SA03 and data processing end executes the following processing on each time expansion convolution layer in the P time expansion convolution layers.
Wherein P is greater than or equal to 2.
The data processing end acquires at least two time convolution blocks included in the time convolution layer; performing a first process on each of the at least two time convolution blocks; the first process includes: the standard convolution in the time convolution block is replaced with a segmented convolution and a point convolution. In other words, the first process may replace the standard convolution calculation method in the time convolution block with the calculation method of the segment convolution and the calculation method of the point convolution.
Next, a description will be given of a procedure in which the speaker separation model in embodiment B includes a first convolution network, a first speaker identification network, and a second convolution network, and L first predicted speakers are obtained based on the first convolution network, the first speaker identification network, and the second convolution network.
In addition, in embodiment B, as shown in fig. 4, the following SB01 and SB02 may be included, which are different from embodiment a.
SB01, the data processing end inputs the spectrum characteristic to the second convolution network; and processing the spectrum characteristic through the second convolution network to obtain a second characteristic.
The specific implementation of SB01 can refer to SA01, a data processing end and input the spectrum characteristic into the first convolution network; and processing the spectrum characteristics through the first convolution network to obtain a detailed description of the first characteristics.
It should be noted that the second convolution network may be the same as the first convolution network, or the second convolution network may be different from the first convolution network.
SB02, the data processing end adds the second characteristic to the first characteristic, and the first characteristic after superposition is obtained.
The embodiment of the application does not limit a specific stacking mode, and can be configured according to actual requirements.
In one possible implementation manner, the data processing end may directly add the second feature and the first feature according to the corresponding bits, so as to obtain the first feature after superposition.
In another possible implementation manner, the data processing end may multiply the second feature by the third weight value to obtain a third product, multiply the first feature by the fourth weight value to obtain a fourth product, and then add the third product and the fourth product according to the corresponding bits to obtain the first feature after superposition.
The third weight value and the fourth weight value can be configured according to actual requirements. In an example, the third weight value may be 0.5 and the fourth weight value may be 0.5; in another example, the third weight value may be 0.6 and the fourth weight value may be 0.4.
Correspondingly, the SA02 and the data processing end input the first feature to the first speaker recognition network, and process the first feature through the first speaker recognition network, so as to obtain the implementation of the L first predicted speakers may include: and inputting the superimposed first features into a first speaker recognition network, and processing the superimposed first features through the first speaker recognition network to obtain L first predicted speakers.
In the data processing method provided for the embodiment B, the performing SB01 data processing end inputs the spectral features into the second convolutional network; and before the spectrum characteristics are processed through the second convolution network to obtain second characteristics, training the second network, and then executing SB01 based on the trained second network. As shown in fig. 4, the training process may include, but is not limited to, SB03 to SB05 described below.
SB03, the data processing end inputs the second characteristic to a second speaker recognition network, processes the second characteristic through the second speaker recognition network, and outputs N second predicted speakers.
Wherein the second speaker identification network may be the same as the first speaker identification network; alternatively, the second speaker identification network may be different from the first speaker identification network.
The specific implementation of SB03 may refer to the SA02 data processing end inputting the first feature to the first speaker recognition network, and processing the first feature by the first speaker recognition network to obtain specific descriptions of the L first predicted speakers, which are not described herein in detail.
SB04, the data processing end calculates the second loss value between the N second predicted speakers and the M real speakers.
The calculation mode of the second loss value in the embodiment of the present application is not specifically limited, and may be configured according to actual requirements. In one example, second loss values between the N second predicted speakers and the M real speakers may be calculated by a binary cross entropy loss function.
SB05, the data processing end reversely adjusts parameters in the second convolution network at the second loss value; such that the first distance is greater than or equal to the first distance threshold.
The first distance is a distance between the first feature and the second feature; the speaker corresponding to the first characteristic is different from the speaker corresponding to the second characteristic.
The value of the first distance threshold is not limited, and the configuration can be performed according to actual requirements.
The specific implementation of SB05 may refer to S204 that the data processing end reversely adjusts the specific description of the parameters in the speaker separation model based on at least the first loss value, which is not described herein.
In addition, unlike S204, the purpose of the adjustment in SB05 is to make the distance between the first distance and the second distance greater than or equal to the first distance threshold. In brief, the greater the distance between the second features that the second convolutional network is adjusted to output for different speakers.
Thus, the difference between the second characteristics output by the trained second convolution network aiming at different speakers is more obvious; the accuracy of the speaker separation model is improved.
For embodiment B, where the speaker separation model includes a second convolutional network, the speaker separation model may also be co-trained by a permutation invariance loss and a speaker recognition loss, which may include, but is not limited to, SB06 to SB08 described below.
SB06, the data processing end obtains the second loss value.
The second loss value is the loss value between the N second predicted speakers and the M real speakers; wherein the N second predicted speakers are derived from the second characteristics output by the second convolutional network.
The obtaining process of N second prediction speakers can refer to SB01, a data processing end inputs the spectrum characteristics to the second convolution network; processing the spectrum characteristics through the second convolution network to obtain second characteristics; the SB03 data processing end inputs the second features to a second speaker recognition network, processes the second features through the second speaker recognition network, and outputs detailed descriptions of N second predicted speakers, which will not be described in detail herein.
SB07, the data processing end determines the third loss value as the sum of the first product and the second product.
The first product is the result of multiplying the first loss value by the first weight value; the second product is the result of multiplying the second loss value by the second weight value.
The specific sizes of the first weight value and the second weight value are not limited, and the configuration can be carried out according to actual requirements. For example, the first weight value may be 0.5 and the second weight value may be 0.5; for another example, the first weight value may be 0.6; the second weight value may be 0.4.
SB07 may be implemented as: the data processing end multiplies the first loss value by a first weight value to obtain a first product; multiplying the second loss value by the second weight value to obtain a second product, and taking the sum of the first product and the second product as a third loss value.
SB08, the data processing end adjusts the parameter in the speaker separation model reversely based on the third loss value.
The implementation of SB08 may refer to 204, and the data processing end reversely adjusts the parameters in the speaker separation model based on at least the first loss value to obtain a specific description of the converged speaker separation model, which is not described herein in detail.
Next, a description will be given of a procedure in which the speaker separation model of embodiment C includes a first convolution network, a first speaker recognition network, and a global attention network, and L first predicted speakers are obtained based on the first convolution network, the first speaker recognition network, and the global attention network. As shown in fig. 5, the process may include, but is not limited to, SC01 to SC03 described below.
And SC01, the data processing end inputs the M first audio frames to the first convolution network, and the M audio frames are processed through the first convolution network to obtain M first features.
The specific implementation of SC01 may refer to the SA01 data processing end inputting the spectral feature into the first convolutional network; the spectrum features are processed through the first convolution network to obtain a specific description of the first feature, which is not described in detail herein.
And SC02, the data processing end calculates the following first formulas for the M first features through the global attention network to obtain M new first features.
The first formula includes: o=f (Q) (f (K) T V);
Wherein, the O represents the M new first features, and the Q represents the M first features; the K and the V represent preset parameter tensors; the said
Figure BDA0003440407570000131
Beta is more than or equal to 0 and less than or equal to 1.
The data processing end optimizes the first characteristics through a global attention network by adopting a first formula to obtain new first characteristics; this can improve the accuracy of the speaker separation model.
SC03, the data processing end performs, for each new first feature of the M new first features: inputting the new first features to the first speaker recognition network, and processing the new first features through the first speaker recognition network to obtain the L first predicted speakers.
The specific implementation of SC03 may refer to the SA02 data processing end inputting the first feature to the first speaker recognition network, and processing the first feature by using the first speaker recognition network to obtain specific descriptions of the L first predicted speakers, which are not described herein in detail.
It should be noted that, for the specific implementation procedure of embodiment D, reference may be made to the specific descriptions of embodiments a to C, which are not described herein in detail.
The following describes a data processing method provided in the embodiment of the present application by taking a speaker separation model in a conference scenario as an example.
Speaker separation is an important content for voice collection and analysis in scenes such as conferences. At present, speaker recognition and separation methods mainly comprise a traditional method and a deep learning method.
The traditional method mainly comprises the following steps: the method comprises the steps of inputting voice data into a Gaussian mixture model-global background model (Gaussian Mixed Model-Universal Background Model, GMM-UBM), calculating to obtain voice features in the voice data through the GMM-UBM model, inputting the voice features into a clustering algorithm, and obtaining a speaker corresponding to the voice features through the clustering algorithm.
The deep learning-based method is specifically divided into two types:
the first implementation mainly includes: training a large amount of audio data to obtain a voiceprint sub-model, wherein the voiceprint sub-model can extract voiceprint characteristics according to the audio data; establishing a speaker recognition sub-model based on a clustering algorithm; the speaker recognition sub-model can obtain a corresponding speaker according to different voiceprint characteristics, and the voiceprint sub-model and the speaker recognition sub-model are used as speaker separation models.
The second implementation mainly comprises: configuring the number of speakers in the speaker separation model, and establishing parameters of the speaker separation model based on the number of speakers; and then training the speaker separation model by adopting an end-to-end training mode, so as to realize that the speaker separation model is output corresponding voice data of each speaker when the voice data of the speakers with the same number as the number of the speakers is input.
The related art has the following disadvantages:
for the conventional method, interference information of a voice channel cannot be overcome, and when the data size increases, the effect is reduced;
for the first method of deep learning, the result depends on the voiceprint sub-model and the speaker recognition sub-model at the same time, so the stability is poor; moreover, the voiceprint sub-model and the speaker recognition sub-model are required to be modeled independently, and the realization is complicated by the modeling process for two times;
for the second method of deep learning, a speaker separation model can be built only in the case that the number of people is known to be contained in audio, and audio data containing any number of people cannot be processed.
The embodiment of the application provides a data processing method, which has the following characteristics when separating speakers from audio data:
1. the end-to-end speaker separation of the audio data of the variable number of people can be realized;
2. constructing a speaker separation model by utilizing local and global context information, and improving the robustness and separation performance of the model;
3. the complexity of calculation is further reduced through the attention mechanism of linear approximation, and the processing capacity of long-time audio data is improved.
The data processing method provided in the embodiment of the present application will be described below by taking an end-to-end speaker separation model based on deep learning as an example. The method may include, but is not limited to, S1 to S6 described below.
S1, collecting original audio (corresponding to audio data) and extracting frequency domain features (corresponding to frequency spectrum features).
S2, extracting abstract features by using a residual time expansion convolution network (equivalent to the first convolution network or the second convolution network).
And S3, adding speaker recognition loss (corresponding to a second loss value) and performing joint training with the speaker separation model.
S4, inputting the abstract feature (corresponding to the first feature) with the local context information into a global attention module (corresponding to a global attention network) with lower computational complexity.
S5, performing joint training on the speaker separation model by using the displacement invariance loss (corresponding to the first loss value) and the joint speaker recognition loss (corresponding to the second loss value).
S6, performing network learning optimization by using a cascade architecture and utilizing displacement invariance loss, and finely optimizing a prediction result.
As shown in fig. 6, the details of implementation of each part of the data processing method provided in the implementation of the present application are as follows:
And step P10, collecting original audio data.
Wherein the original audio data is voice data of a plurality of persons.
And P20, extracting frequency domain features.
Because the collected original audio belongs to one-dimensional original time domain information, the one-dimensional original time domain information needs to be converted into high-dimensional frequency domain information as frequency domain characteristics.
Specifically, the method comprises the following steps: firstly, processing each frame of audio in original audio data by utilizing Fourier transformation and mel cepstrum coefficient calculation, extracting 64-dimensional log-mel frequency spectrum characteristics of each frame, and splicing frequency spectrum characteristics of short-time voices of 20 adjacent frames to obtain high-dimensional frequency domain information serving as a frequency domain characteristic. And, frame shift is performed with a step length of 4 frames to obtain a plurality of frequency domain features.
The frequency domain features extracted in step P20 are used as inputs for steps P30 and P40.
And step P30, using the depth residual error time expansion convolutional network to process the frequency domain characteristics and extract the corresponding abstract characteristics.
The structure of the depth residual time expansion convolution network (equivalent to the first convolution network) is as shown in fig. 7, and after the input is obtained, the depth residual time expansion convolution network performs layer batch normalization; then, performing point convolution operation, then processing each time expansion convolution layer, and finally adding the data of each time expansion convolution layer and outputting the added data. Wherein each layer of the depth residual time dilation convolutional network is formed by time dilation convolutional blocks, and the dilation factor D grows exponentially in the convolutional blocks of each layer so as to ensure a sufficiently large time context window. The temporal context window is related to the number of front and back related frames.
Where N in fig. 2 represents N time-expanded convolution blocks within each set of convolution blocks.
It should be noted that, the input of the expansion convolution block is correspondingly zero-padded at each time, so as to ensure that the output length is the same as the input length.
Fig. 8 is a block diagram of a time-expanded convolution block. After the input is obtained, the time expansion convolution block respectively performs processing of point convolution, nonlinear activation function, normalization, time expansion convolution, nonlinear activation function, normalization, point convolution and point convolution to generate output information and jump connection information.
Referring to fig. 7 and 8, a first output of one time-expanded convolution block (the output in fig. 8) serves as the input to the next time-expanded convolution block, and a second output of one time-expanded convolution block (the jump connection in fig. 8) serves as part of the lateral output in fig. 7.
To further reduce the number of parameters, the standard convolution inside each time-expanded convolution block is replaced with a piecewise separable convolution (F-conv ()).
The piecewise separable convolution decouples the standard convolution into two sequential operations, namely a piecewise convolution (D-conv ()) and a point convolution (1 x 1-conv ()).
The piecewise convolution can be expressed as the following equation (1):
Figure BDA0003440407570000151
/>
Wherein Y represents the input of the segment-wise deconvolution, Y ε R G×M The method comprises the steps of carrying out a first treatment on the surface of the K represents the size of the convolution kernel, K ε R G×P The method comprises the steps of carrying out a first treatment on the surface of the concat represents a splice calculation; y is j ∈R 1×M M represents the number of rows of the matrix and Y; k (k) j ∈R 1×P P represents the number of rows of matrix K; q represents the dimension of the matrix.
The piecewise separable convolution can be expressed as the following equation (2):
Figure BDA0003440407570000161
wherein Y represents the input of the segment-wise deconvolution, Y ε R G×M The method comprises the steps of carrying out a first treatment on the surface of the K represents the size of the convolution kernel, K ε R G×P
Figure BDA0003440407570000162
Representing tensor product calculation; l represents a parameter vector of sequence length L.
Briefly, the piecewise convolution (D-conv ()) operation is primarily convolving each row of input with a corresponding matrix row; the point convolution (1×1-conv ()) mainly realizes the function of linearly transforming the feature space (realizes the dimension increase and dimension decrease of the feature channel number, and changes the linear combination change calculation mode of the inter-channel information).
And step P40, using the depth residual error time expansion convolutional network to process the frequency domain characteristics and extract the corresponding abstract characteristics.
Unlike P30, in P40, the depth residual time-expanded convolutional network (corresponding to the second convolutional network) needs to be trained through the speaker recognition network and speaker recognition loss, so that the difference between the extracted abstract features is more obvious for different speakers.
Wherein the speaker recognition loss can be obtained by the formula (3).
Figure BDA0003440407570000163
Where Loss represents speaker recognition Loss; BCE represents binary cross entropy loss calculation; u (u) m,t A true value indicating whether speaker m speaks in the t frame;
Figure BDA0003440407570000164
a predicted value indicating whether or not speaker m is speaking in the t-th frame.
The training process may include: for each abstract feature input, identifying a corresponding speaker through a speaker identification network; and then, calculating speaker recognition loss through a speaker recognition loss function, and adjusting the depth residual error time expansion convolution network through the loss value so as to ensure that the difference between the extracted abstract features is more obvious for different speakers.
Wherein a speaker tag vector u is defined t ∈R N ;R N The speaker tag vector is represented as an N-dimensional vector, N being the number of speakers trained in the set. If speaker m speaks in the t frame, then pass u m,t =1, if speaker m does not speak in the t frame, then pass u m,t =0. For the case that multiple speakers speak in the same abstract feature, then the multiple speakers speak through multiple u m,t =1.
Step P50, combining the outputs of P30 and P40 steps, then serves as the input to the linear approximation global attention network.
In one possible implementation, the abstract features extracted by P30 and the abstract features extracted by P40 are combined by adding, and then the combined result (corresponding to the first feature after superposition) is taken as the input of the global attention network of the linear approximation.
The structure of the attention network can be shown in fig. 9, and the tensor dot product, the scaling, the tensor mask, the softmax calculation and the corresponding processing procedures of the tensor dot product are mainly performed for Q, K and V; specifically, the expression (4) below can be used.
Figure BDA0003440407570000171
Wherein O represents the output of the attention network; softmax represents softmax calculation; q represents a given query vector (corresponding to the combined feature vector), K represents a key vector (corresponding to the score of the feature vector), and V represents a value vector V (corresponding to a mapping of the combined feature vector); d, d k Representing the dimensions of the key vector; t represents a transpose operation.
Because the memory and the calculation complexity occupied by the self-attention model have dependence on the length of the input sequence, when the length of the input sequence is larger, the calculation amount is larger and the implementation complexity is higher through softmax calculation.
To avoid excessive resource usage, embodiments of the present application employ a full-attention calculation of linear approximation. The structure of the linear approximation global attention network is shown in fig. 10. The method mainly comprises the steps of performing linear mapping, scaling, tensor masking, first tensor dot product and second tensor dot product corresponding processing procedures on Q, K and V; specifically, the expression (5) below can be used.
O=f(Q)(f(K) T V) equation (5);
wherein O represents the output of the attention network; q represents a given query vector (corresponding to a combined feature vector), K represents a key vector (corresponding to a score of the feature vector), and V represents a value vector V (corresponding to a map of the combined feature vector).
f (x) can be expressed as the following formula (6).
Figure BDA0003440407570000172
Wherein, 0.ltoreq.β.ltoreq.1, β being used to control the saturation value of the portion where the input x is smaller than zero.
The full-attention network of the linear approximation occupies less memory and computing resources and is simpler to implement.
And step P60, obtaining an output result through speaker separation layer processing.
A fully connected layer and sigmoid activation function are used to construct the speaker separation layer (equivalent to a speaker recognition network).
The speaker separation layer can process the output of the linear approximate full-attention network to obtain the final output Y E {0,1} N×T . Where N represents the maximum of the number of speakers and T represents the speaking time. If y n,t =1, meaning that when speaker n speaks at time t, if y n,t =0, indicating that speaker n is not speaking at time t.
In the part where the overlapping of the voices occurs, there is sigma n y n,t >1。
And step P70, training the speaker separation model by using the displacement invariance loss.
In order to enable the end-to-end speaker separation model used in embodiments of the present application to handle different arrangements between the prediction results and the real labels, the speaker separation model is trained using permutation-invariant cross entropy loss.
The permutation-invariant cross entropy loss can be obtained by the following formula (7).
Figure BDA0003440407570000181
Wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0003440407570000182
representing a genuine labelLoss from the predictive tag; pi (n) represents the combined index, i.e., pi (n) represents a specific speaker, regardless of the order of speaking.
Wherein if in formula 7
Figure BDA0003440407570000183
t is denoted as->
Figure BDA0003440407570000184
Y in equation 7 n,t Denoted as y; then, in equation 7
Figure BDA0003440407570000185
Can be obtained by the following formula (8).
Figure BDA0003440407570000186
Wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0003440407570000187
representing cross entropy loss; log represents a log operation.
The specific training process comprises the following steps: and calculating a loss value between the real tag and the predicted tag, and optimizing parameters in the speaker separation model by taking loss reduction as a target through a basic back propagation algorithm.
The embodiment of the application has the following characteristics:
firstly, using a depth residual time expansion convolution network, extracting abstract features by utilizing local context information, optimizing standard convolution into segmentation separation convolution, and applying the abstract features to a speaker identification network; the speaker recognition network and the speaker separation model are trained in a combined way, so that the speaker separation model can distinguish different speakers;
Secondly, a calculation mode of a full-attention network with linear approximation is provided, so that the consumption of memory and calculation resources of the attention network is reduced, the processing capacity of a speaker separation model on long-term voice data is improved, and a larger attention model is allowed to be adopted;
thirdly, an end-to-end speaker separation model modeling method is provided, and the number of people who need not obtain audio data can be processed, so that voice data containing any number of people can be processed.
The embodiment of the application has the following technical effects:
the first end-to-end model can process voice data containing any number of people without additional clustering steps;
secondly, the full-attention network with linear approximation can process voice data similar to conference records for a long time, and can reduce memory and calculation load;
thirdly, the speaker recognition model and speaker separation combined training is additionally added into the model, so that the capability of distinguishing different speaker characteristics of the speaker separation model is improved, and the robustness of the model is improved.
Fig. 11 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, and as shown in fig. 11, the data processing apparatus 110 may include a preprocessing unit 1101, a speaker separation unit 1102, a calculation unit 1103, and an adjustment unit 1104. Wherein:
A preprocessing unit 1101, configured to preprocess a first audio frame in audio data, so as to obtain a spectral feature corresponding to the first audio frame; the first audio frame includes voice data of M speakers; m is greater than or equal to 2;
the speaker separation unit 1102 is configured to input the spectral features to a speaker separation model, and separate first predicted speakers corresponding to the spectral features by using the speaker separation model to obtain L first predicted speakers; the L is greater than or equal to 1;
a calculating unit 1103 for calculating first loss values between the L first predicted speakers and the M real speakers by using a permutation invariance loss function; if the first set is the same as the second set, the first loss value is equal to zero; if the first set is different from the second set, the first loss value is not equal to zero; the first set includes the L first predicted speakers, and the second set includes the M real speakers;
an adjusting unit 1104 is configured to reversely adjust parameters in the speaker separation model based on at least the first loss value, so as to obtain a converged speaker separation model.
In some embodiments, the speaker separation model includes a first convolution network and a first speaker recognition network, the speaker separation unit 1102 further configured to perform:
inputting the spectral features into the first convolutional network; processing the spectrum characteristics through the first convolution network to obtain first characteristics;
and inputting the first features into the first speaker recognition network, and processing the first features through the first speaker recognition network to obtain the L first predicted speakers.
In some embodiments, the first convolutional network comprises P time-expanded convolutional layers; the data processing apparatus 110 may further include a first processing unit; wherein the first processing unit is configured to perform, before said inputting the spectral feature into the first convolutional network:
for each of the P time-expanded convolutional layers, performing the following:
acquiring at least two time convolution blocks included in the time convolution layer;
performing a first process on each of the at least two temporal convolution blocks; the first process includes: replacing the standard convolution in the time convolution block with a segmented convolution and a point convolution; wherein, P is greater than or equal to 2.
In some embodiments, the data processing apparatus 110 may further include a second processing unit; wherein the second processing unit is configured to perform, if the speaker separation model further comprises a second convolutional network:
inputting the spectral features into the second convolutional network; processing the spectrum characteristics through the second convolution network to obtain second characteristics;
superposing the second feature on the first feature to obtain a superposed first feature;
correspondingly, the inputting the first feature into the first speaker recognition network, and processing the first feature through the first speaker recognition network to obtain the L first predicted speakers includes:
and inputting the superimposed first features to the first speaker recognition network, and processing the superimposed first features through the first speaker recognition network to obtain the L first predicted speakers.
In some embodiments, the data processing apparatus 110 may further include a third processing unit; wherein the third processing unit is configured to perform, before the inputting the spectral feature into the second convolutional network:
inputting the second features to a second speaker recognition network, processing the second features through the second speaker recognition network, and outputting N second predicted speakers;
Calculating second loss values between the N second predicted speakers and the M real speakers;
reversely adjusting parameters in the second convolution network based on the second loss value; such that the first distance is greater than or equal to the first distance threshold; the first distance is a distance between the first feature and the second feature; the speaker corresponding to the first feature is different from the speaker corresponding to the second feature.
In some embodiments, where the speaker separation model further includes a second convolutional network, the adjusting unit 1104 is further configured to perform:
acquiring a second loss value; the second loss values are loss values between N second predicted speakers and the M real speakers; the N second predicted speakers are obtained through second characteristics output by the second convolution network;
determining a third loss value as a sum of the first product and the second product; the first product is the result of multiplying the first loss value by the first weight value; the second product is the result of multiplying the second loss value by a second weight value;
and reversely adjusting parameters in the speaker separation model based on the third loss value.
In some embodiments, where the audio data includes M first audio frames, the speaker separation model includes a first convolution network, a first speaker recognition network, and a global attention network, the speaker separation unit 1102 is further configured to perform:
Inputting the M first audio frames into the first convolution network, and processing the M audio frames through the first convolution network to obtain M first features;
calculating the following first formulas for the M first features through the global attention network to obtain M new first features; the first formula includes: o=f (Q) (f (K) TV); wherein, the O represents the M new first features, and the Q represents the M first features; the K and the V represent preset parameter tensors; the said
Figure BDA0003440407570000211
Beta is more than or equal to 0 and less than or equal to 1; />
For each new first feature of the M new first features: inputting the new first features to the first speaker recognition network, and processing the new first features through the first speaker recognition network to obtain the L first predicted speakers.
It should be noted that, the data processing apparatus provided in the embodiments of the present application includes each unit included, which may be implemented by a processor in an electronic device; of course, the method can also be realized by a specific logic circuit; in practice, the processor may be a central processing unit (CPU, central Processing Unit), a microprocessor (MPU, micro Processor Unit), a digital signal processor (DSP, digital Signal Processor) or a Field programmable gate array (FPGA, field-Programmable Gate Array), or the like.
The description of the apparatus embodiments above is similar to that of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the device embodiments of the present application, please refer to the description of the method embodiments of the present application for understanding.
It should be noted that, in the embodiment of the present application, if the above-mentioned data processing method is implemented in the form of a software functional module, and sold or used as a separate product, the data processing method may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributing to the related art, and the computer software product may be stored in a storage medium, and include several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.
Correspondingly, the embodiment of the application provides an electronic device, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor realizes the steps in the data processing method provided in the embodiment when executing the program.
Accordingly, the present embodiments provide a storage medium, that is, a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the data processing method provided in the above embodiments.
It should be noted here that: the description of the storage medium and apparatus embodiments above is similar to that of the method embodiments described above, with similar benefits as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and the apparatus of the present application, please refer to the description of the method embodiments of the present application for understanding.
It should be noted that fig. 12 is a schematic diagram of a hardware entity of the electronic device according to the embodiment of the present application, as shown in fig. 12, the electronic device 120 includes: a processor 1201, at least one communication bus 1202, a user interface 1203, at least one external communication interface 1204, and a memory 1205. Wherein the communication bus 1202 is configured to enable connected communications between these components. The user interface 1203 may include a display screen, among other things, and the external communication interface 1204 may include standard wired and wireless interfaces.
The memory 1205 is configured to store instructions and applications executable by the processor 1201, and may also cache data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or processed by various modules in the processor 1201 and the electronic device, and may be implemented by a FLASH memory (FLASH) or a random access memory (Random Access Memory, RAM).
It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in some embodiments" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application. The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.
The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units; can be located in one place or distributed to a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, where the program, when executed, performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read Only Memory (ROM), a magnetic disk or an optical disk, or the like, which can store program codes.
Alternatively, the integrated units described above may be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributing to the related art, and the computer software product may be stored in a storage medium, and include several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a removable storage device, a ROM, a magnetic disk, or an optical disk.
The foregoing is merely an embodiment of the present application, but the protection scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method of data processing, the method comprising:
preprocessing a first audio frame in audio data to obtain a frequency spectrum characteristic corresponding to the first audio frame; the first audio frame includes voice data of M speakers; m is greater than or equal to 2;
inputting the spectral features into a speaker separation model, and separating first predicted speakers corresponding to the spectral features through the speaker separation model to obtain L first predicted speakers; the L is greater than or equal to 1;
calculating first loss values between the L first predicted speakers and the M real speakers by using a displacement invariance loss function; if the first set is the same as the second set, the first loss value is equal to zero; if the first set is different from the second set, the first loss value is not equal to zero; the first set includes the L first predicted speakers, and the second set includes the M real speakers;
and reversely adjusting parameters in the speaker separation model at least based on the first loss value to obtain a converged speaker separation model.
2. The method of claim 1, wherein the speaker separation model includes a first convolution network and a first speaker recognition network, wherein the inputting the spectral features into the speaker separation model, and separating the first predicted speakers corresponding to the spectral features by the speaker separation model, obtaining L first predicted speakers, includes:
Inputting the spectral features into the first convolutional network; processing the spectrum characteristics through the first convolution network to obtain first characteristics;
and inputting the first features into the first speaker recognition network, and processing the first features through the first speaker recognition network to obtain the L first predicted speakers.
3. The method of claim 2, wherein the first convolutional network comprises P time-expanded convolutional layers; before said inputting the spectral features into the first convolutional network, the method further comprises:
for each of the P time-expanded convolutional layers, performing the following:
acquiring at least two time convolution blocks included in the time convolution layer;
performing a first process on each of the at least two temporal convolution blocks; the first process includes: replacing the standard convolution in the time convolution block with a segmented convolution and a point convolution; wherein, P is greater than or equal to 2.
4. The method of claim 2, wherein the speaker separation model further comprises a second convolutional network, the method further comprising:
Inputting the spectral features into the second convolutional network; processing the spectrum characteristics through the second convolution network to obtain second characteristics;
superposing the second feature on the first feature to obtain a superposed first feature;
correspondingly, the inputting the first feature into the first speaker recognition network, and processing the first feature through the first speaker recognition network to obtain the L first predicted speakers includes:
and inputting the superimposed first features to the first speaker recognition network, and processing the superimposed first features through the first speaker recognition network to obtain the L first predicted speakers.
5. The method of claim 4, wherein prior to said inputting the spectral features into the second convolutional network, the method further comprises:
inputting the second features to a second speaker recognition network, processing the second features through the second speaker recognition network, and outputting N second predicted speakers;
calculating second loss values between the N second predicted speakers and the M real speakers;
Reversely adjusting parameters in the second convolution network based on the second loss value; such that the first distance is greater than or equal to the first distance threshold; the first distance is a distance between the first feature and the second feature; the speaker corresponding to the first feature is different from the speaker corresponding to the second feature.
6. The method of claim 1, wherein in the case where the speaker separation model further comprises a second convolutional network; the inversely adjusting parameters in the speaker separation model based at least on the first loss value includes:
acquiring a second loss value; the second loss values are loss values between N second predicted speakers and the M real speakers; the N second predicted speakers are obtained through second characteristics output by the second convolution network;
determining a third loss value as a sum of the first product and the second product; the first product is the result of multiplying the first loss value by the first weight value; the second product is the result of multiplying the second loss value by a second weight value;
and reversely adjusting parameters in the speaker separation model based on the third loss value.
7. The method of claim 1, wherein in the case where the audio data comprises M first audio frames, the speaker separation model comprises a first convolutional network, a first speaker recognition network, and a global attention network; the spectral features are input into a speaker separation model, and first predicted speakers corresponding to the spectral features are separated through the speaker separation model to obtain L first predicted speakers, wherein the method comprises the following steps:
inputting the M first audio frames into the first convolution network, and processing the M audio frames through the first convolution network to obtain M first features;
calculating the following first formulas for the M first features through the global attention network to obtain M new first features; the first formula includes: o=f (Q) (f (K) TV); wherein, the O represents the M new first features, and the Q represents the M first features; the K and the V represent preset parameter tensors; the said
Figure FDA0003440407560000031
Beta is more than or equal to 0 and less than or equal to 1;
for each new first feature of the M new first features: inputting the new first features to the first speaker recognition network, and processing the new first features through the first speaker recognition network to obtain the L first predicted speakers.
8. A data processing apparatus, the apparatus comprising:
the preprocessing unit is used for preprocessing a first audio frame in the audio data to obtain spectrum characteristics corresponding to the first audio frame; the first audio frame includes voice data of M speakers; m is greater than or equal to 2;
the speaker separation unit is used for inputting the frequency spectrum characteristics into a speaker separation model, and separating first predicted speakers corresponding to the frequency spectrum characteristics through the speaker separation model to obtain L first predicted speakers; the L is greater than or equal to 1;
a calculating unit for calculating first loss values between the L first predicted speakers and the M real speakers using a permutation invariance loss function; if the first set is the same as the second set, the first loss value is equal to zero; if the first set is different from the second set, the first loss value is not equal to zero; the first set includes the L first predicted speakers, and the second set includes the M real speakers;
and the adjusting unit is used for reversely adjusting parameters in the speaker separation model at least based on the first loss value so as to obtain a converged speaker separation model.
9. An electronic device comprising a memory and a processor, the memory storing a computer program executable on the processor, the processor implementing the data processing method of any one of claims 1 to 7 when the program is executed.
10. A storage medium having stored thereon a computer program which, when executed by a processor, implements the data processing method of any of claims 1 to 7.
CN202111627928.9A 2021-12-28 2021-12-28 Data processing method and device, equipment and storage medium Pending CN116364102A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111627928.9A CN116364102A (en) 2021-12-28 2021-12-28 Data processing method and device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111627928.9A CN116364102A (en) 2021-12-28 2021-12-28 Data processing method and device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116364102A true CN116364102A (en) 2023-06-30

Family

ID=86939287

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111627928.9A Pending CN116364102A (en) 2021-12-28 2021-12-28 Data processing method and device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116364102A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117198272A (en) * 2023-11-07 2023-12-08 浙江同花顺智能科技有限公司 Voice processing method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117198272A (en) * 2023-11-07 2023-12-08 浙江同花顺智能科技有限公司 Voice processing method and device, electronic equipment and storage medium
CN117198272B (en) * 2023-11-07 2024-01-30 浙江同花顺智能科技有限公司 Voice processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN105976812B (en) A kind of audio recognition method and its equipment
CN110136690A (en) Phoneme synthesizing method, device and computer readable storage medium
CN108804453B (en) Video and audio recognition method and device
Vinyals et al. Learning with recursive perceptual representations
CN111178507B (en) Atlas convolution neural network data processing method and apparatus
CN113298096B (en) Method, system, electronic device and storage medium for training zero sample classification model
Mo et al. Neural architecture search for keyword spotting
CN111968150A (en) Weak surveillance video target segmentation method based on full convolution neural network
CN111357051B (en) Speech emotion recognition method, intelligent device and computer readable storage medium
CN112584062B (en) Background audio construction method and device
CN111274412A (en) Information extraction method, information extraction model training device and storage medium
CN116976428A (en) Model training method, device, equipment and storage medium
WO2020135324A1 (en) Audio signal processing
CN116364102A (en) Data processing method and device, equipment and storage medium
WO2022083165A1 (en) Transformer-based automatic speech recognition system incorporating time-reduction layer
CN112528077B (en) Video face retrieval method and system based on video embedding
CN114141237A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN113762503A (en) Data processing method, device, equipment and computer readable storage medium
CN111462762B (en) Speaker vector regularization method and device, electronic equipment and storage medium
CN114822509A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN113378866B (en) Image classification method, system, storage medium and electronic device
CN112885367B (en) Fundamental frequency acquisition method, fundamental frequency acquisition device, computer equipment and storage medium
CN113095435B (en) Video description generation method, device, equipment and computer readable storage medium
CN115116470A (en) Audio processing method and device, computer equipment and storage medium
CN112820298B (en) Voiceprint recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination