WO2020221059A1 - 音频信号处理方法及相关产品 - Google Patents

音频信号处理方法及相关产品 Download PDF

Info

Publication number
WO2020221059A1
WO2020221059A1 PCT/CN2020/085800 CN2020085800W WO2020221059A1 WO 2020221059 A1 WO2020221059 A1 WO 2020221059A1 CN 2020085800 W CN2020085800 W CN 2020085800W WO 2020221059 A1 WO2020221059 A1 WO 2020221059A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
speaker
matrix
target
signals
Prior art date
Application number
PCT/CN2020/085800
Other languages
English (en)
French (fr)
Inventor
黎椿键
施栋
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP20799277.7A priority Critical patent/EP3944238B1/en
Priority to US17/605,121 priority patent/US20220199099A1/en
Publication of WO2020221059A1 publication Critical patent/WO2020221059A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2201/00Details of transducers, loudspeakers or microphones covered by H04R1/00 but not provided for in any of its subgroups
    • H04R2201/40Details of arrangements for obtaining desired directional characteristic by combining a number of identical transducers covered by H04R1/40 but not provided for in any of its subgroups
    • H04R2201/4012D or 3D arrays of transducers

Definitions

  • This application relates to the technical field of audio signal processing, and in particular to an audio processing method and related products.
  • the audio signal is usually divided by speaker (English: Speaker Diarization), the entire audio signal is divided into different segments, and the speaker and the audio segment are correspondingly labeled, so that you can clearly know The speaker at each moment can quickly generate a meeting summary.
  • the embodiment of the present application provides an audio signal processing method, which is beneficial to improve the accuracy of speaker segmentation, thereby facilitating the generation of conference room records, thereby improving user experience.
  • an audio signal processing method including:
  • N is an integer greater than or equal to 2
  • M is an integer greater than or equal to 1
  • the number of speakers and speaker identities corresponding to the N channels of observation signals are determined according to the preset audio features of each source signal, the M separation matrices, and the spatial feature matrix.
  • the solution of the embodiment of the present application is a speaker segmentation technology under the multi-microphone system, which introduces a spatial feature matrix and preset audio features, and clusters speakers through the spatial feature matrix, preset audio features, and separation matrix.
  • Speaker segmentation can be achieved without knowing the arrangement information of the microphone array in advance, which solves the problem of aging devices in the prior art that reduce segmentation accuracy, and with the participation of audio features, it can identify the scene where the speaker has a similar angle and the speaker is moving. , To further improve the speaker segmentation accuracy.
  • the obtaining the preset audio characteristics of each source signal in the M source signals includes: dividing each source signal in the M source signals into Q audio frames, Q It is an integer greater than 1; to obtain the preset audio characteristics of each audio frame of each source signal.
  • the source signal is segmented for subsequent clustering using preset audio features.
  • the acquiring the spatial characteristic matrix of the N observation signals includes: dividing each observation signal in the N observation signals into Q audio frames; and corresponding to each audio frame group Determine the spatial feature matrix corresponding to each first audio frame group to obtain Q spatial feature matrices.
  • the N audio frames corresponding to each first audio frame group are the N channels of observation signals in the same time window The next N audio frames; obtain the spatial characteristic matrix of the N observation signals according to the Q spatial characteristic matrices;
  • c F (k, n) represents the spatial feature matrix corresponding to each first audio group
  • n represents the frame number of the Q audio frames
  • k represents the frequency index of the nth audio frame
  • X F (k, n ) Is a column vector composed of the representation in the frequency domain of the k-th frequency point of the n-th audio frame of each observation signal
  • X FH (k,n) is the transposition of X F (k,n)
  • n It is an integer, 1 ⁇ n ⁇ Q.
  • the number of speakers and speaker identities corresponding to the N channels of observation signals are determined according to the preset audio features of each source signal, the M separation matrices, and the spatial feature matrix, It includes: performing a first cluster on the spatial feature matrix to obtain P initial clusters, each initial cluster corresponds to an initial cluster center matrix, and the initial cluster center matrix is used to indicate that each initial cluster corresponds
  • P is an integer greater than or equal to 1
  • M similarities are determined, and the M similarities are between the initial cluster center matrix corresponding to each initial cluster and the M separation matrices Determine the source signal corresponding to each initial cluster according to the M similarities; perform a second clustering on the preset audio features of the source signal corresponding to each initial cluster to obtain the N observation signals The corresponding number of speakers and speaker identity.
  • the first clustering is performed using the spatial feature matrix to determine where the speakers are speaking in the current scene, and the estimated number of speakers is obtained. Then, the second clustering is performed using the preset audio features, The initial clusters obtained by a cluster are split or merged to obtain the true number of speakers in the current scene, which improves the speaker segmentation accuracy.
  • the determining the source signal corresponding to each initial cluster according to the M similarities includes: determining the maximum similarity among the M similarities, and determining the M separation matrices
  • the separation matrix corresponding to the maximum similarity is the target separation matrix; the source signal corresponding to the target separation matrix is determined as the source signal corresponding to each initial cluster. It can be seen that the first clustering is performed through the spatial feature matrix to determine where the speaker is speaking in the current scene, and then the similarity between the spatial feature matrix and the separation matrix is used to determine the source signal corresponding to each speaker to achieve rapid determination Output the source signal corresponding to each speaker.
  • the performing a second clustering on the preset audio features of the source signal corresponding to each initial cluster to obtain the number of speakers and speaker identities corresponding to the N observation signals includes: Perform a second clustering on the preset audio features of the source signal corresponding to each initial cluster to obtain H target clusters, where the H target clusters represent the number of speakers corresponding to the N observation signals, each The target cluster corresponds to a target cluster center.
  • Each target cluster center is composed of a preset audio feature and at least one initial cluster center matrix.
  • the preset audio feature corresponding to each target cluster is used to represent each target The speaker identity of the speaker corresponding to the cluster, and at least one initial cluster center matrix corresponding to each target cluster is used to represent the spatial position of the speaker.
  • the preset audio features corresponding to each source signal are used for clustering, and the initial clusters corresponding to each source signal are split or merged to obtain the target cluster corresponding to the M source signals.
  • the two source signals separated by human movement are gathered into the same target cluster and two speakers with similar angles are split into two target clusters, and the two speakers with similar angles are separated, which improves the speaker The segmentation accuracy.
  • the method further includes: obtaining an output audio including a speaker tag according to the number of speakers corresponding to the N observation signals and the identity of the speaker. It can be seen that based on the clustered speaker identities and numbers, the audio signal is cut to determine the speaker identities and numbers corresponding to each audio frame, which is convenient for generating meeting room abstracts in a meeting room environment.
  • the obtaining the output audio including the speaker tag according to the number of speakers corresponding to the N observation signals and the speaker identity includes: determining K distances, where the K distances are each The distance between the spatial feature matrix corresponding to each first audio frame group and at least one initial cluster center matrix corresponding to each target cluster, each first audio frame group is determined by the N channels of observation signals in the same time window.
  • the audio signal is cut and labeled.
  • the spatial feature matrix is used to determine the number of speakers corresponding to each audio frame group, and then each of the source signals is used.
  • the preset audio features of the audio frame determine the source signal corresponding to each speaker.
  • the audio is cut and annotated in two steps to improve the speaker's cutting accuracy.
  • the obtaining the output audio including the speaker tag according to the number of speakers corresponding to the N observation signals and the speaker identity includes: determining H similarities, the H similarities Is the similarity between the preset audio feature of each audio frame in each second audio frame group and the preset audio feature of each target cluster center in the H target clusters, and each second The audio frame group is composed of audio frames of the M source signals in the same time window; the target cluster corresponding to each audio frame in each second audio frame group is determined according to the H similarities; according to each audio The target cluster corresponding to the frame obtains the output audio including the speaker tag, and the speaker tag is used to label the number of speakers and/or the speaker identity of each audio frame in the output audio. It can be seen that the audio segmentation and labeling directly through the audio feature improves the speaker's segmentation speed.
  • an audio processing device which is characterized in that it includes:
  • the audio separation unit is used to receive N observation signals collected by the microphone array, and perform blind source separation on the N observation signals to obtain M source signals and M separation matrices, the M source signals and the M
  • the separation matrix has a one-to-one correspondence, N is an integer greater than or equal to 2, and M is an integer greater than or equal to 1;
  • a spatial feature extraction unit configured to obtain a spatial feature matrix of the N observation signals, and the spatial feature matrix is used to represent the correlation between the N observation signals;
  • An audio feature extraction unit configured to obtain preset audio features of each source signal in the M source signals
  • the determining unit is configured to determine the number of speakers and speaker identities corresponding to the N channels of observation signals according to the preset audio features of each source signal, the M separation matrices, and the spatial feature matrix.
  • the solution of the embodiment of the present application is a speaker segmentation technology under the multi-microphone system, which introduces a spatial feature matrix and preset audio features, and clusters speakers through the spatial feature matrix, preset audio features, and separation matrix.
  • Speaker segmentation can be achieved without knowing the arrangement information of the microphone array in advance, which solves the problem of aging devices in the prior art that reduce segmentation accuracy, and with the participation of audio features, it can identify the scene where the speaker has a similar angle and the speaker is moving. , To further improve the speaker segmentation accuracy.
  • the audio feature extraction unit when acquiring the preset audio feature of each source signal in the M source signal, is specifically configured to: combine each source signal in the M source signal The signal is divided into Q audio frames, where Q is an integer greater than 1, and the preset audio characteristics of each audio frame of each source signal are obtained.
  • the spatial feature extraction unit when acquiring the spatial feature matrix of the N observation signals, is specifically configured to: divide each observation signal of the N observation signals into Q audio Frame; Determine the spatial characteristic matrix corresponding to each first audio frame group according to the N audio frames corresponding to each audio frame group, and obtain Q spatial characteristic matrices.
  • the N audio frames corresponding to each first audio frame group are all N audio frames of the N observation signals in the same time window; obtaining the spatial characteristic matrix of the N observation signals according to the Q spatial characteristic matrices;
  • c F (k, n) represents the spatial feature matrix corresponding to each first audio group
  • n represents the frame number of the Q audio frames
  • k represents the frequency index of the nth audio frame
  • X F (k, n ) Is a column vector composed of the representation in the frequency domain of the k-th frequency point of the n-th audio frame of each observation signal
  • X FH (k,n) is the transposition of X F (k,n)
  • n It is an integer, 1 ⁇ n ⁇ Q.
  • the determining unit may determine the number and number of speakers corresponding to the N observation signals according to the preset audio characteristics of each source signal, the M separation matrices, and the spatial characteristic matrix.
  • the speaker identity is specifically used to: perform a first cluster on the spatial feature matrix to obtain P initial clusters, each initial cluster corresponds to an initial cluster center matrix, and the initial cluster center matrix is used for Represents the spatial position of the speaker corresponding to each initial cluster, and P is an integer greater than or equal to 1; determines M similarities, and the M similarities are the initial cluster center matrix corresponding to each initial cluster and the The similarity between the M separation matrices; the source signal corresponding to each initial cluster is determined according to the M similarities; the second cluster is performed on the preset audio characteristics of the source signal corresponding to each initial cluster, Obtain the number of speakers and speaker identities corresponding to the N observation signals.
  • the determining unit when determining the source signal corresponding to each initial cluster according to the M similarities, is specifically configured to: determine the maximum similarity among the M similarities, The separation matrix corresponding to the maximum similarity among the M separation matrices is determined to be the target separation matrix; the source signal corresponding to the target separation matrix is determined as the source signal corresponding to each initial cluster.
  • the determining unit performs a second clustering on the preset audio features of the source signal corresponding to each initial cluster to obtain the number of speakers and the number of speakers corresponding to the N observation signals.
  • identity it is specifically used to: perform a second clustering on the preset audio features of the source signal corresponding to each initial cluster to obtain H target clusters, and the H target clusters indicate that the N observation signals correspond to The number of speakers, each target cluster corresponds to a target cluster center, each target cluster center is composed of a preset audio feature and at least one initial cluster center matrix, and each target cluster corresponds to a preset audio
  • the feature is used to represent the speaker identity of the speaker corresponding to each target cluster, and at least one initial cluster center matrix corresponding to each target cluster is used to represent the spatial position of the speaker.
  • the device further includes an audio segmentation unit
  • the audio segmentation unit is configured to obtain the output audio including the speaker tag according to the number of speakers and the speaker identity corresponding to the N observation signals.
  • the audio segmentation unit is specifically used to determine K distances when obtaining output audio including speaker tags according to the number of speakers corresponding to the N observation signals and speaker identities ,
  • the K distances are the distances between the spatial feature matrix corresponding to each first audio frame group and at least one initial cluster center matrix corresponding to each target cluster, and each first audio frame group is observed by the N channels
  • the signal is composed of N audio frames under the same time window, K ⁇ H; L target clusters corresponding to each first audio frame group are determined according to the K distances, L ⁇ H; From the M source signal Extract L audio frames corresponding to each first audio frame group, where the L audio frames are the same time window as each first audio frame group; determine L similarities, and the L similarities are all The similarity between the preset audio feature of each audio frame in the L audio frames and the preset audio feature corresponding to each target cluster in the L target clusters; the L is determined according to the L similarities The target cluster corresponding to each audio frame in each audio frame; the output audio including the speaker tag is obtained according to the target
  • the audio segmentation unit when it obtains output audio containing speaker tags according to the number of speakers corresponding to the N observation signals and the speaker identity, it is specifically used to: determine H similar
  • the H similarities are the similarities between the preset audio feature of each audio frame in each second audio frame group and the preset audio feature of each target cluster center in the H target clusters
  • Each second audio frame group is composed of audio frames of the M source signals in the same time window; according to the H similarities, the corresponding audio frame in each second audio frame group is determined Target clustering; according to the target clustering corresponding to each audio frame, the output audio containing the speaker tag is obtained, and the speaker tag is used to label the number of speakers and/or the speaker in each audio frame in the output audio Identity.
  • an audio processing device which is characterized in that it includes:
  • the communication interface is used to receive N observation signals collected by the microphone array, and N is an integer greater than or equal to 2;
  • the processor is configured to perform blind source separation on the N observation signals to obtain M source signals and M separation matrices, the M source signals and the M separation matrices have a one-to-one correspondence, and M is greater than Or an integer equal to 1; obtain the spatial characteristic matrix of the N channels of observation signals, where the spatial characteristic matrix is used to represent the correlation between the N channels of observation signals; acquire each of the M channels of source signals According to the preset audio characteristics of each source signal, the M separation matrices and the spatial feature matrix, the number of speakers and the speaker identity corresponding to the N observation signals are determined.
  • an embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program is executed by hardware (such as a processor, etc.). Part or all of the steps of any method executed by the audio processing device.
  • the embodiments of the present application provide a computer program product including instructions, which when the computer program product runs on an audio processing device, cause the audio processing device to execute part of the audio signal processing method of the above aspects Or all steps.
  • FIG. 1A is a schematic diagram of a process provided by an embodiment of this application.
  • FIG. 1B is a schematic flowchart of an audio signal processing method provided by an embodiment of this application.
  • FIG. 2A is a schematic flowchart of another audio signal processing method provided by an embodiment of this application.
  • FIG. 2B is a schematic diagram of the characterization of frequency points in the frequency domain according to an embodiment of this application.
  • 2C is a schematic diagram of a speaking scene provided by an embodiment of the application.
  • 2D is a schematic diagram of another speaking scene provided by an embodiment of the application.
  • FIG. 3 is a schematic flowchart of another audio signal processing method provided by an embodiment of the application.
  • FIG. 4 is a schematic flowchart of another audio signal processing method provided by an embodiment of the application.
  • 5A is a schematic diagram of displaying output audio on an interface according to an embodiment of the application.
  • 5B is another schematic diagram of displaying output audio on an interface provided by an embodiment of the application.
  • 5C is another schematic diagram of displaying and outputting audio on an interface provided by an embodiment of the application.
  • FIG. 6 is a schematic diagram of an audio processing device provided by an embodiment of the application.
  • FIG. 7 is a schematic diagram of an audio processing device provided by an embodiment of the application.
  • BSS Blind Source Separation
  • BSS technology mainly solves the "cocktail party" problem, that is, to separate the independent signal of each person's speech from a given mixed signal.
  • M source signals it is usually assumed that there are also M observed signals, that is, it is assumed that there are M microphones in the microphone array. For example, two microphones are placed in different positions in a room, and two people are talking at the same time. Each microphone can collect audio signals of two people talking, and output one observation signal.
  • the BSS technology mainly solves how to separate M channels of source signals s 1 ,...,s M , from x 1 ,...,x M.
  • the audio features of the speakers are mainly used for segmentation, and similar speakers (speakers with similar audio features) cannot be segmented, and the segmentation accuracy is low; a multi-microphone speaker segmentation system , It is necessary to obtain the angle and position of the speaker, and use the angle and position of the speaker to segment the speaker. Therefore, the multi-microphone speaker segmentation system needs to know the arrangement information and spatial position information of the microphone array in advance.
  • the audio signal processing method in this application is proposed to improve the speaker segmentation accuracy.
  • FIG. 1A is a scene architecture diagram of an audio signal processing method.
  • the scene architecture diagram includes a sound source, a microphone array, and an audio processing device.
  • the audio processing device includes: a spatial feature extraction module, a blind source separation module, and an audio feature The extraction module, the first clustering module, the second clustering module, and the audio segmentation module;
  • the microphone array is used to collect the speaking audio of the speaker to obtain the observation signal;
  • the spatial feature extraction module is used to determine the spatial feature matrix corresponding to the observation signal; blind
  • blind The source separation module is used to perform blind source separation on the observation signal to obtain the source signal;
  • the first clustering module is used to perform the first clustering on the spatial feature matrix to obtain the initial cluster;
  • the audio feature extraction module is used to feature the source signal Extraction to obtain the preset audio features corresponding to the source signal;
  • the second clustering module is used to perform the second clustering according to the preset audio features corresponding to the source signal and the initial clustering to obtain the target cluster;
  • the solution of the embodiment of the present application is a speaker segmentation technology under the multi-microphone system, which introduces a spatial feature matrix and preset audio features, and clusters speakers through the spatial feature matrix, preset audio features, and separation matrix.
  • Speaker segmentation can be achieved without knowing the arrangement information of the microphone array in advance, which solves the problem of aging devices in the prior art that reduce segmentation accuracy, and with the participation of audio features, it can identify the scene where the speaker has a similar angle and the speaker is moving. , To further improve the speaker segmentation accuracy.
  • FIG. 1B is a schematic flowchart of an audio signal processing method provided by an embodiment of the application. This method may include but is not limited to the following steps:
  • Step 101 The audio processing device receives N observation signals collected by the microphone array, and performs blind source separation on the N observation signals to obtain M source signals and M separation matrices, the M source signals and the M
  • the separation matrix has a one-to-one correspondence, N is an integer greater than or equal to 2, and M is an integer greater than or equal to 1.
  • the blind source separation of the N observation signals includes a time domain separation method and a frequency domain separation method.
  • Step 102 The audio processing device obtains a spatial feature matrix of the N channels of observation signals, where the spatial feature matrix is used to represent the correlation between the N channels of observation signals.
  • the correlation between the N observation signals is caused by the different spatial positions of the speakers relative to the microphone, that is, the spatial feature matrix reflects the spatial position information of the speakers.
  • Step 103 The audio processing device acquires the preset audio characteristics of each source signal in the M source signals.
  • the preset audio characteristics include but are not limited to one or more of the following: zero-crossing rate ZCR, short-term energy, fundamental frequency, and Mel cepstrum coefficient MFCC.
  • Step 104 The audio processing device determines the number of speakers and speaker identities corresponding to the N channels of observation signals according to the preset audio features of each source signal, the M separation matrices, and the spatial feature matrix.
  • the preset audio features, separation matrix, and spatial feature matrix are used for clustering to obtain the identity and number of speakers.
  • only audio features are used for speaker segmentation.
  • the speaker segmentation accuracy is improved; moreover, Domic’s speaker segmentation technology in this application introduces a spatial feature matrix, and speaker segmentation can be performed without knowing the arrangement information of the microphone array in advance, and there will be no changes in arrangement information due to device aging. And reduce the problem of segmentation accuracy.
  • FIG. 2A is a schematic flowchart of another audio signal processing method provided by an embodiment of the application. This method may include but is not limited to the following steps:
  • Step 201 The audio processing device receives N observation signals collected by the microphone array, and performs blind source separation on the N observation signals to obtain M source signals and M separation matrices, the M source signals and the M
  • the separation matrix has a one-to-one correspondence, N is an integer greater than or equal to 2, and M is an integer greater than or equal to 1.
  • the N observation signals are audio signals collected by the microphone array in a period of time.
  • the microphone array is called a standard independent component analysis ICA model.
  • ICA model For the dimension of the source signal
  • the separation formula for separating the M source signals from the N observation signals based on the BSS is:
  • p is an integer, 1 ⁇ p ⁇ M, It is a convolution operation.
  • Step 202 The audio processing device obtains a spatial characteristic matrix of the N observation signals, where the spatial characteristic matrix is used to represent the correlation between the N observation signals.
  • the realization process of obtaining the spatial characteristic matrix of the N observation signals may be: dividing each observation signal in the N observation signals into Q audio frames;
  • the spatial feature matrix corresponding to each first audio frame group is determined according to the N audio frames corresponding to each audio frame group to obtain Q spatial feature matrices, and the N audio frames corresponding to each first audio frame group are the N Channel observation signal N audio frames in the same time window;
  • c F (k, n) represents the spatial feature matrix corresponding to each first audio group
  • n represents the frame number of the Q audio frames
  • k represents the frequency index of the nth audio frame
  • X F (k, n ) Is a column vector composed of the representation in the frequency domain of the k-th frequency point of the n-th audio frame of each observation signal
  • X FH (k,n) is the transposition of X F (k,n)
  • X FH (k,n) is the norm of X F (k,n).
  • the diagonal elements in the spatial feature matrix represent the energy of the observation signals collected by each microphone in the microphone array
  • the off-diagonal elements represent the correlation between the observation signals collected by different microphones in the microphone array.
  • the diagonal element C 11 of the spatial characteristic matrix represents the energy of the observation signal collected by the first microphone in the microphone array
  • the off-diagonal element C 12 represents the first microphone and the second microphone in the microphone array.
  • the correlation between the collected observation signals is caused by the different spatial positions of the speech relative to the first microphone and the second microphone. Therefore, the spatial position of the speaker corresponding to each first audio frame group can be reflected through the spatial feature matrix.
  • FIG. 2B is a schematic diagram of the representation in the frequency domain of the audio frame of each observation signal in any time window in an N observation signal provided by an embodiment of the application, assuming that each audio frame contains s 2B, it can be seen from Fig. 2B that the column vectors corresponding to all the first frequency points in the time window in the N observation signals are [a 11 +b 11 *j, a 21 +b 21 *j ,...A N1 +b N1 *j] T , the N audio frames corresponding to each time window are regarded as a first audio frame group. Since each observation signal is divided into Q audio frames, the Q first audio frames can be obtained. An audio frame group; obtain the representation of other frequency points in the frequency domain under the time window shown in FIG. 2B, and obtain the corresponding to the first audio frame group under the time window
  • the spatial feature matrix corresponding to each first audio group is calculated separately to obtain Q spatial feature matrices, and the Q spatial feature matrices are spliced in the order of the time window in which they are located, to obtain all The spatial characteristic matrix corresponding to the N observation signals.
  • Step 203 The audio processing device obtains the preset audio characteristics of each source signal in the M source signals.
  • the step of obtaining the preset audio characteristics of each source signal in the M source signals includes: dividing each source signal in the M source signals into Q audio frames; and obtaining information about each source signal The preset audio characteristics of each audio frame.
  • the preset audio characteristics include but are not limited to one or more of the following: zero-crossing rate ZCR, short-term energy, fundamental frequency, and Mel cepstrum coefficient MFCC.
  • the following specifically introduces the process of obtaining the zero-crossing rate ZCR and short-term energy.
  • Z n is the zero-crossing rate corresponding to the nth audio frame of the Q audio frames
  • sgn[] is the sign function
  • N is the frame length of the nth audio frame
  • n is the frame index of the audio frame.
  • E n is the energy of the n-th short audio frames
  • N is the frame length of the n-th audio frame.
  • Step 204 The audio processing device determines the number of speakers and speaker identities corresponding to the N channels of observation signals according to the preset audio features of each source signal, the M separation matrices, and the spatial feature matrix.
  • Each initial cluster corresponds to an initial cluster center matrix, and the initial cluster center matrix is used to indicate that each initial cluster corresponds to The spatial position of the speaker, P is an integer greater than or equal to 1; M similarities are determined, and the M similarities are between the initial cluster center matrix corresponding to each initial cluster and the M separation matrices The degree of similarity; the source signal corresponding to each initial cluster is determined according to the M similarities, and a second cluster is performed on the preset audio features of the source signal corresponding to each initial cluster to obtain the N observation signals Corresponding speaker number and/or speaker identity.
  • the spatial feature matrix reflects the speaker's spatial position
  • the spatial feature matrix corresponding to each first audio group is used as the sample data to obtain Q sample data
  • the Q sample data are used for the first aggregation.
  • Each initial cluster corresponds to an initial cluster center matrix.
  • the initial cluster center matrix indicates the speaker's spatial position.
  • the initial cluster center Expressed in the form of a spatial feature matrix, after the clustering is completed, P initial clusters are obtained, and it is determined that the N observation signals are generated by the speaker speaking at P spatial positions.
  • the clustering algorithms that can be used for the first clustering and the second clustering include but are not limited to the following: maximum expected clustering algorithm EM (English: Expectation Maximization), K-means clustering algorithm, hierarchical clustering algorithm HAC (English: Hierarchical Agglomerative Clustering).
  • the separation matrix since the separation matrix represents the spatial position, the separation matrix reflects the number of speakers to a certain extent. Therefore, when the K-means algorithm is used for the first clustering, the initial clustering is estimated according to the number of separation matrices. For the number of classes, assign the value of k in the K-means algorithm to the number of separation matrices M, and then preset the cluster centers corresponding to M initial clusters to perform the first clustering, and estimate the initial clusters by the number of separation matrices The number of classes reduces the number of iterations and speeds up clustering.
  • the step of determining the source signal corresponding to each initial cluster according to the M similarities includes: determining the source signal corresponding to each initial cluster according to the M similarities and determining among the M similarities Determine the maximum similarity between the M separation matrices and determine the separation matrix corresponding to the maximum similarity as the target separation matrix; determine the source signal corresponding to the target separation matrix as the source signal corresponding to each initial cluster.
  • the source signal corresponding to each spatial position in the P spatial positions is determined, that is, the source signal corresponding to each initial cluster is determined.
  • the second clustering is performed on the preset audio features of the source signal corresponding to each initial cluster
  • the realization process of obtaining the number of speakers and/or speaker identity corresponding to the N observation signals may be:
  • the second clustering is performed on the preset audio features of the source signal corresponding to each initial cluster, and H target clusters are obtained.
  • the H target clusters represent the number of speakers corresponding to the N observation signals, and each target The cluster corresponds to a target cluster center.
  • Each target cluster center is composed of a preset audio feature and at least one initial cluster center matrix.
  • the preset audio feature corresponding to each target cluster is used to represent each target cluster.
  • the speaker identity of the speaker corresponding to the class, and at least one initial cluster center matrix corresponding to each target cluster is used to represent the spatial position of the speaker.
  • the second clustering is performed on the preset audio features of the source signal corresponding to each initial cluster
  • the realization process of obtaining H target clusters may be: preset the source signal corresponding to each initial cluster Perform a second clustering of audio features to obtain at least one target cluster corresponding to each initial cluster; obtain the H target clusters according to at least one target cluster corresponding to each initial cluster
  • the preset audio feature composition feature vector of each audio frame of the source signal corresponding to each initial cluster is used as a sample data, and several sample data corresponding to the source signal corresponding to each initial cluster are obtained. Cluster the several sample data, cluster the sample data with similar audio features into one category, and obtain the target cluster corresponding to the initial cluster.
  • the source signal corresponding to each initial cluster is the audio signal of a speaker
  • the several sample data correspond to a target cluster center after multiple clustering iterations, and the target cluster center is expressed in the form of a feature vector, and the target cluster center represents the speaker’s identity information (audio feature)
  • the target cluster center represents the speaker’s identity information (audio feature)
  • the source signal corresponding to each initial cluster corresponds to multiple speakers
  • several sample data of the source signal corresponding to the initial cluster correspond to multiple target cluster centers
  • each target cluster The class center represents the identity information of each speaker, so the source signal corresponding to the initial cluster is split into multiple target clusters; for example, the speakers corresponding to the first source signal and the second source signal are the same speaker
  • the target cluster centers corresponding to the two source signals are the same target cluster center or the cluster centers of both are similar, then the two initial clusters corresponding to the two source signals are regarded as the same Target clustering; since the second clustering is performed on the basis of the first clustering, the target cluster center obtained by the second
  • each source signal is separated according to the speaker's spatial location.
  • the multi-source signal corresponding to the speaker will be separated from the observed signal and correspond to different initial clusters.
  • the speaker speaks at position W 1 during the time period from 0 to t 1
  • W 2 during the time period from t 2 to t 3
  • t 3 > t 2 > t 1 and it is determined that the speaker is at W 1 and W 2
  • the corresponding source signals are s 1 and s 2 respectively .
  • s 1 corresponds to the initial cluster A
  • s 2 corresponds to the initial cluster B.
  • the preset audio range from 0 to t 1
  • the features are consistent with the preset audio features in t 2 ⁇ t 3 , so after the second clustering, it can be determined that s 1 and s 2 correspond to the same target cluster center. Since t 2 > t 1 , s 2 can be determined It is the audio signal generated by the speaker walking to the position W 2 , so the two initial clusters A and B can be merged into a target cluster, so the target cluster center of the target cluster contains the first cluster The spatial positions W 1 and W 2 and the second cluster obtain the speaker's preset audio characteristics.
  • a source signal s 3 corresponding to position W 3 is separated based on the separation matrix,
  • the source signal s 3 also includes the audio signals of speaker A and speaker B.
  • speaker A and speaker B it is impossible for speaker A and speaker B to keep talking at the same position all the time.
  • speaker A speaks at position W 3
  • speaker B does not speak
  • speaker B speaks at position W 3 during the period t 2 to t 3 , because different speakers are speaking in these two time periods, so ,
  • the preset audio features in the two time periods are inconsistent.
  • the source signal will correspond to two target cluster centers, and the first target cluster center contains the first clustering
  • the position information W 3 and the second cluster obtain the audio feature of speaker A
  • the second target cluster center contains the position information W 3 obtained from the first cluster and the second cluster obtains the audio feature of speaker B.
  • the method before performing the second clustering on the preset audio features of the source signal corresponding to each initial cluster, the method further includes: performing a human voice analysis on each source signal to remove the M sources
  • the signal is a source signal that is not human voice.
  • the realization process of human voice analysis for each source signal can be: compare the preset audio characteristics of each audio frame of each source signal with the audio characteristics of the human voice Yes, determine whether each source signal contains human voice.
  • Step 205 The audio processing device outputs an audio signal including a first speaker tag according to the number of speakers and speaker identity corresponding to the N-channel observation signal, and the first speaker tag is used to tag each of the audio signals.
  • the step of obtaining the output audio containing the first speaker tag according to the number of speakers and the speaker identity corresponding to the N observation signals includes: determining K distances, each of which is the first distance.
  • the distance between the spatial feature matrix corresponding to the audio frame group and at least one initial cluster center matrix corresponding to each target cluster, and each first audio frame group consists of N audio frames in the same time window from the N observation signals Composition, K ⁇ H; determine the number of speakers corresponding to each first audio frame group according to the K distances, that is, determine L distances greater than the distance threshold among the H distances, and use L as the first audio frame
  • the number of speakers corresponding to the group then, determine the time window corresponding to the first audio frame group, and mark the number of speakers of the audio frame of the output audio in the time window as L; finally, determine each first audio frame in turn The number of speakers corresponding to the group, thereby obtaining the first speaker label.
  • the distance threshold may be 80%, 90%, 95% or other values.
  • the audio frame of the output audio in each time window may include multiple channels of audio, or may be a mixed audio of the multiple channels of audio.
  • speaker A and speaker B speak at the same time from 0 to t 1 , and speaker A and speaker B are located in different spatial positions, then 0 to t are extracted from the source signal corresponding to speaker A a speaker in the first speaker audio 1, the same signal is extracted from the source speaker B corresponding to the second audio speaker 0 ⁇ T 1 within the speaker B, which retains the first and second audio speaker audio speaker alone, that is, the output audio corresponding to the two-way audio speaker 0 ⁇ t 1, and is denoted by 0 ⁇ t 1 has two individual speaker to speak at the same time, the audio speaker may be a first and a second audio speaker, the audio output 0 ⁇ T 1 corresponds to a channel of mixed audio, and there are also two speakers in 0 ⁇ t 1 who are simultaneously speaking.
  • the embodiment of this application is a speaker segmentation method under the multi-microphone system, which introduces a spatial feature matrix and preset audio features, and confirms the speaker through the spatial feature matrix, preset audio features, and separation matrix, without requiring advancement
  • Knowing the arrangement information of the microphone array can realize speaker segmentation, which solves the problem of reducing segmentation accuracy due to device aging in the prior art, and performing a second clustering based on audio features can group speakers with similar angles into an initial cluster.
  • the class is split into two target clusters, and the two initial clusters generated due to speaker movement are merged into one target cluster, which solves the problem of low speaker segmentation accuracy in the prior art.
  • FIG. 3 is a schematic flowchart of an audio signal processing method provided by an embodiment of the application. This method may include but is not limited to the following steps:
  • Step 301 The audio processing device receives N observation signals collected by the microphone array, and performs blind source separation on the N observation signals to obtain M source signals and M separation matrices, the M source signals and the M
  • the separation matrix has a one-to-one correspondence, and both M and N are integers greater than or equal to 1.
  • Step 302 The audio processing device obtains a spatial characteristic matrix of the N observation signals, where the spatial characteristic matrix is used to represent the correlation between the N observation signals.
  • Step 303 The audio processing device obtains the preset audio characteristics of each source signal in the M source signals.
  • Step 304 The audio processing device determines the number of speakers and speaker identities corresponding to the N channels of observation signals according to the preset audio features of each source signal, the M separation matrices, and the spatial feature matrix.
  • Step 305 The audio processing device obtains the output audio including the second speaker tag according to the number of speakers and the speaker identity corresponding to the N observation signals, and the second speaker tag is used to tag each of the output audio The speaker identity corresponding to each audio frame.
  • the step of obtaining the output audio containing the second speaker tag according to the number of speakers and the speaker identity corresponding to the N observation signals includes: determining K distances, each of which is the first distance.
  • the distance between the spatial feature matrix corresponding to the audio frame group and at least one initial cluster center matrix corresponding to each target cluster, and each first audio frame group consists of N audio frames in the same time window from the N observation signals Composition, K ⁇ H; determine the speaker identity corresponding to each first audio frame group according to the K distances, that is, determine the L distances greater than the distance threshold among the H distances, L ⁇ H, and obtain the L L target clusters corresponding to the distance, the L target clusters are regarded as the speaker identity corresponding to the first audio frame group; then, the time window corresponding to the first audio frame group is determined, and the M source signal is determined to be The speakers in the time window are clustered for the L targets; finally, the number of speakers corresponding to each audio frame group is determined in turn, that is, the number of speakers in each time window of
  • the distance threshold may be 80%, 90%, 95% or other values.
  • the audio frame of the output audio in each time window may include multiple channels of audio, or may be a mixed audio of the multiple channels of audio.
  • the audio speaker may be a first and a second Speaking audio, the output audio corresponds to a mixed audio from 0 to t 1 , and the second speaker label is also marked with 0 to t 1 and there are speaker A and speaker B speaking at the same time.
  • the embodiment of this application is a speaker segmentation method under the multi-microphone system, which introduces a spatial feature matrix and preset audio features, and confirms the speaker through the spatial feature matrix, preset audio features, and separation matrix, without requiring advancement
  • Knowing the arrangement information of the microphone array can realize speaker segmentation, which solves the problem of reducing segmentation accuracy due to device aging in the prior art, and performing a second clustering based on audio features can group speakers with similar angles into an initial cluster.
  • the class is split into two target clusters, and the two initial clusters generated due to speaker movement are merged into one target cluster, which solves the problem of low speaker segmentation accuracy in the prior art.
  • FIG. 4 is a schematic flowchart of an audio signal processing method provided by an embodiment of the application. This method may include but is not limited to the following steps:
  • Step 401 The audio processing device receives N channels of observation signals collected by the microphone array, and performs blind source separation on the N channels of observation signals to obtain M channels of source signals and M separation matrices, the M channels of source signals and the M channels
  • the separation matrix has a one-to-one correspondence, and both M and N are integers greater than or equal to 1.
  • Step 402 The audio processing device obtains a spatial characteristic matrix of the N observation signals, where the spatial characteristic matrix is used to represent the correlation between the N observation signals.
  • Step 403 The audio processing device obtains the preset audio characteristics of each source signal in the M source signals.
  • Step 404 The audio processing device determines the number of speakers and speaker identities corresponding to the N channels of observation signals according to the preset audio features of each source signal, the M separation matrices, and the spatial feature matrix.
  • Step 405 The audio processing device obtains an output audio including a third speaker tag according to the number of speakers corresponding to the N observation signals and the speaker identity, and the third speaker tag is used to mark each of the output audio The number of speakers and speaker identity corresponding to each audio frame.
  • the step of obtaining the output audio including the third speaker tag according to the number of speakers and speaker identity corresponding to the N observation signals includes: determining K distances, each of which is the first distance.
  • the distance between the spatial feature matrix corresponding to the audio frame group and at least one initial cluster center matrix corresponding to each target cluster, and each first audio frame group consists of N audio frames in the same time window from the N observation signals Composition, K ⁇ H; determine the speaker identity corresponding to each first audio frame group according to the K distances, that is, determine the L distances greater than the distance threshold among the H distances, L ⁇ H, and obtain the L L target clusters corresponding to the distance, the L target clusters are regarded as the speaker identity corresponding to the first audio frame group; then, the time window corresponding to the first audio frame group is determined, and the M source signal is determined to be
  • the speakers in the time window are the L target clusters; L audio frames corresponding to each first audio frame group are extracted from the M source signals, and the L audio frames are associated with each first audio Frame groups
  • the distance threshold may be 80%, 90%, 95% or other values.
  • 0 to t are determined by the spatial feature matrix of the first audio group a corresponding target and target cluster cluster B 1, is then extracted from the M source signals in the channel 0 ⁇ t 1 a two-way audio source frames, but can not determine the source of an audio frame which is the speaker a, Which source audio frame belongs to speaker B, so the preset audio features of the two source audio frames are compared with the preset audio features corresponding to the target cluster A, and the similarity is obtained, and two similarities are obtained.
  • the target cluster corresponding to the maximum similarity is regarded as the speaker corresponding to each source audio frame.
  • the step of obtaining an output audio containing a third speaker tag according to the number of speakers and speaker identity corresponding to the N observation signals includes: determining H similarities, each of the H similarities The similarity between the preset audio feature of each audio frame in the second audio frame group and the preset audio feature of each target cluster center in the H target clusters, each of the second audio frame groups It is composed of audio frames of the M source signals in the same time window; the target cluster corresponding to each audio frame in each second audio frame group is determined according to the H similarities; according to the corresponding target cluster of each audio frame The target clustering obtains an output audio containing a speaker tag, and the speaker tag is used to label the number of speakers and/or the speaker identity of each audio frame in the output audio. Direct speaker comparison through audio features speeds up speaker segmentation.
  • the corresponding ones in 0 to t 1 can be extracted from M source signals Two source audio frames, however, it is impossible to determine which source audio frame belongs to speaker A and which source audio frame belongs to speaker B. Then, the preset audio characteristics of the two source audio frames are directly compared with the second The H target clusters obtained after clustering are compared, and the target cluster with the greatest similarity is regarded as the speaker corresponding to each source audio frame.
  • the audio frame of the output audio in each time window may include multiple channels of audio, or may be a mixed audio of the multiple channels of audio.
  • speaker A and speaker B speak at the same time from 0 to t 1 , and speaker A and speaker B are located in different spatial positions, then 0 to t are extracted from the source signal corresponding to speaker A a speaker in the first speaker audio 1, the same signal is extracted from the source speaker B corresponding to the second audio speaker 0 ⁇ T 1 within the speaker B, which retains the first and second audio speaker audio speaker alone, That is, the output audio corresponds to two channels of speaking audio within 0 ⁇ t 1 , and the third speaker label is labeled 0 ⁇ t 1. There are speaker A and speaker B speaking at the same time.
  • each source audio frame is determined For the corresponding speaker, when the audio of A and B is not mixed, a separate play button can be set.
  • the play button of speaker A is clicked, the speaking audio of A can be played separately; the first speaking audio can also be combined with
  • the output audio corresponds to a mixed audio from 0 to t 1 , and the second speaker label is also marked with 0 to t 1 and there are speaker A and speaker B speaking at the same time.
  • the embodiment of this application is a speaker segmentation method under the multi-microphone system, which introduces a spatial feature matrix and preset audio features, and confirms the speaker through the spatial feature matrix, preset audio features, and separation matrix, without requiring advancement
  • Knowing the arrangement information of the microphone array can realize speaker segmentation, which solves the problem of reducing segmentation accuracy due to device aging in the prior art, and performing a second clustering based on audio features can group speakers with similar angles into an initial cluster.
  • the class is split into two target clusters, and the two initial clusters generated due to speaker movement are merged into one target cluster, which solves the problem of low speaker segmentation accuracy in the prior art.
  • the H cluster centers of the H target clusters corresponding to the N observation signals are input into In the next time window, the H cluster centers are used as the initial clustering values of the observation signals obtained within the second preset time to realize parameter sharing in the two time periods, speed up clustering, and improve speaker segmentation efficiency .
  • the output audio and speaker tag may be presented in the following forms on the interface of the audio processing device.
  • FIG. 5A is a schematic diagram of displaying output audio on an interface according to an embodiment of the application.
  • the display mode shown in FIG. 5A corresponds to the speaker segmentation method described in FIG. 2A.
  • a first speaker tag is added to each audio frame of the audio, and the number of speakers corresponding to the time window is marked by the first speaker tag. It is understandable that if the audio of each speaker speaking separately is retained in the output audio, that is, the audio of the speaker is not mixed and output, when multiple speakers corresponding to a time window of the audio are output, click the " Click the "button to play the independent audio signal of each speaker in the time window in turn.
  • the first speaker tag and the output audio can be associated and output.
  • the first speaker tag labels each output audio
  • the number of speakers corresponding to each audio frame can be determined by reading the first speaker tag to determine the number of speakers corresponding to each audio frame in the output audio.
  • FIG. 5B is another schematic diagram of displaying output audio on the interface provided by an embodiment of the application.
  • the display mode shown in FIG. 5B corresponds to the speaker segmentation method described in FIG. 3, and each output audio is determined
  • each output audio is determined
  • the speaker identity corresponds to each audio frame
  • the speaker corresponding to the audio frame is speaker A.
  • the audio of each speaker is kept in the output audio, and the audio of the speaker is not mixed and output, when there are multiple speakers corresponding to a time window of the output audio, click "click" next to the label Button to play the audio of each speaker in turn, but it is impossible to determine which speaker the audio frame played each time belongs to.
  • the second speaker tag can be associated with the output audio.
  • the first speaker tag labels each output audio The number of speakers corresponding to each audio frame can be determined by reading the second speaker tag to determine the speaker identity corresponding to each audio frame in the output audio.
  • FIG. 5C is another schematic diagram of displaying output audio on the interface provided by an embodiment of the application.
  • the display mode shown in FIG. 5C corresponds to the speaker segmentation method described in FIG. 4, and each output audio is determined After the number of speakers and speaker identity corresponding to each audio frame, a third speaker tag is added to the output audio to mark the number of speakers and speaker identity corresponding to each time window; moreover, in the output audio, there are no speakers The human audio is mixed and output.
  • the identity of each speaker and the source signal of the speaker under the time window can be determined; for all time windows of the output audio Through analysis, all audio frames corresponding to the output audio of each speaker can be determined.
  • the audio of each speaker can be played separately, which is helpful for generating meeting records.
  • the third speaker tag and the output audio can be associated and output, and the output can be determined by reading the first speaker tag. The number and identity of speakers corresponding to each time window in the audio.
  • an audio processing device 600 which may include:
  • the audio separation unit 610 is configured to receive N observation signals collected by the microphone array, and perform blind source separation on the N observation signals to obtain M source signals and M separation matrices.
  • the two separation matrices have a one-to-one correspondence, N is an integer greater than or equal to 2, and M is an integer greater than or equal to 1;
  • the spatial feature extraction unit 620 is configured to obtain a spatial feature matrix of the N observation signals, where the spatial feature matrix is used to represent the correlation between the N observation signals;
  • the audio feature extraction unit 630 is configured to obtain the preset audio feature of each source signal in the M source signals;
  • the determining unit 640 is configured to determine the number of speakers and speaker identities corresponding to the N channels of observation signals according to the preset audio features of each source signal, the M separation matrices, and the spatial feature matrix.
  • the solution of the embodiment of the present application is a speaker segmentation technology under the multi-microphone system, which introduces a spatial feature matrix and preset audio features, and clusters speakers through the spatial feature matrix, preset audio features, and separation matrix.
  • Speaker segmentation can be achieved without knowing the arrangement information of the microphone array in advance, which solves the problem of aging devices in the prior art that reduce segmentation accuracy, and with the participation of audio features, it can identify the scene where the speaker has a similar angle and the speaker is moving. , To further improve the speaker segmentation accuracy.
  • the audio feature extraction unit 630 when acquiring the preset audio feature of each source signal in the M source signal, is specifically configured to: combine each source signal in the M source signal Divide into Q audio frames, where Q is an integer greater than 1. Obtain the preset audio characteristics of each audio frame of each source signal.
  • the spatial feature extraction unit 620 when acquiring the spatial feature matrix of the N observation signals, is specifically configured to: divide each observation signal of the N observation signals into Q audio frames ; Determine the spatial characteristic matrix corresponding to each first audio frame group according to the N audio frames corresponding to each audio frame group to obtain Q spatial characteristic matrices, and the N audio frames corresponding to each first audio frame group are said N audio frames of N observation signals in the same time window; obtaining the spatial characteristic matrix of the N observation signals according to the Q spatial characteristic matrices;
  • c F (k, n) represents the spatial feature matrix corresponding to each first audio group
  • n represents the frame number of the Q audio frames
  • k represents the frequency index of the nth audio frame
  • X F (k, n ) Is a column vector composed of the representation in the frequency domain of the k-th frequency point of the n-th audio frame of each observation signal
  • X FH (k,n) is the transposition of X F (k,n)
  • n It is an integer, 1 ⁇ n ⁇ Q. .
  • the determining unit 640 determines the number of speakers and the number of speakers corresponding to the N channels of observation signals according to the preset audio features of each source signal, the M separation matrices, and the spatial feature matrix.
  • identity it is specifically used to: perform a first cluster on the spatial feature matrix to obtain P initial clusters, each initial cluster corresponds to an initial cluster center matrix, and the initial cluster center matrix is used to represent The spatial position of the speaker corresponding to each initial cluster, P is an integer greater than or equal to 1, and M similarities are determined, and the M similarities are the initial cluster center matrix corresponding to each initial cluster and the The similarity between the M separation matrices; the source signal corresponding to each initial cluster is determined according to the M similarities; the second cluster is performed on the preset audio features of the source signal corresponding to each initial cluster to obtain The number of speakers and the identity of the speakers corresponding to the N observation signals.
  • the determining unit when determining the source signal corresponding to each initial cluster according to the M similarities, is specifically configured to: determine the maximum similarity among the M similarities, The separation matrix corresponding to the maximum similarity among the M separation matrices is determined to be the target separation matrix; the source signal corresponding to the target separation matrix is determined as the source signal corresponding to each initial cluster.
  • the determining unit 640 performs a second clustering on the preset audio features of the source signal corresponding to each initial cluster to obtain the number of speakers and speaker identity corresponding to the N observation signals.
  • it is specifically used to: perform a second clustering on the preset audio features of the source signal corresponding to each initial cluster to obtain H target clusters, where the H target clusters represent the corresponding N channels of observation signals
  • the number of speakers, each target cluster corresponds to a target cluster center, each target cluster center is composed of a preset audio feature and at least one initial cluster center matrix, and each target cluster corresponds to a preset audio feature It is used to represent the speaker identity of the speaker corresponding to each target cluster, and at least one initial cluster center matrix corresponding to each target cluster is used to represent the spatial position of the speaker.
  • the audio processing device 100 further includes an audio dividing unit 650;
  • the audio segmentation unit 650 is configured to obtain the output audio including the speaker tag according to the number of speakers corresponding to the N observation signals and the speaker identity.
  • the audio segmentation unit 650 is specifically used to: determine K distances when obtaining output audio containing speaker tags according to the number of speakers and speaker identities corresponding to the N observation signals.
  • the K distances are the distances between the spatial feature matrix corresponding to each first audio frame group and at least one initial cluster center matrix corresponding to each target cluster, and each first audio frame group is composed of the N channels of observation signals.
  • L target clusters corresponding to each first audio frame group are determined according to the K distances, L ⁇ H;
  • From the M source signals Extract L audio frames corresponding to each first audio frame group, where the L audio frames are the same time window as each first audio frame group; determine L similarities, and the L similarities are the The similarity between the preset audio feature of each audio frame in the L audio frames and the preset audio feature corresponding to each target cluster in the L target clusters; the L number of similarities are determined according to the L number of similarities
  • the target cluster corresponding to each audio frame in the audio frame; the output audio containing the speaker tag is obtained according to the target cluster corresponding to each audio frame, and the speaker tag is used to label each audio frame in the output audio The number of speakers and/or speaker identity.
  • the audio segmentation unit 650 is specifically used to: determine H similarities when obtaining output audio containing speaker tags according to the number of speakers corresponding to the N observation signals and speaker identities
  • the H similarities are the similarities between the preset audio feature of each audio frame in each second audio frame group and the preset audio feature of each target cluster center in the H target clusters
  • Each of the second audio frame groups is composed of audio frames of the M source signals in the same time window; the target corresponding to each audio frame in each second audio frame group is determined according to the H similarities Clustering; according to the target clustering corresponding to each audio frame, the output audio containing the speaker tag is obtained, and the speaker tag is used to label the number of speakers and/or speaker identity of each audio frame in the output audio .
  • an audio processing device 700 including:
  • the processor 730, the communication interface 720, and the memory 710 are coupled to each other; for example, the processor 730, the communication interface 720, and the memory 710 are coupled through the bus 740.
  • the memory 710 may include, but is not limited to, random access memory (RAM), erasable programmable read-only memory (Erasable Programmable ROM, EPROM), read-only memory (Read-Only Memory, ROM), or portable read-only memory A memory (Compact Disc Read-Only Memory, CD-ROM), etc., the memory 810 is used for related instructions and data.
  • RAM random access memory
  • EPROM erasable programmable read-only memory
  • ROM Read-Only Memory
  • portable read-only memory A memory Compact Disc Read-Only Memory, CD-ROM
  • the processor 730 may be one or more central processing units (CPUs).
  • CPUs central processing units
  • the CPU may be a single-core CPU or a multi-core CPU.
  • the processor 730 is configured to read the program code stored in the memory 710, and cooperate with the communication interface 740 to execute part or all of the steps of the method executed by the audio processing device in the foregoing embodiment of the present application.
  • the communication interface 720 is used to receive N observation signals collected by the microphone array, and N is an integer greater than or equal to 2.
  • the processor 730 and the processor are configured to perform blind source separation on the N observation signals to obtain M source signals and M separation matrices, and the M source signals and the M separation matrices are one One-to-one correspondence, M is an integer greater than or equal to 1; obtain the spatial characteristic matrix of the N observation signals, the spatial characteristic matrix is used to express the correlation between the N observation signals; obtain the M sources The preset audio characteristics of each source signal in the signal; the number of speakers and the speakers corresponding to the N observation signals are determined according to the preset audio characteristics of each source signal, the M separation matrices and the spatial characteristic matrix Identity.
  • the solution of the embodiment of the present application is a speaker segmentation technology under the multi-microphone system, which introduces a spatial feature matrix and preset audio features, and clusters speakers through the spatial feature matrix, preset audio features, and separation matrix.
  • Speaker segmentation can be achieved without knowing the arrangement information of the microphone array in advance, which solves the problem of aging devices in the prior art that reduce segmentation accuracy, and with the participation of audio features, it can identify the scene where the speaker has a similar angle and the speaker is moving. , To further improve the speaker segmentation accuracy.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (such as a floppy disk, a hard disk, and a magnetic tape), an optical medium (such as an optical disk), or a semiconductor medium (such as a solid state hard disk).
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, wherein the computer program is executed by related hardware to complete the execution of any audio file provided by the embodiments of the present invention Signal processing method.
  • an embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program is executed by related hardware to perform any of the methods provided in the embodiments of the present invention .
  • the embodiments of the present application also provide a computer program product, wherein when the computer program product runs on a computer, the computer is caused to execute any audio signal processing method provided in the embodiments of the present invention.
  • the embodiments of the present application also provide a computer program product, which when the computer program product runs on a computer, causes the computer to execute any method provided in the embodiments of the present invention.
  • the disclosed device may also be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored or not implemented.
  • the displayed or discussed indirect coupling or direct coupling or communication connection between each other may be through some interfaces, indirect coupling or communication connection between devices or units, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. . Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware, or may also be implemented in the form of software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium,
  • a number of instructions are included to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage medium may include, for example: U disk, mobile hard disk, Read-Only Memory (ROM), Random Access Memory (RAM, Random Access Memory), magnetic disks or optical disks and other storable program codes. Medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Otolaryngology (AREA)
  • Algebra (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Health & Medical Sciences (AREA)

Abstract

一种音频信号处理方法及产品,方法包括:接收麦克风阵列采集的N路观测信号,对N路观测信号进行盲源分离以得到M路源信号和M个分离矩阵,M路源信号和M个分离矩阵一一对应,N为大于或者等于2的整数,M为大于或者等于1的整数(S101);获取N路观测信号的空间特征矩阵,空间特征矩阵用于表示N路观测信号之间的相关性(S102);获取M路源信号中每路源信号的预设音频特征(S103);根据每路源信号的预设音频特征、M个分离矩阵和空间特征矩阵确定N路观测信号对应的说话人数量和说话人身份(S104)。有利提高说话人的分割精度。

Description

音频信号处理方法及相关产品
本申请要求在2019年4月30日提交中国国家知识产权局、申请号为201910369726.5的中国专利申请的优先权,发明名称为“音频信号处理方法及相关产品”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及音频信号处理技术领域,尤其涉及一种音频处理方法及相关产品。
背景技术
随着网络和通讯技术的发展,利用音视频技术、网络与通讯技术等可以实现复杂声学环境场景下的多人多方通话。在很多的应用场景中通话一方包含多人参加:比如大型的会议室。为了便于后期的文本和会议摘要的生成,通常对音频信号进行说话人分割(英文:Speaker Diarization),将整个音频信号分割成不同的片段,将说话人和音频片段对应标注出来,这样可以清楚知道在每个时刻的说话人,可快速生成会议摘要。
现有技术中,单麦克风的说话人分割技术,难以区分声音类似的说话人;多麦克风的说话人分割***,难以区分角度接近的说话人、受房间混响影响较大,分割精度低。所以现有技术中对说话人的分割精度低。
发明内容
本申请实施例提供了一种音频信号处理方法,有利于提高说话人分割的精度,进而便于会议室记录的生成,进而提高用户体验。
第一方面,本申请实施例提供了一种音频信号处理方法,包括:
接收麦克风阵列采集的N路观测信号,对所述N路观测信号进行盲源分离以得到M路源信号和M个分离矩阵,所述M路源信号和所述M个分离矩阵一一对应,N为大于或者等于2的整数,M为大于或者等于1的整数;
获取所述N路观测信号的空间特征矩阵,所述空间特征矩阵用于表示所述N路观测信号之间的相关性;
获取所述M路源信号中每路源信号的预设音频特征;
根据每路源信号的预设音频特征、所述M个分离矩阵和所述空间特征矩阵确定所述N路观测信号对应的说话人数量和说话人身份。
可以看出,本申请实施例方案为多麦***下的说话人分割技术,引入了空间特征矩阵和预设音频特征,通过空间特征矩阵、预设音频特征以及分离矩阵进行说话人的聚类,无需提前知道麦克风阵列的排列信息,即可实现说话人分割,解决了现有技术中器件老化降低分割精度的问题,而且有音频特征的参与,可以识别出说话人角度相近以及说话人移动的场景,进一步提高说话人的分割精度。
在一些可能的实施方式中,所述获取所述M路源信号中每路源信号的预设音频特征,包括:将所述M路源信号中每路源信号分割为Q个音频帧,Q为大于1的整数;获取每路源信号的每个音频帧的预设音频特征。对源信号进行分割,以便后续利用预设音频特征进行聚类。
在一些可能的实施方式中,所述获取所述N路观测信号的空间特征矩阵,包括:将所述 N路观测信号中每路观测信号分割为Q个音频帧;根据每个音频帧组对应的N个音频帧确定每个第一音频帧组对应的空间特征矩阵,得到Q个空间特征矩阵,每个第一音频帧组对应的N个音频帧为所述N路观测信号在同一时间窗口下的N个音频帧;根据所述Q个空间特征矩阵得到所述N路观测信号的空间特征矩阵;
其中,
Figure PCTCN2020085800-appb-000001
c F(k,n)表示每个第一音频组对应的空间特征矩阵,n表示所述Q个音频帧的帧序号,k表示第n个音频帧的频点索引,X F(k,n)是由每路观测信号的第n个音频帧的第k个频点在频域中的表征组成的列向量,X FH(k,n)为X F(k,n)的转置,n为整数,1≤n≤Q。。可以看出,由于空间特征矩阵反映了说话人相对于麦克风的位置信息,所以通过引入空间特征矩阵,在无需提前知道麦克风阵列的排列信息时,即可确定出当前场景中有多少个位置存在说话人。
在一些可能的实施方式中,所述根据每路源信号的预设音频特征、所述M个分离矩阵和所述空间特征矩阵确定所述N路观测信号对应的说话人数量和说话人身份,包括:对所述空间特征矩阵进行第一聚类,得到P个初始聚类,每个初始聚类对应一个初始聚类中心矩阵,所述初始聚类中心矩阵用于表示每个初始聚类对应的说话人的空间位置,P为大于或者等于1的整数;确定M个相似度,所述M个相似度为每个初始聚类对应的初始聚类中心矩阵与所述M个分离矩阵之间的相似度;根据所述M个相似度确定每个初始聚类对应的源信号;对每个初始聚类对应的源信号的预设音频特征进行第二聚类,得到所述N路观测信号对应的说话人数量和说话人身份。可以看出,先利用空间特征矩阵进行第一聚类,确定当前场景中说话人在哪些位置说话,得到说话人的预估数量,然后,再利用预设音频特征进行第二聚类,对第一聚类得到的初始聚类进行拆分或者合并,得到当前场景中说话人的真实数量,提高了说话人的分割精度。
在一些可能的实施方式中,所述根据所述M个相似度确定每个初始聚类对应的源信号,包括:确定所述M个相似度中的最大相似度,确定所述M个分离矩阵中与最大相似度对应的分离矩阵为目标分离矩阵;确定与所述目标分离矩阵对应的源信号为每个初始聚类对应的源信号。可以看出,通过空间特征矩阵进行第一聚类,确定当前场景中说话人在哪些位置说话,然后利用空间特征矩阵和分离矩阵的相似度确定出每个说话人对应的源信号,实现快速确定出每个说话人对应的的源信号。
在一些可能的实施方式中,所述对每个初始聚类对应的源信号的预设音频特征进行第二聚类,得到所述N路观测信号对应的说话人数量和说话人身份,包括:对每个初始聚类对应的源信号的预设音频特征进行第二聚类,得到H个目标聚类,所述H个目标聚类表示所述N路观测信号对应的说话人数量,每个目标聚类对应一个目标聚类中心,每个目标聚类中心是由一个预设音频特征和至少一个初始聚类中心矩阵组成,每个目标聚类对应的预设音频特征用于表示每个目标聚类对应的说话人的说话人身份,每个目标聚类对应的至少一个初始聚类中心矩阵用于表示所述说话人的空间位置。可以看出,利用每路源信号对应的预设音频特征进行聚类,将每路源信号对应的初始聚类进行拆分或者合并,得到所述M路源信号对应的目标聚类,将说话人移动而分离出的两路源信号聚为同一个目标聚类以及将角度相近的两个说话人拆分为两个目标聚类,将角度相近的两个说话人分割出来,提高了说话人的分割精度。
在一些可能的实施方式中,所述方法还包括:根据所述N路观测信号对应的说话人数量和说话人身份得到包含有说话人标签的输出音频。可以看出,基于聚类后的说话人身份和数 量,对音频信号切割,确定每个音频帧对应的说话人身份和数量,方便会议室环境下生成会议室摘要。
在一些可能的实施方式中,所述根据所述N路观测信号对应的说话人数量和说话人身份得到包含有说话人标签的输出音频,包括:确定K个距离,所述K个距离为每个第一音频帧组对应的空间特征矩阵与每个目标聚类对应的至少一个初始聚类中心矩阵的距离,每个第一音频帧组由所述N路观测信号在同一时间窗口下的N个音频帧组成,K≥H;根据所述K个距离确定每个第一音频帧组对应的L个目标聚类,L≤H;从所述M路源信号中提取与每个第一音频帧组对应的L个音频帧,所述L个音频帧与每个第一音频帧组所在时间窗口相同;确定L个相似度,所述L个相似度为所述L个音频帧中每个音频帧的预设音频特征与所述L个目标聚类中每个目标聚类对应的预设音频特征的相似度;根据所述L个相似度确定所述L个音频帧中每个音频帧对应的目标聚类;根据每个音频帧对应的目标聚类得到包含有说话人标签的输出音频,所述说话人标签用于标注所述输出音频中每个音频帧的说话人数量和/或说话人身份。可以看出,基于聚类后的说话人身份和数量,对音频信号进行切割和标注,先利用空间特征矩阵确定出每个音频帧组对应的说话人数量,然后,再利用源信号中每个音频帧的预设音频特征确定出每个说话人对应的源信号,通过对音频进行两步切割和标注,提高了说话人的切割精度。
在一些可能的实施方式中,所述根据所述N路观测信号对应的说话人数量和说话人身份得到包含有说话人标签的输出音频,包括:确定H个相似度,所述H个相似度为每个第二音频帧组中每个音频帧的预设音频特征与所述H个目标聚类中每个目标聚类中心的预设音频特征之间的相似度,所述每个第二音频帧组由所述M路源信号在同一时间窗口下的音频帧组成;根据所述H个相似度确定每个第二音频帧组中每个音频帧对应的目标聚类;根据每个音频帧对应的目标聚类得到包含有说话人标签的输出音频,所述说话人标签用于标注所述输出音频中每个音频帧的说话人数量和/或说话人身份。可以看出,通过音频特征直接进行音频的分割和标注,提高了说话人的分割速度。
第二方面,本申请实施例提供了一种音频处理装置,其特征在于,包括:
音频分离单元,用于接收麦克风阵列采集的N路观测信号,对所述N路观测信号进行盲源分离以得到M路源信号和M个分离矩阵,所述M路源信号和所述M个分离矩阵一一对应,N为大于或者等于2的整数,M为大于或者等于1的整数;
空间特征提取单元,用于获取所述N路观测信号的空间特征矩阵,所述空间特征矩阵用于表示所述N路观测信号之间的相关性;
音频特征提取单元,用于获取所述M路源信号中每路源信号的预设音频特征;
确定单元,用于根据每路源信号的预设音频特征、所述M个分离矩阵和所述空间特征矩阵确定所述N路观测信号对应的说话人数量和说话人身份。
可以看出,本申请实施例方案为多麦***下的说话人分割技术,引入了空间特征矩阵和预设音频特征,通过空间特征矩阵、预设音频特征以及分离矩阵进行说话人的聚类,无需提前知道麦克风阵列的排列信息,即可实现说话人分割,解决了现有技术中器件老化降低分割精度的问题,而且有音频特征的参与,可以识别出说话人角度相近以及说话人移动的场景,进一步提高说话人的分割精度。
在一些可能的实施方式中,所述音频特征提取单元,在获取所述M路源信号中每路源信号的预设音频特征时,具体用于:将所述M路源信号中每路源信号分割为Q个音频帧,Q为 大于1的整数;获取每路源信号的每个音频帧的预设音频特征。
在一些可能的实施方式中,所述空间特征提取单元,在获取所述N路观测信号的空间特征矩阵时,具体用于:将所述N路观测信号中每路观测信号分割为Q个音频帧;根据每个音频帧组对应的N个音频帧确定每个第一音频帧组对应的空间特征矩阵,得到Q个空间特征矩阵,每个第一音频帧组对应的N个音频帧为所述N路观测信号在同一时间窗口下的N个音频帧;根据所述Q个空间特征矩阵得到所述N路观测信号的空间特征矩阵;
其中,
Figure PCTCN2020085800-appb-000002
c F(k,n)表示每个第一音频组对应的空间特征矩阵,n表示所述Q个音频帧的帧序号,k表示第n个音频帧的频点索引,X F(k,n)是由每路观测信号的第n个音频帧的第k个频点在频域中的表征组成的列向量,X FH(k,n)为X F(k,n)的转置,n为整数,1≤n≤Q。
在一些可能的实施方式中,所述确定单元,在根据每路源信号的预设音频特征、所述M个分离矩阵和所述空间特征矩阵确定所述N路观测信号对应的说话人数量和说话人身份时,具体用于:对所述空间特征矩阵进行第一聚类,得到P个初始聚类,每个初始聚类对应一个初始聚类中心矩阵,所述初始聚类中心矩阵用于表示每个初始聚类对应的说话人的空间位置,P为大于或者等于1的整数;确定M个相似度,所述M个相似度为每个初始聚类对应的初始聚类中心矩阵与所述M个分离矩阵之间的相似度;根据所述M个相似度确定每个初始聚类对应的源信号;对每个初始聚类对应的源信号的预设音频特征进行第二聚类,得到所述N路观测信号对应的说话人数量和说话人身份。
在一些可能的实施方式中,所述确定单元,在根据所述M个相似度确定每个初始聚类对应的源信号时,具体用于:确定所述M个相似度中的最大相似度,确定所述M个分离矩阵中与最大相似度对应的分离矩阵为目标分离矩阵;确定与所述目标分离矩阵对应的源信号为每个初始聚类对应的源信号。
在一些可能的实施方式中,所述确定单元,在对每个初始聚类对应的源信号的预设音频特征进行第二聚类,得到所述N路观测信号对应的说话人数量和说话人身份时,具体用于:对每个初始聚类对应的源信号的预设音频特征进行第二聚类,得到H个目标聚类,所述H个目标聚类表示所述N路观测信号对应的说话人数量,每个目标聚类对应一个目标聚类中心,每个目标聚类中心是由一个预设音频特征和至少一个初始聚类中心矩阵组成,每个目标聚类对应的预设音频特征用于表示每个目标聚类对应的说话人的说话人身份,每个目标聚类对应的至少一个初始聚类中心矩阵用于表示所述说话人的空间位置。
在一些可能的实施方式中,所述装置还包括音频分割单元;
所述音频分割单元,用于根据所述N路观测信号对应的说话人数量和说话人身份得到包含有说话人标签的输出音频。
在一些可能的实施方式中,所述音频分割单元,在根据所述N路观测信号对应的说话人数量和说话人身份得到包含有说话人标签的输出音频时,具体用于:确定K个距离,所述K个距离为每个第一音频帧组对应的空间特征矩阵与每个目标聚类对应的至少一个初始聚类中心矩阵的距离,每个第一音频帧组由所述N路观测信号在同一时间窗口下的N个音频帧组成,K≥H;根据所述K个距离确定每个第一音频帧组对应的L个目标聚类,L≤H;从所述M路源信号中提取与每个第一音频帧组对应的L个音频帧,所述L个音频帧与每个第一音频帧组所在时间窗口相同;确定L个相似度,所述L个相似度为所述L个音频帧中每个音频帧的预设 音频特征与所述L个目标聚类中每个目标聚类对应的预设音频特征的相似度;根据所述L个相似度确定所述L个音频帧中每个音频帧对应的目标聚类;根据每个音频帧对应的目标聚类得到包含有说话人标签的输出音频,所述说话人标签用于标注所述输出音频中每个音频帧的说话人数量和/或说话人身份。
在一些可能的实施方式中,所述音频分割单元,在根据所述N路观测信号对应的说话人数量和说话人身份得到包含有说话人标签的输出音频时,具体用于:确定H个相似度,所述H个相似度为每个第二音频帧组中每个音频帧的预设音频特征与所述H个目标聚类中每个目标聚类中心的预设音频特征之间的相似度,所述每个第二音频帧组由所述M路源信号在同一时间窗口下的音频帧组成;根据所述H个相似度确定每个第二音频帧组中每个音频帧对应的目标聚类;根据每个音频帧对应的目标聚类得到包含有说话人标签的输出音频,所述说话人标签用于标注所述输出音频中每个音频帧的说话人数量和/或说话人身份。
第三方面、本申请实施例提供一种音频处理装置,其特征在于,包括:
相互耦合的处理器、通信接口和存储器;
其中,所述通信接口,用于收麦克风阵列采集的N路观测信号,N为大于或者等于2的整数;
所述处理器,用于对所述N路观测信号进行盲源分离以得到M路源信号和M个分离矩阵,所述M路源信号和所述M个分离矩阵一一对应,M为大于或者等于1的整数;获取所述N路观测信号的空间特征矩阵,所述空间特征矩阵用于表示所述N路观测信号之间的相关性;获取所述M路源信号中每路源信号的预设音频特征;根据每路源信号的预设音频特征、所述M个分离矩阵和所述空间特征矩阵确定所述N路观测信号对应的说话人数量和说话人身份。
第四方面,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被硬件(例如处理器等)执行,以本申请实施例中由音频处理装置执行的任意一种方法的部分或全部步骤。
第五方面,本申请实施例提供了一种包括指令的计算机程序产品,当所述计算机程序产品在音频处理装置上运行时,使得所述音频处理装置执行以上各方面的音频信号处理方法的部分或全部步骤。
附图说明
图1A为本申请实施例提供的一种的流程示意图;
图1B为本申请实施例提供的一种音频信号处理方法的流程示意图
图2A为本申请实施例提供的另一种音频信号处理方法的流程示意图;
图2B为本申请实施例提供的一种频点在频域中的表征的示意图;
图2C为本申请实施例提供的一种说话场景的示意图;
图2D为本申请实施例提供的另一种说话场景的示意图;
图3为本申请实施例提供的另一种音频信号处理方法的流程示意图;
图4为本申请实施例提供的另一种音频信号处理方法的流程示意图;
图5A为本申请实施例提供的一种在界面显示输出音频的示意图;
图5B为本申请实施例提供的另一种在界面显示输出音频的示意图;
图5C为本申请实施例提供的另一种在界面显示输出音频的的示意图;
图6为本申请实施例提供的一种音频处理装置的示意图;
图7为本申请实施例提供的一种音频处理装置的示意图。
具体实施方式
本申请的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、***、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结果或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
下面先介绍一下盲源分离BSS(Blind Source Separation,简称:BSS)技术。
BSS技术主要解决“鸡尾酒会”问题,即从给定的混合信号中分离出每个人说话的独立信号。当有M个源信号时,通常假设观察信号也有M个,即假设麦克风阵列中有M个麦克风。举例来说,在一个房间的不同位置放着两个麦克风,同时有两个人说话,每个麦克风都能采集到两个人说话的音频信号,输出一路观测信号,假设两个麦克风输出的两个观测信号为x 1,x 2,两路源信号为s 1,s 2,则x 1,x 2分别由s 1、s 2混合而成,即x 1=a 11*s 1+a 12*s 2,x 2=a 21*s 1+a 22* s2,BSS技术主要解决从x 1,x 2中分离出s 1,s 2的问题。
当存在M路观测信号x 1,…,x M时,BSS技术主要解决如何从x 1,…,x M中分离出M路源信号s 1,…,s M、。由上述举例可知,X=AS,X=[x 1,…,x M],S=[s 1,…,s M],A为混合矩阵;假设有Y=WX,Y是对S的估计,W为分离矩阵,W通过自然梯度法得到。所以,在BSS时,先得到分离矩阵W,然后利用分离矩阵W对观测信号X进行分离,得到源信号S,其中,通过自然梯度法得到分离矩阵W。
现有技术中,单麦克风的说话人分割时,主要利用说话人的音频特征进行分割,无法分割说话类似的说话人(音频特征相似的说话人),分割精度低;多麦克风的说话人分割***,需要获取说话人的角度和位置,利用说话人的角度和位置对说话人进行分割,所以多麦克风的说话人分割***需要预先知道麦克风阵列的排列信息和空间位置信息,但是,随着器件的老化,导致麦克风阵列的排布信息和空间位置信息发生变化,导致分割精度降低,而且利用说话人的角度和位置对说话人进行分割,难以区分角度接近的说话人,且在分割时受房间混响影响较大,分割精度低。为了解决现有技术中说话人分割精度低的问题,特提出本申请中的音频信号处理方法,以期提高说话人的分割精度。
参阅图1A,图1A为一种音频信号处理方法的场景架构图,该场景架构图包括声源、麦克风阵列、音频处理装置,音频处理装置包括:空间特征提取模块、盲源分离模块、音频特征提取模块、第一聚类模块、第二聚类模块、音频分割模块;麦克风阵列用于采集说话人的说话音频,得到观测信号;空间特征提取模块用于确定观测信号对应的空间特征矩阵;盲源分离模块用于对观测信号进行盲源分离,得到源信号;第一聚类模块用于对空间特征矩阵进行第一聚类,得到初始聚类;音频特征提取模块用于对源信号进行特征提取,得到源信号对 应的预设音频特征;第二聚类模块用于根据源信号对应的预设音频特征以及初始聚类进行第二聚类,得到目标聚类;音频分割模块用于根据目标聚类对源信号进行音频分割,输出音频信号和说话人标签,所述说话人标签用于标注输出的音频信号中每个音频帧对应的说话人数量和/或说话人身份。
可以看出,本申请实施例方案为多麦***下的说话人分割技术,引入了空间特征矩阵和预设音频特征,通过空间特征矩阵、预设音频特征以及分离矩阵进行说话人的聚类,无需提前知道麦克风阵列的排列信息,即可实现说话人分割,解决了现有技术中器件老化降低分割精度的问题,而且有音频特征的参与,可以识别出说话人角度相近以及说话人移动的场景,进一步提高说话人的分割精度。
本申请实施例的技术方案可以基于图1A举例所示的场景架构图来具体实施。
参阅图1B,图1B为本申请实施例提供的一种音频信号处理方法的流程示意图,这种方法可包括但不限于如下步骤:
步骤101:音频处理装置接收麦克风阵列采集的N路观测信号,对所述N路观测信号进行盲源分离以得到M路源信号和M个分离矩阵,所述M路源信号和所述M个分离矩阵一一对应,N为大于或者等于2的整数,M为大于或者等于1的整数。
其中,对所述N路观测信号进行盲源分离包括时域分离法和频域分离法。
步骤102:音频处理装置获取所述N路观测信号的空间特征矩阵,所述空间特征矩阵用于表示所述N路观测信号之间的相关性。
其中,所述N路观测信号之间的相关性是由于说话人相对于麦克风的空间位置不同造成的,即空间特征矩阵反映了说话人的空间位置信息。
步骤103:音频处理装置获取所述M路源信号中每路源信号的预设音频特征。
其中,所述预设音频特征包括但不限于以下一种或几种:过零率ZCR、短时能量、基频、梅尔倒谱系数MFCC。
步骤104:音频处理装置根据每路源信号的预设音频特征、所述M个分离矩阵和所述空间特征矩阵确定所述N路观测信号对应的说话人数量和说话人身份。
可以看出,在本申请实施例中,利用预设音频特征、分离矩阵和空间特征矩阵进行聚类,得到说话人的身份和数量,相比现有技术仅利用音频特征进行说话人分割,提高了说话人的分割精度;而且,本申请多麦的说话人分割技术,引入了空间特征矩阵,无需预先知道麦克风阵列的排列信息即可进行说话人分割,不会产生由于器件老化导致排列信息变化而降低分割精度的问题。
参阅图2A,图2A为本申请实施例提供的另一种音频信号处理方法的流程示意图,这种方法可包括但不限于如下步骤:
步骤201:音频处理装置接收麦克风阵列采集的N路观测信号,对所述N路观测信号进行盲源分离以得到M路源信号和M个分离矩阵,所述M路源信号和所述M个分离矩阵一一对应,N为大于或者等于2的整数,M为大于或者等于1的整数。
其中,所述N路观测信号为麦克风阵列在一段时间内采集到的音频信号。
在盲源分离时,如有D个源信号时,通常假设观察信号也有D个,以便确定混合矩阵为方阵,此时称麦克风阵列为标准的独立成分分析ICA模型,对于源信号的维数和麦克风阵列 的维数不同时,这种情况称非方阵ICA模型non-square ICA。本申请以标准的ICA模型为例做具体说明,即N=M。
可选的,通过时域法对所述N路观测信号进行盲源分离时,具体包括以下步骤:假设N路观测信号分别为x 1,x 2,…,x N;将N路观测信号组成输入信号X=[x 1,x 2,…,x N],假设经BSS后的输出信号为Y,Y=[s 1,s 2,…,s M],基于BSS技术可知:Y=XW,W为由M个分离矩阵组成的矩阵,假设W=[w 11,w 12,…w 1M,w 21,w 22,…w 2M,…,w M1,w M2,…,w MM],每M列w为一个分离矩阵,每个分离矩阵用于分离该N路观测信号,得到一路源信号,基于BSS从所述N观测信号中分离出M路源信号的分离公式为:
Figure PCTCN2020085800-appb-000003
其中,p为整数,1≤p≤M,
Figure PCTCN2020085800-appb-000004
为卷积运算。
可选的,通过频域法对对所述N路观测信号进行盲源分离时,上述分离公式变换为:
Figure PCTCN2020085800-appb-000005
其中,
Figure PCTCN2020085800-appb-000006
分别为频域下的输出信号、输入信号和分离矩阵。
步骤202:音频处理装置获取所述N路观测信号的空间特征矩阵,所述空间特征矩阵用于表示所述N路观测信号之间的相关性。
可选的,获取所述N路观测信号的空间特征矩阵的实现过程可以为:将所述N路观测信号中每路观测信号分割为Q个音频帧;
根据每个音频帧组对应的N个音频帧确定每个第一音频帧组对应的空间特征矩阵,得到Q个空间特征矩阵,每个第一音频帧组对应的N个音频帧为所述N路观测信号在同一时间窗口下的N个音频帧;
根据所述Q个空间特征矩阵得到所述N路观测信号的空间特征矩阵;
其中,
Figure PCTCN2020085800-appb-000007
c F(k,n)表示每个第一音频组对应的空间特征矩阵,n表示所述Q个音频帧的帧序号,k表示第n个音频帧的频点索引,X F(k,n)是由每路观测信号的第n个音频帧的第k个频点在频域中的表征组成的列向量,X FH(k,n)为X F(k,n)的转置,n为整数,1≤n≤Q,||X F(k,n)*X FH(k,n)||为X FH(k,n)为X F(k,n)的范数。
其中,空间特征矩阵中的对角线元素代表麦克风阵列中每个麦克风采集到的观测信号的能量,非对角线元素代表麦克风阵列中不同麦克风采集到的观测信号之间的相关性。例如,空间特征矩阵的对角线元素C 11代表了麦克风阵列中第1个麦克风采集到的观测信号的能量,非对角线元素C 12代表了麦克风阵列中第1个麦克风与第2个麦克风采集到的观测信号之间的相关性,该相关性是由于说话相对于第1个麦克风和第2个麦克风的空间位置不同造成的。所以通过空间特征矩阵可以反映出每个第一音频帧组对应的说话人的空间位置。
参阅图2B,图2B为本申请实施例提供的一种N路观测信号中每路观测信号在任意一个时间窗口下的音频帧在频域中的表征的示意图,假设每个音频帧中包含s个频点,则从图2B中可以看出N路观测信号中在该时间窗口下的所有第一个频点对应的列向量为[a 11+b 11*j,a 21+b 21*j,…a N1+b N1*j] T,将每个时间窗口对应的N个音频帧作为一个第一音频帧组,由于将每路观测信号分割为Q个音频帧,故可得到Q个第一音频帧组;获取图2B所示的时 间窗口下其他频点在频域中的表征,得到该时间窗口下的第一音频帧组对应的
Figure PCTCN2020085800-appb-000008
基于上述的空间特征矩阵的计算方法,分别计算每个第一音频组对应的空间特征矩阵,得到Q个空间特征矩阵,Q个空间特征矩阵按照其所在的时间窗口的先后顺序进行拼接,得到所述N路观测信号对应的空间特征矩阵。
步骤203:音频处理装置获取所述M路源信号中每路源信号的预设音频特征。
可选的,获取所述M路源信号中每路源信号的预设音频特征的步骤包括:将所述M路源信号中每路源信号分割为Q个音频帧;获取每路源信号的每个音频帧的预设音频特征。
其中,所述预设音频特征包括但不限于以下一种或几种:过零率ZCR、短时能量、基频、梅尔倒谱系数MFCC。
下面具体介绍获取过零率ZCR、短时能量的过程。
Figure PCTCN2020085800-appb-000009
其中,Z n为Q个音频帧的第n个音频帧对应的过零率,sgn[]为符号函数,N为第n帧音频帧的帧长,n为音频帧的帧索引。
Figure PCTCN2020085800-appb-000010
其中,E n为第n个音频帧的短时能量,N为第n个音频帧的帧长。
步骤204:音频处理装置根据每路源信号的预设音频特征、所述M个分离矩阵和所述空间特征矩阵确定所述N路观测信号对应的说话人数量和说话人身份。
首先,根据所述空间特征矩阵进行第一聚类,得到P个初始聚类,每个初始聚类对应一个初始聚类中心矩阵,所述初始聚类中心矩阵用于表示每个初始聚类对应的说话人的空间位置,P为大于或者等于1的整数;确定M个相似度,所述M个相似度为每个初始聚类对应的初始聚类中心矩阵与所述M个分离矩阵之间的相似度;根据所述M个相似度确定每个初始聚类对应的源信号,对每个初始聚类对应的源信号的预设音频特征进行第二聚类,得到所述N路观测信号对应的说话人数量和/或说话人身份。
具体来讲,由于空间特征矩阵反映了说话人的空间位置,故利用每个第一音频组对应的空间特征矩阵作为样本数据,则得到Q个样本数据,利用该Q个样本数据进行第一聚类,将空间特征矩阵距离相近的聚为一类,得到一个初始聚类,每个初始聚类对应一个初始聚类中心矩阵,初始聚类中心矩阵表示说话人的空间位置,该初始聚类中心以空间特征矩阵的形式表示,在聚类完成后,得到P个初始聚类,确定出所述N路观测信号是由说话人在P个空间位置说话产生。
其中,第一聚类和第二聚类可以采用的聚类算法包括但不限于以下几种:最大期望聚类算法EM(英文:Expectation Maximization)、K-means聚类算法、层次聚类算法HAC(英文:Hierarchical Agglomerative Clustering)。
在一些可能实施方式中,由于分离矩阵代表了空间位置,故分离矩阵在一定程度上反映说话人的数量,所以在采用K-means算法进行第一聚类时,根据分离矩阵的数量估计初始聚 类的数量,将K-means算法中的k值赋值为分离矩阵的数量M,然后,预设与M个初始聚类对应的聚类中心进行第一聚类,通过分离矩阵的数量估计初始聚类的数量,减少迭代次数,加快聚类速度。
可选的,根据所述M个相似度确定每个初始聚类对应的源信号的步骤包括:根据所述M个相似度确定每个初始聚类对应的源信号确定所述M个相似度中的最大相似度,确定所述M个分离矩阵中与最大相似度对应的分离矩阵为目标分离矩阵;确定与所述目标分离矩阵对应的源信号为每个初始聚类对应的源信号。通过求取初始聚类中心与分离矩阵的相似度,确定出P个空间位置中每个空间位置对应的源信号,即确定出每个初始聚类对应的源信号。
可选的,对每个初始聚类对应的源信号的预设音频特征进行第二聚类,得到所述N路观测信号对应的说话人数量和/或说话人身份的实现过程可以为:对每个初始聚类对应的源信号的预设音频特征进行第二聚类,得到H个目标聚类,所述H个目标聚类表示所述N路观测信号对应的说话人数量,每个目标聚类对应一个目标聚类中心,每个目标聚类中心是由一个预设音频特征和至少一个初始聚类中心矩阵组成,每个目标聚类对应的预设音频特征用于表示每个目标聚类对应的说话人的说话人身份,每个目标聚类对应的至少一个初始聚类中心矩阵用于表示所述说话人的空间位置。
可选的,对每个初始聚类对应的源信号的预设音频特征进行第二聚类,得到H个目标聚类的实现过程可以为:对每个初始聚类对应的源信号的预设音频特征进行第二聚类,得到每个初始聚类对应的至少一个目标聚类;根据每个初始聚类对应的至少一个目标聚类,得到所述H个目标聚类
具体来讲,将每个初始聚类对应的源信号的每个音频帧的预设音频特征组成特征向量作为一个样本数据,得到与每个初始聚类对应的源信号对应的若干个样本数据,对该若干个样本数据进行聚类,将音频特征相似的样本数据聚为一类,得到该初始聚类对应的目标聚类,如每个初始聚类对应的源信号为一个说话人的音频信号,则在多次聚类迭代后该若干个样本数据对应一个目标聚类中心,该目标聚类中心以特征向量的形式表现,该目标聚类中心表示该说话人的的身份信息(音频特征),如每个初始聚类对应的源信号对应多个说话人,则在多次聚类迭代后该初始聚类对应的源信号的若干个样本数据对应多个目标聚类中心,每个目标聚类中心表示每个说话人的身份信息,故将该初始聚类对应的源信号拆分成了多个目标聚类;如第一路源信号和第二源信号对应的说话人为同一个说话人,在第二聚类后,两路源信号对应的目标聚类中心为同一个目标聚类中心或者两者的聚类中心相近,则将两路源信号对应的两个初始聚类作为同一个目标聚类;由于第二聚类是在第一聚类基础上进行的,所以第二聚类得到的目标聚类中心还包含第一聚类得到说话人的空间位置。
举例来说,如图2C所示,由于分离矩阵表示了说话人的空间位置信息,故每路源信号是根据说话人所在的空间位置分离得到的,当同一个说话人在不同位置说话时,在第一聚类时,会从观测信号中分离出与该说话人对应的多路源信号,且对应不同的初始聚类。如该说话人0~t 1时间段内在位置W 1说话,t 2~t 3时间段内在位置W 2说话,t 3>t 2>t 1,且确定出该说话人在W 1和W 2对应的源信号分别为s 1和s 2,如s 1对应初始聚类A,s 2对应初始聚类B,由于s 1和s 2对应同一个说话人,0~t 1内的预设音频特征与t 2~t 3内的预设音频特征一致,所以在第二聚类后,可确定s 1和s 2对应同一个目标聚类中心,由于t 2>t 1,可确定出s 2是由说话人走到位置W 2产生的音频信号,故可将两个初始聚类A和B合并为一个目标聚类,故该目标聚类的目标聚类中心包含了第一聚类得到的空间位置W 1和W 2以及第二聚类得到说话人的预 设音频特征。
再例如,如图2D所示,如说话人A和说话人B在同一个位置W 3说话,由于说话人的位置相同,则基于分离矩阵分离出一路与位置W 3对应的源信号s 3,但源信号s 3中同时包括了说话人A和说话人B的音频信号,一般来讲,说话人A和说话人B不可能在同一个位置一直保持同时说话,我们假设0~t 1时间段内,说话人A在位置W 3说话,说话人B未说话,t 2~t 3时间段内说话人B在位置W 3说话,由于这两个时间段内为不同的说话人在说话,所以,该两个时间段内的预设音频特征不一致,在进行第二聚类后,该路源信号会对应两个目标聚类中心,第一个目标聚类中心包含了第一聚类得到的位置信息W 3以及第二聚类得到说话人A的音频特征,第二个目标聚类中心包含了第一聚类得到的位置信息W 3以及第二聚类得到说话人B的音频特征。
可选的,在对每个初始聚类对应的源信号的预设音频特征进行第二聚类之前,所述方法还包括:对每路源信号进行人声分析,以移除该M路源信号中为非人声的源信号,其中,对每路源信号进行人声分析的实现过程可以为:将每路源信号的每个音频帧的预设音频特征与人声的音频特征进行比对,确定每路源信号中是否包含人声。
步骤205:音频处理装置根据所述N路观测信号对应的说话人数量和说话人身份输出包含有第一说话人标签的音频信号,所述第一说话人标签用于标注所述音频信号的每个帧音频帧对应的说话人数量。
可选的,根据所述N路观测信号对应的说话人数量和说话人身份得到包含有第一说话人标签的输出音频的步骤包括:确定K个距离,所述K个距离为每个第一音频帧组对应的空间特征矩阵与每个目标聚类对应的至少一个初始聚类中心矩阵的距离,每个第一音频帧组由所述N路观测信号在同一时间窗口下的N个音频帧组成,K≥H;根据所述K个距离确定每个第一音频帧组对应的说话人数量,即确定所述H个距离中大于距离阈值的L个距离,将L作为该第一音频帧组对应的说话人数量;然后,确定该第一音频帧组对应的时间窗口,将该输出音频在该时间窗口的音频帧的说话人数量标记为L;最后,依次确定每个第一音频帧组对应的说话人数量,从而得到该第一说话人标签。
其中,该距离阈值可以为80%、90%、95%或者其他值。
可选的,该输出音频在每个时间窗口下的音频帧可以包含多路音频,也可以为该多路音频的混合音频。举例来说,如在0~t 1为说话人A和说话人B同时说话,且说话人A和说话人B位于不同的空间位置,则从说话人A对应的源信号中提取出0~t 1内说话人A的第一说话音频,同样从说话人B对应的源信号中提取出0~t 1内说话人B的第二说话音频,可以单独保留第一说话音频和第二说话音频,即该输出音频在0~t 1内对应两路说话音频,且标注0~t 1有说话人2个人同时说话,在也可以将第一说话音频和第二说话音频,则该输出音频在0~t 1对应一路混合音频,同样标注0~t 1内有2个说话人同时说话。
可以看出,本申请实施例为多麦***下的说话人分割方法,引入了空间特征矩阵以及预设音频特征,通过空间特征矩阵、预设音频特征以及分离矩阵进行说话人的确认,无需提前知道麦克风阵列的排列信息,即可实现说话人分割,解决了现有技术中由于器件老化降低分割精度问题,而且基于音频特征进行第二聚类,可以将角度相近的说话人对应的一个初始聚类拆分为两个目标聚类,将由于说话人移动产生的两个初始聚类合并为一个目标聚类,解决了现有技术中说话人的分割精度低的问题。
参阅图3,图3为本申请实施例提供的一种音频信号处理方法的流程示意图,这种方法可包括但不限于如下步骤:
步骤301:音频处理装置接收麦克风阵列采集的N路观测信号,对所述N路观测信号进行盲源分离以得到M路源信号和M个分离矩阵,所述M路源信号和所述M个分离矩阵一一对应,M和N均为大于或者等于1的整数。
步骤302:音频处理装置获取所述N路观测信号的空间特征矩阵,所述空间特征矩阵用于表示所述N路观测信号之间的相关性。
步骤303:音频处理装置获取所述M路源信号中每路源信号的预设音频特征。
步骤304:音频处理装置根据每路源信号的预设音频特征、所述M个分离矩阵和所述空间特征矩阵确定所述N路观测信号对应的说话人数量和说话人身份。
步骤305:音频处理装置根据所述N路观测信号对应的说话人数量和说话人身份得到包含有第二说话人标签的输出音频,所述第二说话人标签用于标注所述输出音频的每个帧音频帧对应的说话人身份。
可选的,根据所述N路观测信号对应的说话人数量和说话人身份得到包含有第二说话人标签的输出音频的步骤包括:确定K个距离,所述K个距离为每个第一音频帧组对应的空间特征矩阵与每个目标聚类对应的至少一个初始聚类中心矩阵的距离,每个第一音频帧组由所述N路观测信号在同一时间窗口下的N个音频帧组成,K≥H;根据所述K个距离确定每个第一音频帧组对应的说话人身份,即确定所述H个距离中大于距离阈值的L个距离,L≤H,获取与该L距离对应的L个目标聚类,将该L个目标聚类作为该第一音频帧组对应的说话人身份;然后,确定该第一音频帧组对应的时间窗口,确定该M路源信号在该时间窗口下的说话人为该L个目标聚类;最后,依次确定每个音频帧组对应的说话人数量,即确定出该M路源信号在每个时间窗口下的说话人数量,将该M路源信号在每个时间窗口下的音频帧组成该输出音频,并基于每个时间窗口下的说话人身份确定第二说话人标签,第二说话人标签则标注出该输出音频在每个时间窗口下的说话人身份。
其中,该距离阈值可以为80%、90%、95%或者其他值。
可选的,该输出音频在每个时间窗口下的音频帧可以包含多路音频,也可以为该多路音频的混合音频。举例来说,如在0~t 1为说话人A和说话人B同时说话,且说话人A和说话人B位于不同的空间位置,则从说话人A对应的源信号中提取出0~t 1内说话人A的第一说话音频,同样从说话人B对应的源信号中提取出0~t 1内说话人B的第二说话音频,可以单独保留第一说话音频和第二说话音频,即该输出音频在0~t 1内对应两路说话音频,且通过第二说话人标签标注0~t 1有说话人A和说话人B同时说话,在也可以将第一说话音频和第二说话音频,则该输出音频在0~t 1对应一路混合音频,同样通过第二说话人标签标注0~t 1有说话人A和说话人B同时说话。
可以看出,本申请实施例为多麦***下的说话人分割方法,引入了空间特征矩阵以及预设音频特征,通过空间特征矩阵、预设音频特征以及分离矩阵进行说话人的确认,无需提前知道麦克风阵列的排列信息,即可实现说话人分割,解决了现有技术中由于器件老化降低分割精度问题,而且基于音频特征进行第二聚类,可以将角度相近的说话人对应的一个初始聚类拆分为两个目标聚类,将由于说话人移动产生的两个初始聚类合并为一个目标聚类,解决了现有技术中说话人的分割精度低的问题。
参阅图4,图4为本申请实施例提供的一种音频信号处理方法的流程示意图,这种方法可包括但不限于如下步骤:
步骤401:音频处理装置接收麦克风阵列采集的N路观测信号,对所述N路观测信号进行盲源分离以得到M路源信号和M个分离矩阵,所述M路源信号和所述M个分离矩阵一一对应,M和N均为大于或者等于1的整数。
步骤402:音频处理装置获取所述N路观测信号的空间特征矩阵,所述空间特征矩阵用于表示所述N路观测信号之间的相关性。
步骤403:音频处理装置获取所述M路源信号中每路源信号的预设音频特征。
步骤404:音频处理装置根据每路源信号的预设音频特征、所述M个分离矩阵和所述空间特征矩阵确定所述N路观测信号对应的说话人数量和说话人身份。
步骤405:音频处理装置根据所述N路观测信号对应的说话人数量和说话人身份得到包含有第三说话人标签的输出音频,所述第三说话人标签用于标注所述输出音频的每个帧音频帧对应的说话人数量和说话人身份。
可选的,根据所述N路观测信号对应的说话人数量和说话人身份得到包含有第三说话人标签的输出音频的步骤包括:确定K个距离,所述K个距离为每个第一音频帧组对应的空间特征矩阵与每个目标聚类对应的至少一个初始聚类中心矩阵的距离,每个第一音频帧组由所述N路观测信号在同一时间窗口下的N个音频帧组成,K≥H;根据所述K个距离确定每个第一音频帧组对应的说话人身份,即确定所述H个距离中大于距离阈值的L个距离,L≤H,获取与该L距离对应的L个目标聚类,将该L个目标聚类作为该第一音频帧组对应的说话人身份;然后,确定该第一音频帧组对应的时间窗口,确定该M路源信号在该时间窗口下的说话人为该L个目标聚类;从所述M路源信号中提取与每个第一音频帧组对应的L个音频帧,所述L个音频帧与每个第一音频帧组所在时间窗口相同;确定L个相似度,所述L个相似度为所述L个音频帧中每个音频帧的预设音频特征与所述L个目标聚类中每个目标聚类对应的预设音频特征的相似度;根据所述L个相似度确定所述L个音频帧中每个音频帧对应的目标聚类,即将该L个相似度中最大相似度对应的目标聚类作为每个音频帧的目标聚类,即确定该时间窗口下对应的说话人数量以及每个说话人对应的源音频帧;最后,根据每个音频帧对应的目标聚类得到包含有第三说话人标签的输出音频。先通过空间特征矩阵进行比对,确定出每个时间窗下的说话人数量,然后,在通过说话人的音频特征进行比对,确定每个源音频帧对应的说话人,提高了说话人的分割精度。
其中,该距离阈值可以为80%、90%、95%或者其他值。
举例来说,如在0~t 1为说话人A和说话人B同时说话,且说话人A和说话人B位于不同的空间位置,则通过第一音频组的空间特征矩阵确定出0~t 1内对应目标聚类A和目标聚类B,然后,从M路源信号中在0~t 1内提取出两路源音频帧,但是,无法确定出哪个源音频帧是说话人A的,哪个源音频帧是说话人B的,故将该两路源音频帧的预设音频特征分别与目标聚类A对应的预设音频特征进行比对,获取相似度,得到两个相似度,将相似度最大时对应的目标聚类作为每路源音频帧对应的说话人。
可选的,根据所述N路观测信号对应的说话人数量和说话人身份得到包含有第三说话人标签的输出音频的步骤包括:确定H个相似度,所述H个相似度为每个第二音频帧组中每个音频帧的预设音频特征与所述H个目标聚类中每个目标聚类中心的预设音频特征之间的相似度,所述每个第二音频帧组由所述M路源信号在同一时间窗口下的音频帧组成;根据所述H 个相似度确定每个第二音频帧组中每个音频帧对应的目标聚类;根据每个音频帧对应的目标聚类得到包含有说话人标签的输出音频,所述说话人标签用于标注所述输出音频中每个音频帧的说话人数量和/或说话人身份。通过音频特征直接进行说话人比对,加快了说话人分割速度。
举例来说,如在0~t 1为说话人A和说话人B同时说话,且说话人A和说话人B位于不同的空间位置,可从M路源信号中提取0~t 1内对应的两路源音频帧,但是,无法确定出哪个源音频帧是说话人A的,哪个源音频帧是说话人B的,然后,直接将该两路源音频帧的预设音频特征分别与第二聚类后得到的H个目标聚类进行比对,将相似度最大的目标聚类作为每路源音频帧对应的说话人。
可选的,该输出音频在每个时间窗口下的音频帧可以包含多路音频,也可以为该多路音频的混合音频。举例来说,如在0~t 1为说话人A和说话人B同时说话,且说话人A和说话人B位于不同的空间位置,则从说话人A对应的源信号中提取出0~t 1内说话人A的第一说话音频,同样从说话人B对应的源信号中提取出0~t 1内说话人B的第二说话音频,可以单独保留第一说话音频和第二说话音频,即该输出音频在0~t 1内对应两路说话音频,且通过第三说话人标签标注0~t 1有说话人A和说话人B同时说话,当然,由于确定出了每路源音频帧对应的说话人,在不将A和B的音频进行混合时,可设置单独播放按钮,在点击说话人A的播放按钮时,可单独播放A的说话音频;在也可以将第一说话音频和第二说话音频,则该输出音频在0~t 1对应一路混合音频,同样通过第二说话人标签标注0~t 1有说话人A和说话人B同时说话。
可以看出,本申请实施例为多麦***下的说话人分割方法,引入了空间特征矩阵以及预设音频特征,通过空间特征矩阵、预设音频特征以及分离矩阵进行说话人的确认,无需提前知道麦克风阵列的排列信息,即可实现说话人分割,解决了现有技术中由于器件老化降低分割精度问题,而且基于音频特征进行第二聚类,可以将角度相近的说话人对应的一个初始聚类拆分为两个目标聚类,将由于说话人移动产生的两个初始聚类合并为一个目标聚类,解决了现有技术中说话人的分割精度低的问题。
在一些可能实施方式中,如所述N路观测信号为在第一预设时间段内获得的音频信号,将所述N路观测信号对应的H个目标聚类的H个聚类中心输入到下一个时间窗口,将所述H个聚类中心作为第二预设时间内获得的观测信号的聚类初值,实现两个时间段内的参数共享,加快聚类速度,提高说话人分割效率。
在一些可能实施方式中,基于图2A、图3、图4所示的说话人分割方法,可在音频处理装置的界面以下几种形式呈现该输出音频和说话人标签。
可选的,图5A为本申请实施例提供的一种在界面显示输出音频的示意图,图5A所示的显示方式对应图2A中所述的说话人分割方法,如图5A所示,在输出音频的每个音频帧上添加第一说话人标签,通过第一说话人标签标注时间窗口对应的说话人数量。可以理解的是,如果输出音频中保留每个说话人单独说话的音频,即未对说话人的音频混合输出,当输出音频的一时间窗口对应的多个说话人时,通过点击标签旁的“点击”按钮,可依次播放该时间窗口下的每个说话人的独立音频信号。当然,在添加第一说话人标签时,无需将第一说话人标签添加到输出音频上,可将第一说话人标签和输出音频关联输出,该第一说话人标签标注了该输出音频中每个音频帧对应的说话人数量,可通过读取该第一说话人标签,确定出输出 音频中每个音频帧对应的说话人数量。
可选的,图5B为本申请实施例提供的另一种在界面显示输出音频的示意图,图5B所示的显示方式对应图3中所述的说话人分割方法,在确定出输出音频中每个音频帧对应的说话人身份时,在输出音频帧上添加第二说话人标签,标注每个时间窗口对应的说话人身份,如图5B所示,标记出第一个音频帧和第三个音频帧对应的说话人为为说话人A。可以理解的是,如果输出音频中保留每个说话人单独说话的音频,未对说话人的音频混合输出,当输出音频的一时间窗口对应的多个说话人时,点击标签旁的“点击”按钮,依次播放每个说话人的音频,但无法确定出每次播放的音频帧属于哪一个说话人。当然,在添加第二说话人标签时,无需将第二说话人标签添加到输出音频上,可将第二说话人标签和输出音频关联输出,该第一说话人标签标注了该输出音频中每个音频帧对应的说话人数量,可通过读取该第二说话人标签,确定出输出音频中每个音频帧对应的说话人身份。
可选的,图5C为本申请实施例提供的另一种在界面显示输出音频的示意图,图5C所示的显示方式对应图4中所述的说话人分割方法,在确定出输出音频中每个音频帧对应的说话人数量和说话人身份后,在该输出音频上添加第三说话人标签,标记每个时间窗口对应的说话人数量和说话人身份;而且,输出音频中在未对说话人的音频进行混合输出,当输出音频的一时间窗口对应的多个说话人时,可确定每个说话人的身份以及该说话人在该时间窗口下的源信号;对输出音频的所有时间窗口进行分析,可确定出每个说话人在该输出音频上对应的所有音频帧,通过点击每个说话人的“点击”按钮,则可单独播放每个人说话的音频,有利于生成会议记录。当然,在添加第三说话人标签时,无需将第三说话人标签添加到输出音频上,可将第三说话人标签和输出音频关联输出,通过读取该第一说话人标签,确定出输出音频中每个时间窗口对应的说话人数量和说话人身份。
参阅图6,本申请实施例提供了一种音频处理装置600,可包括:
音频分离单元610,用于接收麦克风阵列采集的N路观测信号,对所述N路观测信号进行盲源分离以得到M路源信号和M个分离矩阵,所述M路源信号和所述M个分离矩阵一一对应,N为大于或者等于2的整数,M为大于或者等于1的整数;
空间特征提取单元620,用于获取所述N路观测信号的空间特征矩阵,所述空间特征矩阵用于表示所述N路观测信号之间的相关性;
音频特征提取单元630,用于获取所述M路源信号中每路源信号的预设音频特征;
确定单元640,用于根据每路源信号的预设音频特征、所述M个分离矩阵和所述空间特征矩阵确定所述N路观测信号对应的说话人数量和说话人身份。
可以看出,本申请实施例方案为多麦***下的说话人分割技术,引入了空间特征矩阵和预设音频特征,通过空间特征矩阵、预设音频特征以及分离矩阵进行说话人的聚类,无需提前知道麦克风阵列的排列信息,即可实现说话人分割,解决了现有技术中器件老化降低分割精度的问题,而且有音频特征的参与,可以识别出说话人角度相近以及说话人移动的场景,进一步提高说话人的分割精度。
在一些可能的实施方式中,音频特征提取单元630,在获取所述M路源信号中每路源信号的预设音频特征时,具体用于:将所述M路源信号中每路源信号分割为Q个音频帧,Q为大于1的整数;获取每路源信号的每个音频帧的预设音频特征。
在一些可能的实施方式中,空间特征提取单元620,在获取所述N路观测信号的空间特 征矩阵时,具体用于:将所述N路观测信号中每路观测信号分割为Q个音频帧;根据每个音频帧组对应的N个音频帧确定每个第一音频帧组对应的空间特征矩阵,得到Q个空间特征矩阵,每个第一音频帧组对应的N个音频帧为所述N路观测信号在同一时间窗口下的N个音频帧;根据所述Q个空间特征矩阵得到所述N路观测信号的空间特征矩阵;
其中,
Figure PCTCN2020085800-appb-000011
c F(k,n)表示每个第一音频组对应的空间特征矩阵,n表示所述Q个音频帧的帧序号,k表示第n个音频帧的频点索引,X F(k,n)是由每路观测信号的第n个音频帧的第k个频点在频域中的表征组成的列向量,X FH(k,n)为X F(k,n)的转置,n为整数,1≤n≤Q。。
在一些可能的实施方式中,确定单元640,在根据每路源信号的预设音频特征、所述M个分离矩阵和所述空间特征矩阵确定所述N路观测信号对应的说话人数量和说话人身份时,具体用于:对所述空间特征矩阵进行第一聚类,得到P个初始聚类,每个初始聚类对应一个初始聚类中心矩阵,所述初始聚类中心矩阵用于表示每个初始聚类对应的说话人的空间位置,P为大于或者等于1的整数;确定M个相似度,所述M个相似度为每个初始聚类对应的初始聚类中心矩阵与所述M个分离矩阵之间的相似度;根据所述M个相似度确定每个初始聚类对应的源信号;对每个初始聚类对应的源信号的预设音频特征进行第二聚类,得到所述N路观测信号对应的说话人数量和说话人身份。
在一些可能的实施方式中,所述确定单元,在根据所述M个相似度确定每个初始聚类对应的源信号时,具体用于:确定所述M个相似度中的最大相似度,确定所述M个分离矩阵中与最大相似度对应的分离矩阵为目标分离矩阵;确定与所述目标分离矩阵对应的源信号为每个初始聚类对应的源信号。
在一些可能的实施方式中,确定单元640,在对每个初始聚类对应的源信号的预设音频特征进行第二聚类,得到所述N路观测信号对应的说话人数量和说话人身份时,具体用于:对每个初始聚类对应的源信号的预设音频特征进行第二聚类,得到H个目标聚类,所述H个目标聚类表示所述N路观测信号对应的说话人数量,每个目标聚类对应一个目标聚类中心,每个目标聚类中心是由一个预设音频特征和至少一个初始聚类中心矩阵组成,每个目标聚类对应的预设音频特征用于表示每个目标聚类对应的说话人的说话人身份,每个目标聚类对应的至少一个初始聚类中心矩阵用于表示所述说话人的空间位置。
在一些可能的实施方式中,音频处理装置100还包括音频分割单元650;
音频分割单元650,用于根据所述N路观测信号对应的说话人数量和说话人身份得到包含有说话人标签的输出音频。
在一些可能的实施方式中,音频分割单元650,在根据所述N路观测信号对应的说话人数量和说话人身份得到包含有说话人标签的输出音频时,具体用于:确定K个距离,所述K个距离为每个第一音频帧组对应的空间特征矩阵与每个目标聚类对应的至少一个初始聚类中心矩阵的距离,每个第一音频帧组由所述N路观测信号在同一时间窗口下的N个音频帧组成,K≥H;根据所述K个距离确定每个第一音频帧组对应的L个目标聚类,L≤H;从所述M路源信号中提取与每个第一音频帧组对应的L个音频帧,所述L个音频帧与每个第一音频帧组所在时间窗口相同;确定L个相似度,所述L个相似度为所述L个音频帧中每个音频帧的预设音频特征与所述L个目标聚类中每个目标聚类对应的预设音频特征的相似度;根据所述L个相似度确定所述L个音频帧中每个音频帧对应的目标聚类;根据每个音频帧对应的目标聚类 得到包含有说话人标签的输出音频,所述说话人标签用于标注所述输出音频中每个音频帧的说话人数量和/或说话人身份。
在一些可能的实施方式中,音频分割单元650,在根据所述N路观测信号对应的说话人数量和说话人身份得到包含有说话人标签的输出音频时,具体用于:确定H个相似度,所述H个相似度为每个第二音频帧组中每个音频帧的预设音频特征与所述H个目标聚类中每个目标聚类中心的预设音频特征之间的相似度,所述每个第二音频帧组由所述M路源信号在同一时间窗口下的音频帧组成;根据所述H个相似度确定每个第二音频帧组中每个音频帧对应的目标聚类;根据每个音频帧对应的目标聚类得到包含有说话人标签的输出音频,所述说话人标签用于标注所述输出音频中每个音频帧的说话人数量和/或说话人身份。
参见图7,本申请实施例提供了一种音频处理装置700,包括:
相互耦合的处理器730、通信接口720和存储器710;例如处理器730、通信接口720和存储器710通过总线740耦合。
存储器710可包括但不限于随机存储记忆体(Random Access Memory,RAM)、可擦除可编程只读存储器(Erasable Programmable ROM,EPROM)、只读存储器(Read-Only Memory,ROM)或便携式只读存储器(Compact Disc Read-Only Memory,CD-ROM)等等,该存储器810用于相关指令及数据。
处理器730可以是一个或多个中央处理器(Central Processing Unit,CPU),在处理器730是一个CPU的情况下,该CPU可以是单核CPU,也可以是多核CPU。
处理器730用于读取所述存储器710中存储的程序代码,与通信接口740配合执行本申请上述实施例中由音频处理装置执行的方法的部分或全部步骤。
举例来说,所述通信接口720用于收麦克风阵列采集的N路观测信号,N为大于或者等于2的整数。
所述处理器730,所述处理器,用于对所述N路观测信号进行盲源分离以得到M路源信号和M个分离矩阵,所述M路源信号和所述M个分离矩阵一一对应,M为大于或者等于1的整数;获取所述N路观测信号的空间特征矩阵,所述空间特征矩阵用于表示所述N路观测信号之间的相关性;获取所述M路源信号中每路源信号的预设音频特征;根据每路源信号的预设音频特征、所述M个分离矩阵和所述空间特征矩阵确定所述N路观测信号对应的说话人数量和说话人身份。
可以看出,本申请实施例方案为多麦***下的说话人分割技术,引入了空间特征矩阵和预设音频特征,通过空间特征矩阵、预设音频特征以及分离矩阵进行说话人的聚类,无需提前知道麦克风阵列的排列信息,即可实现说话人分割,解决了现有技术中器件老化降低分割精度的问题,而且有音频特征的参与,可以识别出说话人角度相近以及说话人移动的场景,进一步提高说话人的分割精度。
在上述实施例中,可全部或部分地通过软件、硬件、固件、或其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一 个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如软盘、硬盘、磁带)、光介质(例如光盘)、或者半导体介质(例如固态硬盘)等。在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序被相关硬件执行,以完成执行本发明实施例提供的任意一种音频信号处理方法。此外,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被相关硬件执行,以完成执行本发明实施例提供的任意一种方法。
本申请实施例还提供一种计算机程序产品,其中,当所述计算机程序产品在计算机上运行时,使得所述计算机执行本发明实施例提供的任意一种音频信号处理方法。此外,本申请实施例还提供一种计算机程序产品,当所述计算机程序产品在计算机上运行时,使得所述计算机执行本发明实施例提供的任意一种方法。
在上述实施例中对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,也可以通过其它的方式实现。例如以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可结合或者可以集成到另一个***,或一些特征可以忽略或不执行。另一点,所显示或讨论的相互之间的间接耦合或者直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者,也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例的方案的目的。
另外,在本申请各实施例中的各功能单元可集成在一个处理单元中,也可以是各单元单独物理存在,也可两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,或者也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质例如可包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或光盘等各种可存储程序代码的介质。

Claims (20)

  1. 一种音频信号处理方法,其特征在于,包括:
    接收麦克风阵列采集的N路观测信号,对所述N路观测信号进行盲源分离以得到M路源信号和M个分离矩阵,所述M路源信号和所述M个分离矩阵一一对应,N为大于或者等于2的整数,M为大于或者等于1的整数;
    获取所述N路观测信号的空间特征矩阵,所述空间特征矩阵用于表示所述N路观测信号之间的相关性;
    获取所述M路源信号中每路源信号的预设音频特征;
    根据每路源信号的预设音频特征、所述M个分离矩阵和所述空间特征矩阵确定所述N路观测信号对应的说话人数量和说话人身份。
  2. 根据权利要求1所述的方法,其特征在于,所述获取所述M路源信号中每路源信号的预设音频特征,包括:
    将所述M路源信号中每路源信号分割为Q个音频帧,Q为大于1的整数;
    获取每路源信号的每个音频帧的预设音频特征。
  3. 根据权利要求1或2所述的方法,其特征在于,所述获取所述N路观测信号的空间特征矩阵,包括:
    将所述N路观测信号中每路观测信号分割为Q个音频帧;
    根据每个音频帧组对应的N个音频帧确定每个第一音频帧组对应的空间特征矩阵,得到Q个空间特征矩阵,每个第一音频帧组对应的N个音频帧为所述N路观测信号在同一时间窗口下的N个音频帧;
    根据所述Q个空间特征矩阵得到所述N路观测信号的空间特征矩阵;
    其中,
    Figure PCTCN2020085800-appb-100001
    c F(k,n)表示每个第一音频组对应的空间特征矩阵,n表示所述Q个音频帧的帧序号,k表示第n个音频帧的频点索引,X F(k,n)是由每路观测信号的第n个音频帧的第k个频点在频域中的表征组成的列向量,X FH(k,n)为X F(k,n)的转置,n为整数,1≤n≤Q。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述根据每路源信号的预设音频特征、所述M个分离矩阵和所述空间特征矩阵确定所述N路观测信号对应的说话人数量和说话人身份,包括:
    对所述空间特征矩阵进行第一聚类,得到P个初始聚类,每个初始聚类对应一个初始聚类中心矩阵,所述初始聚类中心矩阵用于表示每个初始聚类对应的说话人的空间位置,P为大于或者等于1的整数;
    确定M个相似度,所述M个相似度为每个初始聚类对应的初始聚类中心矩阵与所述M个分离矩阵之间的相似度;
    根据所述M个相似度确定每个初始聚类对应的源信号;
    对每个初始聚类对应的源信号的预设音频特征进行第二聚类,得到所述N路观测信号对应的说话人数量和说话人身份。
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述M个相似度确定每个初始聚类对应的源信号,包括:
    确定所述M个相似度中的最大相似度,
    确定所述M个分离矩阵中与最大相似度对应的分离矩阵为目标分离矩阵;
    确定与所述目标分离矩阵对应的源信号为每个初始聚类对应的源信号。
  6. 根据权利要求4或5所述的方法,其特征在于,所述对每个初始聚类对应的源信号的预设音频特征进行第二聚类,得到所述N路观测信号对应的说话人数量和说话人身份,包括:
    对每个初始聚类对应的源信号的预设音频特征进行第二聚类,得到H个目标聚类,所述H个目标聚类表示所述N路观测信号对应的说话人数量,每个目标聚类对应一个目标聚类中心,每个目标聚类中心是由一个预设音频特征和至少一个初始聚类中心矩阵组成,每个目标聚类对应的预设音频特征用于表示每个目标聚类对应的说话人的说话人身份,每个目标聚类对应的至少一个初始聚类中心矩阵用于表示所述说话人的空间位置。
  7. 根据权利要求6所述的方法,其特征在于,所述方法还包括:
    根据所述N路观测信号对应的说话人数量和说话人身份得到包含有说话人标签的输出音频。
  8. 根据权利要求7所述的方法,其特征在于,所述根据所述N路观测信号对应的说话人数量和说话人身份得到包含有说话人标签的输出音频,包括:
    确定K个距离,所述K个距离为每个第一音频帧组对应的空间特征矩阵与每个目标聚类对应的至少一个初始聚类中心矩阵的距离,每个第一音频帧组由所述N路观测信号在同一时间窗口下的N个音频帧组成,K≥H;
    根据所述K个距离确定每个第一音频帧组对应的L个目标聚类,L≤H;
    从所述M路源信号中提取与每个第一音频帧组对应的L个音频帧,所述L个音频帧与每个第一音频帧组所在时间窗口相同;
    确定L个相似度,所述L个相似度为所述L个音频帧中每个音频帧的预设音频特征与所述L个目标聚类中每个目标聚类对应的预设音频特征的相似度;
    根据所述L个相似度确定所述L个音频帧中每个音频帧对应的目标聚类;
    根据每个音频帧对应的目标聚类得到包含有说话人标签的输出音频,所述说话人标签用于标注所述输出音频中每个音频帧的说话人数量和/或说话人身份。
  9. 根据权利要求7所述的方法,其特征在于,所述根据所述N路观测信号对应的说话人数量和说话人身份得到包含有说话人标签的输出音频,包括:
    确定H个相似度,所述H个相似度为每个第二音频帧组中每个音频帧的预设音频特征与所述H个目标聚类中每个目标聚类中心的预设音频特征之间的相似度,所述每个第二音频帧组由所述M路源信号在同一时间窗口下的音频帧组成;
    根据所述H个相似度确定每个第二音频帧组中每个音频帧对应的目标聚类;
    根据每个音频帧对应的目标聚类得到包含有说话人标签的输出音频,所述说话人标签用于标注所述输出音频中每个音频帧的说话人数量和/或说话人身份。
  10. 一种音频处理装置,其特征在于,包括:
    音频分离单元,用于接收麦克风阵列采集的N路观测信号,对所述N路观测信号进行盲源分离以得到M路源信号和M个分离矩阵,所述M路源信号和所述M个分离矩阵一一对应,N为大于或者等于2的整数,M为大于或者等于1的整数;
    空间特征提取单元,用于获取所述N路观测信号的空间特征矩阵,所述空间特征矩阵用于表示所述N路观测信号之间的相关性;
    音频特征提取单元,用于获取所述M路源信号中每路源信号的预设音频特征;
    确定单元,用于根据每路源信号的预设音频特征、所述M个分离矩阵和所述空间特征矩阵确定所述N路观测信号对应的说话人数量和说话人身份。
  11. 根据权利要求10所述的装置,其特征在于,
    所述音频特征提取单元,在获取所述M路源信号中每路源信号的预设音频特征时,具体用于:将所述M路源信号中每路源信号分割为Q个音频帧,Q为大于1的整数;获取每路源信号的每个音频帧的预设音频特征。
  12. 根据权利要求10或11所述的装置,其特征在于,
    所述空间特征提取单元,在获取所述N路观测信号的空间特征矩阵时,具体用于:将所述N路观测信号中每路观测信号分割为Q个音频帧;根据每个音频帧组对应的N个音频帧确定每个第一音频帧组对应的空间特征矩阵,得到Q个空间特征矩阵,每个第一音频帧组对应的N个音频帧为所述N路观测信号在同一时间窗口下的N个音频帧;根据所述Q个空间特征矩阵得到所述N路观测信号的空间特征矩阵;
    其中,
    Figure PCTCN2020085800-appb-100002
    c F(k,n)表示每个第一音频组对应的空间特征矩阵,n表示所述Q个音频帧的帧序号,k表示第n个音频帧的频点索引,X F(k,n)是由每路观测信号的第n个音频帧的第k个频点在频域中的表征组成的列向量,X FH(k,n)为X F(k,n)的转置,n为整数,1≤n≤Q。
  13. 根据权利要求10-12任一项所述的装置,其特征在于,
    所述确定单元,在根据每路源信号的预设音频特征、所述M个分离矩阵和所述空间特征矩阵确定所述N路观测信号对应的说话人数量和说话人身份时,具体用于:对所述空间特征矩阵进行第一聚类,得到P个初始聚类,每个初始聚类对应一个初始聚类中心矩阵,所述初始聚类中心矩阵用于表示每个初始聚类对应的说话人的空间位置,P为大于或者等于1的整数;确定M个相似度,所述M个相似度为每个初始聚类对应的初始聚类中心矩阵与所述M个分离矩阵之间的相似度;根据所述M个相似度确定每个初始聚类对应的源信号;对每个初始聚类对应的源信号的预设音频特征进行第二聚类,得到所述N路观测信号对应的说话人数量和说话人身份。
  14. 根据权利要求13所述的装置,其特征在于,
    所述确定单元,在根据所述M个相似度确定每个初始聚类对应的源信号时,具体用于:确定所述M个相似度中的最大相似度,确定所述M个分离矩阵中与最大相似度对应的分离矩阵为目标分离矩阵;确定与所述目标分离矩阵对应的源信号为每个初始聚类对应的源信号。
  15. 根据权利要求13或14所述的装置,其特征在于,
    所述确定单元,在对每个初始聚类对应的源信号的预设音频特征进行第二聚类,得到所述N路观测信号对应的说话人数量和说话人身份时,具体用于:对每个初始聚类对应的源信号的预设音频特征进行第二聚类,得到H个目标聚类,所述H个目标聚类表示所述N路观测信号对应的说话人数量,每个目标聚类对应一个目标聚类中心,每个目标聚类中心是由一个预设音频特征和至少一个初始聚类中心矩阵组成,每个目标聚类对应的预设音频特征用于表示每个目标聚类对应的说话人的说话人身份,每个目标聚类对应的至少一个初始聚类中心矩阵用于表示所述说话人的空间位置。
  16. 根据权利要求15所述的装置,其特征在于,
    所述装置还包括音频分割单元;
    所述音频分割单元,用于根据所述N路观测信号对应的说话人数量和说话人身份得到包含有说话人标签的输出音频。
  17. 根据权利要求16所述的装置,其特征在于,
    所述音频分割单元,在根据所述N路观测信号对应的说话人数量和说话人身份得到包含有说话人标签的输出音频时,具体用于:确定K个距离,所述K个距离为每个第一音频帧组对应的空间特征矩阵与每个目标聚类对应的至少一个初始聚类中心矩阵的距离,每个第一音频帧组由所述N路观测信号在同一时间窗口下的N个音频帧组成,K≥H;根据所述K个距离确定每个第一音频帧组对应的L个目标聚类,L≤H;从所述M路源信号中提取与每个第一音频帧组对应的L个音频帧,所述L个音频帧与每个第一音频帧组所在时间窗口相同;确定L个相似度,所述L个相似度为所述L个音频帧中每个音频帧的预设音频特征与所述L个目标聚类中每个目标聚类对应的预设音频特征的相似度;根据所述L个相似度确定所述L个音频帧中每个音频帧对应的目标聚类;根据每个音频帧对应的目标聚类得到包含有说话人标签的输出音频,所述说话人标签用于标注所述输出音频中每个音频帧的说话人数量和/或说话人身份。
  18. 根据权利要求16所述的装置,其特征在于,
    所述音频分割单元,在根据所述N路观测信号对应的说话人数量和说话人身份得到包含有说话人标签的输出音频时,具体用于:确定H个相似度,所述H个相似度为每个第二音频帧组中每个音频帧的预设音频特征与所述H个目标聚类中每个目标聚类中心的预设音频特征之间的相似度,所述每个第二音频帧组由所述M路源信号在同一时间窗口下的音频帧组成;根据所述H个相似度确定每个第二音频帧组中每个音频帧对应的目标聚类;根据每个音频帧对应的目标聚类得到包含有说话人标签的输出音频,所述说话人标签用于标注所述输出音频 中每个音频帧的说话人数量和/或说话人身份。
  19. 一种音频处理装置,其特征在于,包括:
    相互耦合的处理器、通信接口和存储器;
    其中,所述通信接口,用于收麦克风阵列采集的N路观测信号,N为大于或者等于2的整数;
    所述处理器,用于对所述N路观测信号进行盲源分离以得到M路源信号和M个分离矩阵,所述M路源信号和所述M个分离矩阵一一对应,M为大于或者等于1的整数;获取所述N路观测信号的空间特征矩阵,所述空间特征矩阵用于表示所述N路观测信号之间的相关性;获取所述M路源信号中每路源信号的预设音频特征;根据每路源信号的预设音频特征、所述M个分离矩阵和所述空间特征矩阵确定所述N路观测信号对应的说话人数量和说话人身份。
  20. 一种计算机可读存储介质,其特征在于,存储有计算机程序,所述计算机程序被硬件执行以实现权利要求书1至9任意一项中由音频处理装置执行的方法。
PCT/CN2020/085800 2019-04-30 2020-04-21 音频信号处理方法及相关产品 WO2020221059A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20799277.7A EP3944238B1 (en) 2019-04-30 2020-04-21 Audio signal processing method and related product
US17/605,121 US20220199099A1 (en) 2019-04-30 2020-04-21 Audio Signal Processing Method and Related Product

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910369726.5A CN110111808B (zh) 2019-04-30 2019-04-30 音频信号处理方法及相关产品
CN201910369726.5 2019-04-30

Publications (1)

Publication Number Publication Date
WO2020221059A1 true WO2020221059A1 (zh) 2020-11-05

Family

ID=67488086

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/085800 WO2020221059A1 (zh) 2019-04-30 2020-04-21 音频信号处理方法及相关产品

Country Status (4)

Country Link
US (1) US20220199099A1 (zh)
EP (1) EP3944238B1 (zh)
CN (1) CN110111808B (zh)
WO (1) WO2020221059A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112992181A (zh) * 2021-02-08 2021-06-18 上海哔哩哔哩科技有限公司 音频分类方法及装置
US11842747B2 (en) 2021-10-22 2023-12-12 International Business Machines Corporation Calculating numbers of clusters in data sets using eigen response analysis

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110111808B (zh) * 2019-04-30 2021-06-15 华为技术有限公司 音频信号处理方法及相关产品
CN110491412B (zh) * 2019-08-23 2022-02-25 北京市商汤科技开发有限公司 声音分离方法和装置、电子设备
CN110930984A (zh) * 2019-12-04 2020-03-27 北京搜狗科技发展有限公司 一种语音处理方法、装置和电子设备
CN111883168B (zh) * 2020-08-04 2023-12-22 上海明略人工智能(集团)有限公司 一种语音处理方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007156300A (ja) * 2005-12-08 2007-06-21 Kobe Steel Ltd 音源分離装置、音源分離プログラム及び音源分離方法
JP2012173584A (ja) * 2011-02-23 2012-09-10 Nippon Telegr & Teleph Corp <Ntt> 音源分離装置、その方法及びプログラム
CN105989851A (zh) * 2015-02-15 2016-10-05 杜比实验室特许公司 音频源分离
CN107919133A (zh) * 2016-10-09 2018-04-17 赛谛听股份有限公司 针对目标对象的语音增强***及语音增强方法
CN108877831A (zh) * 2018-08-28 2018-11-23 山东大学 基于多标准融合频点筛选的盲源分离快速方法及***
CN110111808A (zh) * 2019-04-30 2019-08-09 华为技术有限公司 音频信号处理方法及相关产品

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2115743A1 (en) * 2007-02-26 2009-11-11 QUALCOMM Incorporated Systems, methods, and apparatus for signal separation
JP5195652B2 (ja) * 2008-06-11 2013-05-08 ソニー株式会社 信号処理装置、および信号処理方法、並びにプログラム
JP4952698B2 (ja) * 2008-11-04 2012-06-13 ソニー株式会社 音声処理装置、音声処理方法およびプログラム
US20120102066A1 (en) * 2009-06-30 2012-04-26 Nokia Corporation Method, Devices and a Service for Searching
JP5706782B2 (ja) * 2010-08-17 2015-04-22 本田技研工業株式会社 音源分離装置及び音源分離方法
JP6005443B2 (ja) * 2012-08-23 2016-10-12 株式会社東芝 信号処理装置、方法及びプログラム
JP6472824B2 (ja) * 2017-03-21 2019-02-20 株式会社東芝 信号処理装置、信号処理方法および音声の対応づけ提示装置
JP6591477B2 (ja) * 2017-03-21 2019-10-16 株式会社東芝 信号処理システム、信号処理方法及び信号処理プログラム
JP6859235B2 (ja) * 2017-09-07 2021-04-14 本田技研工業株式会社 音響処理装置、音響処理方法及びプログラム
US10089994B1 (en) * 2018-01-15 2018-10-02 Alex Radzishevsky Acoustic fingerprint extraction and matching
FR3081641A1 (fr) * 2018-06-13 2019-11-29 Orange Localisation de sources sonores dans un environnement acoustique donne.

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007156300A (ja) * 2005-12-08 2007-06-21 Kobe Steel Ltd 音源分離装置、音源分離プログラム及び音源分離方法
JP2012173584A (ja) * 2011-02-23 2012-09-10 Nippon Telegr & Teleph Corp <Ntt> 音源分離装置、その方法及びプログラム
CN105989851A (zh) * 2015-02-15 2016-10-05 杜比实验室特许公司 音频源分离
CN107919133A (zh) * 2016-10-09 2018-04-17 赛谛听股份有限公司 针对目标对象的语音增强***及语音增强方法
CN108877831A (zh) * 2018-08-28 2018-11-23 山东大学 基于多标准融合频点筛选的盲源分离快速方法及***
CN110111808A (zh) * 2019-04-30 2019-08-09 华为技术有限公司 音频信号处理方法及相关产品

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112992181A (zh) * 2021-02-08 2021-06-18 上海哔哩哔哩科技有限公司 音频分类方法及装置
US11842747B2 (en) 2021-10-22 2023-12-12 International Business Machines Corporation Calculating numbers of clusters in data sets using eigen response analysis

Also Published As

Publication number Publication date
US20220199099A1 (en) 2022-06-23
CN110111808B (zh) 2021-06-15
EP3944238B1 (en) 2023-11-15
EP3944238A1 (en) 2022-01-26
CN110111808A (zh) 2019-08-09
EP3944238A4 (en) 2022-05-04

Similar Documents

Publication Publication Date Title
WO2020221059A1 (zh) 音频信号处理方法及相关产品
Czyzewski et al. An audio-visual corpus for multimodal automatic speech recognition
EP3254435B1 (en) Post-conference playback system having higher perceived quality than originally heard in the conference
CN107211062B (zh) 虚拟声学空间中的音频回放调度
US20240021202A1 (en) Method and apparatus for recognizing voice, electronic device and medium
US9672829B2 (en) Extracting and displaying key points of a video conference
JP2021009701A (ja) インターフェイススマートインタラクティブ制御方法、装置、システム及びプログラム
US7636662B2 (en) System and method for audio-visual content synthesis
EP3754961A1 (en) Post-teleconference playback using non-destructive audio transport
WO2020147407A1 (zh) 一种会议记录生成方法、装置、存储介质及计算机设备
CN107210036B (zh) 会议词语云
CN111048064B (zh) 基于单说话人语音合成数据集的声音克隆方法及装置
WO2020238209A1 (zh) 音频处理的方法、***及相关设备
CN111462733B (zh) 多模态语音识别模型训练方法、装置、设备及存储介质
CN107211061A (zh) 用于空间会议回放的优化虚拟场景布局
WO2022062800A1 (zh) 语音分离方法、电子设备、芯片及计算机可读存储介质
CN108073572B (zh) 信息处理方法及其装置、同声翻译***
KR20200027331A (ko) 음성 합성 장치
WO2023048746A1 (en) Speaker-turn-based online speaker diarization with constrained spectral clustering
CN107358947A (zh) 说话人重识别方法及***
WO2021072893A1 (zh) 一种声纹聚类方法、装置、处理设备以及计算机存储介质
CN113053361B (zh) 语音识别方法、模型训练方法、装置、设备及介质
CN111429916B (zh) 一种声音信号记录***
TW202211077A (zh) 多國語言語音辨識及翻譯方法與相關的系統
WO2021127975A1 (zh) 一种声音采集对象声纹检测方法、装置和设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20799277

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020799277

Country of ref document: EP

Effective date: 20211020

NENP Non-entry into the national phase

Ref country code: DE