WO2021151310A1 - 语音通话的噪声消除方法、装置、电子设备及存储介质 - Google Patents

语音通话的噪声消除方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2021151310A1
WO2021151310A1 PCT/CN2020/121571 CN2020121571W WO2021151310A1 WO 2021151310 A1 WO2021151310 A1 WO 2021151310A1 CN 2020121571 W CN2020121571 W CN 2020121571W WO 2021151310 A1 WO2021151310 A1 WO 2021151310A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
speaker
category
detected
speech
Prior art date
Application number
PCT/CN2020/121571
Other languages
English (en)
French (fr)
Inventor
孙岩丹
王瑞璋
马骏
王少军
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021151310A1 publication Critical patent/WO2021151310A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Definitions

  • This application relates to the technical field of voiceprint recognition, and in particular to a method, device, electronic equipment, and computer-readable storage medium for noise cancellation in voice calls.
  • the inventor realizes that the current noise cancellation technology mainly eliminates background noise that is not human voice, and the noise cancellation effect for background human voice is poor, resulting in poor voice call quality.
  • a noise elimination method for voice calls includes:
  • the voice feature sets to be detected whose cumulative duration is the preset duration threshold are intercepted in chronological order from the voice feature sets to obtain multiple voice feature sets to be detected, and cluster processing is performed on each of the voice feature sets to be detected, and use
  • the preset evaluation algorithm scores the obtained clustering results, and obtains the score value of each of the voice feature sets to be detected;
  • the human voice is deleted from the human voice speech set.
  • a noise elimination device for voice calls comprising:
  • the voice endpoint detection module is used to perform voice endpoint detection on the call audio to obtain the human voice voice set;
  • the voice feature extraction module is used to perform voice feature extraction on the human voice speech set to obtain a voice feature set
  • the clustering scoring module is used to intercept the voice feature sets to be detected whose cumulative duration is the preset duration threshold from the voice feature sets in chronological order to obtain multiple voice feature sets to be detected, for each of the voice feature sets to be detected Performing clustering processing, and scoring the obtained clustering results using a preset evaluation algorithm, to obtain the score value of each of the voice feature sets to be detected;
  • the human voice classification module is configured to divide the human voice speech set into a first speaker voice and a second speaker voice according to the score value;
  • the background voice removal module is used to calculate the duration of the first speaker and the second speaker, and to determine the background person in the voice focus of the first speaker and the second speaker Voice, delete the background human voice from the human voice voice set.
  • An electronic device which includes:
  • Memory storing at least one instruction
  • the processor executes the instructions stored in the memory to implement the following steps:
  • the voice feature sets to be detected whose cumulative duration is the preset duration threshold are intercepted in chronological order from the voice feature sets to obtain multiple voice feature sets to be detected, and cluster processing is performed on each of the voice feature sets to be detected, and use
  • the preset evaluation algorithm scores the obtained clustering results, and obtains the score value of each of the voice feature sets to be detected;
  • the human voice is deleted from the human voice speech set.
  • a computer-readable storage medium includes a storage data area and a storage program area, where the data storage area stores created data, and the storage program area stores a computer program.
  • the voice feature sets to be detected whose cumulative duration is the preset duration threshold are intercepted in chronological order from the voice feature sets to obtain multiple voice feature sets to be detected, and cluster processing is performed on each of the voice feature sets to be detected, and use
  • the preset evaluation algorithm scores the obtained clustering results, and obtains the score value of each of the voice feature sets to be detected;
  • the human voice is deleted from the human voice speech set.
  • the noise elimination method, device and computer readable storage medium for voice calls proposed in this application can delete background human voices in voice calls and improve the success rate of the dialogue system.
  • FIG. 1 is a schematic flowchart of a method for eliminating noise in a voice call according to an embodiment of the application
  • FIG. 2 is a schematic flowchart of a voice feature extraction method provided by an embodiment of this application.
  • FIG. 3 is a schematic flowchart of a method for separating human voice provided by an embodiment of the application
  • FIG. 4 is a schematic diagram of modules of a voice call noise cancellation device provided by an embodiment of the application.
  • FIG. 5 is a schematic diagram of the internal structure of an electronic device for implementing a noise cancellation method for voice calls according to an embodiment of the application;
  • the execution subject of the noise cancellation method for voice calls provided in the embodiments of the present application includes, but is not limited to, at least one of the electronic devices that can be configured to execute the method provided in the embodiments of the present application, such as a server and a terminal.
  • the noise cancellation method of the voice call can be executed by software or hardware installed in the terminal device or the server device, and the software can be a blockchain platform.
  • the server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, etc.
  • FIG. 1 it is a schematic flowchart of a method for eliminating noise in a voice call according to an embodiment of this application.
  • the noise elimination method for voice calls includes:
  • the call audio described in the embodiments of the present application includes audio generated by a conversation in a crowd or an environment with more human voices, such as a call through a communication system such as a telephone or instant messaging software in an environment full of background human voices. Call audio generated at the time.
  • the call audio may be directly obtained from the communication system, or obtained from a database used to store voice dialogue information. It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned call audio, the above-mentioned call audio can also be stored in a node of a blockchain.
  • the voice endpoint detection refers to distinguishing the human voice data and non-human voice data (silence and environmental noise) in the call audio in an environment with noise or other interference, and determining the starting point and ending point of the human voice data to The non-human voice noise in the call audio is deleted, the subsequent processing amount of the computer is reduced, the efficiency is improved, and the necessary support is provided for the subsequent signal processing.
  • the voice endpoint detection model may be a voice activity detection (VAD) model based on a deep neural network (DNN).
  • VAD voice activity detection
  • DNN deep neural network
  • the S2 includes:
  • S21 Perform pre-emphasis, framing and windowing on the human voice speech set to obtain a speech frame sequence
  • the pre-emphasis is to use a high-pass filter to enhance the high-frequency part of the voice signal in the human voice speech concentration, so as to flatten the frequency spectrum of the voice signal;
  • the sub-frame uses a movable window of limited length Weighting is performed to divide the speech signal into some short segments, so that the speech signal has stationarity;
  • the windowing is to make the speech signal without periodicity present part of the characteristics of the periodic function, which is convenient for subsequent Fourier expansion.
  • the speech signal is usually converted into an energy distribution in the frequency domain. Different energy distributions can represent the characteristics of different speech.
  • the Mel filter bank is a set of triangular filter banks of Mel scale, and the frequency spectrum can be converted into a Mel frequency spectrum through the Mel filter bank, and the Mel frequency can be accurately Reflects the auditory characteristics of the human ear.
  • cepstrum analysis includes taking logarithm and discrete cosine change, and outputting a feature vector.
  • the speech feature set includes a feature vector corresponding to the speech frame sequence output after cepstrum analysis.
  • the voice feature set to be detected when the cumulative duration of the voice feature set reaches a preset duration threshold, a detection calculation is performed, and the voice feature set accumulated this time is referred to as the voice feature set to be detected.
  • the performing clustering processing on each of the voice feature sets to be detected includes:
  • Step a Randomly select two feature vectors in the voice feature set to be detected as category centers;
  • Step b For each feature vector in the speech feature set to be detected, by calculating the distance from the center of each category, clustering the feature vector with the closest category center to obtain two initial categories.
  • the embodiment of the present application uses the following distance algorithm to calculate the distance between the feature vector and the center of each category:
  • L(X, Y) is the distance value
  • X is the center of the category
  • Y i is the feature vector in the voice feature set to be detected.
  • Step c Update the category centers of the two initial categories
  • the embodiment of the present application calculates the average value of all feature vectors in each of the initial categories, and updates the average value to the category center of the category.
  • Step d Repeat the above steps b and c until the number of iterations reaches the preset number threshold, and two standard categories are obtained.
  • this application uses a preset evaluation algorithm to score the obtained two standard categories to obtain the score values of the standard categories.
  • the evaluation algorithm is as follows:
  • n 1 and n 2 are the category centers of the two standard categories
  • H s is the hypothesis that the standard categories belong to the same category
  • H d is the hypothesis that the standard categories belong to different categories
  • H s ) are the likelihood functions of n 1 and n 2 from the same space
  • H d ) are the likelihood functions of n 1 and n 2 from different spaces, respectively.
  • the likelihood function is a function about the parameters of the statistical model and is used to test whether a certain hypothesis is valid.
  • the higher the score value the greater the probability that the voices corresponding to the two standard categories belong to the same speaker; the lower the score value, the voices corresponding to the two standard categories belong to the same speaker The less likely people are.
  • the S4 includes:
  • the first human voice includes a voice feature and a duration
  • the voice feature includes the single voice category and a category center
  • the duration includes the number of frames of the single voice category.
  • the first human voice and the second human voice include a voice feature and a duration
  • the voice feature includes the standard category and a category center
  • the duration includes the number of frames of the standard category
  • S45 Determine whether each of the voice feature sets to be detected has been selected, and repeat the S44 until each of the voice feature sets to be detected is selected, and the first speaker's voice and the second speaker's voice are obtained.
  • the classification of the two standard categories in the voice feature set to be detected into the first speaker's voice or the second speaker's voice includes:
  • the score value of the voice feature set to be detected is greater than the score threshold, the two standard categories of the voice feature set to be detected are combined into a single voice category, and the score of the single voice category is calculated.
  • Category center calculating the cosine distance between the category center of the single voice category and the category centers of the first and second speaker voices, and classifying the single voice category into all categories according to the cosine distance Stated in the voice of the first speaker or the voice of the second speaker.
  • the single voice category is classified into the first speaker's voice, and if the single voice category is If the cosine distance between the category center of the voice category and the category center of the second speaker's voice is relatively close, the single voice category is classified into the second speaker's voice.
  • the classification includes: merging the single voice category with the first or second speaker voice, and recalculating the merged category center; and combining the number of frames of the single voice category with the The duration of the first voice or the second voice is accumulated.
  • the category center of each standard category in the speech feature set to be detected and the category center of the first speaker's voice and the second speaker's voice are calculated. According to the cosine distance between, the two standard categories are respectively classified into the first speaker's voice and the second speaker's voice according to the cosine distance.
  • the voice feature set to be detected includes standard category A and standard category B. If the category center of standard category A and the category center of the first speaker have a relatively close cosine distance and the category center of standard category B and the second speaker The category center of the sound has a relatively short cosine distance, then the standard category A is classified into the first speaker's voice and the standard category B is classified into the second speaker's voice.
  • the classification includes: merging the standard category A and the standard category with the first or second speaker voice respectively, and recalculating the merged category center; and combining the standard category
  • the number of frames of A and the standard category B is accumulated with the duration of the first speaker's voice or the second speaker's voice.
  • the duration of the target speaker is generally greater than the duration of the background speaker. Therefore, in the embodiment of the present application, the longer duration of the first speaker and the second speaker are used to speak.
  • the human voice is used as the target speaker, and the remaining human voice is used as the background speaker.
  • the deleting the background human voice from the human voice voice collection includes:
  • the background human voice is deleted from the human voice speech set, and the background human voice in the call audio is removed.
  • the duration algorithm is as follows:
  • R is the duration ratio of the background voice in the current call
  • t is the duration of the background speaker
  • T is the total duration of the call, that is, the sum of the duration of the target speaker and the background speaker.
  • the duration ratio when the duration ratio is less than the ratio threshold, it means that the background vocal noise interference of the current call is small, and there is no need to process the call audio; when the duration ratio is greater than the ratio threshold When it is time, it means that the current call is interfered by relatively serious background human voice noise. Deleting the background human voice from the human voice speech set can reduce the misrecognition caused by the background human voice and improve the voice call quality.
  • the embodiment of the application performs voice endpoint detection on the call audio, deletes the non-human voice noise in the call audio, and reduces the subsequent processing amount of the computer; performs voice feature extraction on the human voice voice set to obtain a voice feature set, which is convenient for follow-up Separate the background human voice in the call audio; intercept the voice feature set to be detected whose cumulative duration is the preset duration threshold from the voice feature set in chronological order, and obtain multiple voice feature sets to be detected, for each of them
  • the voice feature set to be detected is clustered, and the obtained clustering results are scored using a preset evaluation algorithm, and the score value of each voice feature set to be detected is obtained.
  • the clustering and scoring method can be used to detect Fragmented, fuzzy, and low-volume background human voices are generated; according to the score value, the human voice speech set is divided into a first speaker voice and a second speaker voice, and the speaker and background person can be saved and dynamically updated in real time
  • the audio characteristics of the voice calculate the duration of the first and second voices, and determine the background voices in the voice focus based on the durations of the first and second voices, and
  • the background human voice is deleted from the human voice voice set to improve the voice call quality. Therefore, the noise elimination method, device, and computer-readable storage medium for voice calls proposed in this application can delete background human voices in voice calls and improve the success rate of the dialogue system.
  • FIG. 4 it is a functional block diagram of the noise cancellation device for voice calls of the present application.
  • the noise cancellation device 100 for voice calls described in this application can be installed in an electronic device.
  • the noise cancellation device for voice calls may include a voice endpoint detection module 101, a voice feature extraction module 102, a clustering scoring module 103, a human voice classification module 104, and a background human voice removal module 105.
  • the module described in the present invention can also be called a unit, which refers to a series of computer program segments that can be executed by the processor of an electronic device and can complete fixed functions, and are stored in the memory of the electronic device.
  • each module/unit is as follows:
  • the voice endpoint detection module 101 is configured to perform voice endpoint detection on the call audio to obtain a human voice voice set.
  • the call audio described in the embodiments of the present application includes audio generated by a conversation in a crowd or an environment with more human voices, such as a call through a communication system such as a telephone or instant messaging software in an environment full of background human voices. Call audio generated at the time.
  • the call audio may be directly obtained from the communication system, or obtained from a database used to store voice dialogue information. It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned call audio, the above-mentioned call audio can also be stored in a node of a blockchain.
  • the voice endpoint detection refers to distinguishing the human voice data and non-human voice data (silence and environmental noise) in the call audio in an environment with noise or other interference, and determining the starting point and ending point of the human voice data to The non-human voice noise in the call audio is deleted, the subsequent processing amount of the computer is reduced, the efficiency is improved, and the necessary support is provided for the subsequent signal processing.
  • the voice endpoint detection model may be a voice activity detection (VAD) model based on a deep neural network (DNN).
  • VAD voice activity detection
  • DNN deep neural network
  • the voice feature extraction module 102 is configured to perform voice feature extraction on the human voice speech set to obtain a voice feature set.
  • the speech feature extraction module 102 specifically executes:
  • the corresponding frequency spectrum is obtained through fast Fourier transform
  • the pre-emphasis is to use a high-pass filter to enhance the high-frequency part of the voice signal in the human voice speech concentration, so as to flatten the frequency spectrum of the voice signal;
  • the sub-frame uses a movable window of limited length Weighting is performed to divide the speech signal into some short segments, so that the speech signal has stationarity; the windowing is to make the speech signal without periodicity present part of the characteristics of the periodic function, which is convenient for subsequent Fourier expansion.
  • the speech signal is usually converted into an energy distribution in the frequency domain, and different energy distributions can represent the characteristics of different speech.
  • the Mel filter bank is a set of triangular filter banks of Mel scale, and the frequency spectrum can be converted into a Mel frequency spectrum through the Mel filter bank, and the Mel frequency can be accurately Reflects the auditory characteristics of the human ear.
  • cepstrum analysis includes taking logarithm and discrete cosine change, and outputting a feature vector.
  • the speech feature set includes a feature vector corresponding to the speech frame sequence output after cepstrum analysis.
  • the clustering scoring module 103 is configured to intercept the voice feature sets to be detected whose cumulative duration is a preset duration threshold from the voice feature sets in chronological order to obtain a plurality of voice feature sets to be detected, and for each of the voice feature sets to be detected The voice feature set is clustered, and the obtained clustering result is scored by using a preset evaluation algorithm to obtain the score value of each voice feature set to be detected.
  • the voice feature set to be detected when the cumulative duration of the voice feature set reaches a preset duration threshold, a detection calculation is performed, and the voice feature set accumulated this time is referred to as the voice feature set to be detected.
  • the performing clustering processing on each of the voice feature sets to be detected includes:
  • Step a Randomly select two feature vectors in the voice feature set to be detected as category centers;
  • Step b For each feature vector in the speech feature set to be detected, by calculating the distance from the center of each category, clustering the feature vector with the closest category center to obtain two initial categories.
  • the embodiment of the present application uses the following distance algorithm to calculate the distance between the feature vector and the center of each category:
  • L(X, Y) is the distance value
  • X is the center of the category
  • Y i is the feature vector in the voice feature set to be detected.
  • Step c Update the category centers of the two initial categories
  • the embodiment of the present application calculates the average value of all feature vectors in each of the initial categories, and updates the average value to the category center of the category.
  • Step d Repeat the above steps b and c until the number of iterations reaches the preset number threshold, and two standard categories are obtained.
  • this application uses a preset evaluation algorithm to score the obtained two standard categories to obtain the score values of the standard categories.
  • the evaluation algorithm is as follows:
  • n 1 and n 2 are the category centers of the two standard categories
  • H s is the hypothesis that the standard categories belong to the same category
  • H d is the hypothesis that the standard categories belong to different categories
  • H s ) are the likelihood functions of n 1 and n 2 from the same space
  • H d ) are the likelihood functions of n 1 and n 2 from different spaces, respectively.
  • the likelihood function is a function about the parameters of the statistical model and is used to test whether a certain hypothesis is valid.
  • the higher the score value the greater the probability that the voices corresponding to the two standard categories belong to the same speaker; the lower the score value, the voices corresponding to the two standard categories belong to the same speaker The less likely people are.
  • the human voice classification module 104 is configured to divide the human voice speech set into a first human voice and a second human voice according to the score value.
  • the human voice classification module 104 is specifically used for:
  • the human voice classification module 104 merges the selected two standard categories of the voice feature set to be detected into a single voice category, and calculates the category of the single voice category Center, generating a first speaker voice according to the single voice category and category center;
  • the first human voice includes a voice feature and a duration
  • the voice feature includes the single voice category and a category center
  • the duration includes the number of frames of the single voice category.
  • the first human voice and the second human voice include a voice feature and a duration
  • the voice feature includes the standard category and a category center
  • the duration includes the number of frames of the standard category
  • the classification of the two standard categories in the voice feature set to be detected into the first speaker's voice or the second speaker's voice includes:
  • the score value of the voice feature set to be detected is greater than the score threshold, the two standard categories of the voice feature set to be detected are combined into a single voice category, and the score of the single voice category is calculated.
  • Category center calculating the cosine distance between the category center of the single voice category and the category centers of the first and second speaker voices, and classifying the single voice category into all categories according to the cosine distance Stated in the voice of the first speaker or the voice of the second speaker.
  • the single voice category is classified into the first speaker's voice, and if the single voice category is If the cosine distance between the category center of the voice category and the category center of the second speaker's voice is relatively close, the single voice category is classified into the second speaker's voice.
  • the classification includes: merging the single voice category with the first or second speaker voice, and recalculating the merged category center; and combining the number of frames of the single voice category with the The duration of the first voice or the second voice is accumulated.
  • the category center of each standard category in the speech feature set to be detected and the category center of the first speaker's voice and the second speaker's voice are calculated. According to the cosine distance between, the two standard categories are respectively classified into the first speaker's voice and the second speaker's voice according to the cosine distance.
  • the voice feature set to be detected includes standard category A and standard category B. If the category center of standard category A and the category center of the first speaker have a relatively close cosine distance and the category center of standard category B and the second speaker The category center of the sound has a relatively short cosine distance, then the standard category A is classified into the first speaker's voice and the standard category B is classified into the second speaker's voice.
  • the classification includes: merging the standard category A and the standard category with the first or second speaker voice respectively, and recalculating the merged category center; and combining the standard category
  • the number of frames of A and the standard category B is accumulated with the duration of the first speaker's voice or the second speaker's voice.
  • the background voice removal module 105 is used to calculate the duration of the first and second voices, and determine the concentration of the voice based on the duration of the first and second voices The background human voice of, delete the background human voice from the human voice voice set.
  • the duration of the target speaker is generally greater than the duration of the background speaker. Therefore, in the embodiment of the present application, the longer duration of the first speaker and the second speaker are used to speak.
  • the human voice is used as the target speaker, and the remaining human voice is used as the background speaker.
  • the background human voice removal module 105 deletes the background human voice from the human voice speech collection by the following methods, including:
  • the background human voice is deleted from the human voice speech set, and the background human voice in the call audio is removed.
  • the duration algorithm is as follows:
  • R is the duration ratio of the background voice in the current call
  • t is the duration of the background speaker
  • T is the total duration of the call, that is, the sum of the duration of the target speaker and the background speaker.
  • the duration ratio when the duration ratio is less than the ratio threshold, it means that the background vocal noise interference of the current call is small, and there is no need to process the call audio; when the duration ratio is greater than the ratio threshold When it is time, it means that the current call is interfered by relatively serious background human voice noise. Deleting the background human voice from the human voice speech set can reduce the misrecognition caused by the background human voice and improve the voice call quality.
  • FIG. 5 it is a schematic structural diagram of an electronic device that implements the noise elimination method for voice calls according to the present application.
  • the electronic device 1 may include a processor 10, a memory 11, and a bus, and may also include a computer program stored in the memory 11 and running on the processor 10, such as a noise cancellation program 12 for voice calls.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (such as SD or DX memory, etc.), magnetic memory, magnetic disk, CD etc.
  • the memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, for example, a mobile hard disk of the electronic device 1.
  • the memory 11 may also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a smart media card (SMC), and a secure digital (Secure Digital) equipped on the electronic device 1. , SD) card, flash card (Flash Card), etc.
  • the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 can be used not only to store application software and various data installed in the electronic device 1, such as the code of the noise cancellation program 12 for voice calls, etc., but also to temporarily store data that has been output or will be output.
  • the processor 10 may be composed of integrated circuits in some embodiments, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits with the same function or different functions, including one or more Combinations of central processing unit (CPU), microprocessor, digital processing chip, graphics processor, and various control chips, etc.
  • the processor 10 is the control unit of the electronic device, which uses various interfaces and lines to connect the various components of the entire electronic device, and runs or executes programs or modules stored in the memory 11 (for example, executing Voice call noise cancellation programs, etc.), and call data stored in the memory 11 to execute various functions of the electronic device 1 and process data.
  • the bus may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • PCI peripheral component interconnect standard
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on.
  • the bus is configured to implement connection and communication between the memory 11 and at least one processor 10 and the like.
  • FIG. 5 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 5 does not constitute a limitation on the electronic device 1, and may include fewer or more components than shown in the figure. Components, or a combination of certain components, or different component arrangements.
  • the electronic device 1 may also include a power source (such as a battery) for supplying power to various components.
  • the power source may be logically connected to the at least one processor 10 through a power management device, thereby controlling power
  • the device implements functions such as charge management, discharge management, and power consumption management.
  • the power supply may also include any components such as one or more DC or AC power supplies, recharging devices, power failure detection circuits, power converters or inverters, and power status indicators.
  • the electronic device 1 may also include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
  • the electronic device 1 may also include a network interface.
  • the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • the electronic device 1 may also include a user interface.
  • the user interface may be a display (Display) and an input unit (such as a keyboard (Keyboard)).
  • the user interface may also be a standard wired interface or a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, etc.
  • the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the electronic device 1 and to display a visualized user interface.
  • the voice call noise cancellation program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple instructions. When running in the processor 10, it can realize:
  • the voice feature sets to be detected whose cumulative duration is the preset duration threshold are intercepted in chronological order from the voice feature sets to obtain multiple voice feature sets to be detected, and cluster processing is performed on each of the voice feature sets to be detected, and use
  • the preset evaluation algorithm scores the obtained clustering results, and obtains the score value of each of the voice feature sets to be detected;
  • the integrated module/unit of the electronic device 1 is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) .
  • the computer-readable storage medium may be volatile or non-volatile.
  • the computer-readable medium includes a storage data area and a storage program area. The storage data area stores the created data, and the storage program area stores There is a computer program, and when the computer program is executed by a processor, the following steps are implemented:
  • the voice feature sets to be detected whose cumulative duration is the preset duration threshold are intercepted in chronological order from the voice feature sets to obtain multiple voice feature sets to be detected, and cluster processing is performed on each of the voice feature sets to be detected, and use
  • the preset evaluation algorithm scores the obtained clustering results, and obtains the score value of each of the voice feature sets to be detected;
  • the human voice is deleted from the human voice speech set.
  • modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional modules in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种语音通话的噪声消除方法,涉及声纹识别技术,包括:对通话音频进行语音端点检测,得到人声语音集(S1);对人声语音集进行语音特征提取,得到语音特征集(S2);依时间顺序从语音特征集中截取累计时长为预设时长阈值的待检测语音特征集,得到多个待检测语音特征集,对每个待检测语音特征集进行聚类处理,并对聚类进行评分(S3);根据评分,将人声语音集分成第一说话人声和第二说话人声(S4),并从第一说话人声和第二说话人声中区分出背景人声,并将背景人声从人声语音集中删除(S5)。该方法还涉及区块链技术,通话音频可存储于区块链中。该方法可以删除语音通话中的背景人声,从而提升语音通话质量。

Description

语音通话的噪声消除方法、装置、电子设备及存储介质
本申请要求于2020年6月19日提交中国专利局、申请号为CN202010570483.4,发明名称为“语音通话的噪声消除方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及声纹识别技术领域,尤其涉及一种语音通话的噪声消除方法、装置、电子设备及计算机可读存储介质。
背景技术
客户服务***,尤其是智能外呼***,常常需要面对客户所处环境的背景噪音干扰。在所有的噪音中,背景人声的噪音干扰最为强烈,智能外呼***的自动语音识别会把背景人声也识别出来,并当作对话的目标,大幅度影响整体对话的成功率。
然而,发明人意识到目前的噪声消除技术主要是消除非人声的背景噪音,对于背景人声的噪音消除效果较差,造成语音通话质量较差。
发明内容
一种语音通话的噪声消除方法,包括:
对通话音频进行语音端点检测,得到人声语音集;
对所述人声语音集进行语音特征提取,得到语音特征集;
依时间顺序从所述语音特征集中截取累计时长为预设时长阈值的待检测语音特征集,得到多个待检测语音特征集,对每一个所述待检测语音特征集进行聚类处理,并利用预设的评估算法对得到的聚类结果进行评分,得到每一个所述待检测语音特征集的评分值;
根据所述评分值,将所述人声语音集分成第一说话人声和第二说话人声;
计算所述第一说话人声和第二说话人声的时长,并根据所述第一说话人声和第二说话人声的时长判断所述人声语音集中的背景人声,将所述背景人声从所述人声语音集中删除。
一种语音通话的噪声消除装置,所述装置包括:
语音端点检测模块,用于对通话音频进行语音端点检测,得到人声语音集;
语音特征提取模块,用于对所述人声语音集进行语音特征提取,得到语音特征集;
聚类评分模块,用于依时间顺序从所述语音特征集中截取累计时长为预设时长阈值的待检测语音特征集,得到多个待检测语音特征集,对每一个所述待检测语音特征集进行聚类处理,并利用预设的评估算法对得到的聚类结果进行评分,得到每一个所述待检测语音特征集的评分值;
人声分类模块,用于根据所述评分值,将所述人声语音集分成第一说话人声和第二说话人声;
背景人声去除模块,用于计算所述第一说话人声和第二说话人声时长,并根据所述第一说话人声和第二说话人声时长判断所述人声语音集中的背景人声,将所述背景人声从所述人声语音集中删除。
一种电子设备,所述电子设备包括:
存储器,存储至少一个指令;及
处理器,执行所述存储器中存储的指令以实现如下步骤:
对通话音频进行语音端点检测,得到人声语音集;
对所述人声语音集进行语音特征提取,得到语音特征集;
依时间顺序从所述语音特征集中截取累计时长为预设时长阈值的待检测语音特征集,得到多个待检测语音特征集,对每一个所述待检测语音特征集进行聚类处理,并利用预设 的评估算法对得到的聚类结果进行评分,得到每一个所述待检测语音特征集的评分值;
根据所述评分值,将所述人声语音集分成第一说话人声和第二说话人声;
计算所述第一说话人声和第二说话人声的时长,并根据所述第一说话人声和第二说话人声的时长判断所述人声语音集中的背景人声,将所述背景人声从所述人声语音集中删除。
一种计算机可读存储介质,包括存储数据区和存储程序区,存储数据区存储所创建的数据,存储程序区存储有计算机程序,所述计算机程序被处理器执行时实现如下步骤:
对通话音频进行语音端点检测,得到人声语音集;
对所述人声语音集进行语音特征提取,得到语音特征集;
依时间顺序从所述语音特征集中截取累计时长为预设时长阈值的待检测语音特征集,得到多个待检测语音特征集,对每一个所述待检测语音特征集进行聚类处理,并利用预设的评估算法对得到的聚类结果进行评分,得到每一个所述待检测语音特征集的评分值;
根据所述评分值,将所述人声语音集分成第一说话人声和第二说话人声;
计算所述第一说话人声和第二说话人声的时长,并根据所述第一说话人声和第二说话人声的时长判断所述人声语音集中的背景人声,将所述背景人声从所述人声语音集中删除。
本申请提出的语音通话的噪声消除方法、装置及计算机可读存储介质,可以删除语音通话中的背景人声,并提升对话***的成功率。
附图说明
图1为本申请一实施例提供的语音通话的噪声消除方法的流程示意图;
图2为本申请一实施例提供的语音特征提取方法的流程示意图;
图3为本申请一实施例提供的人声分离方法的流程示意图;
图4为本申请一实施例提供的语音通话的噪声消除装置的模块示意图;
图5为本申请一实施例提供的实现语音通话的噪声消除方法的电子设备的内部结构示意图;
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请实施例提供的语音通话的噪声消除方法的执行主体包括但不限于服务端、终端等能够被配置为执行本申请实施例提供的该方法的电子设备中的至少一种。换言之,所述语音通话的噪声消除方法可以由安装在终端设备或服务端设备的软件或硬件来执行,所述软件可以是区块链平台。所述服务端包括但不限于:单台服务器、服务器集群、云端服务器或云端服务器集群等。
本申请提供一种语音通话的噪声消除方法。参照图1所示,为本申请一实施例提供的语音通话的噪声消除方法的流程示意图。
在本实施例中,所述语音通话的噪声消除方法包括:
S1、对通话音频进行语音端点检测,得到人声语音集。
详细地,本申请实施例中所述通话音频包括在人群中或在有较多人声的环境中对话产生的音频,如在充满背景人声的环境中通过电话或者即时通信软件等通信***通话时产生的通话音频。所述通话音频可直接从所述通信***中获取,或从用于存储语音对话信息的数据库中调用获取。需要强调的是,为进一步保证上述通话音频的私密和安全性,上述通话音频还可以存储于一区块链的节点中。
所述语音端点检测是指在有噪声或其他干扰的环境下分辨出通话音频中的人声数据和非人声数据(静音和环境噪音),并确定人声数据的起始点和终止点,以删除所述通话 音频中的非人声噪音,减少计算机后续的处理量,提高效率,为后续的信号处理提供必要的支持。
本申请较佳实施例中,所述语音端点检测模型可以为基于深度神经网络(deep neural network,DNN)的语音活动性检测(voice activity detection,VAD)模型。
S2、对所述人声语音集进行语音特征提取,得到语音特征集。
详细地,参阅图2所示,所述S2包括:
S21、对所述人声语音集进行预加重、分帧和加窗,得到语音帧序列;
其中,所述预加重是利用一个高通滤波器提升所述人声语音集中语音信号的高频部分,使所述语音信号的频谱变得平坦;所述分帧是采用可移动的有限长度的窗口进行加权以将所述语音信号分为一些短段,使语音信号具有平稳性;所述加窗是为了使没有周期性的语音信号呈现出周期函数的部分特征,便于后续进行傅里叶展开。
S22、对所述语音帧序列中的每一帧语音,通过快速傅里叶变换得到对应的频谱;
较佳地,由于语音信号在时域上的变换通常很难看出信号的特性,所以通常将语音信号转换为频域上的能量分布,不同的能量分布,就能代表不同语音的特性。
S23、通过梅尔滤波器组将所述频谱转换为梅尔频谱;
其中,所述梅尔(Mel)滤波器组是一组梅尔尺度的三角形滤波器组,通过所述梅尔滤波器组可以对所述频谱转换为梅尔频谱,所述梅尔频率能准确反映人耳的听觉特性。
S24、在所述梅尔频谱上进行倒谱分析,得到所述人声语音集对应的语音特征集。
进一步地,所述倒谱分析包括取对数和离散余弦变化,并输出特征向量。所述语音特征集包括经过倒谱分析后输出的所述语音帧序列对应的特征向量。
S3、依时间顺序从所述语音特征集中截取累计时长为预设时长阈值的待检测语音特征集,得到多个待检测语音特征集,对每一个所述待检测语音特征集进行聚类处理,并利用预设的评估算法对得到的聚类结果进行评分,得到每一个所述待检测语音特征集的评分值。
本申请实施例在所述语音特征集累计时长到达预设的时长阈值时,进行一次检测计算,并将本次累积获取的语音特征集称为待检测语音特征集。
详细地,所述对每一个所述待检测语音特征集进行聚类处理,包括:
步骤a、在所述待检测语音特征集中随机选择两个特征向量作为类别中心;
步骤b、对于所述待检测语音特征集中每个特征向量,通过计算其到每一个所述类别中心的距离,将所述特征向量与距离最近的类别中心进行聚类,得到两个初始类别。
详细地,本申请实施例利用如下距离算法计算所述特征向量与每一个所述类别中心的距离:
Figure PCTCN2020121571-appb-000001
其中,L(X,Y)为所述距离值,X为所述类别中心,Y i为所述待检测语音特征集内的特征向量。
步骤c、更新两个初始类别的类别中心;
较佳地,本申请实施例计算每个所述初始类别下的所有特征向量的均值,并将所述均值更新为该类别的类别中心。
步骤d、重复上述步骤b和步骤c直至达迭代次数达到预设次数阈值,得到两个标准类别。
进一步地,本申请利用预设的评估算法对得到的两个标准类别进行评分,得到所述标准类别的评分值。优选地,本申请实施例中,所述评估算法如下:
Figure PCTCN2020121571-appb-000002
其中,n 1和n 2分别是两个标准类别的类别中心,H s为所述标准类别属于同一类别的假设,H d为所述标准类别属于不同类别的假设;P(n 1,n 2|H s)为n 1和n 2来自同一空间的似然函数;P(n 1|H d),P(n 2|H d)分别为n 1和n 2来自不同空间的似然函数。所述似然函数是一种关于统计模型参数的函数,用来检测某个假设是否有效的一种检验。
较佳地,所述评分值越高,两个所述标准类别对应的语音属于同一说话人的可能性越大;所述评分值越低,则两个所述标准类别对应的语音属于同一说话人的可能性越小。
S4、根据所述评分值,将所述人声语音集分成第一说话人声和第二说话人声。
详细地,参阅图3所示,所述S4包括:
S40、选择其中一个待检测语音特征集,获取其对应的评分值;
S41、将所述评分值与预设的评分阈值进行比较;
当所述评分值大于预设的评分阈值时,执行S42、将选择的所述待检测语音特征集的两个标准类别合并为单一语音类别,并计算所述单一语音类别的类别中心,根据所述单一语音类别和类别中心生成第一说话人声;
其中,所述第一说话人声包括语音特征和时长,所述语音特征包括所述单一语音类别和类别中心,所述时长包括所述单一语音类别的帧数。
当所述评分值小于或者等于预设的评分阈值时,执行S43、根据所述两个标准类别生成第一说话人声和第二说话人声;
相似地,所述第一说话人声和第二说话人声包括语音特征和时长,所述语音特征包括所述标准类别和类别中心,所述时长包括所述标准类别的帧数。
S44、选择下一个待检测语音特征集,获取其对应的评分值,并根据其评分值,将所述待检测语音特征集中的两个标准类别归类至所述第一说话人声或所述第二说话人声;
S45、判断是否每一个所述待检测语音特征集都选择完毕,并重复所述S44,直到每一个所述待检测语音特征集都选择完毕,得到第一说话人声和第二说话人声。
详细地,所述将所述待检测语音特征集中的两个标准类别归类至所述第一说话人声或所述第二说话人声,包括:
本申请其中一个实施例中,若所述待检测语音特征集的评分值大于评分阈值,将所述待检测语音特征集的两个标准类别合并为单一语音类别,并计算所述单一语音类别的类别中心,计算所述单一语音类别的类别中心与上述第一说话人声和第二说话人声的类别中心之间的余弦距离,根据所述余弦距离,将所述单一语音类别归类到所述第一说话人声或第二说话人声中。
其中,若所述单一语音类别的类别中心与上述第一说话人声的类别中心的余弦距离较近,则将所述单一语音类别归类至所述第一说话人声中,若所述单一语音类别的类别中心与上述第二说话人声的类别中心的余弦距离较近,则将所述单一语音类别归类至所述第二说话人声中。
所述归类包括:将所述单一语音类别与所述第一说话人声或第二说话人声进行合并,并重新计算合并后的类别中心;将所述单一语音类别的帧数与所述第一说话人声或第二说话人声的时长进行累加。
本申请另一个实施例中,若所述评分值小于或等于评分阈值,通过计算所述待检测语音特征集中每个标准类别的类别中心与第一说话人声和第二说话人声的类别中心之间的余弦距离,根据余弦距离将所述两个标准类别分别归类至所述第一说话人声和第二说话人声中。
其中,假设待检测语音特征集中包括标准类别A及标准类别B,若标准类别A的类别中心与第一说话人声的类别中心得余弦距离较近且标准类别B的类别中心与第二说话人声的类别中心得余弦距离较近,则将所述标准类别A归类至第一说话人声中并将所述标准类别B归类至第二说话人声中。
相同的,所述归类包括:将所述标准类别A及标准类别分别与所述第一说话人声或第二说话人声进行合并,并重新计算合并后的类别中心;将所述标准类别A及标准类别B的帧数与所述第一说话人声或第二说话人声的时长进行累加。
S5、计算所述第一说话人声和第二说话人声时长,并根据所述第一说话人声和第二说话人声时长判断所述人声语音集中的背景人声,将所述背景人声从所述人声语音集中删除。
较佳地,在所述通话音频中,一般目标说话人的时长会大于背景说话人的时长,因此本申请实施例将所述第一说话人声和第二说话人声中时长较长的说话人声作为目标说话人,将余下的说话人声作为背景说话人。
详细地,所述将所述背景人声从所述人声语音集中删除,包括:
利用预设的时长算法计算所述背景人声在本次通话中的时长比例;
将所述时长比例与预设的比例阈值进行对比;
当所述时长比例大于所述比例阈值时,将所述背景人声从所述人声语音集中删除,去除所述通话音频中的背景人声。
其中,所述时长算法如下:
R=t/T
其中,R为背景人声在本次通话中的时长比例,t为背景人说话人的时长,T为通话总时长,即目标说话人与背景说话人的时长之和。
较佳地,当所述时长比例小于所述比例阈值时,表示本次通话受到的背景人声噪音干扰很小,不需要对所述通话音频进行处理;当所述时长比例大于所述比例阈值时,则表示本次通话受到比较严重的背景人声噪音干扰,将所述背景人声从所述人声语音集中删除可以降低因背景人声造成的错误识别,提升语音通话质量。
本申请实施例对通话音频进行语音端点检测,删除所述通话音频中的非人声噪音,减少计算机后续的处理量;对所述人声语音集进行语音特征提取,得到语音特征集,便于后续将所述通话音频中的背景人声分离出来;依时间顺序从所述语音特征集中截取累计时长为预设时长阈值的待检测语音特征集,得到多个待检测语音特征集,对每一个所述待检测语音特征集进行聚类处理,并利用预设的评估算法对得到的聚类结果进行评分,得到每一个所述待检测语音特征集的评分值,使用聚类结合评分的方式可以检测出碎片化、模糊、音量小的背景人声;根据所述评分值,将所述人声语音集分成第一说话人声和第二说话人声,可以实时保存并动态更新说话人和背景人声的音频特征;计算所述第一说话人声和第二说话人声时长,并根据所述第一说话人声和第二说话人声时长判断所述人声语音集中的背景人声,将所述背景人声从所述人声语音集中删除,以提高语音通话质量。因此本申请提出的语音通话的噪声消除方法、装置及计算机可读存储介质,可以删除语音通话中的背景人声,并提升对话***的成功率。
如图4所示,是本申请语音通话的噪声消除装置的功能模块图。
本申请所述语音通话的噪声消除装置100可以安装于电子设备中。根据实现的功能,所述语音通话的噪声消除装置可以包括语音端点检测模块101、语音特征提取模块102、聚类评分模块103、人声分类模块104和背景人声去除模块105。本发所述模块也可以称之为单元,是指一种能够被电子设备处理器所执行,并且能够完成固定功能的一系列计算机程序段,其存储在电子设备的存储器中。
在本实施例中,关于各模块/单元的功能如下:
所述语音端点检测模块101,用于对通话音频进行语音端点检测,得到人声语音集。
详细地,本申请实施例中所述通话音频包括在人群中或在有较多人声的环境中对话产生的音频,如在充满背景人声的环境中通过电话或者即时通信软件等通信***通话时产生的通话音频。所述通话音频可直接从所述通信***中获取,或从用于存储语音对话信息的数据库中调用获取。需要强调的是,为进一步保证上述通话音频的私密和安全性,上述通 话音频还可以存储于一区块链的节点中。
所述语音端点检测是指在有噪声或其他干扰的环境下分辨出通话音频中的人声数据和非人声数据(静音和环境噪音),并确定人声数据的起始点和终止点,以删除所述通话音频中的非人声噪音,减少计算机后续的处理量,提高效率,为后续的信号处理提供必要的支持。
本申请较佳实施例中,所述语音端点检测模型可以为基于深度神经网络(deep neural network,DNN)的语音活动性检测(voice activity detection,VAD)模型。
所述语音特征提取模块102,用于对所述人声语音集进行语音特征提取,得到语音特征集。
详细地,所述语音特征提取模块102具体执行:
对所述人声语音集进行预加重、分帧和加窗,得到语音帧序列;
对所述语音帧序列中的每一帧语音,通过快速傅里叶变换得到对应的频谱;
通过梅尔滤波器组将所述频谱转换为梅尔频谱;
在所述梅尔频谱上进行倒谱分析,得到所述人声语音集对应的语音特征集。其中,所述预加重是利用一个高通滤波器提升所述人声语音集中语音信号的高频部分,使所述语音信号的频谱变得平坦;所述分帧是采用可移动的有限长度的窗口进行加权以将所述语音信号分为一些短段,使语音信号具有平稳性;所述加窗是为了使没有周期性的语音信号呈现出周期函数的部分特征,便于后续进行傅里叶展开。
较佳地,由于语音信号在时域上的变换通常很难看出信号的特性,所以通常将语音信号转换为频域上的能量分布,不同的能量分布,就能代表不同语音的特性。
其中,所述梅尔(Mel)滤波器组是一组梅尔尺度的三角形滤波器组,通过所述梅尔滤波器组可以对所述频谱转换为梅尔频谱,所述梅尔频率能准确反映人耳的听觉特性。
进一步地,所述倒谱分析包括取对数和离散余弦变化,并输出特征向量。所述语音特征集包括经过倒谱分析后输出的所述语音帧序列对应的特征向量。
所述聚类评分模块103,用于依时间顺序从所述语音特征集中截取累计时长为预设时长阈值的待检测语音特征集,得到多个待检测语音特征集,对每一个所述待检测语音特征集进行聚类处理,并利用预设的评估算法对得到的聚类结果进行评分,得到每一个所述待检测语音特征集的评分值。
本申请实施例在所述语音特征集累计时长到达预设的时长阈值时,进行一次检测计算,并将本次累积获取的语音特征集称为待检测语音特征集。
详细地,所述对每一个所述待检测语音特征集进行聚类处理,包括:
步骤a、在所述待检测语音特征集中随机选择两个特征向量作为类别中心;
步骤b、对于所述待检测语音特征集中每个特征向量,通过计算其到每一个所述类别中心的距离,将所述特征向量与距离最近的类别中心进行聚类,得到两个初始类别。
详细地,本申请实施例利用如下距离算法计算所述特征向量与每一个所述类别中心的距离:
Figure PCTCN2020121571-appb-000003
其中,L(X,Y)为所述距离值,X为所述类别中心,Y i为所述待检测语音特征集内的特征向量。
步骤c、更新两个初始类别的类别中心;
较佳地,本申请实施例计算每个所述初始类别下的所有特征向量的均值,并将所述均值更新为该类别的类别中心。
步骤d、重复上述步骤b和步骤c直至达迭代次数达到预设次数阈值,得到两个标准类别。
进一步地,本申请利用预设的评估算法对得到的两个标准类别进行评分,得到所述标准类别的评分值。优选地,本申请实施例中,所述评估算法如下:
Figure PCTCN2020121571-appb-000004
其中,n 1和n 2分别是两个标准类别的类别中心,H s为所述标准类别属于同一类别的假设,H d为所述标准类别属于不同类别的假设;P(n 1,n 2|H s)为n 1和n 2来自同一空间的似然函数;P(n 1|H d),P(n 2|H d)分别为n 1和n 2来自不同空间的似然函数。所述似然函数是一种关于统计模型参数的函数,用来检测某个假设是否有效的一种检验。
较佳地,所述评分值越高,两个所述标准类别对应的语音属于同一说话人的可能性越大;所述评分值越低,则两个所述标准类别对应的语音属于同一说话人的可能性越小。
所述人声分类模块104,用于根据所述评分值,将所述人声语音集分成第一说话人声和第二说话人声。
详细地,所述人声分类模块104具体用于:
选择其中一个待检测语音特征集,获取其对应的评分值;
将所述评分值与预设的评分阈值进行比较;
当所述评分值小于或者等于预设的评分阈值时,根据所述两个标准类别生成第一说话人声和第二说话人声;
选择下一个待检测语音特征集,获取其对应的评分值,并根据其评分值,将所述待检测语音特征集中的两个标准类别归类至所述第一说话人声或所述第二说话人声;
判断是否每一个所述待检测语音特征集都选择完毕,直到每一个所述待检测语音特征集都选择完毕,得到第一说话人声和第二说话人声。
当所述评分值大于预设的评分阈值时,所述人声分类模块104将选择的所述待检测语音特征集的两个标准类别合并为单一语音类别,并计算所述单一语音类别的类别中心,根据所述单一语音类别和类别中心生成第一说话人声;
其中,所述第一说话人声包括语音特征和时长,所述语音特征包括所述单一语音类别和类别中心,所述时长包括所述单一语音类别的帧数。
相似地,所述第一说话人声和第二说话人声包括语音特征和时长,所述语音特征包括所述标准类别和类别中心,所述时长包括所述标准类别的帧数。
详细地,所述将所述待检测语音特征集中的两个标准类别归类至所述第一说话人声或所述第二说话人声,包括:
本申请其中一个实施例中,若所述待检测语音特征集的评分值大于评分阈值,将所述待检测语音特征集的两个标准类别合并为单一语音类别,并计算所述单一语音类别的类别中心,计算所述单一语音类别的类别中心与上述第一说话人声和第二说话人声的类别中心之间的余弦距离,根据所述余弦距离,将所述单一语音类别归类到所述第一说话人声或第二说话人声中。
其中,若所述单一语音类别的类别中心与上述第一说话人声的类别中心的余弦距离较近,则将所述单一语音类别归类至所述第一说话人声中,若所述单一语音类别的类别中心与上述第二说话人声的类别中心的余弦距离较近,则将所述单一语音类别归类至所述第二说话人声中。
所述归类包括:将所述单一语音类别与所述第一说话人声或第二说话人声进行合并,并重新计算合并后的类别中心;将所述单一语音类别的帧数与所述第一说话人声或第二说话人声的时长进行累加。
本申请另一个实施例中,若所述评分值小于或等于评分阈值,通过计算所述待检测语音特征集中每个标准类别的类别中心与第一说话人声和第二说话人声的类别中心之间的 余弦距离,根据余弦距离将所述两个标准类别分别归类至所述第一说话人声和第二说话人声中。
其中,假设待检测语音特征集中包括标准类别A及标准类别B,若标准类别A的类别中心与第一说话人声的类别中心得余弦距离较近且标准类别B的类别中心与第二说话人声的类别中心得余弦距离较近,则将所述标准类别A归类至第一说话人声中并将所述标准类别B归类至第二说话人声中。
相同的,所述归类包括:将所述标准类别A及标准类别分别与所述第一说话人声或第二说话人声进行合并,并重新计算合并后的类别中心;将所述标准类别A及标准类别B的帧数与所述第一说话人声或第二说话人声的时长进行累加。
所述背景人声去除模块105,用于计算所述第一说话人声和第二说话人声时长,并根据所述第一说话人声和第二说话人声时长判断所述人声语音集中的背景人声,将所述背景人声从所述人声语音集中删除。
较佳地,在所述通话音频中,一般目标说话人的时长会大于背景说话人的时长,因此本申请实施例将所述第一说话人声和第二说话人声中时长较长的说话人声作为目标说话人,将余下的说话人声作为背景说话人。
详细地,所述背景人声去除模块105通过下述方法将所述背景人声从所述人声语音集中删除,包括:
利用预设的时长算法计算所述背景人声在本次通话中的时长比例;
将所述时长比例与预设的比例阈值进行对比;
当所述时长比例大于所述比例阈值时,将所述背景人声从所述人声语音集中删除,去除所述通话音频中的背景人声。
其中,所述时长算法如下:
R=t/T
其中,R为背景人声在本次通话中的时长比例,t为背景人说话人的时长,T为通话总时长,即目标说话人与背景说话人的时长之和。
较佳地,当所述时长比例小于所述比例阈值时,表示本次通话受到的背景人声噪音干扰很小,不需要对所述通话音频进行处理;当所述时长比例大于所述比例阈值时,则表示本次通话受到比较严重的背景人声噪音干扰,将所述背景人声从所述人声语音集中删除可以降低因背景人声造成的错误识别,提升语音通话质量。
如图5所示,是本申请实现语音通话的噪声消除方法的电子设备的结构示意图。
所述电子设备1可以包括处理器10、存储器11和总线,还可以包括存储在所述存储器11中并可在所述处理器10上运行的计算机程序,如语音通话的噪声消除程序12。
其中,所述存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、移动硬盘、多媒体卡、卡型存储器(例如:SD或DX存储器等)、磁性存储器、磁盘、光盘等。所述存储器11在一些实施例中可以是电子设备1的内部存储单元,例如该电子设备1的移动硬盘。所述存储器11在另一些实施例中也可以是电子设备1的外部存储设备,例如电子设备1上配备的插接式移动硬盘、智能存储卡(Smart Media Card,SMC)、安全数字(Secure Digital,SD)卡、闪存卡(Flash Card)等。进一步地,所述存储器11还可以既包括电子设备1的内部存储单元也包括外部存储设备。所述存储器11不仅可以用于存储安装于电子设备1的应用软件及各类数据,例如语音通话的噪声消除程序12的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。
所述处理器10在一些实施例中可以由集成电路组成,例如可以由单个封装的集成电路所组成,也可以是由多个相同功能或不同功能封装的集成电路所组成,包括一个或者多个中央处理器(Central Processing unit,CPU)、微处理器、数字处理芯片、图形处理器及各种控制芯片的组合等。所述处理器10是所述电子设备的控制核心(Control Unit),利用 各种接口和线路连接整个电子设备的各个部件,通过运行或执行存储在所述存储器11内的程序或者模块(例如执行语音通话的噪声消除程序等),以及调用存储在所述存储器11内的数据,以执行电子设备1的各种功能和处理数据。
所述总线可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。所述总线被设置为实现所述存储器11以及至少一个处理器10等之间的连接通信。
图5仅示出了具有部件的电子设备,本领域技术人员可以理解的是,图5示出的结构并不构成对所述电子设备1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。
例如,尽管未示出,所述电子设备1还可以包括给各个部件供电的电源(比如电池),优选地,电源可以通过电源管理装置与所述至少一个处理器10逻辑相连,从而通过电源管理装置实现充电管理、放电管理、以及功耗管理等功能。电源还可以包括一个或一个以上的直流或交流电源、再充电装置、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。所述电子设备1还可以包括多种传感器、蓝牙模块、Wi-Fi模块等,在此不再赘述。
进一步地,所述电子设备1还可以包括网络接口,可选地,所述网络接口可以包括有线接口和/或无线接口(如WI-FI接口、蓝牙接口等),通常用于在该电子设备1与其他电子设备之间建立通信连接。
可选地,该电子设备1还可以包括用户接口,用户接口可以是显示器(Display)、输入单元(比如键盘(Keyboard)),可选地,用户接口还可以是标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在电子设备1中处理的信息以及用于显示可视化的用户界面。
应该了解,所述实施例仅为说明之用,在专利申请范围上并不受此结构的限制。
所述电子设备1中的所述存储器11存储的语音通话的噪声消除程序12是多个指令的组合,在所述处理器10中运行时,可以实现:
对通话音频进行语音端点检测,得到人声语音集;
对所述人声语音集进行语音特征提取,得到语音特征集;
依时间顺序从所述语音特征集中截取累计时长为预设时长阈值的待检测语音特征集,得到多个待检测语音特征集,对每一个所述待检测语音特征集进行聚类处理,并利用预设的评估算法对得到的聚类结果进行评分,得到每一个所述待检测语音特征集的评分值;
根据所述评分值,将所述人声语音集分成第一说话人声和第二说话人声;
计算所述第一说话人声和第二说话人声时长,并根据所述第一说话人声和第二说话人声时长判断所述人声语音集中的背景人声,将所述背景人声从所述人声语音集中删除。
进一步地,所述电子设备1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)。所述计算机可读存储介质可以是易失性的,也可以是非易失性的,所述计算机可读介质包括存储数据区和存储程序区,存储数据区存储所创建的数据,存储程序区存储有计算机程序,所述计算机程序被处理器执行时实现如下步骤:
对通话音频进行语音端点检测,得到人声语音集;
对所述人声语音集进行语音特征提取,得到语音特征集;
依时间顺序从所述语音特征集中截取累计时长为预设时长阈值的待检测语音特征集,得到多个待检测语音特征集,对每一个所述待检测语音特征集进行聚类处理,并利用预设的评估算法对得到的聚类结果进行评分,得到每一个所述待检测语音特征集的评分值;
根据所述评分值,将所述人声语音集分成第一说话人声和第二说话人声;
计算所述第一说话人声和第二说话人声的时长,并根据所述第一说话人声和第二说话人声的时长判断所述人声语音集中的背景人声,将所述背景人声从所述人声语音集中删除。
在本申请所提供的几个实施例中,应该理解到,所揭露的设备,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。
因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附关联图表记视为限制所涉及的权利要求。
此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。***权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第二等词语用来表示名称,而并不表示任何特定的顺序。
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。

Claims (20)

  1. 一种语音通话的噪声消除方法,其中,所述方法包括:
    对通话音频进行语音端点检测,得到人声语音集;
    对所述人声语音集进行语音特征提取,得到语音特征集;
    依时间顺序从所述语音特征集中截取累计时长为预设时长阈值的待检测语音特征集,得到多个待检测语音特征集,对每一个所述待检测语音特征集进行聚类处理,并利用预设的评估算法对得到的聚类结果进行评分,得到每一个所述待检测语音特征集的评分值;
    根据所述评分值,将所述人声语音集分成第一说话人声和第二说话人声;
    计算所述第一说话人声和第二说话人声的时长,并根据所述第一说话人声和第二说话人声的时长判断所述人声语音集中的背景人声,将所述背景人声从所述人声语音集中删除。
  2. 如权利要求1所述的语音通话的噪声消除方法,其中,所述对所述人声语音集进行语音特征提取,得到语音特征集,包括:
    对所述人声语音集进行预加重、分帧和加窗,得到语音帧序列;
    对所述语音帧序列中的每一帧语音,通过快速傅里叶变换得到对应的频谱;
    通过梅尔滤波器组将所述频谱转换为梅尔频谱;
    在所述梅尔频谱上进行倒谱分析,得到所述人声语音集对应的语音特征集。
  3. 如权利要求1所述的语音通话的噪声消除方法,其中,所述对每一个所述待检测语音特征集进行聚类处理,包括:
    步骤a、在所述待检测语音特征集中随机选择两个特征向量作为类别中心;
    步骤b、对于所述待检测语音特征集中每个特征向量,通过计算所述特征向量到每一个所述类别中心的距离,将所述特征向量与距离最近的类别中心进行聚类,得到两个初始类别;
    步骤c、更新两个初始类别的类别中心;
    步骤d、重复上述步骤b和步骤c直至达迭代次数达到预设次数阈值,得到两个标准类别。
  4. 如权利要求1所述的语音通话的噪声消除方法,其中,所述根据所述评分值,将所述人声语音集分成第一说话人声和第二说话人声,包括:
    选择其中一个待检测语音特征集,获取对应的评分值;
    将所述评分值与预设的评分阈值进行比较;
    当所述评分值大于预设的评分阈值时,将选择的所述待检测语音特征集的两个标准类别合并为单一语音类别,并计算所述单一语音类别的类别中心,根据所述单一语音类别和类别中心生成第一说话人声;
    当所述评分值小于或者等于预设的评分阈值时,根据所述两个标准类别生成第一说话人声和第二说话人声;
    选择下一个待检测语音特征集,获取对应的评分值,并根据所述评分值,将所述待检测语音特征集中的两个标准类别归类至所述第一说话人声或所述第二说话人声。
  5. 如权利要求4所述的语音通话的噪声消除方法,其中,所述根据所述评分值,将所述待检测语音特征集中的两个标准类别归类至所述第一说话人声或所述第二说话人声,包括:
    若所述评分值大于所述评分阈值,则将所述待检测语音特征集的两个标准类别合并为单一语音类别,计算所述单一语音类别的类别中心,并根据所述单一语音类别的类别中心与上述第一说话人声和第二说话人声的类别中心之间的余弦距离,将所述单一语音类别归类到所述第一说话人声或第二说话人声中;
    若所述评分值小于或等于评分阈值,根据所述待检测语音特征集中两个标准类别的类别中心与第一说话人声和第二说话人声的类别中心之间的余弦距离,将所述两个标准类别分别归类至所述第一说话人声和第二说话人声中。
  6. 如权利要求5所述的语音通话的噪声消除方法,其中,所述归类包括:
    将所述单一语音类别与所述第一说话人声或第二说话人声进行合并,并重新计算合并后的类别中心,将所述单一语音类别的帧数与所述第一说话人声或第二说话人声的时长进行累加;或
    将两个标准类别分别与所述第一说话人声和第二说话人声进行合并,并重新计算合并的类别中心,将所述标准类别的帧数与所述第一说话人声或第二说话人声的时长进行累加。
  7. 如权利要求1至6中任意一项所述的语音通话的噪声消除方法,其中,所述将所述背景人声从所述人声语音集中删除,包括:
    利用预设的时长算法计算所述背景人声在本次通话中的时长比例;
    当所述时长比例大于预设的比例阈值时,将所述背景人声从所述人声语音集中删除,去除所述通话音频中的背景人声。
  8. 一种语音通话的噪声消除装置,其中,所述装置包括:
    语音端点检测模块,用于对通话音频进行语音端点检测,得到人声语音集;
    语音特征提取模块,用于对所述人声语音集进行语音特征提取,得到语音特征集;
    聚类评分模块,用于依时间顺序从所述语音特征集中截取累计时长为预设时长阈值的待检测语音特征集,得到多个待检测语音特征集,对每一个所述待检测语音特征集进行聚类处理,并利用预设的评估算法对得到的聚类结果进行评分,得到每一个所述待检测语音特征集的评分值;
    人声分类模块,用于根据所述评分值,将所述人声语音集分成第一说话人声和第二说话人声;
    背景人声去除模块,用于计算所述第一说话人声和第二说话人声时长,并根据所述第一说话人声和第二说话人声时长判断所述人声语音集中的背景人声,将所述背景人声从所述人声语音集中删除。
  9. 一种电子设备,其中,所述电子设备包括:
    存储器,存储至少一个指令;及
    处理器,执行所述存储器中存储的指令以执行如下步骤:
    对通话音频进行语音端点检测,得到人声语音集;
    对所述人声语音集进行语音特征提取,得到语音特征集;
    依时间顺序从所述语音特征集中截取累计时长为预设时长阈值的待检测语音特征集,得到多个待检测语音特征集,对每一个所述待检测语音特征集进行聚类处理,并利用预设的评估算法对得到的聚类结果进行评分,得到每一个所述待检测语音特征集的评分值;
    根据所述评分值,将所述人声语音集分成第一说话人声和第二说话人声;
    计算所述第一说话人声和第二说话人声的时长,并根据所述第一说话人声和第二说话人声的时长判断所述人声语音集中的背景人声,将所述背景人声从所述人声语音集中删除。
  10. 如权利要求9所述的电子设备,其中,所述对所述人声语音集进行语音特征提取,得到语音特征集,包括:
    对所述人声语音集进行预加重、分帧和加窗,得到语音帧序列;
    对所述语音帧序列中的每一帧语音,通过快速傅里叶变换得到对应的频谱;
    通过梅尔滤波器组将所述频谱转换为梅尔频谱;
    在所述梅尔频谱上进行倒谱分析,得到所述人声语音集对应的语音特征集。
  11. 如权利要求9所述的电子设备,其中,所述对每一个所述待检测语音特征集进行聚类处理,包括:
    步骤a、在所述待检测语音特征集中随机选择两个特征向量作为类别中心;
    步骤b、对于所述待检测语音特征集中每个特征向量,通过计算所述特征向量到每一个所述类别中心的距离,将所述特征向量与距离最近的类别中心进行聚类,得到两个初始类别;
    步骤c、更新两个初始类别的类别中心;
    步骤d、重复上述步骤b和步骤c直至达迭代次数达到预设次数阈值,得到两个标准类别。
  12. 如权利要求9所述的电子设备,其中,所述根据所述评分值,将所述人声语音集分成第一说话人声和第二说话人声,包括:
    选择其中一个待检测语音特征集,获取对应的评分值;
    将所述评分值与预设的评分阈值进行比较;
    当所述评分值大于预设的评分阈值时,将选择的所述待检测语音特征集的两个标准类别合并为单一语音类别,并计算所述单一语音类别的类别中心,根据所述单一语音类别和类别中心生成第一说话人声;
    当所述评分值小于或者等于预设的评分阈值时,根据所述两个标准类别生成第一说话人声和第二说话人声;
    选择下一个待检测语音特征集,获取对应的评分值,并根据所述评分值,将所述待检测语音特征集中的两个标准类别归类至所述第一说话人声或所述第二说话人声。
  13. 如权利要求12所述的电子设备,其中,所述根据所述评分值,将所述待检测语音特征集中的两个标准类别归类至所述第一说话人声或所述第二说话人声,包括:
    若所述评分值大于所述评分阈值,则将所述待检测语音特征集的两个标准类别合并为单一语音类别,计算所述单一语音类别的类别中心,并根据所述单一语音类别的类别中心与上述第一说话人声和第二说话人声的类别中心之间的余弦距离,将所述单一语音类别归类到所述第一说话人声或第二说话人声中;
    若所述评分值小于或等于评分阈值,根据所述待检测语音特征集中两个标准类别的类别中心与第一说话人声和第二说话人声的类别中心之间的余弦距离,将所述两个标准类别分别归类至所述第一说话人声和第二说话人声中。
  14. 如权利要求13所述的电子设备,其中,所述归类包括:
    将所述单一语音类别与所述第一说话人声或第二说话人声进行合并,并重新计算合并后的类别中心,将所述单一语音类别的帧数与所述第一说话人声或第二说话人声的时长进行累加;或
    将两个标准类别分别与所述第一说话人声和第二说话人声进行合并,并重新计算合并的类别中心,将所述标准类别的帧数与所述第一说话人声或第二说话人声的时长进行累加。
  15. 如权利要求9至14中任意一项所述的电子设备,其中,所述将所述背景人声从所述人声语音集中删除,包括:
    利用预设的时长算法计算所述背景人声在本次通话中的时长比例;
    当所述时长比例大于预设的比例阈值时,将所述背景人声从所述人声语音集中删除,去除所述通话音频中的背景人声。
  16. 一种计算机可读存储介质,包括存储数据区和存储程序区,存储数据区存储所创建的数据,存储程序区存储有计算机程序,其中,所述计算机程序被处理器执行时实现如下步骤:
    对通话音频进行语音端点检测,得到人声语音集;
    对所述人声语音集进行语音特征提取,得到语音特征集;
    依时间顺序从所述语音特征集中截取累计时长为预设时长阈值的待检测语音特征集,得到多个待检测语音特征集,对每一个所述待检测语音特征集进行聚类处理,并利用预设的评估算法对得到的聚类结果进行评分,得到每一个所述待检测语音特征集的评分值;
    根据所述评分值,将所述人声语音集分成第一说话人声和第二说话人声;
    计算所述第一说话人声和第二说话人声的时长,并根据所述第一说话人声和第二说话人声的时长判断所述人声语音集中的背景人声,将所述背景人声从所述人声语音集中删除。
  17. 如权利要求16所述的计算机可读存储介质,其中,所述对所述人声语音集进行语音特征提取,得到语音特征集,包括:
    对所述人声语音集进行预加重、分帧和加窗,得到语音帧序列;
    对所述语音帧序列中的每一帧语音,通过快速傅里叶变换得到对应的频谱;
    通过梅尔滤波器组将所述频谱转换为梅尔频谱;
    在所述梅尔频谱上进行倒谱分析,得到所述人声语音集对应的语音特征集。
  18. 如权利要求16所述的计算机可读存储介质,其中,所述对每一个所述待检测语音特征集进行聚类处理,包括:
    步骤a、在所述待检测语音特征集中随机选择两个特征向量作为类别中心;
    步骤b、对于所述待检测语音特征集中每个特征向量,通过计算所述特征向量到每一个所述类别中心的距离,将所述特征向量与距离最近的类别中心进行聚类,得到两个初始类别;
    步骤c、更新两个初始类别的类别中心;
    步骤d、重复上述步骤b和步骤c直至达迭代次数达到预设次数阈值,得到两个标准类别。
  19. 如权利要求16所述的计算机可读存储介质,其中,所述根据所述评分值,将所述人声语音集分成第一说话人声和第二说话人声,包括:
    选择其中一个待检测语音特征集,获取对应的评分值;
    将所述评分值与预设的评分阈值进行比较;
    当所述评分值大于预设的评分阈值时,将选择的所述待检测语音特征集的两个标准类别合并为单一语音类别,并计算所述单一语音类别的类别中心,根据所述单一语音类别和类别中心生成第一说话人声;
    当所述评分值小于或者等于预设的评分阈值时,根据所述两个标准类别生成第一说话人声和第二说话人声;
    选择下一个待检测语音特征集,获取对应的评分值,并根据所述评分值,将所述待检测语音特征集中的两个标准类别归类至所述第一说话人声或所述第二说话人声。
  20. 如权利要求19所述的计算机可读存储介质,其中,所述根据所述评分值,将所述待检测语音特征集中的两个标准类别归类至所述第一说话人声或所述第二说话人声,包括:
    若所述评分值大于所述评分阈值,则将所述待检测语音特征集的两个标准类别合并为单一语音类别,计算所述单一语音类别的类别中心,并根据所述单一语音类别的类别中心与上述第一说话人声和第二说话人声的类别中心之间的余弦距离,将所述单一语音类别归类到所述第一说话人声或第二说话人声中;
    若所述评分值小于或等于评分阈值,根据所述待检测语音特征集中两个标准类别的类别中心与第一说话人声和第二说话人声的类别中心之间的余弦距离,将所述两个标准类别分别归类至所述第一说话人声和第二说话人声中。
PCT/CN2020/121571 2020-06-19 2020-10-16 语音通话的噪声消除方法、装置、电子设备及存储介质 WO2021151310A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010570483.4 2020-06-19
CN202010570483.4A CN111754982A (zh) 2020-06-19 2020-06-19 语音通话的噪声消除方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021151310A1 true WO2021151310A1 (zh) 2021-08-05

Family

ID=72675687

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/121571 WO2021151310A1 (zh) 2020-06-19 2020-10-16 语音通话的噪声消除方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN111754982A (zh)
WO (1) WO2021151310A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111754982A (zh) * 2020-06-19 2020-10-09 平安科技(深圳)有限公司 语音通话的噪声消除方法、装置、电子设备及存储介质
CN112700790A (zh) * 2020-12-11 2021-04-23 广州市申迪计算机***有限公司 一种idc机房声音处理方法、***、设备及计算机存储介质
CN113255362B (zh) * 2021-05-19 2024-02-02 平安科技(深圳)有限公司 人声过滤与识别方法、装置、电子设别及存储介质
CN113572908A (zh) * 2021-06-16 2021-10-29 云茂互联智能科技(厦门)有限公司 一种VoIP通话中降噪的方法、装置及***
CN114070935B (zh) * 2022-01-12 2022-04-15 百融至信(北京)征信有限公司 一种智能外呼打断方法及***
CN115394310B (zh) * 2022-08-19 2023-04-07 中邮消费金融有限公司 一种基于神经网络的背景人声去除方法及***

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070033027A1 (en) * 2005-08-03 2007-02-08 Texas Instruments, Incorporated Systems and methods employing stochastic bias compensation and bayesian joint additive/convolutive compensation in automatic speech recognition
CN109065028A (zh) * 2018-06-11 2018-12-21 平安科技(深圳)有限公司 说话人聚类方法、装置、计算机设备及存储介质
CN109147798A (zh) * 2018-07-27 2019-01-04 北京三快在线科技有限公司 语音识别方法、装置、电子设备及可读存储介质
CN110136749A (zh) * 2019-06-14 2019-08-16 苏州思必驰信息科技有限公司 说话人相关的端到端语音端点检测方法和装置
CN110797021A (zh) * 2018-05-24 2020-02-14 腾讯科技(深圳)有限公司 混合语音识别网络训练方法、混合语音识别方法、装置及存储介质
CN111199741A (zh) * 2018-11-20 2020-05-26 阿里巴巴集团控股有限公司 声纹识别方法、声纹验证方法、装置、计算设备及介质
CN111754982A (zh) * 2020-06-19 2020-10-09 平安科技(深圳)有限公司 语音通话的噪声消除方法、装置、电子设备及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070033027A1 (en) * 2005-08-03 2007-02-08 Texas Instruments, Incorporated Systems and methods employing stochastic bias compensation and bayesian joint additive/convolutive compensation in automatic speech recognition
CN110797021A (zh) * 2018-05-24 2020-02-14 腾讯科技(深圳)有限公司 混合语音识别网络训练方法、混合语音识别方法、装置及存储介质
CN109065028A (zh) * 2018-06-11 2018-12-21 平安科技(深圳)有限公司 说话人聚类方法、装置、计算机设备及存储介质
CN109147798A (zh) * 2018-07-27 2019-01-04 北京三快在线科技有限公司 语音识别方法、装置、电子设备及可读存储介质
CN111199741A (zh) * 2018-11-20 2020-05-26 阿里巴巴集团控股有限公司 声纹识别方法、声纹验证方法、装置、计算设备及介质
CN110136749A (zh) * 2019-06-14 2019-08-16 苏州思必驰信息科技有限公司 说话人相关的端到端语音端点检测方法和装置
CN111754982A (zh) * 2020-06-19 2020-10-09 平安科技(深圳)有限公司 语音通话的噪声消除方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN111754982A (zh) 2020-10-09

Similar Documents

Publication Publication Date Title
WO2021151310A1 (zh) 语音通话的噪声消除方法、装置、电子设备及存储介质
WO2021208287A1 (zh) 用于情绪识别的语音端点检测方法、装置、电子设备及存储介质
WO2019227583A1 (zh) 一种声纹识别方法、装置、终端设备及存储介质
US9202462B2 (en) Key phrase detection
US9230550B2 (en) Speaker verification and identification using artificial neural network-based sub-phonetic unit discrimination
WO2018113243A1 (zh) 语音分割的方法、装置、设备及计算机存储介质
WO2021082420A1 (zh) 声纹认证方法、装置、介质及电子设备
WO2022116420A1 (zh) 语音事件检测方法、装置、电子设备及计算机存储介质
JP3584458B2 (ja) パターン認識装置およびパターン認識方法
CN105096941A (zh) 语音识别方法以及装置
CN110634472B (zh) 一种语音识别方法、服务器及计算机可读存储介质
CN109801646B (zh) 一种基于融合特征的语音端点检测方法和装置
US9947323B2 (en) Synthetic oversampling to enhance speaker identification or verification
CN110070859B (zh) 一种语音识别方法及装置
WO2021159902A1 (zh) 年龄识别方法、装置、设备及计算机可读存储介质
EP3989217B1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
CN113327586B (zh) 一种语音识别方法、装置、电子设备以及存储介质
CN116547752A (zh) 虚假音频检测
CN113112992B (zh) 一种语音识别方法、装置、存储介质和服务器
CN113593597B (zh) 语音噪声过滤方法、装置、电子设备和介质
US10910000B2 (en) Method and device for audio recognition using a voting matrix
WO2021196477A1 (zh) 基于声纹特征与关联图谱数据的风险用户识别方法、装置
CN112735432B (zh) 音频识别的方法、装置、电子设备及存储介质
JP2011191542A (ja) 音声分類装置、音声分類方法、及び音声分類用プログラム
CN112992175B (zh) 一种语音区分方法及其语音记录装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20916341

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20916341

Country of ref document: EP

Kind code of ref document: A1