CN113571090A - Voiceprint feature validity detection method and device and electronic equipment - Google Patents

Voiceprint feature validity detection method and device and electronic equipment Download PDF

Info

Publication number
CN113571090A
CN113571090A CN202110833612.9A CN202110833612A CN113571090A CN 113571090 A CN113571090 A CN 113571090A CN 202110833612 A CN202110833612 A CN 202110833612A CN 113571090 A CN113571090 A CN 113571090A
Authority
CN
China
Prior art keywords
voice data
voiceprint
processed
target cluster
stationary noise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110833612.9A
Other languages
Chinese (zh)
Inventor
郭俊龙
贺亚运
陈戈
李美玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Citic Bank Corp Ltd
Original Assignee
China Citic Bank Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Citic Bank Corp Ltd filed Critical China Citic Bank Corp Ltd
Priority to CN202110833612.9A priority Critical patent/CN113571090A/en
Publication of CN113571090A publication Critical patent/CN113571090A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the application provides a voiceprint feature validity detection method and device and electronic equipment. The method comprises the following steps: clustering the voiceprint features to be processed to obtain target clusters, wherein the ratio of the voiceprint features in the target clusters in the voiceprint features to be processed is larger than a preset value; and determining the effectiveness of the voiceprint to be processed based on whether the voiceprint features with the distance from the centroid of the target cluster not less than the preset distance exist in the target cluster. Based on the scheme, the voiceprint feature effectiveness can be detected, and the voiceprint feature effectiveness for identity recognition is guaranteed, so that a basis can be provided for improving the voiceprint recognition effect.

Description

Voiceprint feature validity detection method and device and electronic equipment
Technical Field
The application relates to the technical field of voice processing, in particular to a voiceprint feature validity detection method and device and electronic equipment.
Background
Voiceprint recognition is an important application scenario in the field of intelligent voice, namely, identity recognition is carried out through speaker voice. Voiceprint recognition identifies a person primarily by extracting the voiceprint features of a speaker's voice through a voiceprint model.
In actual use, there may be effectiveness of voiceprint features affected by various reasons, and thus, the recognition effect on the identity of the user may be affected. If the validity of the voiceprint features can be detected, the validity of the voiceprint features for identity recognition is ensured, and the recognition effect of voiceprint recognition can be improved.
Disclosure of Invention
The present application aims to solve at least one of the above technical drawbacks. The technical scheme adopted by the application is as follows:
in a first aspect, an embodiment of the present application provides a method for detecting validity of a voiceprint feature, where the method includes:
clustering the voiceprint features to be processed to obtain target clusters, wherein the ratio of the voiceprint features in the target clusters in the voiceprint features to be processed is larger than a preset value;
and determining the effectiveness of the voiceprint to be processed based on whether the voiceprint features with the distance from the centroid of the target cluster not less than the preset distance exist in the target cluster.
Optionally, determining the validity of the voiceprint to be processed based on whether there is a voiceprint feature in the target cluster whose distance from the centroid of the target cluster is smaller than a preset distance, including:
if the distance between the target cluster and the centroid of the target cluster is not less than the preset distance, determining that the voiceprint to be processed is invalid;
and if the voiceprint features with the distance from the centroid of the target cluster not less than the preset distance do not exist in the target cluster, determining that the voiceprint to be processed is valid.
Optionally, the method further includes:
determining whether stationary noise is contained in the first voice data based on a pre-trained stationary noise detection model;
and processing the first voice data to obtain second voice data based on whether the first voice data contains stationary noise.
Optionally, the stationary noise detection model is obtained by performing model training using pre-configured human voice data as a positive sample and using sample voice data as a negative sample, where the sample voice data includes fused voice data and pre-configured noise data, and the fused voice data is obtained by fusing human voice data and noise data.
Optionally, processing the first voice data based on whether the first voice data contains stationary noise, to obtain second voice data, includes:
if the first voice data contains stationary noise, processing the stationary noise in the first voice data to obtain second voice data;
and if the first voice data does not contain stationary noise, taking the first voice data as second voice data.
Optionally, the method further includes:
and processing the second voice data to obtain third voice data based on whether the second voice data contains silence or not.
Optionally, processing the second voice data to obtain third voice data based on whether the second voice data includes silence includes:
if the second voice data contains silence, processing the silence in the second voice data to obtain third voice data;
and if the second voice data does not contain silence, taking the second voice data as third voice data.
Optionally, the method further includes:
dividing the third voice data into at least one fourth voice data based on a preset step length and a preset sliding window;
and extracting the voiceprint features to be processed through the fourth voice data.
Optionally, if it is determined that the voiceprint to be processed is valid, the method further includes:
and removing the voice data corresponding to the clustered voiceprint features from the fourth voice data, wherein the clustered voiceprint features are the voiceprint features except the target cluster in the clustering result of the voiceprint features to be processed.
In a second aspect, an embodiment of the present application provides an apparatus for detecting validity of a voiceprint feature, where the apparatus includes:
the clustering module is used for clustering the voiceprint features to be processed to obtain target clusters, and the ratio of the voiceprint features in the target clusters to the voiceprint features to be processed is larger than a preset value;
and the effectiveness detection module is used for determining the effectiveness of the voiceprint to be processed based on whether the voiceprint features with the distance from the centroid of the target cluster to the preset distance exist in the target cluster.
Optionally, the validity detection module is specifically configured to:
if the distance between the target cluster and the centroid of the target cluster is not less than the preset distance, determining that the voiceprint to be processed is invalid;
and if the voiceprint features with the distance from the centroid of the target cluster not less than the preset distance do not exist in the target cluster, determining that the voiceprint to be processed is valid.
Optionally, the above apparatus further comprises a stationary noise processing module, the stationary noise processing module is specifically configured to:
determining whether stationary noise is contained in the first voice data based on a pre-trained stationary noise detection model;
and processing the first voice data to obtain second voice data based on whether the first voice data contains stationary noise.
Optionally, the stationary noise detection model is obtained by performing model training using pre-configured human voice data as a positive sample and using sample voice data as a negative sample, where the sample voice data includes fused voice data and pre-configured noise data, and the fused voice data is obtained by fusing human voice data and noise data.
Optionally, the stationary noise processing module is configured to, based on whether the stationary noise is included in the first voice data, process the first voice data to obtain second voice data, specifically:
if the first voice data contains stationary noise, processing the stationary noise in the first voice data to obtain second voice data;
and if the first voice data does not contain stationary noise, taking the first voice data as second voice data.
Optionally, the apparatus further includes a mute processing module, where the mute processing module is configured to:
and processing the second voice data to obtain third voice data based on whether the second voice data contains silence or not.
Optionally, the mute processing module is specifically configured to:
if the second voice data contains silence, processing the silence in the second voice data to obtain third voice data;
and if the second voice data does not contain silence, taking the second voice data as third voice data.
Optionally, the apparatus further includes a voiceprint feature extraction module, where the voiceprint feature extraction module is configured to:
dividing the third voice data into at least one fourth voice data based on a preset step length and a preset sliding window;
and extracting the voiceprint features to be processed through the fourth voice data.
Optionally, the apparatus further comprises:
and the voice cleaning module is used for removing the voice data corresponding to the clustered voiceprint features from the fourth voice data when the voiceprint to be processed is determined to be effective, wherein the clustered voiceprint features are the voiceprint features except the target cluster in the clustering result of the voiceprint features to be processed.
In a third aspect, an embodiment of the present application provides an electronic device, including: a processor and a memory;
a memory for storing operating instructions;
a processor, configured to execute the method for detecting validity of a voiceprint feature as shown in any one of the embodiments of the first aspect of the present application by calling an operation instruction.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for detecting validity of a voiceprint feature shown in any one of the implementation manners of the first aspect of the present application.
The technical scheme provided by the embodiment of the application has the following beneficial effects:
according to the scheme provided by the embodiment of the application, the target cluster is obtained by clustering the voiceprint features to be processed, and the effectiveness of the voiceprint to be processed is determined based on whether the voiceprint features with the distance from the centroid of the target cluster not less than the preset distance exist in the target cluster. Based on the scheme, the voiceprint feature effectiveness can be detected, and the voiceprint feature effectiveness for identity recognition is guaranteed, so that a basis can be provided for improving the voiceprint recognition effect.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
Fig. 1 is a schematic flowchart of a method for detecting validity of a voiceprint feature according to an embodiment of the present application;
fig. 2 is a schematic flowchart illustrating a process of processing stationary noise and silence that may be included in first voice data according to an embodiment of the present disclosure;
FIG. 3 is a diagram illustrating vocal print feature extraction through a sliding window in an embodiment of the present application;
fig. 4 is a schematic flowchart of a specific implementation of a method for detecting validity of a voiceprint feature according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an apparatus for detecting validity of a voiceprint feature according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The process of extracting the voiceprint features mainly comprises the following three steps: voice signal preprocessing, effective voice extraction and detection and voice print model feature extraction. Generally, the test effect of the voiceprint model on a test set is good, but the effect is sharply reduced when the voiceprint model is on line to a production environment, the main reason is that the production environment condition is complex, a large amount of silence which randomly appears is arranged in a voice signal, and various steady and non-steady noises are superposed. At the moment, the extraction and the detection of the effective voice signal are important, and the clean and effective voice signal is obtained by removing silence and noise of the voice signal.
In addition, when the voiceprint features include features representing a plurality of users, the voiceprint features are also invalid, so that an accurate recognition result cannot be obtained.
The voiceprint feature validity detection method and device and the electronic device provided by the embodiment of the application aim to solve at least one of the above technical problems in the prior art.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Fig. 1 shows a schematic flow chart of a method for detecting validity of a voiceprint feature provided in an embodiment of the present application, and as shown in fig. 1, the method mainly includes:
step S110: and clustering the voiceprint features to be processed to obtain target clusters, wherein the ratio of the voiceprint features in the target clusters in the voiceprint features to be processed is larger than a preset value.
The voiceprint features to be processed can be a group of voiceprint feature sequences extracted from the voice to be detected, clustering processing is carried out on the voiceprint features to be processed, a clustering processing result can be obtained, the maximum class in the clustering processing result can be determined as a target cluster, and the preset ratio can be specified according to actual needs.
As an example, a spectral clustering algorithm can be adopted to perform clustering processing on the voiceprint features to be processed, and the spectral clustering is a widely used clustering algorithm and has stronger adaptability compared with the traditional k-means algorithm. When the spectral clustering algorithm is used for clustering the voiceprint features to be processed, the distance can be cosine distance, and the number of the voiceprint features in the maximum class (namely target clustering) can account for more than 80% of the total number of the voiceprint features to be processed by adjusting parameters.
Step S120: and determining the effectiveness of the voiceprint to be processed based on whether the voiceprint features with the distance from the centroid of the target cluster not less than the preset distance exist in the target cluster.
The distance and the preset distance may be cosine distances.
In the embodiment of the application, the centroid of the target cluster can be determined, the cosine distances between the voiceprint features in the target cluster and the centroid can be respectively calculated, and the calculated cosine distances are compared with the preset distances.
In practical use, the cosine distance is in a range of 0 to 1, if the cosine distance is 0, it can be considered that the centroid of the voiceprint feature and the target cluster can represent the same speaker, and if the cosine distance is 1, it can be considered that the centroid of the voiceprint feature and the target cluster can represent two different speakers, in this example, the preset distance can be set to 0.4.
According to the method provided by the embodiment of the application, the target cluster is obtained by clustering the voiceprint features to be processed, and the effectiveness of the voiceprint to be processed is determined based on whether the voiceprint features with the distance from the centroid of the target cluster not less than the preset distance exist in the target cluster. Based on the scheme, the voiceprint feature effectiveness can be detected, and the voiceprint feature effectiveness for identity recognition is guaranteed, so that a basis can be provided for improving the voiceprint recognition effect.
In an optional mode of the embodiment of the present application, determining validity of a voiceprint to be processed based on whether a voiceprint feature whose distance from a centroid of a target cluster is smaller than a preset distance exists in the target cluster includes:
if the distance between the target cluster and the centroid of the target cluster is not less than the preset distance, determining that the voiceprint to be processed is invalid;
and if the voiceprint features with the distance from the centroid of the target cluster not less than the preset distance do not exist in the target cluster, determining that the voiceprint to be processed is valid.
In an optional manner of the embodiment of the present application, the method further includes:
determining whether stationary noise is contained in the first voice data based on a pre-trained stationary noise detection model;
and processing the first voice data to obtain second voice data based on whether the first voice data contains stationary noise.
The first voice data may be recorded user voice data, and in an actual situation, stationary noise may exist in the first voice data.
In the embodiment of the application, whether stationary noise is contained in the first voice data or not can be detected through the stationary noise detection model, and the first voice data is processed according to whether stationary noise is contained in the first voice data or not, so that the second voice data is obtained.
In an optional mode of the embodiment of the application, the stationary noise detection model is obtained by performing model training by using pre-configured voice data as a positive sample and using sample voice data as a negative sample, wherein the sample voice data includes fused voice data and pre-configured noise data, and the fused voice data is obtained by fusing the voice data and the noise data.
In the embodiment of the application, a training set is constructed for the training of the stationary noise detection model.
Specifically, a noiseX-92 noise data set can be used, which contains a total of 15 different noises. 32767 kinds of different pure noises can be synthesized by mutually arranging and combining any kind of 15 kinds of different noises, and the power ratio of the different pure noises is 0dB, namely the power is equal. 32767 different pure noise, pre-configured noise data, are described above.
The pre-configured voice data can be clean voice with a label of 0, and the voice data is used as a negative sample in the training set. The 32767 different pure noises are fused with the pre-configured voice data to generate voice with noise of 3dB, 0dB, -5dB, -10dB, -15dB and-20 dB, namely fused voice data. The sample speech data includes the fused speech data as well as the 32767 different pure noises described above. The sample speech data is labeled 1 and is taken as a positive sample in the training set.
After the training set is constructed, the stationary noise detection model may be trained using the training set.
Specifically, the stationary noise detection model may be a two-class deep learning model, and the model structure may use VGG, Resnet, LSTM, and the like, which is not limited herein.
In an optional mode of the embodiment of the application, based on whether stationary noise is included in the first voice data, the first voice data is processed to obtain second voice data, including:
if the first voice data contains stationary noise, processing the stationary noise in the first voice data to obtain second voice data;
and if the first voice data does not contain stationary noise, taking the first voice data as second voice data.
In the embodiment of the present application, when it is detected that the first voice data includes stationary noise, the stationary noise in the first voice data may be processed, for example, the stationary noise is processed by using a filter such as a wiener filter, a gaussian filter, or a wavelet filter, and the type of the filter is not limited in this example.
In an optional manner of the embodiment of the present application, the method further includes:
and processing the second voice data to obtain third voice data based on whether the second voice data contains silence or not.
In the embodiment of the present application, in the second speech data obtained after the detection and processing of stationary noise, there may be long mutes, and these mutes affect the subsequent voiceprint feature extraction, so that the mutes need to be processed.
In an optional manner of the embodiment of the application, processing the second voice data to obtain third voice data based on whether the second voice data includes silence includes:
if the second voice data contains silence, processing the silence in the second voice data to obtain third voice data;
and if the second voice data does not contain silence, taking the second voice data as third voice data.
In this embodiment, when it is detected that the second voice data includes silence, the silence in the second voice data may be processed, for example, the silence may be removed by using a VAD algorithm.
By processing stationary noise and silence that may be contained in the first voice data, the extracted voiceprint features can be made cleaner and more effective.
As an example, fig. 2 is a schematic flow chart illustrating processing of stationary noise and silence that may be included in the first voice data in a specific implementation manner provided by the embodiment of the present application. As shown in fig. 2, it may be detected whether the first voice data includes stationary noise, and if so, the stationary noise is processed, and then possible silence is removed. If stationary noise is not included, the possible silence can be removed.
In an optional manner of the embodiment of the present application, the method further includes:
dividing the third voice data into at least one fourth voice data based on a preset step length and a preset sliding window;
and extracting the voiceprint features to be processed through the fourth voice data.
In the embodiment of the present application, after processing stationary noise and silence possibly included in the first voice data to obtain the third voice data,
in the embodiment of the application, a sliding window mode can be adopted to extract fourth voice data from the third voice data, and then the fourth voice data is input into a voiceprint model to extract voiceprint features.
As an example, the sliding window size of the preset sliding window may be 1400ms to 1800ms, the preset step size may be half window, when extracting to the last sliding window, if the remaining voice data occupies less than 75% of the sliding window, the sliding window is discarded, and if it exceeds 75%, the sliding window is filled with zero value. And inputting each section of fourth voice data divided by the sliding window into the trained neural network voiceprint model to obtain a group of voiceprint feature sequences, namely the voiceprint features to be processed.
Fig. 3 is a schematic diagram illustrating extraction of voiceprint features through a sliding window in a specific implementation manner of the embodiment of the present application.
As shown in fig. 3, a set of fourth voice data may be divided by sliding a sliding window with a preset sliding window size with the length of a half window as a step length, and the divided fourth voice data is input into a trained neural network voiceprint model to obtain a set of voiceprint feature sequences.
In an optional manner of the embodiment of the present application, if it is determined that the voiceprint to be processed is valid, the method further includes:
and removing the voice data corresponding to the clustered voiceprint features from the fourth voice data, wherein the clustered voiceprint features are the voiceprint features except the target cluster in the clustering result of the voiceprint features to be processed.
In the embodiment of the present application, in the clustering result of the voiceprint features to be processed, besides the target clustering, an outlier voiceprint feature may exist. The outlier voiceprint feature may correspond to a small section of the long-section stationary voice data, which may cause a distance between the voiceprint feature extracted from the small section and the cosine of the centroid to be far, and since the fourth voice data is extracted using a sliding window, a voice segment corresponding to the outlier voiceprint feature may be repeated with a voice segment corresponding to the target cluster, and in order to preserve the voice signal as much as possible, only a voice segment that is not repeated and corresponds to the outlier voiceprint feature may be removed. Thereby processing to obtain clean and effective voice data.
As an example, fig. 4 is a flowchart illustrating a specific implementation manner of a method for detecting validity of a voiceprint feature provided in an embodiment of the present application.
As shown in fig. 4, the voiceprint features can be extracted through a sliding window, then the voiceprint features are subjected to spectral clustering to obtain a maximum class, and then the validity detection is realized by calculating the cosine distance between each voiceprint feature in the maximum class and the centroid. When the extracted voiceprint features are determined to be valid, the voice segments corresponding to the outlier voiceprint features can be removed, and valid voice is output. When it is determined that the extracted voiceprint features are not valid, the process can end.
Based on the same principle as the method shown in fig. 1, fig. 5 shows a schematic structural diagram of an apparatus for detecting validity of a voiceprint feature provided by an embodiment of the present application, and as shown in fig. 5, the apparatus 20 for detecting validity of a voiceprint feature may include:
the clustering module 210 is configured to perform clustering processing on the voiceprint features to be processed to obtain target clusters, where a ratio of the voiceprint features in the target clusters to the voiceprint features to be processed is greater than a preset value;
and the validity detection module 220 is configured to determine validity of the voiceprint to be processed based on whether a voiceprint feature whose distance from the centroid of the target cluster is not less than a preset distance exists in the target cluster.
The device provided by the embodiment of the application obtains the target cluster by clustering the voiceprint features to be processed, and determines the effectiveness of the voiceprint to be processed based on whether the voiceprint features with the distance from the centroid of the target cluster not less than the preset distance exist in the target cluster. Based on the scheme, the voiceprint feature effectiveness can be detected, and the voiceprint feature effectiveness for identity recognition is guaranteed, so that a basis can be provided for improving the voiceprint recognition effect.
Optionally, the validity detection module is specifically configured to:
if the distance between the target cluster and the centroid of the target cluster is not less than the preset distance, determining that the voiceprint to be processed is invalid;
and if the voiceprint features with the distance from the centroid of the target cluster not less than the preset distance do not exist in the target cluster, determining that the voiceprint to be processed is valid.
Optionally, the above apparatus further comprises a stationary noise processing module, the stationary noise processing module is specifically configured to:
determining whether stationary noise is contained in the first voice data based on a pre-trained stationary noise detection model;
and processing the first voice data to obtain second voice data based on whether the first voice data contains stationary noise.
Optionally, the stationary noise detection model is obtained by performing model training using pre-configured human voice data as a positive sample and using sample voice data as a negative sample, where the sample voice data includes fused voice data and pre-configured noise data, and the fused voice data is obtained by fusing human voice data and noise data.
Optionally, the stationary noise processing module is configured to, based on whether the stationary noise is included in the first voice data, process the first voice data to obtain second voice data, specifically:
if the first voice data contains stationary noise, processing the stationary noise in the first voice data to obtain second voice data;
and if the first voice data does not contain stationary noise, taking the first voice data as second voice data.
Optionally, the apparatus further includes a mute processing module, where the mute processing module is configured to:
and processing the second voice data to obtain third voice data based on whether the second voice data contains silence or not.
Optionally, the mute processing module is specifically configured to:
if the second voice data contains silence, processing the silence in the second voice data to obtain third voice data;
and if the second voice data does not contain silence, taking the second voice data as third voice data.
Optionally, the apparatus further includes a voiceprint feature extraction module, where the voiceprint feature extraction module is configured to:
dividing the third voice data into at least one fourth voice data based on a preset step length and a preset sliding window;
and extracting the voiceprint features to be processed through the fourth voice data.
Optionally, the apparatus further comprises:
and the voice cleaning module is used for removing the voice data corresponding to the clustered voiceprint features from the fourth voice data when the voiceprint to be processed is determined to be effective, wherein the clustered voiceprint features are the voiceprint features except the target cluster in the clustering result of the voiceprint features to be processed.
It can be understood that the above modules of the voiceprint feature validity detection apparatus in the present embodiment have functions of implementing the corresponding steps of the voiceprint feature validity detection method in the embodiment shown in fig. 1. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above. The modules can be software and/or hardware, and each module can be implemented independently or by integrating a plurality of modules. For the functional description of each module of the voiceprint feature validity detection apparatus, reference may be specifically made to the corresponding description of the voiceprint feature validity detection method in the embodiment shown in fig. 1, and details are not repeated here.
The embodiment of the application provides an electronic device, which comprises a processor and a memory;
a memory for storing operating instructions;
and the processor is used for executing the voiceprint feature validity detection method provided by any embodiment of the application by calling the operation instruction.
As an example, fig. 6 shows a schematic structural diagram of an electronic device to which an embodiment of the present application is applicable, and as shown in fig. 6, the electronic device 2000 includes: a processor 2001 and a memory 2003. Wherein the processor 2001 is coupled to a memory 2003, such as via a bus 2002. Optionally, the electronic device 2000 may also include a transceiver 2004. It should be noted that the transceiver 2004 is not limited to one in practical applications, and the structure of the electronic device 2000 is not limited to the embodiment of the present application.
The processor 2001 is applied to the embodiment of the present application to implement the method shown in the above method embodiment. The transceiver 2004 may include a receiver and a transmitter, and the transceiver 2004 is applied to the embodiments of the present application to implement the functions of the electronic device of the embodiments of the present application to communicate with other devices when executed.
The Processor 2001 may be a CPU (Central Processing Unit), general Processor, DSP (Digital Signal Processor), ASIC (Application Specific Integrated Circuit), FPGA (Field Programmable Gate Array) or other Programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 2001 may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs and microprocessors, and the like.
Bus 2002 may include a path that conveys information between the aforementioned components. The bus 2002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 2002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus.
The Memory 2003 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.
Optionally, the memory 2003 is used for storing application program code for performing the disclosed aspects, and is controlled in execution by the processor 2001. The processor 2001 is configured to execute the application program code stored in the memory 2003 to implement the method for detecting the validity of the voiceprint feature provided in any of the embodiments of the present application.
The electronic device provided by the embodiment of the application is applicable to any embodiment of the method, and is not described herein again.
Compared with the prior art, the electronic equipment has the advantages that the target cluster is obtained by clustering the voiceprint features to be processed, and the effectiveness of the voiceprint to be processed is determined based on whether the voiceprint features with the distance from the centroid of the target cluster to the centroid of the target cluster being not less than the preset distance exist in the target cluster. Based on the scheme, the voiceprint feature effectiveness can be detected, and the voiceprint feature effectiveness for identity recognition is guaranteed, so that a basis can be provided for improving the voiceprint recognition effect.
The embodiment of the present application provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the method for detecting validity of a voiceprint feature shown in the above method embodiment.
The computer-readable storage medium provided in the embodiments of the present application is applicable to any of the embodiments of the foregoing method, and is not described herein again.
Compared with the prior art, the embodiment of the application provides a computer-readable storage medium, the target cluster is obtained by clustering the voiceprint features to be processed, and the effectiveness of the voiceprint to be processed is determined based on whether the voiceprint features exist in the target cluster, wherein the distance between the voiceprint features and the centroid of the target cluster is not less than the preset distance. Based on the scheme, the voiceprint feature effectiveness can be detected, and the voiceprint feature effectiveness for identity recognition is guaranteed, so that a basis can be provided for improving the voiceprint recognition effect.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (12)

1. A voiceprint feature validity detection method is characterized by comprising the following steps:
clustering the voiceprint features to be processed to obtain target clusters, wherein the ratio of the voiceprint features in the target clusters in the voiceprint features to be processed is larger than a preset value;
and determining the effectiveness of the voiceprint to be processed based on whether the voiceprint features with the distance from the centroid of the target cluster not less than the preset distance exist in the target cluster.
2. The method according to claim 1, wherein the determining the validity of the voiceprint to be processed based on whether there is a voiceprint feature in the target cluster having a distance from the centroid of the target cluster less than a preset distance comprises:
if the distance between the target cluster and the centroid of the target cluster is not less than the preset distance, determining that the voiceprint to be processed is invalid;
and if the voiceprint features, the distance from which to the centroid of the target cluster is not less than the preset distance, do not exist in the target cluster, and then the voiceprint to be processed is determined to be valid.
3. The method of claim 1 or 2, further comprising:
determining whether stationary noise is contained in the first voice data based on a pre-trained stationary noise detection model;
and processing the first voice data to obtain second voice data based on whether the first voice data contains stationary noise or not.
4. The method of claim 3, wherein the stationary noise detection model is obtained by model training with pre-configured human voice data as a positive sample and sample voice data as a negative sample, wherein the sample voice data comprises fused voice data and pre-configured noise data, and the fused voice data is obtained by fusing the human voice data and the noise data.
5. The method of claim 4, wherein processing the first speech data based on whether stationary noise is included in the first speech data to obtain second speech data comprises:
if the first voice data contains stationary noise, processing the stationary noise in the first voice data to obtain second voice data;
and if the first voice data does not contain stationary noise, taking the first voice data as second voice data.
6. The method of claim 5, further comprising:
and processing the second voice data to obtain third voice data based on whether the second voice data contains silence or not.
7. The method of claim 6, wherein the processing the second voice data to obtain third voice data based on whether silence is included in the second voice data comprises:
if the second voice data contains silence, processing the silence in the second voice data to obtain third voice data;
and if the second voice data does not contain silence, taking the second voice data as third voice data.
8. The method of claim 7, further comprising:
dividing the third voice data into at least one fourth voice data based on a preset step length and a preset sliding window;
and extracting the voiceprint features to be processed through the fourth voice data.
9. The method of claim 8, wherein if it is determined that the voiceprint to be processed is valid, the method further comprises:
and removing the voice data corresponding to the outlier voiceprint feature from the fourth voice data, wherein the outlier voiceprint feature is a voiceprint feature except the target cluster in the clustering result of the to-be-processed voiceprint feature.
10. An apparatus for detecting validity of a voiceprint feature, comprising:
the clustering module is used for clustering the voiceprint features to be processed to obtain target clusters, and the ratio of the voiceprint features in the target clusters in the voiceprint features to be processed is larger than a preset value;
and the effectiveness detection module is used for determining the effectiveness of the voiceprint to be processed based on whether the voiceprint features with the distance from the centroid of the target cluster to the centroid of the target cluster being not less than the preset distance exist in the target cluster.
11. An electronic device comprising a processor and a memory;
the memory is used for storing operation instructions;
the processor is used for executing the method of any one of claims 1-9 by calling the operation instruction.
12. A computer-readable storage medium, characterized in that the storage medium has stored thereon a computer program which, when being executed by a processor, carries out the method of any one of claims 1-9.
CN202110833612.9A 2021-07-23 2021-07-23 Voiceprint feature validity detection method and device and electronic equipment Pending CN113571090A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110833612.9A CN113571090A (en) 2021-07-23 2021-07-23 Voiceprint feature validity detection method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110833612.9A CN113571090A (en) 2021-07-23 2021-07-23 Voiceprint feature validity detection method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN113571090A true CN113571090A (en) 2021-10-29

Family

ID=78166524

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110833612.9A Pending CN113571090A (en) 2021-07-23 2021-07-23 Voiceprint feature validity detection method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN113571090A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109256137A (en) * 2018-10-09 2019-01-22 深圳市声扬科技有限公司 Voice acquisition method, device, computer equipment and storage medium
US20190341068A1 (en) * 2018-05-02 2019-11-07 Melo Inc. Systems and methods for processing meeting information obtained from multiple sources
CN110648670A (en) * 2019-10-22 2020-01-03 中信银行股份有限公司 Fraud identification method and device, electronic equipment and computer-readable storage medium
CN111429920A (en) * 2020-03-30 2020-07-17 北京奇艺世纪科技有限公司 User distinguishing method, user behavior library determining method, device and equipment
CN111462761A (en) * 2020-03-03 2020-07-28 深圳壹账通智能科技有限公司 Voiceprint data generation method and device, computer device and storage medium
CN111489262A (en) * 2020-06-15 2020-08-04 太平金融科技服务(上海)有限公司 Policy information detection method and device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190341068A1 (en) * 2018-05-02 2019-11-07 Melo Inc. Systems and methods for processing meeting information obtained from multiple sources
CN109256137A (en) * 2018-10-09 2019-01-22 深圳市声扬科技有限公司 Voice acquisition method, device, computer equipment and storage medium
CN110648670A (en) * 2019-10-22 2020-01-03 中信银行股份有限公司 Fraud identification method and device, electronic equipment and computer-readable storage medium
CN111462761A (en) * 2020-03-03 2020-07-28 深圳壹账通智能科技有限公司 Voiceprint data generation method and device, computer device and storage medium
CN111429920A (en) * 2020-03-30 2020-07-17 北京奇艺世纪科技有限公司 User distinguishing method, user behavior library determining method, device and equipment
CN111489262A (en) * 2020-06-15 2020-08-04 太平金融科技服务(上海)有限公司 Policy information detection method and device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
阿布•埃拉•哈桑尼, 国防工业出版社 *

Similar Documents

Publication Publication Date Title
CN108010515B (en) Voice endpoint detection and awakening method and device
CN109473123B (en) Voice activity detection method and device
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN105741838B (en) Voice awakening method and device
CN110265037B (en) Identity verification method and device, electronic equipment and computer readable storage medium
CN109448746B (en) Voice noise reduction method and device
CN113223536B (en) Voiceprint recognition method and device and terminal equipment
CN106504768A (en) Phone testing audio frequency classification method and device based on artificial intelligence
CN110265035B (en) Speaker recognition method based on deep learning
CN111540342B (en) Energy threshold adjusting method, device, equipment and medium
CN110808030B (en) Voice awakening method, system, storage medium and electronic equipment
CN113646833A (en) Voice confrontation sample detection method, device, equipment and computer readable storage medium
CN110797031A (en) Voice change detection method, system, mobile terminal and storage medium
CN110570871A (en) TristouNet-based voiceprint recognition method, device and equipment
CN110570870A (en) Text-independent voiceprint recognition method, device and equipment
KR20150105847A (en) Method and Apparatus for detecting speech segment
CN109377984B (en) ArcFace-based voice recognition method and device
CN109065026B (en) Recording control method and device
US10910000B2 (en) Method and device for audio recognition using a voting matrix
CN112289311B (en) Voice wakeup method and device, electronic equipment and storage medium
CN116564315A (en) Voiceprint recognition method, voiceprint recognition device, voiceprint recognition equipment and storage medium
CN113571090A (en) Voiceprint feature validity detection method and device and electronic equipment
CN114420136A (en) Method and device for training voiceprint recognition model and storage medium
CN112509556B (en) Voice awakening method and device
CN115457973A (en) Speaker segmentation method, system, terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination