CN110880325B - Identity recognition method and equipment - Google Patents

Identity recognition method and equipment Download PDF

Info

Publication number
CN110880325B
CN110880325B CN201811031757.1A CN201811031757A CN110880325B CN 110880325 B CN110880325 B CN 110880325B CN 201811031757 A CN201811031757 A CN 201811031757A CN 110880325 B CN110880325 B CN 110880325B
Authority
CN
China
Prior art keywords
data segment
sub
dimension
speaker
voice data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811031757.1A
Other languages
Chinese (zh)
Other versions
CN110880325A (en
Inventor
张立斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201811031757.1A priority Critical patent/CN110880325B/en
Publication of CN110880325A publication Critical patent/CN110880325A/en
Application granted granted Critical
Publication of CN110880325B publication Critical patent/CN110880325B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)
  • Telephone Function (AREA)

Abstract

The application discloses an identity recognition method and identity recognition equipment. The method comprises the following steps: obtaining at least one initial speech data segment of a speaker; performing data transformation on each initial voice data segment from a sound source dimension and/or a channel dimension to obtain at least one expanded voice data segment; respectively carrying out voiceprint feature extraction on the at least one initial voice data segment and the at least one expanded voice data segment to obtain at least one initial voiceprint feature data segment and at least one expanded voiceprint feature data segment; and storing the at least one initial voiceprint feature data segment and the at least one expanded voiceprint feature data segment in the identity recognition device corresponding to the identity of the speaker. According to the technical scheme, the coverage range of the stored voiceprint characteristic data segment can be enlarged, and then the fault tolerance rate of voiceprint characteristic data during comparison is improved, and the accuracy rate of identity recognition is improved.

Description

Identity recognition method and equipment
Technical Field
The embodiment of the application relates to the technical field of identity recognition, in particular to an identity recognition method and equipment.
Background
A voiceprint (voice print) is a biometric attribute (like a fingerprint) that is unique to the speaker, so that the identity of the speaker can be identified from the voiceprint. The process of identifying identity according to voiceprint mainly comprises two parts of voiceprint feature data (voice print feature data) registration and voiceprint feature data comparison. The registration of the voiceprint feature data is the creation work of a voiceprint feature database, and comprises the steps of collecting voice data of each candidate speaker, extracting the voiceprint feature data from each section of voice data respectively, and storing the voiceprint feature data as the voiceprint feature data of each candidate speaker respectively. And the voiceprint characteristic data comparison is to receive voice data input by a speaker to be identified, extract the voiceprint characteristic data from the voice data and obtain the voiceprint characteristic data to be identified. If the difference or distance between the voiceprint feature data to be recognized and a certain registered voiceprint feature data is smaller than a preset threshold value, the voiceprint feature data to be recognized and the registered voiceprint feature data belong to the same speaker, and therefore the speaker to be recognized can be recognized as the speaker corresponding to the registered voiceprint data.
Because it is difficult to ensure that the speaking state and/or the collection environment of the speaker during registration is completely the same as that during identification, the voiceprint feature data to be identified and the registered voiceprint feature data of the same speaker may be greatly different, thereby causing an identification error and reducing the accuracy of identification.
Disclosure of Invention
The embodiment of the application provides an identity recognition method and identity recognition equipment, which can correctly recognize the identity of a speaker when the speaking state and/or the collection environment during speaker registration are different from those during identity recognition, so that the accuracy of identity recognition is improved.
In a first aspect, an embodiment of the present application provides an identity recognition method, which includes,
obtaining at least one initial voice data segment of a first speaker;
performing data transformation on each initial voice data segment in the at least one initial voice data segment from a sound source dimension and/or a channel dimension to obtain at least one expanded voice data segment;
respectively carrying out voiceprint feature extraction on the at least one initial voice data segment and the at least one expanded voice data segment to obtain at least one initial voiceprint feature data segment and at least one expanded voiceprint feature data segment;
Storing the at least one initial voiceprint feature data segment and the at least one extended voiceprint feature data segment in a storage device in correspondence with the identity of the first speaker.
In a typical identification method, in a voiceprint feature data registration stage, an identification device only acquires one voice data segment corresponding to one speaker. According to the technical scheme, the identity recognition equipment can obtain one or more voice data sections of the speaker. And the at least one initial voice data segment respectively corresponds to a speaking state of the speaker and a collection environment state.
Factors that can change the voiceprint characteristic data include a plurality of items of attribute information of the sound source and the channel. Based on this, in the embodiment of the application, for each initial voice data segment in the at least one initial voice data segment, data transformation is performed from a sound source dimension and/or a channel dimension, and at least one extended voice data segment after the influence of the sound source dimension and/or the channel dimension is obtained by extension, so as to obtain voice data of a speaker under various possible environments.
By adopting the realization mode, the identity recognition device takes the speech data segment under the speaking state (such as normal speech speed state) and/or the channel environment (such as quiet channel environment) when the speaker inputs the speech data segment as the initial speech data segment, and simulates at least one extended speech data segment under a plurality of different speaking states (such as high speech speed state, high tone state and the like) and/or a plurality of different channel environments (such as noisy environment) by data transformation. Therefore, the voiceprint characteristic data section of the speaker under various different speaking states and/or under the influence of various different channel environments can be simulated, the coverage range of the voiceprint characteristic data section registered by the speaker can be expanded, and the accuracy of identity recognition is finally improved.
In an optional design, after storing the at least one initial voiceprint feature data segment and the at least one extended voiceprint feature data segment in correspondence with the identity of the speaker in an identification device, the method further comprises:
acquiring a voice data segment to be recognized input by a second speaker;
performing voiceprint feature extraction on the voice data segment to be recognized to obtain a voiceprint feature data segment to be recognized;
calculating a characteristic distance value, wherein the characteristic distance value is a distance value between the voiceprint characteristic data segment to be recognized and the at least one initial voiceprint characteristic data segment and the at least one expanded voiceprint characteristic data segment which are stored in the storage device and correspond to the first speaker identity;
determining whether the second speaker is the same as the first speaker according to the feature distance value.
After storing the plurality of voiceprint feature data segments of the first speaker, the identity recognition device may receive a voice data segment of a second speaker (a speaker to be recognized), and further extract a voiceprint feature data segment to be recognized corresponding to the voice data segment. Then, the identity recognition device calculates a distance value between the voiceprint feature data segment to be recognized and each stored voiceprint feature data segment, and determines whether the second speaker (the speaker to be recognized) is the same as the speaker (the first speaker) corresponding to the corresponding stored voiceprint feature data segment according to a relation between the distance value and a preset threshold.
By adopting the implementation mode, the stored voiceprint characteristic data segment comprises the voiceprint characteristic data segment of the speaker under any speaking state and/or under various different channel environments, so that the identity recognition equipment can correctly recognize the identity of the speaker to be recognized, and the accuracy of identity recognition is improved.
In an alternative design, the calculating the characteristic distance value includes:
respectively calculating the distance value between each characteristic data segment and the voiceprint characteristic data segment to be identified aiming at the at least one initial voiceprint characteristic data segment and the at least one expanded voiceprint characteristic data segment to obtain a plurality of distance values;
selecting a minimum value of the plurality of distance values as the characteristic distance value.
In an alternative design, the calculating the characteristic distance value includes:
calculating the average value of the at least one initial voiceprint characteristic data segment and the at least one expanded voiceprint characteristic data segment to obtain an average characteristic data segment;
and calculating the distance value between the average characteristic data segment and the characteristic data segment to be identified, and taking the distance value as the characteristic distance value.
In an alternative design, the determining whether the second speaker is the same as the first speaker based on the feature distance values includes:
If the characteristic distance value is smaller than a preset threshold value, determining that the second speaker is the same as the first speaker;
and if the characteristic distance value is larger than or equal to the preset threshold value, determining that the second speaker is different from the second speaker.
In an optional design, the data transformation of each initial speech data segment in the at least one initial speech data segment from a sound source dimension and/or a channel dimension to obtain at least one extended speech data segment specifically includes:
selecting M sub-dimensions from the sound source dimension and/or the channel dimension, wherein M is a positive integer;
and performing data transformation on each initial voice data segment in the at least one initial voice data segment from the M sub-dimensions to obtain the at least one expanded voice data segment.
Wherein, each attribute information in the sound source and the channel is, for example, a sub-dimension. Based on this, the identification device may select M different sub-dimensions from a sound source dimension and/or a channel dimension, and then perform data transformation on each initial voice data segment in the at least one initial voice data segment from the M sub-dimensions to obtain the at least one extended voice data segment.
By adopting the implementation mode, the identity recognition equipment can simulate the voiceprint characteristic data segment which corresponds to the speaker and is influenced by any one of the sub-dimension types. Therefore, when the speaker is identified, even if the voice data segment to be identified input by the speaker is inconsistent with the voice data segment input by the speaker when the voiceprint characteristic data segment is registered, in at least one aspect of speech speed, tone, bandwidth, coding and decoding, noise and reverberation, the identity identification device can still accurately identify the speaker, and the accuracy of identity identification is improved.
In an alternative design, the data transforming each initial speech data segment of the at least one initial speech data segment from the M sub-dimensions to obtain the at least one extended speech data segment includes:
performing data transformation on each initial voice data segment in the at least one initial voice data segment from the M sub-dimensions respectively to obtain at least M transformed data segments corresponding to each initial voice data segment;
and determining at least M conversion data segments corresponding to each initial voice data segment as an extended voice data segment in the at least one extended voice data segment.
By adopting the implementation mode, the identity recognition equipment can simulate at least one voiceprint characteristic data segment after the speaker is influenced by any one sub-dimension. Therefore, when the speaker is identified, even if the voice data segment to be identified input by the speaker is different from the voice data segment input when the voiceprint characteristic data segment is registered by the speaker in any one aspect of speech speed, tone, bandwidth, coding and decoding, noise and reverberation, the identity identification device can still accurately identify the speaker, and the accuracy of identity identification is improved.
In an optional design, the data transforming each initial speech data segment of the at least one initial speech data segment from the M sub-dimensions to obtain the at least one extended speech data segment further includes:
determining each of the M sub-dimensions as a target sub-dimension;
performing data transformation on each target transformed data segment from the target sub-dimension to obtain a plurality of combined transformed data segments, wherein the target transformed data segments refer to voice data segments subjected to data transformation from M-1 sub-dimensions, and the M-1 sub-dimensions refer to sub-dimensions except the target sub-dimension in the M sub-dimensions;
Determining the plurality of combined transformed data segments as an extended speech data segment of the at least one extended speech data segment.
By adopting the implementation mode of the embodiment, the identity recognition equipment can perform data transformation once on each initial voice data segment in the at least one initial voice data segment from each sub-dimension in the M sub-dimensions, and simulate the extended voice data segment influenced by any plurality of sub-dimensions. Therefore, when the speaker is identified, even if the voice data segment to be identified input by the speaker is different from the voice data segment input when the voiceprint characteristic data segment is registered by the speaker in any multiple aspects of speech speed, tone, bandwidth, coding and decoding, noise and reverberation, the identity identification device can still accurately identify the speaker, and the accuracy of identity identification is improved.
In an alternative design, data transformation of a data segment to be transformed from any of the M sub-dimensions includes:
acquiring at least one transformation parameter corresponding to the sub-dimension, wherein the transformation parameter indicates a data transformation amount corresponding to the sub-dimension;
and performing data transformation on the data segment to be transformed according to each transformation parameter in the at least one transformation parameter.
The attribute parameters corresponding to the sound source dimension and some of the sub-dimensions in the channel dimension may fluctuate. In order to make the coverage of the expanded voice data segment wider, the identity recognition equipment sets a plurality of transformation parameters corresponding to the sub-dimensions of the attribute parameters which are up-and-down floating. Furthermore, when the identity recognition device performs data transformation on the data segment to be transformed from any one of the M sub-dimensions, the data segment to be transformed may be subjected to data transformation according to each transformation parameter corresponding to the corresponding sub-dimension.
By adopting the implementation mode of the embodiment, the identity recognition equipment can respectively simulate the voice data segments under various influence degrees of one sub-dimension. Therefore, when the speaker is identified, even if the voice data segment to be identified input by the speaker is different from the voice data segment input when the voiceprint characteristic data segment is registered by the speaker in terms of any one of the aspects of speech speed, tone, bandwidth, coding and decoding, noise and reverberation to different degrees, the identity identification device can still accurately identify the speaker, and the accuracy of identity identification is improved.
In an alternative design, the sound source dimensions include a speech rate sub-dimension and a pitch sub-dimension, and the channel dimensions include a bandwidth sub-dimension, a codec sub-dimension, a noise sub-dimension, and a reverberation sub-dimension.
In a second aspect, an embodiment of the present application provides an identity recognition apparatus, which includes a module configured to execute the method steps in the first aspect and the implementation manners of the first aspect.
In a third aspect, an embodiment of the present application provides an identity recognition device, which includes a transceiver, a processor, and a memory. The transceiver, the processor and the memory can be connected through a bus system. The memory is for storing a program, instructions or code, and the processor is for executing the program, instructions or code in the memory to perform the method of the first aspect, or any one of the possible designs of the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium having stored therein instructions which, when run on a computer, cause the computer to perform the method of the first aspect or any possible design of the first aspect.
In order to solve the problem of low accuracy of identification, in the voiceprint feature data registration stage, in the embodiment of the application, in the voiceprint feature data registration stage, the identification device performs data transformation on each initial voice data segment in at least one initial voice data segment input by a speaker from a sound source dimension and/or a channel dimension to obtain at least one speaking state of the speaker and/or an extended voice data segment in an acquisition environment respectively. Based on this, the identification device can extract and store all voiceprint characteristic data segments corresponding to the speaker in at least one speaking state and/or collection environment. Therefore, in the comparison stage of the voiceprint characteristic data, even if the voiceprint characteristic data segment of the speaker to be identified is influenced by the speaking state and/or the collection environment, the identity recognition equipment can still accurately recognize the identity of the speaker to be identified according to the stored voiceprint characteristic data segment. Therefore, in the technical scheme of the embodiment of the application, in the voiceprint feature data registration stage, the voiceprint feature data sections of the speaker in various speaking states and/or in various acquisition environments are simulated, and the voiceprint feature data sections of the speaker in various speaking states and/or in various acquisition environments can be stored, so that the coverage range of the stored voiceprint feature data sections can be enlarged, further, the fault tolerance rate of voiceprint feature data comparison is improved, and the accuracy rate of identity recognition is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a flowchart of a method of an identity recognition method provided in an embodiment of the present application;
FIG. 2 is a schematic diagram of a spectrum of a narrowband speech data segment provided by an embodiment of the present application;
FIG. 3 is a schematic diagram of the spectrum of the narrowband speech data segment shown in FIG. 2 after bandwidth sub-dimension transformation;
fig. 4 is a scene schematic diagram of a real-time conference scene provided in an embodiment of the present application;
fig. 5 is a schematic diagram illustrating a connection structure of a wideband codec according to an embodiment of the present application;
FIG. 6 is a schematic structural diagram of an embodiment of an identification device provided in an embodiment of the present application;
fig. 7 is a schematic structural diagram of another implementation manner of an identification device provided in an embodiment of the present application.
Detailed Description
The technical scheme of the embodiment of the application protects an identity recognition method and equipment. The identity recognition method is a method for recognizing the identity of a user through voiceprints, and can be applied to scenes such as real-time conferences, smart homes, access controls, blacklists, remote monitoring and retrieval and the like. When the identity recognition method is applied to different scenes, the identity recognition equipment can be adaptively arranged in a system provided by the corresponding scene. For example, when the identity recognition method is applied to a real-time conference scene, the identity recognition device may be arranged in a conference system; when the identity recognition method is applied to an intelligent home scene, the identity recognition equipment can be arranged in an intelligent system corresponding to the intelligent home; when the identity recognition method is applied to an access control scene, the identity recognition equipment can be arranged in an access control system; when the identity recognition method is applied to a blacklist, remote monitoring or retrieval scene, the identity recognition device can be arranged in a communication system.
Taking a real-time conference scenario as an example, the real-time conference scenario includes 3 participants. And the identity identification equipment correspondingly stores the identity of each participant in the 3 participants and the voiceprint feature data of the participants. In the process of a conference, the identity recognition equipment acquires the voice data of any one participant, determines whether the participant is one of the 3 participants or not according to the distance between the voiceprint feature data corresponding to the voice data and each voiceprint feature data stored in the identity recognition equipment, and if the participant is one of the 3 participants, the information of the corresponding participant can be determined and displayed.
In this embodiment of the present application, the process of storing the identity of each of the 3 participants and the voiceprint feature data of the participants includes: and the identity recognition equipment expands the initial voice data segment input by the participant from the sound source dimension and/or the channel dimension to obtain at least one expanded voice data segment. And then, the identity recognition equipment extracts corresponding voiceprint characteristic data segments from each voice data segment, and then, correspondingly stores all the extracted voiceprint characteristic data segments and the identity of the conference participants.
Therefore, in the technical scheme of the embodiment of the application, in the voiceprint feature data registration stage, the voiceprint feature data segments of the speaker in various speaking states and/or in various acquisition environments are simulated, and the voiceprint feature data segments of the speaker in various speaking states and/or in various acquisition environments can be stored, so that the fault tolerance rate of voiceprint feature data comparison can be improved, and the accuracy of identity identification can be improved.
Thus, the identity of the speaker in the identification device corresponds to only one voiceprint feature data segment. Wherein the voiceprint characteristic data is changed under the influence of the speaking state of the speaker and the voice data collection environment. Based on this, the voiceprint feature data segment corresponding to the identity of each speaker in the identification device can be regarded as the voiceprint feature data under the influence of a speaking state of the speaker and an acquisition environment.
The embodiments of the present application will be described below with reference to the accompanying drawings.
Example one
The embodiment one provides an identity recognition method. Fig. 1 is a flowchart of a method of an identity recognition method according to an embodiment of the present application. The identity recognition method shown in fig. 1 includes the following steps:
Step S101, the identification device obtains at least one initial speech data segment of the first speaker.
The initial voice data segment (voice segment) refers to a voice data segment obtained by the identity recognition device from the first speaker. The voice data segment may be directly collected from the first speaker by the identification device, or may be directly collected from the first speaker by a device of the first speaker (e.g., a mobile phone, a tablet Computer, a Personal Computer (Personal Computer), a conference terminal, etc.) and transmitted via the internet.
It should be appreciated that in a typical identification method, the identification device only obtains one voice data segment for one speaker during the registration phase of the voiceprint feature data. According to the technical scheme of the embodiment of the application, the identity recognition equipment can acquire one or more voice data sections of the first speaker. The process of the identification device obtaining at least one initial speech data segment of said first speaker may comprise steps S1011 and S1012.
In step S1011, the identification apparatus collects at least one piece of speech information of the first speaker.
Wherein the identification device may collect voice information of the first speaker through at least one of a mobile phone, a fixed phone, a Personal Computer (PC), and a microphone. Generally, the voice information collected by the identity recognition device through mobile phones, fixed phones, microphones and other devices is analog voice information.
In order to obtain as many speaking states as possible and acquire the speech data segment under the environment, in this embodiment, the speed and the intonation adopted by the first speaker when inputting each piece of speech information are different. And the hardware devices adopted by the first speaker to input each piece of voice information are different.
Step S1012, the identification device digitizes the at least one piece of speech information to obtain at least one initial speech data segment of the first speaker.
After the at least one piece of voice information is acquired, the identification device needs to transmit each piece of voice information in the at least one piece of voice information from the acquisition function module to other function modules in the identification device to execute subsequent operations.
According to step S1011, the at least one piece of voice information is analog information, and the analog information is not convenient for storage, transmission, voiceprint feature extraction, and other processing. Based on this, the identity recognition device performs a digital (digitarize) process on each piece of voice information in the at least one piece of voice information to obtain the at least one initial voice data segment.
According to the description of step S1011, the at least one initial speech data segment corresponds to a speaking status and an acquisition environment status of the first speaker, respectively.
In an alternative embodiment of the present application, the digitization process refers to converting each piece of voice information into a voice data segment by means of sampling quantization.
Step S102, the identity recognition device carries out data transformation on each initial voice data segment in the at least one initial voice data segment from a sound source dimension and/or a channel dimension to obtain at least one expanded voice data segment.
Wherein, the factors capable of changing the voiceprint characteristics comprise a plurality of factors in sound source dimension and channel dimension. Based on this, in the embodiment of the present application, the identification device performs data transformation on each initial voice data segment in the at least one initial voice data segment from the sound source dimension and/or the channel dimension, so as to simulate the voice data segment affected by the sound source dimension and/or the channel dimension according to the initial voice data segment.
Specifically, each attribute information in the sound source and the channel is, for example, a sub-dimension. The identification device may select M sub-dimensions from a sound source dimension and/or a channel dimension, and then perform data transformation on each initial voice data segment of the at least one initial voice data segment from the M sub-dimensions to obtain the at least one extended voice data segment. And M is a positive integer, and each of the M sub-dimensions can trigger the change of the voiceprint characteristics. In an alternative example of the present application, the sound source dimensions include a speech rate sub-dimension and a pitch sub-dimension, and the channel dimensions include a bandwidth sub-dimension, a codec sub-dimension, a noise sub-dimension, and a reverberation sub-dimension.
It should be understood that the identification device is provided with an initial parameter corresponding to each of the M sub-dimensions. The identification device may predetermine the data transformation method of the corresponding sub-dimension according to the initial parameter of each of the M sub-dimensions. Based on this, the identification device performs data transformation on each of the at least one initial voice data segment from M sub-dimensions, specifically, performs data transformation on each of the at least one initial voice data segment according to a preset data transformation method.
Taking the frequency bandwidth sub-dimension as an example, in an optional example, if the sampling frequency bandwidth of the identification device is a broadband, the identification device determines broadband data as an initial parameter of the frequency bandwidth sub-dimension. Furthermore, the identification device determines a method for transforming the initial voice data segment from a wideband to a narrowband, resulting in a method for transforming the sub-dimension of the bandwidth. In another alternative example, the sampling frequency bandwidth of the identification device is a narrow band, and the method of the identification device for transforming the initial speech data segment from the narrow band to the wide band is determined as a transformation method of the frequency bandwidth sub-dimension. Of course, similar to the above alternative examples, the determination of other sub-dimension data transformation methods is not described in detail here.
In addition, it should be noted that attribute parameters corresponding to a part of the sub-dimensions in the sound source dimension and/or the channel dimension may fluctuate. For example, in the noise sub-dimension, the noise may become larger or smaller. In the speech rate sub-dimension, the speech rate may be faster or slower. Based on this, in order to simulate the extended voice data segment in more scenes, the identity recognition device may set a plurality of transformation parameters corresponding to the sub-dimensions of the up-and-down floating attribute parameters. Wherein the transformation parameters indicate the data transformation amount corresponding to the respective sub-dimensions.
Step S103, the identity recognition device respectively extracts the voiceprint characteristics of the at least one initial voice data segment and the at least one expanded voice data segment to obtain at least one initial voiceprint characteristic data segment and at least one expanded voiceprint characteristic data segment.
Wherein the voiceprint feature data segment is a voiceprint feature vector or a voiceprint feature vector.
Each voice data segment correspondingly carries a voiceprint feature. The voiceprint features include multiple items of feature data, such as information including sound intensity, formant frequency, trend, waveform, wavelength, and a plurality of language maps of the same character, word or sentence. For some of the plurality of feature data, such as formant frequencies, waveforms, and wavelengths, the identification device may read directly from the corresponding speech data segment. The feature data that cannot be directly read from the voice data segment, for example, a plurality of language maps of the same word, word or sentence, may be determined by the identity recognition device according to the corresponding data in the voice data segment. This embodiment will not be described in detail here.
Wherein each item of feature data in the multiple items of feature data included in the voiceprint features indicates a feature, and the identity recognition device is difficult to determine the proximity of two voiceprint features according to the feature data in the two voiceprint features. Based on the voice print feature vector, the identity recognition equipment can convert a plurality of items of feature data of the voice print features into feature vectors or feature vectors corresponding to the voice print features. Furthermore, the identity recognition device may calculate a distance value between the two voiceprint features by calculating a feature vector corresponding to the voiceprint features. The identity recognition device can calculate the distance value between the two voiceprint features by calculating the feature vector corresponding to the voiceprint features. According to the scheme, the feature vector corresponding to the voiceprint feature can be called a voiceprint feature vector, and the feature vector corresponding to the voiceprint feature can be called a voiceprint feature vector.
In an optional example of the present application, the identity recognition device may convert multiple items of feature data of the voiceprint features into a voiceprint feature vector through a preset voiceprint feature vector model. In an optional example of the present application, the identity recognition device may convert multiple items of feature data of the voiceprint feature into a voiceprint feature vector through a preset voiceprint feature vector model. The voiceprint feature vector model and the voiceprint feature vector model can be obtained by training a large number of different speakers and voiceprint features obtained in different acquisition environments in advance.
It should be understood that in one possible implementation manner of the present application, only one voiceprint feature model is provided in the identity recognition device, and the voiceprint feature model is a voiceprint feature vector model or a voiceprint feature vector model. Based on the above, the identity recognition device performs voiceprint feature extraction on each acquired voice data segment to obtain voiceprint feature data segments with the same parameter attribute.
For example, in an alternative example of the present application, a voiceprint feature vector model is provided in an identification device. Based on this, in this step, the identity recognition device processes each of the at least one initial speech data segment and the at least one extended speech data segment through a voiceprint feature vector model. Furthermore, the at least one initial voiceprint feature data segment and the at least one extended voiceprint feature data segment obtained by the identity recognition device are both voiceprint feature vectors.
Step S104, the identification device stores the at least one initial voiceprint characteristic data segment and the at least one expanded voiceprint characteristic data segment in a storage device corresponding to the identification of the first speaker.
Wherein the identification device can set an identification for the first speaker to identify the identity of the first speaker. The identity can be any one of an identity card number, a mobile phone number and a mailbox identifier. The speaker identity may also be other information capable of identifying the speaker identity, and the embodiments of the present application are not limited thereto.
The storage device may be deployed on the same physical device as the identification device, or may be deployed on a different physical device. When the storage device and the identification device are deployed on different physical devices, it is necessary to enable the identification device to access the storage device.
After the at least one initial voiceprint feature data segment and the at least one extended voiceprint feature data segment are obtained in step S103, the identity recognition device stores the at least one initial voiceprint feature data segment and the at least one extended voiceprint feature data segment in correspondence with the identity of the first speaker, and thus, the voiceprint feature data of the first speaker is registered.
In one embodiment of the present application, the identity of the speaker and the corresponding storage of the voiceprint feature data fields may be as shown in table 1.
TABLE 1
Speaker identification Voiceprint characteristic data segment
UID001 Voiceprint characteristic data segment 1
UID001 Voiceprint characteristic data segment 2
UID001 Voiceprint characteristic data section 3
UID002 Voiceprint characteristic data segment 4
…… ……
UIDxyz Voiceprint characteristic data segment n
Wherein, the "xyz" in the "UIDxyz" is sequentially increased according to the requirements of practical application and the adaptability of rules like "001" and "002". n is a positive integer and is set according to the requirements of practical application. "UIDxyz" represents an identification of a corresponding registered user, such as the identification of an identification card as described above.
In an alternative example of the present application, the voiceprint feature data segment 1 through the voiceprint feature data segment n are each a voiceprint feature vector.
According to the embodiment, the identity recognition device can use a voice data segment input by a speaker in a single speaking state (such as a normal speech speed state) and/or a channel environment (such as a quiet channel environment) as an initial voice data segment, and simulate at least one extended voice data segment of the speaker in a plurality of different speaking states (such as a high speech speed state, a high intonation state, and the like) and/or in a plurality of different channel environments (such as a noisy environment) through data transformation, so that voiceprint feature data segments of the speaker in a plurality of different speaking states and/or under the influence of a plurality of different channel environments can be simulated, further, the coverage range of the voiceprint feature data segments registered by the speaker can be expanded, and the accuracy of identity recognition can be improved. How to use the extended speech data segment for identification is further described by the second embodiment.
Example two
By performing the first embodiment, the identification device stores the voiceprint feature data segment of the first speaker (including at least one initial voiceprint feature data segment and at least one extended voiceprint feature data segment) and the identification of the first speaker. Based on this, the second embodiment describes an implementation process of the identity recognition device comparing the voiceprint feature data segments to identify whether the second speaker is the same as the first speaker. The method comprises the following specific steps:
And after receiving the voice information to be recognized input by the second speaker, the identity recognition equipment carries out digital processing on the voice information to be recognized to obtain a voice data segment to be recognized. And then, the identity recognition equipment performs voiceprint feature extraction on the voice data segment to be recognized to obtain a voiceprint feature data segment to be recognized.
In the second embodiment, similarly to the second embodiment, reference may be specifically made to the descriptions in steps S101, S1011, S1012 and S103 of the embodiment, where a method for converting voice information to be recognized into a voice data segment to be recognized by an identity recognition device, and a method for extracting a voiceprint feature data segment from the voice data segment to be recognized by the identity recognition device. The second embodiment will not be described in detail. In the second embodiment, the voiceprint feature data segment to be identified is a voiceprint feature vector or a voiceprint feature vector. In an optional implementation manner of the second embodiment, the voiceprint feature data segment to be identified is a voiceprint feature vector.
Further, the identity recognition device calculates a characteristic distance value, where the characteristic distance value is a distance value between at least one initial voiceprint characteristic data segment and at least one extended voiceprint characteristic data segment, where the voiceprint characteristic data segment to be recognized and the first speaker identification are stored correspondingly, and then the identity recognition device determines whether the second speaker is the same as the first speaker according to the characteristic distance value.
Specifically, the identification device may calculate the characteristic distance value by:
respectively calculating the distance value between each feature data segment and the voiceprint feature data segment to be recognized aiming at least one initial voiceprint feature data segment and at least one expanded voiceprint feature data segment which are correspondingly stored by the first speaker to obtain a plurality of distance values;
selecting a minimum value of the plurality of distance values as the characteristic distance value.
Specifically, the identification device may further calculate the characteristic distance value by:
calculating the average value of the at least one initial voiceprint characteristic data segment and the at least one expanded voiceprint characteristic data segment to obtain an average characteristic data segment;
and calculating the distance value between the average characteristic data segment and the characteristic data segment to be identified, and taking the distance value as the characteristic distance value.
It should be understood that, in the embodiments of the present application, with respect to a distance value between two feature data segments, a feature vector or a distance value between feature vectors corresponding to the two feature data segments, for example, a euclidean distance, is referred to. The embodiment of the present application will not be described in detail later.
Specifically, the determining, by the identity recognition device, whether the second speaker is the same as the first speaker according to the characteristic distance value may be:
If the characteristic distance value is smaller than a preset threshold value, determining that the second speaker is the same as the first speaker (i.e. the same person or the same identity);
if the characteristic distance value is greater than or equal to the predetermined threshold, determining that the second speaker is different from the second speaker (i.e., different person or different identity).
The preset threshold is a threshold preset or stored for the identity recognition device. The identity recognition equipment can also dynamically adjust the size of the preset threshold value according to the identity recognition accuracy rate so as to improve the recognition accuracy rate.
It should be understood that the smaller the distance value between the voiceprint feature data segment to be recognized of the second speaker and the at least one initial voiceprint feature data segment and the at least one extended voiceprint feature data segment stored corresponding to the first speaker is, the closer the feature vector or feature vector corresponding to the voiceprint feature data segment to be recognized is to the feature vector or feature vector corresponding to the stored voiceprint feature data segment, the more likely the second speaker and the first speaker are the same person. Based on this, if the feature distance value is smaller than the preset threshold, the identification device may determine that the second speaker is the same as the first speaker; otherwise, if the distance value is greater than or equal to the predetermined threshold, the identification device may determine that the first speaker is different from the second speaker.
It should be understood that the identity identification of the identity recognition device corresponding to the first speaker stores a plurality of voiceprint feature data segments (at least one initial voiceprint feature data segment and at least one extended voiceprint feature data segment), and the identity recognition device can also determine that the second speaker is the same as the first speaker as long as the distance value between the voiceprint feature data segment to be recognized of the second speaker and one of the voiceprint feature data segments is smaller than the preset threshold value.
The following is an example. Assuming that the preset threshold is 5; the identity recognition equipment respectively calculates the distance value between the voiceprint characteristic data segment to be recognized of the second speaker and each voiceprint characteristic data segment corresponding to the identity identifier UID001 in the table 1; assume that the distance values are:
the distance value between the voiceprint characteristic data segment to be identified and the voiceprint characteristic data segment 1 in the table 1 is 2,
the distance value between the voiceprint characteristic data segment to be identified and the voiceprint characteristic data segment 2 in the table 1 is 5,
the distance value between the voiceprint characteristic data segment to be identified and the voiceprint characteristic data segment 3 in the table 1 is 6.
Thus, the identification device may calculate a characteristic distance value of 2 (since 2 is the smallest of the distance values 2, 5 and 6 in this example), compare the characteristic distance value 2 with the previously assumed preset threshold value 5, and determine that it is less than the preset threshold value, and then determine that the second speaker is the same as the speaker identified by the "UID 001".
The identity recognition device may also calculate an average value of the voiceprint characteristic data segment 1, the voiceprint characteristic data segment 2, and the voiceprint characteristic data segment 3 to obtain an average voiceprint characteristic data segment, and then calculate a distance value between the voiceprint characteristic data segment to be recognized and the average voiceprint characteristic data segment, as the characteristic distance value.
It should be understood that the above process assumes that only the voiceprint feature data segment (including the extended voiceprint feature data segment) of one speaker (i.e. the first speaker) is stored, and in practical applications, the voiceprint feature data segments (including the extended voiceprint feature data segment) of a plurality of speakers are stored correspondingly, in this case, the feature distance value should be calculated separately for each speaker according to the above method, a plurality of feature distance values are obtained, and a minimum one of the plurality of feature distance values is selected to be compared with the preset threshold, and it is determined whether the speaker corresponding to the minimum feature distance value of the second speaker (the speaker to be identified) is the same.
Continuing with the example of Table 1, assume that the threshold is preset at 5, and:
the characteristic distance between the speaker with the identity mark of 'UID 001' and the voiceprint characteristic data segment to be recognized of the second speaker is 2;
The characteristic distance between the speaker with the identity of 'UID 002' and the voiceprint characteristic data segment to be recognized of the second speaker is 3;
……
the characteristic distance of the speaker with the identity "UIDxyz" from the voiceprint characteristic data segment to be recognized of the second speaker is 6.
It can be concluded that the minimum feature distance is 2, whose value is less than the preset threshold value 5, it can be determined that its corresponding speaker, i.e. the speaker whose identity is "UID 001", is the same as the second speaker.
The second embodiment enables the identity recognition device to perform identity recognition by utilizing the voiceprint feature data segment of the simulated speaker under various speaking states and/or various channel environments, and can reduce the degree of influence of the speaking states and the speaking environment factors, thereby improving the accuracy of identity recognition.
EXAMPLE III
Embodiment three describes a concrete implementation process of embodiment step S102 from a sound source dimension.
It should be understood that the third embodiment is a description of details of the step S102 in the third embodiment, and other execution steps of the voiceprint feature data registration process related to the third embodiment are the same as the description of the first embodiment, and are not described again here.
According to the description of the first embodiment, the sound source dimension includes a speech rate sub-dimension and a pitch sub-dimension. The specific operations for performing data transformations from the speech rate sub-dimension and the pitch sub-dimension are described below, respectively.
Data transformation is performed from the speech rate sub-dimension:
in a third embodiment, the identification device may preset at least one voice duration, where the at least one voice duration is different from each other. In an alternative embodiment, the identification device determines one of the at least one speech duration as a target speech duration, which is greater than the speech duration of the speech data section to be converted. The identity recognition equipment performs data transformation from the speech speed sub-dimension, and specifically comprises the step of performing framing operation on the speech data segment to be transformed by the identity recognition equipment to obtain fn speech frames. And the identity recognition equipment interpolates the fn1 voice frames in the fn voice frames, wherein fn1 is the voice frame number corresponding to the target voice time length, and fn1 is greater than fn. The identification device adjusts the distance between every two adjacent speech frames of fn1 speech frames. The identification equipment carries out smoothing processing between every two adjacent speech frames in fn1 speech frames. Then, the identity recognition device synthesizes the fn1 speech frames after the smoothing processing to obtain an extended speech data segment.
In another alternative embodiment, the target speech duration is less than the speech duration of the speech data to be transformed, and the identification device removes part of the speech in fn speech frames in the above process to obtain fn2 speech frames. The fn2 is the number of voice frames corresponding to the target voice time length in the present embodiment. Other processing procedures of the identity recognition device are the same as the above-mentioned procedures, and are not described herein again.
Data transformation is performed from the tone sub-dimension:
wherein, the pitch of the tones is substantially determined by the fundamental frequency in the voice data. The fundamental frequency refers to the frequency of vocal cord vibration, the higher the fundamental frequency is, the higher the tone is, and the lower the fundamental frequency is, the lower the tone is. Based on this, in this embodiment, the identity recognition device may preset at least one fundamental frequency parameter.
The identity recognition device performs data transformation from tone sub-dimension, specifically including the identity recognition device performing Linear Predictive Coding (LPC) on the voice data segment to be transformed to obtain LPC coefficients and fundamental frequency parameters of the voice data segment to be transformed. The identity recognition equipment adjusts the fundamental frequency parameter of the voice data segment to be converted into a preset fundamental frequency parameter. Then, the identity recognition equipment acquires the excitation parameters of the voice data segment according to the adjusted fundamental frequency parameters. And the identity recognition equipment reversely excites the LPC coefficient of the voice data segment to be converted by using the excitation parameter to obtain an extended voice data segment corresponding to the preset fundamental frequency parameter.
It should be understood that, in the above adjustment process, if the preset fundamental frequency parameter is greater than the fundamental frequency parameter of the voice data to be converted, the voice tone corresponding to the corresponding extended voice data segment is higher than the voice tone corresponding to the voice data segment to be converted. If the preset fundamental frequency parameter is smaller than the fundamental frequency parameter of the voice data to be converted, the voice tone corresponding to the corresponding expanded voice data segment is lower than the voice tone corresponding to the voice data segment to be converted.
It should be understood that the voice data segments to be transformed described in the present embodiment include each of at least one initial voice data segment, and voice data segments on which data transformation has been performed from other sub-dimensions.
Therefore, by adopting the embodiment of the embodiment, the identification device can simulate the voiceprint characteristic data of the speaker in different speaking states and take the voiceprint characteristic data as the voiceprint characteristic data segment of the speaker. Therefore, when the identity of the speaker is identified subsequently, even if the voice data section input by the speaker is inconsistent with the voice data section input during registration in speech speed and/or tone, the identity identification equipment can still correctly identify the identity of the speaker, thereby improving the accuracy of identity identification.
Example four
Embodiment four describes a specific implementation process of an embodiment step S102 from a channel dimension.
It should be understood that the fourth embodiment is a description of details of the step S102 in the fourth embodiment, and other execution steps of the voiceprint feature data registration process related to the fourth embodiment are the same as the description of the first embodiment, and are not described again here.
According to the description of the first embodiment, the channel dimensions include a bandwidth sub-dimension, a coding sub-dimension, a noise sub-dimension, and a reverberation sub-dimension. The specific operation of performing data transformation from each sub-dimension of the channel is described below.
Data transformation is performed from the bandwidth sub-dimension:
when the frequency bandwidth of the voice data section to be converted is 0-4 kHz (narrow band), the voice data section to be converted is converted into a voice data section with the frequency bandwidth of 0-8 kHz (wide band).
An alternative way to transform narrowband speech data segments into wideband speech data segments is to: the identification device transforms the narrowband speech data segments from the time domain to the frequency domain. The spectrum of the narrowband speech data segment of the frequency domain is shown in the solid line portion in fig. 2. The solid line portion in FIG. 2 is a low frequency portion (0 to 4 kHz). The identification device then copies the narrowband speech data segments directly from the low frequency part to the high frequency part (4-8 kHz). The spectrum of the voice data segment copied to the high frequency part is shown in dotted line in fig. 2. The identity recognition device shapes the frequency spectrum of the voice data segment of the high-frequency part according to the frequency of the narrow-band voice data segment to obtain the wide-band voice data segment (0-8 kHz) of the frequency domain shown in figure 3. And the identity recognition equipment converts the broadband voice data segment of the frequency domain into a time domain to obtain the broadband voice data segment after the frequency band is expanded.
When the frequency bandwidth of the voice data section to be converted is 0-8 kHz (broadband), the identity recognition equipment converts the voice data section to be converted into the voice data section with the frequency bandwidth of 0-4 kHz (narrowband). An alternative way to transform a wideband speech data segment into a narrowband speech data segment is to: and the identity recognition equipment performs low-pass filtering on the voice data segment to be converted of the broadband, and the bandwidth of the corresponding low-pass filter is 0-4 kHz to obtain the converted narrowband voice data segment.
Performing data transformations from the codec sub-dimensions:
and then, the identity recognition equipment decodes the coded data in a target decoding mode to obtain the converted extended voice data segment. The target decoding mode corresponds to the target coding mode.
It should be understood that the device performing the codec operation is a codec. Codecs include narrowband speech codecs and wideband speech codecs. Wherein, when the voice data segment to be transformed is a narrowband voice data segment, the identification device should use a narrowband voice codec, such as g.729, g.723.1 and g.726 voice codecs. When the speech data segment to be transformed is a wideband speech data segment, the identification device should use wideband speech codecs, such as g.718 and g.722 speech codecs.
Data transformation is performed from the noise sub-dimension:
in this embodiment, the identity recognition device may preset at least one signal-to-noise ratio of the voice and the noise, where the at least one signal-to-noise ratio is not equal to each other. Taking one of at least one preset signal-to-noise ratio as an example, when the voice data segment to be converted is converted from the noise sub-dimension, the identity recognition device adjusts the volume parameter corresponding to the reference noise data according to the signal-to-noise ratio to obtain the target noise data segment. Then, the identity recognition device mixes the target noise data segment with the voice data segment to be converted to obtain an extended voice data segment corresponding to the signal-to-noise ratio. The identity recognition device executes the operation corresponding to each signal-to-noise ratio in the at least one signal-to-noise ratio to respectively obtain the extended voice data segment corresponding to each signal-to-noise ratio.
Wherein, in an alternative embodiment, the identification device pre-acquires the reference noise data segment. In another alternative embodiment, the identification device obtains the reference noise data segment in real time. The identification device acquires the reference noise data segment in a similar manner to the initial speech data segment and will not be described in detail herein.
In addition, the sound content corresponding to the reference noise data segment is different in different scenes, for example, in a conference scene, the sound content of the reference noise data segment includes a keyboard sound, a mobile phone ring, a table and chair moving sound, a speaking sound of other people in a conference place, and the like. For example, in a home scene, the background music sound, the outside street sound, the sound of a child speaking, and the like.
Data transformation is performed from the reverb sub-dimension:
in the process of indoor propagation, sound waves are reflected by obstacles such as walls, ceilings, floors and the like, and energy of the sound waves is partially absorbed by the obstacles every time the sound waves are reflected. Therefore, when the sound source stops sounding, the sound wave can disappear after being reflected and absorbed for multiple times. After the sound source stops sounding, the phenomenon that a plurality of sound waves are mixed together due to multiple reflections of the sound waves is called reverberation, and the duration of the mixed sound waves is called reverberation time.
In this embodiment, the identity recognition device may preset at least one reverberation data segment, where the volume and the reverberation duration corresponding to the at least one reverberation data segment are different from each other. Taking a preset reverberation data segment of at least one reverberation data segment as an example, performing data transformation from a reverberation sub-dimension, specifically, the identity recognition device mixes the reverberation data segment with a voice data segment to be transformed to obtain an extended voice data segment corresponding to the reverberation data segment. And the identity recognition equipment executes the operation on each reverberation data section in the at least one reverberation data section to respectively obtain an extended voice data section corresponding to each reverberation data section.
It should be understood that the voice data segment to be transformed described in the present embodiment includes each of at least one initial voice data segment, and a voice data segment on which data transformation has been performed from other sub-dimensions.
Therefore, by adopting the embodiment of the embodiment, the identity recognition device can simulate the voiceprint feature data of the speaker in different acquisition environments and take the voiceprint feature data as the voiceprint feature data segment of the speaker. Therefore, when the identity of the speaker is identified subsequently, even if the voice data segment input by the speaker is inconsistent with at least one of frequency width, encoding and decoding, noise and reverberation relative to the voice data segment input during registration, the identity identification device can still correctly identify the identity of the speaker, thereby improving the accuracy of identity identification.
EXAMPLE five
Embodiment five describes one implementation of "data transforming each of the at least one initial segment of speech data from M sub-dimensions".
It should be understood that the fifth embodiment is a description of details of the step S102 in the fifth embodiment, and other execution steps of the voiceprint feature data registration process related to the fifth embodiment are the same as the description of the first embodiment, and are not described herein again.
In the fifth embodiment, the identity recognition device performs data transformation on each initial voice data segment in the at least one initial voice data segment from M sub-dimensions, respectively, to obtain at least M transformed data segments corresponding to each initial voice data segment. Furthermore, the identity recognition device determines at least M transformation data segments corresponding to each initial voice data segment as an extended voice data segment in the at least one extended voice data segment.
In a possible implementation manner, taking any initial voice data segment of the at least one initial voice data segment as an example, the M sub-dimensions are, for example, a speech rate sub-dimension, a bandwidth sub-dimension, and a coding/decoding sub-dimension. The identity recognition equipment carries out data transformation on the initial voice data segment from the speech speed sub-dimension to obtain a first transformation data segment; performing data transformation on the initial voice data segment from the frequency width sub-dimension to obtain a second transformed data segment; and performing data transformation on the initial voice data segment from the encoding and decoding sub-dimension to obtain a third transformed data segment. The first, second and third transformed data segments are three extended speech data segments, respectively.
It should be understood that the above description is made only by taking the speech rate sub-dimension, the bandwidth sub-dimension and the codec sub-dimension as an example, and the extension of one initial speech data segment of the at least one initial speech data segment is described. The solution described in example five is applicable to any of the multiple sub-dimensions described in example three and example four. And, the scheme of the fifth embodiment is applied to each of the at least one initial voice data segment.
It should be understood that the fifth embodiment relates to the data transformation of the sub-dimension in the dimension of the sound source, and refer to the description of the third embodiment. In the fifth embodiment, the data transformation of the sub-dimension in the channel dimension is referred to the description of the fourth embodiment. This embodiment will not be described herein.
By adopting the implementation mode, the identity recognition equipment can simulate at least one voiceprint characteristic data segment after the speaker is influenced by any one sub-dimension. Therefore, when the speaker is identified, even if the voice data segment to be identified input by the speaker is different from the voice data segment input when the voiceprint characteristic data segment is registered by the speaker in any one aspect of speech speed, tone, bandwidth, coding and decoding, noise and reverberation, the identity identification device can still accurately identify the speaker, and the accuracy of identity identification is improved.
EXAMPLE six
Embodiment six describes another implementation of the data transformation of each of the at least one initial speech data segment from the M sub-dimensions.
It should be understood that the sixth embodiment is a description of details of the step S102 in the sixth embodiment, and other execution steps of the voiceprint feature data registration process related to the sixth embodiment are the same as the description of the first embodiment, and are not described again here.
In the sixth embodiment, the identity recognition device determines each of the M sub-dimensions as a target sub-dimension, and performs data transformation on each target transformed data segment from the target sub-dimension to obtain a plurality of combined transformed data segments. The identification device then determines the plurality of combined transformed data segments as an extended speech data segment of the at least one extended speech data segment. Wherein, the target transformed data segment refers to the voice data segment after data transformation is carried out from M-1 sub-dimensions. The M-1 sub-dimensions refer to sub-dimensions of the M sub-dimensions other than the target sub-dimension.
It will be appreciated that in one possible implementation, the target transformed data segment is each of the at least one initial speech data segment, the speech data segments being data transformed from M-1 sub-dimensions, respectively. In another possible implementation, the target transformed data segment is each of the at least one initial speech data segment, and the speech data segments are data-transformed sequentially from any k sub-dimensions of the M-1 sub-dimensions. Wherein k is any one of integers from 2 to M-1, and the group number of k sub-dimensions satisfies the permutation and combination relationship
Figure BDA0001789928720000161
The M sub-dimensions are, for example, a speech rate sub-dimension, a tone sub-dimension, a bandwidth sub-dimension, and a codec sub-dimension. In an optional embodiment, the speech rate sub-dimension is taken as a target sub-dimension by the identity recognition device, and the target transformed data segment refers to a speech data segment obtained by performing data transformation on a tone sub-dimension, a bandwidth sub-dimension and a coding and decoding sub-dimension. In another optional implementation, the identity recognition device uses the tone sub-dimension as the target sub-dimension, and the target transformed data segment refers to a voice data segment after data transformation from the speech rate sub-dimension, the bandwidth sub-dimension, and the encoding and decoding sub-dimension. Other implementation scenarios of the present application are similar to the two optional embodiments described above, and are not described herein again.
Further, in an optional embodiment, the identity recognition device takes the speech rate sub-dimension as the target sub-dimension. In this embodiment, the target transformed data segment includes a voice data segment after the identification device performs data transformation on each of the at least one initial voice data segment from a tone sub-dimension, a bandwidth sub-dimension, and a codec sub-dimension. And/or the identity recognition device follows tone sub-dimension, frequency width sub-dimension and coding and decoding sub-dimension
Figure BDA0001789928720000162
And (4) combining. Then, the identification device sequentially performs the data-transformed voice data segment on each of the at least one initial voice data segment from k sub-dimensions corresponding to each combination. Wherein k is 2 or 3.
It should be understood that the identification device performs data transformation on the initial voice data segment from K sub-dimensions in sequence, and is not affected by the sequence of the K sub-dimensions.
When k is 2, the identification device gets three combinations. The three combinations include a combination of tone sub-dimensions and bandwidth sub-dimensions, a combination of bandwidth sub-dimensions and codec sub-dimensions, and a combination of tone sub-dimensions and codec sub-dimensions. And the identity recognition equipment sequentially performs data transformation on each initial voice data segment in the at least one initial voice data segment according to two sub-dimensions of each combination in the three combinations to obtain a target transformed data segment.
Taking the combination of the bandwidth sub-dimension and the encoding and decoding sub-dimension as an example, in an embodiment, the identity recognition device may perform data transformation on the initial voice data segment from the bandwidth sub-dimension to obtain a transformed voice data segment. Then, the identity recognition device performs data transformation on the transformed voice data segment from the encoding and decoding sub-dimension to obtain a target transformed data segment. In another embodiment, which takes the combination of the frequency width sub-dimension and the encoding and decoding sub-dimension as an example, the identification device may perform data transformation on the initial voice data segment from the encoding and decoding sub-dimension to obtain a transformed voice data segment. Then, the identity recognition device performs data transformation on the transformed voice data segment from the frequency width sub-dimension to obtain a target transformed data segment. The identity recognition device performs data transformation in sequence according to two sub-dimensions of each of the other two combinations, which is similar to the description and is not described herein again.
When k is 3, the identification device gets a combination. The one combination includes a combination of tone sub-dimensions, bandwidth sub-dimensions, and codec sub-dimensions.
In one embodiment, the identification device performs data transformation on one of the at least one initial voice data segment from a tone sub-dimension to obtain a first transformed voice data segment. And the identity recognition equipment performs data transformation on the first transformed voice data segment from the frequency width sub-dimension to obtain a second transformed voice data segment. And the identity recognition equipment performs data transformation on the second transformed voice data segment from the encoding and decoding sub-dimension to obtain a target transformed data segment.
In a second embodiment, the identification device performs data transformation on one of the at least one initial voice data segment from the bandwidth sub-dimension to obtain a first transformed voice data segment. And the identity recognition equipment performs data transformation on the first transformed voice data segment from the tone sub-dimension to obtain a second transformed voice data segment. And the identity recognition equipment performs data transformation on the second transformed voice data segment from the encoding and decoding sub-dimension to obtain a target transformed data segment.
In a second embodiment, the identity recognition device performs data transformation on one of the at least one initial voice data segment from the coding/decoding sub-dimension to obtain a first transformed voice data segment. And the identity recognition equipment performs data transformation on the first transformation voice data segment from the tone sub-dimension to obtain a second transformation voice data segment. And the identity identification equipment performs data transformation on the second transformation voice data segment from the frequency width sub-dimension to obtain a target transformed data segment.
It should be understood that when k is 3, the identification device may also perform data transformation on the initial voice data segment sequentially according to other sequences of the 3 sub-dimensions. And will not be described one by one here.
It should be understood that the sixth embodiment relates to the data transformation of the sub-dimension in the dimension of the sound source, and refer to the description of the third embodiment. In the sixth embodiment, the data transformation of the sub-dimension in the channel dimension is referred to the description of the fourth embodiment. This embodiment will not be described herein.
By adopting the implementation mode of the embodiment, the identity recognition equipment can perform data transformation once on each initial voice data segment in the at least one initial voice data segment from each sub-dimension in the M sub-dimensions, and simulate the extended voice data segment influenced by any plurality of sub-dimensions. Therefore, when the speaker is identified, even if the voice data segment to be identified input by the speaker is different from the voice data segment input when the voiceprint characteristic data segment is registered by the speaker in any multiple aspects of speech speed, tone, bandwidth, coding and decoding, noise and reverberation, the identity identification device can still accurately identify the speaker, and the accuracy of identity identification is improved.
EXAMPLE seven
Embodiment seven describes a specific operation of performing data transformation from any one of the sub-dimensions in embodiments three to six.
It should be understood that, the seventh embodiment is a more detailed description of the third to sixth embodiments, and therefore, other steps performed in the voiceprint feature data registration process related to the seventh embodiment are the same as the first embodiment, and reference may be made to the description of the first embodiment.
According to the description of the first embodiment, the sound source dimension and/or the attribute parameters corresponding to some of the sub-dimensions in the channel dimension may fluctuate, and the identity recognition device may set a plurality of transformation parameters corresponding to the sub-dimensions in which the attribute parameters fluctuate. In view of this, in the third to sixth embodiments, when the identity recognition device performs data transformation on the data segment to be transformed from any one of the M sub-dimensions, at least one transformation parameter corresponding to the sub-dimension is obtained. Then, the identity recognition device carries out data transformation on the data segment to be transformed according to each transformation parameter in the at least one transformation parameter.
In an alternative embodiment, the data transformation of the data section to be transformed is performed from the noise sub-dimension. The identity recognition equipment takes the signal-to-noise ratio of the initial voice data segment as a reference parameter, and two signal-to-noise ratios are preset. The signal-to-noise ratio of the voice data segment to be converted is A, the first signal-to-noise ratio set by the identity recognition equipment after data conversion is A + a, and the second signal-to-noise ratio set by the identity recognition equipment after data conversion is A-a. The identity recognition device performs data transformation on the voice data segment to be transformed from the noise sub-dimension, and the identity recognition device performs data transformation on the voice data segment to be transformed to obtain an extended voice data segment with the signal-to-noise ratio of A + a. And the identity recognition equipment performs data transformation on the voice data segment to be transformed to obtain an extended voice data segment with the signal-to-noise ratio of A-a. Wherein, A, A + a and A-a are reasonable SNR values of speakers, and can be flexibly set according to actual use environment.
It should be appreciated that the above-described implementation of setting two signal-to-noise ratios corresponding to the noise sub-dimension is only one alternative embodiment of the present application. In implementation, the identification device may set other numbers of signal-to-noise ratios corresponding to the noise sub-dimensions. For example, in another alternative embodiment, 3 signal-to-noise ratios are set.
It should be understood that the foregoing merely illustrates the embodiments of the present application by way of example of the noise sub-dimension, and the embodiments of the present application are not limited thereto. The identity recognition device corresponds to each sub-dimension of the tone sub-dimension, the speech speed sub-dimension and the reverberation sub-dimension, and can set a plurality of transformation parameters. Wherein, the value and setting mode of the transformation parameter, the identity recognition device can be determined by combining the attribute of the corresponding sub-dimension.
It should be understood that the embodiments of the present application are applicable to various sound production environments of a speaker, and the speech data obtained by the identification device is also the speech data of the speaker under different speaking states and speech acquisition environments. Based on this, the plurality of transformation parameters set by the identity recognition device corresponding to each sub-dimension necessarily conform to the speaking scene of the speaker. For example, the signal-to-noise ratio set by the identification device is the signal-to-noise ratio corresponding to the speaker's voice, but not the signal-to-noise ratio of other scenarios, such as electromagnetic wave transmission.
Therefore, by adopting the implementation mode of the embodiment, the identity recognition equipment can respectively simulate the voice data segments under multiple influence degrees of one sub-dimension. Therefore, when the speaker is identified, even if the voice data segment to be identified input by the speaker is different from the voice data segment input when the voiceprint characteristic data segment is registered by the speaker in terms of any one of the aspects of speech speed, tone, bandwidth, coding and decoding, noise and reverberation to different degrees, the identity identification device can still accurately identify the speaker, and the accuracy of identity identification is improved.
Example eight
The eighth embodiment describes the present solution with reference to a specific implementation scenario.
Referring to fig. 4, fig. 4 is a scene schematic diagram of a real-time conference scene provided in an embodiment of the present application. In the conference scene provided by this embodiment, the conference participants include a conference participant 1, a conference participant 2, a conference participant 3, a conference participant 4, and a conference participant 5, and the conference system includes a conference server (including an identification device), a mobile phone used by the conference participant 1 to access the conference, a desktop (PC) used by the conference participant 2 to access the conference, a tablet computer (tablet) used by the conference participant 3 to access the conference, a telephone conference terminal used by the conference participant 4 to access the conference, and a telephone base used by the conference participant 5 to access the conference. In order to ensure the security of the conference, the conference server is provided with an identity recognition device, so that the identity of each participant can be recognized through voiceprint characteristics in the process that each participant accesses the conference.
The identification process of the identity recognition equipment for the participants through voiceprint features comprises the following steps:
voiceprint feature data registration section: the 5-position conferees are allowed to register voiceprint characteristics with the identity recognition equipment through respective equipment accessed to the conference or other equipment in advance;
and a voiceprint characteristic data comparison part: in the beginning stage of the conference, the 5 participants can provide their own voice (for example, a sentence) to the identification device through the devices respectively connected to the conference, the identification device extracts the voiceprint feature of the voice data of the participants, and determines the identity of each participant by comparing the voiceprint feature data of each participant with the registered voiceprint feature data.
And the identity recognition equipment is preset with a voiceprint characteristic vector model. The identification device extracts voiceprint features from each voice data segment. And then, the identity recognition equipment calculates the feature vector of the corresponding voiceprint feature through the voiceprint feature vector model and uses the feature vector as the voiceprint feature data segment corresponding to the voice data segment.
Based on this, the process of identifying the unknown conference participants by the identification device is described below from two parts, namely voiceprint feature data registration and voiceprint feature data comparison.
Voiceprint feature data registration section:
the method of registering voiceprint feature data of each of the conference participants 1, 2, 3, 4 and 5 by the identification device is similar. The voiceprint feature data of the participant 1 registered by the identification device will be described as an example. The process of registering the voiceprint feature data of the four other participants by the identification device may refer to the process of registering the voiceprint feature data of the participant 1 by the identification device.
And the identity recognition equipment receives the first voice information and the second voice information of the conference participant 1 through a mobile phone. The first voice information and the obtained second voice information are both analog information. Then, the identity recognition device samples and quantizes the first voice information to obtain a first voice data segment. And the identity recognition equipment samples and quantizes the second voice information to obtain a second voice data segment. And then, the identity recognition equipment respectively carries out data transformation on the first voice data segment and the second voice data segment to obtain a plurality of extended voice data segments.
It should be understood that the use of the ordinal numbers such as "first" and "second" are used to distinguish between the various objects and are not used to limit the order in which the objects are recited.
And the data transformation method of the identification equipment to the first voice data segment is the same as the data transformation method to the second voice data segment. The data transformation method of the first voice data segment by the identification device is described as an example. The following process may be referred to as a data transformation method for the second speech data segment by the identification device.
When the participant 1 registers the voiceprint feature data, the scene of inputting the voice information to the identification device is, for example, a conference room. Assume a scenario in which participant 1 is in a non-conference room while participating in a conference. In this case, the speech rate and the tone when the participant 1 participates in the conference are different from those when the voiceprint feature data is registered. The bandwidth, codec, noise, and reverberation of the voice input when the participant 1 participates in the conference are different from those of the voice print feature data when the voice is registered. Based on this, the identification device performs data transformation on the first voice data segment from a speech rate sub-dimension, a tone sub-dimension, a bandwidth sub-dimension, a coding/decoding sub-dimension, a noise sub-dimension, and a reverberation sub-dimension.
Based on this, the identification device acquires a noise data segment and a reverberation data segment in advance. The sound content of the noise data section and the reverberation data section is for example the sound of a keyboard, a mobile phone ring tone and a table and chair movement sound.
The frequency bandwidth of the identity recognition device is, for example, 0-8 kHz. Based on this, the identity recognition equipment sets up the low pass filter that the bandwidth is 0 ~ 4 kHz. When the first voice data segment is subjected to data transformation from the frequency width sub-dimension, the identity recognition device transforms the first voice data segment from the frequency width to the extended voice data segment with the narrow frequency. Correspondingly, the identity recognition device is provided with a broadband codec so as to perform data transformation on the first voice data segment from a codec sub-dimension. The connection structure of the wideband codec is shown in fig. 4, and the input end of the wideband codec is used for receiving the first voice data. The output end of the wideband coder is connected with the input end of the wideband decoder. And the output end of the broadband decoder outputs the converted extended voice data segment. In addition, the identity recognition equipment is preset with 2 signal-to-noise ratios and a set of signal noise addition algorithm. And corresponding to the reverberation sub-dimension, presetting a reverberation data section and a set of algorithm for mixing reverberation data and voice data by the identity recognition equipment. And corresponding to the speech speed sub-dimension, presetting a speech duration difference value and an algorithm for adjusting the speech speed by the identity recognition equipment. And presetting a fundamental frequency parameter difference value and an algorithm for adjusting tone by the identity recognition equipment corresponding to tone sub-dimensions.
The identification device performs data transformation on the first voice data segment from a speech rate sub-dimension, a tone sub-dimension, a bandwidth sub-dimension, a coding/decoding sub-dimension, a noise sub-dimension and a reverberation sub-dimension respectively to obtain, for example, 8 extended voice data segments.
The data transformation performed by the identification device from the bandwidth sub-dimension specifically comprises: and the identity recognition equipment inputs the first voice data segment into a low-pass filter, and outputs an expanded voice data segment with the frequency bandwidth of 0-4 kHz through the low-pass filter.
The data transformation performed by the identity recognition device from the encoding and decoding sub-dimension specifically comprises: the identification device inputs the first speech data segment into the wideband encoder shown in fig. 5. The encoded data segments output via the wideband encoder are input to a wideband decoder. The wideband decoder outputs an extended speech data segment.
The data transformation performed by the identity recognition device from the speech rate sub-dimension specifically comprises: in an optional embodiment, the identity recognition device adds a voice time length difference value on the basis of the voice time length corresponding to the first voice data segment to obtain the target voice time length. And the identity recognition equipment adds the voice time length corresponding to the first voice data segment to the target voice time length according to the voice frame interpolation mode to obtain an extended voice data segment. In another optional embodiment, the identification device reduces the difference of the voice time lengths on the basis of the voice time length corresponding to the first voice data segment to obtain the target voice time length. And the identity recognition equipment reduces the voice time length corresponding to the first voice data segment to the target voice time length according to the mode of deleting the voice frame to obtain an expanded voice data segment.
The data transformation performed by the identity recognition device from the tone sub-dimension specifically comprises: and the identity recognition equipment executes LPC on the first voice data segment to obtain initial fundamental frequency parameters. Then, the identity recognition device increases the difference of the fundamental frequency parameters on the basis of the initial fundamental frequency parameters. And then, the identity recognition equipment correspondingly adjusts other parameters corresponding to the heightened fundamental frequency parameter to obtain a converted extended voice data section.
The data transformation performed by the identity recognition device from the noise sub-dimension specifically comprises: in an alternative embodiment, the identification device calculates the volume corresponding to the noise data segment when the noise ratio of the first speech data segment to the noise data segment previously input reaches a first signal-to-noise ratio. And the identity recognition equipment adjusts the noise data segment according to the volume to obtain a first target noise data segment. And the identity recognition equipment mixes the first target noise data segment with the first voice data segment according to a preset noise adding algorithm to obtain an extended voice data segment corresponding to a first signal-to-noise ratio. In an alternative embodiment, the identification device calculates the volume corresponding to the noise data segment when the noise ratio of the initial speech data segment to the noise data segment previously input reaches the second signal-to-noise ratio. And the identity recognition equipment adjusts the noise data segment according to the volume to obtain a second target noise data segment. And the identity recognition equipment mixes the second target noise data segment with the first voice data segment according to a preset noise adding algorithm to obtain an extended voice data segment corresponding to a second signal-to-noise ratio.
The identification device specifically performs data transformation from the reverberation sub-dimension including: and the identity recognition equipment mixes the preset reverberation data segment with the first voice data segment according to a preset reverberation data segment and voice data segment mixing algorithm to obtain an extended voice data segment.
It should be appreciated that the above-described process of data transformation is only one alternative implementation of the present embodiment. On the basis of the implementation, the identity recognition device can also combine the sub-dimensions two by two at will to obtain a plurality of combinations comprising two sub-dimensions. Then, the identification device performs data transformation from the two sub-dimensions in each combination in sequence to obtain a plurality of extended speech data segments. Of course, the identification device may also combine the several sub-dimensions in any other way. Then, the identification device performs data transformation from the sub-dimensions in each combination in sequence to obtain a plurality of extended speech data segments. And will not be described in detail herein.
Correspondingly, the identification device performs data transformation on the second voice data segment to obtain 10 extended voice data segments. The process of the identification device performing data transformation on the second speech data segment is similar to the above description and will not be described in detail here.
It should be understood that the above-mentioned transformation parameters of different sub-dimensions and the settings of corresponding hardware are only set by the identity recognition device corresponding to the scene corresponding to the above-mentioned optional implementation. In a scene corresponding to other optional implementation manners, the identity recognition device can adaptively adjust the transformation parameters of each sub-dimension and the corresponding hardware according to an implementation scene. For example, if the data is transformed from the codec sub-dimension into a speech data segment with a bandwidth of 0 to 4kHz, the codec should be a narrowband coder.
Further, the identity recognition device performs voiceprint feature extraction on the first voice data segment, the second voice data segment and the 18 extended voice data segments respectively to obtain 20 voiceprint feature data segments corresponding to the participant 1. Then, the identification device stores the identification of the participant 1 corresponding to the 20 voiceprint feature data segments. At this point, the identification device completes registration of voiceprint feature data of the conference participant 1.
For the conference participants 2, 3, 4 and 5, the identification device obtains the voice information of the conference participants 3 and 4 through a mobile phone, and the identification device obtains the voice information of the conference participants 2 and 5 through a PC. The identification device performs the further process of voiceprint feature data registration for each of participant 2, participant 3, participant 4 and participant 5, which is described above and not described in detail here.
In the identification device, the correspondence between the identification of the conference participant 1, the conference participant 2, the conference participant 3, the conference participant 4, and the conference participant 5 and the voiceprint feature data segment is as shown in table 1 in the first embodiment. The description is not repeated here.
And a voiceprint characteristic data comparison part:
and in the process of the conference, the identity recognition equipment recognizes the identity of each unknown conference participant in real time according to the voiceprint characteristics of the unknown conference participants. An unknown participant is described below as an example.
The identification device receives a voice message of an unknown participant. Then, the identity recognition device digitizes the voice information to obtain a voice data segment to be recognized. And the identity recognition equipment performs voiceprint feature extraction on the voice data segment to be recognized to obtain the voiceprint feature data segment to be recognized of the unknown conference participant.
The method for converting the voice information of the unknown participant into the voice data segment to be recognized by the identity recognition device, and the method for extracting the voiceprint feature data segment from the voice data segment to be recognized by the identity recognition device are described in the above related description. And will not be described in detail herein.
Further, the identity recognition device calculates the distance value between the voiceprint feature data segment to be recognized and each voiceprint feature data segment obtained by registering the voiceprint feature data. And then, the identity recognition equipment determines the identity of the unknown conferee according to the relation between each distance value and a preset threshold value.
In an optional implementation manner, the voiceprint feature data segment X corresponds to the identity of the participant 3, and a distance value between the voiceprint feature data segment to be recognized and the voiceprint feature data segment X is taken as an example. And the identity recognition equipment judges whether the distance value is smaller than a preset threshold value. And if the distance value is smaller than a preset threshold value, the identity recognition equipment judges that the unknown participant is the participant identified by the identity corresponding to the voiceprint feature data segment X. I.e. the unknown participant is participant 3. If the distance value is larger than the preset threshold value, the identity recognition equipment judges that the unknown participant is not the participant identified by the identity corresponding to the voiceprint feature data segment X. I.e. the unknown participant is not participant 3.
It should be understood that each participant is registered with multiple voiceprint feature data segments. Based on this, if the unknown participant is one of the above 5-bit participants, the identification device can obtain a plurality of distance values smaller than the preset threshold. Moreover, the voiceprint feature data segments corresponding to the plurality of distance values should correspond to the same identity. Correspondingly, if the unknown participant is not one of the 5 participants, the distance values calculated by the identity recognition device are all larger than the preset threshold value.
For example, if the unknown participant is participant 1. Participant 1 has 20 voiceprint feature data segments registered correspondingly. Based on this, the identification device can calculate a distance value of, for example, 8 less than a preset threshold. The voiceprint feature data segments corresponding to the 8 distance values all correspond to the identity of the participant 1. As another example, if the unknown participant is participant 6. The voiceprint feature data segment of the participant 6 is not stored in the identification device. Based on the above, the identity recognition device calculates the distance value between the voiceprint feature data segment to be recognized and each voiceprint feature data segment, and the distance value is greater than the preset threshold value.
It should be understood that, the eighth embodiment is only an example of a real-time conference scenario, and the identity recognition method of the present application is described. The identity recognition method is also suitable for other scenes for recognizing the identity of the speaker through the voiceprint features. Such as smart home, door access, blacklisting, remote monitoring, and retrieval scenarios. And will not be described in detail herein.
It should be understood that each of the first to eighth embodiments is provided to support an exemplary introduction of the present disclosure, and the technical solutions of the embodiments of the present disclosure are not limited. As can be seen by those skilled in the art, with the evolution of voiceprint recognition technology and the emergence of new dimensions, the technical solution provided by the embodiments of the present application is also applicable to similar technical problems.
In summary, in order to solve the problem of low accuracy of identification, in the voiceprint feature data registration stage, in the embodiment of the present application, the identification device performs data transformation on each initial voice data segment in at least one initial voice data segment input by a speaker from a sound source dimension and/or a channel dimension to obtain at least one speaking state of the speaker and/or an extended voice data segment in an acquisition environment, respectively. Based on this, the identification device can extract and store all voiceprint characteristic data segments corresponding to the speaker in at least one speaking state and/or collection environment. Therefore, in the comparison stage of the voiceprint characteristic data, even if the voiceprint characteristic data segment of the speaker to be identified is influenced by the speaking state and/or the collection environment, the identity recognition equipment can still accurately recognize the identity of the speaker to be identified according to the stored voiceprint characteristic data segment. Therefore, in the technical scheme of the embodiment of the application, in the voiceprint feature data registration stage, the voiceprint feature data sections of the speaker in various speaking states and/or in various acquisition environments are simulated, and the voiceprint feature data sections of the speaker in various speaking states and/or in various acquisition environments can be stored, so that the coverage range of the stored voiceprint feature data sections can be enlarged, further, the fault tolerance rate of voiceprint feature data comparison is improved, and the accuracy rate of identity recognition is improved.
Example nine
The embodiment provides an identity recognition device.
Referring to fig. 6, fig. 6 is a schematic structural diagram of an implementation manner of an identification device provided in an embodiment of the present application. The identity recognition device is used for executing the voiceprint feature data registration and voiceprint feature data comparison methods described in the first to eighth embodiments. As shown in fig. 6, the identification device includes an obtaining module 501, a data transformation module 502, an extraction module 503, a storage module 504, a distance calculation module 505, and a comparison and determination module 506. Wherein:
the obtaining module 501 may be specifically configured to perform the operation of obtaining speaker voice data in the voiceprint feature data registration stage or the voiceprint feature data comparison stage in the first to eighth embodiments;
the data transformation module 502 may be specifically configured to perform an operation of transforming the initial voice data segment in the registration phase of the voiceprint feature data (to obtain the extended voice data segment) in the first to eighth embodiments;
the extracting module 503 may be specifically configured to perform operations of extracting a voiceprint feature from a voice data segment (an initial voice data segment or an extended voice data segment) in a voiceprint feature data registration stage or a voiceprint feature data comparison stage (to obtain the initial voiceprint feature data segment and the extended voiceprint feature data segment) in the first to eighth embodiments;
The storage module 504 may be specifically configured to execute the voiceprint feature data segment of the speaker in the voiceprint feature data registration stage or the voiceprint feature data comparison stage in the first to eighth embodiments, and may be stored locally in the identification device or in a remote device that can be accessed by the identification device;
a distance calculating module 505, which can be specifically used to execute the first to eighth embodiments, for calculating a feature distance value between the voiceprint feature data segment of the speaker to be identified (the second speaker in the second embodiment) and the voiceprint feature data segment (including the initial voiceprint feature data segment and the extended voiceprint feature data segment) of the registered speaker (the first speaker in the second embodiment) stored in the storage module 504 in the voiceprint feature data comparison stage;
the comparison determining module 506 can be specifically configured to perform the above-mentioned first to eighth embodiments, wherein in the voiceprint feature data comparison stage, it is determined or determined whether the speaker to be identified (the second speaker in the second embodiment) is the same as the registered speaker (the first speaker in the second embodiment) according to the feature distance value calculated by the distance calculating module 505.
For example, the obtaining module 501 may be configured to obtain at least one initial speech data segment of a speaker. The data transformation module 502 may be configured to perform data transformation on each initial voice data segment of the at least one initial voice data segment from a sound source dimension and/or a channel dimension to obtain at least one extended voice data segment. The extracting module 503 may be configured to perform voiceprint feature extraction on the at least one initial voice data segment and the at least one extended voice data segment, respectively, to obtain at least one initial voiceprint feature data segment and at least one extended voiceprint feature data segment. The storage module 504 may be configured to store the at least one initial voiceprint feature data segment and the at least one extended voiceprint feature data segment in correspondence with the identity of the speaker in the identification device.
For specific contents, reference may be made to the description of relevant parts in the first to eighth embodiments, and details are not described herein again.
EXAMPLE ten
The division of each module described in the ninth embodiment is only a division of a logical function, and when the actual implementation is implemented, all or part of the division may be integrated into one physical entity, or may be physically separated. In this embodiment, the obtaining module 501 may be implemented by a transceiver, and the data transforming module 502, the extracting module 503 and the storing module 504 may be implemented by a processor. As shown in fig. 7, fig. 7 is a schematic structural diagram of another implementation manner of an identity recognition apparatus provided in an embodiment of the present application. The identification device may include a processor 601, a transceiver 602, and a memory 603. The memory 603 may be used to store a program/code preinstalled when the identification device is shipped from a factory, or may store a code used when the processor 601 executes, for example.
It should be understood that the identification apparatus of the present embodiment may correspond to the identification apparatus described in the first to eighth embodiments, wherein the transceiver 602 is configured to perform the voice information collection described in the first to eighth embodiments, and the processor 601 is configured to perform other processing besides data transceiving in the first to eighth embodiments. And will not be described in detail herein.
In this embodiment, the transceiver may be a wired transceiver, a wireless transceiver, or a combination thereof. The wired transceiver may be, for example, an ethernet interface. The ethernet interface may be an optical interface, an electrical interface, or a combination thereof. The wireless transceiver may be, for example, a wireless local area network transceiver, a cellular network transceiver, or a combination thereof. The processor may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP. The processor may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof. The memory may include volatile memory (volatile memory), such as random-access memory (RAM); the memory may also include a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), a flash memory (flash memory), a Hard Disk Drive (HDD), or a solid-state drive (SSD); the memory may also comprise a combination of the above kinds of memories.
Also included in fig. 7 is a bus interface, which may include any number of interconnected buses and bridges, with various circuits of one or more processors, represented by a processor, and memory, represented by a memory, linked together. The bus interface may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface. The transceiver provides a means for communicating with various other apparatus over a transmission medium. The processor is responsible for managing the bus architecture and the usual processing, and the memory may store data used by the processor in performing operations.
Those skilled in the art will also appreciate that the various illustrative logical blocks and steps (step) set forth in the embodiments of the present application may be implemented in electronic hardware, computer software, or combinations of both. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.
The various illustrative logical units and circuits described in this application may be implemented or operated upon by design of a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.
The steps of a method or algorithm described in the embodiments herein may be embodied directly in hardware, in a software element executed by a processor, or in a combination of the two. The software cells may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may be located in a UE. In the alternative, the processor and the storage medium may reside in different components in the UE.
It should be understood that, in the various embodiments of the present application, the size of the serial number of each process does not mean the execution sequence, and the execution sequence of each process should be determined by the function and the inherent logic thereof, and should not constitute any limitation to the implementation process of the embodiments.
EXAMPLE eleven
Corresponding to the identification device described in the tenth embodiment, this embodiment provides a computer storage medium. In this case, the computer storage medium provided in the identification device may store a program, and when the program is executed, some or all of the steps of the identification method provided in embodiments one to eight may be implemented. The storage medium in the identification device may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like.
Example twelve
In embodiments nine through eleventh, all or part may be implemented by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., Digital Video Disk (DVD)), or a semiconductor medium (e.g., solid state disk), among others.
All parts of this specification are described in a progressive manner, and like parts of the various embodiments can be referred to one another, with emphasis on each embodiment being placed on differences from other embodiments. In particular, as to the apparatus and system embodiments, since they are substantially similar to the method embodiments, the description is relatively simple and reference may be made to the description of the method embodiments in relevant places.
In addition, unless stated to the contrary, the embodiments of the present application refer to the ordinal numbers "first" and "second" for distinguishing a plurality of objects, and do not limit the sequence of the plurality of objects.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the scope of the present application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (14)

1. An identity recognition method, the method comprising:
obtaining at least one initial voice data segment of a first speaker;
performing data transformation on each initial voice data segment in the at least one initial voice data segment from a sound source dimension to obtain at least one expanded voice data segment; or, performing data transformation on each initial voice data segment in the at least one initial voice data segment from a sound source dimension and a channel dimension to obtain at least one expanded voice data segment;
respectively carrying out voiceprint feature extraction on the at least one initial voice data segment and the at least one expanded voice data segment to obtain at least one initial voiceprint feature data segment and at least one expanded voiceprint feature data segment;
storing the at least one initial voiceprint feature data segment and the at least one extended voiceprint feature data segment in correspondence with the identity of the first speaker in a storage device.
2. The method of claim 1, further comprising, after storing the at least one initial voiceprint feature data segment and the at least one extended voiceprint feature data segment in correspondence with the identity of the first speaker in a storage device:
Acquiring a voice data segment to be recognized input by a second speaker;
performing voiceprint feature extraction on the voice data segment to be recognized to obtain a voiceprint feature data segment to be recognized;
calculating a characteristic distance value, wherein the characteristic distance value is a distance value between the voiceprint characteristic data segment to be recognized and the at least one initial voiceprint characteristic data segment and the at least one extended voiceprint characteristic data segment which are stored in the storage device and correspond to the first speaker identity;
determining whether the second speaker is the same as the first speaker according to the feature distance value.
3. The method of claim 2, wherein the calculating the feature distance value comprises:
respectively calculating the distance value between each characteristic data segment and the voiceprint characteristic data segment to be identified aiming at the at least one initial voiceprint characteristic data segment and the at least one expanded voiceprint characteristic data segment to obtain a plurality of distance values;
selecting a minimum value of the plurality of distance values as the characteristic distance value.
4. The method of claim 2, wherein the calculating the feature distance value comprises:
Calculating the average value of the at least one initial voiceprint characteristic data segment and the at least one expanded voiceprint characteristic data segment to obtain an average characteristic data segment;
and calculating the distance value between the average characteristic data segment and the characteristic data segment to be identified, and taking the distance value as the characteristic distance value.
5. The method of claim 2, wherein determining whether the second speaker is the same as the first speaker based on the feature distance value comprises:
if the characteristic distance value is smaller than a preset threshold value, determining that the second speaker is the same as the first speaker;
and if the characteristic distance value is larger than or equal to the preset threshold value, determining that the second speaker is different from the second speaker.
6. The method according to claim 1, wherein the data transformation of each initial speech data segment of the at least one initial speech data segment from a sound source dimension to obtain at least one extended speech data segment comprises:
selecting M sub-dimensions from the sound source dimensions, wherein M is a positive integer;
and performing data transformation on each initial voice data segment in the at least one initial voice data segment from the M sub-dimensions to obtain the at least one expanded voice data segment.
7. The method according to claim 1, wherein the data transformation of each initial speech data segment of the at least one initial speech data segment from a sound source dimension and a channel dimension to obtain at least one extended speech data segment comprises:
selecting M sub-dimensions from the sound source dimension and the channel dimension, wherein M is a positive integer;
and performing data transformation on each initial voice data segment in the at least one initial voice data segment from the M sub-dimensions to obtain the at least one expanded voice data segment.
8. The method according to claim 6 or 7, wherein said data transforming each of said at least one initial speech data segment from said M sub-dimensions to obtain said at least one extended speech data segment comprises:
performing data transformation on each initial voice data segment in the at least one initial voice data segment from the M sub-dimensions respectively to obtain at least M transformation data segments corresponding to each initial voice data segment;
and determining at least M conversion data segments corresponding to each initial voice data segment as an extended voice data segment in the at least one extended voice data segment.
9. The method of claim 8, wherein the data transforming each initial segment of speech data of the at least one initial segment of speech data from the M sub-dimensions to obtain the at least one extended segment of speech data, further comprises:
determining each of the M sub-dimensions as a target sub-dimension;
performing data transformation on each target transformed data segment from the target sub-dimension to obtain a plurality of combined transformed data segments, wherein the target transformed data segments refer to voice data segments subjected to data transformation from M-1 sub-dimensions, and the M-1 sub-dimensions refer to sub-dimensions except the target sub-dimension in the M sub-dimensions;
determining the plurality of combined transformed data segments as an extended speech data segment of the at least one extended speech data segment.
10. The method of claim 9, wherein transforming data of the segment of data to be transformed from any of the M sub-dimensions comprises:
acquiring at least one transformation parameter corresponding to the sub-dimension, wherein the transformation parameter indicates a data transformation amount corresponding to the sub-dimension;
And performing data transformation on the data segment to be transformed according to each transformation parameter in the at least one transformation parameter.
11. The method according to any one of claims 1-7, wherein the sound source dimensions comprise a speech rate sub-dimension and a pitch sub-dimension, and the channel dimensions comprise a frequency bandwidth sub-dimension, a codec sub-dimension, a noise sub-dimension, and a reverberation sub-dimension.
12. An identification device comprising means for performing the identification method of any of claims 1 to 11.
13. An identification device comprising a processor and a memory, wherein:
the memory to store program instructions;
the processor is configured to call and execute program instructions stored in the memory to cause the identification device to perform the identification method of any one of claims 1 to 11.
14. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the identification method of any of claims 1 to 11.
CN201811031757.1A 2018-09-05 2018-09-05 Identity recognition method and equipment Active CN110880325B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811031757.1A CN110880325B (en) 2018-09-05 2018-09-05 Identity recognition method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811031757.1A CN110880325B (en) 2018-09-05 2018-09-05 Identity recognition method and equipment

Publications (2)

Publication Number Publication Date
CN110880325A CN110880325A (en) 2020-03-13
CN110880325B true CN110880325B (en) 2022-06-28

Family

ID=69727653

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811031757.1A Active CN110880325B (en) 2018-09-05 2018-09-05 Identity recognition method and equipment

Country Status (1)

Country Link
CN (1) CN110880325B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760434A (en) * 2012-07-09 2012-10-31 华为终端有限公司 Method for updating voiceprint feature model and terminal
CN103546622A (en) * 2012-07-12 2014-01-29 百度在线网络技术(北京)有限公司 Control method, device and system for identifying login on basis of voiceprint
CN105810200A (en) * 2016-02-04 2016-07-27 深圳前海勇艺达机器人有限公司 Man-machine dialogue apparatus and method based on voiceprint identification
CN107068154A (en) * 2017-03-13 2017-08-18 平安科技(深圳)有限公司 The method and system of authentication based on Application on Voiceprint Recognition
CN107395352A (en) * 2016-05-16 2017-11-24 腾讯科技(深圳)有限公司 Personal identification method and device based on vocal print
CN107705791A (en) * 2016-08-08 2018-02-16 中国电信股份有限公司 Caller identity confirmation method, device and Voiceprint Recognition System based on Application on Voiceprint Recognition
CN108492830A (en) * 2018-03-28 2018-09-04 深圳市声扬科技有限公司 Method for recognizing sound-groove, device, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104616655B (en) * 2015-02-05 2018-01-16 北京得意音通技术有限责任公司 The method and apparatus of sound-groove model automatic Reconstruction

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760434A (en) * 2012-07-09 2012-10-31 华为终端有限公司 Method for updating voiceprint feature model and terminal
CN103546622A (en) * 2012-07-12 2014-01-29 百度在线网络技术(北京)有限公司 Control method, device and system for identifying login on basis of voiceprint
CN105810200A (en) * 2016-02-04 2016-07-27 深圳前海勇艺达机器人有限公司 Man-machine dialogue apparatus and method based on voiceprint identification
CN107395352A (en) * 2016-05-16 2017-11-24 腾讯科技(深圳)有限公司 Personal identification method and device based on vocal print
CN107705791A (en) * 2016-08-08 2018-02-16 中国电信股份有限公司 Caller identity confirmation method, device and Voiceprint Recognition System based on Application on Voiceprint Recognition
CN107068154A (en) * 2017-03-13 2017-08-18 平安科技(深圳)有限公司 The method and system of authentication based on Application on Voiceprint Recognition
CN108492830A (en) * 2018-03-28 2018-09-04 深圳市声扬科技有限公司 Method for recognizing sound-groove, device, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Voiceprint analysis for Parkinson"s disease using MFCC, GMM, and instance based learning and multilayer perceptron;Soham Dasgupta;《2017 IEEE International Conference on Power, Control, Signals and Instrumentation Engineering (ICPCSI)》;20180621;全文 *
声纹识别技术在听音知人中的应用;刘苏;《通讯世界》;20170131(第1期);第227-228页 *

Also Published As

Publication number Publication date
CN110880325A (en) 2020-03-13

Similar Documents

Publication Publication Date Title
CN107393526B (en) Voice silence detection method, device, computer equipment and storage medium
JP6099556B2 (en) Voice identification method and apparatus
KR20190022432A (en) ELECTRONIC DEVICE, IDENTIFICATION METHOD, SYSTEM, AND COMPUTER READABLE STORAGE MEDIUM
CN109256138B (en) Identity verification method, terminal device and computer readable storage medium
KR20190001280A (en) Bandwidth extension based on generative adversarial networks
JP2014142626A (en) Voice identification method and device
CN108922543B (en) Model base establishing method, voice recognition method, device, equipment and medium
WO2022141868A1 (en) Method and apparatus for extracting speech features, terminal, and storage medium
CN102945673A (en) Continuous speech recognition method with speech command range changed dynamically
CN110534123A (en) Sound enhancement method, device, storage medium, electronic equipment
WO2010031109A1 (en) Method of analysing an audio signal
Li et al. Frame-Level Signal-to-Noise Ratio Estimation Using Deep Learning.
CN114420140A (en) Frequency band expansion method, encoding and decoding method and system based on generation countermeasure network
CN105741853A (en) Digital speech perception hash method based on formant frequency
CN110880325B (en) Identity recognition method and equipment
CN101222703A (en) Identity verification method for mobile terminal based on voice identification
CN111353258A (en) Echo suppression method based on coding and decoding neural network, audio device and equipment
AU2018102038A4 (en) A Speaker Identification Method Based on DTW Algorithm
Sun et al. An adaptive speech endpoint detection method in low SNR environments
CN115966218A (en) Bone conduction assisted air conduction voice processing method, device, medium and equipment
TWI749547B (en) Speech enhancement system based on deep learning
CN113823277A (en) Keyword recognition method, system, medium, and apparatus based on deep learning
Shahhoud et al. PESQ enhancement for decoded speech audio signals using complex convolutional recurrent neural network
RU2295163C1 (en) Method for recognition of music compositions and device for realization of method
WO2024082928A1 (en) Voice processing method and apparatus, and device and medium

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant