CN112201256A

CN112201256A - Voiceprint segmentation method, voiceprint segmentation device, voiceprint segmentation equipment and readable storage medium

Info

Publication number: CN112201256A
Application number: CN202011072850.4A
Authority: CN
Inventors: 谭聪慧
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2020-10-09
Filing date: 2020-10-09
Publication date: 2021-01-08
Anticipated expiration: 2040-10-09
Also published as: CN112201256B

Abstract

The application discloses a voiceprint segmentation method, a voiceprint segmentation device, a voiceprint segmentation equipment and a readable storage medium, wherein the voiceprint segmentation method comprises the following steps: obtaining a voice to be segmented, performing coarse-grained frame segmentation on the voice to be segmented to obtain each first segmentation-grained sound frame corresponding to the voice to be segmented, further performing voiceprint recognition on each first segmentation-grained sound frame to obtain a first voiceprint recognition result corresponding to the voice to be segmented, further performing fine-grained frame segmentation on a boundary area of each first segmentation-grained sound frame based on the first voiceprint recognition result to obtain each second segmentation-grained sound frame, further performing voiceprint recognition on each second segmentation-grained sound frame to obtain a second voiceprint recognition result corresponding to the voice to be segmented, and further performing voiceprint segmentation on the voice to be segmented based on the first voiceprint recognition result and the second voiceprint recognition result to obtain a target voiceprint segmentation result. The method and the device solve the technical problem of low voiceprint segmentation accuracy.

Description

Voiceprint segmentation method, voiceprint segmentation device, voiceprint segmentation equipment and readable storage medium

Technical Field

The present application relates to the field of artificial intelligence in financial technology (Fintech), and in particular, to a voiceprint segmentation method, apparatus, device, and readable storage medium.

Background

With the continuous development of financial technologies, especially internet technology and finance, more and more technologies (such as distributed, Blockchain, artificial intelligence and the like) are applied to the financial field, but the financial industry also puts higher requirements on the technologies, such as higher requirements on the distribution of backlog of the financial industry.

With the continuous development of computer software and artificial intelligence, the application field of artificial intelligence is becoming more and more extensive, and in the field of speech recognition, it is generally necessary to perform voiceprint segmentation on speech to segment the speech into a plurality of segments, where each segment of speech is a segment of continuous speech of the same speaker, and at present, the speech is generally divided into voice frames of fixed size, and then voiceprint recognition is performed on each voice frame of fixed size, so as to recognize to which speaker each voice frame belongs, and further to implement voiceprint segmentation on speech.

Disclosure of Invention

The present application mainly aims to provide a voiceprint segmentation method, device, apparatus and readable storage medium, and aims to solve the technical problem of low voiceprint segmentation accuracy in the prior art.

In order to achieve the above object, the present application provides a voiceprint segmentation method, which is applied to a voiceprint segmentation device, and includes:

acquiring a voice to be segmented, and performing coarse-grained frame division on the voice to be segmented to obtain each first segmentation-scale voice frame corresponding to the voice to be segmented;

performing voiceprint recognition on each first cut-size voice frame to obtain a first voiceprint recognition result corresponding to the voice to be cut;

performing fine-grained frame division on the boundary area of each first segmentation-granularity sound frame based on the first voiceprint recognition result to obtain each second segmentation-granularity sound frame;

performing voiceprint recognition on each second segmentation granularity voice frame to obtain a second voiceprint recognition result;

and carrying out voiceprint segmentation on the voice to be segmented based on the first voiceprint recognition result and the second voiceprint recognition result to obtain a target voiceprint segmentation result.

The application still provides a voiceprint segmentation device, the voiceprint segmentation device is virtual device, just the voiceprint segmentation device is applied to the voiceprint segmentation equipment, the voiceprint segmentation device includes:

the first frame division module is used for acquiring a voice to be divided, and performing coarse-grained frame division on the voice to be divided to obtain each first division-granularity sound frame corresponding to the voice to be divided;

the first voiceprint recognition module is used for carrying out voiceprint recognition on each first cut-size voice frame to obtain a first voiceprint recognition result corresponding to the voice to be cut;

a second frame division module, configured to perform fine-grained frame division on a boundary area of each first segmentation-granularity sound frame based on the first voiceprint recognition result, to obtain each second segmentation-granularity sound frame;

the second fingerprint identification module is used for carrying out fingerprint identification on each second segmentation granularity sound frame to obtain a second fingerprint identification result;

and the voiceprint recognition module is used for carrying out voiceprint segmentation on the voice to be segmented based on the first voiceprint recognition result and the second voiceprint recognition result so as to obtain a target voiceprint segmentation result.

The present application further provides a voiceprint segmentation apparatus, the voiceprint segmentation apparatus is an entity apparatus, the voiceprint segmentation apparatus includes: a memory, a processor and a program of the voiceprint segmentation method stored on the memory and executable on the processor, which when executed by the processor, may implement the steps of the voiceprint segmentation method as described above.

The present application also provides a readable storage medium having stored thereon a program for implementing a voiceprint segmentation method, which when executed by a processor, implements the steps of the voiceprint segmentation method as described above.

The application provides a voiceprint segmentation method, a device and a readable storage medium, compared with the technical means of dividing voice into voice frames with fixed sizes adopted in the prior art and then respectively carrying out voice recognition on the voice frames with fixed sizes, after the application acquires the voice to be segmented, the application carries out coarse-granularity frame division on the voice to be segmented to obtain each first segmentation-granularity voice frame corresponding to the voice to be segmented, namely, the voice frame division with high segmentation granularity is carried out on the voice to be segmented to obtain each first segmentation-granularity voice frame with larger frame size, wherein, the requirement is stated that if the voice frame is too small, the voice characteristic information contained in the frame voice is too little, the voiceprint recognition accuracy is lower, then the voiceprint recognition is carried out on each first segmentation-granularity frame to obtain the first voiceprint recognition result corresponding to the voice to be segmented, the method can realize the purpose of voiceprint recognition based on the voiceprint frame containing enough voice characteristic information, can improve the accuracy of voiceprint recognition of the non-boundary area of the voice to be segmented, further, based on the first voiceprint recognition result, the boundary area of each first segmentation-granularity voice frame is divided into fine-grained frames to obtain each second segmentation-granularity voice frame, namely, the boundary area of each first segmentation-granularity voice frame is divided into low-segmentation-granularity voice frames to obtain each second segmentation-granularity voice frame with smaller frame size, namely, the purpose of refining the segmentation granularity of the voice to be segmented is realized, wherein, if the voiceprint frame is too large, the voiceprint frame covers the voice characteristic information of a plurality of speakers, and further the voice characteristic information of the plurality of speakers is mixed in the same voiceprint frame, wherein, the confused portion is usually a boundary region where two speakers connect, which results in a lower accuracy of voiceprint recognition, and further improves the accuracy of voiceprint recognition of the boundary region of the voice to be segmented, and further performs voiceprint recognition on each second segmentation-granularity voice frame to obtain a second voiceprint recognition result, which can achieve the purpose of accurately distinguishing the confused portion of the boundary region of each first segmentation-granularity voice frame, and further performs voiceprint segmentation on the voice to be segmented based on the first voiceprint recognition result and the second voiceprint recognition result, that is, the voice to be segmented can be segmented more accurately by performing voiceprint segmentation on the voiceprint recognition result, that is, the first voiceprint recognition result is used in the non-boundary region of the voice to be segmented, and the second voiceprint recognition result is used in the boundary region of the voice to be segmented, and then the target voiceprint segmentation result is obtained, the technical defects that when the voice is divided into voice frames with fixed sizes in the prior art and then voiceprint recognition is respectively carried out on the voice frames with fixed sizes, the accuracy of voiceprint recognition is reduced due to the fact that the voice frames are too large or too small, and then the accuracy of voiceprint segmentation is reduced are overcome, and the accuracy of voiceprint segmentation is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a schematic flow chart of a first embodiment of the voiceprint segmentation method of the present application;

FIG. 2 is a flowchart illustrating a voiceprint segmentation method according to a second embodiment of the present application;

FIG. 3 is a schematic view of an overall process flow of voiceprint segmentation performed in an embodiment of the voiceprint segmentation method of the present application;

fig. 4 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present application.

The objectives, features, and advantages of the present application will be further described with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In a first embodiment of the voiceprint segmentation method of the present application, referring to fig. 1, the voiceprint segmentation method includes:

step S10, obtaining a voice to be segmented, and performing coarse-grained frame division on the voice to be segmented to obtain each first segmentation-scale sound frame corresponding to the voice to be segmented;

in this embodiment, it should be noted that the speech to be segmented is speech collected in a multi-person conversation scene, the speech to be segmented includes speech uttered by multiple speakers, and the purpose of voiceprint recognition of the speech to be segmented is to recognize the correspondence between the speech to be segmented and each speaker, that is, the purpose of voiceprint recognition of the voice to be segmented is to identify which speaker each segment of voice in the voice to be segmented belongs to respectively, the coarse-grained frame is divided into voiceprint segmentations with segmentation granularity larger than a first preset segmentation granularity threshold, that is, the first cut-size sound frame obtained by the coarse-granularity frame division is larger, so that enough sound characteristic information is contained in the first cut-size sound frame, therefore, the accuracy rate of the speaker to which the first cut-sound-granularity sound frame belongs can be identified based on the sound characteristic information in the first cut-sound-granularity sound frame, and is greater than or equal to the preset accuracy rate threshold value.

The method includes the steps of obtaining a voice to be segmented, performing coarse-grained frame division on the voice to be segmented to obtain each first segmentation-scale sound frame corresponding to the voice to be segmented, specifically, obtaining the voice to be segmented, and performing equidistant frame division on a sentence to be segmented based on a preset first segmentation scale to obtain each first segmentation-scale sound frame corresponding to the voice to be segmented, wherein the preset first segmentation scale is a preset frame size, and each first segmentation-scale sound frame is a sound frame with the preset first segmentation scale.

Step S20, performing voiceprint recognition on each first cut-size voice frame to obtain a first voiceprint recognition result corresponding to the voice to be cut;

in this embodiment, the voiceprint recognition is performed on each of the first cut-size sound frames to obtain a first voiceprint recognition result corresponding to the to-be-cut speech, specifically, based on sound feature information in each of the first cut-size sound frames, sound affiliation of each of the first cut-size sound frames is respectively recognized, a sound affiliation recognition result of each of the first cut-size sound frames is obtained, and each of the sound affiliation recognition results is used as the first voiceprint recognition result, where the sound feature information includes a sound spectrogram and the like, and the sound affiliation is a speaker who emits sound.

The step of performing voiceprint recognition on each first cut-size voice frame to obtain a first voiceprint recognition result corresponding to the voice to be cut comprises:

step S21, scoring each of the first cut-level sound frames, respectively, to obtain target voiceprint attribution identification scores corresponding to each of the first cut-level sound frames, respectively;

in this embodiment, it should be noted that the voiceprint segmentation device includes a preset voiceprint recognition model, and the preset voiceprint recognition model is a preset machine learning model and is used for voiceprint recognition.

Respectively scoring each first cut-level sound frame to obtain a target voiceprint attribution identification score corresponding to each first cut-level sound frame, specifically, executing the following steps for each first cut-level sound frame:

inputting the first cut-level sound frame into a preset voiceprint recognition model, performing feature extraction on the sound frame matrix representation of the first cut-level sound frame to obtain a feature extraction result, and further scoring the first cut-level sound frame based on the feature extraction result to obtain a scoring vector, wherein the scoring vector at least comprises a voiceprint recognition score, the voiceprint recognition score is a probability evaluation value of the first cut-level sound frame belonging to a speaker, the sound frame matrix is represented in a matrix representation form of the first cut-level sound frame and used for representing sound feature information of the first cut-level sound frame, and further the largest voiceprint recognition score is selected from the scoring vector as a target voiceprint attribution recognition score,

the step of scoring each of the first cut-level sound frames to obtain a target voiceprint belonging identification score corresponding to each of the first cut-level sound frames includes:

step S211, based on preset user voiceprint information, respectively performing similarity scoring on each first cut-out-degree-of-granularity voice frame to obtain voiceprint similarity scoring information respectively corresponding to each first cut-out-degree-of-granularity voice frame;

in this embodiment, it should be noted that the preset user voiceprint information is voice feature information obtained by extracting pre-collected speaker features, and the preset user voiceprint information at least includes a preset voice feature expression vector of a speaker, where the preset voice feature expression vector is a preset feature extraction vector used for expressing the voice feature information of the speaker.

Based on preset user voiceprint information, respectively performing similarity scoring on each first cut-level sound frame to obtain voiceprint similarity scoring information corresponding to each first cut-level sound frame, specifically, executing the following steps for each first cut-level sound frame:

inputting the first cut-level sound frame into a preset voiceprint recognition model, performing feature extraction on the sound frame matrix representation of the first cut-level sound frame to obtain feature extraction vectors, further respectively calculating the similarity between the feature extraction vectors and each preset sound feature representation vector to obtain each similarity score corresponding to the feature extraction vectors, wherein the similarity score is an evaluation value for evaluating the similarity between the feature extraction vectors and the preset sound feature representation vectors, and further taking each similarity score as voiceprint similarity score information.

Step S212 is to generate each target voiceprint attribution identification score based on each piece of voiceprint similarity score information.

In this embodiment, each target voiceprint attribution identification score is generated based on each voiceprint similarity score information, and specifically, the following steps are performed for each voiceprint similarity score information: and selecting the maximum similarity score from the similarity scores as the target voiceprint attribution identification score.

Step S22, generating the first voiceprint recognition result based on each target voiceprint attribution recognition score.

In this embodiment, the first voiceprint recognition result is generated based on each target voiceprint attribution recognition score, specifically, a target speaker corresponding to each target voiceprint attribution recognition score is determined, each target speaker is respectively used as a voiceprint recognition attribution of a first cut-level voice frame corresponding to each target voiceprint attribution recognition score, wherein the voiceprint recognition attributions correspond to the target speakers one to one, and a one-to-one correspondence relationship between each voiceprint recognition attribution and each first cut-level voice frame is used as the first voiceprint recognition result.

Step S30, based on the first voiceprint recognition result, performing fine-grained frame division on the boundary area of each first segmentation-granularity voice frame to obtain each second segmentation-granularity voice frame;

in this embodiment, based on the first voiceprint recognition result, performing fine-grained frame division on a boundary region of each first cut-grained sound frame to obtain each second cut-grained sound frame, specifically, based on a one-to-one correspondence between each voiceprint recognition attribution and each first cut-grained sound frame, using all two adjacent first cut-grained sound frames belonging to two different target speakers in each first cut-grained sound frame as a boundary region together, where the boundary region at least includes a target boundary sound frame, where the target boundary sound frame is a combination of two adjacent first cut-grained sound frames belonging to two different target speakers, and further based on a preset second cut-grained, performing equidistant sound frame division on each target boundary frame respectively to obtain second cut-grained sound frames corresponding to each target boundary sound frame respectively, the preset second segmentation granularity is smaller than the preset first segmentation granularity, and the fine-grained frame is divided into sound frame partitions with segmentation granularity smaller than a second preset segmentation granularity threshold, wherein the second preset segmentation granularity threshold is smaller than or equal to the first preset segmentation granularity threshold, that is, the first segmentation granularity sound frame obtained by the fine-grained frame partitions is smaller, so that the sound frames of the second segmented granularity after segmentation cannot be mixed up, that is, the situation that the probability that the same sound frame of the second segmented granularity belongs to two target speakers is very high does not occur.

Wherein the boundary region includes at least one target boundary sound frame,

the step of performing fine-grained frame division on the boundary area of each first segmentation-granularity sound frame based on the first voiceprint recognition result to obtain each second segmentation-granularity sound frame includes:

step S31, determining the target boundary sound frame corresponding to each of the first cut-level sound frames based on the first voiceprint recognition result;

in this embodiment, the target boundary sound frame corresponding to each first cut-out-of-speech sound frame is determined based on the first voiceprint recognition result, and specifically, two adjacent first cut-out-of-speech sound frames belonging to different target speakers are selected as the target boundary sound frame in each first cut-out-of-speech sound frame based on a one-to-one correspondence between each voiceprint recognition attribution and each first cut-out-of-speech sound frame.

Wherein the determining the target boundary sound frame corresponding to each of the first cut-level sound frames based on the first voiceprint recognition result includes:

step S311, determining a first attribution part and a second attribution part in each of the first cut-level sound frames based on the first voiceprint recognition result;

in this embodiment, it should be noted that the speech to be segmented is a speech between two target speakers, a first half of the speech to be segmented belongs to a first target speaker, and a second half of the speech to be segmented belongs to a second target speaker.

And determining a first attribution part and a second attribution part in each first segmentation-level sound frame based on the first voiceprint recognition result, specifically, dividing each first segmentation-level sound frame into a first attribution part and a second attribution part based on the one-to-one correspondence between each voiceprint recognition attribution and each first segmentation-level sound frame, wherein each first segmentation-level sound frame in the first attribution part belongs to a first target speaker, and each second segmentation-level sound frame in the second attribution part belongs to a second target speaker.

Step S312, acquiring a first boundary area sound frame in the first belonging part and a second boundary area sound frame in the second belonging part;

in this embodiment, a first boundary zone sound frame in the first attribution part and a second boundary zone sound frame in the second attribution part are acquired, specifically, target separation points of the first attribution part and the second attribution part are determined, a first cut-out particle sound frame before the target separation point is used as the first boundary zone sound frame in the first attribution part, and a first cut-out particle sound frame after the target separation point is used as the second boundary zone sound frame in the second attribution part.

Step S313, combining the first boundary area sound frame and the second boundary area sound frame to obtain the target boundary sound frame.

In this embodiment, the first boundary area sound frame and the second boundary area sound frame are combined to obtain the target boundary sound frame, and specifically, the first boundary area sound frame and the second boundary area sound frame are combined to connect the first boundary area sound frame and the second boundary area sound frame to obtain a connection frame common to the first boundary area sound frame and the second boundary area sound frame, and the connection frame is used as the target boundary sound frame.

Step S32, performing low-granularity segmentation on the target boundary sound frame to obtain each second-granularity segmented sound frame.

In this embodiment, the target boundary sound frame is subjected to low-granularity segmentation to obtain each second segmentation-granularity sound frame, and specifically, the target boundary sound frame is subjected to equidistant segmentation based on a preset second segmentation granularity to segment the target boundary sound frame into sound frames each having a frame size of the preset second segmentation granularity, and then the sound frames each having the frame size of the preset second segmentation granularity are taken as each second segmentation-granularity sound frame, where the preset second segmentation granularity is smaller than the preset first segmentation granularity.

Step S40, performing voiceprint recognition on each second segmentation granularity sound frame to obtain a second voiceprint recognition result;

in this embodiment, the voiceprint recognition is performed on each second-granularity-segmented voice frame to obtain a second voiceprint recognition result, and specifically, based on the voice feature information of each second-granularity-segmented voice frame, a second voice attribution of each second-granularity-segmented voice frame is respectively recognized, that is, whether each second-granularity-segmented voice frame belongs to a first target speaker or a second target speaker is respectively recognized, and then each second voice attribution is used as the second voiceprint recognition result.

In an implementation manner, step S40 includes:

respectively scoring each second segmentation-granularity sound frame based on a preset voiceprint recognition model to obtain a first home voiceprint recognition score and a second home voiceprint recognition score corresponding to each second segmentation-granularity sound frame, and specifically, for each second segmentation-granularity sound frame, executing the following steps:

inputting the sound frame with the second segmentation granularity into the preset voiceprint recognition model, performing feature extraction on the sound frame with the second segmentation granularity to obtain a second feature extraction vector of the sound frame with the second segmentation granularity, further calculating bit similarity between the second feature extraction vector and a preset sound feature expression vector of the first target speaker to obtain a first bit similarity evaluation value, and further taking the first bit similarity evaluation value as the first attribution voiceprint recognition score, wherein the bit similarity is the ratio of the same number of bits between the vectors, the first attribution voiceprint recognition score is the probability score of the sound frame with the second segmentation granularity belonging to the first target speaker, and is used for evaluating the probability of the sound frame with the second segmentation granularity belonging to the first target speaker, and the second attribution voiceprint recognition score is the probability of the sound with the second segmentation granularity belonging to the second target speaker And scoring, configured to evaluate a probability that the second segmented-granularity voice frame belongs to a second target speaker, for example, assuming that the vector a is (a, B, c, d), the vector B is (a, B, c, e), the number of bits that are the same between the vector a and the vector B is 3, and further the bit similarity is 75%, further calculating the bit similarity between the second feature extraction vector and a preset voice feature representation vector of the second target speaker, obtaining a second bit similarity evaluation value, and further taking the second bit similarity evaluation value as the second attribution voiceprint identification score.

And step S50, based on the first voiceprint recognition result and the second voiceprint recognition result, carrying out voiceprint segmentation on the voice to be segmented to obtain a target voiceprint segmentation result.

In this embodiment, based on the first voiceprint recognition result and the second voiceprint recognition result, the voice to be segmented is voiceprint segmented to obtain a target voiceprint segmentation result, specifically, based on the first voiceprint recognition result, the non-boundary region of the voice to be segmented is voiceprint segmented to obtain a first voiceprint segmentation result, further based on the second voiceprint recognition result, the boundary region of the voice to be segmented is voiceprint segmented to obtain a second voiceprint segmentation result, further the first voiceprint segmentation result and the second voiceprint segmentation result are fused to obtain the voiceprint segmentation result of the voice to be segmented, that is, to obtain the target voiceprint segmentation result, for example, assuming that the voice to be segmented includes a first segmentation-level voice frame a, a first segmentation-level voice frame B and a first segmentation-level voice frame C, wherein the first cut-level audio frame a and the first cut-level audio frame C both belong to non-boundary regions, the first cut-level audio frame is a boundary region, and the first cut-level audio frame includes a second cut-level audio frame E and a second cut-level audio frame F, and is known based on a first voiceprint recognition result, the first cut-level audio frame a belongs to a speaker X, the first cut-level audio frame C belongs to a speaker Y, and when the non-boundary region is segmented, the first cut-level audio frame a is segmented into a first speech segment of the speaker X, the first cut-level audio frame C is segmented into a second speech segment of the speaker Y, and is known based on a second voiceprint recognition result, the second cut-level audio frame E belongs to the speaker X, the second cut-level audio frame F belongs to the speaker Y, and further, the second segmentation-granularity voice frame E is segmented into a first voice segment belonging to a speaker X, the second segmentation-granularity voice frame F is segmented into a second voice segment belonging to a speaker Y, the first voice segment belonging to the speaker X comprises a first segmentation-granularity voice frame A and a second segmentation-granularity voice frame E, and the second voice segment belonging to the speaker Y comprises a first segmentation-granularity voice frame C and a second segmentation-granularity voice frame F.

Wherein, it should be noted that, because the non-boundary area of the voice to be segmented is segmented based on the high segmentation granularity, and each first segmentation-granularity voice frame in the non-boundary area uniquely corresponds to a target speaker, and then when the first voiceprint recognition result of the non-boundary area is used as the voiceprint segmentation of the non-boundary area, because the non-boundary area does not generate the situation that the voice frame covers the voice feature information of multiple speakers due to the oversize of the voice frame, so that the voice feature information of multiple speakers is confused in the same voice frame, and the voice frame of the non-boundary area is larger, so that the accuracy of the voiceprint recognition of the non-boundary area is extremely high, the accuracy of the voiceprint segmentation of the non-boundary area is extremely high, further, the boundary area is segmented by the segmentation of the low segmentation granularity, the segmentation of the boundary area is implemented to perform the granularity refinement segmentation of the boundary area, the voiceprint recognition can be carried out based on the voiceprint frame with smaller frame size to obtain a second voiceprint recognition result, and then the boundary area which is easy to generate the confusion of the voice characteristic information in the sentence to be segmented adopts the second voiceprint recognition result which is carried out based on the voiceprint frame with smaller frame size to realize the accurate voiceprint recognition of the boundary area, thereby avoiding the occurrence of the condition that the accuracy of the voiceprint recognition is reduced because the voice characteristic information of a plurality of speakers is confused in the same voiceprint frame, improving the accuracy of the voiceprint recognition, and further realizing the accurate voiceprint segmentation of the boundary area based on the second voiceprint recognition result, namely, the voiceprint recognition is carried out by the voice frame which is as large as possible in the non-boundary area which is not easy to generate the confusion of the voice characteristic information, so as to ensure the accuracy of the voiceprint recognition of the non-boundary area, and the voiceprint recognition is carried out by the smaller voiceprint frame in the boundary area which is easy to generate the confusion of, the voiceprint recognition accuracy of the boundary area is guaranteed, the voiceprint recognition accuracy is improved, and the voiceprint segmentation accuracy is improved.

Compared with the technical means of dividing the voice into voice frames with fixed sizes and then respectively performing voice recognition on the voice frames with fixed sizes adopted in the prior art, the embodiment performs coarse-granularity frame division on the voice to be divided after the voice to be divided is acquired to obtain each first-granularity voice frame corresponding to the voice to be divided, that is, performs high-segmentation-granularity voice frame division on the voice to be divided to obtain each first-granularity voice frame with larger frame size, wherein it needs to be stated that if the voice frame is too small, the voice feature information contained in the voice frame is too little, and the accuracy of the voice print recognition is lower, and then performs the voice print recognition on each first-granularity voice frame to obtain the first voice print recognition result corresponding to the voice to be divided, the method can realize the purpose of voiceprint recognition based on the voiceprint frame containing enough voice characteristic information, can improve the accuracy of voiceprint recognition of the non-boundary area of the voice to be segmented, further, based on the first voiceprint recognition result, the boundary area of each first segmentation-granularity voice frame is divided into fine-grained frames to obtain each second segmentation-granularity voice frame, namely, the boundary area of each first segmentation-granularity voice frame is divided into low-segmentation-granularity voice frames to obtain each second segmentation-granularity voice frame with smaller frame size, namely, the purpose of refining the segmentation granularity of the voice to be segmented is realized, wherein, if the voiceprint frame is too large, the voiceprint frame covers the voice characteristic information of a plurality of speakers, and further the voice characteristic information of the plurality of speakers is mixed in the same voiceprint frame, wherein, the confused portion is usually a boundary region where two speakers connect, which results in a lower accuracy of voiceprint recognition, and further improves the accuracy of voiceprint recognition of the boundary region of the voice to be segmented, and further performs voiceprint recognition on each second segmentation-granularity voice frame to obtain a second voiceprint recognition result, which can achieve the purpose of accurately distinguishing the confused portion of the boundary region of each first segmentation-granularity voice frame, and further performs voiceprint segmentation on the voice to be segmented based on the first voiceprint recognition result and the second voiceprint recognition result, that is, the voice to be segmented can be segmented more accurately by performing voiceprint segmentation on the voiceprint recognition result, that is, the first voiceprint recognition result is used in the non-boundary region of the voice to be segmented, and the second voiceprint recognition result is used in the boundary region of the voice to be segmented, and then the target voiceprint segmentation result is obtained, the technical defects that when the voice is divided into voice frames with fixed sizes in the prior art and then voiceprint recognition is respectively carried out on the voice frames with fixed sizes, the accuracy of voiceprint recognition is reduced due to the fact that the voice frames are too large or too small, and then the accuracy of voiceprint segmentation is reduced are overcome, and the accuracy of voiceprint segmentation is improved.

Further, referring to fig. 2, based on the first embodiment of the present application, in another embodiment of the present application, the second voice print recognition result at least includes a first home voice print recognition score and a second home voice print recognition score corresponding to the second sliced-granularity voice frame,

the step of performing voiceprint segmentation on the voice to be segmented based on the first voiceprint recognition result and the second voiceprint recognition result to obtain a target voiceprint segmentation result comprises:

step S41, determining a target segmentation point corresponding to each second segmentation-granularity sound frame based on each first home voiceprint recognition score and each second home voiceprint recognition score;

in this embodiment, a target segmentation point corresponding to each second-granularity segmented sound frame is determined based on each first home voiceprint recognition score and each second home voiceprint recognition score, and specifically, a maximum voiceprint score sum corresponding to each second-granularity segmented sound frame is calculated based on each first home voiceprint recognition score and each second home voiceprint recognition score, where the maximum voiceprint score sum is a maximum sum of target voiceprint scores corresponding to each second-granularity segmented sound frame, where the target score is one of a first home voiceprint recognition score and a corresponding first home voiceprint recognition score corresponding to the second-granularity segmented sound frame, and then a voiceprint score sequence corresponding to each second-granularity segmented sound frame is determined based on the maximum voiceprint score sum, where the voiceprint score sequence is a sequence consisting of the maximum voiceprint score and each voiceprint score, the sequence of the voiceprint scoring sequence is consistent with the time sequence of each second segmentation-granularity voice frame, and then a voiceprint scoring mutation point is inquired in the voiceprint scoring sequence, wherein the voiceprint scoring mutation point is a boundary point of the voiceprint scoring sequence, which is mutated from a first attribution voiceprint recognition score to a second attribution voiceprint recognition score, or is a boundary point of the voiceprint scoring sequence, which is mutated from a second attribution voiceprint recognition score to a first attribution voiceprint recognition score, and then a corresponding voiceprint frame boundary point of the voiceprint scoring mutation point in a sequence composed of each second segmentation-granularity voice frame is used as the target segmentation point, wherein the voiceprint boundary point is a boundary point of two adjacent second segmentation-granularity voice frames.

The step of determining a target segmentation point corresponding to each second segmentation-granularity sound frame based on each first home voiceprint recognition score and each second home voiceprint recognition score includes:

step S411, determining a target voiceprint recognition score sum corresponding to each second segmentation-granularity voice frame based on each first home voiceprint recognition score and each second home voiceprint recognition score;

in this embodiment, a target voiceprint recognition score sum corresponding to each second segmentation-granularity voice frame is determined based on each first home voiceprint recognition score and each second home voiceprint recognition score, specifically, one of the first home voiceprint recognition score and the corresponding second home voiceprint recognition score corresponding to each second segmentation-granularity voice frame is arbitrarily selected as a voiceprint score to be calculated, a sum of the voiceprint scores to be calculated corresponding to each second segmentation-granularity voice frame is calculated, a voiceprint score sum is obtained, calculation of the voiceprint score sum is further repeated until all possible voiceprint score sums are obtained, and then a maximum voiceprint score sum is selected from each voiceprint score sum as the target voiceprint score sum.

Step S422, determining the voice print score mutation point corresponding to each voice frame with the second segmentation granularity based on the target voice print score sum;

in this embodiment, based on the sum of the target voiceprint scores, a voiceprint score mutation point corresponding to each second-granularity voiceprint frame is determined, specifically, based on the time sequence of each second-granularity voiceprint frame, the voiceprint scores to be calculated in the target voiceprint scores and the sum of the target voiceprint scores are ranked to obtain a voiceprint score sequence, and then the voiceprint score mutation point is queried in the voiceprint score sequence, where the voiceprint score mutation point is a boundary point in the voiceprint score sequence at which a first home voiceprint identification score is mutated into a second home voiceprint identification score, or is a boundary point in the voiceprint score sequence at which a second home voiceprint identification score is mutated into a first home voiceprint identification score.

And step S423, generating the target segmentation points based on the vocal print scoring mutation points.

In this embodiment, the target segmentation point is generated based on the voiceprint score mutation point, specifically, two sudden change voiceprint scores before and after the voiceprint score mutation point are determined, and a boundary point of a second segmentation-granularity voice frame corresponding to the two sudden change voiceprint scores is used as the target segmentation point.

Step S43, based on the target segmentation point and the first voiceprint recognition result, segmenting the to-be-segmented speech to obtain the target voiceprint segmentation result.

In this embodiment, the voice to be segmented is segmented based on the target segmentation point and the first voiceprint recognition result to obtain the target voiceprint segmentation result, specifically, the non-boundary region of the voice to be segmented is segmented based on the first voiceprint recognition result to obtain a first non-boundary region voice segment belonging to a first target speaker and a second non-boundary region voice segment belonging to a second target speaker, and then each second segmentation granularity voice frame is segmented based on the target segmentation point, that is, the boundary region of the voice to be segmented is segmented based on the target segmentation point to obtain a first boundary region voice segment belonging to the first speaker and a second boundary region voice segment belonging to the second speaker, and then the first non-boundary region voice segment and the first boundary region voice segment are fused, and obtaining a first segmentation voice section belonging to a first target speaker, fusing the second non-boundary region voice section and the second boundary region voice section to obtain a second segmentation voice section belonging to a second target speaker, and taking the first segmentation voice section and the second segmentation voice section as the target voiceprint segmentation result.

In an implementable scheme, as shown in fig. 3, an overall flow diagram for performing voiceprint segmentation is shown, in which a speaker a is a first target speaker, a speaker B is a second target speaker, frames are divided according to a fixed size to correspond to step S10, each frame is divided to determine speaker correspondence step S20, frames with converted boundaries are further divided into small blocks to correspond to step S30, all the small blocks are divided into A, B to correspond to step S40, and an optimal division point is found to perform division to correspond to step S50.

The embodiment provides a voiceprint segmentation method, wherein a target segmentation point corresponding to each second segmentation-granularity voice frame is determined based on each first home voiceprint recognition score and each second home voiceprint recognition score, wherein the target segmentation point is an optimal segmentation point of a voice to be segmented, so that a voiceprint segmentation effect is optimal, the voice to be segmented is segmented based on the target segmentation point and the first voiceprint recognition result, so as to obtain a target voiceprint segmentation result, that is, based on the target segmentation point, segmentation of a boundary region of the voice to be segmented is achieved, based on the first voiceprint recognition result, segmentation of a non-boundary region of the voice to be segmented is achieved, so as to complete segmentation of the voice to be segmented, wherein because the first segmentation-granularity frame is large, the first voiceprint recognition result is obtained by voiceprint recognition of a voice frame based on sufficient voice characteristic information, and then the accuracy of the first voiceprint recognition result is extremely high, and further the accuracy of the voiceprint segmentation of the non-boundary area is extremely high, and the target segmentation point is determined by the second voiceprint recognition result, which is obtained based on the voiceprint recognition of the smaller second segmentation-granularity voiceprint frame, wherein, it is required to be stated that, if the voiceprint frame is too large, the voiceprint frame covers the voice feature information of a plurality of speakers, and further the voice feature information of the plurality of speakers is confused in the same voiceprint frame, wherein, the confused part is usually the boundary area where the voices of two speakers are connected, and further the accuracy of the voiceprint recognition is low, and because the second segmentation-granularity voiceprint frame is smaller, further the probability of the confusion of the voice feature information in the boundary area is extremely small, further the accuracy of the voiceprint recognition is extremely high, further the accuracy of the voiceprint segmentation of the boundary area is extremely high, the accuracy of the voiceprint segmentation is improved.

Referring to fig. 4, fig. 4 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present application.

As shown in fig. 4, the voiceprint segmentation apparatus may include: a processor 1001, such as a CPU, a memory 1005, and a communication bus 1002. The communication bus 1002 is used for realizing connection communication between the processor 1001 and the memory 1005. The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a memory device separate from the processor 1001 described above.

Optionally, the voiceprint segmentation device may further include a rectangular user interface, a network interface, a camera, RF (Radio Frequency) circuitry, a sensor, audio circuitry, a WiFi module, and so on. The rectangular user interface may comprise a Display screen (Display), an input sub-module such as a Keyboard (Keyboard), and the optional rectangular user interface may also comprise a standard wired interface, a wireless interface. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface).

Those skilled in the art will appreciate that the voiceprint segmentation apparatus configuration shown in figure 4 does not constitute a limitation of the voiceprint segmentation apparatus and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

As shown in fig. 4, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, and a voiceprint recognition program. The operating system is a program that manages and controls the hardware and software resources of the voiceprint segmentation device, supporting the operation of the voiceprint recognition program, as well as other software and/or programs. The network communication module is used to enable communication between the various components within the memory 1005, as well as with other hardware and software in the voiceprint recognition system.

In the voiceprint segmentation apparatus shown in fig. 4, the processor 1001 is configured to execute a voiceprint recognition program stored in the memory 1005 to implement the steps of the voiceprint segmentation method described in any one of the above.

The specific implementation of the voiceprint segmentation apparatus of the present application is substantially the same as that of each embodiment of the voiceprint segmentation method described above, and is not described herein again.

The embodiment of the present application further provides a voiceprint segmentation apparatus, where the voiceprint segmentation apparatus is applied to a voiceprint segmentation device, the voiceprint segmentation apparatus includes:

and the voiceprint segmentation module is used for carrying out voiceprint segmentation on the voice to be segmented based on the first voiceprint recognition result and the second voiceprint recognition result so as to obtain a target voiceprint segmentation result.

Optionally, the second frame dividing module includes:

a first determining unit configured to determine the target boundary sound frame corresponding to each of the first cut-level sound frames based on the first voiceprint recognition result;

and the low-granularity segmentation unit is used for performing low-granularity segmentation on the target boundary sound frame to obtain each second segmentation-granularity sound frame.

Optionally, the first determining unit includes:

a first determination subunit determines, in each of the first cut-level sound frames, a first attribution part and a second attribution part based on the first voiceprint recognition result;

an acquisition subunit configured to acquire a first boundary region sound frame in the first attribution section and a second boundary region sound frame in the second attribution section;

a combining unit, configured to combine the first boundary area sound frame and the second boundary area sound frame to obtain the target boundary sound frame.

Optionally, the voiceprint segmentation module includes:

a second determining unit, configured to determine, based on each first home voiceprint identification score and each second home voiceprint identification score, a target segmentation point corresponding to each second segmentation-granularity voice frame;

and the segmentation unit is used for segmenting the voice to be segmented based on the target segmentation point and the first voiceprint recognition result to obtain the target voiceprint segmentation result.

Optionally, the second determining unit includes:

a second determining subunit, configured to determine, based on each first home voiceprint identification score and each second home voiceprint identification score, a target voiceprint identification score sum corresponding to each second-segmentation-granularity voice frame;

a third determining subunit, configured to determine, based on the target voiceprint score sum, a voiceprint score mutation point corresponding to each second segmentation-granularity voice frame;

and the first generation subunit is used for generating the target segmentation point based on the voiceprint score mutation point.

Optionally, the first voiceprint recognition module comprises:

the scoring unit is used for scoring each first cut-level sound frame respectively to obtain a target voiceprint attribution identification score corresponding to each first cut-level sound frame;

and the generating unit is used for generating the first voiceprint recognition result based on each target voiceprint attribution recognition score.

Optionally, the scoring unit includes:

the similarity scoring subunit is configured to perform similarity scoring on each first cut-level sound frame based on preset user voiceprint information, and obtain voiceprint similarity scoring information corresponding to each first cut-level sound frame;

and the second generation subunit is used for generating each target voiceprint attribution identification score based on each piece of voiceprint similarity score information.

The present application provides a readable storage medium, and the readable storage medium stores one or more programs, which can be executed by one or more processors for implementing the steps of the voiceprint segmentation method described in any one of the above.

The specific implementation of the readable storage medium of the present application is substantially the same as that of the above-mentioned embodiments of the voiceprint segmentation method, and is not described herein again.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings, or which are directly or indirectly applied to other related technical fields, are included in the scope of the present application.

Claims

1. A voiceprint segmentation method, characterized by comprising:

2. The voiceprint segmentation method of claim 1 wherein said boundary region includes at least one target boundary sound frame,

determining the target boundary sound frame corresponding to each first cut-level sound frame based on the first voiceprint recognition result;

and performing low-granularity segmentation on the target boundary sound frame to obtain each second segmentation-granularity sound frame.

3. The voiceprint segmentation method according to claim 2, wherein the step of determining the target boundary sound frame corresponding to each of the first cut-level sound frames based on the first voiceprint recognition result includes:

determining a first attribution part and a second attribution part in each of the first cut-level sound frames based on the first voiceprint recognition result;

acquiring a first boundary area sound frame in the first attribution part and a second boundary area sound frame in the second attribution part;

and combining the first boundary area sound frame and the second boundary area sound frame to obtain the target boundary sound frame.

4. The voiceprint segmentation method of claim 1 wherein the second voiceprint recognition result comprises at least a first home voiceprint recognition score and a second home voiceprint recognition score corresponding to the second sliced-granularity voice frame,

determining a target segmentation point corresponding to each second segmentation granularity sound frame based on each first home voiceprint identification score and each second home voiceprint identification score;

and segmenting the voice to be segmented based on the target segmentation point and the first voiceprint recognition result to obtain the target voiceprint segmentation result.

5. The voiceprint segmentation method according to claim 4, wherein the step of determining the target segmentation point corresponding to each of the second-cut-granularity voice frames based on each of the first home voiceprint recognition scores and each of the second home voiceprint recognition scores comprises:

determining a target voiceprint recognition score sum corresponding to each second segmentation granularity voiceprint frame based on each first home voiceprint recognition score and each second home voiceprint recognition score;

determining a voiceprint score mutation point corresponding to each second segmentation granularity voice frame based on the target voiceprint score sum;

and generating the target segmentation point based on the voiceprint scoring mutation point.

6. The method according to claim 1, wherein the step of performing the voiceprint recognition on each of the first cut-size voice frames to obtain the first voiceprint recognition result corresponding to the voice to be segmented comprises:

respectively scoring each first cut-level sound frame to obtain a target voiceprint attribution identification score corresponding to each first cut-level sound frame;

and generating the first voiceprint recognition result based on each target voiceprint attribution recognition score.

7. The method as claimed in claim 6, wherein the step of scoring each of the first cut-level sound frames to obtain the corresponding target voiceprint attribution identification score of each of the first cut-level sound frames comprises:

based on preset user voiceprint information, respectively carrying out similarity scoring on each first cut-level voice frame to obtain voiceprint similarity scoring information respectively corresponding to each first cut-level voice frame;

and generating each target voiceprint attribution identification score based on each voiceprint similarity score information.

8. A voiceprint segmentation apparatus, wherein the voiceprint segmentation apparatus comprises:

9. A voiceprint segmentation apparatus, characterized in that the voiceprint segmentation apparatus comprises: a memory, a processor, and a program stored on the memory for implementing the voiceprint segmentation method,

the memory is used for storing a program for realizing the voiceprint segmentation method;

the processor is configured to execute a program implementing the voiceprint segmentation method to implement the steps of the voiceprint segmentation method according to any one of claims 1 to 7.

10. A readable storage medium having stored thereon a program for implementing a voiceprint segmentation method, the program being executable by a processor for implementing the steps of the voiceprint segmentation method as claimed in any one of claims 1 to 7.