CN112201256A - Voiceprint segmentation method, voiceprint segmentation device, voiceprint segmentation equipment and readable storage medium - Google Patents

Voiceprint segmentation method, voiceprint segmentation device, voiceprint segmentation equipment and readable storage medium Download PDF

Info

Publication number
CN112201256A
CN112201256A CN202011072850.4A CN202011072850A CN112201256A CN 112201256 A CN112201256 A CN 112201256A CN 202011072850 A CN202011072850 A CN 202011072850A CN 112201256 A CN112201256 A CN 112201256A
Authority
CN
China
Prior art keywords
voiceprint
segmentation
voice
frame
granularity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011072850.4A
Other languages
Chinese (zh)
Other versions
CN112201256B (en
Inventor
谭聪慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202011072850.4A priority Critical patent/CN112201256B/en
Publication of CN112201256A publication Critical patent/CN112201256A/en
Application granted granted Critical
Publication of CN112201256B publication Critical patent/CN112201256B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/10Multimodal systems, i.e. based on the integration of multiple recognition engines or fusion of expert systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application discloses a voiceprint segmentation method, a voiceprint segmentation device, a voiceprint segmentation equipment and a readable storage medium, wherein the voiceprint segmentation method comprises the following steps: obtaining a voice to be segmented, performing coarse-grained frame segmentation on the voice to be segmented to obtain each first segmentation-grained sound frame corresponding to the voice to be segmented, further performing voiceprint recognition on each first segmentation-grained sound frame to obtain a first voiceprint recognition result corresponding to the voice to be segmented, further performing fine-grained frame segmentation on a boundary area of each first segmentation-grained sound frame based on the first voiceprint recognition result to obtain each second segmentation-grained sound frame, further performing voiceprint recognition on each second segmentation-grained sound frame to obtain a second voiceprint recognition result corresponding to the voice to be segmented, and further performing voiceprint segmentation on the voice to be segmented based on the first voiceprint recognition result and the second voiceprint recognition result to obtain a target voiceprint segmentation result. The method and the device solve the technical problem of low voiceprint segmentation accuracy.

Description

Voiceprint segmentation method, voiceprint segmentation device, voiceprint segmentation equipment and readable storage medium
Technical Field
The present application relates to the field of artificial intelligence in financial technology (Fintech), and in particular, to a voiceprint segmentation method, apparatus, device, and readable storage medium.
Background
With the continuous development of financial technologies, especially internet technology and finance, more and more technologies (such as distributed, Blockchain, artificial intelligence and the like) are applied to the financial field, but the financial industry also puts higher requirements on the technologies, such as higher requirements on the distribution of backlog of the financial industry.
With the continuous development of computer software and artificial intelligence, the application field of artificial intelligence is becoming more and more extensive, and in the field of speech recognition, it is generally necessary to perform voiceprint segmentation on speech to segment the speech into a plurality of segments, where each segment of speech is a segment of continuous speech of the same speaker, and at present, the speech is generally divided into voice frames of fixed size, and then voiceprint recognition is performed on each voice frame of fixed size, so as to recognize to which speaker each voice frame belongs, and further to implement voiceprint segmentation on speech.
Disclosure of Invention
The present application mainly aims to provide a voiceprint segmentation method, device, apparatus and readable storage medium, and aims to solve the technical problem of low voiceprint segmentation accuracy in the prior art.
In order to achieve the above object, the present application provides a voiceprint segmentation method, which is applied to a voiceprint segmentation device, and includes:
acquiring a voice to be segmented, and performing coarse-grained frame division on the voice to be segmented to obtain each first segmentation-scale voice frame corresponding to the voice to be segmented;
performing voiceprint recognition on each first cut-size voice frame to obtain a first voiceprint recognition result corresponding to the voice to be cut;
performing fine-grained frame division on the boundary area of each first segmentation-granularity sound frame based on the first voiceprint recognition result to obtain each second segmentation-granularity sound frame;
performing voiceprint recognition on each second segmentation granularity voice frame to obtain a second voiceprint recognition result;
and carrying out voiceprint segmentation on the voice to be segmented based on the first voiceprint recognition result and the second voiceprint recognition result to obtain a target voiceprint segmentation result.
The application still provides a voiceprint segmentation device, the voiceprint segmentation device is virtual device, just the voiceprint segmentation device is applied to the voiceprint segmentation equipment, the voiceprint segmentation device includes:
the first frame division module is used for acquiring a voice to be divided, and performing coarse-grained frame division on the voice to be divided to obtain each first division-granularity sound frame corresponding to the voice to be divided;
the first voiceprint recognition module is used for carrying out voiceprint recognition on each first cut-size voice frame to obtain a first voiceprint recognition result corresponding to the voice to be cut;
a second frame division module, configured to perform fine-grained frame division on a boundary area of each first segmentation-granularity sound frame based on the first voiceprint recognition result, to obtain each second segmentation-granularity sound frame;
the second fingerprint identification module is used for carrying out fingerprint identification on each second segmentation granularity sound frame to obtain a second fingerprint identification result;
and the voiceprint recognition module is used for carrying out voiceprint segmentation on the voice to be segmented based on the first voiceprint recognition result and the second voiceprint recognition result so as to obtain a target voiceprint segmentation result.
The present application further provides a voiceprint segmentation apparatus, the voiceprint segmentation apparatus is an entity apparatus, the voiceprint segmentation apparatus includes: a memory, a processor and a program of the voiceprint segmentation method stored on the memory and executable on the processor, which when executed by the processor, may implement the steps of the voiceprint segmentation method as described above.
The present application also provides a readable storage medium having stored thereon a program for implementing a voiceprint segmentation method, which when executed by a processor, implements the steps of the voiceprint segmentation method as described above.
The application provides a voiceprint segmentation method, a device and a readable storage medium, compared with the technical means of dividing voice into voice frames with fixed sizes adopted in the prior art and then respectively carrying out voice recognition on the voice frames with fixed sizes, after the application acquires the voice to be segmented, the application carries out coarse-granularity frame division on the voice to be segmented to obtain each first segmentation-granularity voice frame corresponding to the voice to be segmented, namely, the voice frame division with high segmentation granularity is carried out on the voice to be segmented to obtain each first segmentation-granularity voice frame with larger frame size, wherein, the requirement is stated that if the voice frame is too small, the voice characteristic information contained in the frame voice is too little, the voiceprint recognition accuracy is lower, then the voiceprint recognition is carried out on each first segmentation-granularity frame to obtain the first voiceprint recognition result corresponding to the voice to be segmented, the method can realize the purpose of voiceprint recognition based on the voiceprint frame containing enough voice characteristic information, can improve the accuracy of voiceprint recognition of the non-boundary area of the voice to be segmented, further, based on the first voiceprint recognition result, the boundary area of each first segmentation-granularity voice frame is divided into fine-grained frames to obtain each second segmentation-granularity voice frame, namely, the boundary area of each first segmentation-granularity voice frame is divided into low-segmentation-granularity voice frames to obtain each second segmentation-granularity voice frame with smaller frame size, namely, the purpose of refining the segmentation granularity of the voice to be segmented is realized, wherein, if the voiceprint frame is too large, the voiceprint frame covers the voice characteristic information of a plurality of speakers, and further the voice characteristic information of the plurality of speakers is mixed in the same voiceprint frame, wherein, the confused portion is usually a boundary region where two speakers connect, which results in a lower accuracy of voiceprint recognition, and further improves the accuracy of voiceprint recognition of the boundary region of the voice to be segmented, and further performs voiceprint recognition on each second segmentation-granularity voice frame to obtain a second voiceprint recognition result, which can achieve the purpose of accurately distinguishing the confused portion of the boundary region of each first segmentation-granularity voice frame, and further performs voiceprint segmentation on the voice to be segmented based on the first voiceprint recognition result and the second voiceprint recognition result, that is, the voice to be segmented can be segmented more accurately by performing voiceprint segmentation on the voiceprint recognition result, that is, the first voiceprint recognition result is used in the non-boundary region of the voice to be segmented, and the second voiceprint recognition result is used in the boundary region of the voice to be segmented, and then the target voiceprint segmentation result is obtained, the technical defects that when the voice is divided into voice frames with fixed sizes in the prior art and then voiceprint recognition is respectively carried out on the voice frames with fixed sizes, the accuracy of voiceprint recognition is reduced due to the fact that the voice frames are too large or too small, and then the accuracy of voiceprint segmentation is reduced are overcome, and the accuracy of voiceprint segmentation is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a schematic flow chart of a first embodiment of the voiceprint segmentation method of the present application;
FIG. 2 is a flowchart illustrating a voiceprint segmentation method according to a second embodiment of the present application;
FIG. 3 is a schematic view of an overall process flow of voiceprint segmentation performed in an embodiment of the voiceprint segmentation method of the present application;
fig. 4 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present application.
The objectives, features, and advantages of the present application will be further described with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In a first embodiment of the voiceprint segmentation method of the present application, referring to fig. 1, the voiceprint segmentation method includes:
step S10, obtaining a voice to be segmented, and performing coarse-grained frame division on the voice to be segmented to obtain each first segmentation-scale sound frame corresponding to the voice to be segmented;
in this embodiment, it should be noted that the speech to be segmented is speech collected in a multi-person conversation scene, the speech to be segmented includes speech uttered by multiple speakers, and the purpose of voiceprint recognition of the speech to be segmented is to recognize the correspondence between the speech to be segmented and each speaker, that is, the purpose of voiceprint recognition of the voice to be segmented is to identify which speaker each segment of voice in the voice to be segmented belongs to respectively, the coarse-grained frame is divided into voiceprint segmentations with segmentation granularity larger than a first preset segmentation granularity threshold, that is, the first cut-size sound frame obtained by the coarse-granularity frame division is larger, so that enough sound characteristic information is contained in the first cut-size sound frame, therefore, the accuracy rate of the speaker to which the first cut-sound-granularity sound frame belongs can be identified based on the sound characteristic information in the first cut-sound-granularity sound frame, and is greater than or equal to the preset accuracy rate threshold value.
The method includes the steps of obtaining a voice to be segmented, performing coarse-grained frame division on the voice to be segmented to obtain each first segmentation-scale sound frame corresponding to the voice to be segmented, specifically, obtaining the voice to be segmented, and performing equidistant frame division on a sentence to be segmented based on a preset first segmentation scale to obtain each first segmentation-scale sound frame corresponding to the voice to be segmented, wherein the preset first segmentation scale is a preset frame size, and each first segmentation-scale sound frame is a sound frame with the preset first segmentation scale.
Step S20, performing voiceprint recognition on each first cut-size voice frame to obtain a first voiceprint recognition result corresponding to the voice to be cut;
in this embodiment, the voiceprint recognition is performed on each of the first cut-size sound frames to obtain a first voiceprint recognition result corresponding to the to-be-cut speech, specifically, based on sound feature information in each of the first cut-size sound frames, sound affiliation of each of the first cut-size sound frames is respectively recognized, a sound affiliation recognition result of each of the first cut-size sound frames is obtained, and each of the sound affiliation recognition results is used as the first voiceprint recognition result, where the sound feature information includes a sound spectrogram and the like, and the sound affiliation is a speaker who emits sound.
The step of performing voiceprint recognition on each first cut-size voice frame to obtain a first voiceprint recognition result corresponding to the voice to be cut comprises:
step S21, scoring each of the first cut-level sound frames, respectively, to obtain target voiceprint attribution identification scores corresponding to each of the first cut-level sound frames, respectively;
in this embodiment, it should be noted that the voiceprint segmentation device includes a preset voiceprint recognition model, and the preset voiceprint recognition model is a preset machine learning model and is used for voiceprint recognition.
Respectively scoring each first cut-level sound frame to obtain a target voiceprint attribution identification score corresponding to each first cut-level sound frame, specifically, executing the following steps for each first cut-level sound frame:
inputting the first cut-level sound frame into a preset voiceprint recognition model, performing feature extraction on the sound frame matrix representation of the first cut-level sound frame to obtain a feature extraction result, and further scoring the first cut-level sound frame based on the feature extraction result to obtain a scoring vector, wherein the scoring vector at least comprises a voiceprint recognition score, the voiceprint recognition score is a probability evaluation value of the first cut-level sound frame belonging to a speaker, the sound frame matrix is represented in a matrix representation form of the first cut-level sound frame and used for representing sound feature information of the first cut-level sound frame, and further the largest voiceprint recognition score is selected from the scoring vector as a target voiceprint attribution recognition score,
the step of scoring each of the first cut-level sound frames to obtain a target voiceprint belonging identification score corresponding to each of the first cut-level sound frames includes:
step S211, based on preset user voiceprint information, respectively performing similarity scoring on each first cut-out-degree-of-granularity voice frame to obtain voiceprint similarity scoring information respectively corresponding to each first cut-out-degree-of-granularity voice frame;
in this embodiment, it should be noted that the preset user voiceprint information is voice feature information obtained by extracting pre-collected speaker features, and the preset user voiceprint information at least includes a preset voice feature expression vector of a speaker, where the preset voice feature expression vector is a preset feature extraction vector used for expressing the voice feature information of the speaker.
Based on preset user voiceprint information, respectively performing similarity scoring on each first cut-level sound frame to obtain voiceprint similarity scoring information corresponding to each first cut-level sound frame, specifically, executing the following steps for each first cut-level sound frame:
inputting the first cut-level sound frame into a preset voiceprint recognition model, performing feature extraction on the sound frame matrix representation of the first cut-level sound frame to obtain feature extraction vectors, further respectively calculating the similarity between the feature extraction vectors and each preset sound feature representation vector to obtain each similarity score corresponding to the feature extraction vectors, wherein the similarity score is an evaluation value for evaluating the similarity between the feature extraction vectors and the preset sound feature representation vectors, and further taking each similarity score as voiceprint similarity score information.
Step S212 is to generate each target voiceprint attribution identification score based on each piece of voiceprint similarity score information.
In this embodiment, each target voiceprint attribution identification score is generated based on each voiceprint similarity score information, and specifically, the following steps are performed for each voiceprint similarity score information: and selecting the maximum similarity score from the similarity scores as the target voiceprint attribution identification score.
Step S22, generating the first voiceprint recognition result based on each target voiceprint attribution recognition score.
In this embodiment, the first voiceprint recognition result is generated based on each target voiceprint attribution recognition score, specifically, a target speaker corresponding to each target voiceprint attribution recognition score is determined, each target speaker is respectively used as a voiceprint recognition attribution of a first cut-level voice frame corresponding to each target voiceprint attribution recognition score, wherein the voiceprint recognition attributions correspond to the target speakers one to one, and a one-to-one correspondence relationship between each voiceprint recognition attribution and each first cut-level voice frame is used as the first voiceprint recognition result.
Step S30, based on the first voiceprint recognition result, performing fine-grained frame division on the boundary area of each first segmentation-granularity voice frame to obtain each second segmentation-granularity voice frame;
in this embodiment, based on the first voiceprint recognition result, performing fine-grained frame division on a boundary region of each first cut-grained sound frame to obtain each second cut-grained sound frame, specifically, based on a one-to-one correspondence between each voiceprint recognition attribution and each first cut-grained sound frame, using all two adjacent first cut-grained sound frames belonging to two different target speakers in each first cut-grained sound frame as a boundary region together, where the boundary region at least includes a target boundary sound frame, where the target boundary sound frame is a combination of two adjacent first cut-grained sound frames belonging to two different target speakers, and further based on a preset second cut-grained, performing equidistant sound frame division on each target boundary frame respectively to obtain second cut-grained sound frames corresponding to each target boundary sound frame respectively, the preset second segmentation granularity is smaller than the preset first segmentation granularity, and the fine-grained frame is divided into sound frame partitions with segmentation granularity smaller than a second preset segmentation granularity threshold, wherein the second preset segmentation granularity threshold is smaller than or equal to the first preset segmentation granularity threshold, that is, the first segmentation granularity sound frame obtained by the fine-grained frame partitions is smaller, so that the sound frames of the second segmented granularity after segmentation cannot be mixed up, that is, the situation that the probability that the same sound frame of the second segmented granularity belongs to two target speakers is very high does not occur.
Wherein the boundary region includes at least one target boundary sound frame,
the step of performing fine-grained frame division on the boundary area of each first segmentation-granularity sound frame based on the first voiceprint recognition result to obtain each second segmentation-granularity sound frame includes:
step S31, determining the target boundary sound frame corresponding to each of the first cut-level sound frames based on the first voiceprint recognition result;
in this embodiment, the target boundary sound frame corresponding to each first cut-out-of-speech sound frame is determined based on the first voiceprint recognition result, and specifically, two adjacent first cut-out-of-speech sound frames belonging to different target speakers are selected as the target boundary sound frame in each first cut-out-of-speech sound frame based on a one-to-one correspondence between each voiceprint recognition attribution and each first cut-out-of-speech sound frame.
Wherein the determining the target boundary sound frame corresponding to each of the first cut-level sound frames based on the first voiceprint recognition result includes:
step S311, determining a first attribution part and a second attribution part in each of the first cut-level sound frames based on the first voiceprint recognition result;
in this embodiment, it should be noted that the speech to be segmented is a speech between two target speakers, a first half of the speech to be segmented belongs to a first target speaker, and a second half of the speech to be segmented belongs to a second target speaker.
And determining a first attribution part and a second attribution part in each first segmentation-level sound frame based on the first voiceprint recognition result, specifically, dividing each first segmentation-level sound frame into a first attribution part and a second attribution part based on the one-to-one correspondence between each voiceprint recognition attribution and each first segmentation-level sound frame, wherein each first segmentation-level sound frame in the first attribution part belongs to a first target speaker, and each second segmentation-level sound frame in the second attribution part belongs to a second target speaker.
Step S312, acquiring a first boundary area sound frame in the first belonging part and a second boundary area sound frame in the second belonging part;
in this embodiment, a first boundary zone sound frame in the first attribution part and a second boundary zone sound frame in the second attribution part are acquired, specifically, target separation points of the first attribution part and the second attribution part are determined, a first cut-out particle sound frame before the target separation point is used as the first boundary zone sound frame in the first attribution part, and a first cut-out particle sound frame after the target separation point is used as the second boundary zone sound frame in the second attribution part.
Step S313, combining the first boundary area sound frame and the second boundary area sound frame to obtain the target boundary sound frame.
In this embodiment, the first boundary area sound frame and the second boundary area sound frame are combined to obtain the target boundary sound frame, and specifically, the first boundary area sound frame and the second boundary area sound frame are combined to connect the first boundary area sound frame and the second boundary area sound frame to obtain a connection frame common to the first boundary area sound frame and the second boundary area sound frame, and the connection frame is used as the target boundary sound frame.
Step S32, performing low-granularity segmentation on the target boundary sound frame to obtain each second-granularity segmented sound frame.
In this embodiment, the target boundary sound frame is subjected to low-granularity segmentation to obtain each second segmentation-granularity sound frame, and specifically, the target boundary sound frame is subjected to equidistant segmentation based on a preset second segmentation granularity to segment the target boundary sound frame into sound frames each having a frame size of the preset second segmentation granularity, and then the sound frames each having the frame size of the preset second segmentation granularity are taken as each second segmentation-granularity sound frame, where the preset second segmentation granularity is smaller than the preset first segmentation granularity.
Step S40, performing voiceprint recognition on each second segmentation granularity sound frame to obtain a second voiceprint recognition result;
in this embodiment, the voiceprint recognition is performed on each second-granularity-segmented voice frame to obtain a second voiceprint recognition result, and specifically, based on the voice feature information of each second-granularity-segmented voice frame, a second voice attribution of each second-granularity-segmented voice frame is respectively recognized, that is, whether each second-granularity-segmented voice frame belongs to a first target speaker or a second target speaker is respectively recognized, and then each second voice attribution is used as the second voiceprint recognition result.
In an implementation manner, step S40 includes:
respectively scoring each second segmentation-granularity sound frame based on a preset voiceprint recognition model to obtain a first home voiceprint recognition score and a second home voiceprint recognition score corresponding to each second segmentation-granularity sound frame, and specifically, for each second segmentation-granularity sound frame, executing the following steps:
inputting the sound frame with the second segmentation granularity into the preset voiceprint recognition model, performing feature extraction on the sound frame with the second segmentation granularity to obtain a second feature extraction vector of the sound frame with the second segmentation granularity, further calculating bit similarity between the second feature extraction vector and a preset sound feature expression vector of the first target speaker to obtain a first bit similarity evaluation value, and further taking the first bit similarity evaluation value as the first attribution voiceprint recognition score, wherein the bit similarity is the ratio of the same number of bits between the vectors, the first attribution voiceprint recognition score is the probability score of the sound frame with the second segmentation granularity belonging to the first target speaker, and is used for evaluating the probability of the sound frame with the second segmentation granularity belonging to the first target speaker, and the second attribution voiceprint recognition score is the probability of the sound with the second segmentation granularity belonging to the second target speaker And scoring, configured to evaluate a probability that the second segmented-granularity voice frame belongs to a second target speaker, for example, assuming that the vector a is (a, B, c, d), the vector B is (a, B, c, e), the number of bits that are the same between the vector a and the vector B is 3, and further the bit similarity is 75%, further calculating the bit similarity between the second feature extraction vector and a preset voice feature representation vector of the second target speaker, obtaining a second bit similarity evaluation value, and further taking the second bit similarity evaluation value as the second attribution voiceprint identification score.
And step S50, based on the first voiceprint recognition result and the second voiceprint recognition result, carrying out voiceprint segmentation on the voice to be segmented to obtain a target voiceprint segmentation result.
In this embodiment, based on the first voiceprint recognition result and the second voiceprint recognition result, the voice to be segmented is voiceprint segmented to obtain a target voiceprint segmentation result, specifically, based on the first voiceprint recognition result, the non-boundary region of the voice to be segmented is voiceprint segmented to obtain a first voiceprint segmentation result, further based on the second voiceprint recognition result, the boundary region of the voice to be segmented is voiceprint segmented to obtain a second voiceprint segmentation result, further the first voiceprint segmentation result and the second voiceprint segmentation result are fused to obtain the voiceprint segmentation result of the voice to be segmented, that is, to obtain the target voiceprint segmentation result, for example, assuming that the voice to be segmented includes a first segmentation-level voice frame a, a first segmentation-level voice frame B and a first segmentation-level voice frame C, wherein the first cut-level audio frame a and the first cut-level audio frame C both belong to non-boundary regions, the first cut-level audio frame is a boundary region, and the first cut-level audio frame includes a second cut-level audio frame E and a second cut-level audio frame F, and is known based on a first voiceprint recognition result, the first cut-level audio frame a belongs to a speaker X, the first cut-level audio frame C belongs to a speaker Y, and when the non-boundary region is segmented, the first cut-level audio frame a is segmented into a first speech segment of the speaker X, the first cut-level audio frame C is segmented into a second speech segment of the speaker Y, and is known based on a second voiceprint recognition result, the second cut-level audio frame E belongs to the speaker X, the second cut-level audio frame F belongs to the speaker Y, and further, the second segmentation-granularity voice frame E is segmented into a first voice segment belonging to a speaker X, the second segmentation-granularity voice frame F is segmented into a second voice segment belonging to a speaker Y, the first voice segment belonging to the speaker X comprises a first segmentation-granularity voice frame A and a second segmentation-granularity voice frame E, and the second voice segment belonging to the speaker Y comprises a first segmentation-granularity voice frame C and a second segmentation-granularity voice frame F.
Wherein, it should be noted that, because the non-boundary area of the voice to be segmented is segmented based on the high segmentation granularity, and each first segmentation-granularity voice frame in the non-boundary area uniquely corresponds to a target speaker, and then when the first voiceprint recognition result of the non-boundary area is used as the voiceprint segmentation of the non-boundary area, because the non-boundary area does not generate the situation that the voice frame covers the voice feature information of multiple speakers due to the oversize of the voice frame, so that the voice feature information of multiple speakers is confused in the same voice frame, and the voice frame of the non-boundary area is larger, so that the accuracy of the voiceprint recognition of the non-boundary area is extremely high, the accuracy of the voiceprint segmentation of the non-boundary area is extremely high, further, the boundary area is segmented by the segmentation of the low segmentation granularity, the segmentation of the boundary area is implemented to perform the granularity refinement segmentation of the boundary area, the voiceprint recognition can be carried out based on the voiceprint frame with smaller frame size to obtain a second voiceprint recognition result, and then the boundary area which is easy to generate the confusion of the voice characteristic information in the sentence to be segmented adopts the second voiceprint recognition result which is carried out based on the voiceprint frame with smaller frame size to realize the accurate voiceprint recognition of the boundary area, thereby avoiding the occurrence of the condition that the accuracy of the voiceprint recognition is reduced because the voice characteristic information of a plurality of speakers is confused in the same voiceprint frame, improving the accuracy of the voiceprint recognition, and further realizing the accurate voiceprint segmentation of the boundary area based on the second voiceprint recognition result, namely, the voiceprint recognition is carried out by the voice frame which is as large as possible in the non-boundary area which is not easy to generate the confusion of the voice characteristic information, so as to ensure the accuracy of the voiceprint recognition of the non-boundary area, and the voiceprint recognition is carried out by the smaller voiceprint frame in the boundary area which is easy to generate the confusion of, the voiceprint recognition accuracy of the boundary area is guaranteed, the voiceprint recognition accuracy is improved, and the voiceprint segmentation accuracy is improved.
Compared with the technical means of dividing the voice into voice frames with fixed sizes and then respectively performing voice recognition on the voice frames with fixed sizes adopted in the prior art, the embodiment performs coarse-granularity frame division on the voice to be divided after the voice to be divided is acquired to obtain each first-granularity voice frame corresponding to the voice to be divided, that is, performs high-segmentation-granularity voice frame division on the voice to be divided to obtain each first-granularity voice frame with larger frame size, wherein it needs to be stated that if the voice frame is too small, the voice feature information contained in the voice frame is too little, and the accuracy of the voice print recognition is lower, and then performs the voice print recognition on each first-granularity voice frame to obtain the first voice print recognition result corresponding to the voice to be divided, the method can realize the purpose of voiceprint recognition based on the voiceprint frame containing enough voice characteristic information, can improve the accuracy of voiceprint recognition of the non-boundary area of the voice to be segmented, further, based on the first voiceprint recognition result, the boundary area of each first segmentation-granularity voice frame is divided into fine-grained frames to obtain each second segmentation-granularity voice frame, namely, the boundary area of each first segmentation-granularity voice frame is divided into low-segmentation-granularity voice frames to obtain each second segmentation-granularity voice frame with smaller frame size, namely, the purpose of refining the segmentation granularity of the voice to be segmented is realized, wherein, if the voiceprint frame is too large, the voiceprint frame covers the voice characteristic information of a plurality of speakers, and further the voice characteristic information of the plurality of speakers is mixed in the same voiceprint frame, wherein, the confused portion is usually a boundary region where two speakers connect, which results in a lower accuracy of voiceprint recognition, and further improves the accuracy of voiceprint recognition of the boundary region of the voice to be segmented, and further performs voiceprint recognition on each second segmentation-granularity voice frame to obtain a second voiceprint recognition result, which can achieve the purpose of accurately distinguishing the confused portion of the boundary region of each first segmentation-granularity voice frame, and further performs voiceprint segmentation on the voice to be segmented based on the first voiceprint recognition result and the second voiceprint recognition result, that is, the voice to be segmented can be segmented more accurately by performing voiceprint segmentation on the voiceprint recognition result, that is, the first voiceprint recognition result is used in the non-boundary region of the voice to be segmented, and the second voiceprint recognition result is used in the boundary region of the voice to be segmented, and then the target voiceprint segmentation result is obtained, the technical defects that when the voice is divided into voice frames with fixed sizes in the prior art and then voiceprint recognition is respectively carried out on the voice frames with fixed sizes, the accuracy of voiceprint recognition is reduced due to the fact that the voice frames are too large or too small, and then the accuracy of voiceprint segmentation is reduced are overcome, and the accuracy of voiceprint segmentation is improved.
Further, referring to fig. 2, based on the first embodiment of the present application, in another embodiment of the present application, the second voice print recognition result at least includes a first home voice print recognition score and a second home voice print recognition score corresponding to the second sliced-granularity voice frame,
the step of performing voiceprint segmentation on the voice to be segmented based on the first voiceprint recognition result and the second voiceprint recognition result to obtain a target voiceprint segmentation result comprises:
step S41, determining a target segmentation point corresponding to each second segmentation-granularity sound frame based on each first home voiceprint recognition score and each second home voiceprint recognition score;
in this embodiment, a target segmentation point corresponding to each second-granularity segmented sound frame is determined based on each first home voiceprint recognition score and each second home voiceprint recognition score, and specifically, a maximum voiceprint score sum corresponding to each second-granularity segmented sound frame is calculated based on each first home voiceprint recognition score and each second home voiceprint recognition score, where the maximum voiceprint score sum is a maximum sum of target voiceprint scores corresponding to each second-granularity segmented sound frame, where the target score is one of a first home voiceprint recognition score and a corresponding first home voiceprint recognition score corresponding to the second-granularity segmented sound frame, and then a voiceprint score sequence corresponding to each second-granularity segmented sound frame is determined based on the maximum voiceprint score sum, where the voiceprint score sequence is a sequence consisting of the maximum voiceprint score and each voiceprint score, the sequence of the voiceprint scoring sequence is consistent with the time sequence of each second segmentation-granularity voice frame, and then a voiceprint scoring mutation point is inquired in the voiceprint scoring sequence, wherein the voiceprint scoring mutation point is a boundary point of the voiceprint scoring sequence, which is mutated from a first attribution voiceprint recognition score to a second attribution voiceprint recognition score, or is a boundary point of the voiceprint scoring sequence, which is mutated from a second attribution voiceprint recognition score to a first attribution voiceprint recognition score, and then a corresponding voiceprint frame boundary point of the voiceprint scoring mutation point in a sequence composed of each second segmentation-granularity voice frame is used as the target segmentation point, wherein the voiceprint boundary point is a boundary point of two adjacent second segmentation-granularity voice frames.
The step of determining a target segmentation point corresponding to each second segmentation-granularity sound frame based on each first home voiceprint recognition score and each second home voiceprint recognition score includes:
step S411, determining a target voiceprint recognition score sum corresponding to each second segmentation-granularity voice frame based on each first home voiceprint recognition score and each second home voiceprint recognition score;
in this embodiment, a target voiceprint recognition score sum corresponding to each second segmentation-granularity voice frame is determined based on each first home voiceprint recognition score and each second home voiceprint recognition score, specifically, one of the first home voiceprint recognition score and the corresponding second home voiceprint recognition score corresponding to each second segmentation-granularity voice frame is arbitrarily selected as a voiceprint score to be calculated, a sum of the voiceprint scores to be calculated corresponding to each second segmentation-granularity voice frame is calculated, a voiceprint score sum is obtained, calculation of the voiceprint score sum is further repeated until all possible voiceprint score sums are obtained, and then a maximum voiceprint score sum is selected from each voiceprint score sum as the target voiceprint score sum.
Step S422, determining the voice print score mutation point corresponding to each voice frame with the second segmentation granularity based on the target voice print score sum;
in this embodiment, based on the sum of the target voiceprint scores, a voiceprint score mutation point corresponding to each second-granularity voiceprint frame is determined, specifically, based on the time sequence of each second-granularity voiceprint frame, the voiceprint scores to be calculated in the target voiceprint scores and the sum of the target voiceprint scores are ranked to obtain a voiceprint score sequence, and then the voiceprint score mutation point is queried in the voiceprint score sequence, where the voiceprint score mutation point is a boundary point in the voiceprint score sequence at which a first home voiceprint identification score is mutated into a second home voiceprint identification score, or is a boundary point in the voiceprint score sequence at which a second home voiceprint identification score is mutated into a first home voiceprint identification score.
And step S423, generating the target segmentation points based on the vocal print scoring mutation points.
In this embodiment, the target segmentation point is generated based on the voiceprint score mutation point, specifically, two sudden change voiceprint scores before and after the voiceprint score mutation point are determined, and a boundary point of a second segmentation-granularity voice frame corresponding to the two sudden change voiceprint scores is used as the target segmentation point.
Step S43, based on the target segmentation point and the first voiceprint recognition result, segmenting the to-be-segmented speech to obtain the target voiceprint segmentation result.
In this embodiment, the voice to be segmented is segmented based on the target segmentation point and the first voiceprint recognition result to obtain the target voiceprint segmentation result, specifically, the non-boundary region of the voice to be segmented is segmented based on the first voiceprint recognition result to obtain a first non-boundary region voice segment belonging to a first target speaker and a second non-boundary region voice segment belonging to a second target speaker, and then each second segmentation granularity voice frame is segmented based on the target segmentation point, that is, the boundary region of the voice to be segmented is segmented based on the target segmentation point to obtain a first boundary region voice segment belonging to the first speaker and a second boundary region voice segment belonging to the second speaker, and then the first non-boundary region voice segment and the first boundary region voice segment are fused, and obtaining a first segmentation voice section belonging to a first target speaker, fusing the second non-boundary region voice section and the second boundary region voice section to obtain a second segmentation voice section belonging to a second target speaker, and taking the first segmentation voice section and the second segmentation voice section as the target voiceprint segmentation result.
In an implementable scheme, as shown in fig. 3, an overall flow diagram for performing voiceprint segmentation is shown, in which a speaker a is a first target speaker, a speaker B is a second target speaker, frames are divided according to a fixed size to correspond to step S10, each frame is divided to determine speaker correspondence step S20, frames with converted boundaries are further divided into small blocks to correspond to step S30, all the small blocks are divided into A, B to correspond to step S40, and an optimal division point is found to perform division to correspond to step S50.
The embodiment provides a voiceprint segmentation method, wherein a target segmentation point corresponding to each second segmentation-granularity voice frame is determined based on each first home voiceprint recognition score and each second home voiceprint recognition score, wherein the target segmentation point is an optimal segmentation point of a voice to be segmented, so that a voiceprint segmentation effect is optimal, the voice to be segmented is segmented based on the target segmentation point and the first voiceprint recognition result, so as to obtain a target voiceprint segmentation result, that is, based on the target segmentation point, segmentation of a boundary region of the voice to be segmented is achieved, based on the first voiceprint recognition result, segmentation of a non-boundary region of the voice to be segmented is achieved, so as to complete segmentation of the voice to be segmented, wherein because the first segmentation-granularity frame is large, the first voiceprint recognition result is obtained by voiceprint recognition of a voice frame based on sufficient voice characteristic information, and then the accuracy of the first voiceprint recognition result is extremely high, and further the accuracy of the voiceprint segmentation of the non-boundary area is extremely high, and the target segmentation point is determined by the second voiceprint recognition result, which is obtained based on the voiceprint recognition of the smaller second segmentation-granularity voiceprint frame, wherein, it is required to be stated that, if the voiceprint frame is too large, the voiceprint frame covers the voice feature information of a plurality of speakers, and further the voice feature information of the plurality of speakers is confused in the same voiceprint frame, wherein, the confused part is usually the boundary area where the voices of two speakers are connected, and further the accuracy of the voiceprint recognition is low, and because the second segmentation-granularity voiceprint frame is smaller, further the probability of the confusion of the voice feature information in the boundary area is extremely small, further the accuracy of the voiceprint recognition is extremely high, further the accuracy of the voiceprint segmentation of the boundary area is extremely high, the accuracy of the voiceprint segmentation is improved.
Referring to fig. 4, fig. 4 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present application.
As shown in fig. 4, the voiceprint segmentation apparatus may include: a processor 1001, such as a CPU, a memory 1005, and a communication bus 1002. The communication bus 1002 is used for realizing connection communication between the processor 1001 and the memory 1005. The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a memory device separate from the processor 1001 described above.
Optionally, the voiceprint segmentation device may further include a rectangular user interface, a network interface, a camera, RF (Radio Frequency) circuitry, a sensor, audio circuitry, a WiFi module, and so on. The rectangular user interface may comprise a Display screen (Display), an input sub-module such as a Keyboard (Keyboard), and the optional rectangular user interface may also comprise a standard wired interface, a wireless interface. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface).
Those skilled in the art will appreciate that the voiceprint segmentation apparatus configuration shown in figure 4 does not constitute a limitation of the voiceprint segmentation apparatus and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.
As shown in fig. 4, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, and a voiceprint recognition program. The operating system is a program that manages and controls the hardware and software resources of the voiceprint segmentation device, supporting the operation of the voiceprint recognition program, as well as other software and/or programs. The network communication module is used to enable communication between the various components within the memory 1005, as well as with other hardware and software in the voiceprint recognition system.
In the voiceprint segmentation apparatus shown in fig. 4, the processor 1001 is configured to execute a voiceprint recognition program stored in the memory 1005 to implement the steps of the voiceprint segmentation method described in any one of the above.
The specific implementation of the voiceprint segmentation apparatus of the present application is substantially the same as that of each embodiment of the voiceprint segmentation method described above, and is not described herein again.
The embodiment of the present application further provides a voiceprint segmentation apparatus, where the voiceprint segmentation apparatus is applied to a voiceprint segmentation device, the voiceprint segmentation apparatus includes:
the first frame division module is used for acquiring a voice to be divided, and performing coarse-grained frame division on the voice to be divided to obtain each first division-granularity sound frame corresponding to the voice to be divided;
the first voiceprint recognition module is used for carrying out voiceprint recognition on each first cut-size voice frame to obtain a first voiceprint recognition result corresponding to the voice to be cut;
a second frame division module, configured to perform fine-grained frame division on a boundary area of each first segmentation-granularity sound frame based on the first voiceprint recognition result, to obtain each second segmentation-granularity sound frame;
the second fingerprint identification module is used for carrying out fingerprint identification on each second segmentation granularity sound frame to obtain a second fingerprint identification result;
and the voiceprint segmentation module is used for carrying out voiceprint segmentation on the voice to be segmented based on the first voiceprint recognition result and the second voiceprint recognition result so as to obtain a target voiceprint segmentation result.
Optionally, the second frame dividing module includes:
a first determining unit configured to determine the target boundary sound frame corresponding to each of the first cut-level sound frames based on the first voiceprint recognition result;
and the low-granularity segmentation unit is used for performing low-granularity segmentation on the target boundary sound frame to obtain each second segmentation-granularity sound frame.
Optionally, the first determining unit includes:
a first determination subunit determines, in each of the first cut-level sound frames, a first attribution part and a second attribution part based on the first voiceprint recognition result;
an acquisition subunit configured to acquire a first boundary region sound frame in the first attribution section and a second boundary region sound frame in the second attribution section;
a combining unit, configured to combine the first boundary area sound frame and the second boundary area sound frame to obtain the target boundary sound frame.
Optionally, the voiceprint segmentation module includes:
a second determining unit, configured to determine, based on each first home voiceprint identification score and each second home voiceprint identification score, a target segmentation point corresponding to each second segmentation-granularity voice frame;
and the segmentation unit is used for segmenting the voice to be segmented based on the target segmentation point and the first voiceprint recognition result to obtain the target voiceprint segmentation result.
Optionally, the second determining unit includes:
a second determining subunit, configured to determine, based on each first home voiceprint identification score and each second home voiceprint identification score, a target voiceprint identification score sum corresponding to each second-segmentation-granularity voice frame;
a third determining subunit, configured to determine, based on the target voiceprint score sum, a voiceprint score mutation point corresponding to each second segmentation-granularity voice frame;
and the first generation subunit is used for generating the target segmentation point based on the voiceprint score mutation point.
Optionally, the first voiceprint recognition module comprises:
the scoring unit is used for scoring each first cut-level sound frame respectively to obtain a target voiceprint attribution identification score corresponding to each first cut-level sound frame;
and the generating unit is used for generating the first voiceprint recognition result based on each target voiceprint attribution recognition score.
Optionally, the scoring unit includes:
the similarity scoring subunit is configured to perform similarity scoring on each first cut-level sound frame based on preset user voiceprint information, and obtain voiceprint similarity scoring information corresponding to each first cut-level sound frame;
and the second generation subunit is used for generating each target voiceprint attribution identification score based on each piece of voiceprint similarity score information.
The specific implementation of the voiceprint segmentation apparatus of the present application is substantially the same as that of each embodiment of the voiceprint segmentation method described above, and is not described herein again.
The present application provides a readable storage medium, and the readable storage medium stores one or more programs, which can be executed by one or more processors for implementing the steps of the voiceprint segmentation method described in any one of the above.
The specific implementation of the readable storage medium of the present application is substantially the same as that of the above-mentioned embodiments of the voiceprint segmentation method, and is not described herein again.
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings, or which are directly or indirectly applied to other related technical fields, are included in the scope of the present application.

Claims (10)

1. A voiceprint segmentation method, characterized by comprising:
acquiring a voice to be segmented, and performing coarse-grained frame division on the voice to be segmented to obtain each first segmentation-scale voice frame corresponding to the voice to be segmented;
performing voiceprint recognition on each first cut-size voice frame to obtain a first voiceprint recognition result corresponding to the voice to be cut;
performing fine-grained frame division on the boundary area of each first segmentation-granularity sound frame based on the first voiceprint recognition result to obtain each second segmentation-granularity sound frame;
performing voiceprint recognition on each second segmentation granularity voice frame to obtain a second voiceprint recognition result;
and carrying out voiceprint segmentation on the voice to be segmented based on the first voiceprint recognition result and the second voiceprint recognition result to obtain a target voiceprint segmentation result.
2. The voiceprint segmentation method of claim 1 wherein said boundary region includes at least one target boundary sound frame,
the step of performing fine-grained frame division on the boundary area of each first segmentation-granularity sound frame based on the first voiceprint recognition result to obtain each second segmentation-granularity sound frame includes:
determining the target boundary sound frame corresponding to each first cut-level sound frame based on the first voiceprint recognition result;
and performing low-granularity segmentation on the target boundary sound frame to obtain each second segmentation-granularity sound frame.
3. The voiceprint segmentation method according to claim 2, wherein the step of determining the target boundary sound frame corresponding to each of the first cut-level sound frames based on the first voiceprint recognition result includes:
determining a first attribution part and a second attribution part in each of the first cut-level sound frames based on the first voiceprint recognition result;
acquiring a first boundary area sound frame in the first attribution part and a second boundary area sound frame in the second attribution part;
and combining the first boundary area sound frame and the second boundary area sound frame to obtain the target boundary sound frame.
4. The voiceprint segmentation method of claim 1 wherein the second voiceprint recognition result comprises at least a first home voiceprint recognition score and a second home voiceprint recognition score corresponding to the second sliced-granularity voice frame,
the step of performing voiceprint segmentation on the voice to be segmented based on the first voiceprint recognition result and the second voiceprint recognition result to obtain a target voiceprint segmentation result comprises:
determining a target segmentation point corresponding to each second segmentation granularity sound frame based on each first home voiceprint identification score and each second home voiceprint identification score;
and segmenting the voice to be segmented based on the target segmentation point and the first voiceprint recognition result to obtain the target voiceprint segmentation result.
5. The voiceprint segmentation method according to claim 4, wherein the step of determining the target segmentation point corresponding to each of the second-cut-granularity voice frames based on each of the first home voiceprint recognition scores and each of the second home voiceprint recognition scores comprises:
determining a target voiceprint recognition score sum corresponding to each second segmentation granularity voiceprint frame based on each first home voiceprint recognition score and each second home voiceprint recognition score;
determining a voiceprint score mutation point corresponding to each second segmentation granularity voice frame based on the target voiceprint score sum;
and generating the target segmentation point based on the voiceprint scoring mutation point.
6. The method according to claim 1, wherein the step of performing the voiceprint recognition on each of the first cut-size voice frames to obtain the first voiceprint recognition result corresponding to the voice to be segmented comprises:
respectively scoring each first cut-level sound frame to obtain a target voiceprint attribution identification score corresponding to each first cut-level sound frame;
and generating the first voiceprint recognition result based on each target voiceprint attribution recognition score.
7. The method as claimed in claim 6, wherein the step of scoring each of the first cut-level sound frames to obtain the corresponding target voiceprint attribution identification score of each of the first cut-level sound frames comprises:
based on preset user voiceprint information, respectively carrying out similarity scoring on each first cut-level voice frame to obtain voiceprint similarity scoring information respectively corresponding to each first cut-level voice frame;
and generating each target voiceprint attribution identification score based on each voiceprint similarity score information.
8. A voiceprint segmentation apparatus, wherein the voiceprint segmentation apparatus comprises:
the first frame division module is used for acquiring a voice to be divided, and performing coarse-grained frame division on the voice to be divided to obtain each first division-granularity sound frame corresponding to the voice to be divided;
the first voiceprint recognition module is used for carrying out voiceprint recognition on each first cut-size voice frame to obtain a first voiceprint recognition result corresponding to the voice to be cut;
a second frame division module, configured to perform fine-grained frame division on a boundary area of each first segmentation-granularity sound frame based on the first voiceprint recognition result, to obtain each second segmentation-granularity sound frame;
the second fingerprint identification module is used for carrying out fingerprint identification on each second segmentation granularity sound frame to obtain a second fingerprint identification result;
and the voiceprint segmentation module is used for carrying out voiceprint segmentation on the voice to be segmented based on the first voiceprint recognition result and the second voiceprint recognition result so as to obtain a target voiceprint segmentation result.
9. A voiceprint segmentation apparatus, characterized in that the voiceprint segmentation apparatus comprises: a memory, a processor, and a program stored on the memory for implementing the voiceprint segmentation method,
the memory is used for storing a program for realizing the voiceprint segmentation method;
the processor is configured to execute a program implementing the voiceprint segmentation method to implement the steps of the voiceprint segmentation method according to any one of claims 1 to 7.
10. A readable storage medium having stored thereon a program for implementing a voiceprint segmentation method, the program being executable by a processor for implementing the steps of the voiceprint segmentation method as claimed in any one of claims 1 to 7.
CN202011072850.4A 2020-10-09 2020-10-09 Voiceprint segmentation method, voiceprint segmentation device, voiceprint segmentation equipment and readable storage medium Active CN112201256B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011072850.4A CN112201256B (en) 2020-10-09 2020-10-09 Voiceprint segmentation method, voiceprint segmentation device, voiceprint segmentation equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011072850.4A CN112201256B (en) 2020-10-09 2020-10-09 Voiceprint segmentation method, voiceprint segmentation device, voiceprint segmentation equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN112201256A true CN112201256A (en) 2021-01-08
CN112201256B CN112201256B (en) 2023-09-19

Family

ID=74012624

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011072850.4A Active CN112201256B (en) 2020-10-09 2020-10-09 Voiceprint segmentation method, voiceprint segmentation device, voiceprint segmentation equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN112201256B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113488058A (en) * 2021-06-23 2021-10-08 武汉理工大学 Voiceprint recognition method based on short voice

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150051912A1 (en) * 2013-08-15 2015-02-19 Chunghwa Telecom Co., Ltd. Method for Segmenting Videos and Audios into Clips Using Speaker Recognition
US20160111112A1 (en) * 2014-10-17 2016-04-21 Fujitsu Limited Speaker change detection device and speaker change detection method
CN106782507A (en) * 2016-12-19 2017-05-31 平安科技(深圳)有限公司 The method and device of voice segmentation
US20170365259A1 (en) * 2015-02-05 2017-12-21 Beijing D-Ear Technologies Co., Ltd. Dynamic password voice based identity authentication system and method having self-learning function

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150051912A1 (en) * 2013-08-15 2015-02-19 Chunghwa Telecom Co., Ltd. Method for Segmenting Videos and Audios into Clips Using Speaker Recognition
US20160111112A1 (en) * 2014-10-17 2016-04-21 Fujitsu Limited Speaker change detection device and speaker change detection method
US20170365259A1 (en) * 2015-02-05 2017-12-21 Beijing D-Ear Technologies Co., Ltd. Dynamic password voice based identity authentication system and method having self-learning function
CN106782507A (en) * 2016-12-19 2017-05-31 平安科技(深圳)有限公司 The method and device of voice segmentation
WO2018113243A1 (en) * 2016-12-19 2018-06-28 平安科技(深圳)有限公司 Speech segmentation method, device and apparatus, and computer storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MICHAEL A. CARLIN .ET AL.: "Detection of Speaker Change Points in Conversational Speech", 2007 IEEE AEROSPACE CONFERENCE *
周晓东等: "基于注意力机制的单通道双人语音分离研究", 通信技术 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113488058A (en) * 2021-06-23 2021-10-08 武汉理工大学 Voiceprint recognition method based on short voice

Also Published As

Publication number Publication date
CN112201256B (en) 2023-09-19

Similar Documents

Publication Publication Date Title
US10810748B2 (en) Multiple targets—tracking method and apparatus, device and storage medium
CN112533051B (en) Barrage information display method, barrage information display device, computer equipment and storage medium
US20210224598A1 (en) Method for training deep learning model, electronic equipment, and storage medium
EP3792818A1 (en) Video processing method and device, and storage medium
CN110298415A (en) A kind of training method of semi-supervised learning, system and computer readable storage medium
US11449706B2 (en) Information processing method and information processing system
US10002296B2 (en) Video classification method and apparatus
EP3872652A2 (en) Method and apparatus for processing video, electronic device, medium and product
JP2011013732A (en) Information processing apparatus, information processing method, and program
CN109740752B (en) Deep model training method and device, electronic equipment and storage medium
CN109815938A (en) Multi-modal affective characteristics recognition methods based on multiclass kernel canonical correlation analysis
JP2015176175A (en) Information processing apparatus, information processing method and program
CN110232331B (en) Online face clustering method and system
CN111241745A (en) Stepwise model selection method, apparatus and readable storage medium
CN112926621B (en) Data labeling method, device, electronic equipment and storage medium
WO2022268182A1 (en) Method and device for detecting standardization of wearing mask
CN114549557A (en) Portrait segmentation network training method, device, equipment and medium
US20220358658A1 (en) Semi Supervised Training from Coarse Labels of Image Segmentation
CN112201256B (en) Voiceprint segmentation method, voiceprint segmentation device, voiceprint segmentation equipment and readable storage medium
CN110110143A (en) A kind of video classification methods and device
CN111385659B (en) Video recommendation method, device, equipment and storage medium
CN111210022A (en) Backward model selection method, device and readable storage medium
CN111985488B (en) Target detection segmentation method and system based on offline Gaussian model
CN115439700B (en) Image processing method and device and machine-readable storage medium
CN113516739A (en) Animation processing method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant