CN106971734B - Method and system for training and identifying model according to extraction frequency of model - Google Patents

Method and system for training and identifying model according to extraction frequency of model Download PDF

Info

Publication number
CN106971734B
CN106971734B CN201610025278.3A CN201610025278A CN106971734B CN 106971734 B CN106971734 B CN 106971734B CN 201610025278 A CN201610025278 A CN 201610025278A CN 106971734 B CN106971734 B CN 106971734B
Authority
CN
China
Prior art keywords
initial
model
signal stream
recognition
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610025278.3A
Other languages
Chinese (zh)
Other versions
CN106971734A (en
Inventor
祝铭明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yutou Technology Hangzhou Co Ltd
Original Assignee
Yutou Technology Hangzhou Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yutou Technology Hangzhou Co Ltd filed Critical Yutou Technology Hangzhou Co Ltd
Priority to CN201610025278.3A priority Critical patent/CN106971734B/en
Publication of CN106971734A publication Critical patent/CN106971734A/en
Application granted granted Critical
Publication of CN106971734B publication Critical patent/CN106971734B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method and a system for training a recognition model according to the extraction frequency of the model, belonging to the technical field of voice recognition; the method can train the recognition model according to the extraction frequency of the model, data communication is carried out in a mode of remotely connecting the server and the client, the extraction frequency of the unusual initial recognition model through comparison in the client can be deleted, the unusual initial recognition model is updated by adopting the sentence training sample in the server, the operation burden of the client is reduced, the working efficiency is improved, and the method can be applied to the common intelligent terminal to form better practicability required by the recognition model and the accuracy required by voiceprint recognition.

Description

Method and system for training and identifying model according to extraction frequency of model
Technical Field
The invention relates to the technical field of voice recognition, in particular to a method and a system for training a recognition model according to the extraction frequency of the model.
Background
Voiceprint recognition is a recognition technology realized by using human voice, and since the vocal organs used by a person during speaking have certain difference and the voiceprint maps of any two human voices are different, the voiceprint can be used as a biological feature for representing the difference of individuals, so that different individuals can be represented by establishing a recognition model, and the recognition model is further used for recognizing different individuals. At present, the application of the recognition model has a dilemma, which is mainly reflected in the length selection of the training corpus. Generally speaking, the longer the corpus of the voiceprint training is, the more accurate the established characteristic model is, and the higher the recognition accuracy is, but the practicability of the model establishing mode is not strong; on the contrary, the voiceprint training corpus is shorter, so that better practicability can be ensured, but the recognition accuracy of the model generated by training is relatively low. In practical applications, for example, when the method is applied to voiceprint recognition of voice operation in some intelligent devices, not only higher recognition accuracy is required, but also the training corpus is required not to be too long, so that better practicability is ensured, and the above purpose is difficult to achieve by using a technical scheme established by a voiceprint recognition model in the prior art.
Similarly, in the prior art, a user needs to manually input training corpora for a certain time to assist in establishing the recognition model, so that poor experience is brought to the user, and the method has no high practicability; the length of the combined training corpus is still limited, a more accurate characteristic model cannot be generated, and the recognition accuracy cannot be further improved; the accuracy of model establishment is also influenced by changes of speech speed and intonation, emotional fluctuation and the like; and the system established by the voiceprint recognition model is usually an independent client, the stored sentence training samples are limited, and the training speed is low, so how to improve the recognition model accuracy and further improve the recognition accuracy rate on the premise of ensuring higher practicability is an urgent problem to be solved.
Disclosure of Invention
According to the above problems in the prior art, a technical solution for a method and a system for training and identifying a model according to the extraction frequency of the model is provided, which specifically includes:
a method for training a recognition model according to the extraction frequency of the model, which provides a plurality of clients and a server, wherein the server is remotely connected with the clients respectively, and the method comprises the following steps:
the client acquires an initial voice signal stream of a speaker;
the client acquires a voice signal stream in the initial voice signal stream according to a preset speaker segmentation algorithm and a speaker clustering algorithm;
the client judges whether the voice signal stream capable of being used as a recognition object exists in all the voice signal streams or not, and outputs the voice signal stream capable of being used as the recognition object as a recognition signal stream;
the client matches the identification signal stream with a plurality of pre-formed initial identification models to obtain the initial identification models which are successfully matched;
the recognition signal stream of the client serves as a sentence training sample of an additional recognition signal stream, and the extraction frequency of the initial recognition model is obtained;
the client judges whether the extraction frequency is greater than a preset extraction threshold value or not, and deletes the initial identification model which is less than or equal to the extraction threshold value from the client;
the client updates the initial recognition model which is larger than the extraction threshold value according to the sentence training sample;
and the server stores all the initial recognition models, updates the initial recognition models which are smaller than or equal to the extraction threshold value in each client according to the sentence training samples, and finally forms a plurality of recognition models, wherein each recognition model corresponds to one speaker.
Preferably, before the client obtains an initial speech signal stream of a speaker, the method further includes:
and the client establishes a plurality of initial recognition models according to a plurality of preset sentence training samples.
Preferably, the method for the client to respectively obtain the speech signal stream in the initial speech signal stream according to the speaker segmentation algorithm and the speaker clustering algorithm specifically includes:
segmenting the initial speech signal stream into a plurality of speech segments according to the speaker segmentation algorithm;
and clustering the voice segments according to the speaker clustering algorithm to generate the voice signal stream.
Preferably, the method for the client to match the identification signal stream with the plurality of initial identification models respectively to obtain the initial identification models successfully matched includes:
matching the identification signal stream with a plurality of initial identification models respectively to obtain the matching degree of each initial identification model and the identification signal stream;
and selecting the initial identification model corresponding to the highest matching degree in the multiple matching degrees which are greater than a preset matching threshold value as the initial identification model which is successfully matched.
Preferably, the method for the client to match the identification signal stream with the plurality of initial identification models respectively to obtain the initial identification models successfully matched includes:
matching the identification signal stream with a plurality of initial identification models respectively to obtain the matching degree of each initial identification model and the identification signal stream;
and selecting the initial identification model corresponding to the highest matching degree in the multiple matching degrees which are greater than a preset matching threshold value as the initial identification model which is successfully matched.
Preferably, the method for the client to update the initial recognition model larger than the extraction threshold according to the sentence training sample specifically includes:
generating a modified recognition model according to the successfully matched initial recognition model and a preset sentence training sample, wherein the preset sentence training sample is the recognition signal stream for generating the initial recognition model;
updating the initial recognition model with the revised recognition model.
Preferably, the method for the client to update the initial recognition model larger than the extraction threshold according to the sentence training sample specifically includes:
generating a modified recognition model according to the successfully matched initial recognition model and a preset sentence training sample, wherein the preset sentence training sample is the recognition signal stream for generating the initial recognition model;
updating the initial recognition model with the revised recognition model.
Preferably, the method for the client to update the initial recognition model larger than the extraction threshold according to the sentence training sample specifically includes:
generating a modified recognition model according to the successfully matched initial recognition model and a preset sentence training sample, wherein the preset sentence training sample is the recognition signal stream for generating the initial recognition model;
updating the initial recognition model with the revised recognition model.
A system for training a recognition model based on the frequency of extraction of the model, comprising: the system comprises a server and a plurality of clients, wherein the server is remotely connected with the plurality of clients, and each client comprises an acquisition unit, a processing unit, a judgment unit, a matching unit, a comparison unit and a model updating unit;
the acquisition unit is used for acquiring an initial voice signal stream of a speaker and sending the initial voice signal stream to the processing unit connected with the acquisition unit;
the processing unit is used for receiving the initial voice signal stream sent by the obtaining unit, obtaining a voice signal stream in the initial voice signal stream according to a preset speaker segmentation algorithm and a speaker clustering algorithm, and sending the voice signal stream to the judging unit connected with the processing unit;
the judging unit is used for judging whether the voice signal stream capable of being used as a recognition object exists in all the voice signal streams sent by the processing unit and outputting the voice signal stream capable of being used as the recognition object to the matching unit connected with the judging unit as a recognition signal stream;
the matching unit is used for receiving the identification signal streams sent by the judging unit, matching each identification signal stream with a plurality of pre-formed initial identification models respectively, obtaining the initial identification models successfully matched, taking the identification signal streams as sentence training samples of the additional identification signal streams, obtaining the extraction frequency of the initial identification models, and sending the extraction frequency to the comparing unit connected with the matching unit;
the comparison unit is used for receiving the extraction frequency which is successfully matched and sent by the matching unit, comparing whether the extraction frequency is greater than a preset extraction threshold value or not, deleting the initial identification model which is less than or equal to the extraction threshold value from the client, and sending a comparison result to the model updating unit connected with the comparison unit, and
the model updating unit is used for receiving the comparison result sent by the comparison unit and updating the initial recognition model larger than the extraction threshold value according to the sentence training sample;
the server provides all the initial recognition models, and is used for updating the initial recognition models smaller than or equal to the extraction threshold value in each client according to the sentence training samples to finally form a plurality of recognition models, wherein each recognition model corresponds to one speaker.
Preferably, the client further includes:
the system comprises a sample acquisition unit, a model establishing unit and a data processing unit, wherein the sample acquisition unit is used for acquiring a plurality of preset sentence training samples and sending the sentence training samples to the model establishing unit connected with the sample acquisition unit; and
the model establishing unit is used for receiving a plurality of preset sentence training samples sent by the sample obtaining unit and establishing a plurality of initial recognition models according to the preset sentence training samples.
Preferably, the processing unit specifically includes:
the segmentation module is used for segmenting the initial voice signal stream into a plurality of voice segments according to a preset speaker segmentation algorithm and sending all the voice segments to the clustering module connected with the segmentation module; and
the clustering module is used for receiving the voice segments sent by the segmentation module and clustering the voice segments according to a preset speaker clustering algorithm to generate the voice signal stream.
Preferably, the matching unit specifically includes:
the matching degree acquisition module is used for respectively matching the identification signal stream with a plurality of initial identification models to acquire the matching degree of each initial identification model and the identification signal stream, and sending all the matching degrees to the signal stream acquisition module connected with the matching degree acquisition module; and
the signal flow acquisition module is configured to receive all the matching degrees sent by the matching degree acquisition module, and select the initial identification model corresponding to the highest matching degree of the multiple matching degrees that is greater than a preset matching threshold as the initial identification model successfully matched.
Preferably, the matching unit specifically includes:
the matching degree acquisition module is used for respectively matching the identification signal stream with a plurality of initial identification models to acquire the matching degree of each initial identification model and the identification signal stream, and sending all the matching degrees to the signal stream acquisition module connected with the matching degree acquisition module; and
the signal flow acquisition module is configured to receive all the matching degrees sent by the matching degree acquisition module, and select the initial identification model corresponding to the highest matching degree of the multiple matching degrees that is greater than a preset matching threshold as the initial identification model successfully matched.
Preferably, the model updating unit specifically includes:
the correction module is used for generating a corrected recognition model according to the initial recognition model successfully matched and a preset sentence training sample and sending the corrected recognition model to the updating module connected with the correction module; and
and the updating module is used for receiving the corrected identification model sent by the correcting module and updating the initial identification model by using the corrected identification model.
Preferably, the model updating unit specifically includes:
the correction module is used for generating a corrected recognition model according to the initial recognition model successfully matched and a preset sentence training sample and sending the corrected recognition model to the updating module connected with the correction module; and
and the updating module is used for receiving the corrected identification model sent by the correcting module and updating the initial identification model by using the corrected identification model.
Preferably, the model updating unit specifically includes:
the correction module is used for generating a corrected recognition model according to the initial recognition model successfully matched and a preset sentence training sample and sending the corrected recognition model to the updating module connected with the correction module; and
and the updating module is used for receiving the corrected identification model sent by the correcting module and updating the initial identification model by using the corrected identification model.
Extraction frequency of modelable extraction frequency modelable extraction frequency has the following beneficial effects:
1) the method for training the recognition model according to the extraction frequency of the model is characterized in that data communication is carried out in a mode that a server is remotely connected with a client, the extraction frequency of the initial recognition model which is not commonly used in the client can be compared, the initial recognition model which is not commonly used is deleted, the initial recognition model which is not commonly used is updated by adopting a sentence training sample in the server, the operation burden of the client is reduced, the working efficiency is improved, and the method can be applied to a general intelligent terminal to form better practicability required by the recognition model and the accuracy required by voiceprint recognition.
2) The system for training the recognition model according to the extraction frequency of the model can support the realization of the method for training the recognition model according to the extraction frequency of the model.
Drawings
FIG. 1 is a flowchart illustrating a method for training a recognition model according to an extraction frequency of the model according to a first embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method for training a recognition model according to the extraction frequency of the model according to a second embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a system for training a recognition model according to the extraction frequency of the model in the third embodiment of the present invention;
FIG. 4 is a schematic diagram of a processing unit in a system for training a recognition model according to the extraction frequency of the model according to a fourth embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a matching unit in a system for training a recognition model according to the extraction frequency of the model according to a fifth embodiment of the present invention;
fig. 6 is a schematic structural diagram of a model updating unit in a system that can train a recognition model according to the extraction frequency of the model in the sixth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
According to the embodiment of the invention, the initial voice signal stream of at least one speaker is obtained, the respective voice signal stream of each speaker in the at least one speaker in the initial voice signal stream is obtained according to the preset speaker segmentation and clustering algorithm, the voice signal stream matched with the initial recognition model is further obtained, and the matched voice signal stream is used as an additional voice signal stream sentence training sample for generating the initial recognition model so as to update the initial recognition model, so that the recognition model accuracy is improved, the user experience effect is improved, and the like.
The following detailed description of specific implementations of the present invention is provided in conjunction with specific embodiments:
the first embodiment is as follows:
fig. 1 shows a flow of implementing the method for training a recognition model according to the extraction frequency of the model according to the first embodiment of the present invention, which provides a plurality of clients and a server, and the server is remotely connected to the plurality of clients respectively, as detailed below:
in step S1, the client obtains an initial speech signal stream of a speaker.
In this embodiment, the method for training the recognition model according to the extraction frequency of the model may be applied to an intelligent terminal located in a private space, such as an intelligent robot, so that the initial voice signal stream may be a voice signal stream generated by a user performing a voice chat or issuing a voice command through the intelligent terminal, or a voice signal stream acquired by recording or the like. Specifically, the method for training the recognition model according to the extraction frequency of the model can also be applied in a more open space, i.e. the source of the initial speech signal stream may be limited to a plurality, and thus the initial speech signal stream including a plurality of persons may be included. Similarly, the initial voice signal stream may be a switch that is set in the smart terminal to determine whether to automatically start the voiceprint learning function in the voice interaction process, and the user sets the voiceprint learning function according to the need; or a voiceprint learning function is arranged in the intelligent terminal, and a user can record the voice signal stream automatically. The initial speech signal stream is typically an audio stream.
And step S2, the client acquires the voice signal stream in the initial voice signal stream according to a preset speaker segmentation algorithm and a speaker clustering algorithm.
Specifically, because the initial voice signal stream includes a voice signal stream of at least one speaker, the initial voice signal stream needs to be segmented into a plurality of voice segments according to a preset speaker segmentation algorithm, each voice segment of the plurality of voice segments only includes voice information of the same speaker, and then all voice segments only including the same speaker are clustered according to a preset speaker clustering algorithm, so as to finally generate a voice signal stream only including voice information of the same speaker.
In other words, in this embodiment, the obtained initial speech signal stream is first processed by a speaker segmentation algorithm to obtain a plurality of speech segments, and each speech segment only includes speech information related to the same speaker;
subsequently, speech segments of speech information associated with the same speaker are processed by a speaker clustering algorithm to obtain a stream of speech signals associated with each speaker, respectively.
In step S3, the client determines whether or not there is a speech signal stream that can be a recognition target among all speech signal streams, and outputs the speech signal stream that can be a recognition target as a recognition signal stream.
Wherein, whether each voice signal stream can be used as the voice signal stream of the recognition object is judged respectively, and the method can include one or combination of several of the following:
1) setting a standard sound intensity, and respectively judging that the sound intensity corresponding to each voice signal flow is greater than the standard sound intensity: if so, the voice signal stream can be used as the recognition signal stream of the recognition object, otherwise, the voice signal stream is ignored.
2) Setting a standard audio time length, and respectively judging whether the continuous time length corresponding to each voice signal stream is greater than the standard audio time length: if so, the voice signal stream can be used as the recognition signal stream of the recognition object, otherwise, the voice signal stream is ignored.
3) Setting a standard frequency band, and respectively judging whether the receiving frequency corresponding to each voice signal stream is in the standard frequency band: if so, the voice signal stream can be used as the recognition signal stream of the recognition object, otherwise, the voice signal stream is ignored.
4) One or more speakers as trainers are set in advance through voiceprint matching, and the voice signal streams of the one or more speakers are determined according to a preset fuzzy voiceprint matching mode to serve as identification signal streams of identification objects.
The step S3 can screen the obtained multiple speech signal streams before updating the recognition model, and exclude some speech signal streams that are not originally required to be used as sentence training samples, thereby ensuring the accuracy of the source of the sentence training samples that can train the recognition model according to the extraction frequency of the model, and further ensuring the accuracy of voiceprint recognition according to the recognition model.
In step S4, the client matches the identification signal stream with a plurality of pre-formed initial identification models to obtain successfully matched initial identification models.
The initial recognition model is a recognition model pre-established according to sentence training samples of a preset voice signal stream, namely, a plurality of sentence training samples related to the preset voice signal stream are provided in advance, and the initial recognition model is formed according to the sentence training samples. The initial recognition model is a feature model formed after a voiceprint registration process is completed for a certain person or a plurality of persons, and the registration process has no requirement on the length of a sentence training sample of a training corpus or a speech signal stream. At this time, the identification signal stream with successful matching can be selected according to the matching degree of the identification signal stream of each speaker and the initial identification model (described in detail below).
Step S5, the client side identification signal flow is used as a sentence training sample of the additional identification signal flow, and the extraction frequency of the initial identification model is obtained;
step S6, the client judges whether the extraction frequency is larger than a preset extraction threshold, and deletes the initial identification model smaller than or equal to the extraction threshold from the client;
step S7, the client updates the initial recognition model larger than the extraction threshold value according to the sentence training sample;
step S8, storing all initial recognition models in the server, and updating the initial recognition models smaller than or equal to the extraction threshold value in each client according to the sentence training samples to finally form a plurality of recognition models, wherein each recognition model corresponds to a speaker.
Specifically, after the successfully matched recognition signal stream is obtained, a voiceprint registration algorithm interface is called according to the successfully matched recognition signal stream and a preset sentence training sample of the recognition signal stream, and a modified recognition model is generated. The preset sentence training sample is also the sentence training sample used for generating the initial recognition model. The corrected recognition model is a more accurate recognition model, and the initial recognition model is updated by using the corrected recognition model (that is, the corrected recognition model is stored as the initial recognition model to replace the previous initial recognition model), so that the purposes of model adaptation and intellectualization can be achieved.
In the preferred embodiment of the present invention, for the case that the identification signal stream of each speaker in the multiple speakers cannot be matched with the initial identification model, a new identification model can be created and recorded according to the preset of the user. For example, for a first-time-use smart terminal, its initial recognition model is null (null), and therefore any newly acquired recognition signal stream cannot be matched against it. At this time, according to the setting of the user, the identification signal flow of one of the speakers can be identified, the voiceprint registration algorithm interface is called to create a new identification model, and the new identification model is updated to the initial identification model.
In the preferred embodiment of the invention, the initial voice signal stream of at least one speaker is obtained, the recognition signal stream of each speaker in the initial voice signal stream is respectively obtained according to the preset speaker segmentation and clustering algorithm through judgment, the recognition signal stream matched with the initial recognition model is further obtained, the matched recognition signal stream is used as a sentence training sample of the additional recognition signal stream for generating the initial recognition model, and the initial recognition model is updated, so that the purposes of continuously correcting and updating the recognition model, continuously improving the accuracy of the recognition model, improving the user experience effect and the like are achieved.
The method for training the recognition model according to the extraction frequency of the model adopts a mode of remote connection between the server and the client to carry out data communication, can delete the frequently-used initial recognition model by comparing the extraction frequency of the initial recognition model in the client, and update the frequently-used initial recognition model by adopting the sentence training sample in the server, thereby reducing the operation burden of the client, improving the working efficiency, and simultaneously considering the better practicability required by the formation of the recognition model in a general intelligent terminal and the accuracy required by voiceprint recognition.
Example two:
fig. 2 shows a flow of implementing the method for training the recognition model according to the extraction frequency of the model, which is provided by the second embodiment of the present invention, and is detailed as follows:
in step S21, the client establishes a plurality of initial recognition models according to a plurality of preset sentence training samples.
The initial recognition model is a recognition model established according to a preset statement training sample of a voice signal stream by calling a voiceprint registration algorithm interface, the initial recognition model is a recognition model formed after a voiceprint registration process is completed for one person or a plurality of persons, and the registration process has no requirement on the length of a training corpus or the flow statement training sample of the voice signal. In addition, because the method provided by the embodiment of the present invention can implement operations such as continuously and dynamically correcting the corrected model, the initial recognition model may be a recognition model obtained by using the existing method, or may be a recognition model corrected by using the method provided by the embodiment of the present invention.
In step S22, the client obtains an initial speech signal stream of a speaker.
In a specific embodiment, because the user generally has a greatly changed speech rate, intonation, emotion fluctuation and the like in the speaking process or in the process of multi-person conversation and the like, the deviation of various factors such as the user's speech rate, emotion and the like to the accuracy of the recognition model can be eliminated as much as possible by continuously collecting the linguistic data in the conversation process, so that the influence of the factors such as the intonation, the speech rate, the emotion and the like to the accuracy of the recognition model can be greatly reduced, and the influence to the accuracy of voiceprint recognition can also be reduced.
In step S23, the client splits the initial speech signal stream into a plurality of speech segments according to a speaker splitting algorithm.
And step S24, the client end clusters the voice segments according to the speaker clustering algorithm to generate a voice signal stream.
Specifically, assuming that the current speakers respectively include a user a, a user B, and a user C, after the users agree to record, the recording module may be started, and an initial voice signal stream when the users perform voice interaction with the intelligent terminal is recorded. The intelligent terminal can segment the initial voice signal stream into a plurality of voice segments based on a preset speaker segmentation algorithm, wherein each voice segment only contains voice information of one speaker. For example, after an initial voice signal stream is divided, the obtained voice segments are respectively a voice segment a, a voice segment B, a voice segment a, a voice segment C, a voice segment a, and a voice segment C, and the voice segment a, the voice segment B, and the voice segment C are respectively different segments of the speeches of the users A, B and C, which are obtained respectively, and then, the voice segments of the same speaker are clustered by using a preset speaker clustering algorithm to generate a voice signal stream a, a voice signal stream B, and a voice signal stream C file, for example, the voice signal stream a includes all voice segments of the user a, so that voice signal streams of different persons can be distinguished, and an effective voice signal stream belonging to the same person is extracted. The speaker segmentation algorithm and the speaker clustering algorithm may be any one of the existing speaker segmentation algorithms and speaker clustering algorithms, respectively, and are not limited herein.
After the above-described step S24 is executed, it is first determined whether each speech signal stream can be regarded as a recognition signal stream, and all the recognition signal streams are retained and output.
In step S25, the client matches the identification signal stream with a plurality of pre-formed initial identification models to obtain successfully matched initial identification models.
The step S25 specifically includes:
respectively acquiring the matching degree of each initial recognition model and the recognition signal stream according to the recognition signal stream and the plurality of initial recognition models;
and selecting the initial recognition model which is in accordance with the preset condition and is related to the matching degree as the initial recognition model which is successfully matched. The preset conditions include: 1) the relevant matching degree is greater than a preset matching threshold; 2) the associated degree of match has the highest value among all degrees of match.
Step S26, the client side identification signal flow is used as a sentence training sample of the additional identification signal flow, and the extraction frequency of the initial identification model is obtained;
step S27, the client judges whether the extraction frequency is larger than a preset extraction threshold, and deletes the initial identification model smaller than or equal to the extraction threshold from the client;
and step S28, the client updates the initial recognition model larger than the extraction threshold value according to the sentence training sample.
The step S28 specifically includes:
generating a modified recognition model according to the successfully matched initial recognition model and a preset sentence training sample of the voice signal stream; the sentence training sample of the preset voice signal stream is a voice signal stream for generating an initial recognition model;
updating the initial recognition model to the revised recognition model.
Specifically, the recognition signal stream is used as a sentence training sample of the additional recognition signal stream, that is, a voiceprint registration algorithm interface is called according to the successfully matched initial recognition model and a preset sentence training sample of the voice signal stream to generate a modified recognition model, and the modified recognition model is a more accurate recognition model (as above), so that the purposes of model adaptation and intellectualization are achieved.
Furthermore, the updated recognition model can be used as an initial recognition model, the steps are repeated, the recognition model is continuously corrected and updated, and the accuracy of the recognition model is continuously improved.
In a preferred embodiment of the present invention, there may be a plurality of initial recognition models, and the above steps may be performed for each initial recognition model, that is, different recognition signal streams are obtained through a speaker segmentation algorithm and a speaker clustering algorithm, and a best matching recognition signal stream is selected according to the matching degree to generate a modified recognition model associated with the initial recognition model, and the initial recognition model is updated. The initial recognition models correspond to different speakers respectively, that is, the recognition signal stream with the highest matching degree corresponding to the different initial recognition models can be originated from different speakers.
It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by hardware instructed by a program, and the program may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc.
Example three:
fig. 3 shows a structure of a system capable of training a recognition model according to a model extraction frequency according to a third embodiment of the present invention, and a terminal provided in the third embodiment of the present invention can be used to implement the methods implemented in the first to second embodiments of the present invention.
The system capable of training the recognition model according to the extraction frequency of the model may be an intelligent terminal, such as an intelligent robot, which is applied in a private space or a semi-open space and supports voice operation, in this embodiment, the system capable of training the recognition model according to the extraction frequency of the model is applied in the intelligent robot as an example, and fig. 3 is a block diagram illustrating a structure related to the system capable of training the recognition model according to the extraction frequency of the model according to the embodiment of the present invention.
As shown in fig. 3, the system for training the recognition model according to the extraction frequency of the model specifically includes: the server is remotely connected with the plurality of clients, and each client comprises:
the device comprises an acquisition unit 1, a processing unit 2 and a processing unit, wherein the acquisition unit 1 is used for acquiring an initial voice signal stream of a speaker and sending the initial voice signal stream to the processing unit 2 connected with the acquisition unit 1;
the processing unit 2 is used for receiving the initial voice signal stream sent by the obtaining unit 1, obtaining the voice signal stream in the initial voice signal stream according to a preset speaker segmentation algorithm and a speaker clustering algorithm, and sending the voice signal stream to a judging unit 3 connected with the processing unit 2;
the judging unit 3 is used for judging whether a voice signal stream capable of being used as a recognition object exists in all the voice signal streams sent by the processing unit 2, and outputting the voice signal stream capable of being used as the recognition object to the matching unit 4 connected with the judging unit 3 as a recognition signal stream;
the matching unit 4 is configured to receive the identification signal stream sent by the determining unit 3, match the identification signal stream with a plurality of initial identification models formed in advance, obtain an initial identification model successfully matched, use the identification signal stream as a sentence training sample of an additional identification signal stream, obtain an extraction frequency of the initial identification model, and send the extraction frequency to the comparing unit 5 connected to the matching unit 4;
the comparison unit 5 is used for receiving the extraction frequency which is successfully matched and sent by the matching unit, comparing whether the extraction frequency is greater than a preset extraction threshold value or not, deleting the initial identification model which is less than or equal to the extraction threshold value from the client, and sending the comparison result to a model updating unit 8 connected with the comparison unit 5, and
the model updating unit 8 is used for receiving the comparison result sent by the comparison unit 5 and updating the initial recognition model larger than the extraction threshold value according to the sentence training sample;
the server provides all the initial recognition models, and the initial recognition models which are smaller than or equal to the extraction threshold value in each client are updated according to the sentence training samples, and finally a plurality of recognition models are formed, wherein each recognition model corresponds to one speaker.
In this embodiment, the client of the system for training the recognition model according to the extraction frequency of the model further includes:
the system comprises a sample acquisition unit 6, a model establishing unit 7 and a control unit, wherein the sample acquisition unit 6 is used for acquiring a plurality of preset sentence training samples and sending the sentence training samples to the model establishing unit 7 connected with the sample acquisition unit; and
the model establishing unit 7 is configured to receive a plurality of preset sentence training samples sent by the sample obtaining unit and establish a plurality of initial recognition models according to the preset sentence training samples.
Example four:
fig. 4 shows a structure of a system for training a recognition model according to an extraction frequency of the model according to a fourth embodiment of the present invention. As shown in fig. 4, the processing unit 2 in the system for training the recognition model according to the extraction frequency of the model specifically includes:
the segmentation module 21 is used for segmenting the initial voice signal stream into a plurality of voice segments according to a preset speaker segmentation algorithm, and sending all the voice segments to the clustering module 21 connected with the segmentation module; and
the clustering module 22 is configured to receive the speech segments sent by the segmentation module 21, and cluster the speech segments according to a preset speaker clustering algorithm to generate a speech signal stream.
Example five:
fig. 5 shows a structure of a system that can train a recognition model according to an extraction frequency of the model according to a fifth embodiment of the present invention. As shown in fig. 5, the matching unit 4 in the system for training the recognition model according to the extraction frequency of the model specifically includes:
a matching degree obtaining module 41, configured to match the identification signal stream with the plurality of initial identification models, respectively, obtain a matching degree between each initial identification model and the identification signal stream, and send all the matching degrees to a signal stream obtaining module 42 connected to the matching degree obtaining module; and
the signal flow obtaining module 42 is configured to receive all the matching degrees sent by the matching degree obtaining module 41, and select an initial identification model corresponding to the highest matching degree of the multiple matching degrees that are greater than a preset matching threshold as an initial identification model successfully matched.
Example six:
fig. 6 shows a structure of a system for training a recognition model according to an extraction frequency of the model according to a sixth embodiment of the present invention. As shown in fig. 6, the model updating unit 8 in the system for training the recognition model according to the extraction frequency of the model specifically includes:
the correction module 81 is used for generating a corrected recognition model according to the successfully matched initial recognition model and a preset sentence training sample and sending the corrected recognition model to the updating module 82 connected with the correction module; and
the updating module 82 is configured to receive the modified recognition model sent by the modifying module 81, and update the initial recognition model with the modified recognition model.
It should be noted that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described in a functional generic sense in the foregoing description for clarity of hardware and software interchangeability. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (16)

1. A method for training a recognition model according to the extraction frequency of the model, wherein a plurality of clients and a server are provided, and the server is remotely connected with the plurality of clients respectively, the method comprising:
the client acquires an initial voice signal stream of a speaker;
the client acquires a voice signal stream in the initial voice signal stream according to a preset speaker segmentation algorithm and a speaker clustering algorithm;
the client judges whether the voice signal stream capable of being used as a recognition object exists in all the voice signal streams or not, and outputs the voice signal stream capable of being used as the recognition object as a recognition signal stream;
the client matches the identification signal stream with a plurality of pre-formed initial identification models to obtain the initial identification models which are successfully matched;
the recognition signal stream of the client serves as a sentence training sample of an additional recognition signal stream, and the extraction frequency of the initial recognition model is obtained;
the client judges whether the extraction frequency is greater than a preset extraction threshold value or not, and deletes the initial identification model which is less than or equal to the extraction threshold value from the client;
the client updates the initial recognition model which is larger than the extraction threshold value according to the sentence training sample;
and the server stores all the initial recognition models, updates the initial recognition models which are smaller than or equal to the extraction threshold value in each client according to the sentence training samples, and finally forms a plurality of recognition models, wherein each recognition model corresponds to one speaker.
2. The method of claim 1, wherein the client further comprises, before obtaining an initial speech signal stream of the speaker:
and the client establishes a plurality of initial recognition models according to a plurality of preset sentence training samples.
3. The method as claimed in claim 1 or 2, wherein the method for the client to obtain the speech signal stream in the initial speech signal stream according to the speaker segmentation algorithm and the speaker clustering algorithm specifically comprises:
segmenting the initial speech signal stream into a plurality of speech segments according to the speaker segmentation algorithm;
and clustering the voice segments according to the speaker clustering algorithm to generate the voice signal stream.
4. The method according to claim 1 or 2, wherein the method for the client to match the recognition signal stream with the plurality of initial recognition models respectively to obtain the initial recognition models successfully matched comprises:
matching the identification signal stream with a plurality of initial identification models respectively to obtain the matching degree of each initial identification model and the identification signal stream;
and selecting the initial identification model corresponding to the highest matching degree in the multiple matching degrees which are greater than a preset matching threshold value as the initial identification model which is successfully matched.
5. The method as claimed in claim 3, wherein the step of the client matching the recognition signal stream with the plurality of initial recognition models respectively comprises:
matching the identification signal stream with a plurality of initial identification models respectively to obtain the matching degree of each initial identification model and the identification signal stream;
and selecting the initial identification model corresponding to the highest matching degree in the multiple matching degrees which are greater than a preset matching threshold value as the initial identification model which is successfully matched.
6. The method according to any one of claims 1, 2 and 5, wherein the method for updating, by the client, the initial recognition model larger than the extraction threshold according to the sentence training sample specifically comprises:
generating a modified recognition model according to the successfully matched initial recognition model and a preset sentence training sample, wherein the preset sentence training sample is the recognition signal stream for generating the initial recognition model;
updating the initial recognition model with the revised recognition model.
7. The method of claim 3, wherein the method for the client to update the initial recognition model larger than the extraction threshold according to the sentence training samples comprises:
generating a modified recognition model according to the successfully matched initial recognition model and a preset sentence training sample, wherein the preset sentence training sample is the recognition signal stream for generating the initial recognition model;
updating the initial recognition model with the revised recognition model.
8. The method of claim 4, wherein the method for the client to update the initial recognition model larger than the extraction threshold according to the sentence training samples comprises:
generating a modified recognition model according to the successfully matched initial recognition model and a preset sentence training sample, wherein the preset sentence training sample is the recognition signal stream for generating the initial recognition model;
updating the initial recognition model with the revised recognition model.
9. A system for training a recognition model based on an extraction frequency of the model, comprising: the system comprises a server and a plurality of clients, wherein the server is remotely connected with the plurality of clients, and each client comprises an acquisition unit, a processing unit, a judgment unit, a matching unit, a comparison unit and a model updating unit;
the acquisition unit is used for acquiring an initial voice signal stream of a speaker and sending the initial voice signal stream to the processing unit connected with the acquisition unit;
the processing unit is used for receiving the initial voice signal stream sent by the obtaining unit, obtaining a voice signal stream in the initial voice signal stream according to a preset speaker segmentation algorithm and a speaker clustering algorithm, and sending the voice signal stream to the judging unit connected with the processing unit;
the judging unit is used for judging whether the voice signal stream capable of being used as a recognition object exists in all the voice signal streams sent by the processing unit and outputting the voice signal stream capable of being used as the recognition object to the matching unit connected with the judging unit as a recognition signal stream;
the matching unit is used for receiving the identification signal stream sent by the judging unit, respectively matching the identification signal stream with a plurality of pre-formed initial identification models, obtaining the initial identification models which are successfully matched, taking the identification signal stream as an added sentence training sample of the identification signal stream, obtaining the extraction frequency of the initial identification models, and sending the extraction frequency to the comparing unit connected with the matching unit;
the comparison unit is used for receiving the extraction frequency which is successfully matched and sent by the matching unit, comparing whether the extraction frequency is greater than a preset extraction threshold value or not, deleting the initial identification model which is less than or equal to the extraction threshold value from the client, and sending a comparison result to the model updating unit connected with the comparison unit, and
the model updating unit is used for receiving the comparison result sent by the comparison unit and updating the initial recognition model larger than the extraction threshold value according to the sentence training sample;
the server provides all the initial recognition models, and is used for updating the initial recognition models smaller than or equal to the extraction threshold value in each client according to the sentence training samples to finally form a plurality of recognition models, wherein each recognition model corresponds to one speaker.
10. The system for training a recognition model according to the extraction frequency of the model as claimed in claim 9, wherein the client further comprises:
the system comprises a sample acquisition unit, a model establishing unit and a data processing unit, wherein the sample acquisition unit is used for acquiring a plurality of preset sentence training samples and sending the sentence training samples to the model establishing unit connected with the sample acquisition unit; and
the model establishing unit is used for receiving a plurality of preset sentence training samples sent by the sample obtaining unit and establishing a plurality of initial recognition models according to the preset sentence training samples.
11. The system according to claim 9 or 10, wherein the processing unit specifically comprises:
the segmentation module is used for segmenting the initial voice signal stream into a plurality of voice segments according to a preset speaker segmentation algorithm and sending all the voice segments to the clustering module connected with the segmentation module; and
the clustering module is used for receiving the voice segments sent by the segmentation module and clustering the voice segments according to a preset speaker clustering algorithm to generate the voice signal stream.
12. The system according to claim 9 or 10, wherein the matching unit specifically comprises:
the matching degree acquisition module is used for respectively matching the identification signal stream with a plurality of initial identification models to acquire the matching degree of each initial identification model and the identification signal stream, and sending all the matching degrees to the signal stream acquisition module connected with the matching degree acquisition module; and
the signal flow acquisition module is configured to receive all the matching degrees sent by the matching degree acquisition module, and select the initial identification model corresponding to the highest matching degree of the multiple matching degrees that is greater than a preset matching threshold as the initial identification model successfully matched.
13. The system for training a recognition model according to the extraction frequency of the model as claimed in claim 11, wherein the matching unit specifically comprises:
the matching degree acquisition module is used for respectively matching the identification signal stream with a plurality of initial identification models to acquire the matching degree of each initial identification model and the identification signal stream, and sending all the matching degrees to the signal stream acquisition module connected with the matching degree acquisition module; and
the signal flow acquisition module is configured to receive all the matching degrees sent by the matching degree acquisition module, and select the initial identification model corresponding to the highest matching degree of the multiple matching degrees that is greater than a preset matching threshold as the initial identification model successfully matched.
14. The system according to any one of claims 9,10 and 13, wherein the model updating unit specifically comprises:
the correction module is used for generating a corrected recognition model according to the initial recognition model successfully matched and a preset sentence training sample and sending the corrected recognition model to the updating module connected with the correction module; and
and the updating module is used for receiving the corrected identification model sent by the correcting module and updating the initial identification model by using the corrected identification model.
15. The system for training a recognition model according to the extraction frequency of the model as claimed in claim 11, wherein the model updating unit specifically comprises:
the correction module is used for generating a corrected recognition model according to the initial recognition model successfully matched and a preset sentence training sample and sending the corrected recognition model to the updating module connected with the correction module; and
and the updating module is used for receiving the corrected identification model sent by the correcting module and updating the initial identification model by using the corrected identification model.
16. The system for training a recognition model according to the extraction frequency of the model as claimed in claim 12, wherein the model updating unit specifically comprises:
the correction module is used for generating a corrected recognition model according to the initial recognition model successfully matched and a preset sentence training sample and sending the corrected recognition model to the updating module connected with the correction module; and
and the updating module is used for receiving the corrected identification model sent by the correcting module and updating the initial identification model by using the corrected identification model.
CN201610025278.3A 2016-01-14 2016-01-14 Method and system for training and identifying model according to extraction frequency of model Active CN106971734B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610025278.3A CN106971734B (en) 2016-01-14 2016-01-14 Method and system for training and identifying model according to extraction frequency of model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610025278.3A CN106971734B (en) 2016-01-14 2016-01-14 Method and system for training and identifying model according to extraction frequency of model

Publications (2)

Publication Number Publication Date
CN106971734A CN106971734A (en) 2017-07-21
CN106971734B true CN106971734B (en) 2020-10-23

Family

ID=59334924

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610025278.3A Active CN106971734B (en) 2016-01-14 2016-01-14 Method and system for training and identifying model according to extraction frequency of model

Country Status (1)

Country Link
CN (1) CN106971734B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065026B (en) * 2018-09-14 2021-08-31 海信集团有限公司 Recording control method and device
CN111462761A (en) * 2020-03-03 2020-07-28 深圳壹账通智能科技有限公司 Voiceprint data generation method and device, computer device and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0817170B1 (en) * 1996-07-01 2003-05-02 Telia Ab Method and apparatus for adaption of models of speaker verification
CN101546557A (en) * 2008-03-28 2009-09-30 展讯通信(上海)有限公司 Method for updating classifier parameters for identifying audio content
CN102237084A (en) * 2010-04-22 2011-11-09 松下电器产业株式会社 Method, device and equipment for adaptively adjusting sound space benchmark model online
CN102282608A (en) * 2008-12-09 2011-12-14 诺基亚公司 Adaptation of automatic speech recognition acoustic models
CN102760434A (en) * 2012-07-09 2012-10-31 华为终端有限公司 Method for updating voiceprint feature model and terminal
CN102968989A (en) * 2012-12-10 2013-03-13 中国科学院自动化研究所 Improvement method of Ngram model for voice recognition
CN103811002A (en) * 2012-11-13 2014-05-21 通用汽车环球科技运作有限责任公司 Adaptation methods and systems for speech systems
US8805684B1 (en) * 2012-05-31 2014-08-12 Google Inc. Distributed speaker adaptation

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101114449A (en) * 2006-07-26 2008-01-30 大连三曦智能科技有限公司 Model training method for unspecified person alone word, recognition system and recognition method
CN101458816B (en) * 2008-12-19 2011-04-27 西安电子科技大学 Target matching method in digital video target tracking
CN102543063B (en) * 2011-12-07 2013-07-24 华南理工大学 Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers
CN104103272B (en) * 2014-07-15 2017-10-10 无锡中感微电子股份有限公司 Audio recognition method, device and bluetooth earphone

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0817170B1 (en) * 1996-07-01 2003-05-02 Telia Ab Method and apparatus for adaption of models of speaker verification
CN101546557A (en) * 2008-03-28 2009-09-30 展讯通信(上海)有限公司 Method for updating classifier parameters for identifying audio content
CN102282608A (en) * 2008-12-09 2011-12-14 诺基亚公司 Adaptation of automatic speech recognition acoustic models
CN102237084A (en) * 2010-04-22 2011-11-09 松下电器产业株式会社 Method, device and equipment for adaptively adjusting sound space benchmark model online
US8805684B1 (en) * 2012-05-31 2014-08-12 Google Inc. Distributed speaker adaptation
CN102760434A (en) * 2012-07-09 2012-10-31 华为终端有限公司 Method for updating voiceprint feature model and terminal
CN103811002A (en) * 2012-11-13 2014-05-21 通用汽车环球科技运作有限责任公司 Adaptation methods and systems for speech systems
CN102968989A (en) * 2012-12-10 2013-03-13 中国科学院自动化研究所 Improvement method of Ngram model for voice recognition

Also Published As

Publication number Publication date
CN106971734A (en) 2017-07-21

Similar Documents

Publication Publication Date Title
US9769296B2 (en) Techniques for voice controlling bluetooth headset
WO2017054122A1 (en) Speech recognition system and method, client device and cloud server
WO2021159688A1 (en) Voiceprint recognition method and apparatus, and storage medium and electronic apparatus
JP4369132B2 (en) Background learning of speaker voice
AU2016277548A1 (en) A smart home control method based on emotion recognition and the system thereof
WO2016150001A1 (en) Speech recognition method, device and computer storage medium
JP2018072650A (en) Voice interactive device and voice interactive method
KR101618512B1 (en) Gaussian mixture model based speaker recognition system and the selection method of additional training utterance
US11462219B2 (en) Voice filtering other speakers from calls and audio messages
CN110634472A (en) Voice recognition method, server and computer readable storage medium
CN112562681B (en) Speech recognition method and apparatus, and storage medium
CN106981289A (en) A kind of identification model training method and system and intelligent terminal
JP2019040123A (en) Learning method of conversion model and learning device of conversion model
CN109074809B (en) Information processing apparatus, information processing method, and computer-readable storage medium
CN106971734B (en) Method and system for training and identifying model according to extraction frequency of model
CN110931018A (en) Intelligent voice interaction method and device and computer readable storage medium
CN109065026B (en) Recording control method and device
US20220059080A1 (en) Realistic artificial intelligence-based voice assistant system using relationship setting
CN108806691B (en) Voice recognition method and system
CN112185422A (en) Prompt message generation method and voice robot thereof
US11587554B2 (en) Control apparatus, voice interaction apparatus, voice recognition server, and program
CN109087651B (en) Voiceprint identification method, system and equipment based on video and spectrogram
CN106971731B (en) Correction method for voiceprint recognition
US20200168221A1 (en) Voice recognition apparatus and method of voice recognition
US20230238002A1 (en) Signal processing device, signal processing method and program

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant