CN111209429A

CN111209429A - Unsupervised model training method and unsupervised model training device for measuring coverage of voice database

Info

Publication number: CN111209429A
Application number: CN202010309303.7A
Authority: CN
Inventors: 李科; 张卫强; 黄宇凯; 郝玉峰; 宋琼
Original assignee: Beijing Speechocean Technology Co ltd; Tsinghua University
Current assignee: Beijing Speechocean Technology Co ltd; Tsinghua University
Priority date: 2020-04-20
Filing date: 2020-04-20
Publication date: 2020-05-29
Anticipated expiration: 2040-04-20
Also published as: CN111209429B

Abstract

The present disclosure relates to an unsupervised model training method for measuring speech database coverage, the method comprising: acquiring training data, wherein the training data is voice; determining one or more evaluation factors for voice database coverage; dividing the evaluation factors into adjustable factors or non-adjustable factors based on whether the training data can be controlled by parameter adjustment; determining a clustering algorithm corresponding to each divided evaluation factor; classifying the training data through a clustering algorithm corresponding to each evaluation factor to obtain a plurality of subclasses; and training an evaluation model according to the plurality of subclasses of each evaluation factor. The method can set different voice databases corresponding to the evaluation element measurement according to the needs of users, purposefully extract different characteristics and select proper algorithms by distinguishing the evaluation elements, and meanwhile, model training can be carried out by using unsupervised data, so that the cost introduced by data annotation is reduced.

Description

Unsupervised model training method and unsupervised model training device for measuring coverage of voice database

Technical Field

The present disclosure relates to the field of speech signal processing, and in particular, to an unsupervised model training method and apparatus, an electronic device, and a computer-readable storage medium for measuring speech database coverage.

Background

The coverage of the voice database is an important index for measuring the quality of the voice database, and refers to the coverage degree of the voice database for evaluation factors. For example: gender, language, voice content and other factors of the speaker. For example, in the case of a trained speech recognition system, the speech of a large number of speakers needs to be collected for training, and at this time, the better the coverage of the selected speech database is, the more extensive speech space can be included, and the influence in the spatial distribution of the sample can be effectively reduced.

Traditionally, the coverage of acquiring a voice database depends on expert experience in the design stage of the voice database, and when an acquisition plan is made, voices in the voice database are distributed as comprehensively as possible on various evaluation factors. However, for the collected database, indirect feedback can be obtained according to indexes such as recognition rate and the like only after the voice signal is processed and modeled. In the process of training the speech data evaluation model, the training data is not divided comprehensively, and the accurate evaluation model is difficult to construct due to the lack of manually marked sample data.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides an unsupervised model training method and apparatus, an electronic device, and a computer-readable storage medium for measuring coverage of a speech database.

According to a first aspect of the embodiments of the present disclosure, there is provided an unsupervised model training method for measuring coverage of a speech database, the method including: acquiring training data, wherein the training data is voice; determining one or more evaluation factors for voice database coverage; dividing the evaluation factors into adjustable factors or non-adjustable factors based on whether the training data correspond to the evaluation factors and can be controlled through parameter adjustment; determining a clustering algorithm corresponding to each divided evaluation factor; classifying the training data through a clustering algorithm corresponding to each evaluation factor to obtain a plurality of subclasses; an evaluation model is trained based on the plurality of subclasses of each evaluation factor.

In an embodiment, determining a clustering algorithm corresponding to each divided evaluation factor includes: if the evaluation factor is an unadjustable factor, determining that the corresponding clustering algorithm is a distance-based clustering algorithm; and if the evaluation factor is an adjustable factor, determining that the corresponding clustering algorithm is the self-adaptive training algorithm.

In an embodiment, the training data is classified by a clustering algorithm corresponding to each evaluation factor to obtain a plurality of subclasses, including: if the evaluation factor is an unadjustable factor, extracting a feature vector of the training data; and according to the feature vector, adopting a clustering algorithm based on distance to divide the training data into a plurality of subclasses.

In one embodiment, the distance-based clustering algorithm is a K-means clustering algorithm.

In an embodiment, the training data is classified by a clustering algorithm corresponding to each evaluation factor to obtain a plurality of subclasses, including: if the evaluation factor is an adjustable factor, extracting a feature vector of the training data; training a Gaussian mixture model through the feature vectors, and labeling training data; and according to the labeled training data, dividing the training data into a plurality of subclasses.

In one embodiment, training the gaussian mixture model by the feature vector, and labeling the training data comprises: training a Gaussian mixture model through the feature vectors; determining control parameters according to the evaluation factors, wherein the control parameters can adjust and control training data; traversing all values of the control parameters, and transforming the training data; obtaining a parameter value when the feature vector of the transformed training data enables the likelihood of the Gaussian mixture model to be maximum; accumulating the likelihood according to the parameter values; converting the training data according to the parameter values to obtain new training data, and retraining until a stopping condition is reached; and taking the parameter value corresponding to each training data and enabling the Gaussian mixture model to have the maximum likelihood as the marking value of the training data.

In one embodiment, the stop condition includes: the iteration times reach a preset threshold, or the change rate of the cumulative likelihood and the cumulative likelihood in the last iteration is smaller than the preset threshold.

In one embodiment, training an evaluation model based on a plurality of sub-classes of each evaluation factor comprises: and respectively training one or more evaluation models for each piece of sub-data, or training one evaluation model for a plurality of pieces of sub-data as a whole.

In one embodiment, the factors for assessing voice database coverage include one or more of: gender of the speaker, age of the speaker, accent of the speaker, speed of speech, pitch, language, capture device, capture environment, pronunciation factors, or content subject.

According to a second aspect of the embodiments of the present disclosure, a method for measuring coverage of a speech database is provided, where the method includes obtaining an evaluation model of each evaluation factor by using an unsupervised model training method for measuring coverage of a speech database according to the first aspect, and obtaining a speech database to be evaluated, where the speech database includes at least one piece of speech; detecting each voice in the voice database through an evaluation model of the evaluation factor to obtain a single-factor information entropy of the voice database corresponding to the evaluation factor; and determining the coverage of the voice database according to the single-factor information entropy.

According to a third aspect of the embodiments of the present disclosure, there is provided an unsupervised model training apparatus for measuring coverage of a speech database, the apparatus comprising: the data acquisition unit is used for acquiring training data, and the training data is voice; the evaluation factor determining unit is used for determining one or more evaluation factors of the coverage of the voice database; the dividing unit is used for dividing the evaluation factors into adjustable factors or non-adjustable factors based on whether the training data correspond to the evaluation factors and can be controlled through parameter adjustment or not; the algorithm determining unit is used for determining a clustering algorithm corresponding to each divided evaluation factor; the classification unit is used for classifying the training data through a clustering algorithm corresponding to each evaluation factor to obtain a plurality of subclasses; and the model training unit is used for training the evaluation model according to the plurality of subclasses of each evaluation factor.

In an embodiment, the algorithm determination unit is further configured to: when the evaluation factor is an unadjustable factor, determining the corresponding clustering algorithm as a distance-based clustering algorithm; and when the evaluation factor is an adjustable factor, determining the corresponding clustering algorithm as the self-adaptive training algorithm.

In an embodiment, the classification unit is further configured to: when the evaluation factor is an unadjustable factor, extracting a feature vector of the training data; and according to the feature vector, adopting a clustering algorithm based on distance to divide the training data into a plurality of subclasses.

In an embodiment, the classification unit is further configured to: when the evaluation factor is an adjustable factor, extracting a feature vector of the training data; training a Gaussian mixture model through the feature vectors, and labeling training data; and according to the labeled training data, dividing the training data into a plurality of subclasses.

In an embodiment, the model training unit is further configured to: and respectively training one or more evaluation models for each piece of sub-data, or training one evaluation model for a plurality of pieces of sub-data as a whole.

According to a fourth aspect of the embodiments of the present disclosure, there is provided an apparatus for measuring coverage of a voice database, the apparatus including an evaluation model obtaining unit, configured to obtain an evaluation model of each evaluation factor by using the unsupervised model training method for measuring coverage of a voice database according to the first aspect, and a voice database obtaining unit, configured to obtain a voice database to be evaluated, where the voice database includes at least one piece of voice; the detection unit is used for detecting each voice in the voice database through the evaluation model of the evaluation factor to obtain the single-factor information entropy of the voice database corresponding to the evaluation factor; and the evaluation unit is used for determining the coverage of the voice database according to the single-factor information entropy.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including: a memory to store instructions; and a processor for invoking the memory-stored instructions to perform the unsupervised model training method for measuring speech database coverage of the first aspect.

According to a sixth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions that, when executed by a processor, perform the unsupervised model training method for measuring speech database coverage of the first aspect.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: the method comprises the steps of firstly, setting different evaluation element measurement corresponding databases according to user requirements, for example, setting evaluation factors such as accent, tone, speech speed and the like according to the user requirements to construct an evaluation model when accent recognition classification is carried out, so that the speech database evaluation model is more targeted in specific application scenes; secondly, by distinguishing different evaluation factors, the method is beneficial to extracting corresponding features in subsequent processing, selecting a proper algorithm, further constructing a more proper and accurate classification model aiming at the evaluation factors, and being beneficial to accurately judging the coverage of the voice database in the subsequent processing; thirdly, unsupervised data, namely data without labels, can be adopted for model training, so that quantitative evaluation on the coverage of the voice database is realized, and the cost introduced by data labels is reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic flow diagram illustrating an unsupervised model training method for measuring speech database coverage in accordance with an exemplary embodiment;

FIG. 2 is a schematic flow diagram illustrating another unsupervised model training method for measuring speech database coverage in accordance with an exemplary embodiment;

FIG. 3 is a flow diagram illustrating a method of measuring speech database coverage in accordance with an exemplary embodiment;

FIG. 4 is a schematic block diagram illustrating an unsupervised model training apparatus for measuring speech database coverage in accordance with an exemplary embodiment;

FIG. 5 is a schematic block diagram illustrating an apparatus in accordance with an exemplary embodiment.

FIG. 6 is a schematic block diagram illustrating an electronic device in accordance with an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The existing common voice database coverage evaluation system depends on expert labeled data, and the division of training data is not comprehensive and lacks diversity. In some related technologies, depending on the supervisory data, i.e., the data containing annotations, it is necessary to divide the data into several subclasses according to the annotations, and then train the model according to the label of each subclass. The supervision data can be obtained by manual marking, and a large amount of manpower, material resources and time are consumed.

The present disclosure provides an unsupervised model training method 10 for measuring speech database coverage, which includes steps S11-S15, as shown in fig. 1, and is described in detail below:

step S11, training data is acquired, the training data being speech.

The training data can be from one or more voice databases, the range of covered voices is wide, the possibility of various classifications is provided, and the diversity of the data in model training is ensured.

At step S12, one or more assessment factors for voice database coverage are determined.

In an embodiment of the present disclosure, the evaluation factors of voice database coverage include one or more of the following: gender of the speaker, age of the speaker, accent of the speaker, speed of speech, pitch, language, pronunciation factors, or content subject. In the evaluation process, a user can determine evaluation factors related to the voice database according to the use requirements of the voice database, for example, the evaluation factors such as accent, tone, speech speed and the like can be set according to the user requirements to construct an evaluation model when accent recognition classification is carried out. Through the steps of the embodiment, the evaluation factors can be set by the user independently, so that the speech database evaluation model is more targeted in a specific application scene, and the obtained evaluation result is more in line with the user requirements.

And step S13, dividing the evaluation factors into adjustable factors or non-adjustable factors based on whether the training data is corresponding to the evaluation factors and can be controlled through parameter adjustment.

Specifically, for a variable factor to be evaluated, it is first determined whether the speech is adjustable corresponding to the evaluation factor, that is, there is a certain transformation mode so that it can be transformed from one value to another value. If the conversion mode exists, the evaluation factor is considered as an adjustable factor, and if the evaluation factor is the speech speed, the speech speed can be increased or decreased through time domain scaling; as with pitch, pitch can be reduced or increased by frequency domain scaling; if no conversion mode exists, the evaluation factor is considered as an unmodulatable factor, and generally, according to the current technical means, a certain simple conversion mode does not exist, so that the evaluation factor is converted from one value to another value, such as accent, language and the like. By distinguishing whether or not the evaluation elements can be adjusted, it is helpful for the developer to further select an appropriate processing method for the evaluation elements, extract different features for different evaluation elements, and make a processing algorithm more suitable for the elements. When the overall evaluation model is constructed subsequently, the distinguished elements can make the model architecture clearer, and the accuracy of the evaluation model is further improved.

And step S14, determining a clustering algorithm corresponding to each divided evaluation factor.

In an embodiment of the disclosure, if the evaluation factor is an unadjustable factor, determining that the corresponding clustering algorithm is a distance-based clustering algorithm; and if the evaluation factor is an adjustable factor, determining that the corresponding clustering algorithm is the self-adaptive training algorithm. When the evaluation factors corresponding to the voices do not have a conversion mode, the categories of the factors are clear, for example, the gender of a speaker comprises male and female categories, the categories of the voices are relatively fixed, and the like, and under the condition that the categories are clear and the voice-related characteristics can be obviously distinguished from different categories, the distance-based clustering algorithm is selected more visually and conveniently, and the method is easy to realize and has high accuracy; and when the evaluation factors corresponding to the voice have a conversion mode, the category information cannot be visually reflected in the feature vector, and the related evaluation factors can be embodied only through the measured feature parameters.

And step S15, classifying the training data through the clustering algorithm corresponding to each evaluation factor to obtain a plurality of subclasses.

In an embodiment of the present disclosure, the training data is classified by a clustering algorithm corresponding to each evaluation factor, so as to obtain a plurality of subclasses, including: if the evaluation factor is an unadjustable factor, extracting a feature vector of the training data; and according to the feature vector, adopting a clustering algorithm based on distance to divide the training data into a plurality of subclasses.

Specifically, if the evaluation factor is an unadjustable factor, a vector representation of the relevant features of the training data is extracted. For example, for the factors such as accent, language, speaker, etc., i-vector, x-vector, etc. can be extracted; for factors such as environment, equipment, phoneme content, etc., MFCC (mel frequency cepstrum coefficient) feature vectors and the like may be extracted. According to the extracted features, the data is divided into a plurality of subclasses. When the features are extracted, the feature vectors capable of effectively distinguishing different voices under the evaluation factors are selected, so that the clustering accuracy is improved.

In an embodiment of the present disclosure, the distance-based clustering algorithm is a K-means clustering algorithm. The K-means algorithm is easy to implement, and is very efficient and practical in computing the clustering problem.

In an embodiment of the present disclosure, the classifying the training data through the clustering algorithm corresponding to each evaluation factor to obtain a plurality of subclasses respectively further includes: if the evaluation factor is an adjustable factor, extracting a feature vector of the training data; training a Gaussian mixture model through the feature vectors, and labeling training data; and according to the labeled training data, dividing the training data into a plurality of subclasses. Since the gaussian mixture model is a probabilistic model, it assumes that all samples are generated from a mixture of a finite number of gaussian distributions with unknown parameters. A linear combination as a gaussian probability density function can approximate any one of the density functions. The voice features usually have smooth probability density functions, so that a limited number of Gaussian functions can form smooth approximation for the density functions of the voice features, and voice can be effectively distinguished according to adjustable factors.

In an embodiment of the present disclosure, training a gaussian mixture model through a feature vector, and labeling training data includes: training a Gaussian mixture model through the feature vectors; determining control parameters according to the evaluation factors, wherein the control parameters can adjust and control training data; traversing all values of the control parameters, and transforming the training data; obtaining a parameter value when the feature vector of the transformed training data enables the likelihood of the Gaussian mixture model to be maximum; accumulating the likelihood according to the parameter values; converting the training data according to the parameter values to obtain new training data, and retraining until a stopping condition is reached; and taking the parameter value corresponding to each training data and enabling the Gaussian mixture model to have the maximum likelihood as the marking value of the training data.

In an embodiment of the present disclosure, the stop condition includes: the iteration times reach a preset threshold, or the change rate of the cumulative likelihood and the cumulative likelihood in the last iteration is smaller than the preset threshold.

Specifically, MFCC feature vectors of all data can be collected, and a GMM (Gaussian mixed model) model is trained;

order likelihood accumulator

Then, the following operations are performed on all data one by one:

transforming the data with transformation control parameters of

To obtain

: for example, for a speed of speech transformation,

refers to a time domain scaling factor; for the tone or tones of the music,

refers to a frequency domain scaling factor;

to pair

Discretizing the value, and traversing all the values to ensure that

MFCC feature-pair GMM model of

Maximizing the likelihood of:

when the record reaches the maximum value

Value of

：

By value

Cumulative likelihood:

by value

And transforming the data to obtain new data:

continue to collect MFCC feature vectors for all data, train GMM model to iterate until a stop condition is reached, which in an embodiment is set to a likelihood accumulator

The value of change of the last iteration is less than

Or the number of iterations reaches 8;

according to each maximum value recorded

Value of

Dividing the training data intoMAnd (4) sub-classes.

For speech classification under adjustable factors, because the feature vectors cannot well reflect the adjustability of speech, the accuracy of the method for measuring by directly using vector distance is not high for clustering of the adjustable factors. The feature vectors are described through the likelihood of the Gaussian mixture model, more voice feature information can be obtained in the feature transformation process, and a more accurate and effective description effect can be achieved on the distribution of voice data. Meanwhile, the selection of a proper stopping condition is helpful for improving the arithmetic speed of the algorithm on the premise of ensuring the accuracy of the algorithm.

In step S16, an evaluation model is trained based on the plurality of subclasses of each evaluation factor.

In an embodiment of the present disclosure, one or more evaluation models are trained on each of the sub-data, respectively, or one evaluation model is trained on a plurality of sub-data as a whole. For a plurality of subclasses of each evaluation factor, a corresponding evaluation model can be trained as required, so that the information entropy of each voice in the voice database relative to the evaluation factors can be conveniently judged subsequently, and the coverage of the voice database can be further judged.

The flow of an embodiment of the present disclosure is shown in fig. 2, first determining an evaluation factor of voice database coverage; dividing the evaluation factors based on whether the voice has a transformation mode corresponding to the evaluation factors; if no conversion mode exists, directly extracting the characteristic vector, clustering by using a k-means clustering algorithm to obtain a plurality of subclasses under the evaluation factor, and constructing a model of the evaluation factor through the obtained subclasses; if the conversion mode exists, extracting an MFCC vector, constructing a GMM model, determining a parameter value when the likelihood is the maximum through the parameter conversion mode, converting the vector, inputting the vector into the GMM model to continue iteration, marking data through the parameter value when the likelihood is the maximum to obtain a plurality of subclasses, and constructing a model of the evaluation factor through the obtained subclasses. In the embodiment, for different evaluation elements, different feature vectors are extracted, and corresponding clustering algorithms are selected, so that the evaluation model is more targeted; meanwhile, the algorithm adopts unsupervised data to train the model in the whole process, so that the cost introduced by data labeling is reduced.

The present disclosure also provides a method 20 for measuring coverage of a voice database, as shown in fig. 3, the method 20 for measuring coverage of a voice database includes steps S21-S24, which are described in detail as follows:

step S21: an evaluation model of each evaluation factor is obtained by using the unsupervised model training method 10 for measuring the coverage of the voice database of any one of the preceding embodiments.

Step S22: and acquiring a voice database to be evaluated, wherein the voice database comprises at least one voice.

Step S23: and detecting each voice in the voice database through an evaluation model of the evaluation factor to obtain a single-factor information entropy of the voice database corresponding to the evaluation factor.

In the embodiment of the disclosure, each voice in the voice database is detected respectively according to the evaluation factors needed to be related to the voice database. The evaluation model may detect one or more evaluation factors. And performing classification detection through the evaluation model, determining the single-factor information entropy of the speech in the speech database related to each evaluation factor, obtaining the possibility of the speech database related to each evaluation factor, and conveniently determining whether the speech database relates to the evaluation factor to be related and the existence probability. The entropy of the obtained single-factor information entropy is higher as the subclass factors related to each voice are more distributed, and conversely, the entropy of the obtained single-factor information entropy is lower as the subclass factors related to each voice are more concentrated.

In one embodiment, based on the evaluation factors, each voice is classified and detected through an evaluation model, and subclass conditional probabilities of each voice in a voice database and corresponding to a plurality of subclass factors in the evaluation factors are obtained; and obtaining single-factor information entropy corresponding to the evaluation factors and the voice database based on the subclass conditional probability.

Specifically, according to the evaluation factors related to the determination needs, the evaluation model is used for classifying and detecting each voice in the voice database, and the probability of each voice corresponding to each evaluation factor related to the determination needs is determined. According to the detection, the conditional probability of each voice in the voice database corresponding to each subclass factor under the current evaluation factor can be obtained, the occurrence probability of each voice under the condition that each subclass factor is related to each evaluation factor can be conveniently determined, and then the subclass conditional probabilities of each voice under each subclass factor are integrated to obtain the single-factor information entropy of the voice database under the evaluation factor.

In an embodiment, for the current evaluation factor, by detecting each voice, the subclass conditional probability of each voice corresponding to each subclass factor under the current evaluation factor can be obtained, the subclass conditional probabilities are summarized and averaged, the subclass average conditional probability of the voice database corresponding to each subclass factor under the current evaluation factor can be obtained, and then the single-factor information entropy of the voice database under the current evaluation factor is obtained according to the subclass average conditional probability of the voice database under each subclass factor.

In one implementation scenario, the process may be,

for representing a database of speech sounds,Krepresenting the number of pieces of speech in the speech database,

to

Representing each voice in the voice database.

Representing each sub-category of factors under the current evaluation factor,Mis the number of sub-categories. Obtaining the subclass conditional probability of each voice under each subclass factor by adopting the following formula through an evaluation model:

，

，

。Krepresenting the number of pieces of speech in the speech database,Mis the number of sub-categories. And integrating by adopting the following formula to obtain the subclass average conditional probability of the voice database under each subclass factor:

，

. Thereby obtaining the single-factor information entropy of the speech database related to the current evaluation factor according to the following formula:

and log is the natural logarithm.

Step S24: and determining the coverage of the voice database according to the single-factor information entropy.

Specifically, according to the single-factor information entropies correspondingly obtained by the voice database under each evaluation factor, the number of classification factors and the size of an entropy value related to each voice in the voice database can be quickly determined, all voice coverage conditions are judged for the subclass factors of each evaluation factor, and then whether the voice in the voice database meets the requirements or not is evaluated, so that whether the quality of the voice database is qualified or not is determined, and each subclass factor in each evaluation factor needs to be covered as comprehensively and averagely as possible for the qualified voice database, so that the result of model training or other subsequent processing based on the voice database is ensured. The classification factors related to the voice database are judged by utilizing the information entropy so as to evaluate the quality of the voice database, so that the uncertainty factors are quantized, the standard of each voice related to each classification factor is measured uniformly, the abstract judgment information is changed into concreteness, and the quality of the voice database to be evaluated is directly and quickly acquired.

Based on the same inventive concept, fig. 4 shows an unsupervised model training apparatus 100 for measuring coverage of a speech database, comprising: a data obtaining unit 110, configured to obtain training data, where the training data is speech; an evaluation factor determination unit 120 for determining one or more evaluation factors of the voice database coverage; a dividing unit 130, configured to divide the evaluation factor into an adjustable factor or an unadjustable factor based on whether the training data corresponds to the evaluation factor that is controllable through parameter adjustment; an algorithm determining unit 140, configured to determine a clustering algorithm corresponding to each divided evaluation factor; the classification unit 150 is configured to classify the training data respectively through a clustering algorithm corresponding to each evaluation factor to obtain a plurality of subclasses; and a model training device 160 for training the evaluation model according to the plurality of subclasses of each evaluation factor.

In an embodiment, the algorithm determination unit 140 is further configured to: when the evaluation factor is an unadjustable factor, determining the corresponding clustering algorithm as a distance-based clustering algorithm; and when the evaluation factor is an adjustable factor, determining the corresponding clustering algorithm as the self-adaptive training algorithm.

In an embodiment, the classification unit 150 is further configured to: when the evaluation factor is an unadjustable factor, extracting a feature vector of the training data; and according to the feature vector, adopting a clustering algorithm based on distance to divide the training data into a plurality of subclasses.

In an embodiment, the classification unit 150 is further configured to: when the evaluation factor is an adjustable factor, extracting a feature vector of the training data; training a Gaussian mixture model through the feature vectors, and labeling training data; and according to the labeled training data, dividing the training data into a plurality of subclasses.

In an embodiment, the model training unit 160 is further configured to: and respectively training one or more evaluation models for each piece of sub-data, or training one evaluation model for a plurality of pieces of sub-data as a whole.

In one embodiment, the factors for assessing voice database coverage include one or more of: gender of the speaker, age of the speaker, accent of the speaker, speed of speech of the speaker, pitch of the speaker, pronunciation factors, or content subject.

The present disclosure further provides a device for measuring coverage of a voice database, the device including an evaluation model obtaining unit, configured to obtain an evaluation model of each evaluation factor by using the unsupervised model training method 10 for measuring coverage of a voice database according to any of the foregoing embodiments, and a voice database obtaining unit, configured to obtain a voice database to be evaluated, where the voice database includes at least one voice; the detection unit is used for detecting each voice in the voice database through the evaluation model of the evaluation factor to obtain the single-factor information entropy of the voice database corresponding to the evaluation factor; and the evaluation unit is used for determining the coverage of the voice database according to the single-factor information entropy.

Fig. 5 is a schematic block diagram illustrating an apparatus of any of the previous embodiments in accordance with an exemplary embodiment. For example, the apparatus 200 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 5, the apparatus 200 may include one or more of the following components: a processing component 202, a memory 204, a power component 206, a multimedia component 208, an audio component 210, an input/output (I/O) interface 212, a sensor component 214, and a communication component 216.

The processing component 202 generally controls overall operation of the device 200, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 202 may include one or more processors 220 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 202 can include one or more modules that facilitate interaction between the processing component 202 and other components. For example, the processing component 202 can include a multimedia module to facilitate interaction between the multimedia component 208 and the processing component 202.

The memory 204 is configured to store various types of data to support operations at the apparatus 200. Examples of such data include instructions for any application or method operating on the device 200, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 204 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 206 provides power to the various components of the device 200. The power components 206 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 200.

The multimedia component 208 includes a screen that provides an output interface between the device 200 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 208 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 200 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 210 is configured to output and/or input audio signals. For example, audio component 210 includes a Microphone (MIC) configured to receive external audio signals when apparatus 200 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 204 or transmitted via the communication component 216. In some embodiments, audio component 210 also includes a speaker for outputting audio signals.

The I/O interface 212 provides an interface between the processing component 202 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 214 includes one or more sensors for providing various aspects of status assessment for the device 200. For example, the sensor assembly 214 may detect an open/closed state of the device 200, the relative positioning of components, such as a display and keypad of the device 200, the sensor assembly 214 may also detect a change in the position of the device 200 or a component of the device 200, the presence or absence of user contact with the device 200, the orientation or acceleration/deceleration of the device 200, and a change in the temperature of the device 200. The sensor assembly 214 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 214 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 214 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 216 is configured to facilitate wired or wireless communication between the apparatus 200 and other devices. The device 200 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 216 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 216 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 200 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as memory 204 comprising instructions, executable by processor 220 of apparatus 200 to perform the above-described method is also provided. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 6 is a block diagram illustrating an electronic device 300 according to an example embodiment. For example, the apparatus 300 may be provided as a server. Referring to FIG. 6, apparatus 300 includes a processing component 322 that further includes one or more processors and memory resources, represented by memory 342, for storing instructions, such as application programs, that are executable by processing component 322. The application programs stored in memory 342 may include one or more modules that each correspond to a set of instructions. Further, the processing component 322 is configured to execute instructions to perform the above-described methods.

The apparatus 300 may also include a power component 326 configured to perform power management of the apparatus 300, a wired or wireless network interface 350 configured to connect the apparatus 300 to a network, and an input/output (I/O) interface 358. The apparatus 300 may operate based on an operating system stored in the memory 342, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. An unsupervised model training method for measuring speech database coverage, the method comprising:

acquiring training data, wherein the training data is voice;

determining one or more evaluation factors for voice database coverage;

dividing the evaluation factor into an adjustable factor or an unadjustable factor based on whether the training data is controllable by parameter adjustment corresponding to the evaluation factor;

determining a clustering algorithm corresponding to each divided evaluation factor;

classifying the training data respectively through the clustering algorithm corresponding to each evaluation factor to obtain a plurality of subclasses;

training an evaluation model according to the plurality of subclasses of each of the evaluation factors.

2. The unsupervised model training method for measuring coverage of a speech database according to claim 1, wherein the determining a clustering algorithm corresponding to each of the divided evaluation factors comprises:

if the evaluation factor is an unadjustable factor, determining that the corresponding clustering algorithm is a distance-based clustering algorithm;

and if the evaluation factor is an adjustable factor, determining that the corresponding clustering algorithm is the self-adaptive training algorithm.

3. The unsupervised model training method for measuring coverage of a speech database according to claim 2, wherein the classifying the training data by the clustering algorithm corresponding to each of the evaluation factors to obtain a plurality of subclasses comprises:

if the evaluation factor is an unadjustable factor, extracting a feature vector of the training data;

and according to the feature vector, dividing the training data into a plurality of subclasses by adopting the distance-based clustering algorithm.

4. The unsupervised model training method for measuring speech database coverage as in claim 3, wherein the distance-based clustering algorithm is a K-means clustering algorithm.

5. The unsupervised model training method for measuring coverage of a speech database according to claim 2, wherein the classifying the training data by the clustering algorithm corresponding to each of the evaluation factors to obtain a plurality of subclasses comprises:

if the evaluation factor is an adjustable factor, extracting a feature vector of the training data;

training a Gaussian mixture model through the feature vectors, and labeling the training data;

and according to the marked training data, dividing the training data into a plurality of subclasses.

6. The unsupervised model training method for measuring coverage of a speech database according to claim 5, wherein the training a Gaussian mixture model by the feature vector and labeling the training data comprises:

training a Gaussian mixture model through the feature vectors;

determining control parameters according to the evaluation factors, wherein the control parameters can adjust and control the training data;

traversing all values of the control parameters, and transforming the training data;

obtaining a parameter value when the feature vector of the transformed training data enables the likelihood of the Gaussian mixture model to be maximum;

accumulating likelihood according to the parameter values;

converting the training data according to the parameter values to obtain new training data, and retraining until a stopping condition is reached;

and taking the parameter value corresponding to each training data and enabling the Gaussian mixture model to have the maximum likelihood as the labeled value of the training data.

7. The unsupervised model training method for measuring speech database coverage of claim 6, wherein the stopping condition comprises: and the iteration times reach a preset threshold, or the change rate of the cumulative likelihood and the cumulative likelihood in the last iteration is smaller than the preset threshold.

8. The unsupervised model training method for measuring speech database coverage of claim 1, wherein said training an assessment model according to said plurality of sub-classes of each of said assessment factors comprises: and respectively training one or more evaluation models for each piece of sub-data, or training one evaluation model for a plurality of pieces of sub-data as a whole.

9. The unsupervised model training method for measuring speech database coverage as in claim 1, wherein the evaluation factors for speech database coverage include one or more of: gender of the speaker, age of the speaker, accent of the speaker, speed of speech, pitch, language, capture device, capture environment, pronunciation factors, or content subject.

10. A method for measuring coverage of a voice database, the method comprising obtaining an evaluation model of each evaluation factor by using the unsupervised model training method for measuring coverage of a voice database according to any one of claims 1 to 9;

acquiring a voice database to be evaluated, wherein the voice database comprises at least one voice;

detecting each voice in the voice database through the evaluation model of the evaluation factor to obtain a single-factor information entropy of the voice database corresponding to the evaluation factor;

and determining the coverage degree of the voice database according to the single-factor information entropy.

11. An unsupervised model training apparatus for measuring speech database coverage, the apparatus comprising:

the data acquisition unit is used for acquiring training data, and the training data is voice;

the evaluation factor determining unit is used for determining one or more evaluation factors of the coverage of the voice database;

the dividing unit is used for dividing the evaluation factors into adjustable factors or non-adjustable factors based on whether the training data corresponding to the evaluation factors can be controlled through parameter adjustment;

the algorithm determining unit is used for determining a clustering algorithm corresponding to each divided evaluation factor;

the classification unit is used for classifying the training data through the clustering algorithm corresponding to each evaluation factor to obtain a plurality of subclasses;

and the model training unit is used for training an evaluation model according to the plurality of subclasses of each evaluation factor.

12. The unsupervised model training device for measuring speech database coverage of claim 11, wherein the algorithm determination unit is further configured to:

when the evaluation factor is an unadjustable factor, determining that the corresponding clustering algorithm is a distance-based clustering algorithm;

and when the evaluation factor is an adjustable factor, determining the corresponding clustering algorithm as the self-adaptive training algorithm.

13. The unsupervised model training device for measuring coverage of a speech database of claim 12, wherein the classification unit is further configured to:

when the evaluation factor is an unadjustable factor, extracting a feature vector of the training data;

14. The unsupervised model training device for measuring speech database coverage of claim 13, wherein the distance-based clustering algorithm is a K-means clustering algorithm.

15. The unsupervised model training device for measuring coverage of a speech database of claim 12, wherein the classification unit is further configured to:

when the evaluation factor is an adjustable factor, extracting a feature vector of the training data;

16. The unsupervised model training device for measuring coverage of a speech database according to claim 15, wherein said training a gaussian mixture model by using the feature vectors and labeling the training data comprises:

training a Gaussian mixture model through the feature vectors;

accumulating likelihood according to the parameter values;

17. The unsupervised model training device for measuring speech database coverage of claim 16, wherein the stopping condition comprises: and the iteration times reach a preset threshold, or the change rate of the cumulative likelihood and the cumulative likelihood in the last iteration is smaller than the preset threshold.

18. The unsupervised model training device for measuring speech database coverage of claim 11, wherein the model training device is further configured to: and respectively training one or more evaluation models for each piece of sub-data, or training one evaluation model for a plurality of pieces of sub-data as a whole.

19. The unsupervised model training device for measuring coverage of a speech database of claim 11, wherein the evaluation factors for coverage of the speech database include one or more of: gender of the speaker, age of the speaker, accent of the speaker, speed of speech, pitch, language, capture device, capture environment, pronunciation factors, or content subject.

20. An apparatus for measuring coverage of a voice database, the apparatus comprising an evaluation model obtaining unit, configured to obtain an evaluation model of each evaluation factor by using the unsupervised model training method for measuring coverage of a voice database according to any one of claims 1 to 9;

the system comprises a voice database acquisition unit, a voice database evaluation unit and a voice evaluation unit, wherein the voice database acquisition unit is used for acquiring a voice database to be evaluated, and the voice database comprises at least one voice;

the detection unit is used for detecting each voice in the voice database through the evaluation model of the evaluation factor to obtain the single-factor information entropy of the voice database corresponding to the evaluation factor;

and the evaluation unit is used for determining the coverage of the voice database according to the single-factor information entropy.

21. An electronic device, comprising:

a memory to store instructions; and

a processor for invoking the memory-stored instructions to perform the unsupervised model training method for measuring speech database coverage of any of claims 1-9.

22. A computer-readable storage medium having stored thereon instructions which, when executed by a processor, perform the unsupervised model training method for measuring speech database coverage of any of claims 1 to 9.