CN115273832B - Training method of wake optimization model, wake optimization method and related equipment - Google Patents
Training method of wake optimization model, wake optimization method and related equipment Download PDFInfo
- Publication number
- CN115273832B CN115273832B CN202211158719.9A CN202211158719A CN115273832B CN 115273832 B CN115273832 B CN 115273832B CN 202211158719 A CN202211158719 A CN 202211158719A CN 115273832 B CN115273832 B CN 115273832B
- Authority
- CN
- China
- Prior art keywords
- model
- embedding
- awakening
- classification model
- template
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 57
- 238000000034 method Methods 0.000 title claims abstract description 52
- 238000005457 optimization Methods 0.000 title claims abstract description 51
- 238000013145 classification model Methods 0.000 claims abstract description 124
- 239000013598 vector Substances 0.000 claims description 38
- 238000009499 grossing Methods 0.000 claims description 21
- 230000006870 function Effects 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 11
- 238000003860 storage Methods 0.000 claims description 11
- 238000002372 labelling Methods 0.000 claims description 10
- 230000002618 waking effect Effects 0.000 claims description 6
- 230000000694 effects Effects 0.000 abstract description 13
- 238000004220 aggregation Methods 0.000 abstract description 4
- 230000002776 aggregation Effects 0.000 abstract description 4
- 238000005516 engineering process Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 5
- 239000000284 extract Substances 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000007796 conventional method Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
- G10L15/07—Adaptation to the speaker
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The present application relates to the field of speech recognition technologies, and in particular, to a training method for a wake-up optimization model, a wake-up optimization method, and related devices. Training a classification model and an embedding model, setting a template of the embedding model according to the weight of the classification model, and enabling the aggregation degree of the trained embedding model to the awakening words to be better. After the two models are trained, the two models are deployed on terminal equipment, after the terminal is activated and used, the classification model is mainly used for judging whether the terminal is awakened or not, then the classification model is gradually transited to the embedding model, and a user template is generated. And after the user template is obtained, calculating the related parameters of the current awakening voice during awakening each time, and judging whether to awaken the terminal according to the user template. The awakening effect can be optimized in a self-adaptive mode according to continuous iteration of user use, each user can obtain consistent effect experience, and the problem that the adaptability of different scenes of a single model is insufficient is effectively solved.
Description
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a training method for a wake-up optimization model, a wake-up optimization method, and related devices.
Background
When the awakening word and command word model is applied, the voice of the user is detected in real time, and when a specific word is detected, feedback is made. In actual use, the awakening model is generally trained in advance, pronunciation habits of each terminal user are different, and it cannot be guaranteed that each user has a consistent experience effect. In a general processing mode, as many types of positive sample data as possible are added into training data to improve the recognition capability of the model for different accents and different scenes, but the mode needs too much data, the corpus collection cost is high, and the model training time is too long. Although the model effect can be improved by the method, all accent data cannot be exhausted, and therefore the problem that the experience effect of each user is inconsistent cannot be fundamentally solved.
Disclosure of Invention
The main objective of the present application is to provide a training method for a wake-up optimization model, a wake-up optimization method and related devices, which aim to solve the problem in the prior art that a voice wake-up effect cannot be adaptively optimized according to different users.
In order to achieve the above object, the present application provides a training method for a wake-up optimization model, including:
acquiring annotation data, wherein the annotation data comprises a positive sample and a negative sample;
training the classification model by using the labeling data to obtain a template of the embedding modelWherein, the template of the embedding modelA first column of weights of a penultimate layer output for the classification model;
and obtaining an awakening optimization model according to the classification model and the embedding model.
The application also provides a method for awakening optimization, which comprises the following steps:
when the terminal is detected to be activated and voice is received, inputting the voice into a classification model, and judging whether to awaken the terminal or not according to the output of the classification model and a first awakening threshold value;
if the terminal is awakened successfully, extracting an embedding vector of the voice by using an embedding model;
when the times of successfully awakening the terminal reach the designated times, calculating the times corresponding to the designated timesAveraging the embedding vectors to obtain the user specific template;
After the user specific template is obtained, reducing the awakening threshold of the classification model to be a second awakening threshold;
when voice is received, judging whether the classification model is awakened or not according to the output of the classification model and the second awakening threshold value;
and after the classification model is successfully awakened, calculating the smoothing coefficient and the final judgment score of the current awakening, and judging whether to awaken the terminal according to the smoothing coefficient and the final judgment score of the current awakening.
The application also provides a training device for waking up an optimization model, the device includes:
the data acquisition module is used for acquiring marking data, and the marking data comprises positive samples and negative samples;
a classification model training module for training the classification model by using the labeling data to obtain a template of the embedding modelWherein, the template of the embedding modelA first column of weights of a penultimate layer output by the classification model;
an embedding model training module for using the marking data and the template of the embedding modelTraining an imbedding model;
and the awakening optimization model generation module is used for obtaining an awakening optimization model according to the classification model and the embedding model.
The present application further provides a device for wake-up optimization, the device comprising:
the first terminal awakening module is used for inputting the voice into the classification model when detecting that the terminal is activated and receiving the voice, and judging whether to awaken the terminal or not according to the output of the classification model and a first awakening threshold value;
the user template determining module is used for extracting the embedding vector of the voice by using an embedding model if the terminal is awakened successfully; when the number of times of successful awakening of the terminal reaches the designated number of times, calculating the average of the embedding vectors corresponding to the designated number of times to obtain a user specific template;
The awakening threshold adjusting module is used for reducing the awakening threshold of the classification model to be a second awakening threshold after the user specific template is obtained;
the classification model awakening module is used for judging whether to awaken the classification model or not according to the output of the classification model and the second awakening threshold value when receiving the voice;
and the second terminal awakening module is used for calculating a smooth coefficient and a final judgment score of the current awakening after the classification model is awakened successfully, and judging whether to awaken the terminal according to the smooth coefficient and the final judgment score of the current awakening.
The present application further provides a computer device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of any one of the above methods when executing the computer program.
The present application also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any of the above.
The application provides a training method for awakening an optimization model, an awakening optimization method and related equipment, a classification model and an embedding model are trained, a template of the embedding model is set according to the weight of the classification model, the trained embedding model is better in aggregation degree of awakening words, and the interval between the awakening words and non-awakening words is larger. After the two models are trained, the two models are deployed on terminal equipment, and after a terminal user activates and uses the two models, the classification model is mainly used to judge whether to awaken the terminal or not, and then the classification model is gradually transited to the embedding model to generate a user template. And after the user template is obtained, calculating the related parameters of the current awakening voice during awakening each time, and judging whether to awaken the terminal according to the user template. The awakening effect can be optimized in a self-adaptive mode according to continuous iteration of user use, each user can obtain consistent effect experience, and the problem that the adaptability of different scenes of a single model is insufficient is effectively solved.
Drawings
FIG. 1 is a schematic diagram illustrating steps of a training method for a wake optimization model according to an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating steps of a wake-up optimization method according to an embodiment of the present application;
FIG. 3 is a block diagram of an overall structure of a training apparatus for waking up an optimization model according to an embodiment of the present application;
FIG. 4 is a block diagram of the overall structure of a wake-up optimized device according to an embodiment of the present application;
fig. 5 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present application.
The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Referring to fig. 1, an embodiment of the present application provides a training method for a wake-up optimization model, including steps S1 to S4, specifically:
s1, obtaining annotation data, wherein the annotation data comprises a positive sample and a negative sample.
Specifically, for step S1, pre-entered annotation data is obtained during training of the system, where the annotation data includes positive samples, negative samples, and text corresponding to each audio. The positive sample is audio data containing a wake-up word; negative samples include AISLELL corpus and DSN-Challenge noise corpus. The AISHELL corpus is recorded by 400 speakers from different accent areas in China, and the corpus content covers finance, science and technology, sports, entertainment and current affairs news, and is a basic database designed for artificial intelligent Chinese Putonghua voice recognition. The AISHELL corpus is used as accent data of the training model, so that the trained model can recognize different accents and process the accents, and awakening experience of different accent users is improved.
S2, training the classification model by using the labeling data to obtain a template of the embedding modelWherein, the template of the embedding modelA first column of weights for a penultimate layer of the classification model output.
Specifically, for step S2, the classification network in the classification model uses TC-ResNet, and other networks such as TDNN or RNN-Attention can also be used. The dimension of the second layer from the last to the last of the classification model needs to be consistent with the output dimension of the embedding model, for example, the dimension is 48, the hyper-parameters of other layers are not limited, and the selection is carried out according to the training effect. After the classification model training is finished, the template of the embedding model can be obtained. When there is a word in the input audio data, the classification model has two output nodes, namely a housekeeping node and an unknown node. The housekeeper node comprises a judgment score corresponding to the word, the judgment score is a probability value between 0 and 1, and whether the word is the awakening word needing to be identified can be judged according to the judgment score. The unknown node includes audio data content that the classification model fails to identify.
And S4, obtaining an awakening optimization model according to the classification model and the embedding model.
Specifically, for steps S3 and S4, the template according to the embedding modelAnd training an embedding model by using the labeling data. Compared with the conventional method of obtaining the embedding template according to the embedding average of the training set positive sample, experiments prove that the embedding model trained by the method has better aggregation degree on the awakening words and larger interval between the awakening words and the non-awakening words. The embedding is to map an object (which may be an image, a sentence, a word, a commodity, a movie, etc.) of high-dimensional raw data to a low-dimensional manifold, and convert the low-dimensional manifold into a low-dimensional vector to represent the object. The low-dimensional embedding vector has the property that objects corresponding to vectors with similar distances have similar meanings, and potential relations among the objects can be revealed by expressing the objects in the embedding space. The embedding model can generate the embedding vectors of the similar pronunciation contents and gather the embedding vectors in an embedding space. Therefore, the user specific template obtained by the embedding model can better represent the pronunciation habit of the user to the awakening word. And after the classification model and the embedding model are trained, obtaining an awakening optimization model.
In an embodiment, after the obtaining the annotation data, the method includes:
s101, selecting a fixed length according to the length range of the positive sample;
s102, adjusting the length of the marked data according to the fixed length, and determining the frame number;
s103, extracting features from the marked data according to the fixed length to obtain data of frame number and feature dimension size, and using the data as model input data.
Specifically, for steps S101, S102, and S103, all audio data are selected to have a fixed length according to the positive sample length distribution range, and the number of frames is determined, for example, if the selected fixed length is 1.5S, the number of frames is 151. And extracting features from the labeled data to obtain data with the frame number and the feature dimension size, and using the data as model input data. For example 1.5s audio, corresponding to a model input data size of 151 x 40. Wherein the feature dimension is the number of features in the sample.
In one embodiment, the step of training the classification model with the labeling data includes:
s201, inputting the model input data into a classification model to obtain the output of the classification model;
s202, calculating a Loss function according to the output of the classification model and a preset target value, and optimizing the parameters of the classification model according to the Loss function.
Specifically, for steps S201 and S202, there are two classification model output nodes, namely, a housekeeper node and an unknown node, and if the input model input data includes a plurality of words, the classification model outputs a plurality of housekeeper nodes accordingly. The housekeeper node comprises a judgment score of the classification model for each word in the input data of the model, the judgment score is a probability value between 0 and 1, and whether the word is the awakening word needing to be identified can be judged according to the judgment score. Unknown refers to audio data content that the classification model fails to identify. The Loss function is used for describing the difference between the predicted value and the true value of the model and guiding the model to move towards the convergence direction in the training process. The Loss function used in the present invention, which is cross-entropy, is generally a Loss function used to quantify the difference between two probability distributions and is often used in classification, and measures the accuracy of the model on the test set. The cross entropy is used as a loss function, and the sigmoid function is used to avoid the problem that the learning rate of the mean square error loss function is reduced when the gradient is reduced, because the learning rate can be controlled by the output error. And calculating a Loss function according to the output of the classification model and a preset target value, and updating the parameters of the classification model by utilizing Loss back propagation.
In one embodiment, the above usesThe label data and the template of the classification modelStep S3 of training the embedding model comprises the following steps:
s301, inputting the model input data into an embedding model, and obtaining an embedding vector according to the weight of the classification model.
S302, calculating the embedding vector and the template of the embedding modelAnd optimizing parameters of the embedding model according to the cosine similarity.
Specifically, for steps S301 and S302, the embedding model is used to map the input feature sequence to an embedding feature space, for example, after audio data with a size of 151 × 40 is input into the embedding model, a 48-dimensional embedding vector is obtained. The dimensionality of the embedding vector needs to be consistent with the dimensionality in the weights of the classification model. The cosine similarity is calculated by the formulaWhere c is the embedding template, here the template(ii) a e is the embedding vector of the current wake-up voice; w and b are trainable parameters. And updating parameters of the embedding model by utilizing cosine similarity back propagation, and gathering the embedding vectors with similar cosine similarity values in an embedding space as much as possible, namely gathering the awakening words together. Experiments prove that the ensemble degree of the embedding model trained by the invention to the awakening words is better, and the interval between the awakening words and the non-awakening words is larger.
Referring to fig. 2, an embodiment of the present application provides a wake-up optimization method, including steps A1 to A6, specifically:
a1, when detecting that a terminal is activated and receiving voice, inputting the voice into a classification model, and judging whether to awaken the terminal according to the output of the classification model and a first awakening threshold value;
a2, if the terminal is awakened successfully, extracting an embedding vector of the voice by using an embedding model;
a3, when the number of times of successful awakening of the terminal reaches the specified number of times, calculating the average of the embedding vectors corresponding to the specified number of times to obtain the user specific template;
A4, after the user specific template is obtained, reducing the awakening threshold of the classification model to be a second awakening threshold;
a5, when voice is received, judging whether the classification model is awakened or not according to the output of the classification model and the second awakening threshold;
and A6, after the classification model is successfully awakened, calculating a smoothing coefficient and a final judgment score of the current awakening, and judging whether to awaken the terminal according to the smoothing coefficient and the final judgment score of the current awakening.
Specifically, for step A1, after the terminal is activated, the classification model is used to determine whether to wake up the terminal. And setting the awakening threshold value at the moment as a first threshold value, obtaining corresponding output by the classification model when receiving voice, and awakening the terminal when the output is greater than the first threshold value.
Specifically, for steps A2 and A3, after waking up the terminal successfully according to the output of the classification model each time, the embedding model extracts embedding vectors of the current speech, and the embedding vectors can represent pronunciation habits of the user on the wake-up word. Since the output of the classification model is not necessarily greater than the first threshold each time speech is received, there may be a case where the terminal is not awakened. The speech at this time cannot be used as a template for obtaining the user's specificationTherefore, the embedding model does not extract the embedding vector of the voice, unnecessary operation is reduced, and only the embedding is extracted from the voice which successfully wakes up the terminalg vector as a user-specific template for calculationThe data of (2). When the times of successfully awakening the terminal reach the specified times, calculating the average of the embedding vectors of the specified times to obtain a user specific template. User template obtained in this wayThe pronunciation habits of different users can be well expressed, compared with the conventional method of using multiple types of positive sample data to improve the recognition capability of the model to different accents and different scenes, the method reduces a large amount of training model data requirements, can obtain a specific template more suitable for the current user, and can enable the users with different pronunciation habits to obtain good awakening experience.
Specifically, for step A4, the user-specific template is obtainedThen, the classification model and the user-specific templateTo jointly determine whether to wake up the terminal. At this moment, the classification model is no longer the only basis for judging whether to awaken the terminal, so the awakening threshold of the classification model is reduced to be the second awakening threshold, even if the classification model becomes easier to awaken, and the user specific template is subsequently usedA further determination is made.
Specifically, for step A5, when the speech is received, if the output of the classification model is greater than the second wake-up threshold, the classification model is woken up, that is, as long as the probability that the speech includes a wake-up word is greater than the second threshold, the classification model is woken up. And after the classification model is awakened, judging whether to awaken the terminal or not according to the smoothing coefficient of the current awakening and the calculation result of the final judgment score.
Specifically, for step A6, after the classification model is successfully awakened, a smoothing coefficient is calculated, and a final decision score is calculated according to the smoothing coefficient. User specific templatesIs one of the parameters in the final decision score. And when the calculated final judgment score exceeds a third threshold value, waking up the terminal. In this step, the judgment of whether to wake up the terminal according to the pronunciation habit of the user can be realized.
In an embodiment, the step of calculating the smoothing factor and the final decision score of the current wake-up includes:
a101, calculating a smoothing coefficient: a1=0.9 × n/(1 + β × n), where β is an adjustable parameter and n is the current number of awakenings;
a102, calculating a final judgment score:wherein, in the step (A),the decision score of the classification model for the current wake-up, a1 is a smoothing coefficient,embedding vector and template for current wake-up speechCosine similarity of (1), templateTemplate being embedding model,Embedding vector and user-specific template for current wake-up speechCosine similarity of (c).
Specifically, for steps a101 and a102, the formula for calculating the smoothing coefficient is: a1=0.9 × n/(1 + β × n), where β is an adjustable parameter selected according to the estimated user wake-up frequency; n is the current number of awakenings, and the user template is obtainedThen, the number of times of restarting counting is not superimposed with the number of times of judging whether to wake up the terminal only by using the classification model before. The formula for calculating the final decision score is:in whichThe decision score of the classification model for the current wake-up, a1 is a smoothing coefficient,template of embedding and classification model for current wake-up voiceThe degree of similarity of the cosine of (c),embedding and user-specific templates for current wake-up speechCosine similarity of (c). The cosine similarity calculation method is the same as the cosine similarity calculation method used in training the embedding model.
Referring to fig. 3, a block diagram of a training apparatus for waking up an optimization model in an embodiment of the present application is shown, where the apparatus includes:
a data obtaining module 100, configured to obtain annotation data, where the annotation data includes a positive sample and a negative sample;
a classification model training module 200, configured to train the classification model with the labeling data to obtain a template of the embedding modelWherein, the template of the embedding modelA first column of weights of a penultimate layer output by the classification model;
an embedding model training module 300 for using the labeled data and the template of the embedding modelTraining an imbedding model;
and the awakening optimization model generation module 400 is used for obtaining an awakening optimization model according to the classification model and the embedding model.
In an embodiment, the training apparatus for the wake-up optimization model further includes:
the model input data adjusting module is used for selecting a fixed length according to the length range of the positive sample; adjusting the length of the labeled data according to the fixed length, and determining the frame number; and extracting features from the labeled data according to the fixed length to obtain data of the frame number and the feature dimension size, and using the data as model input data.
In an embodiment, the training apparatus for the wake-up optimization model further includes:
the classification model training submodule is used for inputting the model input data into a classification model to obtain the output of the classification model; and calculating a Loss function according to the output of the classification model and a preset target value, and optimizing the parameters of the classification model according to the Loss function.
In an embodiment, the training apparatus for the wake-up optimization model further includes:
the embedding model training sub-module is used for inputting the model input data into an embedding model and obtaining an embedding vector according to the weight of the classification model; calculating the embedding vector and the template of the embedding modelAnd optimizing parameters of the embedding model according to the cosine similarity.
Referring to fig. 4, which is a block diagram of a wake-up optimization apparatus in an embodiment of the present application, the apparatus includes:
a first terminal awakening module 500, configured to, when detecting that the terminal is activated and receiving a voice, input the voice into a classification model, and determine whether to awaken the terminal according to an output of the classification model and a first awakening threshold;
a user template determining module 600, configured to, if the terminal is successfully awakened, extract an embedding vector of the voice by using an embedding model; when the number of times of successful awakening of the terminal reaches the designated number of times, calculating the average of the embedding vectors corresponding to the designated number of times to obtain a user specific template;
A wake-up threshold adjustment module 700, configured to reduce the wake-up threshold of the classification model to a second wake-up threshold after obtaining the user-specific template;
a classification model awakening module 800, configured to determine whether to awaken the classification model according to the output of the classification model and the second awakening threshold when receiving the voice;
and a second terminal awakening module 900, configured to calculate a smoothing coefficient and a final decision score of the current awakening after the classification model is successfully awakened, and determine whether to awaken the terminal according to the smoothing coefficient and the final decision score of the current awakening.
In an embodiment, the wake-up optimization apparatus further includes:
a calculation module for calculating a smoothing coefficient: a1=0.9 × n/(1 + β × n), where β is an adjustable parameter and n is the current number of awakenings; calculating a final decision score:wherein, in the step (A),the decision score of the classification model for the current wake-up, a1 is a smoothing coefficient,embedding vector and template for current wake-up speechCosine similarity of (1), templateTemplate being embedding model,Embedding vector and user-specific template for current wake-up speechCosine similarity of (c).
Referring to fig. 5, a computer device, which may be a server and whose internal structure may be as shown in fig. 5, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The database of the computer device is used for storing training method operation data of the wake-up optimization model, wake-up optimization method operation data and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for wake effect adaptive optimization of any of the above embodiments.
Those skilled in the art will appreciate that the architecture shown in fig. 5 is only a block diagram of some of the structures associated with the present solution and is not intended to limit the scope of the present solution as applied to computer devices.
An embodiment of the present application further provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements a method for wake effect adaptive optimization. It is to be understood that the computer readable storage medium in this embodiment may be a volatile readable storage medium or a non-volatile readable storage medium.
The application provides a training method for awakening an optimization model, an awakening optimization method and related equipment, a classification model and an embedding model are trained, a template of the embedding model is set according to the weight of the classification model, the trained embedding model is better in aggregation degree of awakening words, and the interval between the awakening words and non-awakening words is larger. After the two models are trained, the two models are deployed on terminal equipment, and after a terminal user activates and uses the two models, the classification model is mainly used to judge whether to awaken the terminal or not, and then the classification model is gradually transited to the embedding model to generate a user template. And after the user template is obtained, calculating the related parameters of the current awakening voice during awakening each time, and judging whether to awaken the terminal according to the user template. The awakening effect can be optimized in a self-adaptive mode according to continuous iteration of user use, each user can obtain consistent effect experience, and the problem that the adaptability of different scenes of a single model is insufficient is effectively solved.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (SSRDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct bused dynamic RAM (DRDRAM), and bused dynamic RAM (RDRAM).
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of another identical element in a process, apparatus, article, or method comprising the element.
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all the equivalent structures or equivalent processes that can be directly or indirectly applied to other related technical fields by using the contents of the specification and the drawings of the present application are also included in the scope of the present application.
Claims (10)
1. A training method for awakening an optimization model is characterized by comprising the following steps:
acquiring annotation data, wherein the annotation data comprises a positive sample and a negative sample;
training a classification model by using the labeled data to obtain a template of the embedding modelWherein the template of the embedding modelA first column of weights of a penultimate layer output by the classification model;
training an embedding model by using the labeling data and a first column of the weight of the penultimate layer output by the classification model;
and obtaining an awakening optimization model according to the classification model and the embedding model.
2. The method for training the wake-up optimization model according to claim 1, wherein after the obtaining the annotation data, the method comprises:
selecting a fixed length according to the length range of the positive sample;
adjusting the length of the labeled data according to the fixed length, and determining the frame number;
and extracting features from the labeled data according to the fixed length to obtain data of the frame number and the feature dimension size, and using the data as model input data.
3. The method for training the wake-up optimization model according to claim 2, wherein the training the classification model with the labeled data comprises:
inputting the model input data into a classification model to obtain the output of the classification model;
and calculating a Loss function according to the output of the classification model and a preset target value, and optimizing the parameters of the classification model according to the Loss function.
4. The method for training the wake-up optimization model of claim 2, wherein the template using the labeling data and the embedding model isTraining an embedding model, comprising:
inputting the model input data into an embedding model to obtain an embedding vector;
5. A method of wake optimization, comprising:
when the terminal is detected to be activated and voice is received, inputting the voice into a classification model, and judging whether to awaken the terminal or not according to the output of the classification model and a first awakening threshold value;
if the terminal is awakened successfully, extracting an embedding vector of the voice by using an embedding model;
when the number of times of successful awakening of the terminal reaches the designated number of times, calculating the average of the embedding vectors corresponding to the designated number of times to obtain a user specific template;
After the user specific template is obtained, reducing the awakening threshold of the classification model to be a second awakening threshold;
when voice is received, judging whether the classification model is awakened or not according to the output of the classification model and the second awakening threshold value;
and after the classification model is successfully awakened, calculating the smoothing coefficient and the final judgment score of the current awakening, and judging whether to awaken the terminal according to the smoothing coefficient and the final judgment score of the current awakening.
6. The wake-up optimization method according to claim 5, wherein the calculating the smoothing factor and the final decision score for the current wake-up comprises:
calculating a smoothing coefficient: a1=0.9 × n/(1 + β × n), where β is an adjustable parameter and n is the number of current awakenings;
calculating a final decision score:wherein, in the step (A),the decision score of the classification model for the current wake-up, a1 is a smoothing coefficient,embedding vector and template for current wake-up speechCosine similarity of (1), templateTemplate being embedding model,Embedding vector and user-specific template for current wake-up speechCosine similarity.
7. A training apparatus for waking up an optimization model, the apparatus comprising:
the data acquisition module is used for acquiring marking data, and the marking data comprises positive samples and negative samples;
a classification model training module for training a classification model by using the labeling data to obtain a template of the embedding modelWherein the template of the embedding modelA first column of weights of a penultimate layer output for the classification model;
the embedding model training module is used for training an embedding model by using the labeling data and a first column of the weight of the penultimate layer output by the classification model;
and the awakening optimization model generation module is used for obtaining an awakening optimization model according to the classification model and the embedding model.
8. An apparatus for wake optimization, the apparatus comprising:
the first terminal awakening module is used for inputting the voice into the classification model when detecting that the terminal is activated and receiving the voice, and judging whether to awaken the terminal or not according to the output of the classification model and a first awakening threshold value;
the user template determining module is used for extracting the embedding vector of the voice by using the embedding model if the terminal is awakened successfully; when the number of times of successful awakening of the terminal reaches the designated number of times, calculating the average of the embedding vectors corresponding to the designated number of times to obtain a user specific template;
The awakening threshold adjusting module is used for reducing the awakening threshold of the classification model to be a second awakening threshold after the user specific template is obtained;
the classification model awakening module is used for judging whether to awaken the classification model or not according to the output of the classification model and the second awakening threshold value when receiving the voice;
and the second terminal awakening module is used for calculating the smooth coefficient and the final judgment score of the current awakening after the classification model is awakened successfully, and judging whether to awaken the terminal according to the smooth coefficient and the final judgment score of the current awakening.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211158719.9A CN115273832B (en) | 2022-09-22 | 2022-09-22 | Training method of wake optimization model, wake optimization method and related equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211158719.9A CN115273832B (en) | 2022-09-22 | 2022-09-22 | Training method of wake optimization model, wake optimization method and related equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115273832A CN115273832A (en) | 2022-11-01 |
CN115273832B true CN115273832B (en) | 2023-02-28 |
Family
ID=83756079
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211158719.9A Active CN115273832B (en) | 2022-09-22 | 2022-09-22 | Training method of wake optimization model, wake optimization method and related equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115273832B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111933124A (en) * | 2020-09-18 | 2020-11-13 | 电子科技大学 | Keyword detection method capable of supporting self-defined awakening words |
CN114360521A (en) * | 2022-03-09 | 2022-04-15 | 深圳市友杰智新科技有限公司 | Training method of voice recognition model, and detection method and equipment of voice false recognition |
CN114360522A (en) * | 2022-03-09 | 2022-04-15 | 深圳市友杰智新科技有限公司 | Training method of voice awakening model, and detection method and equipment of voice false awakening |
CN114420098A (en) * | 2022-01-20 | 2022-04-29 | 思必驰科技股份有限公司 | Wake-up word detection model training method, electronic device and storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11749267B2 (en) * | 2020-11-20 | 2023-09-05 | Google Llc | Adapting hotword recognition based on personalized negatives |
-
2022
- 2022-09-22 CN CN202211158719.9A patent/CN115273832B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111933124A (en) * | 2020-09-18 | 2020-11-13 | 电子科技大学 | Keyword detection method capable of supporting self-defined awakening words |
CN114420098A (en) * | 2022-01-20 | 2022-04-29 | 思必驰科技股份有限公司 | Wake-up word detection model training method, electronic device and storage medium |
CN114360521A (en) * | 2022-03-09 | 2022-04-15 | 深圳市友杰智新科技有限公司 | Training method of voice recognition model, and detection method and equipment of voice false recognition |
CN114360522A (en) * | 2022-03-09 | 2022-04-15 | 深圳市友杰智新科技有限公司 | Training method of voice awakening model, and detection method and equipment of voice false awakening |
Non-Patent Citations (2)
Title |
---|
Interlayer selective attention network for robust personalized wake-up word detection;H Lim等;《IEEE Signal Processing Letters》;20191231;全文 * |
基于神经网络的语音关键词检测技术研究;刘力;《中国优秀硕士学位论文全文数据库 信息科技辑》;20220115(第01期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN115273832A (en) | 2022-11-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6444530B2 (en) | Spoken language understanding system | |
CN110853666B (en) | Speaker separation method, device, equipment and storage medium | |
CN108346428B (en) | Voice activity detection and model building method, device, equipment and storage medium thereof | |
CN109448719B (en) | Neural network model establishing method, voice awakening method, device, medium and equipment | |
CN109903750B (en) | Voice recognition method and device | |
CN110147806B (en) | Training method and device of image description model and storage medium | |
CN111429923B (en) | Training method and device of speaker information extraction model and computer equipment | |
CN113506574A (en) | Method and device for recognizing user-defined command words and computer equipment | |
CN111833845A (en) | Multi-language speech recognition model training method, device, equipment and storage medium | |
CN112509560B (en) | Voice recognition self-adaption method and system based on cache language model | |
CN114550703A (en) | Training method and device of voice recognition system, and voice recognition method and device | |
CN113870844A (en) | Training method and device of speech recognition model and computer equipment | |
CN111223476A (en) | Method and device for extracting voice feature vector, computer equipment and storage medium | |
CN114360522B (en) | Training method of voice awakening model, and detection method and equipment of voice false awakening | |
CN114360521B (en) | Training method of voice recognition model, and detection method and equipment of voice misrecognition | |
WO2022121188A1 (en) | Keyword detection method and apparatus, device and storage medium | |
CN113569021B (en) | Method for classifying users, computer device and readable storage medium | |
CN112364993B (en) | Model joint training method and device, computer equipment and storage medium | |
CN112712099B (en) | Double-layer knowledge-based speaker model compression system and method by distillation | |
CN115273832B (en) | Training method of wake optimization model, wake optimization method and related equipment | |
CN113223504A (en) | Acoustic model training method, device, equipment and storage medium | |
CN115101063B (en) | Low-computation-power voice recognition method, device, equipment and medium | |
CN112669836B (en) | Command recognition method and device and computer readable storage medium | |
CN113450800A (en) | Method and device for determining activation probability of awakening words and intelligent voice product | |
CN113360644A (en) | Method, device and equipment for retraining text model and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |