CN115273832B - Training method of wake optimization model, wake optimization method and related equipment - Google Patents

Training method of wake optimization model, wake optimization method and related equipment Download PDF

Info

Publication number
CN115273832B
CN115273832B CN202211158719.9A CN202211158719A CN115273832B CN 115273832 B CN115273832 B CN 115273832B CN 202211158719 A CN202211158719 A CN 202211158719A CN 115273832 B CN115273832 B CN 115273832B
Authority
CN
China
Prior art keywords
model
embedding
awakening
classification model
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211158719.9A
Other languages
Chinese (zh)
Other versions
CN115273832A (en
Inventor
王维
王广新
杨汉丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Youjie Zhixin Technology Co ltd
Original Assignee
Shenzhen Youjie Zhixin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Youjie Zhixin Technology Co ltd filed Critical Shenzhen Youjie Zhixin Technology Co ltd
Priority to CN202211158719.9A priority Critical patent/CN115273832B/en
Publication of CN115273832A publication Critical patent/CN115273832A/en
Application granted granted Critical
Publication of CN115273832B publication Critical patent/CN115273832B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The present application relates to the field of speech recognition technologies, and in particular, to a training method for a wake-up optimization model, a wake-up optimization method, and related devices. Training a classification model and an embedding model, setting a template of the embedding model according to the weight of the classification model, and enabling the aggregation degree of the trained embedding model to the awakening words to be better. After the two models are trained, the two models are deployed on terminal equipment, after the terminal is activated and used, the classification model is mainly used for judging whether the terminal is awakened or not, then the classification model is gradually transited to the embedding model, and a user template is generated. And after the user template is obtained, calculating the related parameters of the current awakening voice during awakening each time, and judging whether to awaken the terminal according to the user template. The awakening effect can be optimized in a self-adaptive mode according to continuous iteration of user use, each user can obtain consistent effect experience, and the problem that the adaptability of different scenes of a single model is insufficient is effectively solved.

Description

Training method of wake optimization model, wake optimization method and related equipment
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a training method for a wake-up optimization model, a wake-up optimization method, and related devices.
Background
When the awakening word and command word model is applied, the voice of the user is detected in real time, and when a specific word is detected, feedback is made. In actual use, the awakening model is generally trained in advance, pronunciation habits of each terminal user are different, and it cannot be guaranteed that each user has a consistent experience effect. In a general processing mode, as many types of positive sample data as possible are added into training data to improve the recognition capability of the model for different accents and different scenes, but the mode needs too much data, the corpus collection cost is high, and the model training time is too long. Although the model effect can be improved by the method, all accent data cannot be exhausted, and therefore the problem that the experience effect of each user is inconsistent cannot be fundamentally solved.
Disclosure of Invention
The main objective of the present application is to provide a training method for a wake-up optimization model, a wake-up optimization method and related devices, which aim to solve the problem in the prior art that a voice wake-up effect cannot be adaptively optimized according to different users.
In order to achieve the above object, the present application provides a training method for a wake-up optimization model, including:
acquiring annotation data, wherein the annotation data comprises a positive sample and a negative sample;
training the classification model by using the labeling data to obtain a template of the embedding model
Figure 174959DEST_PATH_IMAGE001
Wherein, the template of the embedding model
Figure 925878DEST_PATH_IMAGE001
A first column of weights of a penultimate layer output for the classification model;
stencil using the annotation data and the embedding model
Figure 955014DEST_PATH_IMAGE001
Training an embedding model;
and obtaining an awakening optimization model according to the classification model and the embedding model.
The application also provides a method for awakening optimization, which comprises the following steps:
when the terminal is detected to be activated and voice is received, inputting the voice into a classification model, and judging whether to awaken the terminal or not according to the output of the classification model and a first awakening threshold value;
if the terminal is awakened successfully, extracting an embedding vector of the voice by using an embedding model;
when the times of successfully awakening the terminal reach the designated times, calculating the times corresponding to the designated timesAveraging the embedding vectors to obtain the user specific template
Figure 398764DEST_PATH_IMAGE002
After the user specific template is obtained, reducing the awakening threshold of the classification model to be a second awakening threshold;
when voice is received, judging whether the classification model is awakened or not according to the output of the classification model and the second awakening threshold value;
and after the classification model is successfully awakened, calculating the smoothing coefficient and the final judgment score of the current awakening, and judging whether to awaken the terminal according to the smoothing coefficient and the final judgment score of the current awakening.
The application also provides a training device for waking up an optimization model, the device includes:
the data acquisition module is used for acquiring marking data, and the marking data comprises positive samples and negative samples;
a classification model training module for training the classification model by using the labeling data to obtain a template of the embedding model
Figure 520042DEST_PATH_IMAGE001
Wherein, the template of the embedding model
Figure 933706DEST_PATH_IMAGE001
A first column of weights of a penultimate layer output by the classification model;
an embedding model training module for using the marking data and the template of the embedding model
Figure 958294DEST_PATH_IMAGE001
Training an imbedding model;
and the awakening optimization model generation module is used for obtaining an awakening optimization model according to the classification model and the embedding model.
The present application further provides a device for wake-up optimization, the device comprising:
the first terminal awakening module is used for inputting the voice into the classification model when detecting that the terminal is activated and receiving the voice, and judging whether to awaken the terminal or not according to the output of the classification model and a first awakening threshold value;
the user template determining module is used for extracting the embedding vector of the voice by using an embedding model if the terminal is awakened successfully; when the number of times of successful awakening of the terminal reaches the designated number of times, calculating the average of the embedding vectors corresponding to the designated number of times to obtain a user specific template
Figure 572946DEST_PATH_IMAGE002
The awakening threshold adjusting module is used for reducing the awakening threshold of the classification model to be a second awakening threshold after the user specific template is obtained;
the classification model awakening module is used for judging whether to awaken the classification model or not according to the output of the classification model and the second awakening threshold value when receiving the voice;
and the second terminal awakening module is used for calculating a smooth coefficient and a final judgment score of the current awakening after the classification model is awakened successfully, and judging whether to awaken the terminal according to the smooth coefficient and the final judgment score of the current awakening.
The present application further provides a computer device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of any one of the above methods when executing the computer program.
The present application also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any of the above.
The application provides a training method for awakening an optimization model, an awakening optimization method and related equipment, a classification model and an embedding model are trained, a template of the embedding model is set according to the weight of the classification model, the trained embedding model is better in aggregation degree of awakening words, and the interval between the awakening words and non-awakening words is larger. After the two models are trained, the two models are deployed on terminal equipment, and after a terminal user activates and uses the two models, the classification model is mainly used to judge whether to awaken the terminal or not, and then the classification model is gradually transited to the embedding model to generate a user template. And after the user template is obtained, calculating the related parameters of the current awakening voice during awakening each time, and judging whether to awaken the terminal according to the user template. The awakening effect can be optimized in a self-adaptive mode according to continuous iteration of user use, each user can obtain consistent effect experience, and the problem that the adaptability of different scenes of a single model is insufficient is effectively solved.
Drawings
FIG. 1 is a schematic diagram illustrating steps of a training method for a wake optimization model according to an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating steps of a wake-up optimization method according to an embodiment of the present application;
FIG. 3 is a block diagram of an overall structure of a training apparatus for waking up an optimization model according to an embodiment of the present application;
FIG. 4 is a block diagram of the overall structure of a wake-up optimized device according to an embodiment of the present application;
fig. 5 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present application.
The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Referring to fig. 1, an embodiment of the present application provides a training method for a wake-up optimization model, including steps S1 to S4, specifically:
s1, obtaining annotation data, wherein the annotation data comprises a positive sample and a negative sample.
Specifically, for step S1, pre-entered annotation data is obtained during training of the system, where the annotation data includes positive samples, negative samples, and text corresponding to each audio. The positive sample is audio data containing a wake-up word; negative samples include AISLELL corpus and DSN-Challenge noise corpus. The AISHELL corpus is recorded by 400 speakers from different accent areas in China, and the corpus content covers finance, science and technology, sports, entertainment and current affairs news, and is a basic database designed for artificial intelligent Chinese Putonghua voice recognition. The AISHELL corpus is used as accent data of the training model, so that the trained model can recognize different accents and process the accents, and awakening experience of different accent users is improved.
S2, training the classification model by using the labeling data to obtain a template of the embedding model
Figure 807618DEST_PATH_IMAGE001
Wherein, the template of the embedding model
Figure 634760DEST_PATH_IMAGE001
A first column of weights for a penultimate layer of the classification model output.
Specifically, for step S2, the classification network in the classification model uses TC-ResNet, and other networks such as TDNN or RNN-Attention can also be used. The dimension of the second layer from the last to the last of the classification model needs to be consistent with the output dimension of the embedding model, for example, the dimension is 48, the hyper-parameters of other layers are not limited, and the selection is carried out according to the training effect. After the classification model training is finished, the template of the embedding model can be obtained
Figure 638488DEST_PATH_IMAGE001
. When there is a word in the input audio data, the classification model has two output nodes, namely a housekeeping node and an unknown node. The housekeeper node comprises a judgment score corresponding to the word, the judgment score is a probability value between 0 and 1, and whether the word is the awakening word needing to be identified can be judged according to the judgment score. The unknown node includes audio data content that the classification model fails to identify.
S3, using the marking data and the template of the embedding model
Figure 925506DEST_PATH_IMAGE001
And training an embedding model.
And S4, obtaining an awakening optimization model according to the classification model and the embedding model.
Specifically, for steps S3 and S4, the template according to the embedding model
Figure 522841DEST_PATH_IMAGE001
And training an embedding model by using the labeling data. Compared with the conventional method of obtaining the embedding template according to the embedding average of the training set positive sample, experiments prove that the embedding model trained by the method has better aggregation degree on the awakening words and larger interval between the awakening words and the non-awakening words. The embedding is to map an object (which may be an image, a sentence, a word, a commodity, a movie, etc.) of high-dimensional raw data to a low-dimensional manifold, and convert the low-dimensional manifold into a low-dimensional vector to represent the object. The low-dimensional embedding vector has the property that objects corresponding to vectors with similar distances have similar meanings, and potential relations among the objects can be revealed by expressing the objects in the embedding space. The embedding model can generate the embedding vectors of the similar pronunciation contents and gather the embedding vectors in an embedding space. Therefore, the user specific template obtained by the embedding model can better represent the pronunciation habit of the user to the awakening word. And after the classification model and the embedding model are trained, obtaining an awakening optimization model.
In an embodiment, after the obtaining the annotation data, the method includes:
s101, selecting a fixed length according to the length range of the positive sample;
s102, adjusting the length of the marked data according to the fixed length, and determining the frame number;
s103, extracting features from the marked data according to the fixed length to obtain data of frame number and feature dimension size, and using the data as model input data.
Specifically, for steps S101, S102, and S103, all audio data are selected to have a fixed length according to the positive sample length distribution range, and the number of frames is determined, for example, if the selected fixed length is 1.5S, the number of frames is 151. And extracting features from the labeled data to obtain data with the frame number and the feature dimension size, and using the data as model input data. For example 1.5s audio, corresponding to a model input data size of 151 x 40. Wherein the feature dimension is the number of features in the sample.
In one embodiment, the step of training the classification model with the labeling data includes:
s201, inputting the model input data into a classification model to obtain the output of the classification model;
s202, calculating a Loss function according to the output of the classification model and a preset target value, and optimizing the parameters of the classification model according to the Loss function.
Specifically, for steps S201 and S202, there are two classification model output nodes, namely, a housekeeper node and an unknown node, and if the input model input data includes a plurality of words, the classification model outputs a plurality of housekeeper nodes accordingly. The housekeeper node comprises a judgment score of the classification model for each word in the input data of the model, the judgment score is a probability value between 0 and 1, and whether the word is the awakening word needing to be identified can be judged according to the judgment score. Unknown refers to audio data content that the classification model fails to identify. The Loss function is used for describing the difference between the predicted value and the true value of the model and guiding the model to move towards the convergence direction in the training process. The Loss function used in the present invention, which is cross-entropy, is generally a Loss function used to quantify the difference between two probability distributions and is often used in classification, and measures the accuracy of the model on the test set. The cross entropy is used as a loss function, and the sigmoid function is used to avoid the problem that the learning rate of the mean square error loss function is reduced when the gradient is reduced, because the learning rate can be controlled by the output error. And calculating a Loss function according to the output of the classification model and a preset target value, and updating the parameters of the classification model by utilizing Loss back propagation.
In one embodiment, the above usesThe label data and the template of the classification model
Figure 278307DEST_PATH_IMAGE001
Step S3 of training the embedding model comprises the following steps:
s301, inputting the model input data into an embedding model, and obtaining an embedding vector according to the weight of the classification model.
S302, calculating the embedding vector and the template of the embedding model
Figure 11908DEST_PATH_IMAGE001
And optimizing parameters of the embedding model according to the cosine similarity.
Specifically, for steps S301 and S302, the embedding model is used to map the input feature sequence to an embedding feature space, for example, after audio data with a size of 151 × 40 is input into the embedding model, a 48-dimensional embedding vector is obtained. The dimensionality of the embedding vector needs to be consistent with the dimensionality in the weights of the classification model. The cosine similarity is calculated by the formula
Figure 968362DEST_PATH_IMAGE003
Where c is the embedding template, here the template
Figure 912048DEST_PATH_IMAGE001
(ii) a e is the embedding vector of the current wake-up voice; w and b are trainable parameters. And updating parameters of the embedding model by utilizing cosine similarity back propagation, and gathering the embedding vectors with similar cosine similarity values in an embedding space as much as possible, namely gathering the awakening words together. Experiments prove that the ensemble degree of the embedding model trained by the invention to the awakening words is better, and the interval between the awakening words and the non-awakening words is larger.
Referring to fig. 2, an embodiment of the present application provides a wake-up optimization method, including steps A1 to A6, specifically:
a1, when detecting that a terminal is activated and receiving voice, inputting the voice into a classification model, and judging whether to awaken the terminal according to the output of the classification model and a first awakening threshold value;
a2, if the terminal is awakened successfully, extracting an embedding vector of the voice by using an embedding model;
a3, when the number of times of successful awakening of the terminal reaches the specified number of times, calculating the average of the embedding vectors corresponding to the specified number of times to obtain the user specific template
Figure 110686DEST_PATH_IMAGE002
A4, after the user specific template is obtained, reducing the awakening threshold of the classification model to be a second awakening threshold;
a5, when voice is received, judging whether the classification model is awakened or not according to the output of the classification model and the second awakening threshold;
and A6, after the classification model is successfully awakened, calculating a smoothing coefficient and a final judgment score of the current awakening, and judging whether to awaken the terminal according to the smoothing coefficient and the final judgment score of the current awakening.
Specifically, for step A1, after the terminal is activated, the classification model is used to determine whether to wake up the terminal. And setting the awakening threshold value at the moment as a first threshold value, obtaining corresponding output by the classification model when receiving voice, and awakening the terminal when the output is greater than the first threshold value.
Specifically, for steps A2 and A3, after waking up the terminal successfully according to the output of the classification model each time, the embedding model extracts embedding vectors of the current speech, and the embedding vectors can represent pronunciation habits of the user on the wake-up word. Since the output of the classification model is not necessarily greater than the first threshold each time speech is received, there may be a case where the terminal is not awakened. The speech at this time cannot be used as a template for obtaining the user's specification
Figure 964372DEST_PATH_IMAGE002
Therefore, the embedding model does not extract the embedding vector of the voice, unnecessary operation is reduced, and only the embedding is extracted from the voice which successfully wakes up the terminalg vector as a user-specific template for calculation
Figure 826149DEST_PATH_IMAGE002
The data of (2). When the times of successfully awakening the terminal reach the specified times, calculating the average of the embedding vectors of the specified times to obtain a user specific template
Figure 522709DEST_PATH_IMAGE002
. User template obtained in this way
Figure 495345DEST_PATH_IMAGE002
The pronunciation habits of different users can be well expressed, compared with the conventional method of using multiple types of positive sample data to improve the recognition capability of the model to different accents and different scenes, the method reduces a large amount of training model data requirements, can obtain a specific template more suitable for the current user, and can enable the users with different pronunciation habits to obtain good awakening experience.
Specifically, for step A4, the user-specific template is obtained
Figure 958863DEST_PATH_IMAGE002
Then, the classification model and the user-specific template
Figure 116175DEST_PATH_IMAGE002
To jointly determine whether to wake up the terminal. At this moment, the classification model is no longer the only basis for judging whether to awaken the terminal, so the awakening threshold of the classification model is reduced to be the second awakening threshold, even if the classification model becomes easier to awaken, and the user specific template is subsequently used
Figure 909818DEST_PATH_IMAGE002
A further determination is made.
Specifically, for step A5, when the speech is received, if the output of the classification model is greater than the second wake-up threshold, the classification model is woken up, that is, as long as the probability that the speech includes a wake-up word is greater than the second threshold, the classification model is woken up. And after the classification model is awakened, judging whether to awaken the terminal or not according to the smoothing coefficient of the current awakening and the calculation result of the final judgment score.
Specifically, for step A6, after the classification model is successfully awakened, a smoothing coefficient is calculated, and a final decision score is calculated according to the smoothing coefficient. User specific templates
Figure 951724DEST_PATH_IMAGE002
Is one of the parameters in the final decision score. And when the calculated final judgment score exceeds a third threshold value, waking up the terminal. In this step, the judgment of whether to wake up the terminal according to the pronunciation habit of the user can be realized.
In an embodiment, the step of calculating the smoothing factor and the final decision score of the current wake-up includes:
a101, calculating a smoothing coefficient: a1=0.9 × n/(1 + β × n), where β is an adjustable parameter and n is the current number of awakenings;
a102, calculating a final judgment score:
Figure 639057DEST_PATH_IMAGE004
wherein, in the step (A),
Figure 842636DEST_PATH_IMAGE005
the decision score of the classification model for the current wake-up, a1 is a smoothing coefficient,
Figure 622111DEST_PATH_IMAGE006
embedding vector and template for current wake-up speech
Figure 592341DEST_PATH_IMAGE001
Cosine similarity of (1), template
Figure 9547DEST_PATH_IMAGE001
Template being embedding model
Figure 649607DEST_PATH_IMAGE001
Figure 542477DEST_PATH_IMAGE007
Embedding vector and user-specific template for current wake-up speech
Figure 660606DEST_PATH_IMAGE002
Cosine similarity of (c).
Specifically, for steps a101 and a102, the formula for calculating the smoothing coefficient is: a1=0.9 × n/(1 + β × n), where β is an adjustable parameter selected according to the estimated user wake-up frequency; n is the current number of awakenings, and the user template is obtained
Figure 699362DEST_PATH_IMAGE002
Then, the number of times of restarting counting is not superimposed with the number of times of judging whether to wake up the terminal only by using the classification model before. The formula for calculating the final decision score is:
Figure 634957DEST_PATH_IMAGE004
in which
Figure 890489DEST_PATH_IMAGE005
The decision score of the classification model for the current wake-up, a1 is a smoothing coefficient,
Figure 812309DEST_PATH_IMAGE006
template of embedding and classification model for current wake-up voice
Figure 469686DEST_PATH_IMAGE001
The degree of similarity of the cosine of (c),
Figure 310603DEST_PATH_IMAGE007
embedding and user-specific templates for current wake-up speech
Figure 551966DEST_PATH_IMAGE002
Cosine similarity of (c). The cosine similarity calculation method is the same as the cosine similarity calculation method used in training the embedding model.
Referring to fig. 3, a block diagram of a training apparatus for waking up an optimization model in an embodiment of the present application is shown, where the apparatus includes:
a data obtaining module 100, configured to obtain annotation data, where the annotation data includes a positive sample and a negative sample;
a classification model training module 200, configured to train the classification model with the labeling data to obtain a template of the embedding model
Figure 136532DEST_PATH_IMAGE001
Wherein, the template of the embedding model
Figure 648416DEST_PATH_IMAGE001
A first column of weights of a penultimate layer output by the classification model;
an embedding model training module 300 for using the labeled data and the template of the embedding model
Figure 801179DEST_PATH_IMAGE001
Training an imbedding model;
and the awakening optimization model generation module 400 is used for obtaining an awakening optimization model according to the classification model and the embedding model.
In an embodiment, the training apparatus for the wake-up optimization model further includes:
the model input data adjusting module is used for selecting a fixed length according to the length range of the positive sample; adjusting the length of the labeled data according to the fixed length, and determining the frame number; and extracting features from the labeled data according to the fixed length to obtain data of the frame number and the feature dimension size, and using the data as model input data.
In an embodiment, the training apparatus for the wake-up optimization model further includes:
the classification model training submodule is used for inputting the model input data into a classification model to obtain the output of the classification model; and calculating a Loss function according to the output of the classification model and a preset target value, and optimizing the parameters of the classification model according to the Loss function.
In an embodiment, the training apparatus for the wake-up optimization model further includes:
the embedding model training sub-module is used for inputting the model input data into an embedding model and obtaining an embedding vector according to the weight of the classification model; calculating the embedding vector and the template of the embedding model
Figure 765724DEST_PATH_IMAGE001
And optimizing parameters of the embedding model according to the cosine similarity.
Referring to fig. 4, which is a block diagram of a wake-up optimization apparatus in an embodiment of the present application, the apparatus includes:
a first terminal awakening module 500, configured to, when detecting that the terminal is activated and receiving a voice, input the voice into a classification model, and determine whether to awaken the terminal according to an output of the classification model and a first awakening threshold;
a user template determining module 600, configured to, if the terminal is successfully awakened, extract an embedding vector of the voice by using an embedding model; when the number of times of successful awakening of the terminal reaches the designated number of times, calculating the average of the embedding vectors corresponding to the designated number of times to obtain a user specific template
Figure 153980DEST_PATH_IMAGE002
A wake-up threshold adjustment module 700, configured to reduce the wake-up threshold of the classification model to a second wake-up threshold after obtaining the user-specific template;
a classification model awakening module 800, configured to determine whether to awaken the classification model according to the output of the classification model and the second awakening threshold when receiving the voice;
and a second terminal awakening module 900, configured to calculate a smoothing coefficient and a final decision score of the current awakening after the classification model is successfully awakened, and determine whether to awaken the terminal according to the smoothing coefficient and the final decision score of the current awakening.
In an embodiment, the wake-up optimization apparatus further includes:
a calculation module for calculating a smoothing coefficient: a1=0.9 × n/(1 + β × n), where β is an adjustable parameter and n is the current number of awakenings; calculating a final decision score:
Figure 21836DEST_PATH_IMAGE008
wherein, in the step (A),
Figure 345501DEST_PATH_IMAGE005
the decision score of the classification model for the current wake-up, a1 is a smoothing coefficient,
Figure 921975DEST_PATH_IMAGE006
embedding vector and template for current wake-up speech
Figure 989289DEST_PATH_IMAGE001
Cosine similarity of (1), template
Figure 69240DEST_PATH_IMAGE001
Template being embedding model
Figure 829386DEST_PATH_IMAGE001
Figure 267058DEST_PATH_IMAGE007
Embedding vector and user-specific template for current wake-up speech
Figure 997117DEST_PATH_IMAGE002
Cosine similarity of (c).
Referring to fig. 5, a computer device, which may be a server and whose internal structure may be as shown in fig. 5, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The database of the computer device is used for storing training method operation data of the wake-up optimization model, wake-up optimization method operation data and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for wake effect adaptive optimization of any of the above embodiments.
Those skilled in the art will appreciate that the architecture shown in fig. 5 is only a block diagram of some of the structures associated with the present solution and is not intended to limit the scope of the present solution as applied to computer devices.
An embodiment of the present application further provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements a method for wake effect adaptive optimization. It is to be understood that the computer readable storage medium in this embodiment may be a volatile readable storage medium or a non-volatile readable storage medium.
The application provides a training method for awakening an optimization model, an awakening optimization method and related equipment, a classification model and an embedding model are trained, a template of the embedding model is set according to the weight of the classification model, the trained embedding model is better in aggregation degree of awakening words, and the interval between the awakening words and non-awakening words is larger. After the two models are trained, the two models are deployed on terminal equipment, and after a terminal user activates and uses the two models, the classification model is mainly used to judge whether to awaken the terminal or not, and then the classification model is gradually transited to the embedding model to generate a user template. And after the user template is obtained, calculating the related parameters of the current awakening voice during awakening each time, and judging whether to awaken the terminal according to the user template. The awakening effect can be optimized in a self-adaptive mode according to continuous iteration of user use, each user can obtain consistent effect experience, and the problem that the adaptability of different scenes of a single model is insufficient is effectively solved.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (SSRDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct bused dynamic RAM (DRDRAM), and bused dynamic RAM (RDRAM).
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of another identical element in a process, apparatus, article, or method comprising the element.
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all the equivalent structures or equivalent processes that can be directly or indirectly applied to other related technical fields by using the contents of the specification and the drawings of the present application are also included in the scope of the present application.

Claims (10)

1. A training method for awakening an optimization model is characterized by comprising the following steps:
acquiring annotation data, wherein the annotation data comprises a positive sample and a negative sample;
training a classification model by using the labeled data to obtain a template of the embedding model
Figure DEST_PATH_IMAGE001
Wherein the template of the embedding model
Figure 135672DEST_PATH_IMAGE001
A first column of weights of a penultimate layer output by the classification model;
training an embedding model by using the labeling data and a first column of the weight of the penultimate layer output by the classification model;
and obtaining an awakening optimization model according to the classification model and the embedding model.
2. The method for training the wake-up optimization model according to claim 1, wherein after the obtaining the annotation data, the method comprises:
selecting a fixed length according to the length range of the positive sample;
adjusting the length of the labeled data according to the fixed length, and determining the frame number;
and extracting features from the labeled data according to the fixed length to obtain data of the frame number and the feature dimension size, and using the data as model input data.
3. The method for training the wake-up optimization model according to claim 2, wherein the training the classification model with the labeled data comprises:
inputting the model input data into a classification model to obtain the output of the classification model;
and calculating a Loss function according to the output of the classification model and a preset target value, and optimizing the parameters of the classification model according to the Loss function.
4. The method for training the wake-up optimization model of claim 2, wherein the template using the labeling data and the embedding model is
Figure 597878DEST_PATH_IMAGE001
Training an embedding model, comprising:
inputting the model input data into an embedding model to obtain an embedding vector;
calculating the embedding vector and the template of the embedding model
Figure 86628DEST_PATH_IMAGE001
And optimizing parameters of the embedding model according to the cosine similarity.
5. A method of wake optimization, comprising:
when the terminal is detected to be activated and voice is received, inputting the voice into a classification model, and judging whether to awaken the terminal or not according to the output of the classification model and a first awakening threshold value;
if the terminal is awakened successfully, extracting an embedding vector of the voice by using an embedding model;
when the number of times of successful awakening of the terminal reaches the designated number of times, calculating the average of the embedding vectors corresponding to the designated number of times to obtain a user specific template
Figure 682694DEST_PATH_IMAGE002
After the user specific template is obtained, reducing the awakening threshold of the classification model to be a second awakening threshold;
when voice is received, judging whether the classification model is awakened or not according to the output of the classification model and the second awakening threshold value;
and after the classification model is successfully awakened, calculating the smoothing coefficient and the final judgment score of the current awakening, and judging whether to awaken the terminal according to the smoothing coefficient and the final judgment score of the current awakening.
6. The wake-up optimization method according to claim 5, wherein the calculating the smoothing factor and the final decision score for the current wake-up comprises:
calculating a smoothing coefficient: a1=0.9 × n/(1 + β × n), where β is an adjustable parameter and n is the number of current awakenings;
calculating a final decision score:
Figure 940500DEST_PATH_IMAGE003
wherein, in the step (A),
Figure DEST_PATH_IMAGE004
the decision score of the classification model for the current wake-up, a1 is a smoothing coefficient,
Figure 839186DEST_PATH_IMAGE005
embedding vector and template for current wake-up speech
Figure 438401DEST_PATH_IMAGE001
Cosine similarity of (1), template
Figure 447946DEST_PATH_IMAGE001
Template being embedding model
Figure 825837DEST_PATH_IMAGE001
Figure DEST_PATH_IMAGE006
Embedding vector and user-specific template for current wake-up speech
Figure 754479DEST_PATH_IMAGE002
Cosine similarity.
7. A training apparatus for waking up an optimization model, the apparatus comprising:
the data acquisition module is used for acquiring marking data, and the marking data comprises positive samples and negative samples;
a classification model training module for training a classification model by using the labeling data to obtain a template of the embedding model
Figure 217821DEST_PATH_IMAGE001
Wherein the template of the embedding model
Figure 31057DEST_PATH_IMAGE001
A first column of weights of a penultimate layer output for the classification model;
the embedding model training module is used for training an embedding model by using the labeling data and a first column of the weight of the penultimate layer output by the classification model;
and the awakening optimization model generation module is used for obtaining an awakening optimization model according to the classification model and the embedding model.
8. An apparatus for wake optimization, the apparatus comprising:
the first terminal awakening module is used for inputting the voice into the classification model when detecting that the terminal is activated and receiving the voice, and judging whether to awaken the terminal or not according to the output of the classification model and a first awakening threshold value;
the user template determining module is used for extracting the embedding vector of the voice by using the embedding model if the terminal is awakened successfully; when the number of times of successful awakening of the terminal reaches the designated number of times, calculating the average of the embedding vectors corresponding to the designated number of times to obtain a user specific template
Figure 263455DEST_PATH_IMAGE002
The awakening threshold adjusting module is used for reducing the awakening threshold of the classification model to be a second awakening threshold after the user specific template is obtained;
the classification model awakening module is used for judging whether to awaken the classification model or not according to the output of the classification model and the second awakening threshold value when receiving the voice;
and the second terminal awakening module is used for calculating the smooth coefficient and the final judgment score of the current awakening after the classification model is awakened successfully, and judging whether to awaken the terminal according to the smooth coefficient and the final judgment score of the current awakening.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
CN202211158719.9A 2022-09-22 2022-09-22 Training method of wake optimization model, wake optimization method and related equipment Active CN115273832B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211158719.9A CN115273832B (en) 2022-09-22 2022-09-22 Training method of wake optimization model, wake optimization method and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211158719.9A CN115273832B (en) 2022-09-22 2022-09-22 Training method of wake optimization model, wake optimization method and related equipment

Publications (2)

Publication Number Publication Date
CN115273832A CN115273832A (en) 2022-11-01
CN115273832B true CN115273832B (en) 2023-02-28

Family

ID=83756079

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211158719.9A Active CN115273832B (en) 2022-09-22 2022-09-22 Training method of wake optimization model, wake optimization method and related equipment

Country Status (1)

Country Link
CN (1) CN115273832B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111933124A (en) * 2020-09-18 2020-11-13 电子科技大学 Keyword detection method capable of supporting self-defined awakening words
CN114360521A (en) * 2022-03-09 2022-04-15 深圳市友杰智新科技有限公司 Training method of voice recognition model, and detection method and equipment of voice false recognition
CN114360522A (en) * 2022-03-09 2022-04-15 深圳市友杰智新科技有限公司 Training method of voice awakening model, and detection method and equipment of voice false awakening
CN114420098A (en) * 2022-01-20 2022-04-29 思必驰科技股份有限公司 Wake-up word detection model training method, electronic device and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11749267B2 (en) * 2020-11-20 2023-09-05 Google Llc Adapting hotword recognition based on personalized negatives

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111933124A (en) * 2020-09-18 2020-11-13 电子科技大学 Keyword detection method capable of supporting self-defined awakening words
CN114420098A (en) * 2022-01-20 2022-04-29 思必驰科技股份有限公司 Wake-up word detection model training method, electronic device and storage medium
CN114360521A (en) * 2022-03-09 2022-04-15 深圳市友杰智新科技有限公司 Training method of voice recognition model, and detection method and equipment of voice false recognition
CN114360522A (en) * 2022-03-09 2022-04-15 深圳市友杰智新科技有限公司 Training method of voice awakening model, and detection method and equipment of voice false awakening

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Interlayer selective attention network for robust personalized wake-up word detection;H Lim等;《IEEE Signal Processing Letters》;20191231;全文 *
基于神经网络的语音关键词检测技术研究;刘力;《中国优秀硕士学位论文全文数据库 信息科技辑》;20220115(第01期);全文 *

Also Published As

Publication number Publication date
CN115273832A (en) 2022-11-01

Similar Documents

Publication Publication Date Title
JP6444530B2 (en) Spoken language understanding system
CN110853666B (en) Speaker separation method, device, equipment and storage medium
CN108346428B (en) Voice activity detection and model building method, device, equipment and storage medium thereof
CN109448719B (en) Neural network model establishing method, voice awakening method, device, medium and equipment
CN109903750B (en) Voice recognition method and device
CN110147806B (en) Training method and device of image description model and storage medium
CN111429923B (en) Training method and device of speaker information extraction model and computer equipment
CN113506574A (en) Method and device for recognizing user-defined command words and computer equipment
CN111833845A (en) Multi-language speech recognition model training method, device, equipment and storage medium
CN112509560B (en) Voice recognition self-adaption method and system based on cache language model
CN114550703A (en) Training method and device of voice recognition system, and voice recognition method and device
CN113870844A (en) Training method and device of speech recognition model and computer equipment
CN111223476A (en) Method and device for extracting voice feature vector, computer equipment and storage medium
CN114360522B (en) Training method of voice awakening model, and detection method and equipment of voice false awakening
CN114360521B (en) Training method of voice recognition model, and detection method and equipment of voice misrecognition
WO2022121188A1 (en) Keyword detection method and apparatus, device and storage medium
CN113569021B (en) Method for classifying users, computer device and readable storage medium
CN112364993B (en) Model joint training method and device, computer equipment and storage medium
CN112712099B (en) Double-layer knowledge-based speaker model compression system and method by distillation
CN115273832B (en) Training method of wake optimization model, wake optimization method and related equipment
CN113223504A (en) Acoustic model training method, device, equipment and storage medium
CN115101063B (en) Low-computation-power voice recognition method, device, equipment and medium
CN112669836B (en) Command recognition method and device and computer readable storage medium
CN113450800A (en) Method and device for determining activation probability of awakening words and intelligent voice product
CN113360644A (en) Method, device and equipment for retraining text model and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant