CN113450800A - Method and device for determining activation probability of awakening words and intelligent voice product - Google Patents

Method and device for determining activation probability of awakening words and intelligent voice product Download PDF

Info

Publication number
CN113450800A
CN113450800A CN202110759228.9A CN202110759228A CN113450800A CN 113450800 A CN113450800 A CN 113450800A CN 202110759228 A CN202110759228 A CN 202110759228A CN 113450800 A CN113450800 A CN 113450800A
Authority
CN
China
Prior art keywords
acoustic
neural network
sequence
training
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110759228.9A
Other languages
Chinese (zh)
Other versions
CN113450800B (en
Inventor
赵亚东
金忠孝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SAIC Motor Corp Ltd
Original Assignee
SAIC Motor Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SAIC Motor Corp Ltd filed Critical SAIC Motor Corp Ltd
Priority to CN202110759228.9A priority Critical patent/CN113450800B/en
Publication of CN113450800A publication Critical patent/CN113450800A/en
Application granted granted Critical
Publication of CN113450800B publication Critical patent/CN113450800B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method and a device for determining activation probability of a wake-up word and intelligent voice equipment, wherein the method and the device are applied to the intelligent voice equipment, and particularly obtain an audio signal received by the intelligent voice equipment; processing the audio signal based on the acoustic classification model to obtain an acoustic classification sequence of the audio signal; and inputting the acoustic classification sequence and the acoustic characterization sequence of the awakening word into the neural network model to obtain the activation probability of the awakening word. According to the scheme, the activation probability of the awakening word is accurately determined, so that the intelligent voice equipment can determine whether the equipment is activated or not according to the activation probability, and a key technical link is provided for normal work of the intelligent voice equipment.

Description

Method and device for determining activation probability of awakening words and intelligent voice product
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a method and a device for determining activation probability of a wake-up word and an intelligent voice product.
Background
When the intelligent voice equipment such as an intelligent sound box, an intelligent car machine and an intelligent mobile phone realizes voice control, a voice interaction process is divided into five links such as awakening, responding, inputting, understanding and feeding back, wherein the awakening link is a first contact point for interaction between a user and an intelligent voice product, the experience of the awakening link is crucial in the whole voice interaction process, and the first impression of the user on the product is directly influenced by the experience of the awakening link. At present, although intelligent voice products are called intelligently, the intelligent voice products still have no human intelligence and cannot be awakened through the eyesight or actions, so that a word for switching the products from a standby state to a working state, namely a so-called awakening word, needs to be defined.
The traditional voice awakening technical scheme based on awakening words comprises three parts, wherein in the first part, an acoustic classification sequence is generated by audio through an acoustic classification algorithm; a second part, calculating the activation probability of the awakening words by the acoustic classification sequence through methods of distance numerical calculation, model generation, probability generation calculation and the like; and a third part, determining an activation probability threshold value on the basis of obtaining the activation probability of the awakening word, and finally judging whether the equipment is activated or not according to the height relation of the threshold value. And finally, controlling the intelligent voice equipment to realize voice interaction with the user under the condition that the equipment is determined to be activated. Therefore, accurately determining the activation probability of the wake-up word is a key link for controlling the intelligent voice device.
Disclosure of Invention
In view of this, the present application provides a method and an apparatus for determining an activation probability of a wake-up word, and an intelligent voice device, which are used to determine the activation probability of the wake-up word and provide a key technical link for normal operation of the intelligent voice device.
In order to achieve the above object, the following solutions are proposed:
a method for determining activation probability of a wake-up word is applied to intelligent voice equipment, and comprises the following steps:
acquiring an audio signal received by the intelligent voice equipment;
processing the audio signal based on an acoustic classification model to obtain an acoustic classification sequence of the audio signal;
and inputting the acoustic classification sequence and the acoustic characterization sequence of the awakening word into a neural network model to obtain the activation probability of the awakening word.
Optionally, the acoustic classification sequence is x, where:
x=[xi,xi,…,xn]a two-dimensional real number tensor of shape nxn;
n is the length of the sequence of audio frames;
n is the total number of acoustic classification categories;
the ith column vector of x is xi=[xi1,xi2,…,xiN]Then xiIs a one-dimensional real number tensor with the length of N, wherein i is more than or equal to 1 and less than or equal to N, xikIs xiThe value of the kth element, xikRepresenting the probability that an audio frame belongs to the kth acoustic classification, where 1 ≦ k ≦ N, 0 ≦ xik≤1。
Optionally, the acoustic characterization sequence is skeywordWherein:
Figure BDA0003148670120000021
it is a two-dimensional real number tensor of shape nxm;
keyword represents a wake word;
n is the total number of acoustic classification categories;
m is the length of the acoustic characterization sequence of the specific awakening word;
remember of skeywordThe jth column vector of
Figure BDA0003148670120000022
Is a one-dimensional real number tensor with the length of N, wherein j is more than or equal to 1 and less than or equal to m,
Figure BDA0003148670120000023
is composed of
Figure BDA0003148670120000024
Wherein h is more than or equal to 1 and less than or equal to N,
Figure BDA0003148670120000025
for any given keyword, it is always possible to find a unique acoustic characterization s artificiallykeywordRepresenting a keyword.
Optionally, the method further comprises the steps of:
performing model pre-training based on labeling data of a large number of general text audios;
and performing tuning training based on labeled data of a small number of awakening word audios to obtain the neural network model, wherein the general text audios refer to texts in audios which are not related to the awakening words, and the awakening word audios refer to texts in audios which are the same as or similar to the awakening words.
Optionally, the model pre-training is performed on the labeled data based on a large amount of general text audios, and includes the steps of:
determining an acoustic classification model;
building a main body neural network according to the network structure of the acoustic classification model;
training the subject neural network by using the labeled data;
and saving the neural network model parameters.
Optionally, the tuning training based on the labeled data of a small number of wake word audios includes the steps of:
determining an acoustic characterization sequence corresponding to the tuning-training awakening word;
splicing the acoustic classification model with the main body neural network to form an end-to-end network;
loading activation probability to the end-to-end neural network, calculating neural network model parameters and acoustic classification model parameters, and performing cross entropy loss function-based training on the end-to-end network;
and saving the adjusted neural network model parameters and the acoustic classification model parameters.
A device for determining activation probability of a wake-up word is applied to intelligent voice equipment, and comprises the following steps:
the signal acquisition module is configured to acquire an audio signal received by the intelligent voice device;
a first processing module configured to process the audio signal based on an acoustic classification model to obtain an acoustic classification sequence of the audio signal;
and the second processing module is configured to input the acoustic classification sequence and the acoustic characterization sequence of the awakening word into a neural network model to obtain the activation probability of the awakening word.
Optionally, the acoustic classification sequence is x, where:
x=[xi,xi,…,xn]a two-dimensional real number tensor of shape nxn;
n is the length of the sequence of audio frames;
n is the total number of acoustic classification categories;
the ith column vector of x is xi=[xi1,xi2,…,xiN]Then xiIs a one-dimensional real number tensor with the length of N, wherein i is more than or equal to 1 and less than or equal to N, xikIs xiThe value of the kth element, xikRepresenting the probability that an audio frame belongs to the kth acoustic classification, where 1 ≦ k ≦ N, 0 ≦ xik≤1。
Optionally, the acoustic characterization sequence is skeywordWherein:
Figure BDA0003148670120000031
is a two-dimensional real number tensor of shape nxm;
keyword represents a wake word;
n is the total number of acoustic classification categories;
m is the length of the acoustic characterization sequence of the specific awakening word;
remember of skeywordThe jth column vector of
Figure BDA0003148670120000041
Is a one-dimensional real number tensor with the length of N, wherein j is more than or equal to 1 and less than or equal to m,
Figure BDA0003148670120000042
is composed of
Figure BDA0003148670120000043
Wherein h is more than or equal to 1 and less than or equal to N,
Figure BDA0003148670120000044
for any given keyword, it is always possible to find a unique acoustic characterization s artificiallykeywordRepresenting a keyword.
Optionally, the method further includes:
a first training module configured to perform model pre-training based on labeling data of a large amount of general text audio;
and the second training module is configured to perform tuning training based on labeled data of a small number of awakening word audios to obtain the neural network model, wherein the general text audios refer to texts in audios which are not related to the awakening words, and the awakening word audios refer to texts in audios which are the same as or similar to the awakening words.
Optionally, the first training module is specifically configured to:
determining an acoustic classification model;
building a main body neural network according to the network structure of the acoustic classification model;
training the subject neural network by using the labeled data;
and saving the neural network model parameters.
Optionally, the second training module is specifically configured to:
determining an acoustic characterization sequence corresponding to the tuning-training awakening word;
splicing the acoustic classification model with the main body neural network to form an end-to-end network;
loading activation probability to the end-to-end neural network, calculating neural network model parameters and acoustic classification model parameters, and performing cross entropy loss function-based training on the end-to-end network;
and saving the adjusted neural network model parameters and the acoustic classification model parameters.
An intelligent speech device is provided with a determination device as described above.
An intelligent speech device comprising at least one processor and a memory coupled to the processor, wherein:
the memory is for storing a computer program or instructions;
the processor is configured to execute the computer program or instructions to cause the intelligent speech device to perform the determination method as described above.
According to the technical scheme, the method and the device for determining the activation probability of the awakening word and the intelligent voice equipment are applied to the intelligent voice equipment, and particularly, the audio signal received by the intelligent voice equipment is obtained; processing the audio signal based on the acoustic classification model to obtain an acoustic classification sequence of the audio signal; and inputting the acoustic classification sequence and the acoustic characterization sequence of the awakening word into the neural network model to obtain the activation probability of the awakening word. According to the scheme, the activation probability of the awakening word is accurately determined, so that the intelligent voice equipment can determine whether the equipment is activated or not according to the activation probability, and a key technical link is provided for normal work of the intelligent voice equipment.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a method for determining activation probability of a wakeup word according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a neural network model according to an embodiment of the present application;
FIG. 3 is a flow chart of a training process of a neural network model according to an embodiment of the present application;
fig. 4 is a block diagram of an apparatus for determining activation probability of a wakeup word according to an embodiment of the present application;
fig. 5 is a block diagram of another apparatus for determining activation probability of a wakeup word according to an embodiment of the present application;
fig. 6 is a block diagram of an intelligent speech device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Example one
Fig. 1 is a flowchart of a method for determining activation probability of a wakeup word according to an embodiment of the present application.
As shown in fig. 1, the determination method provided in this embodiment is applied to an intelligent voice device, such as an intelligent sound box, an intelligent car machine, and a smart phone. The determining method is used for processing the audio received by the intelligent voice equipment so as to obtain the activation probability of the corresponding awakening word. The determination method comprises the following steps:
and S1, acquiring the audio signal received by the intelligent voice equipment.
That is, when the sound collection device of the intelligent voice device collects the sound emitted by the user and converts the sound into an audio signal in a standby state, the execution subject executing the determination method in the application acquires the audio signal so as to further process the audio signal.
In addition, the acoustic standard sequence of the corresponding awakening words can be obtained at the same time or after the audio signal is obtained, and the acoustic characterization sequence can be listedShown as skeywordWherein:
Figure BDA0003148670120000061
it is a two-dimensional real number tensor of shape nxm;
keyword represents a wake word; n is the total number of acoustic classification categories; m is the length of the acoustic characterization sequence of the specific awakening word;
remember of skeywordThe jth column vector of
Figure BDA0003148670120000062
Is a one-dimensional real number tensor with the length of N, wherein j is more than or equal to 1 and less than or equal to m,
Figure BDA0003148670120000063
is composed of
Figure BDA0003148670120000064
Wherein h is more than or equal to 1 and less than or equal to N,
Figure BDA0003148670120000065
for any given keyword, it is always possible to find a unique acoustic characterization s artificiallykeywordRepresenting a keyword.
And S2, processing the audio signal based on the acoustic classification model.
And under the condition of acquiring the audio signal, processing the audio signal by using an acoustic classification model obtained by pre-training so as to obtain an acoustic classification sequence of the audio signal. The acoustic classification sequence is denoted x, where:
x=[xi,xi,…,xn]a two-dimensional real number tensor of shape nxn; n is the length of the sequence of audio frames; n is the total number of acoustic classification categories; the ith column vector of x is xi=[xi1,xi2,…,xiN]Then xiIs a one-dimensional real number tensor with the length of N, wherein i is more than or equal to 1 and less than or equal to N, xikIs xiThe value of the kth element, xikRepresenting the probability that an audio frame belongs to the kth acoustic classification, where 1 ≦ kN,0≤xik≤1。
And S3, processing the acoustic classification sequence and the acoustic characterization sequence based on the neural network model.
The neural network model is imported into a corresponding processing body of a corresponding intelligent voice device before the processing is started, wherein the acoustic classification sequence refers to an acoustic classification sequence of the audio signal, and the acoustic characterization sequence refers to an acoustic characterization sequence of the awakening word. And obtaining the activation probability of the awakening word through the processing of the neural network model.
The neural network model comprises an available long-term and short-term memory network layer, a convolutional layer, a maximum pooling layer and a linear transformation layer, and the combination is expressed as follows:
emb=Linear(LSTM(s^keyword))
p _ keyword (activation | x) ═ Sigmoid (Conv _ emb (Tanh (Conv (x)))), as shown in fig. 2. The LSTM represents a long-term and short-term memory network layer, the Linear transform layer, the convolutional layer, the Tanh represents a Tanh activation function, the MaxPool represents a maximum pooling layer, the convolutional layer with convolution parameters of emb is represented by Conv _ emb, the emb represents acoustic characterization embedding of awakening words, and the Sigmoid represents the use of a Sigmoid activation function.
The activation probability of the awakening word is marked as Pkeyword(activate | x), where keyword is a specific wake up word and x is the corresponding acoustic classification sequence of the sequence of audio frames.
As can be seen from the foregoing technical solutions, the present embodiment provides a method for determining activation probability of a wakeup word, where the method is applied to an intelligent voice device, and specifically, obtains an audio signal received by the intelligent voice device; processing the audio signal based on the acoustic classification model to obtain an acoustic classification sequence of the audio signal; and inputting the acoustic classification sequence and the acoustic characterization sequence of the awakening word into the neural network model to obtain the activation probability of the awakening word. According to the scheme, the activation probability of the awakening word is accurately determined, so that the intelligent voice equipment can determine whether the equipment is activated or not according to the activation probability, and a key technical link is provided for normal work of the intelligent voice equipment.
The technical scheme is realized on the basis of obtaining a corresponding neural network model, and under the condition that no existing neural network model exists, the method further comprises the following model training scheme, wherein the neural network model is provided for the scheme, and as shown in fig. 3, the specific scheme comprises the following steps:
and S101, model pre-training is carried out based on a large amount of labeled data of the general text audio.
The first step is as follows: determining an acoustic classification model adopted by the intelligent voice equipment, and storing network parameters of the model as Mc, thereby determining the number N of acoustic classifications and an acoustic classification set; and performing acoustic representation sequence representation on the labeled text of the universal text audio labeled data by using the acoustic category set, and performing forced alignment on the universal text audio on the basis of the acoustic representation sequence representation of the acoustic classification model and each audio labeled text (finding out the acoustic classification sequence with the highest probability corresponding to the audio), thereby obtaining the forced aligned acoustic classification sequence.
The acoustic classification sequence is subjected to segmentation with uniformly distributed segmentation intervals of 30-50 and is marked as xcutEach x iscutAre combined to obtain xcutCorresponding acoustic characterization sequence scutWill (x)cut,scut) Putting the training data set T as a training sample point; after the data set T is constructed, finding out all different acoustic characterization sequence elements in the T to form an acoustic characterization sequence set KsThen based on KsAugmenting a training data set T to TaThe specific expansion mode is as follows:
whileKshas elements not selected to:
from KsTo select an unselected element sK
while training data set T in the current round sKAre not selected to:
selecting an element (x) from Tcut,scut);
ifsKAnd sTThe same is that:
data entry ((x)cut,scut) 1) put into the data set Ta
else:
Data entry ((x)cut,scut) 0) put into the data set Ta
The second step is that: building a main body neural network according to the network structure of the acoustic classification model;
the third step: using a training data set TaTraining a main neural network, selecting a proper parameter initialization method, a proper learning rate and a proper adaptive gradient algorithm, performing cross entropy loss function-based training on the main neural network, reserving a part of training data for cross check in the training process, and avoiding overfitting;
the fourth step: saving neural network model parameters, denoted as Mg
And S102, performing tuning training based on the labeled data of a small number of specific awakening word audios.
The tuning training utilizes the labeled data of the specific awakening word audio to carry out end-to-end tuning on the whole intelligent voice equipment, requires the acoustic classification model to be a neural network model, and can skip the following steps if the acoustic classification model is adopted as a non-neural network model, and the specific process is as follows:
the first step is as follows: determining the acoustic characterization sequence s of the training awakening words and the corresponding awakening wordsf
The second step is that: splicing an acoustic classification model (neural network) with the main body neural network, namely directly connecting an output layer of the acoustic classification model (neural network) with an acoustic classification sequence input layer (Conv layer) of the main body neural network to form an end-to-end network;
the third step: loading activation probability calculation neural network parameters MgAnd acoustic classification model parameters McPerforming cross entropy loss function-based training on the end-to-end network in the second step by using the audio annotation data of the specific awakening word, and reserving a part of training data for cross check in the training process to avoid overfitting;
the fourth step: saving the adjusted acoustic classification model parameters McfAnd masterSomatic neural network parameter Mgf
According to the steps, under the condition that the corresponding parameters are obtained, the corresponding parameters can be input into the main body neural network, and the neural network model is obtained.
Example two
Fig. 4 is a block diagram of a device for determining activation probability of a wakeup word according to an embodiment of the present application.
As shown in fig. 4, the determining apparatus provided in this embodiment is applied to an intelligent voice device, such as an intelligent speaker, an intelligent car machine, and a smart phone. The determining device is used for processing the audio received by the intelligent voice equipment so as to obtain the activation probability of the corresponding awakening word. The determining device specifically includes a signal acquiring module 10, a first processing module 20 and a second processing module 30.
The signal acquisition module is used for acquiring the audio signal received by the intelligent voice equipment.
That is, when the sound collection device of the intelligent voice device collects the sound emitted by the user and converts the sound into an audio signal in a standby state, the execution subject executing the determination method in the application acquires the audio signal so as to further process the audio signal.
In addition, the module may also acquire an acoustic standard sequence of corresponding wake-up words at the same time or after the audio signal is acquired, and the acoustic characterization sequence may be represented as skeywordWherein:
Figure BDA0003148670120000091
it is a two-dimensional real number tensor of shape nxm;
keyword represents a wake word; n is the total number of acoustic classification categories; m is the length of the acoustic characterization sequence of the specific awakening word;
remember of skeywordThe jth column vector of
Figure BDA0003148670120000092
Is a one-dimensional real number tensor with the length of N, wherein j is more than or equal to 1 and less than or equal to m,
Figure BDA0003148670120000093
is composed of
Figure BDA0003148670120000094
Wherein h is more than or equal to 1 and less than or equal to N,
Figure BDA0003148670120000095
for any given keyword, it is always possible to find a unique acoustic characterization s artificiallykeywordRepresenting a keyword.
The first processing module is used for processing the audio signal based on the acoustic classification model.
And under the condition of acquiring the audio signal, processing the audio signal by using an acoustic classification model obtained by pre-training so as to obtain an acoustic classification sequence of the audio signal. The acoustic classification sequence is denoted x, where:
x=[xi,xi,…,xn]a two-dimensional real number tensor of shape nxn; n is the length of the sequence of audio frames; n is the total number of acoustic classification categories; the ith column vector of x is xi=[xi1,xi2,…,xiN]Then xiIs a one-dimensional real number tensor with the length of N, wherein i is more than or equal to 1 and less than or equal to N, xikIs xiThe value of the kth element, xikRepresenting the probability that an audio frame belongs to the kth acoustic classification, where 1 ≦ k ≦ N, 0 ≦ xik≤1。
The second processing module is used for processing the acoustic classification sequence and the acoustic characterization sequence based on the neural network model.
The neural network model is imported into a corresponding processing body of a corresponding intelligent voice device before the processing is started, wherein the acoustic classification sequence refers to an acoustic classification sequence of the audio signal, and the acoustic characterization sequence refers to an acoustic characterization sequence of the awakening word. And obtaining the activation probability of the awakening word through the processing of the neural network model.
The neural network model comprises an available long-term and short-term memory network layer, a convolutional layer, a maximum pooling layer and a linear transformation layer, and the combination is expressed as follows:
emb=Linear(LSTM(s^keyword))
p _ keyword (activation | x) ═ Sigmoid (Conv _ emb (Tanh (Conv (x)))), as shown in fig. 2. The LSTM represents a long-term and short-term memory network layer, the Linear transform layer, the convolutional layer, the Tanh represents a Tanh activation function, the MaxPool represents a maximum pooling layer, the convolutional layer with convolution parameters of emb is represented by Conv _ emb, the emb represents acoustic characterization embedding of awakening words, and the Sigmoid represents the use of a Sigmoid activation function.
The activation probability of the awakening word is marked as Pkeyword(activate | x), where keyword is a specific wake up word and x is the corresponding acoustic classification sequence of the sequence of audio frames.
It can be seen from the foregoing technical solutions that, the present embodiment provides a device for determining activation probability of a wakeup word, where the device is applied to an intelligent voice device, and is specifically configured to obtain an audio signal received by the intelligent voice device; processing the audio signal based on the acoustic classification model to obtain an acoustic classification sequence of the audio signal; and inputting the acoustic classification sequence and the acoustic characterization sequence of the awakening word into the neural network model to obtain the activation probability of the awakening word. According to the scheme, the activation probability of the awakening word is accurately determined, so that the intelligent voice equipment can determine whether the equipment is activated or not according to the activation probability, and a key technical link is provided for normal work of the intelligent voice equipment.
The above technical solution is implemented on the basis of obtaining a corresponding neural network model, and in the case of no existing neural network model, the present application further includes a first training module 40 and a second training module 50, which are used for providing the neural network model for the above solution, as shown in fig. 5 in particular.
The first training module is used for model pre-training based on labeling data of a large amount of general text audios.
Specifically, the first training module specifically performs the following operations:
the first step is as follows: determining an acoustic classification model adopted by the intelligent voice equipment, and storing network parameters of the model as McThereby determining the number N of acoustic classifications and the soundA learning category set; and performing acoustic representation sequence representation on the labeled text of the universal text audio labeled data by using the acoustic category set, and performing forced alignment on the universal text audio on the basis of the acoustic representation sequence representation of the acoustic classification model and each audio labeled text (finding out the acoustic classification sequence with the highest probability corresponding to the audio), thereby obtaining the forced aligned acoustic classification sequence.
The acoustic classification sequence is subjected to segmentation with uniformly distributed segmentation intervals of 30-50 and is marked as xcutEach x iscutAre combined to obtain xcutCorresponding acoustic characterization sequence scutWill (x)cut,scut) Putting the training data set T as a training sample point; after the data set T is constructed, finding out all different acoustic characterization sequence elements in the T to form an acoustic characterization sequence set KsThen based on KsAugmenting a training data set T to TaThe specific expansion mode is as follows:
whileKshas elements not selected to:
from KsTo select an unselected element sK
while training data set T in the current round sKAre not selected to:
selecting an element (x) from Tcut,scut);
ifsKAnd sTThe same is that:
data entry ((x)cut,scut) 1) put into the data set Ta
else:
Data entry ((x)cut,scut) 0) put into the data set Ta
The second step is that: building a main body neural network according to the network structure of the acoustic classification model;
the third step: using a training data set TaTraining a main neural network, selecting a proper parameter initialization method, a proper learning rate and a proper adaptive gradient algorithm, and concentrating the main neural networkPerforming cross entropy loss function-based training through a network, reserving a part of training data for cross check in the training process, and avoiding overfitting;
the fourth step: saving neural network model parameters, denoted as Mg
The second training module is used for performing tuning training based on the labeling data of a small amount of specific awakening word audio.
The tuning training utilizes the labeled data of the specific awakening word audio to carry out end-to-end tuning on the whole intelligent voice equipment, an acoustic classification model is required to be a neural network model, and the following steps can be skipped if the acoustic classification model is adopted as a non-neural network model, specifically, the execution process of the module is as follows:
the first step is as follows: determining the acoustic characterization sequence s of the training awakening words and the corresponding awakening wordsf
The second step is that: splicing an acoustic classification model (neural network) with the main body neural network, namely directly connecting an output layer of the acoustic classification model (neural network) with an acoustic classification sequence input layer (Conv layer) of the main body neural network to form an end-to-end network;
the third step: loading activation probability calculation neural network parameters MgAnd acoustic classification model parameters McPerforming cross entropy loss function-based training on the end-to-end network in the second step by using the audio annotation data of the specific awakening word, and reserving a part of training data for cross check in the training process to avoid overfitting;
the fourth step: saving the adjusted acoustic classification model parameters McfAnd subject neural network parameter Mgf
According to the operation, under the condition of obtaining the corresponding parameters, the corresponding parameters can be input into the main body neural network, and the neural network model can be obtained.
EXAMPLE III
The embodiment also provides an intelligent voice device, including but not limited to an intelligent sound box, an intelligent car machine, an intelligent mobile phone, and the like, where the intelligent voice device is provided with the apparatus for determining activation probability of a wakeup word as provided in the above embodiment, and the apparatus is applied to the intelligent voice device, and is specifically used to obtain an audio signal received by the intelligent voice device; processing the audio signal based on the acoustic classification model to obtain an acoustic classification sequence of the audio signal; and inputting the acoustic classification sequence and the acoustic characterization sequence of the awakening word into the neural network model to obtain the activation probability of the awakening word. According to the scheme, the activation probability of the awakening word is accurately determined, so that the intelligent voice equipment can determine whether the equipment is activated or not according to the activation probability, and a key technical link is provided for normal work of the intelligent voice equipment.
Example four
Fig. 6 is a block diagram of an intelligent speech device according to an embodiment of the present application.
As shown in fig. 6, the smart voice device provided in this embodiment includes, but is not limited to, a smart speaker, a smart car machine, a smart phone, and the like, and the device includes at least one processor 101 and a memory 102, which are connected through a data bus 103. The memory is used for storing computer programs or instructions, and the processor acquires and executes the corresponding computer programs or instructions, so that the intelligent voice device can execute the determination method of the wake-up word probability provided by the embodiment.
The method for determining the activation probability of the awakening word comprises the steps of acquiring an audio signal received by intelligent voice equipment; processing the audio signal based on the acoustic classification model to obtain an acoustic classification sequence of the audio signal; and inputting the acoustic classification sequence and the acoustic characterization sequence of the awakening word into the neural network model to obtain the activation probability of the awakening word. According to the scheme, the activation probability of the awakening word is accurately determined, so that the intelligent voice equipment can determine whether the equipment is activated or not according to the activation probability, and a key technical link is provided for normal work of the intelligent voice equipment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The technical solutions provided by the present invention are described in detail above, and the principle and the implementation of the present invention are explained in this document by applying specific examples, and the descriptions of the above examples are only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (14)

1. A method for determining activation probability of a wake-up word is applied to intelligent voice equipment, and is characterized in that the method comprises the following steps:
acquiring an audio signal received by the intelligent voice equipment;
processing the audio signal based on an acoustic classification model to obtain an acoustic classification sequence of the audio signal;
and inputting the acoustic classification sequence and the acoustic characterization sequence of the awakening word into a neural network model to obtain the activation probability of the awakening word.
2. The determination method of claim 1, wherein the acoustic classification sequence is x, wherein:
x=[xi,xi,…,xn]a two-dimensional real number tensor of shape nxn;
n is the length of the sequence of audio frames;
n is the total number of acoustic classification categories;
the ith column vector of x is xi=[xi1,xi2,…,xiN]Then xiIs a one-dimensional real number tensor with the length of N, wherein i is more than or equal to 1 and less than or equal to N, xikIs xiThe value of the kth element, xikRepresenting the probability that an audio frame belongs to the kth acoustic classification, where 1 ≦ k ≦ N, 0 ≦ xik≤1。
3. The determination method of claim 1, wherein the acoustic characterization sequence is skeywordWherein:
Figure FDA0003148670110000011
it is a two-dimensional real number tensor of shape nxm;
keyword represents a wake word;
n is the total number of acoustic classification categories;
m is the length of the acoustic characterization sequence of the specific awakening word;
remember of skeywordThe jth column vector of
Figure FDA0003148670110000012
Is a one-dimensional real number tensor with the length of N, wherein j is more than or equal to 1 and less than or equal to m,
Figure FDA0003148670110000013
is composed of
Figure FDA0003148670110000014
Wherein h is more than or equal to 1 and less than or equal to N,
Figure FDA0003148670110000015
for any given keyword, it is always possible to find a unique acoustic characterization s artificiallykeywordRepresenting a keyword.
4. The method of determining according to any one of claims 1 to 3, further comprising the steps of:
performing model pre-training based on labeling data of a large number of general text audios;
and performing tuning training based on labeled data of a small number of awakening word audios to obtain the neural network model, wherein the general text audios refer to texts in audios which are not related to the awakening words, and the awakening word audios refer to texts in audios which are the same as or similar to the awakening words.
5. The method of claim 4, wherein the model pre-training is performed based on labeling data of a plurality of generic text audios, comprising the steps of:
determining an acoustic classification model;
building a main body neural network according to the network structure of the acoustic classification model;
training the subject neural network by using the labeled data;
and saving the neural network model parameters.
6. The determination method according to claim 5, wherein the tuning training based on the annotation data of the small amount of wake word audio comprises the steps of:
determining an acoustic characterization sequence corresponding to the tuning-training awakening word;
splicing the acoustic classification model with the main body neural network to form an end-to-end network;
loading activation probability to the end-to-end neural network, calculating neural network model parameters and acoustic classification model parameters, and performing cross entropy loss function-based training on the end-to-end network;
and saving the adjusted neural network model parameters and the acoustic classification model parameters.
7. A device for determining activation probability of a wake-up word is applied to intelligent voice equipment, and is characterized in that the device for determining the activation probability of the wake-up word comprises:
the signal acquisition module is configured to acquire an audio signal received by the intelligent voice device;
a first processing module configured to process the audio signal based on an acoustic classification model to obtain an acoustic classification sequence of the audio signal;
and the second processing module is configured to input the acoustic classification sequence and the acoustic characterization sequence of the awakening word into a neural network model to obtain the activation probability of the awakening word.
8. The determination apparatus of claim 7, wherein the acoustic classification sequence is x, wherein:
x=[xi,xi,…,xn]a two-dimensional real number tensor of shape nxn;
n is the length of the sequence of audio frames;
n is the total number of acoustic classification categories;
the ith column vector of x is xi=[xi1,xi2,…,xiN]Then xiIs a one-dimensional real number tensor with the length of N, wherein i is more than or equal to 1 and less than or equal to N, xikIs xiThe value of the kth element, xikRepresenting the probability that an audio frame belongs to the kth acoustic classification, where 1 ≦ k ≦ N, 0 ≦ xik≤1。
9. Determination apparatus according to claim 7Wherein the acoustic characterization sequence is skeywordWherein:
Figure FDA0003148670110000031
is a two-dimensional real number tensor of shape nxm;
keyword represents a wake word;
n is the total number of acoustic classification categories;
m is the length of the acoustic characterization sequence of the specific awakening word;
remember of skeywordThe jth column vector of
Figure FDA0003148670110000032
Is a one-dimensional real number tensor with the length of N, wherein j is more than or equal to 1 and less than or equal to m,
Figure FDA0003148670110000033
is composed of
Figure FDA0003148670110000034
Wherein h is more than or equal to 1 and less than or equal to N,
Figure FDA0003148670110000035
for any given keyword, it is always possible to find a unique acoustic characterization s artificiallykeywordRepresenting a keyword.
10. The determination apparatus according to any one of claims 7 to 9, further comprising:
a first training module configured to perform model pre-training based on labeling data of a large amount of general text audio;
and the second training module is configured to perform tuning training based on labeled data of a small number of awakening word audios to obtain the neural network model, wherein the general text audios refer to texts in audios which are not related to the awakening words, and the awakening word audios refer to texts in audios which are the same as or similar to the awakening words.
11. The determination apparatus of claim 10, wherein the first training module is specifically configured to:
determining an acoustic classification model;
building a main body neural network according to the network structure of the acoustic classification model;
training the subject neural network by using the labeled data;
and saving the neural network model parameters.
12. The determination apparatus of claim 11, wherein the second training module is specifically configured to:
determining an acoustic characterization sequence corresponding to the tuning-training awakening word;
splicing the acoustic classification model with the main body neural network to form an end-to-end network;
loading activation probability to the end-to-end neural network, calculating neural network model parameters and acoustic classification model parameters, and performing cross entropy loss function-based training on the end-to-end network;
and saving the adjusted neural network model parameters and the acoustic classification model parameters.
13. An intelligent speech device, characterized in that it is provided with a determination device according to any one of claims 7 to 12.
14. An intelligent speech device comprising at least one processor and a memory coupled to the processor, wherein:
the memory is for storing a computer program or instructions;
the processor is configured to execute the computer program or instructions to cause the smart speech device to perform the determination method according to any one of claims 1 to 6.
CN202110759228.9A 2021-07-05 2021-07-05 Method and device for determining activation probability of wake-up word and intelligent voice product Active CN113450800B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110759228.9A CN113450800B (en) 2021-07-05 2021-07-05 Method and device for determining activation probability of wake-up word and intelligent voice product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110759228.9A CN113450800B (en) 2021-07-05 2021-07-05 Method and device for determining activation probability of wake-up word and intelligent voice product

Publications (2)

Publication Number Publication Date
CN113450800A true CN113450800A (en) 2021-09-28
CN113450800B CN113450800B (en) 2024-06-21

Family

ID=77815110

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110759228.9A Active CN113450800B (en) 2021-07-05 2021-07-05 Method and device for determining activation probability of wake-up word and intelligent voice product

Country Status (1)

Country Link
CN (1) CN113450800B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023168713A1 (en) * 2022-03-11 2023-09-14 华为技术有限公司 Interactive speech signal processing method, related device and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140226838A1 (en) * 2013-02-13 2014-08-14 Analog Devices, Inc. Signal source separation
CN110288979A (en) * 2018-10-25 2019-09-27 腾讯科技(深圳)有限公司 A kind of audio recognition method and device
CN110444210A (en) * 2018-10-25 2019-11-12 腾讯科技(深圳)有限公司 A kind of method of speech recognition, the method and device for waking up word detection
CN110838289A (en) * 2019-11-14 2020-02-25 腾讯科技(深圳)有限公司 Awakening word detection method, device, equipment and medium based on artificial intelligence
CN111223488A (en) * 2019-12-30 2020-06-02 Oppo广东移动通信有限公司 Voice wake-up method, device, equipment and storage medium
CN111933124A (en) * 2020-09-18 2020-11-13 电子科技大学 Keyword detection method capable of supporting self-defined awakening words
WO2021094607A1 (en) * 2019-11-15 2021-05-20 Sonos Vox France Sas System and method for on-device open-vocabulary keyword spotting

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140226838A1 (en) * 2013-02-13 2014-08-14 Analog Devices, Inc. Signal source separation
CN110288979A (en) * 2018-10-25 2019-09-27 腾讯科技(深圳)有限公司 A kind of audio recognition method and device
CN110444210A (en) * 2018-10-25 2019-11-12 腾讯科技(深圳)有限公司 A kind of method of speech recognition, the method and device for waking up word detection
CN110838289A (en) * 2019-11-14 2020-02-25 腾讯科技(深圳)有限公司 Awakening word detection method, device, equipment and medium based on artificial intelligence
WO2021094607A1 (en) * 2019-11-15 2021-05-20 Sonos Vox France Sas System and method for on-device open-vocabulary keyword spotting
CN111223488A (en) * 2019-12-30 2020-06-02 Oppo广东移动通信有限公司 Voice wake-up method, device, equipment and storage medium
CN111933124A (en) * 2020-09-18 2020-11-13 电子科技大学 Keyword detection method capable of supporting self-defined awakening words

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023168713A1 (en) * 2022-03-11 2023-09-14 华为技术有限公司 Interactive speech signal processing method, related device and system

Also Published As

Publication number Publication date
CN113450800B (en) 2024-06-21

Similar Documents

Publication Publication Date Title
CN110838289B (en) Wake-up word detection method, device, equipment and medium based on artificial intelligence
CN108520741B (en) Method, device and equipment for restoring ear voice and readable storage medium
JP6453917B2 (en) Voice wakeup method and apparatus
CN107767863B (en) Voice awakening method and system and intelligent terminal
CN110570873B (en) Voiceprint wake-up method and device, computer equipment and storage medium
CN110288980A (en) Audio recognition method, the training method of model, device, equipment and storage medium
CN109754789B (en) Method and device for recognizing voice phonemes
CN111710337B (en) Voice data processing method and device, computer readable medium and electronic equipment
CN113314119B (en) Voice recognition intelligent household control method and device
KR20190136578A (en) Method and apparatus for speech recognition
CN111179944B (en) Voice awakening and age detection method and device and computer readable storage medium
CN113450800A (en) Method and device for determining activation probability of awakening words and intelligent voice product
CN114242065A (en) Voice wake-up method and device and training method and device of voice wake-up module
WO2024114303A1 (en) Phoneme recognition method and apparatus, electronic device and storage medium
CN110930997B (en) Method for labeling audio by using deep learning model
CN113178200A (en) Voice conversion method, device, server and storage medium
CN115132195B (en) Voice wakeup method, device, equipment, storage medium and program product
KR20180065761A (en) System and Method of speech recognition based upon digital voice genetic code user-adaptive
CN113314099B (en) Method and device for determining confidence coefficient of speech recognition
US11538474B2 (en) Electronic device and method for controlling the electronic device thereof
CN114373443A (en) Speech synthesis method and apparatus, computing device, storage medium, and program product
CN116705013B (en) Voice wake-up word detection method and device, storage medium and electronic equipment
KR102684936B1 (en) Electronic device and method for controlling the electronic device thereof
CN112951235B (en) Voice recognition method and device
CN117831540A (en) Course learning-based speaker identification method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant