CN116256981A

CN116256981A - Intelligent home control method and system based on Internet of things and electronic equipment

Info

Publication number: CN116256981A
Application number: CN202111498566.8A
Authority: CN
Inventors: 黎璨
Original assignee: Hangzhou Mituo Technology Co ltd
Current assignee: Hangzhou Mituo Technology Co ltd
Priority date: 2021-12-09
Filing date: 2021-12-09
Publication date: 2023-06-13

Abstract

The application discloses an intelligent home control method, system and electronic equipment based on the Internet of things. In the technical scheme of the application, intelligent home connected to the Internet of things carries out intelligent control through voice. Compared with the prior art, the intelligent home control method not only can utilize semantic information in control voice, but also can utilize emotion modes in the control voice to realize more intelligent control on the intelligent home. For example, when the smart home is a smart speaker, the user's voice command for increasing the volume in different emotion modes expects different degrees of volume increase, and the technical scheme of the application can achieve the technical purpose.

Description

Intelligent home control method and system based on Internet of things and electronic equipment

Technical Field

The invention relates to the field of intelligent home control, in particular to an intelligent home control method, system and electronic equipment based on the Internet of things.

Background

The definition of the internet of things is that information sensing equipment such as radio frequency identification, an infrared sensor, a global positioning system, a laser scanner and the like are used for connecting any article through the internet of things domain name according to a stipulated protocol so as to realize intelligent identification, positioning, tracking, monitoring and management, along with the rapid development of economy and the continuous improvement of the living standard of people, electronic equipment represented by household appliances increasingly goes into common families, various demands of people are met by the electronic equipment, the living colors of people are enriched by the appliances, and the living quality of people is greatly improved.

However, the operation of these appliances requires frequent manual opening of the switch, which brings inconvenience to people and is not in line with the requirements of the smart home, when remote controllers are configured for these appliances, different types of appliances also need to be configured with different remote controllers, and the range controlled by the remote controllers is smaller, which has a large limitation on the smart home, and the operation of the appliances in the home cannot be controlled when people go out.

Currently, some manufacturers solve the control problem of smart home through voice control, that is, recognize a voice command proposed by a user and then control. However, there are some problems in such a control manner, for example, different expression modes of different users, that is, the same control purpose but different expression modes of the users, so that the smart home may not recognize the voice of the user, which may also result in poor intelligentization effect.

And the existing voice control only uses semantic information in voice control voice, and does not use mode information, such as emotion information, contained in voice control. It should be understood that when the user wants to increase the volume of the smart speaker, the user will increase the volume through the speech expression, but the smart speaker will only increase the volume based on the default program, but cannot adjust the increase of the volume based on the emotion in the user's speech, resulting in poor smart home.

Therefore, in order to realize a more intelligent control function for the smart home, a smart home control scheme based on the internet of things is desired.

Disclosure of Invention

The present application has been made in order to solve the above technical problems. The embodiment of the application provides an intelligent home control method, an intelligent home control system and electronic equipment based on the Internet of things, which can not only utilize semantic information in control voice, but also utilize emotion modes in the control voice to realize more intelligent control of an intelligent home. For example, when the smart home is a smart speaker, the user's voice command for increasing the volume in different emotion modes expects different degrees of volume increase, and the technical scheme of the application can achieve the technical purpose.

According to one aspect of the present application, there is provided an intelligent home control method based on the internet of things, including:

obtaining control voice of a user;

word-based speech segmentation is performed on the control speech to obtain a speech word sequence consisting of a plurality of speech words;

converting each voice word in the voice word sequence into a voice vector to obtain a voice vector sequence consisting of a plurality of voice vectors;

passing each speech vector in the speech vector sequence through a semantic understanding model to obtain a speech feature vector sequence consisting of a plurality of speech feature vectors;

Arranging a plurality of voice feature vectors in the voice feature vector sequence into a voice input matrix and obtaining a voice feature map through a convolutional neural network;

carrying out global pooling processing based on channel dimension on the voice feature map to obtain a voice feature matrix;

multiplying each voice feature vector in the voice feature vector sequence as a query vector by the voice feature matrix to map the voice feature vector into a high-dimensional feature space of the voice feature matrix to obtain a tag feature vector of each voice feature vector so as to obtain a plurality of tag feature vectors;

estimating a tag score of the plurality of tag feature vectors as a whole based on a maximum conditional likelihood estimation score generated based on a result of dividing a weighted sum of natural exponential function values exponentiating to negative values of feature values of respective positions of each of the tag feature vectors by a weighted sum of natural exponential function values exponentiating to negative values of feature values of respective positions of each of the plurality of tag feature vectors;

performing mode classification on the control voice based on the label score to generate a control instruction; and

And controlling the intelligent household equipment based on the control instruction.

In the above intelligent home control method based on the internet of things, performing word-based voice segmentation on the control voice to obtain a voice word sequence composed of a plurality of voice words, including: and performing syllable sequence-based voice segmentation on the control voice to obtain a voice word sequence composed of a plurality of voice words.

In the above intelligent home control method based on the internet of things, converting each voice word in the voice word sequence into a voice vector to obtain a voice vector sequence composed of a plurality of voice vectors, including: preprocessing each voice word; performing Fourier transform on the preprocessed voice words; performing Mel filtering on the voice words after Fourier transformation; carrying out cepstrum analysis on the voice word after the Mel filtering to extract a Mel frequency cepstrum coefficient with a preset bit number from the Mel frequency cepstrum coefficients of the voice word; and arranging the mel frequency cepstrum coefficient of the pre-preset bit number into the voice vector sequence.

In the intelligent home control method based on the internet of things, performing global pooling processing on the voice feature map based on channel dimension to obtain a voice feature matrix, including: and carrying out global average value pooling processing or global maximum value pooling processing based on channel dimensions on the voice feature map to obtain the voice feature matrix.

In the intelligent home control method based on the internet of things, estimating the tag scores of the plurality of tag feature vectors as a whole based on the maximum condition likelihood estimation score includes: estimating a tag score for the plurality of tag feature vectors as a whole based on the maximum conditional likelihood estimation score with the following formula; the formula is:

wherein Softmax (v _i ) Refer to each tag feature vector v _i The Softmax-like function relative to the whole is expressed as:

here, softmax (v _i ) A tag score representing each speech word in the control speech as a whole,wherein x is _j Is each tag feature vector v _i The bias is a bias term used to adjust the likelihood function.

In the intelligent home control method based on the internet of things, performing mode classification on the control voice based on the tag score to generate a control instruction, including: and inquiring a matching result corresponding to the label score in a control instruction inquiry table, wherein the matching result is the control instruction.

According to another aspect of the present application, there is provided an intelligent home control system based on the internet of things, which includes:

a voice acquisition unit for acquiring a control voice of a user;

A voice segmentation unit for performing word-based voice segmentation on the control voice obtained by the voice obtaining unit to obtain a voice word sequence composed of a plurality of voice words;

a speech vector sequence generating unit configured to convert each speech word in the speech word sequence obtained by the speech dividing unit into speech vectors to obtain a speech vector sequence composed of a plurality of speech vectors;

a semantic understanding unit configured to pass each speech vector in the speech vector sequence obtained by the speech vector sequence generating unit through a semantic understanding model to obtain a speech feature vector sequence composed of a plurality of speech feature vectors;

the convolutional neural network processing unit is used for arranging a plurality of voice feature vectors in the voice feature vector sequence obtained by the semantic understanding unit into a voice input matrix and obtaining a voice feature map through a convolutional neural network;

the global pooling unit is used for carrying out global pooling processing based on channel dimensions on the voice feature map obtained by the convolutional neural network processing unit so as to obtain a voice feature matrix;

a mapping unit, configured to matrix multiply each speech feature vector in the speech feature vector sequence obtained by the semantic understanding unit as a query vector with the speech feature matrix obtained by the global pooling unit, and map the speech feature vector into a high-dimensional feature space of the speech feature matrix to obtain a tag feature vector of each speech feature vector, so as to obtain a plurality of tag feature vectors;

A maximum condition likelihood estimating unit configured to estimate a tag score of the plurality of tag feature vectors obtained by the mapping unit as a whole based on a maximum condition likelihood estimating score generated based on a result of dividing a weighted sum of natural exponent function values exponentiated by a negative value of a feature value of each position of each of the tag feature vectors by a weighted sum of natural exponent function values exponentiated by a negative value of a feature value of each position of each of the plurality of tag feature vectors;

a classification unit configured to pattern-classify the control speech obtained by the speech obtaining unit based on the label score obtained by the maximum condition likelihood estimating unit to generate a control instruction; and

and the control unit is used for controlling the intelligent household equipment based on the control instruction obtained by the classification unit.

In the intelligent home control system based on the internet of things, the voice segmentation unit is further configured to: and performing syllable sequence-based voice segmentation on the control voice to obtain a voice word sequence composed of a plurality of voice words.

In the above intelligent home control system based on the internet of things, the voice vector sequence generating unit includes: a preprocessing subunit, configured to preprocess each of the speech words; a fourier transform subunit, configured to perform fourier transform on the speech word obtained by the preprocessing subunit after preprocessing; a mel filtering subunit, configured to perform mel filtering on the speech word obtained by the fourier transform subunit after fourier transform; the cepstrum analysis subunit is used for carrying out cepstrum analysis on the voice word obtained by the Mel filtering subunit after Mel filtering so as to extract a Mel frequency cepstrum coefficient with a preset bit number from the Mel frequency cepstrum coefficients of the voice word; and an arrangement subunit, configured to arrange the mel-frequency cepstrum coefficient of the pre-preset number of bits obtained by the cepstrum analysis subunit into the speech vector sequence.

In the intelligent home control system based on the internet of things, the global pooling unit is further configured to: and carrying out global average value pooling processing or global maximum value pooling processing based on channel dimensions on the voice feature map to obtain the voice feature matrix.

In the intelligent home control system based on the internet of things, the maximum condition likelihood estimation unit is further configured to: estimating a tag score for the plurality of tag feature vectors as a whole based on the maximum conditional likelihood estimation score with the following formula; the formula is:

here, softmax (v _i ) A tag score representing each speech word in the control speech as a whole, where x _j Is each tag feature vector v _i The bias is a bias term used to adjust the likelihood function.

In the intelligent home control system based on the internet of things, the classification unit is further configured to: and inquiring a matching result corresponding to the label score in a control instruction inquiry table, wherein the matching result is the control instruction.

According to still another aspect of the present application, there is provided an electronic apparatus including: a processor; and a memory in which computer program instructions are stored which, when executed by the processor, cause the processor to perform the intelligent home control method based on the internet of things as described above.

According to yet another aspect of the present application, there is provided a computer readable medium having stored thereon computer program instructions, which when executed by a processor, cause the processor to perform the intelligent home control method based on the internet of things as described above.

Compared with the prior art, the intelligent home control method, the system and the electronic equipment based on the Internet of things can not only utilize semantic information in control voice, but also utilize emotion modes in the control voice to realize intelligent control of the intelligent home. For example, when the smart home is a smart speaker, the user's voice command for increasing the volume in different emotion modes expects different degrees of volume increase, and the technical scheme of the application can achieve the technical purpose.

Drawings

The foregoing and other objects, features and advantages of the present application will become more apparent from the following more particular description of embodiments of the present application, as illustrated in the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.

Fig. 1 is an application scenario diagram of an intelligent home control method based on the internet of things according to an embodiment of the present application;

fig. 2 is a flowchart of an intelligent home control method based on the internet of things according to an embodiment of the present application;

fig. 3 is a schematic system architecture diagram of an intelligent home control method based on the internet of things according to an embodiment of the present application;

fig. 4 is a flowchart of converting each voice word in the voice word sequence into a voice vector to obtain a voice vector sequence composed of a plurality of voice vectors in the intelligent home control method based on the internet of things according to the embodiment of the application;

fig. 5 is a block diagram of an intelligent home control system based on the internet of things according to an embodiment of the present application;

fig. 6 is a block diagram of a speech vector sequence generating unit in an intelligent home control system based on the internet of things according to an embodiment of the present application;

fig. 7 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application and not all of the embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.

Scene overview

As described above, currently, some manufacturers solve the control problem of smart home through voice control, that is, recognize a voice command provided by a user and then perform control. However, there are some problems in such a control manner, for example, different expression modes of different users, that is, the same control purpose but different expression modes of the users, so that the smart home may not recognize the voice of the user, which may also result in poor intelligentization effect.

Based on this, in the technical solution of the present application, after the control speech of the user is obtained, word segmentation processing is performed on the control speech of the user to convert the control speech into a plurality of speech vectors corresponding to a plurality of speech words, then a semantic understanding model, such as a Bert model, is input, so as to obtain a sequence of speech feature vectors, that is, a plurality of speech feature vectors, and then semantic recognition is performed based on the plurality of speech feature vectors.

In order to classify a plurality of speech feature vectors in a mode, in the technical solution of the present application, a label of each speech feature vector is first determined, specifically, this is achieved by mining association relations between the plurality of speech feature vectors. That is, a plurality of speech feature vectors are arranged as a speech input matrix, and the speech input matrix is input into a convolutional neural network to obtain a speech feature map, and then the speech feature map is subjected to global pooling per channel to obtain a speech feature matrix. Thus, the tag feature vector corresponding to each speech feature vector is obtained by multiplying the speech feature matrix by a plurality of speech feature vectors as query vectors.

Here, the tag feature vector corresponding to each speech feature vector is substantially the tag feature vector corresponding to each speech word in the original speech, and further based on the calculation rule of the maximum condition likelihood estimation score, the tag score of each speech word of the original speech as a whole, that is, the tag score of the original speech as a whole, is expressed as:

wherein Softmax (v _i ) Refer to each tag feature vector v _i The Softmax-like function relative to the whole can be expressed as:

Here, softmax (v _i ) A tag score representing each speech word in the original speech as a whole, where x _j Is each tag feature vector v _i In (a) and (b)The tag value for each location, bias, is a bias term used to adjust the likelihood function, can be obtained as a hyper-parameter during training of the neural network model.

Thus, after obtaining the label score of the original voice as a whole, the original voice can be subjected to mode classification based on the label score, or other scores such as volume adjustment can be obtained directly through a lookup table by using the label score.

Based on this, the application provides an intelligent home control method based on the internet of things, which comprises the following steps: obtaining control voice of a user; word-based speech segmentation is performed on the control speech to obtain a speech word sequence consisting of a plurality of speech words; converting each voice word in the voice word sequence into a voice vector to obtain a voice vector sequence consisting of a plurality of voice vectors; passing each speech vector in the speech vector sequence through a semantic understanding model to obtain a speech feature vector sequence consisting of a plurality of speech feature vectors; arranging a plurality of voice feature vectors in the voice feature vector sequence into a voice input matrix and obtaining a voice feature map through a convolutional neural network; carrying out global pooling processing based on channel dimension on the voice feature map to obtain a voice feature matrix; multiplying each voice feature vector in the voice feature vector sequence as a query vector by the voice feature matrix to map the voice feature vector into a high-dimensional feature space of the voice feature matrix to obtain a tag feature vector of each voice feature vector so as to obtain a plurality of tag feature vectors; estimating a tag score of the plurality of tag feature vectors as a whole based on a maximum conditional likelihood estimation score generated based on a result of dividing a weighted sum of natural exponential function values exponentiating to negative values of feature values of respective positions of each of the tag feature vectors by a weighted sum of natural exponential function values exponentiating to negative values of feature values of respective positions of each of the plurality of tag feature vectors; performing mode classification on the control voice based on the label score to generate a control instruction; and controlling the intelligent household equipment based on the control instruction.

Fig. 1 illustrates an application scenario diagram of an intelligent home control method based on the internet of things according to an embodiment of the application. As shown in fig. 1, in this application scenario, a user applies control voice to smart home devices (e.g., H as illustrated in fig. 1) deployed indoors, where the smart home devices include, but are not limited to, smart speakers, smart refrigerators, smart televisions, smart microwave ovens, etc., where the smart home devices are interconnected with other smart home devices based on the internet to form a smart home system. Then, the obtained control voice of the user is input into a server (for example, S as illustrated in fig. 1) deployed with an intelligent home control algorithm based on the internet of things, wherein the server can process the control voice of the user with the intelligent home control algorithm based on the internet of things to generate a control instruction. And further, controlling the intelligent household equipment based on the control instruction.

Having described the basic principles of the present application, various non-limiting embodiments of the present application will now be described in detail with reference to the accompanying drawings.

Exemplary method

Fig. 2 illustrates a flow chart of an intelligent home control method based on the internet of things. As shown in fig. 2, an intelligent home control method based on the internet of things according to an embodiment of the present application includes: s110, obtaining control voice of a user; s120, performing word-based voice segmentation on the control voice to obtain a voice word sequence consisting of a plurality of voice words; s130, converting each voice word in the voice word sequence into a voice vector to obtain a voice vector sequence composed of a plurality of voice vectors; s140, passing each voice vector in the voice vector sequence through a semantic understanding model to obtain a voice feature vector sequence composed of a plurality of voice feature vectors; s150, arranging a plurality of voice feature vectors in the voice feature vector sequence into a voice input matrix and obtaining a voice feature map through a convolutional neural network; s160, carrying out global pooling processing based on channel dimension on the voice feature map to obtain a voice feature matrix; s170, each voice feature vector in the voice feature vector sequence is used as a query vector to be multiplied by the voice feature matrix in a matrix mode, and the voice feature vector is mapped into a high-dimensional feature space of the voice feature matrix to obtain a tag feature vector of each voice feature vector so as to obtain a plurality of tag feature vectors; s180, estimating tag scores of the plurality of tag feature vectors as a whole based on maximum condition likelihood estimation scores generated based on a result of dividing a weighted sum of natural exponent function values exponentiated by a negative value of a feature value of each position of each of the tag feature vectors by a weighted sum of natural exponent function values exponentiated by a negative value of a feature value of each position of each of the plurality of tag feature vectors; s190, performing mode classification on the control voice based on the label score to generate a control instruction; and S200, controlling the intelligent household equipment based on the control instruction.

Fig. 3 illustrates an architecture schematic diagram of an intelligent home control method based on the internet of things according to an embodiment of the application. As shown in fig. 3, in the network architecture of the intelligent home control method based on the internet of things, first, word-based voice segmentation is performed on the obtained control voice (for example, P1 as illustrated in fig. 3) to obtain a voice word sequence (for example, P2 as illustrated in fig. 3) composed of a plurality of voice words; next, each speech word in the sequence of speech words is converted into a speech vector to obtain a sequence of speech vectors consisting of a plurality of speech vectors (e.g., V1 as illustrated in fig. 3); then, passing each speech vector in the sequence of speech vectors through a semantic understanding model (e.g., SUM as illustrated in fig. 3) to obtain a sequence of speech feature vectors (e.g., VF1 as illustrated in fig. 3) composed of a plurality of speech feature vectors; next, a plurality of speech feature vectors in the sequence of speech feature vectors are arranged into a speech input matrix (e.g., M1 as illustrated in fig. 3) and passed through a convolutional neural network (e.g., CNN as illustrated in fig. 3) to obtain a speech feature map (e.g., F1 as illustrated in fig. 3); then, the speech feature map is subjected to a global pooling process based on channel dimensions to obtain a speech feature matrix (e.g., MF as illustrated in fig. 3); next, matrix-multiplying each speech feature vector in the sequence of speech feature vectors as a query vector with the speech feature matrix to map the speech feature vector into a high-dimensional feature space of the speech feature matrix to obtain a label feature vector for each of the speech feature vectors to obtain a plurality of label feature vectors (e.g., VF2 as illustrated in fig. 3); then, estimating a tag score (e.g., LS as illustrated in fig. 3) of the plurality of tag feature vectors as a whole based on the maximum conditional likelihood estimation score; then, pattern classifying the control speech based on the tag score to generate a control instruction (e.g., C as illustrated in fig. 3); and finally, controlling the intelligent household equipment based on the control instruction.

In step S110 and step S120, a control voice of the user is obtained, and word-based voice segmentation is performed on the control voice to obtain a voice word sequence composed of a plurality of voice words. As described above, when the smart home control system is controlled based on voice, there are many problems, for example, the same control purpose but different expression patterns of the user may not make the smart home recognize the voice of the user, and the existing voice control only uses the semantic information of the voice, which does not use the mode information expressed in the voice, for example, the emotion information, so if the user wants to increase the volume of the smart speaker, the smart speaker only increases the volume based on the default program, but cannot adjust the volume increase based on the emotion in the voice of the user, which may result in poor smart home. Therefore, in the technical solution of the present application, in addition to recognizing the semantics of the speech through the semantic understanding model, it is desirable to further mine the patterns in the speech, so that the control is more intelligent.

That is, in the technical solution of the present application, first, a control voice of a user needs to be acquired, and in a specific example, the control voice of the user may be acquired through a voice receiver in a terminal device, it should be understood that the terminal device includes, but is not limited to, an electronic device such as a smart phone, a tablet, a notebook, a bracelet, and the like. Correspondingly, intelligent home furnishings are further arranged indoors, wherein the intelligent home furnishings comprise, but are not limited to, intelligent sound equipment, intelligent refrigerator, intelligent television, intelligent microwave oven and the like, and the terminal equipment and the intelligent home furnishings are mutually connected in an internet of things mode. Preferably, in the application scenario, the user directly applies the control voice to the corresponding smart home device, for example, the user directly exchanges the corresponding smart home device through the representation of the smart home device and speaks the control command.

Then, word-based speech segmentation is performed on the control speech to obtain a speech word sequence composed of a plurality of speech words, so that the control speech is subsequently converted into a plurality of speech vectors corresponding to the plurality of speech words. In one particular example, the control speech may be subjected to syllable-sequence-based speech segmentation to obtain a sequence of speech words consisting of a plurality of speech words.

In step S130 and step S140, each speech word in the speech word sequence is converted into a speech vector to obtain a speech vector sequence composed of a plurality of speech vectors, and each speech vector in the speech vector sequence is passed through a semantic understanding model to obtain a speech feature vector sequence composed of a plurality of speech feature vectors. That is, first, each speech word in the speech word sequence is converted into a speech vector to obtain a speech vector sequence; and then, processing each voice vector in the voice vector sequence by using a semantic understanding model to extract semantic features in the voice information, thereby obtaining a sequence of voice feature vectors. In one specific example, the Bert model may be used to process each speech vector in the sequence of speech vectors to obtain a sequence of speech feature vectors.

Specifically, in the embodiment of the present application, converting each speech word in the speech word sequence into a speech vector to obtain a speech vector sequence composed of a plurality of speech vectors includes: first, each of the phonetic words is preprocessed. That is, each of the voice words is passed through an analog-to-digital converter to be converted into a digital signal for a subsequent computer to process it, and in particular, it includes two steps: sampling and quantization, i.e., converting a continuous waveform of sound into discrete data points at a certain sampling rate and number of sampling bits.

Then, the pre-processed speech word is subjected to a fourier transform to transform the obtained time domain features of the digitized speech word sequence into the audio frequency domain. It should be understood that, the sound is an analog signal, and the time domain waveform of the sound only represents the relationship of the sound pressure changing with time, and cannot represent the characteristic of the sound well, so in the technical solution of the present application, the sound waveform must be subjected to the discrete fourier transform to extract the information of the discrete frequency band spectrum from one discrete signal.

Then, mel filtering is performed on the voice words after Fourier transformation. It will be appreciated that the sensitivity of the human ear hearing to different frequency bands is different, the human ear is less sensitive to high frequencies than to low frequencies, this dividing line is approximately 1000Hz, and therefore the nature of simulating human ear hearing when extracting sound features may improve recognition performance. It is worth mentioning that here, the correspondence between frequency (in Hz) and Mel scale is linear below 1000Hz and logarithmic above 1000Hz, and the calculation formula is as follows: mel (f) =1127 ln (1+f/700).

Then, carrying out cepstrum analysis on the voice word after the Mel filtering to extract the Mel frequency cepstrum coefficient with the pre-preset bit number from the Mel frequency cepstrum coefficients of the voice word. It should be understood that the frequency spectrum transforms the time domain signal into the frequency domain signal, and the cepstrum transforms the frequency domain signal back into the time domain signal, and the cepstrum coefficient has the advantage that the variation of different coefficients is uncorrelated, which means that the gaussian acoustic model does not need to exhibit covariance of all mel frequency cepstrum coefficient characteristics, thus greatly reducing the number of parameters and improving the recognition performance. It should be noted that, with the logarithmic energy of the filter, the cepstral coefficients can be obtained by discrete cosine transformation:

wherein, L refers to the order of the Mel frequency cepstrum coefficient, and 12 orders can represent acoustic characteristics; m refers to the number of triangular filters.

Finally, the mel frequency cepstrum coefficient of the pre-preset bit number is arranged as the voice vector sequence.

Fig. 4 illustrates a flowchart of converting each voice word in the voice word sequence into a voice vector to obtain a voice vector sequence composed of a plurality of voice vectors in the intelligent home control method based on the internet of things according to the embodiment of the application. As shown in fig. 4, in the embodiment of the present application, converting each speech word in the speech word sequence into a speech vector to obtain a speech vector sequence composed of a plurality of speech vectors includes: s210, preprocessing each voice word; s220, carrying out Fourier transform on the preprocessed voice words; s230, carrying out Mel filtering on the voice words after Fourier transformation; s240, carrying out cepstrum analysis on the voice word after the Mel filtering to extract a Mel frequency cepstrum coefficient with a preset bit number from the Mel frequency cepstrum coefficients of the voice word; and S250, arranging the Mel frequency cepstrum coefficient of the pre-preset number of bits into the voice vector sequence.

In step S150 and step S160, a plurality of speech feature vectors in the speech feature vector sequence are arranged as a speech input matrix and a speech feature map is obtained through a convolutional neural network, and global pooling processing based on channel dimensions is performed on the speech feature map to obtain a speech feature matrix. It should be understood that, in order to classify the modes of the plurality of speech feature vectors, in the technical solution of the present application, the label of each speech feature vector is first determined, and in particular, this may be achieved by mining the association relationship between the plurality of speech feature vectors. That is, first, the plurality of speech feature vectors are arranged as a speech input matrix; then, inputting the voice input matrix into a convolutional neural network, and processing the voice input matrix through the convolutional neural network to extract high-dimensional implicit association features among the voice feature vectors, so as to obtain a voice feature map; and finally, carrying out global pooling processing on the voice feature graph according to the channel dimension to obtain a voice feature matrix. It should be appreciated that the number of parameters can be reduced by the global pooling process to prevent overfitting. In a specific example, a global average pooling process or a global maximum pooling process based on channel dimensions may be performed on the speech feature map to obtain the speech feature matrix.

In step S170, each speech feature vector in the speech feature vector sequence is matrix-multiplied with the speech feature matrix as a query vector, and the speech feature vector is mapped into a high-dimensional feature space of the speech feature matrix to obtain a tag feature vector of each speech feature vector, so as to obtain a plurality of tag feature vectors. That is, the plurality of speech feature vectors are multiplied by the speech feature matrix as query vectors, respectively, to obtain a tag feature vector corresponding to each of the speech feature vectors. It should be noted that, here, the tag feature vector corresponding to each speech feature vector is essentially the tag feature vector corresponding to each speech word in the original speech. The tag feature vector representation fuses the implicit association information of the voice feature vector and the voice feature matrix.

In step S180, a tag score of the plurality of tag feature vectors as a whole is estimated based on a maximum condition likelihood estimation score generated based on a result of dividing a weighted sum of natural exponent function values raised to a power by a negative value of a feature value of each position of each of the plurality of tag feature vectors. It should be understood that, because the tag feature vector corresponding to each voice feature vector is substantially the tag feature vector corresponding to each voice word in the original voice, in the technical solution of the present application, the tag score of each voice word of the original voice as a whole, that is, the tag score of the original voice as a whole, can be obtained further based on the calculation rule of the maximum condition likelihood estimation score.

Specifically, in the embodiment of the present application, a process of estimating the tag scores of the plurality of tag feature vectors as a whole based on the maximum condition likelihood estimation scores includes: estimating a tag score for the plurality of tag feature vectors as a whole based on the maximum conditional likelihood estimation score with the following formula;

the formula is:

here, softmax (v _i ) A tag score representing each speech word in the control speech as a whole, where x _j Is each tag feature vector v _i Bias is a bias term used to adjust likelihood functions, which can be obtained as a super-parameter during training of the neural network model.

In step S190 and step S200, the control voice is subjected to pattern classification based on the label score to generate a control instruction, and the smart home device is controlled based on the control instruction. In a specific example, after obtaining the tag score of the original speech as a whole, the original control speech may be subjected to pattern classification based on the tag score, that is, a matching result corresponding to the tag score in a control instruction lookup table is queried, where the matching result is the control instruction. It should be noted that other scores, such as volume adjustment, may also be obtained directly from the tag score via a lookup table. And finally, controlling the intelligent household equipment based on the control instruction.

In summary, the intelligent home control method based on the internet of things is explained, which can not only utilize semantic information in control voice, but also utilize emotion modes in the control voice to realize more intelligent control of the intelligent home. For example, when the smart home is a smart speaker, the user's voice command for increasing the volume in different emotion modes expects different degrees of volume increase, and the technical scheme of the application can achieve the technical purpose.

Exemplary System

Fig. 5 illustrates a block diagram of an intelligent home control system based on the internet of things according to an embodiment of the present application. As shown in fig. 5, an intelligent home control system 500 based on the internet of things according to an embodiment of the present application includes: a voice acquisition unit 510 for acquiring a control voice of a user; a voice segmentation unit 520 for performing word-based voice segmentation on the control voice obtained by the voice obtaining unit 510 to obtain a voice word sequence composed of a plurality of voice words; a speech vector sequence generating unit 530 for converting each speech word in the speech word sequence obtained by the speech dividing unit 520 into a speech vector to obtain a speech vector sequence composed of a plurality of speech vectors; a semantic understanding unit 540 for passing each of the speech vectors in the speech vector sequence obtained by the speech vector sequence generating unit 530 through a semantic understanding model to obtain a speech feature vector sequence composed of a plurality of speech feature vectors; a convolutional neural network processing unit 550 for arranging a plurality of speech feature vectors in the speech feature vector sequence obtained by the semantic understanding unit 540 into a speech input matrix and obtaining a speech feature map through a convolutional neural network; a global pooling unit 560, configured to perform global pooling processing based on channel dimensions on the speech feature map obtained by the convolutional neural network processing unit 550 to obtain a speech feature matrix; a mapping unit 570, configured to matrix multiply each speech feature vector in the speech feature vector sequence obtained by the semantic understanding unit 540 as a query vector with the speech feature matrix obtained by the global pooling unit 560, and map the speech feature vector into a high-dimensional feature space of the speech feature matrix to obtain a tag feature vector of each speech feature vector, so as to obtain a plurality of tag feature vectors; a maximum condition likelihood estimating unit 580 for estimating a tag score of the plurality of tag feature vectors obtained by the mapping unit 570 as a whole based on a maximum condition likelihood estimating score generated based on a result of dividing a weighted sum of natural exponent function values exponentiated by a negative value of a feature value of each position of each of the tag feature vectors by a weighted sum of natural exponent function values exponentiated by a negative value of a feature value of each position of each of the plurality of tag feature vectors; a classification unit 590 for performing pattern classification on the control voice obtained by the voice obtaining unit 510 based on the label score obtained by the maximum condition likelihood estimating unit 580 to generate a control instruction; and a control unit 600, configured to control the smart home device based on the control instruction obtained by the classification unit 590.

In one example, in the intelligent home control system 500 based on the internet of things, the voice segmentation unit 520 is further configured to: and performing syllable sequence-based voice segmentation on the control voice to obtain a voice word sequence composed of a plurality of voice words.

In one example, in the intelligent home control system 500 based on the internet of things, as shown in fig. 6, the speech vector sequence generating unit 530 includes: a preprocessing subunit 531, configured to preprocess each of the speech words; a fourier transform subunit 532, configured to perform fourier transform on the speech word obtained by the preprocessing subunit 531 after preprocessing; a mel filtering subunit 533 configured to perform mel filtering on the speech word obtained by the fourier transform subunit 532 after fourier transform; a cepstrum analysis subunit 534, configured to perform cepstrum analysis on the speech word obtained by the mel filtering subunit 533 after mel filtering, so as to extract a mel frequency cepstrum coefficient with a pre-preset number of bits from the mel frequency cepstrum coefficients of the speech word; and an arrangement subunit 535 for arranging mel-frequency cepstrum coefficients of the pre-preset number of bits obtained by the cepstrum analysis subunit 534 into the speech vector sequence.

In one example, in the intelligent home control system 500 based on the internet of things, the global pooling unit 560 is further configured to: and carrying out global average value pooling processing or global maximum value pooling processing based on channel dimensions on the voice feature map to obtain the voice feature matrix.

In one example, in the intelligent home control system 500 based on the internet of things, the maximum condition likelihood estimating unit 580 is further configured to: estimating a tag score for the plurality of tag feature vectors as a whole based on the maximum conditional likelihood estimation score with the following formula; the formula is:

In one example, in the above-mentioned intelligent home control system 500 based on the internet of things, the classification unit 590 is further configured to: and inquiring a matching result corresponding to the label score in a control instruction inquiry table, wherein the matching result is the control instruction.

Here, it will be understood by those skilled in the art that the specific functions and operations of the respective units and modules in the above-described intelligent home control system based on the internet of things 500 have been described in detail in the above description of the intelligent home control method based on the internet of things with reference to fig. 1 to 4, and thus, repetitive descriptions thereof will be omitted.

As described above, the intelligent home control system 500 based on the internet of things according to the embodiment of the present application may be implemented in various terminal devices, for example, a server of an intelligent home control algorithm based on the internet of things, and the like. In one example, the internet of things-based smart home control system 500 according to embodiments of the present application may be integrated into a terminal device as one software module and/or hardware module. For example, the intelligent home control system 500 based on the internet of things may be a software module in the operating system of the terminal device, or may be an application program developed for the terminal device; of course, the intelligent home control system 500 based on the internet of things can be one of numerous hardware modules of the terminal device.

Alternatively, in another example, the intelligent home control system 500 based on the internet of things and the terminal device may be separate devices, and the intelligent home control system 500 based on the internet of things may be connected to the terminal device through a wired and/or wireless network and transmit the interaction information according to the agreed data format.

Exemplary electronic device

Next, an electronic device according to an embodiment of the present application is described with reference to fig. 7. As shown in fig. 7, the electronic device includes 10 includes one or more processors 11 and memory 12. The processor 11 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.

Memory 12 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium, which may be executed by the processor 11 to implement the functions of the internet of things-based smart home control method and/or other desired functions of the various embodiments of the present application described above. Various contents such as a voice feature map, a tag feature vector, and the like may also be stored in the computer-readable storage medium.

In one example, the electronic device 10 may further include: an input system 13 and an output system 14, which are interconnected by a bus system and/or other forms of connection mechanisms (not shown).

The input system 13 may comprise, for example, a keyboard, a mouse, etc.

The output system 14 can output various information including control instructions and the like to the outside. The output system 14 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, etc.

Of course, only some of the components of the electronic device 10 that are relevant to the present application are shown in fig. 7 for simplicity, components such as buses, input/output interfaces, etc. are omitted. In addition, the electronic device 10 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer readable storage Medium

In addition to the methods and apparatus described above, embodiments of the present application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform steps in the functions of the intelligent home control method based on the internet of things described in the "exemplary method" section of the present specification according to various embodiments of the present application.

The computer program product may write program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium, on which computer program instructions are stored, which, when executed by a processor, cause the processor to perform the steps in the intelligent home control method based on internet of things described in the above "exemplary method" section of the present specification.

The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The basic principles of the present application have been described above in connection with specific embodiments, however, it should be noted that the advantages, benefits, effects, etc. mentioned in the present application are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present application. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not intended to be limited to the details disclosed herein as such.

The block diagrams of the devices, apparatuses, devices, systems referred to in this application are only illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.

It is also noted that in the apparatus, devices and methods of the present application, the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent to the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of the application to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims

1. The intelligent home control method based on the Internet of things is characterized by comprising the following steps of:

obtaining control voice of a user;

2. The internet of things-based smart home control method of claim 1, wherein word-based speech segmentation is performed on the control speech to obtain a speech word sequence consisting of a plurality of speech words, comprising:

and performing syllable sequence-based voice segmentation on the control voice to obtain a voice word sequence composed of a plurality of voice words.

3. The internet of things-based smart home control method of claim 2, wherein converting each speech word in the sequence of speech words into a speech vector to obtain a sequence of speech vectors consisting of a plurality of speech vectors, comprising:

preprocessing each voice word;

performing Fourier transform on the preprocessed voice words;

performing Mel filtering on the voice words after Fourier transformation;

carrying out cepstrum analysis on the voice word after the Mel filtering to extract a Mel frequency cepstrum coefficient with a preset bit number from the Mel frequency cepstrum coefficients of the voice word; and

And arranging the mel frequency cepstrum coefficient of the pre-preset bit number into the voice vector sequence.

4. The internet of things-based smart home control method of claim 3, wherein performing global pooling processing on the voice feature map based on channel dimensions to obtain a voice feature matrix comprises:

and carrying out global average value pooling processing or global maximum value pooling processing based on channel dimensions on the voice feature map to obtain the voice feature matrix.

5. The internet of things-based smart home control method of claim 4, wherein estimating the tag score of the plurality of tag feature vectors as a whole based on the maximum conditional likelihood estimation score comprises:

estimating a tag score for the plurality of tag feature vectors as a whole based on the maximum conditional likelihood estimation score with the following formula;

the formula is:

6. The internet of things-based smart home control method of claim 1, wherein pattern classifying the control speech based on the tag score to generate a control instruction comprises:

and inquiring a matching result corresponding to the label score in a control instruction inquiry table, wherein the matching result is the control instruction.

7. Intelligent home control system based on thing networking, its characterized in that includes:

a voice acquisition unit for acquiring a control voice of a user;

8. The internet of things-based smart home control system of claim 7, wherein the speech vector sequence generation unit comprises:

a preprocessing subunit, configured to preprocess each of the speech words;

a fourier transform subunit, configured to perform fourier transform on the speech word obtained by the preprocessing subunit after preprocessing;

a mel filtering subunit, configured to perform mel filtering on the speech word obtained by the fourier transform subunit after fourier transform;

the cepstrum analysis subunit is used for carrying out cepstrum analysis on the voice word obtained by the Mel filtering subunit after Mel filtering so as to extract a Mel frequency cepstrum coefficient with a preset bit number from the Mel frequency cepstrum coefficients of the voice word; and

and the arrangement subunit is used for arranging the mel frequency cepstrum coefficient of the pre-preset bit number obtained by the cepstrum analysis subunit into the voice vector sequence.

9. The internet of things-based smart home control system of claim 7, wherein the maximum condition likelihood estimation unit is further configured to:

the formula is:

10. An electronic device, comprising:

a processor; and

a memory having stored therein computer program instructions that, when executed by the processor, cause the processor to perform the internet of things-based smart home control method of any one of claims 1-7.