CN117917729A

CN117917729A - Voice recognition ambiguity eliminating method and device

Info

Publication number: CN117917729A
Application number: CN202211290249.1A
Authority: CN
Inventors: 马坚; 李敏; 曾谁飞; 刘卫强; 孔令磊; 张景瑞
Original assignee: Qingdao Haier Refrigerator Co Ltd; Haier Smart Home Co Ltd
Current assignee: Qingdao Haier Refrigerator Co Ltd; Haier Smart Home Co Ltd
Priority date: 2022-10-21
Filing date: 2022-10-21
Publication date: 2024-04-23

Abstract

The invention discloses a voice recognition disambiguation method and a device, wherein the method comprises the following steps: if the ambiguous words exist in the query text data in the ambiguous dictionary; acquiring image data; inputting the image data into an image model to obtain an image recognition result; obtaining quantized data corresponding to the intelligent decision parameters; inputting the text data, the image recognition result and the quantized data into the intelligent decision model to obtain decision text; and inputting the decision text into a language model to obtain a user intention recognition result. When the voice contains ambiguous content, the ambiguous content in the voice can be directly judged by combining the image recognition result of the image data and fusing the data information of a plurality of intelligent decision parameters to make a decision, so that the real requirement of the user is clearly understood, and the use experience of the user is improved.

Description

Voice recognition ambiguity eliminating method and device

Technical Field

The invention relates to the technical field of voice recognition of refrigeration equipment, in particular to a voice recognition disambiguation method and device of refrigeration equipment.

Background

With the progress of technology, users have put forward new demands for the intellectualization of refrigeration equipment, for example, in a scenario where a user uses a refrigerator, the user makes a voice to the refrigerator, the refrigerator performs voice recognition, and then performs corresponding operations according to voice commands.

Although the voice recognition technology is mature at present, 100% accuracy cannot be achieved, and even if the words are accurately recognized by voice recognition, the general content of the daily dialogue of the user is short, some content is omitted, and the true intention of the user cannot be judged only through the omitted sentence. For example, the user says "put apples", and cannot accurately judge whether the user wants to play music of apples or to add food material of apples to the refrigerator.

Therefore, when the user performs voice control on the refrigeration equipment, the user's expectation cannot be completely achieved, and a long path is taken from the point that the user's intention is completely understood.

Disclosure of Invention

In order to solve at least one of the above problems, an object of the present invention is to provide a method and apparatus for voice recognition disambiguation capable of accurately understanding user intention in a use scenario of a refrigeration apparatus _.

In order to achieve the above object, an embodiment of the present invention provides a method for disambiguating speech recognition, including the steps of:

Translating the voice data into text data;

querying an ambiguity dictionary for the presence of an ambiguity word in the text data;

If yes, the following steps are continued:

Acquiring image data, wherein the image data comprises an image of an item held by a user or an item not held by the user;

inputting the image data into an image model to obtain an image recognition result;

Acquiring quantized data intelligent decision parameters corresponding to the intelligent decision parameters, wherein the intelligent decision parameters are a plurality of parameters in an intelligent decision model, the text data are a plurality of parameters matched in the intelligent decision model, and the intelligent decision model is a decision model obtained by using a large number of text data for training, image recognition results for training and a plurality of judgment parameters for training and performing machine learning training;

Inputting the text data, the image recognition result and the quantized data into the intelligent decision model to obtain decision text;

inputting the decision text into a language model to obtain a user intention recognition result, wherein the language model is a deep learning model which is trained by a large amount of texts and is used for recognizing intention.

As a further improvement of the invention, the image model is a classification model, and the image recognition result comprises two results, namely a result and a result which are not included;

if yes, the image data comprises the content of the ambiguous word;

If not, the image data does not include the content of the ambiguous word.

As a further improvement of the present invention, the method further comprises the steps of:

And if the ambiguous word content is food material vocabulary, determining the food material vocabulary in the decision text as the food material intention when the image recognition result is yes.

As a further improvement of the invention, the intelligent decision model may be trained using a decision tree or GBDT algorithm.

if not, inputting the text data into the language model to obtain a user intention recognition result.

As a further improvement of the present invention, the step of translating the voice data into text data includes:

Noise reduction is carried out on the voice data to obtain user voice enhancement data;

intercepting the user voice enhancement data to obtain user voice data;

and recognizing the user voice data to obtain text data.

As a further improvement of the present invention, said step of recognizing said user speech data, obtaining text data comprises:

identifying the user voice data to obtain text data to be checked;

checking whether the text to be checked has wrongly written characters or not;

If yes, correcting wrongly written characters in the text data to be checked to obtain the text data;

and if not, the text data to be checked is the text data.

As a further improvement of the invention, the several parameters include user habit parameters, user characteristic parameters, whether it is a weekend parameter, time parameter, air temperature parameter.

To achieve one of the above objects, an embodiment of the present invention provides a voice recognition disambiguation device, including:

a translation module for translating the voice data into text data;

the query module is used for querying whether an ambiguous word exists in the text data in an ambiguous dictionary;

The image acquisition module is used for acquiring image data, wherein the image data comprises images of articles held by a user or articles not held by the user;

the image model module is used for inputting the image data into an image model to obtain an image recognition result;

The quantitative data acquisition module is used for acquiring quantitative data corresponding to intelligent decision parameters, wherein the intelligent decision parameters are a plurality of parameters in an intelligent decision model, and the intelligent decision model is a decision model obtained through machine learning training by using a large amount of training text data, training image recognition results and training judgment parameters;

The decision module is used for inputting the text data, the image recognition result and the quantized data into the intelligent decision model to obtain decision text;

The intention recognition module is used for inputting the decision text into a language model to obtain a user intention recognition result, wherein the language model is a deep learning model which is trained by a large amount of texts and is used for recognizing intention.

To achieve one of the above objects, an embodiment of the present invention provides an electronic device including:

A storage module storing a computer program;

The processing module can realize the steps in the voice recognition ambiguity eliminating method when executing the computer program.

To achieve one of the above objects, an embodiment of the present invention provides a readable storage medium storing a computer program which, when executed by a processing module, performs the steps of the above-described speech recognition disambiguation method.

Compared with the prior art, the invention has the following beneficial effects: by using the method and the device for disambiguating voice recognition, when voice contains ambiguous content, the ambiguous content in the voice can be directly judged by combining the image recognition result of image data and data information fused with a plurality of intelligent decision parameters to make a decision, that is, the ambiguous content in the voice of the user can be eliminated, the real requirement of the user can be more clearly understood, and the use experience of the user is improved.

Drawings

FIG. 1 is a flow chart of a method of speech recognition disambiguation in accordance with an embodiment of the present invention;

FIG. 2 is a flowchart showing the steps of step S10 according to an embodiment of the present invention;

FIG. 3 is a flowchart showing the steps of step S60 according to an embodiment of the present invention;

FIG. 4 is a block diagram of a speech recognition disambiguation device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail below with reference to specific embodiments shown in the drawings. These embodiments are not intended to limit the invention and structural, methodological, or functional modifications of these embodiments that may be made by one of ordinary skill in the art are included within the scope of the invention.

An embodiment of the invention provides a method and a device for voice recognition disambiguation capable of accurately understanding user intention under the use scene of refrigeration equipment.

The refrigeration device of the present embodiment may be a refrigerator, a freezer, a vertical freezer, a wine cabinet, or the like, and the following embodiments will be described by taking the refrigerator as an example. The refrigerator comprises a refrigerator body, a storage space with an opening and a door body covering the opening, wherein the storage space is arranged in the refrigerator body, food materials can be placed in the storage space, a camera and a microphone can be arranged in and/or outside the door body, the camera is used for shooting whether a user holds an article in a hand and what article is held, and the microphone is used for receiving voice data of the user.

After the microphone receives the voice data, the following voice recognition disambiguation method can be operated through a processing module on the refrigeration equipment, or the voice data can be uploaded to a server, and the voice data can be operated through the server or a mobile phone end and a computer end of a user. After the operation is finished, as a result of identifying the operation intention on the refrigeration equipment, for example, taking the example of the background technology as an example, assuming that the intention of identifying "putting apples" is to add apples in the food material management of the refrigerator, the food material management module of the refrigeration equipment adds apples.

In the following, referring to fig. 1 to 3, a method for disambiguating speech recognition of a refrigeration apparatus according to an embodiment of the present application will be described, and although the present application provides the steps of the method according to the following embodiment or the flowchart, the method is based on conventional or non-creative labor, and the execution sequence of these steps is not limited to the execution sequence provided in the embodiment of the present application. The acquisition order of steps S30 and S50 below may be arbitrarily adjusted or performed simultaneously, without distinguishing chronological order.

Specifically, as shown in fig. 1, the voice recognition disambiguation method of the present embodiment includes the following steps:

Step S10: the voice data is translated into text data.

As shown in fig. 2, step S10 specifically further includes the following steps:

Step S11: noise reduction is carried out on the voice data to obtain user voice enhancement data; step S11 can reduce environmental noise and strengthen the voice intensity of the user;

Step S12: intercepting the user voice enhancement data to obtain user voice data; step S12 can only intercept the content of the user voice and exclude other irrelevant content;

Step S13: identifying the user voice data to obtain text data to be checked;

Step S14: checking whether the text to be checked has wrongly written characters or not; the step S14 may be performed in a dedicated dictionary of the refrigeration equipment, for example, when the voice is food, the voice is identified as "stone" in the step S13, and after the dedicated dictionary of the refrigeration equipment is queried, it is determined that there is a small probability of speaking "stone" in the environment, and a large probability of speaking "food" is determined, and then the "stone" is corrected.

If yes, step S15: correcting wrongly written characters in the text data to be checked to obtain text data;

if not, step S16: the text data to be checked is the text data.

The voice data is converted into clean text data for subsequent processing, via step S10.

Step S20: querying an ambiguity dictionary for the presence of an ambiguity word in the text data;

the step S20 specifically includes:

Step S21: querying the text data in an ambiguity dictionary; here, text segmentation is performed on the translated text through the speech module, and then the text segmentation is performed on the translated text and the text is searched in an ambiguity dictionary.

Step S22: judging whether ambiguous words exist in the text data;

The ambiguity dictionary can be specially stored in content which can generate ambiguity in the environment of the refrigeration equipment, for example, a small apple can be used as a food material and also can be used as a song; or "put" may be understood as either put or play; that is, step S20 is to query the ambiguity dictionary for whether the current text data contains ambiguous content.

If so, it means that there is content that may be ambiguous, and the following intelligent decision model is needed to assist the language model in intention recognition. The method comprises the following steps of:

Step S30: image data is acquired, wherein the image data comprises an image of an item held by a user or an item not held.

The image data may be obtained by the camera, and the camera may capture the entire environment, which may include the human body of the user, the environment in which the user is located, whether the user holds an object on his hand, or just captures the user's hand, or the object held by the user.

Step S40: and inputting the image data into an image model to obtain an image recognition result.

The image model is a classification model, and the image recognition result comprises two results, namely a result and a result which are not included;

if yes, the image data comprises the content of the ambiguous word;

If not, the image data does not include the content of the ambiguous word.

Here, for example, the ambiguous word is a small apple, whether the apple is on the user's hand in the image is judged, if so, the image recognition result is yes, and if the apple is not on the user's hand in the image, the image recognition result is no.

In addition, the image recognition result may be that the image recognizes a hand-held object of the user, such as a hand-held mobile phone of the user.

Step S50: obtaining quantized data intelligent decision parameters corresponding to the intelligent decision parameters, wherein the intelligent decision parameters are a plurality of parameters in an intelligent decision model, the text data are matched with the parameters in the intelligent decision model, and the intelligent decision model is a decision model obtained through machine learning training by using a large number of text data for training, image recognition results for training and a plurality of judgment parameters for training.

Further, the intelligent decision parameters may include user habit parameters, user feature parameters, whether it is a weekend parameter, time parameter, air temperature parameter.

In addition, the habit parameters of the user can specifically include characters, favorites and the like of the user; the user characteristic parameters may include the age, occupation, physical state of the user, even heart beat, sleeping conditions, body temperature, etc.

When the image recognition result is not available, the intention of the user is more difficult to know, and at the moment, through the intelligent decision model, a large amount of training text data, training image recognition results and training judgment parameters are used, and the decision model obtained through machine learning training can be used for establishing connection with the intention of the user by the parameters which cannot be intuitively analyzed by the ordinary people and are related with the intention of the user.

Because the running mechanism in the decision model after a large amount of data machine learning is a black box state, the specific parameters have stronger relation with the user intention, and the relation is established through what decision path, and the decision model depends on the deep learning result.

The process of establishing the intelligent decision model is described herein, the intelligent decision model can be trained by using a decision tree or GBDT algorithm, a large amount of text data, training image recognition results and a plurality of training judgment parameters are input into the intelligent decision model, the decision result is adjusted according to the accuracy of each time until the judgment parameters and the data corresponding to the parameters are found out, and the data has a good corresponding relation with the real intention of the texts.

If the intelligent decision parameter is a time parameter, the quantized data is a specific time; if the intelligent decision parameter is an air temperature parameter, the quantized data is a specific temperature; if the intelligent decision parameter is whether the user wears the earphone, the quantized data is yes or no.

Step S60: and inputting the text data, the image recognition result and the quantized data into the intelligent decision model to obtain a decision text.

Here, taking the case that the ambiguous word content is a food material word as an example, when the image recognition result is that the ambiguous word content is available, the food material word in the decision text is determined to be a food material intention. For example, adding a note to the text, the speech content designs the intent of the food material.

In general, the presence of the food material in the user's hand, and the food material means that the voice content of the user is related to the management of the food material, and therefore, the judgment is directly made based on the result that the image recognition result is available.

Further, the decision process of step S60 will be described by taking the example shown in fig. 3 as an example:

Let it be assumed that the intelligent decision parameters are whether the weekend parameter, the time parameter, the air temperature parameter, … …. That is, the node parameters of the intelligent decision model include "whether it is a weekend parameter, a time parameter, an air temperature parameter, … …" at this time.

Firstly, judging an image recognition result, and if the image recognition result is yes, directly judging that food material words in a decision text are determined to be food material intentions;

Then, if the image recognition result is none, for example, the user speaks a small apple, and if the apple is not held in the hand of the user, whether the parameter is a weekend parameter is introduced for judgment;

Whether there are only two possible situations on weekends, yes or no;

under the condition of non-weekend, supposing that the correlation between non-weekends and food materials in the intelligent decision model is strong, and if the number of food materials purchased by general users is large, judging whether the parameters of the weekends are NO, judging that the food materials vocabulary in the decision text is determined to be the food material intention;

If the weekend parameter is yes, introducing a time parameter to judge;

assume that the time parameter is 9 am;

Then, assuming that 9 am on the weekend in the intelligent decision model has strong correlation with music, determining food vocabulary in the decision text as music intention;

If the judgment can not be carried out by the weekend alone and 9 am in the intelligent decision model, continuing to judge by combining the air temperature parameters, and the like until the ambiguous word is judged to be related to what intention.

Step S70: inputting the decision text into a language model to obtain a user intention recognition result, wherein the language model is a deep learning model which is trained by a large amount of texts and is used for recognizing intention.

The main task of the language module is intent recognition. And (3) transmitting the text content into a trained deep learning model, and reasoning the intention of the current text, namely the intention of the user.

When the food material intent is included in the decision text, the intent of the speech is analyzed in the language model with the purpose of the relevant operation of the food material, the intent being related to the processing of the food material.

In addition, when the text data is queried in the ambiguity dictionary, the operation steps are as follows:

Step S80: if the text data is queried in the ambiguity dictionary, inputting the text data into the language model to obtain a user intention recognition result.

That is, after the voice data is translated into text data, the text data is directly transferred into the language model to recognize the intention of the user, and the result is output.

Compared with the prior art, the embodiment has the following beneficial effects:

By using the method and the device for disambiguating voice recognition, when voice contains ambiguous content, the ambiguous content in the voice can be directly judged by combining the image recognition result of image data and data information fused with a plurality of intelligent decision parameters to make a decision, that is, the ambiguous content in the voice of the user can be eliminated, the real requirement of the user can be more clearly understood, and the use experience of the user is improved.

In one embodiment, a speech recognition disambiguation device is provided, as shown in FIG. 4. The voice recognition disambiguation device can comprise the following modules, wherein the specific functions of each module are as follows:

a translation module for translating the voice data into text data;

It should be noted that, details not disclosed in the voice recognition disambiguation device in the embodiment of the present invention are referred to details disclosed in the voice recognition disambiguation method in the embodiment of the present invention.

It will be appreciated by those skilled in the art that the block diagram is merely an example of a speech recognition disambiguation apparatus and does not constitute a limitation of the terminal device of the speech recognition disambiguation apparatus, and may include more or less components than illustrated, or may combine some components, or different components, e.g., the speech recognition disambiguation apparatus may also include an input and output device, a network access device, a bus, etc.

The speech recognition disambiguation device may also include a computing device such as a computer, a notebook, a palm top computer, and a cloud server, and include, but are not limited to, a processing module, a storage module, and a computer program stored in the storage module and executable on the processing module, such as the speech recognition disambiguation method program described above. The processing module, when executing the computer program, implements the steps of the various embodiments of the speech recognition disambiguation method described above, such as the steps shown in fig. 1-3.

The speech recognition disambiguation device may further comprise a signal transmission module and a communication bus. The signal transmission module is used for sending data to other external processing modules or servers, the external other processing modules such as mobile phones can be in wireless connection with the refrigeration equipment to transmit data, such as Bluetooth, wifi, zigBee and the like, the communication bus is used for establishing connection among the camera, the microphone, the signal transmission module, the processing module and the storage module, and the communication bus can comprise a passage for transmitting information among the camera, the microphone, the signal transmission module, the processing module and the storage module.

In addition, the invention also provides an electronic device, which comprises a storage module and a processing module, wherein the processing module can realize the steps in the voice recognition ambiguity eliminating method when executing the computer program, that is, realize the steps in any technical scheme in the voice recognition ambiguity eliminating method.

The electronic device may be part of a voice recognition disambiguation apparatus integrated into the apparatus, or may be a local terminal device, or may be part of a cloud server.

The processing module may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. The general purpose processor may be a microprocessor, but in the alternative, it may be any conventional processor. The processing module is a control center of the voice recognition disambiguation device, and various interfaces and lines are used for connecting various parts of the whole voice recognition disambiguation device.

The memory module may be used to store the computer program and/or module and the processing module may implement various functions of the speech recognition disambiguation device by running or executing the computer program and/or module stored in the memory module and invoking data stored in the memory module. The memory module may mainly include a memory program area and a memory data area, wherein the memory program area may store an operating system, application programs required for at least one function, and the like. In addition, the memory module may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart memory card (SMART MEDIA CARD, SMC), secure Digital (SD) card, flash memory card (FLASH CARD), at least one disk storage device, flash memory device, or other volatile solid-state storage device.

The computer program may be divided into one or more modules/units, which are stored in a storage module and executed by a processing module to accomplish the present invention, for example. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions for describing the execution of the computer program in a speech recognition disambiguation device.

Further, an embodiment of the present invention provides a readable storage medium storing a computer program, where the computer program is executed by a processing module to implement steps in the above-mentioned speech recognition disambiguation method, that is, to implement steps in any one of the technical solutions of the above-mentioned speech recognition disambiguation method.

The modules integrated in the speech recognition disambiguation method may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as a stand alone product. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and the computer program may implement the steps of each of the method embodiments described above when executed by the processing module.

Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U-disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random-access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

It should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is for clarity only, and that the skilled artisan should recognize that the embodiments may be combined as appropriate to form other embodiments that will be understood by those skilled in the art.

The above list of detailed descriptions is only specific to practical embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent embodiments or modifications that do not depart from the spirit of the present invention should be included in the scope of the present invention.

Claims

1. A method of disambiguating speech recognition, comprising the steps of:

Translating the voice data into text data;

If yes, the following steps are continued:

Obtaining quantized data corresponding to intelligent decision parameters, wherein the intelligent decision parameters are a plurality of parameters in an intelligent decision model, and the intelligent decision model is a decision model obtained through machine learning training by using a large amount of training text data, training image recognition results and training judgment parameters;

2. The method of claim 1, wherein the image model is a classification model, and the image recognition result includes both the presence and absence of the two results;

if yes, the image data comprises the content of the ambiguous word;

If not, the image data does not include the content of the ambiguous word.

3. The speech recognition disambiguation method of claim 2, further comprising the step of:

4. The speech recognition disambiguation method of claim 1, wherein the intelligent decision model is trainable using a decision tree or GBDT algorithm.

5. The speech recognition disambiguation method of claim 1, further comprising the step of:

6. The method of claim 1, wherein the step of translating speech data into text data comprises:

intercepting the user voice enhancement data to obtain user voice data;

and recognizing the user voice data to obtain text data.

7. The method of claim 6, wherein the step of identifying the user speech data to obtain text data comprises:

identifying the user voice data to obtain text data to be checked;

checking whether the text to be checked has wrongly written characters or not;

and if not, the text data to be checked is the text data.

8. The method of claim 1, wherein the plurality of parameters includes a user habit parameter, a user characteristic parameter, a weekend parameter, a time parameter, and a temperature parameter.

9. A speech recognition disambiguation device, comprising:

a translation module for translating the voice data into text data;

10. An electronic device, comprising:

A storage module storing a computer program;

Processing module for implementing the steps of the speech recognition disambiguation method according to any of claims 1 to 9 when said computer program is executed.

11. A readable storage medium storing a computer program, which when executed by a processing module performs the steps of the speech recognition disambiguation method of any one of claims 1 to 9.