CN113889102A

CN113889102A - Instruction receiving method, system, electronic device, cloud server and storage medium

Info

Publication number: CN113889102A
Application number: CN202111115408.XA
Authority: CN
Inventors: 高斌
Original assignee: Cloudminds Beijing Technologies Co Ltd
Current assignee: Cloudminds Beijing Technologies Co Ltd
Priority date: 2021-09-23
Filing date: 2021-09-23
Publication date: 2022-01-04

Abstract

The embodiment of the application relates to the technical field of artificial intelligence, and discloses an instruction receiving method, a system, electronic equipment, a cloud server and a storage medium, wherein the method comprises the following steps: picking up voice information of a user; performing voice recognition on the voice information to generate a first voice recognition result; acquiring a second voice recognition result generated by second equipment; the second device is a device with a sound collecting function and a voice recognition function, and the second voice recognition result is generated by performing voice recognition on voice information of a user picked up by the second device through the second device; and generating an instruction for the first equipment to execute according to the first voice recognition result, the second voice recognition result and the preset voice recognition credibility of each equipment. The instruction receiving method provided by the embodiment of the application can break through the limitation of the pickup range, and accurately, completely and quickly receives the voice instruction issued by the user, so that the use experience of the user is improved.

Description

Instruction receiving method, system, electronic device, cloud server and storage medium

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to an instruction receiving method, a system, electronic equipment, a cloud server and a storage medium.

Background

With the rapid development of artificial intelligence technology, more and more intelligent devices enter people's lives, such as various types and various functions of intelligent home robots, the intelligent home robots are special robots providing services for users, and mainly engage in home services, maintenance, repair, transportation, monitoring, children education and other works, the intelligent home robots have flexible multi-joint arms, and can not only listen to voice instructions of the users but also recognize three-dimensional objects by means of various sensors, and the intelligent home robots perform different works at different positions in the families of the users and can receive the voice instructions issued by the users in real time.

However, the inventors of the present application have found that the robot has a limited sound pickup range, and when the user and the robot are located in different rooms, it is difficult for the robot to accurately acquire a voice instruction given by the user.

Disclosure of Invention

An object of the embodiment of the application is to provide an instruction receiving method, system, electronic device, cloud server and storage medium, which can break through the limitation of pickup range, and accurately, completely and quickly receive the voice instruction issued by the user, thereby improving the use experience of the user.

In order to solve the above technical problem, an embodiment of the present application provides an instruction receiving method, including the following steps: picking up voice information of a user; performing voice recognition on the voice information to generate a first voice recognition result; acquiring a second voice recognition result generated by second equipment; the second device is a device with a sound collecting function and a voice recognition function, and the second voice recognition result is generated by performing voice recognition on voice information of a user picked up by the second device through the second device; and generating an instruction for the first equipment to execute according to the first voice recognition result, the second voice recognition result and the preset voice recognition credibility of each equipment.

An embodiment of the present application further provides an instruction receiving system, including: the voice recognition system comprises a first device and a second device, wherein the second device has a voice pickup function and a voice recognition function; the first equipment is used for acquiring voice information of a user, performing voice recognition on the voice information and generating a first voice recognition result; the second equipment is used for picking up voice information of a user, performing voice recognition on the voice information and generating a second voice recognition result; and the first equipment is also used for acquiring the second voice recognition result, generating an instruction according to the first voice recognition result, the second voice recognition result and the preset voice recognition credibility of each equipment, and executing the instruction.

An embodiment of the present application further provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the above-mentioned instruction receiving method.

An embodiment of the present application further provides a cloud server, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the above-mentioned instruction receiving method.

Embodiments of the present application also provide a computer-readable storage medium storing a computer program, which when executed by a processor implements the above-mentioned instruction receiving method.

The embodiment of the application provides an instruction receiving method, a system, an electronic device, a cloud server and a storage medium, wherein a first device picks up voice information of a user and performs voice recognition on the acquired voice information to generate a first voice recognition result, and then acquires a second voice recognition result generated by a second device, the second device is a device with a voice pickup function and a voice recognition function, the second device performs voice recognition on the voice information picked up by the second device to generate a second voice recognition result, the first device generates an instruction according to the first voice recognition result, the second voice recognition result and preset voice recognition credibility of each device, the instruction is provided for the first device to execute the generated instruction, the voice pickup range of the device is limited, when the user and the first device are located in different rooms, the first device cannot accurately acquire the voice instruction issued by the user, and the embodiment of the application, when the first equipment receives the instruction, not only the recognition result obtained by carrying out voice recognition by depending on the voice information of the first equipment picked up by the user, but also the recognition result obtained by picking up the voice information of the user by the second equipment can be obtained, namely, the voice instruction issued by the user is jointly obtained by combining a plurality of pieces of equipment in the family, the limitation of a pickup range can be broken through, when the instruction is generated, the voice recognition reliability of each piece of equipment is combined, the generated instruction can be guaranteed to be credible and accords with the intention of the user, so that the voice instruction issued by the user can be accurately, completely and quickly received, and the use experience of the user is improved.

In addition, the generating an instruction according to the first voice recognition result, the second voice recognition result and a preset voice recognition reliability of the device includes: performing word segmentation on the first voice recognition result to obtain a plurality of first word segmentation segments; performing word segmentation on the second voice recognition results to obtain second word segmentation segments; determining the voice recognition credibility of a plurality of first segmentation segments and the voice recognition credibility of a plurality of second segmentation segments according to the preset voice recognition credibility of each device; generating an instruction according to the voice recognition credibility of the plurality of first word segmentation segments, the voice recognition credibility of the plurality of second word segmentation segments and the voice recognition credibility of the plurality of second word segmentation segments, wherein in the process of generating the instruction, the first equipment can perform word segmentation processing on the first voice recognition result and the second voice recognition result respectively to obtain the plurality of first word segmentation segments and the plurality of second word segmentation segments, and then the voice recognition credibility of each word segmentation segment is determined by combining the preset voice recognition credibility of the equipment, so that the instruction is generated according to the voice recognition credibility of each word segmentation segment and each word segmentation segment, the accuracy of the obtained instruction can be further improved, and the use experience of a user is further improved.

Before generating an instruction according to the first voice recognition result, the second voice recognition result, and a preset voice recognition reliability of each device, the method includes: acquiring the position of the user, the position of the first device and the positions of a plurality of second devices; determining a relative position between the first device and the user according to the position of the user and the position of the first device; determining the relative positions between the plurality of second devices and the user according to the position of the user and the positions of the plurality of second devices; generating an instruction according to the first voice recognition result, the second voice recognition result and preset voice recognition credibility of each device, wherein the instruction comprises: according to the first voice recognition result, the second voice recognition result, the relative position between the first device and the user, the relative positions between the plurality of second devices and the user, and the preset voice recognition credibility of each device, an instruction is generated, and in consideration of the fact that in actual life, the second devices are distributed at each position in a household, and the sound pickup range of the second devices is limited, so that the first device of the embodiment also comprehensively considers the relative position between the first device and the user and the relative positions between each second device and the user when receiving the instruction, so that the voice recognition result of the second device does not need to be referred to when the first device is close to the user, and the voice recognition result of the second device is mainly referred to when a certain second device is close to the user.

In addition, the preset speech recognition credibility of each device includes the speech recognition credibility of the first device and the speech recognition credibility of the second device, and the speech recognition credibility of the second device is acquired by the first device through the following steps: carrying out a plurality of times of conversations with the second equipment through voice to obtain voice recognition results of the second equipment on the plurality of conversations; and comparing the voice recognition results of the plurality of conversations with the text contents of the plurality of conversations by the second equipment to determine the voice recognition reliability of the second equipment, wherein the voice recognition reliability of the second equipment is obtained by the first equipment through the plurality of conversations in advance, and the obtained voice recognition reliability of the second equipment is scientific, real and reliable, so that the accuracy of the obtained instruction is further improved.

In addition, the speech recognition reliability of each device to the dialect, the speech recognition reliability of each device to different languages, the speech recognition reliability of each device to different volume sizes, the speech recognition reliability of each device in different noise size environments, the speech recognition reliability of each device to users with different identity information, and the speech recognition accuracy of the first device in these aspects are comprehensively considered when generating the instruction, so that the accuracy of the acquired instruction can be further improved.

Before generating an instruction according to the first voice recognition result, the second voice recognition result, and a preset voice recognition reliability of each device, the method includes: acquiring voice information of a user picked up by third equipment; the third equipment has a sound pickup function and does not have a voice recognition function; performing voice recognition on the voice information of the user picked up by the third equipment to generate a third voice recognition result; generating an instruction according to the first voice recognition result, the second voice recognition result and preset voice recognition credibility of each device, wherein the instruction comprises: according to the first voice recognition result, the second voice recognition result, the third voice recognition result and the preset voice recognition credibility of each device, an instruction is generated, the third device has a pickup function but does not have the voice recognition function, in this embodiment, the first device can also refer to the voice information of the user picked up by the third device, the pickup range is further improved, and the first device can obtain a more accurate and complete voice instruction.

Additionally, after the generating instructions, the method further comprises: directly performing voice reply on the instruction; or, the second device replies the instruction by voice, the first device can select the instruction of the user to reply by voice, and can also reply the instruction of the user by any second device, so that the actual requirements of the user can be better met, and the use experience of the user is further improved.

In addition, the obtaining a second speech recognition result generated by the second device includes: acquiring screening information; wherein the screening information comprises any combination of: identity information of the user, location information of the user and location information of each second device; determining target second equipment according to the screening information and a preset corresponding relation; wherein, the corresponding relationship is the corresponding relationship between the screening information and the target second device; and obtaining a second voice recognition result generated by the target second equipment, determining the target second equipment in each second equipment according to one or more of the identity information of the user, the position information of the user and the position information of each second equipment, and only obtaining the second voice recognition result generated by the target second equipment, namely only taking the target second equipment into consideration when receiving the instruction, so that the efficiency of instruction receiving can be effectively improved.

Drawings

One or more embodiments are illustrated by the corresponding figures in the drawings, which are not meant to be limiting.

FIG. 1 is a first flowchart of an instruction receiving method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a distribution of a user, a first device and a second device in a home provided in an embodiment according to the application;

FIG. 3 is a flow chart of generating an instruction based on a first speech recognition result, a second speech recognition result, and a preset speech recognition confidence level of a device according to an embodiment of the present application;

FIG. 4 is a flow chart two of an instruction receiving method according to another embodiment of the present application;

FIG. 5 is a flow diagram for obtaining speech recognition confidence for a second device, according to an embodiment of the present application;

fig. 6 is a flow chart three of an instruction receiving method according to another embodiment of the present application;

FIG. 7 is a flow diagram of obtaining a second speech recognition result generated by a second device in accordance with an embodiment of the present application;

FIG. 8 is a schematic diagram of an instruction receiving system according to another embodiment of the present application;

FIG. 9 is a schematic structural diagram of an electronic device according to another embodiment of the present application;

fig. 10 is a schematic structural diagram of a cloud server according to another embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present application clearer, the embodiments of the present application will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that in the examples of the present application, numerous technical details are set forth in order to provide a better understanding of the present application. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not constitute any limitation to the specific implementation manner of the present application, and the embodiments may be mutually incorporated and referred to without contradiction.

An embodiment of the present application relates to an instruction receiving method applied to a first device, and implementation details of the instruction receiving method of the present embodiment are specifically described below, where the following description is only provided for convenience of understanding, and is not necessary for implementing the present solution.

The specific flow of the instruction receiving method of this embodiment may be as shown in fig. 1, and includes:

step 101, picking up voice information of a user.

And 102, performing voice recognition on the voice information to generate a first voice recognition result.

Specifically, first equipment can receive the voice command that the user assigned, understands the voice command that the user assigned to can carry out the equipment of the voice command that the user assigned, like intelligent house robot, intelligent audio amplifier, intelligent refrigerator, intelligent washing machine and intelligent air conditioner etc..

In a specific implementation, the first device has a sound pickup function and a voice recognition function, and the first device may be configured to monitor voice information in an environment in real time, and when a preset wake-up word is recognized, the voice information of a user may be picked up, and the voice information of the picked-up user is subjected to voice recognition to generate a first voice recognition result, where the preset wake-up word is used to wake up the first device, that is, to instruct the first device to pick up sound, and a preset keyword may be set by a person skilled in the art according to actual needs.

In one example, the first device is an intelligent home robot, the preset awakening word is 'small a and small a', and when the first device recognizes that the voice information in the environment has 'small a and small a', the voice information of the user immediately starts to be picked up, and voice recognition is performed on the picked-up voice information to obtain a first voice recognition result.

And 103, acquiring a second voice recognition result generated by the second equipment.

Specifically, the first device may acquire a second voice recognition result generated by the second device after picking up the voice information of the user, performing voice recognition on the picked-up voice information of the user, and generating a first voice recognition result, where the second device is a device having a voice pickup function and a voice recognition function, and the second voice recognition result is generated by the second device performing voice recognition on the voice information of the user picked up by the second device.

In a specific implementation, the second device may also monitor the voice information in the environment in real time, pick up the voice information of the user, perform voice recognition on the picked-up voice information of the user, and generate a second voice recognition result.

In one example, the second device is a device with a sound collecting function and a voice recognition function, such as a smart speaker, a home background music system, or a smart tv.

In an example, the first device and the second device may be mutually transformed, for example, a plurality of smart home robots exist in a family of a user, including a robot a and a robot B, a preset wake-up word of the robot a is "small a and small a", a preset wake-up word of the robot B is "small B and small B", both the robot a and the robot B may monitor voice information in an environment in real time, and if the robot a recognizes the preset wake-up word "small a and small a", the robot a is the first device and the robot B may be the second device; if the robot B recognizes the preset wake-up word "small B and small B", the robot B is the first device, and the robot a may be the second device.

In one example, the first device and the second device may be wirelessly connected by any one or any combination of the following: wifi connects, IEEE.802.b.g.n connects, 2.4GHz connects, and intelligent gateway connects, and bluetooth Mesh gateway connects, and zigbee agreement gateway connects, and multimode net connects, and smart jack connects, and intelligent wireless switch connects etc..

In an example, the number of the second devices is several, and the distribution schematic diagram of the user, the first device and the second device in the home may be as shown in fig. 2, where wireless connections are established between the first device and the second devices a, b, c and d.

And 104, generating an instruction for the first equipment to execute according to the first voice recognition result, the second voice recognition result and the preset voice recognition credibility of each equipment.

In a specific implementation, after the first device obtains the first voice recognition result and the second voice recognition result, an instruction may be generated according to the first voice recognition result, the second voice recognition result, and the preset voice recognition credibility of each device, for the first device to execute, where the preset voice recognition credibility of each device may be set by a person skilled in the art according to actual needs.

In one example, the number of the second devices is several, the number of the second recognition results is several, the preset voice recognition reliability of each device may represent the voice recognition capability and the voice recognition quality of each device, the voice recognition capability and the voice recognition quality of the device are generally proportional to the performance of the device, the higher the performance of the device is, the higher the voice recognition capability and the voice recognition quality of the device are, and the higher the voice recognition reliability of the device is, the more reliable the voice recognition result of the device is, and the server may select the most reliable voice recognition result from the first voice recognition result and several second voice recognition results, and use the voice recognition result as an instruction for the first device to execute.

In one example, the preset speech recognition confidence levels of the devices may be speech recognition confidence levels set by the user, including speech recognition confidence levels of the devices actively input by the user in the background, and further including speech recognition confidence levels of the devices set by the user inadvertently, such as: the user says for a certain second device: "where nearby flower shops are? ", the second device replies with: "no-go, i am not clearly heard", the user says again for this second device: "how hard you are" means that the user does not trust the second device, and the first device automatically sets the speech recognition confidence level of the second device to a low value.

In this embodiment, compared with the technical solution that only the voice instruction issued by the user is picked up by the first device and the picked-up voice instruction is executed, the first device picks up the voice information of the user, performs voice recognition on the obtained voice information to generate a first voice recognition result, and then obtains a second voice recognition result generated by the second device, the second device is a device having a voice pickup function and a voice recognition function, the second device performs voice recognition on the voice information picked up by the second device to generate a second voice recognition result, the first device further generates an instruction for the first device to execute the generated instruction according to the first voice recognition result, the second voice recognition result and the preset voice recognition credibility of each device, considering that the range of the device is limited, when the user and the first device are located in different rooms, the first equipment is difficult to accurately obtain the voice instruction issued by the user, and the embodiment of the application, when the first equipment receives the instruction, the first equipment not only depends on the voice information of the first equipment picked up the user to perform voice recognition to obtain a recognition result, but also can obtain a recognition result obtained by the second equipment picked up the voice information of the user, namely, a plurality of pieces of equipment in a combined family jointly obtain the voice instruction issued by the user, the limitation of a pickup range can be broken through, when the instruction is generated, the voice recognition reliability of each piece of equipment is combined, the generated instruction can be guaranteed to be credible and accords with the intention of the user, thereby accurately, completely and quickly receiving the voice instruction issued by the user, and the use experience of the user is improved.

In one embodiment, the number of the second devices is several, and the number of the second speech recognition results is also several.

In an embodiment, the number of the second devices is several, the number of the second speech recognition results is also several, and the first device generates the instruction according to the first speech recognition result, the second speech recognition result, and the preset speech recognition reliability of the device, which may be implemented by the steps shown in fig. 3, and specifically includes:

step 201, performing word segmentation on the first voice recognition result to obtain a plurality of first word segmentation segments.

Step 202, performing word segmentation on the plurality of second voice recognition results to obtain a plurality of second word segmentation segments.

In a specific implementation, after the first device generates the first speech recognition result and obtains the plurality of second speech recognition results, the first speech recognition result may be segmented according to a preset segmentation dictionary to obtain a plurality of first segmentation segments, and the plurality of second speech recognition results may be segmented to obtain a plurality of second segmentation segments, where the preset segmentation dictionary may be set by a person skilled in the art according to actual needs, and this is not specifically limited in the embodiments of the present application.

In one example, the first voice recognition result is an "answer bank leaflet", the first device performs word segmentation on the first voice recognition result according to a preset word segmentation dictionary to obtain three first word segmentation segments of "answer", "bank" and "leaflet", the second voice recognition result J is a "print bank leaflet", the second voice recognition result K is a "print bank leaflet", the second voice recognition result L is a "print english-chinese sheet", the second voice recognition result M is a "print hidden sheet", the second voice recognition result N is a "print bank leaflet", the first device performs word segmentation on the second voice recognition result J, the second voice recognition result K, the second voice recognition result L, the second voice recognition result M and the second voice recognition result N according to a preset word segmentation dictionary to obtain a "print", "bank", "english-chinese Five second participle fragments of "implicit", "leaflet" and "sheet".

And step 203, determining the voice recognition credibility of the plurality of first segmentation segments and the voice recognition credibility of the plurality of second segmentation segments according to the preset voice recognition credibility of each device.

In a specific implementation, after obtaining a plurality of first segmentation segments and a plurality of second segmentation segments, the first device may determine the speech recognition credibility of the plurality of first segmentation segments and the speech recognition credibility of the plurality of second segmentation segments according to the preset speech recognition credibility of each device.

In one example, the speech recognition reliability of the first device is 0.9, the speech recognition reliability of the second device J is 0.6, the speech recognition reliability of the second device K is 0.6, the speech recognition reliability of the second device L is 0.3, the speech recognition reliability of the second device M is 0.8, the speech recognition reliability of the second device N is 0.4, and the server determines the speech recognition reliability of each participle segment according to the speech recognition reliability of each device as follows: the speech recognition confidence level of "answer" is: 0.9; the speech recognition confidence level of "print" is: 0.6+0.6+0.3+0.8+0.4 ═ 3.1; the speech recognition confidence level of "bank" is: 0.9+0.6+0.6+0.4 ═ 2.5; the speech recognition credibility of "english-chinese" is: 0.3; the "implicit" speech recognition confidence level is: 0.8; the speech recognition confidence level of the "leaflet" is: 0.9+0.6+0.6+0.4 ═ 2.5; the speech recognition confidence level of the 'bed sheet' is as follows: 0.3+0.8 ═ 1.1.

And step 204, generating an instruction according to the voice recognition credibility of the plurality of first word segmentation segments, and the voice recognition credibility of the plurality of second word segmentation segments and the plurality of second word segmentation segments.

In a specific implementation, after determining the speech recognition credibility of the plurality of first segmentation segments and the speech recognition credibility of the plurality of second segmentation segments, the first device may generate the instruction according to the speech recognition credibility of the plurality of first segmentation segments, the speech recognition credibility of the plurality of second segmentation segments, and the speech recognition credibility of the plurality of second segmentation segments.

In one example, the first device determines that the speech recognition confidence level of "answer" is 0.9, the speech recognition confidence level of "print" is 3.1, the speech recognition confidence level of "bank" is 2.5, the speech recognition confidence level of "english-chinese" is 0.3, the speech recognition confidence level of "implicit" is 0.8, the speech recognition confidence level of "leaflet" is 2.5, the speech recognition confidence level of "sheet" is 1.1, the first device may compose a sentence, that is, a "print bank leaflet" is composed according to the word segmentation with the highest speech recognition confidence level, the sentence is a credible sentence, and the first device takes the "print bank leaflet" as an instruction.

In this embodiment, the generating an instruction according to the first speech recognition result, the second speech recognition result, and a preset speech recognition reliability of the device includes: performing word segmentation on the first voice recognition result to obtain a plurality of first word segmentation segments; performing word segmentation on the second voice recognition results to obtain second word segmentation segments; determining the voice recognition credibility of a plurality of first segmentation segments and the voice recognition credibility of a plurality of second segmentation segments according to the preset voice recognition credibility of each device; generating an instruction according to the voice recognition credibility of the plurality of first word segmentation segments, the voice recognition credibility of the plurality of second word segmentation segments and the voice recognition credibility of the plurality of second word segmentation segments, wherein in the process of generating the instruction, the first equipment can perform word segmentation processing on the first voice recognition result and the second voice recognition result respectively to obtain the plurality of first word segmentation segments and the plurality of second word segmentation segments, and then the voice recognition credibility of each word segmentation segment is determined by combining the preset voice recognition credibility of the equipment, so that the instruction is generated according to the voice recognition credibility of each word segmentation segment and each word segmentation segment, the accuracy of the obtained instruction can be further improved, and the use experience of a user is further improved.

Another embodiment of the present application relates to an instruction receiving method, in this embodiment, there are a plurality of second devices, and there are also a plurality of second speech recognition results, the implementation details of the instruction receiving method of this embodiment are specifically described below, the following are only provided for facilitating understanding of the implementation details, and are not necessary for implementing this embodiment, and a specific flow of the instruction receiving method of this embodiment may be as shown in fig. 4, and includes:

step 301, picking up voice information of a user.

Step 302, performing voice recognition on the voice information to generate a first voice recognition result.

Step 303, obtaining a second speech recognition result generated by the second device.

Steps 301 to 303 are substantially the same as steps 101 to 103, and are not described herein again.

Step 304, the position of the user, the position of the first device and the positions of the plurality of second devices are obtained.

Step 305, determining a relative position between the first device and the user according to the position of the user and the position of the first device.

And step 306, determining the relative positions between the plurality of second devices and the user according to the position of the user and the positions of the plurality of second devices.

In a specific implementation, the first device may obtain the location of the user, the location of the first device, and the locations of the plurality of second devices, and after obtaining the location of the user, the location of the first device, and the locations of the plurality of second devices, the first device may determine the relative location between the first device and the user according to the location of the user and the location of the first device, and determine the relative locations between the plurality of second devices and the user according to the location of the user and the locations of the plurality of second devices.

In an example, the first device may determine the location of the user, the location of the first device, and the locations of the plurality of second devices through a camera, a bluetooth Positioning lamp, and the like of the first device, and the first device may also determine the location of the user, the location of the first device, and the locations of the plurality of second devices through a Global Positioning System (GPS), a 3D semantic map, and other functions.

In one example, the relative position between the device and the user may include, but is not limited to: the distance between the device and the user. The number of walls spaced between the device and the user, the number of rooms spaced between the device and the user, the number of floors spaced between the device and the user, etc.

And 307, generating an instruction according to the first voice recognition result, the second voice recognition result, the relative position between the first device and the user, the relative positions between the second devices and the user and the preset voice recognition credibility of each device.

In a specific implementation, after determining the relative position between the first device and the user and the relative positions between the plurality of second devices and the user, the first device may generate the instruction according to the first voice recognition result, the second voice recognition result, the relative position between the first device and the user, the relative positions between the plurality of second devices and the user, and the preset voice recognition reliability of each device.

In one example, the first device assigns weights to the first speech recognition result and the second speech recognition results according to the distance between each device and the user, and the first device considers that the speech recognition result of the device close to the user is more credible, and the speech recognition result of the device far away from the user is less credible.

In one example, the first device assigns weights to the first speech recognition result and the second speech recognition results according to the number of walls spaced from the user by each device, and the first device considers that the speech recognition result of the device with the small number of walls spaced from the user is more reliable, and the speech recognition result of the device with the large number of walls spaced from the user is less reliable.

In this embodiment, before generating an instruction according to the first speech recognition result, the second speech recognition result, and the preset speech recognition reliability of each device, the method includes: acquiring the position of the user, the position of the first device and the positions of a plurality of second devices; determining a relative position between the first device and the user according to the position of the user and the position of the first device; determining the relative positions between the plurality of second devices and the user according to the position of the user and the positions of the plurality of second devices; generating an instruction according to the first voice recognition result, the second voice recognition result and preset voice recognition credibility of each device, wherein the instruction comprises: according to the first voice recognition result, the second voice recognition result, the relative position between the first device and the user, the relative positions between the plurality of second devices and the user, and the preset voice recognition credibility of each device, an instruction is generated, and in consideration of the fact that in actual life, the second devices are distributed at each position in a household, and the sound pickup range of the second devices is limited, so that the first device of the embodiment also comprehensively considers the relative position between the first device and the user and the relative positions between each second device and the user when receiving the instruction, so that the voice recognition result of the second device does not need to be referred to when the first device is close to the user, and the voice recognition result of the second device is mainly referred to when a certain second device is close to the user.

In an embodiment, the preset speech recognition reliability of each device includes speech recognition reliability of the first device and speech recognition reliability of the second device, where the speech recognition reliability of the second device is obtained by the first device through the steps as shown in fig. 5, and specifically includes:

step 401, performing several dialogs with the second device through voice, and obtaining a voice recognition result of the second device for several dialogs.

Step 402, comparing the voice recognition result of the plurality of conversations with the text content of the plurality of conversations by the second device, and determining the voice recognition reliability of the second device.

Specifically, the first device may perform several dialogues with the second device through voice, obtain voice recognition results of the second device for the several dialogues, and compare the voice recognition results of the several dialogues with the voice recognition results of the second device for the several dialogues according to the text content of the several dialogues, thereby determining the voice recognition reliability of the second device.

In a specific implementation, the first device and the second device may be mutually converted, and devices in a family, which may serve as the first device, may have a plurality of dialogs with other devices through voice to obtain voice recognition credibility of the other devices.

In one example, the first device is a robot, and when a second device is newly added to a home, the robot may start a camera to determine the location of the newly added second device, move to the vicinity of the newly added second device, and perform several dialogues on the newly added second device, thereby determining the speech recognition reliability of the newly added second device.

In one example, a first device may obtain a speech recognition confidence level for the first device from a second device.

In this embodiment, the preset speech recognition reliability of each device includes the speech recognition reliability of the first device and the speech recognition reliability of the second device, and the speech recognition reliability of the second device is obtained by the first device through the following steps: carrying out a plurality of times of conversations with the second equipment through voice to obtain voice recognition results of the second equipment on the plurality of conversations; and comparing the voice recognition results of the plurality of conversations with the text contents of the plurality of conversations by the second equipment to determine the voice recognition reliability of the second equipment, wherein the voice recognition reliability of the second equipment is obtained by the first equipment through the plurality of conversations in advance, and the obtained voice recognition reliability of the second equipment is scientific, real and reliable, so that the accuracy of the obtained instruction is further improved.

In one embodiment, the speech recognition confidence level of each device comprises any combination of: the speech recognition reliability of each device to the dialect, the speech recognition reliability of each device to different languages, the speech recognition reliability of each device to different volume sizes, the speech recognition reliability of each device in different noise size environments, and the speech recognition reliability of each device to users with different identity information, considering that different users have different living habits, for example, some users often use dialects, some users often use english, some users have small voices, some users have children in their homes, etc., and the actual environments of the second devices are different, for example, some second devices are located in windows close to the road, some second devices are located in quiet study rooms, etc., the speech recognition reliability of these aspects is comprehensively considered by the first device when generating instructions, and the accuracy of the acquired instructions can be further improved.

Another embodiment of the present application relates to an instruction receiving method, and the following describes implementation details of the instruction receiving method of this embodiment in detail, where the following are provided only for facilitating understanding of the implementation details, and are not necessary to implement the present embodiment, and a specific flow of the instruction receiving method of this embodiment may be as shown in fig. 6, and includes:

step 501, picking up voice information of a user.

Step 502, performing speech recognition on the speech information to generate a first speech recognition result.

Step 503, obtaining a second speech recognition result generated by the second device.

Steps 501 to 503 are substantially the same as steps 101 to 103, and are not described herein again.

Step 504, voice information of the user picked up by the third device is acquired.

And 505, performing voice recognition on the voice information of the user picked up by the third device to generate a third voice recognition result.

In a specific implementation, after generating the first voice recognition result and obtaining the second voice recognition result, the first device may further obtain voice information of the user picked up by the third device, perform voice recognition on the voice information of the user picked up by the third device, and generate a third voice recognition result, where the third device is a device having a sound pickup function and not having the voice recognition function, and the first device may also include the third device having the sound pickup function and not having the voice recognition function in the reference range, which is equivalent to further expanding the sound pickup range of the first device.

In an example, the first device may also perform step 504, step 505, and then perform step 503; step 502, step 503 and step 504 may also be performed simultaneously.

Step 506, generating an instruction according to the first voice recognition result, the second voice recognition result, the third voice recognition result and the preset voice recognition credibility of each device.

In a specific implementation, after the first device generates the first voice recognition result and the third voice recognition result and obtains the second voice recognition result, the first device may generate the instruction according to the first voice recognition result, the second voice recognition result, the third voice recognition result and the preset voice recognition credibility of each device, so that the first device may obtain a more accurate and complete voice instruction.

In this embodiment, before generating an instruction according to the first speech recognition result, the second speech recognition result, and the preset speech recognition reliability of each device, the method includes: acquiring voice information of a user picked up by third equipment; the third equipment has a sound pickup function and does not have a voice recognition function; performing voice recognition on the voice information of the user picked up by the third equipment to generate a third voice recognition result; generating an instruction according to the first voice recognition result, the second voice recognition result and preset voice recognition credibility of each device, wherein the instruction comprises: according to the first voice recognition result, the second voice recognition result, the third voice recognition result and the preset voice recognition credibility of each device, an instruction is generated, the third device has a pickup function but does not have the voice recognition function, in this embodiment, the first device can also refer to the voice information of the user picked up by the third device, the pickup range is further improved, and the first device can obtain a more accurate and complete voice instruction.

In one embodiment, after the first device generates the instruction, the first device itself may directly perform voice reply on the instruction, or the second device may perform voice reply on the instruction, so as to better meet the actual needs of the user and further improve the user experience.

In one example, the first device may determine whether to respond to the instruction by voice via the first device itself or to select to respond to the instruction by voice via the second device closest to the user based on the relative position between the first device and the user and the relative positions between the number of second devices and the user.

In one example, the first device may voice reply to the instruction, both through the first device itself and through all of the second devices.

In an embodiment, the obtaining, by the first device, the second speech recognition result generated by the second device may be implemented by the steps shown in fig. 7, which specifically include:

step 601, obtaining screening information, wherein the screening information includes the following arbitrary combinations: identity information of the user, location information of the user, and location information of each second device.

Step 602, determining the target second device according to the screening information and the preset corresponding relationship.

In a specific implementation, the first device may first obtain the screening information after generating the first voice recognition result, where the screening information includes one or more of the identity information of the user, the location information of the user, and the location information of each second device, and after obtaining the screening information, the first device may determine the target second device according to the screening information and a preset corresponding relationship, where the preset corresponding relationship includes a corresponding relationship between the identity information of the user and the target second device, a corresponding relationship between the location information of the user and the target second device, and a corresponding relationship between the location information of each second device and the target second device, and determine the target second device, that is, the second device that the first device needs to consider.

In one example, the first device determines that the location information of the user is a kitchen, and the first device may determine, according to the location information of the kitchen, each second device in the kitchen as a target device.

In one example, the first device determines that the identity information of the user is a child, and the first device may use a second device corresponding to the child in the second devices as a target second device.

Step 603, obtaining a second speech recognition result generated by the target second device.

In a specific implementation, the first device only obtains the second speech recognition result generated by the target second device, so that the efficiency of instruction receiving can be effectively improved.

In this embodiment, the acquiring a second speech recognition result generated by the second device includes: acquiring screening information; wherein the screening information comprises any combination of: identity information of the user, location information of the user and location information of each second device; determining target second equipment according to the screening information and a preset corresponding relation; wherein, the corresponding relationship is the corresponding relationship between the screening information and the target second device; and obtaining a second voice recognition result generated by the target second equipment, determining the target second equipment in each second equipment according to one or more of the identity information of the user, the position information of the user and the position information of each second equipment, and only obtaining the second voice recognition result generated by the target second equipment, namely only taking the target second equipment into consideration when receiving the instruction, so that the efficiency of instruction receiving can be effectively improved.

The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.

Another embodiment of the present application relates to an instruction receiving system, and details of the instruction receiving system of the present embodiment are specifically described below, and the following are provided only for the implementation details for facilitating understanding, and are not necessary for implementing the present embodiment, and fig. 8 is a schematic diagram of the instruction receiving system of the present embodiment, and includes: a first device 701 and a second device 702, the second device 702 having a sound collecting function and a voice recognition function.

The first device 701 is configured to obtain voice information of a user, perform voice recognition on the voice information, and generate a first voice recognition result.

The second device 702 is configured to pick up voice information of the user, perform voice recognition on the voice information, and generate a second voice recognition result.

The first device 701 is further configured to obtain a second voice recognition result, generate an instruction according to the first voice recognition result, the second voice recognition result, and preset voice recognition reliability of each device, and execute the generated instruction.

It should be understood that the present embodiment is a system embodiment corresponding to the above method embodiment, and the present embodiment can be implemented in cooperation with the above method embodiment. The related technical details and technical effects mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.

It should be noted that, all the modules involved in this embodiment are logic modules, and in practical application, one logic unit may be one physical unit, may also be a part of one physical unit, and may also be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present application, a unit that is not so closely related to solving the technical problem proposed by the present application is not introduced in the present embodiment, but this does not indicate that there is no other unit in the present embodiment.

Another embodiment of the present application relates to an electronic device, as shown in fig. 9, including: at least one processor 801; and a memory 802 communicatively coupled to the at least one processor 801; the memory 802 stores instructions executable by the at least one processor 801, and the instructions are executed by the at least one processor 801, so that the at least one processor 801 can execute the instruction receiving method in the above embodiments.

Where the memory and processor are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting together one or more of the various circuits of the processor and the memory. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor.

The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory may be used to store data used by the processor in performing operations.

Another embodiment of the present application relates to a cloud server, as shown in fig. 10, including: at least one processor 901; and a memory 902 communicatively coupled to the at least one processor 901; the memory 902 stores instructions executable by the at least one processor 901, and the instructions are executed by the at least one processor 901, so that the at least one processor 901 can execute the instruction receiving method in the foregoing embodiments.

Another embodiment of the present application relates to a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the present application, and that various changes in form and details may be made therein without departing from the spirit and scope of the present application in practice.

Claims

1. An instruction receiving method applied to a first device, comprising:

picking up voice information of a user;

performing voice recognition on the voice information to generate a first voice recognition result;

acquiring a second voice recognition result generated by second equipment; the second device is a device with a sound collecting function and a voice recognition function, and the second voice recognition result is generated by performing voice recognition on voice information of a user picked up by the second device through the second device;

and generating an instruction for the first equipment to execute according to the first voice recognition result, the second voice recognition result and the preset voice recognition credibility of each equipment.

2. The instruction receiving method according to claim 1, wherein the number of the second devices is several, and the number of the second voice recognition results is several.

3. The instruction receiving method according to claim 2, wherein the generating an instruction according to the first speech recognition result, the second speech recognition result, and a preset speech recognition reliability of the device includes:

performing word segmentation on the first voice recognition result to obtain a plurality of first word segmentation segments;

performing word segmentation on the second voice recognition results to obtain second word segmentation segments;

determining the voice recognition credibility of a plurality of first segmentation segments and the voice recognition credibility of a plurality of second segmentation segments according to the preset voice recognition credibility of each device;

and generating an instruction according to the voice recognition credibility of the plurality of first word segmentation segments, the voice recognition credibility of the plurality of second word segmentation segments and the plurality of second word segmentation segments.

4. The instruction receiving method according to claim 2 or 3, wherein before the generating an instruction based on the first speech recognition result, the second speech recognition result, and a preset speech recognition reliability of each device, the instruction receiving method comprises:

acquiring the position of the user, the position of the first device and the positions of a plurality of second devices;

determining a relative position between the first device and the user according to the position of the user and the position of the first device;

determining the relative positions between the plurality of second devices and the user according to the position of the user and the positions of the plurality of second devices;

generating an instruction according to the first voice recognition result, the second voice recognition result and preset voice recognition credibility of each device, wherein the instruction comprises:

and generating an instruction according to the first voice recognition result, the second voice recognition result, the relative position between the first equipment and the user, the relative positions between the second equipment and the user and the preset voice recognition credibility of each equipment.

5. The instruction receiving method according to claim 1, wherein the preset speech recognition credibility of each device includes a speech recognition credibility of the first device and a speech recognition credibility of the second device, and the speech recognition credibility of the second device is obtained by the first device through the following steps:

carrying out a plurality of times of conversations with the second equipment through voice to obtain voice recognition results of the second equipment on the plurality of conversations;

and comparing the voice recognition results of the plurality of conversations with the text contents of the plurality of conversations by the second equipment, and determining the voice recognition reliability of the second equipment.

6. The instruction receiving method according to claim 1 or 5, wherein the speech recognition credibility of each device comprises any combination of the following: the reliability of the voice recognition of dialects of the equipment, the reliability of the voice recognition of different languages of the equipment, the reliability of the voice recognition of different volumes of the equipment, the reliability of the voice recognition of the equipment in environments with different noise sizes and the reliability of the voice recognition of users with different identity information of the equipment are improved.

7. The method for receiving the instruction according to claim 1, prior to generating the instruction based on the first speech recognition result, the second speech recognition result, and a preset speech recognition reliability of each device, comprising:

acquiring voice information of a user picked up by third equipment; the third equipment has a sound pickup function and does not have a voice recognition function;

performing voice recognition on the voice information of the user picked up by the third equipment to generate a third voice recognition result;

and generating an instruction according to the first voice recognition result, the second voice recognition result, the third voice recognition result and the preset voice recognition credibility of each device.

8. The instruction receiving method according to claim 1, wherein after the generating the instruction, the method further comprises:

directly performing voice reply on the instruction;

or performing voice reply on the instruction through the second equipment.

9. The instruction receiving method according to claim 1, wherein the obtaining of the second speech recognition result generated by the second device comprises:

acquiring screening information; wherein the screening information comprises any combination of: identity information of the user, location information of the user and location information of each second device;

determining target second equipment according to the screening information and a preset corresponding relation; wherein, the corresponding relationship is the corresponding relationship between the screening information and the target second device;

and acquiring a second voice recognition result generated by the target second equipment.

10. An instruction receiving system is characterized by comprising a first device and a second device, wherein the second device has a sound pickup function and a voice recognition function;

the first equipment is used for acquiring voice information of a user, performing voice recognition on the voice information and generating a first voice recognition result;

the second equipment is used for picking up voice information of a user, performing voice recognition on the voice information and generating a second voice recognition result;

and the first equipment is also used for acquiring the second voice recognition result, generating an instruction according to the first voice recognition result, the second voice recognition result and the preset voice recognition credibility of each equipment, and executing the instruction.

11. An electronic device, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the instruction receiving method of any one of claims 1 to 9.

12. A cloud server, comprising:

at least one processor; and the number of the first and second groups,

13. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, implements the instruction receiving method of any one of claims 1 to 9.