CN113409793A

CN113409793A - Voice recognition method, intelligent home system, conference device and computing device

Info

Publication number: CN113409793A
Application number: CN202010129820.6A
Authority: CN
Inventors: 郑斯奇; 雷赟
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-02-28
Filing date: 2020-02-28
Publication date: 2021-09-17
Anticipated expiration: 2040-02-28
Also published as: CN113409793B

Abstract

The application discloses a voice recognition method, an intelligent home system, conference equipment and computing equipment. Wherein, the method comprises the following steps: collecting voice information of at least one target object; inputting voice information of at least one target object into a first machine learning model and a second machine learning model, and inputting output results of each network layer in the first machine learning model and the second machine learning model into the target machine learning model together for analysis to obtain identity information of the target object and voice content corresponding to the identity information; and outputting the voice content. The method and the device solve the technical problem that the accuracy rate of the voice recognition scheme corresponding to the task irrelevant to the short-term text is low.

Description

Voice recognition method, intelligent home system, conference device and computing device

Technical Field

The application relates to the field of voice recognition, in particular to a voice recognition method, an intelligent home system, conference equipment and computing equipment.

Background

The speaker recognition technology is a technology for recognizing the identity of a speaker by voice. The existing speaker recognition technology is mainly applied to the floor of the industry, and mainly adopts a short-time text related scene, namely, text contents spoken by a speaker, such as awakening words of an intelligent home, are fixed; or the long text is irrelevant, namely the content spoken by the speaker is not specified, but the speaking time is required to be longer. For short-time text-independent tasks, the traditional speaker recognition technology is used for voice recognition, the recognition accuracy is low, and the commercial level cannot be achieved.

Aiming at the problem that the accuracy rate of a speech recognition scheme corresponding to a task unrelated to short-term text at the present stage is low, an effective solution is not provided at present.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, an intelligent home system, conference equipment and computing equipment, and at least solves the technical problem that a voice recognition scheme corresponding to a task unrelated to short-time text is low in accuracy.

According to an aspect of an embodiment of the present application, there is provided a speech recognition method including: collecting voice information of at least one target object; inputting the voice information of at least one target object into a first machine learning model and a second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the corresponding network layer in the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information; and outputting the voice content.

According to another aspect of the embodiments of the present application, there is provided another speech recognition method, including: receiving voice information of a target object; inputting the voice information of the target object into a corresponding network layer of the target machine learning model for analysis to obtain identity information of the target object and voice content corresponding to the identity information, wherein the input of each network layer in the target machine learning model is the output result of each network layer in the first machine learning model and the second machine learning model; and verifying the identity information, and executing operation corresponding to the voice content when the verification is passed.

According to another aspect of the embodiments of the present application, an intelligent home system is further provided, which includes at least one home device and a control device, where the at least one home device is configured to collect voice information of a target object in a space where the at least one home device is located, and receive a control instruction from the control device; the control equipment is used for receiving voice information, inputting the voice information of at least one target object into the first machine learning model and the second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the corresponding network layer of the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information; and generating a control instruction based on the voice content, and sending the control instruction to at least one household appliance.

According to another aspect of the embodiments of the present application, there is also provided a conference device, including: the voice acquisition equipment is used for acquiring voice information of at least one target object in a space where the voice acquisition equipment is located; the controller is used for acquiring voice information, inputting the voice information into the first machine learning model, and inputting the output result of each network layer in the first machine learning model into a corresponding network layer in the target machine learning model, wherein the target machine learning model is used for recognizing identity information of a target object and voice content corresponding to the identity information, and the first machine learning model is a model for recognizing acoustic features of at least one target object.

According to another aspect of the embodiments of the present application, there is also provided another conference device, including: the voice acquisition equipment is used for acquiring voice information of at least one target object in a space where the voice acquisition equipment is located; and the controller is used for acquiring voice information, inputting the voice information into the second machine learning model, and inputting the output result of each network layer in the second machine learning model into a corresponding network layer in the target machine learning model, wherein the target machine learning model is used for recognizing the identity information of the target object and the voice content corresponding to the identity information, and the second machine learning model is a model for performing content recognition on the voice information of at least one target object.

According to another aspect of the embodiments of the present application, there is provided another speech recognition method, including: collecting voice information of at least one target object; inputting voice information of at least one target object into a second machine learning model, inputting an output result of a network layer in the second machine learning model into a first machine learning model, and inputting an output result of the network layer in the first machine learning model into a corresponding network layer in the target machine learning model for analysis to obtain identity information of the target object and voice content corresponding to the identity information; and outputting the voice content.

According to still another aspect of the embodiments of the present application, there is also provided a storage medium including a stored program, wherein when the program runs, an apparatus in which the storage medium is located is controlled to execute the above speech recognition method.

According to still another aspect of embodiments of the present application, there is also provided a computing device, including: a processor; and a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: collecting voice information of at least one target object; inputting the voice information of at least one target object into a first machine learning model and a second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the corresponding network layer of the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information; and outputting the voice content.

In the embodiment of the application, voice information of at least one target object is collected; inputting the voice information of at least one target object into a first machine learning model and a second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the corresponding network layer of the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information; the method for outputting the voice content utilizes three kinds of neural networks which are mutually associated to identify the voice information of the target object by combining the three kinds of neural networks, thereby realizing the technical effect of improving the accuracy of the voice identification of the short-time text-independent task and further solving the technical problem of low accuracy of the voice identification scheme corresponding to the short-time text-independent task.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 shows a block diagram of a hardware structure of a computer terminal (or mobile device) for implementing a voice recognition method;

FIG. 2 is a flow chart of a method of speech recognition according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a neural network model according to an embodiment of the present application;

FIG. 4 is a flow diagram of another speech recognition method according to an embodiment of the present application;

fig. 5 is a structural diagram of an intelligent home system according to an embodiment of the present application;

FIG. 6 is a block diagram of a computer terminal according to an embodiment of the present application;

fig. 7 is a schematic view of an application scenario in which a user wakes up an intelligent home appliance by voice according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a conference device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of another conference device according to an embodiment of the present application;

FIG. 10 is a flow chart of another speech recognition method according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, some terms or terms appearing in the description of the embodiments of the present application are applicable to the following explanations:

speaker recognition: and the identity of the speaker is identified and confirmed through voice.

Short-time text-independent tasks: the content spoken by the speaker is not limited, and the voice is short (e.g., less than 5 seconds).

Example 1

There is also provided, in accordance with an embodiment of the present application, an embodiment of a speech recognition method, to note that the steps illustrated in the flowchart of the figure may be performed in a computer system, such as a set of computer-executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

The method provided by the first embodiment of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Fig. 1 shows a hardware configuration block diagram of a computer terminal (or mobile device) for implementing a voice recognition method. As shown in fig. 1, the computer terminal 10 (or mobile device 10) may include one or more (shown as 102a, 102b, … …, 102 n) processors 102 (the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 104 for storing data, and a transmission module 106 for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial BUS (USB) port (which may be included as one of the ports of the BUS), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors 102 and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuit may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computer terminal 10 (or mobile device). As referred to in the embodiments of the application, the data processing circuit acts as a processor control (e.g. selection of a variable resistance termination path connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the speech recognition method in the embodiment of the present application, and the processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, that is, implementing the vulnerability detection method of the application program. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission module 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission module 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission module 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).

Under the above operating environment, the present application provides a speech recognition method as shown in fig. 2. Fig. 2 is a flow chart of a speech recognition method according to an embodiment of the present application, as shown in fig. 2, the method comprising the steps of:

step S202, collecting voice information of at least one target object.

According to an alternative embodiment of the present application, the voice information in step S202 is the voice information corresponding to the short-term text-independent task, and the voice recognition of the short-term text-independent task is mainly applied to the wake-up word of the home appliance, such as "hello tv" and the like.

Step S204, inputting the voice information of at least one target object into the first machine learning model and the second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the corresponding network layer of the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information.

In an alternative embodiment of the present application, the first machine learning model is a model for recognizing acoustic features of the at least one target object; the second machine learning model is a model for performing content recognition on speech information of at least one target object.

In an alternative embodiment of the present application, the first machine learning model, the second machine learning model and the target machine learning model are three independent and linked neural network learning models. Fig. 3 is a schematic diagram of a neural network model according to an embodiment of the present application, and as shown in fig. 3, the neural network learning model is composed of three independent and interconnected neural networks, namely, a left neural network, a middle neural network and a right neural network. The leftmost network is similar to a conventional voiceprint recognition network, with the input being acoustic features. The rightmost network inputs hidden features extracted from the speech recognition network and represents the content of the speech. The intermediate network is formed by cross-linking the two networks.

In step S206, the speech content is output.

By the method, the three neural networks are combined, and the three neural networks which are mutually associated are utilized to identify the voice information of the target object, so that the technical effect of improving the accuracy of the voice identification of the short-time text-independent task is achieved.

According to an alternative embodiment of the present application, the target machine learning model is trained by: acquiring a plurality of groups of training data for training a target machine learning model, wherein each group of training data comprises a triplet information, and the triplet information comprises: different speech information of the first sample object; speech information of the second sample object; and respectively inputting the multiple groups of training data into the target machine learning model for training until the prediction result of the target machine learning model meets the preset condition.

The first sample object and the second sample object are referred to as different speakers, respectively.

In an alternative embodiment of the present application, the training of the plurality of sets of training data respectively input to the value target machine learning model includes: and when the prediction result does not meet the preset condition, adjusting the weights of the different voice information of the first sample object and the voice information of the second sample object until the prediction result of the target machine learning model meets the preset condition.

According to an alternative embodiment of the present application, when the weights of the different speech information of the first sample object and the speech information of the second sample object are adjusted, the following processing procedures may be included: increasing the weight of the different voice information of the first sample object; and/or reducing the weight of the speech information of the second sample object. For example, by performing end-to-end recognition through triplet loss, better normalization processing can be performed on different speech contents. For example, when the same person speaks different content, more weight may be given to the information related to the voice trait, so that the two points are as close as possible. If different people say the same content, the target model can abandon the information related to the content as much as possible and find out the differences related to different human voice prints.

In another alternative embodiment of the present application, the inputting the multiple sets of training data into the target machine learning model for training respectively includes: and when the prediction result does not meet the preset condition, adjusting the loss function of the target machine learning model until the sample distance between the feature vectors of different voice information of the first sample object is smaller than the sample distance between the second sample object and the feature vector of the specified voice information, wherein the specified voice information is any one feature vector in the feature vectors of different voice information of the first sample object.

The target machine learning model is different from the mainstream voiceprint recognition framework, and the neural network adopts a loss function of end-to-end triplet loss. the triple loss is a loss function of deep learning, is the same as training samples with small differences, and comprises an Anchor (Anchor) example, a Positive (Positive) example and a Negative (Negative) example, and similarity calculation of the samples is realized by optimizing the distance between the Anchor example and the Positive example to be smaller than the distance between the Anchor example and the Negative example.

The target machine learning model eliminates the traditional back-end Linear Discriminant Analysis (LDA) and PLDA (PLDA is used for processing the deformation of speakers and channels) scoring, and calculates loss function of loss by directly comparing different triples.

By the method, the end-to-end speaker recognition is carried out based on the triplet loss function and in combination with the voice recognition information, and compared with the traditional speaker recognition technology, the accuracy of the voice recognition of the short-time text irrelevant tasks can be greatly improved.

In some optional embodiments of the present application, the speech recognition method further includes: and verifying the identity information, and executing operation corresponding to the voice content when the verification is passed.

After the voice content corresponding to the identity information of the target object is identified, the identity of the target object can be further verified, and the operation corresponding to the voice content is executed only when the verification is passed. For example, in a specific application, a user performs voice control on an intelligent household appliance through a wakeup word, after the user sends a voice control instruction of "turning on an air conditioner", after the controller on the air conditioner recognizes the voice control instruction through the voice recognition method, it is further required to determine whether the user has a control authority over the air conditioner, and after determining that the user has a control authority over the air conditioner, the air conditioner performs an operation corresponding to the voice control instruction. By the method, the user authority can be limited, and the operation safety of the household appliance is improved.

Fig. 7 is a schematic view of an application scenario in which a user wakes up an intelligent home appliance through voice according to an embodiment of the present application, and as shown in fig. 7, a processor of the intelligent home appliance runs a neural network system composed of the first machine learning model, the second machine learning model, and the target machine learning model. After a user sends a voice control instruction for awakening the intelligent household appliance, the intelligent household appliance collects the voice control instruction sent by the user through a microphone on the intelligent household appliance and sends the voice control instruction to a processor, and the processor processes voice information corresponding to the voice control instruction by executing the following method:

step S702, collecting voice information of at least one target object;

step S704, inputting the voice information of at least one target object into a first machine learning model and a second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the corresponding network layer of the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information;

step S706, the voice content is output.

Specifically, the neural network system may give more weight to information related to sound characteristics if the same user speaks different content. If different users speak the same content, the neural network system can abandon the information related to the content as much as possible and find out the difference related to different voice prints.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

Through the above description of the embodiments, those skilled in the art can clearly understand that the speech recognition method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

Example 2

Fig. 4 is a flow chart of another speech recognition method according to an embodiment of the present application, as shown in fig. 4, the method including the steps of:

step S402, receiving the voice information of the target object.

According to an optional embodiment of the present application, the voice information is voice information corresponding to a short-time text-independent task, and the voice recognition of the short-time text-independent task is mainly applied to a wakeup word of a home appliance.

Step S404, inputting the voice information of the target object into a corresponding network layer of the target machine learning model for analysis, and obtaining the identity information of the target object and the voice content corresponding to the identity information, wherein the input of each network layer in the target machine learning model is the output result of each network layer in the first machine learning model and the second machine learning model.

According to an alternative embodiment of the present application, the first machine learning model is a model that identifies acoustic features of the target object; the second machine learning model is a model for performing content recognition on the voice information of the target object.

The first machine learning model, the second machine learning model and the target machine learning model are three independent and mutually linked neural networks. The neural network is composed of a left neural network, a middle neural network and a right neural network which are independent and are mutually linked. The leftmost network sees independently that its input is an acoustic signature, similar to a conventional voiceprint recognition network. The rightmost network inputs hidden features extracted from the speech recognition network and represents the content of the speech. The intermediate network is formed by cross-linking the two networks.

And step S406, verifying the identity information, and executing the operation corresponding to the voice content when the verification is passed.

It should be noted that, reference may be made to the description related to the embodiment shown in fig. 1 for a preferred implementation of the embodiment shown in fig. 4, and details are not described here again.

Example 3

Fig. 5 is a block diagram of a smart home system according to an embodiment of the present application, and as shown in fig. 5, the system includes at least one home device 50 and a control device 52, wherein,

the household appliance 50 is used for acquiring voice information of a target object in a space where the household appliance 50 is located and receiving a control instruction from the control device 52;

according to an optional embodiment of the present application, the voice information is voice information corresponding to a short-time text-independent task, and the voice recognition of the short-time text-independent task is mainly applied to a wake-up word of the home appliance, such as "hello tv" and the like. The home appliance 50 includes, but is not limited to, a smart air conditioner and a smart television. Equipment of intelligence stereo set.

The control device 52 is configured to receive voice information, input the voice information of at least one target object into the first machine learning model and the second machine learning model, and input an output result of each network layer in the first machine learning model and the second machine learning model together into a corresponding network layer of the target machine learning model for analysis, so as to obtain identity information of the target object and voice content corresponding to the identity information; and generating a control instruction based on the voice content, and sending the control instruction to at least one household appliance.

In an alternative embodiment of the present application, the first machine learning model, the second machine learning model and the target machine learning model are three independent and linked neural networks. Fig. 3 is a schematic diagram of a neural network model according to an embodiment of the present application, and the neural network is composed of three independent and interconnected neural networks, i.e., a left neural network, a middle neural network and a right neural network, as shown in fig. 3. The leftmost network sees independently that its input is an acoustic signature, similar to a conventional voiceprint recognition network. The rightmost network inputs hidden features extracted from the speech recognition network and represents the content of the speech. The intermediate network is formed by cross-linking the two networks.

Preferably, the control device 52 may further verify the identity of the target object after recognizing the voice content corresponding to the identity information of the target object, and perform an operation corresponding to the voice content only when the verification is passed. For example, in a specific application, after the user performs voice control on the intelligent air conditioning device by using the wakeup word, and after the user sends a voice control instruction of "turning on the air conditioner", the control device 52 recognizes the voice control instruction by using the voice recognition method, and then needs to further determine whether the user has a control right for the air conditioning device, and after determining that the user has a control right for the air conditioning device, controls the air conditioning device to execute an operation corresponding to the voice control instruction. By the method, the user authority can be limited, and the operation safety of the household appliance is improved.

Through the system, the technical effect of improving the accuracy of the voice recognition of the short-time text-independent task can be achieved.

It should be noted that, reference may be made to the description related to the embodiment shown in fig. 2 for a preferred implementation of the embodiment shown in fig. 5, and details are not described here again.

Example 4

The embodiment of the application can provide a computer terminal, and the computer terminal can be any one computer terminal device in a computer terminal group. Optionally, in this embodiment, the computer terminal may also be replaced with a terminal device such as a mobile terminal.

Optionally, in this embodiment, the computer terminal may be located in at least one network device of a plurality of network devices of a computer network.

In this embodiment, the computer terminal may execute program codes of the following steps in the speech recognition method of the application program: collecting voice information of at least one target object; inputting the voice information of at least one target object into a first machine learning model and a second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the corresponding network layer of the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information; and outputting the voice content.

Optionally, fig. 6 is a block diagram of a computer terminal according to an embodiment of the present application. As shown in fig. 6, the computer terminal 60 may include: one or more processors 602 (only one shown), memory 604, and a radio frequency module, audio module, and display screen.

The memory 604 may be used to store software programs and modules, such as program instructions/modules corresponding to the voice recognition method and apparatus in the embodiments of the present application, and the processor 602 executes various functional applications and data processing by running the software programs and modules stored in the memory, so as to implement the voice recognition method. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory located remotely from the processor, which may be connected to the computer terminal 60 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: collecting voice information of at least one target object; inputting the voice information of at least one target object into a first machine learning model and a second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the corresponding network layer of the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information; and outputting the voice content.

Optionally, the processor may further execute the program code of the following steps: acquiring a plurality of groups of training data for training a target machine learning model, wherein each group of training data comprises a triplet information, and the triplet information comprises: different speech information of the first sample object; speech information of the second sample object; and respectively inputting the multiple groups of training data into corresponding network layers of the target machine learning model for training until the prediction result of the target machine learning model meets the preset condition.

Optionally, the processor may further execute the program code of the following steps: and when the prediction result does not meet the preset condition, adjusting the weights of the different voice information of the first sample object and the voice information of the second sample object until the prediction result of the target machine learning model meets the preset condition.

Optionally, the processor may further execute the program code of the following steps: increasing the weight of the different speech information of the first sample object; and/or reducing the weight of the speech information of the second sample object.

Optionally, the processor may further execute the program code of the following steps: and when the prediction result does not meet the preset condition, adjusting the loss function of the target machine learning model until the sample distance between the feature vectors of different voice information of the first sample object is smaller than the sample distance between the second sample object and the feature vector of the specified voice information, wherein the specified voice information is any one feature vector in the feature vectors of different voice information of the first sample object.

Optionally, the processor may further execute the program code of the following steps: and verifying the identity information, and executing operation corresponding to the voice content when the verification is passed.

According to an alternative embodiment of the present application, the processor may further call the information and the application stored in the memory through the transmission device to perform the following steps: receiving voice information of a target object; inputting the voice information of the target object into a corresponding network layer of the target machine learning model for analysis to obtain identity information of the target object and voice content corresponding to the identity information, wherein the input of each network layer in the target machine learning model is the output result of each network layer in the first machine learning model and the second machine learning model; and verifying the identity information, and executing operation corresponding to the voice content when the verification is passed.

According to an alternative embodiment of the present application, the processor may further call the information and the application stored in the memory through the transmission device to perform the following steps: collecting voice information of at least one target object; inputting voice information of at least one target object into a second machine learning model, inputting an output result of a network layer in the second machine learning model into a first machine learning model, and inputting an output result of the network layer in the first machine learning model into a corresponding network layer in the target machine learning model for analysis to obtain identity information of the target object and voice content corresponding to the identity information; and outputting the voice content.

By adopting the embodiment of the application, a scheme of voice recognition is provided. The three neural networks are combined, and the three neural networks which are correlated with each other are used for recognizing the voice information of the target object, so that the aim of improving the accuracy of the voice recognition of the short-time text-independent task is fulfilled, and the technical problem of low accuracy of the voice recognition scheme corresponding to the short-time text-independent task is solved.

It can be understood by those skilled in the art that the structure shown in fig. 6 is only an illustration, and the computer terminal may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 6 is a diagram illustrating a structure of the electronic device. For example, the computer terminal 60 may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 6, or have a different configuration than shown in FIG. 6.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

Embodiments of the present application also provide a storage medium. Optionally, in this embodiment, the storage medium may be configured to store a program code executed by the speech recognition method provided in embodiment 1.

Optionally, in this embodiment, the storage medium may be located in any one of computer terminals in a computer terminal group in a computer network, or in any one of mobile terminals in a mobile terminal group.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: collecting voice information of at least one target object; inputting the voice information of at least one target object into a first machine learning model and a second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the corresponding network layer of the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information; and outputting the voice content.

Optionally, the storage medium is further configured to store program code for performing the following steps: acquiring a plurality of groups of training data for training a target machine learning model, wherein each group of training data comprises a triplet information, and the triplet information comprises: different speech information of the first sample object; speech information of the second sample object; and respectively inputting the multiple groups of training data into corresponding network layers of the target machine learning model for training until the prediction result of the target machine learning model meets the preset condition.

Optionally, the storage medium is further configured to store program code for performing the following steps: and when the prediction result does not meet the preset condition, adjusting the weights of the different voice information of the first sample object and the voice information of the second sample object until the prediction result of the target machine learning model meets the preset condition.

Optionally, the storage medium is further configured to store program code for performing the following steps: increasing the weight of the different speech information of the first sample object; and/or reducing the weight of the speech information of the second sample object.

Optionally, the storage medium is further configured to store program code for performing the following steps: and when the prediction result does not meet the preset condition, adjusting the loss function of the target machine learning model until the sample distance between the feature vectors of different voice information of the first sample object is smaller than the sample distance between the second sample object and the feature vector of the specified voice information, wherein the specified voice information is any one feature vector in the feature vectors of different voice information of the first sample object.

Optionally, the storage medium is further configured to store program code for performing the following steps: and verifying the identity information, and executing operation corresponding to the voice content when the verification is passed.

Optionally, in this embodiment, the storage medium is further configured to store program code for performing the following steps: receiving voice information of a target object; inputting the voice information of the target object into a corresponding network layer of the target machine learning model for analysis to obtain identity information of the target object and voice content corresponding to the identity information, wherein the input of each network layer in the target machine learning model is the output result of each network layer in the first machine learning model and the second machine learning model; and verifying the identity information, and executing operation corresponding to the voice content when the verification is passed.

Optionally, in this embodiment, the storage medium is further configured to store program code for performing the following steps: collecting voice information of at least one target object; inputting voice information of at least one target object into a second machine learning model, inputting an output result of a network layer in the second machine learning model into a first machine learning model, and inputting an output result of the network layer in the first machine learning model into a corresponding network layer in the target machine learning model for analysis to obtain identity information of the target object and voice content corresponding to the identity information; and outputting the voice content.

Example 5

Fig. 8 is a schematic structural diagram of a conference device according to an embodiment of the present application, and as shown in fig. 8, the conference device includes:

and the at least one voice acquisition device 80 is used for acquiring voice information of at least one target object in the space where the at least one voice acquisition device is located. The speech acquisition device 80 may be a microphone array,

and the controller 82 is configured to acquire voice information, input the voice information into a first machine learning model, and input an output result of each network layer in the first machine learning model into a corresponding network layer in a target machine learning model, where the target machine learning model is used to identify identity information of a target object and voice content corresponding to the identity information, and the first machine learning model is a model that identifies acoustic features of at least one target object.

According to an optional embodiment of the present application, the first machine learning model may recognize a voiceprint feature of the target object, and after the first machine learning model is used to recognize the voice information of the target object, the voiceprint recognition result is input to the target machine learning model for further processing.

It should be noted that, reference may be made to the description related to the embodiment shown in fig. 2 for a preferred implementation of the embodiment shown in fig. 8, and details are not repeated here.

Example 6

Fig. 9 is a schematic structural diagram of another conference device according to an embodiment of the present application, and as shown in fig. 9, the conference device includes:

and the at least one voice acquisition device 90 is used for acquiring voice information of at least one target object in the space where the at least one voice acquisition device is located. The speech acquisition device 90 may be a microphone array.

And the controller 92 is configured to acquire voice information, input the voice information into a second machine learning model, and input an output result of each network layer in the second machine learning model into a corresponding network layer in the target machine learning model, where the target machine learning model is used to identify the identity information of the target object and the voice content corresponding to the identity information, and the second machine learning model is a model for performing content identification on the voice information of at least one target object.

According to an optional embodiment of the present application, the second machine learning model may recognize a speech content of the speech information of the target object, and after the second machine learning model is used to recognize the speech information of the target object, a speech content recognition result is input to the target machine learning model for further processing.

It should be noted that, reference may be made to the description related to the embodiment shown in fig. 2 for a preferred implementation of the embodiment shown in fig. 9, and details are not repeated here.

Example 7

Fig. 10 is a flow chart of another speech recognition method according to an embodiment of the present application, as shown in fig. 10, the method includes the steps of:

step S1002, collecting voice information of at least one target object.

According to an alternative embodiment of the present application, the voice information in step S1002 is the voice information corresponding to the short-term text-independent task, and the voice recognition of the short-term text-independent task is mainly applied to the wake-up word of the home appliance, such as "hello tv" and the like.

Step S1004, inputting the voice information of at least one target object into the second machine learning model, inputting the output result of the network layer in the second machine learning model into the first machine learning model, and inputting the output result of the network layer in the first machine learning model into the corresponding network layer in the target machine learning model for analysis, so as to obtain the identity information of the target object and the voice content corresponding to the identity information.

According to an alternative embodiment of the present application, the first machine learning model is a model for recognizing acoustic features of at least one target object; the second machine learning model is a model for performing content recognition on speech information of at least one target object.

In an alternative embodiment of the present application, the first machine learning model, the second machine learning model and the target machine learning model are three independent and linked neural network learning models. In this embodiment, firstly, the collected voice information is input to the voice content recognition model (the second machine learning model) for recognition, after the voice content recognition is completed, the acoustic feature recognition model (the first machine learning model) is introduced for recognition, then the output result of the first machine learning model is input to the corresponding evening party of the target machine learning model for analysis, and finally, the identity information of the target object and the voice content corresponding to the identity information are obtained.

In step S1006, the voice content is output.

It should be noted that, reference may be made to the description related to the embodiment shown in fig. 2 for a preferred implementation of the embodiment shown in fig. 10, and details are not repeated here.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A speech recognition method, comprising:

collecting voice information of at least one target object;

inputting the voice information of the at least one target object into a first machine learning model and a second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into a corresponding network layer in the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information;

and outputting the voice content.

2. The method of claim 1, wherein the target machine learning model is trained by:

acquiring a plurality of groups of training data for training the target machine learning model, wherein each group of training data comprises a triplet information, and the triplet information comprises: different speech information of the first sample object; speech information of the second sample object;

and respectively inputting the multiple groups of training data into the target machine learning model for training until the prediction result of the target machine learning model meets the preset condition.

3. The method of claim 2, wherein the training the sets of training data separately input values into the target machine learning model comprises:

and when the prediction result does not meet the preset condition, adjusting the weights of the different voice information of the first sample object and the voice information of the second sample object until the prediction result of the target machine learning model meets the preset condition.

4. The method of claim 3, wherein adjusting the weights of the different speech information of the first sample object and the speech information of the second sample object comprises:

increasing the weight of the different speech information of the first sample object; and/or

Reducing a weight of the speech information of the second sample object.

5. The method of claim 1, wherein the first machine learning model is a model that identifies acoustic features of the at least one target object; the second machine learning model is a model for performing content recognition on the voice information of the at least one target object.

6. The method of claim 2, wherein inputting the sets of training data to the target machine learning model for training respectively comprises:

and when the prediction result does not meet the preset condition, adjusting a loss function of the target machine learning model until a sample distance between feature vectors of different voice information of the first sample object is smaller than a sample distance between a second sample object and a feature vector of specified voice information, wherein the specified voice information is any one of the feature vectors of the different voice information of the first sample object.

7. The method of claim 1, further comprising:

and verifying the identity information, and executing operation corresponding to the voice content when the identity information passes the verification.

8. A speech recognition method, comprising:

receiving voice information of a target object;

inputting the voice information of the target object into a corresponding network layer of a target machine learning model for analysis to obtain identity information of the target object and voice content corresponding to the identity information, wherein the input of each network layer in the target machine learning model is the output result of each network layer in the first machine learning model and the second machine learning model;

9. The method of claim 8, wherein the first machine learning model is a model that identifies acoustic features of the target object; the second machine learning model is a model for performing content recognition on the voice information of the target object.

10. An intelligent home system is characterized by comprising at least one home appliance device and a control device, wherein,

the at least one household appliance is used for acquiring voice information of a target object in a space where the at least one household appliance is located and receiving a control instruction from the control equipment;

the control equipment is used for receiving the voice information, inputting the voice information of the at least one target object into a first machine learning model and a second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the corresponding network layer of the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information; and generating the control instruction based on the voice content, and sending the control instruction to the at least one household appliance.

11. A storage medium comprising a stored program, wherein the program, when executed, controls an apparatus in which the storage medium is located to perform the speech recognition method according to any one of claims 1 to 9.

12. A computing device, comprising:

a processor; and

a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: collecting voice information of at least one target object; inputting the voice information of at least one target object into a first machine learning model and a second machine learning model, and inputting the output result of each network layer in the first machine learning model and the second machine learning model into the corresponding network layer of the target machine learning model together for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information; and outputting the voice content.

13. A conferencing device, comprising:

the system comprises at least one voice acquisition device, at least one processing device and a control device, wherein the voice acquisition device is used for acquiring voice information of at least one target object in a space where the at least one voice acquisition device is located;

the controller is used for acquiring the voice information, inputting the voice information into a first machine learning model, and inputting an output result of each network layer in the first machine learning model into a corresponding network layer in a target machine learning model, wherein the target machine learning model is used for identifying the identity information of the target object and the voice content corresponding to the identity information, and the first machine learning model is a model for identifying the acoustic characteristics of at least one target object.

14. A conferencing device, comprising:

the controller is used for acquiring the voice information, inputting the voice information into a second machine learning model, and inputting an output result of each network layer in the second machine learning model into a corresponding network layer in a target machine learning model, wherein the target machine learning model is used for identifying the identity information of the target object and the voice content corresponding to the identity information, and the second machine learning model is a model for performing content identification on the voice information of at least one target object.

15. A speech recognition method, comprising:

collecting voice information of at least one target object;

inputting the voice information of the at least one target object into a second machine learning model, inputting the output result of a network layer in the second machine learning model into a first machine learning model, and inputting the output result of the network layer in the first machine learning model into a corresponding network layer in the target machine learning model for analysis to obtain the identity information of the target object and the voice content corresponding to the identity information;

and outputting the voice content.

16. The method of claim 15, wherein the first machine learning model is a model that identifies acoustic features of the at least one target object; the second machine learning model is a model for performing content recognition on the voice information of the at least one target object.