CN111508482A

CN111508482A - Semantic understanding and voice interaction method, device, equipment and storage medium

Info

Publication number: CN111508482A
Application number: CN201910027477.1A
Authority: CN
Inventors: 徐嘉南
Original assignee: Alibaba Group Holding Ltd
Current assignee: Banma Zhixing Network Hongkong Co Ltd
Priority date: 2019-01-11
Filing date: 2019-01-11
Publication date: 2020-08-07

Abstract

The invention discloses a semantic understanding and voice interaction method, a semantic understanding and voice interaction device, semantic understanding and voice interaction equipment and a storage medium. The method comprises the steps of obtaining text information of voice input, obtaining scene information, and analyzing the intention of the voice input based on the text information and the scene information to obtain an intention recognition result. The intention of the voice input of the user is analyzed by combining the scene information, so that more accurate semantic understanding can be realized, and support is provided for providing more accurate voice interaction service for the user.

Description

Semantic understanding and voice interaction method, device, equipment and storage medium

Technical Field

The present invention relates to the field of voice interaction, and in particular, to a semantic understanding and voice interaction method, apparatus, device, and storage medium.

Background

The voice interaction belongs to the category of human-computer interaction, and is a leading-edge interaction mode developed by the human-computer interaction to the present. Voice interaction is the process by which a user gives instructions to a machine through natural language to achieve his or her own objectives. Semantic understanding is an important part of enabling voice interaction. The main work of semantic understanding is to understand what the user's voice input is intended to express, i.e., to recognize intention information of the user's voice input.

The conventional semantic understanding scheme mainly recognizes intention information of a user's voice input based on a fixed semantic recognition model or expert experience. But since the user's expression environment (living environment, time, space, interests, personal status, etc.) is constantly changing, text that is out of context is usually meaningless or meaningless, so that existing semantic understanding schemes can result in incorrect semantic understanding, or even an unnoticed or even completely unintelligible answer.

Accordingly, there is a need for an improved semantic understanding scheme to accurately understand the user's speech intent.

Disclosure of Invention

An object of the present invention is to provide a semantic understanding and voice interaction scheme capable of accurately understanding a user's voice intention.

According to a first aspect of the present invention, there is provided a semantic understanding method, comprising: acquiring text information input by voice; acquiring scene information; based on the text information and the scene information, the intention of the voice input is analyzed to obtain an intention recognition result.

Optionally, the context information comprises at least one of: time information; location information; ambient environment information; receiving state information of a device of voice input; receiving application information on a device for voice input; user information; context information.

Optionally, the step of analyzing the intent of the speech input comprises: analyzing the text information to obtain text semantic information; analyzing the scene information to obtain scene semantic information; and determining an intention recognition result based on the text semantic information and the scene semantic information.

Optionally, the step of determining the intention recognition result based on the text semantic information and the scene semantic information includes: comparing the text semantic information with the scene semantic information to determine an available part in the scene semantic information; based on the text semantic information and the available portions, an intent recognition result is determined.

Optionally, the step of determining the intention recognition result based on the text semantic information and the scene semantic information includes: comparing the field related to the text semantic information with the field related to the scene semantic information to determine an intention field for the voice input; and/or comparing the intention related to the text semantic information with the intention related to the scene semantic information to determine intention information for the voice input; and/or determining second slot information aimed at by the voice input based on the first slot information related to the text semantic information and the field information related to the scene semantic information, wherein the intention identification result comprises an intention field and/or intention information and/or the second slot information.

Optionally, the step of determining the intention recognition result based on the text semantic information and the scene semantic information further comprises: under the condition that the intention related to the text semantic information and the intention related to the scene semantic information are not matched, the step of comparing the field related to the text semantic information and the field related to the scene semantic information is executed again to re-determine the intention field for the voice input; and/or under the condition that the first slot position information related to the text semantic information and the field information related to the scene semantic information are not matched, the step of comparing the field related to the text semantic information and the field related to the scene semantic information is executed again so as to determine the intention field for the voice input again.

Optionally, the text semantic information includes a domain, and/or an intention, and/or first slot information; and/or the scene semantic information includes one or more scene states and state information corresponding to each scene state.

Optionally, the scene state comprises at least one of: a navigation state; a parking state; a cruising state; an entertainment state; a commute state; a point of interest status.

Optionally, the method further comprises: and integrating information of the intention recognition result, and sending the integrated data to the server side so that the server side performs interaction on the integrated data.

According to the second aspect of the invention, a semantic understanding method applied to the vehicle-mounted system is further provided, and the semantic understanding method comprises the following steps: acquiring text information of voice input by a vehicle-mounted user; acquiring vehicle-mounted scene information; and analyzing the intention of the voice input by the vehicle-mounted user based on the text information and the vehicle-mounted scene information to obtain an intention recognition result.

Optionally, the in-vehicle scene information includes at least one of: time information; spatial information; vehicle system status information; vehicle-mounted application information; personal information of the vehicle-mounted user; context information; ambient environment information.

According to a third aspect of the present invention, there is also provided a voice interaction method, including: receiving a voice input; obtaining an intention recognition result using a semantic understanding method according to the first aspect or the second aspect of the present invention; and executing corresponding operation according to the intention recognition result.

According to a fourth aspect of the present invention, there is also provided a semantic understanding apparatus, including: the first acquisition module is used for acquiring text information input by voice; the second acquisition module is used for acquiring scene information; and the intention analysis module is used for analyzing the intention of the voice input based on the text information and the scene information to obtain an intention recognition result.

According to the fifth aspect of the present invention, there is also provided a semantic understanding apparatus applied to an in-vehicle system, including: the first acquisition module is used for acquiring text information of voice input by a vehicle-mounted user; the second acquisition module is used for acquiring vehicle-mounted scene information; and the intention analysis module is used for analyzing the intention of the voice input by the vehicle-mounted user based on the text information and the vehicle-mounted scene information to obtain an intention recognition result.

According to a sixth aspect of the present invention, there is also provided a voice interaction apparatus, including: a receiving module for receiving a voice input; a semantic understanding module, configured to obtain an intention recognition result by using the semantic understanding method according to the first aspect or the second aspect of the present invention; and the execution module is used for executing corresponding operation according to the intention recognition result.

According to a seventh aspect of the present invention, there is also provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method as set forth in any one of the first to third aspects of the invention.

According to an eighth aspect of the present invention, there is also provided a non-transitory machine-readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform a method as recited in any one of the first to third aspects of the present invention.

The invention analyzes the intention of the voice input of the user by combining the scene information, can realize more accurate semantic understanding, and thus provides support for providing more accurate voice interaction service for the user.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

Fig. 1 is a schematic diagram showing an application scenario of the present invention.

FIG. 2 is a schematic flow chart diagram illustrating a semantic understanding method according to an embodiment of the present invention.

Fig. 3 is a schematic flowchart illustrating a semantic understanding method applied to an in-vehicle system according to an embodiment of the present invention.

Fig. 4 is a schematic block diagram showing the structure of a semantic understanding apparatus according to an embodiment of the present invention.

Fig. 5 is a schematic block diagram showing the structure of a voice interaction apparatus according to an embodiment of the present invention.

FIG. 6 is a schematic structural diagram of a computing device that can be used to implement the semantic understanding and voice interaction method according to an embodiment of the present invention.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

[ term interpretation ]

Semantics: the meanings of concepts represented by objects in the real world to which data corresponds, and the relationships between these meanings, are explanations and logical representations of data in a certain field.

Semantic understanding: and a process of converting the meaning of the concept represented by the object in the real world corresponding to the data into a computer-understandable mark and relationship. In the present invention, semantic understanding refers to a process of analyzing the intention of a user's voice input.

Domain (domain): the domain refers to the same type of data or resources, and services provided around these data or resources, such as weather, music, and the like. In the present invention, a domain (i.e., an intention domain referred to by the present invention) may be used to characterize an intention category or an intention range of a user. Taking the application of the invention to vehicle-mounted scenes as an example, the fields can be the navigation field, the music field and the like.

Intent (intent): intent refers to operations on domain data, typically named in verb phrases, such as asking for weather, looking for music. The intent may relate to one or more slots, such as the intent "find music," slots relating to singers, song titles, and the like.

Slot (slot): for storing attributes of the fields such as the date of the weather field, weather, singer of the music field, song title, album, etc.

Slot position information: attribute information corresponding to the slot position. For example, for the slot of "singer", the slot information may be the name of singer such as Zhang schoite, Wangfei, Lijian, etc. In the present invention, the slot information may be used to further supplement and limit the intention, for example, when the intention is "search music", the intention may be supplemented according to the slot information such as singer name (royal phenanthrene), song name (but volunteer has a long time), and the like, for example, the intention may be supplemented as "search royal phenanthrene singing but volunteer has a long time".

And (3) voice recognition: the process of converting speech signals into text is a precondition for semantic understanding.

L BS services and information based on geographical location.

[ scheme overview ]

In order to accurately identify the intention of the voice input of the user, the invention provides a semantic understanding scheme, and the intention of the voice input of the user is identified by combining scene information so as to obtain an intention identification result capable of accurately representing the real intention of the user.

Context information referred to herein may be information of various dimensions, such as may include, but is not limited to, time information, location information, ambient environment information, state information of a device receiving voice input, application information on a device receiving voice input, user information, context information, and the like.

The semantic understanding scheme of the invention can be applied to various voice interaction scenes. For example, the method can be applied to a vehicle-mounted voice interaction scene, and can also be applied to an outdoor sport scene, a home scene and the like.

Taking the application to the car-mounted voice interaction scenario as an example, as shown in fig. 1, the user says "how long there is". Analyzing this sentence alone, it is difficult to understand the true intention of the user. The invention can accurately identify the intention of the user by utilizing the vehicle-mounted scene information. For example, if the in-vehicle scene information indicates that the current scene state is the navigation state, the user may want to ask "how long it is for driving to the destination", and if the in-vehicle scene information indicates that the current scene state is the listening state for the audio program, the user may want to ask "how long it is for the audio program".

In particular, if it is recognized that the scene state in which the user is located is a navigation state, the user's intention may be complemented with more refined navigation state information. The following navigation state information can be acquired: 1) the current driving destination information can be acquired through map navigation application in a navigation mode, and the destination can be predicted based on the driving habits of a user in a non-navigation mode; 2) position information of the current vehicle (real-time positioning sensor information). Based on the navigation state information, for the received voice input of "how long there is" of the user, the intention of the user can be complemented as follows: "how long it takes to get from the current location (e.g., five mouths) to the destination (e.g., company").

If it is recognized that the scene state where the user is located is a sound program state, the intention of the user can be complemented with more detailed sound program state information. The following audio program status information can be obtained: 1) currently played song information, such as program name, total program duration and other information; 2) and if the current playing state information is played, the current playing progress information is obtained. Based on the above-mentioned audio program status information, for the received voice input of "how long there is still" of the user, the intention of the user can be complemented is: "how long a sound program (like the sound of the opposite of the guo de) ends".

Therefore, the analysis is performed on the intention of the voice input of the user by combining the scene information, the scene information can be used as a decision basis for understanding the intention of the voice input of the user, and the method can be used for filling and/or adjusting the intention represented by the text information of the voice input so as to realize more accurate semantic understanding, thereby providing support for providing more accurate voice interaction service for the user.

The aspects of the invention are further described below.

[ semantic understanding method ]

Referring to fig. 2, in step S210, text information of a voice input is acquired.

The text information of the voice input is also the text recognition result obtained by performing voice recognition on the received voice input of the user. After receiving the voice input of the user, the text information of the voice input can be acquired through a voice recognition technology. The present invention is not described in detail with respect to the principles of speech recognition technology.

In step S220, scene information is acquired.

The scene information may include various information that helps to understand the user's real intention. In the present invention, scene information can be obtained by acquiring all information that is helpful for understanding the true intention of the user. For example, the obtained context information may include, but is not limited to, time information, location information, ambient environment information, state information of a device receiving the voice input, application information on the device receiving the voice input, user information, context information, and the like.

The application scenes of the invention are different, and the specific contents of the scene information are different. Taking the application of the present invention to a vehicle-mounted scene as an example, the time information mentioned above may include, but is not limited to, information related to time such as a current date, a holiday, a limit number, etc.; the position information mentioned above may include, but is not limited to, current position information of the vehicle, destination information in a navigation mode, destination information based on prediction in a non-navigation mode, and the like; the above-mentioned ambient information may include, but is not limited to, road condition information; the status information of the above-mentioned device receiving the voice input may include, but is not limited to, vehicle system status information, such as may include all active application information, application status (e.g., on/off status), and the like; the above-mentioned application information on the device receiving the voice input may include, but is not limited to, a frequency of use of all applications installed on the device, a current running application state, application data, and the like; the user information mentioned above may include, but is not limited to, vehicle driver personal information, such as may include, but is not limited to, driving habits, application usage habits, travel habits, and the like; the context information mentioned above may refer to the user's previous speech input.

In step S230, the intention of the voice input is analyzed based on the text information and the scene information to obtain an intention recognition result.

The intent recognition result may characterize the intent of the user's speech input. The intention of the voice input is analyzed by combining the text information and the scene information, so that the finally obtained intention identification result can accurately represent the real intention of the user. As described above in conjunction with the description of fig. 1, in the case where the text information does not clarify the intention of the user, the intention of the user may be determined using the scene information and complemented to obtain a complete intention recognition result.

After the intention recognition result is obtained, information integration can be performed on the intention recognition result, and the integrated data is sent to the server side so that interaction can be performed by the server side based on the integrated data. For example, the integrated information can be transmitted to the server in an agreed information format. The information integration of the intention recognition result may be to integrate the intention recognition result into a complete intention statement, so that the server performs corresponding interactive operation according to the statement.

While one possible implementation of analyzing the intent of the speech input based on the text information and the context information is described below, it should be understood that the intent analysis may be performed in other ways after the text information and the context information are obtained.

As an example, after obtaining the scene information and the text information of the voice input, the text information and the scene information may be parsed, respectively, to obtain text semantic information and scene semantic information, and then the intention recognition result may be determined based on the text semantic information and the scene semantic information.

The text semantic information may be regarded as a semantic recognition result of the text information of the voice input. Alternatively, the text semantic information may be text intention information obtained by parsing the text information, such as may include an intention field and/or intention and/or slot information (for convenience of distinction, may be referred to as "first slot information"). The process of parsing the text information may refer to an existing semantic understanding manner, and is not described herein again. For example, a pre-trained semantic understanding model may be called to perform intent classification on the text information of the speech input to obtain text intent information.

In the case that the text semantic information is text intention information obtained by parsing the text information, the obtained text intention information may be intention information capable of representing a complete intention of the user, for example, the text intention information may be text intention information composed of a field, an intention, and first slot information. In addition, the obtained text intention information can also be intention information which can only represent partial intention of the user, for example, the text intention information can comprise intention information which is obtained by identifying one intention action but does not relate to the field and/or the aimed object (namely the first slot position information). For example, for the text information of "how long there is, the recognized text semantic information is text intention information indicating ambiguity, that is, the related field and the targeted object are not unambiguous.

The scene semantic information can be regarded as a semantic recognition result of the scene information. Optionally, the scene semantic information may include one or more scene states and state information thereof. The different scene states may be regarded as different fields, and the state information corresponding to each scene state may be regarded as the scene information in the field corresponding to the scene state.

Taking the application of the present invention to an onboard scene as an example, the scene state may include, but is not limited to, a navigation state, a parking state, a cruise state, an entertainment state, a commute state, and a point of interest state. The navigation state may refer to a state in which the navigation mode is started, the parking state may refer to a state in which the vehicle stops running, the cruising state may refer to a free driving state in the non-navigation mode, the entertainment state may refer to a state in which an entertainment application such as a vehicle-mounted video is started, the commuting state may refer to a state in which the vehicle travels from home to a work place, and the point-of-interest state may refer to a state in which a user opens a favorite application (or opens an application with a high frequency of use).

A plurality of scene states may be preset, and each scene state may correspond to a preset determination condition. The scene information may be compared with a preset scene state, and when information satisfying a certain scene state exists in the scene information, the current scene may be identified as the scene state, and the information satisfying the scene state may be identified as state information in the scene state. Optionally, a model supporting identification of several scene states may be created in advance through expert knowledge rules and/or a machine learning manner, and used to determine the scene state corresponding to the scene information. Taking the application of the invention to the vehicle-mounted scene as an example, a state identification model supporting identification of a plurality of scene states such as a navigation state, a parking state, a cruising state, an entertainment state, a commuting state, an interest point state and the like can be created for identifying the scene state corresponding to the scene information.

In parsing the scene information, information satisfying a predetermined scene state among the scene information may be identified as state information in the scene state. As such, the scene information may be divided into one or more scene states and state information thereof. Optionally, different scene states and/or state information may be assigned different weight values, so that intent decisions may be made subsequently with reference to the weight values.

In an embodiment of the invention, after the text semantic information and the scene semantic information are obtained, the text semantic information and the scene semantic information may be compared to determine an available part in the scene semantic information, and then the intention identification result is determined based on the text semantic information and the available part. The usable portion refers to a portion of the scene semantic information that is helpful in determining the user's intention, i.e., a portion that is helpful in determining the final intention recognition result. The available part can be used for filling the intention of the user and can also be used for adjusting the intention recognition result obtained based on the text semantic information.

For example, in the case where the content of text information input by the user's voice is small, the intention of the user cannot be clarified only from the text semantic information, and in this case, the intention of the user may be supplemented based on an available portion in the scene semantic information to obtain an intention recognition result that can clearly characterize the complete intention of the user. For a specific embodiment, see fig. 1, which is not described herein again.

For another example, the text semantic information may also be regarded as a preliminary intention recognition result, and the intention recognition result may be checked according to the scene semantic information to determine whether the intention recognition result can represent the real intention of the user. If the intention identification result obtained based on the text semantic information is matched with the current scene semantic information, under the condition of no match, the intention identification result obtained based on the text semantic information can not represent the real intention of the user, and at the moment, intention adjustment can be carried out based on the scene semantic information to obtain the intention identification result capable of representing the real intention of the user. If it is determined that the user's intention is "open cheerful singing" type of music "based on the text semantic information, but the scene semantic information (such as driving habit information) indicates that the user often goes (or has previously gone) to a KTV named cheerful singing, the previously determined intention recognition result may be adjusted to" open navigation application to navigate with the KTV named cheerful singing as the destination ". Therefore, the final intention recognition result can accurately represent the real intention of the user.

As an example, the scene semantic information may include one or more scene states and state information thereof, and when comparing the text semantic information with the scene semantic information, the scene state related to the scene semantic information may be compared with the text semantic information, a scene state with a matching degree higher than a predetermined threshold (or with the highest matching degree) with the text semantic information is found, and then the found scene state and the corresponding state information thereof are used as the available part. Alternatively, different scene states may correspond to different weight values. When comparing the text semantic information and the scene semantic information, the weight value of the scene state can be referred to. The specific implementation process is not described herein again.

In determining the intention recognition result based on the text semantic information and the available portion, the text semantic information and the available portion may be considered together to determine a final intention recognition result, e.g., an intention represented by the text semantic information may be filled or adjusted based on the available portion to obtain the final intention recognition result. For example, in the case where the intention characterized by the text semantic information is ambiguous, the intention of the user may be supplemented based on the available parts to obtain a clear and complete intention recognition result. For another example, in a case where the intention recognition result determined based on the text semantic information contradicts the available part, the intention recognition result determined based on the text semantic information may be adjusted based on the available part so that the finally obtained intention recognition result more conforms to the real intention of the user.

In another embodiment of the present invention, after the text semantic information and the scene semantic information are obtained, a domain decision, an intention decision, and/or a slot position information decision may be performed based on the text semantic information and the scene semantic information to obtain a final intention recognition result. Wherein the final intention recognition result may include an intention field, and/or intention information, and/or slot information (which may be referred to as "second slot information" for convenience of distinction).

1. Domain decision making

The domain to which the text semantic information relates and the domain to which the scene semantic information relates may be compared to determine an intended domain for which the speech input is directed. For example, the scene semantic information may include one or more scene states, and the field related to the text semantic information and the scene state corresponding to the scene semantic information may be compared to find the scene state with a matching degree higher than a predetermined threshold (or with the highest matching degree) as the intended field for the voice input.

Alternatively, when the field to which the text semantic information relates is empty, a field to which a scene state with the highest weight value is directed may be selected from scene states corresponding to the scene semantic information as an intention field to which the voice input is directed.

Under the condition that the field related to the text semantic information is not empty, the field related to the text semantic information can be screened according to the scene state corresponding to the scene semantic information so as to determine the intention field aimed at by the voice input. For example, when the field to which the text semantic information relates includes a navigation field and a vocal program field, when the current scene state is the navigation state, it may be determined that the intended field to which the voice input is directed is the navigation field, and when the current state is the vocal program field, it may be determined that the intended field to which the voice input is directed is the vocal program field.

2. Intent decision

Comparing the intention related to the text semantic information with the intention related to the scene semantic information to determine intention information for the voice input. The intent information may be used to characterize the intended operation under the determined intent field, e.g., for a user input of "how long there is" in case the determined intent field is a voiced program field, it may be determined from information related to the voiced program in the scene semantic information, such as the vocal sounds of the guo, whether the intent information for the speech input is "how long or vocal in the guo.

Optionally, the intention related to the text semantic information and information corresponding to the determined intention field in the scene semantic information may be compared to decide the intention information. In the case that the decided intention information is disputed, for example, when the intention related to the text semantic information and the intention related to the scene semantic information are not matched (or the matching degree is based on the confidence threshold and the non-confidence threshold), the domain decision may be performed again, that is, the step of comparing the domain related to the text semantic information and the domain related to the scene semantic information may be performed again to determine the intention domain to which the voice input is directed again.

3. Slot information decision

Second slot information for the voice input may be determined based on first slot information to which the text semantic information relates and field information to which the scene semantic information relates.

As described above, the slot information may be used to supplement the intent, and for the decided intent information, when there is less first slot information related to the text semantic information and the intent information needs to be further supplemented, the slot information may be supplemented based on the field information related to the scene semantic information to obtain the second slot information targeted by the voice input. Wherein the second slot information may include the first slot information and field information related to the accepted scene semantic information.

For example, for the voice input of "how long there is", in the case where it is determined that the intention field is the vocal program field, and the intention information is "how long there is an end of the vocal program", the slots of the vocal program, the progress, and the like may be filled based on the fields of the currently played program information (such as the guo germane phase sound), the progress (having heard 5 minutes and 35 seconds), and the like in the scene semantic information to obtain the second slot information including "the guo germane phase sound", "having heard 5 minutes and 35 seconds".

In the case that the first slot information related to the text semantic information and the field information related to the scene semantic information are not matched, the step of comparing the field related to the text semantic information and the field related to the scene semantic information may be performed again to redetermine the intended field for the voice input.

Alternatively, in the case that the decided intention information is disputed, for example, in the case that the intention related to the text semantic information and the intention related to the scene semantic information do not match (or the matching degree is based on the confidence threshold and the non-confidence threshold), the slot information to which the voice input is directed may be determined based on the slot related to the text semantic information and the field information related to the scene semantic information. And under the condition that the slot position related to the text semantic information and the field information related to the scene semantic information are not matched, the step of comparing the field related to the text semantic information and the field related to the scene semantic information is executed again so as to determine the intention field for the voice input again.

Thus, the final intent recognition result may include an intent field, intent information, and slot information.

[ application example ]

The semantic understanding method can be suitable for vehicle-mounted scenes, motion scenes (such as outdoor motion scenes), home scenes and the like.

Fig. 3 is a schematic flowchart illustrating a semantic understanding method applied to an in-vehicle system according to an embodiment of the present invention. Among other things, the method shown in fig. 3 may be performed by an in-vehicle system, such as the information acquisition module 310, the intention understanding module 320, and the intention output module 330 in the in-vehicle system. Details related to fig. 3 can be found in the description above in conjunction with fig. 2, and are not repeated here.

Referring to fig. 3, in step S311, text information of a voice input by the in-vehicle user is acquired.

The information acquisition module 310 may include a text information acquisition module and a scene information acquisition module. The text information of the voice input by the vehicle-mounted user can be acquired by the text information acquisition module.

The text information of the voice input by the vehicle-mounted user is also the text recognition result obtained by performing voice recognition on the received voice input of the user. After receiving the voice input of the vehicle-mounted user, the text information of the voice input can be acquired through a voice recognition technology. The present invention is not described in detail with respect to the principles of speech recognition technology.

In step S313, the in-vehicle scene information may be acquired by the scene information acquisition module in the information acquisition module 310.

The scene information acquisition module can be arranged at a vehicle-mounted system terminal and can be used for acquiring current vehicle-mounted scene information in real time before and after voice awakening. The in-vehicle context information may include, but is not limited to, temporal information, spatial information, in-vehicle system status information, in-vehicle application information, personal information of in-vehicle users, context information, and ambient environment information. The time information may include, but is not limited to, time related information such as current date, holidays, limit numbers, etc.; spatial information may include, but is not limited to, current driving location, predicted destination information; the in-vehicle system status information may include, but is not limited to, all active application information, status, etc.; the in-vehicle application state information may include, but is not limited to, frequency of use of the application, current application state, application data, and the like; the driver personal information may include, but is not limited to, driving habits, application usage habits, travel habits, and the like.

In step S321, semantic understanding and conversion into intention information may be started by the intention understanding module 320.

As shown in fig. 3, a pre-trained semantic understanding model may be called to perform semantic understanding on the text information to obtain text semantic information. The text semantic information may be regarded as a semantic recognition result of the text information of the voice input. Alternatively, the text semantic information may be text intention information obtained by parsing the text information, for example, the obtained text information may be subjected to intention classification, where the semantic understanding model may support classification of intention categories such as a navigation field, a music field, a vocal field, and system control, so that text intention information under a specific intention category may be obtained.

The scene information understanding model can be called to analyze the scene information so as to obtain the scene semantic information. For example, the scene information understanding model may be invoked to perform state classification on the acquired scene information, and acquire a corresponding state sub-item (i.e., the state information mentioned above) of each scene state.

The scene information understanding model is mainly used for completing state understanding based on scene information, and a model supporting recognition of a plurality of scene states can be created through expert knowledge rules and a machine learning mode, for example, a state recognition model supporting a plurality of scene states such as a navigation state, a parking state, a cruise state, an entertainment state, a commute state and an interest point state can be created and used for recognizing the scene state corresponding to the scene information. The current scene state is identified as a preset state when one or more pieces of scene information meet the model rule. Alternatively, each state and state information may be assigned a different weight value, respectively, the different weight values having different importance in the intent decision, and the design of the weight values may be defined by expert experience.

Starting the arbitration processing of semantic understanding, and comprehensively matching and comparing the acquired text intention and the scene intention. And determining whether to utilize the scene intention information by referring to a comparison threshold preset by the expert, and finally generating an intention understanding result, wherein the finally generated intention understanding result can comprise an intention field, an intention, slot position information and accepted scene information.

In the present invention, the intention understanding module 320 may be configured to integrate all understood information (such as the above-mentioned text semantic information and scene semantic information), and determine the final intention by comparing the text semantic information and the scene semantic information.

For example, the final field information may be decided by comparing the field of the text semantic information with the field of the scene information, and then the intention of the text semantic information with the scene semantic information may be compared to decide the intention information, and when the intention information is disputed (between the credible threshold and the incredible threshold), the slot information and the scene field information parsed out by the text semantic may be referred to for decision making. If there is still a dispute, further domain information re-decisions may be made. Finally, the intention field, intention information and slot position information for output are generated.

In step S331, the intention recognition result may be output by the intention output module 330.

The intention output module 330 may perform information integration and transmit the information to the backward server in the agreed information format.

According to the invention, the vehicle-mounted scene information (such as space information, time information, vehicle-mounted system state information, vehicle-mounted application state information and driver personal information) is utilized to help semantic understanding, the semantic understanding and the real-time scene state are taken as strongly-correlated understanding bases, the semantic understanding accuracy in the automobile field and the trip field can be effectively improved, and the voice use experience in the vehicle-mounted scene is improved.

[ VOICE INTERACTION METHOD ]

The present invention further provides a voice interaction method, which can receive a voice input, then obtain an intention recognition result of the voice input by using the above-mentioned semantic understanding method (such as the semantic understanding method described in fig. 2 and fig. 3), and execute a corresponding operation according to the intention recognition result. Such as may instruct a particular application to perform the corresponding operation. For the implementation flow of the semantic understanding method, reference may be made to the description above in conjunction with fig. 1 to fig. 3, which is not described herein again.

The voice interaction method can be applied to various application scenes such as vehicle-mounted scenes, motion scenes, home scenes and the like, and is applied to the vehicle-mounted scenes as an example, the driving state of a vehicle, L BS positioning information, driving preference of a vehicle owner and environmental information are integrated in the voice interaction process to serve as training parameters and decision bases for semantic understanding, so that more accurate semantic understanding can be achieved, and more accurate service information can be provided.

[ semanteme understanding device ]

Fig. 4 is a schematic block diagram showing the structure of a semantic understanding apparatus according to an embodiment of the present invention. Wherein the functional blocks of the semantic understanding apparatus may be implemented by hardware, software, or a combination of hardware and software which embody the principles of the present invention. It will be appreciated by those skilled in the art that the functional blocks described in fig. 4 may be combined or divided into sub-blocks to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional modules described herein.

The functional modules that the semantic understanding apparatus may have and the operations that each functional module may perform are briefly described, and for the details related thereto, reference may be made to the above-mentioned related description, which is not repeated herein.

Referring to fig. 4, the semantic understanding apparatus 400 includes a first obtaining module 410, a second obtaining module 420, and an intention analyzing module 430.

The first obtaining module 410 is used for obtaining text information of the voice input. The second obtaining module 420 is configured to obtain scene information, and the scene information may refer to the above related description, which is not described herein again. The intention analysis module 430 is used for analyzing the intention of the voice input based on the text information and the scene information to obtain an intention recognition result.

The intent analysis module 430 can include a first parsing module, a second parsing module, and an intent recognition module. The first analysis module is used for analyzing the text information to obtain text semantic information; the second analysis module is used for analyzing the scene information to obtain scene semantic information; the intention recognition module is used for determining an intention recognition result based on the text semantic information and the scene semantic information.

In one embodiment of the invention, the intention recognition module may compare the text semantic information and the scene semantic information to determine an available portion of the scene semantic information, and determine the intention recognition result based on the text semantic information and the available portion.

In another embodiment of the invention, the intention recognition module may compare the field to which the text semantic information relates and the field to which the scene semantic information relates to determine an intention field for which the voice input is directed; and/or comparing the intention related to the text semantic information with the intention related to the scene semantic information to determine intention information for the voice input; and/or determining second slot information aimed at by the voice input based on the first slot information related to the text semantic information and the field information related to the scene semantic information, wherein the intention identification result comprises an intention field and/or intention information and/or the second slot information.

Optionally, in a case that the intention related to the text semantic information and the intention related to the scene semantic information do not match, the step of comparing the domain related to the text semantic information and the domain related to the scene semantic information may be performed again to re-determine the intention domain to which the voice input is directed.

In an embodiment of the present invention, the semantic understanding apparatus 400 may further include an integration module, configured to perform information integration on the intention recognition result, and send the integrated data to the server, so that the server performs interaction on the integrated data.

The semantic understanding apparatus 400 of the present invention may be implemented as a semantic understanding apparatus applied to an in-vehicle system. The first obtaining module 410 may be configured to obtain text information of a voice input by the vehicle-mounted user. The second obtaining module 420 may be configured to obtain the in-vehicle scene information. The intention analysis module 430 may be used to analyze the intention of the voice input by the in-vehicle user based on the text information and the in-vehicle scene information to obtain an intention recognition result.

It should be understood that the specific implementation manner of the semantic understanding apparatus according to the exemplary embodiment of the present invention may be implemented with reference to the related specific implementation manner described in conjunction with fig. 1 to 3, and will not be described in detail herein.

[ VOICE INTERACTION APPARATUS ]

Fig. 5 is a schematic block diagram showing the structure of a voice interaction apparatus according to an embodiment of the present invention. Wherein the functional blocks of the voice interaction apparatus can be implemented by hardware, software or a combination of hardware and software which implement the principles of the present invention. It will be appreciated by those skilled in the art that the functional blocks described in fig. 5 may be combined or divided into sub-blocks to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional modules described herein.

The functional modules that the voice interaction apparatus may have and the operations that each functional module may perform are briefly described, and for the details related thereto, reference may be made to the above-mentioned related description, which is not repeated herein.

Referring to fig. 5, the voice interaction apparatus 500 includes a receiving module 510, a semantic understanding module 520, and an executing module 530.

The receiving module 510 is used for receiving a voice input, the semantic understanding module 520 is used for obtaining an intention recognition result by using the semantic understanding method mentioned in the present invention, and the executing module 530 is used for executing a corresponding operation according to the intention recognition result. The semantic understanding module 520 may have the same functional modules as the semantic understanding apparatus shown in fig. 4, and details of operations that the semantic understanding module 520 may perform may refer to the above description, which is not described herein again.

[ calculating device ]

Referring to fig. 6, computing device 600 includes memory 610 and processor 620.

The processor 620 may be a multi-core processor or may include a plurality of processors. In some embodiments, processor 620 may include a general-purpose host processor and one or more special coprocessors such as a Graphics Processor (GPU), a Digital Signal Processor (DSP), or the like. In some embodiments, processor 620 may be implemented using custom circuits, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).

The memory 610 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions that are required by the processor 620 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. In addition, the memory 610 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 610 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a digital versatile disc read only (e.g., DVD-ROM, dual layer DVD-ROM), a Blu-ray disc read only, an ultra-dense disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disk, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 610 has stored thereon executable code that, when processed by the processor 620, can cause the processor 620 to perform the semantic understanding method or the voice interaction method described above.

The semantic understanding and voice interaction method, apparatus and computing device according to the present invention have been described in detail above with reference to the accompanying drawings.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of semantic understanding, comprising:

acquiring text information input by voice;

acquiring scene information;

analyzing the intention of the voice input based on the text information and the scene information to obtain an intention recognition result.

2. The semantic understanding method according to claim 1, wherein the scene information includes at least one of:

time information;

location information;

ambient environment information;

receiving state information of the voice input device;

receiving application information on the device of the voice input;

user information;

context information.

3. The semantic understanding method according to claim 1, wherein the step of analyzing the intention of the voice input comprises:

analyzing the text information to obtain text semantic information;

analyzing the scene information to obtain scene semantic information;

determining the intention recognition result based on the text semantic information and the scene semantic information.

4. The semantic understanding method according to claim 3, wherein the step of determining the intention recognition result based on the text semantic information and the scene semantic information includes:

comparing the text semantic information with the scene semantic information to determine an available part in the scene semantic information;

determining the intent recognition result based on the text semantic information and the available portion.

5. The semantic understanding method according to claim 3, wherein the step of determining the intention recognition result based on the text semantic information and the scene semantic information includes:

comparing the field related to the text semantic information with the field related to the scene semantic information to determine an intention field for the voice input; and/or

Comparing the intention related to the text semantic information with the intention related to the scene semantic information to determine intention information aimed at by the voice input; and/or

Determining second slot information aimed at by the voice input based on first slot information related to the text semantic information and field information related to the scene semantic information, wherein the intention identification result comprises the intention field and/or the intention information and/or the second slot information.

6. The semantic understanding method according to claim 5, wherein the step of determining the intention recognition result based on the text semantic information and the scene semantic information further comprises:

when the intention related to the text semantic information and the intention related to the scene semantic information do not match, re-performing the step of comparing the field related to the text semantic information and the field related to the scene semantic information to re-determine the intention field for the voice input; and/or

And under the condition that the first slot position information related to the text semantic information and the field information related to the scene semantic information are not matched, re-performing the step of comparing the field related to the text semantic information and the field related to the scene semantic information so as to re-determine the intention field aimed at by the voice input.

7. The semantic understanding method according to claim 3, wherein the text semantic information includes a domain, and/or an intention, and/or first slot information; and/or

The scene semantic information includes one or more scene states and state information corresponding to each of the scene states.

8. The semantic understanding method according to claim 7, wherein the scene state comprises at least one of:

a navigation state;

a parking state;

a cruising state;

an entertainment state;

a commute state;

a point of interest status.

9. The semantic understanding method according to claim 1, further comprising:

and integrating the information of the intention recognition result, and sending the integrated data to a server side so that the server side performs interaction based on the integrated data.

10. A semantic understanding method applied to a vehicle-mounted system is characterized by comprising the following steps:

acquiring text information of voice input by a vehicle-mounted user;

acquiring vehicle-mounted scene information;

and analyzing the intention of the voice input by the vehicle-mounted user based on the text information and the vehicle-mounted scene information to obtain an intention recognition result.

11. The semantic understanding method according to claim 10, wherein the in-vehicle scene information includes at least one of:

time information;

spatial information;

vehicle system status information;

vehicle-mounted application information;

personal information of the vehicle-mounted user;

context information;

ambient environment information.

12. A method of voice interaction, comprising:

receiving a voice input;

obtaining an intention recognition result using the semantic understanding method according to any one of claims 1-11;

and executing corresponding operation according to the intention recognition result.

13. A semantic understanding apparatus, comprising:

the first acquisition module is used for acquiring text information input by voice;

the second acquisition module is used for acquiring scene information;

and the intention analysis module is used for analyzing the intention of the voice input based on the text information and the scene information to obtain an intention recognition result.

14. A semantic understanding device applied to an on-board system is characterized by comprising:

the first acquisition module is used for acquiring text information of voice input by a vehicle-mounted user;

the second acquisition module is used for acquiring vehicle-mounted scene information;

and the intention analysis module is used for analyzing the intention of the voice input by the vehicle-mounted user based on the text information and the vehicle-mounted scene information so as to obtain an intention recognition result.

15. A voice interaction apparatus, comprising:

a receiving module for receiving a voice input;

a semantic understanding module for obtaining an intention recognition result using the semantic understanding method according to any one of claims 1-11;

and the execution module is used for executing corresponding operation according to the intention recognition result.

16. A computing device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 1 to 12.

17. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1-12.