WO2019169536A1 - 一种电子设备进行语音识别方法及电子设备 - Google Patents

一种电子设备进行语音识别方法及电子设备 Download PDF

Info

Publication number
WO2019169536A1
WO2019169536A1 PCT/CN2018/078056 CN2018078056W WO2019169536A1 WO 2019169536 A1 WO2019169536 A1 WO 2019169536A1 CN 2018078056 W CN2018078056 W CN 2018078056W WO 2019169536 A1 WO2019169536 A1 WO 2019169536A1
Authority
WO
WIPO (PCT)
Prior art keywords
domain
text
sub
classifier
identification
Prior art date
Application number
PCT/CN2018/078056
Other languages
English (en)
French (fr)
Inventor
隋志成
李艳明
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2018/078056 priority Critical patent/WO2019169536A1/zh
Priority to CN201880074893.0A priority patent/CN111373473B/zh
Publication of WO2019169536A1 publication Critical patent/WO2019169536A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present application relates to the field of terminal technologies, and in particular, to a voice recognition method and an electronic device for an electronic device.
  • terminals especially the popularity of voice recognition technology
  • users can call the terminal to perform corresponding functions by inputting voice commands to the terminal.
  • the terminal Taking the above terminal as a mobile phone as an example, the user can input a voice through the mobile phone, and then the mobile phone sends the voice to the cloud, so that the cloud converts the voice into text, and processes the text to obtain a processing result.
  • the cloud then returns the processing result to the mobile phone, so that the mobile phone performs a function matching the processing result according to the processing result.
  • the above implementation process mainly depends on the processing capability of the cloud.
  • the terminal cannot perform data interaction with the cloud, it is difficult for the terminal to perform the corresponding function according to the input voice instruction.
  • the function of recognizing and processing the voice instruction is added in the terminal.
  • the terminal After the terminal converts the voice into text through the voice recognition technology, the terminal can process the text by template matching to determine the terminal.
  • the function that needs to be called that is, the result of the above processing.
  • the template matching means that the terminal matches the obtained text with the existing template, and determines a template that can completely match the text.
  • the terminal can then determine the function corresponding to the template according to the correspondence between the template and the function, and the terminal performs the function.
  • the template stipulates that the structure of the text is "time + place + do"
  • the terminal can determine that the text matches the template when the structure of the text satisfies the structure of "time + place + do”.
  • the structure of the text is "place + time + do”
  • the template structure and the text structure cannot be completely matched, and the terminal cannot determine the function matching the text because the template cannot be found.
  • the user cannot call the terminal to perform the function by inputting a voice instruction.
  • the embodiment of the present application provides a voice recognition method and an electronic device for an electronic device, so as to improve flexibility of the terminal when performing voice command recognition locally.
  • an embodiment of the present application provides a method for performing voice recognition on an electronic device.
  • the method includes converting the received voice command into text.
  • the domain is then identified by at least two sub-domain classifiers to obtain domain identification results. Among them, the domain identification result is used to represent the domain to which the text belongs.
  • the text is processed by the dialogue engine corresponding to the domain to which the text belongs, and the function that the electronic device corresponding to the text needs to perform is determined.
  • the speech recognition process is implemented in the above manner, which can effectively distinguish the text in the field, and then complete the domain-based text recognition process more specifically, thereby determining the functions that the electronic device needs to perform, and enhancing the accuracy of the speech recognition.
  • the above implementation process can be performed locally on the electronic device. In other words, even in the process that the electronic device cannot access the network, the recognition of the voice instruction can be realized without using the cloud processing capability, thereby increasing the flexibility of the voice recognition.
  • the text can be matched to the pre-stored text.
  • the pre-matching can reduce the resources consumed by the subsequent domain identification by the sub-domain classifier.
  • the above matching process can perform preliminary screening on the text. If the converted text conforms to the common sentence pattern, it can be directly based on the existing pre-stored text and the correspondence between the fields, and is accurate without the participation of the sub-domain classifier.
  • the field to which the text belongs is identified, thereby completing the domain recognition process based on the voice instruction.
  • the parallel domain identification is performed on the text by using at least two sub-domain classifiers to obtain the domain identification result, which may be implemented as: when the text fails to match the pre-stored text, through at least two sub-domain classifiers.
  • Parallel domain recognition of the text to obtain domain identification results Considering that the text also has a situation that does not conform to the common sentence pattern, after the text is initially screened, the text can be identified by the sub-domain classifier. It should be noted that the process of domain identification by the sub-domain classifier can realize parallel domain recognition of text by multiple sub-domain classifiers, that is, at least two sub-field classifiers simultaneously identify the domain of the text to save the field. Identify the time taken.
  • the electronic device includes N sub-domain classifier groups, wherein each group has a different priority, and N is a positive integer greater than or equal to 2.
  • the parallel domain identification of the text by at least two sub-domain classifiers can obtain the domain identification result, which can be specifically implemented as: the domain identification is performed by the sub-domain classifier in the highest priority group among the N sub-domain classifier groups.
  • the sub-domain classifier in the highest priority group identifies the domain to which the text belongs as the domain identification result; if the sub-domain in the highest priority group If the classifier does not recognize the domain to which the text belongs, the domain is identified by the sub-domain classifier in the next priority group in the N sub-domain classifier groups until: the field to which the text belongs is identified, and the identified The domain is the domain identification result; or the text has been identified by all sub-domain classifiers in the N sub-domain classifier group.
  • the at least one of the N sub-domain classifier groups includes at least two sub-domain classifiers.
  • the sub-domain classifiers in each priority group identify the text in a certain order.
  • the obtained domain recognition result can be returned without transferring the text to the next-level priority group.
  • the sub-domain classifier performs domain identification to use fewer sub-domain classifiers based on the ability to ensure accurate domain identification results.
  • At least two sub-domain classifiers in at least one of the N sub-domain classifier groups perform domain identification on the text in parallel.
  • not all of the priority groups include multiple sub-domain classifiers, that is, at least one priority group includes multiple sub-domain classifiers. It should be noted that the more sub-domain classifiers that perform domain identification on the text in parallel, the more accurate the field recognition result is.
  • the domain identification accuracy of the sub-domain classifier in the low priority group is lower than the domain identification accuracy of the sub-domain classifier in the high priority group.
  • the accuracy of domain identification by the sub-domain classifier in the high priority group is higher than that of the sub-domain classifier in the low priority group. Therefore, the above-mentioned progressive domain identification process can effectively reduce the working pressure of the sub-domain classifier with low domain recognition accuracy, and further improve the accuracy of the overall process of domain identification.
  • At least one of the N sub-domain classifier groups includes a first sub-domain classifier and a second sub-domain classifier.
  • the first sub-domain classifier obtains the first domain recognition result by performing domain identification on the text
  • the second sub-domain classifier obtains the second domain recognition result after performing domain identification on the text
  • determining the first domain identification result and the second At least one of the domain identification results is a domain identification result; or determining that the first domain recognition result and the second domain recognition result are both domain recognition results.
  • the domain identification result may be selected based on a preset rule or an already configured summary decision mode, for example, Select one of the domain identification results as the final domain identification result, or select multiple or all of the domain identification results as the final domain identification result, and the rules or decision methods are not limited here.
  • each of the at least two sub-domain classifiers performs domain identification on the text, which may be implemented by: performing a named entity recognition NER on the text and determining a common feature in the identified content.
  • the public features are then replaced according to a preset rule.
  • the preset rule includes replacement content corresponding to different categories of common features.
  • feature extraction is performed on the text that completes the replacement, and the weight of each feature is determined, and the value of the text is calculated according to the weight of each feature.
  • the common feature replacement method can reduce the value of the calculated text as the occupied computing resource, and can effectively reduce the influence of the function feature on the domain recognition process, thereby improving the accuracy of field recognition of the text.
  • At least two sub-domain classifiers may be trained in advance before the text is subjected to parallel domain recognition by the at least two sub-domain classifiers.
  • the process of training each sub-field classifier is as follows:
  • each sub-domain classifier may have its own independent positive and negative samples, wherein the positive and negative samples include a positive example training sample set and a negative example training sample set.
  • the samples in the positive training sample set are samples belonging to the corresponding domain of the sub-domain classifier, and the samples in the negative training sample set are samples that do not belong to the corresponding domain of the sub-domain classifier.
  • NER and rule extraction are performed on positive and negative samples, and common feature replacement is performed on positive and negative samples processed by NER.
  • the common feature refers to the content that affects the value of the text when the value of the text is calculated, but the presence of the feature does not affect the domain to which the text belongs.
  • Public features including but not limited to words such as time, place, etc.
  • the common feature may be replaced with a symbol or the like, which is not limited herein.
  • Rules include, but are not limited to, sentence patterns such as "search for pictures of ".
  • the NER for positive and negative samples can be the premise of rule extraction and common feature replacement. That is, the NER identifies the place, time, sentence, and the like in the positive and negative samples, and then takes the sentence as a rule, takes time, place, and the like as a common feature, and completes the replacement between the common feature and the symbol.
  • the words such as stop words.
  • the process of training the sub-domain classifier for positive and negative samples, in order to reduce the modal particles such as "ah” and “yeah” in the positive and negative samples, and ";", ",”, etc.
  • the interference caused by the symbol to the recognition process needs to identify these stop words and ignore these stop words in the field identification process.
  • the feature is generated to generate a training corpus feature library, and the value corresponding to the text is calculated according to the weight.
  • the training corpus feature library is used to store the correspondence between features and weights.
  • the sub-domain classifier is trained and evaluates the impact of the error domain recognition results, and then the positive and negative samples are modified.
  • the above training process can dynamically adjust the distribution of positive and negative samples, thereby improving the recognition accuracy of the sub-domain classifier.
  • an embodiment of the present application provides an electronic device.
  • the electronic device can implement the functions implemented in the foregoing method embodiments, and the functions can be implemented by using hardware or by executing corresponding software by hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • an embodiment of the present application provides an electronic device.
  • the structure of the electronic device includes a memory, one or more processors.
  • the memory is for storing computer program code, the computer program code comprising computer instructions.
  • the one or more processors described above, in reading and executing the computer instructions, cause the electronic device to implement the method of any of the first aspect and various exemplary implementations thereof.
  • an embodiment of the present application provides a readable storage medium, including instructions.
  • the instruction When the instruction is run on an electronic device, the electronic device is caused to perform the method of any of the first aspects above and various exemplary implementations thereof.
  • an embodiment of the present application provides a computer program product, comprising: software code for performing the method of any of the above first aspects and various exemplary implementations thereof.
  • FIG. 1 is a schematic structural diagram of a terminal according to an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of an exemplary method provided by an embodiment of the present application.
  • FIG. 3 is a flowchart of a method for processing a voice command by an exemplary mobile phone according to an embodiment of the present disclosure
  • FIG. 4 is a schematic diagram of an exemplary domain identification multi-classification system according to an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of an implementation process of text field recognition using the system shown in FIG. 4 according to an embodiment of the present application;
  • FIG. 6 is a flowchart of a method for training a sub-domain classifier in a case where a known text belongs in the embodiment of the present application;
  • FIG. 7 is a flowchart of a training method for adjusting positive and negative samples of a sub-domain classifier according to an embodiment of the present application
  • FIG. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
  • FIG. 9 is a schematic structural diagram of another electronic device according to an embodiment of the present disclosure.
  • the embodiment of the present application can be used for an electronic device, which can be a terminal, such as a notebook computer, a smart phone, a virtual reality (VR) device, an augmented reality (AR), an in-vehicle device, Devices such as smart wearable devices.
  • the terminal can be configured with at least a display screen, an input device, and a processor.
  • the terminal 100 is exemplified. As shown in FIG. 1 , the terminal 100 includes a processor 101, a memory 102, a camera 103, an RF circuit 104, and an audio circuit 105.
  • Components such as a speaker 106, a microphone 107, an input device 108, other input devices 109, a display screen 110, a touch panel 111, a display panel 112, an output device 113, and a power source 114.
  • the display screen 110 is composed of at least a touch panel 111 as an input device and a display panel 112 as an output device.
  • the terminal structure shown in FIG. 1 does not constitute a limitation on the terminal, and may include more or less components than those illustrated, or combine some components, or split some components, or different. The component arrangement is not limited herein.
  • the components of the terminal 100 will be specifically described below with reference to FIG. 1 :
  • the radio frequency (RF) circuit 104 can be used for receiving and transmitting information during the transmission or reception of information or during a call. For example, if the terminal 100 is a mobile phone, the terminal 100 can receive the downlink information sent by the base station through the RF circuit 104. Thereafter, it is transmitted to the processor 101 for processing; in addition, data related to the uplink is transmitted to the base station.
  • RF circuits include, but are not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like.
  • LNA Low Noise Amplifier
  • RF circuitry 104 can also communicate with the network and other devices via wireless communication.
  • the wireless communication can use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (Code Division). Multiple Access (CDMA), Wideband Code Divided Multiple Access (WCDMA), Long Term Evolution (LTE), E-mail, Short Messaging Service (SMS), and the like.
  • GSM Global System for Mobile communication
  • GPRS General Packet Radio Service
  • CDMA Code Division Multiple Access
  • WCDMA Wideband Code Division Multiple Access
  • LTE Long Term Evolution
  • E-mail Short Messaging Service
  • the memory 102 can be used to store software programs and modules, and the processor 101 executes various functional applications and data processing of the terminal 100 by running software programs and modules stored in the memory 102.
  • the memory 102 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function (for example, a sound playing function, an image playing function, etc.), and the like; the storage data area may be stored. Data (such as audio data, video data, etc.) created according to the use of the terminal 100, and the like.
  • memory 102 can include high speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
  • Other input devices 109 can be used to receive input numeric or character information, as well as to generate key signal inputs related to user settings and function control of terminal 100.
  • other input devices 109 may include, but are not limited to, a physical keyboard, function keys (eg, volume control buttons, switch buttons, etc.), trackballs, mice, joysticks, light rats (light mice are touches that do not display visual output)
  • function keys eg, volume control buttons, switch buttons, etc.
  • trackballs mice
  • mice joysticks
  • light rats light mice are touches that do not display visual output
  • One or more of a sensitive surface, or an extension of a touch sensitive surface formed by a touch screen may also include sensors built into the terminal 100, such as gravity sensors, acceleration sensors, etc., and the terminal 100 may also use parameters detected by the sensors as input data.
  • the display screen 110 can be used to display information input by the user or information provided to the user as well as various menus of the terminal 100, and can also accept user input.
  • the display panel 112 can be configured in the form of a liquid crystal display (LCD), an organic light-emitting diode (OLED), or the like.
  • the touch panel 111 is also called a touch screen or a touch sensitive screen.
  • the contact or non-contact operation of the user on or near the user may be collected (for example, the user may use any suitable object or accessory such as a finger or a stylus on the touch panel 111 or in the vicinity of the touch panel 111, or Including the somatosensory operation; the operation includes a single point control operation, a multi-point control operation and the like, and drives the corresponding connection device according to a preset program.
  • the touch panel 111 may further include two parts: a touch detection device and a touch controller.
  • the touch detection device detects the touch orientation and posture of the user, and detects a signal brought by the touch operation, and transmits a signal to the touch controller; the touch controller receives the touch information from the touch detection device, and converts the signal into the processor 101.
  • the information that can be processed is transmitted to the processor 101, and the commands sent from the processor 101 can also be received and executed.
  • the touch panel 111 can be implemented by using various types such as resistive, capacitive, infrared, and surface acoustic waves, and the touch panel 111 can be implemented by any technique developed in the future.
  • the touch panel 111 can cover the display panel 112, and the user can cover the display panel 112 according to the content displayed by the display panel 112 (including but not limited to a soft keyboard, a virtual mouse, a virtual button, an icon, etc.).
  • the touch panel 111 operates on or near the touch panel 111. After detecting the operation thereon or nearby, the touch panel 111 transmits to the processor 101 to determine the user input, and then the processor 101 provides the display panel 112 according to the user input. Corresponding visual output.
  • the touch panel 111 and the display panel 112 are used as two independent components to implement the input and output functions of the terminal 100 in FIG. 1, in some embodiments, the touch panel 111 may be integrated with the display panel 112. To implement the input and output functions of the terminal 100.
  • the RF circuit 104, the speaker 106, and the microphone 107 provide an audio interface between the user and the terminal 100.
  • the audio circuit 105 can transmit the converted audio data to the speaker 106 for conversion to the sound signal output.
  • the microphone 107 can convert the collected sound signal into a signal, which is received by the audio circuit 105.
  • the audio data is then converted to audio data, and the audio data is output to the RF circuit 104 for transmission to a device such as another terminal, or the audio data is output to the memory 102 for the processor 101 to perform further processing in conjunction with the content stored in the memory 102.
  • the camera 103 can acquire image frames in real time and transmit them to the processor 101 for processing, and store the processed results to the memory 102 and/or present the processed results to the user via the display panel 112.
  • the processor 101 is the control center of the terminal 100, connecting various portions of the entire terminal 100 using various interfaces and lines, by running or executing software programs and/or modules stored in the memory 102, and recalling data stored in the memory 102.
  • the various functions and processing data of the terminal 100 are executed to perform overall monitoring of the terminal 100.
  • the processor 101 may include one or more processing units; the processor 101 may further integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface (User Interface, UI) And the application, etc., the modem processor mainly handles wireless communication. It can be understood that the above modem processor may not be integrated into the processor 101.
  • the terminal 100 may further include a power source 114 (for example, a battery) for supplying power to the respective components.
  • a power source 114 for example, a battery
  • the power source 114 may be logically connected to the processor 101 through the power management system, thereby managing charging, discharging, and the like through the power management system. And power consumption and other functions.
  • the terminal 100 may further include a Bluetooth module and the like, and details are not described herein.
  • the following takes the terminal 100 as a mobile phone as an example, and describes an embodiment of the present application.
  • the cloud converts the voice command into text through the voice recognition technology, and then processes the text to determine the function corresponding to the text that the mobile phone needs to perform, that is, the processing result.
  • the process of processing the text in the cloud may be implemented by the cloud to match the text and the content set in the template one by one, and finally to obtain the processing result; or, the cloud extracts keywords and keywords in the text, and then based on the key Words and keywords to get the processing result.
  • the cloud returns the processing result to the mobile phone, and the mobile phone implements a function corresponding to the processing result.
  • the process of converting the voice command into text and the subsequent processing process for the text can occur in the cloud, and the mobile phone only needs to send the received voice command to the cloud, and after receiving the voice command in the cloud, receive the voice command.
  • the processing result sent by the cloud, and the corresponding function is performed for the processing result.
  • the data transmission between the mobile phone and the cloud can be realized through the network, and therefore, in the case that the mobile phone cannot be connected to the network, the mobile phone can accurately and efficiently perform the function corresponding to the voice instruction.
  • the template is considered. Most of them are obtained by manual data. It takes a lot of manpower and material resources to generate more templates. Moreover, the template is fixed after being generated. When the structure of the voice instruction and the template structure cannot be completely matched, the processing will be increased. The failure rate of the process, that is, the process of using the template for text processing, is less flexible, and it takes a long time to match the text with more templates. Similarly, similar problems arise when the text is processed by extracting keywords and keywords.
  • the cloud recognizes the text as “setting the language”; the text is “translation to open the eye protection mode”, which involves translation, and also involves When the mode is activated, the cloud recognizes the text as “open eye protection mode”; the text is “help me remember the geographical location of the restaurant”, which involves location information and records, and the cloud recognizes the text as “Global Positioning System (Global Positioning System).
  • the field refers to the type of text.
  • the division of the type may be based on the locale in which the text is located.
  • the domain may be used as a text category.
  • one domain corresponds to one type of task, that is, text belonging to the same domain.
  • the same type of task is handled by the same conversation engine.
  • the process of processing the text by the dialog engine may include parsing the text through a Natural Language Process (NLP) technique to output the processing result.
  • NLP Natural Language Process
  • the processing result may include a code that requires a function implemented by the mobile phone, so that the mobile phone invokes the corresponding function
  • the specific implementation manner may be that the dialog engine is used to parse the text in the field, and determine a function to be executed corresponding to the text.
  • the dialog engine can also generate instruction code corresponding to the function to be executed; the instruction code is code that the machine can recognize and execute the corresponding function.
  • the code can be a binary code or a high-level language code such as ⁇ play> ⁇ wangfei> ⁇ song> (the instruction code generated by the user's input voice command "I want to listen to Faye Wong's music"). Make a limit.
  • FIG. 2 is a schematic flowchart of an exemplary method provided by an embodiment of the present application.
  • the voice recognition technology is used to convert the voice command into text, and then the terminal performs domain identification on the text, and then the domain recognition result obtained through the domain identification is fed back to the domain identification.
  • the corresponding dialog engine is processed, and the obtained processing result is finally fed back to the mobile phone.
  • the dialogue engine may feed back the processing result to the application providing the portal, so that the application providing the portal implements interface switching, such as interface switching.
  • interface switching such as interface switching.
  • the mobile phone may be in the main interface or a system-level display interface such as a setting interface, rather than in the mobile phone, when the user inputs a voice command.
  • the running interface of the application therefore, the dialogue engine can feed back the processing result to the system of the mobile phone, so that the system of the mobile phone implements functions such as running an application and adjusting the font size of the display interface.
  • the interface presented to the user by the mobile phone is the main interface of the mobile phone, and the user runs the game application by inputting a voice instruction of “opening the game application”.
  • the mobile phone feeds back the processing result to the mobile phone system, and the game application is started by the mobile phone system.
  • the system application includes, but is not limited to, an application that is pre-installed when the mobile phone is shipped, and has the function of receiving a voice command.
  • the third-party application includes, but is not limited to, a user who downloads and installs a voice command function from a platform such as an application supply.
  • the application, and the application that implements the calling function through other applications in the mobile phone, etc., are not limited herein.
  • the system-level display interface refers to an interface other than the running interface of the application in the mobile phone, for example, the main interface of the mobile phone, the setting interface, etc.; the running interface of the application includes but is not limited to the application startup process or The interface that is presented to the user through the mobile phone after startup, for example, the application loading interface, the application setting interface, and the like.
  • the areas referred to include, but are not limited to, setting, noDisturb, gallery, translate, stock, weather, calculator And encyclopedia (baike) and other fields.
  • the text converted by the voice recognition can be classified into a preset category by identifying the keyword in the text, or by using a template matching method, or by processing the sub-domain classifier.
  • the preset categories include, but are not limited to, the above-exemplified fields.
  • the implementation of the keyword recognition and the template matching may refer to the implementation manner of the domain identification for the text in the prior art, and details are not described herein; the sub-domain classifier may be set in the multi-category identified by the domain shown in FIG. 2 . In the system, the role of the sub-domain classifier, etc., will be presented later, and will not be described here.
  • the purpose of the domain-identified multi-classification system shown in FIG. 2 is to perform domain identification on the text that the mobile phone has completed conversion, and output the corresponding domain recognition result. Then, according to the domain identification result, the mobile phone will process the text to the dialogue engine corresponding to the domain identification result, and obtain the processing result, so that the mobile phone invokes the corresponding function according to the instruction of the processing result.
  • the user voice portal may be a general portal such as a voice assistant, and may also be a local entry of a system application or a third party application in the mobile phone. For example, taking the system application as an example, the user inputs a voice in the gallery to enable the mobile phone to complete the function of searching for pictures in the gallery.
  • FIG. 3 is a flowchart of a method for processing a voice command by an exemplary mobile phone according to an embodiment of the present application. For example, if the user turns on the Wi-Fi function of the mobile phone by using a voice command, the voice command input by the user is “turn on Wi-Fi”, and after the voice command is voice-recognized, the text with the content “turn on Wi-Fi” is obtained.
  • the mobile phone performs domain identification on the text, confirms that the domain to which the text belongs is a setting field, and sends the text to the dialogue engine corresponding to the setting domain for processing, that is, the mobile phone identifies the domain identification result obtained by the domain identification to the setting domain correspondingly.
  • the local multi-domain semantic understanding dialogue engine is processed by the dialogue engine, and then the mobile phone performs the corresponding function according to the processing result.
  • the dialogue engine may further display the execution result of the mobile phone after performing the corresponding function through voice playback, or pop up a dialog box, etc., prompting the user that the mobile phone has completed the execution of the corresponding function according to the voice instruction input by the user.
  • the mobile phone can play "Wi-Fi is turned on” by voice play, or pop up a message including "Wi-Fi is turned on” to prompt the user.
  • the manner of performing domain identification in the embodiments of the present application is different from the prior art in comparison with the process of realizing domain identification in the prior art.
  • the processing of the voice instruction mainly depends on the template matching. Therefore, when there is a difference between the text structure and the template structure, the mobile phone cannot obtain an accurate processing result.
  • a domain identification and dialogue engine is introduced.
  • the domain identification process not only considers template matching and keyword extraction, but also can work in multiple fields by using parallel multi-sub-domain classifiers to work together.
  • One or more fields corresponding to the text are output, and the text is processed by the corresponding dialogue engine of each field selected.
  • the mobile phone can still further analyze and process the text in the case that the text structure and the template result cannot be completely matched.
  • the mobile phone can process the text from a multi-domain perspective, rather than merely pushing the text to a corresponding dialogue engine for processing.
  • the speech recognition process provided by the embodiment of the present application can effectively distinguish the domain from the text, and then complete the domain-based text recognition process more specifically, thereby determining the functions that the electronic device needs to perform, and enhancing the voice. The accuracy of the identification.
  • the implementation process can be performed locally on the electronic device. In other words, even in the process that the electronic device cannot access the network, the recognition of the voice instruction can be realized without using the cloud processing capability, thereby increasing the flexibility of the voice recognition.
  • the example shown in FIG. 3 is a mobile phone conversation system interaction process, that is, the user inputs a voice instruction, performs processing corresponding to the voice instruction through the processing of the mobile phone, and uses the voice playback or display mode to move the mobile phone.
  • the results of performing this function are fed back to the user.
  • the voice command input by the user interacts with the voice play content output by the mobile phone or the display result, and the travel dialogue system.
  • the voice playback or display result output by the mobile phone is an exemplary response mode provided by the mobile phone to the user, in response to the voice command input by the user during or after the execution of the corresponding function of the mobile phone.
  • an exemplary domain identification multi-classification system provided by an embodiment of the present application.
  • the purpose of the multi-classification system for domain identification is to complete the domain identification of the text based on the text.
  • the system can be divided into three layers, which are the control layer, the classifier layer and the algorithm layer.
  • control layer includes the following components: text fast full precision matching, domain scheduling, classification decisions, and a data loader.
  • the text fast full-precision matching refers to the commonly used phrases, sentence patterns, etc., for example, common and ambiguous fixed statements
  • the control layer can directly divide the field of the text without having to pass the text through the classifier layer Further processing.
  • the template for fast full-precision matching of the text may be preset.
  • the specific setting manner may refer to an existing manual template, for example, a template involved in the template matching manner described in the background art, where Do not repeat them.
  • the function of domain scheduling includes scheduling of sub-domain classifiers within each classifier of the classifier layer. For example, after the text fast full-precision matching does not successfully determine the domain to which the text belongs, the domain scheduling can schedule the sub-domain classifications in priority 1. The text is processed, and in the case that the sub-field classifiers in the priority 1 cannot determine the domain to which the text belongs, the sub-field classifiers in the scheduling priority 2 continue to process the text until the field to which the text belongs is determined. Or the sub-domain classifier of the classifier layer has processed the text and has not determined the domain to which the text belongs. In addition, for a single sub-domain classifier, domain scheduling can also be used to invoke algorithms, rules, patterns, etc. involved in the sub-domain classifier.
  • the domain scheduling is used to connect the control layer and the classifier layer.
  • the priority neutrons are sequentially implemented according to the priority of the classifier layer from high to low.
  • Classification decision that is, summary decision, the main purpose is to determine the field to which the text belongs or determine the text when the control layer does not determine the domain to which the text belongs after the text has been quickly and fully matched. There is no domain to which it belongs.
  • the classification decision can be used to specify that the text is processed by all the sub-domain classifiers in the same priority.
  • the obtained domain recognition result includes multiple fields, how to determine the domain to which the text belongs.
  • the classification decision may specify that the text belongs to one or more fields at this time, that is, the classification decision may specify that the text belongs to the domain 1, belongs to the domain 2, or belongs to the domain 1 and the domain 2.
  • the classification decision can determine the domain to which the text belongs by summarizing the domain recognition results obtained by the sub-domain classifiers in the priority 1 and the priority 2, that is, the classification decision priority 1 does not exist in the domain to which the text belongs, and the priority 2 There is a domain 1 to which the text belongs, and it is finally determined that the text belongs to the domain 1 as the domain recognition result of the text.
  • the mobile phone may generate an instance in the process of recognizing the voice instruction, and the instance may be a task to be processed, and the task is to identify the domain of the text converted by the voice instruction by the mobile phone.
  • the instance may be a task to be processed, and the task is to identify the domain of the text converted by the voice instruction by the mobile phone.
  • multiple sub-domain classifiers can process the same instance at the same time, that is, the mobile phone performs multiple tasks at the same time to realize domain recognition of the text.
  • the data loader is configured to acquire data of various libraries required by the algorithm layer, a model of the sub-domain classifier in the classifier layer, and configuration information from a local device, a network side, or a third-party device such as a server.
  • the sub-domain classifier refers to a classifier corresponding to each domain; the configuration information includes but is not limited to initialization parameters of each model.
  • control layer acts as a layer in the system that interacts with other components of the mobile phone, and the control layer can obtain the text obtained by the voice recognition from the mobile phone, and after the system processes the text, the domain identification result, that is, the classification result, can be Feedback to the phone.
  • control layer is responsible for the external business interaction interface, initialization data and model loading, domain classification task scheduling, distribution of sub-domain classifier classification tasks, and summary decision of all classification results returned.
  • the classifier layer includes a plurality of priorities, such as priority 1 (priority 1), priority 2 (priority 2), and priority 3 (priority 3).
  • priority 1 is greater than the priority 2 and greater than the priority 3.
  • priority 3 is greater than the priority 3.
  • a classifier layer that implements the classification of text.
  • the classifier layer supports multi-level multi-instance task classification, that is, as described in the above paragraph, the classifier layer includes multiple priority classifier groups, and there are multiple juxtapositions in different priority classifier groups.
  • the sub-domain classifier which can be executed simultaneously to enable the domain classification process of the text to achieve summary decisions.
  • sub-domain feature extraction and domain identification are implemented. It should be noted that the same text, in different sub-domain classifiers in the same priority, can have the same sub-domain features, and can also obtain different domain recognition results.
  • NER Named Entity Recognition
  • the sub-domain features include, but are not limited to, keywords in the text, that is, in different fields, the same keyword may represent the same or different meanings.
  • the keyword may identify the domain.
  • the result has an impact; the domain recognition result refers to the processing of the text by the sub-domain classifier, which can preliminarily predict the domain to which the text may belong, for example, two sub-domain classifiers belonging to the same priority after processing the same text , a sub-domain classifier determines that the text belongs to domain 1, and another sub-domain classifier determines that the text belongs to domain 2, then the two sub-domain classifiers get different domain recognition results, ie the text belongs to domain 1, and the text Belongs to field 2.
  • the classifier groups of two adjacent priorities are in a serial relationship.
  • the domain identification result may be fed back to the mobile phone through the control layer; If the sub-domain classifier in priority 1 does not obtain valid domain recognition results, the text can be passed to each sub-domain classifier in serial priority 2 for processing, and so on, until a valid field is obtained. Identify the result location. After the text has not obtained the domain recognition result after traversing the respective priorities in the classifier layer, the classification result of the domain recognition result is not fed back to the mobile phone.
  • the effective domain identification result refers to the field in which the text belongs in a certain priority, and the determined domain is the effective domain recognition result.
  • the number of priorities in the classifier layer and the number of sub-domain classifiers in the same priority are not limited in the embodiment of the present application.
  • the fields corresponding to each sub-domain classifier can be defined in advance.
  • the fields corresponding to the sub-field classifiers may be adjusted, wherein the foregoing adjustments include, but are not limited to, adjustment of the priority of the sub-domain classifier, and adjustment of the corresponding field of the sub-domain classifier, The number of sub-domain classifiers is increased or decreased. For example, moving a sub-domain classifier from one priority to another, and sub-classifiers in different priorities are reversed.
  • the domain with high domain identification accuracy can be matched to the high-priority sub-domain classifier; the better-performing model is corresponding to the high-priority sub-domain classifier.
  • the text can be prioritized through higher-accuracy and time-sensitive recognition. .
  • the system can return the domain identification result to the handset. In other words, after the high-priority sub-domain classifier has not obtained valid domain recognition results after domain identification, the text can be submitted to the next-level sub-domain classifier for domain identification until it is valid.
  • the effective domain identification result refers to the domain in which the system determines the text; the highest priority sub-domain classifier refers to each sub-domain classifier in the priority 1 such as FIG. 4, that is, the sub-domain classification The class 11, the sub-domain classifier 12 and the sub-domain classifier 13.
  • the text can be sequentially identified according to the priority of the group in which the sub-domain classifier is located from high to low.
  • the domain recognition process for the text can be ended.
  • an algorithmic layer is used to provide an algorithm, a model.
  • the model refers to a database such as a rule library, a named entity (Named Entity, NE) library, and a feature library.
  • the algorithm provided by the algorithm layer can also be embodied in the form of a database, for example, an algorithm model library.
  • algorithm model library a variety of algorithms are included.
  • the data loader of the control layer needs to load the content related to the algorithm into the system, so as to be flexiblely called by the sub-domain classifiers of the classifier layer.
  • FIG. 5 a schematic diagram of an implementation flow for text field recognition using the system shown in FIG.
  • the system After the mobile phone inputs the text into the system shown in Figure 4, the system first performs text fast full-precision matching on the text at the control layer. When the field can be successfully determined for the text, the obtained field is directly used as the recognition result; In the case where the field is successfully determined for the text, the text can be further processed through the classifier layer.
  • the above processing of the text may be implemented as performing field identification on the text in descending order of priority groups belonging to the sub-field classifiers of the classifier layer.
  • the process of recognizing the text no matter which priority group the text is identified in the field, as long as the field corresponding to the text is recognized, the field is fed back to the mobile phone as the domain recognition result, and the text is no longer handed over The next priority group is processed.
  • the system performs classification task scheduling for priority 1, that is, calls the sub-domain classifier 11, the sub-domain classifier 12, and the sub-domain classifier 13 to perform parallel domain identification on the text input into the system.
  • the parallel domain identification refers to the sub-domain classifier 11, the sub-domain classifier 12, and the sub-domain classifier 13 simultaneously identifying the domain of the text, or performing domain identification on the text in a certain time sequence, and then each sub-area
  • the classifier outputs a domain identification result, and the classification decision corresponding to the priority 1 implements the three domain identification results for the output, performs a decision to determine the domain recognition result fed back to the mobile phone, or inputs the text into the next priority group.
  • the system will continue to perform classification task scheduling for priority 2, that is, call sub-domain classifier 21, sub-domain classifier 22, and sub-domain classifier 23, input to the system.
  • priority 2 that is, call sub-domain classifier 21, sub-domain classifier 22, and sub-domain classifier 23, input to the system.
  • the text in the field performs parallel domain recognition.
  • the implementation of the parallel domain identification process can refer to the description of the above paragraph.
  • the system can input the text into each sub-domain classifier corresponding to the priority level 3, perform domain identification, and directly feed back the obtained effective domain recognition result to the mobile phone.
  • the effective domain recognition result refers to the field of the text obtained by the full precision matching of the text of the control layer; or the processing of one or more priority groups in the classifier layer to obtain the field of the text; or It is the result of the field in which the text is not obtained after the text has been processed through the control layer and the classifier layer for each priority.
  • the process of recognizing the text field by the system ends when an effective domain identification result is obtained, or each domain group in the classifier layer performs field identification on the text, and is not effective.
  • the field ends when the result is recognized.
  • each sub-domain classifier in the same priority group simultaneously performs domain identification on the text, since the domain-determined discriminating process is operated by multiple sub-domain classifiers in the same time period, it can be effective. Save time spent on the discrimination process.
  • the sub-domain classifiers in the same priority group perform domain identification on the text in a certain time sequence, since there is a sub-domain classifier running for a period of time, it is possible to ensure that the system is occupied during the period of time. Less resources are available for a single sub-domain classifier to ensure that there are enough resources in the phone for other systems or programs to call.
  • the high scalability means that the system can support any extension of the future new class, and the existing model, that is, the system proposed in the embodiment of the present application, does not need to be re-established.
  • the above-mentioned vertical class refers to the categories of different fields involved in the embodiments of the present application, such as settings, DND, Gallery, Translation, Stock, Weather, Computing, and Encyclopedia.
  • the sub-domain classifiers corresponding to other fields can be added in the classifier layer in combination with the needs of different application scenarios.
  • the flexibility is strong, which means that different priority groups can be flexibly adjusted according to the current and future classification.
  • the sub-domain classifiers in a single priority group increase or decrease, and the sub-domain classification among multiple priority groups.
  • the exchange of the device, etc., is not limited herein. This ensures that the classifier layer gets a relatively accurate domain identification result after a summary decision.
  • the higher accuracy rate means that for a single sub-domain classifier, the characteristics of the corresponding domain of the sub-domain classifier can be combined, and the text is processed by a specific analysis and calculation method, for example, a number, a stop word (stopword) Processing, bi-gram, tri-gram selection, feature extraction range, method, etc. Since the same or different processing methods are used in different sub-field classifiers, they can be more targeted, and therefore, the accuracy is relatively high.
  • More refined refers to the screening of training data, as well as more targeted training and optimization of sub-domain classifiers, enabling sub-domain classifiers to fine-tune the field recognition process of text to achieve more accurate fields. Identification.
  • the domain corresponding to each sub-domain classifier in the classifier layer may be pre-configured.
  • the sub-domain classifier with high field identification accuracy can be placed in the higher priority group according to the order of the sub-domain classifier to identify the highest accuracy rate, such as priority level 1
  • Sub-domain classifiers with lower recognition accuracy are placed in lower priority packets, such as priority 3.
  • the mobile phone can put the self-owned classification task with the highest classification accuracy into the priority level 1, and put the classification task belonging to the application docking into the priority level 2, which is the most difficult to identify.
  • the drop class task is placed in priority 3.
  • the sub-domain classifier in the priority group is processed by the sub-domain classifier in the priority group, and the sub-domain classifier in the priority group is not affected by the process of the voice instruction. Or almost no impact.
  • the sub-domain classifiers set in the priority 2 correspond to the application, the text processing processes of different domains under the same application are usually processed by the corresponding dialogue engine of the application. In other words, the text belongs to any field corresponding to the application, and will eventually be processed by the same dialogue engine. Therefore, in an implementation manner of the embodiment of the present application, the classification task that is docked with the application can be regarded as a task that requires less accuracy in the field identification, because the field identification result obtained in the final result is valid only.
  • the domain recognition result is generated in priority 2, then the text will eventually be processed by the same conversation engine and will not affect the processing result.
  • the priority 2 includes a sub-domain classifier 21, a sub-domain classifier 22, and a sub-domain classifier 23, wherein the sub-domain classifier 21 corresponds to a field, and the sub-domain classifier 22 corresponds to a domain, and the sub-domain classifier
  • the corresponding field is calculation, and the stock, translation and calculation correspond to the same application, which corresponds to the same dialogue engine.
  • Priority 2 regardless of which field in the stock, translation, and calculation the text is determined to, the final phone will push the text to the same conversation engine for processing. It can be seen that the domain recognition result obtained by the priority level 2 for the domain recognition does not affect the processing result of the text processing by the subsequent dialog engine regardless of the field involved in the priority 2.
  • each sub-domain classifier in the priority level 2 may also correspond to two or three dialogue engines, but there are often multiple sub-domain classifiers corresponding to the domain, corresponding to the same conversation engine. .
  • the sub-class classifier corresponding to the sub-domain classifier of the domain may include, but is not limited to, a sub-domain classifier corresponding to the function of the mobile phone's own function, such as , the sub-domain classifier corresponding to the setting, the DND, and the gallery field; the sub-domain classifier corresponding to the third-party docking class classification task may include, but is not limited to, an installed application in the mobile phone, or a program such as a small program.
  • the device may include, but is not limited to, a sub-domain classifier corresponding to a domain in which it is difficult to determine a domain identification result according to a keyword, for example, a sub-domain classifier corresponding to a domain having a search function such as an encyclopedia.
  • each sub-domain classifier in the classifier layer is as follows:
  • Priority 1 Set, DND, and corresponding sub-domain classifiers in the gallery field
  • Priority 2 sub-domain classifiers for stocks, translations, calculations, and weather domains
  • Priority 3 Sub-domain classifier corresponding to the encyclopedic field.
  • the control layer does not get valid domain recognition results through text fast full precision matching.
  • the parallel processing of each sub-domain classifier in priority 1 results in an effective domain recognition result, that is, the field corresponding to the text is DND.
  • the system then returns the obtained domain identification result to the mobile phone.
  • the control layer does not obtain valid domain recognition results by text fast full precision matching.
  • the domain recognition result of each sub-domain classifier in priority 1 is obtained, and the obtained domain recognition result is other.
  • the text is handed over to the next priority sub-domain classifier for processing.
  • an effective domain recognition result is obtained, that is, the field corresponding to the text is a stock.
  • the system then returns the obtained domain identification result to the mobile phone.
  • Text corresponding to the voice input by the user look at the stock market
  • the domain identification result obtained from the sub-domain classifier in the priority 2 includes stocks and encyclopedias, or in stocks and encyclopedias.
  • the domain recognition result returned to the mobile phone has a large probability of error.
  • the priority division of the sub-domain classifier is very important. For areas that are ambiguous or difficult to distinguish, the sub-domain classifiers corresponding to the field can be placed into lower priority packets. In this way, after high-priority packets get valid domain identification results, it is not necessary to input text into low-priority packets for domain identification, which reduces low-priority domain recognition pressure.
  • the control layer does not obtain valid domain recognition results through text fast full precision matching.
  • the domain recognition result of each sub-domain classifier in priority 1 is obtained, and the obtained domain recognition result is other.
  • the text is processed by the next-priority sub-domain classifier, that is, the text is subjected to parallel processing by each sub-domain classifier in the priority 2, and the obtained domain recognition result is still other.
  • the text is then processed by the next-priority sub-domain classifier, and processed in parallel by each sub-domain classifier in priority 3 to obtain an effective domain recognition result, that is, the field corresponding to the text is encyclopedia.
  • the system then returns the obtained domain identification result to the mobile phone.
  • the effective domain identification result is not obtained from the sub-domain separator in priority 2, and the obtained domain recognition result is other, and the text is handed over to the sub-domain classifier in priority 3.
  • the above domain identification process involves the controller layer and the processing of each sub-domain classifier in the classifier layer priority 1, priority 2, and priority 3.
  • this text is highly ambiguous and has a certain chance of being identified as stocks, encyclopedias and others.
  • the domain that is most easily recognized by the text is placed in the lowest priority of the classifier layer, so that the conflict between the priority groups can be effectively reduced, and the upper-level high-priority sub-domain is reduced. The recognition pressure of the classifier.
  • the mobile phone can fully utilize the local user data, and the mobile phone can effectively perform domain identification when the mobile phone does not interact with the cloud.
  • the local user data refers to data stored locally in the mobile phone, for example, data stored in the mobile phone memory. This data includes, but is not limited to, the content contained in the various libraries involved in the system. It can be seen that the mobile phone saves the time spent on data interaction with the cloud, and in the domain identification process, multiple sub-domain classifiers of the same priority can simultaneously complete the identification operation, and can effectively save the domain identification process. Time spent.
  • the priority of the classifier layer may be divided according to the characteristics of different domain categories and the accuracy and performance of the corresponding model of each sub-domain classifier.
  • the above statement can be set in the text full precision matching process of the control layer, which effectively improves the processing efficiency of the domain recognition and saves the time taken by the domain recognition process.
  • the sub-domain identification classifiers of different priorities can be sequentially entered into the multi-domain parallel domain recognition process according to the order of priority, thereby further improving the processing efficiency of the domain identification process and saving the efficiency. Processing time. It should be noted that the above-mentioned prioritization can also effectively utilize the sub-domain classifiers with poor classification effects, that is, into the groups with lower priority.
  • the recognition ability of the sub-domain classifier affects the domain recognition result
  • the training of the sub-domain classifier affects the recognition ability of the sub-domain classifier. Therefore, the training of the sub-domain classifier is particularly important.
  • FIG. 6 is a flowchart of a method for training a sub-domain classifier in the case of a domain in which a known text belongs, according to an embodiment of the present application.
  • the method flow includes S201 to S208.
  • the rule may be a sentence form such as [ ⁇ (search
  • “ ⁇ ” is the starting character of the rule, indicating “search”, “check”, “see”, “tell” or “open” as the starting keyword, after 1 to 12 words, followed by The text of the word “share” as the end keyword, and "$" as the terminator of the rule means the end of the "share”.
  • the starting keyword means that the first word in the text is "search”, “check”, "see”, or the first word in the text is "tell” or "open”.
  • the content that is searched for is the starting keyword; the ending keyword means that the last word in the text is "the stock”.
  • the start and end characters are used as optional symbols in the sentence, and are not limited to the embodiments of the present application.
  • the rule can be a sentence of the form [[search
  • the first word in the text is not necessarily “search”, “check”, “see”, or the first word in the text is not necessarily “tell” or “open”. There is “search”, “check”, “see”, “tell” or “open” in the text.
  • the text after “searching", “checking”, “seeing”, “telling” or “opening” after 1 to 12 words, there is a "share” and the "share” is not Must be the last word that appears in the text.
  • the field to which the text belongs can be directly determined, thereby determining the domain recognition result and returning.
  • the common feature refers to the content that affects the value of the text when the value of the text is calculated, but the presence of the feature does not affect the domain to which the text belongs.
  • Public features including but not limited to words such as time, place, etc., can be preset.
  • the common feature may be replaced with a symbol or the like, which is not limited herein.
  • the NER is performed on the text, and words such as time and place in the text can be recognized, and then the recognized content is used as a common feature, and the common feature is performed by using a preset symbol or the like. Exchange.
  • Feature extraction refers to the word extraction of the completed text according to the binary method and the ternary method.
  • the word extraction is performed on the completed text according to the binary method, and multiple words of two words are obtained. It is a combination of one word and one symbol, or a combination of two symbols.
  • the manner of calculating the weight according to the feature may refer to an implementation manner of the prior art such as the binary method and the ternary method.
  • the values corresponding to each feature obtained by the binary method are input into the model, and the model is calculated by an algorithm such as Linear Regression (LR), and the output corresponds to the number of features.
  • the weight of a feature corresponds to a weight.
  • the values corresponding to different features are different, and may be preset.
  • the specific setting manner is not limited herein. In the embodiment of the present application, for the manner of the model calculation, reference may be made to the algorithm provided in the prior art, for example, the foregoing LR algorithm, and details are not described herein.
  • the value of the replaced text is calculated based on the weight of the feature.
  • the value corresponding to the weight of all the features in the text may be obtained, and the value corresponding to the text may be obtained, or the weight of each feature may be processed and then summed.
  • a value corresponding to the text is obtained, and is not limited herein.
  • the sub-domain classifier can be adjusted according to the recognition result obtained by the sub-domain classifier and the field to which the text actually belongs.
  • the manner of adjusting the sub-domain classifier includes, but is not limited to, adjusting the positive and negative samples in the sub-domain classifier. It should be noted that adjusting the positive and negative samples will affect the weight of the feature, and ultimately affect the value corresponding to the calculated text, thereby affecting the domain identification result.
  • the phone can continue to use the same text and again through the same sub-area classifier until the correct field identification result is obtained. That is, in the training process of the sub-domain classifier, the contents shown in the above S201 to S208 are repeated processes until the training purpose is reached.
  • FIG. 7 is a flowchart of a training method for adjusting positive and negative samples of a sub-domain classifier according to an embodiment of the present application.
  • the method flow includes S301 to S310.
  • each sub-domain classifier may have its own independent positive and negative samples, wherein the positive and negative samples include a positive example training sample set and a negative example training sample set.
  • the samples in the positive training sample set are samples belonging to the corresponding domain of the sub-domain classifier, and the samples in the negative training sample set are samples that do not belong to the corresponding domain of the sub-domain classifier.
  • the text content of the positive and negative samples is "Photograph of Search Tiananmen”.
  • the "Tiananmen” is recognized, and the rules are extracted by the rule extraction.
  • the rule is [ ⁇ ( ⁇ ). ⁇ 1,10 ⁇ ) The form of the $] form.
  • [ ⁇ (search). ⁇ 1,10 ⁇ (photo)$] is taken as a rule.
  • the name of the place such as Tiananmen may be pre-defined to be replaced with #, and then the text content of the public feature replacement is "photo of search #".
  • S302 and S303 may refer to the descriptions of S202 to S205 above, and details are not described herein.
  • the NER for positive and negative samples can be the premise of rule extraction and common feature replacement. That is, the NER identifies the place, time, sentence, and the like in the positive and negative samples, and then takes the sentence as a rule, takes time, place, and the like as a common feature, and completes the replacement between the common feature and the symbol.
  • the stop word refers to a word, a word or a symbol that does not have a decisive role in domain identification, but the existence of these words, words or symbols can often affect the accuracy of the field recognition result. For example, ";", ",", etc. Identify these stop words and ignore them in the domain identification process.
  • the training corpus feature library is used to record the correspondence between the calculated features and the weights in S206.
  • S308 and S309 are similar to the purpose of S208.
  • modifying the positive and negative samples may be an exemplary implementation of S208.
  • the training sample and the domain recognition result obtained by the system after the system processing include the following contents:
  • the first round of processing results of the training sample 1 and training sample 2 is as follows:
  • "same flush" is not only the name of the listed company, but also the name of a certain application, wherein the application is for stock trading.
  • training sample 1 the user tries to query the stock of the straight flush; in training sample 2, the user wants to open the application with the name straight flush and conduct stock trading. Therefore, the field recognition result obtained by training sample 1 is accurate, and the field recognition result obtained by training sample 2 is wrong.
  • the text obtained by converting the voice command input by the user is divided according to the binary method, and multiple sets of two words are obtained.
  • each feature and the weight corresponding to each feature are as follows:
  • each feature and the weight corresponding to each feature are as follows:
  • weight of the feature can be positive, negative or zero.
  • the greater the weight of the feature the greater the contribution of the text to which it belongs is identified as the sub-domain (in this case, the "stock" field).
  • the text is confirmed to belong to the stock field; when all the features in the text are weighted When the sum is less than 1.5, it is confirmed that the text does not belong to the stock field. Since the weight of each feature involved in the training sample 1 is a positive number, the correct domain recognition result can be calculated according to the weight, that is, the field is a stock. In the training sample 2, the absolute value of the negative weight value of the characteristics of "shun” and “soys” combined with "fried” is too small, resulting in the training sample 2 passing through various features. After the weights are summed, a positive number greater than the threshold is obtained, so the text is still misidentified as a stock field.
  • the system identifies the correct sample according to the training sample 1 and the training sample 2 identifies the wrong result, and adjusts the positive and negative samples. Delete the content with [same flush] in the positive sample, and add [straight flush] and [fresh stock] to the negative sample.
  • the weight of the feature corresponding to the content added in the positive sample increases, and the weight corresponding to the feature involved in the content deleted in the positive sample is taken. The value will be reduced; similarly, the weight of the feature involved in the content added in the negative sample will be reduced, and the weight corresponding to the feature involved in the content deleted in the negative sample will be taken. The value will increase.
  • the weight corresponding to the feature “stock stock” and “stock” is not affected.
  • the reason for the unaffected effect may be that the sample size of the content with the [fresh stock] in the positive sample and the negative sample is large, which results in a small sample having a smaller sample in the negative sample, which has less influence on the negative sample, for example,
  • the sample size of the content with [Fried Stock] in the positive sample is 20,000, and the number of samples with [Fried Stock] in the negative sample is 10,000. After adding a negative sample of [Fried Stock], it is not It will have an impact on the positive and negative samples of the huge amount of data.
  • each feature and the weight corresponding to each feature in the training sample 1 are as follows:
  • each feature and the weight corresponding to each feature are as follows:
  • the change of the positive and negative samples leads to the change of the weight of some or all of the features, and therefore, the processing result will be affected to some extent. That is, the values of the text in the training sample 1 and the training sample 2 are both less than 1.5, which means that both training samples are identified as not belonging to the stock field. It should be noted that, for the same feature, when the feature belongs to a positive sample, the weight corresponding to the feature is larger; when the feature belongs to the negative sample, the weight corresponding to the feature is smaller; For samples and negative samples, the weight of the feature is weighed according to the number of positive and negative samples containing the feature.
  • the second round of processing results of the training sample 1 and the training sample 2 is as follows:
  • the field identification result obtained by training sample 1 is wrong, and the field recognition result obtained by training sample 2 is correct.
  • the correct field identification result corresponding to the training sample can be input when the training sample is input, so that the mobile phone can automatically adjust the positive and negative samples according to the known correct field and the output domain identification result; or, After the field identification result is output, the human body judges whether the recognition result of the field is correct, and triggers the mobile phone to automatically adjust the positive and negative samples if the result is wrong.
  • the system will automatically adjust the positive and negative samples again, that is, the system automatically adjusts the positive and negative samples for the second time.
  • the system re-adjusts the [same flush] in the positive sample based on the first adjustment of the positive and negative samples, for example, increasing the content of the positive sample including [same flush]. In this way, the values of the weights corresponding to the characteristics of "homogeneous” and “flowery” can be effectively improved.
  • each feature and the weight corresponding to each feature are as follows:
  • each feature and the weight corresponding to each feature are as follows:
  • the third round of processing results of the training sample 1 and training sample 2 is as follows:
  • the system combines the correctness or error of each round of field identification results, and adjusts the positive and negative samples until both the training sample 1 and the training sample 2 obtain the correct domain identification result. It can be seen that the more the number of training samples, the higher the accuracy of the adjusted positive and negative sample sets.
  • the domain identification result may be determined by recognizing the sentence pattern, or the method of replacing the interference term. , simplifying the field identification process. This not only improves the accuracy of the domain identification process, but also further saves the time taken by the domain identification process.
  • a rule may be set in advance based on a sentence pattern for the text of the control layer. The precision matching process is used.
  • the text content obtained by the speech recognition of the examples 1 to 3 and the domain recognition result obtained by the domain identification by the system are as follows:
  • the text corresponding to the voice command input by the user search for the picture taken in Beijing yesterday.
  • the sentence can be pre-set to "query", so that when the user inputs an error or the speech recognition is missing, as long as the text includes the sentence, the system can accurately identify the The sentence sentence, and according to the sentence sentence, distinguishes the domain described by the text, thereby obtaining an accurate domain recognition result.
  • the rules can be pre-set to [ ⁇ (search
  • open]. ⁇ 1,12 ⁇ (share)$]" can be referred to the above description, and will not be repeated here.
  • the classifier layer is still required to process text that cannot be matched by the above rules.
  • the “picture of search...” can be used as a mode.
  • the rules involved in the sub-domain classifier corresponding to the gallery field may include the pattern, which means that in the process of text recognition, when the sub-domain classifier recognizes the pattern, the feedback can be effectively
  • the domain recognizes the result, that is, the field of the text is a gallery.
  • a common feature can be set in advance for the system to prevent the problem of inaccurate domain identification caused by the common feature.
  • the system can then call NER to extract the NE information in the text as a common feature. For example, define [Nanjing Port] as a "general company name” entity, replace the entity with #; and define [Jiangtong CWB1] as a " The listed company name code "entity”, replace the entity with @. Then the content of the text is "Query the # and @ ⁇ @@”.
  • each feature in Example 2 and the weight corresponding to each feature are as follows:
  • null indicates that the feature "# and" has no effect on the domain recognition result, or the feature has a weight of zero.
  • each feature in Example 3 and the weight corresponding to each feature are as follows:
  • the feature “+” indicates that the replaced text satisfies the pattern defined in the sub-domain classifier. Therefore, when calculating the value corresponding to the text, the replaced text satisfies the pattern defined in the sub-domain classifier, and the weight is obtained. Addition to improve the accuracy of domain identification.
  • the content that the mobile phone will obtain after the speech recognition is "the font is adjusted a little larger", and the domain identification is performed, and the obtained domain recognition result is that the text belongs to the setting field.
  • the phone then sends the text to the dialog engine corresponding to the set field for processing. It should be noted that when the mobile phone gives the user a response, the mobile phone has increased the font size at the request of the user.
  • the text corresponding to the voice command input by the user Please help me set a no-disturbance from 2 to 3 o'clock this afternoon, except Pharaoh.
  • the content that the mobile phone will receive after voice recognition is "Please help me set a text of exemption from 2 to 3 o'clock this afternoon, except for the old king", to identify the domain, and the domain recognition result is that the text is not disturbed. field.
  • the phone then sends the text to the dialog engine corresponding to the DND field for processing. It should be noted that when the mobile phone gives the user a response, the mobile phone has set the do not disturb opening time according to the user's request, and ensures that the user is still prompted for the call of Pharaoh during the time period of the do not disturb.
  • the content that the mobile phone will receive after speech recognition is “Please help me Baidu Fan Bingbing's picture”, and the domain identification is obtained.
  • the obtained domain recognition result is that the text belongs to the gallery field.
  • the phone then sends the text to the dialog engine corresponding to the gallery field for processing. It should be noted that when the mobile phone gives the user a response, the mobile phone has completed the image search at the request of the user, that is, the relevant photo has been presented to the user through Baidu.
  • the content that the mobile phone will receive after the speech recognition is the text of "how to chopsticks in English", and the domain identification is performed, and the obtained domain recognition result is that the text belongs to the field of translation.
  • the phone then sends the text to the dialog engine corresponding to the translation domain for processing. It should be noted that when the mobile phone gives the user a response, the mobile phone has completed the translation of the word "chopsticks" according to the user's needs.
  • the content that the mobile phone will receive after the speech recognition is "how is the weather today", and the domain identification is performed, and the obtained domain recognition result is that the text belongs to the weather field.
  • the phone then sends the text to the dialog engine corresponding to the weather domain for processing. It should be noted that when the mobile phone gives the user a response, the mobile phone has determined the weather condition of the geographical location in combination with the current geographic location of the user.
  • Vanke A is now 39.42 yuan, down 0.86%, has been closed
  • the content that the mobile phone will receive after speech recognition is the text of "Vanke's stock", and the domain identification is performed.
  • the obtained domain recognition result is that the text belongs to the stock field.
  • the phone then sends the text to the dialogue engine corresponding to the stock field for processing. It should be noted that when the mobile phone gives the user a response, the mobile phone has determined the stock situation that the user asks for.
  • the content obtained by the mobile phone after speech recognition is "2 of the 13th power equals how much" text, and the domain identification is performed, and the obtained domain recognition result is that the text belongs to the calculation field.
  • the phone then sends the text to the dialog engine corresponding to the computing domain for processing. It should be noted that when the mobile phone gives the user a response, the mobile phone has determined the calculation result that the user desires by calculation.
  • Yao Ming born on September 12, 1980 in Xuhui District, Shanghai, is a native of Zhenze Town, Wujiang District, Suzhou City, Jiangsu province. He is a former Chinese professional basketball player and a center. He is currently the Chairman and General Manager of the China National Association. In April 1998, Yao Ming was selected as the national team coached by Wang Fei and began his basketball career. In 2001, he won the CBA regular season MVP. In 2002, he won the CBA championship and the finals MVP. He was elected CBA rebounding champion and capping champion three times, and was elected as the CBA dunk king twice.
  • the content that the mobile phone will receive after the speech recognition is the text of "Yao Ming's height", and the domain identification is performed, and the obtained domain recognition result is that the text belongs to the encyclopedic field.
  • the mobile phone can then search for the extracted keywords in the encyclopedia and present the search results to the user. At the same time, the mobile phone can selectively present the searched related content to the user. It should be noted that when the mobile phone gives the user a response, the mobile phone has searched Yao Ming's height and other related information.
  • the manner of the response in the foregoing examples includes, but is not limited to, a manner of prompting a text or a voice prompt, which is not limited herein.
  • the embodiments of the present application may divide the functional modules of the electronic device according to the foregoing method embodiments.
  • each functional module may be divided according to each function, or two or more functions may be integrated into one processing module.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules. It should be noted that the division of the module in the embodiment of the present application is schematic, and is only a logical function division, and the actual implementation may have another division manner.
  • the electronic device performs the speech recognition apparatus 400, and includes a receiving module 401, a conversion module 402, a first domain identification module 403, a processing module 404, a second domain identification module 405, a control module 406, and a sub-domain classifier 407.
  • the sub-domain classifier 407 includes a named entity identification module 4071, a replacement module 4072, an extraction module 4073, a calculation module 4074, and a domain determination module 4075. It should be noted that at least one sub-area classifier 407 is included in the electronic device 400, which is not limited herein.
  • the receiving module 401 is configured to support the electronic device 400 to receive a voice instruction.
  • a voice instruction For example, the voice command input by the user through the electronic device corresponding to the text, that is, the voice input as shown in FIG.
  • the conversion module 402 is configured to support the electronic device 400 to convert the voice instruction into text, for example, converting the input voice into text by voice recognition as shown in FIG.
  • the first domain identification module 403 is configured to support the electronic device 400 to identify the text by using at least two sub-domain classifiers to obtain a domain identification result. For example, the sub-domain classifier in each priority level (ie, the sub-domain classifier group) in the classifier layer shown in FIG.
  • the processing module 404 is configured to support the electronic device 400 to process text through a dialog engine corresponding to the domain to which the text belongs to determine functions that the electronic device corresponding to the text needs to perform, and other processes for supporting the electronic device 400 to implement the techniques described herein. Wait.
  • the second sub-area identification module 405 is configured to support the electronic device 400 to match the text field storage text, for example, the text in the control layer as shown in FIG. 4 is fast full-precision matching, and when the text and the pre-stored text are successfully matched, the pre-stored is determined.
  • the field corresponding to the text is the domain recognition result of the text; when the matching of the text and the pre-stored text fails, the text is input to the classifier layer, and the first domain identification module 403 identifies the domain by using at least two sub-domain classifiers. To get the domain identification results.
  • the first domain identification module includes N sub-domain classifier groups, wherein each group has a different priority, and N is a positive integer greater than or equal to 2. At least one of the N sub-domain classifier groups includes at least two sub-domain classifiers. Each sub-domain classifier is used to confirm whether the text belongs to the domain corresponding to the sub-domain classifier.
  • the control module 406 is configured to support the electronic device 400 to control the sub-domain classifier in the highest priority group of the N sub-domain classifier groups to perform domain identification on the text. For example, as shown in FIG. 4, the highest priority group in the classifier layer is controlled. , that is, the sub-domain classifier in priority 1 performs domain identification on the text.
  • the sub-domain classifier in the highest priority group identifies the domain to which the text belongs, the sub-domain classifier in the highest priority group identifies the domain to which the text belongs as the domain identification result; if the sub-domain in the highest priority group If the classifier does not recognize the domain to which the text belongs, the domain is identified by the sub-domain classifier in the next priority group of the N sub-domain classifier groups. For example, as shown in FIG. 4, the text passes priority 1 If the sub-domain classifier does not obtain the domain identification result after the domain identification, the sub-domain classifier in priority 2 will identify the domain to the domain until: the domain to which the text belongs is identified, and the identified domain is regarded as the domain.
  • the control module 406 is further configured to: when the first sub-domain classifier performs domain identification on the text to obtain the first domain identification result, and the second sub-domain classifier obtains the second domain identification result after performing domain identification on the text, determining the first At least one of the domain identification result and the second domain identification result is a domain identification result, or determining that the first domain recognition result and the second domain recognition result are both domain recognition results. Taking the priority 1 shown in FIG. 4 as an example, when the sub-domain classifier 11 obtains the first domain identification result and the sub-domain classifier 12 obtains the second domain identification result, the control module 406 performs the above process.
  • the control module 406 determines that at least one domain identification result in the first domain identification result, the second domain recognition result, and the third domain recognition result is The field recognition result of the text.
  • the NER module is used to support the electronic device 400 to perform NER on the text and determine common features in the identified content.
  • the replacement module 4072 is configured to support the electronic device 400 to replace the utility features in the text according to a preset rule.
  • the extraction module 4073 is configured to support the electronic device 400 to perform feature extraction on the completed text and determine the weight of each feature.
  • the calculation module 4074 is configured to support the electronic device 400 to calculate the value of the text according to the weight of each feature.
  • the domain determining module 4075 is configured to support the electronic device 400 to determine that the text belongs to a domain corresponding to the sub-domain classifier when the value of the text is greater than a threshold.
  • the sub-domain classifier 407 may be any sub-domain classifier involved in the classifier layer as shown in FIG. 4.
  • the electronic device 400 may further include at least one of a storage module 408, a communication module 409, and a display module 410.
  • the storage module 408 is configured to support the electronic device 400 to store program codes and data of the electronic device;
  • the communication module 409 can support data interaction between the various modules in the electronic device 400, and/or support the electronic device 400 and such as a server, other electronic The communication between the devices and the like;
  • the display module 410 can support the electronic device 400 to present the processing result of the voice instruction to the user through text, graphics, or the like, or selectively present the voice recognition process to the user during the voice recognition process. It is not limited here.
  • the receiving module 401 and the communication module 409 can be implemented as a transceiver; the conversion module 402, the first domain identification module 403, the processing module 404, the second domain identification module 405, the control module 406, and the sub-domain classifier 407 can be implemented as a processor.
  • the storage module 408 can be implemented as a memory; the display module 410 can be implemented as a display.
  • the processor may also be a controller, such as a CPU, a general-purpose processor, a digital signal processor (DSP), and an application-specific integrated circuit (Application-Specific Integrated Circuit). , ASIC), Field Programmable Gate Array (FPGA) or other programmable logic device, transistor logic device, hardware component, or any combination thereof. It is possible to implement or carry out the various illustrative logical blocks, modules and circuits described in connection with the present disclosure.
  • the processor may also be a combination of computing functions, for example, including one or more microprocessor combinations, a combination of a DSP and a microprocessor, and the like.
  • the above transceiver can also be implemented as a transceiver circuit or a communication interface.
  • the electronic device 50 can include a processor 51, a transceiver 52, a memory 53, a display 54, and a bus 55.
  • the transceiver 52, the memory 53 and the display 54 are optional components, that is, the electronic device 50 may include one or more of the above optional components.
  • the processor 51, the transceiver 52, the memory 53, and the display 54 are connected to each other through a bus 55; the bus 55 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA). Bus, etc.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • Bus etc.
  • the bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is shown in Figure 9, but it does not mean that there is only one bus or one type of bus.
  • the steps of a method or algorithm described in connection with the present disclosure may be implemented in a hardware or may be implemented by a processor executing software instructions.
  • the software instructions may be composed of corresponding software modules, which may be stored in a random access memory (RAM), a flash memory, a read only memory (ROM), an erasable programmable read only memory ( Erasable Programmable ROM (EPROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Register, Hard Disk, Mobile Hard Disk, Compact Disc Read-Only Memory (CD-ROM), or any of those well known in the art.
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read only memory
  • EEPROM Electrically Erasable Programmable Read Only Memory
  • register Hard Disk
  • Mobile Hard Disk Mobile Hard Disk
  • CD-ROM Compact Disc Read-Only Memory
  • Other forms of storage media are coupled to the processor to enable the processor to read information from, and write information to, the storage medium.
  • Embodiments of the present application provide a readable storage medium, including instructions.
  • the instruction When the instruction is run on the electronic device, the electronic device is caused to perform the method described above.
  • An embodiment of the present application provides a computer program product, the computer program product comprising software code for performing the method described above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种电子设备进行语音识别方法及电子设备,涉及终端技术领域,能够提升终端在本地进行语音指令识别时的灵活性。方法包括:将接收的语音指令转换为文本,之后通过至少两个子领域分类器对文本进行领域识别,得到领域识别结果,其中,领域识别结果用于表示文本所属的领域,再通过文本所属的领域对应的对话引擎对文本进行处理,确定文本对应的电子设备需要执行的功能。适用于语音识别过程。

Description

一种电子设备进行语音识别方法及电子设备 技术领域
本申请涉及终端技术领域,尤其涉及一种电子设备进行语音识别方法及电子设备。
背景技术
随着终端技术的发展,尤其是语音识别技术的普及,目前,用户可以通过向终端输入语音指令以调用终端执行相应功能。以上述终端为手机为例,用户可以通过手机输入一段语音,之后手机将这一段语音发送至云端,以使云端将这段语音转成文本,并对该文本进行处理,得到处理结果。之后云端将处理结果返回至手机,以使手机根据该处理结果执行与该处理结果匹配的功能。
由此可见,上述实现过程主要依赖于云端的处理能力。也就意味着,对于终端无法与云端实现数据交互的情况而言,终端难以依据输入的语音指令执行相应功能。为了解决上述问题,目前,终端中增加了对语音指令进行识别、处理的功能,在终端通过语音识别技术将语音转换成文本后,终端可以通过模板匹配的方式对该文本进行处理,以确定终端需要调用的功能,即上述处理结果。其中,模板匹配指的是,终端将得到的文本与已有模板进行匹配,并确定出能够完全匹配该文本的模板。之后终端可以根据模板与功能之间的对应关系,确定与该模板对应的功能,并由终端执行该功能。
但对于上述实现方式而言,需要确保得到的文本与模板完全匹配。比如,模板规定了文本的结构为“时间+地点+做什么”,那么在该文本的结构满足“时间+地点+做什么”的结构时,终端才能确定该文本与模板匹配。对于该文本的结构为“地点+时间+做什么”的结构时,由于模板结构与文本结构无法完全匹配,导致终端因无法找到与该文本匹配的模板,而无法确定与该文本匹配的功能,也就导致用户无法通过输入语音指令的方式调用终端执行该功能。
发明内容
本申请实施例提供一种电子设备进行语音识别方法及电子设备,以提升终端在本地进行语音指令识别时的灵活性。
为达到上述目的,本申请实施例采用如下技术方案:
第一方面,本申请实施例提供一种电子设备进行语音识别方法。该方法包括:将接收的语音指令转换为文本。之后通过至少两个子领域分类器对文本进行领域识别,得到领域识别结果。其中,领域识别结果用于表示文本所属的领域。再通过文本所属的领域对应的对话引擎对文本进行处理,确定文本对应的电子设备需要执行的功能。采用上述方式实现语音识别过程,能够有效对文本进行领域的区分,之后更有针对性的完成基于领域的文本识别过程,从而确定电子设备需要执行的功能,加强了语音识别的准确性。并且,上述实现过程可以在电子设备的本地进行。也就意味着,即便是在电子设备无法接入网络的过程中,也能够在不借助云端处理能力的基础上,实现针对语音指令的识别,从而增加了语音识别的灵活性。
在一种示例性的实现方式中,在将语音指令转换为文本之后,可以将文本与预存文本进行匹配。当文本与预存文本匹配成功时,确定预存文本对应的领域为文本的领域识别结果。在上述实现方式中,预先匹配能够减少后续通过子领域分类器进行领域识别所消耗的资源。上述匹配过程可以对文本进行初步筛选,若转换后的文本符合常用句式,那么可以直接基于已有的预存文本和领域之间的对应关系,在不需要子领域分类器参与的情况下,准确识别出该文本所属的领域,从而完成基于语音指令的领域识别过程。
在一种示例性的实现方式中,通过至少两个子领域分类器对文本进行并行领域识别,得到领域识别结果,具体可以实现为:当文本与预存文本匹配失败时,通过至少两个子领域分类器对文本进行并行领域识别,得到领域识别结果。考虑到文本也会存在不符合常用句式的情况,那么在该文本经过初步筛选后,就可以由子领域分类器对该文本进行领域识别。需要说明的是,子领域分类器对文本进行领域识别的过程可以实现为多个子领域分类器对文本进行并行的领域识别,即至少存在两个子领域分类器同时对文本进行领域识别,以节省领域识别所占用的时间。
在一种示例性的实现方式中,电子设备包括N个子领域分类器组,其中,每个组有不同的优先级,N为大于或等于2的正整数。通过至少两个子领域分类器对文本进行并行领域识别,得到领域识别结果,可以具体实现为:通过N个子领域分类器组中最高优先级组中的子领域分类器对文本进行领域识别。若最高优先级组中的子领域分类器识别出文本所属的领域,则将最高优先级组中的子领域分类器识别出文本所属的领域作为领域识别结果;若最高优先级组中的子领域分类器未识别出文本所属的领域,则通过N个子领域分类器组中下一优先级组中的子领域分类器对文本进行领域识别,直至:识别出文本所属的领域,并将识别出的领域作为领域识别结果;或文本已经过N个子领域分类器组中所有子领域分类器进行领域识别。其中,N个子领域分类器组中的至少一个组中包括至少两个子领域分类器。
在上述实现过程中,各个优先级组中的子领域分类器是按照一定先后顺序对文本进行识别的。上述实现过程中,一旦经过某一优先级组中的子领域分类器的领域识别后得到领域识别结果,可以将得到的领域识别结果返回,而无需将文本交由下一级优先级组中的子领域分类器进行领域识别,从而在能够确保得到准确的领域识别结果的基础上,动用较少的子领域分类器。
在一种示例性的实现方式中,N个子领域分类器组的至少一个组中的至少两个子领域分类器对文本并行进行领域识别。在本申请实施例的一种示例性的实现方式中,各个优先级组中不一定全部都包括多个子领域分类器,即至少存在一个优先级组中包括多个子领域分类器即可。需要说明的是,并行对文本进行领域识别的子领域分类器的数量越多,得出的领域识别结果越准确。
在一种示例性的实现方式中,N个子领域分类器组中,低优先级组中的子领域分类器的领域识别准确率低于高优先级组中的子领域分类器的领域识别准确率。由于高优先级组中的子领域分类器进行领域识别的准确率,高于低优先级组中子领域分类器进行领域识别的准确率。因此,上述层层递进的领域识别过程,能够有效降低领域识别准确率较低的子领域分类器的工作压力,且进一步提升了领域识别整体过程的准确 率。
在一种示例性的实现方式中,N个子领域分类器组中的至少一个组包括第一子领域分类器和第二子领域分类器。当第一子领域分类器对文本进行领域识别后得到第一领域识别结果,且第二子领域分类器对文本进行领域识别后得到第二领域识别结果时,确定第一领域识别结果和第二领域识别结果中的至少一项为领域识别结果;或确定第一领域识别结果和第二领域识别结果均为领域识别结果。由此可见,对于同一优先级组中多个子领域分类器均得到领域识别结果的情况而言,可以基于预先设置的规则或是已经配置好的汇总决策方式,对领域识别结果进行选择,比如,选其中一个领域识别结果作为最终的领域识别结果,或是选择其中的多个或是全部的领域识别结果作为最终的领域识别结果,在此对于规则或是决策的方式不予限定。
在一种示例性的实现方式中,至少两个子领域分类器中的每一个对文本进行领域识别,可以实现为:对文本进行命名实体识别NER,并确定识别出的内容中的公用特征。之后按照预设规则,将公用特征进行替换。其中,预设规则包括不同类别的公用特征对应的替换内容。再对完成替换的文本进行特征提取,并确定每个特征的权重,根据每个特征的权重,计算文本的值。并且,当文本的值大于阈值时,确定文本属于本子领域分类器对应的领域。需要说明的是,采用公用特征替换的方式,能够减少计算文本的值是所占用的计算资源,且能够有效减少功用特征对领域识别过程产生的影响,从而提升对文本进行领域识别的准确率。
在一种示例性的实现方式中,在将文本通过至少两个子领域分类器对文本进行并行领域识别之前,可以预先对至少两个子领域分类器进行训练。其中,对每个子领域分类器进行训练的过程如下:
生成子领域分类器的正负样本。需要说明的是,每个子领域分类器均可以有自己独立的正负样本,其中,正负样本包括正例训练样本集合和负例训练样本集合。正例训练样本集合中的样本为属于该子领域分类器对应领域的样本,负例训练样本集合中的样本为不属于该子领域分类器对应领域的样本。
对正负样本进行NER和规则提取,并对经NER处理过的正负样本进行公用特征替换。其中,公用特征指的是在计算文本的值时,会对该值产生影响的内容,但该特征的存在并不会对文本所属的领域产生影响,在本申请实施例的一种实现方式中,公用特征包括但不限于时间、地点等词语,可以预先设置。在本申请实施例中,可以将公用特征替换为符号等,在此不予限定。规则包括但不限于诸如“搜索……的图片”等句式。需要说明的是,对正负样本进行NER,可以为规则提取和公用特征替换的前提。即通过NER识别出正负样本中的地点、时间、句式等,之后将句式作为规则,将时间、地点等作为公用特征,并完成公用特征与符号之间的替换。
停用词等去噪。也就意味着,在对子领域分类器进行训练的过程中,对于正负样本而言,为了降低正负样本中诸如“啊”、“呀”等语气词以及“;”、“、”等符号对识别过程产生的干扰,需要将这些停用词识别出来,并在领域识别过程中忽略这些停用词。
提取特征生成训练语料特征库,并根据权重计算文本对应的值。其中,训练语料特征库用于存储特征与权重之间的对应关系。
子领域分类器训练,并对错误领域识别结果的影响评估,之后修改正负样本。
上述训练过程能够动态调整正负样本的分布情况,从而提升子领域分类器的识别准确率。
第二方面,本申请实施例提供一种电子设备。该电子设备可以实现上述方法实施例中所实现的功能,所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个上述功能相应的模块。
第三方面,本申请实施例提供一种电子设备。该电子设备的结构中包括存储器,一个或多个处理器。其中,存储器用于存储计算机程序代码,该计算机程序代码包括计算机指令。上述一个或多个处理器在读取并执行所述计算机指令的过程中,使得该电子设备实现第一方面及其各种示例性的实现方式任一项所述的方法。
第四方面,本申请实施例提供一种可读存储介质,包括指令。当该指令在电子设备上运行时,使得该电子设备执行上述第一方面及其各种示例性的实现方式任一项所述的方法。
第五方面,本申请实施例提供一种计算机程序产品,该计算机程序产品包括软件代码,该软件代码用于执行上述第一方面及其各种示例性的实现方式任一项所述的方法。
附图说明
图1为本申请实施例提供的一种终端的结构示意图;
图2为本申请实施例提供的一种示例性的方法流程示意图;
图3为本申请实施例提供的一种示例性的手机处理语音指令的方法流程图;
图4为本申请实施例提供的一种示例性的领域识别的多分类***的示意图;
图5为本申请实施例提供的一种采用如图4所示的***进行文本领域识别的实现流程示意图;
图6为本申请实施例提供的一种在已知文本所属领域的情况下,对子领域分类器进行训练的方法流程图;
图7为本申请实施例提供的一种对子领域分类器的正负样本进行调整的训练方法流程图;
图8为本申请实施例提供的一种电子设备的结构示意图;
图9为本申请实施例提供的另一种电子设备的结构示意图。
具体实施方式
本申请实施例可以用于一种电子设备,该电子设备可以为终端,比如,笔记本电脑、智能手机、虚拟现实(Virtual Reality,VR)设备、增强现实技术(Augmented Reality,AR)、车载设备、智能可穿戴设备等设备。该终端可以至少设置有显示屏、输入设备和处理器,以终端100为例,如图1所示,该终端100中包括处理器101、存储器102、摄像头103、RF电路104、音频电路105、扬声器106、话筒107、输入设备108、其他输入设备109、显示屏110、触控面板111、显示面板112、输出设备113、以及电源114等部件。其中,显示屏110至少由作为输入设备的触控面板111和作为输出设备的显示面板112组成。需要说明的是,图1中示出的终端结构并不构成对终端的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或 者不同的部件布置,在此不做限定。
下面结合图1对终端100的各个构成部件进行具体的介绍:
射频(Radio Frequency,RF)电路104可用于收发信息或通话过程中,信号的接收和发送,比如,若该终端100为手机,那么该终端100可以通过RF电路104,将基站发送的下行信息接收后,传送给处理器101处理;另外,将涉及上行的数据发送给基站。通常,RF电路包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器(Low Noise Amplifier,LNA)、双工器等。此外,RF电路104还可以通过无线通信与网络和其他设备通信。该无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯***(Global System for Mobile communication,GSM)、通用分组无线服务(General Packet Radio Service,GPRS)、码分多址(Code Division Multiple Access,CDMA)、宽带码分多址(Wideband Code Divi sion Multiple Access,WCDMA)、长期演进(Long Term Evolution,LTE)、电子邮件、短消息服务(Short Messaging Service,SMS)等。
存储器102可用于存储软件程序以及模块,处理器101通过运行存储在存储器102的软件程序以及模块,从而执行终端100的各种功能应用以及数据处理。存储器102可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作***、至少一个功能所需的应用程序(比如,声音播放功能、图像播放功能等)等;存储数据区可存储根据终端100的使用所创建的数据(比如,音频数据、视频数据等)等。此外,存储器102可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
其他输入设备109可用于接收输入的数字或字符信息,以及产生与终端100的用户设置以及功能控制有关的键信号输入。具体地,其他输入设备109可包括但不限于物理键盘、功能键(比如,音量控制按键、开关按键等)、轨迹球、鼠标、操作杆、光鼠(光鼠是不显示可视输出的触摸敏感表面,或者是由触摸屏形成的触摸敏感表面的延伸)等中的一种或多种。其他输入设备109还可以包括终端100内置的传感器,比如,重力传感器、加速度传感器等,终端100还可以将传感器所检测到的参数作为输入数据。
显示屏110可用于显示由用户输入的信息或提供给用户的信息以及终端100的各种菜单,还可以接受用户输入。此外,显示面板112可以采用液晶显示器(Liquid Crystal Display,LCD)、有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板112;触控面板111,也称为触摸屏、触敏屏等,可收集用户在其上或附近的接触或者非接触操作(比如,用户使用手指、触笔等任何适合的物体或附件在触控面板111上或在触控面板111附近的操作,也可以包括体感操作;该操作包括单点控制操作、多点控制操作等操作类型),并根据预先设定的程式驱动相应的连接装置。需要说明的是,触控面板111还可以包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位、姿势,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成处理器101能够处理的信息,再传送给处理器101,并且,还能接收处理器101发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种 类型实现触控面板111,也可以采用未来发展的任何技术实现触控面板111。一般情况下,触控面板111可覆盖显示面板112,用户可以根据显示面板112显示的内容(该显示内容包括但不限于软键盘、虚拟鼠标、虚拟按键、图标等),在显示面板112上覆盖的触控面板111上或者附近进行操作,触控面板111检测到在其上或附近的操作后,传送给处理器101以确定用户输入,随后处理器101根据用户输入,在显示面板112上提供相应的视觉输出。虽然在图1中,触控面板111与显示面板112是作为两个独立的部件来实现终端100的输入和输出功能,但是在某些实施例中,可以将触控面板111与显示面板112集成,以实现终端100的输入和输出功能。
RF电路104、扬声器106,话筒107可提供用户与终端100之间的音频接口。音频电路105可将接收到的音频数据转换后的信号,传输到扬声器106,由扬声器106转换为声音信号输出;另一方面,话筒107可以将收集的声音信号转换为信号,由音频电路105接收后转换为音频数据,再将音频数据输出至RF电路104以发送给诸如另一终端的设备,或者将音频数据输出至存储器102,以便处理器101结合存储器102中存储的内容进行进一步的处理。另外,摄像头103可以实时采集图像帧,并传送给处理器101处理,并将处理后的结果存储至存储器102和/或将处理后的结果通过显示面板112呈现给用户。
处理器101是终端100的控制中心,利用各种接口和线路连接整个终端100的各个部分,通过运行或执行存储在存储器102内的软件程序和/或模块,以及调用存储在存储器102内的数据,执行终端100的各种功能和处理数据,从而对终端100进行整体监控。需要说明的是,处理器101可以包括一个或多个处理单元;处理器101还可以集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作***、用户界面(User Interface,UI)和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器101中。
终端100还可以包括给各个部件供电的电源114(比如,电池),在本申请实施例中,电源114可以通过电源管理***与处理器101逻辑相连,从而通过电源管理***实现管理充电、放电、以及功耗等功能。
此外,图1中还存在未示出的部件,比如,终端100还可以包括蓝牙模块等,在此不予赘述。
下面以上述终端100为手机为例,对本申请实施例进行阐述。
目前,在手机将接收到的语音指令发送至云端后,云端通过语音识别技术,将语音指令转换成文本,之后对文本进行处理,以确定手机需要执行的与该文本对应的功能,即处理结果。其中,云端对文本进行处理的过程,可以实现为云端将文本与模板中设置好的内容进行逐条匹配,最终得到处理结果;或是,云端提取文本中的关键字、关键词,之后基于该关键字、关键词来得出处理结果。之后云端将该处理结果返回给手机,由手机实现与该处理结果对应的功能。
由此可见,语音指令转换成文本的过程,以及后续针对文本的处理过程,可以发生在云端,而手机只需要将接收到的语音指令发送至云端,并在云端完成语音指令的处理后,接收云端发送的处理结果,并针对该处理结果执行相应功能。
在上述实现过程中,手机与云端之间可以通过网络实现数据传输,那么对于手机 无法连网的情况而言,也就无法保证手机能够准确、有效地执行与语音指令对应的功能。
此外,无论手机是否需要通连网来实现对语音指令的处理,对于云端采用模板匹配的方式完成文本处理的情况,以及手机在本地采用模板匹配的方式完成文本处理的情况而言,考虑到模板大部分是人工数据得到的,生成较多的模板往往会占用大量的人力、物力;并且,模板在生成后是固定不变的,当语音指令的结构与模板结构不能完全匹配时,会增加处理过程的失败率,即采用模板进行文本处理的过程,灵活度较差,且由于需要将文本与较多的模板进行匹配,往往耗费的时间较长。同样的,对于采用提取关键字、关键词的方式对文本进行处理,也会出现类似的问题。
在对文本进行模板匹配的过程中,当文本内容涉及多个领域时,容易产生歧义,即识别率较低。比如,文本为“你好用英语怎么说”,涉及到翻译,还涉及到语言设置,云端将文本识别为“设置语言”;文本为“翻译一下打开护眼模式”,涉及到翻译,还涉及到模式启动,云端将文本识别为“打开护眼模式”;文本为“帮我记一下餐厅的地理位置”,涉及到位置信息,还涉及到记录,云端将文本识别为“全球定位***(Global Positioning System,GPS)”;文本为“发微博说字体很大”,涉及到字体,还涉及到字体调整,云端将文本识别为“字体”;文本为“提醒我明天下午开飞行模式”,涉及到时间,还涉及到模式启动,云端将文本识别为“打开飞行模式”等。由此可见,在文本中涉及多个领域时,云端或是手机很难准确确定出文本对应的功能。
其中,领域指的是文本的类型。该类型的划分可以依据文本所处的语言环境,在本申请实施例中,可以将领域作为文本类别,在对文本进行处理的过程中,一个领域对应一个类型的任务,即属于同一领域的文本作为同一类型的任务交由同一对话引擎进行处理。上述对话引擎对文本进行处理的过程,可以包括通过自然语言处理(Natural Language Process,NLP)技术对文本进行解析,从而输出处理结果。其中,该处理结果可以包括需要手机实现的功能的代码,以使手机调用相应功能,具体实现方式可以为该对话引擎用于对该领域的文本进行解析,确定该文本对应的需执行的功能,该对话引擎也可以生成与该需执行的功能相应的指令代码;所述指令代码为机器可以识别并执行对应功能的代码。
该代码可以是二进制形式的代码,也可以是诸如<play><wangfei><song>(对用户输入的语音指令“我要听王菲的音乐”生成的指令代码)的高级语言代码,本申请不做限定。
为了解决上述问题,本申请实施例提供一种语音识别方法。如图2所示,为本申请实施例提供的一种示例性的方法流程示意图。在手机通过用户语音入口接收到语音指令后,采用语音识别技术,将语音指令转换为文本,之后由终端对该文本进行领域识别,再将经过领域识别得到的领域识别结果反馈给与该领域识别结果对应的对话引擎进行处理,最终将得到的处理结果反馈给手机。
需要说明的是,对于用户通过第三方应用或是***应用提供的入口输入语音指令的情况而言,对话引擎可以将处理结果反馈给提供入口的应用,以使该提供入口的应用实现诸如界面切换等功能;对于用户通过手机的***级显示界面提供的入口输入语 音指令的情况而言,由于用户输入语音指令时,手机可以处于主界面或是诸如设置界面等***级显示界面,而并非手机中应用的运行界面,因此,对话引擎可以将处理结果反馈给手机的***,以使该手机的***实现诸如运行某一应用、调整显示界面字体大小等功能。比如,手机呈现给用户的界面为手机的主界面,用户通过输入“打开游戏应用”的语音指令来运行该游戏应用。手机在对该语音指令完成语音识别及后续处理后,将处理结果反馈给手机的***,由手机的***启动该游戏应用。
其中,上述***应用包括但不限于手机出厂时预装的具备接收语音指令功能的应用程序;上述第三方应用包括但不限于用户从诸如应用供应等平台上下载、安装的具备接收语音指令功能的应用程序,以及通过手机中其他应用实现调用功能的应用程序等,在此不予限定。在本申请实施例中,***级显示界面指的是手机中除应用的运行界面以外的界面,比如,手机的主界面、设置界面等;应用的运行界面包括但不限于应用启动过程中或是启动后通过手机呈现给用户的界面,比如,应用的加载界面、应用的设置界面等。
上述领域识别的过程中,所指的领域包括但不限于设置(setting)、免打扰(noDisturb)、图库(gallery)、翻译(translate)、股票(stock)、天气(weather)、计算(calculator)及百科(baike)等领域。
在本申请实施例中,可以通过对文本中关键词的识别,或是采用模板匹配的方式,或是通过子领域分类器的处理,将通过语音识别转换过来的文本归类于预设的类别中,而该预设的类别包括但不限于上述例举的领域。其中,关键词识别、模板匹配的实现方式可以参考现有技术中针对文本进行领域识别的实现方式,在此不予赘述;子领域分类器可以被设置在图2所示的领域识别的多分类***中,该子领域分类器的作用等,会在后文提出,在此不予赘述。
图2中所示的领域识别的多分类***的目的在于,对手机已经完成转换的文本,进行领域识别,并输出对应的领域识别结果。之后手机会按照领域识别结果,将该文本交由与该领域识别结果对应的对话引擎进行处理,并得到处理结果,以使手机按照处理结果的指示调用相应功能。
其中,用户语音入口可以为诸如语音助手等通用入口,还可以为手机中***应用或是第三方应用的局部入口。比如,以***应用为例,用户在图库内输入语音,以使手机在图库中完成搜索图片的功能。
如图3所示,为本申请实施例提供的一种示例性的手机处理语音指令的方法流程图。以用户通过语音指令打开手机的Wi-Fi功能为例,用户输入的语音指令为“打开Wi-Fi”,该语音指令经过语音识别后,得到内容为“打开Wi-Fi”的文本。手机对该文本进行领域识别,确认该文本所属的领域为设置领域,将该文本发送至设置领域对应的对话引擎进行处理,即手机将经过领域识别后得到的领域识别结果,发送至设置领域对应的本地多领域语义理解对话引擎,由该对话引擎进行处理,之后手机依据处理结果执行相应功能。在本申请实施例中,对话引擎还可以将手机执行相应功能后的执行结果通过语音播放,或是弹出对话框等方式,提示用户,手机已依据用户输入的语音指令完成相应功能的执行。比如,在图3所示的示例中,手机可以通过语音播放的方式播放“Wi-Fi已打开”,或是弹出包括诸如“Wi-Fi已打开”的字样来提示用户。
与现有技术中手机在本地实现领域识别的过程相比,本申请实施例中进行领域识别的方式与现有技术不同。现有技术中,对语音指令的处理过程,主要依赖于模板匹配,因此,当文本结构与模板结构存在差异时,手机就无法得到准确的处理结果。而在本申请实施例中,引入了领域识别及对话引擎,其中,领域识别过程不仅考虑到了模板匹配、关键词提取,还可以采用并行多子领域分类器共同工作的方式,在多领域中筛选出一个或是多个与文本对应的领域,并将该文本交由筛选出的各个领域对应的对话引擎进行处理。这样对于文本结构与模板结果不能完全匹配的情况而言,手机仍然能够对该文本进行进一步的分析、处理。需要说明的是,对于涉及到多领域的文本而言,手机能够从多领域的角度对该文本进行处理,而不仅仅是将该文本推送至一个领域对应的对话引擎进行处理。由此可见,采用本申请实施例提供的语音识别过程,能够有效对文本进行领域的区分,之后更有针对性的完成基于领域的文本识别过程,从而确定电子设备需要执行的功能,加强了语音识别的准确性。并且,实现过程可以在电子设备的本地进行。也就意味着,即便是在电子设备无法接入网络的过程中,也能够在不借助云端处理能力的基础上,实现针对语音指令的识别,从而增加了语音识别的灵活性。
需要说明的是,图3所示的示例为手机对话***交互过程,即用户输入语音指令,经过手机的处理,执行与该语音指令对应的功能,并且通过语音播放或是显示的方式,将手机执行该功能的结果反馈给用户。其中,用户输入的语音指令与手机输出的语音播放内容或是显示结果,行程对话***的交互。也就意味着,手机输出的语音播放或是显示结果,为手机对用户提供的一种示例性的应答方式,以在手机执行相应功能的过程中或是之后,回应用户输入的语音指令。
如图4所示,为本申请实施例提供的一种示例性的领域识别的多分类***。领域识别的多分类***的目的在于,依据文本完成对该文本的领域识别。在领域识别过程中,该***可以被划分为三层,这三层分别为控制层、分类器层和算法层。
下面针对该***中涉及的各个层,对每一层的功能、作用等进行说明。
在一种示例性的实例中,控制层包括如下几个部分:文本快速全精度匹配、域调度、分类决策,以及数据装载机。
其中,文本快速全精度匹配,指的是针对常用的短语、句式等,比如,常用及歧义的固定说法,控制层可以直接将文本的领域进行划分,而无需通过分类器层对该文本进行进一步的处理。在本申请实施例中,用于文本快速全精度匹配的模板,可以预先设置,具体设置方式可以参考现有的人工模板,比如,背景技术中描述的模板匹配方式中所涉及的模板,在此不予赘述。
域调度的功能,包括针对分类器层各个优先级内子领域分类器的调度,比如,经过文本快速全精度匹配后未成功确定文本所属的领域,那么域调度可以调度优先级1中各个子领域分类器对该文本进行处理,并在优先级1中各个子领域分类器无法确定文本所属领域的情况下,继续调度优先级2中各个子领域分类器对该文本进行处理,直至确定出文本所属领域或是分类器层所有子领域分类器均已对该文本进行处理且未确定文本所属领域。此外,对于单个子领域分类器而言,域调度还可以用于调用子领域分类器所涉及的算法、规则、模式等。
其中,算法、规则和模式等均用于子领域分类器对文本进行处理过程中使用。在本申请实施例中,当文本与该规则匹配时,或是文本满足规则时,分类器层返回领域识别结果;当文本与该模式匹配时,或是文本满足模式时,可以确定该文本会存在较大几率属于该模式对应的领域。在本申请的一个实施方式中,规则与本文的对应,对于确定文本所属领域起着决定性作用,而模式与文本的对应,则是增加了确定文本所属领域的准确性,具体实现方式会在后文提及的具体实例中说明,在此不予赘述。
也就意味着,域调度用于衔接控制层与分类器层,在文本经过全文本精度匹配未得到结果后,按照分类器层的优先级由高到低的顺序,依次实现各个优先级中子领域分类器的调度,且在调度子领域分类器进行文本处理的过程中,根据子领域分类器的需求,调度相应的算法、规则、模式等。
分类决策,即汇总决策,主要目的在于当控制层经过文本快速全精度匹配后未确定文本所属领域的情况下,结合分类器层各优先级得到的处理结果,确定出文本所属领域或是确定文本不存在所属的领域。
比如,在优先级1中各个子领域分类器对文本进行处理后,确定文本属于领域1和领域2,那么分类决策就可以用于规定文本经过同一优先级中的所有子领域分类器处理后,得到的领域识别结果中包括多个领域时,如何确定文本所属领域。在本申请实施例中,分类决策可以规定此时文本属于其中一个领域或是多个领域,即分类决策可以规定文本属于领域1、属于领域2,或是属于领域1和领域2。
再比如,在优先级1中各个子领域分类器对文本进行处理后,未确定出文本所属的领域,而在优先级2中各个子领域分类器对文本进行处理后,确定出文本属于领域1,那么分类决策就可以通过汇总优先级1和优先级2中各个子领域分类器得到的领域识别结果来确定文本所属领域,即分类决策汇总优先级1中不存在文本所属领域,而优先级2中存在文本所属的领域1,最终确定文本属于领域1为文本的领域识别结果。
在本申请实施例中,手机在语音指令的识别过程中,可以生成一个实例,这个实例可以为待处理的任务,该任务为手机对语音指令转换为的文本进行领域识别。在分类器层的同一优先级中,多个子领域分类器可以同时处理相同的实例,即手机同时执行多个任务,以实现对文本的领域识别。
数据装载机,用于从手机本地、网络侧或是诸如服务器等第三方设备,获取算法层所需的各种库的数据、分类器层中子领域分类器的模型,以及配置信息。其中,子领域分类器,指的是各个领域对应的分类器;配置信息包括但不限于各个模型的初始化参数等。
此外,控制层作为该***中与手机其他部件进行交互的层,控制层可以从手机获取完成语音识别得到的文本,且在***对该文本进行处理后,能够将领域识别结果,即分类结果,反馈给手机。
由此可见,控制层负责外部业务交互接口、初始化数据和模型加载、领域分类任务调度、子领域分类器分类任务的分发,以及最终返回的所有分类结果的汇总决策。
在一种示例性的实例中,分类器层包括多个优先级,比如,优先级1(priority1)、优先级2(priority2)、优先级3(priority3)。其中,优先级1大于优先级2,大于优先级3。在每个优先级中,可以设置一个或是多个类的实例,即子领域分类器。 比如,优先级1中的类的实例11、类的实例12,以及类的实例13。
分类器层,用于实现文本的分类。在实际分类过程中,分类器层支持多级多实例任务分类,即如上段描述,在分类器层包括多个优先级的分类器组,在不同优先级的分类器组中,存在多个并列的子领域分类器,这多个并列的子领域分类器能够同时执行,以使文本的领域分类过程实现汇总决策。
在单个子领域分类器中,包括规则、模式(pattern)、命名实体识别(Named Entity Recognition,NER),以及预测部分,从而实现子领域特征的提取和领域识别。需要说明的是,同一条文本,在同一优先级中的不同子领域分类器中,可以具备相同的子领域特征,还可以得到不同的领域识别结果。
其中,子领域特征包括但不限于文本中的关键词,也就意味着,在不同领域中,同一关键词可以表示相同或是不同含义,在本申请实施例中,该关键词可以对领域识别结果产生影响;领域识别结果指的是经过子领域分类器对文本进行的处理,可以初步预测文本可能属于的领域,比如,属于同一优先级的两个子领域分类器在经过对同一文本进行处理后,一个子领域分类器确定该文本属于领域1,而另一个子领域分类器确定该文本属于领域2,那么这两个子领域分类器得到了不同的领域识别结果,即文本属于领域1,以及文本属于领域2。
另外,对于不同优先级的分类器组而言,相邻两个优先级的分类器组之间是串行关系。对于分类器层中优先级较高的优先级1而言,在优先级1中的子领域分类器得到有效的领域识别结果的情况下,可以将该领域识别结果通过控制层反馈给手机;在优先级1中的子领域分类器未得到有效的领域识别结果的情况下,可以将该文本传递至串行的优先级2中的各个子领域分类器进行处理,依次类推,直至得到有效的领域识别结果位置。在该文本在分类器层中遍历各个优先级之后仍未得到领域识别结果,那么可以将未得到领域识别结果这一分类结果,反馈给手机。其中,有效的领域识别结果,指的是在某一优先级中,能够确定出该文本所属的领域,那么所确定出的领域就是有效的领域识别结果。
需要说明的是,位于分类器层的优先级数量,以及处于同一优先级中子领域分类器的数量,在本申请实施例中不予限定。另外,可以预先定义每个子领域分类器对应的领域。并且,在后续***的使用过程中,可以对各个子领域分类器对应的领域进行调整,其中,上述调整包括但不限于子领域分类器所在优先级的调整,子领域分类器对应领域的调整,子领域分类器数量的增减等。比如,将一个子领域分类器从一个优先级移动至另一个优先级,将位于不同优先级中的子领域分类器进行对调等。
在实际配置过程中,可以将领域识别精度较高的领域,对应到高优先级的子领域分类器中;将性能较好的模型,对应到高优先级的子领域分类器中。对于需要被识别领域的文本而言,由于高优先级的子领域分类器的识别精度较高,且使用的模型性能较好,就能使文本优先通过精准度较高且时效性较高的识别。当该文本在最高优先级的子领域分类器中就能够识别到领域时,该***可以将该领域识别结果返回给手机。也就意味着,在高优先级的子领域分类器经过领域识别后,未得到有效的领域识别结果,那么可以将该文本交由下一级的子领域分类器进行领域识别,直至得到有效的领域识别结果或是该文本已经经过了每一级子领域分类器的处理。其中,有效的领域识 别结果,指的是该***确定出文本对应的领域;最高优先级的子领域分类器,指的是诸如图4优先级1中的各个子领域分类器,即子领域分类器11、子领域分类器12和子领域分类器13。需要说明的是,在分类器层,可以按照子领域分类器所在组的优先级从高到低的顺序,依次对文本进行识别。当然,当该文本在某一子领域分类器所在组中识别出领域,就可以结束本次针对文本的领域识别过程。
在一种示例性的实例中,算法层,用于提供算法、模型。其中,模型指的是诸如规则(rule)库、命名实体(Named Entity,NE)库,以及特征(feature)库的数据库。算法层所提供的算法,同样可以以数据库的形式体现,比如,算法模型库。在该算法模型库中,包括多种算法。
需要说明的是,在调用上述各种算法之前,控制层的数据装载机需要将该算法相关的内容加载到***中,从而供分类器层各个子领域分类器的灵活调用。
如图5所示,为采用如图4所示的***进行文本领域识别的实现流程示意图。
手机将文本输入图4所示的***后,***先在控制层对该文本进行文本快速全精度匹配,在能够成功为文本确定出领域的情况下,直接将得到的领域作为识别结果;在无法成功为文本确定出领域的情况下,可以继续通过分类器层对文本进行进一步的处理。
上述对文本进行进一步处理,可以实现为按照分类器层各个子领域分类器所属优先级组从高到低的顺序,依次对文本进行领域识别。在对文本进行识别的过程中,无论该文本在哪个优先级组中进行领域识别,只要识别到与文本对应的领域,就将该领域作为领域识别结果反馈给手机,且不再将文本交由下一优先级组进行处理。
如图5所示,***针对优先级1进行分类任务调度,即调用子领域分类器11、子领域分类器12及子领域分类器13,对输入到***中的文本进行并行的领域识别。其中,并行的领域识别,指的是子领域分类器11、子领域分类器12及子领域分类器13同时对文本进行领域识别,或是按照一定时间顺序对文本进行领域识别,之后每个子领域分类器输出一个领域识别结果,并由优先级1对应的分类决策实现针对输出的3个领域识别结果,进行判决,以确定反馈给手机的领域识别结果,或是将文本输入下一优先级组。
对于将文本输入下一优先级组的情况而言,***会继续针对优先级2进行分类任务调度,即调用子领域分类器21、子领域分类器22及子领域分类器23,对输入到***中的文本进行并行的领域识别。其中,并行的领域识别过程的实现可以参考上段的描述。同理,***在优先级2针对文本完成领域识别后,可以将文本输入优先级3对应的各个子领域分类器,进行领域识别,还可以直接将得到的有效的领域识别结果反馈给手机。其中,有效的领域识别结果,指的是通过控制层的文本全精度匹配,得到文本的领域;或是通过分类器层中某一个或是多个优先级组的处理,得到文本的领域;或是在文本通过控制层和分类器层每个优先级的处理后,未得到文本的领域的结果。
在本申请实施例中,***进行文本领域识别的过程,会在得到有效的领域识别结果时结束,或是,在分类器层中每个优先级组对文本进行领域识别后,均无法得到有效的领域识别结果时结束。
需要说明的是,当同一优先级组中的各个子领域分类器同时对文本进行领域识别 时,由于领域确定的判别过程在同一时间段内,由多个子领域分类器共同运行,因此,可以有效节省判别过程所占用的时间。而当同一优先级组中各个子领域分类器按照一定时间顺序对文本进行领域识别时,由于在一段时间内,有一个子领域分类器运行,因此,能够确保***在该一段时间内,占用较少的资源供单个子领域分类器运行,从而保证手机中存在足够的资源供其他***或是程序调用。
参照图4所示的***及图5所示的方法流程,可以得知,相比较于现有技术中云端实现语音识别的方案而言,本申请实施例提供的***的可扩展性较高、灵活性较强、准确率较高,且更精细化。
扩展性较高,指的是该***能够支持未来新垂类的任意扩展,而已有的模型,也就是本申请实施例提出的***,不需要重新建立。上述垂类,指的是本申请实施例中涉及的不同领域的类别,比如,设置、免打扰、图库、翻译、股票、天气、计算及百科等领域。也就意味着,后续使用过程中,可以结合不同应用场景的需要,在分类器层增加其他领域对应的子领域分类器。
灵活性较强,指的是根据当前及未来垂类的特定,不同优先级组可以灵活调整,比如,单个优先级组中子领域分类器的增减,多个优先级组之间子领域分类器的调换等,在此不予限定。这样能够保证分类器层经过汇总决策后得到相对准确的领域识别结果。
准确率较高,指的是针对单个子领域分类器而言,可以结合该子领域分类器对应领域的特征,采用特定的分析、计算方式对文本进行处理,比如,数字、停用词(stopword)的处理,二元法(bi-gram)、三元法(tri-gram)的选择,特征提取范围、方式等。由于在不同子领域分类器中采用相同或是不同的处理方式,能够更具有针对性,因此,准确率相对较高。
更精细化,指的是对训练数据的筛选,以及对子领域分类器进行更有针对性的训练、优化,能够使子领域分类器对文本的领域识别过程更加精细,以达到更准确的领域识别。
下面结合示例性的实例,对上述***对文本进行领域识别的过程进行阐述。
在本申请实施例中,可以预先配置分类器层中各个子领域分类器对应的领域。比如,可以按照子领域分类器的领域识别准确率从高到低的顺序,将领域识别准确率较高的子领域分类器放置在优先级较高的分组中,比如优先级1中,将领域识别准确率较低的子领域分类器放置在优先级较低的分组中,比如优先级3中。在一种示例性的实现方式中,手机可以将分类准确度最高的自有垂类分类任务放入优先级1,将属于与应用对接的垂类分类任务放入优先级2,将最难识别的垂类任务放入优先级3。
其中,与应用对接的任务经过优先级组内的子领域分类器处理后,无论得出的领域划分结果是该优先级组内的哪个子领域分类器,对于语音指令的处理过程影响不大,或是几乎不会产生影响。由于设置于优先级2中的子领域分类器对应于应用,而同一应用下的不同领域的文本处理过程,通常是交由该应用对应的对话引擎进行处理。也就意味着,文本归属于该应用对应的任意一个领域,最终都会交由相同的对话引擎进行处理。因此,在本申请实施例的一个实现方式中,与应用对接的垂类分类任务可以被视为对领域识别准确率要求较低的任务,因为无论最终得到的领域识别结果为哪个 领域,只要有效的领域识别结果产生在优先级2中,那么该文本最终都会交由同一对话引擎进行处理,并不会影响处理结果。
比如,优先级2中包括子领域分类器21、子领域分类器22和子领域分类器23,其中,子领域分类器21对应领域为股票、子领域分类器22对应领域为翻译、子领域分类器23对应领域为计算,而股票、翻译和计算对应同一个应用,也就对应着同一个对话引擎。也就意味着,在优先级2中,无论文本被确定为属于股票、翻译和计算中的哪个领域,最终手机均会将文本推送至同一对话引擎进行处理。由此可见,优先级2对文本进行领域识别得到的领域识别结果无论在优先级2中涉及的哪个领域,都不会影响后续对话引擎进行文本处理的处理结果。
需要说明的是,优先级2中的各个子领域分类器对应的领域,也可以对应两个或是三个对话引擎,但往往存在多个子领域分类器对应的领域,对应同一个对话引擎的情况。
其中,垂类分类任务,由领域对应的子领域分类器来执行;自有垂类分类任务对应的子领域分类器,可以包括但不限于手机自带功能所属功能对应的子领域分类器,比如,设置、免打扰及图库领域对应的子领域分类器;第三方对接的垂类分类任务对应的子领域分类器,可以包括但不限于手机中已安装的应用程序,或是诸如小程序等无需下载、安装即可以直接调用的程序等所能实现功能对应的子领域分类器,比如,股票、翻译、计算及天气领域对应的子领域分类器;最难识别的垂类任务对应的子领域分类器,可以包括但不限于根据关键词难以确定领域识别结果的领域对应的子领域分类器,比如,百科等具有搜索功能的领域对应的子领域分类器。
由此可见,在一种示例性的实现方式中,分类器层中各个子领域分类器的分布情况如下:
优先级1:设置、免打扰及图库领域各自对应的子领域分类器;
优先级2:股票、翻译、计算及天气领域各自对应的子领域分类器;
优先级3:百科领域对应的子领域分类器。
比如,当输入到上述***的文本为“我要设置一个免打扰”时,控制层通过文本快速全精度匹配未得到有效的领域识别结果。在将该文本交由分类器层处理时,经过优先级1中各个子领域分类器的并行处理,得到有效的领域识别结果,即该文本对应的领域为免打扰。之后***将得到的领域识别结果反馈给手机。上述示例的表现形式如下:
用户输入的语音对应的文本:我要设置一个免打扰
过程:[我要设置一个免打扰]<nodisturb,priority1>
从优先级1中的子领域分类器得到有效的领域识别结果后,直接返回给手机。
领域识别结果:[nodisturb]return from<priority1>
需要说明的是,上述领域识别过程,涉及到控制器层,以及分类器层优先级1中各个子领域分类器的处理。
再比如,当输入到上述***的文本为“看一下股市”时,控制层通过文本快速全精度匹配未得到有效的领域识别结果。在将该文本交由分类器层处理时,经过优先级1中各个子领域分类器的并行处理,得到的领域识别结果为其他(other)。将该文本 交由下一个优先级的子领域分类器进行处理。之后经过优先级2中各个子领域分类器的并行处理,得到有效的领域识别结果,即该文本对应的领域为股票。之后***将得到的领域识别结果反馈给手机。上述示例的表现形式如下:
用户输入的语音对应的文本:看一下股市
过程:[看一下股市]<other,priority1>
从优先级1中的子领域分离器未得到有效的领域识别结果,得到的领域识别结果为other,将文本交由优先级2中的子领域分类器进行处理。
[看一下股市]<stock,priority2>
从优先级2中的子领域分类器得到有效的领域识别结果后,直接返回给手机。
领域识别结果:[stock]return from<priority2>
需要说明的是,上述领域识别过程,涉及到控制器层,以及分类器层优先级1、优先级2中各个子领域分类器的处理。
在本申请实施例中,若优先级2中包括百科对应的子领域分类器,那么从优先级2中的子领域分类器得到的领域识别结果,包括股票和百科,或是股票和百科中的一项,那么返回给手机的领域识别结果存在较大的出错几率。由此可见,分类器层中,对子领域分类器的优先级划分十分重要。对于容易产生歧义,或是难易分辨的领域而言,可以将该领域对应的子领域分类器放置到优先级较低的分组。这样在高优先级的分组得到有效的领域识别结果后,无需将文本输入到低优先级的分组中进行领域识别,也就降低了低优先级的领域识别压力。
再比如,当输入到上述***的文本为“查询五粮液”时,控制层通过文本快速全精度匹配未得到有效的领域识别结果。在将该文本交由分类器层处理时,经过优先级1中各个子领域分类器的并行处理,得到的领域识别结果为other。将该文本交由下一个优先级的子领域分类器进行处理,即将该文本交由优先级2中各个子领域分类器进行并行处理,得到的领域识别结果仍为other。之后将该文本再交由下一个优先级的子领域分类器进行处理,经过优先级3中各个子领域分类器的并行处理,得到有效的领域识别结果,即该文本对应的领域为百科。之后***将得到的领域识别结果反馈给手机。上述示例的表现形式如下:
用户输入的语音对应的文本:查询五粮液
过程:[查询五粮液]<other,priority1>
从优先级1中的子领域分离器未得到有效的领域识别结果,得到的领域识别结果为other,将文本交由优先级2中的子领域分类器进行处理。
[查询五粮液]<other,priority2>
从优先级2中的子领域分离器未得到有效的领域识别结果,得到的领域识别结果为other,将文本交由优先级3中的子领域分类器进行处理。
[查询五粮液]<baike,priority3>
从优先级3中的子领域分类器得到有效的领域识别结果后,直接返回给手机。
领域识别结果:[baike]return from<priority3>
需要说明的是,上述领域识别过程,涉及到控制器层,以及分类器层优先级1、优先级2、优先级3中各个子领域分类器的处理。另外,这条文本的歧义较大,存在 一定几率被识别为股票、百科及other。在本申请实施例中,将文本最容易识别到的领域放置在分类器层的最低优先级中,这样能够有效减小各个优先级组之间的抵触,且降低了上层高优先级的子领域分类器的识别压力。
在上述实现过程中,手机可以充分利用本地用户数据,在手机不与云端进行数据交互的情况下,手机可以有效的进行领域识别。其中,本地用户数据,指的是存储在手机本地的数据,比如,存储在手机存储器中的数据。该数据包括但不限于***中涉及的各个库中包含的内容。由此可见,手机节省了与云端之间进行数据交互所耗费的时间,并且,在领域识别过程中,处于同一优先级的多个子领域分类器可以同时完成识别操作,也能有效节省领域识别过程耗费的时间。
在本申请实施例中,可以依据不同领域类别的特点,以及各个子领域分类器对应模型的准确率和性能,对分类器层的优先级进行划分。对于常用的说法或是容易产生歧义的固定说法,可以将上述说法设置在控制层的文本全精度匹配过程,有效提高领域识别的处理效率,节省领域识别过程占用的时间。除上述说法以外的说法,可以按照优先级从高到低的顺序,依次进入不同优先级的子领域识别分类器进行多领域并行的领域识别过程,进一步提高了领域识别过程的处理效率,节省了处理时间。需要说明的是,上述优先级的划分,还能够使分类效果较差的子领域分类器得到有效的利用,即放置到优先级较低的组中。
在上述***中,子领域分类器的识别能力会影响领域识别结果,而子领域分类器的训练又会影响到子领域分类器的识别能力,因此,子领域分类器的训练显得尤为重要。
如图6所示,为本申请实施例提供的一种示例性的在已知文本所属领域的情况下,对子领域分类器进行训练的方法流程图。其中,该方法流程包括S201至S208。
S201、输入文本。
S202、通过规则对文本进行筛选。
在本申请实施例中,规则可以为诸如[^(搜|查|看|告诉|打开).{1,12}(的股)$]形式的句式。其中,“^”作为规则的起始符,表示以“搜”、“查”、“看”、“告诉”或是“打开”为起始关键词,在间隔1至12个字后,接着作为结束关键词的“的股”二字的文本,而“$”作为规则的结束符,表示以“的股”结束。
起始关键词指的是,在文本中的第一个字为“搜”、“查”、“看”,或是文本中的第一个词为“告诉”或是“打开”,就认为被搜到的内容为起始关键词;结束关键词指的是,在文本中的最后一个词为“的股”。
需要说明的是,起始符与结束符作为句式中可选的符号出现,并不作为对本申请实施例的限定。比如,规则可以为[(搜|查|看|告诉|打开).{1,12}(的股)]形式的句式。那么,该句式表示以“搜”、“查”、“看”、“告诉”或是“打开”为起始关键词,在间隔1至12个字后,接着作为结束关键词的“的股”二字的文本。其中,在该文本中的第一个字不一定为“搜”、“查”、“看”,或是在该文本中的第一个词不一定为“告诉”或是“打开”,而是在该文本中存在“搜”、“查”、“看”、“告诉”或是“打开”。并且,在文本中,“搜”、“查”、“看”、“告诉”或是“打开”之后间隔1至12个字后,存在“的股”二字,而“的股”并不一定作为该文本中出现 的最后一个词。
也就意味着,在规则中,可以包括起始符,或是结束符,或是同时包括起始符和结束符,在此不予限定。
S203、当文本满足规则时,返回领域识别结果。
参照上文描述,对于能够匹配上规则的文本,可以直接确定该文本所属领域,从而确定领域识别结果,并返回。
S204、当文本不满足规则时,对文本进行NER,并完成公用特征替换。
其中,公用特征指的是在计算文本的值时,会对该值产生影响的内容,但该特征的存在并不会对文本所属的领域产生影响,在本申请实施例的一种实现方式中,公用特征包括但不限于时间、地点等词语,可以预先设置。在本申请实施例中,可以将公用特征替换为符号等,在此不予限定。
在本申请实施例的一个实现方式中,对文本进行NER,可以将文本中诸如时间、地点等词语识别出来,之后将识别出的内容作为公用特征,并使用预先设置的符号等对公用特征进行调换。
S205、对完成替换的文本进行特征提取。
特征提取指的是,按照二元法、三元法等方式对完成替换的文本进行词语提取,比如,按照二元法对完成替换的文本进行词语提取,得到多组两个字组成的词语,或是一个字一个符号组成的组合,或是两个符号组成的组合等。
S206、计算每个特征的权重。
需要说明的是,根据特征计算权重的方式可以参考现有技术中诸如二元法、三元法的实现方式。
比如,以二元法为例,将经二元法拆分后得到的各个特征对应的数值输入到模型中,模型经过诸如线性回归(Linear Regression,LR)等算法的计算,输出与特征数量对应的权重,即一个特征对应一个权重。其中,对于输入模型的参数而言,不同特征对应的数值不同,可以预先设置,具体设置方式在此不予限定。在本申请实施例中,对于模型计算的方式可以参考现有技术中提供的算法,比如,上述LR算法,在此不予赘述。
S207、根据权重计算文本对应的值。
根据特征的权重计算出经过替换的文本的值,即输入的文本的值。具体计算方式可以参考现有技术的实现方式,比如,可以采用将文本中所有特征的权重求和的方式,得到与文本对应的值,或是通过对各个特征的权重进行处理后进行求和的方式,得到与文本对应的值等,在此不予限定。
S208、根据计算得到的值,以及已知的领域识别结果,对子领域分类器进行调整。
由于上述S201至S208是根据已知领域的文本对子领域分类器进行训练,因此,可以依据子领域分类器得到的识别结果,以及文本实际所属领域,对子领域分类器进行调整。其中,调整子领域分类器的方式包括但不限于对子领域分类器中正负样本进行调整。需要说明的是,调整正负样本会影响到特征的权重,最终影响计算出的文本对应的值,从而影响领域识别结果。
在完成调整过程后,手机可以继续使用相同的文本,再次经过同一子领域分类器 的处理,直至得到正确的领域识别结果为止。即在子领域分类器的训练过程中,上述S201至S208所示内容为重复的过程,直至达到训练目的为止。
如图7所示,为本申请实施例提供的一种示例性的对子领域分类器的正负样本进行调整的训练方法流程图。其中,该方法流程包括S301至S310。
S301、生成子领域分类器的正负样本。
在本申请实施例中,每个子领域分类器均可以有自己独立的正负样本,其中,正负样本包括正例训练样本集合和负例训练样本集合。正例训练样本集合中的样本为属于该子领域分类器对应领域的样本,负例训练样本集合中的样本为不属于该子领域分类器对应领域的样本。
S302、对正负样本进行NER和规则提取。
比如,正负样本的文本内容为“搜天安门的照片”,经过NER后,识别到“天安门”,并且通过规则提取,提取到的规则为[^(搜).{1,10}(的照片)$]形式的句式。这样,可以将经过NER后得到的“天安门”作为公用特征,而将[^(搜).{1,10}(的照片)$]作为规则。
S303、完成公用特征替换。
在本申请实施例的一个实现方式中,可以预先定义将诸如天安门等地名替换为#,那么完成公用特征替换的文本内容为“搜#的照片”。
其中,S302和S303可以参考上文S202至S205的描述,在此不予赘述。
需要说明的是,对正负样本进行NER,可以为规则提取和公用特征替换的前提。即通过NER识别出正负样本中的地点、时间、句式等,之后将句式作为规则,将时间、地点等作为公用特征,并完成公用特征与符号之间的替换。
S304、停用词等去噪。
在本申请实施例中,停用词指的是对领域识别不存在决定性作用的字、词或是符号,但这些字、词或是符号的存在,往往可以对领域识别结果的准确性产生影响,比如,“;”、“,”等。将这些停用词识别出来,并在领域识别过程中忽略这些停用词。
S305、提取特征生成训练语料特征库。
其中,训练语料特征库用于记载S206中经计算得到的特征与权重之间的对应关系。
S306、根据权重计算文本对应的值。
S307、子领域分类器训练。
具体训练过程可以参考S201至S207的实现过程,在此不予赘述。
S308、错误领域识别结果的影响评估。
S309、修改正负样本。
上述S308和S309,与S208的目的相似,在本申请实施例中,修改正负样本可以作为S208的一种示例性的实现方式。
下面结合示例性的实例,针对上述***对子领域分类器的训练过程,对正负样本的调整过程进行阐述。
在一种示例性的实现方式中,以股票领域对应的子领域分类器为例,训练样本及该样本经过***处理后得到的领域识别结果,包括如下内容:
***对训练样本1及训练样本2的第一轮处理结果如下:
训练样本1
用户输入的语音指令对应的文本:同花顺股市
领域识别结果:[stock]return from<priority2>
训练样本2
用户输入的语音指令对应的文本:同花顺炒股票
领域识别结果:[stock]return from<priority2>
在本申请实施例中,“同花顺”不仅为上市公司的名称,还为某一款应用的名称,其中,这款应用是用于炒股的。训练样本1中,用户试图查询同花顺的股票;而训练样本2中,用户是想打开名称为同花顺的应用,进行炒股。因此,训练样本1得到的领域识别结果是准确的,而训练样本2得到的领域识别结果是错误的。
下面以二元法为例,将用户输入的语音指令转换后得到的文本按照二元法进行划分,得到多组由两个字组成的特征。
训练样本1中,各个特征及与每个特征对应的权重,如下:
同花:0.33474357
花顺:0.23474357
顺股:0.30918131
股市:1.57149447
文本的值:0.33474357+0.23474357+0.30918131+1.57149447=2.45016292
训练样本2中,各个特征及与每个特征对应的权重,如下:
同花:0.33474357
花顺:0.23474357
顺炒:-0.34392488
炒股:-1.34392488
股票:1.99415611
文本的值:
0.33474357+0.23474357-0.34392488-1.34392488+1.99415611=1.87579349
需要说明的是,特征的权重可以是正数、负数或是0。在本申请实施例中,特征的权重取值越大,则表示其对其所在的文本识别成所在子领域(本例中为“股票”领域)的贡献就越大。
在本申请实施例的一个实现方式中,以1.5作为文本的值的阈值,当文本中所有特征的权重之和大于或是等于1.5时,确认该文本属于股票领域;当本文中所有特征的权重之和小于1.5时,确认该文本不属于股票领域。由于训练样本1中涉及到的每个特征的权重均为正数,因此,能够根据权重计算出正确的领域识别结果,即领域为股票。而在训练样本2中,由于噪声“炒”的存在,且与“炒”结合的特征“顺炒”和“炒股”的负的权重值的绝对值太小,导致训练样本2经过各个特征的权重求和后,得到大于阈值的正数,因此,该文本仍然会被误识别为股票领域。
为了纠正第一轮处理过程中的误识别的问题,在进行第二轮处理过程之前,***依据训练样本1识别正确而训练样本2识别错误的结果,对正负样本进行调整。删除 正样本中带[同花顺]的内容,并在负样本中增加[同花顺]和[炒股票]。
需要说明的是,通常情况下,在正样本中增加的内容所涉及到的特征对应的权重的取值会有所增加,而在正样本中删除的内容所涉及到的特征对应的权重的取值会有所减少;同样的,在负样本中增加的内容所涉及到的特征对应的权重的取值会有所减少,而在负样本中删除的内容所涉及到的特征对应的权重的取值会有所增加。
比如,在正样本中删除带[同花顺]的内容,那么特征“同花”、“花顺”分别对应的权重的取值会有所减少。而在负样本中增加[同花顺],会使特征“同花”、“花顺”分别对应的权重的取值进一步减少。
但是对于在负样本中增加[炒股票]而言,在本申请实施例的一种实现方式中,并未影响特征“炒股”和“股票”分别对应的权重。未产生影响的原因可以为正样本及负样本中带[炒股票]的内容的样本容量较大,导致在负样本中增加一个[炒股票]的样本后对负样本的影响较小,比如,正样本中带[炒股票]的内容的样本数量为两万个,负样本中带[炒股票]的内容的样本数量为一万个,而增加一个[炒股票]的负样本后,并不会对庞大数据量的正样本及负样本产生影响,因此,对特征“炒股”、“股票”分别对应的权重产生的影响几乎为零,也就不会改变“炒股”、“股票”分别对应的权重的取值。因此,在本申请实施例的一个实现方式中,上述删除正样本中带[同花顺]的内容,并在负样本中增加[同花顺]和[炒股票],会使特征“同花”、“花顺”分别对应的权重的取值减小,而不会影响特征“炒股”、“股票”分别对应的权重的取值。
需要说明的是,上述情况为一种示例性的实现方式,不作为对本申请实施例的限定。
经第一次正负样本的调整后,训练样本1中,各个特征及与每个特征对应的权重,如下:
同花:-0.34743574
花顺:-0.34743574
顺股:0.30918131
股市:1.57149447
文本的值:-0.34743574-0.34743574+0.30918131+1.57149447=1.1858043
经第一次正负样本的调整后,训练样本2中,各个特征及与每个特征对应的权重,如下:
同花:-0.34743574
花顺:-0.34743574
顺炒:-0.34392488
炒股:-1.34392488
股票:1.99415611
文本的值:
-0.34743574-0.34743574-0.34392488-1.34392488+1.99415611=-0.38856513
经过第一次正负样本的调整后,由于正负样本的改变,导致了部分或是全部特征的权重的变化,因此,也会在一定程度上影响处理结果。即训练样本1和训练样本2中文本的值均小于1.5,也就意味着,两个训练样本均被识别为不属于股票领域。需 要说明的是,对于同一个特征而言,在该特征属于正样本时,该特征对应的权重越大;在该特征属于负样本时,该特征对应的权重越小;在该特征同时属于正样本和负样本时,则根据包含该特征的正样本、负样本的数量,来权衡该特征的权重取值。
在经过第一次正负样本的调整后,***对训练样本1及训练样本2的第二轮处理结果如下:
训练样本1
用户输入的语音指令对应的文本:同花顺股市
领域识别结果:[other]return from<priority3>
训练样本2
用户输入的语音指令对应的文本:同花顺炒股票
领域识别结果:[other]return from<priority3>
其中,训练样本1得到的领域识别结果是错误的,而训练样本2得到的领域识别结果是正确的。需要说明的是,在训练样本输入时可以将训练样本对应的正确的领域识别结果输入,这样手机可以根据已知的正确的领域,结合输出的领域识别结果,自动调整正负样本;或者,在输出领域识别结果后,人为判断该领域识别结果是否正确,并在结果产生错误的情况下,触发手机自动调整正负样本。
因此,***会再次自动调整正负样本,即***第二次自动调整正负样本。***在第一次调整正负样本的基础上,重新调整正样本中的[同花顺],比如,增加正样本中包括[同花顺]的内容。这样就能有效提高特征“同花”、“花顺”分别对应的权重的取值。
经第二次正负样本的调整后,训练样本1中,各个特征及与每个特征对应的权重,如下:
同花:-0.03474357
花顺:-0.03474357
顺股:0.30918131
股市:1.57149447
文本的值:-0.03474357-0.03474357+0.30918131+1.57149447=1.81118864
经第二次正负样本的调整后,训练样本2中,各个特征及与每个特征对应的权重,如下:
同花:-0.03474357
花顺:-0.03474357
顺炒:-0.34392488
炒股:-1.34392488
股票:1.99415611
文本的值:
-0.03474357-0.03474357-0.34392488-1.34392488+1.99415611=-0.76318079
经过第二次正负样本的调整后,***对训练样本1及训练样本2的第三轮处理结果如下:
训练样本1
用户输入的语音对应的文本:同花顺股市
领域识别结果:[stock]return from<priority2>
训练样本2
用户输入的语音对应的文本:同花顺炒股票
领域识别结果:[other]return from<priority3>
在本申请实施例中,***结合每一轮领域识别结果的正确或是错误,对正负样本进行调整,直至训练样本1和训练样本2都得到正确的领域识别结果为止。由此可见,训练样本的数量越多,那么经过调整后的正负样本集合的准确性越高。
为了减少停用词、数字、地点名称等内容对领域识别过程带来的干扰,在本申请实施例中,还可以通过识别句式的方式来确定领域识别结果,或是通过干扰项替代的方式,简化领域识别过程。这样不仅可以提升领域识别过程的准确度,还可以进一步节省领域识别过程占用的时间。
在一种示例性的实现方式中,对于子领域分类器难以识别的句式,或是容易对领域识别结果产生较大影响的句式,可以预先基于句式设置规则,供控制层的文本全精度匹配过程使用。
比如,实例1至3经过语音识别得到的文本内容及通过***进行领域识别后得到的领域识别结果如下:
实例1:
用户输入的语音指令对应的文本:查询一下方大炭素的股
领域识别结果:股票
实例2:
用户输入的语音指令对应的文本:查询一下南京港和江铜CWB1的股票600160
领域识别结果:股票
实例3:
用户输入的语音指令对应的文本:搜一下昨天在北京拍的图片
领域识别结果:图库
针对实例1所示的文本,可以预先设置句式为“查询……的股”,这样在用户输入错误或是语音识别产生遗漏时,只要文本中包括该句式,就能够使***准确识别出该句式,并依据该句式对文本所述的领域进行区分,从而得到准确的领域识别结果。
比如,规则可以预先设置为[^(搜|查|看|告诉|打开).{1,12}(的股)$],供***对文本进行快读的识别、匹配,并将得到的领域识别结果反馈给手机。其中,句式“[^(搜|查|看|告诉|打开).{1,12}(的股)$]”的含义可以参考上文描述,在此不予赘述。
在一种示例性的实现方式中,对于上述规则无法匹配的文本,仍然需要分类器层进行处理。
以实例3为例,在本申请实施例中,可以将“搜……的图片”作为模式。这样对于图库领域而言,该图库领域对应的子领域分类器中涉及的规则可以包括该模式,也就意味着在文本识别的过程中,当子领域分类器识别到该模式,就可以反馈有效的领域识别结果,即该文本的领域为图库。
以实例2为例,可以预先为***设置公用特征,以防止该公用特征造成的领域识 别不准确的问题。在***对该文本进行处理时,可以先将文本中连续6位的数字进行替换,比如,将[600160]替换为@,这样该文本的内容为“查询一下南京港和江铜CWB1的股票@”。之后***可以调用NER提取该文本中的NE信息作为公用特征,比如,将[南京港]定义为一个“普通公司名称”实体,将该实体替换为#;将[江铜CWB1]定义为一个“上市公司名称代号”实体,将该实体替换为@。那么该文本的内容为“查询一下#和@的股票@”。
同理,可以将文本中的时间替换为$,将地点替换为#,那么实例3中的文本的内容为“搜一下$在#拍的图片”。
在完成上述替换过程之后,实例2中各个特征及与每个特征对应的权重,如下:
查询:0.1067646020633481
询一:-0.10021895439172483
一下:-0.215034710246020433
下#:0.1067646020633481
#和:null
其中,null表示特征“#和”对领域识别结果没有影响,或是该特征的权重为0。
和@:0.12009207293891772
@股:0.304457783445201952
股票:1.1114948005328673
票@:0.3067646020633481
在完成上述替换过程之后,实例3中各个特征及与每个特征对应的权重,如下:
搜一:0.3835541240544907
一下:-0.2517062504931636
下$:0.14542119078470123
$在:0.094333521958256
在#:0.19608161704432386
#拍:-0.006875871484002316
拍的:0.5827998208565368
的图:0.26154773801450293
图片:0.17497209951796067
+:1.4622835953886275
其中,特征“+”表示经过替换后的文本满足子领域分类器中定义的模式,因此,在计算文本对应的值时,经过替换后的文本满足子领域分类器中定义的模式,而得到权重加成,以提高领域识别的准确率。
经过替换后的文本对应的值:3.04241158392
由此可见,由于上述替换过程,替换了诸如时间、地点等公用特征,而公用特征往往包括至少两个字,也就意味着,在完成替换之后,得到的特征与权重的对应关系的数量有所减少。尤其是对于文本中涉及较多公用特征的情况而言,这样的替换方式,能够有效简化子领域分类器的计算过程,从而提升子领域分类器的工作效率。并且,这样的替换方式,能够有效降低公用特征对于领域识别带来的干扰。
下面举例几个不同领域中,基于语音指令得到的语音播放内容或是显示结果。
用户输入的语音指令对应的文本:把字体调大一点
应答:好的,已为您调整
手机将经过语音识别后得到的内容为“把字体调大一点”的文本,进行领域识别,得到的领域识别结果为,该文本属于设置领域。之后手机将该文本发送至与设置领域对应的对话引擎进行处理。需要说明的是,在手机给予用户应答时,手机已应用户的要求将字体调大。
用户输入的语音指令对应的文本:请帮我设一个今天下午两点到三点的免打扰,老王除外
应答:免打扰已开启,从14:00到15:00,老王除外
手机将经过语音识别后得到的内容为“请帮我设一个今天下午两点到三点的免打扰,老王除外”的文本,进行领域识别,得到的领域识别结果为,该文本属于免打扰领域。之后手机将该文本发送至与免打扰领域对应的对话引擎进行处理。需要说明的是,在手机给予用户应答时,手机已应用户的要求设置了免打扰的开启时间,且确保了在免打扰的时间段内,针对老王的来电等,依旧会提示用户。
用户输入的语音指令对应的文本:请帮我百度一下范冰冰的图片
应答:跳转到百度显示范冰冰的相关照片
手机将经过语音识别后得到的内容为“请帮我百度一下范冰冰的图片”的文本,进行领域识别,得到的领域识别结果为,该文本属于图库领域。之后手机将该文本发送至与图库领域对应的对话引擎进行处理。需要说明的是,在手机给予用户应答时,手机已应用户的要求完成图片搜索,即当前已经通过百度向用户呈现相关照片。
用户输入的语音指令对应的文本:筷子用英语怎么说
应答:chopsticks
手机将经过语音识别后得到的内容为“筷子用英语怎么说”的文本,进行领域识别,得到的领域识别结果为,该文本属于翻译领域。之后手机将该文本发送至与翻译领域对应的对话引擎进行处理。需要说明的是,在手机给予用户应答时,手机已经按照用户的需求完成“筷子”一词的翻译。
用户输入的语音指令对应的文本:今天天气怎么样
应答:上海今天小雨,7℃~10℃,空气质量指数88,空气还可以,请记得带伞
手机将经过语音识别后得到的内容为“今天天气怎么样”的文本,进行领域识别,得到的领域识别结果为,该文本属于天气领域。之后手机将该文本发送至与天气领域对应的对话引擎进行处理。需要说明的是,在手机给予用户应答时,手机已经结合用户当前所处的地理位置,确定了该地理位置的天气情况。
用户输入的语音指令对应的文本:万科的股票
应答:万科A现在39.42元,跌幅0.86%,已休市
手机将经过语音识别后得到的内容为“万科的股票”的文本,进行领域识别,得到的领域识别结果为,该文本属于股票领域。之后手机将该文本发送至与股票领域对应的对话引擎进行处理。需要说明的是,在手机给予用户应答时,手机已确定用户索要了解的股票情况。
用户输入的语音指令对应的文本:2的13次方等于多少
应答:2的13次方等于8192
手机将经过语音识别后得到的内容为“2的13次方等于多少”的文本,进行领域识别,得到的领域识别结果为,该文本属于计算领域。之后手机将该文本发送至与计算领域对应的对话引擎进行处理。需要说明的是,在手机给予用户应答时,手机已通过计算的方式,确定出用户期望得到的计算结果。
用户输入的语音指令对应的文本:姚明的身高
应答:姚明,1980年9月12日出生于上海市徐汇区,祖籍江苏省苏州市吴江区震泽镇,前中国职业篮球运动员,司职中锋,现任中职联公司董事长兼总经理。1998年4月,姚明入选王非执教的国家队,开始篮球生涯。2001夺得CBA常规赛MVP,2002年夺得CBA总冠军以及总决赛MVP,分别3次当选CBA篮板王以及盖帽王,2次当选CBA扣篮王。
手机将经过语音识别后得到的内容为“姚明的身高”的文本,进行领域识别,得到的领域识别结果为,该文本属于百科领域。之后手机可以将提取到的关键字在百科中搜索,并将搜索到结果呈现给用户,与此同时,手机可以将搜索到的相关内容选择性地呈现给用户。需要说明的是,在手机给予用户应答时,手机已搜索到姚明的身高及其他相关信息。
其中,上述各个实例中,应答的方式包括但不限于通过文字提示的方式或是语音提示的方式,在此不予限定。
本申请实施例可以根据上述方法实施例对电子设备进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
如图8所示,为上述实施例中所涉及的电子设备进行语音识别装置的一种示例性的结构示意图。电子设备进行语音识别装置400包括:接收模块401、转换模块402、第一领域识别模块403、处理模块404、第二领域识别模块405、控制模块406、子领域分类器407。其中,子领域分类器407包括命名实体识别模块4071、替换模块4072、提取模块4073、计算模块4074,以及领域确定模块4075。需要说明的是,在该电子设备400中包括至少一个子领域分类器407,在此不予限定。
其中,接收模块401用于支持电子设备400接收语音指令。比如,用户通过文本对应的电子设备输入的语音指令,即如图4所示的语音输入。转换模块402用于支持电子设备400将语音指令转换为文本,比如,如图4所示的将输入的语音通过语音识别的方式转换为文本。第一领域识别模块403用于支持电子设备400通过至少两个子领域分类器对文本进行识别,得到领域识别结果。比如,如图4所示的分类器层中各个优先级(即子领域分类器组)中的子领域分类器对文本进行识别,比如,优先级1中子领域分类器11、子领域分类器12和子领域分类器13对进行并行的文本识别。处理模块404用于支持电子设备400通过文本所属领域对应的对话引擎对文本进行处理,以确定文本对应的电子设备需要执行的功能,以及用于支持电子设备400实现本文所 描述的技术的其它过程等。第二子领域识别模块405用于支持电子设备400将文本域运存文本进行匹配,比如,如图4所示的控制层中文本快速全精度匹配,在文本与预存文本匹配成功时,确定预存文本对应的领域为文本的领域识别结果;在文本和预存文本的匹配失败时,再将文本输入到分类器层,由第一领域识别模块403通过至少两个子领域分类器对文本进行领域识别,以得到领域识别结果。
在本申请实施例的一个实现方式中,第一领域识别模块包括N个子领域分类器组,其中,每个组有不同的优先级,N为大于或等于2的正整数。N个子领域分类器组中的至少一个组中包括至少两个子领域分类器。每个子领域分类器用于确认所述文本是否属于本子领域分类器对应的领域。控制模块406用于支持电子设备400控制N个子领域分类器组中最高优先级组中的子领域分类器对文本进行领域识别,比如,如图4所示,控制分类器层中最高优先级组,即优先级1中的子领域分类器对文本进行领域识别。若最高优先级组中的子领域分类器识别出文本所属的领域,则将最高优先级组中的子领域分类器识别出文本所属的领域作为领域识别结果;若最高优先级组中的子领域分类器未识别出文本所属的领域,则通过N个子领域分类器组中下一优先级组中的子领域分类器对文本进行领域识别,比如,如图4所示,在文本经过优先级1中的子领域分类器进行领域识别后未得到领域识别结果,那么将由优先级2中的子领域分类器对文本进行领域识别,直至:识别出文本所属的领域,并将识别出的领域作为领域识别结果;或文本已经过N个子领域分类器组中所有子领域分类器进行领域识别。比如,如图4所示,文本经过分类器层的优先级1、优先级2和优先级3中的所有子领域分类器的领域识别后未得到领域识别结果,那么结束本次对语音指令的处理过程。
控制模块406还用于当第一子领域分类器对文本进行领域识别后得到第一领域识别结果,且第二子领域分类器对文本进行领域识别后得到第二领域识别结果时,确定第一领域识别结果和第二领域识别结果中的至少一项为领域识别结果,或是确定第一领域识别结果和第二领域识别结果均为领域识别结果。以图4所示的优先级1为例,当子领域分类器11得到第一领域识别结果,子领域分类器12得到第二领域识别结果时,控制模块406执行上述过程。需要说明的是,此时,若子领域分类器13得到第三领域识别结果,那么控制模块406确定第一领域识别结果、第二领域识别结果和第三领域识别结果中的至少一个领域识别结果为文本的领域识别结果。
在子领域分类器407中,NER模块用于支持电子设备400对文本进行NER,并确定识别出的内容中的公用特征。替换模块4072用于支持电子设备400按照预设规则对文本中的功用特征进行替换。提取模块4073用于支持电子设备400对完成替换的文本进行特征提取,并确定每个特征的权重。计算模块4074用于支持电子设备400根据每个特征的权重,计算文本的值。领域确定模块4075用于支持电子设备400当文本的值大于阈值时,确定文本属于本子领域分类器对应的领域。其中,子领域分类器407可以为如图4所示的分类器层中涉及的任意一个子领域分类器。
在本申请实施例的一个实现方式中,电子设备400还可以包括存储模块408、通信模块409以及显示模块410中的至少一项。其中,存储模块408用于支持电子设备400存储电子设备的程序代码和数据;通信模块409可以支持电子设备400中各个模块之间进行数据交互,和/或支持电子设备400与诸如服务器、其他电子设备等之间的 通信;显示模块410可以支持电子设备400将语音指令的处理结果通过文字、图形等方式呈现给用户,或是在语音识别过程中,选择性地向用户呈现语音识别的过程等,在此不予限定。
其中,接收模块401、通信模块409可以实现为收发器;转换模块402、第一领域识别模块403、处理模块404、第二领域识别模块405、控制模块406和子领域分类器407可以实现为处理器;存储模块408可以实现为存储器;显示模块410可以实现为显示器。
在本申请实施例的一个实现方式中,上述处理器也可以为控制器,例如可以是CPU,通用处理器,数字信号处理器(Digital Signal Processor,DSP),专用集成电路(Application-Specific Integrated Circuit,ASIC),现场可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。所述处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等等。上述收发器还可以实现为收发电路或通信接口等。
如图9所示,电子设备50可以包括:处理器51、收发器52、存储器53、显示器54,以及总线55。其中,收发器52、存储器53及显示器54为可选部件,即电子设备50可以包括上述可选部件中的一项或是多想。处理器51、收发器52、存储器53、显示器54通过总线55相互连接;总线55可以是外设部件互连标准(Peripheral Component Interconnect,PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,EISA)总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图9中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
结合本申请公开内容所描述的方法或者算法的步骤可以硬件的方式来实现,也可以是由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(Random Access Memory,RAM)、闪存、只读存储器(Read Only Memory,ROM)、可擦除可编程只读存储器(Erasable Programmable ROM,EPROM)、电可擦可编程只读存储器(Electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、只读光盘(Compact Disc Read-Only Memory,CD-ROM)或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以部署在同一设备中,或者,处理器和存储介质也可以作为分立组件部署在于不同的设备中。
本申请实施例提供一种可读存储介质,包括指令。当该指令在电子设备上运行时,使得该电子设备执行上述的方法。
本申请实施例提供一种计算机程序产品,该计算机程序产品包括软件代码,该软件代码用于执行上述的方法。
以上所述的具体实施方式,对本申请实施例的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本申请的具体实施方式而已,并不用 于限定本申请的保护范围,凡在本申请实施例的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本申请实施例的保护范围之内。

Claims (19)

  1. 一种电子设备进行语音识别的方法,其特征在于,所述方法包括:
    将接收的语音指令转换为文本;
    通过至少两个子领域分类器对所述文本进行领域识别,得到领域识别结果,所述领域识别结果用于表示所述文本所属的领域;
    通过所述文本所属的领域对应的对话引擎对所述文本进行处理,确定所述文本对应的所述电子设备需要执行的功能。
  2. 根据权利要求1所述的方法,其特征在于,在所述将接收的语音指令转换为文本之后,所述方法还包括:
    将所述文本与预存文本进行匹配;
    当所述文本与所述预存文本匹配成功时,确定所述预存文本对应的领域为所述文本的领域识别结果。
  3. 根据权利要求2所述的方法,其特征在于,所述通过至少两个子领域分类器对所述文本进行领域识别,得到领域识别结果,具体为:
    当所述文本与所述预存文本匹配失败时,通过至少两个子领域分类器对所述文本进行领域识别,得到领域识别结果。
  4. 根据权利要求1至3中任意一项所述的方法,其特征在于,所述电子设备包括N个子领域分类器组,其中,每个组有不同的优先级,N为大于或等于2的正整数;
    所述通过至少两个子领域分类器对所述文本进行领域识别,得到领域识别结果,具体为:
    通过所述N个子领域分类器组中最高优先级组中的子领域分类器对所述文本进行领域识别;
    若所述最高优先级组中的子领域分类器识别出所述文本所属的领域,则将所述最高优先级组中的子领域分类器识别出所述文本所属的领域作为所述领域识别结果;
    若所述最高优先级组中的子领域分类器未识别出所述文本所属的领域,则通过所述N个子领域分类器组中下一优先级组中的子领域分类器对所述文本进行领域识别,直至:
    识别出所述文本所属的领域,并将识别出的领域作为所述领域识别结果;或
    所述文本已经过所述N个子领域分类器组中所有子领域分类器进行领域识别;
    所述N个子领域分类器组中的至少一个组中包括至少两个子领域分类器。
  5. 根据权利要求4所述的方法,其特征在于,所述N个子领域分类器组的至少一个组中的至少两个子领域分类器对所述文本并行进行领域识别。
  6. 根据权利要求4或5所述的方法,其特征在于,所述N个子领域分类器组中,低优先级组中的子领域分类器的领域识别准确率低于高优先级组中的子领域分类器的领域识别准确率。
  7. 根据权利要求4至6中任意一项所述的方法,其特征在于,所述N个子领域分类器组中的至少一个组包括第一子领域分类器和第二子领域分类器,所述方法还包括:
    当所述第一子领域分类器对所述文本进行领域识别后得到第一领域识别结果,且所述第二子领域分类器对所述文本进行领域识别后得到第二领域识别结果时,
    确定所述第一领域识别结果和所述第二领域识别结果中的至少一项为所述领域识别结果;或
    确定所述第一领域识别结果和所述第二领域识别结果均为所述领域识别结果。
  8. 根据权利要求1至5中任意一项所述的方法,其特征在于,所述至少两个子领域分类器中的至少一个对所述文本进行领域识别,包括:
    对所述文本进行命名实体识别NER,并确定识别出的内容中的公用特征;
    按照预设规则,将所述公用特征进行替换,所述预设规则包括不同类别的公用特征对应的替换内容;
    对完成替换的文本进行特征提取,并确定每个特征的权重;
    根据所述每个特征的权重,计算所述文本的值;
    当所述文本的值大于阈值时,确定所述文本属于本子领域分类器对应的领域。
  9. 一种电子设备,其特征在于,所述电子设备包括:
    接收模块,用于接收语音指令;
    转换模块,将所述接收模块接收的所述语音指令转换为文本;
    第一领域识别模块,用于通过至少两个子领域分类器对所述转换模块经转换得到的所述文本进行领域识别,得到领域识别结果,所述领域识别结果用于表示所述文本所属的领域;
    处理模块,用于通过所述第一领域识别模块确定的所述文本所属的领域对应的对话引擎对所述文本进行处理,确定所述文本对应的所述电子设备需要执行的功能。
  10. 根据权利要求9所述的电子设备,其特征在于,所述电子设备还包括:
    第二领域识别模块,用于将所述文本与预存文本进行匹配,在所述文本与所述预存文本匹配成功时,确定所述预存文本对应的领域为所述文本的领域识别结果。
  11. 根据权利要求10所述的电子设备,其特征在于,所述第一领域识别模块,具体用于:
    当所述第二领域识别模块对所述文本和所述预存文本的匹配失败时,通过至少两个子领域分类器对所述文本进行领域识别,得到领域识别结果。
  12. 根据权利要求9至11中任意一项所述的电子设备,其特征在于,所述第一领域识别模块包括:
    N个子领域分类器组,其中,每个组有不同的优先级,N为大于或等于2的正整数;所述N个子领域分类器组中的至少一个组中包括至少两个子领域分类器;每个子领域分类器用于确认所述文本是否属于本子领域分类器对应的领域;
    控制模块,用于:
    控制所述N个子领域分类器组中最高优先级组中的子领域分类器对所述文本进行领域识别;
    若所述最高优先级组中的子领域分类器识别出所述文本所属的领域,则将所述最高优先级组中的子领域分类器识别出所述文本所属的领域作为所述领域识别结果;
    若所述最高优先级组中的子领域分类器未识别出所述文本所属的领域,则通过所述N个子领域分类器组中下一优先级组中的子领域分类器对所述文本进行领域识别,直至:
    识别出所述文本所属的领域,并将识别出的领域作为所述领域识别结果;或
    所述文本已经过所述N个子领域分类器组中所有子领域分类器进行领域识别。
  13. 根据权利要求12所述的电子设备,其特征在于,所述N个子领域分类器组的至少一个组中的至少两个子领域分类器对所述文本并行进行领域识别。
  14. 根据权利要求12或13所述的电子设备,其特征在于,所述N个子领域分类器组中,低优先级组中的子领域分类器的领域识别准确率低于高优先级组中的子领域分类器的领域识别准确率。
  15. 根据权利要求12至14中任意一项所述的电子设备,其特征在于,所述N个子领域分类器组中的至少一个组包括第一子领域分类器和第二子领域分类器,
    当所述第一子领域分类器对所述文本进行领域识别后得到第一领域识别结果,且所述第二子领域分类器对所述文本进行领域识别后得到第二领域识别结果时,
    所述控制模块,还用于:
    确定所述第一领域识别结果和所述第二领域识别结果中的至少一项为所述领域识别结果;或
    确定所述第一领域识别结果和所述第二领域识别结果均为所述领域识别结果。
  16. 根据权利要求9至13中任意一项所述的电子设备,其特征在于,所述子领域分类器包括:
    命名实体识别NER模块,用于对所述文本进行NER,并确定识别出的内容中的公用特征;
    替换模块,用于按照预设规则,将所述识别模块确定的所述公用特征进行替换,所述预设规则包括不同类别的公用特征对应的替换内容;
    提取模块,用于对所述替换模块完成替换的文本进行特征提取,并确定每个特征的权重;
    计算模块,用于根据所述提取模块确定的所述每个特征的权重,计算所述文本的值;
    领域确定模块,用于当所述文本的值大于阈值时,确定所述文本属于本子领域分类器对应的领域。
  17. 一种电子设备,包括存储器,一个或多个处理器,多个应用程序,以及一个或多个程序;其中,所述一个或多个程序被存储在所述存储器中;其特征在于,所述一个或多个处理器在执行所述一个或多个程序时,使得所述电子设备实现如权利要求1至8中任意一项所述的方法。
  18. 一种可读存储介质,其特征在于,所述可读存储介质中存储有指令,当所述指令在电子设备上运行时,使得所述电子设备执行上述权利要求1至8中任意一项所述的方法。
  19. 一种计算机程序产品,其特征在于,所述计算机程序产品包括软件代码,所述软件代码用于执行上述权利要求1至8中任意一项所述的方法。
PCT/CN2018/078056 2018-03-05 2018-03-05 一种电子设备进行语音识别方法及电子设备 WO2019169536A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2018/078056 WO2019169536A1 (zh) 2018-03-05 2018-03-05 一种电子设备进行语音识别方法及电子设备
CN201880074893.0A CN111373473B (zh) 2018-03-05 2018-03-05 一种电子设备进行语音识别方法及电子设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/078056 WO2019169536A1 (zh) 2018-03-05 2018-03-05 一种电子设备进行语音识别方法及电子设备

Publications (1)

Publication Number Publication Date
WO2019169536A1 true WO2019169536A1 (zh) 2019-09-12

Family

ID=67846452

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/078056 WO2019169536A1 (zh) 2018-03-05 2018-03-05 一种电子设备进行语音识别方法及电子设备

Country Status (2)

Country Link
CN (1) CN111373473B (zh)
WO (1) WO2019169536A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114049884A (zh) * 2022-01-11 2022-02-15 广州小鹏汽车科技有限公司 语音交互方法、车辆、计算机可读存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111897916B (zh) * 2020-07-24 2024-03-19 惠州Tcl移动通信有限公司 语音指令识别方法、装置、终端设备及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103187058A (zh) * 2011-12-28 2013-07-03 上海博泰悦臻电子设备制造有限公司 车内语音对话***
CN103187061A (zh) * 2011-12-28 2013-07-03 上海博泰悦臻电子设备制造有限公司 车内语音对话***
US20140214410A1 (en) * 2009-12-11 2014-07-31 Samsung Electronics Co., Ltd Dialogue system and method for responding to multimodal input using calculated situation adaptability
CN105389304A (zh) * 2015-10-27 2016-03-09 小米科技有限责任公司 事件提取方法及装置
CN105976818A (zh) * 2016-04-26 2016-09-28 Tcl集团股份有限公司 指令识别的处理方法及装置
US20180033434A1 (en) * 2015-09-07 2018-02-01 Voicebox Technologies Corporation System and method for eliciting open-ended natural language responses to questions to train natural language processors
CN107741928A (zh) * 2017-10-13 2018-02-27 四川长虹电器股份有限公司 一种基于领域识别的对语音识别后文本纠错的方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2880554C (en) * 2004-10-05 2017-11-21 Inago Corporation System and methods for improving accuracy of speech recognition
CN101154379B (zh) * 2006-09-27 2011-11-23 夏普株式会社 定位语音中的关键词的方法和设备以及语音识别***
CN103945044A (zh) * 2013-01-22 2014-07-23 中兴通讯股份有限公司 一种信息处理方法和移动终端
RU2014111971A (ru) * 2014-03-28 2015-10-10 Юрий Михайлович Буров Способ и система голосового интерфейса
CN106992004B (zh) * 2017-03-06 2020-06-26 华为技术有限公司 一种调整视频的方法及终端
CN107731228B (zh) * 2017-09-20 2020-11-03 百度在线网络技术(北京)有限公司 英文语音信息的文本转换方法和装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140214410A1 (en) * 2009-12-11 2014-07-31 Samsung Electronics Co., Ltd Dialogue system and method for responding to multimodal input using calculated situation adaptability
CN103187058A (zh) * 2011-12-28 2013-07-03 上海博泰悦臻电子设备制造有限公司 车内语音对话***
CN103187061A (zh) * 2011-12-28 2013-07-03 上海博泰悦臻电子设备制造有限公司 车内语音对话***
US20180033434A1 (en) * 2015-09-07 2018-02-01 Voicebox Technologies Corporation System and method for eliciting open-ended natural language responses to questions to train natural language processors
CN105389304A (zh) * 2015-10-27 2016-03-09 小米科技有限责任公司 事件提取方法及装置
CN105976818A (zh) * 2016-04-26 2016-09-28 Tcl集团股份有限公司 指令识别的处理方法及装置
CN107741928A (zh) * 2017-10-13 2018-02-27 四川长虹电器股份有限公司 一种基于领域识别的对语音识别后文本纠错的方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114049884A (zh) * 2022-01-11 2022-02-15 广州小鹏汽车科技有限公司 语音交互方法、车辆、计算机可读存储介质
WO2023134378A1 (zh) * 2022-01-11 2023-07-20 广州小鹏汽车科技有限公司 语音交互方法、车辆、计算机可读存储介质

Also Published As

Publication number Publication date
CN111373473B (zh) 2023-10-20
CN111373473A (zh) 2020-07-03

Similar Documents

Publication Publication Date Title
CN107924313B (zh) 分布式个人助理
DK179301B1 (en) Application integration with a digital assistant
US20230072352A1 (en) Speech Recognition Method and Apparatus, Terminal, and Storage Medium
US11217244B2 (en) System for processing user voice utterance and method for operating same
CN106165010B (zh) 用于高效且精确译码的增量言语译码器组合
US20160055240A1 (en) Orphaned utterance detection system and method
US10698654B2 (en) Ranking and boosting relevant distributable digital assistant operations
WO2018014341A1 (zh) 展示候选项的方法和终端设备
US10929458B2 (en) Automated presentation control
WO2020056621A1 (zh) 一种意图识别模型的学习方法、装置及设备
US11830482B2 (en) Method and apparatus for speech interaction, and computer storage medium
CN110073349B (zh) 考虑频率和格式化信息的词序建议
CN106649253B (zh) 基于后验证的辅助控制方法及***
US11144175B2 (en) Rule based application execution using multi-modal inputs
JP7063937B2 (ja) 音声対話するための方法、装置、電子デバイス、コンピュータ読み取り可能な記憶媒体、及びコンピュータプログラム
US20220020358A1 (en) Electronic device for processing user utterance and operation method therefor
CN108153875B (zh) 语料处理方法、装置、智能音箱和存储介质
WO2019169536A1 (zh) 一种电子设备进行语音识别方法及电子设备
US11308965B2 (en) Voice information processing method and apparatus, and terminal
US20220270604A1 (en) Electronic device and operation method thereof
US11727938B2 (en) Emotion-based voice controlled device
US20220229991A1 (en) Multi-feature balancing for natural language processors
US10963640B2 (en) System and method for cooperative text recommendation acceptance in a user interface
TW202240461A (zh) 使用輔助系統的語音和手勢輸入之文字編輯
CN114297340A (zh) 一种搜索词纠错方法和装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18908774

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18908774

Country of ref document: EP

Kind code of ref document: A1