WO2021120876A1 - Corpus identification method, device, terminal apparatus, and medium - Google Patents

Corpus identification method, device, terminal apparatus, and medium Download PDF

Info

Publication number
WO2021120876A1
WO2021120876A1 PCT/CN2020/124764 CN2020124764W WO2021120876A1 WO 2021120876 A1 WO2021120876 A1 WO 2021120876A1 CN 2020124764 W CN2020124764 W CN 2020124764W WO 2021120876 A1 WO2021120876 A1 WO 2021120876A1
Authority
WO
WIPO (PCT)
Prior art keywords
corpus
intent
category
recognition
nlu
Prior art date
Application number
PCT/CN2020/124764
Other languages
French (fr)
Chinese (zh)
Inventor
刘志强
李前国
叶筠
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021120876A1 publication Critical patent/WO2021120876A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • This application relates to the field of information technology, in particular to a corpus recognition method, device, terminal device and medium.
  • Recall Ratio also known as recall ratio, refers to the ratio of the amount of relevant information detected from the database to the total amount.
  • recall ratio refers to the ratio of the amount of relevant information detected from the database to the total amount.
  • AI Artificial Intelligence
  • increasing the service recall rate helps to enhance the user's service experience. For example, when a user uses a voice assistant service on a terminal such as a mobile phone, whether the voice assistant can accurately understand what the user says and complete the corresponding task or return corresponding information greatly affects the normal use of the user.
  • terminal manufacturers choose to simultaneously access multiple third-party content providers (CP) with natural language understanding (NLU) capabilities in their terminals.
  • the NLU of each CP The system recognizes the corpus input by the user, and then selects a result from it and returns it to the user.
  • the capabilities of the NLU systems of multiple CPs that the terminal accesses at the same time have their own strengths and weaknesses, and each CP does not have a completely unified standard for the definition of different fields or intentions, and it is difficult to accurately compare each other.
  • the introduction of multiple NLU systems has increased the recall rate, it has also increased the number of awkward chats and reduced the accuracy of corpus recognition.
  • the embodiments of this application provide a corpus recognition method, device, terminal equipment and medium, which can improve the service recall rate by performing fine-grained credibility processing on the domains or intents corresponding to the corpus recognized by multiple external NLU systems , Improve the accuracy of corpus recognition.
  • an embodiment of the present application provides a corpus recognition method, including:
  • the original corpus is recognized according to the credibility of the intention.
  • the use of multiple natural language understanding NLU engines to recognize the original corpus and obtain the intent category corresponding to each NLU engine respectively includes:
  • the determining the intent credibility of the original corpus according to the intent category of each NLU engine includes:
  • Determining the intent score corresponding to the intent category of each NLU engine, and the intent score corresponding to each intent category is obtained by using each NLU engine to test a sample corpus;
  • the intent credibility of the original corpus is calculated.
  • the calculating the intent credibility of the original corpus according to each intent category and its corresponding intent score includes:
  • the weight value is used to perform a weighted summation on the intent score corresponding to each intent category to obtain the intent credibility of the original corpus.
  • the recognizing the original corpus according to the intent credibility includes:
  • the original corpus is recognized as a valid corpus
  • the original corpus is identified as an invalid corpus.
  • the method further includes:
  • the invalid corpus is divided into multiple corpus categories according to the intent category, and the multiple NLU engines are used again to analyze each corpus category. If the intent category recognized by each NLU engine remains unchanged, the invalid corpus in the corpus category is recognized as a valid corpus.
  • the method further includes:
  • the effective corpus, the initial category of the effective corpus, and the intent category recognized by each NLU engine are associated and stored in a corpus.
  • the method further includes:
  • a whitelist of the corpus is generated.
  • the dividing the multiple valid corpora into multiple recognition categories according to the initial categories and intent categories of the multiple stored valid corpora includes:
  • the valid corpus with the same initial category and intent category are classified into the same recognition category.
  • the generating the whitelist of the corpus according to the number of valid corpora contained in each recognition category includes:
  • the recognition classes in the preset sorting interval are extracted as the whitelist of the corpus.
  • an embodiment of the present application provides a corpus recognition method, including:
  • the whitelist is generated through the following steps:
  • a whitelist of the corpus is generated.
  • the dividing the multiple valid corpora into multiple recognition categories according to the initial categories and intent categories of the multiple valid corpora stored in the corpus includes:
  • the valid corpus with the same initial category and intent category are classified into the same recognition category.
  • generating the whitelist of the corpus according to the number of valid corpora contained in each recognition category includes:
  • the recognition classes in the preset sorting interval are extracted as the whitelist of the corpus.
  • the extraction of a whitelist matching the intent category from a preset corpus includes:
  • a whitelist containing the initial category and at least one intent category recognized by the NLU engine is extracted from the corpus.
  • the identifying the corpus according to the intention categories included in the white list includes:
  • Determining an intent score corresponding to each intent category included in the whitelist, and the intent score corresponding to each intent category is obtained by using each NLU engine to test a sample corpus;
  • an embodiment of the present application provides a corpus recognition device, including:
  • the original corpus acquisition module is used to obtain the original corpus to be recognized
  • the intent category recognition module is configured to use multiple natural language understanding NLU engines to recognize the original corpus, and obtain the intent category corresponding to each NLU engine respectively;
  • the intent credibility determination module is used to determine the intent credibility of the original corpus according to the intent category of each NLU engine;
  • the original corpus recognition module is used to recognize the original corpus according to the intent credibility.
  • an embodiment of the present application provides a corpus recognition device, including:
  • the intent category recognition module is configured to use multiple NLU engines to recognize the target corpus when receiving the target corpus to be recognized, and obtain the intent category corresponding to each NLU engine respectively;
  • the whitelist extraction module is used to extract a whitelist that matches the intent category from a preset corpus
  • the target corpus recognition module is used to recognize the corpus according to the intention categories included in the whitelist.
  • an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor executes the computer program, Implement the corpus recognition method described in any one of the first aspect or the second aspect.
  • an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor of a terminal device, the above-mentioned first aspect or second aspect is implemented.
  • the corpus recognition method according to any one of aspects.
  • the embodiments of the present application provide a computer program product, which when the computer program product runs on a terminal device, causes the terminal device to execute the corpus recognition method according to any one of the first aspect or the second aspect above .
  • the embodiments of the present application include the following beneficial effects:
  • multiple NLU engines are used to identify the original corpus to be recognized, and the intent category corresponding to each NLU engine is obtained, and then the intent credibility of the original corpus is determined according to the intent category of each NLU engine , So that the original corpus can be identified according to the intent credibility.
  • This embodiment can perform fine-grained scoring and credibility processing on external NLU services according to domains or intentions, so as to realize the recognition of massive corpus, thereby generating corpus.
  • the terminal performs corpus recognition based on the above-mentioned corpus, it can effectively improve the service recall rate and at the same time improve the accuracy of corpus recognition.
  • This embodiment can be widely used in artificial intelligence and other fields, especially in various application scenarios where a service recall rate needs to be realized based on natural language understanding.
  • FIG. 1 is a schematic step flowchart of a corpus recognition method provided by an embodiment of the present application
  • FIG. 2 is a schematic step flowchart of a corpus recognition method provided by another embodiment of the present application.
  • Fig. 3 is a schematic diagram of a corpus tagging process provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a corpus whitelist generation process provided by an embodiment of the present application.
  • FIG. 5 is a schematic step flowchart of a corpus recognition method provided by another embodiment of the present application.
  • FIG. 6 is a schematic diagram of the hardware structure of a mobile phone to which the corpus recognition method provided by an embodiment of the present application is applicable;
  • FIG. 7 is a schematic diagram of the software structure of a mobile phone to which the corpus recognition method provided by an embodiment of the present application is applicable;
  • FIG. 8 is a schematic flow chart of steps for generating a corpus whitelist provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a corpus whitelist application process provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of the relationship between original corpus and intent categories provided by an embodiment of the present application.
  • FIG. 11 is a structural block diagram of a corpus recognition device provided by an embodiment of the present application.
  • FIG. 12 is a structural block diagram of a corpus recognition device provided by another embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • This method can be applied to fields such as artificial intelligence, especially in scenarios involving natural language understanding and other scenarios that require speech and text recognition.
  • the service recall rate can be significantly improved.
  • the original corpus to be recognized may be obtained by capturing various text information input by the user on the network, or may be obtained by performing text conversion on the user's voice.
  • the words or sentences spoken to the voice assistant, and the text information obtained after conversion can be used as the original corpus.
  • This embodiment does not limit the source of the original corpus.
  • the number of original corpus is very large, which may reach the order of millions. Since the process of processing each piece of original corpus according to the method provided in this embodiment is basically similar, in order to facilitate understanding, this embodiment only introduces the method with the process of processing one piece of original corpus.
  • multiple NLU engines may refer to NLU systems provided by multiple CPs for recognizing corpus.
  • Each NLU system can individually recognize the received original corpus and output the corresponding recognition. result.
  • no less than three CPs that provide NLU services can be accessed, so that for the same original corpus, at least three NLU systems can obtain recognition results.
  • the recognition result in this embodiment may refer to the field or intent information to which the original corpus recognized by each NLU system belongs, that is, the intent category. For example, for a certain original corpus, after recognition, you can first determine whether the corpus belongs to a general encyclopedia or a practical tool.
  • NLU systems may be different in their areas of expertise. For example, a certain NLU system may have a higher recognition rate for sub-fields such as food and food delivery, while another NLU system may be better at knowledge of sub-fields such as maps and travel.
  • the credibility of the intent category can also be determined. For example, when using an NLU system that has a high recognition rate for sub-domains such as food, food delivery, etc. for corpus recognition, if the output intent category belongs to the sub-domains such as food, food delivery, etc., the recognized intention category can be considered credible Conversely, if the identified intention category belongs to sub-domains such as map and travel, its credibility is relatively low.
  • the judgment of the credibility of the identified different intention categories can also be given by the CP providing the NLU service by itself, or the terminal manufacturer that accesses each CP can test each NLU system in advance to obtain each NLU.
  • the evaluation information of the system, or combined with the evaluation of both the CP and the terminal manufacturer, is jointly obtained, which is not limited in this embodiment.
  • S103 Determine the intent credibility of the original corpus according to the intent category of each NLU engine
  • the intent credibility of the original corpus can be jointly determined.
  • the intent credibility of the original corpus can refer to whether the original corpus is recognizable and can be understood by the NLU system.
  • each NLU system's recognition result for a certain original corpus is a general encyclopedia, it can be considered that the credibility of the corpus is high. If each NLU system has different recognition results for a certain original corpus, it can be considered that the credibility of the corpus is low.
  • the original corpus with higher credibility of intent can be marked as effective corpus, and the original corpus with lower credibility of intent can be marked as invalid corpus.
  • the above is the process of recognizing a single original corpus. After recognizing the massive corpus according to the above process, the massive corpus can be marked to generate the corpus.
  • the terminal performs corpus recognition based on the above-mentioned corpus, it can effectively improve the service recall rate and at the same time improve the accuracy of corpus recognition.
  • this embodiment introduces the method from the perspective of constructing a whitelist of the corpus on the basis of annotating the original corpus to obtain a corpus containing a large number of valid corpora.
  • the original corpus to be recognized in this embodiment can also be obtained by collecting massive amounts of voice-to-text and manually inputted texts of live network users.
  • the original corpus to be recognized can reach a million or more.
  • the NLU services provided by different CPs can be implemented by different NLU engines, and the NLU engine is the NLU system corresponding to the CP.
  • the terminal can realize data interaction with each NLU engine through the corresponding processing interface.
  • the processing interface of multiple NLU engines can be called first, and then the aforementioned original corpus is input into the processing interface of each NLU engine to instruct each NLU engine Recognize the original corpus. After each NLU engine recognizes the original corpus, it can return the corresponding result through the above-mentioned processing interface.
  • the terminal can receive the intent category output by each NLU engine as the recognition result of each NLU system on the original corpus.
  • S203 Determine the intent score corresponding to the intent category of each NLU engine, and the intent score corresponding to each intent category is obtained by testing the sample corpus by using each NLU engine;
  • the intent score corresponding to the intent category can be determined.
  • the intent score corresponding to each intent category of the NLU engine can be obtained by using the NLU engine to test the sample corpus.
  • a part of the corpus can be collected in advance as the sample corpus for testing, and then each NLU engine connected to the terminal is used to identify the sample corpus to obtain the corresponding recognition results.
  • each intent category of the NLU engine is scored. When scoring the intention category, a certain NLU engine's good intentions can be assigned a higher score, and an intention category that is not good at or with a relatively low recognition accuracy can be assigned a relatively low score.
  • Table 1 it is an example of the intention score corresponding to the intention category of each CP in this embodiment.
  • Table 1 gives examples of the scores of each intent category corresponding to the NLU engines provided by the three CPs, namely CP1, CP2, and CP3.
  • CP1 its good intention category is the category labeled sub-intent 1.2.
  • the intent score corresponding to this category will be relatively high at 1.5 points; and the sub-intent 1.1 with a score of 0.5
  • the category belongs to the field that the NLU engine provided by CP1 is not good at.
  • the intent score of this category is relatively low at 0.5 points; for other categories, such as sub-intent 1.3, sub-intent 1.4 and other categories, use the NLU engine provided by CP1 to perform The recognition result obtained by the recognition is average, and the corresponding intention score is 1.0 point.
  • the intent credibility of the original corpus to be recognized can be calculated according to each intent category and its corresponding intent score.
  • the intent credibility of the original corpus can be joined according to the following formula:
  • n is the number of CPs accessed by the terminal.
  • n ⁇ 3 that is, access to the NLU system provided by at least three CPs.
  • the intent category output by CP1 is sub-intent 1.1
  • the intent category output by CP2 is sub-intent 2.2
  • the intent category output by CP3 is sub-intent 3.1
  • the intent credibility of the original corpus when calculating the intent credibility of the original corpus, you can first determine the weight value of each intent category, and then use the above weight value to score the intent corresponding to each intent category. Perform weighted summation to obtain the intent credibility of the original corpus.
  • the aforementioned weight value may be set when scoring each intent category, for example, a higher weight is set for the intent category that the NLU system is good at.
  • This embodiment does not limit the specific method for calculating the intent credibility of the original corpus according to each intent category and its corresponding intent score.
  • the intent credibility can be compared with the preset credibility threshold. If the intent credibility is greater than or equal to the credibility threshold, the original corpus can be identified It is a valid corpus; if the intent credibility is less than the above credibility threshold, the original corpus can be identified as an invalid corpus.
  • the credibility threshold can be set according to actual needs.
  • the invalid corpus can be deleted.
  • At least one of the multiple intent categories corresponding to the invalid corpus is not empty, it means that there is at least one NLU engine that can recognize the corpus, but multiple NLU engines have inconsistent classifications when recognizing the corpus.
  • all invalid corpora can be divided into multiple corpus classes according to the identified intention categories, and the multiple NLU engines described above can be used again to identify the invalid corpus in each corpus class.
  • the intent category recognized by each NLU engine remains unchanged, it means that the classification result obtained by each NLU engine identifying invalid corpus in the corpus is stable, so that the invalid corpus in the corpus can be Recognized as valid corpus.
  • the calculated intent credibility is less than the above credibility threshold. If the above steps are followed, these original corpora should be marked as invalid corpus . However, if individual NLU engines can recognize these original corpora and output corresponding classification results, these invalid corpora with the same classification results can be classified into the same corpus class. For example, both CP1 and CP2 are recognized as invalid, but all corpora recognized as sub-intent 3.3 by CP3 are classified into the same corpus. Then use the above three NLU engines again to identify each corpus in the corpus.
  • the invalid corpus in the corpus can be marked Add the effective corpus to the corpus, and use the sub-intent 3.3 recognized by CP3 as the intent category corresponding to these corpus.
  • the original corpus below the credibility threshold is aggregated in batches, and then after multiple identifications, if the results identified by each NLU engine are fixed, these corpora can be marked as valid corpus , And store it in the corpus.
  • This embodiment uses the above-mentioned auxiliary means to re-recognize part of the corpus that is recognized as invalid corpus, which can reduce the problem of large amounts of deletion of original corpus due to the different recognition accuracy of each NLU engine, and can effectively expand the number of corpora .
  • S206 Obtain the initial category of the effective corpus, and store the effective corpus, the initial category of the effective corpus, and the intent category recognized by each NLU engine in a corpus in association;
  • the intent category of the corpus can be stored at the same time.
  • the initial category of the effective corpus can also be obtained, and the initial category can be obtained by roughly identifying the original corpus when the original corpus is collected.
  • the rough classification NLU system can be used to initially screen the original corpus, and the preliminary screening intention classification of each original corpus can be obtained as the initial category of the original corpus.
  • the raw materials marked as valid corpus are stored in the corpus together with its initial category and the intent category output by each NLU engine.
  • S207 According to the stored initial categories and intent categories of the multiple valid corpora, divide the multiple valid corpora into multiple recognition categories;
  • this embodiment can also perform aggregate statistics on the large amount of effective corpus in the corpus to construct a corpus whitelist.
  • the corresponding effective corpus with the same initial category and intent category can be divided into the same recognition category.
  • a category string corresponding to each valid corpus can be generated according to the difference between its initial category and intent category, such as [Initial category_0.1, CP1 intent category 1.1, CP2 intent category 2.1, CP3 intent category 3.2 ], the category string indicates that the initial category of a valid corpus is sub-intent 0.1, and the intent categories obtained by using three different NLU engines for recognition are sub-intent 1.1, sub-intent 2.1 and sub-intent 3.2 respectively.
  • each valid corpus After the category string of each valid corpus is identified in the above manner, all corpora with the same category string can be aggregated into the same recognition category.
  • the initial categories of the valid corpus in each recognition category are the same, and the intent categories obtained by using three different NLU engines to recognize the corpus in this recognition category are also the same.
  • S208 Count the number of valid corpora included in each recognition category, and generate a whitelist of the corpus according to the number of valid corpora included in each recognition category.
  • the number of corpus contained in each recognition category can be counted separately. For example, a certain recognition class contains 10,000 corpora, a certain recognition class contains 500 corpora, and so on.
  • a corpus whitelist can be constructed according to the number of corpora included.
  • the recognition classes that contain more than a certain threshold can be selected as the corpus whitelist. For example, for all recognition categories obtained by aggregation, those recognition categories that contain more than 2000 corpora can be recognized as corpus whitelists.
  • the above threshold can be determined according to actual needs, which is not limited in this embodiment.
  • each recognition category can be sorted according to the number of valid corpus contained in each recognition category. For example, all recognition classes can be sorted in the order of the largest number of corpus included. Then, the recognition classes in the preset sorting interval are extracted as the whitelist of the corpus. For example, the recognition classes ranked in the top 80% can be selected as the whitelist of the corpus.
  • the corpus whitelist is also corrected by manual labeling, which is not limited in this embodiment.
  • the intent category corresponding to each NLU engine can be output, and then according to each intent category and its corresponding intent score, the original corpus can be obtained Therefore, the original corpus whose intent credibility exceeds the preset credibility threshold can be marked as a valid corpus, and then a corpus containing a large number of valid corpus can be generated.
  • the corresponding corpus whitelist can be obtained by performing aggregation statistics on the corpus in the corpus, which can be used for subsequent corpus recognition.
  • each original corpus is identified as a subcategory of a different NLU system, and then each subcategory is further processed according to a unified standard. Processing and identifying the intent category that best matches the original corpus can effectively increase the service recall rate, and can ensure a certain recognition accuracy while increasing the recall rate, and improve the efficiency and accuracy of corpus recognition.
  • this embodiment can also realize automatic labeling of massive corpora, solve the problem of relying on manual sorting and labeling to generate massive corpora, improve the efficiency of corpus generation, and help to obtain a richer corpus ,
  • the generated corpus can continue to influence subsequent corpus recognition, making more corpora available for comparison and matching, and further improving the service recall rate.
  • FIG. 3 it is a schematic diagram of a corpus tagging process in this embodiment.
  • the first NLU preliminary screening of the collected text can be performed to obtain the original corpus that requires further identification and processing by external NLU, such as encyclopedia and small chat, and the preliminary screening intention of the original corpus classification.
  • the first NLU preliminary screening can be to roughly process the text to be processed, and the intention classification obtained by the preliminary screening can be a larger range of classification categories.
  • the NLU processing interface of n CPs can be called, the original corpus is used as input information, and each NLU is used for identification, and the corresponding intent category and intent score are output.
  • no less than 3 CPs should be called.
  • the intent category and intent score output by each NLU can calculate the corresponding intent credibility according to the set formula.
  • the original corpus can be marked as valid or invalid corpus .
  • the initial screening intent classification of the corpus and the intent category output by each NLU can be recorded at the same time to complete the labeling of the original corpus.
  • FIG. 4 a schematic diagram of a corpus whitelist generation process of this embodiment is shown.
  • the labeling process shown in Figure 3 can be repeated to obtain the initial screening intent classification of the large amount of corpus and the intent category output by each NLU.
  • the mass corpus can be aggregated and counted according to the initial screening intent classification and the intent category output by each NLU, and the top 80% of the recognition categories obtained by the statistics can be used as the corpus whitelist.
  • the whitelist can also be modified manually.
  • FIG. 5 a schematic step flowchart of a corpus recognition method provided by another embodiment of the present application is shown.
  • the method may specifically include the following steps:
  • the corpus recognition method provided in this embodiment can be applied to mobile phones, tablets, wearable devices, vehicle-mounted devices, augmented reality (AR)/virtual reality (VR) devices, notebook computers,
  • AR augmented reality
  • VR virtual reality
  • terminal devices such as ultra-mobile personal computers (UMPC), netbooks, and personal digital assistants (personal digital assistants, PDAs)
  • UMPC ultra-mobile personal computers
  • PDAs personal digital assistants
  • Fig. 6 shows a block diagram of a part of the structure of a mobile phone provided by an embodiment of the present application.
  • the mobile phone includes: a radio frequency (RF) circuit 610, a memory 620, an input unit 630, a display unit 640, a sensor 650, an audio circuit 660, a wireless fidelity (Wi-Fi) module 670, a processing 680, power supply 690 and other components.
  • RF radio frequency
  • the RF circuit 610 can be used for receiving and sending signals during information transmission or communication, in particular, after receiving the downlink information of the base station, it is processed by the processor 680; in addition, the designed uplink data is sent to the base station.
  • the RF circuit includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (LNA), a duplexer, and the like.
  • the RF circuit 610 can also communicate with the network and other devices through wireless communication.
  • the above-mentioned wireless communication can use any communication standard or protocol, including but not limited to Global System of Mobile Communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (Code Division) Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), Email, Short Messaging Service (SMS), etc.
  • GSM Global System of Mobile Communication
  • GPRS General Packet Radio Service
  • CDMA Code Division Multiple Access
  • WCDMA Wideband Code Division Multiple Access
  • LTE Long Term Evolution
  • Email Short Messaging Service
  • the memory 620 may be used to store software programs and modules.
  • the processor 680 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 620.
  • the memory 620 may mainly include a program storage area and a data storage area.
  • the program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data created by the use of mobile phones (such as audio data, phone book, etc.), etc.
  • the memory 620 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the input unit 630 may be used to receive inputted digital or character information, and generate key signal input related to user settings and function control of the mobile phone 600.
  • the input unit 630 may include a touch panel 631 and other input devices 632.
  • the touch panel 631 also called a touch screen, can collect user touch operations on or near it (for example, the user uses any suitable objects or accessories such as fingers, stylus, etc.) on the touch panel 631 or near the touch panel 631. Operation), and drive the corresponding connection device according to the preset program.
  • the touch panel 631 may include two parts: a touch detection device and a touch controller.
  • the touch detection device detects the user's touch position, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and then sends it To the processor 680, and can receive and execute the commands sent by the processor 680.
  • the touch panel 631 can be implemented in multiple types such as resistive, capacitive, infrared, and surface acoustic wave.
  • the input unit 630 may also include other input devices 632.
  • the other input device 632 may include, but is not limited to, one or more of a physical keyboard, function keys (such as volume control buttons, switch buttons, etc.), trackball, mouse, and joystick.
  • the display unit 640 may be used to display information input by the user or information provided to the user and various menus of the mobile phone.
  • the display unit 640 may include a display panel 641.
  • the display panel 641 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD), an organic light-emitting diode (Organic Light-Emitting Diode, OLED), etc.
  • the touch panel 631 can cover the display panel 641. When the touch panel 631 detects a touch operation on or near it, it is sent to the processor 680 to determine the type of the touch event, and then the processor 680 determines the type of the touch event. The type provides corresponding visual output on the display panel 641.
  • the touch panel 631 and the display panel 641 are used as two independent components to implement the input and input functions of the mobile phone, but in some embodiments, the touch panel 631 and the display panel 641 can be integrated. Realize the input and output functions of the mobile phone.
  • the mobile phone 600 may also include at least one sensor 650, such as a light sensor, a motion sensor, and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor.
  • the ambient light sensor can adjust the brightness of the display panel 641 according to the brightness of the ambient light.
  • the proximity sensor can close the display panel 641 and/or when the mobile phone is moved to the ear. Or backlight.
  • the accelerometer sensor can detect the magnitude of acceleration in various directions (usually three-axis), and can detect the magnitude and direction of gravity when it is stationary.
  • the audio circuit 660, the speaker 661, and the microphone 662 can provide an audio interface between the user and the mobile phone.
  • the audio circuit 660 can transmit the electric signal after the conversion of the received audio data to the speaker 661, and the speaker 661 converts it into a sound signal for output; on the other hand, the microphone 662 converts the collected sound signal into an electric signal, and the audio circuit 660 converts the collected sound signal into an electric signal. After being received, it is converted into audio data, and then processed by the audio data output processor 680, and sent to, for example, another mobile phone via the RF circuit 610, or the audio data is output to the memory 620 for further processing.
  • Wi-Fi is a short-distance wireless transmission technology. Through the Wi-Fi module 670, mobile phones can help users send and receive emails, browse web pages, and access streaming media. It provides users with wireless broadband Internet access. Although FIG. 6 shows the Wi-Fi module 670, it is understandable that it is not a necessary component of the mobile phone 600 and can be omitted as needed without changing the essence of the invention.
  • the processor 680 is the control center of the mobile phone. It uses various interfaces and lines to connect various parts of the entire mobile phone, and executes by running or executing software programs and/or modules stored in the memory 620, and calling data stored in the memory 620. Various functions and processing data of the mobile phone can be used to monitor the mobile phone as a whole.
  • the processor 680 may include one or more processing units; preferably, the processor 680 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface, and application programs, etc. , The modem processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 680.
  • the mobile phone 600 also includes a power source 690 (such as a battery) for supplying power to various components.
  • a power source 690 such as a battery
  • the power source may be logically connected to the processor 680 through a power management system, so that functions such as charging, discharging, and power management are realized through the power management system.
  • the mobile phone 600 may also include a camera.
  • the position of the camera on the mobile phone 600 may be front-mounted or rear-mounted, which is not limited in the embodiment of the present application.
  • the mobile phone 600 may include a single camera, a dual camera, or a triple camera, etc., which is not limited in the embodiment of the present application.
  • the mobile phone 600 may include three cameras, of which one is a main camera, one is a wide-angle camera, and one is a telephoto camera.
  • the multiple cameras may be all front-mounted, or all rear-mounted, or partly front-mounted and some rear-mounted, which is not limited in the embodiment of the present application.
  • the mobile phone 600 may also include a Bluetooth module, etc., which will not be repeated here.
  • FIG. 7 is a schematic diagram of the software structure of a mobile phone 600 according to an embodiment of the present application.
  • the Android system is divided into four layers, namely the application layer, the application framework layer (framework, FWK), the system layer, and the hardware abstraction layer. Communication between the layers through software interface.
  • the application layer may include a series of application packages, and the application packages may include applications such as short message, calendar, camera, video, navigation, gallery, and call.
  • the application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer.
  • the application framework layer can include some predefined functions, such as functions for receiving events sent by the application framework layer.
  • the application framework layer can include a window manager, a resource manager, and a notification manager.
  • the window manager is used to manage window programs.
  • the window manager can obtain the size of the display screen, determine whether there is a status bar, lock the screen, take a screenshot, etc.
  • the content provider is used to store and retrieve data and make these data accessible to applications.
  • the data may include videos, images, audios, phone calls made and received, browsing history and bookmarks, phone book, etc.
  • the resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and so on.
  • the notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and it can automatically disappear after a short stay without user interaction.
  • the notification manager is used to notify download completion, message reminders, and so on.
  • the notification manager can also be a notification that appears in the status bar at the top of the system in the form of a chart or a scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialog window. For example, text messages are prompted in the status bar, prompt sounds, electronic devices vibrate, and indicator lights flash.
  • the application framework layer can also include:
  • a view system which includes visual controls, such as controls that display text, controls that display pictures, and so on.
  • the view system can be used to build applications.
  • the display interface can be composed of one or more views.
  • a display interface that includes a short message notification icon may include a view that displays text and a view that displays pictures.
  • the phone manager is used to provide the communication function of the mobile phone 600. For example, the management of the call status (including connecting, hanging up, etc.).
  • the system layer can include multiple functional modules. For example: sensor service module, physical state recognition module, 3D graphics processing library (for example: OpenGL ES), etc.
  • the sensor service module is used to monitor the sensor data uploaded by various sensors at the hardware layer and determine the physical state of the mobile phone 600;
  • Physical state recognition module used to analyze and recognize user gestures, faces, etc.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, synthesis, and layer processing.
  • the system layer can also include:
  • the surface manager is used to manage the display subsystem and provides a combination of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files.
  • the media library can support multiple audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the hardware abstraction layer is the layer between hardware and software.
  • the hardware abstraction layer can include display drivers, camera drivers, sensor drivers, etc., used to drive related hardware at the hardware layer, such as display screens, cameras, sensors, and so on.
  • the following embodiments can be implemented on the mobile phone 600 having the above hardware structure/software structure.
  • the following embodiment will take the mobile phone 600 as an example to describe the corpus recognition method provided in this embodiment.
  • the target corpus to be recognized may refer to words or sentences spoken by the user using the voice service on the mobile phone.
  • the user invokes the intelligent voice assistant on the mobile phone, he can speak a sentence to the voice assistant, which is used to instruct the voice assistant to complete a certain task or output certain information.
  • the user can say “Who is Andy Lau” to the voice assistant, and the voice assistant can convert this sentence into text, and the obtained text information is the target corpus to be recognized.
  • the user can also input the above-mentioned target corpus into the mobile phone by directly inputting text, which is not limited in this embodiment.
  • the mobile phone After receiving the target corpus, the mobile phone can call the NLU services provided by multiple CPs to identify the target corpus and output the corresponding intent category.
  • the corpus may be obtained after annotating a large amount of original corpus.
  • the corpus can also store information such as intent classification obtained when the corpus is recognized by multiple NLU systems.
  • the corpus whitelist that matches the above intent category can be extracted from the corpus.
  • the corpus whitelist can be generated through the following steps:
  • S801 According to the initial categories and intent categories of the multiple valid corpora stored in the corpus, divide the multiple valid corpora into multiple recognition categories;
  • the initial category of the valid corpus stored in the corpus may be obtained during the preliminary screening of the original corpus, and the intent category may be obtained by recognizing multiple NLU systems separately.
  • the valid corpus stored in the corpus should generally include the initial category of the corpus and at least one intent category output by the NLU system.
  • a category string corresponding to each valid corpus can be generated according to the difference between its initial category and intent category, such as [Initial category_0.1, CP1 intent category 1.1, CP2 intent category 2.1, CP3 intent category 3.2 ], the category string indicates that the initial category of a valid corpus is sub-intent 0.1, and the intent categories obtained by using three different NLU engines for recognition are sub-intent 1.1, sub-intent 2.1 and sub-intent 3.2 respectively.
  • the number of corpus contained in each recognition category can be counted separately. For example, a certain recognition class contains 10,000 corpora, a certain recognition class contains 500 corpora, and so on.
  • S803 Generate a whitelist of the corpus according to the number of valid corpora included in each recognition category.
  • each recognition category can be sorted according to the number of valid corpus contained in each recognition category. For example, all recognition classes can be sorted in the order of the largest number of corpus included. Then, the recognition classes in the preset sorting interval are extracted as the whitelist of the corpus. For example, the recognition classes ranked in the top 80% can be selected as the whitelist of the corpus.
  • the initial category and part of the intent category can be matched, so as to find the white list with the same initial category and the same intent category. List.
  • the initial category of the target corpus can be obtained first, and then a whitelist containing the initial category and at least one intent category identified by the above-mentioned NLU engine can be extracted from the corpus.
  • the intent categories identified by the three NLUs are sub-intent 1.2, sub-intent 2.1, and sub-intent 3.1
  • FIG. 9 a schematic diagram of a corpus whitelist application process of this embodiment is shown.
  • the target corpus to be recognized it can be matched in the corpus according to the initial screening intent classification of the target corpus and the intent category output by each NLU. If the initial screening intent classification and part of the intent category output by the NLU are matched by a whitelist, Then, according to the intent category in the matched whitelist, the most suitable recognition result can be returned according to the set sorting rule.
  • the set sorting rules may be intent scores corresponding to the intent categories output by different NLUs, and the categories with higher scores are preferentially selected and returned to the user; or, according to pre-designed rules, different intents Category, preferentially route a certain CP; or, without distinguishing specific intent categories, directly compare the ranking priorities among CPs, and select the intent category corresponding to the CP with a higher priority to return to the user, which is not limited in this embodiment .
  • the intent score corresponding to each intent category included in the whitelist can be determined first. It should be noted that the intent score corresponding to each intent category can be obtained by using each NLU engine to test the sample corpus. For obtaining the intention score, reference may be made to the introduction of step S203 in the foregoing embodiment, which will not be repeated in this embodiment.
  • the intent category corresponding to the maximum value of the intent score in the whitelist can be identified as the target intent category of the target corpus currently to be recognized.
  • the whitelist obtained by matching the target corpus contains the intent category recognition results C1 and C2 of CP1 and CP2.
  • the intent scores of C1 and C2 can be compared, and the intent category corresponding to the larger score can be selected as The final recognition result is returned to the user.
  • multiple NLU engines are used to identify the received target corpus, and after the intent category corresponding to each NLU engine is obtained respectively, the intent category corresponding to the above-mentioned intent category can be extracted from the preset corpus.
  • the matched whitelist can then identify the target corpus according to the intent categories contained in the whitelist.
  • the corpus and the corresponding whitelist are generated, the corpus can be identified according to the labeled category information and the whitelist, which helps to improve the service recall rate of end users.
  • the original corpus in this embodiment may be obtained by capturing various text information input by the user on the network, or may be obtained by performing text conversion on the user's voice.
  • the number of original corpus is very large, which may reach the order of millions.
  • n (n ⁇ 3) CPs that provide NLU services as the aggregation object, and use the NLU system of each CP to identify each original corpus to obtain the corresponding intent category.
  • FIG. 10 it is a schematic diagram of the relationship between the original corpus and the intent category in this embodiment.
  • each original corpus needs to be recognized by the NLU systems of three CPs, CP1, CP2, and CP3, and each NLU system will output the corresponding recognition results.
  • the classification result of any CP can be obtained, and the original corpus can be marked according to the result.
  • the credibility threshold can be set to more than half the number of CPs
  • the current corpus can be automatically marked, and the initial screening intent of the corpus can be recorded Classification and intent category output by each CP.
  • the intent credibility can be calculated according to the following formula.
  • each CP's output intention is lower than the credibility threshold
  • the corpus below the threshold can be aggregated in batches, and then after multiple identifications, if the results of each CP's identification are fixed and unchanged , You can mark these corpora as effective corpus, and map the recognition results of each CP to the same overall classification (many-to-one).
  • Table 2 it is an example of the recognition result obtained by recognizing the original corpus.
  • Table 2 there are cases where the identification is not possible and the identification of multiple CPs is inconsistent.
  • a large number of effective corpora can be obtained to form a corpus, and a corpus whitelist can be output by performing aggregation statistics on the corpus in the corpus.
  • the massive effective corpus and the whitelist in the corpus can be used for subsequent identification of the target corpus.
  • the user enters the target corpus "Who is Andy Lau”. After preliminary screening, it can be known that the preliminary screening intention of the corpus is classified as "General Encyclopedia”. After the NLU system of multiple CPs is used for identification, CP1 returns sub-intent 1.2, and CP2 returns sub-intent. Intent 2.1, CP3 returns to sub-intent 3.1. By matching the white list shown in Table 3, it can be known that the sub-intent 3.1 returned by CP3 is not in the white list.
  • the intention score of sub-intent 1.2 is 1.5 points, and the intention score of sub-intent 2.1 is 0.5 points. Therefore, the target corpus "Who is Andy Lau" can match the sub-intent 1.2 and the sub-intent 2.1 in the whitelist, but since the sub-intent 1.2 scores higher, the terminal will return the recognition result of CP1, namely the sub-intent 1.2, to the user.
  • the service recall rate of end users can be effectively improved, and here In the process, the automatic recognition and labeling of the massive original corpus is realized.
  • the service recall rate of the terminal can be increased from 59.5% to 81.3%, and the accuracy rate does not decrease significantly.
  • FIG. 11 shows a structural block diagram of a corpus recognition device provided in an embodiment of the present application. For ease of description, only the parts related to the embodiment of the present application are shown.
  • the device can be applied to terminal equipment, and specifically can include the following modules:
  • the original corpus acquisition module 1101 is used to obtain the original corpus to be recognized
  • the intent category recognition module 1102 is configured to use multiple natural language understanding NLU engines to recognize the original corpus, and obtain the intent category corresponding to each NLU engine respectively;
  • the intent credibility determination module 1103 is configured to determine the intent credibility of the original corpus according to the intent category of each NLU engine;
  • the original corpus recognition module 1104 is configured to recognize the original corpus according to the intent credibility.
  • the intention category identification module 1102 may specifically include the following sub-modules:
  • the processing interface calling sub-module is used to call the processing interfaces of multiple NLU engines
  • the original corpus input sub-module is used to input the original corpus into the processing interface of each NLU engine to instruct each NLU engine to recognize the original corpus;
  • the intention category receiving sub-module is used to receive the intention category output by each NLU engine.
  • the intent credibility determination module 1103 may specifically include the following submodules:
  • the intent score determination sub-module is used to determine the intent score corresponding to the intent category of each NLU engine, and the intent score corresponding to each intent category is obtained by using each NLU engine to test a sample corpus;
  • the intent credibility calculation sub-module is used to calculate the intent credibility of the original corpus according to each intent category and its corresponding intent score.
  • the intention credibility calculation sub-module may specifically include the following units:
  • a weight value determining unit configured to determine the weight value of each intention category
  • the intent credibility calculation unit is configured to use the weight value to perform a weighted summation of the intent scores corresponding to each intent category to obtain the intent credibility of the original corpus.
  • the original corpus recognition module 1104 may specifically include the following sub-modules:
  • the valid corpus recognition sub-module is configured to recognize the original corpus as a valid corpus if the intent credibility is greater than or equal to a preset credibility threshold;
  • the invalid corpus identification sub-module is used to identify the original corpus as an invalid corpus if the intent credibility is less than the credibility threshold.
  • the original corpus recognition module 1104 may further include the following sub-modules:
  • the intention category judgment sub-module is used to judge whether the multiple intention categories corresponding to the invalid corpus are all empty;
  • the invalid corpus deletion sub-module is used to delete the invalid corpus if the multiple intent categories corresponding to the invalid corpus are all empty;
  • the invalid corpus re-identification sub-module is used to if at least one of the multiple intent categories corresponding to the invalid corpus is not empty, divide the invalid corpus into multiple corpus categories according to the intent category, and use the multiple corpus again
  • Each NLU engine recognizes the invalid corpus in each corpus class, and if the intent category recognized by each NLU engine remains unchanged, the invalid corpus in the corpus class is recognized as a valid corpus.
  • the device may further include the following modules:
  • the initial category acquisition module is used to acquire the initial category of the valid corpus
  • the effective corpus association storage module is used for storing the effective corpus, the initial category of the effective corpus, and the intent category recognized by each NLU engine in a corpus.
  • the device may further include the following modules:
  • the recognition class division module is used to classify the multiple valid corpora into multiple recognition classes according to the stored initial categories and intent categories of the multiple valid corpora;
  • the corpus quantity statistics module is used to count the number of valid corpora contained in each recognition category
  • the whitelist generating module is used to generate the whitelist of the corpus according to the number of valid corpora contained in each recognition category.
  • the recognition class division module may specifically include the following sub-modules:
  • the recognition class division sub-module is used to classify the corresponding valid corpus with the same initial category and intent category into the same recognition category.
  • the whitelist generating module may specifically include the following submodules:
  • the recognition category ranking sub-module is used to sort each recognition category according to the number of valid corpus contained in each recognition category
  • the whitelist generating sub-module is used to extract the recognition classes in the preset sorting interval as the whitelist of the corpus.
  • FIG. 12 there is shown a structural block diagram of a corpus recognition apparatus provided by another embodiment of the present application.
  • the apparatus can be applied to a terminal device, and specifically may include the following modules:
  • the intent category recognition module 1201 is configured to use multiple NLU engines to recognize the target corpus when receiving the target corpus to be recognized, and obtain the intent category corresponding to each NLU engine respectively;
  • the whitelist extraction module 1202 is configured to extract a whitelist matching the intent category from a preset corpus
  • the target corpus recognition module 1203 is configured to recognize the corpus according to the intention categories included in the whitelist.
  • the whitelist can be generated by the following modules:
  • the recognition class division module is configured to classify the multiple valid corpora into multiple recognition classes according to the initial categories and intent categories of the multiple valid corpora stored in the corpus;
  • the corpus quantity statistics module is used to count the number of valid corpora contained in each recognition category
  • the whitelist generating module is used to generate the whitelist of the corpus according to the number of valid corpora contained in each recognition category.
  • the recognition class division module may specifically include the following sub-modules:
  • the recognition class division sub-module is used to classify the corresponding valid corpus with the same initial category and intent category into the same recognition category.
  • the whitelist generating module may specifically include the following submodules:
  • the recognition category ranking sub-module is used to sort each recognition category according to the number of valid corpus contained in each recognition category
  • the whitelist generating sub-module is used to extract the recognition classes in the preset sorting interval as the whitelist of the corpus.
  • the whitelist extraction module 1202 may specifically include the following submodules:
  • the initial category acquisition sub-module is used to acquire the initial category of the target corpus
  • the whitelist extraction submodule is used to extract a whitelist from the corpus that includes the initial category and at least one intent category recognized by the NLU engine.
  • the target corpus identification module 1203 may specifically include the following sub-modules:
  • the intention score determination sub-module is used to determine the intention score corresponding to each intention category included in the whitelist, and the intention score corresponding to each intention category is obtained by using each NLU engine to test a sample corpus;
  • the target intent category recognition sub-module is used to recognize the intent category corresponding to the maximum value of the intent score as the target intent category of the target corpus.
  • the description is relatively simple, and for related parts, please refer to the description of the method embodiment part.
  • the terminal device 1300 of this embodiment includes a processor 1310, a memory 1320, and a computer program 1321 stored in the memory 1320 and running on the processor 1310.
  • the processor 1310 executes the computer program 1321
  • the steps in each embodiment of the above-mentioned corpus recognition method are implemented, for example, steps S101 to S104 shown in FIG. 1.
  • the processor 1310 executes the computer program 1321
  • the functions of the modules/units in the foregoing device embodiments are implemented, for example, the functions of the modules 1101 to 1104 shown in FIG. 11.
  • the computer program 1321 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 1320 and executed by the processor 1310 to complete This application.
  • the one or more modules/units may be a series of computer program instruction segments capable of completing specific functions, and the instruction segments may be used to describe the execution process of the computer program 1321 in the terminal device 1300.
  • the computer program 1321 can be divided into an original corpus acquisition module, an intent category recognition module, an intent credibility determination module, and an original corpus recognition module.
  • the specific functions of each module are as follows:
  • the original corpus acquisition module is used to obtain the original corpus to be recognized
  • the intent category recognition module is configured to use multiple natural language understanding NLU engines to recognize the original corpus, and obtain the intent category corresponding to each NLU engine respectively;
  • the intent credibility determination module is used to determine the intent credibility of the original corpus according to the intent category of each NLU engine;
  • the original corpus recognition module is used to recognize the original corpus according to the intent credibility.
  • the computer program 1321 can also be divided into an intent category recognition module, a whitelist extraction module, and a target corpus recognition module.
  • intent category recognition module e.g., a whitelist extraction module
  • target corpus recognition module e.g., a whitelist extraction module
  • the specific functions of each module are as follows:
  • the intent category recognition module is configured to use multiple NLU engines to recognize the target corpus when receiving the target corpus to be recognized, and obtain the intent category corresponding to each NLU engine respectively;
  • the whitelist extraction module is used to extract a whitelist that matches the intent category from a preset corpus
  • the target corpus recognition module is used to recognize the corpus according to the intention categories included in the whitelist.
  • the terminal device 1300 may be a computing device such as a desktop computer, a notebook, or a palmtop computer.
  • the terminal device 1300 may include, but is not limited to, a processor 1310 and a memory 1320.
  • FIG. 13 is only an example of the terminal device 1300, and does not constitute a limitation on the terminal device 1300. It may include more or less components than those shown in the figure, or combine certain components, or be different.
  • the terminal device 1300 may also include input and output devices, network access devices, buses, and so on.
  • the processor 1310 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory 1320 may be an internal storage unit of the terminal device 1300, such as a hard disk or a memory of the terminal device 1300.
  • the memory 1320 may also be an external storage device of the terminal device 1300, such as a plug-in hard disk equipped on the terminal device 1300, a smart memory card (Smart Media Card, SMC), or a Secure Digital (SD) Card, Flash Card, etc.
  • the memory 1320 may also include both an internal storage unit of the terminal device 1300 and an external storage device.
  • the memory 1320 is used to store the computer program 1321 and other programs and data required by the terminal device 1300.
  • the memory 1320 can also be used to temporarily store data that has been output or will be output.
  • the disclosed corpus recognition method, device, terminal device, and medium can be implemented in other ways.
  • the division of the modules or units is only a logical function division, and there may be other divisions in actual implementation.
  • multiple units or components can be combined or integrated into another system, or some features can be ignored. Or not.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the computer program can be stored in a computer-readable storage medium.
  • the computer program can be stored in a computer-readable storage medium.
  • the steps of the foregoing method embodiments can be implemented.
  • the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate forms.
  • the computer-readable medium may at least include: any entity or device capable of carrying computer program code to the corpus recognition device, terminal device and medium, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), random Access memory (RAM, Random Access Memory), electric carrier signal, telecommunications signal, and software distribution medium.
  • ROM read-only memory
  • RAM random Access memory
  • electric carrier signal telecommunications signal
  • software distribution medium for example, U disk, mobile hard disk, floppy disk or CD-ROM, etc.
  • computer-readable media cannot be electrical carrier signals and telecommunication signals.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

A corpus identification method, a device, a terminal apparatus, and a medium. The method comprises: acquiring a raw corpus to be identified (S101); identifying the raw corpus by using multiple natural language understanding (NLU) engines, and acquiring respective intent types corresponding to each of the NLU engines, (S102); determining, according to the respective intent types of the NLU engines, an intent confidence level of the raw corpus (S103); and identifying the raw corpus according to the intent confidence level (S104). The method enables fine-grained confidence processing for an external NLU service according to a domain or an intent, thereby realizing identification of a huge number of corpora, and accordingly generating a corpus. A terminal is enabled to effectively increase, when performing corpus identification on the basis of the generated corpus, a service recall rate while improving accuracy of corpus identification. The method is broadly applicable to the field of artificial intelligence and the like, and particularly to various application scenarios that require a service recall rate realized on the basis of natural language understanding.

Description

语料识别方法、装置、终端设备和介质Corpus recognition method, device, terminal equipment and medium
本申请要求于2019年12月18日提交国家知识产权局、申请号为201911307187.9、申请名称为“语料识别方法、装置、终端设备和介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the State Intellectual Property Office on December 18, 2019, the application number is 201911307187.9, and the application name is "corpus identification method, device, terminal equipment and medium", the entire content of which is incorporated by reference In this application.
技术领域Technical field
本申请涉及信息技术领域,尤其涉及一种语料识别方法、装置、终端设备和介质。This application relates to the field of information technology, in particular to a corpus recognition method, device, terminal device and medium.
背景技术Background technique
召回率(Recall Ratio)又被称作查全率,是指从数据库中检出的相关的信息量与总量的比率。在人工智能(Artificial Intelligence,AI)等领域中,提升服务召回率有助于增强用户的服务体验。例如,用户在手机等终端上使用语音助手服务时,语音助手能否准确地理解用户所说的话,并完成相应的任务或返回相应的信息,极大地影响着用户的正常使用。Recall Ratio, also known as recall ratio, refers to the ratio of the amount of relevant information detected from the database to the total amount. In fields such as Artificial Intelligence (AI), increasing the service recall rate helps to enhance the user's service experience. For example, when a user uses a voice assistant service on a terminal such as a mobile phone, whether the voice assistant can accurately understand what the user says and complete the corresponding task or return corresponding information greatly affects the normal use of the user.
目前,为了提升服务召回率,终端厂商选择在终端中同时接入多家具备自然语言理解(Natural Language Understanding,NLU)能力的第三方内容提供商(Content Provider,CP),由各家CP的NLU***分别对用户输入的语料进行识别,然后再从中选择出一个结果返回给用户。但是,终端同时接入的多家CP的NLU***的能力各有强弱,且每家CP对于不同领域或意图的定义并没有完全统一的标准,彼此之间很难准确地比较。另一方面,多个NLU***的引入,虽然提升了召回率,但同时也使得尬聊增多,降低了语料识别的准确率。At present, in order to improve the service recall rate, terminal manufacturers choose to simultaneously access multiple third-party content providers (CP) with natural language understanding (NLU) capabilities in their terminals. The NLU of each CP The system recognizes the corpus input by the user, and then selects a result from it and returns it to the user. However, the capabilities of the NLU systems of multiple CPs that the terminal accesses at the same time have their own strengths and weaknesses, and each CP does not have a completely unified standard for the definition of different fields or intentions, and it is difficult to accurately compare each other. On the other hand, although the introduction of multiple NLU systems has increased the recall rate, it has also increased the number of awkward chats and reduced the accuracy of corpus recognition.
发明内容Summary of the invention
本申请实施例提供了一种语料识别方法、装置、终端设备和介质,可以通过对多个外部NLU***所识别出的语料对应的领域或意图进行细粒度的可信度处理,提升服务召回率,提高语料识别的准确率。The embodiments of this application provide a corpus recognition method, device, terminal equipment and medium, which can improve the service recall rate by performing fine-grained credibility processing on the domains or intents corresponding to the corpus recognized by multiple external NLU systems , Improve the accuracy of corpus recognition.
第一方面,本申请实施例提供了一种语料识别方法,包括:In the first aspect, an embodiment of the present application provides a corpus recognition method, including:
获取待识别的原始语料;Obtain the original corpus to be recognized;
采用多个自然语言理解NLU引擎对所述原始语料进行识别,分别获得与每个NLU引擎相对应的意图类别;Use multiple natural language understanding NLU engines to recognize the original corpus, and obtain the intent category corresponding to each NLU engine respectively;
根据所述每个NLU引擎的意图类别,确定所述原始语料的意图可信度;Determine the intent credibility of the original corpus according to the intent category of each NLU engine;
根据所述意图可信度对所述原始语料进行识别。The original corpus is recognized according to the credibility of the intention.
示例性的,所述采用多个自然语言理解NLU引擎对所述原始语料进行识别,分别获得与每个NLU引擎相对应的意图类别,包括:Exemplarily, the use of multiple natural language understanding NLU engines to recognize the original corpus and obtain the intent category corresponding to each NLU engine respectively includes:
调用多个NLU引擎的处理接口;Call the processing interface of multiple NLU engines;
分别将所述原始语料输入每个NLU引擎的处理接口,以指示所述每个NLU引擎对所述原始语料进行识别;Input the original corpus into the processing interface of each NLU engine to instruct each NLU engine to recognize the original corpus;
接收所述每个NLU引擎输出的意图类别。Receive the intent category output by each NLU engine.
示例性的,所述根据所述每个NLU引擎的意图类别,确定所述原始语料的意图可信度,包括:Exemplarily, the determining the intent credibility of the original corpus according to the intent category of each NLU engine includes:
确定所述每个NLU引擎的意图类别对应的意图评分,所述每个意图类别对应的意图评分通过采用所述每个NLU引擎对样本语料进行测试获得;Determining the intent score corresponding to the intent category of each NLU engine, and the intent score corresponding to each intent category is obtained by using each NLU engine to test a sample corpus;
根据所述每个意图类别及其对应的意图评分,计算所述原始语料的意图可信度。According to each intent category and its corresponding intent score, the intent credibility of the original corpus is calculated.
示例性的,所述根据所述每个意图类别及其对应的意图评分,计算所述原始语料的意图可信度,包括:Exemplarily, the calculating the intent credibility of the original corpus according to each intent category and its corresponding intent score includes:
确定所述每个意图类别的权重值;Determining the weight value of each intention category;
采用所述权重值,对所述每个意图类别对应的意图评分进行加权求和,获得所述原始语料的意图可信度。The weight value is used to perform a weighted summation on the intent score corresponding to each intent category to obtain the intent credibility of the original corpus.
示例性的,所述根据所述意图可信度对所述原始语料进行识别,包括:Exemplarily, the recognizing the original corpus according to the intent credibility includes:
若所述意图可信度大于或等于预设的可信度阈值,则将所述原始语料识别为有效语料;If the intent credibility is greater than or equal to the preset credibility threshold, the original corpus is recognized as a valid corpus;
若所述意图可信度小于所述可信度阈值,则将所述原始语料识别为无效语料。If the intent credibility is less than the credibility threshold, the original corpus is identified as an invalid corpus.
示例性的,在将所述原始语料识别为无效语料之后,还包括:Exemplarily, after identifying the original corpus as an invalid corpus, the method further includes:
判断所述无效语料对应的多个意图类别是否均为空;Determine whether the multiple intent categories corresponding to the invalid corpus are all empty;
若所述无效语料对应的多个意图类别均为空,则删除所述无效语料;If the multiple intent categories corresponding to the invalid corpus are all empty, delete the invalid corpus;
若所述无效语料对应的多个意图类别至少一个不为空,则根据所述意图类别将所述无效语料划分为多个语料类,并再次采用所述多个NLU引擎对每个语料类中的无效语料进行识别,若所述每个NLU引擎识别出的意图类别保持不变,则将所述语料类中的无效语料识别为有效语料。If at least one of the multiple intent categories corresponding to the invalid corpus is not empty, the invalid corpus is divided into multiple corpus categories according to the intent category, and the multiple NLU engines are used again to analyze each corpus category. If the intent category recognized by each NLU engine remains unchanged, the invalid corpus in the corpus category is recognized as a valid corpus.
示例性的,在将所述原始语料识别为有效语料之后,还包括:Exemplarily, after recognizing the original corpus as a valid corpus, the method further includes:
获取所述有效语料的初始类别;Obtain the initial category of the valid corpus;
将所述有效语料、所述有效语料的初始类别以及所述每个NLU引擎识别出的意图类别关联存储至语料库。The effective corpus, the initial category of the effective corpus, and the intent category recognized by each NLU engine are associated and stored in a corpus.
示例性的,所述方法还包括:Exemplarily, the method further includes:
根据已存储的多个有效语料的初始类别和意图类别,将所述多个有效语料划分为多个识别类;According to the stored initial categories and intent categories of the multiple valid corpora, divide the multiple valid corpora into multiple recognition categories;
统计每个识别类包含的有效语料的数量;Count the number of valid corpus contained in each recognition category;
根据所述每个识别类包含的有效语料的数量,生成所述语料库的白名单。According to the number of valid corpora contained in each recognition category, a whitelist of the corpus is generated.
示例性的,所述根据已存储的多个有效语料的初始类别和意图类别,将所述多个有效语料划分为多个识别类,包括:Exemplarily, the dividing the multiple valid corpora into multiple recognition categories according to the initial categories and intent categories of the multiple stored valid corpora includes:
将对应的初始类别和意图类别均相同的有效语料划分为同一识别类。The valid corpus with the same initial category and intent category are classified into the same recognition category.
示例性的,所述根据所述每个识别类包含的有效语料的数量,生成所述语料库的白名单,包括:Exemplarily, the generating the whitelist of the corpus according to the number of valid corpora contained in each recognition category includes:
根据所述每个识别类包含的有效语料的数量,对所述每个识别类进行排序;Sorting each recognition category according to the number of valid corpora contained in each recognition category;
提取处于预设排序区间内的识别类,作为所述语料库的白名单。The recognition classes in the preset sorting interval are extracted as the whitelist of the corpus.
第二方面,本申请实施例提供了一种语料识别方法,包括:In the second aspect, an embodiment of the present application provides a corpus recognition method, including:
当接收到待识别的目标语料时,采用多个NLU引擎对所述目标语料进行识别,分别获得与每个NLU引擎相对应的意图类别;When the target corpus to be recognized is received, multiple NLU engines are used to recognize the target corpus, and the intent category corresponding to each NLU engine is obtained respectively;
从预设的语料库中提取出与所述意图类别相匹配的白名单;Extract a whitelist matching the intent category from the preset corpus;
根据所述白名单中包含的意图类别,对所述目标语料进行识别。Identify the target corpus according to the intention categories included in the whitelist.
示例性的,所述白名单通过如下步骤生成:Exemplarily, the whitelist is generated through the following steps:
根据所述语料库中已存储的多个有效语料的初始类别和意图类别,将所述多个有效语料划分为多个识别类;Dividing the multiple valid corpora into multiple recognition categories according to the initial categories and intent categories of the multiple valid corpora that have been stored in the corpus;
统计每个识别类包含的有效语料的数量;Count the number of valid corpus contained in each recognition category;
根据所述每个识别类包含的有效语料的数量,生成所述语料库的白名单。According to the number of valid corpora contained in each recognition category, a whitelist of the corpus is generated.
示例性的,所述根据所述语料库中已存储的多个有效语料的初始类别和意图类别,将所述多个有效语料划分为多个识别类,包括:Exemplarily, the dividing the multiple valid corpora into multiple recognition categories according to the initial categories and intent categories of the multiple valid corpora stored in the corpus includes:
将对应的初始类别和意图类别均相同的有效语料划分为同一识别类。The valid corpus with the same initial category and intent category are classified into the same recognition category.
示例性的,述根据所述每个识别类包含的有效语料的数量,生成所述语料库的白名单,包括:Exemplarily, generating the whitelist of the corpus according to the number of valid corpora contained in each recognition category includes:
根据所述每个识别类包含的有效语料的数量,对所述每个识别类进行排序;Sorting each recognition category according to the number of valid corpora contained in each recognition category;
提取处于预设排序区间内的识别类,作为所述语料库的白名单。The recognition classes in the preset sorting interval are extracted as the whitelist of the corpus.
示例性的,所述从预设的语料库中提取出与所述意图类别相匹配的白名单,包括:Exemplarily, the extraction of a whitelist matching the intent category from a preset corpus includes:
获取所述目标语料的初始类别;Acquiring the initial category of the target corpus;
从所述语料库中提取出包含所述初始类别和至少一个所述NLU引擎识别出的意图类别的白名单。A whitelist containing the initial category and at least one intent category recognized by the NLU engine is extracted from the corpus.
示例性的,所述白名单中包含的意图类别包括多个,所述根据所述白名单中包含的意图类别,对所述语料进行识别,包括:Exemplarily, there are multiple intention categories included in the white list, and the identifying the corpus according to the intention categories included in the white list includes:
确定所述白名单中包含的每个意图类别对应的意图评分,所述每个意图类别对应的意图评分通过采用所述每个NLU引擎对样本语料进行测试获得;Determining an intent score corresponding to each intent category included in the whitelist, and the intent score corresponding to each intent category is obtained by using each NLU engine to test a sample corpus;
识别所述意图评分最大值对应的意图类别为所述目标语料的目标意图类别。Identify the intent category corresponding to the maximum value of the intent score as the target intent category of the target corpus.
第三方面,本申请实施例提供了一种语料识别装置,包括:In the third aspect, an embodiment of the present application provides a corpus recognition device, including:
原始语料获取模块,用于获取待识别的原始语料;The original corpus acquisition module is used to obtain the original corpus to be recognized;
意图类别识别模块,用于采用多个自然语言理解NLU引擎对所述原始语料进行识别,分别获得与每个NLU引擎相对应的意图类别;The intent category recognition module is configured to use multiple natural language understanding NLU engines to recognize the original corpus, and obtain the intent category corresponding to each NLU engine respectively;
意图可信度确定模块,用于根据所述每个NLU引擎的意图类别,确定所述原始语料的意图可信度;The intent credibility determination module is used to determine the intent credibility of the original corpus according to the intent category of each NLU engine;
原始语料识别模块,用于根据所述意图可信度对所述原始语料进行识别。The original corpus recognition module is used to recognize the original corpus according to the intent credibility.
第四方面,本申请实施例提供了一种语料识别装置,包括:In a fourth aspect, an embodiment of the present application provides a corpus recognition device, including:
意图类别识别模块,用于在接收到待识别的目标语料时,采用多个NLU引擎对所述目标语料进行识别,分别获得与每个NLU引擎相对应的意图类别;The intent category recognition module is configured to use multiple NLU engines to recognize the target corpus when receiving the target corpus to be recognized, and obtain the intent category corresponding to each NLU engine respectively;
白名单提取模块,用于从预设的语料库中提取出与所述意图类别相匹配的白名单;The whitelist extraction module is used to extract a whitelist that matches the intent category from a preset corpus;
目标语料识别模块,用于根据所述白名单中包含的意图类别,对所述语料进行识别。The target corpus recognition module is used to recognize the corpus according to the intention categories included in the whitelist.
第五方面,本申请实施例提供了一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述第一方面或第二方面任一项所述的语料识别方法。In a fifth aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor. When the processor executes the computer program, Implement the corpus recognition method described in any one of the first aspect or the second aspect.
第六方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被终端设备的处理器执行时实现上述第一方面或第二方面任一项所述的语料识别方法。In a sixth aspect, an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor of a terminal device, the above-mentioned first aspect or second aspect is implemented. The corpus recognition method according to any one of aspects.
第七方面,本申请实施例提供了一种计算机程序产品,当所述计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面或第二方面任一项所述的语料识别方法。In a seventh aspect, the embodiments of the present application provide a computer program product, which when the computer program product runs on a terminal device, causes the terminal device to execute the corpus recognition method according to any one of the first aspect or the second aspect above .
与现有技术相比,本申请实施例包括以下有益效果:Compared with the prior art, the embodiments of the present application include the following beneficial effects:
本申请实施例,通过采用多个NLU引擎对待识别的原始语料进行识别,获得与每个NLU引擎相对应的意图类别,然后根据每个NLU引擎的意图类别,确定出原始语料的意图可信度,从而 可以根据意图可信度对原始语料进行识别。本实施例可以按照领域或意图对外部NLU服务进行细粒度的评分和可信度处理,实现对海量语料的识别,从而生成语料库。终端在基于上述语料库进行语料识别时,可以在有效提高服务召回率的同时,提高语料识别的准确率。本实施例可以广泛应用于人工智能等领域,特别是需要基于自然语言理解实现服务召回率的各种应用场景中。In the embodiment of this application, multiple NLU engines are used to identify the original corpus to be recognized, and the intent category corresponding to each NLU engine is obtained, and then the intent credibility of the original corpus is determined according to the intent category of each NLU engine , So that the original corpus can be identified according to the intent credibility. This embodiment can perform fine-grained scoring and credibility processing on external NLU services according to domains or intentions, so as to realize the recognition of massive corpus, thereby generating corpus. When the terminal performs corpus recognition based on the above-mentioned corpus, it can effectively improve the service recall rate and at the same time improve the accuracy of corpus recognition. This embodiment can be widely used in artificial intelligence and other fields, especially in various application scenarios where a service recall rate needs to be realized based on natural language understanding.
附图说明Description of the drawings
图1是本申请一实施例提供的语料识别方法的示意性步骤流程图;FIG. 1 is a schematic step flowchart of a corpus recognition method provided by an embodiment of the present application;
图2是本申请另一实施例提供的语料识别方法的示意性步骤流程图;FIG. 2 is a schematic step flowchart of a corpus recognition method provided by another embodiment of the present application;
图3是本申请一实施例提供的语料标注过程示意图;Fig. 3 is a schematic diagram of a corpus tagging process provided by an embodiment of the present application;
图4是本申请一实施例提供的语料库白名单生成过程示意图;FIG. 4 is a schematic diagram of a corpus whitelist generation process provided by an embodiment of the present application;
图5是本申请另一实施例提供的语料识别方法的示意性步骤流程图;FIG. 5 is a schematic step flowchart of a corpus recognition method provided by another embodiment of the present application;
图6是本申请一实施例提供的语料识别方法所适用于的手机的硬件结构示意图;6 is a schematic diagram of the hardware structure of a mobile phone to which the corpus recognition method provided by an embodiment of the present application is applicable;
图7是本申请一实施例提供的语料识别方法所适用于的手机的软件结构示意图;FIG. 7 is a schematic diagram of the software structure of a mobile phone to which the corpus recognition method provided by an embodiment of the present application is applicable;
图8是本申请一实施例提供的生成语料库白名单的示意性步骤流程图;FIG. 8 is a schematic flow chart of steps for generating a corpus whitelist provided by an embodiment of the present application;
图9是本申请一实施例提供的语料库白名单应用过程示意图;FIG. 9 is a schematic diagram of a corpus whitelist application process provided by an embodiment of the present application;
图10是本申请一实施例提供的原始语料与意图类别之间的关系示意图;FIG. 10 is a schematic diagram of the relationship between original corpus and intent categories provided by an embodiment of the present application;
图11是本申请一实施例提供的语料识别装置的结构框图;FIG. 11 is a structural block diagram of a corpus recognition device provided by an embodiment of the present application;
图12是本申请另一实施例提供的语料识别装置的结构框图;FIG. 12 is a structural block diagram of a corpus recognition device provided by another embodiment of the present application;
图13是本申请一实施例提供的终端设备的结构示意图。FIG. 13 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
具体实施方式Detailed ways
以下描述中,为了说明而不是为了限定,提出了诸如特定***结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域技术人员应当清楚,在没有这些具体细节的其他实施例中也可以实现本申请。在其他情况中,省略对众所周知的***、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。In the following description, for the purpose of illustration rather than limitation, specific details such as a specific system structure and technology are proposed for a thorough understanding of the embodiments of the present application. However, it should be clear to those skilled in the art that the present application can also be implemented in other embodiments without these specific details. In other cases, detailed descriptions of well-known systems, devices, circuits, and methods are omitted to avoid unnecessary details from obstructing the description of this application.
以下实施例中所使用的术语只是为了描述特定实施例的目的,而并非旨在作为对本申请的限制。如在本申请的说明书和所附权利要求书中所使用的那样,单数表达形式“一个”、“一种”、“所述”、“上述”、“该”和“这一”旨在也包括例如“一个或多个”这种表达形式,除非其上下文中明确地有相反指示。还应当理解,在本申请实施例中,“一个或多个”是指一个、两个或两个以上;“和/或”,描述关联对象的关联关系,表示可以存在三种关系;例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A、B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。The terms used in the following embodiments are only for the purpose of describing specific embodiments, and are not intended to limit the application. As used in the specification and appended claims of this application, the singular expressions "a", "an", "said", "above", "the" and "this" are intended to also This includes expressions such as "one or more" unless the context clearly indicates to the contrary. It should also be understood that in the embodiments of the present application, "one or more" refers to one, two, or more than two; "and/or" describes the association relationship of associated objects, indicating that there may be three relationships; for example, A and/or B can mean the situation where A exists alone, A and B exist at the same time, and B exists alone, where A and B can be singular or plural. The character "/" generally indicates that the associated objects before and after are in an "or" relationship.
参照图1,示出了本申请一实施例提供的语料识别方法的示意性步骤流程图,该方法具体可以包括如下步骤:1, there is shown a schematic step flowchart of a corpus recognition method provided by an embodiment of the present application. The method may specifically include the following steps:
S101、获取待识别的原始语料;S101. Obtain the original corpus to be recognized;
本方法可以适用于人工智能等领域,特别是在涉及自然语言理解等需要对语音、文本进行识别的场景中,通过采用本实施例提供的语料识别方法,能够显著提升服务召回率。This method can be applied to fields such as artificial intelligence, especially in scenarios involving natural language understanding and other scenarios that require speech and text recognition. By adopting the corpus recognition method provided in this embodiment, the service recall rate can be significantly improved.
需要说明的是,本实施例是从对原始语料进行标注的角度来对本方法进行的介绍。It should be noted that this embodiment introduces the method from the perspective of labeling the original corpus.
在本实施例中,待识别的原始语料可以是通过抓取网络中用户输入的各种文本信息获得的,也可以通过对用户的语音进行文本转换后获得的。例如,用户在使用终端上的语音助手时,向语音助手说出的词语或句子,经过转换所得到的文本信息,便可以作为原始语料。本实施例对原始 语料的来源不作限定。In this embodiment, the original corpus to be recognized may be obtained by capturing various text information input by the user on the network, or may be obtained by performing text conversion on the user's voice. For example, when the user uses the voice assistant on the terminal, the words or sentences spoken to the voice assistant, and the text information obtained after conversion, can be used as the original corpus. This embodiment does not limit the source of the original corpus.
通常,原始语料的数量非常庞大,可能达到百万数量级。由于按照本实施例提供的方法对每一条原始语料进行处理的过程基本相似,为了便于理解,本实施例仅以对一条原始语料的处理过程来对本方法进行介绍。Usually, the number of original corpus is very large, which may reach the order of millions. Since the process of processing each piece of original corpus according to the method provided in this embodiment is basically similar, in order to facilitate understanding, this embodiment only introduces the method with the process of processing one piece of original corpus.
S102、采用多个自然语言理解NLU引擎对所述原始语料进行识别,分别获得与每个NLU引擎相对应的意图类别;S102. Use multiple natural language understanding NLU engines to recognize the original corpus, and obtain an intent category corresponding to each NLU engine.
在本实施例中,多个NLU引擎可以是指多家CP所提供的用于对语料进行识别的NLU***,每个NLU***均可以对接收到的原始语料单独进行识别,并输出相应的识别结果。In this embodiment, multiple NLU engines may refer to NLU systems provided by multiple CPs for recognizing corpus. Each NLU system can individually recognize the received original corpus and output the corresponding recognition. result.
作为本实施例的一种示例,可以接入不少于三家提供NLU服务的CP,这样对于同一条原始语料,至少可以获得三个NLU***给出的识别结果。As an example of this embodiment, no less than three CPs that provide NLU services can be accessed, so that for the same original corpus, at least three NLU systems can obtain recognition results.
本实施例中的识别结果可以是指每个NLU***识别出的原始语料所属的领域或意图的信息,即意图类别。例如,对于某一条原始语料,经过识别,可以首先判断该语料是属于通用百科还是实用工具等。The recognition result in this embodiment may refer to the field or intent information to which the original corpus recognized by each NLU system belongs, that is, the intent category. For example, for a certain original corpus, after recognition, you can first determine whether the corpus belongs to a general encyclopedia or a practical tool.
通常,不同的NLU***所擅长的领域可能存在差异。例如,某个NLU***可能对于美食、外卖等子领域具有较高的识别率,而另一个NLU***则可能更擅长地图、出行等子领域方面的知识。Generally, different NLU systems may be different in their areas of expertise. For example, a certain NLU system may have a higher recognition rate for sub-fields such as food and food delivery, while another NLU system may be better at knowledge of sub-fields such as maps and travel.
因此,在本实施例中,在通过不同的NLU***识别出意图类别时,还可以确定该意图类别的可信度。例如,在采用对于美食、外卖等子领域具有较高的识别率的NLU***进行语料识别时,若输出的意图类别属于上述美食、外卖等子领域,则可以认为识别出的意图类别的可信度相对较高;反之,若识别出的意图类别属于地图、出行等子领域,其可信度则相对较低。Therefore, in this embodiment, when the intent category is recognized through different NLU systems, the credibility of the intent category can also be determined. For example, when using an NLU system that has a high recognition rate for sub-domains such as food, food delivery, etc. for corpus recognition, if the output intent category belongs to the sub-domains such as food, food delivery, etc., the recognized intention category can be considered credible Conversely, if the identified intention category belongs to sub-domains such as map and travel, its credibility is relatively low.
当然,对于识别出的不同意图类别的可信度的判断还可以由提供NLU服务的CP自行给出,也可以由接入各家CP的终端厂商预先对各个NLU***进行测试从而得到每个NLU***的评价信息,或者,结合CP和终端厂商双方的评价共同得出,本实施例对此不作限定。Of course, the judgment of the credibility of the identified different intention categories can also be given by the CP providing the NLU service by itself, or the terminal manufacturer that accesses each CP can test each NLU system in advance to obtain each NLU. The evaluation information of the system, or combined with the evaluation of both the CP and the terminal manufacturer, is jointly obtained, which is not limited in this embodiment.
S103、根据所述每个NLU引擎的意图类别,确定所述原始语料的意图可信度;S103: Determine the intent credibility of the original corpus according to the intent category of each NLU engine;
在本实施例中,根据每个NLU***识别出的意图类别,可以共同确定原始语料的意图可信度。原始语料的意图可信度可以是指原始语料是否是可识别的,是否是NLU***能够理解的。In this embodiment, according to the intent category recognized by each NLU system, the intent credibility of the original corpus can be jointly determined. The intent credibility of the original corpus can refer to whether the original corpus is recognizable and can be understood by the NLU system.
在具体实现中,若每个NLU***识别出的意图类别较为一致,例如每个NLU***对于某条原始语料的识别结果均是通用百科,则可以认为该条语料的可信度较高。若每个NLU***对于某条原始语料的识别结果均不同,则可以认为该条语料的可信度较低。In a specific implementation, if the intent categories recognized by each NLU system are relatively consistent, for example, each NLU system's recognition result for a certain original corpus is a general encyclopedia, it can be considered that the credibility of the corpus is high. If each NLU system has different recognition results for a certain original corpus, it can be considered that the credibility of the corpus is low.
S104、根据所述意图可信度对所述原始语料进行识别。S104: Recognize the original corpus according to the credibility of the intention.
根据意图可信度对原始语料进行识别时,可以将意图可信度较高的原始语料标记为有效语料,将意图可信度较低的原始语料标记为无效语料。When recognizing the original corpus according to the credibility of intent, the original corpus with higher credibility of intent can be marked as effective corpus, and the original corpus with lower credibility of intent can be marked as invalid corpus.
以上介绍的是对单独一条原始语料进行识别的过程,在按照上述过程对海量语料进行识别后,可以完成对海量语料的标记,从而生成语料库。终端在基于上述语料库进行语料识别时,可以在有效提高服务召回率的同时,提高语料识别的准确率。The above is the process of recognizing a single original corpus. After recognizing the massive corpus according to the above process, the massive corpus can be marked to generate the corpus. When the terminal performs corpus recognition based on the above-mentioned corpus, it can effectively improve the service recall rate and at the same time improve the accuracy of corpus recognition.
参照图2,示出了本申请另一实施例提供的语料识别方法的示意性步骤流程图,该方法具体可以包括如下步骤:2, there is shown a schematic step flowchart of a corpus recognition method provided by another embodiment of the present application. The method may specifically include the following steps:
S201、获取待识别的原始语料;S201. Obtain the original corpus to be recognized;
需要说明的是,本实施例是从在对原始语料进行标注,获得包含海量有效语料的语料库的基 础上,构建出语料库的白名单的角度来对本方法进行的介绍。It should be noted that this embodiment introduces the method from the perspective of constructing a whitelist of the corpus on the basis of annotating the original corpus to obtain a corpus containing a large number of valid corpora.
与前述实施例类似,本实施例中的待识别的原始语料也可以是通过对现网用户海量的语音转文本以及手工输入的文本进行采集所获得的。待识别的原始语料可以达到百万条甚至更多。Similar to the foregoing embodiment, the original corpus to be recognized in this embodiment can also be obtained by collecting massive amounts of voice-to-text and manually inputted texts of live network users. The original corpus to be recognized can reach a million or more.
S202、采用多个自然语言理解NLU引擎对所述原始语料进行识别,分别获得与每个NLU引擎相对应的意图类别;S202. Use multiple natural language understanding NLU engines to recognize the original corpus, and obtain an intent category corresponding to each NLU engine.
在本实施例中,不同CP所提供的NLU服务可以通过不同的NLU引擎来实现,该NLU引擎即是对应CP的NLU***。终端可以通过相应的处理接口与各个NLU引擎实现数据交互。In this embodiment, the NLU services provided by different CPs can be implemented by different NLU engines, and the NLU engine is the NLU system corresponding to the CP. The terminal can realize data interaction with each NLU engine through the corresponding processing interface.
因此,在具体实现中,当需要对待识别的原始语料进行处理时,可以首先调用多个NLU引擎的处理接口,然后分别将上述原始语料输入每个NLU引擎的处理接口,以指示每个NLU引擎对该原始语料进行识别。每个NLU引擎在对原始语料进行识别后,可以通过上述处理接口返回相应的结果。终端可以接收每个NLU引擎输出的意图类别,作为每个NLU***对原始语料的识别结果。Therefore, in a specific implementation, when the original corpus to be recognized needs to be processed, the processing interface of multiple NLU engines can be called first, and then the aforementioned original corpus is input into the processing interface of each NLU engine to instruct each NLU engine Recognize the original corpus. After each NLU engine recognizes the original corpus, it can return the corresponding result through the above-mentioned processing interface. The terminal can receive the intent category output by each NLU engine as the recognition result of each NLU system on the original corpus.
S203、确定所述每个NLU引擎的意图类别对应的意图评分,所述每个意图类别对应的意图评分通过采用所述每个NLU引擎对样本语料进行测试获得;S203: Determine the intent score corresponding to the intent category of each NLU engine, and the intent score corresponding to each intent category is obtained by testing the sample corpus by using each NLU engine;
在获得每个NLU引擎识别出的意图类别后,可以确定该意图类别对应的意图评分。本实施例中NLU引擎的每个意图类别对应的意图评分可以通过采用该NLU引擎对样本语料进行测试获得。After obtaining the intent category recognized by each NLU engine, the intent score corresponding to the intent category can be determined. In this embodiment, the intent score corresponding to each intent category of the NLU engine can be obtained by using the NLU engine to test the sample corpus.
在具体实现中,可以预先采集一部分语料作为测试用的样本语料,然后分别采用终端接入的每个NLU引擎对样本语料进行识别,获得相应的识别结果,通过人工分析识别结果,可以为每个NLU引擎的每个意图类别进行评分。在进行意图类别的评分时,可以将某个NLU引擎擅长的意图赋予更高的分值,而对于其并不擅长或识别准确率相对较低的意图类别赋予相对较低的分值。In the specific implementation, a part of the corpus can be collected in advance as the sample corpus for testing, and then each NLU engine connected to the terminal is used to identify the sample corpus to obtain the corresponding recognition results. Through manual analysis of the recognition results, you can Each intent category of the NLU engine is scored. When scoring the intention category, a certain NLU engine's good intentions can be assigned a higher score, and an intention category that is not good at or with a relatively low recognition accuracy can be assigned a relatively low score.
如表一所示,是本实施例的一种各家CP的意图类别对应的意图评分示例。表一中给出了三家CP提供的NLU引擎对应的各个意图类别的评分示例,即CP1、CP2、和CP3。以CP1为例,其擅长的意图类别为标号为子意图1.2的这一类别,相应地,对应于该类别的意图评分也会相对较高为1.5分;而评分为0.5的子意图1.1这一类别,属于CP1所提供的NLU引擎并不擅长的领域,该类别的意图评分相对较低为0.5分;对于其他类别,如子意图1.3、子意图1.4等类别,使用CP1所提供的NLU引擎进行识别所获得的识别结果表现一般,对应的意图评分为1.0分。As shown in Table 1, it is an example of the intention score corresponding to the intention category of each CP in this embodiment. Table 1 gives examples of the scores of each intent category corresponding to the NLU engines provided by the three CPs, namely CP1, CP2, and CP3. Taking CP1 as an example, its good intention category is the category labeled sub-intent 1.2. Correspondingly, the intent score corresponding to this category will be relatively high at 1.5 points; and the sub-intent 1.1 with a score of 0.5 The category belongs to the field that the NLU engine provided by CP1 is not good at. The intent score of this category is relatively low at 0.5 points; for other categories, such as sub-intent 1.3, sub-intent 1.4 and other categories, use the NLU engine provided by CP1 to perform The recognition result obtained by the recognition is average, and the corresponding intention score is 1.0 point.
表一:Table I:
Figure PCTCN2020124764-appb-000001
Figure PCTCN2020124764-appb-000001
S204、根据所述每个意图类别及其对应的意图评分,计算所述原始语料的意图可信度;S204: Calculate the intent credibility of the original corpus according to each intent category and its corresponding intent score;
在本实施例中,可以根据每个意图类别及其对应的意图评分,计算待识别的原始语料的意图 可信度。In this embodiment, the intent credibility of the original corpus to be recognized can be calculated according to each intent category and its corresponding intent score.
在具体实现中,可以按照如下公式接榫原始语料的意图可信度:In specific implementation, the intent credibility of the original corpus can be joined according to the following formula:
Figure PCTCN2020124764-appb-000002
(CPi输出的意图类别*CPi输出的意图类别对应的意图评分)
Figure PCTCN2020124764-appb-000002
(Intention category output by CPi*Intent score corresponding to the intention category output by CPi)
其中,n为终端接入的CP数量。通常,n≥3,即至少接入三家CP所提供的NLU***。Among them, n is the number of CPs accessed by the terminal. Generally, n≥3, that is, access to the NLU system provided by at least three CPs.
对于某条原始语料,若CP1输出的意图类别为子意图1.1,CP2输出的意图类别为子意图2.2,CP3输出的意图类别为子意图3.1,参见上述表一所示的意图评分,该原始语料的意图可信度=0.5+1.5+0.5=2.5分。For a certain original corpus, if the intent category output by CP1 is sub-intent 1.1, the intent category output by CP2 is sub-intent 2.2, and the intent category output by CP3 is sub-intent 3.1, refer to the intent score shown in Table 1 above. The original corpus The credibility of the intention = 0.5 + 1.5 + 0.5 = 2.5 points.
当然,作为本实施例的一种示例,在计算原始语料的意图可信度时,还可以首先确定每个意图类别的权重值,然后再采用上述权重值,对每个意图类别对应的意图评分进行加权求和,从而获得原始语料的意图可信度。上述权重值可以是在为每个意图类别进行评分时设定的,例如,对于NLU***擅长的意图类别设置更高的权重。本实施例对于根据每个意图类别及其对应的意图评分,计算原始语料的意图可信度的具体方式不作限定。Of course, as an example of this embodiment, when calculating the intent credibility of the original corpus, you can first determine the weight value of each intent category, and then use the above weight value to score the intent corresponding to each intent category. Perform weighted summation to obtain the intent credibility of the original corpus. The aforementioned weight value may be set when scoring each intent category, for example, a higher weight is set for the intent category that the NLU system is good at. This embodiment does not limit the specific method for calculating the intent credibility of the original corpus according to each intent category and its corresponding intent score.
S205、若所述意图可信度大于或等于预设的可信度阈值,则将所述原始语料识别为有效语料;S205: If the intent credibility is greater than or equal to a preset credibility threshold, recognize the original corpus as a valid corpus;
在根据意图可信度对原始语料进行识别时,可以将意图可信度与预设的可信度阈值进行比较,如果意图可信度大于或等于该可信度阈值,则可以将原始语料识别为有效语料;如果意图可信度小于上述可信度阈值,则可以将原始语料识别为无效语料。When recognizing the original corpus based on the intent credibility, the intent credibility can be compared with the preset credibility threshold. If the intent credibility is greater than or equal to the credibility threshold, the original corpus can be identified It is a valid corpus; if the intent credibility is less than the above credibility threshold, the original corpus can be identified as an invalid corpus.
在具体实现中,可信度阈值可以根据实际需要设定。例如,可以根据接入的CP数量来确定上述阈值,选择可信度阈值的大小等于接入的CP数量的一半,即可信度阈值=n*50%,本实施例对此不作限定。In specific implementation, the credibility threshold can be set according to actual needs. For example, the above threshold may be determined according to the number of CPs accessed, and the size of the selected credibility threshold is equal to half of the number of CPs accessed, that is, the credibility threshold=n*50%, which is not limited in this embodiment.
需要说明的是,对于当前识别出的无效语料,还可以继续判断该无效语料对应的多个意图类别是否均为空,即判断每个NLU引擎是否均无法识别该语料。若无效语料对应的多个意图类别均为空,则表示每个NLU引擎是否均无法识别该语料,此时,可以删除该无效语料。It should be noted that, for the currently recognized invalid corpus, it is also possible to continue to determine whether the multiple intent categories corresponding to the invalid corpus are all empty, that is, to determine whether each NLU engine cannot recognize the corpus. If the multiple intent categories corresponding to the invalid corpus are all empty, it means whether each NLU engine cannot recognize the corpus. In this case, the invalid corpus can be deleted.
若该无效语料对应的多个意图类别至少一个不为空,则表示至少存在一个NLU引擎能够对该语料进行识别,只是多个NLU引擎在识别该语料时存在分类不一致的情况。此时,可以根据已经识别出的意图类别将全部无效语料划分为多个语料类,并再次采用上述多个NLU引擎对每个语料类中的无效语料进行识别。若每个NLU引擎识别出的意图类别保持不变,则表示是每个NLU引擎对该语料类中的无效语料进行识别所得到的分类结果是稳定的,从而可以将该语料类中的无效语料识别为有效语料。If at least one of the multiple intent categories corresponding to the invalid corpus is not empty, it means that there is at least one NLU engine that can recognize the corpus, but multiple NLU engines have inconsistent classifications when recognizing the corpus. At this time, all invalid corpora can be divided into multiple corpus classes according to the identified intention categories, and the multiple NLU engines described above can be used again to identify the invalid corpus in each corpus class. If the intent category recognized by each NLU engine remains unchanged, it means that the classification result obtained by each NLU engine identifying invalid corpus in the corpus is stable, so that the invalid corpus in the corpus can be Recognized as valid corpus.
例如,对于某些原始语料,若采用CP1、CP2、CP3提供的NLU引擎识别完成后,计算得到的意图可信度小于上述可信度阈值,如果按照前述步骤应当将这些原始语料标记为无效语料。但是,若个别NLU引擎能够对这些原始语料进行识别,输出相应的分类结果,此时可以将具有相同分类结果的这些无效语料划分至同一个语料类中。例如,将CP1和CP2均识别为无效,但CP3识别为子意图3.3的所有语料划分至同一个语料类中。然后再次采用上述三个NLU引擎对该语料类中的各个语料进行识别,若识别结果仍然是CP1和CP2均识别为无效,CP3识别为子意图3.3,则可以将该语料类中的无效语料标记为有效语料加入语料库中,并以CP3识别出的子意图3.3作为这些语料对应的意图类别。For example, for some original corpus, if the NLU engine provided by CP1, CP2, and CP3 is used for recognition, the calculated intent credibility is less than the above credibility threshold. If the above steps are followed, these original corpora should be marked as invalid corpus . However, if individual NLU engines can recognize these original corpora and output corresponding classification results, these invalid corpora with the same classification results can be classified into the same corpus class. For example, both CP1 and CP2 are recognized as invalid, but all corpora recognized as sub-intent 3.3 by CP3 are classified into the same corpus. Then use the above three NLU engines again to identify each corpus in the corpus. If the recognition result is still that both CP1 and CP2 are recognized as invalid, and CP3 is recognized as sub-intent 3.3, then the invalid corpus in the corpus can be marked Add the effective corpus to the corpus, and use the sub-intent 3.3 recognized by CP3 as the intent category corresponding to these corpus.
本实施例通过将低于可信度阈值的原始语料进行分批汇聚,然后再经过多次识别,如果每个NLU引擎识别出的结果均固定不变,则可以将将这些语料标记为有效语料,并存储至语料库中。本实施例通过上述辅助手段对被识别为无效语料的部分语料进行再次识别,可以减少由于每个 NLU引擎的识别准确率不同而造成的原始语料被大量删除的问题,能够有效地扩充语料库的数量。In this embodiment, the original corpus below the credibility threshold is aggregated in batches, and then after multiple identifications, if the results identified by each NLU engine are fixed, these corpora can be marked as valid corpus , And store it in the corpus. This embodiment uses the above-mentioned auxiliary means to re-recognize part of the corpus that is recognized as invalid corpus, which can reduce the problem of large amounts of deletion of original corpus due to the different recognition accuracy of each NLU engine, and can effectively expand the number of corpora .
S206、获取所述有效语料的初始类别,将所述有效语料、所述有效语料的初始类别以及所述每个NLU引擎识别出的意图类别关联存储至语料库;S206: Obtain the initial category of the effective corpus, and store the effective corpus, the initial category of the effective corpus, and the intent category recognized by each NLU engine in a corpus in association;
在本实施例中,在将有效语料存储至语料库时,可以同时将该语料的意图类别一并存储。In this embodiment, when the valid corpus is stored in the corpus, the intent category of the corpus can be stored at the same time.
在具体实现中,还可以获取该有效语料的初始类别,该初始类别可以是在采集到原始语料时,通过对原始语料进行粗略识别所获得的。In a specific implementation, the initial category of the effective corpus can also be obtained, and the initial category can be obtained by roughly identifying the original corpus when the original corpus is collected.
例如,在采集到海量的原始语料时,可以使用粗略分类的NLU***对原始语料进行初筛,得到每一条原始语料的初筛意图分类,作为该原始语料的初始类别。For example, when a large amount of original corpus is collected, the rough classification NLU system can be used to initially screen the original corpus, and the preliminary screening intention classification of each original corpus can be obtained as the initial category of the original corpus.
然后,将被标记为有效语料的原料,连同其初始类别和每个NLU引擎输出的意图类别,一并存储至语料库中。Then, the raw materials marked as valid corpus are stored in the corpus together with its initial category and the intent category output by each NLU engine.
S207、根据已存储的多个有效语料的初始类别和意图类别,将所述多个有效语料划分为多个识别类;S207: According to the stored initial categories and intent categories of the multiple valid corpora, divide the multiple valid corpora into multiple recognition categories;
在获得包含海量有效语料的语料库后,本实施例还可以对语料库中的海量有效语料做聚合统计,构建出语料库白名单。After obtaining a corpus containing a large amount of effective corpus, this embodiment can also perform aggregate statistics on the large amount of effective corpus in the corpus to construct a corpus whitelist.
在具体实现中,在对海量有效语料做聚合统计时,可以首先将对应的初始类别和意图类别均相同的有效语料划分为同一识别类。In a specific implementation, when performing aggregation statistics on a large number of effective corpora, the corresponding effective corpus with the same initial category and intent category can be divided into the same recognition category.
例如,针对有效语料,可以根据其初始类别和意图类别的不同,分别生成对应每个有效语料的类别字符串,如[初始类别_0.1,CP1意图类别1.1,CP2意图类别2.1,CP3意图类别3.2],该类别字符串表示某个有效语料的初始类别为子意图0.1,而采用三个不同的NLU引擎进行识别所得到的意图类别分别是子意图1.1、子意图2.1和子意图3.2。For example, for a valid corpus, a category string corresponding to each valid corpus can be generated according to the difference between its initial category and intent category, such as [Initial category_0.1, CP1 intent category 1.1, CP2 intent category 2.1, CP3 intent category 3.2 ], the category string indicates that the initial category of a valid corpus is sub-intent 0.1, and the intent categories obtained by using three different NLU engines for recognition are sub-intent 1.1, sub-intent 2.1 and sub-intent 3.2 respectively.
在按照上述方式识别出每个有效语料的类别字符串后,可以将具有相同类别字符串的所有语料聚合为同一个识别类。每个识别类中的有效语料的初始类别是相同,采用三个不同的NLU引擎对这一识别类中的语料进行识别所得到的意图类别也分别是相同的。After the category string of each valid corpus is identified in the above manner, all corpora with the same category string can be aggregated into the same recognition category. The initial categories of the valid corpus in each recognition category are the same, and the intent categories obtained by using three different NLU engines to recognize the corpus in this recognition category are also the same.
S208、统计每个识别类包含的有效语料的数量,根据所述每个识别类包含的有效语料的数量,生成所述语料库的白名单。S208: Count the number of valid corpora included in each recognition category, and generate a whitelist of the corpus according to the number of valid corpora included in each recognition category.
在本实施例中,在按照前述步骤将海量语料进行聚合,划分为不同的识别类后,可以分别统计每个识别类中包含的语料的数量。例如,某个识别类包含有语料10000条,某个识别类包含有语料500条,等等。In this embodiment, after the massive corpus is aggregated according to the foregoing steps and divided into different recognition categories, the number of corpus contained in each recognition category can be counted separately. For example, a certain recognition class contains 10,000 corpora, a certain recognition class contains 500 corpora, and so on.
然后,可以根据包含的语料的数量,构建出语料库白名单。Then, a corpus whitelist can be constructed according to the number of corpora included.
在具体实现中,可以选择包含的语料数量超过某个阈值的那些识别类作为语料库白名单。例如,对于聚合得到的全部识别类,可以将包含有超过2000条语料的那些识别类识别为语料库白名单,上述阈值可以根据实际需要确定,本实施例对此不作限定。In a specific implementation, the recognition classes that contain more than a certain threshold can be selected as the corpus whitelist. For example, for all recognition categories obtained by aggregation, those recognition categories that contain more than 2000 corpora can be recognized as corpus whitelists. The above threshold can be determined according to actual needs, which is not limited in this embodiment.
作为本实施例的一种示例,在统计得到每个识别类包含的有效语料的数量后,可以根据每个识别类包含的有效语料的数量,对每个识别类进行排序。例如,可以按照包含的语料数量由多至少的顺序对全部识别类进行排序。然后,提取处于预设排序区间内的识别类,作为语料库的白名单。例如,可以选择排序在前80%的识别类作为语料库的白名单。As an example of this embodiment, after the number of valid corpus contained in each recognition category is obtained by counting, each recognition category can be sorted according to the number of valid corpus contained in each recognition category. For example, all recognition classes can be sorted in the order of the largest number of corpus included. Then, the recognition classes in the preset sorting interval are extracted as the whitelist of the corpus. For example, the recognition classes ranked in the top 80% can be selected as the whitelist of the corpus.
当然,在个别情况下,还通过手工标注的方式对语料库白名单进行修正,本实施例对此不作限定。Of course, in individual cases, the corpus whitelist is also corrected by manual labeling, which is not limited in this embodiment.
在本申请实施例中,通过采用多个NLU引擎对原始语料进行识别,可以输出与每个NLU引 擎相对应的意图类别,然后根据每个意图类别及其对应的意图评分,可以得到该原始语料的意图可信度,从而可以将意图可信度超过预设的可信度阈值的原始语料标记为有效语料,进而生成包含海量有效语料的语料库。在此基础上,还通过对语料库中的语料做聚合统计,可以得到相应的语料库白名单,用于后续的语料识别。本实施例通过对外部的NLU服务所理解的领域或意图进行细粒度的可信度处理,将每个原始语料识别为不同NLU***的一个子类别,然后按照统一的标准对各个子类别作进一步处理,识别出与原始语料最相匹配的意图类别,可以有效地提升服务召回率,并能够在提升召回率的情况下保证一定的识别准确度,提高了语料识别的效率和准确率。同时,在此过程中,本实施例还可以实现对海量语料的自动标注,解决了生成海量语料库需要依赖人工整理和打标签的问题,提高了语料库生成的效率,有助于获得更丰富的语料库,生成的语料库又可以继续影响后续的语料识别,使得可供比较和匹配的语料更多,进一步提升了服务召回率。In the embodiment of the present application, by using multiple NLU engines to identify the original corpus, the intent category corresponding to each NLU engine can be output, and then according to each intent category and its corresponding intent score, the original corpus can be obtained Therefore, the original corpus whose intent credibility exceeds the preset credibility threshold can be marked as a valid corpus, and then a corpus containing a large number of valid corpus can be generated. On this basis, the corresponding corpus whitelist can be obtained by performing aggregation statistics on the corpus in the corpus, which can be used for subsequent corpus recognition. In this embodiment, by performing fine-grained credibility processing on the domain or intent understood by the external NLU service, each original corpus is identified as a subcategory of a different NLU system, and then each subcategory is further processed according to a unified standard. Processing and identifying the intent category that best matches the original corpus can effectively increase the service recall rate, and can ensure a certain recognition accuracy while increasing the recall rate, and improve the efficiency and accuracy of corpus recognition. At the same time, in this process, this embodiment can also realize automatic labeling of massive corpora, solve the problem of relying on manual sorting and labeling to generate massive corpora, improve the efficiency of corpus generation, and help to obtain a richer corpus , The generated corpus can continue to influence subsequent corpus recognition, making more corpora available for comparison and matching, and further improving the service recall rate.
为了便于理解,下面结合一个具体的示例对本申请的语料识别方法作一介绍。In order to facilitate understanding, the corpus recognition method of the present application will be introduced below in conjunction with a specific example.
如图3所示,是本实施例的一种语料标注过程的示意图。按照图3所示的标注过程,在初始时,可以首先对采集到的文本作第一次NLU初筛,得到百科、闲聊等需要外部NLU进一步识别处理的原始语料以及该原始语料的初筛意图分类。需要说明的是,第一次NLU初筛可以是对需要处理的文本进行粗略的处理,初筛得到的意图分类可以是一个较大范围的分类类别。As shown in FIG. 3, it is a schematic diagram of a corpus tagging process in this embodiment. According to the labeling process shown in Figure 3, at the beginning, the first NLU preliminary screening of the collected text can be performed to obtain the original corpus that requires further identification and processing by external NLU, such as encyclopedia and small chat, and the preliminary screening intention of the original corpus classification. It should be noted that the first NLU preliminary screening can be to roughly process the text to be processed, and the intention classification obtained by the preliminary screening can be a larger range of classification categories.
对于初筛得到的原始语料,可以调用n个CP的NLU处理接口,以原始语料作为输入信息,通过每个NLU进行识别,输出相应的意图类别和意图评分。在本实施例中,被调用的CP应当不少于3个。For the original corpus obtained by the preliminary screening, the NLU processing interface of n CPs can be called, the original corpus is used as input information, and each NLU is used for identification, and the corresponding intent category and intent score are output. In this embodiment, no less than 3 CPs should be called.
每个NLU输出的意图类别和意图评分可以按照设定的公式计算出相应的意图可信度,通过将意图可信度与可信度阈值进行比较,可以将原始语料标记为有效语料或无效语料。在本实施例中,上述可信度阈值可以为被调用的CP数量的一半。即,可信度阈值=n*50%。The intent category and intent score output by each NLU can calculate the corresponding intent credibility according to the set formula. By comparing the intent credibility with the credibility threshold, the original corpus can be marked as valid or invalid corpus . In this embodiment, the above-mentioned credibility threshold may be half of the number of CPs that are called. That is, the reliability threshold=n*50%.
对于被标记为有效语料的原始语料,可以同时记录下该语料的初筛意图分类和每个NLU输出的意图类别,完成对原始语料的标注。For the original corpus that is marked as a valid corpus, the initial screening intent classification of the corpus and the intent category output by each NLU can be recorded at the same time to complete the labeling of the original corpus.
在图3的基础上,参见图4,示出了本实施例的一种语料库白名单生成过程的示意图。对于海量的原始语料,可以反复执行图3所示的标注过程,得到海量语料的初筛意图分类和每个NLU输出的意图类别。然后,可以根据初筛意图分类和每个NLU输出的意图类别对海量语料做聚合统计,取统计得到的前80%的识别类作为语料库白名单。同时,还可以通过人工的方式对白名单做一定的修正。On the basis of FIG. 3, referring to FIG. 4, a schematic diagram of a corpus whitelist generation process of this embodiment is shown. For a large amount of original corpus, the labeling process shown in Figure 3 can be repeated to obtain the initial screening intent classification of the large amount of corpus and the intent category output by each NLU. Then, the mass corpus can be aggregated and counted according to the initial screening intent classification and the intent category output by each NLU, and the top 80% of the recognition categories obtained by the statistics can be used as the corpus whitelist. At the same time, the whitelist can also be modified manually.
以上实施例对原始语料标注、语料库生成以及白名单构建作了详细介绍,下面继续介绍在前述实施例构建的白名单基础上,对语料进行识别的过程。即,白名单的应用过程。The above embodiments give a detailed introduction to the original corpus labeling, corpus generation, and whitelist construction. The following continues to introduce the process of recognizing the corpus based on the whitelist constructed in the foregoing embodiment. That is, the application process of the white list.
参照图5,示出了本申请另一实施例提供的语料识别方法的示意性步骤流程图,该方法具体可以包括如下步骤:Referring to FIG. 5, a schematic step flowchart of a corpus recognition method provided by another embodiment of the present application is shown. The method may specifically include the following steps:
S501、当接收到待识别的目标语料时,采用多个NLU引擎对所述目标语料进行识别,分别获得与每个NLU引擎相对应的意图类别;S501: When a target corpus to be recognized is received, multiple NLU engines are used to recognize the target corpus, and an intent category corresponding to each NLU engine is obtained.
需要说明的是,本实施例提供的语料识别方法可以应用于手机、平板电脑、可穿戴设备、车载设备、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本、个人数字助理(personal digital assistant,PDA)等终端设备上,本申请实施例对终端设备的具体类型不作任何限制。It should be noted that the corpus recognition method provided in this embodiment can be applied to mobile phones, tablets, wearable devices, vehicle-mounted devices, augmented reality (AR)/virtual reality (VR) devices, notebook computers, For terminal devices such as ultra-mobile personal computers (UMPC), netbooks, and personal digital assistants (personal digital assistants, PDAs), the embodiments of this application do not impose any restrictions on the specific types of terminal devices.
以终端设备为手机为例。图6示出的是与本申请实施例提供的手机的部分结构的框图。参考 图6,手机包括:射频(Radio Frequency,RF)电路610、存储器620、输入单元630、显示单元640、传感器650、音频电路660、无线保真(wireless fidelity,Wi-Fi)模块670、处理器680、以及电源690等部件。本领域技术人员可以理解,图6中示出的手机结构并不构成对手机的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Take the terminal device as a mobile phone as an example. Fig. 6 shows a block diagram of a part of the structure of a mobile phone provided by an embodiment of the present application. Referring to FIG. 6, the mobile phone includes: a radio frequency (RF) circuit 610, a memory 620, an input unit 630, a display unit 640, a sensor 650, an audio circuit 660, a wireless fidelity (Wi-Fi) module 670, a processing 680, power supply 690 and other components. Those skilled in the art can understand that the structure of the mobile phone shown in FIG. 6 does not constitute a limitation on the mobile phone, and may include more or fewer components than shown in the figure, or combine some components, or arrange different components.
下面结合图6对手机的各个构成部件进行具体的介绍:The following describes the components of the mobile phone in detail with reference to Figure 6:
RF电路610可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,给处理器680处理;另外,将设计上行的数据发送给基站。通常,RF电路包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器(Low Noise Amplifier,LNA)、双工器等。此外,RF电路610还可以通过无线通信与网络和其他设备通信。上述无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯***(Global System of Mobile communication,GSM)、通用分组无线服务(General Packet Radio Service,GPRS)、码分多址(Code Division Multiple Access,CDMA)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、长期演进(Long Term Evolution,LTE))、电子邮件、短消息服务(Short Messaging Service,SMS)等。The RF circuit 610 can be used for receiving and sending signals during information transmission or communication, in particular, after receiving the downlink information of the base station, it is processed by the processor 680; in addition, the designed uplink data is sent to the base station. Generally, the RF circuit includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (LNA), a duplexer, and the like. In addition, the RF circuit 610 can also communicate with the network and other devices through wireless communication. The above-mentioned wireless communication can use any communication standard or protocol, including but not limited to Global System of Mobile Communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (Code Division) Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), Email, Short Messaging Service (SMS), etc.
存储器620可用于存储软件程序以及模块,处理器680通过运行存储在存储器620的软件程序以及模块,从而执行手机的各种功能应用以及数据处理。存储器620可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作***、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器620可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。The memory 620 may be used to store software programs and modules. The processor 680 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 620. The memory 620 may mainly include a program storage area and a data storage area. The program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data created by the use of mobile phones (such as audio data, phone book, etc.), etc. In addition, the memory 620 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
输入单元630可用于接收输入的数字或字符信息,以及产生与手机600的用户设置以及功能控制有关的键信号输入。具体地,输入单元630可包括触控面板631以及其他输入设备632。触控面板631,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板631上或在触控面板631附近的操作),并根据预先设定的程式驱动相应的连接装置。可选的,触控面板631可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器680,并能接收处理器680发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触控面板631。除了触控面板631,输入单元630还可以包括其他输入设备632。具体地,其他输入设备632可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。The input unit 630 may be used to receive inputted digital or character information, and generate key signal input related to user settings and function control of the mobile phone 600. Specifically, the input unit 630 may include a touch panel 631 and other input devices 632. The touch panel 631, also called a touch screen, can collect user touch operations on or near it (for example, the user uses any suitable objects or accessories such as fingers, stylus, etc.) on the touch panel 631 or near the touch panel 631. Operation), and drive the corresponding connection device according to the preset program. Optionally, the touch panel 631 may include two parts: a touch detection device and a touch controller. Among them, the touch detection device detects the user's touch position, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and then sends it To the processor 680, and can receive and execute the commands sent by the processor 680. In addition, the touch panel 631 can be implemented in multiple types such as resistive, capacitive, infrared, and surface acoustic wave. In addition to the touch panel 631, the input unit 630 may also include other input devices 632. Specifically, the other input device 632 may include, but is not limited to, one or more of a physical keyboard, function keys (such as volume control buttons, switch buttons, etc.), trackball, mouse, and joystick.
显示单元640可用于显示由用户输入的信息或提供给用户的信息以及手机的各种菜单。显示单元640可包括显示面板641,可选的,可以采用液晶显示器(Liquid Crystal Display,LCD)、有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板641。进一步的,触控面板631可覆盖显示面板641,当触控面板631检测到在其上或附近的触摸操作后,传送给处理器680以确定触摸事件的类型,随后处理器680根据触摸事件的类型在显示面板641上提供相应的视觉输出。虽然在图6中,触控面板631与显示面板641是作为两个独立的部件来实现手机的输入和输入功能,但是在某些实施例中,可以将触控面板631与显示面板641集成而实现手机的输入和输出功能。The display unit 640 may be used to display information input by the user or information provided to the user and various menus of the mobile phone. The display unit 640 may include a display panel 641. Optionally, the display panel 641 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD), an organic light-emitting diode (Organic Light-Emitting Diode, OLED), etc. Further, the touch panel 631 can cover the display panel 641. When the touch panel 631 detects a touch operation on or near it, it is sent to the processor 680 to determine the type of the touch event, and then the processor 680 determines the type of the touch event. The type provides corresponding visual output on the display panel 641. Although in FIG. 6, the touch panel 631 and the display panel 641 are used as two independent components to implement the input and input functions of the mobile phone, but in some embodiments, the touch panel 631 and the display panel 641 can be integrated. Realize the input and output functions of the mobile phone.
手机600还可包括至少一种传感器650,比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板641的亮度,接近传感器可在手机移动到耳边时,关闭显示面板641和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至于手机还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。The mobile phone 600 may also include at least one sensor 650, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor. The ambient light sensor can adjust the brightness of the display panel 641 according to the brightness of the ambient light. The proximity sensor can close the display panel 641 and/or when the mobile phone is moved to the ear. Or backlight. As a kind of motion sensor, the accelerometer sensor can detect the magnitude of acceleration in various directions (usually three-axis), and can detect the magnitude and direction of gravity when it is stationary. Games, magnetometer posture calibration), vibration recognition related functions (such as pedometer, percussion), etc.; as for other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc., which can also be configured in mobile phones, I will not here Go into details.
音频电路660、扬声器661,传声器662可提供用户与手机之间的音频接口。音频电路660可将接收到的音频数据转换后的电信号,传输到扬声器661,由扬声器661转换为声音信号输出;另一方面,传声器662将收集的声音信号转换为电信号,由音频电路660接收后转换为音频数据,再将音频数据输出处理器680处理后,经RF电路610以发送给比如另一手机,或者将音频数据输出至存储器620以便进一步处理。The audio circuit 660, the speaker 661, and the microphone 662 can provide an audio interface between the user and the mobile phone. The audio circuit 660 can transmit the electric signal after the conversion of the received audio data to the speaker 661, and the speaker 661 converts it into a sound signal for output; on the other hand, the microphone 662 converts the collected sound signal into an electric signal, and the audio circuit 660 converts the collected sound signal into an electric signal. After being received, it is converted into audio data, and then processed by the audio data output processor 680, and sent to, for example, another mobile phone via the RF circuit 610, or the audio data is output to the memory 620 for further processing.
Wi-Fi属于短距离无线传输技术,手机通过Wi-Fi模块670可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图6示出了Wi-Fi模块670,但是可以理解的是,其并不属于手机600的必须构成,完全可以根据需要在不改变发明的本质的范围内而省略。Wi-Fi is a short-distance wireless transmission technology. Through the Wi-Fi module 670, mobile phones can help users send and receive emails, browse web pages, and access streaming media. It provides users with wireless broadband Internet access. Although FIG. 6 shows the Wi-Fi module 670, it is understandable that it is not a necessary component of the mobile phone 600 and can be omitted as needed without changing the essence of the invention.
处理器680是手机的控制中心,利用各种接口和线路连接整个手机的各个部分,通过运行或执行存储在存储器620内的软件程序和/或模块,以及调用存储在存储器620内的数据,执行手机的各种功能和处理数据,从而对手机进行整体监控。可选的,处理器680可包括一个或多个处理单元;优选的,处理器680可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作***、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器680中。The processor 680 is the control center of the mobile phone. It uses various interfaces and lines to connect various parts of the entire mobile phone, and executes by running or executing software programs and/or modules stored in the memory 620, and calling data stored in the memory 620. Various functions and processing data of the mobile phone can be used to monitor the mobile phone as a whole. Optionally, the processor 680 may include one or more processing units; preferably, the processor 680 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface, and application programs, etc. , The modem processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 680.
手机600还包括给各个部件供电的电源690(比如电池),优选的,电源可以通过电源管理***与处理器680逻辑相连,从而通过电源管理***实现管理充电、放电、以及功耗管理等功能。The mobile phone 600 also includes a power source 690 (such as a battery) for supplying power to various components. Preferably, the power source may be logically connected to the processor 680 through a power management system, so that functions such as charging, discharging, and power management are realized through the power management system.
尽管未示出,手机600还可以包括摄像头。可选地,摄像头在手机600的上的位置可以为前置的,也可以为后置的,本申请实施例对此不作限定。Although not shown, the mobile phone 600 may also include a camera. Optionally, the position of the camera on the mobile phone 600 may be front-mounted or rear-mounted, which is not limited in the embodiment of the present application.
可选地,手机600可以包括单摄像头、双摄像头或三摄像头等,本申请实施例对此不作限定。Optionally, the mobile phone 600 may include a single camera, a dual camera, or a triple camera, etc., which is not limited in the embodiment of the present application.
例如,手机600可以包括三摄像头,其中,一个为主摄像头、一个为广角摄像头、一个为长焦摄像头。For example, the mobile phone 600 may include three cameras, of which one is a main camera, one is a wide-angle camera, and one is a telephoto camera.
可选地,当手机600包括多个摄像头时,这多个摄像头可以全部前置,或者全部后置,或者一部分前置、另一部分后置,本申请实施例对此不作限定。Optionally, when the mobile phone 600 includes multiple cameras, the multiple cameras may be all front-mounted, or all rear-mounted, or partly front-mounted and some rear-mounted, which is not limited in the embodiment of the present application.
另外,尽管未示出,手机600还可以包括蓝牙模块等,在此不再赘述。In addition, although not shown, the mobile phone 600 may also include a Bluetooth module, etc., which will not be repeated here.
图7是本申请实施例的手机600的软件结构示意图。以手机600操作***为Android***为例,在一些实施例中,将Android***分为四层,分别为应用程序层、应用程序框架层(framework,FWK)、***层以及硬件抽象层,层与层之间通过软件接口通信。FIG. 7 is a schematic diagram of the software structure of a mobile phone 600 according to an embodiment of the present application. Taking the mobile phone 600 operating system as the Android system as an example, in some embodiments, the Android system is divided into four layers, namely the application layer, the application framework layer (framework, FWK), the system layer, and the hardware abstraction layer. Communication between the layers through software interface.
如图7所示,所述应用程序层可以包括一系列应用程序包,应用程序包可以包括短信息,日历,相机,视频,导航,图库,通话等应用程序。As shown in FIG. 7, the application layer may include a series of application packages, and the application packages may include applications such as short message, calendar, camera, video, navigation, gallery, and call.
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层可以包括一些预先定义的函数,例如用于接收应用 程序框架层所发送的事件的函数。The application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer. The application framework layer can include some predefined functions, such as functions for receiving events sent by the application framework layer.
如图7所示,应用程序框架层可以包括窗口管理器、资源管理器以及通知管理器等。As shown in Figure 7, the application framework layer can include a window manager, a resource manager, and a notification manager.
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。The window manager is used to manage window programs. The window manager can obtain the size of the display screen, determine whether there is a status bar, lock the screen, take a screenshot, etc. The content provider is used to store and retrieve data and make these data accessible to applications. The data may include videos, images, audios, phone calls made and received, browsing history and bookmarks, phone book, etc.
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and so on.
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在***顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,电子设备振动,指示灯闪烁等。The notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and it can automatically disappear after a short stay without user interaction. For example, the notification manager is used to notify download completion, message reminders, and so on. The notification manager can also be a notification that appears in the status bar at the top of the system in the form of a chart or a scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialog window. For example, text messages are prompted in the status bar, prompt sounds, electronic devices vibrate, and indicator lights flash.
应用程序框架层还可以包括:The application framework layer can also include:
视图***,所述视图***包括可视控件,例如显示文字的控件,显示图片的控件等。视图***可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。A view system, which includes visual controls, such as controls that display text, controls that display pictures, and so on. The view system can be used to build applications. The display interface can be composed of one or more views. For example, a display interface that includes a short message notification icon may include a view that displays text and a view that displays pictures.
电话管理器用于提供手机600的通信功能。例如通话状态的管理(包括接通,挂断等)。The phone manager is used to provide the communication function of the mobile phone 600. For example, the management of the call status (including connecting, hanging up, etc.).
***层可以包括多个功能模块。例如:传感器服务模块,物理状态识别模块,三维图形处理库(例如:OpenGL ES)等。The system layer can include multiple functional modules. For example: sensor service module, physical state recognition module, 3D graphics processing library (for example: OpenGL ES), etc.
传感器服务模块,用于对硬件层各类传感器上传的传感器数据进行监测,确定手机600的物理状态;The sensor service module is used to monitor the sensor data uploaded by various sensors at the hardware layer and determine the physical state of the mobile phone 600;
物理状态识别模块,用于对用户手势、人脸等进行分析和识别;Physical state recognition module, used to analyze and recognize user gestures, faces, etc.;
三维图形处理库用于实现三维图形绘图,图像渲染,合成,和图层处理等。The 3D graphics processing library is used to implement 3D graphics drawing, image rendering, synthesis, and layer processing.
***层还可以包括:The system layer can also include:
表面管理器用于对显示子***进行管理,并且为多个应用程序提供了2D和3D图层的融合。The surface manager is used to manage the display subsystem and provides a combination of 2D and 3D layers for multiple applications.
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如:MPEG4,H.264,MP3,AAC,AMR,JPG,PNG等。The media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files. The media library can support multiple audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
硬件抽象层是硬件和软件之间的层。硬件抽象层可以包括显示驱动,摄像头驱动,传感器驱动等,用于驱动硬件层的相关硬件,如显示屏、摄像头、传感器等。The hardware abstraction layer is the layer between hardware and software. The hardware abstraction layer can include display drivers, camera drivers, sensor drivers, etc., used to drive related hardware at the hardware layer, such as display screens, cameras, sensors, and so on.
以下实施例可以在具有上述硬件结构/软件结构的手机600上实现。以下实施例将以手机600为例,对本实施例提供的语料识别方法进行说明。The following embodiments can be implemented on the mobile phone 600 having the above hardware structure/software structure. The following embodiment will take the mobile phone 600 as an example to describe the corpus recognition method provided in this embodiment.
在本实施例中,待识别的目标语料可以是指用户在使用手机上的语音服务所说出的词语或句子。例如,用户在调用手机上的智能语音助手时,可以向语音助手说出一句话,用于指示语音助手完成某项任务,或者输出某些信息。In this embodiment, the target corpus to be recognized may refer to words or sentences spoken by the user using the voice service on the mobile phone. For example, when the user invokes the intelligent voice assistant on the mobile phone, he can speak a sentence to the voice assistant, which is used to instruct the voice assistant to complete a certain task or output certain information.
例如,用户可以对语音助手说“刘德华是谁”,语音助手可以将这句话转换成文本,所得到的文本信息即是待识别的目标语料。For example, the user can say "Who is Andy Lau" to the voice assistant, and the voice assistant can convert this sentence into text, and the obtained text information is the target corpus to be recognized.
当然,用户也可以通过直接输入文本的方式将上述目标语料输入给手机,本实施例此不作限定。Of course, the user can also input the above-mentioned target corpus into the mobile phone by directly inputting text, which is not limited in this embodiment.
手机在接收到目标语料后,可以调用多个CP提供的NLU服务分别对目标语料进行识别,输 出相应的意图类别。After receiving the target corpus, the mobile phone can call the NLU services provided by multiple CPs to identify the target corpus and output the corresponding intent category.
S502、从预设的语料库中提取出与所述意图类别相匹配的白名单;S502: Extract a whitelist matching the intent category from a preset corpus;
在本实施例中,语料库可以是对海量的原始语料进行标注后得到的。除语料本身外,语料库中还可以存储有该语料被多个NLU***进行识别时,所得到的意图分类等信息。In this embodiment, the corpus may be obtained after annotating a large amount of original corpus. In addition to the corpus itself, the corpus can also store information such as intent classification obtained when the corpus is recognized by multiple NLU systems.
在采用多个NLU对目标语料进行识别,得到相应的意图类别后,可以从语料库中提取出与上述意图类别相匹配的语料库白名单。After multiple NLUs are used to identify the target corpus and the corresponding intent category is obtained, the corpus whitelist that matches the above intent category can be extracted from the corpus.
如图8所示,是本实施例的一种生成语料库白名单的步骤流程图。语料库白名单可以通过如下步骤生成:As shown in FIG. 8, it is a flow chart of the steps of generating a corpus whitelist in this embodiment. The corpus whitelist can be generated through the following steps:
S801、根据所述语料库中已存储的多个有效语料的初始类别和意图类别,将所述多个有效语料划分为多个识别类;S801: According to the initial categories and intent categories of the multiple valid corpora stored in the corpus, divide the multiple valid corpora into multiple recognition categories;
在本实施例中,语料库中存储的有效语料的初始类别可以是在原始语料进行初筛时获得的,其意图类别可以是由多个NLU***分别进行识别得到的。In this embodiment, the initial category of the valid corpus stored in the corpus may be obtained during the preliminary screening of the original corpus, and the intent category may be obtained by recognizing multiple NLU systems separately.
需要说明的是,被标记为有效语料并存储至语料库中的语料,在采用多个NLU***进行识别时,至少应该有一个NLU***能够理解该语料,输出相应的意图类别。因此,语料库中存储的有效语料一般应当包括该语料的初始类别和至少一个NLU***输出的意图类别。It should be noted that when multiple NLU systems are used for recognition of the corpus that is marked as a valid corpus and stored in the corpus, at least one NLU system should be able to understand the corpus and output the corresponding intent category. Therefore, the valid corpus stored in the corpus should generally include the initial category of the corpus and at least one intent category output by the NLU system.
为了构建出语料库白名单,可以首先将对应的初始类别和意图类别均相同的有效语料划分为同一识别类。In order to construct a corpus whitelist, you can first classify the corresponding valid corpora with the same initial category and intent category into the same recognition category.
例如,针对有效语料,可以根据其初始类别和意图类别的不同,分别生成对应每个有效语料的类别字符串,如[初始类别_0.1,CP1意图类别1.1,CP2意图类别2.1,CP3意图类别3.2],该类别字符串表示某个有效语料的初始类别为子意图0.1,而采用三个不同的NLU引擎进行识别所得到的意图类别分别是子意图1.1、子意图2.1和子意图3.2。For example, for a valid corpus, a category string corresponding to each valid corpus can be generated according to the difference between its initial category and intent category, such as [Initial category_0.1, CP1 intent category 1.1, CP2 intent category 2.1, CP3 intent category 3.2 ], the category string indicates that the initial category of a valid corpus is sub-intent 0.1, and the intent categories obtained by using three different NLU engines for recognition are sub-intent 1.1, sub-intent 2.1 and sub-intent 3.2 respectively.
在按照上述方式识别出每个有效语料的类别字符串后,可以将具有相同类别字符串的所有语料聚合为同一个识别类。After the category string of each valid corpus is identified in the above manner, all corpora with the same category string can be aggregated into the same recognition category.
S802、统计每个识别类包含的有效语料的数量;S802. Count the number of valid corpora contained in each recognition category;
在按照前述步骤将海量有效语料进行聚合,划分为不同的识别类后,可以分别统计每个识别类中包含的语料的数量。例如,某个识别类包含有语料10000条,某个识别类包含有语料500条,等等。After the massive effective corpus is aggregated according to the aforementioned steps and divided into different recognition categories, the number of corpus contained in each recognition category can be counted separately. For example, a certain recognition class contains 10,000 corpora, a certain recognition class contains 500 corpora, and so on.
S803、根据所述每个识别类包含的有效语料的数量,生成所述语料库的白名单。S803: Generate a whitelist of the corpus according to the number of valid corpora included in each recognition category.
作为本实施例的一种示例,在统计得到每个识别类包含的有效语料的数量后,可以根据每个识别类包含的有效语料的数量,对每个识别类进行排序。例如,可以按照包含的语料数量由多至少的顺序对全部识别类进行排序。然后,提取处于预设排序区间内的识别类,作为语料库的白名单。例如,可以选择排序在前80%的识别类作为语料库的白名单。As an example of this embodiment, after the number of valid corpus contained in each recognition category is obtained by counting, each recognition category can be sorted according to the number of valid corpus contained in each recognition category. For example, all recognition classes can be sorted in the order of the largest number of corpus included. Then, the recognition classes in the preset sorting interval are extracted as the whitelist of the corpus. For example, the recognition classes ranked in the top 80% can be selected as the whitelist of the corpus.
需要说明的是,由于本实施例步骤S801-S803生成语料库白名单的过程与前述实施例S207-S208较为类似,可以相互参阅,本实施例对此介绍得比较简单,相关细节可以参见前述实施例的描述。It should be noted that since the process of generating the corpus whitelist in steps S801-S803 in this embodiment is similar to the foregoing embodiments S207-S208, you can refer to each other. This embodiment is relatively simple to introduce, and related details can be referred to the foregoing embodiment. description of.
在本实施例中,在从语料库中提取出与目标语料的意图类别相匹配的白名单时,可以对初始类别和部分意图类别进行匹配,从而找出初始类别相同,以及部分意图类别相同的白名单。In this embodiment, when the whitelist matching the intent category of the target corpus is extracted from the corpus, the initial category and part of the intent category can be matched, so as to find the white list with the same initial category and the same intent category. List.
在具体实现中,可以首先获取目标语料的初始类别,然后从语料库中提取出包含该初始类别和至少一个通过上述NLU引擎识别出的意图类别的白名单。In a specific implementation, the initial category of the target corpus can be obtained first, and then a whitelist containing the initial category and at least one intent category identified by the above-mentioned NLU engine can be extracted from the corpus.
例如,对于目标语料“刘德华是谁”,若经三个NLU识别得到的意图类别分别为子意图1.2、子意图2.1和子意图3.1,在提取白名单时,可以首先确定该语料的初始类别,然后从语料库中找出包含该初始类别以及包含部分上述子意图1.2、子意图2.1和子意图3.1的白名单。For example, for the target corpus "Who is Andy Lau", if the intent categories identified by the three NLUs are sub-intent 1.2, sub-intent 2.1, and sub-intent 3.1, when extracting the whitelist, you can first determine the initial category of the corpus, and then Find out from the corpus a whitelist that contains the initial category and part of the above-mentioned sub-intent 1.2, sub-intent 2.1 and sub-intent 3.1.
在图3的基础上,参见图9,示出了本实施例的一种语料库白名单应用过程示意图。对于待识别的目标语料,可以根据目标语料的初筛意图分类和每个NLU输出的意图类别在语料库中进行匹配,若初筛意图分类和部分NLU输出的意图类别被某个白名单匹配到,则可以根据匹配到的白名单中的意图类别,按照设定的排序规则返回最合适的识别结果。On the basis of FIG. 3, referring to FIG. 9, a schematic diagram of a corpus whitelist application process of this embodiment is shown. For the target corpus to be recognized, it can be matched in the corpus according to the initial screening intent classification of the target corpus and the intent category output by each NLU. If the initial screening intent classification and part of the intent category output by the NLU are matched by a whitelist, Then, according to the intent category in the matched whitelist, the most suitable recognition result can be returned according to the set sorting rule.
其中,部分NLU输出的意图类别被匹配到是指,若某个白名单为初筛意图分类(初始类别)=C,CP1意图类别=C1,CP2意图类别=C2,那么,对于识别结果为初筛意图分类=C,CP1意图类别=C1,CP2意图类别=C2,CP3意图类别=C3的目标语料,可以认为该目标语料符合上述白名单。Among them, the intent category output by part of the NLU is matched means that if a certain whitelist is the initial screening intent category (initial category) = C, CP1 intent category = C1, CP2 intent category = C2, then the recognition result is initial Screening the target corpus with intent category=C, CP1 intent category=C1, CP2 intent category=C2, CP3 intent category=C3, it can be considered that the target corpus meets the aforementioned whitelist.
在本实施例中,设定的排序规则可以是以不同的NLU输出的意图类别对应的意图评分,优先选择评分高的类别返回给用户;或者,也可以根据预先设计的规则,对于不同的意图类别,优先路由某家CP;或者,不区分具体的意图类别,直接比较各家CP间的排序优先级,选择优先级较高的CP对应的意图类别返回给用户,本实施例对此不作限定。In this embodiment, the set sorting rules may be intent scores corresponding to the intent categories output by different NLUs, and the categories with higher scores are preferentially selected and returned to the user; or, according to pre-designed rules, different intents Category, preferentially route a certain CP; or, without distinguishing specific intent categories, directly compare the ranking priorities among CPs, and select the intent category corresponding to the CP with a higher priority to return to the user, which is not limited in this embodiment .
S503、根据所述白名单中包含的意图类别,对所述目标语料进行识别。S503: Identify the target corpus according to the intention categories included in the whitelist.
对于提取出的白名单,可以首先确定白名单中包含的每个意图类别对应的意图评分。需要说明的是,每个意图类别对应的意图评分可以通过采用每个NLU引擎对样本语料进行测试获得。对于意图评分的获得可以参见前述实施例步骤S203的介绍,本实施例对此不再赘述。For the extracted whitelist, the intent score corresponding to each intent category included in the whitelist can be determined first. It should be noted that the intent score corresponding to each intent category can be obtained by using each NLU engine to test the sample corpus. For obtaining the intention score, reference may be made to the introduction of step S203 in the foregoing embodiment, which will not be repeated in this embodiment.
然后,可以识别白名单中意图评分最大值对应的意图类别为当前待识别的目标语料的目标意图类别。Then, the intent category corresponding to the maximum value of the intent score in the whitelist can be identified as the target intent category of the target corpus currently to be recognized.
例如,在上述示例中,目标语料匹配得到的白名单中包含CP1和CP2的意图类别识别结果C1和C2,可以对C1和C2的意图评分进行比较,从中选择较大分值对应的意图类别作为最终的识别结果,返回给用户。For example, in the above example, the whitelist obtained by matching the target corpus contains the intent category recognition results C1 and C2 of CP1 and CP2. The intent scores of C1 and C2 can be compared, and the intent category corresponding to the larger score can be selected as The final recognition result is returned to the user.
在本申请实施例中,通过采用多个NLU引擎对接收到的目标语料进行识别,分别获得与每个NLU引擎相对应的意图类别后,可以从预设的语料库中提取出与上述意图类别相匹配的白名单,进而根据白名单中包含的意图类别,可以对目标语料进行识别。本实施例在对原始语料进行标注、生成语料库以及相应的白名单后,可以根据标注的类别信息以及白名单对语料进行识别,有助于提升终端用户的服务召回率。In the embodiment of the present application, multiple NLU engines are used to identify the received target corpus, and after the intent category corresponding to each NLU engine is obtained respectively, the intent category corresponding to the above-mentioned intent category can be extracted from the preset corpus. The matched whitelist can then identify the target corpus according to the intent categories contained in the whitelist. In this embodiment, after the original corpus is annotated, the corpus and the corresponding whitelist are generated, the corpus can be identified according to the labeled category information and the whitelist, which helps to improve the service recall rate of end users.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.
为了便于理解,下面结合具体的示例对本实施例的语料识别方法作一完整介绍,具体可以包括如下步骤:In order to facilitate understanding, the corpus recognition method of this embodiment will be fully introduced below in conjunction with specific examples, which may specifically include the following steps:
1、获取原始语料。本实施例中的原始语料可以是通过抓取网络中用户输入的各种文本信息获得的,也可以通过对用户的语音进行文本转换后获得的。原始语料的数量非常庞大,可能达到百万数量级。1. Obtain the original corpus. The original corpus in this embodiment may be obtained by capturing various text information input by the user on the network, or may be obtained by performing text conversion on the user's voice. The number of original corpus is very large, which may reach the order of millions.
2、选则n(n≥3)家提供NLU服务的CP作为汇聚对象,通过采用各家CP的NLU***对每一条原始语料进行识别,得到相应的意图类别。如图10所示,是本实施例的原始语料与意图类别之间的关系示意图。按照图10所示,每一条原始语料均需要采用CP1、CP2和CP3共三家CP的 NLU***进行识别,每个NLU***均会输出相应的识别结果。2. Select n (n≥3) CPs that provide NLU services as the aggregation object, and use the NLU system of each CP to identify each original corpus to obtain the corresponding intent category. As shown in FIG. 10, it is a schematic diagram of the relationship between the original corpus and the intent category in this embodiment. As shown in Figure 10, each original corpus needs to be recognized by the NLU systems of three CPs, CP1, CP2, and CP3, and each NLU system will output the corresponding recognition results.
2.1、如果所有CP分类一致,则可以获取到任一CP的分类结果,并按照该结果对原始语料进行标记。2.1. If the classification of all CPs is the same, the classification result of any CP can be obtained, and the original corpus can be marked according to the result.
2.2、如果所有CP均无法识别或分类,则可以判断原始语料为无效语料,并将其从原始语料库中剔除。2.2. If all CPs cannot be identified or classified, the original corpus can be judged to be invalid and removed from the original corpus.
2.3、如果各个CP识别出的意图类别对应的意图可信度较高(可信度阈值可以设置为超过半数CP数量),则可以自动对当前语料进行标记,并记录下该语料的初筛意图分类和各个CP输出的意图类别。其中,意图可信度可以按照如下公式计算得到。2.3. If the credibility of the intent corresponding to the intent category identified by each CP is high (the credibility threshold can be set to more than half the number of CPs), the current corpus can be automatically marked, and the initial screening intent of the corpus can be recorded Classification and intent category output by each CP. Among them, the intent credibility can be calculated according to the following formula.
Figure PCTCN2020124764-appb-000003
(CPi输出的意图类别*CPi输出的意图类别对应的意图评分)
Figure PCTCN2020124764-appb-000003
(Intention category output by CPi*Intent score corresponding to the intention category output by CPi)
2.4、如果各家CP输出意图可信度低于可信度阈值,则可以对低于阈值的语料进行分批汇聚,然后再经过多次识别,如果每家CP识别出的结果均固定不变,则可以将这些语料标记为有效语料,将各CP的识别结果映射到同一个总体分类(多对一)。2.4. If the credibility of each CP's output intention is lower than the credibility threshold, the corpus below the threshold can be aggregated in batches, and then after multiple identifications, if the results of each CP's identification are fixed and unchanged , You can mark these corpora as effective corpus, and map the recognition results of each CP to the same overall classification (many-to-one).
如表二所示,是一种对原始语料进行识别获得的识别结果的示例。在表二的示例中,存在无法识别和多家CP识别不一致的情况。As shown in Table 2, it is an example of the recognition result obtained by recognizing the original corpus. In the example in Table 2, there are cases where the identification is not possible and the identification of multiple CPs is inconsistent.
表二:Table II:
Figure PCTCN2020124764-appb-000004
Figure PCTCN2020124764-appb-000004
3、在前述步骤的基础上可以得到海量的有效语料形成语料库,通过对语料库中的语料进行聚合统计,可以输出语料库白名单。语料库库中的海量有效语料以及白名单可以用于后续针对目标语料的识别。3. On the basis of the foregoing steps, a large number of effective corpora can be obtained to form a corpus, and a corpus whitelist can be output by performing aggregation statistics on the corpus in the corpus. The massive effective corpus and the whitelist in the corpus can be used for subsequent identification of the target corpus.
4、用户输入目标语料“刘德华是谁”,经过初筛可知该语料的初筛意图分类为“通用百科”,在采用多家CP的NLU***进行识别后,CP1返回子意图1.2,CP2返回子意图2.1,CP3返回子意图3.1。通过匹配如表三所示的白名单可知,CP3返回的子意图3.1不在该白名单中。4. The user enters the target corpus "Who is Andy Lau". After preliminary screening, it can be known that the preliminary screening intention of the corpus is classified as "General Encyclopedia". After the NLU system of multiple CPs is used for identification, CP1 returns sub-intent 1.2, and CP2 returns sub-intent. Intent 2.1, CP3 returns to sub-intent 3.1. By matching the white list shown in Table 3, it can be known that the sub-intent 3.1 returned by CP3 is not in the white list.
表三:Table Three:
Figure PCTCN2020124764-appb-000005
Figure PCTCN2020124764-appb-000005
5、参考如表一所示的意图评分可知,子意图1.2的意图评分为1.5分,子意图2.1的意图评分为0.5分。因此,目标语料“刘德华是谁”能匹配到白名单中的子意图1.2和子意图2.1,但由于子意图1.2评分更高,因此终端会将CP1的识别结果,即子意图1.2,返回给用户。5. Referring to the intention score shown in Table 1, the intention score of sub-intent 1.2 is 1.5 points, and the intention score of sub-intent 2.1 is 0.5 points. Therefore, the target corpus "Who is Andy Lau" can match the sub-intent 1.2 and the sub-intent 2.1 in the whitelist, but since the sub-intent 1.2 scores higher, the terminal will return the recognition result of CP1, namely the sub-intent 1.2, to the user.
本实施例在对多家NLU服务按照领域或意图进行细粒度的处理和可信度评分后,通过整合汇聚多家NLU服务接入终端设备,可以有效提升终端用户的服务召回率,并在此过程中实现对海量原始语料的自动识别和标注。经试验,按照本实施例提供的语料识别方法,终端的服务召回率可以从59.5%提升到81.3%,且准确率无明显下降。In this embodiment, after fine-grained processing and credibility scoring of multiple NLU services according to domains or intentions, by integrating multiple NLU services to access terminal equipment, the service recall rate of end users can be effectively improved, and here In the process, the automatic recognition and labeling of the massive original corpus is realized. After experiments, according to the corpus recognition method provided in this embodiment, the service recall rate of the terminal can be increased from 59.5% to 81.3%, and the accuracy rate does not decrease significantly.
对应于上文实施例所述的语料识别方法,图11示出了本申请一实施例提供的语料识别装置的结构框图,为了便于说明,仅示出了与本申请实施例相关的部分。Corresponding to the corpus recognition method described in the above embodiment, FIG. 11 shows a structural block diagram of a corpus recognition device provided in an embodiment of the present application. For ease of description, only the parts related to the embodiment of the present application are shown.
参照图11,该装置可以应用于终端设备中,具体可以包括如下模块:Referring to FIG. 11, the device can be applied to terminal equipment, and specifically can include the following modules:
原始语料获取模块1101,用于获取待识别的原始语料;The original corpus acquisition module 1101 is used to obtain the original corpus to be recognized;
意图类别识别模块1102,用于采用多个自然语言理解NLU引擎对所述原始语料进行识别,分别获得与每个NLU引擎相对应的意图类别;The intent category recognition module 1102 is configured to use multiple natural language understanding NLU engines to recognize the original corpus, and obtain the intent category corresponding to each NLU engine respectively;
意图可信度确定模块1103,用于根据所述每个NLU引擎的意图类别,确定所述原始语料的意图可信度;The intent credibility determination module 1103 is configured to determine the intent credibility of the original corpus according to the intent category of each NLU engine;
原始语料识别模块1104,用于根据所述意图可信度对所述原始语料进行识别。The original corpus recognition module 1104 is configured to recognize the original corpus according to the intent credibility.
在本申请实施例中,所述意图类别识别模块1102具体可以包括如下子模块:In the embodiment of the present application, the intention category identification module 1102 may specifically include the following sub-modules:
处理接口调用子模块,用于调用多个NLU引擎的处理接口;The processing interface calling sub-module is used to call the processing interfaces of multiple NLU engines;
原始语料输入子模块,用于分别将所述原始语料输入每个NLU引擎的处理接口,以指示所述每个NLU引擎对所述原始语料进行识别;The original corpus input sub-module is used to input the original corpus into the processing interface of each NLU engine to instruct each NLU engine to recognize the original corpus;
意图类别接收子模块,用于接收所述每个NLU引擎输出的意图类别。The intention category receiving sub-module is used to receive the intention category output by each NLU engine.
在本申请实施例中,所述意图可信度确定模块1103具体可以包括如下子模块:In the embodiment of the present application, the intent credibility determination module 1103 may specifically include the following submodules:
意图评分确定子模块,用于确定所述每个NLU引擎的意图类别对应的意图评分,所述每个意图类别对应的意图评分通过采用所述每个NLU引擎对样本语料进行测试获得;The intent score determination sub-module is used to determine the intent score corresponding to the intent category of each NLU engine, and the intent score corresponding to each intent category is obtained by using each NLU engine to test a sample corpus;
意图可信度计算子模块,用于根据所述每个意图类别及其对应的意图评分,计算所述原始语料的意图可信度。The intent credibility calculation sub-module is used to calculate the intent credibility of the original corpus according to each intent category and its corresponding intent score.
在本申请实施例中,所述意图可信度计算子模块具体可以包括如下单元:In the embodiment of the present application, the intention credibility calculation sub-module may specifically include the following units:
权重值确定单元,用于确定所述每个意图类别的权重值;A weight value determining unit, configured to determine the weight value of each intention category;
意图可信度计算单元,用于采用所述权重值,对所述每个意图类别对应的意图评分进行加权求和,获得所述原始语料的意图可信度。The intent credibility calculation unit is configured to use the weight value to perform a weighted summation of the intent scores corresponding to each intent category to obtain the intent credibility of the original corpus.
在本申请实施例中,所述原始语料识别模块1104具体可以包括如下子模块:In this embodiment of the application, the original corpus recognition module 1104 may specifically include the following sub-modules:
有效语料识别子模块,用于若所述意图可信度大于或等于预设的可信度阈值,则将所述原始语料识别为有效语料;The valid corpus recognition sub-module is configured to recognize the original corpus as a valid corpus if the intent credibility is greater than or equal to a preset credibility threshold;
无效语料识别子模块,用于若所述意图可信度小于所述可信度阈值,则将所述原始语料识别为无效语料。The invalid corpus identification sub-module is used to identify the original corpus as an invalid corpus if the intent credibility is less than the credibility threshold.
在本申请实施例中,所述原始语料识别模块1104还可以包括如下子模块:In this embodiment of the application, the original corpus recognition module 1104 may further include the following sub-modules:
意图类别判断子模块,用于判断所述无效语料对应的多个意图类别是否均为空;The intention category judgment sub-module is used to judge whether the multiple intention categories corresponding to the invalid corpus are all empty;
无效语料删除子模块,用于若所述无效语料对应的多个意图类别均为空,则删除所述无效语料;The invalid corpus deletion sub-module is used to delete the invalid corpus if the multiple intent categories corresponding to the invalid corpus are all empty;
无效语料再识别子模块,用于若所述无效语料对应的多个意图类别至少一个不为空,则根据所述意图类别将所述无效语料划分为多个语料类,并再次采用所述多个NLU引擎对每个语料类中的无效语料进行识别,若所述每个NLU引擎识别出的意图类别保持不变,则将所述语料类中的无效语料识别为有效语料。The invalid corpus re-identification sub-module is used to if at least one of the multiple intent categories corresponding to the invalid corpus is not empty, divide the invalid corpus into multiple corpus categories according to the intent category, and use the multiple corpus again Each NLU engine recognizes the invalid corpus in each corpus class, and if the intent category recognized by each NLU engine remains unchanged, the invalid corpus in the corpus class is recognized as a valid corpus.
在本申请实施例中,该装置还可以包括如下模块:In the embodiment of the present application, the device may further include the following modules:
初始类别获取模块,用于获取所述有效语料的初始类别;The initial category acquisition module is used to acquire the initial category of the valid corpus;
有效语料关联存储模块,用于将所述有效语料、所述有效语料的初始类别以及所述每个NLU引擎识别出的意图类别关联存储至语料库。The effective corpus association storage module is used for storing the effective corpus, the initial category of the effective corpus, and the intent category recognized by each NLU engine in a corpus.
在本申请实施例中,该装置还可以包括如下模块:In the embodiment of the present application, the device may further include the following modules:
识别类划分模块,用于根据已存储的多个有效语料的初始类别和意图类别,将所述多个有效语料划分为多个识别类;The recognition class division module is used to classify the multiple valid corpora into multiple recognition classes according to the stored initial categories and intent categories of the multiple valid corpora;
语料数量统计模块,用于统计每个识别类包含的有效语料的数量;The corpus quantity statistics module is used to count the number of valid corpora contained in each recognition category;
白名单生成模块,用于根据所述每个识别类包含的有效语料的数量,生成所述语料库的白名单。The whitelist generating module is used to generate the whitelist of the corpus according to the number of valid corpora contained in each recognition category.
在本申请实施例中,所述识别类划分模块具体可以包括如下子模块:In the embodiment of the present application, the recognition class division module may specifically include the following sub-modules:
识别类划分子模块,用于将对应的初始类别和意图类别均相同的有效语料划分为同一识别类。The recognition class division sub-module is used to classify the corresponding valid corpus with the same initial category and intent category into the same recognition category.
在本申请实施例中,所述白名单生成模块具体可以包括如下子模块:In the embodiment of the present application, the whitelist generating module may specifically include the following submodules:
识别类排序子模块,用于根据所述每个识别类包含的有效语料的数量,对所述每个识别类进行排序;The recognition category ranking sub-module is used to sort each recognition category according to the number of valid corpus contained in each recognition category;
白名单生成子模块,用于提取处于预设排序区间内的识别类,作为所述语料库的白名单。The whitelist generating sub-module is used to extract the recognition classes in the preset sorting interval as the whitelist of the corpus.
参见图12,示出了本申请另一实施例提供的语料识别装置的结构框图,该装置可以应用于终端设备中,具体可以包括如下模块:Referring to FIG. 12, there is shown a structural block diagram of a corpus recognition apparatus provided by another embodiment of the present application. The apparatus can be applied to a terminal device, and specifically may include the following modules:
意图类别识别模块1201,用于在接收到待识别的目标语料时,采用多个NLU引擎对所述目标语料进行识别,分别获得与每个NLU引擎相对应的意图类别;The intent category recognition module 1201 is configured to use multiple NLU engines to recognize the target corpus when receiving the target corpus to be recognized, and obtain the intent category corresponding to each NLU engine respectively;
白名单提取模块1202,用于从预设的语料库中提取出与所述意图类别相匹配的白名单;The whitelist extraction module 1202 is configured to extract a whitelist matching the intent category from a preset corpus;
目标语料识别模块1203,用于根据所述白名单中包含的意图类别,对所述语料进行识别。The target corpus recognition module 1203 is configured to recognize the corpus according to the intention categories included in the whitelist.
在本申请实施例中,所述白名单可以通过如下模块生成:In the embodiment of the present application, the whitelist can be generated by the following modules:
识别类划分模块,用于根据所述语料库中已存储的多个有效语料的初始类别和意图类别,将所述多个有效语料划分为多个识别类;The recognition class division module is configured to classify the multiple valid corpora into multiple recognition classes according to the initial categories and intent categories of the multiple valid corpora stored in the corpus;
语料数量统计模块,用于统计每个识别类包含的有效语料的数量;The corpus quantity statistics module is used to count the number of valid corpora contained in each recognition category;
白名单生成模块,用于根据所述每个识别类包含的有效语料的数量,生成所述语料库的白名单。The whitelist generating module is used to generate the whitelist of the corpus according to the number of valid corpora contained in each recognition category.
在本申请实施例中,所述识别类划分模块具体可以包括如下子模块:In the embodiment of the present application, the recognition class division module may specifically include the following sub-modules:
识别类划分子模块,用于将对应的初始类别和意图类别均相同的有效语料划分为同一识别类。The recognition class division sub-module is used to classify the corresponding valid corpus with the same initial category and intent category into the same recognition category.
在本申请实施例中,所述白名单生成模块具体可以包括如下子模块:In the embodiment of the present application, the whitelist generating module may specifically include the following submodules:
识别类排序子模块,用于根据所述每个识别类包含的有效语料的数量,对所述每个识别类进行排序;The recognition category ranking sub-module is used to sort each recognition category according to the number of valid corpus contained in each recognition category;
白名单生成子模块,用于提取处于预设排序区间内的识别类,作为所述语料库的白名单。The whitelist generating sub-module is used to extract the recognition classes in the preset sorting interval as the whitelist of the corpus.
在本申请实施例中,所述白名单提取模块1202具体可以包括如下子模块:In the embodiment of the present application, the whitelist extraction module 1202 may specifically include the following submodules:
初始类别获取子模块,用于获取所述目标语料的初始类别;The initial category acquisition sub-module is used to acquire the initial category of the target corpus;
白名单提取子模块,用于从所述语料库中提取出包含所述初始类别和至少一个所述NLU引擎识别出的意图类别的白名单。The whitelist extraction submodule is used to extract a whitelist from the corpus that includes the initial category and at least one intent category recognized by the NLU engine.
在本申请实施例中,所述白名单中包含的意图类别包括多个,所述目标语料识别模块1203具体可以包括如下子模块:In this embodiment of the present application, there are multiple intent categories included in the whitelist, and the target corpus identification module 1203 may specifically include the following sub-modules:
意图评分确定子模块,用于确定所述白名单中包含的每个意图类别对应的意图评分,所述每个意图类别对应的意图评分通过采用所述每个NLU引擎对样本语料进行测试获得;The intention score determination sub-module is used to determine the intention score corresponding to each intention category included in the whitelist, and the intention score corresponding to each intention category is obtained by using each NLU engine to test a sample corpus;
目标意图类别识别子模块,用于识别所述意图评分最大值对应的意图类别为所述目标语料的目标意图类别。The target intent category recognition sub-module is used to recognize the intent category corresponding to the maximum value of the intent score as the target intent category of the target corpus.
对于装置实施例而言,由于其与方法实施例基本相似,所以描述得比较简单,相关之处参见方法实施例部分的说明即可。As for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the description of the method embodiment part.
参照图13,示出了本申请一实施例的一种终端设备的示意图。如图13所示,本实施例的终端设备1300包括:处理器1310、存储器1320以及存储在所述存储器1320中并可在所述处理器1310上运行的计算机程序1321。所述处理器1310执行所述计算机程序1321时实现上述语料识别方法各个实施例中的步骤,例如图1所示的步骤S101至S104。或者,所述处理器1310执行所述计算机程序1321时实现上述各装置实施例中各模块/单元的功能,例如图11所示模块1101至1104的功能。Referring to FIG. 13, a schematic diagram of a terminal device according to an embodiment of the present application is shown. As shown in FIG. 13, the terminal device 1300 of this embodiment includes a processor 1310, a memory 1320, and a computer program 1321 stored in the memory 1320 and running on the processor 1310. When the processor 1310 executes the computer program 1321, the steps in each embodiment of the above-mentioned corpus recognition method are implemented, for example, steps S101 to S104 shown in FIG. 1. Alternatively, when the processor 1310 executes the computer program 1321, the functions of the modules/units in the foregoing device embodiments are implemented, for example, the functions of the modules 1101 to 1104 shown in FIG. 11.
示例性的,所述计算机程序1321可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器1320中,并由所述处理器1310执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段可以用于描述所述计算机程序1321在所述终端设备1300中的执行过程。例如,所述计算机程序1321可以被分割成原始语料获取模块、意图类别识别模块、意图可信度确定模块、原始语料识别模块,各模块具体功能如下:Exemplarily, the computer program 1321 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 1320 and executed by the processor 1310 to complete This application. The one or more modules/units may be a series of computer program instruction segments capable of completing specific functions, and the instruction segments may be used to describe the execution process of the computer program 1321 in the terminal device 1300. For example, the computer program 1321 can be divided into an original corpus acquisition module, an intent category recognition module, an intent credibility determination module, and an original corpus recognition module. The specific functions of each module are as follows:
原始语料获取模块,用于获取待识别的原始语料;The original corpus acquisition module is used to obtain the original corpus to be recognized;
意图类别识别模块,用于采用多个自然语言理解NLU引擎对所述原始语料进行识别,分别获得与每个NLU引擎相对应的意图类别;The intent category recognition module is configured to use multiple natural language understanding NLU engines to recognize the original corpus, and obtain the intent category corresponding to each NLU engine respectively;
意图可信度确定模块,用于根据所述每个NLU引擎的意图类别,确定所述原始语料的意图可信度;The intent credibility determination module is used to determine the intent credibility of the original corpus according to the intent category of each NLU engine;
原始语料识别模块,用于根据所述意图可信度对所述原始语料进行识别。The original corpus recognition module is used to recognize the original corpus according to the intent credibility.
或者,所述计算机程序1321还可以被分割成意图类别识别模块、白名单提取模块、目标语料识别模块,各模块具体功能如下:Alternatively, the computer program 1321 can also be divided into an intent category recognition module, a whitelist extraction module, and a target corpus recognition module. The specific functions of each module are as follows:
意图类别识别模块,用于在接收到待识别的目标语料时,采用多个NLU引擎对所述目标语料进行识别,分别获得与每个NLU引擎相对应的意图类别;The intent category recognition module is configured to use multiple NLU engines to recognize the target corpus when receiving the target corpus to be recognized, and obtain the intent category corresponding to each NLU engine respectively;
白名单提取模块,用于从预设的语料库中提取出与所述意图类别相匹配的白名单;The whitelist extraction module is used to extract a whitelist that matches the intent category from a preset corpus;
目标语料识别模块,用于根据所述白名单中包含的意图类别,对所述语料进行识别。The target corpus recognition module is used to recognize the corpus according to the intention categories included in the whitelist.
所述终端设备1300可以是桌上型计算机、笔记本、掌上电脑等计算设备。所述终端设备1300可包括,但不仅限于,处理器1310、存储器1320。本领域技术人员可以理解,图13仅仅是终端设备1300的一种示例,并不构成对终端设备1300的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述终端设备1300还可以包括输入输出设备、网络接 入设备、总线等。The terminal device 1300 may be a computing device such as a desktop computer, a notebook, or a palmtop computer. The terminal device 1300 may include, but is not limited to, a processor 1310 and a memory 1320. Those skilled in the art can understand that FIG. 13 is only an example of the terminal device 1300, and does not constitute a limitation on the terminal device 1300. It may include more or less components than those shown in the figure, or combine certain components, or be different. For example, the terminal device 1300 may also include input and output devices, network access devices, buses, and so on.
所述处理器1310可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 1310 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
所述存储器1320可以是所述终端设备1300的内部存储单元,例如终端设备1300的硬盘或内存。所述存储器1320也可以是所述终端设备1300的外部存储设备,例如所述终端设备1300上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等等。进一步地,所述存储器1320还可以既包括所述终端设备1300的内部存储单元也包括外部存储设备。所述存储器1320用于存储所述计算机程序1321以及所述终端设备1300所需的其他程序和数据。所述存储器1320还可以用于暂时地存储已经输出或者将要输出的数据。The memory 1320 may be an internal storage unit of the terminal device 1300, such as a hard disk or a memory of the terminal device 1300. The memory 1320 may also be an external storage device of the terminal device 1300, such as a plug-in hard disk equipped on the terminal device 1300, a smart memory card (Smart Media Card, SMC), or a Secure Digital (SD) Card, Flash Card, etc. Further, the memory 1320 may also include both an internal storage unit of the terminal device 1300 and an external storage device. The memory 1320 is used to store the computer program 1321 and other programs and data required by the terminal device 1300. The memory 1320 can also be used to temporarily store data that has been output or will be output.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail or recorded in an embodiment, reference may be made to related descriptions of other embodiments.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。A person of ordinary skill in the art may realize that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
在本申请所提供的实施例中,应该理解到,所揭露的语料识别方法、装置、终端设备和介质,可以通过其他的方式实现。例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其他的形式。In the embodiments provided in this application, it should be understood that the disclosed corpus recognition method, device, terminal device, and medium can be implemented in other ways. For example, the division of the modules or units is only a logical function division, and there may be other divisions in actual implementation. For example, multiple units or components can be combined or integrated into another system, or some features can be ignored. Or not. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质至少可以包括:能够将计算机程序代码携带到语料识别装置、终端设备和介质的任何实体或装置、记录介质、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质。例如U盘、移动硬盘、磁碟或者光盘等。在某些司法管辖区,根据立法和 专利实践,计算机可读介质不可以是电载波信号和电信信号。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the implementation of all or part of the processes in the above-mentioned embodiment methods in the present application can be accomplished by instructing relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium. The computer program can be stored in a computer-readable storage medium. When executed by the processor, the steps of the foregoing method embodiments can be implemented. . Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate forms. The computer-readable medium may at least include: any entity or device capable of carrying computer program code to the corpus recognition device, terminal device and medium, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), random Access memory (RAM, Random Access Memory), electric carrier signal, telecommunications signal, and software distribution medium. For example, U disk, mobile hard disk, floppy disk or CD-ROM, etc. In some jurisdictions, in accordance with legislation and patent practices, computer-readable media cannot be electrical carrier signals and telecommunication signals.
最后应说明的是:以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。Finally, it should be noted that the above are only specific implementations of this application, but the scope of protection of this application is not limited to this. Any changes or substitutions within the technical scope disclosed in this application shall be covered by this application. Within the scope of protection applied for. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims (20)

  1. 一种语料识别方法,其特征在于,包括:A corpus recognition method, which is characterized in that it includes:
    获取待识别的原始语料;Obtain the original corpus to be recognized;
    采用多个自然语言理解NLU引擎对所述原始语料进行识别,分别获得与每个NLU引擎相对应的意图类别;Use multiple natural language understanding NLU engines to recognize the original corpus, and obtain the intent category corresponding to each NLU engine respectively;
    根据所述每个NLU引擎的意图类别,确定所述原始语料的意图可信度;Determine the intent credibility of the original corpus according to the intent category of each NLU engine;
    根据所述意图可信度对所述原始语料进行识别。The original corpus is recognized according to the credibility of the intention.
  2. 根据权利要求1所述的方法,其特征在于,所述采用多个自然语言理解NLU引擎对所述原始语料进行识别,分别获得与每个NLU引擎相对应的意图类别,包括:The method according to claim 1, wherein the recognizing the original corpus by using a plurality of natural language understanding NLU engines and respectively obtaining the intent category corresponding to each NLU engine comprises:
    调用多个NLU引擎的处理接口;Call the processing interface of multiple NLU engines;
    分别将所述原始语料输入每个NLU引擎的处理接口,以指示所述每个NLU引擎对所述原始语料进行识别;Input the original corpus into the processing interface of each NLU engine to instruct each NLU engine to recognize the original corpus;
    接收所述每个NLU引擎输出的意图类别。Receive the intent category output by each NLU engine.
  3. 根据权利要求1所述的方法,其特征在于,所述根据所述每个NLU引擎的意图类别,确定所述原始语料的意图可信度,包括:The method according to claim 1, wherein the determining the intent credibility of the original corpus according to the intent category of each NLU engine comprises:
    确定所述每个NLU引擎的意图类别对应的意图评分,所述每个意图类别对应的意图评分通过采用所述每个NLU引擎对样本语料进行测试获得;Determining the intent score corresponding to the intent category of each NLU engine, and the intent score corresponding to each intent category is obtained by using each NLU engine to test a sample corpus;
    根据所述每个意图类别及其对应的意图评分,计算所述原始语料的意图可信度。According to each intent category and its corresponding intent score, the intent credibility of the original corpus is calculated.
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述每个意图类别及其对应的意图评分,计算所述原始语料的意图可信度,包括:The method according to claim 3, wherein the calculating the intent credibility of the original corpus according to each intent category and its corresponding intent score comprises:
    确定所述每个意图类别的权重值;Determining the weight value of each intention category;
    采用所述权重值,对所述每个意图类别对应的意图评分进行加权求和,获得所述原始语料的意图可信度。The weight value is used to perform a weighted summation on the intent score corresponding to each intent category to obtain the intent credibility of the original corpus.
  5. 根据权利要求1所述的方法,其特征在于,所述根据所述意图可信度对所述原始语料进行识别,包括:The method according to claim 1, wherein the recognizing the original corpus according to the intent credibility comprises:
    若所述意图可信度大于或等于预设的可信度阈值,则将所述原始语料识别为有效语料;If the intent credibility is greater than or equal to the preset credibility threshold, the original corpus is recognized as a valid corpus;
    若所述意图可信度小于所述可信度阈值,则将所述原始语料识别为无效语料。If the intent credibility is less than the credibility threshold, the original corpus is identified as an invalid corpus.
  6. 根据权利要求5所述的方法,其特征在于,在将所述原始语料识别为无效语料之后,还包括:The method according to claim 5, wherein after identifying the original corpus as an invalid corpus, the method further comprises:
    判断所述无效语料对应的多个意图类别是否均为空;Determine whether the multiple intent categories corresponding to the invalid corpus are all empty;
    若所述无效语料对应的多个意图类别均为空,则删除所述无效语料;If the multiple intent categories corresponding to the invalid corpus are all empty, delete the invalid corpus;
    若所述无效语料对应的多个意图类别至少一个不为空,则根据所述意图类别将所述无效语料划分为多个语料类,并再次采用所述多个NLU引擎对每个语料类中的无效语料进行识别,若所述每个NLU引擎识别出的意图类别保持不变,则将所述语料类中的无效语料识别为有效语料。If at least one of the multiple intent categories corresponding to the invalid corpus is not empty, the invalid corpus is divided into multiple corpus categories according to the intent category, and the multiple NLU engines are used again to analyze each corpus category. If the intent category recognized by each NLU engine remains unchanged, the invalid corpus in the corpus category is recognized as a valid corpus.
  7. 根据权利要求5或6所述的方法,其特征在于,在将所述原始语料识别为有效语料之后,还包括:The method according to claim 5 or 6, characterized in that, after recognizing the original corpus as a valid corpus, the method further comprises:
    获取所述有效语料的初始类别;Obtain the initial category of the valid corpus;
    将所述有效语料、所述有效语料的初始类别以及所述每个NLU引擎识别出的意图类别关联存 储至语料库。The effective corpus, the initial category of the effective corpus, and the intent category recognized by each NLU engine are associated and stored in a corpus.
  8. 根据权利要求7所述的方法,其特征在于,还包括:The method according to claim 7, further comprising:
    根据已存储的多个有效语料的初始类别和意图类别,将所述多个有效语料划分为多个识别类;According to the stored initial categories and intent categories of the multiple valid corpora, divide the multiple valid corpora into multiple recognition categories;
    统计每个识别类包含的有效语料的数量;Count the number of valid corpus contained in each recognition category;
    根据所述每个识别类包含的有效语料的数量,生成所述语料库的白名单。According to the number of valid corpora contained in each recognition category, a whitelist of the corpus is generated.
  9. 根据权利要求8所述的方法,其特征在于,所述根据已存储的多个有效语料的初始类别和意图类别,将所述多个有效语料划分为多个识别类,包括:The method according to claim 8, wherein the dividing the multiple valid corpora into multiple recognition categories according to the initial categories and intent categories of the multiple stored valid corpora includes:
    将对应的初始类别和意图类别均相同的有效语料划分为同一识别类。The valid corpus with the same initial category and intent category are classified into the same recognition category.
  10. 根据权利要求8所述的方法,其特征在于,所述根据所述每个识别类包含的有效语料的数量,生成所述语料库的白名单,包括:The method according to claim 8, wherein the generating a whitelist of the corpus according to the number of valid corpora contained in each recognition category comprises:
    根据所述每个识别类包含的有效语料的数量,对所述每个识别类进行排序;Sorting each recognition category according to the number of valid corpora contained in each recognition category;
    提取处于预设排序区间内的识别类,作为所述语料库的白名单。The recognition classes in the preset sorting interval are extracted as the whitelist of the corpus.
  11. 一种语料识别方法,其特征在于,包括:A corpus recognition method, which is characterized in that it includes:
    当接收到待识别的目标语料时,采用多个NLU引擎对所述目标语料进行识别,分别获得与每个NLU引擎相对应的意图类别;When the target corpus to be recognized is received, multiple NLU engines are used to recognize the target corpus, and the intent category corresponding to each NLU engine is obtained respectively;
    从预设的语料库中提取出与所述意图类别相匹配的白名单;Extract a whitelist matching the intent category from the preset corpus;
    根据所述白名单中包含的意图类别,对所述目标语料进行识别。Identify the target corpus according to the intention categories included in the whitelist.
  12. 根据权利要求11所述的方法,其特征在于,所述白名单通过如下步骤生成:The method according to claim 11, wherein the white list is generated by the following steps:
    根据所述语料库中已存储的多个有效语料的初始类别和意图类别,将所述多个有效语料划分为多个识别类;Dividing the multiple valid corpora into multiple recognition categories according to the initial categories and intent categories of the multiple valid corpora that have been stored in the corpus;
    统计每个识别类包含的有效语料的数量;Count the number of valid corpus contained in each recognition category;
    根据所述每个识别类包含的有效语料的数量,生成所述语料库的白名单。According to the number of valid corpora contained in each recognition category, a whitelist of the corpus is generated.
  13. 根据权利要求12所述的方法,其特征在于,所述根据所述语料库中已存储的多个有效语料的初始类别和意图类别,将所述多个有效语料划分为多个识别类,包括:The method according to claim 12, wherein the dividing the multiple valid corpora into multiple recognition categories according to the initial categories and intent categories of the multiple valid corpora stored in the corpus comprises:
    将对应的初始类别和意图类别均相同的有效语料划分为同一识别类。The valid corpus with the same initial category and intent category are classified into the same recognition category.
  14. 根据权利要求12所述的方法,其特征在于,所述根据所述每个识别类包含的有效语料的数量,生成所述语料库的白名单,包括:The method according to claim 12, wherein the generating a whitelist of the corpus according to the number of valid corpora contained in each recognition category comprises:
    根据所述每个识别类包含的有效语料的数量,对所述每个识别类进行排序;Sorting each recognition category according to the number of valid corpora contained in each recognition category;
    提取处于预设排序区间内的识别类,作为所述语料库的白名单。The recognition classes in the preset sorting interval are extracted as the whitelist of the corpus.
  15. 根据权利要求11-14任一项所述的方法,其特征在于,所述从预设的语料库中提取出与所述意图类别相匹配的白名单,包括:The method according to any one of claims 11-14, wherein the extracting a whitelist matching the intent category from a preset corpus comprises:
    获取所述目标语料的初始类别;Acquiring the initial category of the target corpus;
    从所述语料库中提取出包含所述初始类别和至少一个所述NLU引擎识别出的意图类别的白名单。A whitelist containing the initial category and at least one intent category recognized by the NLU engine is extracted from the corpus.
  16. 根据权利要求15所述的方法,其特征在于,所述白名单中包含的意图类别包括多个,所述根据所述白名单中包含的意图类别,对所述语料进行识别,包括:The method according to claim 15, wherein the intent categories included in the whitelist include multiple, and the identifying the corpus according to the intent categories included in the whitelist includes:
    确定所述白名单中包含的每个意图类别对应的意图评分,所述每个意图类别对应的意图评分通过采用所述每个NLU引擎对样本语料进行测试获得;Determining an intent score corresponding to each intent category included in the whitelist, and the intent score corresponding to each intent category is obtained by using each NLU engine to test a sample corpus;
    识别所述意图评分最大值对应的意图类别为所述目标语料的目标意图类别。Identify the intent category corresponding to the maximum value of the intent score as the target intent category of the target corpus.
  17. 一种语料识别装置,其特征在于,包括:A corpus recognition device, which is characterized in that it comprises:
    原始语料获取模块,用于获取待识别的原始语料;The original corpus acquisition module is used to obtain the original corpus to be recognized;
    意图类别识别模块,用于采用多个自然语言理解NLU引擎对所述原始语料进行识别,分别获得与每个NLU引擎相对应的意图类别;The intent category recognition module is configured to use multiple natural language understanding NLU engines to recognize the original corpus, and obtain the intent category corresponding to each NLU engine respectively;
    意图可信度确定模块,用于根据所述每个NLU引擎的意图类别,确定所述原始语料的意图可信度;The intent credibility determination module is used to determine the intent credibility of the original corpus according to the intent category of each NLU engine;
    原始语料识别模块,用于根据所述意图可信度对所述原始语料进行识别。The original corpus recognition module is used to recognize the original corpus according to the intent credibility.
  18. 一种语料识别装置,其特征在于,包括:A corpus recognition device, which is characterized in that it comprises:
    意图类别识别模块,用于在接收到待识别的目标语料时,采用多个NLU引擎对所述目标语料进行识别,分别获得与每个NLU引擎相对应的意图类别;The intent category recognition module is configured to use multiple NLU engines to recognize the target corpus when receiving the target corpus to be recognized, and obtain the intent category corresponding to each NLU engine respectively;
    白名单提取模块,用于从预设的语料库中提取出与所述意图类别相匹配的白名单;The whitelist extraction module is used to extract a whitelist that matches the intent category from a preset corpus;
    目标语料识别模块,用于根据所述白名单中包含的意图类别,对所述语料进行识别。The target corpus recognition module is used to recognize the corpus according to the intention categories included in the whitelist.
  19. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至16任一项所述的语料识别方法。A terminal device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor executes the computer program as claimed in claims 1 to 16. The corpus recognition method described in any one of items.
  20. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至16任一项所述的语料识别方法。A computer-readable storage medium storing a computer program, wherein the computer program implements the corpus recognition method according to any one of claims 1 to 16 when the computer program is executed by a processor.
PCT/CN2020/124764 2019-12-18 2020-10-29 Corpus identification method, device, terminal apparatus, and medium WO2021120876A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911307187.9A CN111178055B (en) 2019-12-18 2019-12-18 Corpus identification method, apparatus, terminal device and medium
CN201911307187.9 2019-12-18

Publications (1)

Publication Number Publication Date
WO2021120876A1 true WO2021120876A1 (en) 2021-06-24

Family

ID=70655565

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/124764 WO2021120876A1 (en) 2019-12-18 2020-10-29 Corpus identification method, device, terminal apparatus, and medium

Country Status (2)

Country Link
CN (1) CN111178055B (en)
WO (1) WO2021120876A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111178055B (en) * 2019-12-18 2022-07-29 华为技术有限公司 Corpus identification method, apparatus, terminal device and medium
CN111611366B (en) * 2020-05-20 2023-08-11 北京百度网讯科技有限公司 Method, device, equipment and storage medium for optimizing intention recognition

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160196499A1 (en) * 2015-01-07 2016-07-07 Microsoft Technology Licensing, Llc Managing user interaction for input understanding determinations
CN107170446A (en) * 2017-05-19 2017-09-15 深圳市优必选科技有限公司 Semantic processes server and the method for semantic processes
CN109671421A (en) * 2018-12-25 2019-04-23 苏州思必驰信息科技有限公司 The customization and implementation method navigated offline and device
CN110147551A (en) * 2019-05-14 2019-08-20 腾讯科技(深圳)有限公司 Multi-class entity recognition model training, entity recognition method, server and terminal
CN111178055A (en) * 2019-12-18 2020-05-19 华为技术有限公司 Corpus identification method, apparatus, terminal device and medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020103837A1 (en) * 2001-01-31 2002-08-01 International Business Machines Corporation Method for handling requests for information in a natural language understanding system
KR101932264B1 (en) * 2018-03-02 2018-12-26 주식회사 머니브레인 Method, interactive ai agent system and computer readable recoding medium for providing intent determination based on analysis of a plurality of same type entity information
CN109033378A (en) * 2018-07-27 2018-12-18 北京中关村科金技术有限公司 A kind of application method of Zero-shot Learning in intelligent customer service system
CN109284392B (en) * 2018-12-07 2021-04-06 达闼机器人有限公司 Text classification method, device, terminal and storage medium
CN110147445A (en) * 2019-04-09 2019-08-20 平安科技(深圳)有限公司 Intension recognizing method, device, equipment and storage medium based on text classification
CN110321437B (en) * 2019-05-27 2024-03-15 腾讯科技(深圳)有限公司 Corpus data processing method and device, electronic equipment and medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160196499A1 (en) * 2015-01-07 2016-07-07 Microsoft Technology Licensing, Llc Managing user interaction for input understanding determinations
CN107170446A (en) * 2017-05-19 2017-09-15 深圳市优必选科技有限公司 Semantic processes server and the method for semantic processes
CN109671421A (en) * 2018-12-25 2019-04-23 苏州思必驰信息科技有限公司 The customization and implementation method navigated offline and device
CN110147551A (en) * 2019-05-14 2019-08-20 腾讯科技(深圳)有限公司 Multi-class entity recognition model training, entity recognition method, server and terminal
CN111178055A (en) * 2019-12-18 2020-05-19 华为技术有限公司 Corpus identification method, apparatus, terminal device and medium

Also Published As

Publication number Publication date
CN111178055A (en) 2020-05-19
CN111178055B (en) 2022-07-29

Similar Documents

Publication Publication Date Title
WO2021196981A1 (en) Voice interaction method and apparatus, and terminal device
US20230109816A1 (en) Notification display method and terminal
US20170091335A1 (en) Search method, server and client
WO2021135611A1 (en) Method and device for speech recognition, terminal and storage medium
US9900427B2 (en) Electronic device and method for displaying call information thereof
US20220353225A1 (en) Method for searching for chat information and electronic device
WO2021114928A1 (en) Error correction word sorting method and apparatus, terminal device and storage medium
CN109284261B (en) Application searching method and device, storage medium and electronic equipment
EP3300008A1 (en) Method and apparatus for recommending texts
WO2021212879A1 (en) Quick search method, terminal and storage medium
KR20160021637A (en) Method for processing contents and electronics device thereof
CN111177180A (en) Data query method and device and electronic equipment
CN108156508B (en) Barrage information processing method and device, mobile terminal, server and system
WO2021147421A1 (en) Automatic question answering method and apparatus for man-machine interaction, and intelligent device
WO2021120876A1 (en) Corpus identification method, device, terminal apparatus, and medium
WO2021135578A1 (en) Page processing method and apparatus, and storage medium and terminal device
WO2017012423A1 (en) Method and terminal for displaying instant message
WO2021120875A1 (en) Search method and apparatus, terminal device and storage medium
WO2021047230A1 (en) Method and apparatus for obtaining screenshot information
TW201512865A (en) Method for searching web page digital data, device and system thereof
CN108492836A (en) A kind of voice-based searching method, mobile terminal and storage medium
CN107241481A (en) Book management method, device, storage medium and electronic equipment
CN110688497A (en) Resource information searching method and device, terminal equipment and storage medium
WO2021185174A1 (en) Electronic card selection method and apparatus, terminal, and storage medium
CN108234758B (en) Application display method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20901428

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20901428

Country of ref document: EP

Kind code of ref document: A1