CN111428019A - Data processing method and equipment for knowledge base question answering - Google Patents

Data processing method and equipment for knowledge base question answering Download PDF

Info

Publication number
CN111428019A
CN111428019A CN202010255287.8A CN202010255287A CN111428019A CN 111428019 A CN111428019 A CN 111428019A CN 202010255287 A CN202010255287 A CN 202010255287A CN 111428019 A CN111428019 A CN 111428019A
Authority
CN
China
Prior art keywords
user
knowledge
knowledge base
utterances
descriptions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010255287.8A
Other languages
Chinese (zh)
Other versions
CN111428019B (en
Inventor
谷博
雷欣
李志飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mobvoi Information Technology Co Ltd
Original Assignee
Mobvoi Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mobvoi Information Technology Co Ltd filed Critical Mobvoi Information Technology Co Ltd
Priority to CN202010255287.8A priority Critical patent/CN111428019B/en
Publication of CN111428019A publication Critical patent/CN111428019A/en
Application granted granted Critical
Publication of CN111428019B publication Critical patent/CN111428019B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Stored Programmes (AREA)
  • Machine Translation (AREA)

Abstract

The present disclosure provides a data processing method and device for knowledge base question answering, wherein the data processing method comprises the following steps: acquiring any knowledge item from a knowledge base; selecting a user utterance matched with the knowledge item from the dialogue records to form a set of user utterances; associating the set of user utterances with the knowledge items; and training the knowledge base question-answer model by taking the associated set of user descriptions and knowledge items as training samples so as to feed back the subsequently input user descriptions according to the training result. The data processing method can improve the real-time performance of the model based on online real data optimization and ensure the optimal model effect; the operation convenience of operators is improved, and the working efficiency is improved; the defects existing in the knowledge items are found in an accelerated way, and the continuous perfection of the knowledge base is promoted.

Description

Data processing method and equipment for knowledge base question answering
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a data processing method and device for knowledge base question answering.
Background
The historical overall process of the question-answering system is developed from a template-based question-answering expert system to a question-answering based on information retrieval, then to a community-based question-answering and then to the current knowledge base-based question-answering. The question-answering algorithm based on information retrieval combines information extraction and shallow semantic analysis on the basis of keyword matching. The community-based question answering depends on the contribution of netizens, and the question answering process depends on the keyword retrieval technology. The question answering based on the knowledge base is based on semantic analysis and the knowledge base, the questions input by the user are subjected to semantic analysis through a knowledge base question answering model, and knowledge items matched with the questions input by the user are selected from the knowledge base. The existing model optimization based on knowledge base question-answer is often required to be carried out off-line, the model can not be adjusted and optimized on line in real time by operators, the on-line labeling of the knowledge base question-answer is not automatic enough, and effective screening, clustering and recommendation are not carried out on a large amount of real data on the line, so that the labeling work efficiency of the operators is low, the workload is large, and the repeatability is high. In addition, data spoken by many users of online users is not used efficiently by the model.
Disclosure of Invention
To solve or at least alleviate at least one of the above technical problems, the present disclosure provides a processing method and apparatus for knowledge base question answering.
According to one aspect of the present disclosure, a data processing method for knowledge base question answering, the data processing method comprising:
acquiring any knowledge item from a knowledge base;
selecting a user utterance matched with the knowledge item from the dialogue records to form a set of user utterances;
associating the set of user utterances with the knowledge item; and
and training a knowledge base question-answering model by taking the associated set of user descriptions and the knowledge items as training samples so as to feed back the subsequently input user descriptions according to the training results.
According to at least one embodiment of the present disclosure, the selecting the user utterance matched with the knowledge item in the dialog record forms a set of user utterances, including:
if the knowledge items are provided for the user by the knowledge base question-answer model as approximate answers and are replied or clicked and selected by the user, setting the corresponding user utterance in the dialogue record as level A;
if the knowledge items are provided for the user by the knowledge base question-answer model as approximate answers and are not replied or clicked and selected by the user, setting the corresponding user utterance in the dialogue record as a level B;
if the knowledge items are not provided for the user as the best answer or the approximate answer by the knowledge base question-answer model, but the confidence degree is greater than or equal to a preset value, setting the corresponding user utterance in the dialogue record as a C level; and
sorting and de-duplicating the user utterances in an order of priority level A > level B > level C to form a set of the user utterances.
According to another aspect of the present disclosure, a data processing method for knowledge base question answering, the data processing method comprising:
clustering the user descriptions in the conversation record to form a set of at least one type of user descriptions;
selecting a set of knowledge items matched with the set of the user descriptions from the knowledge base aiming at the set of the user descriptions of each type;
associating the set of the class of user utterances with one of the set of knowledge items; and
and training a knowledge base question-answer model by taking the associated set of the user descriptions and one knowledge item as a training sample so as to feed back the user descriptions input subsequently according to the training result.
According to at least one embodiment of the present disclosure, the clustering the user utterances in the dialog record to form a set of at least one type of user utterance includes:
in the dialogue record, gathering feedback contents of the knowledge base question-answer model into a class, wherein the feedback contents comprise user descriptions with approximate answers or no answers; or in the dialogue records, gathering the user descriptions with the confidence coefficient smaller than the preset value given by the knowledge base question-answer model into one category.
According to at least one embodiment of the present disclosure, the clustering the user utterances in the dialog record to form a set of at least one type of user utterance includes:
and sequencing the set of at least one type of user descriptions obtained by clustering.
According to at least one embodiment of the present disclosure, the sorting the set of at least one class of user utterances obtained by clustering includes:
sorting the set of at least one type of user descriptions obtained by clustering in a descending order according to the number of questioning times; the number of questioning times is the total number of unrepeated user utterances in the set of each type of user utterances.
According to at least one embodiment of the present disclosure, the sorting the set of at least one class of user utterances obtained by clustering includes:
arranging the sets of at least one class of user descriptions with the same questioning times in an ascending order according to the number of the clustering questions; the clustering problem number refers to the total number of the deduplicated user utterances in the set of each type of user utterances.
According to at least one embodiment of the present disclosure, the sorting the set of at least one class of user utterances obtained by clustering includes:
and sequencing the sets of at least one type of user descriptions with the same clustering problem number according to the time sequence from near to far.
According to at least one embodiment of the present disclosure, for each class of user utterance, selecting from the knowledge base a set of knowledge items that match the class of user utterance includes:
matching the knowledge items in the knowledge base with the user descriptions in the set of each type of user descriptions one by one;
selecting knowledge items with confidence degrees larger than or equal to a preset value given by a knowledge base question-answer model to form a knowledge item set; and
and in the knowledge item set, arranging and de-duplicating the knowledge items in a descending order according to the accumulated occurrence times of the knowledge items.
According to another aspect of the disclosure, a processing device for knowledge base question answering, the device comprising:
a memory storing execution instructions; and
a processor executing execution instructions stored by the memory to cause the processor to perform the method of any of the preceding claims.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.
FIG. 1 is a schematic flow chart diagram of one exemplary embodiment of the data processing method for knowledge base question answering according to the present disclosure.
FIG. 2 is a schematic flow chart diagram illustrating another exemplary embodiment of the data processing method for knowledge base question answering according to the present disclosure.
FIG. 3 is a flow diagram illustrating another exemplary embodiment of the data processing method for knowledge base question answering according to the present disclosure.
FIG. 4 is a block diagram of an exemplary embodiment of a data processing device for knowledge base question answering according to the present disclosure.
Detailed Description
The present disclosure will be described in further detail with reference to the drawings and embodiments. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limitations of the present disclosure. It should be further noted that, for the convenience of description, only the portions relevant to the present disclosure are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
The knowledge base question-answer system comprises a knowledge base question-answer model and an established knowledge base, wherein the knowledge base comprises a plurality of knowledge items, and the knowledge items are the minimum units formed by the knowledge base. When the knowledge base question-answer model receives a user utterance (question asked by the user), similarity calculation is carried out through the semantic model, an answer is obtained in the knowledge base and fed back to the user, and the answer is usually in a question-answer mode of FAQ question-answer. The knowledge base question-answering system can be in various implementation forms, and is arranged into an intelligent dialogue robot and the like. In the on-line use process of the knowledge base question-answering system, one user statement and knowledge items fed back by the user statement form a corresponding conversation record.
In one application scenario, a user inputs a user utterance into a knowledge base question-answer model, and the knowledge base question-answer model can find out matched knowledge items in a knowledge base according to the user utterance to form feedback. The knowledge base question-answer model may feed back more than one knowledge item for the same user utterance. For some users, knowledge items with proper matching degrees cannot be found in the knowledge base, so that answers cannot be obtained, and in this case, the knowledge base question-answer model feeds back responses like 'answer cannot be provided'. In the process that the knowledge base question-answer model matches the user utterance with each knowledge item in the knowledge base, for each knowledge item, the knowledge base question-answer model gives a corresponding confidence coefficient, wherein the confidence coefficient refers to the credibility of the knowledge base question-answer model, which is obtained after judging the user utterance and a corresponding knowledge item, matching the user utterance and the corresponding knowledge item. The knowledge base question-answer model feeds back the best answer, the approximate answer or no answer to the user according to the confidence level of each knowledge item. The best answer is: when a user initiates a conversation, the knowledge base question-answering model can acquire a knowledge item with the highest confidence level and higher than a certain specified value to answer the question, and the replied knowledge item is provided to the user as the best answer. The approximate answer is: when a user initiates a conversation, the knowledge base question-answering model acquires a plurality of knowledge items with confidence degrees within a specified range for answering the question, and the replied knowledge items are provided to the user as approximate answers. The no answer means: when a user initiates a conversation, the knowledge base question-answer model cannot acquire the knowledge items with the confidence degrees within the specified range for answering the questions, and then the feedback is no answer or similar words.
During the online operation of the knowledge base question-answering system, a large number of conversation records are generated to form a data set of the conversation records, the data of the conversation records are stored in the system, and the online formed conversation record data is not utilized in the prior art to optimize the model. That is, the existing knowledge base question-answering model does not effectively use data of many users' utterances of online users, and online model optimization cannot be performed. The optimization of the existing knowledge base question-answer model is usually required to be carried out off line, and the online real-time model optimization of operators cannot be supported.
According to one aspect of the present disclosure, refer to a schematic flow chart diagram of an exemplary embodiment of a data processing method for knowledge base question answering according to the present disclosure shown in fig. 1. A data processing method for knowledge base question answering is used for processing data (such as dialogue records) generated in the knowledge base question answering process so as to realize online optimization of a knowledge base model. The data processing method comprises the following steps:
and S10, acquiring any knowledge item from the knowledge base. For example, the system automatically selects related knowledge items of a certain knowledge field from the knowledge base, and processes the knowledge items of the knowledge field one by using the data processing method disclosed by the invention. Or, the system may select the knowledge items with lower confidence degrees to be processed one by one according to the confidence degrees in the conversation records.
And S20, selecting the user descriptions matched with the knowledge items in the conversation records to form a set of user descriptions. Each dialogue record comprises a user utterance, and if the knowledge base question-answer model screens out the knowledge items with confidence degrees meeting the requirements in the knowledge base, the dialogue records contain the corresponding knowledge items; and if the knowledge base question-answer model does not screen out the knowledge items with the confidence degrees meeting the requirement in the knowledge base, the dialogue records do not contain the knowledge items. In all the dialogue records, the user utterances in the dialogue records containing the knowledge items selected in step S10 are selected, and all the selected user utterances form a set of user utterances.
S30, associating the set of user utterances with the knowledge item. Those skilled in the art will appreciate that the user utterance matching the knowledge item, which is found from all the dialog records generated during the system operation, is related to the knowledge item, and otherwise is not fed back by the knowledge base question-answering model as the knowledge item of the user utterance. The set of user utterances matched with the knowledge item is automatically selected through the steps, and the knowledge base question-and-answer model associates the set with the knowledge item. During association, the knowledge base question-answer model may associate all user utterances in the set of user utterances with the knowledge item, or may only select a part of the user utterances for association, where the selected part of the user utterances may be user utterances with a more definite user session intention and a more standard utterance, or user utterances with a higher frequency of occurrence.
And S40, training a knowledge base question-answer model by taking the associated set of the user 'S opinions and the knowledge items as training samples, and feeding back the user' S opinions input subsequently according to the training result. After a set of user expressions related to the knowledge items is selected from a large number of conversation records and correlated, the set is input into the model as a training sample, after the knowledge base question-answer model is trained, the user expressions in the sample and the knowledge items in the sample can form a correlation relationship, and after a user inputs the user expressions which are the same as or similar to those in the sample into the knowledge base question-answer model next time, the knowledge base question-answer model can directly feed back the related knowledge items as answers through the correlation relationship in the sample, so that the purpose of optimizing the knowledge base question-answer model on line is achieved. When the input user utterance and the user utterance in the sample have more than one same keyword, the input user utterance and the user utterance in the sample can be considered to be similar.
According to the data processing method, the data of the user dialogue records generated in the production environment (in the online operation process of the knowledge base question-answering system) are utilized, the system automatically screens the user descriptions matched with the knowledge items to be optimized to form a user description set, the user description set and the knowledge items are correlated, the correlated user description set and the knowledge items are automatically led into the knowledge base question-answering model, and the knowledge base question-answering model is trained online, so that the purpose of learning the optimization model online can be achieved. The problem that data of a plurality of users of online users are not effectively used by the model in the prior art is solved. Meanwhile, a set of user descriptions is automatically screened and formed in a large number of online conversation records through the system, and the problems of low working efficiency, large workload, high repeatability and the like in manual selection are solved.
In an embodiment of the present disclosure, the step S20 of selecting the user utterance matched with the knowledge item in the dialog record to form a set of user utterances may include:
and if the knowledge items are provided for the user by the knowledge base question-answer model as approximate answers and are replied or clicked and selected by the user, setting the corresponding user utterance in the dialogue record as A level. The knowledge item is provided to the user as an approximate answer and is selected by the user reply or click, which shows that the knowledge item has higher matching degree with the corresponding user dialogue and higher confidence, and the priority level of the knowledge item is set as A level.
And if the knowledge items are provided for the user by the knowledge base question-answer model as approximate answers and are not replied or clicked and selected by the user, setting the corresponding user utterance in the dialogue record as a B level. Although the knowledge item is provided to the user as an approximate answer, the knowledge item is not selected by the user reply or click, which indicates that the knowledge item has a certain matching degree with the corresponding user conversation, the confidence degree is lower than that of the A level, and the priority level is set as the B level.
If the knowledge item is not provided to the user as the best answer or the approximate answer by the knowledge base question-answer model, but the confidence degree is greater than or equal to a preset value when the knowledge item is matched with the user utterance, for example, the preset value can be set to 0.5, and then the corresponding user utterance in the dialog record is set to the C level. That is, the selected user utterance should be the user utterance whose confidence matching the knowledge item in the dialog record is greater than the preset value, and the user utterance whose confidence is less than the preset value is not selected to the formed set of user utterances, so as to avoid introducing noise that does not match the knowledge item.
And sorting and de-duplicating the screened user descriptions according to the sequence of priority A, priority B and priority C to form a set of the user descriptions. That is, in the formed set of user's utterances, the user's utterance with priority level a is ranked first, and so on, and the user's utterance with priority level C is ranked last.
The above embodiments are applicable to the initial stage of online of the knowledge base question-answering system, that is, after the online operation is performed for a short time, a certain number of conversation records exist in the system, and the knowledge base question-answering model needs to be optimized to improve the performance of the system.
According to another aspect of the present disclosure, refer to a schematic flow chart diagram of another exemplary embodiment of the data processing method for knowledge base question answering according to the present disclosure shown in fig. 2. A data processing method for knowledge base question answering is used for processing data (such as dialogue records) generated in the knowledge base question answering process so as to realize online optimization of a knowledge base model. The data processing method comprises the following steps:
s110, clustering the user descriptions in the conversation record to form a set of at least one type of user descriptions. With the increase of the online running time of the knowledge base question-answering system, a large number of similar user expressions exist in the knowledge base question-answering system aiming at the same question, the knowledge base question-answering model automatically clusters the large number of user expressions existing in the dialogue records according to the similarity degree, a plurality of similar user expressions relevant to a certain question are clustered into one class, a plurality of similar user expressions relevant to another question are clustered into one class, and a set of different kinds of user expressions is formed. There may be only one type, or there may be two or three different types of user utterances.
S220, aiming at the set of each type of user descriptions, the knowledge base question-answer model selects a set of knowledge items matched with the set of the type of user descriptions from the knowledge base. Namely, a set of user utterances and a set of knowledge items are respectively formed, and each knowledge item in the set of knowledge items has a certain matching degree with each user utterance in the set of user utterances.
S330, associating the set of the user' S utterances with one knowledge item in the set of knowledge items. Although each knowledge item in the knowledge item set has a certain matching degree with the user utterance in the user utterance set, in order to achieve the purpose of optimizing the knowledge base question-answer model and avoid introducing redundant noise, the knowledge base question-answer model only screens one knowledge item from the knowledge item set for association.
And S440, training the knowledge base question-answer model by taking the associated set of the user descriptions and one knowledge item as a training sample so as to feed back the user descriptions input subsequently according to the training result. The method comprises the steps of selecting various user expressions from a large number of dialogue records for clustering, screening and associating knowledge items matched with the user expressions for each formed user expression set, inputting the knowledge items into a model as training samples, forming an association relation between the user expression set in the sample and the knowledge items in the sample by a knowledge base question-answer model after training, and directly feeding back the associated knowledge items as answers by the knowledge base question-answer model through the association relation in the sample after a user inputs the same or similar user expressions in the user expression set of a certain type in the sample to the knowledge base question-answer model next time, so that the aim of optimizing the knowledge base question-answer model on line is fulfilled. When the input user utterance and the user utterance in the sample have more than one same keyword, the input user utterance and the user utterance in the sample can be considered to be similar.
In an embodiment of the present disclosure, the step S110 of clustering the user descriptions in the dialog record to form a set of at least one type of user descriptions may include:
in the dialogue record, the feedback content of the knowledge base question-answer model including the user utterances with approximate answers or no answers is gathered into a category. Or in the dialogue records, gathering the user descriptions with the confidence coefficient smaller than the preset value given by the knowledge base question-answer model into one category. The two clustering modes can screen and cluster user descriptions with poor initial matching effect in the conversation record, so that the user descriptions are matched with better knowledge items in the subsequent steps to optimize the model.
That is, the two different clustering methods can be adopted, and one of the two clustering methods can be selected according to different production environments.
In an embodiment of the present disclosure, the step S110 of clustering the user descriptions in the dialog record to form a set of at least one type of user descriptions may include:
and S111, sequencing the set of at least one type of user descriptions obtained by clustering. Referring to fig. 3, a schematic flow chart diagram of another exemplary embodiment of the data processing method for knowledge base question answering according to the present disclosure is shown. If the clusters form a set of more than one category of user utterances, the sets of user utterances of each category are sorted according to a certain rule to facilitate subsequent further processing of the data.
In an embodiment of the present disclosure, the step S111 of sorting the set of at least one class of user utterances obtained by clustering may include:
sorting the set of at least one type of user descriptions obtained by clustering in a descending order according to the number of questioning times; the number of questioning times is the total number of unrepeated user utterances in the set of each type of user utterances. If the clustering obtains more than two sets of user's utterances, the set with the larger number of user's utterances included therein is arranged in the front, and the set with the smaller number of user's utterances included therein is arranged in the rear.
In an embodiment of the present disclosure, the step S111 of sorting the set of at least one class of user utterances obtained by clustering may include:
arranging the sets of at least one class of user descriptions with the same questioning times in an ascending order according to the number of the clustering questions; the clustering problem number refers to the total number of the deduplicated user utterances in the set of each type of user utterances. If the number of the user utterances contained in the two sets is the same, then the cluster problems after de-duplication are compared and ranked, the set with the small cluster problem number (namely the set with the large number of the repeated user utterances) is ranked in the front, and the set with the large cluster problem number (namely the set with the small number of the repeated user utterances) is ranked in the back.
In an embodiment of the present disclosure, the step S111 of sorting the set of at least one class of user utterances obtained by clustering may include:
and sequencing the sets of at least one type of user descriptions with the same clustering problem number according to the time sequence from near to far. If there are two sets that contain the same number of user utterances and the same number of repeated user utterances, the set closer to the present time is ranked ahead of the set between the two sets and the set farther from the present time is ranked behind the set. The time referred to herein is the time when the user's utterance is input to the knowledge base question-and-answer model, i.e., the time when the user asks a question. In other words, it is determined which of the user utterances in the two sets that is closest to the current time, and the set corresponding to the closest user utterance is ranked ahead.
In one embodiment of the present disclosure, the step S220 of selecting, for each class of user utterances, a set of knowledge items from the knowledge base that matches the set of the class of user utterances may include:
and matching the knowledge items in the knowledge base with the user descriptions in the set of each type of user descriptions one by one. Each set of user utterances comprises a plurality of user utterances, and for each user utterance, a knowledge item matched with the user utterance is found in the knowledge base (more than one knowledge item may be matched, or no related knowledge item may be matched). I.e. each user utterance in the set of each class of user utterances is ergodically matched with a knowledge item in the knowledge base. That is, this step re-matches the individual user utterances in the set of each class of user utterances once in the knowledge base to filter out better knowledge items matching them for association. Wherein the better match is: the confidence of the user utterance and the knowledge item after the re-matching in the step is higher than that of the user utterance and the knowledge item which are initially matched in the utterance record.
And selecting the knowledge items with the confidence degrees more than or equal to a preset value given by the knowledge base question-answer model to form a set of the knowledge items. When the matching is performed one by one in the above step, each time the corresponding knowledge item is matched, the knowledge base question-answer model gives the confidence coefficient for the matching, a threshold value of the confidence coefficient may be preset, for example, the threshold value of the confidence coefficient is set to 0.5, and only the knowledge items with the matching confidence coefficient greater than or equal to 0.5 are screened into the formed set of the knowledge items, so as to avoid generating noise.
And in the knowledge item set, arranging and de-duplicating the knowledge items in a descending order according to the accumulated occurrence times of the knowledge items. Similar user utterances are in the sets of user utterances of each class, wherein the theme and the purpose of some user utterances are definite and can be matched with appropriate knowledge items in a knowledge base, and the theme and the purpose of some user utterances are fuzzy and cannot be matched with appropriate knowledge items. If the number of times of occurrence of a certain knowledge item in the set of knowledge items is large, the matching degree of the knowledge item is better, so that the knowledge item is arranged in the front, and conversely, the knowledge item is arranged in the back, so that the data can be further processed in the following process.
In one embodiment of the present disclosure, the step S220 of selecting, for each class of user utterances, a set of knowledge items from the knowledge base that matches the set of the class of user utterances may include:
if a set of a certain class of user utterances does not match with related knowledge items in the current knowledge base, that is, no answer is provided, the related knowledge items need to be added in the knowledge base by means of network search or manual input and the like according to the content of the user utterances, so as to supplement and perfect the current knowledge base. And then screening the added knowledge items to a formed knowledge item set, in the subsequent step, associating the added knowledge items with a matched user statement set, inputting the associated knowledge items into a knowledge base question-answer model as a training sample, and training and carrying out online optimization on the model.
The implementation modes are suitable for the conditions that the knowledge base question-answering model needs to be optimized to improve the system performance in the middle and later online periods of the knowledge base question-answering system, namely, a large number of conversation records are stored in the knowledge base question-answering system after the knowledge base question-answering system is online for a period of time.
By combining the two different implementation modes, the data processing method for the knowledge base question answering can improve the real-time performance of the model based on-line real data optimization and ensure the optimal model effect; the operation convenience of operators is improved, and the working efficiency is improved; the defects existing in the knowledge items are found in an accelerated way, and the continuous perfection of the knowledge base is promoted.
The present disclosure also provides a data processing device for knowledge base question answering, referring to the schematic structural diagram of an exemplary embodiment of the data processing device for knowledge base question answering shown in fig. 4. The apparatus comprises: a communication interface 1000, a memory 2000, and a processor 3000. The communication interface 1000 is used for communicating with an external device to perform data interactive transmission. The memory 2000 has stored therein a computer program that is executable on the processor 3000. The processor 3000 implements the method in the above-described embodiments when executing the computer program. The number of the memory 2000 and the processor 3000 may be one or more.
The memory 2000 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
If the communication interface 1000, the memory 2000 and the processor 3000 are implemented independently, the communication interface 1000, the memory 2000 and the processor 3000 may be connected to each other through a bus to complete communication therebetween. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not represent only one bus or one type of bus.
Optionally, in a specific implementation, if the communication interface 1000, the memory 2000, and the processor 3000 are integrated on a chip, the communication interface 1000, the memory 2000, and the processor 3000 may complete communication with each other through an internal interface.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present disclosure includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the implementations of the present disclosure. The processor performs the various methods and processes described above. For example, method embodiments in the present disclosure may be implemented as a software program tangibly embodied in a machine-readable medium, such as a memory. In some embodiments, some or all of the software program may be loaded and/or installed via memory and/or a communication interface. When the software program is loaded into memory and executed by a processor, one or more steps of the method described above may be performed. Alternatively, in other embodiments, the processor may be configured to perform one of the methods described above by any other suitable means (e.g., by means of firmware).
The logic and/or steps represented in the flowcharts or otherwise described herein may be embodied in any readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
For the purposes of this description, a "readable storage medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the readable storage medium include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). In addition, the readable storage medium may even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in the memory.
It should be understood that portions of the present disclosure may be implemented in hardware, software, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on data information, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps of the method implementing the above embodiments may be implemented by hardware instructions associated with a program, which may be stored in a readable storage medium, and when executed, includes one or a combination of the steps of the method embodiments.
In addition, each functional unit in the embodiments of the present disclosure may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.
In the description herein, reference to the description of the terms "one embodiment/mode," "some embodiments/modes," "example," "specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment/mode or example is included in at least one embodiment/mode or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to be the same embodiment/mode or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments/modes or examples. Furthermore, the various embodiments/aspects or examples and features of the various embodiments/aspects or examples described in this specification can be combined and combined by one skilled in the art without conflicting therewith.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
It will be understood by those skilled in the art that the foregoing embodiments are merely for clarity of illustration of the disclosure and are not intended to limit the scope of the disclosure. Other variations or modifications may occur to those skilled in the art, based on the foregoing disclosure, and are still within the scope of the present disclosure.

Claims (10)

1. A data processing method for knowledge base question answering, the method comprising:
acquiring any knowledge item from a knowledge base;
selecting a user utterance matched with the knowledge item from the dialogue records to form a set of user utterances;
associating the set of user utterances with the knowledge item; and
and training a knowledge base question-answering model by taking the associated set of user descriptions and the knowledge items as training samples so as to feed back the subsequently input user descriptions according to the training results.
2. The data processing method of claim 1, wherein said selecting user utterances in the conversation record that match said knowledge items forms a set of user utterances, comprising:
if the knowledge items are provided for the user by the knowledge base question-answer model as approximate answers and are replied or clicked and selected by the user, setting the corresponding user utterance in the dialogue record as level A;
if the knowledge items are provided for the user by the knowledge base question-answer model as approximate answers and are not replied or clicked and selected by the user, setting the corresponding user utterance in the dialogue record as a level B;
if the knowledge items are not provided for the user as the best answer or the approximate answer by the knowledge base question-answer model, but the confidence degree is greater than or equal to a preset value, setting the corresponding user utterance in the dialogue record as a C level; and
sorting and de-duplicating the user utterances in an order of priority level A > level B > level C to form a set of the user utterances.
3. A data processing method for knowledge base question answering, the method comprising:
clustering the user descriptions in the conversation record to form a set of at least one type of user descriptions;
selecting a set of knowledge items matched with the set of the user descriptions from the knowledge base aiming at the set of the user descriptions of each type;
associating the set of the class of user utterances with one of the set of knowledge items; and
and training a knowledge base question-answer model by taking the associated set of the user descriptions and one knowledge item as a training sample so as to feed back the user descriptions input subsequently according to the training result.
4. The data processing method of claim 3, wherein clustering the user utterances in the conversation record to form a set of at least one class of user utterances comprises:
in the dialogue record, gathering feedback contents of the knowledge base question-answer model into a class, wherein the feedback contents comprise user descriptions with approximate answers or no answers; or in the dialogue records, gathering the user descriptions with the confidence coefficient smaller than the preset value given by the knowledge base question-answer model into one category.
5. The data processing method of claim 3, wherein clustering the user utterances in the conversation record to form a set of at least one class of user utterances comprises:
and sequencing the set of at least one type of user descriptions obtained by clustering.
6. The data processing method of claim 5, wherein sorting the clustered set of at least one class of user utterances comprises:
sorting the set of at least one type of user descriptions obtained by clustering in a descending order according to the number of questioning times; the number of questioning times is the total number of unrepeated user utterances in the set of each type of user utterances.
7. The data processing method of claim 6, wherein sorting the clustered set of at least one class of user utterances comprises:
arranging the sets of at least one class of user descriptions with the same questioning times in an ascending order according to the number of the clustering questions; the clustering problem number refers to the total number of the deduplicated user utterances in the set of each type of user utterances.
8. The data processing method of claim 7, wherein sorting the clustered set of at least one class of user utterances comprises:
and sequencing the sets of at least one type of user descriptions with the same clustering problem number according to the time sequence from near to far.
9. The data processing method of claim 3, wherein selecting, for each set of user utterances, a set of knowledge items from a knowledge base that match the set of user utterances of that type comprises:
matching the knowledge items in the knowledge base with the user descriptions in the set of each type of user descriptions one by one;
selecting knowledge items with confidence degrees larger than or equal to a preset value given by a knowledge base question-answer model to form a knowledge item set; and
and in the knowledge item set, arranging and de-duplicating the knowledge items in a descending order according to the accumulated occurrence times of the knowledge items.
10. A data processing apparatus for knowledgebase question answering, the data processing apparatus comprising:
a memory storing execution instructions; and
a processor executing execution instructions stored by the memory to cause the processor to perform the method of any of claims 1 to 9.
CN202010255287.8A 2020-04-02 2020-04-02 Data processing method and equipment for knowledge base questions and answers Active CN111428019B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010255287.8A CN111428019B (en) 2020-04-02 2020-04-02 Data processing method and equipment for knowledge base questions and answers

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010255287.8A CN111428019B (en) 2020-04-02 2020-04-02 Data processing method and equipment for knowledge base questions and answers

Publications (2)

Publication Number Publication Date
CN111428019A true CN111428019A (en) 2020-07-17
CN111428019B CN111428019B (en) 2023-07-28

Family

ID=71556118

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010255287.8A Active CN111428019B (en) 2020-04-02 2020-04-02 Data processing method and equipment for knowledge base questions and answers

Country Status (1)

Country Link
CN (1) CN111428019B (en)

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100145976A1 (en) * 2008-12-05 2010-06-10 Yahoo! Inc. System and method for context based query augmentation
US20140140494A1 (en) * 2012-11-21 2014-05-22 Genesys Telecommunications Laboratories, Inc. Dynamic recommendation of routing rules for contact center use
CN103902733A (en) * 2014-04-18 2014-07-02 北京大学 Information retrieval method based on interrogative extension
US20160314114A1 (en) * 2013-12-09 2016-10-27 International Business Machines Corporation Testing and Training a Question-Answering System
EP3101597A2 (en) * 2015-06-05 2016-12-07 Google, Inc. Reading comprehension neural networks
US20170109355A1 (en) * 2015-10-16 2017-04-20 Baidu Usa Llc Systems and methods for human inspired simple question answering (hisqa)
US20170213139A1 (en) * 2016-01-21 2017-07-27 Accenture Global Solutions Limited Processing data for use in a cognitive insights platform
CN108241649A (en) * 2016-12-23 2018-07-03 北京奇虎科技有限公司 The searching method and device of knowledge based collection of illustrative plates
CN108491433A (en) * 2018-02-09 2018-09-04 平安科技(深圳)有限公司 Chat answer method, electronic device and storage medium
CN109299247A (en) * 2018-06-05 2019-02-01 安徽省泰岳祥升软件有限公司 Intention classification method and device based on business corpus and intelligent question and answer method
US10331402B1 (en) * 2017-05-30 2019-06-25 Amazon Technologies, Inc. Search and knowledge base question answering for a voice user interface
US20190205301A1 (en) * 2016-10-10 2019-07-04 Microsoft Technology Licensing, Llc Combo of Language Understanding and Infomation Retrieval
CN110019149A (en) * 2019-01-30 2019-07-16 阿里巴巴集团控股有限公司 A kind of method for building up of service knowledge base, device and equipment
CN110019749A (en) * 2018-09-28 2019-07-16 北京百度网讯科技有限公司 Generate method, apparatus, equipment and the computer-readable medium of VQA training data
CN110059172A (en) * 2019-04-19 2019-07-26 北京百度网讯科技有限公司 The method and apparatus of recommendation answer based on natural language understanding
CN110516059A (en) * 2019-08-30 2019-11-29 腾讯科技(深圳)有限公司 The problem of based on machine learning, replies method, Question-Answering Model training method and device

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100145976A1 (en) * 2008-12-05 2010-06-10 Yahoo! Inc. System and method for context based query augmentation
US20140140494A1 (en) * 2012-11-21 2014-05-22 Genesys Telecommunications Laboratories, Inc. Dynamic recommendation of routing rules for contact center use
US20160314114A1 (en) * 2013-12-09 2016-10-27 International Business Machines Corporation Testing and Training a Question-Answering System
CN103902733A (en) * 2014-04-18 2014-07-02 北京大学 Information retrieval method based on interrogative extension
EP3101597A2 (en) * 2015-06-05 2016-12-07 Google, Inc. Reading comprehension neural networks
US20170109355A1 (en) * 2015-10-16 2017-04-20 Baidu Usa Llc Systems and methods for human inspired simple question answering (hisqa)
US20170213139A1 (en) * 2016-01-21 2017-07-27 Accenture Global Solutions Limited Processing data for use in a cognitive insights platform
US20190205301A1 (en) * 2016-10-10 2019-07-04 Microsoft Technology Licensing, Llc Combo of Language Understanding and Infomation Retrieval
CN108241649A (en) * 2016-12-23 2018-07-03 北京奇虎科技有限公司 The searching method and device of knowledge based collection of illustrative plates
US10331402B1 (en) * 2017-05-30 2019-06-25 Amazon Technologies, Inc. Search and knowledge base question answering for a voice user interface
CN108491433A (en) * 2018-02-09 2018-09-04 平安科技(深圳)有限公司 Chat answer method, electronic device and storage medium
WO2019153613A1 (en) * 2018-02-09 2019-08-15 平安科技(深圳)有限公司 Chat response method, electronic device and storage medium
CN109299247A (en) * 2018-06-05 2019-02-01 安徽省泰岳祥升软件有限公司 Intention classification method and device based on business corpus and intelligent question and answer method
CN110019749A (en) * 2018-09-28 2019-07-16 北京百度网讯科技有限公司 Generate method, apparatus, equipment and the computer-readable medium of VQA training data
CN110019149A (en) * 2019-01-30 2019-07-16 阿里巴巴集团控股有限公司 A kind of method for building up of service knowledge base, device and equipment
CN110059172A (en) * 2019-04-19 2019-07-26 北京百度网讯科技有限公司 The method and apparatus of recommendation answer based on natural language understanding
CN110516059A (en) * 2019-08-30 2019-11-29 腾讯科技(深圳)有限公司 The problem of based on machine learning, replies method, Question-Answering Model training method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HAN HAN: "CFXGBoost: Topic phrase extraction based on context features and XGBoost for knowledge base question answering", 2017 13TH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (ICNC-FSKD) *
周明;段楠;刘树杰;沈向洋;: "神经自然语言处理最新进展――模型、训练和推理", no. 03 *
岳世峰等: "智能回复***研究综述", 信息安全学报, vol. 5, no. 1, pages 20 - 34 *

Also Published As

Publication number Publication date
CN111428019B (en) 2023-07-28

Similar Documents

Publication Publication Date Title
US10402039B2 (en) Adaptive user interface using machine learning model
CN108153800B (en) Information processing method, information processing apparatus, and recording medium
CN108764480B (en) Information processing system
EP3685245B1 (en) Method, apparatus, and computer-readable media for customer interaction semantic annotation and analytics
CN1637744A (en) Machine-learned approach to determining document relevance for search over large electronic collections of documents
CN107967280B (en) Method and system for recommending songs by tag
CA3153056A1 (en) Intelligently questioning and answering method, device, computer, equipment and storage medium
CN110909768B (en) Method and device for acquiring marked data
CN111125145A (en) Automatic system for acquiring database information through natural language
CN113254624B (en) Intelligent question-answering processing method, device, equipment and medium based on artificial intelligence
CN113886544A (en) Text matching method and device, storage medium and computer equipment
CN111428019B (en) Data processing method and equipment for knowledge base questions and answers
CN111652001A (en) Data processing method and device
EP3908941A1 (en) Artificial intelligence system for business processes
CN114185938B (en) Project traceability analysis method and system based on digital finance and big data traceability
Riera et al. No sample left behind: Towards a comprehensive evaluation of speech emotion recognition system
CN113628077A (en) Method for generating non-repeated examination questions, terminal and readable storage medium
CN112395402A (en) Depth model-based recommended word generation method and device and computer equipment
KR20220099690A (en) Apparatus, method and computer program for summarizing document
CN112182296A (en) Intelligent AI interactive robot for wedding celebration host and control method thereof
CN109787784A (en) Group recommending method, device, storage medium and computer equipment
US20240202448A1 (en) Automatic extraction of semantically similar question topics
US11176927B2 (en) Computer-implemented method for providing an adaptive dialogue system, and adaptive dialogue system
CN112036208B (en) Artificial intelligence sprite based on interactive learning system
CN115905482A (en) Knowledge optimization method and device for conversation robot and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant