CN111339248A

CN111339248A - Data attribute filling method, device, equipment and computer readable storage medium

Info

Publication number: CN111339248A
Application number: CN202010088080.6A
Authority: CN
Inventors: 张智; 莫洋
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-02-12
Filing date: 2020-02-12
Publication date: 2020-06-26
Also published as: WO2021159655A1

Abstract

The invention relates to the technical field of artificial intelligence, and discloses a data attribute filling method, which comprises the following steps: acquiring a knowledge owner to which the initial to-be-responded question data belongs based on a target prediction result, and determining a knowledge base corresponding to the initial to-be-responded question data according to the knowledge owner; calculating the comprehensive similarity between the initial to-be-responded question data and the historical question data in the knowledge base; and if the similarity is greater than or equal to a first preset threshold, inputting the initial problem data to be responded into each preset node of a graph G (V, E), obtaining a clustering result, if the matching degree of the attribute of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold, determining the attribute of the clustering result, and performing attribute filling on the clustering result by adopting the attribute. The invention also discloses a data attribute filling device, equipment and a computer readable storage medium. The data attribute filling method provided by the invention improves the efficiency of data attribute filling.

Description

Data attribute filling method, device, equipment and computer readable storage medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a data attribute filling method, apparatus, device, and computer-readable storage medium.

Background

At present, the filling of the attributes of data generally adopts a literal similar clustering mode, and is not suitable for a large-scale conversation log mining scene with multiple knowledge owners, the batch supplement of the corresponding knowledge attributes cannot be realized through one-time operation, but the attributes need to be set for the problems proposed by users in an artificial mode, the time consumption is long, the errors are easy to occur, the attribute filling efficiency is low, the efficient automatic filling of the attributes of the data under a big data scene is realized, and the technical problem to be solved urgently in the field at present is solved.

Disclosure of Invention

The invention mainly aims to provide a data attribute filling method, a data attribute filling device, data attribute filling equipment and a computer readable storage medium, and aims to solve the technical problem of low data attribute filling efficiency.

In order to achieve the above object, the present invention provides a data attribute filling method, including the following steps:

predicting initial problem data to be responded by a preset model set to obtain a target prediction result;

acquiring a knowledge owner to which the initial question data to be responded belongs based on the target prediction result, and determining a knowledge base corresponding to the initial question data to be responded according to the knowledge owner;

calculating the comprehensive similarity between the initial to-be-responded question data and the historical question data in the knowledge base;

judging whether the similarity is greater than or equal to a first preset threshold value or not;

if the similarity is greater than or equal to a first preset threshold, inputting the initial problem data to be responded into each preset node of a graph G (V, E), determining the weight of the initial problem data to be responded according to the degree in the graph, and clustering the initial problem data to be responded based on the weight to obtain a clustering result, wherein the highest weight in the clustering result is problem data, the rest are similar problem data, V is a node set, E is an edge set, and the similar problem data is data with a similar relation to the problem data;

judging whether the matching degree of the attribute of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold value or not;

and if the matching degree of the attributes of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold value, determining the attributes of the clustering result, and performing attribute filling on the clustering result by adopting the attributes.

Optionally, before the predicting the initial to-be-responded question data by the preset model set to obtain the target prediction result, the method further includes the following steps:

removing punctuation marks in the first initial problem data set to be responded to through a regular expression to obtain a second initial problem data set to be responded to;

performing synonym conversion on the second initial problem data set to be responded by a preset synonym conversion mode to obtain a third initial problem data set to be responded;

and calling a library function to perform literal duplicate removal processing on the third initial problem data set to be responded to obtain a target problem data set to be responded to, wherein the target problem data set to be responded to at least comprises one initial problem data to be responded to.

Optionally, the invoking a library function performs literal deduplication processing on the third initial question data set to be answered to obtain a target question data set to be answered, where the target question data set to be answered at least includes one initial question data to be answered, and the method includes the following steps:

sequencing each third initial problem data to be responded in the third initial problem data sets to be responded according to the sentence length by calling a quick sequencing algorithm in a library function to obtain sequenced third initial problem data sets to be responded;

traversing the sorted third initial problem data set to be responded, and clearing repeated words to obtain a target problem data set to be responded.

Optionally, the synonym conversion is performed on the second initial problem data set to be responded to through a preset synonym conversion mode to obtain a third initial problem data set to be responded to, including the following steps:

performing word segmentation on the second initial question data set to be responded to obtain word segmentation data;

acquiring a feature vector of the word segmentation data, and calculating cosine included angle values of the feature vector and feature vectors of all words in a preset word bank;

judging whether the cosine included angle value is smaller than a preset included angle value or not;

if the cosine included angle value is smaller than a preset included angle value, obtaining synonymous data of each word in the preset word bank, and forming the synonymous data into a third initial problem data set to be responded;

if the cosine included angle value is larger than or equal to the preset included angle value, the step of judging whether the cosine included angle value is smaller than the preset included angle value is continuously executed until the cosine included angle value meets the preset included angle value.

Optionally, the predicting the initial problem data to be responded to through a preset model set to obtain a target prediction result includes the following steps:

predicting the initial question data to be responded through a language representation bert model in a preset model set, and judging whether the initial question data to be responded belongs to an effective type;

if the initial question data to be responded belong to the effective type, obtaining an effective type prediction result;

predicting the initial question data to be responded through a text classification textcnn model in a preset model set, and judging whether the initial question data to be responded belongs to a chatting type;

if the initial question data to be responded belongs to the chatting type, obtaining a chatting type prediction result;

and combining the effective class prediction result and the chatting class prediction result to obtain a target prediction result.

Optionally, if the matching degree between the attribute of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold, determining the attribute of the clustering result, and performing attribute filling on the clustering result by using the attribute includes:

if the matching degree of the attributes of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold value, acquiring an attribute set of the clustering result based on a preset mapping relation between the attributes of the historical problem data and the attributes of the clustering result, wherein the attribute set of the clustering result comprises at least one attribute of the clustering result;

and mining a frequent item set in the attribute set of the clustering result, and determining the attribute of the clustering result based on the frequent item set.

Optionally, the calculating the comprehensive similarity between the initial to-be-answered question data and the historical question data in the knowledge base includes the following steps:

calculating the literal similarity between the initial question data to be responded and the historical question data in the knowledge base through the word frequency-inverse file frequency TF-IDF;

calculating semantic similarity between initial problem data to be responded and historical problem data in the knowledge base through a twin network;

and respectively carrying out priority sequencing on the literal similarity and the semantic similarity according to the similarity numerical value to obtain comprehensive similarity.

Further, in order to achieve the above object, the present invention further provides a data attribute filling apparatus, including the following modules:

the prediction module is used for predicting the initial problem data to be responded through a preset model set to obtain a target prediction result;

the classification module is used for acquiring a knowledge owner to which the initial to-be-responded question data belongs based on the target prediction result, and determining a knowledge base corresponding to the initial to-be-responded question data according to the knowledge owner;

the identification module is used for calculating the comprehensive similarity between the initial to-be-responded question data and the historical question data in the knowledge base;

the similarity judging module is used for judging whether the similarity is greater than or equal to a first preset threshold value or not;

the clustering module is used for inputting the initial problem data to be responded into preset nodes of a graph G (V, E) if the similarity is greater than or equal to a first preset threshold, determining the weight of the initial problem data to be responded according to the degree in the graph, and clustering the initial problem data to be responded based on the weight to obtain a clustering result, wherein the highest weight in the clustering result is problem data, the rest are similar problem data, V is a node set, E is an edge set, and the similar problem data is data with a similar relation to the problem data;

the matching degree judging module is used for judging whether the matching degree of the attribute of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold value or not;

and the filling module is used for determining the attribute of the clustering result and filling the attribute of the clustering result by adopting the attribute if the matching degree of the attribute of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold.

Optionally, the data attribute padding apparatus further includes the following modules:

the punctuation mark clearing module is used for removing punctuation marks in the first initial problem data set to be responded through the regular expression to obtain a second initial problem data set to be responded;

the synonym conversion module is used for carrying out synonym conversion on the second initial problem data set to be responded through a preset synonym conversion mode to obtain a third initial problem data set to be responded;

and the literal duplicate removal module is used for calling a library function to perform literal duplicate removal processing on the third initial problem data set to be responded to obtain a target problem data set to be responded to, wherein the target problem data set to be responded to at least comprises one initial problem data to be responded to.

Optionally, the face deduplication module comprises the following units:

the sorting unit is used for sorting each third initial problem data to be responded in the third initial problem data sets to be responded according to the sentence length by calling a quick sorting algorithm in the library function to obtain sorted third initial problem data sets to be responded;

and the face duplication removing unit is used for traversing the sorted third initial problem data set to be responded, and removing repeated words to obtain a target problem data set to be responded.

Optionally, the synonym transformation module includes the following elements:

the word segmentation unit is used for performing word segmentation on the second initial question data set to be responded to obtain word segmentation data;

the cosine included angle value calculating unit is used for acquiring the characteristic vector of the word segmentation data and calculating cosine included angle values of the characteristic vector and the characteristic vector of each word in a preset word bank;

the cosine included angle value judging unit is used for judging whether the cosine included angle value is smaller than a preset included angle value or not;

a synonymous data obtaining unit, configured to obtain synonymous data of each word in the preset lexicon if the cosine included angle value is smaller than a preset included angle value, and form the synonymous data into a third initial problem data set to be answered; if the cosine included angle value is larger than or equal to the preset included angle value, the step of judging whether the cosine included angle value is smaller than the preset included angle value is continuously executed until the cosine included angle value meets the preset included angle value.

Optionally, the prediction module comprises the following units:

the effective type prediction unit is used for predicting the initial question data to be answered through a language representation bert model in a preset model set and judging whether the initial question data to be answered belongs to an effective type;

an effective class prediction result obtaining unit, configured to obtain an effective class prediction result if the initial data of the problem to be responded belongs to an effective class;

the chatting type prediction unit is used for predicting the initial question data to be responded through a text classification textcnn model in a preset model set and judging whether the initial question data to be responded belongs to a chatting type;

a chatting type prediction result obtaining unit, configured to obtain a chatting type prediction result if the initial question data to be responded belongs to a chatting type;

and the prediction result combination unit is used for combining the effective class prediction result and the chatting class prediction result to obtain a target prediction result.

Optionally, the filling module comprises:

the attribute set acquisition unit of the clustering result is used for acquiring an attribute set of the clustering result based on a preset mapping relation between the attributes of the historical problem data and the attributes of the clustering result if the matching degree of the attributes of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold value, wherein the attribute set of the clustering result comprises at least one attribute of the clustering result;

and the frequent item set mining unit is used for mining frequent item sets in the attribute sets of the clustering results and determining the attributes of the clustering results based on the frequent item sets.

Optionally, the identification module comprises the following units:

the literal similarity calculation unit is used for calculating the literal similarity between the initial question data to be responded and the historical question data in the knowledge base through the word frequency-inverse file frequency TF-IDF;

the semantic similarity calculation unit is used for calculating the semantic similarity between the initial to-be-responded question data and the historical question data in the knowledge base through the twin network;

and the similarity obtaining unit is used for respectively carrying out priority sequencing on the literal similarity and the semantic similarity according to the size of the similarity numerical value to obtain comprehensive similarity.

Further, to achieve the above object, the present invention also provides a data attribute padding apparatus, including a memory, a processor, and a data attribute padding program stored on the memory and executable on the processor, where the data attribute padding program, when executed by the processor, implements the steps of the data attribute padding method according to any one of the above.

Further, to achieve the above object, the present invention also provides a computer readable storage medium, having stored thereon a data attribute padding program, which when executed by a processor, implements the steps of the data attribute padding method according to any one of the above items.

Clustering a problem data set through a graph to separate problem data and similar problem data, wherein the problem data and the similar problem data have the same attribute, each attribute has a corresponding relation with a corresponding knowledge base, different data are stored in the corresponding knowledge base according to the attribute, and then, a language representation bert model is used for predicting the passing degree of an unanswered problem to determine whether the problem is an effective sentence; and identifying whether the chat is a chatting chat by using a binary classification model trained by a text classification textcnn model, taking an effective non-chatting chat part, respectively calculating the literal similarity and the semantic similarity through a twin network and the word frequency-inverse file frequency TF-IDF, clustering problem data meeting the similarity, selecting an attribute with higher matching degree with a clustering result from a knowledge base, and realizing the purpose of quickly filling the attribute in the data.

Drawings

FIG. 1 is a schematic diagram of an environment in which data attribute populating devices operate according to embodiments of the present invention;

FIG. 2 is a flowchart illustrating a data attribute filling method according to a first embodiment of the present invention;

FIG. 3 is a flowchart illustrating a data attribute filling method according to a second embodiment of the present invention;

FIG. 4 is a detailed flowchart of one embodiment of step S103 in FIG. 3;

FIG. 5 is a detailed flowchart of one embodiment of step S102 in FIG. 3;

FIG. 6 is a detailed flowchart of one embodiment of step S10 in FIG. 2;

FIG. 7 is a detailed flowchart of one embodiment of step S70 in FIG. 2;

FIG. 8 is a schematic diagram illustrating a detailed flow of the step S30 in FIG. 2;

FIG. 9 is a functional block diagram of an embodiment of a data attribute populating apparatus according to the invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a data attribute filling device.

Referring to fig. 1, fig. 1 is a schematic structural diagram of an operating environment of a data attribute populating device according to an embodiment of the present invention.

As shown in fig. 1, the data attribute populating apparatus includes: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display (Display), an input unit such as a Keyboard (Keyboard), and the network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the hardware configuration of the data property populating device shown in FIG. 1 does not constitute a limitation of the data property populating device and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a data attribute populating program. The operating system is a program that manages and controls the data property filling apparatus and software resources, and supports the operation of the data property filling program as well as other software and/or programs.

In the hardware structure of the data attribute populating device shown in fig. 1, the network interface 1004 is primarily used to access a network; the user interface 1003 is mainly used for detecting a confirmation instruction, an editing instruction, and the like. And processor 1001 may be configured to invoke the data property filling program stored in memory 1005 and perform the operations of the various embodiments of the data property filling method below.

Based on the above hardware structure of the data attribute filling device, embodiments of the data attribute filling method of the present invention are provided.

Referring to fig. 2, fig. 2 is a flowchart illustrating a data attribute filling method according to a first embodiment of the present invention. In this embodiment, the data attribute filling method includes the following steps:

step S10, predicting the initial question data to be responded by a preset model set to obtain a target prediction result;

in this embodiment, a pre-trained prediction model in a preset model set is used to predict initial data to be responded, and problem data to be responded is predicted through the preset model set to obtain an effective prediction result, for example, the preset model set may include: the method comprises the steps of using a language representation bert model and a text classification textcnn model, and then adopting different models to predict initial data to be responded so as to obtain a prediction result, wherein the prediction result can be that the initial data to be responded belongs to a chatting class or an effective class.

Step S20, acquiring the knowledge owner of the initial to-be-responded question data based on the target prediction result, and determining a knowledge base corresponding to the initial to-be-responded question data according to the knowledge owner;

in this embodiment, the property and relationship of the object, referred to as the attribute of the object, may be, for example, an insurance amount, an insurance policy number, and an applicant may be classified as "insurance", in this embodiment, the knowledge owner to which the initial data to be responded belongs refers to the predicted classification to which the initial data to be responded belongs, different knowledge bases having different classifications have been previously set, and a mapping relationship exists between the knowledge base and the initial data to be responded having different knowledge owners, so after the knowledge owner to which the initial data to be responded belongs is obtained, the initial data to be responded may be dispatched to the corresponding knowledge base according to the mapping relationship.

Step S30, calculating the comprehensive similarity between the initial question data to be answered and the historical question data in the knowledge base;

in this embodiment, after the initial problem data to be responded with different knowledge owners are dispatched to the corresponding knowledge bases, the similarity between the dispatched initial problem data to be responded and the historical problem data needs to be calculated, the purpose of calculating the similarity is to acquire other data having an approximate relationship with the current initial problem data to be responded, the other data may include literal similarity, for example, if the current initial problem data to be responded has "insurance" for multiple times, and if a piece of historical problem data having "insurance" for multiple times also exists in the corresponding knowledge base, it indicates that a certain similarity exists between the two pieces of data, and in order to calculate the similarity, a preset similarity calculation method may be used to calculate, for example, word frequency-inverse file frequency.

Step S40, judging whether the similarity is larger than or equal to a first preset threshold value;

in this embodiment, because there may be a plurality of pieces of historical problem data similar to the current initial problem data to be answered in the knowledge base, and the plurality of pieces of historical problem data do not always satisfy the preset similarity, the first preset threshold is preset, and the value of the first preset threshold is not limited, for example, may be 90%.

Step S50, if the similarity is greater than or equal to a first preset threshold, inputting the initial problem data to be responded into each preset node of G ═ V, E, determining the weight of the initial problem data to be responded according to the degree in the graph, clustering the initial problem data to be responded based on the weight, and obtaining a clustering result, wherein the highest weight in the clustering result is the problem data, the rest are similar problem data, V is a node set, E is an edge set, and the similar problem data is data having a similar relationship with the problem data;

in this embodiment, the Graph (Graph) is composed of a finite and non-empty set of vertices and a set of edges between the vertices, which may be represented as G ═ V, E, where V is a set of nodes and E is a set of edges, and in this embodiment, a point is each initial question data to be answered, and an edge is a similarity of each initial question data to be answered, and a point with the highest degree (degree), that is, a point with the highest center position, is taken as a representative, that is, historical question data, where the degree is a weight of each point.

Step S60, judging whether the matching degree of the attribute of the historical problem data in the knowledge base and the clustering result is larger than or equal to a second preset threshold value;

in this embodiment, the attributes of the clustering result and the historical problem data may be in a one-to-one mapping relationship or in a one-to-many mapping relationship, and these mapping relationships are set in advance.

And step S70, if the matching degree of the attributes of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold, determining the attributes of the clustering result, and filling the attributes of the clustering result with the attributes.

In this embodiment, one knowledge owner corresponds to one knowledge base; the method comprises the steps that a plurality of historical problem data are arranged under a knowledge base, the historical problem data have different attributes, when the matching degree of the attributes of the historical problem data in the knowledge base and a clustering result is larger than or equal to a second preset threshold value, the attributes of the historical problems can be filled into the clustering result, a specific filling mode is that an attribute table to be filled is established in advance, and when the matching degree is larger than or equal to the second preset threshold value, the corresponding attributes are mapped to the attribute table to be filled.

Firstly, clustering a problem data set through a graph to separate problem data and similar problem data, wherein the problem data and the similar problem data have the same attribute, each attribute has a corresponding relation with a corresponding knowledge base, different data are stored into the corresponding knowledge base according to the attribute, the attribute with higher matching degree with a clustering result is selected from the knowledge base, and filling of the problem attribute is realized.

Referring to fig. 3, fig. 3 is a flowchart illustrating a data attribute filling method according to a second embodiment of the present invention. In this embodiment, before predicting the initial to-be-answered problem data through the preset model set in step S10 to obtain a target prediction result, the data attribute filling method includes the following steps:

step S80, removing punctuation marks in the first initial problem data set to be responded to through a regular expression to obtain a second initial problem data set to be responded to;

in this embodiment, punctuation marks in the problem data are removed through the regular expression, so as to obtain the problem data with punctuation marks removed.

Step S90, synonym conversion is carried out on the second initial problem data set to be responded through a preset synonym conversion mode, and a third initial problem data set to be responded is obtained;

in this embodiment, different characters or words are found by searching through the character string, and then replaced, similar to dictionary query, such as e shengbao- > e shengbao; e Shenbao- > E life-saving, which mainly realizes the functions of voice error correction of dangerous case and unified description of the dangerous case.

And S100, calling a library function to perform literal duplication elimination on the third initial question data set to be responded to obtain a target question data set to be responded to, wherein the target question data set to be responded to at least comprises one initial question data to be responded to.

In this embodiment, the duplicate removal action is performed through a library function to obtain the problem data of literal duplicate removal, where the library function is a way to put the function into a library for use. The method is to program some commonly used functions into a file for calling.

And traversing the third initial problem data set to be responded in sequence, judging whether problem data with the same word face exists or not, and after repeated words in a single problem are eliminated, other problem data with the same word face possibly exist, for example, "do you buy a coal mining and save you". If the problem data with the same literal exists, only one problem data is stored to obtain a target problem data set to be responded, and in order to avoid repeated data, only one problem data is stored, namely each initial problem data to be responded in the target problem data set to be responded is unique and different.

Referring to fig. 4, fig. 4 is a detailed flowchart of an embodiment of step S103 in fig. 3. In this embodiment, in step S100, a library function is called to perform literal deduplication processing on a third initial question data set to be answered to obtain a target question data set to be answered, where the target question data set to be answered at least includes one initial question data to be answered, and the method includes the following steps:

step S1001, sorting each third initial problem data to be responded in a third initial problem data set to be responded according to the sentence length by calling a quick sorting algorithm in a library function to obtain a sorted third initial problem data set to be responded;

in this embodiment, the data to be sorted is divided into two independent parts by sorting, in which all data lengths of one part are smaller than all data lengths of the other part, and then the two parts of data are sorted rapidly according to this method, and the whole sorting process can be performed recursively, so that the whole data can be changed into an ordered sequence.

And step S1002, traversing the sorted third initial problem data set to be responded, and clearing repeated words to obtain a target problem data set to be responded.

In this embodiment, when traversing the sorted data, the two parts of the sorted data can be traversed at the same time, so that whether repeated words exist can be identified in time, and if the repeated words exist, the repeated words are removed to obtain problem data with duplicate faces removed, that is, initial problem data to be responded.

Referring to fig. 5, fig. 5 is a schematic view of a detailed flow of an embodiment of step S102 in fig. 3. In this embodiment, in step S90, performing synonym transformation on the second initial question data set to be answered in a preset synonym transformation manner to obtain a third initial question data set to be answered, including the following steps:

step S901, performing word segmentation on the second initial question data set to be responded to obtain word segmentation data;

in this embodiment, the punctuation removal problem data may be segmented in a mode of word segmentation, so as to obtain word segmentation data.

Step S902, acquiring a feature vector of the word segmentation data, and calculating cosine clip angle values of the feature vector and the feature vector of each word in a preset word bank;

in this embodiment, the word segmentation data is converted into a form of feature vectors, and then cosine included angles between the feature vectors are calculated by using a cosine formula, and the smaller the cosine included angle value, the more similar the feature vectors are.

Step S903, judging whether the cosine included angle value is smaller than a preset included angle value;

in this embodiment, in order to obtain the cosine included angle value meeting the preset condition, a preset included angle value needs to be set, for example, 20 °.

Step S904, if the cosine included angle value is smaller than the preset included angle value, obtaining synonymous data of each word in the preset lexicon, forming the synonymous data into a third initial problem data set to be responded, and if the cosine included angle value is greater than or equal to the preset included angle value, returning to step S903.

In this embodiment, since there is a constraint of the preset pinch angle value, data smaller than the preset pinch angle value can be obtained. If the value is larger than or equal to the preset included angle value, the cosine included angle values of the feature vectors of the word segmentation data and the feature vectors of other words in the preset word bank need to be calculated.

Referring to fig. 6, fig. 6 is a detailed flowchart of an embodiment of step S10 in fig. 2. In this embodiment, in step S10, predicting the initial to-be-answered problem data by using a preset model set to obtain a target prediction result, including the following steps:

s101, predicting initial to-be-responded problem data through a language representation bert model in a preset model set, and judging whether the initial to-be-responded problem data belong to effective types;

in this embodiment, in order to make the language representation bert model available as valid problem data, the language representation bert model needs to be trained before this. The initial bert model is trained by using training sample data of known effective types and invalid types until the initial bert model can accurately identify effective types of initial problem data to be responded.

Step S102, if the initial question data to be responded belongs to the effective type, obtaining an effective type prediction result;

in this embodiment, as shown in step S101, the language representation bert model can identify whether the initial question data to be answered is an effect type, and then an effect type prediction result can be obtained. The purpose of identifying the initial question data to be answered by the bert model is to distinguish valid questions in all the initial question data to be answered, for example, the valid questions need to belong to a certain knowledge base, and if the current situation is a man-machine question-answering scene for buying insurance, if the current situation is a man-machine question-answering scene for buying fruit, the valid questions belong to invalid data.

Step S103, predicting the initial question data to be responded through a text classification textcnn model in a preset model set, and judging whether the initial question data to be responded belongs to a chatting type;

in this embodiment, chatting type prediction training is also performed on the initial text classification model until the training is completed, that is, after a certain accuracy is achieved, the initial problem data to be responded can not be predicted.

Step S104, if the initial question data to be responded belongs to a chatting type, obtaining a chatting type prediction result;

in this embodiment, the text classification textcnn model is trained by using preset chatting training samples and non-chatting training samples, so that the recognition capability is provided, for example, if an insurance-buying human-machine question-answering scene is provided, data of laughter or data of sigh can be used as the chatting type data.

And step S105, combining the effective class prediction result and the chatting class prediction result to obtain a target prediction result.

In this embodiment, the chatting class prediction result and the effective class prediction result are used as components of the effective class prediction result.

Referring to fig. 7, fig. 7 is a schematic diagram illustrating a detailed flow of an embodiment of step S70 in fig. 2, where in the embodiment, in step S70, if a matching degree between an attribute of historical problem data in a knowledge base and a clustering result is greater than or equal to a second preset threshold, determining an attribute of the clustering result, and performing attribute filling on the clustering result by using the attribute includes the following steps:

step S701, if the matching degree of the attributes of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold value, acquiring an attribute set of the clustering result based on a preset mapping relation between the attributes of the historical problem data and the attributes of the clustering result, wherein the attribute set of the clustering result comprises at least one attribute of the clustering result;

step S702, a frequent item set in the attribute set of the clustering result is mined, and the attribute of the clustering result is determined based on the frequent item set.

In this embodiment, the attributes that frequently appear in the attribute set of the clustering result can be mined through the big data mining platform. Criteria that can be a frequent item set may be set in advance, for example, if the current attribute appears three or more times, the attribute may be set as the frequent item set.

Referring to fig. 8, fig. 8 is a schematic view of a detailed flow of the step S30 in fig. 2. In this embodiment, in step S30, the step of calculating the comprehensive similarity between the initial to-be-answered question data and the historical question data in the knowledge base includes the following steps:

step S301, calculating the literal similarity between the initial question data to be responded and the historical question data in the knowledge base through the word frequency-inverse file frequency TF-IDF;

in the embodiment, word segmentation is performed through jieba, the word is arranged in a designated format, then each to-be-responded problem data to be compared is converted into a sparse vector through doc2bow by a genim library, then a corpus is processed through word frequency-inverse file frequency TF-IDF, an index is established between a characteristic value and sparse matrix similarity, and finally the literal similarity between each to-be-responded problem data is obtained.

Step S302, calculating semantic similarity between initial problem data to be responded and historical problem data in a knowledge base through a twin network;

in this embodiment, the twin network is composed of two networks, the two networks have the same structure and share parameters, the model is selected when the two sentences come from the unified field and have a large similarity in structure, and the spatial similarity between the two sentences is measured by calculating the manhattan distance, the euclidean distance, the cosine similarity, and the like, so as to obtain the semantic similarity.

Step S303, the literal similarity and the semantic similarity are respectively subjected to priority ranking according to the similarity value, and comprehensive similarity is obtained.

In this embodiment, in order to obtain a more accurate attribute, it is necessary to obtain a literal similarity and a semantic similarity, and the literal similarity and the semantic similarity are combined together to obtain the similarity, and when calculating the similarity between the initial problem data to be responded and the historical problem data and determining whether the similarity is greater than or equal to a first preset threshold, it is necessary that both the literal similarity and the semantic similarity are greater than or equal to the first preset threshold, and then the initial problem data to be responded may be input to each node of the graph G (V, E).

Firstly, clustering a problem data set through a graph to separate problem data and similar problem data, wherein the problem data and the similar problem data have the same attribute, each attribute has a corresponding relation with a corresponding knowledge base, different data are stored in the corresponding knowledge base according to the attribute, and then a bert model is used for predicting the passing degree of an unanswered problem to determine whether the problem is an effective sentence; and identifying whether the chat is a chatting chat by using a two-classification model trained by a textcnn model, taking an effective non-chatting chat part, respectively calculating the literal similarity and the semantic similarity by using a twin network and the word frequency-inverse file frequency TF-IDF, clustering problem data meeting the similarity, selecting an attribute with higher matching degree with a clustering result from a knowledge base, and filling the attribute of the problem.

Referring to fig. 9, fig. 9 is a functional module diagram of an embodiment of a data attribute filling apparatus according to the present invention. In this embodiment, the data attribute filling apparatus includes:

the prediction module 10 is configured to predict the initial problem data to be responded to through a preset model set to obtain a target prediction result;

the classification module 20 is configured to obtain a knowledge owner to which the initial question data to be responded belongs based on the target prediction result, and determine a knowledge base corresponding to the initial question data to be responded according to the knowledge owner;

the identification module 30 is used for calculating the comprehensive similarity between the initial to-be-responded question data and the historical question data in the knowledge base;

a similarity judging module 40, configured to judge whether the similarity is greater than or equal to a first preset threshold;

a clustering module 50, configured to, if the similarity is greater than or equal to a first preset threshold, input the initial problem data to be responded into each node of a preset graph G ═ V, E, determine a weight of the initial problem data to be responded according to the degree in the graph, and perform clustering processing on the initial problem data to be responded based on the weight to obtain a clustering result, where the highest weight in the clustering result is problem data, the rest are similar problem data, V is a node set, E is an edge set, and the similar problem data is data having a similar relationship with the problem data;

a matching degree judging module 60, configured to judge whether a matching degree between an attribute of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold;

and a filling module 70, configured to determine an attribute of the clustering result if a matching degree between the attribute of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold, and perform attribute filling on the clustering result by using the attribute.

In this embodiment, the module in the device can realize one-time operation, and the purpose of obtaining multiple attributes can be achieved, so that the efficiency of classifying different initial problem data to be responded into different attributes is improved.

The invention also provides a computer readable storage medium.

In this embodiment, the computer readable storage medium has stored thereon a data attribute padding program, and the data attribute padding program, when executed by a processor, implements the steps of the data attribute padding method as described in any one of the above embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM), and includes instructions for causing a terminal (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The present invention is described in connection with the accompanying drawings, but the present invention is not limited to the above embodiments, which are only illustrative and not restrictive, and those skilled in the art can make various changes without departing from the spirit and scope of the invention as defined by the appended claims, and all changes that come within the meaning and range of equivalency of the specification and drawings that are obvious from the description and the attached claims are intended to be embraced therein.

Claims

1. A data attribute padding method, characterized by comprising the steps of:

if the similarity is greater than or equal to a first preset threshold, inputting the initial problem data to be responded into preset nodes of a graph G (V, E), determining the weight of the initial problem data to be responded according to the degree in the graph, and clustering the initial problem data to be responded based on the weight to obtain a clustering result, wherein the highest weight in the clustering result is problem data, the rest are similar problem data, V is a node set, E is an edge set, and the similar problem data is data with a similar relation to the problem data;

2. The data attribute population method of claim 1, wherein before said predicting the initial to-be-answered question data by the preset model set to obtain the target prediction result, further comprising the steps of:

3. The data attribute population method of claim 2 wherein the calling library function performs literal deduplication processing on the third initial question-to-answer data set to obtain a target question-to-answer data set comprising the steps of:

4. The data attribute population method of claim 2 wherein the synonym transformation of the second initial question-to-answer data set by a preset synonym transformation to obtain a third initial question-to-answer data set comprises the steps of:

5. The data attribute population method of claim 1 wherein predicting the initial to-be-answered problem data by a preset model set to obtain a target prediction result comprises the steps of:

6. The method according to claim 1, wherein if the matching degree between the attribute of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold, determining the attribute of the clustering result, and performing attribute filling on the clustering result by using the attribute comprises:

7. The data attribute population method of any one of claims 1-6 wherein the calculating of the integrated similarity between the initial to-be-answered question data and the historical question data in the knowledge base comprises the steps of:

8. A data attribute populating apparatus, characterized in that the data attribute populating apparatus comprises the following modules:

9. A data property filling apparatus, characterized in that the data property filling apparatus comprises a memory, a processor and a data property filling program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the data property filling method according to any of claims 1-7.

10. A computer-readable storage medium, having stored thereon a data property filling program which, when executed by a processor, implements the steps of the data property filling method of any one of claims 1-7.