CN111339248A - Data attribute filling method, device, equipment and computer readable storage medium - Google Patents

Data attribute filling method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN111339248A
CN111339248A CN202010088080.6A CN202010088080A CN111339248A CN 111339248 A CN111339248 A CN 111339248A CN 202010088080 A CN202010088080 A CN 202010088080A CN 111339248 A CN111339248 A CN 111339248A
Authority
CN
China
Prior art keywords
data
responded
initial
attribute
problem data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010088080.6A
Other languages
Chinese (zh)
Inventor
张智
莫洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010088080.6A priority Critical patent/CN111339248A/en
Publication of CN111339248A publication Critical patent/CN111339248A/en
Priority to PCT/CN2020/098768 priority patent/WO2021159655A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of artificial intelligence, and discloses a data attribute filling method, which comprises the following steps: acquiring a knowledge owner to which the initial to-be-responded question data belongs based on a target prediction result, and determining a knowledge base corresponding to the initial to-be-responded question data according to the knowledge owner; calculating the comprehensive similarity between the initial to-be-responded question data and the historical question data in the knowledge base; and if the similarity is greater than or equal to a first preset threshold, inputting the initial problem data to be responded into each preset node of a graph G (V, E), obtaining a clustering result, if the matching degree of the attribute of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold, determining the attribute of the clustering result, and performing attribute filling on the clustering result by adopting the attribute. The invention also discloses a data attribute filling device, equipment and a computer readable storage medium. The data attribute filling method provided by the invention improves the efficiency of data attribute filling.

Description

Data attribute filling method, device, equipment and computer readable storage medium
Technical Field
The present invention relates to the field of artificial intelligence technologies, and in particular, to a data attribute filling method, apparatus, device, and computer-readable storage medium.
Background
At present, the filling of the attributes of data generally adopts a literal similar clustering mode, and is not suitable for a large-scale conversation log mining scene with multiple knowledge owners, the batch supplement of the corresponding knowledge attributes cannot be realized through one-time operation, but the attributes need to be set for the problems proposed by users in an artificial mode, the time consumption is long, the errors are easy to occur, the attribute filling efficiency is low, the efficient automatic filling of the attributes of the data under a big data scene is realized, and the technical problem to be solved urgently in the field at present is solved.
Disclosure of Invention
The invention mainly aims to provide a data attribute filling method, a data attribute filling device, data attribute filling equipment and a computer readable storage medium, and aims to solve the technical problem of low data attribute filling efficiency.
In order to achieve the above object, the present invention provides a data attribute filling method, including the following steps:
predicting initial problem data to be responded by a preset model set to obtain a target prediction result;
acquiring a knowledge owner to which the initial question data to be responded belongs based on the target prediction result, and determining a knowledge base corresponding to the initial question data to be responded according to the knowledge owner;
calculating the comprehensive similarity between the initial to-be-responded question data and the historical question data in the knowledge base;
judging whether the similarity is greater than or equal to a first preset threshold value or not;
if the similarity is greater than or equal to a first preset threshold, inputting the initial problem data to be responded into each preset node of a graph G (V, E), determining the weight of the initial problem data to be responded according to the degree in the graph, and clustering the initial problem data to be responded based on the weight to obtain a clustering result, wherein the highest weight in the clustering result is problem data, the rest are similar problem data, V is a node set, E is an edge set, and the similar problem data is data with a similar relation to the problem data;
judging whether the matching degree of the attribute of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold value or not;
and if the matching degree of the attributes of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold value, determining the attributes of the clustering result, and performing attribute filling on the clustering result by adopting the attributes.
Optionally, before the predicting the initial to-be-responded question data by the preset model set to obtain the target prediction result, the method further includes the following steps:
removing punctuation marks in the first initial problem data set to be responded to through a regular expression to obtain a second initial problem data set to be responded to;
performing synonym conversion on the second initial problem data set to be responded by a preset synonym conversion mode to obtain a third initial problem data set to be responded;
and calling a library function to perform literal duplicate removal processing on the third initial problem data set to be responded to obtain a target problem data set to be responded to, wherein the target problem data set to be responded to at least comprises one initial problem data to be responded to.
Optionally, the invoking a library function performs literal deduplication processing on the third initial question data set to be answered to obtain a target question data set to be answered, where the target question data set to be answered at least includes one initial question data to be answered, and the method includes the following steps:
sequencing each third initial problem data to be responded in the third initial problem data sets to be responded according to the sentence length by calling a quick sequencing algorithm in a library function to obtain sequenced third initial problem data sets to be responded;
traversing the sorted third initial problem data set to be responded, and clearing repeated words to obtain a target problem data set to be responded.
Optionally, the synonym conversion is performed on the second initial problem data set to be responded to through a preset synonym conversion mode to obtain a third initial problem data set to be responded to, including the following steps:
performing word segmentation on the second initial question data set to be responded to obtain word segmentation data;
acquiring a feature vector of the word segmentation data, and calculating cosine included angle values of the feature vector and feature vectors of all words in a preset word bank;
judging whether the cosine included angle value is smaller than a preset included angle value or not;
if the cosine included angle value is smaller than a preset included angle value, obtaining synonymous data of each word in the preset word bank, and forming the synonymous data into a third initial problem data set to be responded;
if the cosine included angle value is larger than or equal to the preset included angle value, the step of judging whether the cosine included angle value is smaller than the preset included angle value is continuously executed until the cosine included angle value meets the preset included angle value.
Optionally, the predicting the initial problem data to be responded to through a preset model set to obtain a target prediction result includes the following steps:
predicting the initial question data to be responded through a language representation bert model in a preset model set, and judging whether the initial question data to be responded belongs to an effective type;
if the initial question data to be responded belong to the effective type, obtaining an effective type prediction result;
predicting the initial question data to be responded through a text classification textcnn model in a preset model set, and judging whether the initial question data to be responded belongs to a chatting type;
if the initial question data to be responded belongs to the chatting type, obtaining a chatting type prediction result;
and combining the effective class prediction result and the chatting class prediction result to obtain a target prediction result.
Optionally, if the matching degree between the attribute of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold, determining the attribute of the clustering result, and performing attribute filling on the clustering result by using the attribute includes:
if the matching degree of the attributes of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold value, acquiring an attribute set of the clustering result based on a preset mapping relation between the attributes of the historical problem data and the attributes of the clustering result, wherein the attribute set of the clustering result comprises at least one attribute of the clustering result;
and mining a frequent item set in the attribute set of the clustering result, and determining the attribute of the clustering result based on the frequent item set.
Optionally, the calculating the comprehensive similarity between the initial to-be-answered question data and the historical question data in the knowledge base includes the following steps:
calculating the literal similarity between the initial question data to be responded and the historical question data in the knowledge base through the word frequency-inverse file frequency TF-IDF;
calculating semantic similarity between initial problem data to be responded and historical problem data in the knowledge base through a twin network;
and respectively carrying out priority sequencing on the literal similarity and the semantic similarity according to the similarity numerical value to obtain comprehensive similarity.
Further, in order to achieve the above object, the present invention further provides a data attribute filling apparatus, including the following modules:
the prediction module is used for predicting the initial problem data to be responded through a preset model set to obtain a target prediction result;
the classification module is used for acquiring a knowledge owner to which the initial to-be-responded question data belongs based on the target prediction result, and determining a knowledge base corresponding to the initial to-be-responded question data according to the knowledge owner;
the identification module is used for calculating the comprehensive similarity between the initial to-be-responded question data and the historical question data in the knowledge base;
the similarity judging module is used for judging whether the similarity is greater than or equal to a first preset threshold value or not;
the clustering module is used for inputting the initial problem data to be responded into preset nodes of a graph G (V, E) if the similarity is greater than or equal to a first preset threshold, determining the weight of the initial problem data to be responded according to the degree in the graph, and clustering the initial problem data to be responded based on the weight to obtain a clustering result, wherein the highest weight in the clustering result is problem data, the rest are similar problem data, V is a node set, E is an edge set, and the similar problem data is data with a similar relation to the problem data;
the matching degree judging module is used for judging whether the matching degree of the attribute of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold value or not;
and the filling module is used for determining the attribute of the clustering result and filling the attribute of the clustering result by adopting the attribute if the matching degree of the attribute of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold.
Optionally, the data attribute padding apparatus further includes the following modules:
the punctuation mark clearing module is used for removing punctuation marks in the first initial problem data set to be responded through the regular expression to obtain a second initial problem data set to be responded;
the synonym conversion module is used for carrying out synonym conversion on the second initial problem data set to be responded through a preset synonym conversion mode to obtain a third initial problem data set to be responded;
and the literal duplicate removal module is used for calling a library function to perform literal duplicate removal processing on the third initial problem data set to be responded to obtain a target problem data set to be responded to, wherein the target problem data set to be responded to at least comprises one initial problem data to be responded to.
Optionally, the face deduplication module comprises the following units:
the sorting unit is used for sorting each third initial problem data to be responded in the third initial problem data sets to be responded according to the sentence length by calling a quick sorting algorithm in the library function to obtain sorted third initial problem data sets to be responded;
and the face duplication removing unit is used for traversing the sorted third initial problem data set to be responded, and removing repeated words to obtain a target problem data set to be responded.
Optionally, the synonym transformation module includes the following elements:
the word segmentation unit is used for performing word segmentation on the second initial question data set to be responded to obtain word segmentation data;
the cosine included angle value calculating unit is used for acquiring the characteristic vector of the word segmentation data and calculating cosine included angle values of the characteristic vector and the characteristic vector of each word in a preset word bank;
the cosine included angle value judging unit is used for judging whether the cosine included angle value is smaller than a preset included angle value or not;
a synonymous data obtaining unit, configured to obtain synonymous data of each word in the preset lexicon if the cosine included angle value is smaller than a preset included angle value, and form the synonymous data into a third initial problem data set to be answered; if the cosine included angle value is larger than or equal to the preset included angle value, the step of judging whether the cosine included angle value is smaller than the preset included angle value is continuously executed until the cosine included angle value meets the preset included angle value.
Optionally, the prediction module comprises the following units:
the effective type prediction unit is used for predicting the initial question data to be answered through a language representation bert model in a preset model set and judging whether the initial question data to be answered belongs to an effective type;
an effective class prediction result obtaining unit, configured to obtain an effective class prediction result if the initial data of the problem to be responded belongs to an effective class;
the chatting type prediction unit is used for predicting the initial question data to be responded through a text classification textcnn model in a preset model set and judging whether the initial question data to be responded belongs to a chatting type;
a chatting type prediction result obtaining unit, configured to obtain a chatting type prediction result if the initial question data to be responded belongs to a chatting type;
and the prediction result combination unit is used for combining the effective class prediction result and the chatting class prediction result to obtain a target prediction result.
Optionally, the filling module comprises:
the attribute set acquisition unit of the clustering result is used for acquiring an attribute set of the clustering result based on a preset mapping relation between the attributes of the historical problem data and the attributes of the clustering result if the matching degree of the attributes of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold value, wherein the attribute set of the clustering result comprises at least one attribute of the clustering result;
and the frequent item set mining unit is used for mining frequent item sets in the attribute sets of the clustering results and determining the attributes of the clustering results based on the frequent item sets.
Optionally, the identification module comprises the following units:
the literal similarity calculation unit is used for calculating the literal similarity between the initial question data to be responded and the historical question data in the knowledge base through the word frequency-inverse file frequency TF-IDF;
the semantic similarity calculation unit is used for calculating the semantic similarity between the initial to-be-responded question data and the historical question data in the knowledge base through the twin network;
and the similarity obtaining unit is used for respectively carrying out priority sequencing on the literal similarity and the semantic similarity according to the size of the similarity numerical value to obtain comprehensive similarity.
Further, to achieve the above object, the present invention also provides a data attribute padding apparatus, including a memory, a processor, and a data attribute padding program stored on the memory and executable on the processor, where the data attribute padding program, when executed by the processor, implements the steps of the data attribute padding method according to any one of the above.
Further, to achieve the above object, the present invention also provides a computer readable storage medium, having stored thereon a data attribute padding program, which when executed by a processor, implements the steps of the data attribute padding method according to any one of the above items.
Clustering a problem data set through a graph to separate problem data and similar problem data, wherein the problem data and the similar problem data have the same attribute, each attribute has a corresponding relation with a corresponding knowledge base, different data are stored in the corresponding knowledge base according to the attribute, and then, a language representation bert model is used for predicting the passing degree of an unanswered problem to determine whether the problem is an effective sentence; and identifying whether the chat is a chatting chat by using a binary classification model trained by a text classification textcnn model, taking an effective non-chatting chat part, respectively calculating the literal similarity and the semantic similarity through a twin network and the word frequency-inverse file frequency TF-IDF, clustering problem data meeting the similarity, selecting an attribute with higher matching degree with a clustering result from a knowledge base, and realizing the purpose of quickly filling the attribute in the data.
Drawings
FIG. 1 is a schematic diagram of an environment in which data attribute populating devices operate according to embodiments of the present invention;
FIG. 2 is a flowchart illustrating a data attribute filling method according to a first embodiment of the present invention;
FIG. 3 is a flowchart illustrating a data attribute filling method according to a second embodiment of the present invention;
FIG. 4 is a detailed flowchart of one embodiment of step S103 in FIG. 3;
FIG. 5 is a detailed flowchart of one embodiment of step S102 in FIG. 3;
FIG. 6 is a detailed flowchart of one embodiment of step S10 in FIG. 2;
FIG. 7 is a detailed flowchart of one embodiment of step S70 in FIG. 2;
FIG. 8 is a schematic diagram illustrating a detailed flow of the step S30 in FIG. 2;
FIG. 9 is a functional block diagram of an embodiment of a data attribute populating apparatus according to the invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a data attribute filling device.
Referring to fig. 1, fig. 1 is a schematic structural diagram of an operating environment of a data attribute populating device according to an embodiment of the present invention.
As shown in fig. 1, the data attribute populating apparatus includes: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display (Display), an input unit such as a Keyboard (Keyboard), and the network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the hardware configuration of the data property populating device shown in FIG. 1 does not constitute a limitation of the data property populating device and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a data attribute populating program. The operating system is a program that manages and controls the data property filling apparatus and software resources, and supports the operation of the data property filling program as well as other software and/or programs.
In the hardware structure of the data attribute populating device shown in fig. 1, the network interface 1004 is primarily used to access a network; the user interface 1003 is mainly used for detecting a confirmation instruction, an editing instruction, and the like. And processor 1001 may be configured to invoke the data property filling program stored in memory 1005 and perform the operations of the various embodiments of the data property filling method below.
Based on the above hardware structure of the data attribute filling device, embodiments of the data attribute filling method of the present invention are provided.
Referring to fig. 2, fig. 2 is a flowchart illustrating a data attribute filling method according to a first embodiment of the present invention. In this embodiment, the data attribute filling method includes the following steps:
step S10, predicting the initial question data to be responded by a preset model set to obtain a target prediction result;
in this embodiment, a pre-trained prediction model in a preset model set is used to predict initial data to be responded, and problem data to be responded is predicted through the preset model set to obtain an effective prediction result, for example, the preset model set may include: the method comprises the steps of using a language representation bert model and a text classification textcnn model, and then adopting different models to predict initial data to be responded so as to obtain a prediction result, wherein the prediction result can be that the initial data to be responded belongs to a chatting class or an effective class.
Step S20, acquiring the knowledge owner of the initial to-be-responded question data based on the target prediction result, and determining a knowledge base corresponding to the initial to-be-responded question data according to the knowledge owner;
in this embodiment, the property and relationship of the object, referred to as the attribute of the object, may be, for example, an insurance amount, an insurance policy number, and an applicant may be classified as "insurance", in this embodiment, the knowledge owner to which the initial data to be responded belongs refers to the predicted classification to which the initial data to be responded belongs, different knowledge bases having different classifications have been previously set, and a mapping relationship exists between the knowledge base and the initial data to be responded having different knowledge owners, so after the knowledge owner to which the initial data to be responded belongs is obtained, the initial data to be responded may be dispatched to the corresponding knowledge base according to the mapping relationship.
Step S30, calculating the comprehensive similarity between the initial question data to be answered and the historical question data in the knowledge base;
in this embodiment, after the initial problem data to be responded with different knowledge owners are dispatched to the corresponding knowledge bases, the similarity between the dispatched initial problem data to be responded and the historical problem data needs to be calculated, the purpose of calculating the similarity is to acquire other data having an approximate relationship with the current initial problem data to be responded, the other data may include literal similarity, for example, if the current initial problem data to be responded has "insurance" for multiple times, and if a piece of historical problem data having "insurance" for multiple times also exists in the corresponding knowledge base, it indicates that a certain similarity exists between the two pieces of data, and in order to calculate the similarity, a preset similarity calculation method may be used to calculate, for example, word frequency-inverse file frequency.
Step S40, judging whether the similarity is larger than or equal to a first preset threshold value;
in this embodiment, because there may be a plurality of pieces of historical problem data similar to the current initial problem data to be answered in the knowledge base, and the plurality of pieces of historical problem data do not always satisfy the preset similarity, the first preset threshold is preset, and the value of the first preset threshold is not limited, for example, may be 90%.
Step S50, if the similarity is greater than or equal to a first preset threshold, inputting the initial problem data to be responded into each preset node of G ═ V, E, determining the weight of the initial problem data to be responded according to the degree in the graph, clustering the initial problem data to be responded based on the weight, and obtaining a clustering result, wherein the highest weight in the clustering result is the problem data, the rest are similar problem data, V is a node set, E is an edge set, and the similar problem data is data having a similar relationship with the problem data;
in this embodiment, the Graph (Graph) is composed of a finite and non-empty set of vertices and a set of edges between the vertices, which may be represented as G ═ V, E, where V is a set of nodes and E is a set of edges, and in this embodiment, a point is each initial question data to be answered, and an edge is a similarity of each initial question data to be answered, and a point with the highest degree (degree), that is, a point with the highest center position, is taken as a representative, that is, historical question data, where the degree is a weight of each point.
Step S60, judging whether the matching degree of the attribute of the historical problem data in the knowledge base and the clustering result is larger than or equal to a second preset threshold value;
in this embodiment, the attributes of the clustering result and the historical problem data may be in a one-to-one mapping relationship or in a one-to-many mapping relationship, and these mapping relationships are set in advance.
And step S70, if the matching degree of the attributes of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold, determining the attributes of the clustering result, and filling the attributes of the clustering result with the attributes.
In this embodiment, one knowledge owner corresponds to one knowledge base; the method comprises the steps that a plurality of historical problem data are arranged under a knowledge base, the historical problem data have different attributes, when the matching degree of the attributes of the historical problem data in the knowledge base and a clustering result is larger than or equal to a second preset threshold value, the attributes of the historical problems can be filled into the clustering result, a specific filling mode is that an attribute table to be filled is established in advance, and when the matching degree is larger than or equal to the second preset threshold value, the corresponding attributes are mapped to the attribute table to be filled.
Firstly, clustering a problem data set through a graph to separate problem data and similar problem data, wherein the problem data and the similar problem data have the same attribute, each attribute has a corresponding relation with a corresponding knowledge base, different data are stored into the corresponding knowledge base according to the attribute, the attribute with higher matching degree with a clustering result is selected from the knowledge base, and filling of the problem attribute is realized.
Referring to fig. 3, fig. 3 is a flowchart illustrating a data attribute filling method according to a second embodiment of the present invention. In this embodiment, before predicting the initial to-be-answered problem data through the preset model set in step S10 to obtain a target prediction result, the data attribute filling method includes the following steps:
step S80, removing punctuation marks in the first initial problem data set to be responded to through a regular expression to obtain a second initial problem data set to be responded to;
in this embodiment, punctuation marks in the problem data are removed through the regular expression, so as to obtain the problem data with punctuation marks removed.
Step S90, synonym conversion is carried out on the second initial problem data set to be responded through a preset synonym conversion mode, and a third initial problem data set to be responded is obtained;
in this embodiment, different characters or words are found by searching through the character string, and then replaced, similar to dictionary query, such as e shengbao- > e shengbao; e Shenbao- > E life-saving, which mainly realizes the functions of voice error correction of dangerous case and unified description of the dangerous case.
And S100, calling a library function to perform literal duplication elimination on the third initial question data set to be responded to obtain a target question data set to be responded to, wherein the target question data set to be responded to at least comprises one initial question data to be responded to.
In this embodiment, the duplicate removal action is performed through a library function to obtain the problem data of literal duplicate removal, where the library function is a way to put the function into a library for use. The method is to program some commonly used functions into a file for calling.
And traversing the third initial problem data set to be responded in sequence, judging whether problem data with the same word face exists or not, and after repeated words in a single problem are eliminated, other problem data with the same word face possibly exist, for example, "do you buy a coal mining and save you". If the problem data with the same literal exists, only one problem data is stored to obtain a target problem data set to be responded, and in order to avoid repeated data, only one problem data is stored, namely each initial problem data to be responded in the target problem data set to be responded is unique and different.
Referring to fig. 4, fig. 4 is a detailed flowchart of an embodiment of step S103 in fig. 3. In this embodiment, in step S100, a library function is called to perform literal deduplication processing on a third initial question data set to be answered to obtain a target question data set to be answered, where the target question data set to be answered at least includes one initial question data to be answered, and the method includes the following steps:
step S1001, sorting each third initial problem data to be responded in a third initial problem data set to be responded according to the sentence length by calling a quick sorting algorithm in a library function to obtain a sorted third initial problem data set to be responded;
in this embodiment, the data to be sorted is divided into two independent parts by sorting, in which all data lengths of one part are smaller than all data lengths of the other part, and then the two parts of data are sorted rapidly according to this method, and the whole sorting process can be performed recursively, so that the whole data can be changed into an ordered sequence.
And step S1002, traversing the sorted third initial problem data set to be responded, and clearing repeated words to obtain a target problem data set to be responded.
In this embodiment, when traversing the sorted data, the two parts of the sorted data can be traversed at the same time, so that whether repeated words exist can be identified in time, and if the repeated words exist, the repeated words are removed to obtain problem data with duplicate faces removed, that is, initial problem data to be responded.
Referring to fig. 5, fig. 5 is a schematic view of a detailed flow of an embodiment of step S102 in fig. 3. In this embodiment, in step S90, performing synonym transformation on the second initial question data set to be answered in a preset synonym transformation manner to obtain a third initial question data set to be answered, including the following steps:
step S901, performing word segmentation on the second initial question data set to be responded to obtain word segmentation data;
in this embodiment, the punctuation removal problem data may be segmented in a mode of word segmentation, so as to obtain word segmentation data.
Step S902, acquiring a feature vector of the word segmentation data, and calculating cosine clip angle values of the feature vector and the feature vector of each word in a preset word bank;
in this embodiment, the word segmentation data is converted into a form of feature vectors, and then cosine included angles between the feature vectors are calculated by using a cosine formula, and the smaller the cosine included angle value, the more similar the feature vectors are.
Step S903, judging whether the cosine included angle value is smaller than a preset included angle value;
in this embodiment, in order to obtain the cosine included angle value meeting the preset condition, a preset included angle value needs to be set, for example, 20 °.
Step S904, if the cosine included angle value is smaller than the preset included angle value, obtaining synonymous data of each word in the preset lexicon, forming the synonymous data into a third initial problem data set to be responded, and if the cosine included angle value is greater than or equal to the preset included angle value, returning to step S903.
In this embodiment, since there is a constraint of the preset pinch angle value, data smaller than the preset pinch angle value can be obtained. If the value is larger than or equal to the preset included angle value, the cosine included angle values of the feature vectors of the word segmentation data and the feature vectors of other words in the preset word bank need to be calculated.
Referring to fig. 6, fig. 6 is a detailed flowchart of an embodiment of step S10 in fig. 2. In this embodiment, in step S10, predicting the initial to-be-answered problem data by using a preset model set to obtain a target prediction result, including the following steps:
s101, predicting initial to-be-responded problem data through a language representation bert model in a preset model set, and judging whether the initial to-be-responded problem data belong to effective types;
in this embodiment, in order to make the language representation bert model available as valid problem data, the language representation bert model needs to be trained before this. The initial bert model is trained by using training sample data of known effective types and invalid types until the initial bert model can accurately identify effective types of initial problem data to be responded.
Step S102, if the initial question data to be responded belongs to the effective type, obtaining an effective type prediction result;
in this embodiment, as shown in step S101, the language representation bert model can identify whether the initial question data to be answered is an effect type, and then an effect type prediction result can be obtained. The purpose of identifying the initial question data to be answered by the bert model is to distinguish valid questions in all the initial question data to be answered, for example, the valid questions need to belong to a certain knowledge base, and if the current situation is a man-machine question-answering scene for buying insurance, if the current situation is a man-machine question-answering scene for buying fruit, the valid questions belong to invalid data.
Step S103, predicting the initial question data to be responded through a text classification textcnn model in a preset model set, and judging whether the initial question data to be responded belongs to a chatting type;
in this embodiment, chatting type prediction training is also performed on the initial text classification model until the training is completed, that is, after a certain accuracy is achieved, the initial problem data to be responded can not be predicted.
Step S104, if the initial question data to be responded belongs to a chatting type, obtaining a chatting type prediction result;
in this embodiment, the text classification textcnn model is trained by using preset chatting training samples and non-chatting training samples, so that the recognition capability is provided, for example, if an insurance-buying human-machine question-answering scene is provided, data of laughter or data of sigh can be used as the chatting type data.
And step S105, combining the effective class prediction result and the chatting class prediction result to obtain a target prediction result.
In this embodiment, the chatting class prediction result and the effective class prediction result are used as components of the effective class prediction result.
Referring to fig. 7, fig. 7 is a schematic diagram illustrating a detailed flow of an embodiment of step S70 in fig. 2, where in the embodiment, in step S70, if a matching degree between an attribute of historical problem data in a knowledge base and a clustering result is greater than or equal to a second preset threshold, determining an attribute of the clustering result, and performing attribute filling on the clustering result by using the attribute includes the following steps:
step S701, if the matching degree of the attributes of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold value, acquiring an attribute set of the clustering result based on a preset mapping relation between the attributes of the historical problem data and the attributes of the clustering result, wherein the attribute set of the clustering result comprises at least one attribute of the clustering result;
step S702, a frequent item set in the attribute set of the clustering result is mined, and the attribute of the clustering result is determined based on the frequent item set.
In this embodiment, the attributes that frequently appear in the attribute set of the clustering result can be mined through the big data mining platform. Criteria that can be a frequent item set may be set in advance, for example, if the current attribute appears three or more times, the attribute may be set as the frequent item set.
Referring to fig. 8, fig. 8 is a schematic view of a detailed flow of the step S30 in fig. 2. In this embodiment, in step S30, the step of calculating the comprehensive similarity between the initial to-be-answered question data and the historical question data in the knowledge base includes the following steps:
step S301, calculating the literal similarity between the initial question data to be responded and the historical question data in the knowledge base through the word frequency-inverse file frequency TF-IDF;
in the embodiment, word segmentation is performed through jieba, the word is arranged in a designated format, then each to-be-responded problem data to be compared is converted into a sparse vector through doc2bow by a genim library, then a corpus is processed through word frequency-inverse file frequency TF-IDF, an index is established between a characteristic value and sparse matrix similarity, and finally the literal similarity between each to-be-responded problem data is obtained.
Step S302, calculating semantic similarity between initial problem data to be responded and historical problem data in a knowledge base through a twin network;
in this embodiment, the twin network is composed of two networks, the two networks have the same structure and share parameters, the model is selected when the two sentences come from the unified field and have a large similarity in structure, and the spatial similarity between the two sentences is measured by calculating the manhattan distance, the euclidean distance, the cosine similarity, and the like, so as to obtain the semantic similarity.
Step S303, the literal similarity and the semantic similarity are respectively subjected to priority ranking according to the similarity value, and comprehensive similarity is obtained.
In this embodiment, in order to obtain a more accurate attribute, it is necessary to obtain a literal similarity and a semantic similarity, and the literal similarity and the semantic similarity are combined together to obtain the similarity, and when calculating the similarity between the initial problem data to be responded and the historical problem data and determining whether the similarity is greater than or equal to a first preset threshold, it is necessary that both the literal similarity and the semantic similarity are greater than or equal to the first preset threshold, and then the initial problem data to be responded may be input to each node of the graph G (V, E).
Firstly, clustering a problem data set through a graph to separate problem data and similar problem data, wherein the problem data and the similar problem data have the same attribute, each attribute has a corresponding relation with a corresponding knowledge base, different data are stored in the corresponding knowledge base according to the attribute, and then a bert model is used for predicting the passing degree of an unanswered problem to determine whether the problem is an effective sentence; and identifying whether the chat is a chatting chat by using a two-classification model trained by a textcnn model, taking an effective non-chatting chat part, respectively calculating the literal similarity and the semantic similarity by using a twin network and the word frequency-inverse file frequency TF-IDF, clustering problem data meeting the similarity, selecting an attribute with higher matching degree with a clustering result from a knowledge base, and filling the attribute of the problem.
Referring to fig. 9, fig. 9 is a functional module diagram of an embodiment of a data attribute filling apparatus according to the present invention. In this embodiment, the data attribute filling apparatus includes:
the prediction module 10 is configured to predict the initial problem data to be responded to through a preset model set to obtain a target prediction result;
the classification module 20 is configured to obtain a knowledge owner to which the initial question data to be responded belongs based on the target prediction result, and determine a knowledge base corresponding to the initial question data to be responded according to the knowledge owner;
the identification module 30 is used for calculating the comprehensive similarity between the initial to-be-responded question data and the historical question data in the knowledge base;
a similarity judging module 40, configured to judge whether the similarity is greater than or equal to a first preset threshold;
a clustering module 50, configured to, if the similarity is greater than or equal to a first preset threshold, input the initial problem data to be responded into each node of a preset graph G ═ V, E, determine a weight of the initial problem data to be responded according to the degree in the graph, and perform clustering processing on the initial problem data to be responded based on the weight to obtain a clustering result, where the highest weight in the clustering result is problem data, the rest are similar problem data, V is a node set, E is an edge set, and the similar problem data is data having a similar relationship with the problem data;
a matching degree judging module 60, configured to judge whether a matching degree between an attribute of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold;
and a filling module 70, configured to determine an attribute of the clustering result if a matching degree between the attribute of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold, and perform attribute filling on the clustering result by using the attribute.
In this embodiment, the module in the device can realize one-time operation, and the purpose of obtaining multiple attributes can be achieved, so that the efficiency of classifying different initial problem data to be responded into different attributes is improved.
The invention also provides a computer readable storage medium.
In this embodiment, the computer readable storage medium has stored thereon a data attribute padding program, and the data attribute padding program, when executed by a processor, implements the steps of the data attribute padding method as described in any one of the above embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM), and includes instructions for causing a terminal (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The present invention is described in connection with the accompanying drawings, but the present invention is not limited to the above embodiments, which are only illustrative and not restrictive, and those skilled in the art can make various changes without departing from the spirit and scope of the invention as defined by the appended claims, and all changes that come within the meaning and range of equivalency of the specification and drawings that are obvious from the description and the attached claims are intended to be embraced therein.

Claims (10)

1. A data attribute padding method, characterized by comprising the steps of:
predicting initial problem data to be responded by a preset model set to obtain a target prediction result;
acquiring a knowledge owner to which the initial question data to be responded belongs based on the target prediction result, and determining a knowledge base corresponding to the initial question data to be responded according to the knowledge owner;
calculating the comprehensive similarity between the initial to-be-responded question data and the historical question data in the knowledge base;
judging whether the similarity is greater than or equal to a first preset threshold value or not;
if the similarity is greater than or equal to a first preset threshold, inputting the initial problem data to be responded into preset nodes of a graph G (V, E), determining the weight of the initial problem data to be responded according to the degree in the graph, and clustering the initial problem data to be responded based on the weight to obtain a clustering result, wherein the highest weight in the clustering result is problem data, the rest are similar problem data, V is a node set, E is an edge set, and the similar problem data is data with a similar relation to the problem data;
judging whether the matching degree of the attribute of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold value or not;
and if the matching degree of the attributes of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold value, determining the attributes of the clustering result, and performing attribute filling on the clustering result by adopting the attributes.
2. The data attribute population method of claim 1, wherein before said predicting the initial to-be-answered question data by the preset model set to obtain the target prediction result, further comprising the steps of:
removing punctuation marks in the first initial problem data set to be responded to through a regular expression to obtain a second initial problem data set to be responded to;
performing synonym conversion on the second initial problem data set to be responded by a preset synonym conversion mode to obtain a third initial problem data set to be responded;
and calling a library function to perform literal duplicate removal processing on the third initial problem data set to be responded to obtain a target problem data set to be responded to, wherein the target problem data set to be responded to at least comprises one initial problem data to be responded to.
3. The data attribute population method of claim 2 wherein the calling library function performs literal deduplication processing on the third initial question-to-answer data set to obtain a target question-to-answer data set comprising the steps of:
sequencing each third initial problem data to be responded in the third initial problem data sets to be responded according to the sentence length by calling a quick sequencing algorithm in a library function to obtain sequenced third initial problem data sets to be responded;
traversing the sorted third initial problem data set to be responded, and clearing repeated words to obtain a target problem data set to be responded.
4. The data attribute population method of claim 2 wherein the synonym transformation of the second initial question-to-answer data set by a preset synonym transformation to obtain a third initial question-to-answer data set comprises the steps of:
performing word segmentation on the second initial question data set to be responded to obtain word segmentation data;
acquiring a feature vector of the word segmentation data, and calculating cosine included angle values of the feature vector and feature vectors of all words in a preset word bank;
judging whether the cosine included angle value is smaller than a preset included angle value or not;
if the cosine included angle value is smaller than a preset included angle value, obtaining synonymous data of each word in the preset word bank, and forming the synonymous data into a third initial problem data set to be responded;
if the cosine included angle value is larger than or equal to the preset included angle value, the step of judging whether the cosine included angle value is smaller than the preset included angle value is continuously executed until the cosine included angle value meets the preset included angle value.
5. The data attribute population method of claim 1 wherein predicting the initial to-be-answered problem data by a preset model set to obtain a target prediction result comprises the steps of:
predicting the initial question data to be responded through a language representation bert model in a preset model set, and judging whether the initial question data to be responded belongs to an effective type;
if the initial question data to be responded belong to the effective type, obtaining an effective type prediction result;
predicting the initial question data to be responded through a text classification textcnn model in a preset model set, and judging whether the initial question data to be responded belongs to a chatting type;
if the initial question data to be responded belongs to the chatting type, obtaining a chatting type prediction result;
and combining the effective class prediction result and the chatting class prediction result to obtain a target prediction result.
6. The method according to claim 1, wherein if the matching degree between the attribute of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold, determining the attribute of the clustering result, and performing attribute filling on the clustering result by using the attribute comprises:
if the matching degree of the attributes of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold value, acquiring an attribute set of the clustering result based on a preset mapping relation between the attributes of the historical problem data and the attributes of the clustering result, wherein the attribute set of the clustering result comprises at least one attribute of the clustering result;
and mining a frequent item set in the attribute set of the clustering result, and determining the attribute of the clustering result based on the frequent item set.
7. The data attribute population method of any one of claims 1-6 wherein the calculating of the integrated similarity between the initial to-be-answered question data and the historical question data in the knowledge base comprises the steps of:
calculating the literal similarity between the initial question data to be responded and the historical question data in the knowledge base through the word frequency-inverse file frequency TF-IDF;
calculating semantic similarity between initial problem data to be responded and historical problem data in the knowledge base through a twin network;
and respectively carrying out priority sequencing on the literal similarity and the semantic similarity according to the similarity numerical value to obtain comprehensive similarity.
8. A data attribute populating apparatus, characterized in that the data attribute populating apparatus comprises the following modules:
the prediction module is used for predicting the initial problem data to be responded through a preset model set to obtain a target prediction result;
the classification module is used for acquiring a knowledge owner to which the initial to-be-responded question data belongs based on the target prediction result, and determining a knowledge base corresponding to the initial to-be-responded question data according to the knowledge owner;
the identification module is used for calculating the comprehensive similarity between the initial to-be-responded question data and the historical question data in the knowledge base;
the similarity judging module is used for judging whether the similarity is greater than or equal to a first preset threshold value or not;
the clustering module is used for inputting the initial problem data to be responded into preset nodes of a graph G (V, E) if the similarity is greater than or equal to a first preset threshold, determining the weight of the initial problem data to be responded according to the degree in the graph, and clustering the initial problem data to be responded based on the weight to obtain a clustering result, wherein the highest weight in the clustering result is problem data, the rest are similar problem data, V is a node set, E is an edge set, and the similar problem data is data with a similar relation to the problem data;
the matching degree judging module is used for judging whether the matching degree of the attribute of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold value or not;
and the filling module is used for determining the attribute of the clustering result and filling the attribute of the clustering result by adopting the attribute if the matching degree of the attribute of the historical problem data in the knowledge base and the clustering result is greater than or equal to a second preset threshold.
9. A data property filling apparatus, characterized in that the data property filling apparatus comprises a memory, a processor and a data property filling program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the data property filling method according to any of claims 1-7.
10. A computer-readable storage medium, having stored thereon a data property filling program which, when executed by a processor, implements the steps of the data property filling method of any one of claims 1-7.
CN202010088080.6A 2020-02-12 2020-02-12 Data attribute filling method, device, equipment and computer readable storage medium Pending CN111339248A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010088080.6A CN111339248A (en) 2020-02-12 2020-02-12 Data attribute filling method, device, equipment and computer readable storage medium
PCT/CN2020/098768 WO2021159655A1 (en) 2020-02-12 2020-06-29 Data attribute filling method, apparatus and device, and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010088080.6A CN111339248A (en) 2020-02-12 2020-02-12 Data attribute filling method, device, equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN111339248A true CN111339248A (en) 2020-06-26

Family

ID=71182154

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010088080.6A Pending CN111339248A (en) 2020-02-12 2020-02-12 Data attribute filling method, device, equipment and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN111339248A (en)
WO (1) WO2021159655A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541054A (en) * 2020-12-15 2021-03-23 平安科技(深圳)有限公司 Method, device, equipment and storage medium for governing questions and answers of knowledge base
CN113204974A (en) * 2021-05-14 2021-08-03 清华大学 Method, device and equipment for generating confrontation text and storage medium
CN113239697A (en) * 2021-06-01 2021-08-10 平安科技(深圳)有限公司 Entity recognition model training method and device, computer equipment and storage medium
WO2021159655A1 (en) * 2020-02-12 2021-08-19 平安科技(深圳)有限公司 Data attribute filling method, apparatus and device, and computer-readable storage medium
CN113761178A (en) * 2021-08-11 2021-12-07 北京三快在线科技有限公司 Data display method and device

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133866A (en) * 2014-07-18 2014-11-05 国家电网公司 Intelligent-power-grid-oriented missing data filling method
US10394956B2 (en) * 2015-12-31 2019-08-27 Shanghai Xiaoi Robot Technology Co., Ltd. Methods, devices, and systems for constructing intelligent knowledge base
CN106844781B (en) * 2017-03-10 2020-04-21 广州视源电子科技股份有限公司 Data processing method and device
CN108932301B (en) * 2018-06-11 2021-04-27 天津科技大学 Data filling method and device
CN110674621B (en) * 2018-07-03 2024-06-18 北京京东尚科信息技术有限公司 Attribute information filling method and device
CN109460775B (en) * 2018-09-20 2020-09-11 国家计算机网络与信息安全管理中心 Data filling method and device based on information entropy
CN110287179A (en) * 2019-06-25 2019-09-27 广东工业大学 A kind of filling equipment of shortage of data attribute value, device and method
CN111339248A (en) * 2020-02-12 2020-06-26 平安科技(深圳)有限公司 Data attribute filling method, device, equipment and computer readable storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021159655A1 (en) * 2020-02-12 2021-08-19 平安科技(深圳)有限公司 Data attribute filling method, apparatus and device, and computer-readable storage medium
CN112541054A (en) * 2020-12-15 2021-03-23 平安科技(深圳)有限公司 Method, device, equipment and storage medium for governing questions and answers of knowledge base
CN112541054B (en) * 2020-12-15 2023-08-29 平安科技(深圳)有限公司 Knowledge base question and answer management method, device, equipment and storage medium
CN113204974A (en) * 2021-05-14 2021-08-03 清华大学 Method, device and equipment for generating confrontation text and storage medium
CN113239697A (en) * 2021-06-01 2021-08-10 平安科技(深圳)有限公司 Entity recognition model training method and device, computer equipment and storage medium
CN113761178A (en) * 2021-08-11 2021-12-07 北京三快在线科技有限公司 Data display method and device

Also Published As

Publication number Publication date
WO2021159655A1 (en) 2021-08-19

Similar Documents

Publication Publication Date Title
CN108647205B (en) Fine-grained emotion analysis model construction method and device and readable storage medium
CN111339248A (en) Data attribute filling method, device, equipment and computer readable storage medium
WO2021017721A1 (en) Intelligent question answering method and apparatus, medium and electronic device
CN109670163B (en) Information identification method, information recommendation method, template construction method and computing device
CN104199965B (en) Semantic information retrieval method
CN105912716B (en) A kind of short text classification method and device
CN110019732B (en) Intelligent question answering method and related device
CN111797210A (en) Information recommendation method, device and equipment based on user portrait and storage medium
CN112035599B (en) Query method and device based on vertical search, computer equipment and storage medium
WO2020232898A1 (en) Text classification method and apparatus, electronic device and computer non-volatile readable storage medium
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN110210038B (en) Core entity determining method, system, server and computer readable medium thereof
WO2024098623A1 (en) Cross-media retrieval method and apparatus, cross-media retrieval model training method and apparatus, device, and recipe retrieval system
WO2023065642A1 (en) Corpus screening method, intention recognition model optimization method, device, and storage medium
US20200004786A1 (en) Corpus generating method and apparatus, and human-machine interaction processing method and apparatus
WO2018213783A1 (en) Computerized methods of data compression and analysis
CN115203421A (en) Method, device and equipment for generating label of long text and storage medium
CN113935314A (en) Abstract extraction method, device, terminal equipment and medium based on heteromorphic graph network
CN115795030A (en) Text classification method and device, computer equipment and storage medium
CN110781673B (en) Document acceptance method and device, computer equipment and storage medium
KR102560521B1 (en) Method and apparatus for generating knowledge graph
CN106407332B (en) Search method and device based on artificial intelligence
JP2019148933A (en) Summary evaluation device, method, program, and storage medium
CN112925912A (en) Text processing method, and synonymous text recall method and device
CN110442696B (en) Query processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination