CN112035449A

CN112035449A - Data processing method and device, computer equipment and storage medium

Info

Publication number: CN112035449A
Application number: CN202010712476.3A
Authority: CN
Inventors: 王冰玉
Original assignee: Dazhu Hangzhou Technology Co ltd
Current assignee: Dazhu Hangzhou Technology Co ltd
Priority date: 2020-07-22
Filing date: 2020-07-22
Publication date: 2020-12-04

Abstract

The invention provides a data processing method and device, computer equipment and a storage medium, wherein the method comprises the following steps: performing primary field classification on data to be processed according to first key features contained in the data to be processed to obtain a plurality of groups of first data sets, wherein the data to be processed is unstructured data; converting the target first data set into a feature vector based on a classification model, and performing secondary domain classification on the target first data set according to the feature vector to obtain a plurality of groups of second data sets; extracting entity information in the target second data set based on the information extraction model; performing data cleaning on entity information in the target second data set to obtain target data, wherein the target data are structured data; and importing the target data into a target structured database. The invention solves the technical problems of low efficiency and limitation in extracting information in unstructured data in the related technology.

Description

Data processing method and device, computer equipment and storage medium

Technical Field

The present invention relates to the field of computers, and in particular, to a data processing method and apparatus, a computer device, and a storage medium.

Background

In the related art, large data structuring generally requires sorting out corresponding data templates according to data characteristics to extract data information. However, for large-scale data, the information extraction method is not universal, and in the process of processing large data, when a new data template is arranged, a large amount of manual arrangement work is usually required, and the accurate matching and information extraction are performed by manually rearranging the accurate rule, which easily causes the hysteresis and the limitation of information extraction. For example, the most common way to structure text data is to summarize fixed format data templates, and use a rule method to accurately extract the effective information in the text. The method is simple and effective for data with fixed data templates and definite data types, aiming at the fact that a lot of data are non-fixed templates, dispersed types and disordered data, if the data are still processed by using a simple rule mode, huge manpower and time are consumed for sorting, the efficiency is extremely low, and when the data face a new form, the data are almost completely invalid by using an existing template accurate matching mode, the new data templates need to be manually sorted to solve the data of the new form, and the hysteresis of information processing is further caused.

In view of the above problems in the related art, no effective solution has been found at present.

Disclosure of Invention

The embodiment of the invention provides a data processing method and device, computer equipment and a storage medium, which at least solve the technical problems of low efficiency and limitation in extracting information in unstructured data in the related technology.

According to an embodiment of the present invention, there is provided a data processing method including: performing primary field classification on data to be processed according to first key features contained in the data to be processed to obtain a plurality of groups of first data sets, wherein the data to be processed is unstructured data; converting a target first data set into a feature vector based on a classification model, and performing secondary domain classification on the target first data set according to the feature vector to obtain a plurality of groups of second data sets, wherein the target first data set is any one of the plurality of groups of first data sets; extracting entity information in a target second data set based on an information extraction model, wherein the target second data set is any one of the multiple groups of second data sets; performing data cleaning on entity information in the target second data set to obtain target data, wherein the target data are structured data; and importing the target data into a target structured database.

Optionally, performing first-level domain classification on the data to be processed according to a first key feature included in the data to be processed to obtain a plurality of groups of first data sets, including: step A1, performing primary classification on the data to be processed according to the first key features to obtain a plurality of groups of first data sets; step A2, randomly extracting one or more first data sets from the plurality of first data sets, labeling one or more primary class labels for the one or more first data sets, and using the one or more first data sets as a first verification set; step A3, calculating a first precision P1 and a first recall R1 of a first class classification based on the actual class labels of the one or more groups of first data sets and the first verification set, wherein for any group of first data sets, the precision represents the ratio of the number of samples discriminated as the label M and the actual label is also M to the number of samples discriminated as the label M, the recall represents the ratio of the number of samples discriminated as the label M and the actual label is also M to the number of samples actually labeled as M in any group of first data sets, and M represents the class label of any group of first data sets; step A4, calculating the accuracy of the primary classification F1 by the following formula: f1 ═ 2P1R1/(P1+ R1); step A5, comparing the F1 with a first threshold; if F1 is less than the first threshold, adding first data with a class-one classification error to the plurality of sets of first data, looping the operations of steps A1-A5 until F1 is greater than or equal to the first threshold, and outputting the annotated plurality of sets of first data.

Optionally, converting the target first data set into a feature vector based on a classification model, and performing secondary domain classification on the target first data set according to the feature vector to obtain a plurality of groups of second data sets, including: a1, performing digital processing on the target first data set by using the classification model to obtain a feature vector corresponding to the target first data set; a2, performing primary classification on the target first data set according to the characteristic vector to obtain a plurality of groups of second data sets; a3, randomly extracting one or more second data sets from the plurality of second data sets, labeling one or more second class labels for the one or more second data sets, and using the one or more second data sets as a second verification set; step a4, calculating a second accuracy P2 and a second recall R2 of the classification model based on the actual class labels of the one or more sets of second data sets and the second verification set; step a5, calculating the accuracy of the classification model F2 by the following formula: f2 ═ 2P2R2/(P2+ R2); step a6, comparing the F2 with a second threshold; if F2 is less than the second threshold, adding the second data with the classification model classification error to the multiple sets of second data sets, looping the operations of steps a1-a6 until F2 is greater than or equal to the second threshold, and outputting the multiple sets of second data sets.

Optionally, the extracting entity information in the target second data set based on the information extraction model includes: step B1, extracting a plurality of entity information in the target data set through the information extraction model; step B2, labeling a plurality of entity labels for the plurality of entity information to obtain a plurality of labeled entity information, and taking the plurality of labeled entity information as a third verification set; step B3, calculating a third precision P3 and a third recall R3 of the information extraction model according to the actual entity labels of the plurality of entity information and the third verification set; step B4, calculating a third accuracy F3 of the information extraction model by the following formula: f3 ═ 2P3R3/(P3+ R3); step B5, comparing the F3 with a third threshold; if the F3 is less than the third threshold, adding the third data with the wrong extraction of the information extraction model to the target data set, looping the operations of the steps B1-B5 until the F3 is greater than or equal to the third threshold, and outputting the entity information in the target second data set.

Optionally, performing data cleaning on the entity information in the target second data set to obtain target data, including: b1, performing data cleaning on the plurality of entity information in the target second data set according to preset rules, wherein the preset rules at least include: unreasonable data in the entity information are filtered, and the entity information is subjected to format standardization; step b2, randomly extracting one or more entity information from the cleaned entity information, and calculating the qualification rate of the cleaning result; step b3, comparing the qualified rate with a fourth threshold value; and if the qualified rate is smaller than the fourth threshold value, fourth data which are unqualified in cleaning are added to the entity information, the operation of the steps b1-b3 is circulated until the qualified rate is larger than or equal to the fourth threshold value, and the entity information is output.

Optionally, after the entity information in the target second data set is subjected to data cleaning to obtain target data, the method further includes: determining a second key feature in the target data; searching for a third key feature associated with the second key feature in the target data; searching a fourth key feature having a logical relation with the third key feature based on a knowledge graph, wherein the fourth key feature is structured data; supplementing the fourth key feature to the target data.

Optionally, after supplementing the fourth key feature to the target data, the method further includes: periodically updating the target data; associating the fourth key with the second key feature in the updated target data.

According to an embodiment of the present invention, there is provided a data processing apparatus including: the first classification module is used for performing primary field classification on the data to be processed according to first key features contained in the data to be processed to obtain a plurality of groups of first data sets, wherein the data to be processed is unstructured data; the second classification module is used for converting a target first data set into a feature vector based on a classification model, and performing secondary domain classification on the target first data set according to the feature vector to obtain a plurality of groups of second data sets, wherein the target first data set is any one of the plurality of groups of first data sets; the extraction module is used for extracting entity information in a target second data set based on an information extraction model, wherein the target second data set is any one of the multiple groups of second data sets; a cleaning module, configured to perform data cleaning on entity information in the target second data set to obtain target data, where the target data is structured data; and the import module is used for importing the target data into a target structured database.

Optionally, the first classification module is configured to: step A1, performing primary classification on the data to be processed according to the first key features to obtain a plurality of groups of first data sets; step A2, randomly extracting one or more first data sets from the plurality of first data sets, labeling one or more primary class labels for the one or more first data sets, and using the one or more first data sets as a first verification set; step A3, calculating a first precision P1 and a first recall R1 of a first class classification based on the actual class labels of the one or more groups of first data sets and the first verification set, wherein for any group of first data sets, the precision represents the ratio of the number of samples discriminated as the label M and the actual label is also M to the number of samples discriminated as the label M, the recall represents the ratio of the number of samples discriminated as the label M and the actual label is also M to the number of samples actually labeled as M in any group of first data sets, and M represents the class label of any group of first data sets; step A4, calculating the accuracy of the primary classification F1 by the following formula: f1 ═ 2P1R1/(P1+ R1); step A5, comparing the F1 with a first threshold; if F1 is less than the first threshold, adding first data with a class-one classification error to the plurality of sets of first data, looping the operations of steps A1-A5 until F1 is greater than or equal to the first threshold, and outputting the annotated plurality of sets of first data.

Optionally, the second classification module is configured to: a1, performing digital processing on the target first data set by using the classification model to obtain a feature vector corresponding to the target first data set; a2, performing primary classification on the target first data set according to the characteristic vector to obtain a plurality of groups of second data sets; a3, randomly extracting one or more second data sets from the plurality of second data sets, labeling one or more second class labels for the one or more second data sets, and using the one or more second data sets as a second verification set; step a4, calculating a second accuracy P2 and a second recall R2 of the classification model based on the actual class labels of the one or more sets of second data sets and the second verification set; step a5, calculating the accuracy of the classification model F2 by the following formula: f2 ═ 2P2R2/(P2+ R2); step a6, comparing the F2 with a second threshold; if F2 is less than the second threshold, adding the second data with the classification model classification error to the multiple sets of second data sets, looping the operations of steps a1-a6 until F2 is greater than or equal to the second threshold, and outputting the multiple sets of second data sets.

Optionally, the extracting module is configured to: step B1, extracting a plurality of entity information in the target data set through the information extraction model; step B2, labeling a plurality of entity labels for the plurality of entity information to obtain a plurality of labeled entity information, and taking the plurality of labeled entity information as a third verification set; step B3, calculating a third precision P3 and a third recall R3 of the information extraction model according to the actual entity labels of the plurality of entity information and the third verification set; step B4, calculating a third accuracy F3 of the information extraction model by the following formula: f3 ═ 2P3R3/(P3+ R3); step B5, comparing the F3 with a third threshold; if the F3 is less than the third threshold, adding the third data with the wrong extraction of the information extraction model to the target data set, looping the operations of the steps B1-B5 until the F3 is greater than or equal to the third threshold, and outputting the entity information in the target second data set.

Optionally, the cleaning module is configured to: b1, performing data cleaning on the plurality of entity information in the target second data set according to preset rules, wherein the preset rules at least include: unreasonable data in the entity information are filtered, and the entity information is subjected to format standardization; step b2, randomly extracting one or more entity information from the cleaned entity information, and calculating the qualification rate of the cleaning result; step b3, comparing the qualified rate with a fourth threshold value; and if the qualified rate is smaller than the fourth threshold, adding fourth data which are unqualified for cleaning to the entity information, circulating the operation of the steps 17-19 until the qualified rate is larger than or equal to the fourth threshold, and outputting the entity information.

Optionally, the apparatus further comprises: the determining module is used for determining a second key feature in the target data after the entity information in the target second data set is subjected to data cleaning to obtain the target data; the searching module is used for searching a third key feature related to the second key feature in the target data; the searching module is used for searching a fourth key feature which has a logical relation with the third key feature based on a knowledge graph, wherein the fourth key feature is structured data; an adding module for supplementing the fourth key feature to the target data.

Optionally, the apparatus further comprises: an updating module for periodically updating the target data after supplementing the fourth key feature to the target data; and the association module is used for associating the fourth key with the second key feature in the updated target data.

According to yet another embodiment of the present invention, there is also provided a computer device comprising a memory having a computer program stored therein and a processor configured to execute the computer program to perform the steps of any of the above method embodiments.

According to a further embodiment of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps in any of the apparatus embodiments described above when executed.

According to the invention, the unstructured data is roughly classified in a wide field according to the key features, the roughly classified data is converted into the feature vectors according to the training model, and the roughly classified data is further finely classified according to the feature vectors, so that the problems of dispersed data categories and disorder of the unstructured data are solved; then extracting entity backbone information in the finely classified data through an information extraction model; carrying out data cleaning on entity backbone information, and outputting structured data which is reasonable in data and standard in format; and finally, the structured database is imported, so that the user can conveniently analyze and manage the unstructured data. Compared with the related technology, the scheme enables the processing of the unstructured data to be more modularized and more streamlined, and solves the technical problems of low efficiency and limitation in extracting information in the unstructured data in the related technology.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a block diagram of a hardware structure in which a data processing method according to an embodiment of the present invention is applied to a computer terminal;

FIG. 2 is a flow chart of a method of data processing according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating an implementation of a data processing method according to an embodiment of the present invention;

fig. 4 is a block diagram of a data processing apparatus according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Example 1

The method provided by the first embodiment of the present application may be executed in a mobile terminal, a server, a computer terminal, or a similar computing device. Taking the operation on a computer terminal as an example, fig. 1 is a hardware structure block diagram of a data processing method applied to a computer terminal according to an embodiment of the present invention. As shown in fig. 1, the computer terminal may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally, a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the computer terminal. For example, the computer terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store computer programs, for example, software programs and modules of application software, such as computer programs corresponding to the data processing method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer programs stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to a computer terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

In order to solve the technical problems in the related art, a data processing method is provided in the present embodiment, and fig. 2 is a flowchart of a data processing method according to an embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:

step S202, performing primary domain classification on data to be processed according to first key features contained in the data to be processed to obtain a plurality of groups of first data sets, wherein the data to be processed is unstructured data;

the unstructured data in this embodiment mainly refers to data that cannot be logically expressed by a fixed structure, and compared with structured data, the most essential differences of unstructured data mainly include three layers: unstructured data has a larger capacity than structured data; faster than structured data; the data sources are diverse. The method is divided morphologically and mainly comprises the following steps: text, images, pictures, etc., video streams, television streams, etc.

Optionally, the key feature in this embodiment may be a key word in the text data, but is not limited thereto. The primary purpose of roughly classifying the unstructured data according to the above-described embodiments (i.e., the above-described first-level domain classification) is to distinguish the unstructured data according to the domain to which the unstructured data belongs, thereby facilitating further data mining.

Step S204, converting the target first data set into a feature vector based on a classification model, and performing secondary domain classification on the target first data set according to the feature vector to obtain a plurality of groups of second data sets, wherein the target first data set is any one of the plurality of groups of first data sets;

in this embodiment, since the rough classification is only to roughly divide the unstructured data according to the broad domain to which the unstructured data belongs, the unstructured data after the rough classification needs to be further sub-classified (i.e. the secondary domain classification), and the fine data under the rough classification is sub-classified, so that the unstructured data is sorted into finer grains, and the finer domain to which the unstructured data belongs is determined.

Step S206, extracting entity information in a target second data set based on the information extraction model, wherein the target second data set is any one of a plurality of groups of second data sets;

in this embodiment, useful information, such as entity information in text data and mode information in image data, is extracted from the segmentation result in step S204. Optionally, taking a text structure as an example, the entity information may be text trunk information such as a name of a person, a place name, a name of an organization, time, a behavior, and the like.

Step S208, performing data cleaning on entity information in the target second data set to obtain target data, wherein the target data is structured data;

in this embodiment, the information extracted in step S206 is cleaned and corrected to obtain normalized data.

Step S210, importing the target data into a target structured database.

According to the embodiment of the invention, the unstructured data is roughly classified in a wide field according to the key features, the roughly classified data is converted into the feature vectors according to the training model, and the roughly classified data is further finely classified according to the feature vectors, so that the problems of data category dispersion and disorder of the unstructured data are solved; then extracting entity backbone information in the finely classified data through an information extraction model; carrying out data cleaning on entity backbone information, and outputting structured data which is reasonable in data and standard in format; and finally, the structured database is imported, so that the user can conveniently analyze and manage the unstructured data. Compared with the related technology, the scheme enables the processing of the unstructured data to be more modularized and more streamlined, and solves the technical problems of low efficiency and limitation in extracting information in the unstructured data in the related technology.

In an optional embodiment of the present disclosure, performing a first-level domain classification on data to be processed according to a first key feature included in the data to be processed to obtain a plurality of groups of first data sets, includes: a1, classifying the data to be processed for the first time according to the first key characteristics to obtain a plurality of groups of first data sets; step A2, randomly extracting one or more first data sets from the plurality of first data sets, labeling one or more primary class labels for the one or more first data sets, and using the one or more first data sets as a first verification set; step A3, calculating a first accuracy P1 and a first recall ratio R1 of the first-level classification based on the actual class labels and the first verification set of one or more groups of first data sets, wherein for any group of first data sets, the accuracy represents the ratio of the number of samples which are judged to be a label M and the actual label is also M to the number of samples which are judged to be the label M, the recall ratio represents the ratio of the number of samples which are judged to be the label M and the actual label is also M to the number of samples which are actually labeled M in any group of first data sets, and M represents the class label of any group of first data sets; step A4, calculating the accuracy of the primary classification F1 by the following formula: f1 ═ 2P1R1/(P1+ R1); step A5, comparing F1 with a first threshold value; if F1 is less than the first threshold, adding the first data with the error of the first-level classification to the multiple groups of first data sets, circulating the operations of the steps A1-A5 until F1 is greater than or equal to the first threshold, and outputting the labeled multiple groups of first data sets.

Taking a text structure as an example, further describing the present solution with reference to a specific embodiment, fig. 3 is an execution flowchart of the data processing method according to a specific embodiment of the present invention, and as shown in fig. 3, first, an unprocessed data stream is obtained and input into a rough classification model (i.e., performing the above-mentioned first-level domain classification), and the unprocessed data stream is classified once according to keywords detected in the unprocessed data stream (i.e., roughly dividing a complicated unprocessed data stream into broad domains to which the unprocessed data stream belongs); in the rough classification process, the classification result is sampled and checked regularly, and when the accuracy (namely, the F1) reaches a preset rough classification threshold (namely, the first threshold), the model is not iterated temporarily (the next secondary domain classification process is started); if the accuracy does not reach the preset threshold value of the rough classification, badcase (bad case or problem data) data (namely the first data) is added, and model iteration is carried out.

In this embodiment, the calculation of the accuracy of the rough classification model generally needs to extract a part of data (i.e., the one or more groups of first data sets) as a verification set (i.e., the first verification set), the data in the verification set will be labeled with a class label as a verification reference (i.e., the actual class label), the data in the verification set will also be labeled with a class label through the rough classification model, and finally, the consistency of the two parts of labels is compared.

Specifically, the consistency of a two-part tag includes two aspects: on the one hand, the accuracy of the model (i.e., P1, referred to above, collectively as Precision) and, on the other hand, the Recall rate (i.e., R1, referred to above, collectively as Recall). For a certain class of data samples (i.e. any one of the above-mentioned first data sets), if the class label is M, the accuracy rate is the number of samples that are determined as the label M by the model and the actual label is also M/the number of samples that are determined as the label M by the model, i.e. how much data in a certain class is determined to be correct by the model; the recall rate is the number of samples which are judged as a label M by the model and the actual label is also M/the number of samples of which the actual label is M in any group of data set, namely, the number of data which should belong to a certain class is accurately judged by the model; then, aiming at the one or more groups of first data sets, the overall accuracy P1 and the recall rate R1 of the rough classification of the primary field can be obtained by averaging or weighted averaging each category; finally, the harmonic mean (F1-score) of model accuracy and model recall is taken as the final accuracy result, i.e., F1 ═ 2P1R1/(P1+ R1). When the model accuracy is below a preset threshold, the model will be iterated by adding badcase. The rough classification mainly utilizes rules, and the increase of badcase data specifically means the increase of rough classification rules, namely, the data classified by the model is judged by mistake; the iterative process of the rule model is usually to supplement the rule for many times until the accuracy of the model reaches a threshold value or more.

In addition, the unprocessed data stream in this embodiment is a generic term, and may be dynamically incremented data or static data. The dynamic data can be stored in the distributed database to facilitate query and processing, while the static data can be stored according to actual needs, can be stored in various databases, and can also be stored as text format files.

In this embodiment, the rough classification model is mainly a rule model, and roughly judges text data according to keywords in the text, divides the data into a large category, and then delivers the large category to the fine classification model for further processing in a finer stage. With the arrival of new data, the data form will change gradually, so the classification result needs to be checked in a spot mode in time, and the rough classification model is iterated in time.

In an optional embodiment of the present disclosure, converting the target first data set into a feature vector based on a classification model, and performing secondary domain classification on the target first data set according to the feature vector to obtain a plurality of groups of second data sets, where the method includes: a1, performing digital processing on the target first data set by adopting a classification model to obtain a feature vector corresponding to the target first data set; step a2, performing primary classification on the target first data set according to the characteristic vectors to obtain a plurality of groups of second data sets; a3, randomly extracting one or more groups of second data sets from the plurality of groups of second data sets, labeling one or more secondary class labels for the one or more groups of second data sets, and using the one or more groups of second data sets as a second verification set; step a4, calculating a second accuracy P2 and a second recall R2 of the classification model based on the actual class labels of the one or more sets of second data sets and the second verification set; step a5, calculating the accuracy of the classification model F2 by the following formula: f2 ═ 2P2R2/(P2+ R2); step a6, comparing F2 with a second threshold value; if F2 is less than the second threshold, adding the second data with wrong classification of the classification model to the multiple groups of second data sets, looping the operations of steps a1-a6 until F2 is greater than or equal to the second threshold, and outputting the multiple groups of second data sets.

As shown in fig. 3, the fine classification of the coarsely classified data includes: performing a classification result at regular time, and temporarily not iterating the model (i.e. entering the next process of entity information extraction) when the accuracy (i.e. the above-mentioned F2) reaches the threshold (i.e. the above-mentioned second threshold); and if the accuracy rate does not reach the threshold value, adding badcase data (namely the second data) and performing an iterative model.

In this embodiment, the text subdivision may refine the result obtained by the rough classification, and continue subdividing the data in the large domain into the finer domain, where the main steps of this step include several sub-processes of classification data labeling, classification model training, and classification model iteration.

According to the embodiment, the model iteration process in the subdivision classification has the same accuracy calculation mode as that of the rough classification, and the difference mainly includes two places, namely, the label and the level to which the model belongs are different (namely, the first-stage classification and the second-stage classification), and the model type is different. The different levels refer to that the fine categories are under the rough category, and each rough category may include a plurality of fine categories, for example, a news text may be determined to be political, entertainment or economic by using the rough category model, if the rough category model determines that the news is entertainment, the news text enters the fine category model under the entertainment category to be subdivided, and further determined to be music, movie or other entertainment. Different model types mean different specific algorithms for processing data, the rough classification usually uses rules, the fine classification usually uses classification algorithms of machine learning and deep learning for data processing, firstly, text data is numerically converted into text feature vectors in a certain mode, the text feature vectors can represent more detailed features of the text (namely, all texts are classified based on the keywords), the converted text features are input into a model, and the fine classification model is generated after a certain number of training iterations.

In one example of the present embodiment, the fine classification model may be a CNN (Convolutional Neural Network) correlation model, such as TextCNN (text CNN model); it can also be a RNN (current Neural Network) related model, such as LSTM (Long short-Term Memory Network).

The fine classification model needs to be shorter than the coarse classification iteration cycle, and the main reason is that the keywords mainly depended on by the coarse classification cannot be greatly updated in a short time, while the updating change speed of the data form under the fine classification is higher due to the diversity of information description and the like, and the requirement on the iteration cycle of the model is higher. Therefore, compared with a roughly classified rule model, the classification effect of the machine learning and deep learning models adopted for fine classification is more detailed and accurate.

In an alternative embodiment, extracting entity information in the target second data set based on the information extraction model comprises: step B1, extracting a plurality of entity information in the target data set through an information extraction model; step B2, labeling a plurality of entity labels for the plurality of entity information to obtain a plurality of labeled entity information, and taking the plurality of labeled entity information as a third verification set; step B3, calculating a third accuracy P3 and a third recall R3 of the information extraction model according to the actual entity labels and the third verification set of the plurality of entity information; step B4, calculating a third accuracy F3 of the information extraction model by the following formula: f3 ═ 2P3R3/(P3+ R3); step B5, comparing F3 with a third threshold value; if F3 is less than the third threshold, adding the third data with the wrong extraction of the information extraction model to the target data set, looping the operations of steps B1-B5 until F3 is greater than or equal to the third threshold, and outputting the entity information in the target second data set.

As shown in fig. 3, the information extraction of the finely classified data includes: extracting the information at regular time, and temporarily not iterating the model (namely entering the next cleaning process) when the accuracy (namely the F3) reaches the threshold (namely the third threshold); if the accuracy does not reach the threshold, adding badcase data (namely the third data) and entering an iterative model.

In this embodiment, the main purpose of text entity extraction is to extract a skeleton of text information, and common entities include: name of person, place name, time, organization, action, etc. The specific entity requires a specific determination of the specific text morphology of the fine category. The method mainly comprises the following steps: and (3) entity labeling, entity extraction model training and entity extraction model iteration.

The accuracy of the text entity extraction model also needs to be periodically marked with a test set (i.e., the third verification set) for determination, the entity extraction model is run on the test set, and is compared with the actual entity label to obtain the accuracy of the entity extraction model (i.e., the above F3), and the accuracy of the model is similar to the calculation of the accuracy of the classification model and is obtained by reconciling the accuracy of the model and the recall rate. If the model accuracy rate on the test set reaches a threshold value, the model is not updated temporarily; and if the model accuracy rate on the test set does not reach the threshold value, putting the data (namely the third data) with wrong model prediction in the test set and the actual entity label into a training set of the entity extraction model, and iterating the model again.

Optionally, performing data cleaning on the entity information in the target second data set to obtain target data, including: b1, performing data cleaning on the entity information in the target second data set according to preset rules, wherein the preset rules at least include: unreasonable data in the information of a plurality of entities is filtered, and the information of the plurality of entities is subjected to format standardization; step b2, randomly extracting one or more entity information from the cleaned entity information, and calculating the qualification rate of the cleaning result; step b3, comparing the qualified rate with a fourth threshold value; and if the qualified rate is less than the fourth threshold value, adding fourth data which is unqualified in cleaning to the entity information, circulating the operations of the steps b1-b3 until the qualified rate is greater than or equal to the fourth threshold value, and outputting the entity information.

As shown in fig. 3, the data cleansing includes: periodically sampling the cleaning result, and temporarily not iterating the cleaning rule model (namely entering the next operation) when the accuracy (namely the qualified rate) reaches the threshold (namely the fourth threshold); and if the accuracy rate does not reach the threshold value, combing the problem data, and performing an iterative cleaning rule.

The data cleaning in this embodiment is mainly to further clean the extracted entity, and the main process of cleaning can be summarized as follows: filtering impurities, reasonably judging, and performing standardized treatment. For example, when cleaning the Chinese "name" entity, all characters except Chinese are impurities, when cleaning the time entity; further, if a month appears as "13 months", the time data is unreasonable data. For another example, the data format of the time can be expressed in various ways, for example, "5/24/2020" is often described as "2020-05-24" format, and for the convenience of the subsequent process, the format needs to be standardized and unified.

Alternatively, the accuracy measure of the cleaning result can be performed from two aspects, namely format uniformity on one hand and content reasonableness on the other hand. Format accuracy means that the same entity has the same format and the content conforms to the conventional rule, for example, the date is cleaned to be "XXXX-XX-XX" format; content reasonableness means that the content of each cleaned field needs to meet the specific requirements of the field, for example, the range of the age must be in the range of [0, 150], and the cleaned result is the obtained structured data.

In this embodiment, the accuracy measure of the cleansing result also requires a timing spot check, and when the spot check qualified rate is lower than the threshold, the cleansing policy needs to be updated again, and the cleansing rule is added for the incorrectly cleansed badcase data (i.e., the fourth data).

In another optional embodiment of the present disclosure, after performing data cleansing on the entity information in the target second data set to obtain target data, the method further includes: determining a second key feature in the target data; searching third key characteristics related to the second key characteristics in the target data; searching a fourth key feature having a logical relation with the third key feature based on the knowledge graph, wherein the fourth key feature is structured data; the fourth key feature is supplemented to the target data.

Optionally, after supplementing the fourth key feature to the target data, the method further includes: periodically updating the target data; and associating the fourth key with the second key feature in the updated target data.

As shown in fig. 3, after the accuracy of data cleansing reaches a threshold, data supplementation is performed, supplementation logic is updated periodically, and the obtained final data is imported into the structured database.

In this embodiment, data supplementation often needs to be performed by means of a knowledge graph or other associated structured data, and the richness of the data is increased by reasoning on the structured data; and performing appropriate logic supplement on the cleaning result according to other dependent items of the data, digging out explicit and implicit information in the data as much as possible, and digging out the information in the data as much as possible.

For example: the work location of a person (i.e., the second key feature) may be supplemented based on the fact that the work unit of the person is a (i.e., the third key feature) and the address of the unit a searched in the associated knowledge graph or table is at B (i.e., the fourth key feature).

Optionally, the logic is supplemented by a table linking operation in the database.

In an alternative example of the present disclosure, taking a text structure as an example, for example, the text before the structure (i.e., the above data to be processed) is "5/1/2020, zhang san sitting train arrives at south kyo south", based on the keywords "ride" and "arrival" (i.e., the above first key feature), the text is subjected to a primary domain classification, and a rough classification label is labeled "trip to transport"; converting the text into a feature vector based on a classification model, wherein the feature vector comprises all text features of the text, further finely classifying the roughly classified text according to all the text features (namely, the secondary domain classification), and marking a fine classification label as 'trip on transport _ trip on railway'; extracting entity information 'Zhang III', 'departure date', 'arrival station' and 'train taking' from the text by an information extraction model; then, cleaning operation is carried out on the extracted entity information, and the cleaning result comprises the following steps: the name "Zhang three", the departure date "2020-05-01", and the arrival station "Nanjing"; and finally, logically supplementing the 'arrival site Nanjing', so as to obtain the 'Nanjing' arrival city of Zhang III.

Through the embodiment, the flows of data classification, information extraction and data cleaning are carried out on a variety of unstructured data, a mature data structured production line is formed from construction to iteration, most flows in text structuring are modeled and modularized through more perfect flows, so that the labor and time cost are saved, and the data are utilized to the maximum extent.

Example 2

In this embodiment, a data processing apparatus is further provided, and the apparatus is used to implement the foregoing embodiments and preferred embodiments, and details are not repeated for what has been described. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 4 is a block diagram of a data processing apparatus according to an embodiment of the present invention, as shown in fig. 4, the apparatus including: the first classification module 402 is configured to perform first-level domain classification on data to be processed according to a first key feature included in the data to be processed, so as to obtain multiple groups of first data sets, where the data to be processed is unstructured data; a second classification module 404, connected to the first classification module 402, configured to convert the target first data set into a feature vector based on a classification model, and perform secondary domain classification on the target first data set according to the feature vector to obtain multiple groups of second data sets, where the target first data set is any one of the multiple groups of first data sets; an extracting module 406, connected to the second classifying module 404, configured to extract entity information in a target second data set based on an information extraction model, where the target second data set is any one of a plurality of sets of second data sets; a cleaning module 408, connected to the extracting module 406, configured to perform data cleaning on the entity information in the target second data set to obtain target data, where the target data is structured data; and an importing module 410 connected to the cleaning module 408 for importing the target data into the target structured database.

Optionally, the first classification module 402 is configured to: a1, classifying the data to be processed for the first time according to the first key characteristics to obtain a plurality of groups of first data sets; step A2, randomly extracting one or more first data sets from the plurality of first data sets, labeling one or more primary class labels for the one or more first data sets, and using the one or more first data sets as a first verification set; step A3, calculating a first accuracy P1 and a first recall ratio R1 of the first-level classification based on the actual class labels and the first verification set of one or more groups of first data sets, wherein for any group of first data sets, the accuracy represents the ratio of the number of samples which are judged to be a label M and the actual label is also M to the number of samples which are judged to be the label M, the recall ratio represents the ratio of the number of samples which are judged to be the label M and the actual label is also M to the number of samples which are actually labeled M in any group of first data sets, and M represents the class label of any group of first data sets; step A4, calculating the accuracy of the primary classification F1 by the following formula: f1 ═ 2P1R1/(P1+ R1); step A5, comparing F1 with a first threshold value; if F1 is less than the first threshold, adding the first data with the error of the first-level classification to the multiple groups of first data sets, circulating the operations of the steps A1-A5 until F1 is greater than or equal to the first threshold, and outputting the labeled multiple groups of first data sets.

Optionally, the second classification module 404 is configured to: a1, performing digital processing on the target first data set by adopting a classification model to obtain a feature vector corresponding to the target first data set; step a2, performing primary classification on the target first data set according to the characteristic vectors to obtain a plurality of groups of second data sets; a3, randomly extracting one or more groups of second data sets from the plurality of groups of second data sets, labeling one or more secondary class labels for the one or more groups of second data sets, and using the one or more groups of second data sets as a second verification set; step a4, calculating a second accuracy P2 and a second recall R2 of the classification model based on the actual class labels of the one or more sets of second data sets and the second verification set; step a5, calculating the accuracy of the classification model F2 by the following formula: f2 ═ 2P2R2/(P2+ R2); step a6, comparing F2 with a second threshold value; if F2 is less than the second threshold, adding the second data with wrong classification of the classification model to the multiple groups of second data sets, looping the operations of steps a1-a6 until F2 is greater than or equal to the second threshold, and outputting the multiple groups of second data sets.

Optionally, the extracting module 406 is configured to: step B1, extracting a plurality of entity information in the target data set through an information extraction model; step B2, labeling a plurality of entity labels for the plurality of entity information to obtain a plurality of labeled entity information, and taking the plurality of labeled entity information as a third verification set; step B3, calculating a third accuracy P3 and a third recall R3 of the information extraction model according to the actual entity labels and the third verification set of the plurality of entity information; step B4, calculating a third accuracy F3 of the information extraction model by the following formula: f3 ═ 2P3R3/(P3+ R3); step B5, comparing F3 with a third threshold value; if F3 is less than the third threshold, adding the third data with the wrong extraction of the information extraction model to the target data set, looping the operations of steps B1-B5 until F3 is greater than or equal to the third threshold, and outputting the entity information in the target second data set.

Optionally, the cleaning module 408 is configured to: b1, performing data cleaning on the entity information in the target second data set according to preset rules, wherein the preset rules at least include: unreasonable data in the information of a plurality of entities is filtered, and the information of the plurality of entities is subjected to format standardization; step b2, randomly extracting one or more entity information from the cleaned entity information, and calculating the qualification rate of the cleaning result; step b3, comparing the qualified rate with a fourth threshold value; and if the qualified rate is less than the fourth threshold value, adding fourth data which is unqualified in cleaning to the entity information, circulating the operations of the steps b1-b3 until the qualified rate is greater than or equal to the fourth threshold value, and outputting the entity information.

Optionally, the apparatus further comprises: the determining module is used for determining a second key feature in the target data after the entity information in the target second data set is subjected to data cleaning to obtain the target data; the searching module is used for searching third key characteristics related to the second key characteristics in the target data; the searching module is used for searching a fourth key feature which has a logical relation with the third key feature based on the knowledge graph, wherein the fourth key feature is structured data; and the adding module is used for supplementing the fourth key characteristic to the target data.

Optionally, the apparatus further comprises: the updating module is used for periodically updating the target data after supplementing the fourth key feature to the target data; and the association module is used for associating the fourth key with the second key feature in the updated target data.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Example 3

Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

s1, performing primary field classification on the data to be processed according to first key features contained in the data to be processed to obtain a plurality of groups of first data sets, wherein the data to be processed is unstructured data;

s2, converting a target first data set into a feature vector based on a classification model, and performing secondary domain classification on the target first data set according to the feature vector to obtain a plurality of groups of second data sets, wherein the target first data set is any one of the plurality of groups of first data sets;

s3, extracting entity information in a target second data set based on an information extraction model, wherein the target second data set is any one of the multiple sets of second data sets;

s4, performing data cleaning on the entity information in the target second data set to obtain target data, wherein the target data are structured data;

and S5, importing the target data into a target structured database.

Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, which can store computer programs.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

and S5, importing the target data into a target structured database.

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A data processing method, comprising:

performing primary field classification on data to be processed according to first key features contained in the data to be processed to obtain a plurality of groups of first data sets, wherein the data to be processed is unstructured data;

converting a target first data set into a feature vector based on a classification model, and performing secondary domain classification on the target first data set according to the feature vector to obtain a plurality of groups of second data sets, wherein the target first data set is any one of the plurality of groups of first data sets;

extracting entity information in a target second data set based on an information extraction model, wherein the target second data set is any one of the multiple groups of second data sets;

performing data cleaning on entity information in the target second data set to obtain target data, wherein the target data are structured data;

and importing the target data into a target structured database.

2. The method according to claim 1, wherein performing a first-level domain classification on the data to be processed according to a first key feature included in the data to be processed to obtain a plurality of sets of first data sets, comprises:

step A1, performing primary classification on the data to be processed according to the first key features to obtain a plurality of groups of first data sets;

step A2, randomly extracting one or more first data sets from the plurality of first data sets, labeling one or more primary class labels for the one or more first data sets, and using the one or more first data sets as a first verification set;

step A3, calculating a first precision P1 and a first recall R1 of a first class classification based on the actual class labels of the one or more groups of first data sets and the first verification set, wherein for any group of first data sets, the precision represents the ratio of the number of samples discriminated as the label M and the actual label is also M to the number of samples discriminated as the label M, the recall represents the ratio of the number of samples discriminated as the label M and the actual label is also M to the number of samples actually labeled as M in any group of first data sets, and M represents the class label of any group of first data sets;

step A4, calculating the accuracy of the primary classification F1 by the following formula: f1 ═ 2P1R1/(P1+ R1);

step A5, comparing the F1 with a first threshold;

if F1 is less than the first threshold, adding first data with a class-one classification error to the plurality of sets of first data, looping the operations of steps A1-A5 until F1 is greater than or equal to the first threshold, and outputting the annotated plurality of sets of first data.

3. The method of claim 1, wherein converting a target first data set into feature vectors based on a classification model, and performing two-level domain classification on the target first data set according to the feature vectors to obtain a plurality of groups of second data sets comprises:

a1, performing digital processing on the target first data set by using the classification model to obtain a feature vector corresponding to the target first data set;

a2, performing primary classification on the target first data set according to the characteristic vector to obtain a plurality of groups of second data sets;

a3, randomly extracting one or more second data sets from the plurality of second data sets, labeling one or more second class labels for the one or more second data sets, and using the one or more second data sets as a second verification set;

step a4, calculating a second accuracy P2 and a second recall R2 of the classification model based on the actual class labels of the one or more sets of second data sets and the second verification set;

step a5, calculating the accuracy of the classification model F2 by the following formula: f2 ═ 2P2R2/(P2+ R2);

step a6, comparing the F2 with a second threshold;

if F2 is less than the second threshold, adding the second data with the classification model classification error to the multiple sets of second data sets, looping the operations of steps a1-a6 until F2 is greater than or equal to the second threshold, and outputting the multiple sets of second data sets.

4. The method of claim 1, wherein extracting entity information in the target second data set based on the information extraction model comprises:

step B1, extracting a plurality of entity information in the target data set through the information extraction model;

step B2, labeling a plurality of entity labels for the plurality of entity information to obtain a plurality of labeled entity information, and taking the plurality of labeled entity information as a third verification set;

step B3, calculating a third precision P3 and a third recall R3 of the information extraction model according to the actual entity labels of the plurality of entity information and the third verification set;

step B4, calculating a third accuracy F3 of the information extraction model by the following formula: f3 ═ 2P3R3/(P3+ R3);

step B5, comparing the F3 with a third threshold;

if the F3 is less than the third threshold, adding the third data with the wrong extraction of the information extraction model to the target data set, looping the operations of the steps B1-B5 until the F3 is greater than or equal to the third threshold, and outputting the entity information in the target second data set.

5. The method of claim 1, wherein performing data cleansing on the entity information in the target second data set to obtain target data comprises:

b1, performing data cleaning on the plurality of entity information in the target second data set according to preset rules, wherein the preset rules at least include: unreasonable data in the entity information are filtered, and the entity information is subjected to format standardization;

step b2, randomly extracting one or more entity information from the cleaned entity information, and calculating the qualification rate of the cleaning result;

step b3, comparing the qualified rate with a fourth threshold value;

and if the qualified rate is smaller than the fourth threshold value, fourth data which are unqualified in cleaning are added to the entity information, the operation of the steps b1-b3 is circulated until the qualified rate is larger than or equal to the fourth threshold value, and the entity information is output.

6. The method of claim 1, wherein after performing data cleansing on the entity information in the target second data set to obtain target data, the method further comprises:

determining a second key feature in the target data;

searching for a third key feature associated with the second key feature in the target data;

searching a fourth key feature having a logical relation with the third key feature based on a knowledge graph, wherein the fourth key feature is structured data;

supplementing the fourth key feature to the target data.

7. The method of claim 6, wherein after supplementing the fourth key feature to the target data, the method further comprises:

periodically updating the target data;

associating the fourth key with the second key feature in the updated target data.

8. A data processing apparatus, comprising:

the first classification module is used for performing primary field classification on the data to be processed according to first key features contained in the data to be processed to obtain a plurality of groups of first data sets, wherein the data to be processed is unstructured data;

the second classification module is used for converting a target first data set into a feature vector based on a classification model, and performing secondary domain classification on the target first data set according to the feature vector to obtain a plurality of groups of second data sets, wherein the target first data set is any one of the plurality of groups of first data sets;

the extraction module is used for extracting entity information in a target second data set based on an information extraction model, wherein the target second data set is any one of the multiple groups of second data sets;

a cleaning module, configured to perform data cleaning on entity information in the target second data set to obtain target data, where the target data is structured data;

and the import module is used for importing the target data into a target structured database.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer storage medium on which a computer program is stored, characterized in that the computer program, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.