CN108154179B - Data error detection method and system - Google Patents

Data error detection method and system Download PDF

Info

Publication number
CN108154179B
CN108154179B CN201711417309.0A CN201711417309A CN108154179B CN 108154179 B CN108154179 B CN 108154179B CN 201711417309 A CN201711417309 A CN 201711417309A CN 108154179 B CN108154179 B CN 108154179B
Authority
CN
China
Prior art keywords
feature vector
dictionary
data
target data
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711417309.0A
Other languages
Chinese (zh)
Other versions
CN108154179A (en
Inventor
林文慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Runke General Technology Co Ltd
Original Assignee
Beijing Runke General Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Runke General Technology Co Ltd filed Critical Beijing Runke General Technology Co Ltd
Priority to CN201711417309.0A priority Critical patent/CN108154179B/en
Publication of CN108154179A publication Critical patent/CN108154179A/en
Application granted granted Critical
Publication of CN108154179B publication Critical patent/CN108154179B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/28Determining representative reference patterns, e.g. by averaging or distorting; Generating dictionaries

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a system for detecting errors of data, wherein the method comprises the following steps: determining each dictionary related to the target data according to a preset determination rule; performing dictionary processing on the target data according to the dictionaries to obtain at least one dictionary feature vector; and transmitting the dictionary-like feature vectors to a classifier model trained according to a preset training rule for error detection, so as to realize error detection of the target data. According to the data error detection method and system provided by the invention, the dictionary with the same dimension is adopted to carry out dictionary processing on the target data, a plurality of dictionary feature vectors with the same dimension corresponding to the target data are obtained, and the dictionary feature vectors are transmitted to the classifier model to carry out error detection, so that the problem that in the prior art, when some dimension features in the original data are missing, the features of the corresponding dimension are set to be zero, and the detection accuracy is reduced is solved.

Description

Data error detection method and system
Technical Field
The invention relates to the technical field of data analysis and processing, in particular to a data error detection method and system.
Background
In order to ensure the performance of a product, original data of actual operation of the product is adopted to design and adjust the product in the product development process, however, the original data has larger errors or errors due to the failure of acquisition equipment or abnormal operation and other reasons in the acquisition process of the original data, and the data is subjected to error detection to ensure the authenticity of simulation, and abnormal data is filtered out. According to the characteristics of the original data, a Support Vector Machine (SVM) classifier model is designed, the characteristics of the original data are extracted, the SVM classifier model is trained, and error detection is carried out on the data according to the trained classifier model.
The inventor researches the existing data error detection method and finds that the dimensionality of a feature vector required to be input is fixed by the data error detection method based on the SVM classifier model. And the dimension of each original data is not fixed, so in order to meet the condition that the input dimension is fixed, when the features of some dimensions in the original data are missing, the features of the corresponding dimension are set to be zero. Although such simple processing satisfies the condition of fixed input dimension, the detection accuracy is lowered.
Disclosure of Invention
In view of this, the present invention provides a data error detection method and system, so as to solve the problem that in the prior art, when data error detection is performed according to an SVM classifier model, in order to satisfy a condition that input dimensions are fixed, when features of some dimensions in original data are missing, features of corresponding dimensions are set to zero, which results in reduction of detection accuracy. The specific scheme is as follows:
a method of error detection of data, comprising:
determining each dictionary related to the target data according to a preset determination rule;
performing dictionary processing on the target data according to the dictionaries to obtain at least one dictionary feature vector;
and transmitting the dictionary-like feature vectors to a classifier model trained according to a preset training rule for error detection, so as to realize error detection of the target data.
In the foregoing method, preferably, the determining, according to a preset determination rule, each dictionary associated with the target data includes:
analyzing the target data to obtain each feature vector set document associated with the target data, wherein each feature vector set document comprises at least one feature vector;
performing clustering analysis on each feature vector set document to obtain a dictionary associated with the feature vector set document;
and when an ending instruction for the feature vector set document clustering analysis is received, obtaining each dictionary related to the target data.
In the foregoing method, preferably, the performing cluster analysis on each feature vector set document includes:
determining the number of clusters in the feature vector set document, and selecting feature vectors with the same number as the clusters from the feature vector set document as a first cluster center;
performing dissimilarity calculation on each feature vector in the feature vector set document and the feature vector contained in the first cluster center to obtain a first dissimilarity calculation result;
adjusting the feature vector of the first cluster center according to the first dissimilarity degree calculation result to obtain a second cluster center;
performing dissimilarity calculation on each feature vector in the feature vector set document and the feature vector contained in the second cluster center to obtain a second dissimilarity calculation result;
and when the difference value of the corresponding items of the first dissimilarity degree calculation result and the second dissimilarity degree calculation result is smaller than a preset analysis threshold value, the feature vector contained in the second cluster center is used as a root word, and a dictionary is constructed according to the root word.
In the foregoing method, preferably, the performing dictionary processing on the target data according to the dictionaries to obtain at least one dictionary-based feature vector includes:
analyzing at least one data document contained in the target data;
describing each data document according to each dictionary to obtain the dictionary-like feature vector corresponding to the data document;
and when an instruction for completing the description of the data document is detected, obtaining at least one dictionary feature vector.
In the above method, preferably, the process of training the classifier model according to the preset training rule includes:
taking the dictionary-based feature vector as a training sample;
establishing a data classifier model according to a Support Vector Machine (SVM);
and transmitting the training sample to the data classifier model, and finishing the training of the classifier model when the accuracy of the error detection result output by the data classifier model reaches a preset distinguishing threshold value.
An error detection system for data, comprising:
the determining module is used for determining each dictionary related to the target data according to a preset determining rule;
the dictionary module is used for performing dictionary processing on the target data according to each dictionary to obtain at least one dictionary feature vector;
and the error detection module is used for transmitting the dictionary-based feature vectors to a classifier model trained according to a preset training rule to perform error detection, so that the error detection of the target data is realized.
In the above system, preferably, the determining module includes:
the first analysis unit is used for analyzing the target data to obtain each feature vector set document associated with the target data, and each feature vector set document comprises at least one feature vector;
the cluster analysis unit is used for carrying out cluster analysis on each feature vector set document to obtain a dictionary related to the feature vector set document;
and the first obtaining unit is used for obtaining each dictionary related to the target data when receiving an end instruction of the feature vector set document clustering analysis.
In the above system, preferably, the cluster analysis unit includes:
a selecting subunit, configured to determine the number of clusters in the feature vector set document, and select, as a first cluster center, a feature vector with the same number as the clusters from the feature vector set document;
the first calculating subunit is configured to perform dissimilarity calculation on each feature vector in the feature vector set document and a feature vector included in the first cluster center, respectively, to obtain a first dissimilarity calculation result;
the adjusting subunit is configured to adjust the feature vector of the first cluster center according to the first dissimilarity calculation result to obtain a second cluster center;
the second calculating subunit is configured to perform dissimilarity calculation on each feature vector in the feature vector set document and a feature vector included in the second cluster center, so as to obtain a second dissimilarity calculation result;
and the constructing subunit is used for taking the feature vector contained in the second cluster center as a root word and constructing a dictionary according to the root word when the difference value of the corresponding terms of the first dissimilarity degree calculation result and the second dissimilarity degree calculation result is smaller than a preset analysis threshold value.
In the above system, preferably, the dictionary module includes:
the second analysis unit is used for analyzing at least one data document contained in the target data;
the describing unit is used for describing each data document according to each dictionary to obtain the dictionary-like feature vector corresponding to the data document;
and the second obtaining unit is used for obtaining at least one dictionary-like feature vector when an instruction for completing the description of the data document is detected.
In the above system, preferably, the process of training the classifier model according to the preset training rule includes:
the determining unit is used for taking the dictionary-based feature vector as a training sample;
the establishing unit is used for establishing a data classifier model according to the support vector machine SVM;
and the training unit is used for transmitting the training samples to the data classifier model, and finishing the training of the classifier model when the accuracy of the error detection result output by the data classifier model reaches a preset distinguishing threshold value.
Compared with the prior art, the invention has the following advantages:
the invention discloses a data error detection method and a system, wherein the error detection method comprises the following steps: determining each dictionary related to the target data according to a preset determination rule; performing dictionary processing on the target data according to the dictionaries to obtain at least one dictionary feature vector; and transmitting the dictionary-like feature vectors to a classifier model trained according to a preset training rule for error detection, so as to realize error detection of the target data. According to the error detection method and system provided by the invention, the dictionary with the same dimension is adopted to carry out dictionary processing on the target data, a plurality of dictionary characteristic vectors with the same dimension corresponding to the target data are obtained, and the dictionary characteristic vectors are transmitted to the classifier model to carry out error detection, so that the problem that in the prior art, when the characteristics of some dimensions in the original data are missing, the characteristics of the corresponding dimensions are set to zero, and the detection accuracy is reduced is solved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart of a method for detecting errors in data according to an embodiment of the present disclosure;
FIG. 2 is a flowchart of another method for error detection of data according to an embodiment of the present disclosure;
FIG. 3 is a flowchart of another method for error detection of data according to an embodiment of the present disclosure;
FIG. 4 is a flowchart of another method for error detection of data according to an embodiment of the present disclosure;
FIG. 5 is a block diagram of an error detection system for data according to an embodiment of the present disclosure;
fig. 6 is a block diagram of another structure of an error detection system for data according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention discloses a data error detection method, which can be applied to the process of error detection of original data such as running data, experimental data, working condition data and the like, wherein the original data can have abnormal data or fault data and the like due to being actual data. In the embodiment of the present invention, an error detection method of aircraft data is taken as an example for explanation, and the aircraft may refer to a rocket, an airplane, and the like. The aircraft data can be blowing data or flight experiment data and the like. The flow of the error detection method is shown in fig. 1, and includes the steps of:
s101, determining each dictionary related to target data according to a preset determination rule;
in the embodiment of the present invention, the target data refers to original data, and includes a plurality of data documents, each of which contains a plurality of original data, but the dimensions of the plurality of data documents may be the same or different. Each dictionary associated with the target data comprises a plurality of roots, and each root is a feature vector of a preset dimension.
S102, performing dictionary processing on the target data according to the dictionaries to obtain at least one dictionary feature vector;
in the embodiment of the present invention, the dictionary processing refers to representing each data document in the target data by using a dictionary feature vector corresponding to each dictionary, and in general, one data document corresponds to one dictionary feature vector.
S103, transmitting the dictionary-like feature vectors to a classifier model trained according to a preset training rule for error detection, and realizing error detection of the target data.
In the embodiment of the invention, the classifier model is a two-classification model obtained by training according to a preset training rule, so the output of the classifier model can be normal or fault.
The invention discloses a data error detection method, which comprises the following steps: determining each dictionary related to target data according to a preset determination rule, performing dictionary processing on the target data according to each dictionary to obtain at least one dictionary-like feature vector, transmitting the dictionary-like feature vector to a classifier model trained according to a preset training rule for error detection, and realizing error detection on the target data. According to the error detection method, the dictionary with the same dimension is adopted to carry out dictionary processing on the target data, the dictionary-like feature vectors with the same dimension corresponding to the target data are obtained, and the dictionary-like feature vectors are transmitted to the classifier model to carry out error detection, so that the problem that in the prior art, when some dimensions of features in original data are missing, the features of the corresponding dimensions are set to be zero, and the detection accuracy is reduced is solved.
In the embodiment of the present invention, a flow of a method for determining each dictionary associated with target data according to a preset determination rule is shown in fig. 2, and the method includes the steps of:
s201, analyzing the target data to obtain each feature vector set document associated with the target data, wherein each feature vector set document comprises at least one feature vector;
in the embodiment of the invention, 1000 or more original data documents with labels of a certain type of aircraft are used as a training data set, wherein the labels are in failure or normal, and a feature vector set is extracted from the original data. The specific implementation method comprises the following steps:
randomly extracting 8 columns from an original data document, establishing a new feature vector set document TrainData _ x, marking the title of the column as TitleTrain _ x, and combining every five data lines in order to utilize the time sequence information provided by the context to form 5 x 8-dimensional feature vector data.
And polling the rest original data documents by titles in sequence, extracting the data lines of the part of original data documents matched with the original data documents containing the title columns, and adding the extracted data lines to the end of the TrainData _ x document. After all the original data documents are queried and processed, a feature vector set document TrainData _ x is finally formed.
Repeating the above generation process of the feature vector set documents to generate M feature vector set documents, wherein in the process of randomly selecting 8 rows of data, 8 rows of completely identical data cannot be selected, otherwise, two completely identical feature vector set documents are generated.
In the embodiment of the present invention, the dimension in the feature vector data forming one 5 × 8 dimension does not mean that 5 × 8 dimensions are always selected, and is estimated based on the number of data documents of the original data and the content of the data documents. Moreover, because the dimension of each original data document is not fixed, it cannot be guaranteed that each original data document can be just completely segmented, and the remaining data which is not completely segmented needs to be discarded.
S202, performing clustering analysis on each feature vector set document to obtain a dictionary associated with the feature vector set document;
in the embodiment of the invention, a K-Means algorithm is adopted for clustering, wherein the K-Means algorithm is an algorithm which inputs the number K of clusters, comprises a database of n data objects and outputs the K clusters meeting the minimum variance standard. The K-Means algorithm receives an input K; the n data objects are then divided into K clusters so that the obtained clusters satisfy: the similarity of objects in the same cluster is higher; while the object similarity in different clusters is smaller.
In the embodiment of the present invention, the feature vector set document tracedata _ X obtained in S201 is denoted as { X1, X2, … …, Xn }, and each Xi is a 5 × 8 dimensional vector.
And dividing the feature vector set document into K classes S (S1, S2, … … and Sk) by adopting K-Means clustering, wherein S represents a set of clusters corresponding to the feature vector set document. And K represents the number of clusters in the feature vector set document, preferably, the number of clusters in the feature vector set document is the same as the number of feature vectors in the center of the first cluster, K feature vectors are randomly selected from the feature vector set to serve as the center of each cluster in the set of clusters, and the K feature vectors are used as the center of the first cluster.
And respectively calculating the dissimilarity degree of all the feature vectors to the center of the first cluster, and classifying each feature vector into the cluster with the lowest dissimilarity degree. The degree of dissimilarity is characterized by the Euclidean distance, and the smaller the distance, the more similar the two are.
The Euclidean distance calculation method comprises the following steps: x { X1, X2 … … xk } represents an arbitrary feature vector, and S1{ μ 1, μ 2, … … μ k } represents the center of any cluster.
Figure BDA0001522273940000091
D represents the first dissimilarity degree, and according to the calculation result of the first dissimilarity degree, the respective centers of the K clusters are recalculated and updated, wherein the calculation method is to select the arithmetic mean of the respective dimensions of all vectors in the clusters to obtain the center of a second cluster.
Calculating the dissimilarity degree of all the characteristic vectors to the center of the second cluster respectively, classifying the vectors into the cluster with the lowest dissimilarity degree respectively to obtain the calculation result of the second dissimilarity degree
Repeating the calculation process, and when the difference value of the corresponding items of the first dissimilarity calculation result and the second dissimilarity calculation result is smaller than a preset analysis threshold value, taking the feature vector contained in the second cluster center as a root word, and constructing a dictionary according to the root word.
The dictionary includes: and K clustering centers, wherein each clustering center is a 5-by-8 dimensional feature vector, namely K basic word roots in the dictionary model.
S203, when an ending instruction for the feature vector set document clustering analysis is received, obtaining dictionaries associated with the target data.
In the embodiment of the present invention, each feature vector document in the M feature vector set documents generated in S201 is subjected to cluster analysis, so as to obtain M dictionaries associated with the target data.
In the embodiment of the present invention, a flow of a method for performing dictionary processing on the target data according to each dictionary to obtain at least one dictionary-like feature vector is shown in fig. 3, and includes the steps of:
s301, analyzing at least one data document contained in the target data;
s302, describing each data document according to each dictionary to obtain the dictionary-like feature vector corresponding to the data document;
in the embodiment of the invention, for each data document in the target data, the generated D dictionaries are used for description in turn.
And when any data document is described by using the current dictionary, wherein the current dictionary is any one of the D dictionaries. If the 8 title columns of the data document are all matched with the 8 title columns of the dictionary, extracting the matched column data, forming a 5-8-dimensional feature vector by every 5 rows, respectively carrying out dissimilarity calculation on the generated feature vectors and K word roots (5-8-dimensional feature vectors) in the current dictionary, classifying the feature vectors into the word roots with the minimum dissimilarity, and adding 1 to the count of the word roots.
And after all the feature vectors of the data document are processed, generating a one-dimensional statistical array, wherein the statistical array comprises K elements, the value of each element is the count of the corresponding root word, and the one-dimensional statistical array is the description of the data document corresponding to a dictionary. And if the 8 title columns of the data document are not completely matched with the 8 title columns of the dictionary, all the one-dimensional statistical array elements of the data document corresponding to the current dictionary are 0.
After each data document is described by M dictionaries, the generated dictionary description vectors are merged to form an M-X-K dimensional feature vector, wherein the M-X-K dimensional feature vector is the dictionary feature vector of the data document.
S303, when an instruction for completing the description of the data document is detected, obtaining at least one dictionary-based feature vector.
In the embodiment of the present invention, each feature vector document in the M feature vector set documents generated in S201 is described according to each dictionary, so as to obtain M dictionary-like feature vectors associated with the target data.
In the embodiment of the invention, the dictionary-like feature vector is transmitted to a classifier model trained according to a preset training rule for error detection, so that the error detection of the target data is realized. Fig. 4 shows a flow of a method for training a classifier model according to a preset training rule, which includes:
s401, taking the dictionary-based feature vector as a training sample;
s402, establishing a data classifier model according to a Support Vector Machine (SVM);
and S403, transmitting the training sample to the data classifier model, and finishing the training of the classifier model when the accuracy of the error detection result output by the data classifier model reaches a preset distinguishing threshold value.
In the embodiment of the present invention, corresponding to the error detection method of the data, the present invention further provides an error detection system of data, where a structural block diagram of the error detection system of data is shown in fig. 5, and the error detection system of data includes:
a determination module 501, a lexicography module 502, and an error detection module 503.
Wherein the content of the first and second substances,
the determining module 501 is configured to determine, according to a preset determining rule, each dictionary associated with the target data;
the dictionary module 502 is configured to perform dictionary processing on the target data according to the dictionaries to obtain at least one dictionary feature vector;
the error detection module 503 is configured to transmit the dictionary-based feature vector to a classifier model trained according to a preset training rule to perform error detection, so as to implement error detection on the target data.
The invention discloses an error detection system of data, which comprises: determining each dictionary related to target data according to a preset determination rule, performing dictionary processing on the target data according to each dictionary to obtain at least one dictionary-like feature vector, transmitting the dictionary-like feature vector to a classifier model trained according to a preset training rule for error detection, and realizing error detection on the target data. According to the error detection system, the dictionary with the same dimensionality is adopted to conduct dictionary processing on the target data, dictionary feature vectors with the same dimensionality corresponding to the target data are obtained, the dictionary feature vectors are transmitted to the classifier model to conduct error detection, and the problem that in the prior art, when the features of certain dimensionality in original data are missing, the features of the corresponding dimensionality are set to be zero, and detection accuracy is reduced is avoided.
In this embodiment of the present invention, a block diagram of the determining module 501 is shown in fig. 6, and includes:
a first parsing unit 504, a cluster analysis unit 505 and a first obtaining unit 506.
Wherein the content of the first and second substances,
the first analyzing unit 504 is configured to analyze the target data to obtain feature vector set documents associated with the target data, where each feature vector set document includes at least one feature vector;
the cluster analysis unit 505 is configured to perform cluster analysis on each feature vector set document to obtain a dictionary associated with the feature vector set document;
the first obtaining unit 506 is configured to obtain, when an end instruction for performing cluster analysis on the feature vector set documents is received, each dictionary associated with the target data.
In this embodiment of the present invention, a block diagram of the structure of the cluster analysis unit 505 is shown in fig. 6, and includes:
a selecting subunit 507, a first calculating subunit 508, an adjusting subunit 509, a second calculating subunit 510, and a constructing subunit 511.
Wherein the content of the first and second substances,
the selecting subunit 507 is configured to determine the number of clusters in the feature vector set document, and select, as a first cluster center, a feature vector with the same number as the clusters from the feature vector set document;
the first calculating subunit 508, configured to perform dissimilarity calculation on each feature vector in the feature vector set document and the feature vector included in the first cluster center, respectively, to obtain a first dissimilarity calculation result;
the adjusting subunit 509 is configured to adjust the feature vector of the first cluster center according to the first dissimilarity calculation result, so as to obtain a second cluster center;
the second calculating subunit 510 is configured to perform dissimilarity calculation on each feature vector in the feature vector set document and the feature vector included in the second cluster center, respectively, to obtain a second dissimilarity calculation result;
the constructing sub-unit 511 is configured to, when a difference between the first dissimilarity calculation result and the corresponding term of the second dissimilarity calculation result is smaller than a preset analysis threshold, use a feature vector included in the second cluster center as a root word, and construct a dictionary according to the root word.
In this embodiment of the present invention, a block diagram of the structure of the dictionary module 502 is shown in fig. 6, and includes:
a second parsing unit 512, a description unit 513 and a second obtaining unit 514.
Wherein the content of the first and second substances,
the second parsing unit 512 is configured to parse at least one data document included in the target data;
the description unit 513 is configured to describe each data document according to each dictionary, so as to obtain the dictionary-like feature vectors corresponding to the data documents;
the second obtaining unit 514 is configured to, when an instruction for completing description of the data document is detected, obtain at least one of the dictionary feature vectors.
In this embodiment of the present invention, a block diagram of the error detection module 503 is shown in fig. 6, and includes:
a determination unit 515, a setup unit 516 and a training unit 517.
Wherein the content of the first and second substances,
the determining unit 515 is configured to use the dictionary-based feature vectors as training samples;
the establishing unit 516 is configured to establish a data classifier model according to a support vector machine SVM;
the training unit 517 is configured to transmit the training sample to the data classifier model, and complete training of the classifier model when the accuracy of the error detection result output by the data classifier model reaches a preset differentiation threshold.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Finally, it should be further noted that, in the present application, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (6)

1. A method for error detection of data, comprising:
determining each dictionary related to the target data according to a preset determination rule;
performing dictionary processing on the target data according to the dictionaries to obtain at least one dictionary feature vector;
transmitting the dictionary-based feature vector to a classifier model trained according to a preset training rule for error detection, so as to realize error detection of the target data;
wherein, according to a preset determination rule, determining each dictionary associated with the target data comprises:
analyzing the target data to obtain each feature vector set document associated with the target data, wherein each feature vector set document comprises at least one feature vector;
performing clustering analysis on each feature vector set document to obtain a dictionary associated with the feature vector set document;
when an ending instruction for the feature vector set document clustering analysis is received, obtaining dictionaries associated with the target data;
wherein performing cluster analysis on each of the feature vector set documents comprises:
determining the number of clusters in the feature vector set document, and selecting feature vectors with the same number as the clusters from the feature vector set document as a first cluster center;
performing dissimilarity calculation on each feature vector in the feature vector set document and the feature vector contained in the first cluster center to obtain a first dissimilarity calculation result;
adjusting the feature vector of the first cluster center according to the first dissimilarity degree calculation result to obtain a second cluster center;
performing dissimilarity calculation on each feature vector in the feature vector set document and the feature vector contained in the second cluster center to obtain a second dissimilarity calculation result;
and when the difference value of the corresponding items of the first dissimilarity degree calculation result and the second dissimilarity degree calculation result is smaller than a preset analysis threshold value, the feature vector contained in the second cluster center is used as a root word, and a dictionary is constructed according to the root word.
2. The method of claim 1, wherein performing a lexicography on the target data according to the dictionaries to obtain at least one lexicographic feature vector comprises:
analyzing at least one data document contained in the target data;
describing each data document according to each dictionary to obtain the dictionary-like feature vector corresponding to the data document;
and when an instruction for completing the description of the data document is detected, obtaining at least one dictionary feature vector.
3. The method of claim 1, wherein the training of the classifier model according to the predetermined training rules comprises:
taking the dictionary-based feature vector as a training sample;
establishing a data classifier model according to a Support Vector Machine (SVM);
and transmitting the training sample to the data classifier model, and finishing the training of the classifier model when the accuracy of the error detection result output by the data classifier model reaches a preset distinguishing threshold value.
4. An error detection system for data, comprising:
the determining module is used for determining each dictionary related to the target data according to a preset determining rule;
the dictionary module is used for performing dictionary processing on the target data according to each dictionary to obtain at least one dictionary feature vector;
the error detection module is used for transmitting the dictionary-like feature vector to a classifier model trained according to a preset training rule for error detection, so that the error detection of the target data is realized;
wherein the determining module comprises:
the first analysis unit is used for analyzing the target data to obtain each feature vector set document associated with the target data, and each feature vector set document comprises at least one feature vector;
the cluster analysis unit is used for carrying out cluster analysis on each feature vector set document to obtain a dictionary related to the feature vector set document;
a first obtaining unit, configured to obtain, when an end instruction for performing cluster analysis on the feature vector set documents is received, each dictionary associated with the target data;
wherein the cluster analysis unit includes:
a selecting subunit, configured to determine the number of clusters in the feature vector set document, and select, as a first cluster center, a feature vector with the same number as the clusters from the feature vector set document;
the first calculating subunit is configured to perform dissimilarity calculation on each feature vector in the feature vector set document and a feature vector included in the first cluster center, respectively, to obtain a first dissimilarity calculation result;
the adjusting subunit is configured to adjust the feature vector of the first cluster center according to the first dissimilarity calculation result to obtain a second cluster center;
the second calculating subunit is configured to perform dissimilarity calculation on each feature vector in the feature vector set document and a feature vector included in the second cluster center, so as to obtain a second dissimilarity calculation result;
and the constructing subunit is used for taking the feature vector contained in the second cluster center as a root word and constructing a dictionary according to the root word when the difference value of the corresponding terms of the first dissimilarity degree calculation result and the second dissimilarity degree calculation result is smaller than a preset analysis threshold value.
5. The system of claim 4, wherein the lexicography module comprises:
the second analysis unit is used for analyzing at least one data document contained in the target data;
the describing unit is used for describing each data document according to each dictionary to obtain the dictionary-like feature vector corresponding to the data document;
and the second obtaining unit is used for obtaining at least one dictionary-like feature vector when an instruction for completing the description of the data document is detected.
6. The system of claim 4, wherein the process of training the classifier model according to the preset training rules comprises:
the determining unit is used for taking the dictionary-based feature vector as a training sample;
the establishing unit is used for establishing a data classifier model according to the support vector machine SVM;
and the training unit is used for transmitting the training samples to the data classifier model, and finishing the training of the classifier model when the accuracy of the error detection result output by the data classifier model reaches a preset distinguishing threshold value.
CN201711417309.0A 2017-12-25 2017-12-25 Data error detection method and system Active CN108154179B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711417309.0A CN108154179B (en) 2017-12-25 2017-12-25 Data error detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711417309.0A CN108154179B (en) 2017-12-25 2017-12-25 Data error detection method and system

Publications (2)

Publication Number Publication Date
CN108154179A CN108154179A (en) 2018-06-12
CN108154179B true CN108154179B (en) 2020-06-05

Family

ID=62465675

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711417309.0A Active CN108154179B (en) 2017-12-25 2017-12-25 Data error detection method and system

Country Status (1)

Country Link
CN (1) CN108154179B (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760138B (en) * 2011-04-26 2015-03-11 北京百度网讯科技有限公司 Classification method and device for user network behaviors and search method and device for user network behaviors
CN103093238B (en) * 2013-01-15 2016-01-20 江苏大学 based on the visual dictionary construction method of D-S evidence theory
US9424288B2 (en) * 2013-03-08 2016-08-23 Oracle International Corporation Analyzing database cluster behavior by transforming discrete time series measurements
CN106705999B (en) * 2016-12-21 2021-05-25 南京航空航天大学 Unmanned aerial vehicle gyroscope fault diagnosis method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ESTIMATING EMBEDDED DATA FROM CLUSTERED HALFTONE DOTS VIA LEARNED DICTIONARY;Chang-Hwan Son 等;《IEEE》;20150129;第2624-2628页 *

Also Published As

Publication number Publication date
CN108154179A (en) 2018-06-12

Similar Documents

Publication Publication Date Title
CN110162593B (en) Search result processing and similarity model training method and device
CN108875067B (en) Text data classification method, device, equipment and storage medium
EP3982275A1 (en) Image processing method and apparatus, and computer device
Kadhim et al. Text document preprocessing and dimension reduction techniques for text document clustering
US10883345B2 (en) Processing of computer log messages for visualization and retrieval
CN104750875B (en) A kind of machine error data classification method and system
CN108269122B (en) Advertisement similarity processing method and device
CN109918498B (en) Problem warehousing method and device
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
US8090720B2 (en) Method for merging document clusters
CN113656254A (en) Abnormity detection method and system based on log information and computer equipment
Cerra et al. Authorship analysis based on data compression
Zul et al. Social media sentiment analysis using K-means and naïve bayes algorithm
CN113836938A (en) Text similarity calculation method and device, storage medium and electronic device
US8572071B2 (en) Systems and methods for data transformation using higher order learning
CN108595411B (en) Method for acquiring multiple text abstracts in same subject text set
CN111506726B (en) Short text clustering method and device based on part-of-speech coding and computer equipment
Hussain et al. Design and analysis of news category predictor
WO2013128684A1 (en) Dictionary generation device, method, and program
Hicham et al. An efficient approach for improving customer Sentiment Analysis in the Arabic language using an Ensemble machine learning technique
CN113282717B (en) Method and device for extracting entity relationship in text, electronic equipment and storage medium
WO2019163642A1 (en) Summary evaluation device, method, program, and storage medium
CN114925702A (en) Text similarity recognition method and device, electronic equipment and storage medium
CN112632000B (en) Log file clustering method, device, electronic equipment and readable storage medium
CN111723206B (en) Text classification method, apparatus, computer device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant