CN113505117A

CN113505117A - Data quality evaluation method, device, equipment and medium based on data indexes

Info

Publication number: CN113505117A
Application number: CN202110848688.9A
Authority: CN
Inventors: 陈丹丽
Original assignee: Ping An Trust Co Ltd
Current assignee: Ping An Trust Co Ltd
Priority date: 2021-07-26
Filing date: 2021-07-26
Publication date: 2021-10-15

Abstract

The invention relates to a data analysis technology, and discloses a data quality evaluation method based on data indexes, which comprises the following steps: segmenting the acquired data to be evaluated into segmented data segments by utilizing the segmentation symbols, and extracting the key semantics of each segmented data segment; carrying out data-related paragraph merging on the segmented data segments according to the key semantics to obtain merged data segments; extracting keywords of each merged data segment, and determining the data field of each merged data segment by using the keywords; acquiring data indexes by using the data field, and calculating an index value of each data index; and calculating the data quality score of the data to be evaluated according to the ratio of each combined data segment and the index value by using a preset weight algorithm. In addition, the invention also relates to a block chain technology, and the product portrait and the user portrait can be stored in the nodes of the block chain. The invention also provides a data quality evaluation device, equipment and a medium based on the data indexes. The invention can improve the accuracy of data quality evaluation.

Description

Data quality evaluation method, device, equipment and medium based on data indexes

Technical Field

The present invention relates to the field of data analysis technologies, and in particular, to a data quality assessment method and apparatus based on data indicators, an electronic device, and a computer-readable storage medium.

Background

As data grows explosively, most traditional businesses have also begun to take the path of digital transformation. The business value in the data generated in the process of enterprise digital transformation is gradually mined by people, and the data is used as a foundation for generating the business value and realizing the business target, so the quality of the data becomes the key point of attention of people.

Most of existing methods for evaluating data quality evaluate the quality of data based on a fixed rule, but because data from a plurality of different sources or different fields often exist in the data and the quality evaluation standards of different data often are inconsistent, the method evaluates the quality of the data by using the fixed rule, which results in low accuracy of data quality evaluation.

Disclosure of Invention

The invention provides a data quality evaluation method and device based on data indexes and a computer readable storage medium, and mainly aims to solve the problem of low accuracy of data quality evaluation.

In order to achieve the above object, the present invention provides a data quality evaluation method based on data indexes, including:

acquiring data to be evaluated, performing segmentation processing on the data to be evaluated according to preset segmentation symbols to obtain segmented data segments, and extracting key semantics of each segmented data segment;

carrying out data related paragraph merging on the segmented data segments according to the key semantics to obtain merged data segments;

extracting a keyword of each merged data segment, and determining the data field of each merged data segment by using the keyword;

searching in a preset index list by utilizing the data field to obtain a data index corresponding to each merged data segment, and calculating an index value of the data index corresponding to each merged data segment;

and counting the proportion of each data segment in the combined data segments in the data to be evaluated, and calculating the data quality score of the data to be evaluated according to the proportion and the index value by using a preset weight algorithm.

Optionally, the extracting key semantics of each of the segmented data segments includes:

selecting one of the segmented data segments one by one as a target segmented data segment, and performing word segmentation processing on the target segmented data segment to obtain data word segments;

carrying out vector conversion on the data word segmentation to obtain a word segmentation vector;

constructing a vector subset set of the word segmentation vectors, and performing feature extraction on the vector subset set by using a pre-constructed semantic analysis model to obtain a feature subset;

and calculating an output value of each feature vector in the feature subset by using a preset activation function, and selecting the feature vector of which the output value is greater than a preset output threshold value as the key semantics of the target segmented data segment.

Optionally, the performing data-dependent segment merging on the segmented data segment according to the key semantics to obtain a merged data segment includes:

respectively calculating the similarity between each semantic in the key semantics;

and merging the segmented data segments corresponding to the key semantics with the similarity larger than a preset similarity threshold value to obtain merged data segments.

Optionally, the extracting the keyword of each merged data segment includes:

selecting one of the merged data segments from the merged data segments as a target merged data segment, and selecting one of target participles from the data participles of the target merged data segment;

counting a first frequency of the target participle appearing in the target combined data segment, and counting a second frequency of the target participle appearing in all the combined data segments;

calculating key values of the target participles by using the first frequency and the second frequency, and returning to the step of selecting one target participle from the data participles of the target combined data segment until the key values of all the target participles in the target combined data segment are calculated;

and selecting the target participle with the key value larger than the preset key value as the keyword of the target merged data segment, and returning to the step of selecting one merged data segment from the merged data segments as the target merged data segment until the keywords of all the data segments in the merged data segments are obtained.

Optionally, the determining the data field of each merged data segment by using the keyword includes:

performing vector conversion on the keywords in the merged data segment to obtain a keyword vector;

calculating a matching value between the keyword vector and a preset standard data field;

and selecting the standard data field with the matching value larger than a preset matching threshold value as the data field of the combined data segment corresponding to the keyword.

Optionally, the retrieving, by using the data field, a data index corresponding to each merged data segment in a preset index list includes:

constructing an index of a preset index list;

and retrieving in the preset index list according to the data field and the index to obtain a data index corresponding to the data field.

Optionally, after calculating the index value of the data index corresponding to each of the merged data segments, the method further includes:

selecting data indexes of which the index values are not in a preset threshold interval from the data indexes of the combined data segment;

collecting the combined data segments corresponding to the selected data indexes into a data segment set;

and carrying out data quality positioning early warning on a preset user by utilizing the data segment set.

In order to solve the above problem, the present invention further provides a data quality evaluation apparatus based on data indexes, the apparatus comprising:

the semantic extraction module is used for acquiring data to be evaluated, performing segmentation processing on the data to be evaluated according to preset segmentation symbols to obtain segmented data segments, and extracting the key semantics of each segmented data segment;

a paragraph merging module, configured to perform data-related paragraph merging on the segmented data segments according to the key semantics to obtain merged data segments;

the data field determining module is used for extracting the key words of each merged data segment and determining the data field of each merged data segment by using the key words;

the index calculation module is used for searching a preset index list by utilizing the data field to obtain a data index corresponding to each merged data segment, and calculating an index numerical value of the data index corresponding to each merged data segment;

and the quality evaluation module is used for counting the proportion of each data segment in the combined data segments in the data to be evaluated, and calculating the data quality score of the data to be evaluated according to the proportion and the index value by using a preset weight algorithm.

In order to solve the above problem, the present invention also provides an electronic device, including:

a memory storing at least one instruction; and

and the processor executes the instructions stored in the memory to realize the data quality evaluation method based on the data indexes.

In order to solve the above problem, the present invention further provides a computer-readable storage medium, in which at least one instruction is stored, and the at least one instruction is executed by a processor in an electronic device to implement the data quality assessment method based on data indicators.

According to the embodiment of the invention, the data to be evaluated is divided according to different data fields, the data indexes are selected according to the data field pertinence of each part of data in the data to be evaluated, the data indexes are used for evaluating the part of data, the evaluation result of each part of data is used for carrying out weight calculation, and the data quality score of the data to be evaluated is obtained, so that the situation that all data are evaluated by using a unified rule is avoided, the matching between the data indexes and the data is improved, and the accuracy of data quality evaluation is further improved. Therefore, the data quality evaluation method and device based on the data indexes, the electronic equipment and the computer readable storage medium provided by the invention can solve the problem of low precision in product recommendation.

Drawings

Fig. 1 is a schematic flow chart of a data quality evaluation method based on data indicators according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a process for extracting key semantics according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a process of extracting keywords according to an embodiment of the present invention;

FIG. 4 is a functional block diagram of a data quality assessment apparatus based on data indicators according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device implementing the data quality assessment method based on data indicators according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The embodiment of the application provides a data quality evaluation method based on data indexes. The execution subject of the data quality assessment method based on the data indexes includes but is not limited to at least one of the electronic devices, such as a server, a terminal, and the like, which can be configured to execute the method provided by the embodiments of the present application. In other words, the data quality evaluation method based on the data index may be performed by software or hardware installed in the terminal device or the server device, and the software may be a block chain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.

Fig. 1 is a schematic flow chart of a data quality evaluation method based on data indicators according to an embodiment of the present invention. In this embodiment, the data quality evaluation method based on the data index includes:

s1, acquiring data to be evaluated, carrying out segmentation processing on the data to be evaluated according to preset segmentation symbols to obtain segmented data segments, and extracting key semantics of each segmented data segment.

In the embodiment of the invention, the data to be evaluated comprises business process data, user information data, product description data, system operation log data and the like.

The embodiment of the invention can utilize a computer sentence (such as a java sentence, a python sentence, and the like) with a data grabbing function to grab the stored data to be evaluated from a pre-constructed storage area, wherein the storage area comprises but is not limited to a database, a block chain node, a network cache, and the like.

In the embodiment of the present invention, the data to be evaluated may be segmented according to preset segmentation symbols to obtain a plurality of segmented data segments corresponding to the data to be evaluated, where the segmentation symbols include, but are not limited to, commas, space numbers, and semicolons.

For example, there is data to be evaluated as: and when the segmentation symbol is a space number, dividing the data to be evaluated into three segmented data segments of fqwfc, njutrnr and cerbwbv according to the position of the space number in the data to be evaluated.

In the embodiment of the invention, the data to be evaluated is divided into a plurality of first data fields according to the preset segmentation characters, so that the data volume in each first data field can be reduced, and the efficiency of analyzing the data to be evaluated is improved.

In the embodiment of the present invention, each segmented data segment may be analyzed by using a preset semantic analysis Model to extract a key semantic meaning of each segmented data segment, where the semantic analysis Model includes but is not limited to a Natural Language Processing (NLP) Model and a Hidden Markov Model (HMM) Model, and the key semantic meaning refers to a semantic meaning capable of representing core content of the segmented data segment.

In the embodiment of the present invention, referring to fig. 2, the extracting the key semantics of each segmented data segment includes:

s21, selecting one of the segmented data segments one by one as a target segmented data segment, and performing word segmentation processing on the target segmented data segment to obtain data word segments;

s22, carrying out vector conversion on the data word segmentation to obtain word segmentation vectors;

s23, constructing a vector subset set of the word segmentation vectors, and performing feature extraction on the vector subset set by using a pre-constructed semantic analysis model to obtain a feature subset;

s24, calculating an output value of each feature vector in the feature subset by using a preset activation function, and selecting the feature vector of which the output value is greater than a preset output threshold value as the key semantic meaning of the target segmented data segment.

In detail, a pre-constructed word bank can be used for segmenting the target segmented data segment to obtain data segmentation, the word bank comprises a plurality of standard segmentation, and the target segmented data segment is searched in the word bank according to different data lengths, so that the data segmentation corresponding to the target segmented data segment can be obtained.

Specifically, a preset word vector conversion model can be used for performing vector conversion on the data word segmentation to obtain a word segmentation vector, and the word vector conversion model includes, but is not limited to, a word2vec model and a CRF (Conditional Random Field) model.

In the embodiment of the invention, the vector subset set comprises all subsets of the participle vectors, and the vector subset set of the participle vectors is constructed, so that the diversity of analysis vector combinations is favorably improved, and the accuracy of the generated key semantics is further improved.

For example, the word segmentation vector includes vector a, vector B, and vector C, and the vector subset set of the word segmentation vector includes: the vector comprises six subsets of [ vector A ], [ vector B ], [ vector C ], [ vector A, vector B ], [ vector A, vector C ], [ vector B, vector C ].

In one practical application scenario of the present invention, in the data participles of the same data segment, there is a strong association between the participles that can be used for representing the core semantics of the data segment, for example, there is a data participle corresponding to a data segment that includes a data participle a, a data participle B and a data participle C, where the data participle a and the data participle B can be used for representing the core semantics of the data segment, and the association between the data participle a and the data participle B is greater than both the association between the data participle a and the data participle C of 2 and the association between the data participle B and the data participle C.

Further, the embodiment of the present invention may analyze the relevance between the analysis vectors in each vector subset of the vector subset set by using a pre-constructed semantic analysis model, so as to filter a representative feature subset from the vector subset set according to the relevance.

For example, the existing vector subset set includes a vector subset a, a vector subset B, and a vector subset C, and the semantic analysis model is used to analyze the association degree of the word vectors in the vector subset a, the vector subset B, and the vector subset C, respectively, to obtain that the association degree of the word vectors in the vector subset a is 80, the association degree of the word vectors in the vector subset B is 70, and the association degree of the word vectors in the vector subset C is 60, and then the vector subset a is determined to be the feature subset of the target segmented data segment.

In detail, after the feature subset is extracted, an output value of each feature vector in the feature subset may be calculated by using a preset activation function, and a feature vector of which the output value is greater than a preset output threshold is selected as a key semantic meaning of the target segmented data segment, where the activation function includes, but is not limited to, a sigmoid activation function and a softmax activation function. Relu activates the function.

And S2, carrying out data correlation paragraph merging on the segmented data segments according to the key semantics to obtain merged data segments.

In the embodiment of the present invention, because the data to be evaluated is only segmented according to the preset segmentation symbol in S1, the obtained segmented data segment may have incomplete content of the data segment, for example, the segmented data segment of the data to be evaluated includes a data segment a and a data segment B, where the data segment a and the data segment B are data segments for recording user behavior, and if the data segment a or the data segment B is used alone to perform data quality analysis, an inaccurate analysis may be caused.

Therefore, the embodiment of the invention can perform paragraph merging on the segmented data segments according to the key semantics so as to merge the data segments belonging to the same content in the segmented data segments together, thereby improving the accuracy of subsequent quality analysis on data.

In this embodiment of the present invention, the merging data related paragraphs on the segmented data segments according to the key semantics to obtain merged data segments includes:

In detail, the separately calculating the similarity between each of the key semantics includes:

calculating the similarity between each semantic meaning in the key semantic meanings by using a similarity algorithm as follows:

wherein S is the similarity, x_kIs the kth key semantic, y, of the key semantics_jAnd alpha is a preset coefficient, wherein alpha is the jth key semantic meaning in the key semantic meanings.

In other embodiments of the present invention, the similarity between the target semantic and each of the key semantics may be calculated by using an algorithm having a similarity calculation function, such as a cosine distance algorithm, an euclidean distance algorithm, or the like.

For example, the key semantics include a semantic a, a semantic B, a semantic C, and a semantic D, the semantic a is selected as the target semantic, the similarity between the semantic a and the semantic B, and the similarity between the semantic a and the semantic B are calculated respectively, the similarity between the semantic a and the semantic B is 89, the similarity between the semantic a and the semantic C is 76, the similarity between the semantic a and the semantic D is 62, when the preset similarity threshold is 80, the data segment corresponding to the semantic B is merged with the segmented data segment corresponding to the semantic a, the semantic C is selected as the target semantic, the similarity between the semantic C and the semantic D is calculated, the similarity between the semantic C and the semantic D is 30, the data segments corresponding to the semantic C and the semantic D are not merged, and the merged data segment is obtained: a data segment corresponding to semantics A and B, a data segment corresponding to semantics C, and a data segment corresponding to semantics D.

In the embodiment of the invention, similar data segments in the segmented data segments can be merged according to the key semantics, so that the integrity of each semantic of the acquired merged data segment is ensured, and the accuracy of evaluating the data quality is further improved.

And S3, extracting the key words of each merged data segment, and determining the data field of each merged data segment by using the key words.

In one practical application scenario of the invention, the evaluation standard emphasis on the data to be evaluated is inconsistent in different fields.

For example, the data to be evaluated includes data for recording stock price changes and data for recording user behaviors, where the data for recording stock price changes belongs to the market row field, and the data needs higher timeliness, but the data for recording user behaviors belongs to the human information field, and the data needs higher accuracy and integrity.

Higher timeliness is required for data in the market place field that records changes in stock price; but higher integrity and the like are required for personal information field data that records user behavior.

Therefore, the data field classification can be performed on the data to be evaluated according to the key semantics, so that the data in the data to be evaluated can be classified into classification data of various different fields, and the subsequent targeted evaluation on the classification data of the different data fields is facilitated.

In the embodiment of the present invention, as shown in fig. 3, the extracting the keyword of each merged data segment includes:

s30, selecting one of the merged data segments from the merged data segments as a target merged data segment;

s31, selecting one target participle from the data participles of the target combined data segment;

s32, counting a first frequency of the target participle appearing in the target combined data segment, and counting a second frequency of the target participle appearing in all the combined data segments;

s33, calculating a key value of the target word segmentation by using the first frequency and the second frequency;

s34, judging whether the number of the selected target participles is smaller than the number of the data participles of the target segmented data segment;

if the number of the target participles is less than the number of the data participles of the target segmented data segment, executing S35, and returning to the step of selecting one target participle from the data participles of the target merged data segment;

if the number of the target participles is larger than or equal to the number of the data participles of the target segmented data segment, executing S36, and selecting the target participles with the key value larger than a preset key value as the key words of the target merged data segment;

s37, judging whether the number of the selected target segmented data segments is smaller than the number of the data segments in the merged data segment;

if the number of the selected target segmented data segments is smaller than the number of the data segments in the merged data segments, executing S38, returning to the step of selecting one merged data segment from the merged data segments as a target merged data segment;

if the number of the selected target segmented data segments is greater than or equal to the number of the data segments in the merged data segment, S39 is executed to obtain the keyword of each merged data segment.

In detail, the key value of the target participle may be calculated by using the first frequency and the second frequency according to the following preset key value algorithm:

and K is the key value, M is the second frequency, N is the first frequency, and beta is a preset constant.

In the embodiment of the invention, the key value of each target participle in the target merged data segment is calculated by utilizing the preset key word algorithm, and the target participle with the key value larger than the preset key value is selected as the key word of the target merged data segment.

For example, a target merged data segment includes a data segment a, a data segment B, and a data segment C, the data segment a is selected as a target segment, the key value of the data segment a is calculated to be 80 by using the preset keyword algorithm, the data segment B is selected as a target segment, the key value of the data segment B is calculated to be 60 by using the preset keyword algorithm, the data segment C is selected as a target segment, the key value of the data segment C is calculated to be 50 by using the preset keyword algorithm, and when the preset key value is 70, the data segment a is selected as a keyword of the target merged data segment.

In other embodiments of the present invention, the algorithm having the keyword extraction function, such as TF-IDF (term frequency-inverse document frequency) algorithm, TextRank algorithm, and the like, may be further used

Further, the embodiment of the present invention may determine the data field of each merged data segment by using the keyword, for example, calculate a matching value between the keyword and a preset standard data field, and determine that the data field of which the matching value is greater than a preset matching threshold is the data field of the merged data segment corresponding to the keyword.

In this embodiment of the present invention, the determining the data field of each merged data segment by using the keyword includes:

In detail, the step of performing vector conversion on the keyword corresponding to each data segment in the merged data segment is consistent with the step of performing vector conversion on the data segmentation in S1 to obtain a segmentation vector, which is not described herein again.

Specifically, the calculating a matching value between the keyword vector and a preset standard data field includes:

calculating a matching value between the keyword vector and a preset standard data field by using a matching value algorithm as follows:

wherein, P is the matching value, a is the keyword vector, and b is the vector expression of the standard data field.

For example, a data segment a and a data segment B exist in the merged data segment, wherein the keyword corresponding to the data segment a is a first keyword, and the keyword corresponding to the data segment B is a second keyword; there are preset standard data fields: the method comprises the steps that a first field and a second field are respectively calculated by utilizing a matching value algorithm, the matching value between a first keyword corresponding to a data segment A and the first field and the matching value between the first field and the second field are 90, the matching value between the first keyword and the first field is 70, and when a preset matching threshold value is 80, the data field corresponding to the data segment A is determined to be the first field; similarly, calculating a matching value between the second keyword corresponding to the data segment B and the first field and the second field to obtain a matching value between the second keyword and the first field of 77, and a matching value between the second keyword and the second field of 85, and determining that the data field corresponding to the data segment B is the second field.

According to the embodiment of the invention, the data field of each merged data segment is determined through the keyword, so that the targeted data quality evaluation can be conveniently carried out on the data in different data fields, and the accuracy of the data quality evaluation is improved.

S4, retrieving the data indexes corresponding to the merged data segments from a preset index list by using the data fields, and calculating the index numerical values of the data indexes corresponding to the merged data segments.

In the implementation of the present invention, the data fields may be retrieved in a preset index list to obtain data indexes corresponding to the data fields, where the index list includes a plurality of data fields and data indexes corresponding to each data field, and the data indexes include, but are not limited to, integrity indexes, timeliness indexes, redundancy indexes, accuracy indexes, and intelligibility indexes.

In this embodiment of the present invention, the retrieving the data index corresponding to each merged data segment from a preset index list by using the data field includes:

constructing an index of a preset index list;

In detail, the INDEX of the preset INDEX list can be constructed by using a CREATE INDEX function in an SQL library.

Illustratively, the INDEX of the preset INDEX list may be constructed using the CREATE INDEX function as follows:

CREATE INDEX index-name

ON table-name(column-name)

the index-name is a name of the created index and can be predefined by a user, the table-name is a name of the preset index list, and the column-name is a name of a data column in the preset index list, wherein the index-name is required to establish the index.

Further, in the embodiment of the present invention, an index value of the data index corresponding to each merged data segment may be calculated according to the obtained data index.

For example, a data segment a and a data segment B exist in the merged data segment, where the data index of the data segment a is an integrity index, and the data index of the data segment B is a timeliness index, for the data segment a, the total amount of data in the data segment a and the number of null values whose data in the data segment a are null values may be counted, the total amount and the number of null values are divided, and a result value of the division is used as an index value of the integrity index;

for the data segment B, the acquisition time when each data in the data segment B is acquired and the current time may be counted, the acquisition times of all the data in the data segment B and the current time are subtracted and summed, and the obtained value is used as an index value of the timeliness index of the data segment B.

In the embodiment of the invention, the data index corresponding to each combined data segment is retrieved from the preset index list by utilizing the data field, so that the data index can be selected according to different data fields in a targeted manner, and the accuracy of data quality evaluation is improved.

In this embodiment of the present invention, after calculating the index value of the data index corresponding to each of the merged data segments, the method further includes:

For example, the merged data segment includes a data segment a, a data segment B, and a data segment C, where an index value of a data index corresponding to the data segment a is 123, an index value of a data index corresponding to the data segment B is 456, and an index value of a data index corresponding to the data segment C is 789, and when a preset threshold interval is (500, 800), it is determined that the index values of the data indexes of the data segment a and the data segment B are not within the threshold interval, which indicates that the data quality of the data segment a and the data segment B in the merged data segment is poor, the data segment a and the data segment B are collected into a data segment set, and a quality positioning warning of the data segment a and the data segment B is sent to a preset user.

In detail, the quality positioning early warning is to point to a user to push information related to a data segment in the merged data segment, where the index value is not within a preset threshold interval, so as to inform the user which index value of a part of data in the merged data segment is not within the preset threshold interval.

Specifically, the data segment of which the index value of the data in the merged data segment is not within the preset threshold interval can be displayed through a prompt box, a highlight mark and the like, so that data quality positioning early warning is performed on a preset user.

In the embodiment of the invention, the user can know the specific position of the data possibly having the data quality problem in the combined data segment by carrying out data quality early warning on the user, and meanwhile, the user can conveniently modify, delete and other adjustments on the data, so that the quality of the data in the combined data segment is improved.

And S5, counting the proportion of each data segment in the combined data segments in the data to be evaluated, and calculating the data quality score of the data to be evaluated according to the proportion and the index value by using a preset weight algorithm.

In the embodiment of the invention, the data quality score of the data to be evaluated can be calculated by counting the proportion of each data segment in the combined data segments in the data to be evaluated and taking the proportion as a parameter of a preset weight algorithm.

For example, the total data amount of all the data segments in the merged data segment is counted, the data amount of each data segment in the merged data segment is counted, and the data amount of each data segment is divided by the total data amount, so as to obtain the ratio of each data segment in the merged data segment in the evaluation data.

In the embodiment of the present invention, the calculating, by using a preset weighting algorithm, a data quality score of the data to be evaluated according to the proportion includes:

calculating the data quality score of the data to be evaluated according to the ratio and the index value by using the following weight algorithm:

wherein G is the data quality score, n is the number of data segments in the merged data segment, Q_iIs the index value, P, of the data index of the ith data segment in the merged data segment_iIs the ith preset weight coefficient.

In the embodiment of the invention, the data quality score of the data to be evaluated can be calculated by utilizing the weight algorithm.

According to the embodiment of the invention, the data to be evaluated is divided according to different data fields, the data indexes are selected according to the data field pertinence of each part of data in the data to be evaluated, the data indexes are used for evaluating the part of data, the evaluation result of each part of data is used for carrying out weight calculation, and the data quality score of the data to be evaluated is obtained, so that the situation that all data are evaluated by using a unified rule is avoided, the matching between the data indexes and the data is improved, and the accuracy of data quality evaluation is further improved. Therefore, the data quality evaluation method based on the data indexes can solve the problem of low accuracy in product recommendation.

Fig. 4 is a functional block diagram of a data quality evaluation apparatus based on data indicators according to an embodiment of the present invention.

The data quality evaluation device 100 based on the data index according to the present invention may be installed in an electronic device. According to the implemented functions, the data quality evaluation device 100 based on data indexes may include a semantic extraction module 101, a paragraph merging module 102, a data domain determination module 103, an index calculation module 104, and a quality evaluation module 105. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.

In the present embodiment, the functions regarding the respective modules/units are as follows:

the semantic extraction module 101 is configured to obtain data to be evaluated, perform segmentation processing on the data to be evaluated according to preset segmentation symbols to obtain segmented data segments, and extract a key semantic meaning of each segmented data segment;

the paragraph merging module 102 is configured to perform data-related paragraph merging on the segmented data segments according to the key semantics to obtain merged data segments;

the data field determining module 103 is configured to extract a keyword of each merged data segment, and determine a data field of each merged data segment by using the keyword;

the index calculation module 104 is configured to retrieve, by using the data field, a data index corresponding to each merged data segment from a preset index list, and calculate an index value of the data index corresponding to each merged data segment;

the quality evaluation module 105 is configured to count a ratio of each data segment in the merged data segment in the data to be evaluated, and calculate a data quality score of the data to be evaluated according to the ratio and the index value by using a preset weight algorithm.

In detail, when the modules in the data quality assessment apparatus 100 based on data indicators according to the embodiment of the present invention are used, the same technical means as the data quality assessment method based on data indicators described in fig. 1 to 3 are adopted, and the same technical effects can be produced, which is not described herein again.

Fig. 5 is a schematic structural diagram of an electronic device implementing a data quality evaluation method based on data indicators according to an embodiment of the present invention.

The electronic device 1 may comprise a processor 10, a memory 11, a communication bus 12 and a communication interface 13, and may further comprise a computer program, such as a data quality assessment program based on data indicators, stored in the memory 11 and executable on the processor 10.

In some embodiments, the processor 10 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, and includes one or more Central Processing Units (CPUs), a microprocessor, a digital Processing chip, a graphics processor, a combination of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device by running or executing programs or modules (for example, executing a data quality evaluation program based on data indexes, etc.) stored in the memory 11 and calling data stored in the memory 11.

The memory 11 includes at least one type of readable storage medium including flash memory, removable hard disks, multimedia cards, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disks, optical disks, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device, for example a removable hard disk of the electronic device. The memory 11 may also be an external storage device of the electronic device in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device. The memory 11 may be used not only to store application software installed in the electronic device and various types of data, such as codes of a data quality evaluation program based on data indexes, but also to temporarily store data that has been output or is to be output.

The communication bus 12 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.

The communication interface 13 is used for communication between the electronic device and other devices, and includes a network interface and a user interface. Optionally, the network interface may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), which are typically used to establish a communication connection between the electronic device and other electronic devices. The user interface may be a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.

Fig. 5 only shows an electronic device with components, and it will be understood by a person skilled in the art that the structure shown in fig. 5 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.

For example, although not shown, the electronic device may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management and the like are realized through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.

The data quality assessment program stored in the memory 11 of the electronic device 1 based on data indicators is a combination of instructions that, when executed in the processor 10, may implement:

Specifically, the specific implementation method of the processor 10 for the instruction may refer to the description of the relevant steps in the embodiment corresponding to fig. 1, which is not described herein again.

Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).

The present invention also provides a computer-readable storage medium, storing a computer program which, when executed by a processor of an electronic device, may implement:

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A data quality assessment method based on data indexes is characterized by comprising the following steps:

2. The data quality assessment method based on data indicators as claimed in claim 1, wherein said extracting key semantics of each of said segmented data segments comprises:

3. The data quality evaluation method based on data indicators as claimed in claim 1, wherein the merging the data-related paragraphs of the segmented data segments according to the key semantics to obtain a merged data segment comprises:

4. The data quality assessment method according to claim 1, wherein the extracting the keyword of each of the merged data segments comprises:

5. The data quality assessment method based on data indicators as claimed in claim 1, wherein the determining the data domain of each of the merged data segments by using the keywords comprises:

6. The data quality assessment method based on data indicators as claimed in any one of claims 1 to 5, wherein the retrieving the data indicator corresponding to each of the merged data segments in a preset indicator list by using the data field comprises:

constructing an index of a preset index list;

7. The data index-based data quality assessment method according to any one of claims 1 to 5, wherein after calculating the index value of the data index corresponding to each of the merged data segments, the method further comprises:

8. An apparatus for data quality assessment based on data indicators, the apparatus comprising:

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a data indicator-based data quality assessment method according to any one of claims 1 to 7.

10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the data quality assessment method based on data indicators according to any one of claims 1 to 7.