CN115544337B

CN115544337B - Data processing method and system starting from data origin

Info

Publication number: CN115544337B
Application number: CN202211060927.5A
Authority: CN
Inventors: 王世今; 莫卉星; 刘珂杭; 高铭; 武欢欢
Original assignee: Smart Co Ltd Beijing Technology Co ltd
Current assignee: Smart Co Ltd Beijing Technology Co ltd
Priority date: 2022-09-01
Filing date: 2022-09-01
Publication date: 2023-06-27
Anticipated expiration: 2042-09-01
Also published as: CN115544337A

Abstract

The invention provides a data processing method and a system starting from a data origin, comprising the following steps: determining a classification dimension of the data classification according to the data origin of the initial data; according to the dimension attribute of the initial data, carrying out data type division on the initial data to obtain dimension data; according to the type attribute of the dimension data, naming the data name of the dimension data to obtain the data type name; according to the invention, the classification dimension of the data classification is determined from the data origin, so that the classification dimension can cover all data, the data is divided according to the dimension attribute, the integrity and the singleness of the data division are ensured, the proper data name is determined according to the type attribute, the query efficiency and the accuracy of the data are ensured, and the effective utilization and the mining of the data are finally improved.

Description

Data processing method and system starting from data origin

Technical Field

The present invention relates to the field of data processing and application, and in particular, to a data processing method and system starting from a data origin.

Background

Today's society is in a state of rapid development, with the accompanying data being increasing. Data has penetrated into various industries, and the data range almost covers the daily activities of human beings. In the face of large amounts of data, people began to mine and use it in an effort to find various commercial values, academic values, etc. from the data. However, it is not a simple and easy matter to fully and effectively utilize data, mine data, and refine the value of the data, and first face what types of data problems are currently available on the market, and whether the current data dimension already covers all types of data problems. In the face of unclear data classification, the data is used as a "blind image", or it is unclear which of the overall data types are, or the currently grasped data type is considered to be the full data classification in a partial sense. Therefore, it has become an urgent matter how to scientifically classify mass data in the market at present, and at the same time, give reasonable names to these classifications. Only a stack of disordered data is classified, the comprehensive appearance of the data can be revealed, meanwhile, effective data are found in different data classification modules according to requirements, and the value of the data in the research field is furthest mined.

In the traditional data processing method, firstly, the classification of the acquired data is not fine enough, and all the data cannot be classified; second, when classifying data, there may be a case where the same kind of data is classified into a plurality of classifications; finally, when defining the classification names of the data, improper naming may exist, which results in difficulty in searching the data; eventually resulting in an inability to make efficient use of and mine the data.

Disclosure of Invention

The invention provides a data processing method and a system starting from a data origin, which classify and name data from the perspective of the data source, ensure the integrity and the singleness of data division and ensure the accurate query of the data.

A method of data processing starting from a data origin, comprising:

step 1: determining a classification dimension of the data classification according to the data origin of the initial data;

step 2: according to the dimension attribute of the initial data, carrying out data type division on the initial data to obtain dimension data;

step 3: and naming the data name of the dimension data according to the type attribute of the dimension data to obtain the data type name.

Preferably, before step 1, the method further comprises: acquiring a data origin of the initial data, comprising:

Acquiring adjacent data acquisition nodes through which the initial data passes, and acquiring a previous data acquisition node through which the initial data passes according to a time stamp of the adjacent data acquisition nodes;

tracing the initial data according to the timestamp of the previous data acquisition node to obtain an initial acquisition node of the initial data;

a data origin of the initial data is determined based on the timestamp of the initial acquisition node.

Preferably, in step 1, determining the classification dimension of the data classification according to the data origin of the initial data comprises:

determining a classification angle according to the application requirement of the initial data;

extracting initial dimensions with consistent classification angles from a dimension database, and selecting a preset number of target dimensions from the initial dimensions based on the emphasis of the application requirements;

establishing a dimension distribution diagram of the target dimension under the classification angle, and judging whether the dimension distribution diagram covers all aspects of the classification angle;

if yes, determining the target dimension as a classification dimension for data classification;

otherwise, determining a missing aspect, and matching the optimal dimension for the missing aspect as a complementary dimension, and forming a classification dimension for data classification by the complementary dimension and the target dimension together.

Preferably, determining the classification angle according to the application requirement of the initial data includes:

extracting keywords in the application requirements, and matching corresponding initial classification angles for each keyword;

and selecting the initial classification angle with the largest number as the classification angle of the initial data.

Preferably, in step 2, according to the dimension attribute of the initial data, the data type of the initial data is divided to obtain dimension data, which includes:

based on the characteristics of the classification dimensions, setting analysis points under each classification dimension and the weight of each analysis point;

setting an attribute determination model based on the analysis points of the classification dimension and the corresponding weights thereof;

inputting the initial data into the attribute determining model, and determining the dimension attribute of the initial data;

and acquiring an initial dimension corresponding to the dimension attribute as the dimension of the initial data to obtain corresponding dimension data.

Preferably, setting an attribute determination model based on the analysis points of the classification dimension and the weights corresponding to the analysis points, includes:

setting the number of channels and the attribute corresponding to each channel based on the classification dimension, and constructing a channel model based on the number of channels and the attribute corresponding to each channel;

Analyzing the analysis points of the classification dimension, and determining the association characteristics and the analysis sequence among the analysis points;

acquiring an initial analysis mode corresponding to the analysis point from a data analysis library, and selecting an initial analysis mode with association according to the association characteristic;

splitting the initial analysis mode with the association to determine sub-analysis rules, selecting the same sub-analysis rules in the initial analysis mode with the association, and determining the positions of the same sub-analysis rules in the initial analysis mode;

combining and simplifying the initial analysis modes with the association based on the same sub-analysis rules and the positions to obtain a target analysis mode;

based on the analysis sequence, establishing an analysis flow of the target analysis mode, acquiring analysis resources corresponding to the analysis flow, and constructing an analysis layer based on the analysis resources;

setting corresponding calculation rules based on the weights corresponding to the analysis points, and constructing a grading layer based on the calculation rules;

based on the weight corresponding to the analysis point, establishing a first connection relation between the analysis layer and the evaluation layer;

establishing a data analysis rule in the channel model according to the first connection relation by using the analysis layer and the evaluation layer;

Each channel in the channel model is connected with an output layer;

the output layer comprises a score comparison layer which is used for comparing the output scores of each channel and selecting the maximum score value;

the output layer further comprises an attribute output layer, which is used for determining the attribute of the channel corresponding to the maximum scoring value and taking the attribute as the dimension attribute of final initial data;

and establishing an attribute determination model based on the channel model and the output layer.

Preferably, inputting the initial data into the attribute determining model, determining the dimension attribute of the initial data includes:

inputting the initial data into the attribute determination model to obtain an output dimension;

judging whether the output dimension is one dimension or not;

if yes, taking the output dimension as the dimension attribute of the initial data;

otherwise, determining the dimension characteristics of each dimension contained in the output dimension, and acquiring the related characteristics among all dimensions based on the dimension characteristics;

based on the relevant characteristics, relevant analysis points of the initial data are obtained from the analysis points, and the output dimension corresponding to the highest score of the initial data at the relevant analysis points is selected to be used as the dimension attribute of the initial data.

Preferably, in step 3, naming the data name of the dimension data according to the type attribute of the dimension data to obtain a data type name, including:

determining name keywords under the type attribute according to the type attribute of the dimension data;

determining application keywords of the dimension data according to the application requirements of the dimension data in the historical application;

wherein the name keywords and the application keywords are a plurality of;

randomly combining the name keywords and the application keywords to obtain a first data name;

inputting the first data name into a semantic scoring model, and selecting a data name with a scoring value larger than a preset value from the first data names as a second data name according to a scoring result;

acquiring a historical search name for the dimension data, determining the probability of successfully acquiring target dimension data by a user under the historical search name, and acquiring an optimal search name from the historical search name based on the probability;

acquiring a data name with the similarity with the optimal search name being larger than a preset similarity value from the second data name as a third data name;

Extracting keywords from the dimension data under the type attribute to obtain data keywords;

acquiring the adaptation degree of the data keyword and the third data name, and selecting the third data name with the largest adaptation degree as a fourth data name;

setting a data name format based on a query rule of the data processing system, wherein the data name format comprises a name word number and a name text attribute;

judging whether the fourth data name meets the data name format or not;

if yes, taking the fourth data name as the data type name of the dimension data;

otherwise, determining that the fourth data name does not meet the specific content of the data name format;

if the specific content is the number of the name words, dividing the fourth data name according to the type attribute to obtain a plurality of groups of word patterns, and adding or deleting the number of the words of the fourth data name based on the matching degree of the word patterns and the type attribute to obtain a fifth data name as the data type name of the dimension data;

and if the specific content is the name text attribute, extracting characters which do not meet the name text attribute from the fourth data name, carrying out similarity replacement on the characters based on the name text attribute to obtain target characters which meet the name text attribute, and modifying the fourth data name based on the target characters to obtain a sixth data name serving as the data type name of the dimension data.

Preferably, in step 2, after obtaining the dimension data, the method further includes: verifying the dimension data, including:

performing two classifications on the dimension data according to time characteristics to obtain first data and second data;

determining a data center point of the dimensional data based on the dimensional characteristics of the first data and the second data;

calculating a variance value between the first data and the second data based on the center point;

judging whether the variance value is smaller than a preset variance value or not;

if yes, the classification of the dimension data is indicated to meet the requirements;

otherwise, the classification of the dimension data is not satisfied, and the dimension data needs to be reclassified.

A data processing system starting from a data origin, comprising:

the dimension determining module is used for determining the classification dimension of the data classification according to the data origin of the initial data;

the data classification module is used for carrying out data type division on the initial data according to the dimension attribute of the initial data to obtain dimension data;

and the data naming module is used for naming the data name of the dimension data according to the type attribute of the dimension data to obtain the data type name.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

FIG. 1 is a flow chart of a data processing method from data origin in an embodiment of the invention;

FIG. 2 is another flow chart of an embodiment of the present invention;

FIG. 3 is a block diagram of a data processing system that begins with the origin of data in an embodiment of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.

Example 1

An embodiment of the present invention provides a data processing method starting from a data origin, as shown in fig. 1, including:

In this embodiment, the classification dimension may be extracted, for example, based on the context of things, i.e., the generation and use of data. For data, the context is that data is generated in daily production and life of human beings, then people find that the data has a utilization value, and finally people decide to apply the data to mine the value.

In this embodiment, the data dimension partitioning may be, for example: (1) Inquiring activities of unknown objects in daily life of human, such as searching and inquiring by using various browsers and map software; (2) Various activities in daily life of human beings, such as activities of eating and drinking, traveling, shopping, financial management and the like; (3) Financial payments made by humans for behavioral activities, such as financial consumer payments using payment treasures, credit cards, and the like; (4) Various interactions with other people, such as communication using WeChat, strange, voice phone, etc.; (5) Human population properties or social properties, such as various credit bureaus, public payment bureaus, judicial bureaus and other property information.

In this embodiment, the type attribute of the data is, for example: (1) Query activities on unknown objects in daily life of human beings, such as searching query dimensions by using various browsers and map software, are named as-searching; (2) Various activities in daily life of human beings, such as clothing and eating, travel, shopping, financial management and other activities, are named as-activities; (3) Financial payments made by humans for behavioral activities, such as financial consumption payment dimensions using payment treasures, credit cards, etc., are named-payments; (4) Various communication interactions between human beings and other people, such as communication connection dimensions using WeChat, strange, voice phone and the like are named as social contact; (5) The human population attribute or social attribute, such as various credit bureaus, public payment bureaus, judicial bureaus and other attribute information dimension is named as government affairs.

The beneficial effects of above-mentioned design scheme are: the classification dimension of the data classification is determined from the data origin, so that the classification dimension can cover all data, the data is divided according to the dimension attribute, the integrity and the singleness of the data division are ensured, the proper data name is determined according to the type attribute, the query efficiency and the accuracy of the data are ensured, and the effective utilization and the mining of the data are finally improved.

Example 2

Based on embodiment 1, the embodiment of the present invention provides a data processing method starting from a data origin, and before step 1, the method further includes: acquiring a data origin of the initial data, comprising:

The beneficial effects of above-mentioned design scheme are: tracing the wiper node of the initial data to obtain an initial acquisition node, determining the data origin of the initial data according to the timestamp record of the initial acquisition node, and providing a basis for determining the classification dimension of the data classification.

Example 3

Based on embodiment 1, the embodiment of the present invention provides a data processing method starting from a data origin, in step 1, determining a classification dimension of a data classification according to the data origin of initial data includes:

In this embodiment, the classification angle is, for example, based on an artificial starting point, and the initial dimension includes clothing and eating, traveling, shopping and entertainment activities that may be involved in daily life of the human, for example, the importance of the application requirement is shopping, and the preset number of target dimensions is a payment dimension, a browsing dimension, a recommending dimension, and so on.

The beneficial effects of above-mentioned design scheme are: the classification dimension is determined according to the application requirement on the initial data, so that the rationality and the integrity of the classification dimension design are ensured, and a foundation is provided for the division of the data.

Example 4

Based on embodiment 3, an embodiment of the present invention provides a data processing method starting from a data origin, determining a classification angle according to an application requirement on the initial data, including:

The beneficial effects of above-mentioned design scheme are: the initial classification angles with the largest quantity are selected as the classification angles of the initial data according to the initial classification angles corresponding to the keywords in the application requirements, so that the fit degree of the determined classification angles and the application requirements is ensured.

Example 5

Based on embodiment 1, a data processing method from a data origin is characterized in that, as shown in fig. 2, in step 2, according to a dimension attribute of the initial data, data type division is performed on the initial data to obtain dimension data, including:

step 2-1: based on the characteristics of the classification dimensions, setting analysis points under each classification dimension and the weight of each analysis point;

step 2-2: setting an attribute determination model based on the analysis points of the classification dimension and the corresponding weights thereof;

Step 2-3: inputting the initial data into the attribute determining model, and determining the dimension attribute of the initial data;

step 2-4: and acquiring an initial dimension corresponding to the dimension attribute as the dimension of the initial data to obtain corresponding dimension data.

In this embodiment, for example, the classification dimension is a shopping dimension, the corresponding analysis point may be a shopping manner, a payment manner, an authentication manner, and the like, the corresponding weight is 0.2,0.5,0.3, and the more important the analysis point, the higher the corresponding weight.

The beneficial effects of above-mentioned design scheme are: the analysis points and the corresponding weights thereof are set according to the characteristics of the classification dimension, so that an attribute determination model is established, the dimension of the initial data is obtained by inputting the initial data into the attribute determination model, the corresponding dimension data is obtained, and the accuracy of the initial data division is ensured.

Example 6

Based on embodiment 5, the embodiment of the invention provides a data processing method starting from data origin, and based on the analysis points of the classification dimension and the corresponding weights thereof, an attribute determination model is set, which comprises the following steps:

Each channel in the channel model is connected with an output layer;

The working principle of the attribute determination model is that the initial data is input into a channel model of the attribute determination model, after the analysis layer of each channel in the channel model analyzes the initial data, the score of the initial data in the channel is determined through the evaluation layer, and then the attribute of the corresponding channel with the largest score is selected through the output layer to be used as the dimension attribute of the initial data, so that dimension data is obtained.

The beneficial effects of above-mentioned design scheme are: the initial analysis mode is reasonably simplified and combined according to the association characteristics and the analysis sequence among the analysis points of the classification dimension, so that a target analysis mode is obtained, the accuracy of determining the classification of the initial data dimension is ensured by analyzing and scoring the initial data in each classification dimension, meanwhile, the initial analysis mode is reasonably simplified and combined, the waste of analysis resources and the complexity of analyzing the initial data are avoided, the running efficiency of an attribute determination model is ensured, and the classification efficiency of the initial data is improved.

Example 7

Based on embodiment 5, an embodiment of the present invention provides a data processing method starting from a data origin, inputting the initial data into the attribute determining model, determining a dimension attribute of the initial data, including:

judging whether the output dimension is one dimension or not;

In this embodiment, when the output dimension is greater than one dimension, it is indicated that there is more than one channel scoring the initial data the same.

In this embodiment, the relevant analysis point is obtained from the analysis points, and is a key analysis point that further determines to which dimension the initial data belongs.

The beneficial effects of above-mentioned design scheme are: by analyzing the output dimension of the attribute determination model, under the condition that the output dimension is larger than one, the score of the relevant analysis point is used as the basis of the attribute of the total determination dimension, so that the part without overlapping between each category is avoided, namely, the same category of data cannot be classified into more than one category at the same time. And the accuracy of data classification is ensured.

Example 8

Based on embodiment 1, the embodiment of the present invention provides a data processing method starting from a data origin, in step 3, according to a type attribute of the dimension data, performing data name naming on the dimension data to obtain a data type name, including:

wherein the name keywords and the application keywords are a plurality of;

judging whether the fourth data name meets the data name format or not;

The working principle and the beneficial effects of the design scheme are as follows: the method comprises the steps of determining a first data name related to dimension data according to type attributes of the dimension data and application requirements of the dimension data, wherein the first data name guarantees relativity with the dimension data, and because the name keywords and the application keywords are randomly combined to obtain the first data name, semantic inappropriateness is possible, so that a second data name is selected through a semantic scoring model, semantic rationality of the second data name is guaranteed, then, a third data name is determined through historical search of a user on the dimension data, the degree of agreement of the third data name and the user search is guaranteed, and because of the fact that the dimension data existing under the dimension attributes have differences, a fourth data name which is most matched with the dimension data keywords is selected from the third data name, matching degree of the fourth data name and corresponding search dimension data is guaranteed.

Example 9

Based on embodiment 1, the embodiment of the present invention provides a data processing method starting from a data origin, and in step 2, after obtaining dimension data, the method further includes: verifying the dimension data, including:

the data center point L _c The calculation formula of (2) is as follows:

wherein n represents the data number of the first data, m represents the data number of the second data, a _i A data dimension value, b, representing an ith data of the first data _j A data dimension value representing a j-th data of the second data;

the formula for calculating the variance value sigma is as follows:

In this embodiment, the time characteristic is before the preset time point and after the preset time point, the first data is dimension data before the preset time point, and the second data is dimension data after the preset time point.

In this embodiment, the data center points of the dimension data are used to represent the average data characteristics of the dimension data.

In this embodiment, the data dimension value is used to represent the matching degree between the dimension data and the dimension, and the larger the matching degree is, the larger the corresponding data latitude value is, and the value is (0, 1).

The beneficial effects of above-mentioned design scheme are: because the time points for acquiring the dimension data are different, the classification of the dimension data is influenced due to the time length, the scheme determines the data center point of the dimension data by taking the time characteristics as the basis, and determines the variances of the first data and the second data according to the data center point, so that the overall classification of the dimension data is verified, the accuracy of the dimension data verification is ensured, the rationality of the dimension data classification is ensured, and a basis is provided for the subsequent processing and searching of the data.

Example 10

A data processing system, as shown in figure 3, starting from a data origin, comprising:

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method of data processing starting from a data origin, comprising:

step 1: according to the data origin of the initial data, determining the classification dimension of the data classification, specifically:

extracting initial dimensions consistent with the classification angle from a dimension database, and selecting a preset number of target dimensions from the initial dimensions based on the emphasis of the application demand;

otherwise, determining a missing aspect, and matching the optimal dimension for the missing aspect as a complementary dimension, and forming a classification dimension for data classification by the complementary dimension and a target dimension together;

step 2: according to the dimension attribute of the initial data, carrying out data type division on the initial data to obtain dimension data, wherein the dimension data is specifically as follows:

acquiring an initial dimension corresponding to the dimension attribute as the dimension of the initial data to obtain corresponding dimension data;

based on the analysis points of the classification dimension and the corresponding weights thereof, setting an attribute determination model, comprising:

each channel in the channel model is connected with an output layer;

establishing an attribute determination model based on the channel model and the output layer;

Step 3: according to the type attribute of the dimension data, carrying out data name naming on the dimension data to obtain a data type name, wherein the data type name is specifically as follows:

wherein the name keywords and the application keywords are a plurality of;

judging whether the fourth data name meets the data name format or not;

2. A method of data processing starting from a data origin according to claim 1, further comprising, prior to step 1: acquiring a data origin of the initial data, comprising:

3. A data processing method starting from a data origin according to claim 1, wherein determining a classification angle based on the application requirements for the initial data comprises:

4. A data processing method starting from a data origin according to claim 1, wherein inputting the initial data into the attribute determination model, determining the dimensional attributes of the initial data, comprises:

judging whether the output dimension is one dimension or not;

5. A data processing method starting from a data origin according to claim 1, characterized in that in step 2, after obtaining the dimensional data, it further comprises: verifying the dimension data, including:

6. A data processing system, proceeding from a data origin, comprising:

a dimension determination module for determining a classification dimension of a data classification based on a data origin of the initial data, comprising:

The data classification module is used for carrying out data type division on the initial data according to the dimension attribute of the initial data to obtain dimension data, and comprises the following steps:

each channel in the channel model is connected with an output layer;

the data naming module is used for naming the data name of the dimension data according to the type attribute of the dimension data to obtain the data type name, and comprises the following steps:

wherein the name keywords and the application keywords are a plurality of;

judging whether the fourth data name meets the data name format or not;