CN110929105B

CN110929105B - User ID (identity) association method based on big data technology

Info

Publication number: CN110929105B
Application number: CN201911190714.2A
Authority: CN
Inventors: 李元佳; 陈新宇; 李柱新; 李剑伟
Original assignee: Guangdong Yunxi Intelligent Technology Co ltd
Current assignee: Guangdong Yunxi Intelligent Technology Co ltd
Priority date: 2019-11-28
Filing date: 2019-11-28
Publication date: 2022-11-29
Anticipated expiration: 2039-11-28
Also published as: CN110929105A

Abstract

The invention discloses a user ID association method based on big data technology, which comprises the following steps: step A, reading configuration information; b, pulling data from the data warehouse according to the configuration information, and constructing vertexes and edges; step C, combining the repeated items of the vertex of the graph and the edge of the graph through sparkGraphx; step D, searching N degree relations of each vertex of the graph directionally and collecting the relations; e, according to the N-degree relation nodes of each graph vertex collected in the previous step; step F, performing confidence calculation on the ambiguity nodes collected in the previous step according to configured ambiguity rules to finish the attribution judgment of the ambiguity peaks; and G, finally storing the output result into a specified database according to the storage configuration according to a specified format. The user ID association method provided by the invention has the advantages of more complete and accurate data mining.

Description

User ID (identity) association method based on big data technology

Technical Field

The invention relates to the field of user ID identification, in particular to a user ID association method based on a big data technology.

Background

Along with the development of the internet, group services have more and more accessible channels, and users can access enterprise services through different identities and have transaction consumption behaviors with enterprises, so that the enterprises can collect a large amount of user identity information. However, the problem that the user identity information is overloaded, the unique user is difficult to efficiently and stably identify from multiple channels, the information is easy to neglect or lose, the processing process is complex, the efficiency is low, the information is missed, the real situation of the user cannot be comprehensively mastered, the operation difficulty of the user is increased, and the problem is more and more prominent. Therefore, a method of associating user IDs is needed to solve the problem of the large number of user identities.

However, in the prior art, there are several methods for associating user IDs as follows:

1. and converting the business rules into corresponding SQL sentences, associating and mapping the main keys of multiple data sets, and completing user ID association.

2. And exporting the database data, converting the database data into a SparkDataFrame, performing graph calculation by utilizing the SparkGraphx, and finishing the user ID association according to the connection relation of the graphs.

3. Determining the association characteristics between different user IDs based on the user ID historical logs, establishing a user ID mapping relation list, calculating the confidence degrees between the user ID and the corresponding other types of user IDs, and judging whether the user IDs of different data sources are associated or not according to the confidence degrees.

The user ID association methods of 1 and 3 both require a data primary key to implement association of user IDs, and the primary key cannot be found in reality. 2 the method implemented with spark graph may generate too large a connectivity map, causing data noise. The method is based on a big data technology, combines graph calculation of sparkGraphx and a confidence degree rule to cut the association graph, and finally completes association of the user ID. The problems in the prior art can be effectively solved.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a user ID association method based on a big data technology, and solves the problems that the user information is easy to neglect or lose, the processing process is complex and the efficiency is low in the conventional user ID association method.

In order to realize the purpose, the technical scheme provided by the invention is as follows:

a user ID association method based on big data technology is provided, the method comprises the following steps:

step A, reading configuration information, wherein the configuration information comprises data source table configuration, data table and mapping field configuration, result storage configuration and ambiguity rules;

b, pulling data from a data warehouse according to configuration information, and constructing vertexes and edges;

step C, combining the vertexes of the graph and the repeated items of the edges of the graph through spark Graphx, and connecting the vertexes of the graph according to the edges of the graph to generate a plurality of association graphs;

step D, searching and collecting N-degree relations directionally for each vertex of the graph, judging whether nodes in the relation sets of the vertices are ambiguous or not through rules, and marking the nodes as ambiguous nodes if the nodes are ambiguous;

step E, judging whether the low-level vertexes have multiple affiliation relations or not through rules according to the N-degree relation nodes of each graph vertex collected in the previous step, and marking if the low-level vertexes have multiple affiliation relations;

step F, performing confidence calculation on the ambiguity nodes collected in the previous step according to configured ambiguity rules to finish the attribution judgment of the ambiguity peaks;

and G, finally storing the output result into a specified database according to the storage configuration according to a specified format.

Further, before reading the configuration information, the step a further includes a step of configuring the data source, the data table, the field mapping and ambiguity rule, and the result data storage information.

Further, the ambiguity rules include, but are not limited to, rules based on reference tables, rules based on weight attributes, rules based on confidence algorithms, rules based on time, and the like.

Further, the data pulled from the data warehouse is cleaned data, and the data cleaning tool specifically includes, but is not limited to, deduplication processing, outlier processing, special symbol processing, and format processing.

The invention provides a user ID association method based on big data, which has the beneficial effect that for the association of multiple data set user IDs, the general user ID association based on sql can not be carried out under the condition that a data primary key does not exist. Meanwhile, the invention also cuts the user associated data through the confidence coefficient, thereby avoiding the problems of overlarge associated graph and multi-user data entanglement. Compared with the scheme of only making the user ID historical log perform user ID association, the user ID association method provided by the invention has the advantages that the data mining is more complete and accurate.

Drawings

FIG. 1 is a flowchart of a big data-based user ID association method according to the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention, but are not intended to limit the scope of the invention.

The user ID association method based on big data comprises the following steps:

b, pulling data from the data warehouse according to the configuration information, and constructing vertexes and edges;

e, judging whether the low-level vertexes have multiple attribution relations or not through rules according to the N-degree relation nodes of each graph vertex collected in the previous step, and marking if the low-level vertexes have multiple attribution relations;

Before reading the configuration information, the step a further includes configuring a data source, a data table, a field mapping and ambiguity rule, and result data storage information.

The ambiguity rules include, but are not limited to, rules based on a reference table, rules based on weight attributes, rules based on a confidence algorithm, rules based on time, and the like.

The data pulled from the data warehouse is cleaned data, and the data cleaning tool specifically includes, but is not limited to, deduplication processing, outlier processing, special symbol processing, and format processing.

Examples

Referring to fig. 1, the big data-based user ID association method of the present invention includes the following steps:

firstly, effective information of a user which is cleaned and filtered is read, and data cleaning and filtering work specifically comprises duplication removal, abnormal value processing, special symbol processing, format alignment and the like.

And reading configuration information, which specifically comprises data source table configuration, data table and mapping field configuration, result storage configuration, ambiguity rules and the like.

Graph vertex and edge generation is then performed. And combining the vertexes of the graph and the repeated items of the edges of the graph by using the data obtained in the last step through the spark graph, and connecting the vertexes of the graph according to the edges of the graph to generate a plurality of association graphs.

And then generating a correlation diagram.

And then collecting the relation of the vertexes with the degree of N in the association diagram. The N degree relationships are searched directionally for each vertex of the graph and collected. And judging whether nodes in the relation set of the vertexes are ambiguous or not through a rule, and if so, marking the nodes as ambiguous nodes.

And then judging whether the association graph has ambiguous nodes. And if so, carrying out ambiguity rule processing, cutting the association graph after the ambiguity rule processing to form different subgraphs representing different individuals, and finally storing the output result into a specified database according to a specified format according to storage configuration.

And if the association graph does not have ambiguous nodes, directly storing the result into a specified database according to a specified format.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that are within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A user ID association method based on big data technology is characterized by comprising the following steps: step A, reading configuration information, wherein the configuration information comprises data source table configuration, data table and mapping field configuration, result storage configuration and ambiguity rules; b, pulling data from the data warehouse according to the configuration information, and constructing vertexes and edges; step C, combining the vertexes of the graph and the repeated items of the edges of the graph through spark Graphx, and connecting the vertexes of the graph according to the edges of the graph to generate a plurality of association graphs; step D, searching and collecting N-degree relations directionally for each vertex of the graph, judging whether nodes in the relation sets of the vertices are ambiguous or not through rules, and marking the nodes as ambiguous nodes if the nodes are ambiguous; step E, judging whether the low-level vertexes have multiple affiliation relations or not through rules according to the N-degree relation nodes of each graph vertex collected in the previous step, and marking if the low-level vertexes have multiple affiliation relations; step F, performing confidence calculation on the ambiguity nodes collected in the previous step according to configured ambiguity rules to finish the attribution judgment of the ambiguity peaks; and G, finally storing the output result into a specified database according to the storage configuration according to a specified format.

2. The method as claimed in claim 1, wherein the step a further comprises the step of configuring the data source, the data table, the field mapping and ambiguity rule, and the result data storage information before reading the configuration information.

3. The method of claim 2, wherein the ambiguity rules include, but are not limited to, rules based on reference tables, rules based on weight attributes, rules based on confidence algorithms, and rules based on time.

4. The method of claim 1, wherein the data pulled from the data repository is cleaned data, and the data cleaning tool specifically includes, but is not limited to, deduplication processing, outlier processing, special symbol processing, and format processing.