CN110929105B - User ID (identity) association method based on big data technology - Google Patents

User ID (identity) association method based on big data technology Download PDF

Info

Publication number
CN110929105B
CN110929105B CN201911190714.2A CN201911190714A CN110929105B CN 110929105 B CN110929105 B CN 110929105B CN 201911190714 A CN201911190714 A CN 201911190714A CN 110929105 B CN110929105 B CN 110929105B
Authority
CN
China
Prior art keywords
data
graph
ambiguity
rules
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911190714.2A
Other languages
Chinese (zh)
Other versions
CN110929105A (en
Inventor
李元佳
陈新宇
李柱新
李剑伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Yunxi Intelligent Technology Co ltd
Original Assignee
Guangdong Yunxi Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Yunxi Intelligent Technology Co ltd filed Critical Guangdong Yunxi Intelligent Technology Co ltd
Priority to CN201911190714.2A priority Critical patent/CN110929105B/en
Publication of CN110929105A publication Critical patent/CN110929105A/en
Application granted granted Critical
Publication of CN110929105B publication Critical patent/CN110929105B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists

Abstract

The invention discloses a user ID association method based on big data technology, which comprises the following steps: step A, reading configuration information; b, pulling data from the data warehouse according to the configuration information, and constructing vertexes and edges; step C, combining the repeated items of the vertex of the graph and the edge of the graph through sparkGraphx; step D, searching N degree relations of each vertex of the graph directionally and collecting the relations; e, according to the N-degree relation nodes of each graph vertex collected in the previous step; step F, performing confidence calculation on the ambiguity nodes collected in the previous step according to configured ambiguity rules to finish the attribution judgment of the ambiguity peaks; and G, finally storing the output result into a specified database according to the storage configuration according to a specified format. The user ID association method provided by the invention has the advantages of more complete and accurate data mining.

Description

User ID (identity) association method based on big data technology
Technical Field
The invention relates to the field of user ID identification, in particular to a user ID association method based on a big data technology.
Background
Along with the development of the internet, group services have more and more accessible channels, and users can access enterprise services through different identities and have transaction consumption behaviors with enterprises, so that the enterprises can collect a large amount of user identity information. However, the problem that the user identity information is overloaded, the unique user is difficult to efficiently and stably identify from multiple channels, the information is easy to neglect or lose, the processing process is complex, the efficiency is low, the information is missed, the real situation of the user cannot be comprehensively mastered, the operation difficulty of the user is increased, and the problem is more and more prominent. Therefore, a method of associating user IDs is needed to solve the problem of the large number of user identities.
However, in the prior art, there are several methods for associating user IDs as follows:
1. and converting the business rules into corresponding SQL sentences, associating and mapping the main keys of multiple data sets, and completing user ID association.
2. And exporting the database data, converting the database data into a SparkDataFrame, performing graph calculation by utilizing the SparkGraphx, and finishing the user ID association according to the connection relation of the graphs.
3. Determining the association characteristics between different user IDs based on the user ID historical logs, establishing a user ID mapping relation list, calculating the confidence degrees between the user ID and the corresponding other types of user IDs, and judging whether the user IDs of different data sources are associated or not according to the confidence degrees.
The user ID association methods of 1 and 3 both require a data primary key to implement association of user IDs, and the primary key cannot be found in reality. 2 the method implemented with spark graph may generate too large a connectivity map, causing data noise. The method is based on a big data technology, combines graph calculation of sparkGraphx and a confidence degree rule to cut the association graph, and finally completes association of the user ID. The problems in the prior art can be effectively solved.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a user ID association method based on a big data technology, and solves the problems that the user information is easy to neglect or lose, the processing process is complex and the efficiency is low in the conventional user ID association method.
In order to realize the purpose, the technical scheme provided by the invention is as follows:
a user ID association method based on big data technology is provided, the method comprises the following steps:
step A, reading configuration information, wherein the configuration information comprises data source table configuration, data table and mapping field configuration, result storage configuration and ambiguity rules;
b, pulling data from a data warehouse according to configuration information, and constructing vertexes and edges;
step C, combining the vertexes of the graph and the repeated items of the edges of the graph through spark Graphx, and connecting the vertexes of the graph according to the edges of the graph to generate a plurality of association graphs;
step D, searching and collecting N-degree relations directionally for each vertex of the graph, judging whether nodes in the relation sets of the vertices are ambiguous or not through rules, and marking the nodes as ambiguous nodes if the nodes are ambiguous;
step E, judging whether the low-level vertexes have multiple affiliation relations or not through rules according to the N-degree relation nodes of each graph vertex collected in the previous step, and marking if the low-level vertexes have multiple affiliation relations;
step F, performing confidence calculation on the ambiguity nodes collected in the previous step according to configured ambiguity rules to finish the attribution judgment of the ambiguity peaks;
and G, finally storing the output result into a specified database according to the storage configuration according to a specified format.
Further, before reading the configuration information, the step a further includes a step of configuring the data source, the data table, the field mapping and ambiguity rule, and the result data storage information.
Further, the ambiguity rules include, but are not limited to, rules based on reference tables, rules based on weight attributes, rules based on confidence algorithms, rules based on time, and the like.
Further, the data pulled from the data warehouse is cleaned data, and the data cleaning tool specifically includes, but is not limited to, deduplication processing, outlier processing, special symbol processing, and format processing.
The invention provides a user ID association method based on big data, which has the beneficial effect that for the association of multiple data set user IDs, the general user ID association based on sql can not be carried out under the condition that a data primary key does not exist. Meanwhile, the invention also cuts the user associated data through the confidence coefficient, thereby avoiding the problems of overlarge associated graph and multi-user data entanglement. Compared with the scheme of only making the user ID historical log perform user ID association, the user ID association method provided by the invention has the advantages that the data mining is more complete and accurate.
Drawings
FIG. 1 is a flowchart of a big data-based user ID association method according to the present invention.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention, but are not intended to limit the scope of the invention.
The user ID association method based on big data comprises the following steps:
step A, reading configuration information, wherein the configuration information comprises data source table configuration, data table and mapping field configuration, result storage configuration and ambiguity rules;
b, pulling data from the data warehouse according to the configuration information, and constructing vertexes and edges;
step C, combining the vertexes of the graph and the repeated items of the edges of the graph through spark Graphx, and connecting the vertexes of the graph according to the edges of the graph to generate a plurality of association graphs;
step D, searching and collecting N-degree relations directionally for each vertex of the graph, judging whether nodes in the relation sets of the vertices are ambiguous or not through rules, and marking the nodes as ambiguous nodes if the nodes are ambiguous;
e, judging whether the low-level vertexes have multiple attribution relations or not through rules according to the N-degree relation nodes of each graph vertex collected in the previous step, and marking if the low-level vertexes have multiple attribution relations;
step F, performing confidence calculation on the ambiguity nodes collected in the previous step according to configured ambiguity rules to finish the attribution judgment of the ambiguity peaks;
and G, finally storing the output result into a specified database according to the storage configuration according to a specified format.
Before reading the configuration information, the step a further includes configuring a data source, a data table, a field mapping and ambiguity rule, and result data storage information.
The ambiguity rules include, but are not limited to, rules based on a reference table, rules based on weight attributes, rules based on a confidence algorithm, rules based on time, and the like.
The data pulled from the data warehouse is cleaned data, and the data cleaning tool specifically includes, but is not limited to, deduplication processing, outlier processing, special symbol processing, and format processing.
Examples
Referring to fig. 1, the big data-based user ID association method of the present invention includes the following steps:
firstly, effective information of a user which is cleaned and filtered is read, and data cleaning and filtering work specifically comprises duplication removal, abnormal value processing, special symbol processing, format alignment and the like.
And reading configuration information, which specifically comprises data source table configuration, data table and mapping field configuration, result storage configuration, ambiguity rules and the like.
Graph vertex and edge generation is then performed. And combining the vertexes of the graph and the repeated items of the edges of the graph by using the data obtained in the last step through the spark graph, and connecting the vertexes of the graph according to the edges of the graph to generate a plurality of association graphs.
And then generating a correlation diagram.
And then collecting the relation of the vertexes with the degree of N in the association diagram. The N degree relationships are searched directionally for each vertex of the graph and collected. And judging whether nodes in the relation set of the vertexes are ambiguous or not through a rule, and if so, marking the nodes as ambiguous nodes.
And then judging whether the association graph has ambiguous nodes. And if so, carrying out ambiguity rule processing, cutting the association graph after the ambiguity rule processing to form different subgraphs representing different individuals, and finally storing the output result into a specified database according to a specified format according to storage configuration.
And if the association graph does not have ambiguous nodes, directly storing the result into a specified database according to a specified format.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that are within the spirit and principle of the present invention are intended to be included therein.

Claims (4)

1. A user ID association method based on big data technology is characterized by comprising the following steps: step A, reading configuration information, wherein the configuration information comprises data source table configuration, data table and mapping field configuration, result storage configuration and ambiguity rules; b, pulling data from the data warehouse according to the configuration information, and constructing vertexes and edges; step C, combining the vertexes of the graph and the repeated items of the edges of the graph through spark Graphx, and connecting the vertexes of the graph according to the edges of the graph to generate a plurality of association graphs; step D, searching and collecting N-degree relations directionally for each vertex of the graph, judging whether nodes in the relation sets of the vertices are ambiguous or not through rules, and marking the nodes as ambiguous nodes if the nodes are ambiguous; step E, judging whether the low-level vertexes have multiple affiliation relations or not through rules according to the N-degree relation nodes of each graph vertex collected in the previous step, and marking if the low-level vertexes have multiple affiliation relations; step F, performing confidence calculation on the ambiguity nodes collected in the previous step according to configured ambiguity rules to finish the attribution judgment of the ambiguity peaks; and G, finally storing the output result into a specified database according to the storage configuration according to a specified format.
2. The method as claimed in claim 1, wherein the step a further comprises the step of configuring the data source, the data table, the field mapping and ambiguity rule, and the result data storage information before reading the configuration information.
3. The method of claim 2, wherein the ambiguity rules include, but are not limited to, rules based on reference tables, rules based on weight attributes, rules based on confidence algorithms, and rules based on time.
4. The method of claim 1, wherein the data pulled from the data repository is cleaned data, and the data cleaning tool specifically includes, but is not limited to, deduplication processing, outlier processing, special symbol processing, and format processing.
CN201911190714.2A 2019-11-28 2019-11-28 User ID (identity) association method based on big data technology Active CN110929105B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911190714.2A CN110929105B (en) 2019-11-28 2019-11-28 User ID (identity) association method based on big data technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911190714.2A CN110929105B (en) 2019-11-28 2019-11-28 User ID (identity) association method based on big data technology

Publications (2)

Publication Number Publication Date
CN110929105A CN110929105A (en) 2020-03-27
CN110929105B true CN110929105B (en) 2022-11-29

Family

ID=69847468

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911190714.2A Active CN110929105B (en) 2019-11-28 2019-11-28 User ID (identity) association method based on big data technology

Country Status (1)

Country Link
CN (1) CN110929105B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111949839B (en) * 2020-08-24 2021-08-24 上海嗨普智能信息科技股份有限公司 Data association method, electronic device and medium
CN112508596A (en) * 2020-10-21 2021-03-16 广州云徙科技有限公司 Entity mapping-based consumer life cycle division method and system
CN112463065A (en) * 2020-12-10 2021-03-09 恩亿科(北京)数据科技有限公司 Account number getting-through calculation method and system
CN117271850B (en) * 2023-11-17 2024-01-30 上海光潾网络科技有限公司 User data matching method, platform, equipment and medium based on client data platform

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104025632A (en) * 2011-10-18 2014-09-03 阿尔卡特朗讯公司 Lte subscriber identity correlation service
CN105138864A (en) * 2015-09-24 2015-12-09 大连理工大学 Protein interaction relationship data base construction method based on biomedical science literature
CN107153687A (en) * 2017-04-18 2017-09-12 东北大学 A kind of indexing means of social networks text data
CN107515915A (en) * 2017-08-18 2017-12-26 晶赞广告(上海)有限公司 User based on user behavior data identifies correlating method
CN108491424A (en) * 2018-02-07 2018-09-04 链家网(北京)科技有限公司 User ID correlating method and device
CN108810155A (en) * 2018-06-19 2018-11-13 中国科学院光电研究院 A kind of car networking vehicle position information reliability evaluation method and system
CN108959461A (en) * 2018-06-15 2018-12-07 东南大学 A kind of entity link method based on graph model
CN109739939A (en) * 2018-12-29 2019-05-10 颖投信息科技(上海)有限公司 The data fusion method and device of knowledge mapping
CN110096577A (en) * 2018-01-31 2019-08-06 国际商业机器公司 From the intention of abnormal profile data prediction user
CN110177094A (en) * 2019-05-22 2019-08-27 武汉斗鱼网络科技有限公司 A kind of user community recognition methods, device, electronic equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2880709B1 (en) * 2005-01-11 2014-04-25 Vision Objects METHOD OF SEARCHING, RECOGNIZING AND LOCATING INK, DEVICE, PROGRAM AND LANGUAGE CORRESPONDING
US8918418B2 (en) * 2010-04-19 2014-12-23 Facebook, Inc. Default structured search queries on online social networks
US20180032930A1 (en) * 2015-10-07 2018-02-01 0934781 B.C. Ltd System and method to Generate Queries for a Business Database

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104025632A (en) * 2011-10-18 2014-09-03 阿尔卡特朗讯公司 Lte subscriber identity correlation service
CN105138864A (en) * 2015-09-24 2015-12-09 大连理工大学 Protein interaction relationship data base construction method based on biomedical science literature
CN107153687A (en) * 2017-04-18 2017-09-12 东北大学 A kind of indexing means of social networks text data
CN107515915A (en) * 2017-08-18 2017-12-26 晶赞广告(上海)有限公司 User based on user behavior data identifies correlating method
CN110096577A (en) * 2018-01-31 2019-08-06 国际商业机器公司 From the intention of abnormal profile data prediction user
CN108491424A (en) * 2018-02-07 2018-09-04 链家网(北京)科技有限公司 User ID correlating method and device
CN108959461A (en) * 2018-06-15 2018-12-07 东南大学 A kind of entity link method based on graph model
CN108810155A (en) * 2018-06-19 2018-11-13 中国科学院光电研究院 A kind of car networking vehicle position information reliability evaluation method and system
CN109739939A (en) * 2018-12-29 2019-05-10 颖投信息科技(上海)有限公司 The data fusion method and device of knowledge mapping
CN110177094A (en) * 2019-05-22 2019-08-27 武汉斗鱼网络科技有限公司 A kind of user community recognition methods, device, electronic equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于Rough Set的最简决策树确定算法的研究;朱红;《计算机工程与应用》;20030501(第13期);第132-134页 *
面向大数据实体识别的超图分割算法;胡志刚等;《小型微型计算机***》;20180715(第07期);第168-173页 *
面向知识与信息管理的领域本体自动构建算法;侯鑫等;《计算机集成制造***》;20110115(第01期);第161-172页 *

Also Published As

Publication number Publication date
CN110929105A (en) 2020-03-27

Similar Documents

Publication Publication Date Title
CN110929105B (en) User ID (identity) association method based on big data technology
CN110415107B (en) Data processing method, data processing device, storage medium and electronic equipment
CN110389950B (en) Rapid running big data cleaning method
CN113569057B (en) Sample query method oriented to ontology tag knowledge graph
CN111144831B (en) Accurate selection screening system and method suitable for recruitment
CN110717092A (en) Method, system, device and storage medium for matching objects for articles
CN112363996B (en) Method, system and medium for establishing physical model of power grid knowledge graph
CN112199488B (en) Incremental knowledge graph entity extraction method and system for power customer service question and answer
van Leeuwen et al. Identifying the components
CN116028678A (en) Method and system for searching full-quantity path in knowledge graph
Balaji et al. An ensemble blocking scheme for entity resolution of large and sparse datasets
CN116303379A (en) Data processing method, system and computer storage medium
CN109582806B (en) Personal information processing method and system based on graph calculation
Sharma et al. Analysis of association rule in data mining
Lin et al. Mining high-utility sequential patterns in uncertain databases
CN111399838A (en) Data modeling method and device based on spark SQ L and materialized view
Tseng et al. A minimal perfect hashing scheme to mining association rules from frequently updated data
CN110956035B (en) Questionnaire optimization method, system and storage medium
CN115631866B (en) Rapid and accurate de-duplication method for medical big data acquisition
Parmar et al. Survey on high utility oriented sequential pattern mining
Subha P-tree oriented association rule mining of multiple data sources
Haruna et al. Cost-based and effective human-machine based data deduplication model in entity reconciliation
Duemong et al. FIAST: A novel algorithm for mining frequent itemsets
CN115858875B (en) Enterprise employee hierarchical relationship discovery method and device based on frequent pattern mining
Qu et al. Improvement of attribute-oriented induction method based on attribute correlation with target attribute

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: No. A06, E-PARK Creative Park, Yuzhu Zhigu, No. 32, Kengtian Street, Huangpu District, Guangzhou City, Guangdong Province, 510000

Applicant after: Guangdong Yunxi Intelligent Technology Co.,Ltd.

Address before: Room 1302-1303, building a, 459 Qianmo Road, Binjiang District, Hangzhou, Zhejiang 310000

Applicant before: Hangzhou Yunqian Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant