CN112667869A - Data processing method, device, system and storage medium - Google Patents

Data processing method, device, system and storage medium Download PDF

Info

Publication number
CN112667869A
CN112667869A CN201910977784.6A CN201910977784A CN112667869A CN 112667869 A CN112667869 A CN 112667869A CN 201910977784 A CN201910977784 A CN 201910977784A CN 112667869 A CN112667869 A CN 112667869A
Authority
CN
China
Prior art keywords
attribute values
key
key attribute
cluster
attribute value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910977784.6A
Other languages
Chinese (zh)
Other versions
CN112667869B (en
Inventor
吴铁民
王赛
陈晓勇
向师富
柯根
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201910977784.6A priority Critical patent/CN112667869B/en
Publication of CN112667869A publication Critical patent/CN112667869A/en
Application granted granted Critical
Publication of CN112667869B publication Critical patent/CN112667869B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a data processing method, equipment, a system and a storage medium. In the embodiment of the application, the attribute values belonging to the same data object in the multiple key attribute values are identified according to the level relationship among the multiple key attributes and the incidence relationship among the multiple key attribute values, and the longitudinal clustering of the attribute values belonging to the same data object under different attributes is completed. Due to the data clustering mode, various key attributes are considered, the probability of wrong clustering is favorably reduced, and the accuracy of identifying the data belonging to the same data object is favorably improved.

Description

Data processing method, device, system and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a data processing method, device, system, and storage medium.
Background
With the development of the information age, various information carrying media are increasingly diversified, and how to realize effective management of data is more and more important. In order to realize effective management of enterprise data, an enterprise can construct a universal unified account.
In order to construct a unified account, data belonging to the same natural person needs to be identified from mass data, but the existing data identification mode has low accuracy.
Disclosure of Invention
Aspects of the present disclosure provide a data processing method, device, system, and storage medium for improving accuracy of data identification.
An embodiment of the present application provides a data processing method, including:
acquiring a plurality of data records, wherein the plurality of data records comprise a plurality of key attribute values under a plurality of key attributes, and each key attribute value belongs to one data object at the same time;
identifying attribute values belonging to the same data object in the plurality of key attribute values according to the level relationship among the plurality of key attributes and the incidence relationship among the plurality of key attribute values;
and outputting the attribute values belonging to the same data object in the plurality of key attribute values.
An embodiment of the present application further provides a data processing method, including:
acquiring a plurality of data records, wherein the plurality of data records comprise a plurality of first-class attribute values;
if the plurality of data records contain a plurality of second attribute values, clustering the plurality of first attribute values according to the incidence relation between the plurality of first attribute values and the plurality of second attribute values to obtain a plurality of information clusters;
respectively calculating the belonging probability between different first-class attribute values and the candidate result aiming at each information cluster; the candidate results include: belong to the same data object and do not belong to the same data object;
and determining whether the different first-class attribute values belong to the same data object according to the belonging probability between the different first-class attribute values and the candidate result.
An embodiment of the present application further provides a server device, including: a memory and a processor; wherein the memory is for a computer program;
the processor is coupled to the memory for executing the computer program for:
acquiring a plurality of data records, wherein the plurality of data records comprise a plurality of key attribute values under a plurality of key attributes, and each key attribute value belongs to one data object at the same time;
identifying attribute values belonging to the same data object in the plurality of key attribute values according to the level relationship among the plurality of key attributes and the incidence relationship among the plurality of key attribute values;
and outputting the attribute values belonging to the same data object in the plurality of key attribute values.
An embodiment of the present application further provides a server device, including: a memory and a processor; wherein the memory is for a computer program;
the processor is coupled to the memory for executing the computer program for:
acquiring a plurality of data records, wherein the plurality of data records comprise a plurality of first-class attribute values;
if the plurality of data records contain a plurality of second attribute values, dividing the plurality of first attribute values into a plurality of information clusters according to the incidence relation between the plurality of first attribute values and the plurality of second attribute values;
respectively calculating the belonging probability between different first-class attribute values and the candidate result aiming at each information cluster; the candidate results include: belong to the same data object and do not belong to the same data object;
and determining whether the different first-class attribute values belong to the same data object according to the belonging probability between the different first-class attribute values and the candidate result.
An embodiment of the present application further provides a data processing system, including: client equipment and server equipment;
the client device is used for sending a plurality of data records to the server device; the data records comprise a plurality of key attribute values under a plurality of key attributes, and each key attribute value belongs to one data object at the same time; and outputting attribute values belonging to the same data object in a plurality of key attribute values in a visual mode.
The server-side equipment is used for identifying the attribute values which belong to the same data object in the plurality of key attribute values according to the level relation among the plurality of key attributes and the incidence relation among the plurality of key attribute values; and sending the attribute values belonging to the same data object in the plurality of key attribute values to the client device.
Embodiments of the present application also provide a computer-readable storage medium storing computer instructions, which when executed by one or more processors, cause the one or more processors to perform the steps of the data processing method.
In the embodiment of the application, the attribute values belonging to the same data object in the multiple key attribute values are identified according to the level relationship among the multiple key attributes and the incidence relationship among the multiple key attribute values, and the longitudinal clustering of the attribute values belonging to the same data object under different attributes is completed. Due to the data clustering mode, various key attributes are considered, the probability of wrong clustering is favorably reduced, and the accuracy of identifying the data belonging to the same data object is favorably improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1a is a block diagram of a data processing system according to an embodiment of the present application;
fig. 1b is a schematic structural diagram of a connected subgraph provided in the embodiment of the present application;
FIG. 1c is a connected subgraph formed by the first-step clustering of the connected subgraph provided in FIG. 1 b;
FIG. 1d is a connected subgraph formed by hierarchical clustering of the connected subgraph provided in FIG. 1 b;
fig. 1e is a schematic structural diagram of another connected subgraph provided in the embodiment of the present application;
FIG. 1f is a connected subgraph formed by hierarchical clustering of the connected subgraph provided in FIG. 1 e;
fig. 1g is a schematic structural diagram of another connected subgraph provided in the embodiment of the present application;
FIG. 1h is a connected subgraph formed by hierarchical clustering of the connected subgraph provided in FIG. 1 g;
FIG. 1i is a connected subgraph formed by hierarchical clustering of the connected subgraph provided in FIG. 1 h;
FIG. 1j is a connected subgraph formed by hierarchically clustering the connected subgraph provided in FIG. 1i again;
fig. 2 is a schematic flowchart of a data processing method according to an embodiment of the present application;
FIG. 3a is a block diagram of another data processing system according to an embodiment of the present application;
fig. 3b is a schematic structural diagram of an information cluster according to an embodiment of the present application;
fig. 4 is a schematic flowchart of another data processing method according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a server device according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of another server device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In some embodiments of the present application, an attribute value belonging to the same data object among a plurality of key attribute values is identified according to a hierarchical relationship among a plurality of key attributes and an association relationship among a plurality of key attribute values, and thus, vertical clustering of attribute values belonging to the same data object under different attributes is completed. Due to the data clustering mode, various key attributes are considered, the probability of wrong clustering is favorably reduced, and the accuracy of identifying the data belonging to the same data object is favorably improved.
The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.
Fig. 1a is a schematic structural diagram of a data processing system according to an embodiment of the present disclosure. As shown in fig. 1a, the system comprises: a client device 10a and a server device 10 b. The implementation forms of the client device 10a and the server device 10b shown in fig. 1a are only exemplary and are not limited thereto.
In this embodiment, the client device 10a refers to a computer device with computing, communication, and other functions located at a client server. The client device 10a may be a computer or a server located at a client server.
In this embodiment, the server device 10b is a computer device capable of performing data processing, and generally has the capability of undertaking and securing services. The server device 10b may be a single server device, a cloud server array, or a Virtual Machine (VM) running in the cloud server array. In addition, the server device may also refer to other computing devices with corresponding service capabilities, such as a terminal device (running a service program) such as a computer.
Alternatively, the server device 10b may be a service platform of an enterprise that provides relevant data processing services directly to the server device 10 b. In this scenario, the server device 10b may provide the client device 10a with a cloud computing service that is intermediate between PaaS service and SaaS service.
Alternatively, the server device 10b may be a server of a cloud service provider leased by a third party that provides the relevant data processing service to the client device 10 a. In this scenario, the cloud service provider provides Iaas services, PaaS services, or cloud computing services that are intermediate between Iaas services or PaaS services to a third party. The server (server device 10b) leased by the third party provides the client device 10a with a cloud computing service that is intermediate between PaaS service and SaaS service.
In this embodiment, the server device 10b and the client device 10a may be connected wirelessly or by wire. Alternatively, the server device 10b may be communicatively connected to the client device 10a through a mobile network, and accordingly, the network format of the mobile network may be any one of 2G (gsm), 2.5G (gprs), 3G (WCDMA, TD-SCDMA, CDMA2000, UTMS), 4G (LTE), 4G + (LTE +), 5G, WiMax, and the like. Alternatively, the server device 10b may be communicatively connected to the client device 10a through bluetooth, WiFi, infrared, or the like.
In this embodiment, the client device 10a may maintain a data record and may provide a plurality of data records to the server device 10 b. The plurality of data records comprise a plurality of key attributes and a plurality of key attribute values under the key attributes. In the embodiments of the present application, the plurality means 2 or more, and the plurality means 2 or more. Alternatively, if server device 10b is a service platform of an enterprise that provides relevant data processing services directly to server device 10b, client device 10a may send the plurality of data records directly to server device 10 b. If the server device 10b is a server of a cloud service provider leased by a third party providing the relevant data processing service to the client device 10a, the client device 10a may send the plurality of data records to the third party, and the third party sends the plurality of data records to the server device 10 b.
In this embodiment, the key attribute refers to an attribute under which an attribute value can uniquely identify one data object at the same time, that is, each key attribute value uniquely belongs to one data object at the same time. Wherein the key attribute may also be referred to as an attribute Identification (ID) in some embodiments, and the key attribute value is referred to as an ID value. In different application scenarios, the data objects are different, and the key attributes thereof are also different. For example, in some application scenarios, the data object may be a natural person for which the key attributes may be: identification number, account number, cell phone number, email, passport number, membership number, etc., but are not limited thereto. In other application scenarios, the data object is a company, and the key attributes thereof may be, but are not limited to, a company name, a taxpayer identification number, a business registration number, and the like.
In the embodiment of the present application, one natural object may be used as one data object, or a plurality of natural objects may be used as one data object. For example, in some application scenarios, all members of a family may be considered as a data object; in other application scenarios, a user in a region, a business district, or a city may also be used as a data object; alternatively, a natural object of the same type may be used as one data object. For example, in a shopping scenario, the type of the user is determined according to the age, sex, or region of the user, and the same type of user is used as a data object, but the invention is not limited thereto.
Accordingly, the server device 10b receives the plurality of data records. Further, in the embodiment of the present application, the server device 10b stores a hierarchical relationship between the key attributes. Accordingly, the server device 10b may identify an attribute value belonging to the same data object from among the plurality of key attribute values according to the hierarchical relationship among the plurality of key attributes included in the plurality of data records and the association relationship among the plurality of key attribute values. The key attribute values belonging to the same data object are hierarchically clustered according to the hierarchical relationship among the key attributes and the incidence relationship among the key attribute values included in the data records, so that the key attribute values belonging to the same data object are identified. Further, the server device 10b sends the attribute values belonging to the same data object among the plurality of key attribute values to the client device 10 a.
Accordingly, the client device 10a receives the attribute values belonging to the same data object from among the plurality of key attribute values, and outputs the attribute values belonging to the same data object from among the plurality of key attribute values in a visual manner.
In this embodiment, the server device 10b may send the attribute value belonging to the same data object from among the plurality of key attribute values to the client device 10a in various forms. For example, as shown in fig. 1a, the server device 10b may adopt a connected subgraph form, aggregate attribute values belonging to the same data object in a plurality of key attribute values, and send the aggregated connected subgraph to the client device 10 a. Accordingly, client device 10a presents the received connectivity sub-graph on a display screen. Wherein each connected subgraph corresponds to a data object. In the connected subgraph, nodes represent key attribute values, and a connecting line between two nodes represents the incidence relation between the two key attribute values. The number of connected subgraphs is determined by the number of data objects corresponding to the plurality of data records, and only 3 data objects (data objects 1-3) are included in fig. 1 a. For another example, the server device 10b may also adopt a table form, aggregate attribute values belonging to the same data object in a plurality of key attribute values, and send the table formed by aggregation to the client device 10 a. Accordingly, the client device 10a presents the received form on the display screen. Each table corresponds to one data object, and the key attribute values in the same row or column represent the attribute values with the association relationship.
The data processing system provided in this embodiment can identify an attribute value belonging to the same data object from among a plurality of key attribute values according to a hierarchical relationship among a plurality of key attributes and an association relationship among the plurality of key attribute values, and complete vertical clustering of attribute values belonging to the same data object under different attributes. Due to the data clustering mode, various key attributes are considered, the probability of wrong clustering is favorably reduced, and the accuracy of identifying the data belonging to the same data object is favorably improved.
In the embodiment of the present application, the plurality of data records may be data records generated by a plurality of terminal devices that provide services for the client device 10 a. For example, the client device 10a is a server device of an online shopping platform, wherein the terminal device is installed with an application program related to the online shopping platform, and the user can access the online shopping platform through the application program, and the client device 10a can record data generated when the user accesses the online shopping platform and save the related data record. The data records may be stored in various forms such as character strings or tables.
In the embodiment of the present application, the plurality of data records sent by the client device 10a to the server device 10b may be data generated for a plurality of data objects, that is, the plurality of data records are a plurality of data records generated across screens. The data processing method provided by the embodiment of the application can realize cross-screen identification of data generated by the data object. In addition, these data records may relate to a number of areas. For example, in some application scenarios, these data records may relate to various fields of online shopping, video, games, sporting events, and so on. Therefore, the data processing method provided by the embodiment of the application can also realize cross-domain identification of data generated by the data object.
In some application scenarios, the data record also includes some behavioral data. For example, in a shopping scenario, the data record includes data of the user's purchased goods; for another example, in a video scene, the data record includes video data viewed by the user; for another example, in an online game scenario, the data record contains game data for the user. The behavior data can reflect the behavior habits of the user, and commodities, videos, games and the like which are consistent with the behavior habits can be recommended to the user according to the behavior habits of the user. Based on this, the server device 10b may also search behavior data of the data object corresponding to each connected subgraph in the plurality of data records, and obtain behavior characteristics of the data object corresponding to each connected subgraph according to the behavior data of the data object corresponding to each connected subgraph, and add the behavior characteristics of the data object corresponding to each connected subgraph as identification information of the data object corresponding to each connected subgraph. Further, the server device 10b sends the connected subgraph with the identification information of the data object to the client device 10 a. In this way, the client device 10a can recommend content corresponding to the identification information to the user based on the identification information of the data object on each connected subgraph.
In the embodiment of the present application, the server device 10b may establish a hierarchical relationship between several key attributes in advance. Wherein the number of key attributes is greater than or equal to the number of key attributes in the plurality of data records. And assuming that the number of the plurality of data records is M, and the number of the key attributes in the plurality of data records is N, wherein M and N are integers which are more than or equal to 2, and M is more than or equal to N. Based on this, the server device 10b may extract the hierarchical relationship among the plurality of key attributes included in the plurality of data records from the previously established hierarchical relationship among the plurality of key attributes.
Optionally, the server device 10b may obtain historical data records in a specified historical time period, where the historical data records include historical key attribute values under several key attributes; further, the server device 10b may establish a hierarchical relationship between the plurality of key attributes according to the number of historical key attribute values under each of the plurality of key attributes. Wherein the specified historical time period may be, but is not limited to, the past 5 years, the past 2 years, the past 6 months, and so forth. Optionally, the server device 10b may divide the number of the historical key attribute values under each key attribute of the plurality of key attributes by the historical time period to obtain a change condition of the historical key attribute values under each key attribute at each unit time, that is, a change rate of the historical key attribute values under each key attribute with time, and the server device 10b may establish a hierarchical relationship between the plurality of key attributes according to the change rate of the historical key attribute values under each key attribute with time. Wherein, the less the number of key attributes having historical key attribute values, the higher the ranking thereof. Or, the technician can initially sort the levels of a plurality of key attributes according to daily experience, and the key attribute ranked at the top is most stable in theory. Based on this, the server device 10b may further divide the number of the historical key attribute values under each key attribute in the plurality of key attributes by (the historical key attribute values under the top key attribute in the preliminary sorting of the historical time period) to obtain the number of the historical key attribute values under each key attribute corresponding to each historical key attribute value under the top key attribute in each unit time, and further, the server device 10b may establish the hierarchical relationship among the plurality of key attributes according to the number of the historical key attribute values under each key attribute corresponding to each historical key attribute value under each historical key attribute under each unit time. Optionally, the less number of owned key attributes corresponding to each historical key attribute value under the top key attribute, the higher the ranking thereof.
On the other hand, the server device 10b may further analyze an association relationship between the plurality of key attribute values based on membership between the plurality of key attribute values and the plurality of data records. In the embodiment of the application, the key attribute values belonging to the same data record are defined as the key attribute values with direct association; the key attribute value which does not belong to the same data record but still has an association relationship is defined as the key attribute value having an indirect association relationship. Further, in the embodiment of the present application, whether the key attribute value has a direct association relationship or the key attribute value has an indirect association relationship, the two are considered to have an association relationship. Based on this, the server device 10b may determine whether an association relationship exists between two key attribute values according to whether any two key attribute values appear in the same data record or appear in different data records having an indirect association relationship with each other.
In the following, an example will be described by taking a first key attribute value and a second key attribute value of a plurality of key attribute values as an example. The first key attribute value and the second key attribute value are any two key attribute values in the plurality of key attribute values.
Determination method 1: and judging whether the first key attribute value and the second key attribute value appear in the same data record.
Determination mode 2: and judging whether one key attribute value of the first key attribute value and the second key attribute value and a key attribute value having an association relationship with the other key attribute value appear in the same data record.
Determination mode 3: and judging whether the key attribute value having the association relationship with the first key attribute value and the key attribute value having the association relationship with the second key attribute value appear in the same data record.
Accordingly, if the candidate result is yes in the determination methods 1 to 3, it is determined that the first key attribute value and the second key attribute value have an association relationship. Wherein, the determining that the candidate result in the mode 1-3 is yes includes: and the judgment result of any one or more of the determination modes 1-3 is yes.
In this embodiment, the server device 10b may identify an attribute value belonging to the same data object from among the plurality of key attribute values according to a hierarchical relationship among the plurality of key attributes and an association relationship among the plurality of key attribute values. Optionally, the server device 10b may perform initial clustering on the plurality of key attribute values according to the association relationship among the plurality of key attribute values to obtain a plurality of clusters. Wherein each cluster contains key attribute values having an associative relationship. Further, for each cluster, the server device 10b may perform secondary clustering on the key attribute value in each cluster according to the hierarchical relationship among the plurality of key attributes to obtain at least one sub-cluster; and the key attribute values in the same sub-cluster can be regarded as the attribute values belonging to the same data object.
Optionally, in some embodiments, there may be giant clusters in the plurality of clusters, which may be pruned in order to reduce the amount of computation. In practical applications, a giant cluster generally represents that the number of key attribute values included in the cluster is greater than or equal to a preset first number threshold, or one or more problem attribute values exist in the cluster, where a problem attribute value refers to a key attribute value in the cluster whose number of other key attribute values associated therewith is greater than or equal to a preset second number threshold. For example, in an online shopping scenario, a cheating account generated by a swipe may occur; in a hotel or other travel enterprise, issue attribute values may occur in the form of a tour guide card, tool card, or the like. Based on this, in this embodiment of the present application, before performing secondary clustering on the key attribute values in each cluster, the server device 10b may further perform at least one of the following determination operations:
judgment operation 1: and judging whether the giant clusters with the number of the key attribute values larger than or equal to a preset first number threshold exist in the plurality of clusters.
Judgment operation 2: and judging whether a huge cluster containing the problem attribute value exists in the plurality of clusters.
Further, if the result of the at least one judgment operation is yes, pruning is performed on the giant clusters in the plurality of clusters. That is, if the determination result in the determination operation 1 and the determination operation 2 is yes, it is determined that the macro cluster exists in the plurality of clusters, and the macro cluster is pruned.
Alternatively, if the determination result in the determining operation 1 is yes, then for the giant cluster whose determination result is yes, the association right may be retained according to the number K of the preset attribute value relationship pairs to rearrange the attribute value relationship pairs of the previous K bits, and pruning may be performed on the remaining attribute value relationship pairs. Wherein K is a positive integer. Optionally, K is less than or equal to a preset first number threshold.
Further, if the judgment result of the judgment operation 2 is yes, the association relationship between the problem attribute value and other associated key attribute values can be completely cut off for the giant cluster with the judgment result of yes; and the like, but are not limited thereto.
Further, when initially clustering the plurality of key attribute values, the server device 10b may initially cluster the plurality of key attribute values in a graph calculation manner. Alternatively, a graph computation framework may be utilized to initially cluster a plurality of key attribute values. Wherein, the graph computation framework can be: odps graph, spark graph X, and the like, but is not limited thereto. When the server device 10b initially clusters the plurality of key attribute values in a graph calculation manner, the server device may initially cluster the plurality of key attribute values by using a breadth-first traversal algorithm, a depth-first traversal algorithm, or an adjacency matrix traversal algorithm. The following is an exemplary description of the initial clustering of multiple key attribute values using the adjacency matrix traversal algorithm. The specific implementation mode is as follows: pairing two key attribute values with an incidence relation in the key attribute values to obtain a plurality of attribute value relation pairs; establishing an adjacency list of each key attribute value in the plurality of attribute value relationship pairs according to the incidence relationship among the key attribute values in the plurality of attribute value relationship pairs; and traversing the adjacency list of each key attribute value in the plurality of attribute value relation pairs to obtain a plurality of connected subgraphs, wherein each connected subgraph is a cluster. In each connected subgraph, the nodes represent key attribute values, and the connecting line between the two nodes represents the incidence relation between the two key attribute values.
Alternatively, attribute-value relationship pairs may be represented in the form of a data structure of triples. The triple is composed of two key attribute values with association relationship and association weight between the two key attribute values. Therefore, in each connected subgraph, the value on the connecting line between two nodes represents the association weight between two key attribute values with association relationship.
Alternatively, the association weight between any two key attribute values with association relationship may be calculated according to the common occurrence frequency of the two key attribute values in the same data record and the respective occurrence frequency of the two key attribute values in multiple data records. Alternatively, assuming that there is an association relationship between the first key attribute value and the second key attribute value, the association weight between the two values can be expressed as:
Figure BDA0002234208810000121
wherein X represents the common occurrence frequency of the first key attribute value and the second key attribute value in the same data record; y and Z represent the frequency of occurrence of the first key attribute value and the second key attribute value, respectively, in the plurality of data records. For example, assume that the client device 10a sends 10 countsThe data is recorded to the server device 10b, the first key attribute value and the second key attribute value have an association relationship, and if the first key attribute value appears in the data record 1-7 and the second key attribute value appears in the data record 1-5 and the data record 8-10, the first key attribute value and the second key attribute value appear in the data record 1-5 together, that is, the common occurrence frequency of the first key attribute value and the second key attribute value appearing in the same data record is 5 times; and the number of occurrences of the first key attribute value in the 10 data records is 7; the second key attribute value appears 8 times in these 10 data records. Alternatively, the association weight between the first key attribute value and the second key attribute value may be expressed as:
Figure BDA0002234208810000122
further, taking a first cluster of the multiple clusters as an example, a specific implementation process of the server device 10b performing secondary clustering on the key attribute value in each cluster is exemplarily described. The first cluster is any one of a plurality of clusters obtained by initial clustering. When performing secondary clustering on the key attribute values in the first cluster, the server device 10b may determine a reference core attribute from the core attributes included in the first cluster, where the core attribute belongs to the key attribute; further, according to the level relation among the key attributes contained in the first cluster and the associated weight among the key attribute values, the key attribute values under the reference core attribute are respectively used as clustering source points, and the key attribute values under the non-reference core attribute in the first cluster are clustered into sub-clusters represented by the clustering source points.
Optionally, when the server device 10b determines the reference core attribute from the core attributes included in the first cluster, the core attribute including the most key attribute values may be selected as the reference core attribute according to the number of key attribute values under each core attribute included in the first cluster. For example, if the first cluster includes 3 core attributes of the identification number, the passport number, and the membership number, and the identification number includes 5 pieces of identification information, the passport number includes 6 passport numbers, and the membership number includes 9 membership numbers, the membership number can be used as the reference core attribute.
Further, when clustering the key attribute values under the non-reference core attribute in the first cluster into the sub-clusters represented by each cluster source point, the server device 10b may cluster the key attribute values with higher attribute levels into the sub-clusters represented by each cluster source point first. And clustering the key attribute values under each grade into the sub-clusters represented by the clustering source points by the same method. The third key attribute value is taken as an example and is exemplarily described below. And the third key attribute value is any key attribute value which is not clustered to any sub-cluster currently in the key attribute values contained in the first cluster.
For the third key attribute value, the server device 10b may calculate a correlation between the third key attribute value and each clustering source point according to the association weight between the third key attribute value and each clustering source point; and dividing the third key attribute value into the sub-cluster represented by the clustering source point with the maximum relevance.
Further, the server device 10b may determine the shortest association path between the third key attribute value and each cluster source point according to the association weight between the third key attribute value and the key attribute value through which the association path between each cluster source point passes; and calculating the correlation between the third key attribute value and each clustering source point respectively according to the correlation weight between the third key attribute value and the key attribute value passed by the shortest correlation path between the clustering source points.
Further, if there are a plurality of clustering source points having the greatest correlation with the third key attribute value, if the target level of the third key attribute value is the next level except the level of each clustering source point or the highest level of the level relationship between the key attributes included in the first cluster, the third key attribute value may be clustered into a sub-cluster corresponding to any one of the plurality of clustering source points having the greatest correlation with the third key attribute value. If the target grade of the third key attribute value is any grade except the next grade of the grade of each cluster source point and the highest grade of the grade relation between the key attributes contained in the first cluster, the correlation between the third key attribute value and each key attribute value at the previous grade of the target grade can be calculated; and clustering the third attribute value into a sub-cluster represented by the key attribute value with the maximum correlation with the third attribute value in all the key attribute values at the upper level of the target level. For the convenience of understanding the above-mentioned process of performing secondary clustering on the key attribute values in the first cluster by the server device 10b, the following description is made by referring to the connected subgraphs shown in fig. 1b to 1 f.
Suppose fig. 1b is a connected subgraph corresponding to the first cluster. As shown in fig. 1b, the first cluster contains a plurality of key attributes: a member number, an identification number, a passport number, an account number, an electronic mail box, and a mobile phone number. The levels among the key attributes are as follows from high to low in sequence: identity card number, passport number, membership number, account number, electronic mail box (short for mail box) and mobile phone number; wherein the core attributes are a membership number, an identification number and a passport number.
From FIG. 1b, core attributes: the number of the member numbers is 5, the number of the identification numbers is 4, and the number of the passport numbers is 3, so that the server device 10b can select the member number having the largest key attribute value as a reference core attribute and use the 5 member numbers as a clustering source. Further, the server device 10b first clusters the 4 id numbers into the sub-clusters represented by the 5 membership numbers according to the level among the plurality of key attributes. Taking the id card number 1 as an example, the path between the id card number 1 and the member number 1 is: identity card number 1-account number 1-mailbox 2-membership number 1; the path between the identification number 1 and the membership number 2 is: identity card number 1-account number 1-mailbox 2-passport 1-membership number 2; the path between the identification number 1 and the membership number 3 is: identification number 1-Member 3; the path between the identification number 1 and the membership number 4 is: identity card number 1-account number 1-mailbox 2-passport 1-membership number 4; the path between the identification number 1 and the membership number 5 is: identity card number 1-account number 1-mailbox 2-passport 1-membership number 4-cell phone number 2-membership number 5. Further, the correlation between the ID card number 1 and the membership number 1-5 is calculated according to the correlation weight between the key values passed by the path between the ID card number 1 and the membership number 1-5. Alternatively, the correlation between the ID card number 1 and the membership number 1-5 can be obtained by multiplying the correlation weight between the key values passed by the paths between the ID card number 1 and the membership number 1-5 respectively. For example, the correlation between identification number 1 and membership number 4 is equal to (0.75 × 0.87 × 0.9 ═ 0.528525). Then, since the correlation between the identification number 1 and the membership number 4 is the largest, the identification number 1 is clustered into the sub-cluster corresponding to the membership number 4. According to the same method, the identification number 2 is clustered into the sub-cluster represented by the member number 3, the identification number 3 is clustered into the sub-cluster represented by the member number 1, and the identification number 4 is clustered into the sub-cluster represented by the member number 5. Wherein, the connected subgraph formed after clustering the identity card number is shown in fig. 1 c.
According to the method, the passport number, the account number, the email address and the mobile phone number are sequentially clustered respectively to obtain the connected subgraph shown in fig. 1d, wherein each connected subgraph in fig. 1d represents a sub-cluster, and each sub-cluster corresponds to a data object.
Similarly, assume that fig. 1e is a connected subgraph corresponding to the first cluster. As shown in fig. 1e, the first cluster contains a plurality of key attributes: a membership number, an identification number, and a passport number. The levels among the key attributes are as follows from high to low in sequence: an identification number, passport number, and member number; wherein the core attributes are a membership number and an identity card number. Then, a membership number may be selected as the reference core attribute. Further, firstly, according to the association weight between the identity card number 1-4 and the membership number 1-5, the identity card number 1 is clustered to the sub-cluster represented by the membership number 3, the identity card number 2 is clustered to the sub-cluster represented by the membership number 4, the identity card number 3 is clustered to the sub-cluster represented by the membership number 1, and the identity card number 4 is clustered to the sub-cluster represented by the membership number 5. Then, passport number 1 is clustered into a sub-cluster represented by member number 2, and passport number 2 is clustered into a sub-cluster corresponding to member number 3, based on the association weights between passport numbers 1 and 2 and member numbers 1 to 5, respectively. However, for the passport number 3, the association weight between the passport number 3 and the member numbers 4 and 5 is the same, that is, the correlation between the passport number 3 and the member numbers 4 and 5 is the same, since the passport number is located at the target level not the next level of the level at which the member number (each clustering origin) is located, nor the highest level of the hierarchical relationship between the key attributes included in the first cluster (the level at which the identification number is located), and since the previous level of the level at which the passport number is located is the level at which the identification number is located, the correlation between the passport number 3 and the identification numbers 2 and 4, respectively, can be calculated. Further, since the correlation (0.9 × 0.86) between the passport No. 3 and the identification number 2 is larger than the correlation (0.9 × 0.73 × 0.72) between the passport No. 3 and the identification number 4, the passport No. 3 can be clustered into the sub-cluster corresponding to the identification number 2. Further, 5 sub-clusters corresponding to the first cluster as shown in fig. 1f are obtained. Wherein each sub-cluster corresponds to a data object.
In some embodiments, after performing secondary clustering on the key attribute value in each cluster to obtain at least one sub-cluster, the server device 10b may further perform information check on the at least one sub-cluster to ensure that the key attribute value in each sub-cluster belongs to the data object corresponding to the sub-cluster. The following takes the first sub-cluster of the at least one sub-cluster as an example for illustration. Wherein the first sub-cluster is any one of the at least one sub-cluster.
For the first sub-cluster, the server device 10b may further determine whether the number of key attribute values under the same key attribute is included in the key attributes included in the first sub-cluster is multiple, and if the determination result is yes, re-cluster the first sub-cluster by using multiple key attribute values under the same key attribute as a new clustering source point until the key attribute values under any key attribute in any sub-cluster decomposed by the first sub-cluster are the same. Optionally, if the number of the key attributes including multiple key attribute values in the first sub-cluster is multiple, the key attribute with the highest rank in the multiple key attributes is used as a new cluster source point.
Optionally, for the first sub-cluster, the server device 10b may further search, in the plurality of data records, non-key attribute values under the first non-key attribute respectively associated with the key attribute values included in the first sub-cluster. The first non-key attribute is any one or more attributes except the key attribute in the attributes related to the attribute information of the data object. The data objects are different, and the key attributes and the non-key attributes are also different. For example, if the data object is a natural person, the key attribute may be at least one of an identification number, a passport number, a mobile phone number, an e-mail box, and a bank card number; the non-critical attribute may be at least one of name, home address, age, date of birth, IP address when the terminal device generates the data record, and MAC address of the terminal device, but is not limited thereto. For another example, if the data object is a company, the key attribute may be at least one of a company name, a taxpayer identification number, and a business registration number, but is not limited thereto; the non-critical attribute may be at least one of a company address, a zip code, and an IP address, but is not limited thereto.
Further, the server device 10b may determine whether the non-critical attribute values under the first non-critical attribute are consistent. If the judgment result is negative, that is, the non-key attribute values under the first non-key attribute are inconsistent, the associated key attribute values with inconsistent non-key attribute values under the first non-key attribute can be used as new source point attribute values, and the first sub-cluster is re-clustered until the non-key attribute values under the first non-key attribute contained in the first sub-cluster are consistent. The non-critical attribute values under the first non-critical attribute contained in the first sub-cluster are consistent, that is, the non-critical attribute values under the first non-critical attribute contained in the first sub-cluster are consistent: and the non-key attribute values under the first non-key attribute contained in the new sub-cluster formed by re-clustering the first sub-cluster are consistent. For the process of re-clustering the first sub-cluster, reference may be made to the above process of performing secondary clustering on the first cluster, which is not described herein again.
In order to understand the above judgment process more clearly, the following data objects are natural persons, and the key attributes are: the identification number, passport number, and member number, the non-critical attribute information of which is name and date of birth, are exemplified. The server device 10b can search the plurality of data records for the name and the date of birth respectively associated with the identification number, the passport number and the member number, and determine whether the name and the date of birth respectively associated with the identification number, the passport number and the member number are consistent. And if the judgment result is that the name associated with the identity card number is not consistent with the name associated with the passport number, re-clustering the first sub-cluster by taking the identity card number and the passport number as new clustering source points until the non-key attribute values associated with the key attribute values in the new sub-cluster are consistent.
The data processing method provided by the embodiment of the application is suitable for various application scenarios, for example, companies or enterprises can construct universal unified account numbers or oneIds of groups by using the data processing method. Correspondingly, the server device 10b provides data processing service for the company or the enterprise to construct a global unified account number or oneID of the group of the company or the enterprise, so as to form a complete unified account number system. The company may be various types of companies, such as internet companies, dairy companies, game companies, clubs, finance companies, travel companies, real estate companies, e-commerce platforms, or travel service platforms, and may even be a large group covering various businesses, and the like, but is not limited thereto.
For the dairy company, a family can be used as a data object to construct a unified account system of the dairy company; for a travel service platform, an e-commerce platform or a club, a user can be used as a data object to construct a unified account system; for a gaming company, a group may be used as a data object to construct its unified account. Wherein, the group can be divided by the age, sex, etc. of the user; and the like, but are not limited thereto.
No matter what kind of enterprises and companies, the establishment of the uniform account system of the enterprises is helpful for uniformly managing the user information of the enterprises. When a unified account system of an enterprise is constructed, information belonging to the same user can be identified by adopting the data processing method provided by the embodiment of the application. The following takes a travel service platform as an example to exemplarily explain the data processing method provided by the embodiment of the present application.
Fig. 1g is a connected subgraph of one of a plurality of clusters formed by initially clustering, by the server device 10b, a plurality of key attribute values under a plurality of key attributes included in a data record provided by the travel service platform. As shown in fig. 1g, the key attributes are: the system comprises an identity card number, a passport number, a bank card number, a payment platform account number, an e-commerce platform account number, a hotel membership card and a mobile phone number, wherein the identity card number and the passport number are core attributes, and the level relationship among the key attributes is arranged from high to low according to the sequence.
From FIG. 1g, core attributes: the number of the identification numbers is 3, and the attributes of the passport number are 2, so that the server device 10b selects the identification number with the largest attribute value as a reference core attribute, and uses the 3 identification numbers as a clustering source. Further, the server device 10b first clusters 2 passport numbers (passport numbers 1 and 2) into a sub-cluster represented by 3 identity numbers according to the ranking between the key attributes. According to the clustering method provided in the above embodiment, the passport numbers 1 and 2 can be homopolymerized to the sub-cluster represented by the identification number 2. According to the same method, the server-side device 10b may cluster the hotel member card number 2 and the hotel member card number 2 into a sub-cluster represented by the identity card number 2, and cluster the hotel member card number 4 into a sub-cluster corresponding to the identity card number 4, so as to obtain the connected subgraph shown in fig. 1 h.
Further, according to the same method, the mobile phone number 1 is clustered into the sub-cluster represented by the identity card number 2, and the mobile phone numbers 2 and 3 are clustered into the sub-clusters represented by the identity card 3 and the identity card number 2, respectively, so as to obtain the connected subgraph shown in fig. 1 i. Each connected subgraph in fig. 1i represents a sub-cluster, and each sub-cluster corresponds to a user. The number of sub-clusters is only illustrated as 3 in FIG. 1i, and is illustrated as sub-clusters 1-3.
Further, since there are different attribute values under the same key attribute in the sub-cluster 3: the passport numbers 1 and 2, the mobile phone numbers 1 and 2 and the hotel membership card numbers 2 and 3 are higher than the mobile phone numbers and the hotel membership card numbers in grade, so that the sub-cluster 3 is re-clustered by taking the passport numbers as new reference attributes, namely the passport numbers 1 and 2 as new clustering source points. The specific process is as follows: firstly, clustering the account numbers 1 of the payment platform according to the grade relation among the key attributes contained in the sub-clusters 3, and clustering the account numbers to the sub-clusters represented by the passport number 1; and clustering the e-commerce platform account numbers 1, and clustering the e-commerce platform account numbers into sub-clusters represented by the passport number 1 to obtain a connected subgraph shown in fig. 1j, so as to form new sub-clusters (sub-clusters 1-4). Since different key attribute values do not exist under the key attributes contained in the sub-clusters 1-4 shown in fig. 1j, clustering of different key attributes of the same user is completed. Each connected subgraph in fig. 1j corresponds to one sub-cluster, and one sub-cluster represents one user.
Further, a uniform identifier may be set for each user represented by the sub-cluster shown in fig. 1j, and in some embodiments, the identifier may be referred to as OneID. Optionally, the key attribute value with the largest occurrence number in a specified time period in the key attribute values contained in each sub-cluster may be used as the OneID of the user represented by the sub-cluster. The designated time period can be flexibly set according to actual conditions, for example, the designated time period can be set to the last 1 week, the last 1 month, 2 months and the like, but is not limited thereto.
The data processing method provided by the embodiment of the application can be applied to unified account construction of enterprises or companies and can also be applied to other application scenarios. For example, it can be applied to various information statistics. The following description is given by taking statistics of garbage, water consumption, power consumption or gas quantity generated by each cell governed by a community within a set time period as an example. In this application scenario, one cell serves as one data object. For a cell, the key attributes include: cell name, cell location, cell tower location, identification of the resident within the cell, and the like, but is not limited thereto. The cell is position information which is accurately positioned to the cell, namely position information which can distinguish different cells; the unit building position is position information which can be accurately positioned to each unit building in a cell, namely position information which can distinguish different unit buildings in the same cell; the identity of the household within the cell may be information that uniquely identifies a household. For example, the identity of the resident in the cell may be, but is not limited to, the owner's identification number, house property number, passport number, telephone number, and the like. Based on the application scenario, the multiple data records acquired by the server device 10b include multiple key attributes, where each key attribute includes multiple key attribute values. Further, each data record may be: how much garbage, how much water, how much electricity or how much gas is consumed in a certain cell on a certain day; the following steps can be also included: how much garbage, how much water, how much electricity or how much gas is consumed, etc. is produced in a certain place on a certain day; but is not limited thereto.
Further, the server device 10b may identify attribute values belonging to the same cell based on a hierarchical relationship between the plurality of key attributes and an association relationship between the plurality of key attribute values, that is, cluster key attribute values representing the same cell, for example, may cluster a cell name, a cell location, and a subscriber identity representing the same cell.
Further, the server device 10b may obtain, from the plurality of data records, the garbage amount, the water consumption amount, the power consumption amount, and the gas consumption amount corresponding to each key attribute value in a set time period according to the key attribute values belonging to the same cell, and further obtain the garbage amount, the water consumption amount, the power consumption amount, and the gas consumption amount of the cell produced in the set time period. In other embodiments, when the server device 10b checks at least one sub-cluster, it may further obtain, from the plurality of data records, a plurality of non-critical attribute values under the second non-critical attribute and a plurality of non-critical attribute values under the third non-critical attribute respectively associated with each sub-cluster. Wherein the plurality of non-critical attribute values under the second non-critical attribute and the plurality of non-critical attribute values under the third non-critical attribute respectively associated with each sub-cluster refer to: a plurality of non-critical attribute values under a second non-critical attribute and a plurality of non-critical attribute values under a third non-critical attribute respectively associated with the critical attribute values in each sub-cluster. For example, for FIG. 1f, the plurality of non-critical attribute values under the second non-critical attribute associated with sub-cluster 1 refers to: non-key attribute values under the second non-key attribute respectively associated with the member number 1 and the identification number 3. In this embodiment, the second non-critical attribute may be the same as or different from the first non-critical attribute. Optionally, if the first non-critical attribute is a plurality of non-critical attributes, the second non-critical attribute may be one of the non-critical attributes. Further, the third non-critical attribute is different from the second non-critical attribute. The third non-critical attribute may also be one of the first non-critical attributes. For example, if the data object is a natural person, the first non-key attribute may be any one of a name, a home address, an age, a birth date, an IP address, and a MAC address, and correspondingly, the second non-key attribute may be any one of a name, a home address, an age, a birth date, an IP address, and a MAC address other than the first non-key attribute.
Further, the server device 10b may divide the plurality of non-critical attribute values under the second non-critical attribute into a plurality of information clusters by using an association relationship between the plurality of non-critical attribute values under the second non-critical attribute and the plurality of non-critical attribute values under the third non-critical attribute. The server device 10b may determine, according to the membership relationship of the plurality of non-critical attribute values under the second non-critical attribute and the membership relationship of the plurality of non-critical attribute values under the third non-critical attribute in the plurality of data records, an association relationship between the plurality of non-critical attribute values under the second non-critical attribute and the plurality of non-critical attribute values under the third non-critical attribute, and the specific embodiment thereof may refer to the relevant contents of the above embodiments, which is not described herein again.
Further, for each information cluster, the server device 10b may calculate the belonging probability between each two non-critical attribute values and the candidate result under the second non-critical attribute in each information cluster, respectively; wherein the candidate result comprises: belonging to the same data object and not belonging to the same data object. I.e. the probability that the two non-critical property values belong to the same data object and the probability that the two non-critical property values do not belong to the same data object are calculated. Further, the server device 10b may determine whether each two non-critical attribute values under the second non-critical attribute belong to the same data object according to the belonging probability between each two non-critical attribute values under the second non-critical attribute in each information cluster and the candidate result. Optionally, the server device 10b may select a candidate result corresponding to a higher probability as a final decision result from the probabilities that the two non-critical attribute values belong to the same data object and the probabilities that the two non-critical attribute values do not belong to the same data object. For example, if the probability that the two non-critical attribute values belong to the same data object is greater than the probability that the two non-critical attribute values do not belong to the same data object, it may be determined that the two non-critical attribute values belong to the same data object; otherwise, it is determined that the two non-critical attribute values do not belong to the same data object.
Taking the first non-critical attribute value and the second non-critical attribute value under the second non-critical attribute contained in the first information cluster as an example, a specific implementation process of the belonging probability between each two non-critical attribute values and the candidate result is exemplarily described below. The first information cluster is any one of a plurality of information clusters, and the first non-key attribute value and the second non-key attribute value are any two non-key attribute values under the second non-key attribute contained in the first information cluster.
For the first non-critical attribute value and the second non-critical attribute value, the server device 10b may search, in the plurality of data records, other attribute values and behavior attributes respectively associated with the first non-critical attribute value and the second non-critical attribute value; wherein, the other attribute values are: and in the plurality of data records, attribute values other than the plurality of key attribute values and the non-key attribute value under the third non-key attribute. Further, the server device 10b may input the other attribute values and the behavior attributes respectively associated with the first non-critical attribute value and the second non-critical attribute value into the decision model, so as to obtain the belonging probabilities between the first non-critical attribute value and the candidate result, and the second non-critical attribute value and the candidate result.
Further, if the probability of belonging between the first non-critical attribute value and the candidate result is that the probability of belonging between the first non-critical attribute value and the candidate result is greater than the probability of belonging between the first non-critical attribute value and the candidate result. Further, if the first non-key attribute value and the second non-key attribute value belong to the same data object, it is determined that the sub-cluster corresponding to the first non-key attribute value and the sub-cluster corresponding to the second non-key attribute value belong to the same data object. Further, the server device 10b may also merge sub-clusters belonging to the same data object. Optionally, the server device 10b may obtain the behavior data associated with the merged sub-cluster, analyze the behavior characteristics of the corresponding data object, and recommend corresponding content to the data object according to the behavior characteristics of the data object.
In the embodiment of the present application, the server device 10b may also train the decision model. Alternatively, the server device 10b may minimize the loss function as a training target, take the sample attribute value and the sample behavior data respectively associated with the non-critical attribute values known to belong to the same data object but different from the second non-critical attribute as positive samples, and take the sample attribute value and the sample behavior data respectively associated with the non-critical attribute values known to not belong to the same data object and different from the second non-critical attribute as negative samples, perform model training, and obtain the decision model. And determining the loss function according to the probability obtained by model training and the actual probability of the positive sample and the negative sample. Alternatively, the positive and negative sample actual probabilities are 1 and 0, respectively.
Alternatively, the decision model may be a Wide & Deep model, a GBDT model, an LR model, or an RF model, etc., but is not limited thereto.
In some other embodiments, the third non-critical attribute may not exist in the plurality of data records, and the server device 10b may obtain, from the plurality of data records, a plurality of non-critical attribute values under the second non-critical attribute respectively associated with each sub-cluster, and obtain, from the plurality of data records, behavior data respectively associated with the plurality of non-critical attribute values under the second non-critical attribute. Further, the server device 10b may further obtain behavior characteristics of the data object corresponding to each sub-cluster according to behavior data associated with a plurality of non-key attribute values under the second non-key attribute; and determining whether the at least one sub-cluster belongs to the same data object according to the behavior characteristics of the data object corresponding to each sub-cluster.
Taking a first sub-cluster and a second sub-cluster of at least one sub-cluster as an example, an implementation process for determining whether a plurality of sub-clusters belong to the same data object according to behavior characteristics of data objects respectively corresponding to the plurality of sub-clusters will be exemplarily described below. Wherein the first sub-cluster and the second sub-cluster are any two sub-clusters of the plurality of sub-clusters.
For the first sub-cluster and the second sub-cluster, the server device 10b may calculate a similarity between a behavior feature of a data object corresponding to the first sub-cluster and a behavior feature of a data object corresponding to the second sub-cluster, and if the calculated similarity is greater than or equal to a preset similarity threshold, the data objects corresponding to the first sub-cluster and the second sub-cluster are the same data object.
Further, under the condition that the data objects corresponding to the first sub-cluster and the second sub-cluster are the same data object, the server device 10b may merge the first sub-cluster and the second sub-cluster, add the identification information to the data objects corresponding to the first sub-cluster and the second sub-cluster according to the behavior characteristics corresponding to the first sub-cluster and the second sub-cluster, respectively, and send the first sub-cluster and the second sub-cluster to the client device 10a after the identification information is added. Accordingly, the client device 10a outputs the first sub-cluster and the second sub-cluster in a visualized manner.
Optionally, the client device 10a may also recommend the relevant content to the data objects corresponding to the first sub-cluster and the second sub-cluster according to the identification information of the first sub-cluster and the second sub-cluster.
In addition to the above client-server (C/S) system architecture, the data processing method provided by the embodiment of the present application can be autonomously performed by a server device. The server device is a computer device with computing, communication and other functions located at a server. The server device may be a computer or a server located at the server. For example, for an online shopping website, a video website, a game website, etc., the server device may be a website server, or may be a cloud server array, etc. In this embodiment, the server device may store the data record of the accessing user. Based on this, the server device can obtain a plurality of data records from the data records it maintains. The plurality of data records include a plurality of key attributes and a plurality of key attribute values under the plurality of key attributes. Further, in the embodiment of the present application, the server device stores a hierarchical relationship between the key attributes. Correspondingly, the server-side equipment can identify the attribute values belonging to the same data object in the plurality of key attribute values according to the hierarchical relationship among the plurality of key attributes contained in the plurality of data records and the incidence relationship among the plurality of key attribute values. The key attribute values belonging to the same data object are hierarchically clustered according to the hierarchical relationship among the key attributes and the incidence relationship among the key attribute values included in the data records, so that the key attribute values belonging to the same data object are identified. Further, the server-side device outputs attribute values belonging to the same data object in a plurality of key attribute values in a visual manner. For the key attribute, the key attribute value, and the specific implementation process of the data processing performed by the server device, reference may be made to the relevant contents of the above system embodiment, and details are not described here again.
In addition to the above system embodiments, the embodiments of the present application also provide a data processing method, and the data processing method provided by the embodiments of the present application is exemplarily described below from the perspective of a server device.
Fig. 2 is a schematic flowchart of a data processing method according to an embodiment of the present application. As shown in fig. 2, the method includes:
201. and acquiring a plurality of data records, wherein the plurality of data records comprise a plurality of key attribute values under a plurality of key attributes, and each key attribute value belongs to one data object at the same time.
202. And identifying the attribute values belonging to the same data object in the plurality of key attribute values according to the grade relation among the plurality of key attributes and the incidence relation among the plurality of key attribute values.
203. And outputting the attribute values belonging to the same data object in the plurality of key attribute values.
In this embodiment, the server device may obtain a plurality of data records. If the server device is a service device in the C/S system architecture, the server device may receive multiple data records sent by the client device. If the server-side equipment is computer equipment with the functions of calculation, communication and the like and located at the server side. The server device may be a computer or a server located at the server. For example, for an online shopping website, a video website, a game website, etc., the server device may be a website server, or may be a cloud server array, etc. In this embodiment, the server device may store the data record. Based on this, the server device can obtain a plurality of data records from the data records it maintains. For descriptions of the data object, the key attribute, and the key attribute value, reference may be made to the relevant contents of the above system embodiment, and details are not described here again.
In the embodiment of the application, the server device stores the hierarchical relationship between the key attributes. Accordingly, in step 202, the server device may identify an attribute value belonging to the same data object from among the plurality of key attribute values according to a hierarchical relationship among a plurality of key attributes included in the plurality of data records and an association relationship among the plurality of key attribute values. The key attribute values belonging to the same data object are hierarchically clustered according to the hierarchical relationship among the key attributes and the incidence relationship among the key attribute values included in the data records, so that the key attribute values belonging to the same data object are identified. Next, in step 203, the server device can output the key attribute values belonging to the same data object in a visual manner.
In this embodiment, the attribute values belonging to the same data object in the multiple key attribute values can be identified according to the hierarchical relationship among the multiple key attributes and the association relationship among the multiple key attribute values, so that the vertical clustering of the attribute values belonging to the same data object under different attributes is completed. Due to the data clustering mode, various key attributes are considered, the probability of wrong clustering is favorably reduced, and the accuracy of identifying the data belonging to the same data object is favorably improved.
Optionally, an optional implementation manner of step 203 is: and the server-side equipment sends the attribute values belonging to the same data object in the plurality of key attribute values to the client-side equipment. Correspondingly, the client device receives the attribute values which belong to the same data object in the plurality of key attribute values, and outputs the attribute values which belong to the same data object in the plurality of key attribute values in a visual mode.
Or, in some embodiments, the server device may directly display an attribute value belonging to the same data object in the plurality of key attribute values on a human-computer interaction interface thereof.
In this embodiment, the service end device may output the attribute values belonging to the same data object from a plurality of key attribute values in various forms. For example, as shown in fig. 1a, the server device may adopt a connected subgraph form, aggregate attribute values belonging to the same data object in multiple key attribute values, and output the aggregated connected subgraph. Optionally, the server device may display the connected subgraph on a human-computer interaction interface thereof. Or, the server device may send the aggregated connectivity subgraph to the client device. Wherein each connected subgraph corresponds to a data object. In the connected subgraph, nodes represent key attribute values, and a connecting line between two nodes represents the incidence relation between the two key attribute values. For another example, the server device may also adopt a table form, aggregate attribute values belonging to the same data object in the plurality of key attribute values, and output the table formed by aggregation. Optionally, the server device may display the form on a human-computer interaction interface thereof. Alternatively, the server device may send the table to the client device. Each table corresponds to one data object, and the key attribute values in the same row or column represent the attribute values with the association relationship.
In some application scenarios, after step 202, the server device may further search behavior data of the data object corresponding to each connected subgraph in multiple data records, and obtain behavior characteristics of the data object corresponding to each connected subgraph according to the behavior data of the data object corresponding to each connected subgraph, and add the behavior characteristics of the data object corresponding to each connected subgraph as identification information of the data object corresponding to each connected subgraph. And then, the server-side equipment outputs the connected subgraph added with the identification information in a visual form.
In some embodiments, the server device may pre-establish a hierarchical relationship between several key attributes. Wherein the number of key attributes is greater than or equal to the number of key attributes in the plurality of data records. Based on the above, the server device may extract the hierarchical relationship among the plurality of key attributes included in the plurality of data records from the pre-established hierarchical relationship among the plurality of key attributes.
Optionally, the server device may obtain historical data records in a specified historical time period, where the historical data records include historical key attribute values under a plurality of key attributes; further, the server-side device may establish a hierarchical relationship between the plurality of key attributes according to the number of historical key attribute values under each of the plurality of key attributes. For a specific implementation manner of establishing the hierarchical relationship among the plurality of key attributes by the server device according to the number of the historical key attribute values under each key attribute in the plurality of key attributes, reference may be made to relevant contents of the above system embodiment, which is not described herein again.
In other embodiments, the server device may further analyze an association relationship between the plurality of key attribute values based on membership between the plurality of key attribute values and the plurality of data records. In the following, an example will be described by taking a first key attribute value and a second key attribute value of a plurality of key attribute values as an example. The first key attribute value and the second key attribute value are any two key attribute values in the plurality of key attribute values.
Determination method 1: and judging whether the first key attribute value and the second key attribute value appear in the same data record.
Determination mode 2: and judging whether one key attribute value of the first key attribute value and the second key attribute value and a key attribute value having an association relationship with the other key attribute value appear in the same data record.
Determination mode 3: and judging whether the key attribute value having the association relationship with the first key attribute value and the key attribute value having the association relationship with the second key attribute value appear in the same data record.
Accordingly, if the candidate result is yes in the determination methods 1 to 3, it is determined that the first key attribute value and the second key attribute value have an association relationship. Wherein, the determining that the candidate result in the mode 1-3 is yes includes: and the judgment result of any one or more of the determination modes 1-3 is yes.
In this embodiment of the application, the server device may identify an attribute value belonging to the same data object from among the plurality of key attribute values according to a hierarchical relationship among the plurality of key attributes and an association relationship among the plurality of key attribute values. Optionally, the server device may perform initial clustering on the plurality of key attribute values according to an association relationship between the plurality of key attribute values to obtain a plurality of clusters. Wherein each cluster contains key attribute values having an associative relationship. Further, for each cluster, the server device may perform secondary clustering on the key attribute value in each cluster according to the hierarchical relationship among the plurality of key attributes to obtain at least one sub-cluster; and the key attribute values in the same sub-cluster can be regarded as the attribute values belonging to the same data object.
In this embodiment of the present application, before performing secondary clustering on the key attribute value in each cluster, the server device may further perform at least one of the following determination operations:
judgment operation 1: and judging whether the giant clusters with the number of the key attribute values larger than or equal to a preset first number threshold exist in the plurality of clusters.
Judgment operation 2: and judging whether a huge cluster containing the problem attribute value exists in the plurality of clusters.
Further, if the result of the at least one judgment operation is yes, pruning is performed on the giant clusters in the plurality of clusters. That is, if the determination result in the determination operation 1 and the determination operation 2 is yes, it is determined that the macro cluster exists in the plurality of clusters, and the macro cluster is pruned. For a specific implementation of pruning the giant cluster, reference may be made to the related contents of the above embodiments, which are not described herein again.
Further, when the server device initially clusters the plurality of key attribute values, the server device may initially cluster the plurality of key attribute values in a graph calculation manner. Alternatively, a graph computation framework may be utilized to initially cluster a plurality of key attribute values. Wherein, the graph computation framework can be: odps graph, spark graph X, and the like, but is not limited thereto. The specific implementation mode is as follows: pairing two key attribute values with an incidence relation in the key attribute values to obtain a plurality of attribute value relation pairs; establishing an adjacency list of each key attribute value in the plurality of attribute value relationship pairs according to the incidence relationship among the key attribute values in the plurality of attribute value relationship pairs; and traversing the adjacency list of each key attribute value in the plurality of attribute value relation pairs to obtain a plurality of connected subgraphs, wherein each connected subgraph is a cluster. In each connected subgraph, the nodes represent key attribute values, and the connecting line between the two nodes represents the incidence relation between the two key attribute values.
Alternatively, attribute-value relationship pairs may be represented in the form of a data structure of triples. The triple is composed of two key attribute values with association relationship and association weight between the two key attribute values. Therefore, in each connected subgraph, the value on the connecting line between two nodes represents the association weight between two key attribute values with association relationship.
Alternatively, the association weight between any two key attribute values with association relationship may be calculated according to the common occurrence frequency of the two key attribute values in the same data record and the respective occurrence frequency of the two key attribute values in multiple data records. For a specific implementation of calculating the association weight, reference may be made to the related contents of the above system embodiments, and details are not described herein again.
Further, taking a first cluster of the multiple clusters as an example, a specific implementation process of performing secondary clustering on the key attribute value in each cluster by the server device is exemplarily described. The first cluster is any one of a plurality of clusters obtained by initial clustering. When the server-side equipment performs secondary clustering on the key attribute values in the first cluster, a reference core attribute can be determined from the core attributes contained in the first cluster, wherein the core attribute belongs to the key attribute; further, according to the level relation among the key attributes contained in the first cluster and the associated weight among the key attribute values, the key attribute values under the reference core attribute are respectively used as clustering source points, and the key attribute values under the non-reference core attribute in the first cluster are clustered into sub-clusters represented by the clustering source points.
Optionally, when the server device determines the reference core attribute from the core attributes included in the first cluster, the server device may select the core attribute including the most key attribute values as the reference core attribute according to the number of key attribute values under each core attribute included in the first cluster. Optionally, if the number of the key attribute values under each core attribute included in the first cluster is the same and is greater than 1, any one of the core attributes included in the first cluster may be used as the reference core attribute, or the core attribute with the highest rank may be selected as the reference core attribute. Further, if the number of the key attribute values under each core attribute contained in the first cluster is 1, it is indicated that the key attribute values in the first cluster belong to the same data object, and the hierarchical iteration is ended.
Further, when the server device clusters the key attribute values under the non-reference core attribute in the first cluster into the sub-clusters represented by each cluster source point, the server device may cluster the key attribute values with higher attribute levels into the sub-clusters represented by each cluster source point first. And clustering the key attribute values under each grade into the sub-clusters represented by the clustering source points by the same method. The third key attribute value is taken as an example and is exemplarily described below. And the third key attribute value is any key attribute value which is not clustered to any sub-cluster currently in the key attribute values contained in the first cluster.
For the third key attribute value, the server-side device may calculate a correlation between the third key attribute value and each clustering source point according to the correlation weight between the third key attribute value and each clustering source point; and dividing the third key attribute value into the sub-cluster represented by the clustering source point with the maximum relevance.
Further, the server-side device may determine the shortest associated path between the third key attribute value and each cluster source point according to the associated weight between the third key attribute value and the key attribute value through which the associated path between each cluster source point passes; and calculating the correlation between the third key attribute value and each clustering source point respectively according to the correlation weight between the third key attribute value and the key attribute value passed by the shortest correlation path between the clustering source points.
Further, if there are a plurality of clustering source points having the greatest correlation with the third key attribute value, if the target level of the third key attribute value is the next level except the level of each clustering source point or the highest level of the level relationship between the key attributes included in the first cluster, the third key attribute value may be clustered into a sub-cluster corresponding to any one of the plurality of clustering source points having the greatest correlation with the third key attribute value. If the target grade of the third key attribute value is any grade except the next grade of the grade of each cluster source point and the highest grade of the grade relation between the key attributes contained in the first cluster, the correlation between the third key attribute value and each key attribute value at the previous grade of the target grade can be calculated; and clustering the third attribute value into a sub-cluster represented by the key attribute value with the maximum correlation with the third attribute value in all the key attribute values at the upper level of the target level.
In some embodiments, after performing secondary clustering on the key attribute value in each cluster to obtain at least one sub-cluster, the server device may further perform information check on the at least one sub-cluster to ensure that the key attribute value in each sub-cluster belongs to the data object corresponding to the sub-cluster. The following takes the first sub-cluster of the at least one sub-cluster as an example for illustration. Wherein the first sub-cluster is any one of the at least one sub-cluster.
For the first sub-cluster, the server device may search, in the plurality of data records, non-key attribute values under the first non-key attribute respectively associated with key attribute values included in the first sub-cluster. The first non-key attribute is any one or more attributes except the key attribute in the attributes related to the attribute information of the data object. Further, the server device may determine whether the non-critical attribute values under the first non-critical attribute are consistent. If the judgment result is negative, that is, the non-key attribute values under the first non-key attribute are inconsistent, the associated key attribute values with inconsistent non-key attribute values under the first non-key attribute can be used as new source point attribute values, and the first sub-cluster is re-clustered until the non-key attribute values under the first non-key attribute contained in the first sub-cluster are consistent.
In other embodiments, when the server device checks at least one sub-cluster, the server device may further obtain, from the plurality of data records, a plurality of non-critical attribute values under the second non-critical attribute and a plurality of non-critical attribute values under the third non-critical attribute respectively associated with each sub-cluster. Further, the server device may divide the plurality of non-key attribute values under the second non-key attribute into a plurality of information clusters by using an association relationship between the plurality of non-key attribute values under the second non-key attribute and the plurality of non-key attribute values under the third non-key attribute. The server device may determine, according to the membership relationship of the plurality of non-critical attribute values under the second non-critical attribute and the membership relationship of the plurality of non-critical attribute values under the third non-critical attribute in the plurality of data records, an association relationship between the plurality of non-critical attribute values under the second non-critical attribute and the plurality of non-critical attribute values under the third non-critical attribute, and the specific implementation manner thereof may refer to the relevant contents of the above embodiments, which is not described herein again.
Further, for each information cluster, the server-side equipment can respectively calculate the belonging probability between each two non-critical attribute values and the candidate result under the second non-critical attribute in each information cluster; the candidate results include: belonging to the same data object and not belonging to the same data object, i.e. calculating the probability that the two non-critical property values belong to the same data object and the probability that the two non-critical property values do not belong to the same data object. Further, the server device may determine whether each two non-critical attribute values under the second non-critical attribute belong to the same data object according to the probability of the second non-critical attribute value under the second non-critical attribute in each information cluster and the candidate result. Optionally, the server device may select a candidate result corresponding to a higher probability as a final decision result from the probabilities that the two non-critical attribute values belong to the same data object and the probabilities that the two non-critical attribute values do not belong to the same data object.
Taking the first non-critical attribute value and the second non-critical attribute value under the second non-critical attribute contained in the first information cluster as an example, a specific implementation process of the belonging probability between each two non-critical attribute values and the candidate result is exemplarily described below. The first information cluster is any one of a plurality of information clusters, and the first non-key attribute value and the second non-key attribute value are any two non-key attribute values under the second non-key attribute contained in the first information cluster.
For the first non-key attribute value and the second non-key attribute value, the server-side device may search, in the plurality of data records, other attribute values and behavior attributes respectively associated with the first non-key attribute value and the second non-key attribute value; wherein, the other attribute values are: and in the plurality of data records, attribute values other than the plurality of key attribute values and the non-key attribute value under the third non-key attribute. Further, the server device may input other attribute values and behavior attributes respectively associated with the first non-critical attribute value and the second non-critical attribute value into the decision model, so as to obtain the belonging probabilities between the first non-critical attribute value and the candidate result, and the second non-critical attribute value and the candidate result.
Further, if the probability of belonging between the first non-critical attribute value and the candidate result is that the probability of belonging between the first non-critical attribute value and the candidate result is greater than the probability of belonging between the first non-critical attribute value and the candidate result. Further, if the first non-key attribute value and the second non-key attribute value belong to the same data object, it is determined that the sub-cluster corresponding to the first non-key attribute value and the sub-cluster corresponding to the second non-key attribute value belong to the same data object. Furthermore, the server device can also merge sub-clusters belonging to the same data object. Optionally, the server device may obtain the behavior data associated with the merged sub-clusters, analyze the behavior characteristics of the corresponding data object, and recommend corresponding content to the data object according to the behavior characteristics of the data object.
In this embodiment, the server device may also train the decision model. For a specific implementation of training the decision model, reference may be made to the related contents of the above system embodiments, and details are not described herein again.
In some other embodiments, the third non-critical attribute may not exist in the plurality of data records, and the server device may obtain, from the plurality of data records, a plurality of non-critical attribute values under the second non-critical attribute respectively associated with each sub-cluster, and obtain, from the plurality of data records, behavior data respectively associated with the plurality of non-critical attribute values under the second non-critical attribute. Further, the server device may further obtain behavior characteristics of the data object corresponding to each sub-cluster according to behavior data associated with the plurality of non-key attribute values under the second non-key attribute; and determining whether the at least one sub-cluster belongs to the same data object according to the behavior characteristics of the data object corresponding to each sub-cluster.
Taking a first sub-cluster and a second sub-cluster of at least one sub-cluster as an example, an implementation process for determining whether a plurality of sub-clusters belong to the same data object according to behavior characteristics of data objects respectively corresponding to the plurality of sub-clusters will be exemplarily described below. Wherein the first sub-cluster and the second sub-cluster are any two sub-clusters of the plurality of sub-clusters.
For the first sub-cluster and the second sub-cluster, the server device may calculate a similarity between a behavior feature of a data object corresponding to the first sub-cluster and a behavior feature of a data object corresponding to the second sub-cluster, and if the calculated similarity is greater than or equal to a preset similarity threshold, the data objects corresponding to the first sub-cluster and the second sub-cluster are the same data object.
Further, under the condition that the data objects corresponding to the first sub-cluster and the second sub-cluster are the same data object, the server device may merge the first sub-cluster and the second sub-cluster, add identification information to the data objects corresponding to the first sub-cluster and the second sub-cluster according to behavior characteristics respectively corresponding to the first sub-cluster and the second sub-cluster, and output the first sub-cluster and the second sub-cluster after the identification information is added.
Optionally, the server device may send the first sub-cluster and the second sub-cluster to which the identification information is added to the client device. Accordingly, the client device outputs the first sub-cluster and the second sub-cluster in a visualized manner. Optionally, the client device may also recommend the relevant content to the data objects corresponding to the first sub-cluster and the second sub-cluster according to the identification information of the first sub-cluster and the second sub-cluster.
Or, the server device may further display the first sub-cluster and the second sub-cluster after the identification information is added on the human-computer interaction interface. Further, the server-side device recommends the relevant content to the data objects corresponding to the first sub-cluster and the second sub-cluster according to the identification information of the first sub-cluster and the second sub-cluster.
Accordingly, embodiments of the present application also provide a computer readable storage medium storing computer instructions, which, when executed by one or more processors, cause the one or more processors to execute the steps of the data processing method.
Fig. 3a is a schematic structural diagram of another data processing system according to an embodiment of the present application. As shown in fig. 3a, the system comprises: a client device 30a and a server device 30 b. For the description of the implementation forms of the client device and the server device and the communication modes of the client device and the server device, reference may be made to the related contents of the system embodiment shown in fig. 1a, which are not described herein again.
In this embodiment, the client device 30a may generate a data record and may send a plurality of data records to the server device 30 b. Wherein, the plurality of data records comprise a plurality of attribute values of the first type. In this embodiment, the first type attribute is an attribute to be processed, which may be a key attribute in the above embodiment or a non-key attribute.
Further, in this embodiment, the plurality of data records may or may not include the second type of attribute value. The second type of attribute is a preset attribute, which may be a key attribute in the above embodiment or a non-key attribute. Preferably, the second type of attribute is the above-mentioned key attribute. In this embodiment, if the plurality of data records include a plurality of second-type attribute values, the server device 30b may cluster the plurality of first-type attribute values according to an association relationship between the plurality of first-type attribute values and the plurality of second-type attribute values, so as to obtain a plurality of information clusters. That is, the server device 30b clusters the first attribute values having an association relationship with the same second attribute value to obtain a plurality of information clusters. And the first type attribute value in each information cluster is associated with the second type attribute value corresponding to the information cluster.
Further, for each information cluster, the server device 30b may calculate the belonging probability between the different first-class attribute values and the candidate result, respectively. Wherein the candidate result comprises: belonging to the same data object and not belonging to the same data object. I.e. calculating the probability that different first type attribute values belong to the same data object and the probability that different first type attribute values do not belong to the same data object.
Further, the server device 30b may determine whether different first-class attribute values belong to the same data object according to the belonging probability between the different first-class attribute values and the candidate result in each information cluster. Optionally, the server device 30b may select a candidate result corresponding to a higher probability as a final decision result from the probabilities that different first-class attribute values belong to the same data object and the probabilities that different first-class attribute values do not belong to the same data object. That is, if the probability that different first-type attribute values belong to the same data object is greater than the probability that the different first-type attribute values do not belong to the same data object, it may be determined that the different first-type attribute values belong to the same data object; otherwise, it is determined that the different first-type attribute values do not belong to the same data object.
Further, the server device 30b may send the first type attribute information belonging to the same data object to the client device 30 a. Accordingly, the client device 30a receives the first-class attribute information belonging to the same data object and outputs the first-class attribute values belonging to the same data object in a visualized manner.
In the embodiment of the present application, the server device 30b may send the first type attribute information belonging to the same data object to the client device 30a in various forms. For example, as shown in fig. 3a, the server device 10b may adopt a connected subgraph form, aggregate attribute values belonging to the same data object in a plurality of first-class attribute values, and send the aggregated connected subgraph to the client device 30 a. Accordingly, client device 30a presents the received connectivity sub-graph on a display screen. Wherein each connected subgraph corresponds to a data object. In the connected subgraph, nodes represent first-class attribute values, and a connecting line between two nodes represents an incidence relation between the two first-class attribute values. For another example, the server device 30b may also adopt a table form, aggregate attribute values belonging to the same data object in the plurality of first-class attribute values, and send the table formed by aggregation to the client device 30 a. Accordingly, the client device 30a presents the received form on the display screen. Each table corresponds to one data object, and the attribute values of the first type located in the same row or the same column represent the attribute values with the association relationship.
The data processing system provided by this embodiment can complete the horizontal clustering of attribute values belonging to the same data object under the same attribute according to the probability that different attribute values belong to and do not belong to the same data object under the same attribute. Due to the data clustering mode, stronger association relation among attribute values is not needed, the requirement on data attributes can be reduced, and the flexibility and universality of data clustering are facilitated.
In the present embodiment, the plurality of data records sent by the client device 30a to the server device 30b may be data generated for a plurality of data objects, that is, the plurality of data records are a plurality of data records generated across screens. The data processing mode provided by the embodiment can realize cross-screen identification of data generated by the data object. In addition, these data records may relate to a number of areas. For example, in some application scenarios, these data records may relate to various fields of online shopping, video, games, sporting events, and so on. Therefore, the data processing method provided by the embodiment can also realize cross-domain identification of data generated by the data object.
In this embodiment, the server device 30b may determine the first-class attribute value associated with each second-class attribute value according to the association relationship between the first-class attribute value and the second-class attribute value, and use the first-class attribute value associated with each second-class attribute value as an information cluster, thereby obtaining a plurality of information clusters.
Optionally, when determining the first-class attribute value associated with each second-class attribute value, the server device 30b may determine the association relationship between the first-class attribute value and the second-class attribute value according to the membership relationship between the first-class attribute value and the second-class attribute value in the plurality of data records. For a specific implementation, reference may be made to the related contents in the data system shown in fig. 1a, which are not described herein again. Further, the server device 30b may pair the first type attribute value and the associated second type attribute value to obtain a plurality of attribute value relationship pairs, where one attribute value relationship pair includes one first type attribute value and one second type attribute value, and the first type attribute value and the second type attribute value in the attribute value relationship pair have an association relationship.
Optionally, the server device 30b may cluster the plurality of attribute value relationship pairs in a graph computation manner to obtain a plurality of connected subgraphs, where each connected subgraph corresponds to one information cluster. In each connected subgraph, a central node represents a second type of attribute value, an edge node represents a first type of attribute value, and a connecting line between the central node and the edge node represents that the corresponding first type of attribute value and the second type of attribute value have an incidence relation. For a specific implementation of clustering a plurality of attribute value relationship pairs in graph calculation, reference may be made to relevant contents of the above embodiments, which are not described herein again.
Alternatively, attribute-value relationship pairs may be represented in the form of a data structure of triples. Wherein a triple is composed of a first type of attribute value and a second type of attribute value and an associated weight between the two attribute values. Therefore, in each connected subgraph, the value on the connecting line between two nodes represents the association weight between two attribute values with association relationship. Wherein, the connected subgraph can be a heterogeneous undirected weighted graph structure.
Optionally, the server device 30b may calculate the association weight between the first-type attribute value and the second-type attribute value according to the common occurrence frequency of the first-type attribute value and the second-type attribute value having the association relationship in the same data record and the respective occurrence frequency of the first-type attribute value and the second-type attribute value in multiple data records, and a specific implementation manner thereof may refer to relevant contents of the foregoing embodiments, which is not described herein again.
Optionally, for each information cluster, if the number of the first-class attribute values associated with the second-class attribute values in the information cluster is greater than a preset third number threshold, pruning may be performed on the information cluster. Optionally, according to the association weight between the first-class attribute value and the second-class attribute value in the information cluster, the first-class attribute value corresponding to the smaller association weight is cut off, so that the number of the remaining first-class attribute values in the information cluster is smaller than or equal to a preset third number threshold. Therefore, the edges with low weight can be cleaned, the attribute value relation pairs with weak relative relation are eliminated, and the complexity of the graph is reduced.
Alternatively, the server device 30b may perform anti-cheating and cleaning operations on the low-weight attribute value pairs according to the application scenario. Based on this, if the first-type attribute value in a certain information cluster is greater than the preset fourth quantity threshold, the server device 30b may also clean the information cluster, that is, cut the information cluster. Preferably, the fourth number threshold is greater than the third number threshold. In practical applications, a normal family generally uses one IP address for a plurality of natural persons, that is, one IP address is associated with a limited number of accounts, while a single group may use one IP address for hundreds or more accounts, that is, one IP address is associated with thousands or hundreds of accounts. Based on this, the maximum number of account numbers that can be associated with one IP address (i.e., a fourth number threshold) can be preset, and if the number of account numbers that are associated with an IP address in one information cluster is greater than the fourth number threshold, the information cluster can be considered as a cheating information cluster, and all the information clusters are pruned.
Further, after the first-class attribute value associated with each second-class attribute value, the server device 30b may associate the first-class attribute values associated with the same second-class attribute value by using a cartesian product, and generate a first-class attribute value relationship pair suspected to belong to the same data object, where each first-class attribute value relationship pair includes two different first-class attribute values.
Further, for each information cluster, the server device 30b may calculate the belonging probability between different first-type attribute values and the candidate result in the information cluster, and determine whether the different first-type attribute values belong to the same data object according to the belonging probability between the different first-type attribute values and the candidate result in the information cluster.
The following takes the first-class attribute values a and B in the first cluster of the multiple information clusters as an example, and exemplifies a process of the server device 30B calculating the belonging probability between the different first-class attribute values and the candidate result. The first cluster is any one of a plurality of information clusters, and the first attribute values A and B are any two first attribute values in the first attribute values contained in the first cluster.
In this embodiment, the server device 30B obtains other attribute values and behavior data respectively associated with the first attribute values a and B in a plurality of data records; wherein the other class attribute values are attribute information other than the first class attribute values. That is, the attribute values of other classes may be the attribute values of the second class, or may be attribute information other than the attribute values of the first class and the attribute values of the second class. Further, the server device 30B inputs other attribute values and behavior data respectively associated with the first-class attribute values a and B into the decision model, to obtain the belonging probabilities between the first-class attribute values a and B and the candidate results.
Further, if the probability of belonging between the first-class attribute values a and B and the candidate result is that the probability of belonging to the same data object is greater than the probability of belonging to the same data object, it is determined that the first-class attribute values a and B belong to the same data object. Optionally, the server device 30B may obtain behavior data associated with the first-class attribute values a and B, analyze behavior characteristics of the corresponding data object, and recommend corresponding content to the data object according to the behavior characteristics of the data object.
In the embodiment of the present application, the server device 30b may also train decision models. Optionally, the server device 30b may minimize the loss function as a training target, and perform model training by using, as positive samples, sample attribute values and sample behavior data that are known to belong to the same data object and are respectively associated with different first-type attribute values, and using, as negative samples, sample attribute values and sample behavior data that are known to not belong to the same data object and are respectively associated with different first-type attribute values, to obtain a decision model; and determining the loss function according to the probability obtained by model training and the actual probability of the positive sample and the negative sample. Alternatively, the positive and negative sample actual probabilities are 1 and 0, respectively. Alternatively, the loss function can be expressed as: argmin SigmaiL(yi,f(xi) ); wherein, yiPositive and negative sample actual probabilities; f (x)i) The associated probability, x, obtained for model trainingiPositive and negative examples. Further, L (y, f (x)) may be represented as: l (y, f (x)) ═ log (1+ exp (-yf (x))).
Alternatively, the decision model may be a Wide & Deep model, a GBDT model, an LR model, or an RF model, etc., but is not limited thereto.
In some application scenarios, the plurality of data records may not include the second-type attribute value, and the server device 30b may further obtain behavior data associated with the plurality of first-type attribute values from the plurality of data records, and obtain behavior characteristics of the data object corresponding to the plurality of first-type attribute values according to the behavior data associated with the plurality of first-type attribute values. Further, the server device 30b may determine whether the plurality of first-type attribute values belong to the same data object according to the behavior characteristics of the data object corresponding to the plurality of first-type attribute values, respectively.
The following takes the first-class attribute values C and D of the plurality of first-class attribute values as an example, and an exemplary description is given to determine whether the plurality of first-class attribute values belong to the same data object. The first-class attribute values C and D are any two attribute values in the plurality of first-class attribute values.
For the first-class attribute values C and D, the server device 30b may calculate similarity of behavior characteristics of the data objects corresponding to the first-class attribute values C and D, and further determine that the first-class attribute values C and D belong to the same data object if the similarity of the behavior characteristics of the data objects corresponding to the first-class attribute values C and D is greater than or equal to a set similarity threshold. Correspondingly, if the similarity of the behavior features of the data objects corresponding to the first-class attribute values C and D is smaller than the set similarity threshold, it is determined that the first-class attribute values C and D do not belong to the same data object.
Further, after identifying the first-class attribute value relationship pair that actually belongs to the same data object from the first-class attribute value relationship pair suspected to belong to the same data object in the first information cluster, the server device 30b may also aggregate the first-class attribute values in the first-class attribute value relationship pair that actually belongs to the same data object by using a graph computation method, that is, aggregate the first-class attribute values that belong to the same data object, to obtain a connected subgraph corresponding to each data object. In the connected subgraph, each node represents a first class attribute value.
In order to more clearly understand the above data processing process, an example will be described below in which the data object is a natural person, the first type attribute is an account, and the second type attribute is an IP address.
Assuming that the plurality of data records contain a plurality of account numbers and a plurality of IP addresses, in this embodiment, the association relationship between the plurality of account numbers and the plurality of IP addresses may be determined according to the membership relationship of the plurality of account numbers and the plurality of IP addresses in the plurality of data records. Further, the server device 30b may pair the account numbers with their associated IP addresses, respectively, to generate a plurality of account number-IP relationship pairs.
Optionally, the server device 30b may cluster a plurality of account-IP relationship pairs in a graph computation manner, so as to obtain a plurality of connected subgraphs as shown in fig. 3 b. The central node of each connected subgraph represents an IP address, the edge nodes represent account numbers, and the connecting lines between the central nodes and the edge nodes represent the incidence relation between the IP address and the account numbers. In fig. 3b, the number of IP addresses is 2 (IP1 and IP2) and the number of accounts is 8 (accounts 1-8), which are only used for illustration and not limited.
Further, for the account in each connected sub-graph, the server device 30b may associate the accounts in the same connected sub-graph by using cartesian product, so as to generate a suspected homonymy account pair. Further, the server device 30b may calculate the probability that each suspected homonymy account pair belongs to the same natural person and the probability that the suspected homonymy account pair does not belong to the same natural person, and use the candidate result with the higher probability as the determination result of whether the suspected homonymy account pair belongs to the same natural person. Namely, if the probability that the suspected account pair of the same person belongs to the same natural person is greater than the probability that the suspected account pair of the same person does not belong to the same natural person, determining that the suspected account pair of the same person belongs to the same natural person; otherwise, the suspected same person account pair is determined not to belong to the same natural person.
In addition to the above client-server (C/S) system architecture, the data processing method provided by the embodiment of the present application can be autonomously performed by a server device. The server device is a computer device with computing, communication and other functions located at a server. The server device may be a computer or a server located at the server. For example, for an online shopping website, a video website, a game website, etc., the server device may be a website server, or may be a cloud server array, etc. In this embodiment, the server device may store the data record of the accessing user. Based on this, the server device can obtain a plurality of data records from the data records it maintains. In this embodiment, the plurality of data records may or may not include the second type of attribute value. The second type of attribute is a preset attribute, which may be a key attribute in the above embodiment or a non-key attribute. Preferably, the second type of attribute is the above-mentioned key attribute. In this embodiment, if the plurality of data records include a plurality of second-type attribute values, the server device may cluster the plurality of first-type attribute values according to an association relationship between the plurality of first-type attribute values and the plurality of second-type attribute values, so as to obtain a plurality of information clusters. Namely, the server device clusters the first attribute values having an association relationship with the same second attribute value to obtain a plurality of information clusters. And the first type attribute value in each information cluster is associated with the second type attribute value corresponding to the information cluster. Further, for each information cluster, the server device may calculate the belonging probability between different first-class attribute values and the candidate result. Wherein the candidate result comprises: belonging to the same data object and not belonging to the same data object. I.e. calculating the probability that different first type attribute values belong to the same data object and the probability that different first type attribute values do not belong to the same data object. Further, the server device may determine whether different first-class attribute values belong to the same data object according to the probability of belonging between the different first-class attribute values and the candidate result in each information cluster. For a specific implementation process of the data processing performed by the server device, reference may be made to relevant contents in the system embodiment shown in fig. 3a, which is not described herein again.
In addition to the above system embodiments, the embodiments of the present application also provide a data processing method, and the data processing method provided by the embodiments of the present application is exemplarily described below from the perspective of a server device.
Fig. 4 is a schematic flowchart of another data processing method according to an embodiment of the present application. As shown in fig. 4, the method includes:
401. a plurality of data records is obtained, wherein the plurality of data records comprises a plurality of attribute values of a first type.
402. And under the condition that the plurality of data records contain a plurality of second attribute values, clustering the plurality of first attribute values according to the incidence relation between the plurality of first attribute values and the plurality of second attribute values to obtain a plurality of information clusters.
403. Respectively calculating the belonging probability between different first-class attribute values and the candidate result aiming at each information cluster; wherein the candidate result comprises: belonging to the same data object and not belonging to the same data object.
404. And determining whether the different first-class attribute values belong to the same data object according to the belonging probability between the different first-class attribute values and the candidate result.
In this embodiment, the first type attribute is an attribute to be processed, which may be a key attribute in the above embodiment or a non-key attribute.
Further, in this embodiment, the plurality of data records may or may not include the second type of attribute value. The second type of attribute is a preset attribute, which may be a key attribute in the above embodiment or a non-key attribute. Preferably, the second type of attribute is the above-mentioned key attribute. In this embodiment, if the plurality of data records include a plurality of second-type attribute values, in step 402, the plurality of first-type attribute values may be clustered according to the association relationship between the plurality of first-type attribute values and the plurality of second-type attribute values, so as to obtain a plurality of information clusters. Namely, the first type attribute values which have an incidence relation with the same second type attribute value are clustered to obtain a plurality of information clusters. And the first type attribute value in each information cluster is associated with the second type attribute value corresponding to the information cluster.
Further, for each information cluster, the server device may calculate the belonging probability between different first-class attribute values and the candidate result. Wherein the candidate result comprises: belonging to the same data object and not belonging to the same data object. I.e. calculating the probability that different first type attribute values belong to the same data object and the probability that different first type attribute values do not belong to the same data object.
Further, the server device 30b may determine whether different first-class attribute values belong to the same data object according to the belonging probability between the different first-class attribute values and the candidate result in each information cluster. Optionally, the server device 30b may select a candidate result corresponding to a higher probability as a final decision result from the probabilities that different first-class attribute values belong to the same data object and the probabilities that different first-class attribute values do not belong to the same data object. That is, if the probability that different first-type attribute values belong to the same data object is greater than the probability that the different first-type attribute values do not belong to the same data object, it may be determined that the different first-type attribute values belong to the same data object; otherwise, it is determined that the different first-type attribute values do not belong to the same data object.
In this embodiment, the horizontal clustering of the attribute values belonging to the same data object under the same attribute can be completed according to the probability that different attribute values belong to and do not belong to the same data object under the same attribute. Due to the data clustering mode, stronger association relation among attribute values is not needed, the requirement on data attributes can be reduced, and the flexibility and universality of data clustering are facilitated.
In this embodiment, the server device may output the first type attribute values belonging to the same data object. In some embodiments, the server device may present the first type of attribute values belonging to the same data object on its display screen. In other embodiments, the server device may send the first type attribute information belonging to the same data object to the client device. Accordingly, the client device 30a receives the first-type attribute information belonging to the same data object and outputs the first-type attribute information belonging to the same data object in a visualized manner.
In some embodiments, an alternative implementation of step 402 is: and determining the first class attribute value associated with each second class attribute value according to the association relationship between the first class attribute value and the second class attribute value, and taking the first class attribute value associated with each second class attribute value as an information cluster to further obtain a plurality of information clusters.
Optionally, when the server device determines the first-class attribute value associated with each second-class attribute value, the server device may determine an association relationship between the first-class attribute value and the second-class attribute value according to a membership relationship of the first-class attribute value and the second-class attribute value in the plurality of data records. For a specific implementation, reference may be made to the related contents in the data system shown in fig. 1a, which are not described herein again. Further, the server device may pair the first type attribute value with a second type attribute value associated therewith to obtain a plurality of attribute value relationship pairs, where one attribute value relationship pair includes one first type attribute value and one second type attribute value, and the first type attribute value and the second type attribute value in the attribute value relationship pair have an association relationship.
Optionally, the server device may cluster the plurality of attribute value relationship pairs in a graph computation manner to obtain a plurality of connected subgraphs, where each connected subgraph corresponds to one information cluster. In each connected subgraph, a central node represents a second type of attribute value, an edge node represents a first type of attribute value, and a connecting line between the central node and the edge node represents that the corresponding first type of attribute value and the second type of attribute value have an incidence relation. For a specific implementation of clustering a plurality of attribute value relationship pairs in graph calculation, reference may be made to relevant contents of the above embodiments, which are not described herein again.
Alternatively, attribute-value relationship pairs may be represented in the form of a data structure of triples. Wherein a triple is composed of a first type of attribute value and a second type of attribute value and an associated weight between the two attribute values. Therefore, in each connected subgraph, the value on the connecting line between two nodes represents the association weight between two attribute values with association relationship. Wherein, the connected subgraph can be a heterogeneous undirected weighted graph structure.
Optionally, the server device may calculate the association weight between the first type attribute value and the second type attribute value according to a common occurrence frequency of the first type attribute value and the second type attribute value having the association relationship in the same data record and respective occurrence frequencies of the first type attribute value and the second type attribute value in multiple data records, and a specific implementation manner thereof may refer to relevant contents of the above embodiments, which is not described herein again.
Optionally, for each information cluster, if the number of the first-class attribute values associated with the second-class attribute values in the information cluster is greater than a preset third number threshold, pruning may be performed on the information cluster. Optionally, according to the association weight between the first-class attribute value and the second-class attribute value in the information cluster, the first-class attribute value corresponding to the smaller association weight is cut off, so that the number of the remaining first-class attribute values in the information cluster is smaller than or equal to a preset third number threshold. Therefore, the edges with low weight can be cleaned, the attribute value relation pairs with weak relative relation are eliminated, and the complexity of the graph is reduced.
Or the server-side equipment can also perform anti-cheating and cleaning operations on the low-weight attribute value pairs according to the application scene. Based on this, if the first type attribute value in a certain information cluster is greater than the preset fourth quantity threshold, the server device may also clean the information cluster, that is, cut the information cluster. Preferably, the fourth number threshold is greater than the third number threshold.
Further, after the first-class attribute value associated with each second-class attribute value, the first-class attribute values associated with the same second-class attribute value may be associated by using a cartesian product, and a first-class attribute value relationship pair suspected to belong to the same data object is generated, where each first-class attribute value relationship pair includes two different first-class attribute values.
Further, for each information cluster, the server device may calculate the belonging probability between different first-class attribute values and the candidate result in the information cluster, and determine whether the different first-class attribute values belong to the same data object according to the belonging probability between the different first-class attribute values and the candidate result in the information cluster.
The following takes the first-class attribute values a and B in the first cluster of the plurality of information clusters as an example, and an exemplary description is given to a process of calculating the belonging probability between different first-class attribute values and the candidate result. The first cluster is any one of a plurality of information clusters, and the first attribute values A and B are any two first attribute values in the first attribute values contained in the first cluster.
The server side equipment acquires other attribute values and behavior data respectively associated with the first attribute values A and B from a plurality of data records; wherein the other class attribute values are attribute information other than the first class attribute values. That is, the attribute values of other classes may be the attribute values of the second class, or may be attribute information other than the attribute values of the first class and the attribute values of the second class. Further, other attribute values and behavior data respectively associated with the first-class attribute values A and B are input into the decision model, and the belonging probability between the first-class attribute values A and B and the candidate result is obtained.
Further, if the probability of belonging between the first-class attribute values a and B and the candidate result is that the probability of belonging to the same data object is greater than the probability of belonging to the same data object, it is determined that the first-class attribute values a and B belong to the same data object. Optionally, behavior data associated with the first-class attribute values a and B may be acquired, behavior characteristics of the corresponding data object may be analyzed, and corresponding content may be recommended to the data object according to the behavior characteristics of the data object.
In the embodiment of the present application, the decision model may also be trained. Optionally, a loss function may be minimized as a training target, taking a sample attribute value and sample behavior data, which are known to belong to the same data object but are respectively associated with different first-type attribute values, as a positive sample, and taking a sample attribute value and sample behavior data, which are known to not belong to the same data object and are respectively associated with different first-type attribute values, as a negative sample, performing model training to obtain a decision model; and determining the loss function according to the probability obtained by model training and the actual probability of the positive sample and the negative sample.
Alternatively, the decision model may be a Wide & Deep model, a GBDT model, an LR model, or an RF model, etc., but is not limited thereto.
In some application scenarios, the plurality of data records may not include the second type attribute value, and then behavior data associated with the plurality of first type attribute values may be obtained from the plurality of data records, and behavior characteristics of the data object corresponding to the plurality of first type attribute values may be obtained according to the behavior data associated with the plurality of first type attribute values. Further, whether the plurality of first-class attribute values belong to the same data object can be determined according to the behavior characteristics of the data object corresponding to the plurality of first-class attribute values respectively.
The following takes the first-class attribute values C and D of the plurality of first-class attribute values as an example, and an exemplary description is given to determine whether the plurality of first-class attribute values belong to the same data object. The first-class attribute values C and D are any two attribute values in the plurality of first-class attribute values.
And for the first-class attribute values C and D, calculating the similarity of the behavior characteristics of the data objects corresponding to the first-class attribute values C and D, and further determining that the first-class attribute values C and D belong to the same data object if the similarity of the behavior characteristics of the data objects corresponding to the first-class attribute values C and D is greater than or equal to a set similarity threshold. Correspondingly, if the similarity of the behavior features of the data objects corresponding to the first-class attribute values C and D is smaller than the set similarity threshold, it is determined that the first-class attribute values C and D do not belong to the same data object.
Further, after the first-class attribute value relationship pair which is suspected to belong to the same data object is identified from the first-class attribute value relationship pair which belongs to the same data object in the first information cluster, the first-class attribute values in the first-class attribute value relationship pair which really belong to the same data object can be aggregated by using a graph calculation method, that is, the first-class attribute values which belong to the same data object are aggregated to obtain a connected subgraph corresponding to each data object. In the connected subgraph, each node represents a first class attribute value.
Accordingly, embodiments of the present application also provide a computer readable storage medium storing computer instructions, which, when executed by one or more processors, cause the one or more processors to execute the steps of the data processing method.
It should be noted that the execution subjects of the steps of the methods provided in the above embodiments may be the same device, or different devices may be used as the execution subjects of the methods. For example, the execution subject of steps 401 and 402 may be device a; for another example, the execution subject of step 401 may be device a, and the execution subject of step 402 may be device B; and so on.
In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a specific order, but it should be clearly understood that the operations may be executed out of the order presented herein or in parallel, and the sequence numbers of the operations, such as 401, 402, etc., are merely used to distinguish various operations, and the sequence numbers themselves do not represent any execution order. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel.
Fig. 5 is a schematic structural diagram of a server device according to an embodiment of the present application. As shown in fig. 5, the server device includes: a memory 50a and a processor 50 b. Wherein the memory 50a is for a computer program.
The processor 50b is coupled to the memory 50a for executing a computer program for: acquiring a plurality of data records, wherein the plurality of data records comprise a plurality of key attribute values under a plurality of key attributes, and each key attribute value belongs to one data object at the same time; identifying attribute values belonging to the same data object in the plurality of key attribute values according to the level relation among the plurality of key attributes and the incidence relation among the plurality of key attribute values; and outputting the attribute values belonging to the same data object in the plurality of key attribute values.
In some embodiments, the server device includes: a communication component 50 c. Accordingly, the processor 50b, when obtaining the plurality of data records, is specifically configured to: the plurality of data records sent by the client device are received by the communication component 50 c. Accordingly, when the processor 50b outputs an attribute value belonging to the same data object from among the plurality of key attribute values, it is specifically configured to: and sending the attribute values belonging to the same data object in the plurality of key attribute values to the client device through the communication component 50c, so that the client device can output the attribute values in a visual manner.
Further, when sending the attribute value belonging to the same data object in the plurality of key attribute values to the client device, the processor 50b is specifically configured to: sending the attribute values belonging to the same data object in the plurality of key attribute values to the client device in the form of a connected subgraph through the communication component 50 c; in the connected subgraph, nodes represent key attribute values; the connecting line between the two nodes represents the incidence relation between the two key attribute values.
Optionally, the processor 50b is further configured to: before the attribute values belonging to the same data object in the plurality of key attribute values are sent to the client in the form of connected subgraphs through the communication component 50c, behavior data of the data object corresponding to each connected subgraph is searched in a plurality of data records; acquiring behavior characteristics of the data object corresponding to each connected subgraph according to the behavior data of the data object corresponding to each connected subgraph; and adding the behavior characteristics of the data object corresponding to each connected subgraph as the identification information of the data object corresponding to each connected subgraph.
In other embodiments, the server device includes: and a display screen 50 d. Accordingly, the processor 50b, when obtaining the plurality of data records, is specifically configured to: and displaying the attribute values belonging to the same data object in the plurality of key attribute values on the display screen 50d in the form of a connected subgraph.
In still other embodiments, the memory 50a stores a pre-established ranking relationship between a number of key attributes. Accordingly, the processor 50b is further configured to: before identifying attribute values belonging to the same data object in a plurality of key values, extracting the grade relation among a plurality of key attributes from the pre-established grade relation among a plurality of key attributes; and analyzing the association relation among the plurality of key attribute values based on the membership relation between the plurality of key attribute values and the plurality of data records.
Further, the processor 50b is further configured to: acquiring a historical data record in a specified historical period before extracting the hierarchical relationship among the plurality of key attributes from the pre-established hierarchical relationship among the plurality of key attributes, wherein the historical data record comprises historical key attribute values under the plurality of key attributes; and establishing a hierarchical relationship among the plurality of key attributes according to the number of historical key attribute values under each key attribute in the plurality of key attributes.
Optionally, when analyzing the association relationship between the plurality of key attribute values, the processor 50b is specifically configured to perform at least one of the following determination operations: judging whether the first key attribute value and the second key attribute value appear in the same data record or not according to the first key attribute value and the second key attribute value; judging whether one key attribute value of the first key attribute value and the second key attribute value and a key attribute value having an association relation with the other key attribute value appear in the same data record; judging whether the key attribute value having the association relation with the first key attribute value and the key attribute value having the association relation with the second key attribute value appear in the same data record; and if the candidate result of the at least one judgment operation is yes, determining that the first key attribute value and the second key attribute value have an association relation. The first key attribute value and the second key attribute value are any two attribute values in the plurality of key attribute values.
In still other embodiments, the processor 50b, when identifying the attribute values belonging to the same data object from the plurality of key values, is specifically configured to: performing initial clustering on the plurality of key attribute values according to the incidence relation among the plurality of key attribute values to obtain a plurality of clusters, wherein each cluster comprises the key attribute values with the incidence relation; for each cluster, performing secondary clustering on the key attribute value in each cluster according to the hierarchical relationship among the key attributes to obtain at least one sub-cluster; and the key attribute values in the same sub-cluster are taken as the attribute values belonging to the same data object.
Further, when initially clustering the plurality of key attribute values, the processor 50b is specifically configured to: pairing two key attribute values with an incidence relation in the key attribute values to obtain a plurality of attribute value relation pairs; constructing an adjacency list of each key attribute value in the multiple attribute value relationship pairs according to the incidence relationship among the key attribute values in the multiple attribute value relationship pairs; and traversing the adjacency list of each key attribute value in the plurality of attribute value relation pairs to obtain a plurality of connected subgraphs, wherein each connected subgraph is a cluster. In each connected subgraph, the nodes represent key attribute values, and the connecting line between the two nodes represents the incidence relation between the two key attribute values.
In some other embodiments, the processor 50b is specifically configured to, when performing secondary clustering on the key attribute values in each cluster: aiming at a first cluster, determining a reference core attribute from core attributes contained in the first cluster, wherein the core attribute belongs to a key attribute; clustering key attribute values under non-reference core attributes in the first cluster into sub-clusters represented by the clustering source points by taking the key attribute values under the reference core attributes as clustering source points according to the level relation among the key attributes contained in the first cluster and the association weight among the key attribute values; wherein the first cluster is any one of a plurality of clusters.
Further, when determining the reference core attribute from the core attributes included in the first cluster, the processor 50b is specifically configured to: and selecting the core attribute containing the most key attribute values as the reference core attribute according to the number of the key attribute values under each core attribute contained in the first cluster.
Further, when clustering the key attribute values under the non-reference core attribute in the first cluster into the sub-clusters represented by each cluster source point, the processor 50b is specifically configured to: aiming at the third key attribute value, calculating the correlation between the third key attribute value and each clustering source point according to the correlation weight between the third key attribute value and each clustering source point; dividing the third key attribute value into a sub-cluster represented by the clustering source point with the maximum correlation; the third key attribute value is any key attribute value of the key attribute values contained in the first cluster that is not currently clustered to any sub-cluster.
Optionally, when calculating the correlation between the third key attribute value and each cluster source point, the processor 50b is specifically configured to: determining the shortest associated path between the third key attribute value and each clustering source point according to the associated weight between the third key attribute value and the key attribute value passed by the associated path between each clustering source point; and calculating the correlation between the third key attribute value and each clustering source point respectively according to the correlation weight between the third key attribute value and the key attribute value passed by the shortest correlation path between the clustering source points.
Further, if the target level of the third key attribute value is any level other than the next level of the source point of each cluster and the highest level in the level relationship between the key attributes included in the first cluster, the processor 50b is further configured to: if the clustering source points with the maximum correlation with the third key attribute value are multiple, calculating the correlation between the third key attribute value and each key attribute value at the upper level of the target level; and clustering the third attribute value into a sub-cluster represented by the key attribute value with the maximum correlation with the third attribute value in all the key attribute values at the upper level of the target level.
Optionally, the processor 50b is further configured to: before clustering the key attribute values under the non-reference core attribute in the first cluster into the sub-clusters represented by the clustering source points, calculating the association weight between every two key attribute values according to the common occurrence frequency of every two key attribute values in the key attribute values contained in the first cluster in the same data record and the respective occurrence frequency of every key attribute value in every two key attribute values in a plurality of data records.
In some embodiments, the processor 50b is further configured to perform at least one of the following determination operations before performing secondary clustering on the key attribute values in each cluster: comprising performing a determination as follows: judging whether giant clusters with the number of key attribute values larger than or equal to a preset first number threshold exist in the plurality of clusters; judging whether a huge cluster containing a problem attribute value exists in the plurality of clusters; the problem attribute value refers to a key attribute value of which the number of other key attribute values associated with the problem attribute value is greater than or equal to a preset second number threshold; if the result of the at least one judgment operation is yes, pruning is carried out on the giant clusters in the plurality of clusters.
In other embodiments, processor 50b, after obtaining at least one sub-cluster, is further configured to: aiming at the first sub-cluster, searching non-key attribute values under first non-key attributes respectively associated with key attribute values contained in the first sub-cluster in a plurality of data records; judging whether the non-key attribute values under the first non-key attribute are consistent; and if the judgment result is negative, re-clustering the first sub-cluster by taking the key attribute value with inconsistent non-key attribute values under the associated first non-key attributes as a new source point attribute value until the non-key attribute values under the first non-key attributes contained in the first sub-cluster are consistent.
In still other embodiments, the processor 50b, after obtaining the at least one sub-cluster, is further configured to: acquiring a plurality of non-critical attribute values under a second non-critical attribute and a plurality of non-critical attribute values under a third non-critical attribute respectively associated with each sub-cluster from a plurality of data records; dividing a plurality of non-key attribute values under the second non-key attribute into a plurality of information clusters by using the incidence relation between the plurality of non-key attribute values under the second non-key attribute and a plurality of non-key attribute values under the third non-key attribute; respectively calculating the belonged probability between each two non-critical attribute values and the candidate result under the second non-critical attribute in each information cluster aiming at each information cluster; the candidate results include: belong to the same data object and do not belong to the same data object; and determining whether every two non-key attribute values under the second non-key attribute belong to the same data object according to the probability of the candidate result and every two non-key attribute values under the second non-key attribute in each information cluster.
Further, the processor 50b is further configured to: for a first non-key attribute value and a second non-key attribute value under a second non-key attribute, if the first non-key attribute value and the second non-key attribute value under the second non-key attribute belong to the same data object, determining that sub-clusters corresponding to the first non-key attribute value and the second non-key attribute value belong to the same data object; merging sub-clusters belonging to the same data object; the first non-critical attribute value and the second non-critical attribute value are any two non-critical attribute values under the second non-critical attribute.
In some optional embodiments, as shown in fig. 5, the server device may further include: power supply components 50e, etc. Optionally, if the server device is a terminal device such as a computer, as shown in the dashed line box of fig. 5, an optional component such as an audio component 50f may be further included. Only some of the components are schematically shown in fig. 5, which does not mean that the server device must include all of the components shown in fig. 5, nor that the server device only includes the components shown in fig. 5.
The server device provided in this embodiment can identify an attribute value belonging to the same data object from among the multiple key attribute values according to the hierarchical relationship among the multiple key attributes and the association relationship among the multiple key attribute values, thereby completing the longitudinal clustering of attribute values belonging to the same data object under different attributes. Due to the data clustering mode, various key attributes are considered, the probability of wrong clustering is favorably reduced, and the accuracy of identifying the data belonging to the same data object is favorably improved.
Fig. 6 is a schematic structural diagram of another server device according to an embodiment of the present application. As shown in fig. 6, the server device includes: a memory 60a and a processor 60 b. Wherein the memory 60a is for a computer program.
The processor 60b is coupled to the memory 60a for executing computer programs for: acquiring a plurality of data records, wherein the plurality of data records comprise a plurality of first attribute values;
if the plurality of data records contain a plurality of second attribute values, clustering the plurality of first attribute values according to the incidence relation between the plurality of first attribute values and the plurality of second attribute values to obtain a plurality of information clusters; respectively calculating the belonging probability between different first-class attribute values and the candidate result aiming at each information cluster; the candidate results include: belong to the same data object and do not belong to the same data object; and determining whether the different first-class attribute values belong to the same data object according to the belonging probability between the different first-class attribute values and the candidate result.
In some embodiments, when the processor 60b divides the plurality of first-type attribute values into a plurality of information clusters, it is specifically configured to: determining a first class attribute value associated with each second class attribute value according to the association relationship between the first class attribute values and the second class attribute values; and taking the first-class attribute value associated with each second-class attribute value as an information cluster to obtain a plurality of information clusters.
In other embodiments, the processor 60b, when calculating the belonged probabilities between the different first-class attribute values and the candidate results, is specifically configured to: aiming at first type attribute values A and B in a first information cluster, acquiring other type attribute values and behavior data respectively associated with the first type attribute values A and B in a plurality of data records; the other type attribute value is attribute information except the first type attribute value; inputting other attribute values and behavior data respectively associated with the first-class attribute values A and B into the decision model to obtain the belonging probability between the first-class attribute values A and B and the candidate result.
Optionally, the processor 60B is further configured to, before entering the further attribute values and the behavioural data associated with the first class of attribute values a and B, respectively, into the decision model: and taking the loss function minimization as a training target, taking the sample attribute values and the sample behavior data which are known to belong to the same data object and are respectively associated with different first-class attribute values as positive samples, and taking the sample attribute values and the sample behavior data which are known not to belong to the same data object and are respectively associated with different first-class attribute values as negative samples, and performing model training to obtain the decision model. And determining the loss function according to the probability obtained by model training and the actual probability of the positive sample and the negative sample.
In still other embodiments, the processor 60b is further configured to: if the plurality of data records do not contain the second type attribute values, behavior data respectively associated with the plurality of first type attribute values are obtained from the plurality of data records; acquiring behavior characteristics of data objects corresponding to the first type attribute values according to behavior data associated with the first type attribute values respectively; and determining whether the plurality of first-type attribute values belong to the same data object according to the behavior characteristics of the data objects corresponding to the plurality of first-type attribute values respectively.
Further, when determining whether the plurality of first-class attribute values belong to the same data object, the processor 60b is specifically configured to: calculating the similarity of the behavior characteristics of the data objects corresponding to the first type attribute values C and D aiming at the first type attribute values C and D; and if the similarity of the behavior characteristics of the data objects corresponding to the first-class attribute values C and D is greater than or equal to the set similarity threshold, determining that the first-class attribute values C and D belong to the same data object.
In one embodiment, the server device further includes: and a display screen 60 c. Accordingly, the processor 60b is further configured to: the attribute values of the first type belonging to the same data object are visually presented on the display screen 60 c. Alternatively, the processor 60b may present the first type of attribute values belonging to the same data object on the display screen 60c in the form of a connected subgraph. The nodes of the connected subgraph represent first-class attribute values, and the connecting line between the two nodes represents that the two first-class attribute values belong to the same data object.
In another embodiment, the server device further includes: the communication component 60 d. Accordingly, the processor 60b is further configured to: the first type attribute values belonging to the same data object are sent to the client device through the communication component 60d for the client device to output in a visual manner. Alternatively, processor 60b may send the first class of attribute values belonging to the same data object to the client device in the form of a connected subgraph through communication component 60 d.
In some optional embodiments, as shown in fig. 6, the server device may further include: power supply assembly 60e, etc. Optionally, if the server device is a terminal device such as a computer, as shown in the dashed line box of fig. 6, optional components such as an audio component 60f may also be included. Only some of the components are schematically shown in fig. 6, which does not mean that the server device must include all of the components shown in fig. 6, nor that the server device only includes the components shown in fig. 6.
The server device provided in this embodiment can complete horizontal clustering of attribute values belonging to the same data object under the same attribute according to the probability that different attribute values belong to and do not belong to the same data object under the same attribute. Due to the data clustering mode, stronger association relation among attribute values is not needed, the requirement on data attributes can be reduced, and the flexibility and universality of data clustering are facilitated.
In an embodiment of the present application, the memory is used for storing a computer program and may be configured to store other various data to support operations on the server device. Wherein the processor may execute a computer program stored in the memory to implement the corresponding control logic. The memory may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
In embodiments of the present application, the communication component is configured to facilitate wired or wireless communication between the server device and other devices. The server device can access a wireless network based on a communication standard, such as WiFi, 2G or 3G, 4G, 5G or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component may also be implemented based on Near Field Communication (NFC) technology, Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, or other technologies.
In the embodiment of the present application, the display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display screen includes a touch panel, the display screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
In embodiments of the present application, a power component is configured to provide power to various components of a server device. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.
In embodiments of the present application, the audio component may be configured to output and/or input audio signals. For example, the audio component includes a Microphone (MIC) configured to receive an external audio signal when the device in which the audio component is located is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in a memory or transmitted via a communication component. In some embodiments, the audio assembly further comprises a speaker for outputting audio signals. For example, for a server device with a language interaction function, voice interaction with a user can be realized through an audio component.
It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (30)

1. A data processing method, comprising:
acquiring a plurality of data records, wherein the plurality of data records comprise a plurality of key attribute values under a plurality of key attributes, and each key attribute value belongs to one data object at the same time;
identifying attribute values belonging to the same data object in the plurality of key attribute values according to the level relationship among the plurality of key attributes and the incidence relationship among the plurality of key attribute values;
and outputting the attribute values belonging to the same data object in the plurality of key attribute values.
2. The method according to claim 1, further comprising, before identifying the attribute values belonging to the same data object from the hierarchical relationship between the key attributes and the association relationship between the key attribute values:
extracting the hierarchical relationship among a plurality of key attributes from the pre-established hierarchical relationship among the plurality of key attributes;
and analyzing the association relation among the plurality of key attribute values based on the membership relation between the plurality of key attribute values and the plurality of data records.
3. The method according to claim 2, before extracting the hierarchical relationship among the plurality of key attributes from the pre-established hierarchical relationship among a plurality of key attributes, further comprising:
acquiring a historical data record in a specified historical time period, wherein the historical data record comprises historical key attribute values under a plurality of key attributes;
and establishing a hierarchical relationship among the plurality of key attributes according to the number of historical key attribute values under each key attribute in the plurality of key attributes.
4. The method of claim 2, wherein analyzing the association between the plurality of key attribute values based on membership between the plurality of key attribute values and the plurality of data records comprises performing at least one of the following:
judging whether a first key attribute value and a second key attribute value appear in the same data record or not according to the first key attribute value and the second key attribute value;
judging whether one key attribute value of the first key attribute value and the second key attribute value and a key attribute value having an association relationship with the other key attribute value appear in the same data record;
judging whether the key attribute value having the association relationship with the first key attribute value and the key attribute value having the association relationship with the second key attribute value appear in the same data record;
if the candidate result of the at least one judgment operation is yes, determining that the first key attribute value and the second key attribute value have an association relation;
the first key attribute value and the second key attribute value are any two attribute values of the plurality of key attribute values.
5. The method of claim 1, wherein identifying the attribute values belonging to the same data object from the level relationship between the key attributes and the association relationship between the key attribute values comprises:
performing initial clustering on the plurality of key attribute values according to the incidence relation among the plurality of key attribute values to obtain a plurality of clusters, wherein each cluster comprises the key attribute values with the incidence relation;
for each cluster, performing secondary clustering on the key attribute value in each cluster according to the hierarchical relationship among the plurality of key attributes to obtain at least one sub-cluster;
and regarding the key attribute values in the same sub-cluster as the attribute values belonging to the same data object.
6. The method according to claim 5, wherein the initially clustering the plurality of key attribute values according to the association relationship between the plurality of key attribute values comprises:
pairing two key attribute values with an incidence relation in the plurality of key attribute values to obtain a plurality of attribute value relation pairs;
constructing an adjacency list of each key attribute value in the attribute value relationship pairs according to the incidence relationship among the key attribute values in the attribute value relationship pairs;
traversing the adjacency list of each key attribute value in the attribute value relationship pairs to obtain a plurality of connected subgraphs, wherein each connected subgraph is a cluster;
in each connected subgraph, the nodes represent key attribute values, and the connecting line between the two nodes represents the incidence relation between the two key attribute values.
7. The method according to claim 5, wherein said secondary clustering of key attribute values in each cluster according to the hierarchical relationship between the plurality of key attributes comprises:
for a first cluster, determining a reference core attribute from core attributes contained in the first cluster, wherein the core attribute belongs to a key attribute;
clustering key attribute values under the non-reference core attribute in the first cluster into sub-clusters represented by each clustering source point by taking the key attribute values under the reference core attribute as clustering source points according to the level relation among the key attributes contained in the first cluster and the associated weight among the key attribute values;
wherein the first cluster is any one of the plurality of clusters.
8. The method of claim 7, wherein determining a reference core attribute from the core attributes included in the first cluster comprises:
and selecting the core attribute with the most key attribute values as the reference core attribute according to the number of the key attribute values under each core attribute contained in the first cluster.
9. The method according to claim 7, wherein the clustering the key attribute values under the non-reference core attribute in the first cluster into the sub-clusters represented by each cluster source point by using the key attribute values under the reference core attribute as the cluster source points respectively according to the hierarchical relationship between the key attributes contained in the first cluster and the associated weight between each key attribute value comprises:
aiming at a third key attribute value, calculating the correlation between the third key attribute value and each clustering source point according to the correlation weight between the third key attribute value and each clustering source point;
dividing the third key attribute value into a sub-cluster represented by the clustering source point with the maximum correlation;
the third key attribute value is any key attribute value of the key attribute values contained in the first cluster that is not currently clustered to any sub-cluster.
10. The method according to claim 9, wherein the calculating the correlation between the third key attribute value and each cluster source point according to the correlation weight between the third key attribute value and each cluster source point comprises:
determining the shortest associated path between the third key attribute value and each clustering source point according to the associated weight between the third key attribute value and the key attribute value passed by the associated path between each clustering source point;
and calculating the correlation between the third key attribute value and each clustering source point respectively according to the correlation weight between the third key attribute value and the key attribute value passed by the shortest correlation path between the clustering source points.
11. The method according to claim 10, wherein if the target level of the third key attribute value is any level other than the next level of the cluster source points and the highest level of the level relationship between the key attributes included in the first cluster, the method further comprises:
if the clustering source points with the maximum correlation with the third key attribute value are multiple, calculating the correlation between the third key attribute value and each key attribute value at the upper level of the target level;
and clustering the third attribute value into a sub-cluster represented by the key attribute value with the maximum correlation with the third attribute value in the key attribute values of the previous level of the target level.
12. The method according to claim 9, wherein before clustering the key attribute values under the non-reference core attributes in the first cluster into the sub-clusters represented by the cluster source points by using the key attribute values under the reference core attributes as the cluster source points respectively according to the hierarchical relationship between the key attributes contained in the first cluster and the associated weight between the key attribute values, the method further comprises:
and calculating the association weight between every two key attribute values according to the common occurrence frequency of every two key attribute values in the key attribute values contained in the first cluster in the same data record and the respective occurrence frequency of each key attribute value in every two key attribute values in the plurality of data records.
13. The method according to claim 5, wherein before performing secondary clustering on the key attribute values in each cluster according to the hierarchical relationship among the plurality of key attributes, the method further comprises performing a determination operation as follows:
judging whether giant clusters with the number of key attribute values larger than or equal to a preset first number threshold exist in the plurality of clusters;
judging whether a huge cluster containing a problem attribute value exists in the plurality of clusters; the problem attribute value refers to a key attribute value of which the number of other key attribute values associated with the problem attribute value is greater than or equal to a preset second number threshold;
if the result of the at least one judgment operation is yes, pruning the giant clusters in the plurality of clusters.
14. The method of claim 5, further comprising, after obtaining at least one sub-cluster:
aiming at a first sub-cluster, searching non-key attribute values under first non-key attributes respectively associated with key attribute values contained in the first sub-cluster in the plurality of data records;
judging whether the non-key attribute values under the first non-key attribute are consistent;
and if the judgment result is negative, re-clustering the first sub-cluster by taking the key attribute value with inconsistent non-key attribute values under the associated first non-key attributes as a new source point attribute value until the non-key attribute values under the first non-key attributes contained in the first sub-cluster are consistent.
15. The method of claim 5, wherein after obtaining at least one sub-cluster, the method further comprises:
acquiring a plurality of non-key attribute values under a second non-key attribute and a plurality of non-key attribute values under a third non-key attribute respectively associated with each sub-cluster from the plurality of data records;
dividing a plurality of non-key attribute values under the second non-key attribute into a plurality of information clusters by using the incidence relation between the plurality of non-key attribute values under the second non-key attribute and the plurality of non-key attribute values under the third non-key attribute;
respectively calculating the belonged probability between each two non-critical attribute values and the candidate result under the second non-critical attribute in each information cluster aiming at each information cluster; the candidate results include: belong to the same data object and do not belong to the same data object;
and determining whether every two non-key attribute values under the second non-key attribute belong to the same data object according to the belonging probability between every two non-key attribute values under the second non-key attribute and the candidate result in each information cluster.
16. The method of claim 15, further comprising:
for a first non-key attribute value and a second non-key attribute value under the second non-key attribute, if the first non-key attribute value and the second non-key attribute value under the second non-key attribute belong to the same data object, determining that sub-clusters corresponding to the first non-key attribute value and the second non-key attribute value belong to the same data object;
merging sub-clusters belonging to the same data object;
the first non-critical attribute value and the second non-critical attribute value are any two non-critical attribute values under the second non-critical attribute.
17. The method of any of claims 1-16, wherein said obtaining a plurality of data records comprises:
receiving a plurality of data records sent by client equipment;
the outputting the attribute values belonging to the same data object in the plurality of key attribute values comprises:
and sending the attribute values belonging to the same data object in the plurality of key attribute values to the client equipment so as to be output by the client equipment in a visual mode.
18. The method of claim 17, wherein sending attribute values of the plurality of key attribute values that belong to a same data object to the client device comprises:
sending attribute values belonging to the same data object in the plurality of key attribute values to the client device in a form of a connected subgraph; wherein, in the connected subgraph, nodes represent key attribute values; the connecting line between the two nodes represents the incidence relation between the two key attribute values.
19. The method of claim 18, further comprising, prior to sending attribute values belonging to the same data object in the plurality of key attribute values to the client in the form of a connected subgraph:
searching behavior data of the data object corresponding to each connected subgraph in the plurality of data records;
acquiring behavior characteristics of the data object corresponding to each connected subgraph according to the behavior data of the data object corresponding to each connected subgraph;
and taking the behavior characteristics of the data object corresponding to each connected subgraph as the identification information of the data object corresponding to each connected subgraph, and adding the identification information to each connected subgraph.
20. A data processing method, comprising:
acquiring a plurality of data records, wherein the plurality of data records comprise a plurality of first-class attribute values;
if the plurality of data records contain a plurality of second attribute values, clustering the plurality of first attribute values according to the incidence relation between the plurality of first attribute values and the plurality of second attribute values to obtain a plurality of information clusters;
respectively calculating the belonging probability between different first-class attribute values and the candidate result aiming at each information cluster; the candidate results include: belong to the same data object and do not belong to the same data object;
and determining whether the different first-class attribute values belong to the same data object according to the belonging probability between the different first-class attribute values and the candidate result.
21. The method according to claim 20, wherein said dividing the plurality of attribute values of the first class into a plurality of information clusters according to the association relationship between the attribute values of the first class and the attribute values of the second class comprises:
determining a first class attribute value associated with each second class attribute value respectively according to the association relationship between the first class attribute value and the second class attribute value;
and taking the first class attribute value associated with each second class attribute value as an information cluster to obtain the plurality of information clusters.
22. The method of claim 20, wherein calculating for each information cluster the probability of belonging between the different first-class attribute values and the candidate result comprises:
aiming at first type attribute values A and B in a first information cluster, acquiring other type attribute values and behavior data respectively associated with the first type attribute values A and B in the plurality of data records; the other attribute values are attribute information except the first attribute value;
inputting the other attribute values and behavior data respectively associated with the first class attribute values A and B into a decision model to obtain the probabilities of the first class attribute values A and B and the candidate results.
23. The method of claim 22, prior to inputting the other attribute values and behavior data associated with the first class of attribute values a and B, respectively, into a decision model, further comprising:
taking the minimization of a loss function as a training target, taking the sample attribute values and the sample behavior data which are known to belong to the same data object and are respectively associated with different first-class attribute values as positive samples, and taking the sample attribute values and the sample behavior data which are known not to belong to the same data object and are respectively associated with different first-class attribute values as negative samples, and performing model training to obtain the decision model;
and the loss function is determined according to the probability obtained by model training and the actual probabilities of the positive sample and the negative sample.
24. The method of claim 20, further comprising:
if the plurality of data records do not contain the second type attribute values, behavior data respectively associated with the plurality of first type attribute values are obtained from the plurality of data records;
acquiring behavior characteristics of data objects corresponding to the plurality of first-class attribute values respectively according to behavior data associated with the plurality of first-class attribute values respectively;
and determining whether the plurality of first-class attribute values belong to the same data object according to the behavior characteristics of the data objects corresponding to the plurality of first-class attribute values respectively.
25. The method according to claim 24, wherein the determining whether the plurality of attribute values of the first class belong to the same data object according to the behavior characteristics of the data object corresponding to the plurality of attribute values of the first class respectively comprises:
calculating the similarity of the behavior characteristics of the data objects corresponding to the first type attribute values C and D aiming at the first type attribute values C and D;
and if the similarity of the behavior characteristics of the data objects corresponding to the first-class attribute values C and D is greater than or equal to a set similarity threshold, determining that the first-class attribute values C and D belong to the same data object.
26. The method of claim 20, further comprising:
outputting the attribute values of the first class belonging to the same data object in a visual mode; alternatively, the first and second electrodes may be,
and sending the first class attribute values belonging to the same data object to client equipment so as to be output by the client equipment in a visual mode.
27. A server-side device, comprising: a memory and a processor; wherein the memory is for a computer program;
the processor is coupled to the memory for executing the computer program for:
acquiring a plurality of data records, wherein the plurality of data records comprise a plurality of key attribute values under a plurality of key attributes, and each key attribute value belongs to one data object at the same time;
identifying attribute values belonging to the same data object in the plurality of key attribute values according to the level relationship among the plurality of key attributes and the incidence relationship among the plurality of key attribute values;
and outputting the attribute values belonging to the same data object in the plurality of key attribute values.
28. A server-side device, comprising: a memory and a processor; wherein the memory is for a computer program;
the processor is coupled to the memory for executing the computer program for:
acquiring a plurality of data records, wherein the plurality of data records comprise a plurality of first-class attribute values;
if the plurality of data records contain a plurality of second attribute values, dividing the plurality of first attribute values into a plurality of information clusters according to the incidence relation between the plurality of first attribute values and the plurality of second attribute values;
respectively calculating the belonging probability between different first-class attribute values and the candidate result aiming at each information cluster; the candidate results include: belong to the same data object and do not belong to the same data object;
and determining whether the different first-class attribute values belong to the same data object according to the belonging probability between the different first-class attribute values and the candidate result.
29. A data processing system, comprising: client equipment and server equipment;
the client device is used for providing a plurality of data records to the server device; the data records comprise a plurality of key attribute values under a plurality of key attributes, and each key attribute value belongs to one data object at the same time; outputting attribute values belonging to the same data object in a plurality of key attribute values in a visual mode;
the server-side equipment is used for identifying the attribute values which belong to the same data object in the plurality of key attribute values according to the level relation among the plurality of key attributes and the incidence relation among the plurality of key attribute values; and sending the attribute values belonging to the same data object in the plurality of key attribute values to the client device.
30. A computer-readable storage medium having stored thereon computer instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the method of any one of claims 1-26.
CN201910977784.6A 2019-10-15 2019-10-15 Data processing method, device, system and storage medium Active CN112667869B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910977784.6A CN112667869B (en) 2019-10-15 2019-10-15 Data processing method, device, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910977784.6A CN112667869B (en) 2019-10-15 2019-10-15 Data processing method, device, system and storage medium

Publications (2)

Publication Number Publication Date
CN112667869A true CN112667869A (en) 2021-04-16
CN112667869B CN112667869B (en) 2024-05-03

Family

ID=75399911

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910977784.6A Active CN112667869B (en) 2019-10-15 2019-10-15 Data processing method, device, system and storage medium

Country Status (1)

Country Link
CN (1) CN112667869B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113486218A (en) * 2021-09-06 2021-10-08 北京世纪好未来教育科技有限公司 Data processing method and device, electronic equipment and storage medium
CN114860797A (en) * 2022-03-16 2022-08-05 电子科技大学 Data derivation processing method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104731809A (en) * 2013-12-23 2015-06-24 阿里巴巴集团控股有限公司 Processing method and device of attribute information of objects
CN105095306A (en) * 2014-05-20 2015-11-25 阿里巴巴集团控股有限公司 Operating method and device based on associated objects
CN107066616A (en) * 2017-05-09 2017-08-18 北京京东金融科技控股有限公司 Method, device and electronic equipment for account processing
CN108205570A (en) * 2016-12-19 2018-06-26 华为技术有限公司 A kind of data detection method and device
KR101880628B1 (en) * 2017-11-27 2018-08-16 한국인터넷진흥원 Method for labeling machine-learning dataset and apparatus thereof
US20190102453A1 (en) * 2017-10-02 2019-04-04 Kabushiki Kaisha Toshiba Information processing device, information processing method, and computer program product

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104731809A (en) * 2013-12-23 2015-06-24 阿里巴巴集团控股有限公司 Processing method and device of attribute information of objects
CN105095306A (en) * 2014-05-20 2015-11-25 阿里巴巴集团控股有限公司 Operating method and device based on associated objects
CN108205570A (en) * 2016-12-19 2018-06-26 华为技术有限公司 A kind of data detection method and device
CN107066616A (en) * 2017-05-09 2017-08-18 北京京东金融科技控股有限公司 Method, device and electronic equipment for account processing
US20190102453A1 (en) * 2017-10-02 2019-04-04 Kabushiki Kaisha Toshiba Information processing device, information processing method, and computer program product
KR101880628B1 (en) * 2017-11-27 2018-08-16 한국인터넷진흥원 Method for labeling machine-learning dataset and apparatus thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张毅;杜秀春;刘欣;刘华富;: "基于多域的互联网物理对象关联分析方法研究", 计算机技术与发展, no. 04, 5 December 2017 (2017-12-05) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113486218A (en) * 2021-09-06 2021-10-08 北京世纪好未来教育科技有限公司 Data processing method and device, electronic equipment and storage medium
CN113486218B (en) * 2021-09-06 2021-12-14 北京世纪好未来教育科技有限公司 Data processing method and device, electronic equipment and storage medium
CN114860797A (en) * 2022-03-16 2022-08-05 电子科技大学 Data derivation processing method

Also Published As

Publication number Publication date
CN112667869B (en) 2024-05-03

Similar Documents

Publication Publication Date Title
Wu et al. Research on internet information mining based on agent algorithm
US20170206204A1 (en) System, method, and device for generating a geographic area heat map
CN107111651A (en) A kind of matching degree computational methods, device and user equipment
CN105718490A (en) Method and device for updating classifying model
CN107918618B (en) Data processing method and device
CN104254852A (en) Method and system for hybrid information query
CN107515915A (en) User based on user behavior data identifies correlating method
US10748166B2 (en) Method and system for mining churn factor causing user churn for network application
US10628412B2 (en) Iterative visualization of a cohort for weighted high-dimensional categorical data
CN108932646B (en) User tag verification method and device based on operator and electronic equipment
CN111191133B (en) Service search processing method, device and equipment
CN106991577A (en) A kind of method and device for determining targeted customer
CN113971527A (en) Data risk assessment method and device based on machine learning
CN109325648A (en) Multi-dimensional data stream statistics method, server and storage medium based on index
CN112667869B (en) Data processing method, device, system and storage medium
CN111159559A (en) Method for constructing recommendation engine according to user requirements and user behaviors
US20130325866A1 (en) Community Profiling for Social Media
CN104142952A (en) Method and device for showing reports
CN108345620B (en) Brand information processing method, brand information processing device, storage medium and electronic equipment
CN111882113A (en) Enterprise mobile banking user prediction method and device
Pramanik et al. Can i foresee the success of my meetup group?
Xiang et al. Camer: a context-aware mobile service recommendation system
CN114048294B (en) Similar population extension model training method, similar population extension method and device
CN110062112A (en) Data processing method, device, equipment and computer readable storage medium
KR101462858B1 (en) Methods for competency assessment of corporation for global business

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant