CN112667869B

CN112667869B - Data processing method, device, system and storage medium

Info

Publication number: CN112667869B
Application number: CN201910977784.6A
Authority: CN
Inventors: 吴铁民; 王赛; 陈晓勇; 向师富; 柯根
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-10-15
Filing date: 2019-10-15
Publication date: 2024-05-03
Anticipated expiration: 2039-10-15
Also published as: CN112667869A

Abstract

The embodiment of the application provides a data processing method, device, system and storage medium. In the embodiment of the application, the attribute values belonging to the same data object in the plurality of key attribute values are identified according to the hierarchical relationship among the plurality of key attributes and the association relationship among the plurality of key attribute values, and the longitudinal clustering of the attribute values belonging to the same data object under different attributes is completed. Due to the data clustering mode, various key attributes are considered, and the probability of error clustering is reduced, so that the accuracy of identifying the data belonging to the same data object is improved.

Description

Data processing method, device, system and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data processing method, device, system, and storage medium.

Background

With the development of the information age, various information bearing media are becoming more and more diverse, and how to achieve effective management of data is also becoming more and more important. To achieve efficient management of enterprise data, an enterprise may build a global unified account.

In order to construct the unified account, data belonging to the same natural person needs to be identified from mass data, but the existing data identification mode is low in accuracy.

Disclosure of Invention

Aspects of the present application provide a data processing method, apparatus, system, and storage medium for improving accuracy of data identification.

The embodiment of the application provides a data processing method, which comprises the following steps:

Acquiring a plurality of data records, wherein the plurality of data records comprise a plurality of key attribute values under a plurality of key attributes, and each key attribute value belongs to a data object at the same time;

Identifying attribute values belonging to the same data object in the plurality of key attribute values according to the hierarchical relationship among the plurality of key attributes and the association relationship among the plurality of key attribute values;

and outputting attribute values which are subordinate to the same data object in the plurality of key attribute values.

The embodiment of the application also provides a data processing method, which comprises the following steps:

acquiring a plurality of data records, wherein the plurality of data records comprise a plurality of first type attribute values;

If the plurality of data records contain a plurality of second type attribute values, clustering the plurality of first type attribute values according to the association relation between the plurality of first type attribute values and the plurality of second type attribute values to obtain a plurality of information clusters;

Aiming at each information cluster, respectively calculating the probability of the first type attribute value and the candidate result; the candidate result comprises: belongs to the same data object and does not belong to the same data object;

and determining whether the different first-class attribute values belong to the same data object according to the probability of the first-class attribute values and the candidate result.

The embodiment of the application also provides a server device, which comprises: a memory and a processor; wherein the memory is used for a computer program;

the processor is coupled to the memory for executing the computer program for:

If the plurality of data records contain a plurality of second type attribute values, dividing the plurality of first type attribute values into a plurality of information clusters according to the association relationship between the plurality of first type attribute values and the plurality of second type attribute values;

The embodiment of the application also provides a data processing system, which comprises: client device and server device;

The client device is used for sending a plurality of data records to the server device; the data records comprise a plurality of key attribute values under a plurality of key attributes, and each key attribute value belongs to a data object at the same time; and outputting attribute values belonging to the same data object in the plurality of key attribute values in a visual mode.

The server device is configured to identify an attribute value belonging to the same data object in the plurality of key attribute values according to a hierarchical relationship between the plurality of key attributes and an association relationship between the plurality of key attribute values; and transmitting the attribute values belonging to the same data object in the plurality of key attribute values to the client device.

The embodiment of the application also provides a computer readable storage medium storing computer instructions, wherein the computer instructions, when executed by one or more processors, cause the one or more processors to perform the steps in the data processing method described above.

In the embodiment of the application, the attribute values belonging to the same data object in the plurality of key attribute values are identified according to the hierarchical relationship among the plurality of key attributes and the association relationship among the plurality of key attribute values, and the longitudinal clustering of the attribute values belonging to the same data object under different attributes is completed. Due to the data clustering mode, various key attributes are considered, and the probability of error clustering is reduced, so that the accuracy of identifying the data belonging to the same data object is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1a is a schematic diagram of a data processing system according to an embodiment of the present application;

fig. 1b is a schematic structural diagram of a connected sub-graph according to an embodiment of the present application;

FIG. 1c is a connected subgraph formed by clustering the connected subgraphs provided in FIG. 1b in a first step;

FIG. 1d is a connected subgraph formed by hierarchical clustering of the connected subgraphs provided in FIG. 1 b;

fig. 1e is a schematic structural diagram of another connected sub-graph according to an embodiment of the present application;

FIG. 1f is a connected subgraph formed by hierarchical clustering of the connected subgraphs provided in FIG. 1 e;

FIG. 1g is a schematic diagram of a communication sub-graph according to an embodiment of the present application;

FIG. 1h is a connected subgraph formed by hierarchical clustering of the connected subgraphs provided in FIG. 1 g;

FIG. 1i is a connected subgraph formed by hierarchical clustering of the connected subgraphs provided in FIG. 1 h;

FIG. 1j is a connected subgraph formed by performing hierarchical clustering on the connected subgraph provided in FIG. 1i again;

FIG. 2 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 3a is a schematic diagram illustrating another data processing system according to an embodiment of the present application;

fig. 3b is a schematic structural diagram of an information cluster according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating another data processing method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a server device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of another server device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Aiming at the technical problem of low accuracy in identifying the data of the same data object, in some embodiments of the application, according to the hierarchical relationship among a plurality of key attributes and the association relationship among a plurality of key attribute values, the attribute values belonging to the same data object in the plurality of key attribute values are identified, and the longitudinal clustering of the attribute values belonging to the same data object under different attributes is completed. Due to the data clustering mode, various key attributes are considered, and the probability of error clustering is reduced, so that the accuracy of identifying the data belonging to the same data object is improved.

The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

FIG. 1a is a schematic diagram of a data processing system according to an embodiment of the present application. As shown in fig. 1a, the system comprises: a client device 10a and a server device 10b. The implementation of the client device 10a and the server device 10b shown in fig. 1a is only exemplary and not limiting.

In the present embodiment, the client device 10a refers to a computer device having functions of calculation, communication, and the like, which is located at a client service side. The client device 10a may be a computer or a server located at a client server, etc.

In this embodiment, the server device 10b is a computer device capable of performing data processing, and generally has the capability of assuming services and guaranteeing the services. The server device 10b may be a single server device, a cloud server array, or a Virtual Machine (VM) running in the cloud server array. In addition, the server device may also refer to other computing devices having corresponding service capabilities, for example, a terminal device (running a service program) such as a computer, and the like.

Alternatively, the server device 10b may be a service platform of an enterprise that directly provides relevant data processing services to the server device 10 b. In this scenario, the server device 10b may provide cloud computing services to the client device 10a that are intermediate to PaaS services and SaaS services.

Or the server-side device 10b may also be a server of a cloud service provider leased by a third party providing relevant data processing services to the client device 10a. In this scenario, a cloud service provider provides Iaas services, paaS services, or cloud computing services interposed between Iaas services or PaaS services to third parties. The server (server device 10 b) leased to the third party provides the cloud computing service between the PaaS service and the SaaS service to the client device 10a.

In this embodiment, the server device 10b and the client device 10a may be connected wirelessly or by wire. Alternatively, the server device 10b may be communicatively connected to the client device 10a through a mobile network, and accordingly, the network system of the mobile network may be any one of 2G (GSM), 2.5G (GPRS), 3G (WCDMA, TD-SCDMA, CDMA2000, UTMS), 4G (LTE), 4g+ (lte+), 5G, wiMax, and the like. Alternatively, the server device 10b may also be communicatively connected to the client device 10a by bluetooth, wiFi, infrared, or the like.

In this embodiment, the client device 10a may save the data record and may provide a plurality of data records to the server device 10b. The plurality of data records comprise a plurality of key attributes and a plurality of key attribute values under the plurality of key attributes. In the embodiment of the application, the plurality refers to 2 or more than 2, and the plurality refers to 2 or more than 2. Alternatively, if the server device 10b is a service platform of an enterprise that directly provides the relevant data processing service to the server device 10b, the client device 10a may directly send a plurality of data records to the server device 10b. If the server device 10b is a server of a cloud service provider leased by a third party providing the relevant data processing service to the client device 10a, the client device 10a may send a plurality of data records to the third party, and then the third party sends the plurality of data records to the server device 10b.

In this embodiment, the key attribute refers to an attribute that the attribute value below can uniquely identify a data object at the same time, that is, each key attribute value uniquely belongs to a data object at the same time. Wherein the key attribute may also be referred to as an attribute Identification (ID) in some embodiments, and the key attribute value is referred to as an ID value. In different application scenarios, the key attributes of the data objects are different. For example, in some application scenarios, a data object may be a natural person for which the key attributes may be: an identification number, account number, cell phone number, email, passport number, member number, etc., but is not limited thereto. In other application scenarios, the data object is a company, and its key attribute may be, but is not limited to, a company name, a tax payer identification number, an industry and commerce registration number, and so on.

In the embodiment of the application, one natural object can be used as one data object, and a plurality of natural objects can also be used as one data object. For example, in some application scenarios, all members in a household may be treated as one data object; in other application scenarios, a user in an area, a business district, or a city may also be used as a data object; or the same type of natural object may also be used as one data object. For example, in a shopping scenario, the type of the user is determined according to information such as age, sex, or region of the user, and the same type of user is used as one data object, etc., but is not limited thereto.

Accordingly, the server device 10b receives a plurality of data records. Further, in the embodiment of the present application, the server device 10b stores the hierarchical relationship between the key attributes. Accordingly, the server device 10b may identify the attribute values belonging to the same data object from the plurality of key attribute values according to the hierarchical relationship among the plurality of key attributes and the association relationship among the plurality of key attribute values included in the plurality of data records. And carrying out hierarchical clustering on the key attribute values belonging to the same data object according to the hierarchical relationship among a plurality of key attributes contained in the plurality of data records and the association relationship among a plurality of key attribute values, so as to identify the key attribute values which belong to the same data object. Further, the server device 10b transmits the attribute values belonging to the same data object among the plurality of key attribute values to the client device 10a.

Accordingly, the client device 10a receives attribute values belonging to the same data object from the plurality of key attribute values, and outputs attribute values belonging to the same data object from the plurality of key attribute values in a visualized manner.

In the embodiment of the present application, the server device 10b may send the attribute values belonging to the same data object among the plurality of key attribute values to the client device 10a in various forms. For example, as shown in fig. 1a, the server device 10b may take the form of a connected subgraph, aggregate attribute values belonging to the same data object among the plurality of key attribute values, and send the aggregated connected subgraph to the client device 10a. Accordingly, the client device 10a presents the received connected subgraph on the display screen. Wherein each connected subgraph corresponds to a data object. In the connected subgraph, nodes represent key attribute values, and a connecting line between two nodes represents an association relationship between the two key attribute values. The number of connected subgraphs is determined by the number of data objects corresponding to the plurality of data records, and only 3 data objects (data objects 1-3) are shown in fig. 1 a. For another example, the server device 10b may also use a table format to aggregate attribute values belonging to the same data object from the plurality of key attribute values, and send the table formed by aggregation to the client device 10a. Accordingly, the client device 10a presents the received form on a display screen. Wherein, each table corresponds to a data object, and the key attribute values in the same row or column represent attribute values with association relation.

According to the data processing system provided by the embodiment, the attribute values belonging to the same data object in the plurality of key attribute values can be identified according to the hierarchical relationship among the plurality of key attributes and the association relationship among the plurality of key attribute values, so that the longitudinal clustering of the attribute values belonging to the same data object under different attributes is completed. Due to the data clustering mode, various key attributes are considered, and the probability of error clustering is reduced, so that the accuracy of identifying the data belonging to the same data object is improved.

In an embodiment of the present application, the plurality of data records may be data records generated by a plurality of terminal devices that provide services to the client device 10 a. For example, the client device 10a is a server device of the online shopping platform, where the terminal device is installed with an application program related to the online shopping platform, and the user can access the online shopping platform through the application program, and the client device 10a can record data generated by the user accessing the online shopping platform and store related data records. The data record may be stored in various forms such as a character string or a table.

In an embodiment of the present application, the plurality of data records sent by the client device 10a to the server device 10b may be data generated for a plurality of data objects, that is, the plurality of data records are a plurality of data records generated across screens. The data processing mode provided by the embodiment of the application can realize the cross-screen identification of the data generated by the data object. In addition, these data records may relate to a number of fields. For example, in some application scenarios, these data records may relate to a variety of fields of online shopping, video, gaming, sporting events, and the like. Therefore, the data processing mode provided by the embodiment of the application can also realize cross-domain identification of the data generated by the data object.

In some application scenarios, the data record also includes some behavior data. For example, in a shopping scenario, the data record includes data of a user's purchased goods; for another example, in a video scene, the data record includes video data viewed by a user; for another example, in a web game scenario, the data record contains game data for the user. These behavior data may reflect the behavior habits of the user, from which goods, videos, games, etc. that match the behavior habits of the user may be recommended to the user. Based on this, the server device 10b may further search the behavior data of the data object corresponding to each connected sub-graph in the plurality of data records, obtain the behavior feature of the data object corresponding to each connected sub-graph according to the behavior data of the data object corresponding to each connected sub-graph, and further add the behavior feature of the data object corresponding to each connected sub-graph as the identification information of the data object corresponding to each connected sub-graph. Further, the server-side device 10b transmits the connected subgraph with the identification information of the data object to the client device 10a. In this way, the client device 10a can recommend the content conforming to the identification information to the user based on the identification information of the data object on each connected subgraph.

In the embodiment of the present application, the server device 10b may pre-establish a hierarchical relationship between several key attributes. Wherein the number is greater than or equal to the number of key attributes in the plurality of data records. Assuming that the number of the key attributes in the plurality of data records is M, and the number of the key attributes in the plurality of data records is N, wherein M and N are integers which are more than or equal to 2, and M is more than or equal to N. Based on this, the server device 10b may extract the hierarchical relationship between the plurality of key attributes contained in the plurality of data records from the hierarchical relationship between the plurality of key attributes previously established.

Alternatively, the server device 10b may obtain history data records within a specified history period, where the history data records include history key attribute values under a number of key attributes; further, the server device 10b may establish a hierarchical relationship between the plurality of key attributes according to the number of historical key attribute values under each of the plurality of key attributes. Among them, the specified history period may be the past 5 years, the past 2 years, the past 6 months, or the like, but is not limited thereto. Optionally, the server device 10b may divide the number of the historical key attribute values under each of the plurality of key attributes by the historical time period to obtain a change condition of the historical key attribute values under each of the key attributes under each unit time, that is, a change rate of the historical key attribute values under each of the key attributes over time, so that the server device 10b may establish a hierarchical relationship between the plurality of key attributes according to the change rate of the historical key attribute values under each of the key attributes over time. Wherein the fewer the number of key attribute values of the owned history, the higher its rank. Or the technician can perform preliminary ranking on the grades of a plurality of key attributes according to daily experience, and the key attributes ranked at the forefront in theory are the most stable. Based on this, the server device 10b may further divide the number of the historical key attribute values under each key attribute in the plurality of key attributes by (the historical key attribute values under the forefront key attribute in the preliminary ranking in the historical time period) to obtain the number of the historical key attribute values under each key attribute corresponding to each historical key attribute of each historical key attribute value under the forefront key attribute in each unit time, and further, the server device 10b may establish the hierarchical relationship between the plurality of key attributes according to the number of the historical key attribute values under each key attribute corresponding to each historical key attribute of each historical key attribute value under the forefront key attribute in each unit time. Optionally, the lower the number of owned key attribute values of each of the historical key attribute values ranked under the top key attribute, the higher its rank.

On the other hand, the server device 10b may further analyze the association relationship between the plurality of key attribute values based on the membership relationship between the plurality of key attribute values and the plurality of data records. In the embodiment of the application, the key attribute value belonging to the same data record is defined as the key attribute value with a direct association relation; and defining the key attribute values which do not belong to the same data record but still have the association relationship as the key attribute values with the indirect association relationship. Further, in the embodiment of the present application, both the key attribute value having the direct association relationship and the key attribute value having the indirect association relationship are considered to have the association relationship. Based on this, the server device 10b may determine whether there is an association relationship between two key attribute values according to whether any two key attribute values appear in the same data record or appear in different data records having indirect association relationship with each other.

An exemplary description will be made below taking a first key attribute value and a second key attribute value among a plurality of key attribute values as an example. The first key attribute value and the second key attribute value are any two key attribute values in the plurality of key attribute values.

Determination mode 1: and judging whether the first key attribute value and the second key attribute value appear in the same data record.

Determination mode 2: and judging whether one key attribute value in the first key attribute value and the second key attribute value and the key attribute value with an association relation with the other key attribute value appear in the same data record.

Determination mode 3: judging whether the key attribute value with the association relation with the first key attribute value and the key attribute value with the association relation with the second key attribute value appear in the same data record.

Accordingly, if the candidate result is yes in the above determination modes 1-3, it is determined that the first key attribute value and the second key attribute value have an association relationship. Wherein, the determining that the candidate result exists in the modes 1-3 is yes comprises: and judging whether the judgment result of any one or more of the determination modes 1-3 is yes.

In the embodiment of the present application, the server device 10b may identify, according to the hierarchical relationship between the plurality of key attributes and the association relationship between the plurality of key attribute values, attribute values belonging to the same data object among the plurality of key attribute values. Alternatively, the server device 10b may perform initial clustering on the plurality of key attribute values according to the association relationship between the plurality of key attribute values, so as to obtain a plurality of clusters. Wherein each cluster contains key attribute values having an association relationship. Further, for each cluster, the server device 10b may perform secondary clustering on the key attribute values in each cluster according to the hierarchical relationship between the plurality of key attributes, to obtain at least one sub-cluster; the key attribute values in the same sub-cluster can be regarded as attribute values belonging to the same data object.

Alternatively, in some embodiments, a large cluster may exist in the plurality of clusters, and pruning may be performed on the large cluster in order to reduce the amount of computation. In practical applications, a huge cluster is generally represented by the fact that the number of key attribute values included in the cluster is greater than or equal to a preset first number threshold, or one or more problem attribute values exist in the cluster, where a problem attribute value refers to a key attribute value whose number of other key attribute values associated with the problem attribute value is greater than or equal to a preset second number threshold in the cluster. For example, in an online shopping scenario, a cheating account generated by a swipe may appear; in hotels or other travel enterprises, problem attribute values of the type of tour cards, tool cards, etc. may occur. Based on this, in the embodiment of the present application, before the secondary clustering is performed on the key attribute values in each cluster, the server device 10b may further perform at least one of the following judgment operations:

Judgment operation 1: judging whether the plurality of clusters have giant clusters with the number of the included key attribute values being greater than or equal to a preset first number threshold value or not.

Judging operation 2: and judging whether a huge cluster containing the problem attribute value exists in the clusters or not.

Further, if at least one of the judging operations is yes, pruning is performed on the giant clusters in the plurality of clusters. That is, if the judgment results of the judgment operation 1 and the judgment operation 2 are yes, it is determined that the macro cluster exists among the plurality of clusters, and pruning is performed on the macro cluster.

Optionally, if the judgment result of the judgment operation 1 is yes, if the judgment result is a huge cluster, according to the number K of preset attribute value relation pairs, the attribute value relation pair with the previous K bits of the association weight row is reserved, and pruning is performed on the rest attribute value relation pairs. Wherein K is a positive integer. Optionally, K is less than or equal to a preset first number threshold.

Further, if the judgment result of the judgment operation 2 is yes, the association relationship between the problem attribute value and other associated key attribute values can be completely cut off for the huge cluster with the judgment result being yes; etc., but is not limited thereto.

Further, when the server device 10b performs initial clustering on the plurality of key attribute values, the plurality of key attribute values may be initially clustered by adopting a graph calculation manner. Optionally, the plurality of key attribute values may be initially clustered using a graph computation framework. Wherein, the graph computation framework may be: odps graph, spark graphX, and the like, but are not limited thereto. When the server device 10b performs initial clustering on the plurality of key attribute values by adopting the graph calculation mode, the plurality of key attribute values may be initially clustered by using a breadth-first traversal algorithm, a depth-first traversal algorithm or an adjacency matrix traversal algorithm. An exemplary description of initial clustering of the plurality of key attribute values using an adjacency matrix traversal algorithm follows. The specific implementation mode is as follows: pairing two key attribute values with association relations in the plurality of key attribute values to obtain a plurality of attribute value relation pairs; constructing an adjacency list of each key attribute value in the plurality of attribute value relation pairs according to the association relation among each key attribute value in the plurality of attribute value relation pairs; traversing the adjacency list of each key attribute value in the plurality of attribute value relation pairs to obtain a plurality of connected subgraphs, wherein each connected subgraph is a cluster. In each connected subgraph, a node represents a key attribute value, and a connection line between two nodes represents an association relationship between two key attribute values.

Alternatively, the attribute value relationship pairs may be represented in the form of a data structure of triples. The triplet consists of two key attribute values with association relations and association weights between the two key attribute values. Therefore, in each connected subgraph, the numerical value on the line between two nodes represents the association weight between two key attribute values having an association relationship.

Alternatively, the association weight between any two key attribute values with association relationship may be calculated according to the common occurrence frequency of the two key attribute values in the same data record and the respective occurrence frequencies of the two key attribute values in multiple data records. Alternatively, assuming that there is an association relationship between the first key attribute value and the second key attribute value, the association weight between the two may be expressed as: . Wherein X represents the co-occurrence frequency of the first key attribute value and the second key attribute value in the same data record; y and Z represent the frequency of occurrence of the first key attribute value and the second key attribute value, respectively, in the plurality of data records. For example, assuming that the client device 10a transmits 10 pieces of data records to the server device 10b and that the first key attribute value and the second key attribute value have an association relationship, and that the first key attribute value appears in the data records 1 to 7 and the second key attribute value appears in the data records 1 to 5 and the data records 8 to 10, the first key attribute value and the second key attribute value appear in the data records 1 to 5 together, that is, the frequency of occurrence of the first key attribute value and the second key attribute value appearing in the same piece of data together is 5 times; and the first key attribute value occurs 7 times in the 10 data records; the second key attribute value occurs 8 times in the 10 data records. Alternatively, the association weight between the first key attribute value and the second key attribute value may be expressed as: /(I) 。

Further, taking the first cluster of the plurality of clusters as an example, a concrete implementation procedure in which the server device 10b performs secondary clustering on the key attribute values in each cluster is exemplarily described. The first cluster is any one cluster among a plurality of clusters obtained by initial clustering. When the server device 10b performs secondary clustering on the key attribute values in the first cluster, a reference core attribute may be determined from the core attributes contained in the first cluster, where the core attribute belongs to the key attribute; further, according to the hierarchical relation among the key attributes contained in the first cluster and the association weight among the key attribute values, the key attribute values under the reference core attribute are respectively used as clustering source points, and the key attribute values under the non-reference core attribute in the first cluster are clustered into sub clusters represented by the clustering source points.

Alternatively, when determining the reference core attribute from the core attributes contained in the first cluster, the server device 10b may select, as the reference core attribute, the core attribute containing the most key attribute value according to the number of key attribute values under each core attribute contained in the first cluster. For example, assuming that the first cluster contains 3 core attributes of an identification card number, a passport number, and a member number, and that the identification card number contains 5 identification card information, the passport number contains 6 passport numbers, and the member number contains 9 member numbers, the member number may be used as the reference core attribute.

Further, when the server device 10b clusters the key attribute values under the non-reference core attribute in the first cluster into the sub-clusters represented by the cluster source points, the key attribute values with higher attribute levels may be clustered into the sub-clusters represented by the cluster source points. The method for clustering the key attribute values under each level into the sub-clusters represented by the clustering source points is the same. An exemplary description will be given below taking a third key attribute value as an example. The third key attribute value is any key attribute value which is not clustered to any sub-cluster currently in the key attribute values contained in the first cluster.

For the third key attribute value, the server device 10b may calculate a correlation between the third key attribute value and each cluster source point according to the association weight between the third key attribute value and each cluster source point; and dividing the third key attribute value into sub-clusters represented by the cluster source points with the greatest correlation.

Further, the server device 10b may determine a shortest association path between the third key attribute value and each cluster source point according to association weights between the third key attribute value and key attribute values through which the association paths between the cluster source points pass; and calculating the correlation between the third key attribute value and each cluster source point according to the correlation weight between the third key attribute value and the key attribute value passed by the shortest correlation path between each cluster source point.

Further, if the number of the cluster source points having the greatest correlation with the third key attribute value is plural, if the target level at which the third key attribute value is located is the next level other than the level at which each cluster source point is located or the highest level of the hierarchical relationship between the key attributes included in the first cluster, the third key attribute value may be clustered into a sub-cluster corresponding to any one of the plurality of cluster source points having the greatest correlation with the third key attribute value. If the target level of the third key attribute value is any level except the next level of the level of each cluster source point and the highest level of the level relation between the key attributes contained in the first cluster, calculating the correlation between the third key attribute value and each key attribute value of the previous level of the target level; and clustering the third attribute value into a sub-cluster represented by the key attribute value with the largest correlation with the third attribute value in each key attribute value of the upper level of the target level. In order to facilitate understanding of the above-described process of secondary clustering of key attribute values in the first cluster by the server device 10b, an exemplary description will be given below with reference to the connected subgraphs shown in fig. 1b to 1 f.

Let us assume that fig. 1b is a connected subgraph corresponding to the first cluster. As shown in fig. 1b, the first cluster includes a plurality of key attributes: member number, identification number, passport number, account number, email and cell phone number. The grades among the key attributes are sequentially from high to low: an identity card number, a passport number, a member number, an account number, an electronic mailbox (mailbox for short) and a mobile phone number; wherein the core attributes are member number, identification card number and passport number.

As can be taken from fig. 1b, core properties: the number of member numbers is 5, the number of identification numbers is 4, and the number of passport numbers is 3, so the server device 10b may select the member number with the largest key attribute value as the reference core attribute, and use the 5 member numbers as the cluster source points. Further, the server device 10b clusters the 4 id numbers into the sub-clusters represented by the 5 member numbers according to the levels among the plurality of key attributes. Taking the identification card number 1 as an example, the path between the identification card number 1 and the member number 1 is: the identity card number 1-the account number 1-the mailbox 2-the member number 1; the path between the ID card number 1 and the member number 2 is: 1-ID card number-account number-1-mailbox 2-passport 1-Member number 2; the path between the ID card number 1 and the member number 3 is: identification number 1-member 3; the path between the ID card number 1 and the member number 4 is: 1-ID card number-account number-1-mailbox 2-passport 1-Member number 4; the path between the ID card number 1 and the member number 5 is: ID card number 1-account number 1-mailbox 2-passport 1-Member number 4-Mobile phone number 2-Member number 5. Further, the correlation of the identification number 1 and the member numbers 1-5 is calculated according to the correlation weight between the key values passed by the paths between the identification number 1 and the member numbers 1-5. Optionally, the correlation weights between the key values passed by the paths between the identification number 1 and the member numbers 1-5 can be multiplied to obtain the correlation between the identification number 1 and the member numbers 1-5. For example, the correlation between the identification number 1 and the member number 4 is equal to (0.75×0.87×0.9×0.9= 0.528525). Then, because the correlation between the identification card number 1 and the member number 4 is the largest, the identification card number 1 is clustered into the sub-cluster corresponding to the member number 4. According to the same method, the identification card number 2 is clustered into the sub-cluster represented by the member number 3, the identification card number 3 is clustered into the sub-cluster represented by the member number 1, and the identification card number 4 is clustered into the sub-cluster represented by the member number 5. The connected subgraph formed after the identification card numbers are clustered is shown in fig. 1 c.

According to the method, the passport number, the account number, the email box and the mobile phone number are clustered in sequence to obtain connected subgraphs shown in fig. 1d, wherein each connected subgraph in fig. 1d represents a sub-cluster, and each sub-cluster corresponds to one data object.

Also, assume that fig. 1e is a connected subgraph corresponding to the first cluster. As shown in fig. 1e, the plurality of key attributes included in the first cluster are: member number, identification card number, and passport number. The grades among the key attributes are sequentially from high to low: identification card number, passport number, and member number; wherein the core attribute is member number and ID card number. Then the member number may be selected as the reference core attribute. Further, first, according to the association weight between the identification numbers 1-4 and the member numbers 1-5, the identification number 1 is clustered to the sub-cluster represented by the member number 3, the identification number 2 is clustered to the sub-cluster represented by the member number 4, the identification number 3 is clustered to the sub-cluster represented by the member number 1, and the identification number 4 is clustered to the sub-cluster represented by the member number 5. Then, passport number 1 is clustered into the sub-cluster represented by member number 2 and passport number 2 is clustered into the sub-cluster corresponding to member number 3 according to the association weights between passport numbers 1 and 2 and member numbers 1-5, respectively. But for passport number 3, which has the same association weight as that between member numbers 4 and 5, i.e., the association between passport number 3 and member numbers 4 and 5, the association between passport number 3 and identification numbers 2 and 4, respectively, can be calculated because the target rank of passport number is not the next rank of the rank of member number (each cluster origin), nor the highest rank of the rank relationship between key attributes contained in the first cluster (the rank of identification number), and because the last rank of the rank of passport number is the rank of identification number. Further, because the correlation between passport number 3 and identification card number 2 (0.9×0.86) is greater than the correlation between passport number 3 and identification card number 4 (0.9×0.73×0.72), passport number 3 can be clustered into sub-clusters corresponding to identification card number 2. Further, 5 sub-clusters corresponding to the first cluster shown in fig. 1f are obtained. Wherein each sub-cluster corresponds to a data object.

In some embodiments, after performing secondary clustering on the key attribute values in each cluster to obtain at least one sub-cluster, the server device 10b may further perform information verification on at least one sub-cluster to ensure that the key attribute values in each sub-cluster belong to the data object corresponding to the sub-cluster. An exemplary description will be made below taking a first sub-cluster of the at least one sub-cluster as an example. Wherein the first sub-cluster is any one of the at least one sub-cluster.

For the first sub-cluster, the server device 10b may further determine whether the key attribute included in the first sub-cluster includes a plurality of key attribute values under the same key attribute, and if so, re-cluster the first sub-cluster with the plurality of key attribute values under the same key attribute as new cluster source points until the key attribute values under any one of the sub-clusters decomposed by the first sub-cluster are the same. Optionally, if the number of key attributes including the plurality of key attribute values in the first sub-cluster is a plurality of key attributes, the key attribute with the highest rank in the plurality of key attributes is used as a new cluster source point.

Optionally, for the first sub-cluster, the server device 10b may further search for, in the plurality of data records, non-key attribute values under the first non-key attribute respectively associated with the key attribute values included in the first sub-cluster. Wherein the first non-key attribute is any one or more attributes except the key attribute among the attributes related to the attribute information of the data object. Wherein the data objects are different, and the key attributes and the non-key attributes of the data objects are also different. For example, if the data object is a natural person, the key attribute may be at least one of an identification card number, a passport number, a mobile phone number, an email box, and a bank card number; the non-key attribute thereof may be at least one of name, home address, age, date of birth, IP address when the terminal device generates a data record, and MAC address of the terminal device, but is not limited thereto. For another example, if the data object is a company, the key attribute may be at least one of a company name, a tax payer identification number, and an industry and commerce registration number, but not limited thereto; the non-key attribute thereof may be at least one of a company address, a zip code, and an IP address, but is not limited thereto.

Further, the server device 10b may determine whether the non-critical attribute values under the first non-critical attribute are consistent. If the judgment result is negative, that is, if the non-key attribute values under the first non-key attribute are inconsistent, the associated key attribute values with inconsistent non-key attribute values under the first non-key attribute are used as new source point attribute values, and the first sub-cluster is reclustered until the non-key attribute values under the first non-key attribute contained in the first sub-cluster are consistent. The non-key attribute value agreement under the first non-key attribute included in the first sub-cluster means that: and the non-key attribute values of the first non-key attribute contained in the new sub-cluster formed by re-clustering the first sub-cluster are consistent. The process of re-clustering the first sub-cluster may refer to the process of performing the secondary clustering on the first cluster, which is not described herein.

In order to understand the above judging process more clearly, the following data objects are taken as natural people, and the key attributes are as follows: an identification card number, a passport number, and a member number, whose non-key attribute information is a name and date of birth, are exemplified. The server device 10b may look up the name and date of birth associated with the identification card number, passport number, and member number, respectively, in a plurality of data records and determine whether the name and date of birth associated with the identification card number, passport number, and member number, respectively, are consistent. If the judgment result is that the name associated with the identification card number is inconsistent with the name associated with the passport number, the identification card number and the passport number are used as new clustering source points, and the first sub-cluster is reclustered until non-key attribute values associated with key attribute values in the new sub-cluster are consistent.

The data processing method provided by the embodiment of the application is suitable for various application scenes, for example, a company or an enterprise can utilize the data processing method to construct OneID of a global unified account or a group. Accordingly, the server device 10b provides data processing services to the company or enterprise to construct oneID a global unified account or group of the company or enterprise, forming a complete unified account system. Among them, the company may be various types of companies such as internet companies, dairy companies, game companies, clubs, finance companies, travel companies, real estate companies, electronic commerce platforms or travel service platforms, even large groups covering various businesses, etc., but is not limited thereto.

For a dairy company, a family can be used as a data object to construct a unified account system of the dairy company; for travel service platforms, e-commerce platforms, or clubs, a user may be used as a data object to build their unified account hierarchy; for gaming companies, a community may be used as a data object to build their unified account. Wherein, the group can be divided according to the age, sex and the like of the user; etc., but is not limited thereto.

The construction of the unified account system of the enterprise is helpful for unified management of user information of the enterprise no matter what enterprise and company. When the unified account system of the enterprise is constructed, the data processing method provided by the embodiment of the application can be adopted to identify the information belonging to the same user. The following describes an exemplary data processing method provided by the embodiment of the present application by taking a travel service platform as an example.

Fig. 1g is a connected subgraph of one of a plurality of clusters formed by the server device 10b performing initial clustering on a plurality of key attribute values under a plurality of key attributes included in a data record provided by the travel service platform. As shown in fig. 1g, the key attributes are: the system comprises an identity card number, a passport number, a bank card number, a payment platform account number, an electronic commerce platform account number, a hotel membership card and a mobile phone number, wherein the identity card number and the passport number are core attributes, and the hierarchical relationship among the key attributes is arranged from high to low according to the sequence.

From fig. 1g, core properties: the number of the identification card numbers is 3, and the attribute of the passport number is 2, so the server device 10b selects the identification card number with the largest attribute value as the reference core attribute, and takes the 3 identification card numbers as the clustering source points. Further, the server device 10b first clusters 2 passport numbers (passport numbers 1 and 2) into 3 sub-clusters represented by identification numbers according to the ranking among the plurality of key attributes. According to the clustering method provided in the above embodiment, both passport numbers 1 and 2 may be clustered into a sub-cluster represented by identification number 2. According to the same method, the server device 10b may cluster the hotel membership card number 2 and the hotel membership card number 2 into the sub-cluster represented by the identity card number 2, and cluster the hotel membership card number 4 into the sub-cluster corresponding to the identity card number 4, to obtain the connected subgraph shown in fig. 1 h.

Further, according to the same method, the mobile phone number 1 is clustered into the sub-cluster represented by the identity card number 2, and the mobile phone numbers 2 and 3 are clustered into the sub-clusters represented by the identity card 3 and the identity card number 2 respectively, so that the connected subgraph shown in fig. 1i is obtained. Wherein, each connected subgraph in fig. 1i represents a sub-cluster, and each sub-cluster corresponds to a user. Only 3 sub-clusters are illustrated in fig. 1i and are illustrated as sub-clusters 1-3.

Further, since there are different attribute values under the same key attribute in the sub-cluster 3: passport numbers 1 and 2, mobile phone numbers 1 and 2, hotel membership card numbers 2 and 3, then, as the passport numbers are ranked higher than the mobile phone numbers and hotel membership card numbers, for sub-cluster 3, sub-cluster 3 is reclustered with passport numbers as new reference attributes, that is, passport numbers 1 and 2 as new clustering sources. The specific process is as follows: firstly, clustering the paymate account numbers 1 according to the hierarchical relationship among key attributes contained in the sub-clusters 3, and clustering the paymate account numbers to the sub-clusters represented by the passport numbers 1; and clustering the E-commerce platform account number 1, and clustering the E-commerce platform account number into the sub-cluster represented by the passport number 1 to obtain a connected subgraph shown in fig. 1j, so as to form a new sub-cluster (sub-cluster 1-4). Since no different key attribute values exist under the key attributes contained in the sub-clusters 1-4 shown in fig. 1j, the clustering of the different key attributes of the same user is completed. Wherein, each connected subgraph in fig. 1j corresponds to a sub-cluster, and a sub-cluster represents a user.

Further, a uniform identification may be set for each user represented by the sub-cluster shown in FIG. 1j, which may be referred to as OneID in some embodiments. Alternatively, the key attribute value having the largest number of occurrences within a specified period of time among the key attribute values included in each sub-cluster may be regarded as OneID of the users represented by the sub-cluster. The specified time period may be flexibly set according to actual conditions, and may be set to, for example, the last 1 week, the last 1 month, 2 months, and the like, but is not limited thereto.

The data processing method provided by the embodiment of the application can be applied to the construction of the unified account of the enterprise or the company, and can also be applied to other application scenes. For example, the method can be applied to various information statistics. In the following, an example is described in which garbage, water consumption, power consumption or gas amount generated in a set period of time by each cell administered by one community is counted. In this application scenario, one cell serves as one data object. Wherein, for the cell, the key attributes thereof include: cell name, cell location, identification of households within a cell, etc., but are not limited thereto. The cell position is position information which is accurately positioned to the cell, namely, the position information which can distinguish different cells; the position of the unit building is the position information which can be accurately positioned to each unit building in the cell, namely, the position information which can distinguish different unit buildings in the same cell; the identification of a household in a cell may be information that can uniquely identify a household. For example, the identification of the resident in the cell may be, but is not limited to, an owner's identification card number, a property card number, a passport number, a telephone number, and the like. Based on the application scenario, the plurality of data records acquired by the server device 10b include a plurality of key attributes, where each key attribute includes a plurality of key attribute values. Further, each data record may be: how much garbage is produced in a certain district, how much water is consumed, how much electric quantity is consumed or how much fuel gas is consumed in a certain day, etc.; the method can also be as follows: how much garbage is produced, how much water is consumed, how much electric quantity is consumed or how much fuel gas is consumed in a certain position in a certain day, and the like; but is not limited thereto.

Further, the server device 10b may identify attribute values belonging to the same cell, that is, cluster key attribute values representing the same cell, based on a hierarchical relationship between a plurality of key attributes and an association relationship between a plurality of key attribute values, for example, may cluster a cell name, a cell location, a cell building location, and a user identifier representing the same cell.

Further, the server device 10b may obtain, from the plurality of data records, the amount of garbage, the amount of water consumption, the amount of electricity consumption, and the amount of fuel consumption corresponding to each key attribute value in the set time period according to the key attribute values belonging to the same cell, so as to obtain the amount of garbage, the amount of water consumption, the amount of electricity consumption, and the amount of fuel consumption produced by the cell in the set time period. In other embodiments, the server device 10b may further obtain, from the plurality of data records, a plurality of non-critical attribute values under the second non-critical attribute and a plurality of non-critical attribute values under the third non-critical attribute respectively associated with each sub-cluster when verifying at least one sub-cluster. Wherein the plurality of non-critical attribute values under the second non-critical attribute and the plurality of non-critical attribute values under the third non-critical attribute associated with each sub-cluster respectively refer to: a plurality of non-critical attribute values under the second non-critical attribute and a plurality of non-critical attribute values under the third non-critical attribute associated with the critical attribute values in each sub-cluster, respectively. For example, for the plurality of non-critical attribute values under the second non-critical attribute associated with sub-cluster 1 in fig. 1f, refer to: non-critical attribute values under the second non-critical attribute associated with member number 1 and identification number 3, respectively. In this embodiment, the second non-critical attribute may be the same as the first non-critical attribute or may be different. Optionally, if the first non-critical attribute is a plurality of non-critical attributes, the second non-critical attribute may be one of the non-critical attributes. Further, the third non-critical attribute is different from the second non-critical attribute. The third non-critical attribute may also be one of the first non-critical attributes. For example, if the data object is a natural person, the first non-key attribute may be any one attribute of a name, a home address, an age, a date of birth, an IP address, and a MAC address, and correspondingly, the second non-key attribute may be any one attribute of a name, a home address, an age, a date of birth, an IP address, and a MAC address other than the first non-key attribute.

Further, the server device 10b may divide the plurality of non-critical attribute values under the second non-critical attribute into a plurality of information clusters by using the association relationship between the plurality of non-critical attribute values under the second non-critical attribute and the plurality of non-critical attribute values under the third non-critical attribute. The server device 10b may determine the association between the plurality of non-key attribute values under the second non-key attribute and the plurality of non-key attribute values under the third non-key attribute according to the membership of the plurality of non-key attribute values under the second non-key attribute and the plurality of data records, and for specific implementation, reference may be made to the related content of the foregoing embodiment, which is not described herein.

Further, for each information cluster, the server device 10b may calculate the probability of the second non-key attribute in each information cluster between each two non-key attribute values and the candidate result; wherein the candidate result comprises: belonging to the same data object and not belonging to the same data object. I.e. calculating the probability that the two non-critical attribute values belong to the same data object and the probability that the two non-critical attribute values do not belong to the same data object. Further, the server device 10b may determine whether each two non-critical attribute values under the second non-critical attribute belong to the same data object according to the probability of the second non-critical attribute value and the candidate result in each information cluster. Alternatively, the server device 10b may select, from the probabilities that the two non-critical attribute values belong to the same data object and the probabilities that the two non-critical attribute values do not belong to the same data object, a candidate result corresponding to the probability that is greater as the final decision result. For example, if the probability that the two non-critical attribute values belong to the same data object is greater than the probability that the two non-critical attribute values do not belong to the same data object, then it may be determined that the two non-critical attribute values belong to the same data object; otherwise, it is determined that the two non-critical attribute values do not belong to the same data object.

The specific implementation process of the probability between each two non-critical attribute values and the candidate result is described below by taking the first non-critical attribute value and the second non-critical attribute value under the second non-critical attribute included in the first information cluster as an example. The first information cluster is any one of a plurality of information clusters, and the first non-key attribute value and the second non-key attribute value are any two non-key attribute values under the second non-key attribute contained in the first information cluster.

For the first non-key attribute value and the second non-key attribute value, the server device 10b may search for other attribute values and behavior attributes respectively associated with the first non-key attribute value and the second non-key attribute value in the plurality of data records; wherein the other attribute values are: in the plurality of data records, attribute values other than the plurality of key attribute values and the non-key attribute value under the third non-key attribute are provided. Further, the server device 10b may input other attribute values and behavior attributes associated with the first non-critical attribute value and the second non-critical attribute value, respectively, into the decision model, to obtain the probability of belonging between the first non-critical attribute value and the candidate result.

Further, if the probability of the first non-critical attribute value and the second non-critical attribute value being attributed to the same data object is greater than the probability of the first non-critical attribute value and the second non-critical attribute value being attributed to the same data object, then determining that the first non-critical attribute value and the second non-critical attribute value are attributed to the same data object. Further, if the first non-critical attribute value and the second non-critical attribute value belong to the same data object, determining that the sub-cluster corresponding to the first non-critical attribute value and the sub-cluster corresponding to the second non-critical attribute value belong to the same data object. Further, the server device 10b may also merge the sub-clusters belonging to the same data object. Alternatively, the server device 10b may acquire behavior data associated with the merged sub-cluster, analyze behavior characteristics of a corresponding data object, and recommend corresponding content to the data object according to the behavior characteristics of the data object.

In the embodiment of the present application, the server device 10b may also train the decision model. Alternatively, the server device 10b may minimize the loss function as a training target, take as positive samples the sample attribute values and sample behavior data that are known to belong to the same data object but are respectively associated with different non-key attribute values under the second non-key attribute, and take as negative samples the sample attribute values and sample behavior data that are known to not belong to the same data object and are respectively associated with different non-key attribute values under the second non-key attribute, and perform model training to obtain the decision model. The loss function is determined according to the probability of the model training and the actual probability of the positive sample and the negative sample. Alternatively, the positive and negative sample actual probabilities are 1 and 0, respectively.

Alternatively, the decision model may be a Wide & Deep model, GBDT model, LR model or RF model, etc., but is not limited thereto.

In other embodiments, the third non-critical attribute may not exist in the plurality of data records, and the server device 10b may obtain, from the plurality of data records, a plurality of non-critical attribute values under the second non-critical attribute respectively associated with each sub-cluster, and obtain, from the plurality of data records, behavior data respectively associated with the plurality of non-critical attribute values under the second non-critical attribute. Further, the server device 10b may further obtain behavior characteristics of the data object corresponding to each sub-cluster according to behavior data associated with the plurality of non-key attribute values under the second non-key attribute respectively; and determining whether at least one sub-cluster belongs to the same data object according to the behavior characteristics of the data object corresponding to each sub-cluster.

Taking a first sub-cluster and a second sub-cluster in at least one sub-cluster as an example, an implementation process of determining whether the plurality of sub-clusters belong to the same data object according to behavior characteristics of the data object respectively corresponding to the plurality of sub-clusters is described in an exemplary manner. Wherein the first sub-cluster and the second sub-cluster are any two sub-clusters of the plurality of sub-clusters.

For the first sub-cluster and the second sub-cluster, the server device 10b may calculate a similarity between the behavior feature of the data object corresponding to the first sub-cluster and the behavior feature of the data object corresponding to the second sub-cluster, and if the calculated similarity is greater than or equal to a preset similarity threshold, the data objects corresponding to the first sub-cluster and the second sub-cluster are the same data object.

Further, in the case that the data objects corresponding to the first sub-cluster and the second sub-cluster are the same data object, the server device 10b may combine the first sub-cluster and the second sub-cluster, add identification information to the data objects corresponding to the first sub-cluster and the second sub-cluster according to the behavior features corresponding to the first sub-cluster and the second sub-cluster, and send the first sub-cluster and the second sub-cluster to the client device 10a after adding the identification information. Accordingly, the client device 10a outputs the first sub-cluster and the second sub-cluster in a visual manner.

Optionally, the client device 10a may further recommend related content to the data objects corresponding to the first sub-cluster and the second sub-cluster according to the identification information of the first sub-cluster and the second sub-cluster.

In addition to the above-described client-server (C/S) system architecture, the data processing method provided by the embodiments of the present application may also be autonomously completed by the server device. The server device refers to computer devices with functions of calculation, communication and the like, which are located at the server. The server device may be a computer or a server located at the server. For example, for online shopping websites, video websites, game websites, etc., the server-side device may be a website server, or may be a cloud-like server array, etc. In this embodiment, the server device may store a data record of the accessing user. Based on this, the server device may obtain a plurality of data records from the data records stored therein. The plurality of data records comprise a plurality of key attributes and a plurality of key attribute values under the plurality of key attributes. Further, in the embodiment of the present application, the server device stores a hierarchical relationship between key attributes. Accordingly, the server device may identify, according to the hierarchical relationship between the plurality of key attributes and the association relationship between the plurality of key attribute values included in the plurality of data records, attribute values belonging to the same data object among the plurality of key attribute values. And carrying out hierarchical clustering on the key attribute values belonging to the same data object according to the hierarchical relationship among a plurality of key attributes contained in the plurality of data records and the association relationship among a plurality of key attribute values, so as to identify the key attribute values which belong to the same data object. Further, the server device outputs attribute values belonging to the same data object in the plurality of key attribute values in a visual mode. The specific implementation process of the data processing of the key attribute, the key attribute value, and the server device may refer to the relevant content of the above system embodiment, which is not described herein again.

In addition to the above system embodiments, the embodiments of the present application further provide a data processing method, and in the following, from the perspective of a server device, the data processing method provided by the embodiments of the present application is described in an exemplary manner.

Fig. 2 is a flow chart of a data processing method according to an embodiment of the present application. As shown in fig. 2, the method includes:

201. and acquiring a plurality of data records, wherein the plurality of data records comprise a plurality of key attribute values under a plurality of key attributes, and each key attribute value belongs to one data object at the same time.

202. And identifying attribute values belonging to the same data object in the plurality of key attribute values according to the hierarchical relationship among the plurality of key attributes and the association relationship among the plurality of key attribute values.

203. And outputting attribute values belonging to the same data object in the plurality of key attribute values.

In this embodiment, the server device may acquire a plurality of data records. If the server device is a service device in the C/S system architecture, the server device may receive a plurality of data records sent by the client device. If the server device is a computer device with functions of calculation, communication and the like, which is located at the server. The server device may be a computer or a server located at the server. For example, for online shopping websites, video websites, game websites, etc., the server-side device may be a website server, or may be a cloud-like server array, etc. In this embodiment, the server device may store the data record. Based on this, the server device may obtain a plurality of data records from the data records stored therein. For descriptions of the data object, the key attribute, and the key attribute value, reference may be made to the related contents of the above system embodiment, which are not described herein.

In the embodiment of the application, the server device stores the hierarchical relationship among the key attributes. Accordingly, in step 202, the server device may identify attribute values belonging to the same data object from the plurality of key attribute values according to the hierarchical relationship among the plurality of key attributes and the association relationship among the plurality of key attribute values included in the plurality of data records. And carrying out hierarchical clustering on the key attribute values belonging to the same data object according to the hierarchical relationship among a plurality of key attributes contained in the plurality of data records and the association relationship among a plurality of key attribute values, so as to identify the key attribute values which belong to the same data object. Next, in step 203, the server device may output the key attribute values belonging to the same data object in a visual manner.

In this embodiment, according to the hierarchical relationship among the plurality of key attributes and the association relationship among the plurality of key attribute values, attribute values belonging to the same data object in the plurality of key attribute values can be identified, and the longitudinal clustering of attribute values belonging to different attributes of the same data object is completed. Due to the data clustering mode, various key attributes are considered, and the probability of error clustering is reduced, so that the accuracy of identifying the data belonging to the same data object is improved.

Optionally, an alternative embodiment of step 203 is: and the server side equipment sends the attribute values belonging to the same data object in the plurality of key attribute values to the client side equipment. Correspondingly, the client device receives attribute values belonging to the same data object in the plurality of key attribute values and outputs the attribute values belonging to the same data object in the plurality of key attribute values in a visual mode.

Or in some embodiments, the server device may directly display the attribute values belonging to the same data object in the plurality of key attribute values on the man-machine interaction interface thereof.

In this embodiment, the service server device may output attribute values belonging to the same data object among the plurality of key attribute values in a plurality of forms. For example, as shown in fig. 1a, the server device may use a form of a connected subgraph to aggregate attribute values belonging to the same data object in multiple key attribute values, and output the connected subgraph formed by aggregation. Optionally, the server device may display the connected subgraph on its man-machine interface. Or the server-side device may send the aggregated connected subgraph to the client-side device. Wherein each connected subgraph corresponds to a data object. In the connected subgraph, nodes represent key attribute values, and a connecting line between two nodes represents an association relationship between the two key attribute values. For another example, the server device may also use a table format to aggregate attribute values belonging to the same data object in the plurality of key attribute values, and output a table formed by the aggregation. Alternatively, the server device may present the form on its man-machine interface. Or the server device may send the form to the client device. Wherein, each table corresponds to a data object, and the key attribute values in the same row or column represent attribute values with association relation.

In some application scenarios, after step 202, the server device may further search for behavior data of the data object corresponding to each connected subgraph in the plurality of data records, and obtain behavior features of the data object corresponding to each connected subgraph according to the behavior data of the data object corresponding to each connected subgraph, so as to add the behavior features of the data object corresponding to each connected subgraph as identification information of the data object corresponding to each connected subgraph. And then, the server-side equipment outputs the connected subgraph added with the identification information in a visual form.

In some embodiments, the server device may pre-establish a hierarchical relationship between several key attributes. Wherein the number is greater than or equal to the number of key attributes in the plurality of data records. Based on the above, the server device may extract the hierarchical relationship between the plurality of key attributes contained in the plurality of data records from the pre-established hierarchical relationship between the plurality of key attributes.

Optionally, the server device may obtain historical data records within a specified historical time period, where the historical data records include historical key attribute values under a number of key attributes; further, the server device may establish a hierarchical relationship between the plurality of key attributes according to the number of historical key attribute values under each of the plurality of key attributes. The specific implementation manner of establishing the hierarchical relationship between the plurality of key attributes by the server device according to the number of the historical key attribute values under each key attribute in the plurality of key attributes may be referred to the relevant content of the system embodiment, which is not described herein.

In other embodiments, the server device may further analyze an association between the plurality of key attribute values based on membership between the plurality of key attribute values and the plurality of data records. An exemplary description will be made below taking a first key attribute value and a second key attribute value among a plurality of key attribute values as an example. The first key attribute value and the second key attribute value are any two key attribute values in the plurality of key attribute values.

In the embodiment of the application, the server device can identify the attribute value belonging to the same data object in the plurality of key attribute values according to the hierarchical relationship among the plurality of key attributes and the association relationship among the plurality of key attribute values. Optionally, the server device may perform initial clustering on the plurality of key attribute values according to an association relationship between the plurality of key attribute values, so as to obtain a plurality of clusters. Wherein each cluster contains key attribute values having an association relationship. Further, for each cluster, the server device may perform secondary clustering on the key attribute values in each cluster according to the hierarchical relationship between the plurality of key attributes, to obtain at least one sub-cluster; the key attribute values in the same sub-cluster can be regarded as attribute values belonging to the same data object.

In the embodiment of the present application, before performing secondary clustering on the key attribute values in each cluster, the server device may further perform at least one of the following judgment operations:

Further, if at least one of the judging operations is yes, pruning is performed on the giant clusters in the plurality of clusters. That is, if the judgment results of the judgment operation 1 and the judgment operation 2 are yes, it is determined that the macro cluster exists among the plurality of clusters, and pruning is performed on the macro cluster. For a specific implementation manner of pruning the macro cluster, refer to the relevant content of the foregoing embodiment, which is not described herein again.

Further, when the server device performs initial clustering on the plurality of key attribute values, a graph calculation mode may be adopted to perform initial clustering on the plurality of key attribute values. Optionally, the plurality of key attribute values may be initially clustered using a graph computation framework. Wherein, the graph computation framework may be: odps graph, spark graphX, and the like, but are not limited thereto. The specific implementation mode is as follows: pairing two key attribute values with association relations in the plurality of key attribute values to obtain a plurality of attribute value relation pairs; constructing an adjacency list of each key attribute value in the plurality of attribute value relation pairs according to the association relation among each key attribute value in the plurality of attribute value relation pairs; traversing the adjacency list of each key attribute value in the plurality of attribute value relation pairs to obtain a plurality of connected subgraphs, wherein each connected subgraph is a cluster. In each connected subgraph, a node represents a key attribute value, and a connection line between two nodes represents an association relationship between two key attribute values.

Alternatively, the association weight between any two key attribute values with association relationship may be calculated according to the common occurrence frequency of the two key attribute values in the same data record and the respective occurrence frequencies of the two key attribute values in multiple data records. For the specific implementation of calculating the association weights, reference may be made to the relevant content of the above system embodiment, which is not described herein.

Further, taking the first cluster of the clusters as an example, a specific implementation process of secondary clustering of the key attribute values in each cluster by the server device is illustrated. The first cluster is any one cluster among a plurality of clusters obtained by initial clustering. When the server side equipment performs secondary clustering on the key attribute values in the first cluster, a reference core attribute can be determined from the core attribute contained in the first cluster, wherein the core attribute belongs to the key attribute; further, according to the hierarchical relation among the key attributes contained in the first cluster and the association weight among the key attribute values, the key attribute values under the reference core attribute are respectively used as clustering source points, and the key attribute values under the non-reference core attribute in the first cluster are clustered into sub clusters represented by the clustering source points.

Optionally, when determining the reference core attribute from the core attributes contained in the first cluster, the server device may select, as the reference core attribute, the core attribute with the most key attribute value according to the number of key attribute values under each core attribute contained in the first cluster. Optionally, if the number of key attribute values under each core attribute included in the first cluster is the same and greater than 1, one core attribute included in the first cluster may be selected as a reference core attribute, or the core attribute with the highest rank may be selected as the reference core attribute. Further, if the number of key attribute values under each core attribute included in the first cluster is 1, it is indicated that the key attribute values in the first cluster belong to the same data object, and the layering iteration is ended.

Further, when the server device clusters the key attribute values under the non-reference core attribute in the first cluster into the sub-clusters represented by the clustering source points, the key attribute values with higher attribute levels can be clustered into the sub-clusters represented by the clustering source points. The method for clustering the key attribute values under each level into the sub-clusters represented by the clustering source points is the same. An exemplary description will be given below taking a third key attribute value as an example. The third key attribute value is any key attribute value which is not clustered to any sub-cluster currently in the key attribute values contained in the first cluster.

Aiming at the third key attribute value, the server side equipment can calculate the correlation between the third key attribute value and each clustering source point according to the correlation weight between the third key attribute value and each clustering source point; and dividing the third key attribute value into sub-clusters represented by the cluster source points with the greatest correlation.

Further, the server device may determine a shortest association path between the third key attribute value and each cluster source point according to association weights between the third key attribute value and key attribute values through which the association paths between the cluster source points pass; and calculating the correlation between the third key attribute value and each cluster source point according to the correlation weight between the third key attribute value and the key attribute value passed by the shortest correlation path between each cluster source point.

Further, if the number of the cluster source points having the greatest correlation with the third key attribute value is plural, if the target level at which the third key attribute value is located is the next level other than the level at which each cluster source point is located or the highest level of the hierarchical relationship between the key attributes included in the first cluster, the third key attribute value may be clustered into a sub-cluster corresponding to any one of the plurality of cluster source points having the greatest correlation with the third key attribute value. If the target level of the third key attribute value is any level except the next level of the level of each cluster source point and the highest level of the level relation between the key attributes contained in the first cluster, calculating the correlation between the third key attribute value and each key attribute value of the previous level of the target level; and clustering the third attribute value into a sub-cluster represented by the key attribute value with the largest correlation with the third attribute value in each key attribute value of the upper level of the target level.

In some embodiments, after performing secondary clustering on the key attribute values in each cluster to obtain at least one sub-cluster, the server device may further perform information verification on at least one sub-cluster, so as to ensure that the key attribute values in each sub-cluster belong to the data object corresponding to the sub-cluster. An exemplary description will be made below taking a first sub-cluster of the at least one sub-cluster as an example. Wherein the first sub-cluster is any one of the at least one sub-cluster.

For the first sub-cluster, the server device may search for non-key attribute values under first non-key attributes respectively associated with key attribute values contained in the first sub-cluster in the plurality of data records. Wherein the first non-key attribute is any one or more attributes except the key attribute among the attributes related to the attribute information of the data object. Further, the server device may determine whether the non-critical attribute values under the first non-critical attribute are consistent. If the judgment result is negative, that is, if the non-key attribute values under the first non-key attribute are inconsistent, the associated key attribute values with inconsistent non-key attribute values under the first non-key attribute are used as new source point attribute values, and the first sub-cluster is reclustered until the non-key attribute values under the first non-key attribute contained in the first sub-cluster are consistent.

In other embodiments, when the server device performs verification on at least one sub-cluster, the server device may further obtain, from the plurality of data records, a plurality of non-critical attribute values under the second non-critical attribute and a plurality of non-critical attribute values under the third non-critical attribute associated with each sub-cluster respectively. Further, the server device may divide the plurality of non-key attribute values under the second non-key attribute into a plurality of information clusters by using an association relationship between the plurality of non-key attribute values under the second non-key attribute and the plurality of non-key attribute values under the third non-key attribute. The server device may determine the association between the plurality of non-key attribute values under the second non-key attribute and the plurality of non-key attribute values under the third non-key attribute according to the membership of the plurality of non-key attribute values under the second non-key attribute and the plurality of non-key attribute values under the third non-key attribute in the plurality of data records, and for specific implementation, reference may be made to the related content of the foregoing embodiment, and details are not repeated herein.

Further, for each information cluster, the server device may calculate the probability of the second non-key attribute value of each information cluster and the candidate result; candidate results include: belongs to the same data object and does not belong to the same data object, i.e. the probability that the two non-critical attribute values belong to the same data object and the probability that the two non-critical attribute values do not belong to the same data object are calculated. Further, the server device may determine, according to the probability of the second non-key attribute in each information cluster, whether each two non-key attribute values in the second non-key attribute belong to the same data object. Optionally, the server device may select, from the probabilities that the two non-critical attribute values belong to the same data object and the probabilities that the two non-critical attribute values do not belong to the same data object, a candidate result corresponding to the probability that is greater as the final decision result.

For the first non-key attribute value and the second non-key attribute value, the server device can search other attribute values and behavior attributes respectively associated with the first non-key attribute value and the second non-key attribute value in a plurality of data records; wherein the other attribute values are: in the plurality of data records, attribute values other than the plurality of key attribute values and the non-key attribute value under the third non-key attribute are provided. Further, the server device may input other attribute values and behavior attributes associated with the first non-critical attribute value and the second non-critical attribute value, respectively, into the decision model to obtain the probability of the first non-critical attribute value and the probability of the second non-critical attribute value and the candidate result.

Further, if the probability of the first non-critical attribute value and the second non-critical attribute value being attributed to the same data object is greater than the probability of the first non-critical attribute value and the second non-critical attribute value being attributed to the same data object, then determining that the first non-critical attribute value and the second non-critical attribute value are attributed to the same data object. Further, if the first non-critical attribute value and the second non-critical attribute value belong to the same data object, determining that the sub-cluster corresponding to the first non-critical attribute value and the sub-cluster corresponding to the second non-critical attribute value belong to the same data object. Further, the server device may also merge the sub-clusters belonging to the same data object. Optionally, the server device may acquire behavior data associated with the merged sub-clusters, analyze behavior characteristics of a corresponding data object, and recommend corresponding content to the data object according to the behavior characteristics of the data object.

In the embodiment of the application, the server device can also train the decision model. The specific implementation manner of training the decision model can be referred to the relevant content of the system embodiment, which is not described herein.

In other embodiments, the third non-critical attribute may not exist in the plurality of data records, and the server device may obtain, from the plurality of data records, a plurality of non-critical attribute values under the second non-critical attribute respectively associated with each sub-cluster, and obtain, from the plurality of data records, behavior data respectively associated with the plurality of non-critical attribute values under the second non-critical attribute. Further, the server device may further obtain behavior characteristics of the data object corresponding to each sub-cluster according to behavior data associated with the plurality of non-key attribute values under the second non-key attribute respectively; and determining whether at least one sub-cluster belongs to the same data object according to the behavior characteristics of the data object corresponding to each sub-cluster.

For the first sub-cluster and the second sub-cluster, the server side device may calculate the similarity between the behavior feature of the data object corresponding to the first sub-cluster and the behavior feature of the data object corresponding to the second sub-cluster, and if the calculated similarity is greater than or equal to a preset similarity threshold, the data objects corresponding to the first sub-cluster and the second sub-cluster are the same data object.

Further, in the case that the data objects corresponding to the first sub-cluster and the second sub-cluster are the same data object, the server device may combine the first sub-cluster and the second sub-cluster, add identification information to the data objects corresponding to the first sub-cluster and the second sub-cluster according to the behavior features corresponding to the first sub-cluster and the second sub-cluster, and output the first sub-cluster and the second sub-cluster after the identification information is added.

Optionally, the server device may send the first sub-cluster and the second sub-cluster after the identification information is added to the client device. Accordingly, the client device outputs the first sub-cluster and the second sub-cluster in a visual manner. Optionally, the client device may further recommend related content to the data objects corresponding to the first sub-cluster and the second sub-cluster according to the identification information of the first sub-cluster and the second sub-cluster.

Or the server side equipment can display the first sub-cluster and the second sub-cluster after the identification information is added on the man-machine interaction interface of the server side equipment. Further, the server side equipment recommends relevant content to the data objects corresponding to the first sub-cluster and the second sub-cluster according to the identification information of the first sub-cluster and the second sub-cluster.

Accordingly, embodiments of the present application also provide a computer-readable storage medium storing computer instructions that, when executed by one or more processors, cause the one or more processors to perform the steps in the data processing method described above.

FIG. 3a is a schematic diagram illustrating another data processing system according to an embodiment of the present application. As shown in fig. 3a, the system comprises: a client device 30a and a server device 30b. For a description of the implementation manner of the client device and the server device and the communication manner of the client device and the server device, reference may be made to the related content of the system embodiment shown in fig. 1a, which is not described herein.

In this embodiment, the client device 30a may generate a data record and may send a plurality of data records to the server device 30b. The plurality of data records comprise a plurality of attribute values of a first type. In this embodiment, the first type of attribute is an attribute to be processed, which may be a key attribute or a non-key attribute in the above embodiment.

Further, in this embodiment, the plurality of data records may or may not include the second type attribute value. The second type of attribute is a preset attribute, which may be a key attribute or a non-key attribute in the above embodiment. Preferably, the second type of attribute is the key attribute described above. In this embodiment, if the plurality of data records include a plurality of second-type attribute values, the server device 30b may cluster the plurality of first-type attribute values according to the association relationship between the plurality of first-type attribute values and the plurality of second-type attribute values, to obtain a plurality of information clusters. That is, the server device 30b clusters the first type attribute values having the association relationship with the same second type attribute values to obtain a plurality of information clusters. Wherein the first type attribute value in each information cluster is associated with the second type attribute value corresponding to the information cluster.

Further, for each information cluster, the server device 30b may calculate the probability of belonging between the different first-type attribute values and the candidate result, respectively. Wherein the candidate result comprises: belonging to the same data object and not belonging to the same data object. I.e. calculating the probability that different first class attribute values belong to the same data object and the probability that different first class attribute values do not belong to the same data object.

Further, the server device 30b may determine whether the different attribute values of the first class belong to the same data object according to the probability of the attribute values of the first class and the candidate result in each information cluster. Alternatively, the server device 30b may select, from the probabilities that different attribute values of the first type belong to the same data object and the probabilities that different attribute values of the first type do not belong to the same data object, a candidate result corresponding to the probability that is greater as the final decision result. That is, if the probability that the different first type attribute values belong to the same data object is greater than the probability that the different first type attribute values do not belong to the same data object, it may be determined that the different first type attribute values belong to the same data object; otherwise, it is determined that the different first-type attribute values do not belong to the same data object.

Further, the server device 30b may transmit the first type attribute information belonging to the same data object to the client device 30a. Accordingly, the client device 30a receives the first-type attribute information belonging to the same data object, and outputs the first-type attribute values belonging to the same data object in a visualized manner.

In the embodiment of the present application, the server device 30b may send the first type attribute information belonging to the same data object to the client device 30a in various forms. For example, as shown in fig. 3a, the server device 10b may take the form of a connected subgraph, aggregate attribute values belonging to the same data object among a plurality of attribute values of the first class, and send the aggregated connected subgraph to the client device 30a. Accordingly, the client device 30a presents the received connected subgraph on the display screen. Wherein each connected subgraph corresponds to a data object. In the connected subgraph, the nodes represent attribute values of a first type, and a connecting line between two nodes represents an association relationship between the two attribute values of the first type. For another example, the server device 30b may also use a table format to aggregate attribute values belonging to the same data object in the plurality of attribute values of the first class, and send the table formed by aggregation to the client device 30a. Accordingly, the client device 30a presents the received form on a display screen. Wherein each table corresponds to a data object, and the attribute values of the first type in the same row or column represent attribute values with association relations.

According to the data processing system provided by the embodiment, the transverse clustering of the attribute values under the same attribute of the same data object can be completed according to the probability that different attribute values under the same attribute belong to the same data object and do not belong to the same data object. Because the data clustering mode does not need to have a stronger association relation between attribute values, the requirement on data attributes can be reduced, and the data clustering method is beneficial to realizing the flexibility and universality of data clustering.

In this embodiment, the plurality of data records sent by the client device 30a to the server device 30b may be data generated for a plurality of data objects, that is, the plurality of data records are a plurality of data records generated across screens. The data processing mode provided by the embodiment can realize the cross-screen recognition of the data generated by the data object. In addition, these data records may relate to a number of fields. For example, in some application scenarios, these data records may relate to a variety of fields of online shopping, video, gaming, sporting events, and the like. Therefore, the data processing manner provided by the embodiment can also realize cross-domain identification of the data generated by the data object.

In this embodiment, the server device 30b may determine a first type attribute value associated with each second type attribute value according to an association relationship between the first type attribute value and the second type attribute value, and use the first type attribute value associated with each second type attribute value as an information cluster, so as to obtain a plurality of information clusters.

Optionally, when determining the first type attribute value associated with each second type attribute value, the server device 30b may determine the association relationship between the first type attribute value and the second type attribute value according to the membership relationship between the first type attribute value and the second type attribute value in the plurality of data records. The detailed description will be referred to the related content in the data system shown in fig. 1a, and will not be repeated here. Further, the server device 30b may pair the first type attribute value with the associated second type attribute value to obtain a plurality of attribute value relationship pairs, where one attribute value relationship pair includes one first type attribute value and one second type attribute value, and the first type attribute value and the second type attribute value in the attribute value relationship pair have an association relationship.

Optionally, the server device 30b may use a graph calculation manner to cluster the plurality of attribute value pairs to obtain a plurality of connected subgraphs, where each connected subgraph corresponds to one information cluster. In each connected subgraph, a center node represents a second type attribute value, an edge node represents a first type attribute value, and a connection line between the center node and the edge node represents a corresponding association relationship between the first type attribute value and the second type attribute value. For a specific implementation of clustering the plurality of attribute value relation pairs in the graph calculation, reference may be made to the related content of the foregoing embodiment, which is not described herein.

Alternatively, the attribute value relationship pairs may be represented in the form of a data structure of triples. Wherein the triplet is composed of a first type of attribute value and a second type of attribute value and associated weights between the two attribute values. Therefore, in each connected subgraph, the numerical value on the line between two nodes represents the association weight between two attribute values having an association relationship. The connected subgraph can be of a heterogeneous undirected weighted graph structure.

Optionally, the server device 30b may calculate the association weights between the first type attribute values and the second type attribute values according to the co-occurrence frequency of the first type attribute values and the second type attribute values having the association relationship in the same data record and the respective occurrence frequencies of the two first type attribute values and the second type attribute values in the plurality of data records, and the specific implementation manner of the method may refer to the related content of the foregoing embodiment and will not be described herein.

Optionally, for each information cluster, pruning may be performed on the information cluster if the number of the first type attribute values associated with the second type attribute values in the information cluster is greater than a preset third number threshold. Optionally, according to the association weight between the first type attribute value and the second type attribute value in the information cluster, the first type attribute value corresponding to the smaller association weight is cut off, so that the number of the remaining first type attribute values in the information cluster is smaller than or equal to a preset third number threshold. Therefore, the edges with low weight can be cleaned, attribute value relation pairs with weak relative relation can be removed, and the complexity of the graph is reduced.

Or the server device 30b may also perform anti-cheating and cleaning operations on the attribute value relation pair with low weight according to the application scenario. Based on this, if the first type attribute value in a certain information cluster is greater than the preset fourth number threshold, the server device 30b may further clean the information cluster, i.e. clip the information cluster. Preferably, the fourth number threshold is greater than the third number threshold. Taking the data object as a natural person, the first type attribute as an account number and the second type attribute as an IP address as an example, in practical application, a normal family generally uses one IP address for a plurality of natural persons, that is, one IP address is associated with a limited number of accounts, while some organizations may use one IP address for hundreds or more accounts, that is, one IP address is associated with thousands of hundreds of accounts. Based on this, the maximum number of accounts (i.e., the fourth number threshold) that can be associated with one IP address may be preset, and if the number of accounts associated with the IP address in one information cluster is greater than the fourth number threshold, the information cluster may be considered as a cheating information cluster, and all the information clusters may be pruned.

Further, after each first type attribute value associated with the second type attribute value, the server device 30b may use cartesian products to associate the first type attribute values associated with the same second type attribute value, to generate first type attribute value relation pairs suspected to belong to the same data object, where each first attribute value relation pair includes two different first type attribute values.

Further, for each information cluster, the server device 30b may calculate the probability of belonging between the different attribute values of the first type and the candidate result in the information cluster, and determine whether the different attribute values of the first type belong to the same data object according to the probability of belonging between the different attribute values of the first type and the candidate result in the information cluster.

The following exemplifies a procedure in which the server device 30B calculates the probability of belonging between the different first-type attribute values and the candidate result, taking the first-type attribute values a and B in the first cluster of the plurality of information clusters as an example. The first cluster is any one of a plurality of information clusters, and the first type attribute values A and B are any two first type attribute values in the first type attribute values contained in the first cluster.

In this embodiment, the server device 30B obtains, from a plurality of data records, other attribute values and behavior data associated with the first attribute values a and B, respectively; wherein the other class attribute values are attribute information other than the first class attribute value. I.e. the other class attribute values may be the second class attribute values as well as other attribute information than the first class attribute values and the second class attribute values. Further, the server device 30B inputs other attribute values and behavior data associated with the first type attribute values a and B, respectively, into the decision model, to obtain the probability of belonging between the first type attribute values a and B and the candidate result.

Further, if the probability of the first type attribute values a and B belonging to the same data object is greater than the probability of the first type attribute values not belonging to the same data object, determining that the first type attribute values a and B belong to the same data object. Alternatively, the server device 30B may obtain the behavior data associated with the attribute values a and B of the first type, analyze the behavior characteristics of the corresponding data object, and recommend corresponding content to the data object according to the behavior characteristics of the data object.

In an embodiment of the present application, the server device 30b may also train the decision model. Optionally, the server device 30b may minimize the loss function as a training target, take the sample attribute values and sample behavior data that are known to belong to the same data object but are respectively associated with different attribute values of the first class as positive samples, and take the sample attribute values and sample behavior data that are known not to belong to the same data object and are respectively associated with different attribute values of the first class as negative samples, and perform model training to obtain a decision model; the loss function is determined according to the probability of the model training and the actual probability of the positive sample and the negative sample. Alternatively, the positive and negative sample actual probabilities are 1 and 0, respectively. Alternatively, the loss function may be expressed as: ; wherein/> Actual probabilities for positive and negative samples; /(I)The probability of the model trainingPositive and negative samples. Further,/>Can be expressed as: /(I)。

In some application scenarios, the plurality of data records may not include the second type attribute value, and the server device 30b may further obtain behavior data associated with the plurality of first type attribute values from the plurality of data records, and obtain behavior features of the data objects corresponding to the plurality of first type attribute values according to the behavior data associated with the plurality of first type attribute values. Further, the server device 30b may determine whether the plurality of attribute values of the first type belong to the same data object according to the behavior characteristics of the data object corresponding to the plurality of attribute values of the first type.

The determination of whether the plurality of first-type attribute values belong to the same data object is exemplarily described below taking the first-type attribute values C and D of the plurality of first-type attribute values as an example. Wherein the first type attribute values C and D are any two attribute values of the plurality of first type attribute values.

For the first type attribute values C and D, the server device 30b may calculate the similarity of the behavior features of the data objects corresponding to the first type attribute values C and D, and further determine that the first type attribute values C and D belong to the same data object if the similarity of the behavior features of the data objects corresponding to the first type attribute values C and D is greater than or equal to the set similarity threshold. Correspondingly, if the similarity of the behavior characteristics of the data objects corresponding to the first type attribute values C and D is smaller than the set similarity threshold, determining that the first type attribute values C and D do not belong to the same data object.

Further, after identifying the first type attribute value relation pair actually belonging to the same data object from the first type attribute value relation pair suspected to belong to the same data object in the first information cluster, the server device 30b may further aggregate the first type attribute values in the first type attribute value relation pair actually belonging to the same data object by using a graph calculation method, that is, aggregate the first type attribute values belonging to the same data object, to obtain a connected subgraph corresponding to each data object. In the connected subgraph, each node represents a first type attribute value.

In order to more clearly understand the above data processing procedure, the following will take a data object as a natural person, a first type attribute is an account number, and a second type attribute is an IP address as an example.

Assuming that the plurality of data records include a plurality of accounts and a plurality of IP addresses, in this embodiment, the association relationship between the plurality of accounts and the plurality of IP addresses may be determined according to membership of the plurality of accounts and the plurality of IP addresses in the plurality of data records. Further, the server device 30b may pair the account numbers with their associated IP addresses, respectively, to generate a plurality of account number-IP relationship pairs.

Alternatively, the server device 30b may use a graph calculation method to cluster multiple account-IP relationship pairs, so as to obtain multiple connected subgraphs as shown in fig. 3 b. The center node of each connected subgraph represents an IP address, the edge node represents an account, and the connecting line between the center node and the edge node represents the association relationship between the IP address and the account. In fig. 3b, only 2 IP addresses (IP 1 and IP 2) and 8 accounts (accounts 1-8) are illustrated, which is not limited.

Further, the server device 30b may correlate the accounts in the same connected subgraph by using cartesian products for the accounts in each connected subgraph, so as to generate a suspected co-person account pair. Further, the server device 30b may calculate the probability that each suspected co-person account pair belongs to the same natural person and the probability that the suspected co-person account pair does not belong to the same natural person, and use the candidate result with a larger probability as the judgment result of whether the suspected co-person account pair belongs to the same natural person. If the probability that the suspected identical person account number pair belongs to the same natural person is larger than the probability that the suspected identical person account number pair does not belong to the same natural person, determining that the suspected identical person account number pair belongs to the same natural person; otherwise, determining that the suspected same person account pair does not belong to the same natural person.

In addition to the above-described client-server (C/S) system architecture, the data processing method provided by the embodiments of the present application may also be autonomously completed by the server device. The server device refers to computer devices with functions of calculation, communication and the like, which are located at the server. The server device may be a computer or a server located at the server. For example, for online shopping websites, video websites, game websites, etc., the server-side device may be a website server, or may be a cloud-like server array, etc. In this embodiment, the server device may store a data record of the accessing user. Based on this, the server device may obtain a plurality of data records from the data records stored therein. In this embodiment, the plurality of data records may or may not include the second type attribute value. The second type of attribute is a preset attribute, which may be a key attribute or a non-key attribute in the above embodiment. Preferably, the second type of attribute is the key attribute described above. In this embodiment, if the plurality of data records include a plurality of second-type attribute values, the server device may cluster the plurality of first-type attribute values according to an association relationship between the plurality of first-type attribute values and the plurality of second-type attribute values, to obtain a plurality of information clusters. The server side equipment clusters the first type attribute values with the association relation with the same second type attribute values to obtain a plurality of information clusters. Wherein the first type attribute value in each information cluster is associated with the second type attribute value corresponding to the information cluster. Further, for each information cluster, the server device may calculate the probability of belonging between the different first-class attribute values and the candidate result, respectively. Wherein the candidate result comprises: belonging to the same data object and not belonging to the same data object. I.e. calculating the probability that different first class attribute values belong to the same data object and the probability that different first class attribute values do not belong to the same data object. Further, the server device may determine, according to the probability of the attribute value of the different first type and the candidate result in each information cluster, whether the attribute value of the different first type belongs to the same data object. The specific implementation process of the data processing performed by the server device may refer to the related content in the system embodiment shown in fig. 3a, which is not described herein.

Fig. 4 is a flowchart of another data processing method according to an embodiment of the present application. As shown in fig. 4, the method includes:

401. A plurality of data records are obtained, wherein the plurality of data records contain a plurality of attribute values of a first type.

402. And clustering the plurality of first-type attribute values according to the association relation between the plurality of first-type attribute values and the plurality of second-type attribute values under the condition that the plurality of data records contain the plurality of second-type attribute values so as to obtain a plurality of information clusters.

403. Aiming at each information cluster, respectively calculating the probability of the first type attribute value and the candidate result; wherein the candidate result comprises: belonging to the same data object and not belonging to the same data object.

404. And determining whether the different first-class attribute values belong to the same data object according to the probability of the first-class attribute values and the candidate result.

In this embodiment, the first type of attribute is an attribute to be processed, which may be a key attribute or a non-key attribute in the above embodiment.

Further, in this embodiment, the plurality of data records may or may not include the second type attribute value. The second type of attribute is a preset attribute, which may be a key attribute or a non-key attribute in the above embodiment. Preferably, the second type of attribute is the key attribute described above. In this embodiment, if the plurality of data records includes a plurality of second-type attribute values, in step 402, the plurality of first-type attribute values may be clustered according to an association relationship between the plurality of first-type attribute values and the plurality of second-type attribute values, so as to obtain a plurality of information clusters. And clustering the first type attribute values with the association relation with the same second type attribute values to obtain a plurality of information clusters. Wherein the first type attribute value in each information cluster is associated with the second type attribute value corresponding to the information cluster.

Further, for each information cluster, the server device may calculate the probability of belonging between the different first-class attribute values and the candidate result, respectively. Wherein the candidate result comprises: belonging to the same data object and not belonging to the same data object. I.e. calculating the probability that different first class attribute values belong to the same data object and the probability that different first class attribute values do not belong to the same data object.

In this embodiment, the lateral clustering of the attribute values under the same attribute belonging to the same data object may be completed according to the probabilities that different attribute values under the same attribute belong to and do not belong to the same data object. Because the data clustering mode does not need to have a stronger association relation between attribute values, the requirement on data attributes can be reduced, and the data clustering method is beneficial to realizing the flexibility and universality of data clustering.

In this embodiment, the server device may output the attribute values of the first type that belong to the same data object. In some embodiments, the server device may expose a first type of attribute value that belongs to the same data object on its display screen. In other embodiments, the server device may send the first type of attribute information belonging to the same data object to the client device. Accordingly, the client device 30a receives the first-type attribute information belonging to the same data object, and outputs the first-type attribute information belonging to the same data object in a visualized manner.

In some embodiments, an alternative implementation of step 402 is: according to the association relation between the first type attribute values and the second type attribute values, determining the first type attribute values associated with each second type attribute value, taking the first type attribute values associated with each second type attribute value as an information cluster, and further obtaining a plurality of information clusters.

Optionally, when determining the first type attribute value associated with each second type attribute value, the server device may determine an association relationship between the first type attribute value and the second type attribute value according to membership relationships between the first type attribute value and the second type attribute value in the plurality of data records. The detailed description will be referred to the related content in the data system shown in fig. 1a, and will not be repeated here. Further, the server device may pair the first type attribute value with the associated second type attribute value to obtain a plurality of attribute value relation pairs, where one attribute value relation pair includes one first type attribute value and one second type attribute value, and the first type attribute value and the second type attribute value in the attribute value relation pair have an association relation.

Optionally, the server device may use a graph calculation manner to cluster the plurality of attribute value pairs to obtain a plurality of connected subgraphs, where each connected subgraph corresponds to one information cluster. In each connected subgraph, a center node represents a second type attribute value, an edge node represents a first type attribute value, and a connection line between the center node and the edge node represents a corresponding association relationship between the first type attribute value and the second type attribute value. For a specific implementation of clustering the plurality of attribute value relation pairs in the graph calculation, reference may be made to the related content of the foregoing embodiment, which is not described herein.

Optionally, the server device may calculate the association weights between the first type attribute values and the second type attribute values according to the co-occurrence frequency of the first type attribute values and the second type attribute values having the association relationship in the same data record and the respective occurrence frequencies of the two first type attribute values and the second type attribute values in the plurality of data records, and the specific implementation manner of the method may refer to the related content of the foregoing embodiment and will not be described herein again.

Or the server device can also perform anti-cheating and cleaning operations on the attribute value relation pairs with low weights according to the application scene. Based on this, if the first type attribute value in a certain information cluster is greater than the preset fourth number threshold, the server device may further clean the information cluster, i.e. cut the information cluster. Preferably, the fourth number threshold is greater than the third number threshold.

Further, after each first-class attribute value associated with each second-class attribute value, the first-class attribute values associated with the same second-class attribute value may be associated with each other by using a cartesian product to generate a first-class attribute value relationship pair suspected to belong to the same data object, where each first-class attribute value relationship pair includes two different first-class attribute values.

Further, for each information cluster, the server device may calculate the probability of the information cluster being affiliated between the different first-class attribute values and the candidate result, and determine whether the different first-class attribute values belong to the same data object according to the probability of the information cluster being affiliated between the different first-class attribute values and the candidate result.

The process of calculating the probability of belonging between the different first-type attribute values and the candidate result is exemplarily described below taking the first-type attribute values a and B in the first cluster of the plurality of information clusters as an example. The first cluster is any one of a plurality of information clusters, and the first type attribute values A and B are any two first type attribute values in the first type attribute values contained in the first cluster.

The server side equipment acquires other types of attribute values and behavior data respectively associated with the first type of attribute values A and B in a plurality of data records; wherein the other class attribute values are attribute information other than the first class attribute value. I.e. the other class attribute values may be the second class attribute values as well as other attribute information than the first class attribute values and the second class attribute values. Further, other attribute values and behavior data respectively associated with the first type attribute values A and B are input into a decision model, and the probability of the first type attribute values A and B and the candidate result is obtained.

Further, if the probability of the first type attribute values a and B belonging to the same data object is greater than the probability of the first type attribute values not belonging to the same data object, determining that the first type attribute values a and B belong to the same data object. Alternatively, behavior data associated with the attribute values a and B of the first class may be obtained, behavior features of the corresponding data object may be analyzed, and corresponding content may be recommended to the data object according to the behavior features of the data object.

In the embodiment of the application, the decision model can be trained. Optionally, the loss function is minimized as a training target, sample attribute values and sample behavior data which are known to belong to the same data object but are respectively associated with different first-class attribute values are taken as positive samples, and sample attribute values and sample behavior data which are known not to belong to the same data object and are respectively associated with different first-class attribute values are taken as negative samples, so that model training is performed to obtain a decision model; the loss function is determined according to the probability of the model training and the actual probability of the positive sample and the negative sample.

In some application scenarios, the plurality of data records may not include the second type attribute value, and behavior data respectively associated with the plurality of first type attribute values may be obtained from the plurality of data records, and behavior features of the data objects respectively corresponding to the plurality of first type attribute values may be obtained according to the behavior data respectively associated with the plurality of first type attribute values. Further, it may be determined whether the plurality of first type attribute values belong to the same data object according to behavior features of the data object to which the plurality of first type attribute values respectively correspond.

For the first type attribute values C and D, the similarity of the behavior features of the data objects corresponding to the first type attribute values C and D may be calculated, and further, if the similarity of the behavior features of the data objects corresponding to the first type attribute values C and D is greater than or equal to a set similarity threshold, it is determined that the first type attribute values C and D belong to the same data object. Correspondingly, if the similarity of the behavior characteristics of the data objects corresponding to the first type attribute values C and D is smaller than the set similarity threshold, determining that the first type attribute values C and D do not belong to the same data object.

Further, after the first class attribute value relation pair actually belonging to the same data object is identified from the first information cluster from the first class attribute value relation pair suspected to belong to the same data object, the graph calculation method may be further utilized to aggregate the first class attribute values in the first class attribute value relation pair actually belonging to the same data object, that is, aggregate the first class attribute values belonging to the same data object, so as to obtain a connected subgraph corresponding to each data object. In the connected subgraph, each node represents a first type attribute value.

It should be noted that, the execution subjects of each step of the method provided in the above embodiment may be the same device, or the method may also be executed by different devices. For example, the execution subject of steps 401 and 402 may be device a; for another example, the execution body of step 401 may be device a, and the execution body of step 402 may be device B; etc.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations appearing in a specific order are included, but it should be clearly understood that the operations may be performed out of the order in which they appear herein or performed in parallel, the sequence numbers of the operations, such as 401, 402, etc., are merely used to distinguish between the various operations, and the sequence numbers themselves do not represent any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel.

Fig. 5 is a schematic structural diagram of a server device according to an embodiment of the present application. As shown in fig. 5, the server device includes: a memory 50a and a processor 50b. Wherein the memory 50a is used for a computer program.

The processor 50b is coupled to the memory 50a for executing a computer program for: acquiring a plurality of data records, wherein the plurality of data records comprise a plurality of key attribute values under a plurality of key attributes, and each key attribute value belongs to a data object at the same time; identifying attribute values belonging to the same data object in the plurality of key attribute values according to the hierarchical relationship among the plurality of key attributes and the association relationship among the plurality of key attribute values; and outputting attribute values belonging to the same data object in the plurality of key attribute values.

In some embodiments, the server device includes: a communication component 50c. Accordingly, the processor 50b, when acquiring a plurality of data records, is specifically configured to: a plurality of data records transmitted by the client device are received via the communication component 50c. Accordingly, the processor 50b is specifically configured to, when outputting attribute values belonging to the same data object among the plurality of key attribute values: attribute values belonging to the same data object among the plurality of key attribute values are sent to the client device through the communication component 50c for visual output by the client device.

Further, the processor 50b is specifically configured to, when transmitting, to the client device, an attribute value belonging to the same data object among the plurality of key attribute values: transmitting attribute values belonging to the same data object in a plurality of key attribute values to the client device in the form of a connected subgraph through the communication component 50 c; in the connected subgraph, nodes represent key attribute values; the connection between two nodes represents the association between two key attribute values.

Optionally, the processor 50b is further configured to: before attribute values belonging to the same data object in the plurality of key attribute values are sent to a client in the form of connected subgraphs through the communication component 50c, behavior data of the data object corresponding to each connected subgraph is searched in a plurality of data records; according to the behavior data of the data object corresponding to each connected subgraph, acquiring behavior characteristics of the data object corresponding to each connected subgraph; and adding the behavior characteristics of the data object corresponding to each connected subgraph as the identification information of the data object corresponding to each connected subgraph.

In other embodiments, a server device includes: and a display 50d. Accordingly, the processor 50b, when acquiring a plurality of data records, is specifically configured to: attribute values belonging to the same data object among the plurality of key attribute values are presented on the display screen 50d in the form of connected subgraphs.

In still other embodiments, the memory 50a has stored therein a hierarchical relationship between a number of key attributes that are pre-established. Accordingly, the processor 50b is also configured to: before the attribute values belonging to the same data object in the plurality of key values are identified, extracting the hierarchical relationship among the plurality of key attributes from the hierarchical relationship among a plurality of key attributes which are established in advance; based on membership between the plurality of key attribute values and the plurality of data records, an association between the plurality of key attribute values is analyzed.

Further, the processor 50b is configured to: before extracting the hierarchical relationship among a plurality of key attributes from the hierarchical relationship among a plurality of key attributes which are established in advance, acquiring a historical data record in a specified historical period, wherein the historical data record comprises historical key attribute values under a plurality of key attributes; and establishing a hierarchical relationship among the plurality of key attributes according to the number of the historical key attribute values under each key attribute in the plurality of key attributes.

Optionally, the processor 50b is specifically configured to perform at least one of the following determination operations when analyzing the association relationship between the plurality of key attribute values: judging whether the first key attribute value and the second key attribute value appear in the same data record or not according to the first key attribute value and the second key attribute value; judging whether one key attribute value in the first key attribute value and the second key attribute value and the key attribute value with an association relation with the other key attribute value appear in the same data record; judging whether the key attribute value with the association relation with the first key attribute value and the key attribute value with the association relation with the second key attribute value appear in the same data record or not; if at least one judging operation has a candidate result, determining that the first key attribute value and the second key attribute value have an association relationship. The first key attribute value and the second key attribute value are any two attribute values in the plurality of key attribute values.

In still other embodiments, the processor 50b, upon identifying attribute values of the plurality of key values that are affiliated with the same data object, is specifically configured to: according to the association relation among the key attribute values, carrying out initial clustering on the key attribute values to obtain a plurality of clusters, wherein each cluster comprises the key attribute values with the association relation; aiming at each cluster, carrying out secondary clustering on key attribute values in each cluster according to the hierarchical relationship among a plurality of key attributes to obtain at least one sub-cluster; and the key attribute values in the same sub-cluster are regarded as attribute values belonging to the same data object.

Further, the processor 50b is specifically configured to, when performing initial clustering on the plurality of key attribute values: pairing two key attribute values with association relations in the plurality of key attribute values to obtain a plurality of attribute value relation pairs; constructing an adjacency list of each key attribute value in the plurality of attribute value relation pairs according to the association relation among each key attribute value in the plurality of attribute value relation pairs; traversing the adjacency list of each key attribute value in the plurality of attribute value relations to obtain a plurality of connected subgraphs, wherein each connected subgraph is a cluster. In each connected subgraph, a node represents a key attribute value, and a connection line between two nodes represents an association relationship between two key attribute values.

In other embodiments, the processor 50b is specifically configured to, when performing the secondary clustering on the key attribute values in each cluster: determining a reference core attribute from the core attributes contained in the first cluster aiming at the first cluster, wherein the core attribute belongs to a key attribute; according to the hierarchical relation among key attributes contained in the first cluster and the association weight among key attribute values, key attribute values under the reference core attribute are respectively used as clustering source points, and key attribute values under the non-reference core attribute in the first cluster are clustered into sub clusters represented by the clustering source points; wherein the first cluster is any one of a plurality of clusters.

Further, the processor 50b is specifically configured to, when determining the reference core attribute from the core attributes contained in the first cluster: and selecting the core attribute with the most key attribute value as the reference core attribute according to the number of key attribute values under each core attribute contained in the first cluster.

Further, the processor 50b is specifically configured to, when clustering the key attribute values under the non-reference core attribute in the first cluster into the sub-clusters represented by the respective cluster source points: aiming at the third key attribute value, calculating the correlation between the third key attribute value and each cluster source point according to the correlation weight between the third key attribute value and each cluster source point; dividing the third key attribute value into sub clusters represented by the cluster source points with the largest correlation with the third key attribute value; the third key attribute value is any key attribute value that is not currently clustered to any sub-cluster among the key attribute values contained in the first cluster.

Optionally, the processor 50b is specifically configured to, when calculating the correlation between the third key attribute value and each cluster source point: determining the shortest association path between the third key attribute value and each cluster source point according to the association weight between the third key attribute value and the key attribute value passed by the association path between each cluster source point; and calculating the correlation between the third key attribute value and each cluster source point according to the correlation weight between the third key attribute value and the key attribute value passed by the shortest correlation path between each cluster source point.

Further, if the target level at which the third key attribute value is located is any level other than the next level of the levels at which the source points of the clusters are located and the highest level in the hierarchical relationship between the key attributes contained in the first cluster, the processor 50b is further configured to: if the number of the clustering source points with the maximum correlation with the third key attribute value is a plurality of, calculating the correlation between the third key attribute value and each key attribute value at the upper level of the target level; and clustering the third attribute value into a sub-cluster represented by the key attribute value with the highest correlation with the third attribute value among the key attribute values of the upper level of the target level.

Optionally, the processor 50b is further configured to: before clustering key attribute values under non-reference core attributes in a first cluster into sub-clusters represented by clustering source points, calculating association weights between every two key attribute values according to the common occurrence frequency of every two key attribute values in the first cluster in the same data record and the respective occurrence frequency of each key attribute value in every two key attribute values in a plurality of data records.

In some embodiments, the processor 50b is further configured to perform at least one of the following determination operations prior to secondary clustering of the key attribute values in each cluster: comprising performing a judgment operation comprising: judging whether a plurality of clusters exist giant clusters with the number of the included key attribute values being greater than or equal to a preset first number threshold value or not; judging whether a huge cluster containing a problem attribute value exists in the clusters or not; the problem attribute value refers to the key attribute value of which the number of other key attribute values associated with the problem attribute value is greater than or equal to a preset second number threshold value; if the result of at least one judging operation is yes, pruning is carried out on the huge clusters in the plurality of clusters.

In other embodiments, the processor 50b, after deriving the at least one sub-cluster, is further configured to: for the first sub-cluster, searching non-key attribute values under first non-key attributes respectively associated with key attribute values contained in the first sub-cluster in a plurality of data records; judging whether the non-key attribute values under the first non-key attribute are consistent; if the judgment result is negative, the non-key attribute value under the related first non-key attribute is used as a new source point attribute value, and the first sub-cluster is clustered again until the non-key attribute value under the first non-key attribute contained in the first sub-cluster is consistent.

In still other embodiments, the processor 50b, after deriving the at least one sub-cluster, is further configured to: acquiring a plurality of non-key attribute values under the second non-key attribute and a plurality of non-key attribute values under the third non-key attribute which are respectively associated with each sub-cluster from a plurality of data records; dividing the plurality of non-key attribute values under the second non-key attribute into a plurality of information clusters by utilizing the association relation between the plurality of non-key attribute values under the second non-key attribute and the plurality of non-key attribute values under the third non-key attribute; aiming at each information cluster, calculating the probability of the candidate result and each two non-key attribute values under the second non-key attribute in each information cluster; candidate results include: belongs to the same data object and does not belong to the same data object; and determining whether each two non-key attribute values under the second non-key attribute belong to the same data object according to the probability of the second non-key attribute value and the candidate result in each information cluster.

Further, the processor 50b is configured to: aiming at a first non-key attribute value and a second non-key attribute value under a second non-key attribute, if the first non-key attribute value and the second non-key attribute value under the second non-key attribute belong to the same data object, determining that the sub-clusters corresponding to the first non-key attribute value and the second non-key attribute value belong to the same data object; merging the sub clusters belonging to the same data object; the first non-critical attribute value and the second non-critical attribute value are any two non-critical attribute values under the second non-critical attribute.

In some alternative embodiments, as shown in fig. 5, the server device may further include: power supply assembly 50e, etc. Optionally, if the server device is a terminal device such as a computer, as shown in the dashed box in fig. 5, optional components such as an audio component 50f may be further included. The illustration of only a portion of the components in fig. 5 is not intended to imply that the server device must contain all of the components shown in fig. 5, nor that the server device can only contain the components shown in fig. 5.

According to the server device provided by the embodiment, the attribute values belonging to the same data object in the plurality of key attribute values can be identified according to the hierarchical relationship among the plurality of key attributes and the association relationship among the plurality of key attribute values, so that the longitudinal clustering of the attribute values belonging to the same data object under different attributes is completed. Due to the data clustering mode, various key attributes are considered, and the probability of error clustering is reduced, so that the accuracy of identifying the data belonging to the same data object is improved.

Fig. 6 is a schematic structural diagram of another server device according to an embodiment of the present application. As shown in fig. 6, the server device includes: a memory 60a and a processor 60b. Wherein the memory 60a is used for the computer program.

The processor 60b is coupled to the memory 60a for executing a computer program for: acquiring a plurality of data records, wherein the plurality of data records comprise a plurality of first type attribute values;

if the plurality of data records contain a plurality of second type attribute values, clustering the plurality of first type attribute values according to the association relationship between the plurality of first type attribute values and the plurality of second type attribute values to obtain a plurality of information clusters; aiming at each information cluster, respectively calculating the probability of the first type attribute value and the candidate result; candidate results include: belongs to the same data object and does not belong to the same data object; and determining whether the different first-class attribute values belong to the same data object according to the probability of the first-class attribute values and the candidate result.

In some embodiments, the processor 60b, when dividing the plurality of first type attribute values into a plurality of information clusters, is specifically configured to: determining a first type attribute value respectively associated with each second type attribute value according to the association relation between the first type attribute value and the second type attribute value; and taking the first type attribute value associated with each second type attribute value as an information cluster to obtain a plurality of information clusters.

In other embodiments, the processor 60b, when calculating the probability of belonging between the different first type attribute values and the candidate result, is specifically configured to: aiming at first type attribute values A and B in a first information cluster, acquiring other type attribute values and behavior data respectively associated with the first type attribute values A and B in a plurality of data records; the other class attribute values are attribute information other than the first class attribute value; and inputting other attribute values and behavior data respectively associated with the first type attribute values A and B into a decision model to obtain the probability of the first type attribute values A and B and the candidate result.

Optionally, the processor 60B is further configured to, prior to entering the other attribute values and behavior data associated with the first type of attribute values a and B, respectively, into the decision model: and taking the sample attribute values and sample behavior data which are known to belong to the same data object and are respectively associated with different first-class attribute values as positive samples, taking the sample attribute values and sample behavior data which are known to not belong to the same data object and are respectively associated with different first-class attribute values as negative samples, and performing model training to obtain a decision model. The loss function is determined according to the probability of the model training and the actual probability of the positive sample and the negative sample.

In still other embodiments, the processor 60b is further configured to: if the plurality of data records do not contain the second type attribute values, behavior data respectively associated with the plurality of first type attribute values are obtained from the plurality of data records; according to the behavior data respectively associated with the plurality of first-class attribute values, behavior characteristics of the data objects respectively corresponding to the plurality of first-class attribute values are obtained; and determining whether the plurality of first-class attribute values belong to the same data object according to the behavior characteristics of the data object corresponding to the plurality of first-class attribute values.

Further, the processor 60b is specifically configured to, when determining whether the plurality of attribute values of the first type belong to the same data object: aiming at the first type attribute values C and D, calculating the similarity of the behavior characteristics of the data objects corresponding to the first type attribute values C and D; and if the similarity of the behavior characteristics of the data objects corresponding to the first type attribute values C and D is greater than or equal to a set similarity threshold, determining that the first type attribute values C and D belong to the same data object.

In one embodiment, the server device further includes: and a display screen 60c. Accordingly, the processor 60b is also configured to: the first type of attribute values belonging to the same data object are visually presented on the display screen 60c. Alternatively, the processor 60b may present the first type attribute values belonging to the same data object in the form of a connected sub-graph on the display screen 60c. Wherein, the nodes of the connected subgraph represent the first type attribute values, and the connecting line between the two nodes represents that the two first type attribute values belong to the same data object.

In another embodiment, the server device further includes: and a communication component 60d. Accordingly, the processor 60b is also configured to: the first type attribute values belonging to the same data object are sent to the client device via the communication component 60d for visual output by the client device. Alternatively, the processor 60b may send the first class attribute values belonging to the same data object to the client device in the form of a connected sub-graph through the communication component 60d.

In some alternative embodiments, as shown in fig. 6, the server device may further include: power supply assembly 60e, etc. Optionally, if the server device is a terminal device such as a computer, as shown in the dashed box in fig. 6, optional components such as an audio component 60f may be further included. Only a part of the components are schematically shown in fig. 6, which does not mean that the server device must contain all the components shown in fig. 6, nor that the server device can only contain the components shown in fig. 6.

The server device provided by the embodiment can complete the transverse clustering of the attribute values under the same attribute of the same data object according to the probability that different attribute values under the same attribute belong to the same data object and do not belong to the same data object. Because the data clustering mode does not need to have a stronger association relation between attribute values, the requirement on data attributes can be reduced, and the data clustering method is beneficial to realizing the flexibility and universality of data clustering.

In an embodiment of the present application, the memory is used to store a computer program and may be configured to store various other data to support operations on the server device. Wherein the processor may execute a computer program stored in the memory to implement the corresponding control logic. The memory may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

In an embodiment of the application, the communication component is configured to facilitate wired or wireless communication between the server device and other devices. The server device may access a wireless network based on a communication standard, such as WiFi,2G or 3G,4G,5G or a combination thereof. In one exemplary embodiment, the communication component receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component may also be implemented based on Near Field Communication (NFC) technology, radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, or other technologies.

In an embodiment of the present application, the display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display screen includes a touch panel, the display screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation.

In an embodiment of the application, the power supply component is configured to provide power to the various components of the server-side device. The power components may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the devices in which the power components are located.

In embodiments of the application, the audio component may be configured to output and/or input audio signals. For example, the audio component includes a Microphone (MIC) configured to receive external audio signals when the device in which the audio component is located is in an operational mode, such as a call mode, a recording mode, and a speech recognition mode. The received audio signal may be further stored in a memory or transmitted via a communication component. In some embodiments, the audio assembly further comprises a speaker for outputting audio signals. For example, for a server device with language interaction functionality, voice interaction with a user, etc., may be achieved through an audio component.

It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. A method of data processing, comprising:

Acquiring a plurality of data records, wherein the plurality of data records comprise a plurality of key attribute values under a plurality of key attributes, each key attribute value belongs to one data object at the same time, and the key attribute refers to an attribute that the attribute value under the key attribute can uniquely identify one data object at the same time;

According to the association relation among the key attribute values, carrying out initial clustering on the key attribute values to obtain a plurality of clusters, wherein each cluster comprises the key attribute values with the association relation, and determining whether the association relation exists between two key attribute values according to whether any two key attribute values appear in the same data record or appear in different data records with indirect association relation with each other;

For each cluster, performing secondary clustering on the key attribute values in each cluster according to the hierarchical relationship among the plurality of key attributes to obtain at least one sub-cluster, wherein the hierarchical relationship among the plurality of key attributes is established according to the number of historical key attribute values under each key attribute;

The key attribute values in the same sub-cluster are regarded as attribute values belonging to the same data object;

2. The method as recited in claim 1, further comprising:

extracting the hierarchical relationship among the plurality of key attributes from the hierarchical relationship among a plurality of key attributes which are pre-established;

And analyzing the association relation among the key attribute values based on the membership relation between the key attribute values and the data records.

3. The method of claim 2, further comprising, prior to extracting the hierarchical relationship between the plurality of key attributes from the pre-established hierarchical relationship between the plurality of key attributes:

acquiring a history data record in a specified history period, wherein the history data record comprises history key attribute values under a plurality of key attributes;

And establishing a hierarchical relationship among the plurality of key attributes according to the number of the historical key attribute values under each key attribute in the plurality of key attributes.

4. The method of claim 2, wherein the analyzing the association between the plurality of key attribute values based on membership between the plurality of key attribute values and the plurality of data records comprises performing at least one of:

Judging whether a first key attribute value and a second key attribute value appear in the same data record or not according to the first key attribute value and the second key attribute value;

Judging whether one key attribute value in the first key attribute value and the second key attribute value and the key attribute value with an association relation with the other key attribute value appear in the same data record;

Judging whether a key attribute value with an association relation with the first key attribute value and a key attribute value with an association relation with the second key attribute value appear in the same data record or not;

If the candidate result of the at least one judging operation is yes, determining that the first key attribute value and the second key attribute value have an association relation;

The first key attribute value and the second key attribute value are any two attribute values of the plurality of key attribute values.

5. The method of claim 1, wherein the initially clustering the plurality of key attribute values according to the association between the plurality of key attribute values comprises:

Pairing the two key attribute values with the association relationship in the plurality of key attribute values to obtain a plurality of attribute value relationship pairs;

constructing an adjacency list of each key attribute value in the attribute value relation pairs according to the association relation among the key attribute values in the attribute value relation pairs;

Traversing the adjacency list of each key attribute value in the plurality of attribute value relation pairs to obtain a plurality of connected subgraphs, wherein each connected subgraph is a cluster;

in each connected subgraph, a node represents a key attribute value, and a connection line between two nodes represents an association relationship between two key attribute values.

6. The method of claim 1, wherein said secondary clustering of key attribute values in each cluster according to hierarchical relationships between the plurality of key attributes comprises:

Determining a reference core attribute from core attributes contained in a first cluster aiming at the first cluster, wherein the core attribute belongs to a key attribute;

According to the hierarchical relation among key attributes contained in the first cluster and the association weight among key attribute values, key attribute values under the reference core attribute are respectively used as clustering source points, and key attribute values under the non-reference core attribute in the first cluster are clustered into sub clusters represented by the clustering source points;

Wherein the first cluster is any one of the plurality of clusters.

7. The method of claim 6, wherein determining a reference core attribute from the core attributes contained in the first cluster comprises:

and selecting the core attribute with the most key attribute value as the reference core attribute according to the number of key attribute values under each core attribute contained in the first cluster.

8. The method according to claim 6, wherein clustering the key attribute values of the first cluster under the non-reference core attribute into the sub-clusters represented by the cluster source points by using the key attribute values of the reference core attribute as the cluster source points according to the hierarchical relation between the key attributes and the association weights between the key attribute values included in the first cluster, respectively, includes:

Aiming at a third key attribute value, calculating the correlation between the third key attribute value and each cluster source point according to the correlation weight between the third key attribute value and each cluster source point;

dividing the third key attribute value into sub clusters represented by the cluster source points with the largest correlation with the third key attribute value;

the third key attribute value is any key attribute value which is not clustered to any sub-cluster currently in the key attribute values contained in the first cluster.

9. The method of claim 8, wherein calculating the correlation between the third key attribute value and each cluster source point based on the association weights between the third key attribute value and each cluster source point comprises:

Determining the shortest association path between the third key attribute value and each cluster source point according to the association weight between the third key attribute value and the key attribute value passed by the association path between each cluster source point;

And calculating the correlation between the third key attribute value and each cluster source point according to the correlation weight between the third key attribute value and the key attribute value passed by the shortest correlation path between each cluster source point.

10. The method of claim 9, wherein if the target level at which the third key attribute value is located is any level other than the level next to the level at which each cluster source point is located and the highest level in the hierarchical relationship between key attributes contained in the first cluster, the method further comprises:

If the number of the clustering source points with the maximum correlation with the third key attribute value is a plurality of, calculating the correlation between the third key attribute value and each key attribute value at the upper level of the target level;

And clustering the third key attribute value into a sub cluster represented by the key attribute value with the largest correlation with the third key attribute value in the key attribute values at the upper level of the target level.

11. The method of claim 8, wherein before clustering key attribute values in the first cluster in sub-clusters represented by respective cluster source points with key attribute values in the reference core attribute as cluster source points according to a hierarchical relationship between key attributes and an association weight between key attribute values included in the first cluster, the method further comprises:

And calculating the association weight between every two key attribute values according to the common occurrence frequency of every two key attribute values in the same data record and the respective occurrence frequency of each key attribute value in every two key attribute values in the plurality of data records.

12. The method of claim 1, wherein prior to secondary clustering of key attribute values in each cluster according to the hierarchical relationship between the plurality of key attributes, the method further comprises performing a determining operation of:

judging whether the plurality of clusters have giant clusters with the number of the included key attribute values being greater than or equal to a preset first number threshold value or not;

judging whether a huge cluster containing a problem attribute value exists in the clusters or not; the problem attribute value refers to a key attribute value with the number of other key attribute values related to the problem attribute value being greater than or equal to a preset second number threshold;

if the result of the at least one judging operation is yes, pruning is carried out on the giant clusters in the clusters.

13. The method of claim 1, further comprising, after obtaining the at least one sub-cluster:

For a first sub-cluster, searching non-key attribute values under first non-key attributes respectively associated with key attribute values contained in the first sub-cluster in the plurality of data records;

judging whether the non-key attribute values under the first non-key attribute are consistent or not;

If the judgment result is negative, the non-key attribute value under the associated first non-key attribute is used as a new source point attribute value, and the first sub-cluster is reclustered until the non-key attribute value under the first non-key attribute contained in the first sub-cluster is consistent.

14. The method of claim 1, wherein after obtaining at least one sub-cluster, the method further comprises:

Acquiring a plurality of non-key attribute values under the second non-key attribute and a plurality of non-key attribute values under the third non-key attribute which are respectively associated with each sub-cluster from the plurality of data records;

dividing the plurality of non-key attribute values under the second non-key attribute into a plurality of information clusters by utilizing the association relation between the plurality of non-key attribute values under the second non-key attribute and the plurality of non-key attribute values under the third non-key attribute;

Aiming at each information cluster, calculating the probability of the candidate result and each two non-key attribute values under the second non-key attribute in each information cluster; the candidate result comprises: belongs to the same data object and does not belong to the same data object;

And determining whether each two non-key attribute values under the second non-key attribute belong to the same data object according to the probability of the second non-key attribute value and the candidate result in each information cluster.

15. The method as recited in claim 14, further comprising:

For the first non-critical attribute value and the second non-critical attribute value under the second non-critical attribute,

If the first non-key attribute value and the second non-key attribute value under the second non-key attribute belong to the same data object, determining that the sub-clusters corresponding to the first non-key attribute value and the second non-key attribute value belong to the same data object;

Merging sub-clusters belonging to the same data object;

the first non-critical attribute value and the second non-critical attribute value are any two non-critical attribute values under the second non-critical attribute.

16. The method of any of claims 1-15, wherein the obtaining a plurality of data records comprises:

Receiving a plurality of data records sent by client equipment;

The outputting the attribute value belonging to the same data object in the plurality of key attribute values includes:

and sending the attribute values belonging to the same data object in the plurality of key attribute values to the client device so as to be output by the client device in a visual mode.

17. The method of claim 16, wherein said sending the attribute values belonging to the same data object of the plurality of key attribute values to the client device comprises:

transmitting attribute values belonging to the same data object in the plurality of key attribute values to the client device in the form of a connected subgraph; in the connected subgraph, nodes represent key attribute values; the connection between two nodes represents the association between two key attribute values.

18. The method of claim 17, further comprising, prior to sending the attribute values belonging to the same data object among the plurality of key attribute values to the client in the form of a connected subgraph:

searching behavior data of a data object corresponding to each connected subgraph in the plurality of data records;

According to the behavior data of the data object corresponding to each connected subgraph, acquiring behavior characteristics of the data object corresponding to each connected subgraph;

And taking the behavior characteristics of the data objects corresponding to each connected subgraph as the identification information of the data objects corresponding to each connected subgraph, and adding the identification information to each connected subgraph.

19. A server device, comprising: a memory and a processor; wherein the memory is used for a computer program;

the processor is coupled to the memory for executing the computer program for:

20. A data processing system, comprising: client device and server device;

The client device is used for providing a plurality of data records to the server device; the data records comprise a plurality of key attribute values under a plurality of key attributes, each key attribute value belongs to one data object at the same time, wherein the key attribute refers to an attribute that the attribute value under the key attribute value can uniquely identify one data object at the same time; outputting attribute values belonging to the same data object in a plurality of key attribute values in a visual mode;

The server device is configured to perform initial clustering on the plurality of key attribute values according to an association relationship between the plurality of key attribute values, so as to obtain a plurality of clusters, where each cluster includes a key attribute value with an association relationship, and determine whether an association relationship exists between two key attribute values according to whether any two key attribute values occur in the same data record or occur in different data records with indirect association relationships with each other; for each cluster, performing secondary clustering on the key attribute values in each cluster according to the hierarchical relationship among the plurality of key attributes to obtain at least one sub-cluster, wherein the hierarchical relationship among the plurality of key attributes is established according to the number of historical key attribute values under each key attribute; and regarding the key attribute values in the same sub-cluster as attribute values belonging to the same data object, and transmitting the attribute values belonging to the same data object in the plurality of key attribute values to the client device.

21. A computer-readable storage medium storing computer instructions that, when executed by one or more processors, cause the one or more processors to perform the steps in the method of any of claims 1-18.