WO2018149292A1 - 一种对象聚类方法和装置 - Google Patents

一种对象聚类方法和装置 Download PDF

Info

Publication number
WO2018149292A1
WO2018149292A1 PCT/CN2018/074552 CN2018074552W WO2018149292A1 WO 2018149292 A1 WO2018149292 A1 WO 2018149292A1 CN 2018074552 W CN2018074552 W CN 2018074552W WO 2018149292 A1 WO2018149292 A1 WO 2018149292A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
target
identifier
objects
directed
Prior art date
Application number
PCT/CN2018/074552
Other languages
English (en)
French (fr)
Inventor
李霖
陈培炫
陈谦
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2018149292A1 publication Critical patent/WO2018149292A1/zh
Priority to US16/428,958 priority Critical patent/US10936669B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90324Query formulation using system suggestions
    • G06F16/90328Query formulation using system suggestions using search space presentation or visualization, e.g. category or range presentation and selection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9035Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2323Non-hierarchical techniques based on graph theory, e.g. minimum spanning trees [MST] or graph cuts
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks

Definitions

  • the present application relates to the field of data processing technologies, and in particular, to object clustering.
  • the amount of information in the network is increasing.
  • the same network user may use multiple different user devices for network access.
  • the user may use his or his family's mobile phone or other terminal device to log in to the instant messaging system or forum, etc., in order to protect against malicious access or Targeted service to users, you may need to determine which user devices are frequently used by the same user, so you need to cluster user devices.
  • the target object when clustering a target object in the network, only the target objects associated with the identifiers of the same associated object are clustered together according to the identifier of the associated object associated with the target object, for example, if multiple user devices When accessing the network system, the user accounts used are the same, and the multiple user devices are considered to be frequently used by the same user, and the multiple user devices are clustered together. According to the identifier of the associated object associated with the target object, the target object is clustered, so that the target object clusters more categories, and the target objects with stronger relevance cannot be clustered together, resulting in low accuracy of clustering.
  • the present application provides an object clustering method and apparatus for maximizing the degree of association between target objects to be clustered and improving the accuracy of clustering.
  • an object clustering method including:
  • each of the plurality of target objects to be clustered includes at least one associated object
  • each node Determining each node as a target node to be processed, and determining, from the at least one ingress node group corresponding to the target node, a target ingress node group with a total weight of directed edges pointing to the target node Updating the category identifier of the target node to the category identifier of the ingress node in the target ingress node group until the category identifier of all the nodes in the directed network graph no longer changes, wherein the entry The degree node group includes at least one ingress node with a directed edge pointing to the target node and having the same category identifier;
  • the target objects represented by the nodes having the same category identifier are determined to belong to one cluster category, and a plurality of cluster categories corresponding to the plurality of target objects are obtained.
  • an embodiment of the present application further provides an object clustering apparatus, including a memory and a processor, where the memory is used to store an instruction, and the processor is configured to execute the instruction to perform the following steps:
  • each of the plurality of target objects to be clustered includes at least one associated object
  • each node Determining each node as a target node to be processed, and determining, from the at least one ingress node group corresponding to the target node, a target ingress node group with a total weight of directed edges pointing to the target node Updating the category identifier of the target node to the category identifier of the ingress node in the target ingress node group until the category identifier of all the nodes in the directed network graph no longer changes, wherein the entry The degree node group includes at least one ingress node with a directed edge pointing to the target node and having the same category identifier;
  • the target objects represented by the nodes having the same category identifier are determined to belong to one cluster category, and a plurality of cluster categories corresponding to the plurality of target objects are obtained.
  • the embodiment of the present application further provides a storage medium, where the storage medium is used to store program code, and the program code is used to execute the object clustering method according to any of the foregoing aspects.
  • an embodiment of the present application further provides a computer program product comprising instructions, when executed on a computer, causing the computer to perform the object clustering method of any of the preceding aspects.
  • the similarity degree of the associated associated object between any two target objects to be clustered it is determined that the two target objects are represented in the directed network graph to be constructed.
  • the target nodes have weights to the edges and construct a directed network graph.
  • the weights of the directed edges between the nodes of any two target objects in the directed network graph are based on the locations between the two target objects.
  • the similarity degree of the associated related objects is determined, and the degree of similarity between the associated objects can reflect the strength of the correlation between the two target objects, thereby reflecting the possibility that the two target objects belong to the same cluster category. Sex.
  • the weight of the directed edge between the nodes of any two target objects in the network graph may reflect the possibility that the two target objects belong to the same cluster category, and the greater the weight, the two target objects are represented. The greater the likelihood of belonging to the same cluster category.
  • different nodes in the directed network graph may be clustered based on the weights of the directed edges between the nodes in the directed network graph to obtain categories corresponding to the plurality of target objects. It can be seen that the method of the present application facilitates mining the similarity between the target objects from a global perspective, compared with the method in which the identifiers of the associated objects that are only associated with the target object are clustered in the conventional manner. Targeted objects with strong correlations are clustered together, which can effectively improve the accuracy of clustering of target objects.
  • FIG. 1 is a schematic diagram of a possible composition structure of a computer device to which an object clustering method is applied according to an embodiment of the present application;
  • FIG. 2 is a schematic diagram of a system composition structure suitable for an object clustering method according to an embodiment of the present application
  • FIG. 3 is a schematic flowchart of an embodiment of an object clustering method disclosed in the present application.
  • FIG. 4 is a schematic flowchart of still another embodiment of an object clustering method disclosed in the present application.
  • FIG. 5a is a schematic structural diagram showing a part of a directed network diagram constructed by an embodiment of the present application.
  • FIG. 5b is a schematic diagram showing a directed network diagram after updating a community identifier of a node in the directed network diagram shown in FIG. 5a by using the community identifier of the ingress node corresponding to the inbound edge with the largest weight;
  • FIG. 6 is a schematic flowchart diagram of still another embodiment of an object clustering method disclosed in the present application.
  • FIG. 7 is a schematic diagram showing the structure of an embodiment of an object clustering apparatus disclosed in the present application.
  • the object clustering method and device provided by the embodiments of the present application are suitable for clustering multiple user equipments that log in to a social network, or multiple articles for topic discovery.
  • the traditional object clustering method only clusters the target objects according to whether the identifiers of the associated objects associated with the target objects are the same. In some cases, the clustering results may be wrong. For example, the associated objects are different, and the associated target objects with different associated objects may actually belong to different categories, or the associated objects may be the same, and the target objects associated with the same associated objects may actually belong to different categories.
  • the target objects associated with different associated objects may actually belong to the same cluster category
  • the target object is clustered by the traditional method
  • the target objects associated with different associated objects are regarded as For different cluster categories, get the wrong target object clustering results.
  • the associated object is a user account
  • the target object is a mobile phone
  • the user A uses the identity of the user account U1 to log in to the instant messaging system through the mobile phone M1; the user A can also log in to the instant messaging system through the mobile phone M2 by using the identity of the user account U2.
  • the mobile phone needs to be clustered. If the traditional clustering method is adopted, since different user accounts are used when using the mobile phone M1 and the mobile phone M2, that is, the associated objects are different, then it is determined that the mobile phone M1 and the mobile phone M2 are not frequently used by the same user. The mobile phone M1 and the mobile phone M2 do not belong to the same cluster category, and the wrong clustering result is obtained.
  • the method provided by the embodiment of the present application can determine the strength of the association between the target objects by determining the weight of the directed edge between the target nodes of any two target objects in the directed network graph, and the mobile phone M1 and the mobile phone M2.
  • the same user uses different user accounts to log in. Therefore, the associated objects (user accounts) associated with both the mobile phone M1 and the mobile phone M2 may be many, so that the relationship between the mobile phone M1 and the mobile phone M2 is determined to be strong, thereby determining the mobile phone.
  • M1 and mobile phone M2 are often used by the same user, and the two can be clustered together.
  • the target objects associated with the same associated objects may actually belong to different clustering categories
  • the target object is clustered by the conventional method, the target objects with the same associated object's identity will be associated. Cluster together to get the wrong target object clustering results.
  • the associated object is a user account
  • the target object is a mobile phone
  • the user A uses the identity of the user account U1 to log in to the instant messaging system through the mobile phone M1
  • the user B also uses the user account U1 of the user A, and also uses the identity of the user account U1.
  • the mobile phone needs to be clustered.
  • the clustering method of the prior art since the user account U1 is used when the mobile phone M1 and the mobile phone M2 are used, that is, the identifier of the associated object is the same, then it is determined that the mobile phone M1 and the mobile phone M2 are frequently used by the same user. Used, get the wrong clustering results.
  • the method provided by the embodiment of the present application can determine the strength of the association between the target objects by determining the weight of the directed edge between the target nodes of any two target objects in the directed network graph, and the mobile phone M1 and the mobile phone M2.
  • Different users use the same user account U1 to log in. Therefore, the associated objects (user accounts) associated with both the mobile phone M1 and the mobile phone M2 may be few, or even only the user account U1, so that the mobile phone M1 and the mobile phone M2 can be determined.
  • the correlation between them is weak, so that it is determined that the mobile phone M1 and the mobile phone M2 are not frequently used by the same user, and the two cannot be clustered together.
  • the method provided by the embodiment of the present application facilitates mining the similarity between the target objects from a global perspective, and clustering the related target objects to avoid clustering errors. Can effectively improve the accuracy of target object clustering.
  • the method and apparatus of this embodiment are applicable to a single computer device or a distributed computing system.
  • FIG. 1 is a schematic structural diagram of a computer device to which an object clustering method and apparatus according to an embodiment of the present application is applied.
  • the computer device may include components such as a memory 101, a processor 102, a communication module 103, a display 104, an input unit 105, and a communication bus 106.
  • the processor 101, the memory 102, the communication interface 103, the display 104, and the input unit 105 all communicate with each other through the communication bus 106.
  • the memory 101 can be used to store software programs and modules.
  • the memory 120 may store an operating system, an application required for at least one function (for example, an image playing function), and the like; and may also store data or the like created according to the use of the terminal.
  • the memory 101 may include a high speed random access memory, and may also include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, or other volatile solid state storage device.
  • the processor 102 is a control center of the terminal, which connects various parts of the entire terminal by various interfaces and lines, and executes by executing or executing software programs and/or modules stored in the memory 101, and calling data stored in the memory 101.
  • processor 102 can include one or more processing units.
  • the communication module 103 can be used for transmitting and receiving information, or receiving and transmitting signals during data processing, or communicating with other devices through a network.
  • the display 104 can be used for a window interface, and displays processed data, graphics, directed network maps, and the like in the window interface; can also display information input by the user, or information provided to the user, and various kinds of computer equipment.
  • Graphical user interfaces these graphical user interfaces can be composed of any combination of graphics, text, images, and the like.
  • the display may include a display panel, such as a display panel that may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. Further, the display can include a touch display panel with a capture touch event.
  • the input unit 105 can be configured to receive input of user-entered characters, numbers, and the like, and to generate signal inputs related to user settings and function control.
  • the input unit may include, but is not limited to, one or more of a physical keyboard, a mouse, a joystick, and the like.
  • the terminal can be any device capable of accessing the service platform, for example, the terminal can be a mobile phone, a tablet computer, a desktop computer, or the like.
  • the object clustering method in the embodiment of the present application may also be applied to a distributed computing system, as shown in FIG. 2, which illustrates a distributed application of the object clustering method of the present application.
  • FIG. 2 illustrates a distributed application of the object clustering method of the present application.
  • the distributed computing system can include a plurality of computer devices 201.
  • the plurality of computer devices 201 can be connected to each other through a network.
  • the plurality of computer devices 201 can cooperate with each other to complete the object of the embodiment of the present application. Data processing involved in clustering methods and devices.
  • each of the plurality of target objects is associated with at least one associated object, and based on any two target objects, the association is based on the associated object set of one of the target objects.
  • each node in the directed network graph is assigned a unique category identifier; and then each of the nodes is sequentially
  • the target node to be processed from the at least one ingress node group corresponding to the target node, determines a target ingress node group with the largest total weight of the directed edge pointing to the target node, and the category identifier of the node is updated to The target identifier of the target ingress node group until the category identifier of all the nodes in the directed network graph no longer changes, wherein
  • the ingress node group includes at least one ingress node that points to the target node and has the same category identifier; the target object represented by the node having the same category
  • FIG. 3 a schematic flowchart of an embodiment of an object clustering method according to the present application is shown.
  • the method in this embodiment may be applied to a computer device or a computer system as shown in the above, and the method in this embodiment may include:
  • the computer device acquires, by the computer device, a set of associated objects respectively associated with the plurality of target objects to be clustered, where the set of associated objects includes at least one associated object.
  • the target object to be clustered may be selected according to requirements, and correspondingly, for different target objects, the associated objects associated with the target object obtained by clustering the target object may also be different.
  • the target object when clustering is required to cluster user devices frequently used by the same user, the target object may be a user device, and the associated object associated with the target object may be a user identifier such as a user account and a user name.
  • the computer device may obtain information about all associated objects associated with each target object, or may acquire multiple data relationships, each of which includes a target object.
  • the correspondence between the associated objects associated with the target object is identified, and according to the identifier of the target object, it is determined which associated objects are associated with each target object.
  • the computer device determines, according to the degree of similarity between the associated object in the set of associated objects of one of the target objects and the associated object in the set of associated objects of another target object, determining the directional network graph to be constructed.
  • the nodes that characterize the two target objects have weights to the edges and construct the directed network graph.
  • the degree of similarity between related objects can reflect the strength of the correlation between the two target objects, which reflects the possibility that the two target objects belong to the same cluster category. Therefore, the weight of the edge between any two target objects in the directed network graph determined according to the degree of similarity may reflect the possibility that the two target objects belong to the same cluster category, and the greater the weight , indicating that the two target objects are more likely to belong to the same cluster category, so as to subsequently cluster the target objects according to the directed network graph.
  • the computer device may refer to one of the target objects as the first target object and the second target object. Is the second target object. The first target object is different from the second target object. It should be noted that, in the embodiment of the present application, when the identifiers of the target objects are different, the target objects with different identifiers may be considered as different target objects.
  • an implementation manner of S302 may be: the computer device determines, according to the total number of associated objects associated with the first target object and the second target object, and the first number of associated objects associated with the first target object.
  • the second target object points to the first similarity of the first target object, and determines the directed to be constructed according to the total number and the second number of associated objects associated with the second target object.
  • the second target object in the network map points to the second similarity of the first target object.
  • the computer device constructing the directed network graph may be a directional network diagram including constructing a first node for representing the first target object and a second node for representing the second target object, and setting the directed The weight of the directed edge of the second node pointed to by the second node in the network graph is the first similarity, and the weight of the directed edge of the first node pointing to the second node is the second similarity.
  • the directed network graph includes nodes and directed edges between the nodes.
  • the directed network diagram of the computer device includes a plurality of nodes respectively representing the plurality of target objects, and for convenience of description, the present application is described by taking any two target objects as an example.
  • the first target object represented by the first node points to the similarity of the second target object represented by the second node as the first node points to the The weight of the directed edge of the second node; and the similarity of the second target object to the first target object as the weight of the directed edge of the second node to the first node.
  • the similarity of one target object to another target object is zero, it means that the two target objects do not have the same associated object.
  • the two representations in the directed network graph The nodes of the target objects may not have connected directed edges.
  • step S302 in the implementation manner of the foregoing step S302 in the embodiment of the present application, only the degree of similarity of the associated objects associated with any two target objects is determined, and the two target objects in the directed network graph are determined.
  • the weight of the edge between the two nodes There is an implementation of the weight of the edge between the two nodes, but it can be understood that there may be multiple ways to reflect the similarity of the associated objects between the target objects, and correspondingly, based on the association between the target objects
  • the degree of similarity of the associated objects, the manner of determining the weights of the directed edges between the nodes of the target object may also be various, and is not limited herein.
  • the computer device separately assigns a unique category identifier to each node in the directed network graph.
  • the category identifier is used to represent the cluster category to which the node belongs, because before the clustering based on the directed network graph, it is unclear which nodes represent the target objects can be clustered together, therefore, the computer device can think of each The nodes belong to a cluster category, respectively, so that each node is assigned a unique category identifier.
  • the category identifiers of some or all nodes will change continuously until the category identifiers of all nodes do not change, then the clustering is completed.
  • the computer device sequentially takes each node in the directed network graph as a target node to be processed, and updates the category identifier of the target node to at least one ingress node group corresponding to the target node, and the total weight of the ingress side is the largest.
  • ingress nodes of the node other nodes pointing to the node in the directed network graph are referred to as ingress nodes of the node, and the ingress node of the node may also be understood as a directed network graph, where the node is Neighbor node.
  • the node has at least one ingress node.
  • the directed edge of the node from the ingress node to the node may be referred to as an in-degree edge. It can be seen that an in-degree node corresponds to an in-degree edge.
  • the ingress node group includes: at least one ingress node with a directed edge pointing to the target node and having the same category identifier, and the total weight of the ingress side of the ingress node group is corresponding to all the ingress nodes in the ingress node group. The sum of the weights of the ingress side.
  • the weight of the in-degree edge indicates the similarity between the in-degree node corresponding to the in-degree edge and the node to be updated, if the weight of the in-degree edge is higher, the in-degree node corresponding to the in-degree edge
  • the target object corresponding to the target object belongs to a category, and correspondingly, if the sum of weights of all in-degree edges in an in-degree node group is the largest, the target node and the ingress node All the ingress nodes in the group are the most likely to belong to the same category, so that the category identifier of the target node can be unified with the category identifier of the target ingress node group with the highest weight of the ingress edge.
  • the category identifier is unified.
  • the category ID of the node group for this target That is to say, it is necessary to determine the target ingress node group with the largest total weight of the directed edges from the plurality of ingress node groups corresponding to the target node, and update the category identifier of the target node to the category of the target ingress node group.
  • step S304 is a process of continuously looping iterations. Each time an iteration is completed, the computer device needs to determine whether there is a node whose category identifier changes during the iteration process, and if there is a node whose category identifier changes, You need to re-select the node from the network map as the target node and iterate again.
  • the computer device treats all nodes in the directed network graph as nodes to be processed; If there is an unprocessed pending node, the computer device selects a target node to be processed from the unprocessed pending node, and updates the category identifier of the target node to at least one ingress node group corresponding to the target node, The category identifier of the node group with the largest total weight of the degree, until all the nodes to be processed are processed as the target node; and, after all the nodes to be processed are used as the target node, if there is an updated category identifier and update The former category identifies different nodes, and the computer device can only determine the node with the pre-update category identifier and the updated category identifier as the node to be processed, and return to execute if there is an unprocessed pending node, which has never been processed
  • the target node to be processed is selected from the node to be processed; if not If there is a node whose updated category identifier is different from the category identifier before the update, it indicates that the clustering ends, and the computer device can perform the subsequent step S306.
  • the computer device determines that the target object represented by the node with the same category identifier belongs to one cluster category, and obtains multiple cluster categories corresponding to the multiple target objects.
  • the target nodes representing the two target objects in the directed network graph to be constructed are The weight of the directed edge, and construct a directed network graph, because the weight of the directed edge between the nodes of any two target objects in the directed network graph is based on the associated associated object between the two target objects.
  • the degree of similarity determined, the degree of similarity between the associated objects can reflect the strength of the correlation between the two target objects, and thus reflect the possibility that the two target objects belong to the same cluster category.
  • the weight of the directed edge between the nodes of any two target objects in the network graph may reflect the possibility that the two target objects belong to the same cluster category, and the greater the weight, the two target objects are represented. The greater the likelihood of belonging to the same cluster category.
  • different nodes in the directed network graph may be clustered based on the weights of the directed edges between the nodes in the directed network graph to obtain categories corresponding to the plurality of target objects. It can be seen that the method of the present application facilitates mining the similarity between the target objects from a global perspective, compared with the method in which the identifiers of the associated objects that are only associated with the target object are clustered in the conventional manner. Targeted objects with strong correlations are clustered together, which can effectively improve the accuracy of clustering of target objects.
  • the object clustering method in the embodiment of the present application is described below by taking the target object to be clustered as the user equipment and the scene of the document as an example.
  • FIG. 4 is a schematic flowchart diagram of an embodiment of an object clustering method.
  • the method of this embodiment is applied to a computer device or a distributed computing system as mentioned above.
  • the method of the embodiment is performed by clustering multiple user equipments of the social network to cluster the devices of the same user together.
  • the user equipment in the embodiment of the present application may be a mobile phone, a tablet computer, a desktop computer, or the like.
  • the method in this embodiment may include:
  • the computer device acquires a data set to be analyzed in the network, where the to-be-analyzed data set includes multiple data relationships, where each data relationship includes a correspondence between the user identifier and the identifier of the user equipment.
  • the user identifier indicates the identity of the user who accesses (or logs in to) the preset network system;
  • the identifier of the user equipment indicates the identifier of the user equipment used by the user to access the preset network system.
  • the data relationship can be expressed in the form of (user U, user equipment Ue).
  • the user identifier may be a user name of the user in the preset network system, a user's network account number, a user's phone number, and the like.
  • the user of the user equipment uniquely identifies a user equipment.
  • the identifier of the user equipment may be an IP address of the user equipment, a device identifier, or the like.
  • the preset preset network system may be one or more, where the preset The network system can be a plurality of social networks, such as a plurality of different instant messaging systems, forum systems, and the like.
  • the preset in the data relationship is The network system is the only one predetermined network system that is determined.
  • the user name of the instant messaging user U1 and the identifier of the mobile phone M1 form a pair of data relationships; for example, user A uses instant messaging.
  • the identity of the user U2, logging in to the instant messaging system via the mobile phone M2, will obtain a pair of data relationships between the username of the instant messaging user U2 and the identity of the mobile phone M2.
  • the user identifiers are different, but the user equipment is actually used by the same user, and the associated objects corresponding to the aforementioned are different, and the target objects associated with different associated objects may actually belong to the same cluster category. happensing.
  • user A logs in to the instant messaging system through the mobile phone M1 as the instant messaging user U1, and the user name of the instant messaging user U1 and the identifier of the mobile phone M1 form a pair of data relationships; the user B is in the identity of the instant messaging user U1.
  • a pair of data relationships between the username of the instant messaging user U1 and the identity of the mobile phone M2 are obtained.
  • the user identifiers are the same, the user equipment is actually used by different users, corresponding to the aforementioned related objects, and the target objects associated with the same associated objects may actually belong to different cluster categories. .
  • the purpose of the embodiment of the present application is to determine which user equipments are used by the same user to cluster the user equipments used by the same user. Therefore, the data set to be analyzed is not Including completing the same data relationship.
  • S402. Acquire at least one first data relationship that includes the identifier of the first user equipment, and at least one identifier that includes the identifier of the second user equipment, for the first user equipment and the second user equipment that are characterized by the identifiers of any two different user equipments.
  • a second data relationship Acquire at least one first data relationship that includes the identifier of the first user equipment, and at least one identifier that includes the identifier of the second user equipment, for the first user equipment and the second user equipment that are characterized by the identifiers of any two different user equipments.
  • the identifier of the user equipment of the first user equipment is Ue1
  • the data relationship is the first data relationship including the identifier Ue1 of the first user equipment.
  • the number of the first data relationship and the second data relationship may be different.
  • the embodiment of the present application is only for convenience of description, and one of any two user equipments is referred to as a first user equipment, and another user equipment is referred to as a second user equipment, where the first user The device is different from the identity of the user device that the second user device has.
  • the computer device determines, from the at least one first data relationship and the at least one second data relationship, at least one pair of data relationship pairs including the same user identifier.
  • the pair of data relationships includes: a first data relationship having the same user identifier and a second data relationship.
  • a pair of data relationship pairs indicates that the same user has used both the first user equipment and the second user equipment to log in to the preset network system.
  • the identifier of the first user equipment is Ue1
  • the identifier of the second user equipment is Ue2.
  • the description is A data relationship and the second data relationship are a pair of data relationship pairs having the same user identifier userA.
  • the user whose user ID is userA is used to log in to the preset network system using the first user equipment Ue1 and the second application device Ue2.
  • the computer device determines a similarity between the total number of data relationship pairs and the first quantity corresponding to the at least one first data relationship as the similarity between the second user equipment and the first user equipment.
  • the total number of data relationship pairs is the total number of users who have used the first user equipment and the second user equipment to log in to the preset network system.
  • the total number of the first data relationship is referred to as a first quantity
  • the total number of the second data relationship is referred to as a second quantity.
  • the first number indicates the total number of users who have used or logged in to the preset network system through the first user equipment
  • the second number indicates the total number of users who have used or logged in to the preset network system through the second user equipment.
  • the similarity W Ue2Ue1 of the second user equipment Ue2 and the first user equipment Ue1 may be expressed as follows:
  • indicates the total number of users who log in to the preset network system using the first user equipment Ue1 and the second user equipment Ue2 at the same time, that is, the total number of data relationship pairs; and
  • indicates the total number of users who have logged into the preset network system using the first user equipment Ue1, that is, the first number of the first data relationship.
  • the computer device determines a similarity between the total number of data relationship pairs and the second quantity corresponding to the plurality of second data relationships as the similarity between the first user equipment and the second user equipment.
  • the similarity W Ue1Ue2 of the first user equipment Ue1 and the second user equipment Ue2 may be expressed as follows:
  • the computer device calculates the second
  • the manner in which the user equipment is similar to the first user equipment and the similarity between the first user equipment and the second user equipment is not limited to the manner shown in step S303 and step S304, and there may be other calculations in the actual application.
  • the manner of similarity between devices is not limited herein.
  • step S401 and step S402 are only an optional implementation manner for obtaining a user identifier corresponding to the user equipment.
  • the computer device or the distributed computing system is also used.
  • the user identifier corresponding to each of the multiple user equipments to be clustered may be directly obtained. That is, the computer device obtains the user identifier corresponding to each user equipment to be clustered, and each user equipment can correspond to one or more user identifiers, and the user identifier corresponding to the user equipment indicates that the user equipment is used to log in to the preset.
  • the identity of the user of the network system is not limited to the user equipment.
  • the computer equipment can obtain the user identifiers of all users who log in to the social network through the first user equipment, and obtain the login to the social network through the second user equipment.
  • User ID for all users.
  • the first data of the user who logs in to the social network by using the first user equipment may be the number of user identifiers corresponding to the first user equipment.
  • the second number of users logging in to the social network by the second user equipment may be the number of user identifiers corresponding to the second user equipment.
  • the similarity between the second user equipment and the first user equipment is equivalent to the first similarity mentioned in the previous embodiment, and the similarity between the second user equipment and the first user equipment is compared to the previous embodiment.
  • the computer device uses the first user equipment and the second user equipment as nodes of the directed network graph, and uses the similarity between the first user equipment and the second user equipment as the first user equipment in the directed network map.
  • the weight of the edge of the user equipment is the weight of the second user equipment and the first user equipment as the weight of the edge of the second user equipment pointing to the first user equipment.
  • edges between any two nodes in the directed network graph have directions and weights, and the edges in different directions between the two nodes may have different weights.
  • the weight of node A pointing to node B may be weight 1
  • the weight of node B pointing to node A may be weighted. 2.
  • Weight 1 and weight 2 can be different.
  • each user equipment can be used as a node in the directed network graph to construct a directed network graph.
  • the node of the first user equipment points to the edge of the node of the second user equipment (also referred to as a directed edge) Having a weight that characterizes the similarity between the first user equipment and the second user equipment; correspondingly, the node of the second user equipment points to the weight of the edge (also referred to as a directed edge) of the node of the first user equipment Characterizing the similarity between the second user equipment and the first user equipment.
  • each node represents a user equipment corresponding to the identifier of a unique user equipment.
  • the number marked above each edge of 2 is the weight corresponding to that edge.
  • the computer device will initialize the community identifier of the node to the identifier of the user equipment represented by the node in the network diagram.
  • the community identifier is used to indicate the community to which the user equipment represented by the node is clustered.
  • a community can also be considered as a cluster category.
  • each node corresponds to a community identifier, which is the identifier of the user equipment corresponding to the node, as shown in Figure 5a, next to each node.
  • a community ID of a node, where the identifier in parentheses indicates the identity of the user equipment corresponding to the node.
  • step S407 may also assign one to each node.
  • the computer device sequentially takes each node in the directed network graph as a node to be updated, and determines, from the plurality of ingress nodes connected to the node to be updated, a target indegree corresponding to the inbound edge with the largest weight.
  • the node identifies the community identifier of the target ingress node as the community identifier of the node to be updated.
  • the computer device may classify the node to be updated and the target degree node corresponding to the in-degree edge with the largest weight into one category, that is, the node to be updated belongs to the same community as the target degree node.
  • the target mobility node In order to identify a node (or a user equipment) belonging to the same community, the target mobility node needs to be unified with the community identifier of the node to be updated. For example, referring to FIG. 5b, the community identifier of a node is updated to the community of the target degree node corresponding to the inbound edge of the node with the largest weight corresponding to the directed network diagram shown in FIG. 5a.
  • the schematic after the identification For example, comparing the nodes corresponding to the user equipment Ue1 in FIG. 5a and FIG. 5b, the community identifier of the node corresponding to the user equipment Ue1 in FIG.
  • the community identifier is also Ue2.
  • the weight of the ingress side of the node pointing to the user equipment Ue1 is the largest. Therefore, the community identifier of the node corresponding to the user equipment Ue1 is changed to Ue2.
  • the community identifiers of other nodes may be updated in sequence.
  • the community identifier of the node to be updated is updated to the community identifier of the target degree node corresponding to the node to be updated as an example, but it can be understood that if the target degree node is
  • the community identity update to the community identity of the node to be updated is also applicable to the embodiment of the present application.
  • the computer device takes all nodes in the directed network graph as nodes to be processed.
  • the computer device can treat all nodes in the directed network graph as nodes to be processed in order to sequentially update the community identifiers of the various nodes in the directed network graph.
  • the computer device selects an unprocessed to-be-processed node from the to-be-processed node of the directed network graph as the target node to be processed.
  • steps S411 to S412 are sequentially performed on each node in the directed network graph.
  • the node that needs to be processed is referred to as a target node, and steps S411 and S412 are performed in a loop.
  • the process if in the current cycle, the node to be processed in the directed network graph has already been used as the target node, it is not necessary to repeat as the target node in the current round of processing.
  • the computer device determines, as the ingress node group, the ingress node of the same community identifier in the ingress node of the target node, and calculates, for each ingress node group, all the ingress nodes corresponding to the ingress node group. The sum of the weights of the ingress edges, and the total weight of the ingress side of each ingress node group is obtained.
  • step S408 two or more ingress nodes having the same community identifier may exist in the ingress node of a target node.
  • the ingress node having the same community identifier is used. Classified as an ingress node group.
  • the ingress node may be separately classified into an ingress node group, and the ingress degree corresponding to the ingress node group
  • the total edge weight is the weight of the ingress node pointing to the ingress side of the target node. It can be seen that an ingress node group includes at least one ingress node.
  • the computer device In order to determine that the target node and the ingress nodes in the ingress node group belong to the node corresponding to the user equipment frequently used by the same user, the computer device needs to calculate the weight of the ingress side corresponding to all the ingress nodes in the ingress node group. And, in the embodiment of the present application, the sum of the weights of the ingress sides corresponding to all the ingress nodes in the ingress node group is referred to as the inclusive edge total weight. The total weight of the ingress side reflects the degree of similarity between all ingress nodes in the ingress node group and the target node.
  • the computer device uses the community identifier of an ingress node group with the largest total weight of the ingress side as the community identifier of the target node.
  • the computer device can cluster the target node and the ingress node in the ingress node group into a community, and the target node The community identity is updated to the community identity of the ingress node group with the highest total weight of the entry side.
  • the computer device records the target node as a node where the community identity update occurs.
  • the computer device will identify the community identifier of the ingress node group with the largest total weight of the ingress side. After the community identifier of the target node is updated, the community identifier of the target node is updated, and the target node needs to be marked, so that the target node is subsequently used as the next pending node to be updated.
  • the community identifier of the target node may be used as the pre-update community identifier, and the community identifier updated in the target node in step S412 is used as the updated community identifier, and then compared. Whether the pre-update community identifier of the target node is consistent with the updated community identifier, and if not, the community identifier of the target node changes.
  • step S413 is an optional step. In the actual application, after all the nodes to be processed in the current round are processed as the target node, it is determined whether the node with the community identifier is updated in the target nodes. That is, it can be directly determined in the subsequent step S415 whether there is a target node in which the community identity update occurs.
  • the computer device detects whether there is a to-be-processed node that is not the target node in the directed network graph, and if yes, returns to step S410; if not, proceeds to step 415.
  • the computer device determines whether there is a node in the network map that has the community identifier updated, and if yes, the node that updates the community identifier is used as the to-be-processed node in the directed network graph, and returns to step S410; if not, The user equipments corresponding to the nodes with the same community identifier are determined to belong to the same community, and multiple communities that are clustered are obtained.
  • the user equipments belonging to the same community may be considered as: clustered user equipment sets that are frequently used by the same user.
  • step S410 to step S414 the iterative process of step S410 to step S414 is repeated.
  • the community identity of each node in the directed network graph will not change any more. Therefore, if there is no node in the network graph where the community identifier is updated, it indicates that the aggregation of all nodes in the directed network graph is completed. Class, in which case nodes with the same community identity are clustered into one community.
  • step S408 is an optional step, which only considers that the community identifier of each node is unique in the newly constructed directed network graph, corresponding to one node.
  • the community identifier of the target ingress node corresponding to the inward edge with the largest weight is directly used as The community ID of the node to be updated.
  • each in-degree node belongs to an in-degree node group, so that the in-degree node group with the largest total weight of the in-degree side is actually the target in-degree node corresponding to the in-degree side with the largest weight.
  • the object clustering method in the embodiment of the present application is introduced by taking a topic discovery of a plurality of documents including different community identifiers as an example.
  • FIG. 6 is a schematic flowchart diagram of still another embodiment of an object clustering method according to the present application.
  • the method of this embodiment is applicable to a computer device or a distributed computing system as mentioned above.
  • the method of this embodiment may include
  • the computer device acquires a data set to be analyzed in the network, where the data set to be analyzed includes multiple data relationships, and each data relationship includes: a correspondence between the community identifier and the identifier of the document.
  • the community identifier indicates an identifier of some preset network system, such as an identifier of a social network, for example, an identifier related to an instant messaging system; and an identifier of the document indicates that the identifier obtained from the network is used to uniquely represent the The identity of the document.
  • the purpose of the embodiment of the present application is to determine which documents are frequently used by the same social network to cluster documents frequently used by the same community to extract the documents. theme.
  • the computer device acquires at least one third data relationship that includes the identifier of the first document, and at least one fourth that includes the identifier of the second document. Data relationship.
  • the computer device determines, from the at least one third data relationship and the at least one fourth data relationship, at least one pair of data relationship pairs including the same community identifier.
  • the pair of data relationships includes: a third data relationship having the same community identifier and a fourth data relationship.
  • the computer device determines a similarity between the total number of pairs of data relationships and the third number corresponding to the at least one third data relationship as the similarity between the second document and the first document.
  • the total number of data relationship pairs is the total number of communities that have used the first document and the second document at the same time.
  • the total number of third data relationships is referred to as a third number
  • the total number of fourth data relationships is referred to as a fourth number.
  • the third number represents the total number of communities that have used the first document
  • the fourth number represents the total number of communities that have used the fourth document.
  • the computer device determines a similarity between the total number of pairs of data relationships and the fourth number corresponding to the plurality of fourth data relationships as the similarity between the first document and the second document.
  • the computer device uses the first document and the second document as nodes of the directed network graph, and uses the similarity between the first document and the second document as the weight of the edge of the directed document from the first document to the second document.
  • the similarity of the second document to the first document is taken as the weight from the second document to the edge of the first document.
  • the computer device will initialize the identifier of the document indicated by the node in the network diagram.
  • the topic identifier is used to indicate the topic corresponding to the document represented by the node, and a topic can also be considered as a cluster category.
  • the computer device sequentially uses each node in the directed network graph as a node to be updated, and determines, from a plurality of ingress nodes connected to the node to be updated, a target indegree corresponding to the inward edge with the largest weight.
  • the node identifies the topic identifier of the target degree node as the topic identifier of the node to be updated.
  • the computer device will use all nodes in the directed network graph as nodes to be processed.
  • the computer device selects an unprocessed to-be-processed node as a target node to be processed from the to-be-processed node of the directed network graph.
  • the computer device determines, as the ingress node group, the ingress node of the same topic identifier in the ingress node of the target node, and calculates, for each ingress node group, all the ingress nodes corresponding to the ingress node group. The sum of the weights of the ingress edges, and the total weight of the ingress side of each ingress node group is obtained.
  • the computer device uses, as the topic identifier of the target node, a topic identifier of an ingress node group with the largest total weight of the ingress side.
  • the computer device records the target node as a node where the topic identifier update occurs.
  • the computer device detects whether there is a to-be-processed node that is not the target node in the directed network graph. If yes, the process returns to step S610; if not, step 615 is performed.
  • the computer device determines whether there is a node in the network diagram that the subject area identifier is updated, and if yes, the node whose topic identifier is updated is used as the to-be-processed node in the directed network graph, and returns to step S610; if not, Then, the documents corresponding to the nodes with the same topic identifier are determined to belong to the same topic, and multiple topics that are clustered are obtained.
  • an embodiment of the present application further provides an object clustering apparatus.
  • FIG. 7 is a schematic structural diagram of an embodiment of an object clustering apparatus according to the present application.
  • the apparatus of this embodiment may include:
  • the data acquisition unit 701 is configured to acquire an association object set associated with each of the plurality of target objects to be clustered, where the association object set includes at least one association object;
  • the directed graph construction unit 702 is configured to determine, according to the degree of similarity between the associated object in the associated object set of one of the target objects and the associated object in the associated object set of another target object, based on any two target objects, Having a weight to the edge between the nodes representing the any two target objects in the directed network graph, and constructing the directed network graph;
  • a class initialization unit 703, configured to separately assign a unique class identifier to each node in the directed network graph
  • the clustering analysis unit 704 is configured to sequentially use the each node as a target node to be processed, and determine, from the at least one ingress node group corresponding to the target node, a total of directed edges pointing to the target node. And the category identifier of the target node is updated to the category identifier of the ingress node in the target ingress node group until the category identifier of all the nodes in the directed network graph is no longer A change occurs, wherein the ingress node group includes at least one ingress node with a directed edge pointing to the target node and having the same category identifier;
  • the category extracting unit 705 is configured to determine that the target objects represented by the nodes having the same category identifier belong to one cluster category, and obtain a plurality of cluster categories corresponding to the plurality of target objects.
  • the directed graph construction unit includes:
  • a first weight determining unit configured to: for any of the plurality of target objects and the second target object, according to a total of the associated objects associated with the first target object and the second target object a quantity, and a first number of associated objects associated with the first target object, determining a directed network graph to be constructed, pointing from a second node representing the second target object to representing the first target object The weight of the directed edge of the first node;
  • a second weight determining unit configured to determine, according to the total number, and a second quantity of the associated object associated with the second target object, a weight from the first node to a directed edge of the second node ;
  • a directed graph constructs a sub-unit for constructing the directed network graph.
  • the first weight determining unit is configured to use a total number of associated objects associated with the first target object and the second target object, and an associated object associated with the first object. a quantity ratio determined to be a weight from a second node representing the second target object to a directed edge of the first node representing the first target object;
  • the second weight determining unit is configured to determine, by the total number, a ratio of a second number of associated objects associated with the second object, to a directed direction from the first node to the second node The weight of the side.
  • the data acquiring unit is configured to acquire at least one data relationship corresponding to each of the plurality of target objects to be clustered, where the data relationship includes an identifier between the target object and an associated object associated with the target object.
  • the category initialization unit includes:
  • a class initialization subunit configured to use an identifier of a target object represented by a node in the directed network graph as a category identifier of the node.
  • the cluster analysis unit includes:
  • a node initial processing unit configured to use all nodes in the directed network graph as nodes to be processed
  • a clustering analysis subunit configured to: if there is an unprocessed node to be processed, select a target node to be processed from the unprocessed node to be processed, and determine from at least one ingress node group corresponding to the target node Pointing out a target ingress node group with the largest total weight of the directed edges of the target node, and updating the category identifier of the node to the category identifier of the category of the ingress node in the target ingress node group until all unprocessed
  • the pending nodes are all processed as target nodes;
  • a loop control unit configured to determine, if the updated category identifier is different from the pre-update category identifier, the node with the updated category identifier and the updated category identifier as a node to be processed, and trigger the return execution The operation of the cluster analysis subunit;
  • the category extracting unit is configured to determine, if there is no node with the updated category identifier and the category identifier before the update, the target object represented by the node having the same category identifier as belonging to one cluster category, A plurality of cluster categories corresponding to the plurality of target objects.
  • the target object is a user equipment
  • the associated object is a user identifier
  • the data acquiring unit is configured to acquire a user identifier set associated with each of the multiple user equipments to be clustered, where the user identifier set includes A plurality of user identifiers, wherein the user identifier associated with the user equipment is an identifier of a user accessing the preset network by using the user equipment.
  • the embodiment of the present application further provides a computer device, which may include any of the object clustering devices described above.
  • the configuration of the computer device can be as shown in FIG. 1.
  • the program code stored in the memory, the processor according to the instruction in the program code performs the following step:
  • each of the plurality of target objects to be clustered includes at least one associated object
  • each node Determining each node as a target node to be processed, and determining, from the at least one ingress node group corresponding to the target node, a target ingress node group with a total weight of directed edges pointing to the target node Updating the category identifier of the target node to the category identifier of the ingress node in the target ingress node group until the category identifier of all the nodes in the directed network graph no longer changes, wherein the entry The degree node group includes at least one ingress node with a directed edge pointing to the target node and having the same category identifier;
  • the target objects represented by the nodes having the same category identifier are determined to belong to one cluster category, and a plurality of cluster categories corresponding to the plurality of target objects are obtained.
  • the processor performs the following steps according to the instructions in the program code, including:
  • any of the plurality of target objects and the second target object based on the total number of associated objects associated with the first target object and the second target object, and the first target a first number of associated objects associated with the object, determined in the directed network graph to be constructed, from a second node representing the second target object to a directed edge of the first node representing the first target object Weights;
  • the processor performs the following steps according to the instructions in the program code, including:
  • the ratio of the total number, the second number of associated objects associated with the second object, is determined as a weight from the first node to a directed edge of the second node.
  • the processor performs the following steps according to the instructions in the program code, including:
  • the processor performs the following steps according to the instructions in the program code, including:
  • the identifier of the target object represented by the node in the network graph is used as the category identifier of the node.
  • the processor performs the following steps according to the instructions in the program code, including:
  • All nodes in the directed network graph are regarded as nodes to be processed
  • the target node to be processed is selected from the unprocessed to-be-processed nodes, and the directed node to the target node is determined from the at least one in-degree node group corresponding to the target node.
  • the target weighted node group with the largest total weight of the edge, and the category identifier of the target node is updated to the category identifier of the ingress node in the target ingress node group until all unprocessed pending nodes are used as target nodes.
  • the node whose pre-update category identifier is different from the updated category identifier is determined as a to-be-processed node, and triggers to return to execute the cluster analysis sub-unit. Operation
  • the target object represented by the node having the same category identifier is determined to belong to one cluster category, and multiple corresponding to the multiple target objects are obtained.
  • Cluster category If there is no node whose updated category identifier is different from the category identifier before the update, the target object represented by the node having the same category identifier is determined to belong to one cluster category, and multiple corresponding to the multiple target objects are obtained. Cluster category.
  • the target object is a user equipment
  • the associated object is a user identifier
  • the processor performs the following steps according to the instructions in the program code, including:
  • the embodiment of the present application further provides a storage medium for storing program code, and the program code is used to execute any one of the object clustering methods described in FIG.
  • the embodiment of the present application also provides a computer program product comprising instructions, which when executed on a computer, causes the computer to perform any of the object clustering methods described in FIGS.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Discrete Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提供了一种对象聚类方法和装置,根据待聚类的任意两个目标对象之间的所关联的关联对象的相似程度,确定出待构建的有向网络图中表示这两个目标对象的目标节点之间有向边的权重,并构建出有向网络图,由于有向网络图中任意两个目标对象的节点之间有向边的权重是根据这两个目标对象之间的所关联的关联对象的相似程度确定出来的,该相似程度可以反映出这两个目标对象之间关联性的强弱,进而反映出这两个目标对象属于同一个聚类类别的可能性。在构建有向网络图后,可以基于有向网络图中各个节点之间有向边的权重,对有向网络图中节点进行类别聚类。因此,本申请可以将关联性较强的目标对象聚类到一起,进而可以有效提高目标对象聚类的精准度。

Description

一种对象聚类方法和装置
本申请要求于2017年02月14日提交中国专利局、申请号为201710078997.6、申请名称为“一种对象聚类方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理技术领域,尤其涉及对象聚类。
背景技术
随着互联网技术的不断发展,网络中的信息量日益增多。为了能够有效利用网络信息,很多情况下,需要对网络信息中所包含同种对象进行聚类。如,同一网络用户可能会使用多个不同的用户设备来进行网络访问,例如,用户可以利用自己或家人的手机或者其他终端设备来登录即时通讯***或者论坛等等,而为了防御恶意访问或者是有针对性的对用户提供服务,就可能需要确定出哪些用户设备是由同一个用户经常使用的,从而需要对用户设备进行聚类。
然而,针对网络中的一种目标对象进行聚类时,仅仅是根据目标对象关联的关联对象的标识,将关联有相同关联对象的标识的目标对象聚类到一起,如,如果多个用户设备访问网络***时,所采用的用户账号相同,则认为该多个用户设备为同一个用户经常使用的,将这多个用户设备聚类到一起。根据目标对象关联的关联对象的标识,对目标对象进行聚类,使得目标对象聚类出的类别较多,不能将关联性较强的目标对象聚类到一起,导致聚类的精准度低。
发明内容
有鉴于此,本申请提供了一种对象聚类方法和装置,以最大程度的挖掘待聚类的目标对象之间的关联度,提高聚类的精准度。
为实现上述目的,一方面,本申请提供了一种对象聚类方法,包括:
获取待聚类的多个目标对象各自关联的关联对象集合,所述关联对象集合包括至少一个关联对象;
基于任意两个目标对象,根据其中一个目标对象的关联对象集合中关联对象与另一个目标对象的关联对象集合中关联对象之间的相似程度,确定待构建的有向网络图中表征所述任意两个目标对象的节点之间有向边的权重,并构建所述有向网络图;
为所述有向网络图中的每个节点分别分配唯一的类别标识;
依次将所述每个节点作为需处理的目标节点,从所述目标节点对应的至少一个入度节点组中,确定出指向所述目标节点的有向边的总权重最大的目标入度节点组,并将所述目标节点的类别标识更新为所述目标入度节点组中入度节点的类别标识,直至所述有向网络图中所有节点的类别标识不再发生变化,其中,所述入度节点组包括有向边指向所述目标节点且具有相同类别标识的至少一个入度节点;
将具有相同类别标识的节点所表示的目标对象确定为属于一个聚类类别,得到所述多个目标对象对应的多个聚类类别。
另一方面,本申请实施例还提供了一种对象聚类装置,包括存储器和处理器,所述存储器用于存储指令,所述处理器用于执行所述指令,以执行下述步骤:
获取待聚类的多个目标对象各自关联的关联对象集合,所述关联对象集合包括至少一个关联对象;
基于任意两个目标对象,根据其中一个目标对象的关联对象集合中关联对象与另一个目标对象的关联对象集合中关联对象之间的相似程度,确定待构建的有向网络图中表征所述任意两个目标对象的节点之间有向边的权重,并构建所述有向网络图;
为所述有向网络图中的每个节点分别分配唯一的类别标识;
依次将所述每个节点作为需处理的目标节点,从所述目标节点对应的至少一个入度节点组中,确定出指向所述目标节点的有向边的总权重最大的目标入度节点组,并将所述目标节点的类别标识更新为所述目标入度节点组中入度节点的类别标识,直至所述有向网络图中所有节点的类别标识不再发生变化,其中,所述入度节点组包括有向边指向所述目标节点且具有相同类别标识的至少一个入度节点;
将具有相同类别标识的节点所表示的目标对象确定为属于一个聚类类别,得到所述多个目标对象对应的多个聚类类别。
另一方面,本申请实施例还提供了一种存储介质,所述存储介质用于存储程序代码,所述程序代码用于执行前述任一方面所述的对象聚类方法。
另一方面,本申请实施例还提供了一种包括指令的计算机程序产品,当其在计算机上运行时,使得所述计算机执行前述任一方面所述的对象聚类方法。
由以上内容可知,在本申请实施例中,根据待聚类的任意两个目标对象之间的所关联的关联对象的相似程度,确定出待构建的有向网络图中表示这两个目标对象的目标节点之间有向边的权重,并构建出有向网络图,由于有向网络图中任意两个目标对象的节点之间有向边的权重是根据这两个目标对象之间的所关联的关联对象的相似程度确定出来的,关联对象之间的相似程度可以反映出这两个目标对象之间关联性的强弱,进而反映出这两个目标对象属于同一个聚类类别的可能性。故,有向网络图中任意两个目标对象的节点之间有向边的权重可以反映出这两个目标对象属于同一个聚类类别的可能性,该权重越大,表示这两个目标对象属于同一个聚类类别的可能性越大。在构建有向网络图后,可以基于有向网络图中各个节点之间有向边的权重,对有向网络图中不同节点进行类别聚类,得到该多个目标对象对应的类别。由此可见,与传统方式中仅根据目标对象关联的关联对象的标识是否相同对目标对象进行聚类的方法相比,本申请实施例有利于从全局角度挖掘目标对象之间的相似程度,将关联性较强的目标对象聚类到一起,进而可以有效提高目标对象聚类的精准度。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请实施例公开一种对象聚类方法所适用的一种计算机设备的一种可能的组成架构示意图;
图2为本申请实施例公开的一种对象聚类方法所适合的一种***组成架构示意图;
图3为本申请公开的一种对象聚类方法一个实施例的流程示意图;
图4为本申请公开的一种对象聚类方法又一个实施例的流程示意图;
图5a示出了本申请实施例构建出的一种有向网络图的部分组成结构示意图;
图5b示出了利用权重最大的入度边对应的入度节点的社区标识对图5a所示的有向网络图中一个节点的社区标识进行更新之后的有向网络图的示意图;
图6示出了本申请公开的一种对象聚类方法又一个实施例的流程示意图;
图7示出了本申请公开的一种对象聚类装置一个实施例的组成结构示意图。
具体实施方式
本申请实施例提供了的对象聚类方法和装置,该对象聚类方法和装置适用于对登录社交网络的多台用户设备进行聚类,或者是多篇文章进行主题发现。
申请人经研究发现,传统的对象聚类方法仅根据目标对象关联的关联对象的标识是否相同对目标对象进行聚类,在一些情况下,可能导致聚类结果出现错误。例如关联对象不相同,关联有不同关联对象的目标对象实际上可能属于不同类别的情况,或者关联对象相同,关联有相同关联对象的目标对象实际上可能属于不同类别的情况。
在关联对象不相同,关联有不同关联对象的目标对象实际上可能属于同一聚类类别的情况下,如果采用传统的方法对目标对象进行聚类,则会将关联有不同关联对象的目标对象视为不同聚类类别,得到错误的目标对象聚类结果。
例如,关联对象为用户账号,目标对象为手机,用户A采用用户账号U1的身份,通过手机M1登录即时通讯***;用户A还可以采用用户账号U2的身份,通过手机M2登录即时通讯***。在这种情况下,如果需要确定出手机M1和手机M2是否由同一个用户经常使用的,那么需要对手机进行聚类。如果采用传统的聚类方法,由于在使用手机M1和手机M2时,采用的是不同的用户账号,即关联对象不相同,那么,则会确定手机M1和手机M2并非由同一个用户经常使用的,手机M1和手机M2不属于同一聚类类别,得到错误的聚类结果。
而本申请实施例提供的方法,可以通过确定有向网络图中任意两个目标对象的目标节点之间有向边的权重,确定出目标对象之间关联性的强弱,手机M1和手机M2同一用户使用不同的用户账号进行登录的,因此,手机M1和手机M2都关联的关联对象(用户账号)可能很多,这样便可以确定出手机M1和手机M2之间关联性较强,从而确定手 机M1和手机M2是同一个用户经常使用的,二者可以聚类在一起。
在关联对象相同,关联有相同关联对象的目标对象实际上可能属于不同聚类类别的情况下,如果采用传统的方法对目标对象进行聚类,则会将关联有相同关联对象的标识的目标对象聚类到一起,得到错误的目标对象聚类结果。
例如,关联对象为用户账号,目标对象为手机,用户A采用用户账号U1的身份,通过手机M1登录即时通讯***;用户B由于知晓用户A的用户账号U1,也采用用户账号U1的身份,而通过手机M2登录即时通讯***。在这种情况下,如果需要确定出手机M1和手机M2是否由同一个用户经常使用的,那么需要对手机进行聚类。如果采用现有技术的聚类方法,由于在使用手机M1和手机M2时,都是采用用户账号U1,即关联对象的标识相同,那么,则会确定手机M1和手机M2是由同一个用户经常使用的,得到错误的聚类结果。
而本申请实施例提供的方法,可以通过确定有向网络图中任意两个目标对象的目标节点之间有向边的权重,确定出目标对象之间关联性的强弱,手机M1和手机M2是不同的用户使用同一用户账号U1进行登录的,因此,手机M1和手机M2都关联的关联对象(用户账号)可能很少,甚至可能只有用户账号U1,这样便可以确定出手机M1和手机M2之间关联性较弱,从而确定手机M1和手机M2并非同一个用户经常使用的,二者不能聚类在一起。
基于上述情况可以看出,本申请实施例提供的方法有利于从全局角度挖掘目标对象之间的相似程度,将关联性较强的目标对象聚类到一起,避免发生聚类错误的情况,进而可以有效提高目标对象聚类的精准度。
本实施例的方法和装置适用于单台计算机设备,也可以是分布式计算***。
如图1,其示出了本申请实施例的对象聚类方法和装置所适用的计算机设备的一种组成结构示意图。在图1中,该计算机设备可以包括:存储器101、处理器102、通信模块103、显示器104、输入单元105以及通信总线106等部件。其中,处理器101、存储器102、通信接口103、显示器104以及输入单元105均通过通信总线106完成相互间的通信。
其中,存储器101可用于存储软件程序以及模块。存储器120可存储操作***、至少一个功能(比如,图像播放功能)所需的应用程序等;还可以存储根据终端的使用所创建的数据等。其中,存储器101可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
处理器102是终端的控制中心,利用各种接口和线路连接整个终端的各个部分,通过运行或执行存储在存储器101内的软件程序和/或模块,以及调用存储在存储器101内的数据,执行终端的各种功能和处理数据,从而对终端进行整体监控。可选的,处理器102可包括一个或多个处理单元。
该通信模块103可以用于收发信息,或者数据处理过程中信号的接收与发送;或者通过网络与其他设备进行通信等。
该显示器104可用于窗口界面,并在窗口界面中显示所处理的数据、图形,有向网络 图等等;还可以显示由用户输入的信息,或者提供给用户的信息,以及计算机设备的各种图形用户接口,这些图形用户接口可以由图形、文本、图片等任意组合来构成。该显示器可以包括显示面板,如,可以为采用液晶显示器、有机发光二极管等形式来配置的显示面板。进一步的,该显示器可以包括具备采集触摸事件的触摸显示面板。
输入单元105可用于接收输入的用户输入的字符、数字等信息,以及产生与用户设置以及功能控制有关的信号输入。该输入单元可以包括但不限于物理键盘、鼠标、操作杆等中的一种或多种。
可以理解的是,无论在何种场景中,该终端均可以为任意能够实现访问服务平台的设备,如,该终端可以为手机、平板电脑、台式电脑等等。
当然,为了提高数据处理能力,本申请实施例的对象聚类方法也可以适用于分布式计算***,如图2所示,其示出了本申请的对象聚类方法所适用的一种分布式计算***的组成结构示意图。
由图2可知,该分布式计算***可以包括多台计算机设备201,这多台计算机设备201之间可以通过网络相连,这多台计算机设备201之间可以相互配合以完成本申请实施例的对象聚类方法和装置中所涉及到的数据处理。
基于以上共性,在本申请的对象聚类方法中,获取到待聚类的多个目标对象各自关联至少一个关联对象之后,基于任意两个目标对象,根据其中一个目标对象的关联对象集合中关联对象与另一个目标对象的关联对象集合中关联对象之间的相似程度,确定出有向网络图中表征该任意两个目标对象的节点之间有向边的权重,并构建表示有该多个关联对象的多个节点,以及节点之间具有有向边的有向网络图;同时,为有向网络图中的每个节点分别分配唯一的类别标识;然后依次将每个所述节点作为当前需处理的目标节点,从目标节点对应的至少一个入度节点组中,确定出指向该目标节点的有向边的总权重最大的目标入度节点组,并所述节点的类别标识更新为所述目标入度节点组的目标标识,直至所述有向网络图中所有节点的类别标识不再发生变化,其中,所述入度节点组包括指向所述目标节点且具有相同类别标识的至少一个入度节点;将具有相同类别标识的节点所表示的目标对象确定为属于一个聚类类别,得到该多个目标对象对应的多个聚类类别,从而实现了依据目标对象所关联的关联对象构建有向网络图,并基于有向网络图对目标对象进行精准聚类。
基于图1和图2,下面结合不同实施例对本申请实施例的对象聚类方法进行相似介绍。
参见图3,其示出了本申请一种对象聚类方法一个实施例的流程示意图,本实施例的方法可以应用于如上所示的计算机设备或者计算机***,本实施例的方法可以包括:
S301,计算机设备获取待聚类的多个目标对象各自关联的关联对象集合,该关联对象集合包括至少一个关联对象。
其中,待聚类的目标对象可以根据需要选取,相应的,针对不同的目标对象,对该目标对象进行聚类所需获取的该目标对象关联的关联对象也会有所差异。如,当需要通过聚类,将同一个用户经常使用的用户设备聚类到一起时,则目标对象可以为用户设备,而目标对象关联的关联对象可以为用户的账号、用户名等用户标识。
可以理解的是,为了确定目标对象关联的关联对象集合,计算机设备可以获取到各个目标对象关联的所有关联对象的信息,也可以是获取多个数据关系,每个数据关系中包括一个目标对象的标识与该目标对象关联的关联对象的对应关系,根据目标对象的标识,可以确定出每个目标对象关联有哪些关联对象。
S302,基于任意两个目标对象,计算机设备根据其中一个目标对象的关联对象集合中关联对象与另一个目标对象的关联对象集合中关联对象之间的相似程度,确定待构建的有向网络图中表征所述任意两个目标对象的节点之间有向边的权重,并构建所述有向网络图。
关联对象之间的相似程度可以反映出这两个目标对象之间关联性的强弱,进而反映出这两个目标对象属于同一个聚类类别的可能性。故,根据该相似程度确定的有向网络图中任意两个目标对象的节点之间有向边的权重,可以反映出这两个目标对象属于同一个聚类类别的可能性,该权重越大,表示这两个目标对象属于同一个聚类类别的可能性越大,以便后续根据该有向网络图对目标对象进行聚类。
可以理解的是,为了便于描述任意两个目标对象之间的相似度,对于任意两个不同的目标对象,计算机设备可以将其中一个目标对象称为第一目标对象,并将第二目标对象称为第二目标对象。其中,该第一目标对象与第二目标对象不同,需要说明的是,在本申请实施例中,当目标对象的标识不同时,就可以认为两个标识不同的目标对象为不同的目标对象。此时,S302的一种实现方式可以是:计算机设备根据与第一目标对象以及第二目标对象都关联的关联对象的总数量以及与第一目标对象关联的关联对象的第一数量,确定出待构建的有向网络图中,第二目标对象指向第一目标对象的第一相似度,并根据该总数量以及与第二目标对象关联的关联对象的第二数量,确定待构建的有向网络图中第二目标对象指向第一目标对象的第二相似度。其中,根据两个目标对象之间关联的相同关联对象的总数量,以及每个目标对象各自关联的关联对象的数量,确定两个目标对象之间相似度的方式可以有多种,在此不加以限制。
相应的,计算机设备构建所述有向网络图可以是构建包括用于表示第一目标对象的第一节点以及用于表示第二目标对象的第二节点的有向网络图,并设定有向网络图中由该第二节点指向第一节点的有向边的权重为该第一相似度,由第一节点指向第二节点的有向边的权重为第二相似度。
可以理解的是,有向网络图中包括节点以及节点之间的有向边。在本申请实施例中,计算机设备构建出的有向网络图中包含有分别表示该多个目标对象的多个节点,而为了便于描述,本申请是以任意两个目标对象为例进行说明。相应的,对于有向网络图中任意的第一节点和第二节点,将第一节点表示的第一目标对象指向第二节点所表示的第二目标对象的相似度作为第一节点指向所述第二节点的有向边的权重;而将第二目标对象指向第一目标对象的相似度作为第二节点指向第一节点的有向边的权重。
可以理解的是,如果一个目标对象指向另一个目标对象的相似度为零,则说明这两个目标对象之间不具有相同的关联对象,在该种情况下,有向网络图中表征这两个目标对象的节点之间可以不具有相连的有向边。
需要说明的是,在本申请实施例的以上步骤S302的实现方式中,仅仅是基于任意两个 目标对象之间关联的关联对象的相似程度,确定出有向网络图中表征两个目标对象的两个节点之间有向边的权重的一种实现方式,但是可以理解的是,体现目标对象之间关联的关联对象的相似程度的方式可以有多种,相应的,基于目标对象之间关联的关联对象的相似程度,确定表征目标对象的节点之间的有向边的权重的方式也可以有多种,在此不加以限制。
S303,计算机设备为该有向网络图中的每个节点分别分配唯一的类别标识。
其中,类别标识用于表征节点所归属的聚类类别,由于在基于有向网络图进行聚类之前,不清楚哪些节点所表示的目标对象可以聚类在一起,因此,计算机设备可以认为每一个节点分别属于一个聚类类别,从而为每个节点分配唯一的类别标识。后续基于有向网络图进行聚类的过程中,部分或者全部节点的类别标识会不断发生变化,直至所有节点的类别标识均不发生变化时,则完成聚类。
S304,计算机设备依次将有向网络图中的每个节点作为需处理的目标节点,并将目标节点的类别标识更新为该目标节点对应的至少一个入度节点组中,入度边总权重最大的目标入度节点组中入度节点的类别标识,直至该有向网络图中所有节点的类别标识不再发生变化。
其中,对于有向网络图中任意一个节点,有向网络图中指向该节点的其他节点称为该节点的入度节点,节点的入度节点也可以理解为有向网络图中,该节点的邻居节点。对于有向网络图中的任意一个节点而言,该节点的入度节点至少有一个。
为了便于描述,本申请实施例将从入度节点指向该节点的有向边可以称为入度边,可见,一个入度节点对应着一条入度边。
其中,入度节点组包括:有向边指向该目标节点且具有相同类别标识的至少一个入度节点,入度节点组的入度边总权重为该入度节点组中所有入度节点对应的入度边的权重总和。
可以理解的是,由于入度边的权重表示该入度边对应的入度节点与该待更新节点的相似度,因此,如果入度边的权重越高,该入度边对应的入度节点所表示的目标对象与该目标节点对应的目标对象属于一个类别的可能性最大,相应的,如果一个入度节点组中所有入度边的权重之和最大,则该目标节点与该入度节点组中所有入度节点属于同一个类别的可能性最大,从而可以将该目标节点的类别标识与该入度边权重最大的目标入度节点组的类别标识进行统一,本实施例将类别标识统一为该目标入度节点组的类别标识。也就是说,需要从目标节点对应的多个入度节点组中,确定出有向边的总权重最大的目标入度节点组,并目标节点的类别标识更新为该目标入度节点组的类别标识。
需要说明的是,该步骤S304为不断循环迭代的过程,每完成一次迭代,计算机设备都需要判断本次迭代过程中是否存在类别标识发生变化的节点,如果存在类别标识发生变化的节点,则仍需要从向网络图中重新选取节点作为目标节点,并重新进行迭代。
可以理解的是,如果完成最近一次迭代之后,在该最近一次迭代过程中,如果节点的类别标识没有发生变化,则说明该类别标识与其入度节点之间的聚类已经完成,即使后续再重新迭代,该节点的类别标识也不会发生变化。为了避免对这些类别标识没有发生变化 的节点的重复聚类,以减少数据处理量,可选的,可以仅仅在首次迭代时,计算机设备将有向网络图中的所有节点均作为待处理节点;如果存在未处理的待处理节点,计算机设备从未处理的待处理节点中选取需处理的目标节点,并将目标节点的类别标识更新为所述目标节点对应的至少一个入度节点组中,入度边总权重最大的目标入度节点组的类别标识,直至所有待处理节点均作为目标节点被处理为止;且,当所有待处理节点均作为目标节点之后,如果存在更新后的类别标识与更新前的类别标识不同的节点,计算机设备则可以仅仅将更新前的类别标识与更新后的类别标识不同的节点确定为待处理节点,并返回执行如果存在未处理的待处理节点,从未处理的所述待处理节点中选取需处理的目标节点等操作;如果不存在更新后的类别标识与更新前的类别标识不同的节点,则表示聚类结束,计算机设备可以执行后续步骤S306。
S305,计算机设备将具有相同类别标识的节点所表示的目标对象确定为属于一个聚类类别,得到该多个目标对象对应的多个聚类类别。
在本申请实施例中,根据待聚类的任意两个目标对象之间的所关联的关联对象的相似程度,确定出待构建的有向网络图中表示这两个目标对象的目标节点之间有向边的权重,并构建出有向网络图,由于有向网络图中任意两个目标对象的节点之间有向边的权重是根据这两个目标对象之间的所关联的关联对象的相似程度确定出来的,关联对象之间的相似程度可以反映出这两个目标对象之间关联性的强弱,进而反映出这两个目标对象属于同一个聚类类别的可能性。故,有向网络图中任意两个目标对象的节点之间有向边的权重可以反映出这两个目标对象属于同一个聚类类别的可能性,该权重越大,表示这两个目标对象属于同一个聚类类别的可能性越大。在构建有向网络图后,可以基于有向网络图中各个节点之间有向边的权重,对有向网络图中不同节点进行类别聚类,得到该多个目标对象对应的类别。由此可见,与传统方式中仅根据目标对象关联的关联对象的标识是否相同对目标对象进行聚类的方法相比,本申请实施例有利于从全局角度挖掘目标对象之间的相似程度,将关联性较强的目标对象聚类到一起,进而可以有效提高目标对象聚类的精准度。
下面以待聚类的目标对象为用户设备以及文档的场景为例,对本申请实施例的对象聚类方法进行介绍。
首先,以对用户设备的聚类为例进行介绍。结合图1和图2,参见图4,其示出了一种对象聚类方法一个实施例的流程示意图,本实施例的方法应用于如上所提到的计算机设备或者分布式计算***中,本实施例的方法是以登录社交网络的多台用户设备进行聚类,以将同一用户的设备聚类到一起为例进行介绍。在本申请实施例的用户设备可以为手机、平板电脑、台式机等等终端。
如图4,本实施例的方法可以包括:
S401,计算机设备获取网络中的待分析数据集,该待分析数据集包括多个数据关系,每个数据关系中包括用户标识与用户设备的标识之间的对应关系。
其中,每个数据关系中,用户标识表示访问(或者说登录)预设网络***的用户的标 识;用户设备的标识表示该用户访问该预设网络***所采用的用户设备的标识。例如,数据关系可以表示为(用户U,用户设备Ue)的形式。
其中,用户标识可以是用户在该预设网络***中的用户名、用户的网络账号、用户的电话号码等等。用户设备的标识用户唯一标识一台用户设备,如该用户设备的标识可以为用户设备的IP地址、设备标识码等等。
可以理解的是,由于计算机设备在进行聚类分析过程中,可以针对一个或多个预设网络***进行分析,因此,预先设定的预设网络***可以是一个或多个,其中,预设网络***可以是多个社交网络,例如多个不同的即时通讯***、论坛***等等。
需要说明的是,虽然预设网络***可以有多个,由于一个数据关系中是用户标识与用户设备的标识之间的对应关系,在该数据关系确定的情况下,该数据关系中的预设网络***是确定的唯一的一个预设网络***。
举例说明,如,用户A以即时通讯用户U1的身份,通过手机M1登录即时通讯***,则即时通讯用户U1的用户名与手机M1的标识构成一对数据关系;又如,用户A以即时通讯用户U2的身份,通过手机M2登录即时通讯***,则会得到即时通讯用户U2的用户名与手机M2的标识所构成的一对数据关系。在这种情况下,用户标识不同,但是用户设备实际上被同一个用户所使用,对应于前述提到的关联对象不相同,关联有不同关联对象的目标对象实际上可能属于同一聚类类别的情况。
又如,用户A以即时通讯用户U1的身份,通过手机M1登录即时通讯***,则即时通讯用户U1的用户名与手机M1的标识构成一对数据关系;用户B以即时通讯用户U1的身份,通过手机M2登录即时通讯***,则又会得到即时通讯用户U1的用户名与手机M2的标识所构建的一对数据关系。在这种情况下,用户标识虽然相同,但是用户设备实际上被不同用户所使用,对应于前述提到的关联对象相同,关联有相同关联对象的目标对象实际上可能属于不同聚类类别的情况。
可以理解的是,本申请实施例的目的是为了确定出哪些用户设备是同一个用户经常使用的设备,以将同一个用户所使用的用户设备聚类到一起,因此,该待分析数据集不包括完成相同的数据关系。
S402,对于任意两个不同用户设备的标识所表征的第一用户设备及第二用户设备,获取包含第一用户设备的标识的至少一个第一数据关系,以及包含第二用户设备的标识的至少一个第二数据关系。
如,假设第一用户设备的用户设备的标识为Ue1,如果数据关系中包含Ue1,则该数据关系为包含第一用户设备的标识Ue1的第一数据关系。
其中,第一数据关系与第二数据关系的数量可以不同。
需要说明的是,本申请实施例仅仅是为了便于描述,而将任意两个用户设备中的一个称为第一用户设备,而将又一个用户设备称为第二用户设备,其中,第一用户设备与第二用户设备所具有的用户设备的标识不同。
S403,计算机设备从该至少一个第一数据关系以及该至少一个第二数据关系中,确定出包含有相同用户标识的至少一对数据关系对。
其中,每对数据关系对中包括:具有相同用户标识的第一数据关系以及第二数据关系。
可以理解的是,一对数据关系对表示同一个用户既使用过第一用户设备,又使用过第二用户设备登录过预设网络***。
举例说明,假设第一用户设备的标识为Ue1,第二用户设备的标识为Ue2,如果第一数据关系为(userA,Ue1),而第二数据关系为(userA,Ue2),则说明该第一数据关系与第二数据关系为具有相同用户标识userA的一对数据关系对,同时可以说明用户标识为userA的用户使用过第一用户设备Ue1以及第二应用设备Ue2登录预设的网络***。
S404,计算机设备将数据关系对的总数量与该至少一个第一数据关系对应的第一数量的比值,确定为第二用户设备与第一用户设备的相似度。
其中,数据关系对的总数量也就是同时使用过第一用户设备以及第二用户设备登录预设网络***的用户的总数。
为了便于区分,本申请实施例中,将第一数据关系的总个数称为第一数量,而将第二数据关系的总个数称为第二数量。其中,第一数量表示使用过或者说通过该第一用户设备登录预设网络***的用户的总数;而第二数量表示使用过或者说通过该第二用户设备登录预设网络***的用户的总数。
可选的,该第二用户设备Ue2与第一用户设备Ue1的相似度W Ue2Ue1可以表示为如下:
Figure PCTCN2018074552-appb-000001
其中,|N(Ue1)∩N(Ue2)|表示同时使用过第一用户设备Ue1以及第二用户设备Ue2登录预设网络***的用户的总数量,即数据关系对的总数量;而|N(Ue1)|表示使用过第一用户设备Ue1登录预设网络***的用户的总数,即第一数据关系的第一数量。
S405,计算机设备将数据关系对的总数量与该多个第二数据关系对应的第二数量的比值,确定为第一用户设备与该第二用户设备的相似度。
可选的,该第一用户设备Ue1与第二用户设备Ue2的相似度W Ue1Ue2可以表示为如下:
Figure PCTCN2018074552-appb-000002
其中,|N(Ue1)∩N(Ue2)|为数据关系对的总数量;而|N(Ue2)|表示使用过第二用户设备Ue2登录预设网络***的用户的总数,即第二数据关系的第二数量。
需要说明的是,在同时使用过第一用户设备以及第二用户设备登录预设网络***的用户的总数量,即数据关系对的总数量;使用过第一用户设备登录预设网络***的用户的数量,即第一数据关系的第一数量;以及,使用过第二用户设备登录预设网络***的用户的数量,即第二数据关系的第二数量确定的确定下,计算机设备计算第二用户设备与第一用户设备的相似度,以及第一用户设备与第二用户设备的相似度的方式并不限于步骤S303以及步骤S304所示的方式,在实际应用中还可以有其他计算两个设备之间相似度的方式,在此不加以限制。
可以理解的是,在本申请实施例中步骤S401和步骤S402仅仅为一种获取用户设备对 应的用户标识的一种可选的实现方式,在实际应用中,该计算机设备或分布式计算***也可以是直接获取待聚类的多个用户设备各自对应的用户标识。也就是说,计算机设备获取到待聚类的每个用户设备各自对应的用户标识,每个用户设备可以对应一个或多个用户标识,用户设备所对应的用户标识表示通过该用户设备登录预设网络***的用户的标识。如,对于待聚类的任意两个第一用户设备和第二用户设备,计算机设备可以获取到通过第一用户设备登录社交网络所有用户的用户标识,以及获取到通过第二用户设备登录社交网络的所有用户的用户标识。
相应的,通过第一用户设备登录社交网络的用户的第一数据可以为第一用户设备对应的用户标识的数量。通过第二用户设备登录社交网络的用户的第二数量可以为该第二用户设备对应的用户标识的数量。
本实施例中,第二用户设备与第一用户设备的相似度相当于前面实施例中所提到的第一相似度,而第二用户设备与第一用户设备的相似度相对于前面实施例所提到的第二相似度。
S406,计算机设备将第一用户设备以及第二用户设备作为有向网络图的节点,并将第一用户设备与第二用户设备的相似度作为有向网络图中由第一用户设备指向第二用户设备的边的权重,将第二用户设备与第一用户设备的相似度作为由第二用户设备指向第一用户设备的边的权重。
可以理解的是,有向网络图中任意两个节点之间的边均具有方向和权重,而且这两节点之间不同方向的边所具有的权重可以不同。如,对于有向网络图中任意两个节点:节点A与节点B,节点A指向节点B的边所具有的权重可以为权重1,而节点B指向节点A的边所具有的权重可以为权重2,权重1和权重2可以不同。
在本申请实施例中,可以将每一个用户设备均作为有向网络图中的节点,来构建有向网络图。对于任意两个用户设备,即第一用户设备和第二用户设备,在有向网络图中,该第一用户设备的节点指向该第二用户设备的节点的边(也称为有向边)所具有的权重表征该第一用户设备与第二用户设备的相似度;相应的,第二用户设备的节点指向该第一用户设备的节点的边(也称为有向边)所具有的权重表征该第二用户设备与第一用户设备的相似度。
如,参见图5a,其示出了本申请实施例中构建出的一种有向网络图的部分结构示意图,在图5a中每个节点表示唯一的用户设备的标识所对应的用户设备,图2中每条边的上方所标出的数字为该边所对应的权重。由图5a可知,节点Ue1与节点Ue2之间具有两条不同指向的有向边,其中,由Ue1指向Ue2的有向边的权重为2/3,而由Ue2指向Ue1的有向边的权重2/5。
S407,计算机设备将有向网络图中节点所表示的用户设备的标识初始化该节点的社区标识。
社区标识用于表示节点所代表的用户设备所聚类到的社区,一个社区也可以认为一个聚类类别。
如图5a所示,在该图5a的有向网络图中,每个节点均对应着一个社区标识,该社区 标识为该节点对应的用户设备的标识,如图5a中,每个节点旁边表示有节点的社区标识,其中括号内的标识表示该节点对应的用户设备的标识。
可以理解的是,节点对应的用户设备的标识作为节点对应的社区标识仅仅是一种实现方式。由于聚类之前,计算机设备无法确定哪些用户设备可以聚类到一个社区,因此,为每一个节点分配一个唯一的社区标识即可,因此,该步骤S407也可以是为每个节点分配一个该有向网络图中唯一的一个社区标识。
S408,计算机设备依次将有向网络图中的每个节点作为待更新节点,并从与该待更新节点相连的多个入度节点中,确定出权重最大的入度边所对应的目标入度节点,将该目标入度节点的社区标识作为该待更新节点的社区标识。
可以理解的是,由于入度边的权重越高,该入度边对应的入度节点对应的用户设备与该待更新节点对应的用户设备属于同一个用户经常使用的用户设备的可能性最大,因此,计算机设备可以将该待更新节点与该权重最大的入度边所对应的目标入度节点划归到一个类别,即,待更新节点与该目标入度节点属于同一个社区。
为了标识出属于同一个社区的节点(或者说用户设备),需要将该目标入度节点与该待更新节点的社区标识进行统一。如,参见图5b,其示出了在图5a所示的有向网络图的基础上,将一个节点的社区标识更新为节点对应的权重最大的入度边所对应的目标入度节点的社区标识之后的示意图。如,对比图5a和图5b中用户设备Ue1对应的节点可知,在图5a中用户设备Ue1对应的节点的社区标识为Ue1,而由于该节点的入度节点中,表征用户设备Ue2的节点(社区标识也为Ue2)指向该用户设备Ue1的节点的入度边的权重最大,因此,将用户设备Ue1对应的节点的社区标识变更为Ue2。更新为该用户设备Ue1对应的节点的社区标识之后,可以依次对其他节点的社区标识进行更新。
在本申请实施例中,是以将待更新节点的社区标识更新为与该待更新节点对应的目标入度节点的社区标识为例进行介绍,但是可以理解的是,如果将目标入度节点的社区标识更新为该待更新节点的社区标识也同样适用于本申请实施例。
S409,计算机设备将有向网络图中所有节点作为待处理节点。
在第一轮循环中,计算机设备可以将有向网络图中的所有节点均作为待处理节点,以便依次更新有向网络图中各个节点的社区标识。
S410,计算机设备从有向网络图的待处理节点中,选取一个未经处理的待处理节点作为需要处理的目标节点。
其中,本申请实施例中依次对有向网络图中的每一个节点执行如下步骤S411至S412的操作,为了便于区分,将需要处理的节点称为目标节点,由于步骤S411以及S412是一个循环执行的过程,如果在本轮循环中,该有向网络图中的待处理节点已经作为目标节点,则在本轮处理中无需重复作为目标节点。
S411,计算机设备将该目标节点的入度节点中,相同社区标识的入度节点确定为一个入度节点组,并针对每一个入度节点组,计算入度节点组中所有入度节点对应的入度边的权重的总和,得到每个入度节点组的入度边总权重。
可以理解的是,经过步骤S408,一个目标节点的入度节点中,可能会存在两个或多个 具有相同社区标识的入度节点,本申请实施例中,将具有相同社区标识的入度节点划归为一个入度节点组。当然,如果目标节点的一个入度节点的社区标识与其他入度节点的社区标识均不相同,则该入度节点可以单独归为一个入度节点组,且该入度节点组对应的入度边总权重就是该入度节点指向该目标节点的入度边的权重。可见,一个入度节点组中包括至少一个入度节点。
为了确定出目标节点与哪些入度节点组内的入度节点属于同一用户经常使用的用户设备所对应的节点,计算机设备需要计算入度节点组内所有入度节点对应的入度边的权重之和,在本申请实施例中,将入度节点组中所有入度节点对应的入度边的权重之和称为入度边总权重。其中,该入度边总权重反映出该入度节点组内所有入度节点与该目标节点的相似程度。
S412,计算机设备将入度边总权重最大的一个入度节点组的社区标识作为该目标节点的社区标识。
可以理解的是,由于同一个入度节点组内的入度节点实际上属于同一个社区(即聚类为一个聚类类别),而且,如果入度节点组的入度边总权重内最大,说明该目标节点与该入度节点组内所有入度节点的相似度最高,因此计算机设备可以将该目标节点与该入度节点组内的入度节点聚类为一个社区,并将该目标节点的社区标识更新为该入度边总权重最大的入度节点组的社区标识。
S413,如果该目标节点的社区标识发生变化,计算机设备则将目标节点记录为发生社区标识更新的节点。
可以理解的是,如果入度边总权重最大的入度节点组的社区标识,与该目标节点更新前的社区标识不同,计算机设备则将入度边总权重最大的入度节点组的社区标识作为该目标节点的社区标识之后,该目标节点的社区标识发生更新,则需要标记出该目标节点,以便后续将该目标节点作为下一轮需要更新的待处理节点。
可选的,在步骤S410选取出该目标节点时,可以将该目标节点的社区标识作为更新前社区标识,并将步骤S412中对该目标节点更新后的社区标识作为更新后社区标识,进而比较该目标节点的更新前社区标识与更新后社区标识是否一致,如果不一致,则说明该目标节点的社区标识发生变化。
需要说明的是,步骤S413为可选步骤,在实际应用中也可以在将本轮中所有待处理节点均作为目标节点进行处理之后,再确定这些目标节点中是否存在社区标识发生更新的节点,即可以直接在后续步骤S415中直接判断是否存在社区标识发生更新的目标节点。
S414,计算机设备检测有向网络图中是否存在未作为目标节点的待处理节点,如果是,则返回步骤S410;如果否,则执行步骤415。
S415,计算机设备判断有向网络图中是否存在社区标识发生更新的节点,如果是,将社区标识发生更新的节点作为有向网络图中的待处理节点,并返回执行步骤S410;如果否,则将具有相同社区标识的节点所对应的用户设备确定为属于同一个社区,得到聚类出的多个社区。
其中,属于同一个社区的用户设备可以认为是:聚类出的属于同一个用户经常使用的 用户设备集合。
可以理解的是,如果有向网络图中每个节点的社区标识均更新为与该节点的相似度最大的入度节点组所对应的社区标识,那么再经过步骤S410至步骤S414的重复迭代,该有向网络图中每个节点的社区标识也不会再发生变化,因此,如果有向网络图中不存在社区标识发生更新的节点时,则说明完成对有向网络图中所有节点的聚类,在该种情况下,具有相同社区标识的节点被聚类为一个社区。
需要说明的是,在本申请实施例中,步骤S408为可选步骤,其仅仅是考虑到刚构建出的有向网络图中,每个节点的社区标识都是唯一的,对应一个节点而言,该节点的入度节点中,不存在两个或多个社区标识相同的入度节点,因此,为了便于理解,而直接将权重最大的入度边所对应的目标入度节点的社区标识作为待更新节点的社区标识。但是可以理解的是,如果没有该步骤S408,而直接执行步骤S409至415的操作同样也是可行的,在该种情况下,在第一轮循环时,由于每个节点的社区标识都是唯一的,因此可以认为是每一个入度节点属于一个入度节点组,这样,入度边总权重最大的入度节点组实际上也就是权重最大的入度边所对应的目标入度节点。
下面以对多篇包含有不同社群标识的文档进行主题发现为例,对本申请实施例的对象聚类方法进行介绍。
结合图1和图2,参见图6,其示出了本申请一种对象聚类方法又一个实施例的流程示意图,本实施例的方法适用于如上所提到的计算机设备或者分布式计算***中,本实施例的方法可以包括
S601,计算机设备获取网络中的待分析数据集,该待分析数据集包括多个数据关系,每个数据关系中包括:社群标识与文档的标识之间的对应关系。
其中,每个数据关系中,社群标识表示一些预设网络***的标识,如社交网络的标识,例如,即时通讯***相关的标识;文档的标识表示从网络中获取到的用于唯一表示该文档的标识。
可以理解的是,本申请实施例的目的是为了确定出哪些文档是同一个社群网络经常使用的文档,以将同一个社群经常使用到的文档聚类到一起,以提取出该文档的主题。
S602,对于任意两个不同文档的标识所表征的第一文档及第二文档,计算机设备获取包含第一文档的标识的至少一个第三数据关系,以及包含第二文档的标识的至少一个第四数据关系。
S603,计算机设备从该至少一个第三数据关系以及该至少一个第四数据关系中,确定出包含有相同社群标识的至少一对数据关系对。
其中,每对数据关系对中包括:具有相同社群标识的第三数据关系以及第四数据关系。
可以理解的是,一对数据关系对表示同一个社群既使用过第一文档,又使用过第二文档。
S604,计算机设备将数据关系对的总数量与该至少一个第三数据关系对应的第三数量的比值,确定为第二文档与第一文档的相似度。
其中,数据关系对的总数量也就是同时使用过第一文档以及第二文档的社群的总数。
为了便于区分,本申请实施例中,将第三数据关系的总个数称为第三数量,而将第四数据关系的总个数称为第四数量。其中,第三数量表示使用过第一文档的社群的总数;而第四数量表示使用过第四文档的社群的总数。
S605,计算机设备将数据关系对的总数量与该多个第四数据关系对应的第四数量的比值,确定为第一文档与该第二文档的相似度。
其中,计算第二文档与第一文档的相似度,以及第一文档与第二文档的相似度的具体方式可以参见前面实施例的相关介绍,在此不再赘述。
S606,计算机设备将第一文档以及第二文档作为有向网络图的节点,并将第一文档与第二文档的相似度作为有向网络图中由第一文档指向第二文档的边的权重,将第二文档与第一文档的相似度作为由第二文档指向第一文档的边的权重。
S607,计算机设备将有向网络图中节点所表示的文档的标识初始化该节点的主题标识。
主题标识用于表示节点所代表的文档所对应的主题,一个主题也可以认为一个聚类类别。
S608,计算机设备依次将有向网络图中的每个节点作为待更新节点,并从与该待更新节点相连的多个入度节点中,确定出权重最大的入度边所对应的目标入度节点,将该目标入度节点的主题标识作为该待更新节点的主题标识。
S609,计算机设备将有向网络图中所有节点作为待处理节点。
S610,计算机设备从有向网络图的待处理节点中,选取一个未经处理的待处理节点作为需要处理的目标节点。
S611,计算机设备将该目标节点的入度节点中,相同主题标识的入度节点确定为一个入度节点组,并针对每一个入度节点组,计算入度节点组中所有入度节点对应的入度边的权重的总和,得到每个入度节点组的入度边总权重。
S612,计算机设备将入度边总权重最大的一个入度节点组的主题标识作为该目标节点的主题标识。
S613,如果该目标节点的主题标识发生变化,计算机设备则将目标节点记录为发生主题标识更新的节点。
S614,计算机设备检测有向网络图中是否存在未作为目标节点的待处理节点,如果是,则返回步骤S610;如果否,则执行步骤615。
S615,计算机设备判断有向网络图中是否存在主题区标识发生更新的节点,如果是,将主题标识发生更新的节点作为有向网络图中的待处理节点,并返回执行步骤S610;如果否,则将具有相同主题标识的节点所对应的文档确定为属于同一个主题,得到聚类出的多个主题。
另一方面,本申请实施例还提供了一种对象聚类装置。
参见图7,其示出了本申请一种对象聚类装置一个实施例的组成结构示意图,本实施例的装置可以包括:
数据获取单元701,用于获取待聚类的多个目标对象各自关联的关联对象集合,所述关联对象集合包括至少一个关联对象;
有向图构建单元702,用于基于任意两个目标对象,根据其中一个目标对象的关联对象集合中关联对象与另一个目标对象的关联对象集合中关联对象之间的相似程度,确定待构建的有向网络图中表征所述任意两个目标对象的节点之间有向边的权重,并构建所述有向网络图;
类别初始化单元703,用于为所述有向网络图中的每个节点分别分配唯一的类别标识;
聚类分析单元704,用于依次将所述每个节点作为需处理的目标节点,从所述目标节点对应的至少一个入度节点组中,确定出指向所述目标节点的有向边的总权重最大的目标入度节点组,并将所述目标节点的类别标识更新为所述目标入度节点组中入度节点的类别标识,直至所述有向网络图中所有节点的类别标识不再发生变化,其中,所述入度节点组包括有向边指向所述目标节点且具有相同类别标识的至少一个入度节点;
类别提取单元705,用于将具有相同类别标识的节点所表示的目标对象确定为属于一个聚类类别,得到该多个目标对象对应的多个聚类类别。
可选的,所述有向图构建单元,包括:
第一权重确定单元,用于对于所述多个目标对象中任意的第一目标对象以及第二目标对象,根据与所述第一目标对象和所述第二目标对象都关联的关联对象的总数量,以及所述第一目标对象关联的关联对象的第一数量,确定出待构建的有向网络图中,从表示所述第二目标对象的第二节点指向表示所述第一目标对象的第一节点的有向边的权重;
第二权重确定单元,用于根据所述总数量,以及所述第二目标对象关联的关联对象的第二数量,确定出从所述第一节点指向所述第二节点的有向边的权重;
有向图构建子单元,用于构建所述有向网络图。
可选的,所述第一权重确定单元,用于将与所述第一目标对象以及所述第二目标对象都关联的关联对象的总数量,与所述第一对象关联的关联对象的第一数量的比值,确定为从表示所述第二目标对象的第二节点指向表示所述第一目标对象的第一节点的有向边的权重;
所述第二权重确定单元,用于将所述总数量,与所述第二对象关联的关联对象的第二数量的比值,确定为从所述第一节点指向所述第二节点的有向边的权重。
可选的,数据获取单元,用于获取待聚类的多个目标对象各自对应的至少一个数据关系,所述数据关系包括所述目标对象的标识与所述目标对象关联的关联对象之间的对应关系;
可选的,所述类别初始化单元,包括:
类别初始化子单元,用于将有向网络图中节点表示的目标对象的标识作为所述节点的类别标识。
可选的,所述聚类分析单元,包括:
节点初始处理单元,用于将所述有向网络图中的所有节点均作为待处理节点;
聚类分析子单元,用于如果存在未处理的待处理节点,从未处理的所述待处理节点中 选取需处理的目标节点,从所述目标节点对应的至少一个入度节点组中,确定出指向该目标节点的有向边的总权重最大的目标入度节点组,并所述节点的类别标识更新为所述目标入度节点组中入度节点的类别的类别标识,直至所有未处理的待处理节点均作为目标节点被处理为止;
循环控制单元,用于如果存在更新后的类别标识与更新前的类别标识不同的节点,则将更新前的类别标识与更新后的类别标识不同的节点确定为待处理节点,并触发返回执行所述聚类分析子单元的操作;
所述类别提取单元具体为,用于如果不存在更新后的类别标识与更新前的类别标识不同的节点,则将具有相同类别标识的节点所表示的目标对象确定为属于一个聚类类别,得到该多个目标对象对应的多个聚类类别。
可选的,所述目标对象为用户设备,所述关联对象为用户标识,所述数据获取单元,用于获取待聚类的多个用户设备各自关联的用户标识集合,所述用户标识集合包括多个用户标识,其中,用户设备关联的用户标识为通过所述用户设备访问预设网络的用户的标识。
本申请实施例还提供了一种计算机设备,该计算机设备可以包括上述所述的任一种对象聚类装置。该计算机设备的组成结构可以参见图1所示,在本申请实施例中的计算机设备中,该存储器中所存储的程序代码,所述处理器根据所述程序代码中的指令,以执行下述步骤:
获取待聚类的多个目标对象各自关联的关联对象集合,所述关联对象集合包括至少一个关联对象;
基于任意两个目标对象,根据其中一个目标对象的关联对象集合中关联对象与另一个目标对象的关联对象集合中关联对象之间的相似程度,确定待构建的有向网络图中表征所述任意两个目标对象的节点之间有向边的权重,并构建所述有向网络图;
为所述有向网络图中的每个节点分别分配唯一的类别标识;
依次将所述每个节点作为需处理的目标节点,从所述目标节点对应的至少一个入度节点组中,确定出指向所述目标节点的有向边的总权重最大的目标入度节点组,并将所述目标节点的类别标识更新为所述目标入度节点组中入度节点的类别标识,直至所述有向网络图中所有节点的类别标识不再发生变化,其中,所述入度节点组包括有向边指向所述目标节点且具有相同类别标识的至少一个入度节点;
将具有相同类别标识的节点所表示的目标对象确定为属于一个聚类类别,得到所述多个目标对象对应的多个聚类类别。
可选的,所述处理器根据所述程序代码中的指令,以执行下述步骤,包括:
对于所述多个目标对象中任意的第一目标对象以及第二目标对象,根据与所述第一目标对象和所述第二目标对象都关联的关联对象的总数量,以及所述第一目标对象关联的关联对象的第一数量,确定出待构建的有向网络图中,从表示所述第二目标对象的第二节点指向表示所述第一目标对象的第一节点的有向边的权重;
根据所述总数量,以及所述第二目标对象关联的关联对象的第二数量,确定出从所述第一节点指向所述第二节点的有向边的权重;
构建所述有向网络图。
可选的,所述处理器根据所述程序代码中的指令,以执行下述步骤,包括:
将与所述第一目标对象以及所述第二目标对象都关联的关联对象的总数量,与所述第一对象关联的关联对象的第一数量的比值,确定为从表示所述第二目标对象的第二节点指向表示所述第一目标对象的第一节点的有向边的权重;
将所述总数量,与所述第二对象关联的关联对象的第二数量的比值,确定为从所述第一节点指向所述第二节点的有向边的权重。
可选的,所述处理器根据所述程序代码中的指令,以执行下述步骤,包括:
获取待聚类的多个目标对象各自对应的至少一个数据关系,所述数据关系包括所述目标对象的标识与所述目标对象关联的关联对象之间的对应关系。
可选的,所述处理器根据所述程序代码中的指令,以执行下述步骤,包括:
将有向网络图中节点表示的目标对象的标识作为所述节点的类别标识。
可选的,所述处理器根据所述程序代码中的指令,以执行下述步骤,包括:
将所述有向网络图中的所有节点均作为待处理节点;
如果存在未处理的待处理节点,从未处理的待处理节点中选取当前需处理的目标节点,从所述目标节点对应的至少一个入度节点组中,确定出指向所述目标节点的有向边的总权重最大的目标入度节点组,并将所述目标节点的类别标识更新为所述目标入度节点组中入度节点的类别标识,直至所有未处理的待处理节点均作为目标节点被处理为止;
如果存在更新后的类别标识与更新前的类别标识不同的节点,则将更新前的类别标识与更新后的类别标识不同的节点确定为待处理节点,并触发返回执行所述聚类分析子单元的操作;
如果不存在更新后的类别标识与更新前的类别标识不同的节点,则将具有相同类别标识的节点所表示的目标对象确定为属于一个聚类类别,得到所述多个目标对象对应的多个聚类类别。
可选的,所述目标对象为用户设备,所述关联对象为用户标识,所述处理器根据所述程序代码中的指令,以执行下述步骤,包括:
获取待聚类的多个用户设备各自关联的用户标识集合,其中,所述用户标识为通过所述用户设备访问预设网络的用户的标识。
本申请实施例还提供了一种存储介质,所述存储介质用于存储程序代码,所述程序代码用于执行图1-图6所述的任一项对象聚类方法。
本申请实施例还提供了一种包括指令的计算机程序产品,当其在计算机上运行时,使得所述计算机执行图1-图6所述的任一项对象聚类方法。
需要说明的是,本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。对于装置类实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括要素的过程、方法、物品或者设备中还存在另外的相同要素。
对所公开的实施例的上述说明,使本领域技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。
以上仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。

Claims (17)

  1. 一种对象聚类方法,所述方法应用于计算机设备,包括:
    获取待聚类的多个目标对象各自关联的关联对象集合,所述关联对象集合包括至少一个关联对象;
    基于任意两个目标对象,根据其中一个目标对象的关联对象集合中关联对象与另一个目标对象的关联对象集合中关联对象之间的相似程度,确定待构建的有向网络图中表征所述任意两个目标对象的节点之间有向边的权重,并构建所述有向网络图;
    为所述有向网络图中的每个节点分别分配唯一的类别标识;
    依次将所述每个节点作为需处理的目标节点,从所述目标节点对应的至少一个入度节点组中,确定出指向所述目标节点的有向边的总权重最大的目标入度节点组,并将所述目标节点的类别标识更新为所述目标入度节点组中入度节点的类别标识,直至所述有向网络图中所有节点的类别标识不再发生变化,其中,所述入度节点组包括有向边指向所述目标节点且具有相同类别标识的至少一个入度节点;
    将具有相同类别标识的节点所表示的目标对象确定为属于一个聚类类别,得到所述多个目标对象对应的多个聚类类别。
  2. 根据权利要求1所述的对象聚类方法,所述基于任意两个目标对象,根据其中一个目标对象的关联对象集合中关联对象与另一个目标对象的关联对象集合中关联对象之间的相似程度,确定待构建的有向网络图中表征所述任意两个目标对象的节点之间有向边的权重,包括:
    对于所述多个目标对象中任意的第一目标对象以及第二目标对象,根据与所述第一目标对象和所述第二目标对象都关联的关联对象的总数量,以及所述第一目标对象关联的关联对象的第一数量,确定出待构建的有向网络图中,从表示所述第二目标对象的第二节点指向表示所述第一目标对象的第一节点的有向边的权重;
    根据所述总数量,以及所述第二目标对象关联的关联对象的第二数量,确定出从所述第一节点指向所述第二节点的有向边的权重。
  3. 根据权利要求2所述的对象聚类方法,所述根据与所述第一目标对象和所述第二目标对象都关联的关联对象的总数量,以及所述第一目标对象关联的关联对象的第一数量,确定出从表示所述第二目标对象的第二节点指向表示所述第一目标对象的第一节点的有向边的权重,包括:
    将与所述第一目标对象以及所述第二目标对象都关联的关联对象的总数量,与所述第一对象关联的关联对象的第一数量的比值,确定为从表示所述第二目标对象的第二节点指向表示所述第一目标对象的第一节点的有向边的权重;
    所述根据所述总数量,以及所述第二目标对象关联的关联对象的第二数量,确定出从所述第一节点指向所述第二节点的有向边的权重,包括:
    将所述总数量,与所述第二对象关联的关联对象的第二数量的比值,确定为从所述第 一节点指向所述第二节点的有向边的权重。
  4. 根据权利要求1所述的对象聚类方法,获取待聚类的多个目标对象各自关联的关联对象集合,包括:
    获取待聚类的多个目标对象各自对应的至少一个数据关系,所述数据关系包括所述目标对象的标识与所述目标对象关联的关联对象之间的对应关系。
  5. 根据权利要求1或4所述的对象聚类方法,所述为所述有向网络图中的每个节点分配一个唯一的类别标识,包括:
    将有向网络图中节点表示的目标对象的标识作为所述节点的类别标识。
  6. 根据权利要求1所述的对象聚类方法,所述依次将每个节点作为需处理的目标节点,从所述目标节点对应的至少一个入度节点组中,确定出指向所述目标节点的有向边的总权重最大的目标入度节点组,并将所述目标节点的类别标识更新为所述目标入度节点组中入度节点的类别标识,直至所述有向网络图中所有节点的类别标识不再发生变化,包括:
    将所述有向网络图中的所有节点均作为待处理节点;
    如果存在未处理的待处理节点,从未处理的待处理节点中选取需处理的目标节点,从所述目标节点对应的至少一个入度节点组中,确定出指向所述目标节点的有向边的总权重最大的目标入度节点组,并将所述目标节点的类别标识更新为所述目标入度节点组中入度节点的类别标识,直至所有未处理的待处理节点均作为目标节点被处理为止;
    如果存在更新后的类别标识与更新前的类别标识不同的节点,则将更新前的类别标识与更新后的类别标识不同的节点确定为待处理节点,并返回执行所述如果存在未处理的待处理节点,从未处理的所述待处理节点中选取需处理的目标节点的操作;
    如果不存在更新后的类别标识与更新前的类别标识不同的节点,则执行所述将具有相同类别标识的节点所表示的目标对象确定为属于一个聚类类别的操作。
  7. 根据权利要求1所述的对象聚类方法,所述目标对象为用户设备,所述关联对象为用户标识,所述获取待聚类的多个目标对象各自关联的关联对象集合,包括:
    获取待聚类的多个用户设备各自关联的用户标识集合,其中,所述用户标识为通过所述用户设备访问预设网络的用户的标识。
  8. 根据权利要求1所述的对象聚类方法,所述目标对象为文档,所述关联对象为社群标识,所述获取待聚类的多个目标对象各自关联的关联对象集合,包括:
    获取待聚类的多篇文档各自关联的社群标识集合;
    所述将具有相同类别标识的节点所表示的目标对象确定为属于一个聚类类别,具体为:
    将具有相同类别标识的节点所表示的文档确定为属于同一个主题。
  9. 一种对象聚类装置,包括存储器和处理器,所述存储器用于存储指令,所述处理器用于执行所述指令,以执行下述步骤:
    获取待聚类的多个目标对象各自关联的关联对象集合,所述关联对象集合包括至少一个关联对象;
    基于任意两个目标对象,根据其中一个目标对象的关联对象集合中关联对象与另一个目标对象的关联对象集合中关联对象之间的相似程度,确定待构建的有向网络图中表征所述任意两个目标对象的节点之间有向边的权重,并构建所述有向网络图;
    为所述有向网络图中的每个节点分别分配唯一的类别标识;
    依次将所述每个节点作为需处理的目标节点,从所述目标节点对应的至少一个入度节点组中,确定出指向所述目标节点的有向边的总权重最大的目标入度节点组,并将所述目标节点的类别标识更新为所述目标入度节点组中入度节点的类别标识,直至所述有向网络图中所有节点的类别标识不再发生变化,其中,所述入度节点组包括有向边指向所述目标节点且具有相同类别标识的至少一个入度节点;
    将具有相同类别标识的节点所表示的目标对象确定为属于一个聚类类别,得到所述多个目标对象对应的多个聚类类别。
  10. 根据权利要求9所述的对象聚类装置,所述处理器用于执行所述指令,以执行下述步骤,包括:
    对于所述多个目标对象中任意的第一目标对象以及第二目标对象,根据与所述第一目标对象和所述第二目标对象都关联的关联对象的总数量,以及所述第一目标对象关联的关联对象的第一数量,确定出待构建的有向网络图中,从表示所述第二目标对象的第二节点指向表示所述第一目标对象的第一节点的有向边的权重;
    根据所述总数量,以及所述第二目标对象关联的关联对象的第二数量,确定出从所述第一节点指向所述第二节点的有向边的权重;
    构建所述有向网络图。
  11. 根据权利要求10所述的对象聚类装置,所述处理器用于执行所述指令,以执行下述步骤,包括:
    将与所述第一目标对象以及所述第二目标对象都关联的关联对象的总数量,与所述第一对象关联的关联对象的第一数量的比值,确定为从表示所述第二目标对象的第二节点指向表示所述第一目标对象的第一节点的有向边的权重;
    将所述总数量,与所述第二对象关联的关联对象的第二数量的比值,确定为从所述第一节点指向所述第二节点的有向边的权重。
  12. 根据权利要求9所述的对象聚类装置,所述处理器用于执行所述指令,以执行下述步骤,包括:
    获取待聚类的多个目标对象各自对应的至少一个数据关系,所述数据关系包括所述目标对象的标识与所述目标对象关联的关联对象之间的对应关系。
  13. 根据权利要求9或12所述的对象聚类装置,所述处理器用于执行所述指令,以执行下述步骤,包括:
    将有向网络图中节点表示的目标对象的标识作为所述节点的类别标识。
  14. 根据权利要求9所述的对象聚类装置,所述处理器用于执行所述指令,以执行下 述步骤,包括:
    将所述有向网络图中的所有节点均作为待处理节点;
    如果存在未处理的待处理节点,从未处理的待处理节点中选取当前需处理的目标节点,从所述目标节点对应的至少一个入度节点组中,确定出指向所述目标节点的有向边的总权重最大的目标入度节点组,并将所述目标节点的类别标识更新为所述目标入度节点组中入度节点的类别标识,直至所有未处理的待处理节点均作为目标节点被处理为止;
    如果存在更新后的类别标识与更新前的类别标识不同的节点,则将更新前的类别标识与更新后的类别标识不同的节点确定为待处理节点,并触发返回执行所述聚类分析子单元的操作;
    如果不存在更新后的类别标识与更新前的类别标识不同的节点,则将具有相同类别标识的节点所表示的目标对象确定为属于一个聚类类别,得到所述多个目标对象对应的多个聚类类别。
  15. 根据权利要求9所述的对象聚类装置,所述目标对象为用户设备,所述关联对象为用户标识,所述处理器用于执行所述指令,以执行下述步骤,包括:
    获取待聚类的多个用户设备各自关联的用户标识集合,其中,所述用户标识为通过所述用户设备访问预设网络的用户的标识。
  16. 一种存储介质,所述存储介质用于存储程序代码,所述程序代码用于执行权利要求1-8任一项所述的对象聚类方法。
  17. 一种包括指令的计算机程序产品,当其在计算机上运行时,使得所述计算机执行权利要求1-8任一项所述的对象聚类方法。
PCT/CN2018/074552 2017-02-14 2018-01-30 一种对象聚类方法和装置 WO2018149292A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/428,958 US10936669B2 (en) 2017-02-14 2019-05-31 Object clustering method and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710078997.6 2017-02-14
CN201710078997.6A CN108427956B (zh) 2017-02-14 2017-02-14 一种对象聚类方法和装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/428,958 Continuation US10936669B2 (en) 2017-02-14 2019-05-31 Object clustering method and system

Publications (1)

Publication Number Publication Date
WO2018149292A1 true WO2018149292A1 (zh) 2018-08-23

Family

ID=63155045

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/074552 WO2018149292A1 (zh) 2017-02-14 2018-01-30 一种对象聚类方法和装置

Country Status (3)

Country Link
US (1) US10936669B2 (zh)
CN (1) CN108427956B (zh)
WO (1) WO2018149292A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259137A (zh) * 2020-01-17 2020-06-09 平安科技(深圳)有限公司 知识图谱摘要的生成方法及***
CN114168805A (zh) * 2022-02-08 2022-03-11 腾讯科技(深圳)有限公司 数据处理方法、装置、设备及介质
CN114443783A (zh) * 2022-04-11 2022-05-06 浙江大学 一种供应链数据分析和增强处理方法及装置

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10642867B2 (en) * 2017-09-15 2020-05-05 Adobe Inc. Clustering based on a directed graph
EP3623960A1 (en) * 2018-09-14 2020-03-18 Adarga Limited Method and system for retrieving and displaying data from an entity network database
CN109450920A (zh) * 2018-11-29 2019-03-08 北京奇艺世纪科技有限公司 一种异常账号检测方法及装置
CN109800276A (zh) * 2018-12-14 2019-05-24 深圳壹账通智能科技有限公司 关联强度评估方法、装置、设备及存储介质
CN111353904B (zh) * 2018-12-21 2022-12-20 腾讯科技(深圳)有限公司 用于在社交网络中确定节点的社交层次的方法和设备
CN110059227B (zh) * 2019-01-22 2023-08-04 创新先进技术有限公司 一种确定多个样本之间的网络结构的方法及装置
CN112306468A (zh) * 2019-08-02 2021-02-02 伊姆西Ip控股有限责任公司 用于处理机器学习模型的方法、设备和计算机程序产品
CN111125546A (zh) * 2019-12-25 2020-05-08 深圳前海微众银行股份有限公司 数据处理方法、装置、设备及计算机可读存储介质
CN111143627B (zh) * 2019-12-27 2023-08-15 北京百度网讯科技有限公司 用户身份数据确定方法、装置、设备和介质
CN112671593B (zh) * 2021-01-18 2023-04-07 中国民航信息网络股份有限公司 一种服务器的管理方法及相关设备
US11860977B1 (en) * 2021-05-04 2024-01-02 Amazon Technologies, Inc. Hierarchical graph neural networks for visual clustering
CN113326880A (zh) * 2021-05-31 2021-08-31 南京信息工程大学 基于社团划分的无监督图像分类方法
CN113965772B (zh) * 2021-10-29 2024-05-10 北京百度网讯科技有限公司 直播视频处理方法、装置、电子设备和存储介质
CN116882408B (zh) * 2023-09-07 2024-02-27 南方电网数字电网研究院有限公司 变压器图模型的构建方法、装置、计算机设备和存储介质
CN118071740A (zh) * 2024-04-18 2024-05-24 山东中泰药业有限公司 一种视觉辅助下药品铝箔包装质量检测方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063458A (zh) * 2010-10-12 2011-05-18 百度在线网络技术(北京)有限公司 用于在计算机网络的网络设备中进行用户聚类的方法和设备
US20120284270A1 (en) * 2011-05-04 2012-11-08 Nhn Corporation Method and device to detect similar documents
CN103020163A (zh) * 2012-11-26 2013-04-03 南京大学 一种网络中基于节点相似度的网络社区划分方法
CN103136303A (zh) * 2011-11-24 2013-06-05 北京千橡网景科技发展有限公司 在社交网络服务网站中划分用户群组的方法和设备

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7840407B2 (en) * 2006-10-13 2010-11-23 Google Inc. Business listing search
CN101267452B (zh) * 2008-02-27 2011-02-16 华为技术有限公司 一种web服务合成方案转换方法及应用服务器
US8620905B2 (en) * 2012-03-22 2013-12-31 Corbis Corporation Proximity-based method for determining concept relevance within a domain ontology
CN103914493A (zh) * 2013-01-09 2014-07-09 北大方正集团有限公司 一种微博用户群体结构发现分析方法及***
US9439053B2 (en) * 2013-01-30 2016-09-06 Microsoft Technology Licensing, Llc Identifying subgraphs in transformed social network graphs
US9836522B2 (en) * 2015-03-17 2017-12-05 Sap Se Framework for ordered clustering
US9558265B1 (en) * 2016-05-12 2017-01-31 Quid, Inc. Facilitating targeted analysis via graph generation based on an influencing parameter
US20180103111A1 (en) * 2016-10-07 2018-04-12 International Business Machines Corporation Determination of well-knit groups in organizational settings

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063458A (zh) * 2010-10-12 2011-05-18 百度在线网络技术(北京)有限公司 用于在计算机网络的网络设备中进行用户聚类的方法和设备
US20120284270A1 (en) * 2011-05-04 2012-11-08 Nhn Corporation Method and device to detect similar documents
CN103136303A (zh) * 2011-11-24 2013-06-05 北京千橡网景科技发展有限公司 在社交网络服务网站中划分用户群组的方法和设备
CN103020163A (zh) * 2012-11-26 2013-04-03 南京大学 一种网络中基于节点相似度的网络社区划分方法

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259137A (zh) * 2020-01-17 2020-06-09 平安科技(深圳)有限公司 知识图谱摘要的生成方法及***
CN111259137B (zh) * 2020-01-17 2023-04-07 平安科技(深圳)有限公司 知识图谱摘要的生成方法及***
CN114168805A (zh) * 2022-02-08 2022-03-11 腾讯科技(深圳)有限公司 数据处理方法、装置、设备及介质
CN114168805B (zh) * 2022-02-08 2022-05-20 腾讯科技(深圳)有限公司 数据处理方法、装置、设备及介质
CN114443783A (zh) * 2022-04-11 2022-05-06 浙江大学 一种供应链数据分析和增强处理方法及装置

Also Published As

Publication number Publication date
US10936669B2 (en) 2021-03-02
CN108427956B (zh) 2019-08-06
CN108427956A (zh) 2018-08-21
US20190286657A1 (en) 2019-09-19

Similar Documents

Publication Publication Date Title
WO2018149292A1 (zh) 一种对象聚类方法和装置
US11003755B2 (en) Authentication using emoji-based passwords
US20200311342A1 (en) Populating values in a spreadsheet using semantic cues
US10878218B2 (en) Device fingerprinting, tracking, and management
TWI647583B (zh) Prompt method and prompting device for login account
US20150234927A1 (en) Application search method, apparatus, and terminal
WO2015074496A1 (en) Identity authentication method and device and storage medium
JP6608972B2 (ja) ソーシャルネットワークに基づいてグループを探索する方法、デバイス、サーバ及び記憶媒体
US20140214963A1 (en) Method, server and system for data sharing in social networking service
US20170139913A1 (en) Method and system for data assignment in a distributed system
KR102110642B1 (ko) 패스워드 보호 질문 설정 방법 및 디바이스
WO2018205999A1 (zh) 一种数据处理方法及装置
US20200356660A1 (en) Managing passwords
US9305226B1 (en) Semantic boosting rules for improving text recognition
US11216482B2 (en) Systems and methods for access to multi-tenant heterogeneous databases
US9552556B2 (en) Site flow optimization
US20190005045A1 (en) Efficient internet protocol prefix match support on no-sql and/or non-relational databases
US20180217992A1 (en) Domain based influence scoring
US9195716B2 (en) Techniques for ranking character searches
US20170061361A1 (en) On-line fellowship enhancement system for off-line company organization
CN110580200B (zh) 数据同步方法和装置
US20140089438A1 (en) Method and device for processing information
US10313438B1 (en) Partitioned key-value store with one-sided communications for secondary global key lookup by range-knowledgeable clients
US20240202210A1 (en) Discovery of source range partitioning information in data extract job
CN115525554B (zh) 模型的自动化测试方法、***及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18754371

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18754371

Country of ref document: EP

Kind code of ref document: A1