CN113761305A

CN113761305A - Method and device for generating label hierarchical structure

Info

Publication number: CN113761305A
Application number: CN202010494685.5A
Authority: CN
Inventors: 陈希
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2020-06-03
Filing date: 2020-06-03
Publication date: 2021-12-07
Anticipated expiration: 2040-06-03
Also published as: CN113761305B

Abstract

The invention discloses a method and a device for generating a label hierarchical structure, and relates to the technical field of computers. One embodiment of the method comprises: screening out label pairs with association relation according to the occurrence frequency of each label in each file object; generating a label relation graph according to each label pair; wherein, the nodes in the relational graph are labels, and the weight of the edges is the co-occurrence times of the two labels in the same file object; and clustering each node in the label relation graph and calculating the membership degree of adjacent nodes so as to generate a label hierarchical structure. The embodiment can solve the technical problem that the position of the label in the label hierarchy is unique.

Description

Method and device for generating label hierarchical structure

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for generating a label hierarchical structure.

Background

In the content field of the internet, a plurality of websites endow users with a function of freely marking interested objects (such as articles, videos, pictures and the like), and labels marked by the users are called social labels which are gathered into a system called a popular classification (Folksonomy).

Although the number of the labels is rich, the coverage content of the same label is less, the labels are scattered and tiled, and the application value density is lower. In order to overcome the problem of lack of organization of the social tags, the internal relationships need to be found from the tags and a tag hierarchy structure needs to be constructed, so that the tags are applied in business scenes such as search recommendation, advertisement delivery and the like.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

the position of each label in the generated label hierarchy is unique, which cannot completely meet the actual requirement; if tags can appear in different locations of the same hierarchy, their respective weight ratios cannot be measured.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for generating a tag hierarchy structure, so as to solve the technical problem that a location of a tag in the tag hierarchy structure is unique.

To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a method of generating a tag hierarchy, including:

screening out label pairs with association relation according to the occurrence frequency of each label in each file object;

generating a label relation graph according to each label pair; wherein, the nodes in the relational graph are labels, and the weight of the edges is the co-occurrence times of the two labels in the same file object;

and clustering each node in the label relation graph and calculating the membership degree of adjacent nodes so as to generate a label hierarchical structure.

Optionally, the clustering the nodes in the label relationship graph and calculating the membership of adjacent nodes, so as to generate a label hierarchy, includes:

calculating the average centrality of each node in the label relation graph;

screening out at least one secondary root node according to the average centrality of each node and the incidence relation between the nodes;

respectively calculating the membership degree of each secondary root node and each adjacent node so as to determine a candidate node set corresponding to each secondary root node, wherein each node in the candidate node set has a membership relationship with the secondary root node;

the above steps are repeatedly performed, thereby generating a label hierarchy.

Optionally, the calculating an average centrality of each node in the label relationship graph includes:

for each node, respectively calculating the calculation degree centrality, the intermediary centrality, the approach centrality and the webpage ranking value of the node;

respectively carrying out normalization processing on the calculation degree centrality, the intermediary centrality, the approach centrality and the webpage ranking value;

and calculating the arithmetic mean of the calculated centrality, the intermediate centrality, the approximate centrality and the webpage ranking value after the normalization processing, so as to obtain the average centrality of the node.

Optionally, the screening out at least one secondary root node according to the average centrality of each node and the association relationship between each node includes:

the average centrality of each node is arranged in a descending order, and N nodes with the average centrality closer to the front are screened out; wherein N is an integer greater than zero;

for the N nodes, dividing the nodes with the incidence relation into a group, thereby obtaining at least one node group;

and for each node group, taking the node with the maximum average centrality in the node group as a root node.

Optionally, the membership degree of the secondary root node and any one adjacent node is calculated by the following method:

the weight of the edge between the adjacent node and the secondary root node is the ratio of the sum of the weights of all the edges of the adjacent node.

Optionally, the determining a candidate node set corresponding to each secondary root node includes:

and adding the adjacent nodes with the membership degree larger than or equal to a membership degree threshold value into the candidate node set corresponding to the secondary root node so that each adjacent node is at least subordinate to one secondary root node.

Optionally, the screening out the tag pairs having an association relationship according to the occurrence frequency of each tag in each file object includes:

respectively calculating the co-occurrence times of any two tags in the same file object according to the occurrence times of the tags in the file objects;

and for any two tags, judging whether an association relationship exists between the two tags according to the co-occurrence frequency of the two tags in the same file object, the total number of the file objects and the number of the file objects with one tag, thereby screening out the tag pairs with the association relationship.

Optionally, the determining whether an association relationship exists between the two tags according to the number of co-occurrences of the two tags in the same file object, the total number of file objects, and the number of file objects in which one tag appears includes:

dividing the co-occurrence times of the two labels in the same file object by the total number of the file objects to obtain the support degree;

dividing the co-occurrence frequency of the two labels in the same file object by the number of the file objects with one label to obtain a confidence coefficient;

and if the support degree is greater than or equal to a support degree threshold value and the confidence degree is greater than or equal to a confidence degree threshold value, judging that an association relationship exists between the two labels.

Optionally, after the generating the tag hierarchy, further comprising:

and matching corresponding labels for each file object according to the label hierarchical structure.

In addition, according to another aspect of the embodiments of the present invention, there is provided an apparatus for generating a tag hierarchy, including:

the screening module is used for screening out the label pairs with the association relationship according to the occurrence frequency of each label in each file object;

the association module is used for generating a label relation graph according to each label pair; wherein, the nodes in the relational graph are labels, and the weight of the edges is the co-occurrence times of the two labels in the same file object;

and the generating module is used for clustering all nodes in the label relation graph and calculating the membership degree of adjacent nodes so as to generate a label hierarchical structure.

Optionally, the generating module is further configured to:

calculating the average centrality of each node in the label relation graph;

the above steps are repeatedly performed, thereby generating a label hierarchy.

Optionally, the generating module is further configured to:

Optionally, the generating module is further configured to: calculating the membership degree of the secondary root node and any adjacent node by adopting the following method:

Optionally, the generating module is further configured to:

Optionally, the screening module is further configured to:

Optionally, the apparatus further comprises a matching module, configured to:

and after the label hierarchical structure is generated, matching corresponding labels for all the file objects according to the label hierarchical structure.

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including:

one or more processors;

a storage device for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors implement the method of any of the embodiments described above.

According to another aspect of the embodiments of the present invention, there is also provided a computer readable medium, on which a computer program is stored, which when executed by a processor implements the method of any of the above embodiments.

One embodiment of the above invention has the following advantages or benefits: because the technical means of generating the label relation graph according to each label pair with the incidence relation, clustering each node in the label relation graph and calculating the membership degree of the adjacent node so as to generate the label hierarchical structure is adopted, the technical problem that the position of the label in the label hierarchical structure is unique in the prior art is solved. The embodiment of the invention solves the problem that the labels have ambiguity by a fuzzy clustering method, so that the labels can appear at different positions, and the probability value (namely membership) of each label appearing at different positions is calculated; and the recursive clustering is flexibly controlled through the membership degree, and the label hierarchy structure can be automatically constructed, so that the labor cost can be saved.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of a tag hierarchy in the prior art;

FIG. 2 is a schematic diagram of the main flow of a method of generating a hierarchy of tags, according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a main flow of a method of generating a hierarchical structure of tags according to one referential embodiment of the present invention;

FIG. 4 is a schematic diagram of generating a hierarchy of tags, according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of the major modules of an apparatus for generating a hierarchy of tags, according to an embodiment of the present invention;

FIG. 6 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 7 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

A typical label hierarchy is shown in fig. 1, and at present, there are three main ways for constructing a label hierarchy, which are manual, semi-manual, and automatic. The manual construction has the highest quality, but requires a great deal of manual effort and is influenced by subjective factors, and the construction cost is also the highest. The semi-manual mode uses a learning system to assist manual construction, still needs manual work to participate in a large amount of work, and cannot be expanded on a large scale. The automatic construction of the label system is the current mainstream research trend, and the construction process is generally divided into two steps, namely, discovering the relationship among the labels based on the label semantics, and constructing the hierarchy system by utilizing the relationship among the labels. As can be seen, the position of each label in the generated label hierarchy is unique and its corresponding weight fraction cannot be measured even though labels may appear at different positions of the same hierarchy.

In order to solve the above technical problems in the prior art, embodiments of the present invention provide a method for generating a label hierarchy, which not only enables labels to appear at different positions, but also calculates membership (i.e., weight fraction) of each label appearing at different positions.

Fig. 1 is a schematic diagram of a main flow of a method of generating a tag hierarchy structure according to an embodiment of the present invention. As an embodiment of the present invention, as shown in fig. 1, the method for generating a label hierarchy may include:

step 101, screening out the label pairs with the association relation according to the occurrence frequency of each label in each file object.

In the embodiment of the present invention, the file objects may be texts, pictures, videos, etc., and the author and the user may add tags to each file object, so the tags to be structured may be social tags or tags designed by the author, but no explicit relationship has been established yet.

Prior to step 101, the file object needs to be associated with a tag (i.e., tagged). For example, a text describing makeup may be labeled "lipstick", "Tom Ford", "Queen". In order to improve the confidence of the tags, the number of times that the tags appear in the text needs to be ensured, and for the tags with extremely few part of occurrences, the tags can be merged into the existing tags in a synonym mode.

Optionally, step 101 may comprise: respectively calculating the co-occurrence times of any two tags in the same file object according to the occurrence times of the tags in the file objects; and for any two tags, judging whether an association relationship exists between the two tags according to the co-occurrence frequency of the two tags in the same file object, the total number of the file objects and the number of the file objects with one tag, thereby screening out the tag pairs with the association relationship. In the embodiment of the invention, whether the incidence relation exists between any two labels is judged according to the co-occurrence frequency of the two labels in the same file object, the total number of the file objects and the number of the file objects with one label, if so, the two labels form a label pair, so that the incidence relation between the labels is dug out, and the redundant relation is filtered. It should be noted that one tag and a plurality of other tags may respectively form a corresponding tag pair.

Assuming that L is { L1, L2, …, ln } is a set of tags, and the text library a is { a1, a2, …, An }, each text has a unique ID, and the texts are labeled with a plurality of tags, the number of times that every two tags appear together in all texts can be calculated, and the number is used as a basis for judging the strength of the connection between the tags. Optionally, an apriori algorithm may be employed to mine the frequent item set and generate association rules. Mining the frequent item set means counting tags with minimum support (support) greater than a specified threshold, and generating the association rule means that on the basis of meeting the minimum support, the minimum confidence (confidence) is met, namely the probability of occurrence of the tag l2 under the condition that the tag l1 is known to occur is greater than the specified threshold, and the association rule l1 → l2 is reached.

Optionally, determining whether an association relationship exists between the two tags according to the number of co-occurrences of the two tags in the same file object, the total number of file objects, and the number of file objects in which one tag appears includes: dividing the co-occurrence times of the two labels in the same file object by the total number of the file objects to obtain the support degree; dividing the co-occurrence frequency of the two labels in the same file object by the number of the file objects with one label to obtain a confidence coefficient; and if the support degree is greater than or equal to a support degree threshold value and the confidence degree is greater than or equal to a confidence degree threshold value, judging that an association relationship exists between the two labels.

Examples are as follows:

ID	Iphone	apple (Malus pumila)	Huawei	Android
					1	1	1	1	0
2	1	1	0	0
					3	1	0	0	0
4	1	0	1	0
					5	0	1	1	1
6	1	1	0	0

The table above is a text and label relationship, containing 6 texts. Item set I ═ iphone, apple, hua, android }. Consider the association rule: iphone and apple, text 1,2,3,4,6 contains iphone, text 1,2,6 contains iphone and apple simultaneously, X ≈ Y ═ 3, a ═ 6, and support degree (X ≈ Y)/a ═ 0.5; x ═ 5, confidence (X ≈ Y)/X ═ 0.6. If the minimum support degree alpha is given to be 0.5, and the minimum confidence degree beta is given to be 0.6, the iphone label and the apple label are considered to have a strong association relationship, and the iphone label and the apple label form a label pair.

102, generating a label relation graph according to each label pair; and the nodes in the relational graph are labels, and the weight of the edges is the co-occurrence frequency of the two labels in the same file object.

And generating a label relation graph (undirected graph) G ═ V, E according to each label pair screened in the step 101, wherein V is a set of nodes (labels) in the graph, and E is a set of edges. Wherein, the weight of the edge is the co-occurrence frequency of the two node labels. For example, if the iphone tag and the apple tag co-occur 3 times, the weight of the edge connecting the two nodes is 3.

And 103, clustering each node in the label relation graph and calculating the membership degree of adjacent nodes, thereby generating a label hierarchical structure.

The embodiment of the invention calculates the incidence relation of each label relative to the cluster center by using the idea of membership function of fuzzy cluster and combining the characteristics of a graph, and then selects the center point representing the cluster through a centrality algorithm. Optionally, step 103 may comprise:

generating a first-level label hierarchy by adopting the following method:

calculating the average centrality of each node in the label relation graph;

the step of generating a level one label hierarchy is repeated, thereby generating a label hierarchy.

In the embodiment of the present invention, the root node may be determined by a user, and the root node may not be a label in the label relationship graph, or may be a certain label (which may be a node with the highest average centrality) in the label relationship graph.

Optionally, calculating an average centrality of each node in the label relationship graph includes: for each node, respectively calculating the calculation degree centrality, the intermediary centrality, the approach centrality and the webpage ranking value of the node; respectively carrying out normalization processing on the calculation degree centrality, the intermediary centrality, the approach centrality and the webpage ranking value; and calculating the arithmetic mean of the calculated centrality, the intermediate centrality, the approximate centrality and the webpage ranking value after the normalization processing, so as to obtain the average centrality of the node.

The Degree Centrality (Degree Centrality) is the most direct measure for characterizing the node Centrality (Centrality) in network analysis. The node degree of a node is larger, which means that the node degree is more central, and the node is more important in the network.

Intermediary centricity/intermediary centricity (Between centricity), an indicator of node importance is characterized by the number of shortest paths through a node.

Proximity center (Closeness center), which reflects the proximity between a node and other nodes in a network, represents the proximity center by the cumulative reciprocal of the shortest path distances from a node to all other nodes. That is, for a node, the closer it is to other nodes, the greater its proximity centrality.

The web page ranking (PageRank), also called web page rank, Google left rank or pecky rank, is a technique calculated based on the mutual hyperlinks between web pages, and is one of the elements of the web page ranking to represent the relevance and importance of web pages.

According to the tag relationship graph generated in the step 102, the calculation centrality, the intermediary centrality, the approach centrality and the webpage ranking value (namely the PageRank value) of each node in the graph are respectively calculated, and then the centrality is normalized and the arithmetic mean is obtained to obtain the average centrality of the node. Experiments show that the single-class centrality algorithm is suitable for different data scenes. According to the embodiment of the invention, the total number of the incoming and outgoing degrees is counted by combining degree centrality, the intermediary centrality is used as the function of bridge connection, the shortest paths between the approximate centrality and other nodes and the interaction among the nodes are considered by the pagerank value, the final average centrality is calculated after the steel is removed, and finally the most representative label (namely, a secondary root node) is selected from a group of labels through the average centrality to be used as the representative of the group of labels (namely, a node set in a label relation graph).

Optionally, screening out at least one secondary root node according to the average centrality of each node and the association relationship between each node, including: the average centrality of each node is arranged in a descending order, and N nodes with the average centrality closer to the front are screened out; wherein N is an integer greater than zero; for the N nodes, dividing the nodes with the incidence relation into a group, thereby obtaining at least one node group; and for each node group, taking the node with the maximum average centrality in the node group as a root node. Each node in the label relation graph can be regarded as a candidate node set of a secondary root node, N nodes with the average central degree being higher are screened out, whether nodes with the association relation exist in the N nodes is judged, and if the nodes exist, the nodes are divided into a group. Since it has been calculated in step 101 whether there is an association relationship between the labels, if there are two nodes directly connected in the N nodes, the two nodes are considered to have an association relationship and are divided into a node group.

For example, in the label relationship diagram of the root node "mobile phone", 9 node labels with the highest average centrality are selected, which are respectively a battery, a screen, a full screen, a camera, a performance, a pixel, a processor and a body. Based on the association relationship among the 9 nodes, the nodes are divided into five node groups: the camera is a group, the processor and the performance are a group, the screen and the full screen are a group, the battery is a group, and the machine body is a group. And finally, screening out a node with the highest average centrality from the five node groups as a secondary root node: for example, the camera, processor, screen, battery, and body are the secondary root nodes of five categories under the mobile phone.

Optionally, the membership degree of the secondary root node and any one adjacent node is calculated by the following method: the weight of the edge between the adjacent node and the secondary root node is the ratio of the sum of the weights of all the edges of the adjacent node.

For example: kylin-processor edge is 30, kylin-photographed edge is 5, and kylin-battery edge is 5, then kylin 0.75 is under processor, 0.125 is under photograph, and 0.125 is under battery. Processor-apple edge is 15, shoot-apple edge is 25, screen-apple edge is 30, battery-apple edge is 20, fuselage and apple edge is 10, then the apple label corresponds to processor, shoot, screen, battery, fuselage, etc. the membership of labels is in order: 0.15,0.25,0.3,0.2,0.1.

Optionally, determining a candidate node set corresponding to each secondary root node includes: and adding the adjacent nodes with the membership degree larger than or equal to a membership degree threshold value into the candidate node set corresponding to the secondary root node so that each adjacent node is at least subordinate to one secondary root node. And by setting a membership threshold, pruning to remove nodes with membership lower than the membership threshold, and ensuring that each node is at least subordinate to one secondary root node. In an embodiment of the invention, one node may be subordinate to a plurality of secondary root nodes, and at least to one secondary root node.

For example, setting the effective membership of the reserved node with the membership degree of more than or equal to 0.2, wherein the kylin is subordinate to the processor, and the membership degree is 0.75; the apple is subject to photographing, a screen and a battery, and the membership degree is 0.25,0.3 and 0.2.

And then, taking the membership node of each secondary root node as a candidate node set, repeating the steps, and generating the next level of label hierarchy and membership degree until all hierarchy relations are established. Or setting a stop condition: the number of adjacent nodes is less than a specified threshold. It is noted that the nodes of the first and last level do not need to compute membership.

For example, a candidate node set related to the processor is taken, the node with the highest centrality is a high-pass, performance, …, the nodes are used as secondary root nodes, then the membership degrees of the secondary root nodes and the processor are obtained respectively, and then the lower label and the membership degree of the secondary root nodes are calculated.

For example, the secondary root node is calculated by using mobile phone related text as: photographing, processor, screen, battery and body. Calculating the membership degree of the secondary root node and the adjacent nodes as

Second root node	Adjacent node	Degree of membership	Tag ID
				Processor with a memory having a plurality of memory cells	2.4G	0.576000	0
Battery with a battery cell	2.4G	0.218667	4
				Battery with a battery cell	360	0.313453	4
Processor with a memory having a plurality of memory cells	360	0.327013	0
				Fuselage body	3D	0.456526	3

The label hierarchical structure with the weight ratio (namely membership degree) constructed by the embodiment of the invention can be used for searching the accurate matching between the recommended content and the user and can also be used for discovering the similar interests among the users.

Optionally, after the tag hierarchy is generated, the corresponding tag may be matched to each file object according to the tag hierarchy. The label hierarchical structure constructed by the embodiment of the invention fully discovers the internal relation among the labels, so that the labels matched with the file objects can better represent the characteristics of the file objects, and accurate recommendation and delivery can be realized when the label hierarchical structure is applied in business scenes such as search recommendation, advertisement delivery and the like.

According to the various embodiments described above, it can be seen that the technical means of generating the label hierarchy structure by generating the label relationship graph according to each label pair having an association relationship, clustering each node in the label relationship graph and calculating the membership of adjacent nodes in the label relationship graph in the embodiments of the present invention solves the technical problem of unique position of a label in the label hierarchy structure in the prior art. The embodiment of the invention solves the problem that the labels have ambiguity by a fuzzy clustering method, so that the labels can appear at different positions, and the probability value (namely membership) of each label appearing at different positions is calculated; and the recursive clustering is flexibly controlled through the membership degree, and the label hierarchy structure can be automatically constructed, so that the labor cost can be saved.

Fig. 3 is a schematic diagram of a main flow of a method of generating a tag hierarchy structure according to one referential embodiment of the present invention. As another embodiment of the present invention, as shown in fig. 3, the method for generating a label hierarchy may include:

step 301, preparing basic data.

As shown in fig. 4, the base data includes each file object and its corresponding tag. The file objects can be texts, pictures, videos and the like, and authors and users can add tags to the file objects, so the tags to be structured may be social tags or tags designed by the authors, but no clear relationship is established yet.

Step 302, respectively calculating the co-occurrence times of any two tags in the same file object according to the occurrence times of each tag in each file object.

Step 303, for any two tags, determining whether an association relationship exists between the two tags according to the number of co-occurrences of the two tags in the same file object, the total number of the file objects, and the number of the file objects in which one tag appears, so as to screen out a tag pair having an association relationship.

As shown in FIG. 4, apriori algorithm may be employed to mine the frequent item set and generate association rules. Specifically, the co-occurrence times of the two tags in the same file object are divided by the total number of the file objects to obtain the support degree; dividing the co-occurrence frequency of the two labels in the same file object by the number of the file objects with one label to obtain a confidence coefficient; and if the support degree is greater than or equal to a support degree threshold value and the confidence degree is greater than or equal to a confidence degree threshold value, judging that an association relationship exists between the two labels.

And 304, generating a label relation graph according to each label pair.

And the nodes in the relational graph are labels, and the weight of the edges is the co-occurrence frequency of the two labels in the same file object.

Step 305, calculating the average centrality of each node in the label relationship graph.

For each node, respectively calculating the calculation degree centrality, the intermediary centrality, the approach centrality and the webpage ranking value of the node; respectively carrying out normalization processing on the calculation degree centrality, the intermediary centrality, the approach centrality and the webpage ranking value; and calculating the arithmetic mean of the calculated centrality, the intermediate centrality, the approximate centrality and the webpage ranking value after the normalization processing, so as to obtain the average centrality of the node.

And step 306, screening out at least one secondary root node according to the average centrality of each node and the incidence relation between the nodes.

The average centrality of each node is arranged in a descending order, and N nodes with the average centrality closer to the front are screened out; wherein N is an integer greater than zero; for the N nodes, dividing the nodes with the incidence relation into a group, thereby obtaining at least one node group; and for each node group, taking the node with the maximum average centrality in the node group as a root node. Each node in the label relation graph can be regarded as a candidate node set of a secondary root node, N nodes with the average central degree being higher are screened out, whether nodes with the association relation exist in the N nodes is judged, and if the nodes exist, the nodes are divided into a group.

And 307, respectively calculating the membership degree of each secondary root node and each adjacent node, thereby determining a candidate node set corresponding to each secondary root node.

Optionally, the membership degree of the secondary root node and any one adjacent node is calculated by the following method: the weight of the edge between the adjacent node and the secondary root node is the ratio of the sum of the weights of all the edges of the adjacent node. And then adding the adjacent nodes with the membership degree greater than or equal to a membership degree threshold value into the candidate node set corresponding to the secondary root node so that each adjacent node is at least subordinate to one secondary root node. And by setting a membership threshold, pruning to remove nodes with membership lower than the membership threshold, and ensuring that each node is at least subordinate to one secondary root node. In an embodiment of the invention, one node may be subordinate to a plurality of secondary root nodes, and at least to one secondary root node.

Step 308, whether a stop condition is met; if yes, go to step 309; if not, go to step 305.

The stop condition may be that the establishment of the entire hierarchical relationship is completed or that the number of adjoining nodes is less than a specified threshold.

Step 309, stopping generating the hierarchical structure to obtain the label hierarchical structure.

In addition, in a reference embodiment of the present invention, the detailed implementation of the method for generating a tag hierarchy structure is described in detail in the above-mentioned method for generating a tag hierarchy structure, and therefore, the repeated content is not described herein.

Fig. 5 is a schematic diagram of main modules of an apparatus for generating a tag hierarchy according to an embodiment of the present invention, and as shown in fig. 5, the apparatus 500 for generating a tag hierarchy includes a filtering module 501, an associating module 502, and a generating module 503; the screening module 501 is configured to screen out a tag pair having an association relationship according to the occurrence frequency of each tag in each file object; the association module 502 is configured to generate a tag relationship graph according to each tag pair; wherein, the nodes in the relational graph are labels, and the weight of the edges is the co-occurrence times of the two labels in the same file object; the generating module 503 is configured to cluster each node in the label relationship graph and calculate a membership degree of an adjacent node, so as to generate a label hierarchy.

Optionally, the generating module 503 is further configured to:

calculating the average centrality of each node in the label relation graph;

the above steps are repeatedly performed, thereby generating a label hierarchy.

Optionally, the generating module 503 is further configured to:

Optionally, the generating module 503 is further configured to: calculating the membership degree of the secondary root node and any adjacent node by adopting the following method:

Optionally, the generating module 503 is further configured to:

Optionally, the screening module 501 is further configured to:

Optionally, the apparatus further comprises a matching module, configured to:

It should be noted that, in the implementation of the apparatus for generating a tag hierarchy according to the present invention, the details of the method for generating a tag hierarchy are already described in detail, and therefore, the repeated descriptions herein are not repeated.

Fig. 6 illustrates an exemplary system architecture 600 to which the method of generating a tag hierarchy or the apparatus for generating a tag hierarchy of embodiments of the present invention may be applied.

As shown in fig. 6, the system architecture 600 may include

terminal devices

601, 602, 603, a network 604, and a server 605. The network 604 serves to provide a medium for communication links between the

terminal devices

601, 602, 603 and the server 605. Network 604 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

601, 602, 603 to interact with the server 605 via the network 604 to receive or send messages or the like. The

terminal devices

601, 602, 603 may have installed thereon various communication client applications, such as shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

601, 602, 603 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 605 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the

terminal devices

601, 602, 603. The background management server may analyze and otherwise process the received data such as the item information query request, and feed back a processing result (for example, target push information, item information — just an example) to the terminal device.

It should be noted that the method for generating the label hierarchy provided by the embodiment of the present invention is generally executed by the server 605, and accordingly, the apparatus for generating the label hierarchy is generally disposed in the server 605. The method for generating the label hierarchy provided by the embodiment of the present invention may also be executed by the

terminal devices

601, 602, and 603, and accordingly, the apparatus for generating the label hierarchy may be disposed in the

terminal devices

601, 602, and 603.

It should be understood that the number of terminal devices, networks, and servers in fig. 6 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 701.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer programs according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a screening module, an association module, and a generation module, where the names of the modules do not in some cases constitute a limitation on the modules themselves.

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, implement the method of: screening out label pairs with association relation according to the occurrence frequency of each label in each file object; generating a label relation graph according to each label pair; wherein, the nodes in the relational graph are labels, and the weight of the edges is the co-occurrence times of the two labels in the same file object; and clustering each node in the label relation graph and calculating the membership degree of adjacent nodes so as to generate a label hierarchical structure.

According to the technical scheme of the embodiment of the invention, because the technical means of generating the label relation graph according to each label pair with the incidence relation, clustering each node in the label relation graph and calculating the membership degree of the adjacent node so as to generate the label hierarchical structure is adopted, the technical problem that the position of the label in the label hierarchical structure is unique in the prior art is solved. The embodiment of the invention solves the problem that the labels have ambiguity by a fuzzy clustering method, so that the labels can appear at different positions, and the probability value (namely membership) of each label appearing at different positions is calculated; and the recursive clustering is flexibly controlled through the membership degree, and the label hierarchy structure can be automatically constructed, so that the labor cost can be saved.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of generating a hierarchy of tags, comprising:

2. The method of claim 1, wherein clustering nodes in the label relationship graph and calculating membership of adjacent nodes to generate a label hierarchy comprises:

calculating the average centrality of each node in the label relation graph;

the above steps are repeatedly performed, thereby generating a label hierarchy.

3. The method of claim 2, wherein the calculating the average centrality of each node in the label relationship graph comprises:

4. The method according to claim 2, wherein the screening out at least one secondary root node according to the average centrality of each node and the association relationship between each node comprises:

5. The method of claim 2, wherein the degree of membership of the secondary root node to any one of the neighboring nodes is calculated as follows:

6. The method of claim 2, wherein the determining the set of candidate nodes corresponding to each of the secondary root nodes comprises:

7. The method according to claim 1, wherein the screening out the label pairs having an association relationship according to the number of occurrences of each label in each file object comprises:

8. The method according to claim 7, wherein said determining whether there is an association relationship between the two tags according to the number of co-occurrences of the two tags in the same file object, the total number of file objects, and the number of file objects in which one tag appears comprises:

9. The method of claim 1, further comprising, after the generating a hierarchy of tags:

10. An apparatus for generating a hierarchy of tags, comprising:

11. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

the one or more programs, when executed by the one or more processors, implement the method of any of claims 1-9.

12. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-9.