WO2020114108A1 - Clustering result interpretation method and device - Google Patents

Clustering result interpretation method and device Download PDF

Info

Publication number
WO2020114108A1
WO2020114108A1 PCT/CN2019/112090 CN2019112090W WO2020114108A1 WO 2020114108 A1 WO2020114108 A1 WO 2020114108A1 CN 2019112090 W CN2019112090 W CN 2019112090W WO 2020114108 A1 WO2020114108 A1 WO 2020114108A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
interpretation
category
embedded object
model
Prior art date
Application number
PCT/CN2019/112090
Other languages
French (fr)
Chinese (zh)
Inventor
王力
向彪
周俊
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020114108A1 publication Critical patent/WO2020114108A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Definitions

  • This specification relates to the field of machine learning technology, and in particular to a method and device for interpreting clustering results.
  • Embedding represents a kind of mapping in mathematics, which can map one space to another space, and retain the basic attributes.
  • the embedding algorithm can transform some complex and difficult-to-express features into easy-to-calculate forms, such as vectors and matrices, which are convenient for machine learning models to process.
  • the embedding algorithm is not interpretable, which leads to the clustering model that clusters the embedding results is not interpretable and cannot meet the needs of business scenarios.
  • this specification provides a method and device for interpreting clustering results.
  • An interpretation method for clustering results including:
  • An interpretation method for the identification results of a risk gang identification model including:
  • a risk gang identification model is used to identify the embedded result to obtain a risk gang label to which each user node belongs;
  • a method for interpreting the clustering results of text clustering models including:
  • a text clustering model is used to cluster the embedding results to obtain a category label for each text
  • a device for interpreting clustering results including:
  • the embedding processing unit uses embedding algorithms to embed the embedded objects to obtain the embedding result of each embedded object;
  • the object clustering unit uses a clustering model to cluster the embedding results to obtain a category label for each embedded object
  • a model training unit which uses the features and category labels of the embedded objects to train the explanatory model
  • the object extraction unit extracts several embedded objects from the category for each category
  • a feature determining unit based on the extracted feature of each embedded object and the trained interpretation model, determining that the embedded object belongs to the interpretation feature of the category;
  • the feature summary unit summarizes the interpretation features of each embedded object extracted under the same category to obtain the interpretation features of the clustering model under the category.
  • a device for interpreting clustering results including:
  • Memory for storing machine executable instructions
  • this specification can use the characteristics and category labels of embedded objects to train an explanatory interpretation model, and can determine the interpretation characteristics of each embedded object category division under each category based on the trained interpretation model. Then, the interpretation characteristics of the embedded objects in the same classification can be summarized to obtain the interpretation characteristics of the clustering model under the category, and the interpretation of the clustering result can be realized, so as to provide a basis for the developer to repair the deviation of the clustering model and help improve the model The generalization ability and performance, and help to avoid legal risks and moral hazards.
  • FIG. 1 is a schematic flowchart of a method for explaining a clustering result shown in an exemplary embodiment of this specification.
  • FIG. 2 is a schematic flowchart of another method for explaining a clustering result shown in an exemplary embodiment of the present specification.
  • FIG. 3 is a schematic diagram of a decision tree shown in an exemplary embodiment of this specification.
  • FIG. 4 is a schematic structural diagram of an apparatus for interpreting clustering results shown in an exemplary embodiment of the present specification.
  • Fig. 5 is a block diagram of an apparatus for interpreting clustering results shown in an exemplary embodiment of the present specification.
  • first, second, third, etc. may be used to describe various information in this specification, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as second information, and similarly, the second information may also be referred to as first information.
  • word “if” as used herein may be interpreted as "when” or “when” or “in response to a determination”.
  • the clustering model can be used to cluster the embedded results of the embedded objects to obtain the category label of each embedded object; on the other hand, the characteristics and category labels of the embedded objects can be used. Train an explanatory interpretation model, and determine that the embedded objects extracted in each category belong to the interpretation characteristics of the category based on the trained interpretation model, and then summarize the interpretation of each embedded object extracted in the same category Feature to obtain the interpretation feature of the above clustering model under this category, so as to realize the interpretation of the clustering model.
  • FIG. 1 and 2 are schematic flowcharts of a method for explaining a clustering result shown in an exemplary embodiment of this specification.
  • the method for interpreting the clustering result may include the following steps:
  • Step 102 embedding the embedded objects using an embedding algorithm to obtain the embedding result of each embedded object.
  • Step 104 Clustering the embedding results using a clustering model to obtain a category label for each embedded object.
  • the embedded object may be a graph node in the graph structure.
  • the embedded object may be a user node in a user network graph.
  • the user network map may be established based on the user's payment data, friend relationship data, and the like.
  • the vector corresponding to each user node can be obtained.
  • the category label of each user node can be obtained.
  • the embedded object may be text to be clustered, such as news, information, and the like.
  • the embedding algorithm is used to embed the vocabulary included in each text, and the vector corresponding to each vocabulary in each text can be obtained, and the vector set corresponding to each text can be obtained.
  • the category label of each text can be obtained.
  • text 1 corresponds to the technology category label 1
  • text 2 corresponds to the sports category label 2, etc., which may indicate that the text 1 belongs to the technology text
  • the text 2 belongs to the sports text.
  • vectors, matrices, etc. obtained after the embedding object is processed by the embedding algorithm may be collectively referred to as embedding results.
  • embedding results Using embedded results as input parameters for machine learning calculations can effectively improve machine processing efficiency.
  • the calculation of the embedding result and the clustering of the clustering model can be performed at the same time.
  • the embedding algorithm and the clustering model can be combined, and the embedded object can be used as the input parameter of the combined model, which is performed by the combined model
  • the calculation of the embedding result and the clustering of the embedded objects are not restricted in this specification.
  • Step 106 Use the features and category labels of the embedded objects to train the interpretation model.
  • an interpretable multi-classification model can be used as the interpretation model, such as a linear model, a decision tree, etc., which is not particularly limited in this specification.
  • the features of the embedded object may include original features and topological features of the embedded object.
  • the original feature is usually an existing feature of the embedded object itself.
  • the original characteristics of the user node may include the user's age, gender, occupation, income, and so on.
  • the original features of the text may include part of speech of the vocabulary, word frequency, and so on.
  • the topological features can be used to represent the topological structure of the embedded object.
  • the topological characteristics may include: first-order neighbor data, number of second-order neighbors, average number of neighbors of first-order neighbors, statistics of first-order neighbors under a specified original feature dimension, etc.
  • the statistical values of the first-order neighbors in the specified original feature dimension may be the average age of the first-order neighbors, the maximum age of the first-order neighbors, the average annual income of the first-order neighbors, the first-order neighbors The minimum annual income, etc.
  • the topological features may include: the vocabulary that appears most often in front of the vocabulary, the number of vocabularies that often appear in conjunction with the vocabulary, and so on.
  • topological features are used to supplement the original features.
  • it can solve the problem that some embedded objects do not have the original features.
  • the topological structure of the embedded objects can be added to the features, thereby improving the interpretation model training results. Accuracy.
  • Step 108 For each category, extract several embedded objects from the category.
  • the number of extracted embedded objects can be preset, such as 5000, 3000, etc.; the number of extracted embedded objects can also be a percentage of the total number of embedded objects under the corresponding category, such as 50%, 30%, etc. No special restrictions.
  • step 110 it is determined that the embedded object belongs to the interpretation feature of the category based on the extracted feature of each embedded object and the trained interpretation model.
  • the contribution value of each feature of the embedded object to the classification result of the embedded object category can be calculated based on the trained interpretation model, and then the feature whose contribution value meets the predetermined condition can be calculated The interpretation feature belonging to the category as the embedded object.
  • each feature of the embedded object may be sorted in order of contribution value from large to small, and the features ranked in the top 5 and top 8 may be regarded as the interpretation features of the embedded object belonging to the category, this specification There are no special restrictions on this.
  • Step 112 Summarize the interpretation features of each embedded object extracted under the same category to obtain the interpretation features of the clustering model under that category.
  • the total number of occurrences of each interpretation feature may be calculated, and then the several interpretation features with the largest total number of selections are selected as the interpretation features of the clustering model under that category.
  • feature 1-feature 5 may be selected, and feature 1-feature 5 is used as the interpretation feature of the clustering model under the category.
  • the interpretation features of each embedded object extracted under each category can be obtained, and the result interpretation of the clustering model can be achieved.
  • this specification can use the characteristics and category labels of embedded objects to train an explanatory interpretation model, and can determine the interpretation characteristics of each embedded object category division under each category based on the trained interpretation model. Then, the interpretation characteristics of the embedded objects in the same classification can be summarized to obtain the interpretation characteristics of the clustering model under that category, and the interpretation of the clustering results can be realized, so as to provide a basis for developers to repair the deviation of the clustering model and help improve the model The generalization ability and performance, and help to avoid legal risks and moral hazards.
  • the interpretation model is a linear model
  • the linear model is trained using the features and category labels of the embedded object, the weight of each embedded object feature under each category can be obtained.
  • the contribution value of feature 1 to the classification result of embedded object 1 is equal to the feature value of feature 1 of embedded object 1 multiplied by W1; the contribution value of feature 2 to the classification result of embedded object 1 is equal to that of feature 2 of embedded object 1
  • the feature value is multiplied by W2, etc., and will not be repeated in this specification.
  • the interpretation model is a decision tree
  • the split points of each feature in the decision tree can be obtained.
  • each tree node in the decision tree shown in FIG. 3 can represent a unique feature, for example, tree node 1 represents user age, tree node 2 represents user annual income, and so on.
  • the split point of each feature in the decision tree usually refers to the feature threshold of the corresponding feature, for example, the split point of the age tree node is 50, when the user's age is less than or equal to 50, it can be determined to select the bifurcation path 12, when the user's age is greater than 50 , It can be determined to select the bifurcation path 13 and so on.
  • the embedded object when determining the contribution value of the characteristics of the embedded object, the embedded object may be input into the trained decision tree first, and then the embedded object may be determined in the process of classifying the embedded object in the decision tree. Describe the path traversed in the decision tree, and obtain each feature on the path and the split point of the feature.
  • FIG. 3 assuming that the path of an embedded object in the decision tree shown in FIG. 3 is tree node 1->tree node 2->tree node 4, you can obtain tree node 1, tree node 2, and tree The features represented by the three tree nodes of node 4 and the split points of the features.
  • the distance between the feature value corresponding to the embedded object and the split point is calculated, and the distance can be used as the contribution value of the feature to the classification result of the embedded object category.
  • the tree node 1 to represent the user's age, and its split point is 50 as an example, assuming that the user age of an embedded object is 20 years old, the contribution value of the characteristic user age is the difference between 50 and 20, that is, 30.
  • the distance can also be normalized, and the normalized result can be used as the corresponding contribution value, which is not particularly limited in this specification.
  • This manual also provides a method to interpret the identification results of the risk gang identification model.
  • an embedding algorithm can be used to embed the user nodes in the user network graph to obtain the embedding results of each user node, and then use the risk gang identification model to identify the embedding results to obtain the risks to which each user node belongs Gang label.
  • the characteristics of the user node and the risk gang tags can be used to train an explanatory model with an explanation.
  • several user nodes can be extracted from the risk group, and based on the extracted features of each user node and the trained interpretation model, it is determined that the user node belongs to the explanation of the risk group Characteristics, and then the explanatory characteristics of each user node extracted from the same risk group can be aggregated to obtain the interpretation characteristics of the risk group identification model corresponding to the risk group.
  • the interpretation characteristics of each risk group identified corresponding to the risk group identification model can be obtained.
  • the interpretation characteristics of risk group 1 may include: no fixed occupation, annual income less than 80,000, resident place in Guangxi, age 18-25 years old, etc. It can be expressed that the risk group identification model identifies the risk group 1 through these user characteristics.
  • the interpretation characteristics of risk group 2 may include: no fixed occupation, annual income less than 100,000, permanent residence in Yunnan, age 20-28 years old, SSID using Wi-Fi network is 12345, etc. It can be expressed that the risk group identification model identifies the risk group 2 through these user characteristics.
  • This specification also provides a method for interpreting the clustering results of text clustering models.
  • an embedding algorithm can be used to embed each word in the text to be clustered to obtain the embedding result of each text, and then a text clustering model is used to cluster the embedding results to obtain the category label to which each text belongs .
  • the features of the text and the category label can be used to train an explanatory model with interpretation.
  • several texts can be extracted from the category, and based on the extracted features of each text and the trained interpretation model, it is determined that the extracted text belongs to the interpretation features of the category, and then the same
  • the interpretation feature of each text extracted from the category obtains the interpretation feature of the text clustering model under the category.
  • the interpretation characteristics of each text category clustered by the text clustering model can be obtained.
  • the interpretation characteristics of science and technology texts may include: computer, artificial intelligence, technology, innovation, and the word frequency of technology is greater than 0.01.
  • Representable text clustering models determine texts that belong to the category of science and technology through these features.
  • the interpretation features of the sports text may include: football, basketball, sports, swimming, recording, and so on.
  • Representable text clustering models determine the text belonging to the sports category through these features.
  • this specification also provides an embodiment of the clustering result interpretation apparatus.
  • the embodiment of the apparatus for interpreting clustering results in this specification can be applied to a server.
  • the device embodiments may be implemented by software, or by hardware or a combination of hardware and software. Taking software implementation as an example, as a logical device, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory through the processor of the server where it is located and running. From the hardware level, as shown in Fig. 4, it is a hardware structure diagram of the server where the apparatus for interpreting the clustering results of this specification is located, except for the processor, memory, network interface, and non-volatile memory shown in Fig. 4 In addition, in the embodiment, the server where the device is located usually may include other hardware according to the actual function of the server, which will not be repeated here.
  • Fig. 5 is a block diagram of an apparatus for interpreting clustering results shown in an exemplary embodiment of the present specification.
  • the apparatus 400 for interpreting clustering results may be applied to the server shown in FIG. 4 described above, including: an embedding processing unit 401, an object clustering unit 402, a model training unit 403, an object extraction unit 404, Feature determination unit 405 and feature summary unit 406.
  • the embedding processing unit 401 uses an embedding algorithm to embed the embedded objects to obtain the embedding result of each embedded object;
  • the object clustering unit 402 uses a clustering model to cluster the embedding results to obtain a category label for each embedded object;
  • the model training unit 403 trains the interpretation model using the characteristics and category labels of the embedded objects
  • the object extraction unit 404 extracts several embedded objects from the category for each category
  • the feature determining unit 405 determines, based on the extracted feature of each embedded object and the trained interpretation model, that the embedded object belongs to the interpretation feature of the category;
  • the feature summary unit 406 summarizes the interpretation features of each embedded object extracted under the same category to obtain the interpretation features of the clustering model under the category.
  • the feature determination unit 405 determines whether the feature determination unit 405 is a pixel value.
  • the contribution value of each feature of the embedded object to the classification result is calculated based on the trained interpretation model
  • a feature whose contribution value satisfies a predetermined condition is extracted as an interpretation feature of the embedded object belonging to the category.
  • the feature determination unit 405 when the interpretation model is a linear model, the feature determination unit 405:
  • the feature determination unit 405 when the interpretation model is a decision tree, the feature determination unit 405:
  • the distance between the split point of the feature and the corresponding feature value of the embedded object is calculated as the contribution value of the feature to the classification result of the embedded object category.
  • the feature determination unit 405 determines whether the feature determination unit 405 is a pixel value.
  • the features arranged in the top N bits are extracted as the interpretation features of the embedded object belonging to the category, and N is a natural number greater than or equal to 1.
  • the features include: original features and topological features.
  • the topological features include one or more of the following:
  • the relevant parts can be referred to the description of the method embodiments.
  • the device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located One place, or can be distributed to multiple network elements. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solution in this specification. Those of ordinary skill in the art can understand and implement without paying creative labor.
  • the system, device, module or unit explained in the above embodiments may be specifically implemented by a computer chip or entity, or implemented by a product with a certain function.
  • a typical implementation device is a computer, and the specific form of the computer may be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email sending and receiving device, and a game control Desk, tablet computer, wearable device, or any combination of these devices.
  • this specification also provides a clustering result interpretation apparatus, which includes a processor and a memory for storing machine-executable instructions.
  • the processor and the memory are usually connected to each other via an internal bus.
  • the device may also include an external interface to be able to communicate with other devices or components.
  • the processor is prompted to:
  • the contribution value of each feature of the embedded object to the classification result is calculated based on the trained interpretation model
  • a feature whose contribution value meets a predetermined condition is extracted as an interpretation feature of the embedded object.
  • the processor when calculating the contribution value of each feature of the embedded object to the classification result based on the trained interpretation model, the processor is prompted to:
  • the processor when calculating the contribution value of each feature of the embedded object to the classification result based on the trained interpretation model, the processor is prompted to:
  • the distance between the split point of the feature and the corresponding feature value of the embedded object is calculated as the contribution value of the feature to the classification result of the embedded object category.
  • the processor is prompted to:
  • the features arranged in the top N bits are extracted as the interpretation features of the embedded object belonging to the category, and N is a natural number greater than or equal to 1.
  • the features include: original features and topological features.
  • the topological features include one or more of the following:
  • this specification also provides a computer-readable storage medium that stores a computer program on the computer-readable storage medium, and the program implements the following steps when executed by the processor:
  • the determining, based on the extracted feature of each embedded object and the trained interpretation model, that the embedded object belongs to the interpretation feature of the category includes:
  • the contribution value of each feature of the embedded object to the classification result is calculated based on the trained interpretation model
  • a feature whose contribution value satisfies a predetermined condition is extracted as an interpretation feature of the embedded object belonging to the category.
  • the calculation of the contribution value of each feature of the embedded object to the classification result based on the trained interpretation model includes:
  • the calculation of the contribution value of each feature of the embedded object to the classification result based on the trained interpretation model includes:
  • the distance between the split point of the feature and the corresponding feature value of the embedded object is calculated as the contribution value of the feature to the classification result of the embedded object category.
  • the feature whose extracted contribution value satisfies a predetermined condition as the interpretation feature of the embedded object belonging to the category includes:
  • the features arranged in the top N bits are extracted as the interpretation features of the embedded object belonging to the category, and N is a natural number greater than or equal to 1.
  • the features include: original features and topological features.
  • the topological features include one or more of the following:

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A clustering result interpretation method and device. The method comprises: embedding embedded objects by using an embedding algorithm to obtain an embedding result of each embedded object (S102); clustering the embedding results by using a clustering model to obtain a category label of each embedded object (S104); training an interpretation model by using the characteristics and category labels of the embedded objects (S106); extracting multiple embedded objects from each category (S108); determining interpretation characteristics of the embedded objects belonging to the described category according to the characteristic of each extracted embedded object and the trained interpretation model (S110); and summarizing an interpretation characteristic of each extracted embedded object in the same category to obtain interpretation characteristics of the clustering model in the described category (S112).

Description

聚类结果的解释方法和装置Interpretation method and device of clustering result 技术领域Technical field
本说明书涉及机器学习技术领域,尤其涉及一种聚类结果的解释方法和装置。This specification relates to the field of machine learning technology, and in particular to a method and device for interpreting clustering results.
背景技术Background technique
嵌入(Embedding)在数学上表示一种映射,可将一个空间映射到另一个空间,并保留基本属性。利用嵌入算法可将一些复杂的难以表达的特征转换成易计算的形式,例如:向量、矩阵等,便于机器学习模型进行处理。然而,嵌入算法并不具有解释性,这就导致对嵌入结果进行聚类的聚类模型不具有解释性,无法满足业务场景的需求。Embedding represents a kind of mapping in mathematics, which can map one space to another space, and retain the basic attributes. The embedding algorithm can transform some complex and difficult-to-express features into easy-to-calculate forms, such as vectors and matrices, which are convenient for machine learning models to process. However, the embedding algorithm is not interpretable, which leads to the clustering model that clusters the embedding results is not interpretable and cannot meet the needs of business scenarios.
发明内容Summary of the invention
有鉴于此,本说明书提供一种聚类结果的解释方法和装置。In view of this, this specification provides a method and device for interpreting clustering results.
具体地,本说明书是通过如下技术方案实现的:Specifically, this specification is implemented through the following technical solutions:
一种聚类结果的解释方法,包括:An interpretation method for clustering results, including:
采用嵌入算法对嵌入对象进行嵌入处理,得到每个嵌入对象的嵌入结果;Use embedding algorithm to embed the embedded objects to get the embedding result of each embedded object;
采用聚类模型对所述嵌入结果进行聚类,得到每个嵌入对象的类别标签;Clustering the embedding results using a clustering model to obtain a category label for each embedded object;
采用所述嵌入对象的特征和类别标签对解释模型进行训练;Use the features and category labels of the embedded objects to train the interpretation model;
针对每个类别,从所述类别中提取若干嵌入对象;For each category, extract several embedded objects from the category;
基于提取的每个嵌入对象的特征和已训练的解释模型确定所述嵌入对象属于所述类别的解释特征;Based on the extracted features of each embedded object and the trained interpretation model, it is determined that the embedded object belongs to the interpretation feature of the category;
汇总同一类别下提取的每个嵌入对象的解释特征,得到所述聚类模型在该类别下的解释特征。Summarize the interpretation features of each embedded object extracted under the same category to obtain the interpretation features of the clustering model under that category.
一种风险团伙识别模型的识别结果解释方法,包括:An interpretation method for the identification results of a risk gang identification model, including:
采用嵌入算法对用户节点进行嵌入处理,得到每个用户节点的嵌入结果;Use embedding algorithm to embed user nodes to get the embedding result of each user node;
采用风险团伙识别模型对所述嵌入结果进行识别,得到每个用户节点所属的风险团伙标签;A risk gang identification model is used to identify the embedded result to obtain a risk gang label to which each user node belongs;
采用所述用户节点的特征和所述风险团伙标签对解释模型进行训练;Train the interpretation model using the characteristics of the user node and the risk group label;
针对每个风险团伙,从所述风险团伙中提取若干用户节点;For each risk group, extract several user nodes from the risk group;
基于提取的每个用户节点的特征和已训练的解释模型确定所述用户节点属于所述风险团伙的解释特征;Determine that the user node belongs to the interpretation feature of the risk group based on the extracted feature of each user node and the trained interpretation model;
汇总同一风险团伙中提取的每个用户节点的解释特征,得到所述风险团伙识别模型对应应该风险团伙的解释特征。Summarizing the interpretation characteristics of each user node extracted from the same risk gang to obtain the interpretation characteristics of the risk gang identification model corresponding to the risk gang.
一种文本聚类模型的聚类结果解释方法,包括:A method for interpreting the clustering results of text clustering models, including:
采用嵌入算法对待聚类文本进行嵌入处理,得到每个文本的嵌入结果;Use embedding algorithm to embed the clustered text to get the embedding result of each text;
采用文本聚类模型对所述嵌入结果进行聚类,得到每个文本的类别标签;A text clustering model is used to cluster the embedding results to obtain a category label for each text;
采用所述文本的特征和所述类别标签对解释模型进行训练;Train the explanatory model using the features of the text and the category label;
针对每个类别,从所述类别中提取若干文本;For each category, extract several texts from the category;
基于提取的每个文本的特征和已训练的解释模型确定所述文本属于所述类别的解释特征;Determine that the text belongs to the interpretation feature of the category based on the extracted feature of each text and the trained interpretation model;
汇总同一类别中提取的每个文本的解释特征,得到所述文本聚类模型在该类别下的解释特征。Summarize the interpretation features of each text extracted in the same category to obtain the interpretation features of the text clustering model under that category.
一种聚类结果的解释装置,包括:A device for interpreting clustering results, including:
嵌入处理单元,采用嵌入算法对嵌入对象进行嵌入处理,得到每个嵌入对象的嵌入结果;The embedding processing unit uses embedding algorithms to embed the embedded objects to obtain the embedding result of each embedded object;
对象聚类单元,采用聚类模型对所述嵌入结果进行聚类,得到每个嵌入对象的类别标签;The object clustering unit uses a clustering model to cluster the embedding results to obtain a category label for each embedded object;
模型训练单元,采用所述嵌入对象的特征和类别标签对解释模型进行训练;A model training unit, which uses the features and category labels of the embedded objects to train the explanatory model;
对象提取单元,针对每个类别,从所述类别中提取若干嵌入对象;The object extraction unit extracts several embedded objects from the category for each category;
特征确定单元,基于提取的每个嵌入对象的特征和已训练的解释模型确定所述嵌入对象属于所述类别的解释特征;A feature determining unit, based on the extracted feature of each embedded object and the trained interpretation model, determining that the embedded object belongs to the interpretation feature of the category;
特征汇总单元,汇总同一类别下提取的每个嵌入对象的解释特征,得到所述聚类模型在该类别下的解释特征。The feature summary unit summarizes the interpretation features of each embedded object extracted under the same category to obtain the interpretation features of the clustering model under the category.
一种聚类结果的解释装置,包括:A device for interpreting clustering results, including:
处理器;processor;
用于存储机器可执行指令的存储器;Memory for storing machine executable instructions;
其中,通过读取并执行所述存储器存储的与聚类结果的解释逻辑对应的机器可执行指令,所述处理器被促使:Wherein, by reading and executing the machine-executable instructions stored in the memory corresponding to the interpretation logic of the clustering result, the processor is prompted to:
采用嵌入算法对嵌入对象进行嵌入处理,得到每个嵌入对象的嵌入结果;Use embedding algorithm to embed the embedded objects to get the embedding result of each embedded object;
采用聚类模型对所述嵌入结果进行聚类,得到每个嵌入对象的类别标签;Clustering the embedding results using a clustering model to obtain a category label for each embedded object;
采用所述嵌入对象的特征和类别标签对解释模型进行训练;Use the features and category labels of the embedded objects to train the interpretation model;
针对每个类别,从所述类别中提取若干嵌入对象;For each category, extract several embedded objects from the category;
基于提取的每个嵌入对象的特征和已训练的解释模型确定所述嵌入对象属于所述类别的解释特征;Based on the extracted features of each embedded object and the trained interpretation model, it is determined that the embedded object belongs to the interpretation feature of the category;
汇总同一类别下提取的每个嵌入对象的解释特征,得到所述聚类模型在该类别下的解释特征。Summarize the interpretation features of each embedded object extracted under the same category to obtain the interpretation features of the clustering model under that category.
由以上描述可以看出,本说明书可采用嵌入对象的特征和类别标签对具有解释性的解释模型进行训练,并可基于已训练的解释模型确定每个类别下各嵌入对象类别划分的解释特征,然后可汇总同一分类中嵌入对象的解释特征,得到聚类模型在该类别下的解释特征,实现对聚类结果的解释,从而为开发者修复聚类模型的偏差提供依据,有助于提升模型的泛化能力和性能,并且有助于规避法律风险和道德风险。As can be seen from the above description, this specification can use the characteristics and category labels of embedded objects to train an explanatory interpretation model, and can determine the interpretation characteristics of each embedded object category division under each category based on the trained interpretation model. Then, the interpretation characteristics of the embedded objects in the same classification can be summarized to obtain the interpretation characteristics of the clustering model under the category, and the interpretation of the clustering result can be realized, so as to provide a basis for the developer to repair the deviation of the clustering model and help improve the model The generalization ability and performance, and help to avoid legal risks and moral hazards.
附图说明BRIEF DESCRIPTION
图1是本说明书一示例性实施例示出的一种聚类结果的解释方法的流程示意图。FIG. 1 is a schematic flowchart of a method for explaining a clustering result shown in an exemplary embodiment of this specification.
图2是本说明书一示例性实施例示出的另一种聚类结果的解释方法的流程示意图。FIG. 2 is a schematic flowchart of another method for explaining a clustering result shown in an exemplary embodiment of the present specification.
图3是本说明书一示例性实施例示出的一种决策树示意图。FIG. 3 is a schematic diagram of a decision tree shown in an exemplary embodiment of this specification.
图4是本说明书一示例性实施例示出的一种用于聚类结果的解释装置的一结构示意图。4 is a schematic structural diagram of an apparatus for interpreting clustering results shown in an exemplary embodiment of the present specification.
图5是本说明书一示例性实施例示出的一种聚类结果的解释装置的框图。Fig. 5 is a block diagram of an apparatus for interpreting clustering results shown in an exemplary embodiment of the present specification.
具体实施方式detailed description
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本说明书相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本说明书的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail here, examples of which are shown in the drawings. When referring to the drawings below, unless otherwise indicated, the same numerals in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this specification. Rather, they are merely examples of devices and methods consistent with some aspects of this specification as detailed in the appended claims.
在本说明书使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本说明书。在本说明书和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in this specification is for the purpose of describing particular embodiments only, and is not intended to limit this specification. The singular forms "a", "said" and "the" used in this specification and the appended claims are also intended to include most forms unless the context clearly indicates other meanings. It should also be understood that the term "and/or" as used herein refers to and includes any or all possible combinations of one or more associated listed items.
应当理解,尽管在本说明书可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本说明书范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used to describe various information in this specification, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, without departing from the scope of this specification, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information. Depending on the context, the word "if" as used herein may be interpreted as "when" or "when" or "in response to a determination".
本说明书提供一种聚类结果的解释方案,一方面可采用聚类模型对嵌入对象的嵌入结果进行聚类,得到每个嵌入对象的类别标签;另一方面可采用嵌入对象的特征和类别标签对具有解释性的解释模型进行训练,并可基于已训练的解释模型确定在每个类别中提取的嵌入对象属于所述类别的解释特征,然后再汇总同一类别中提取的每个嵌入对象的解释特征,得到上述聚类模型在该类别下的解释特征,从而实现聚类模型的解释。This specification provides an interpretation scheme for clustering results. On the one hand, the clustering model can be used to cluster the embedded results of the embedded objects to obtain the category label of each embedded object; on the other hand, the characteristics and category labels of the embedded objects can be used. Train an explanatory interpretation model, and determine that the embedded objects extracted in each category belong to the interpretation characteristics of the category based on the trained interpretation model, and then summarize the interpretation of each embedded object extracted in the same category Feature to obtain the interpretation feature of the above clustering model under this category, so as to realize the interpretation of the clustering model.
图1和图2是本说明书一示例性实施例示出的聚类结果的解释方法的流程示意图。1 and 2 are schematic flowcharts of a method for explaining a clustering result shown in an exemplary embodiment of this specification.
请参考图1和图2,所述聚类结果的解释方法可包括以下步骤:Please refer to FIG. 1 and FIG. 2, the method for interpreting the clustering result may include the following steps:
步骤102,采用嵌入算法对嵌入对象进行嵌入处理,得到每个嵌入对象的嵌入结果。Step 102, embedding the embedded objects using an embedding algorithm to obtain the embedding result of each embedded object.
步骤104,采用聚类模型对所述嵌入结果进行聚类,得到每个嵌入对象的类别标签。Step 104: Clustering the embedding results using a clustering model to obtain a category label for each embedded object.
在一个例子中,所述嵌入对象可以是图结构中的图节点。In one example, the embedded object may be a graph node in the graph structure.
例如,所述嵌入对象可以是用户网络图中的用户节点。所述用户网络图可基于用户的支付数据、好友关系数据等建立。For example, the embedded object may be a user node in a user network graph. The user network map may be established based on the user's payment data, friend relationship data, and the like.
采用嵌入算法对用户网络图中的用户节点进行嵌入处理后,可得到每个用户节点对应的向量。After the embedding algorithm is used to embed the user nodes in the user network graph, the vector corresponding to each user node can be obtained.
将各个用户节点对应的向量作为入参输入聚类模型,可得到每个用户节点的类别标签。Using the vector corresponding to each user node as an input parameter into the clustering model, the category label of each user node can be obtained.
在另一个例子中,所述嵌入对象可以是待聚类的文本,例如:新闻、资讯等。In another example, the embedded object may be text to be clustered, such as news, information, and the like.
采用嵌入算法对每个文本所包括的词汇进行嵌入处理,可得到每个文本中各个词汇对应的向量,即可得到每个文本对应的向量集。The embedding algorithm is used to embed the vocabulary included in each text, and the vector corresponding to each vocabulary in each text can be obtained, and the vector set corresponding to each text can be obtained.
将每个文本对应的向量集作为入参输入聚类模型,可得到每个文本的类别标签。Using the vector set corresponding to each text as an input parameter into the clustering model, the category label of each text can be obtained.
例如,文本1对应科技类别标签1,文本2对应体育类别标签2等,可表示文本1属于科技类文本,文本2属于体育类文本等。For example, text 1 corresponds to the technology category label 1, text 2 corresponds to the sports category label 2, etc., which may indicate that the text 1 belongs to the technology text, and the text 2 belongs to the sports text.
在本实施例中,为便于描述,可将嵌入对象经嵌入算法处理后得到的向量、矩阵等统称为嵌入结果。采用嵌入结果作为入参进行机器学习计算,可有效提高机器处理效率。In this embodiment, for ease of description, vectors, matrices, etc. obtained after the embedding object is processed by the embedding algorithm may be collectively referred to as embedding results. Using embedded results as input parameters for machine learning calculations can effectively improve machine processing efficiency.
在其他例子中,嵌入结果的计算和聚类模型的聚类可同时进行,例如,可将嵌入算法和聚类模型结合,将嵌入对象作为入参输入结合后的模型,由结合后的模型进行嵌入结果的计算以及嵌入对象的聚类,本说明书对此不作特殊限制。In other examples, the calculation of the embedding result and the clustering of the clustering model can be performed at the same time. For example, the embedding algorithm and the clustering model can be combined, and the embedded object can be used as the input parameter of the combined model, which is performed by the combined model The calculation of the embedding result and the clustering of the embedded objects are not restricted in this specification.
步骤106,采用所述嵌入对象的特征和类别标签对解释模型进行训练。Step 106: Use the features and category labels of the embedded objects to train the interpretation model.
在本实施例中,可采用具有解释性的多分类模型作为所述解释模型,例如线性模型、决策树等,本说明书对此不作特殊限制。In this embodiment, an interpretable multi-classification model can be used as the interpretation model, such as a linear model, a decision tree, etc., which is not particularly limited in this specification.
所述嵌入对象的特征可包括嵌入对象的原始特征和拓扑特征。The features of the embedded object may include original features and topological features of the embedded object.
其中,所述原始特征通常是嵌入对象自身已有的特征。Wherein, the original feature is usually an existing feature of the embedded object itself.
例如,用户节点的原始特征可包括用户的年龄、性别、职业、收入等。For example, the original characteristics of the user node may include the user's age, gender, occupation, income, and so on.
再例如,文本的原始特征可包括词汇的词性、词频等。As another example, the original features of the text may include part of speech of the vocabulary, word frequency, and so on.
所述拓扑特征可用于表示嵌入对象的拓扑结构。The topological features can be used to represent the topological structure of the embedded object.
以嵌入对象是图节点为例,所述拓扑特征可包括:一阶邻居数据、二阶邻居数量、一阶邻居的平均邻居数量、一阶邻居在指定原始特征维度下的统计值等。Taking an embedded object as a graph node as an example, the topological characteristics may include: first-order neighbor data, number of second-order neighbors, average number of neighbors of first-order neighbors, statistics of first-order neighbors under a specified original feature dimension, etc.
仍以风险团伙识别为例,所述一阶邻居在指定原始特征维度下的统计值可以是一阶邻居的平均年龄、一阶邻居的年龄最大值、一阶邻居的平均年收入、一阶邻居的年收入最小值等。Still taking risk gang identification as an example, the statistical values of the first-order neighbors in the specified original feature dimension may be the average age of the first-order neighbors, the maximum age of the first-order neighbors, the average annual income of the first-order neighbors, the first-order neighbors The minimum annual income, etc.
以嵌入对象是文本所包括的词汇为例,所述拓扑特征可包括:最常出现在该词汇前面的词汇、经常和该词汇搭配出现的词汇个数等。Taking the embedded object as a vocabulary included in the text as an example, the topological features may include: the vocabulary that appears most often in front of the vocabulary, the number of vocabularies that often appear in conjunction with the vocabulary, and so on.
在本实施例中,采用拓扑特征对原始特征进行补充,一方面可解决部分嵌入对象没有原始特征的问题,另一方面还可将嵌入对象的拓扑结构补充到特征中,从而提高解释模型训练结果的准确性。In this embodiment, topological features are used to supplement the original features. On the one hand, it can solve the problem that some embedded objects do not have the original features. On the other hand, the topological structure of the embedded objects can be added to the features, thereby improving the interpretation model training results. Accuracy.
步骤108,针对每个类别,从所述类别中提取若干嵌入对象。Step 108: For each category, extract several embedded objects from the category.
在本实施例中,针对前述聚类模型输出的每个类别,可从该类别中提取若干嵌入对象。其中,提取的嵌入对象数量可预先设置,例如5000、3000等;提取的嵌入对象数量还可是对应类别下嵌入对象总数量的百分比,例如百分之50、百分之30等,本说明书对此不作特殊限制。In this embodiment, for each category output by the aforementioned clustering model, several embedded objects may be extracted from the category. Among them, the number of extracted embedded objects can be preset, such as 5000, 3000, etc.; the number of extracted embedded objects can also be a percentage of the total number of embedded objects under the corresponding category, such as 50%, 30%, etc. No special restrictions.
步骤110,基于提取的每个嵌入对象的特征和已训练的解释模型确定所述嵌入对象属于所述类别的解释特征。In step 110, it is determined that the embedded object belongs to the interpretation feature of the category based on the extracted feature of each embedded object and the trained interpretation model.
在本实施例中,针对提取的每个嵌入对象,可基于已训练的解释模型计算所述嵌入对象的每个特征对嵌入对象类别划分结果的贡献值,然后可将贡献值满足预定条件的特征作为该嵌入对象属于所述类别的解释特征。In this embodiment, for each embedded object extracted, the contribution value of each feature of the embedded object to the classification result of the embedded object category can be calculated based on the trained interpretation model, and then the feature whose contribution value meets the predetermined condition can be calculated The interpretation feature belonging to the category as the embedded object.
例如,可将所述嵌入对象的各个特征按照贡献值从大到小的顺序进行排序,可将排列在前5位、前8位的特征作为该嵌入对象属于所述类别的解释特征,本说明书对此不作特殊限制。For example, each feature of the embedded object may be sorted in order of contribution value from large to small, and the features ranked in the top 5 and top 8 may be regarded as the interpretation features of the embedded object belonging to the category, this specification There are no special restrictions on this.
步骤112,汇总同一类别下提取的每个嵌入对象的解释特征,得到所述聚类模型在该类别下的解释特征。Step 112: Summarize the interpretation features of each embedded object extracted under the same category to obtain the interpretation features of the clustering model under that category.
在一个例子中,针对同一类别,在进行汇总时,可计算各个解释特征出现的总次数,然后选取总次数最多的若干个解释特征作为所述聚类模型在该类别下的解释特征。In one example, for the same category, when summarizing, the total number of occurrences of each interpretation feature may be calculated, and then the several interpretation features with the largest total number of selections are selected as the interpretation features of the clustering model under that category.
嵌入对象Embedded object 解释特征Explanatory feature
嵌入对象1Embedded Object 1 特征1-5Features 1-5
嵌入对象2Embedded Object 2 特征2-6Features 2-6
嵌入对象3Embedded Object 3 特征7-11Features 7-11
嵌入对象4Embedded Object 4 特征1-4,特征15Features 1-4, Features 15
嵌入对象5Embedded Object 5 特征1-3,特征13-14Features 1-3, Features 13-14
表1Table 1
请参考表1的示例,假设某个类别中有5个嵌入对象,分别为嵌入对象1至嵌入对象5,嵌入对象1属于其类别划分结果的解释特征是特征1-特征5,嵌入对象2属于其类别划分结果的解释特征是特征2-特征6,则可汇总所述类别中各个特征出现的次数,得到表2所示的统计结果。Please refer to the example in Table 1, assuming that there are 5 embedded objects in a certain category, namely embedded object 1 to embedded object 5, and the embedded object 1 belongs to its category division result. The interpretation features are feature 1-feature 5, embedded object 2 belongs to The interpretation feature of the category division result is feature 2-feature 6, and the number of occurrences of each feature in the category can be summarized to obtain the statistical results shown in Table 2.
解释特征Explanatory feature 出现次数The number of occurrences
特征1、特征4Feature 1, Feature 4 33
特征2、特征3 Feature 2, Feature 3 44
特征5 Feature 5 22
特征6-特征11、特征13-特征15Feature 6-Feature 11, Feature 13-Feature 15 11
表2Table 2
请参考表2的示例,通过计算可得特征1和特征4均出现3次,特征2和特征3均出现4次等。Please refer to the example in Table 2, through calculation, it can be found that feature 1 and feature 4 appear 3 times, feature 2 and feature 3 appear 4 times, etc.
在本例中,假设选取出现次数最多的5个解释特征,则可选取出特征1-特征5,并将特征1-特征5作为所述聚类模型在该类别下的解释特征。In this example, assuming that the five interpretation features with the highest number of occurrences are selected, feature 1-feature 5 may be selected, and feature 1-feature 5 is used as the interpretation feature of the clustering model under the category.
在另一个例子中,针对同一类别,在进行汇总时,可计算该类别下各个解释特征的贡献值之和,然后选取贡献值之和最多的若干个解释特征作为所述聚类模型在该类别下的解释特征。In another example, for the same category, when summarizing, the sum of the contribution values of each interpretation feature under that category can be calculated, and then the several interpretation features with the most sum of contribution values can be selected as the clustering model in that category Explain the features below.
请继续参考表1和表2的示例,特征1的贡献值之和等于特征1在嵌入对象1中的贡献值加上特征1在嵌入对象4中的贡献值再加上特征1在嵌入对象5中的贡献值。类似的,可计算表2所示的各个特征的贡献值之和,然后可选取贡献值之和排列在前5位的解释特征作为聚类模型在该类别下的解释特征。Please continue to refer to the examples in Table 1 and Table 2. The sum of the contribution value of feature 1 is equal to the contribution value of feature 1 in embedded object 1 plus the contribution value of feature 1 in embedded object 4 plus the contribution value of feature 1 in embedded object 5 The contribution value in. Similarly, the sum of the contribution values of the features shown in Table 2 can be calculated, and then the interpretation features in which the sum of the contribution values are ranked in the top 5 can be selected as the interpretation features of the clustering model under this category.
在本实施例中,通过汇总各类别下提取的每个嵌入对象的解释特征,可得到所述聚类模型在该类别下的解释特征,实现聚类模型的结果解释。In this embodiment, by summarizing the interpretation features of each embedded object extracted under each category, the interpretation features of the clustering model under that category can be obtained, and the result interpretation of the clustering model can be achieved.
由以上描述可以看出,本说明书可采用嵌入对象的特征和类别标签对具有解释性的解释模型进行训练,并可基于已训练的解释模型确定每个类别下各嵌入对象类别划分的解释特征,然后可汇总同一分类中嵌入对象的解释特征,得到聚类模型在该类别下的解释特征,实现对聚类结果的解释,从而为开发者修复聚类模型的偏差提供依据,有助于提升模型的泛化能力和性能,并且有助于规避法律风险和道德风险。As can be seen from the above description, this specification can use the characteristics and category labels of embedded objects to train an explanatory interpretation model, and can determine the interpretation characteristics of each embedded object category division under each category based on the trained interpretation model. Then, the interpretation characteristics of the embedded objects in the same classification can be summarized to obtain the interpretation characteristics of the clustering model under that category, and the interpretation of the clustering results can be realized, so as to provide a basis for developers to repair the deviation of the clustering model and help improve the model The generalization ability and performance, and help to avoid legal risks and moral hazards.
下面分别以解释模型是线性模型和决策树为例,对特征贡献值的计算方法进行详 细描述。In the following, taking the explanation model as a linear model and a decision tree as examples, the calculation method of feature contribution value will be described in detail.
一、线性模型1. Linear model
在本实施例中,当解释模型是线性模型时,在采用嵌入对象的特征和类别标签对该线性模型进行训练后,可得到每个类别下各个嵌入对象特征的权重。In this embodiment, when the interpretation model is a linear model, after the linear model is trained using the features and category labels of the embedded object, the weight of each embedded object feature under each category can be obtained.
Figure PCTCN2019112090-appb-000001
Figure PCTCN2019112090-appb-000001
表3table 3
请参考表3的示例,假设在类别1中,特征1的权重是W1,特征2的权重是W2,依次类推。在计算某嵌入对象各特征对类别划分结果的贡献值时,可先获取在该嵌入对象所属的类别下各特征的权重,然后计算嵌入对象特征值与对应权重的乘积,并将该乘积作为所述贡献值。Please refer to the example in Table 3, assuming that in category 1, the weight of feature 1 is W1, the weight of feature 2 is W2, and so on. When calculating the contribution value of each feature of an embedded object to the classification result, you can first obtain the weight of each feature under the category of the embedded object, and then calculate the product of the embedded object feature value and the corresponding weight, and use the product as the Describe the contribution value.
例如,特征1对嵌入对象1的类别划分结果的贡献值等于嵌入对象1的特征1的特征值乘以W1;特征2对嵌入对象1的类别划分结果的贡献值等于嵌入对象1的特征2的特征值乘以W2等,本说明书在此不再一一赘述。For example, the contribution value of feature 1 to the classification result of embedded object 1 is equal to the feature value of feature 1 of embedded object 1 multiplied by W1; the contribution value of feature 2 to the classification result of embedded object 1 is equal to that of feature 2 of embedded object 1 The feature value is multiplied by W2, etc., and will not be repeated in this specification.
二、决策树Second, the decision tree
在本实施例中,当解释模型是决策树时,在嵌入对象的特征和类别标签对该决策树进行训练后,可得到决策树中各特征的***点。In this embodiment, when the interpretation model is a decision tree, after the feature tree and the class label of the embedded object are trained on the decision tree, the split points of each feature in the decision tree can be obtained.
请参考图3所示的决策树,图3所示的决策树中的各个树节点都可代表唯一的一个特征,例如树节点1代表用户年龄、树节点2代表用户年收入等。该决策树中各特征的***点通常指对应特征的特征阈值,例如,年龄树节点的***点是50,当用户年龄小于等于50时,可确定选择分叉路径12,当用户年龄大于50时,可确定选择分叉路径13等。Please refer to the decision tree shown in FIG. 3, each tree node in the decision tree shown in FIG. 3 can represent a unique feature, for example, tree node 1 represents user age, tree node 2 represents user annual income, and so on. The split point of each feature in the decision tree usually refers to the feature threshold of the corresponding feature, for example, the split point of the age tree node is 50, when the user's age is less than or equal to 50, it can be determined to select the bifurcation path 12, when the user's age is greater than 50 , It can be determined to select the bifurcation path 13 and so on.
在本实施例中,在确定嵌入对象特征的贡献值时,可先将嵌入对象输入已训练的决策树,然后可在决策树对该嵌入对象进行类别划分的过程中,确定该嵌入对象在所述决策树中经过的路径,并获取该路径上的各个特征及所述特征的***点。In this embodiment, when determining the contribution value of the characteristics of the embedded object, the embedded object may be input into the trained decision tree first, and then the embedded object may be determined in the process of classifying the embedded object in the decision tree. Describe the path traversed in the decision tree, and obtain each feature on the path and the split point of the feature.
仍以图3为例,假设某嵌入对象在图3所示的决策树中经过的路径是树节点1->树 节点2->树节点4,则可获取树节点1、树节点2和树节点4这3个树节点所代表的特征以及所述特征的***点。Still taking FIG. 3 as an example, assuming that the path of an embedded object in the decision tree shown in FIG. 3 is tree node 1->tree node 2->tree node 4, you can obtain tree node 1, tree node 2, and tree The features represented by the three tree nodes of node 4 and the split points of the features.
针对获取到的每个特征及其***点,计算该嵌入对象对应的特征值和所述***点之间的距离,并可将该距离作为所述特征对该嵌入对象类别划分结果的贡献值。For each acquired feature and its split point, the distance between the feature value corresponding to the embedded object and the split point is calculated, and the distance can be used as the contribution value of the feature to the classification result of the embedded object category.
仍以树节点1代表用户年龄,其***点是50为例,假设某嵌入对象的用户年龄是20岁,则特征用户年龄的贡献值是50与20的差值,即30。当然,在实际应用中,在计算得到上述距离后,还可对距离进行归一化处理,并可将归一化结果作为对应的贡献值,本说明书对此不作特殊限制。Still taking the tree node 1 to represent the user's age, and its split point is 50 as an example, assuming that the user age of an embedded object is 20 years old, the contribution value of the characteristic user age is the difference between 50 and 20, that is, 30. Of course, in practical applications, after the above distance is calculated, the distance can also be normalized, and the normalized result can be used as the corresponding contribution value, which is not particularly limited in this specification.
本说明书还提供一种对风险团伙识别模型的识别结果进行解释的方法。This manual also provides a method to interpret the identification results of the risk gang identification model.
一方面,可采用嵌入算法对用户网络图中的用户节点进行嵌入处理,得到每个用户节点的嵌入结果,然后采用风险团伙识别模型对所述嵌入结果进行识别,得到每个用户节点所属的风险团伙标签。On the one hand, an embedding algorithm can be used to embed the user nodes in the user network graph to obtain the embedding results of each user node, and then use the risk gang identification model to identify the embedding results to obtain the risks to which each user node belongs Gang label.
另一方面,可采用用户节点的特征和所述的风险团伙标签对具有解释性的解释模型进行训练。在训练完毕后,针对每个风险团伙,可从该风险团伙中提取若干用户节点,并基于提取的每个用户节点的特征和已训练的解释模型确定所述用户节点属于所述风险团伙的解释特征,然后可汇总同一风险团伙中提取的每个用户节点的解释特征,得到所述风险团伙识别模型对应该风险团伙的解释特征。On the other hand, the characteristics of the user node and the risk gang tags can be used to train an explanatory model with an explanation. After the training is completed, for each risk group, several user nodes can be extracted from the risk group, and based on the extracted features of each user node and the trained interpretation model, it is determined that the user node belongs to the explanation of the risk group Characteristics, and then the explanatory characteristics of each user node extracted from the same risk group can be aggregated to obtain the interpretation characteristics of the risk group identification model corresponding to the risk group.
在本实施例中,可得到风险团伙识别模型对应识别出的各个风险团伙的解释特征。In this embodiment, the interpretation characteristics of each risk group identified corresponding to the risk group identification model can be obtained.
例如,风险团伙1的解释特征可包括:无固定职业、年收入低于8万、常住地广西、年龄18-25周岁等。可表示风险团伙识别模型通过这些用户特征识别出风险团伙1。For example, the interpretation characteristics of risk group 1 may include: no fixed occupation, annual income less than 80,000, resident place in Guangxi, age 18-25 years old, etc. It can be expressed that the risk group identification model identifies the risk group 1 through these user characteristics.
再例如,风险团伙2的解释特征可包括:无固定职业、年收入低于10万、常住地云南、年龄20-28周岁、使用Wi-Fi网络的SSID是12345等。可表示风险团伙识别模型通过这些用户特征识别出风险团伙2。As another example, the interpretation characteristics of risk group 2 may include: no fixed occupation, annual income less than 100,000, permanent residence in Yunnan, age 20-28 years old, SSID using Wi-Fi network is 12345, etc. It can be expressed that the risk group identification model identifies the risk group 2 through these user characteristics.
本说明书还提供一种文本聚类模型的聚类结果解释方法。This specification also provides a method for interpreting the clustering results of text clustering models.
一方面,可采用嵌入算法对待聚类的文本中各词汇进行嵌入处理,得到每个文本的嵌入结果,然后采用文本聚类模型对所述嵌入结果进行聚类,得到每个文本所属的类别标签。On the one hand, an embedding algorithm can be used to embed each word in the text to be clustered to obtain the embedding result of each text, and then a text clustering model is used to cluster the embedding results to obtain the category label to which each text belongs .
另一方面,可采用所述文本的特征和所述类别标签对具有解释性的解释模型进行 训练。在训练完毕后,针对每个类别,可从该类别中提取若干文本,并基于提取的每个文本的特征和已训练的解释模型确定所提取文本属于所述类别的解释特征,然后可汇总同一类别中提取的每个文本的解释特征,得到所述文本聚类模型在该类别下的解释特征。On the other hand, the features of the text and the category label can be used to train an explanatory model with interpretation. After the training is completed, for each category, several texts can be extracted from the category, and based on the extracted features of each text and the trained interpretation model, it is determined that the extracted text belongs to the interpretation features of the category, and then the same The interpretation feature of each text extracted from the category obtains the interpretation feature of the text clustering model under the category.
在本实施例中,可得到所述文本聚类模型聚类出的各个文本类别的解释特征。In this embodiment, the interpretation characteristics of each text category clustered by the text clustering model can be obtained.
例如,科技类文本的解释特征可包括:计算机、人工智能、技术、创新、技术的词频大于0.01等。可表示文本聚类模型通过这些特征确定出属于科技类别的文本。For example, the interpretation characteristics of science and technology texts may include: computer, artificial intelligence, technology, innovation, and the word frequency of technology is greater than 0.01. Representable text clustering models determine texts that belong to the category of science and technology through these features.
再例如,体育类文本的解释特征可包括:足球、篮球、运动、游泳、记录等。可表示文本聚类模型通过这些特征确定出属于体育类别的文本。For another example, the interpretation features of the sports text may include: football, basketball, sports, swimming, recording, and so on. Representable text clustering models determine the text belonging to the sports category through these features.
与前述聚类结果的解释方法的实施例相对应,本说明书还提供了聚类结果的解释装置的实施例。Corresponding to the foregoing embodiments of the clustering result interpretation method, this specification also provides an embodiment of the clustering result interpretation apparatus.
本说明书聚类结果的解释装置的实施例可以应用在服务器上。装置实施例可以通过软件实现,也可以通过硬件或者软硬件结合的方式实现。以软件实现为例,作为一个逻辑意义上的装置,是通过其所在服务器的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言,如图4所示,为本说明书聚类结果的解释装置所在服务器的一种硬件结构图,除了图4所示的处理器、内存、网络接口、以及非易失性存储器之外,实施例中装置所在的服务器通常根据该服务器的实际功能,还可以包括其他硬件,对此不再赘述。The embodiment of the apparatus for interpreting clustering results in this specification can be applied to a server. The device embodiments may be implemented by software, or by hardware or a combination of hardware and software. Taking software implementation as an example, as a logical device, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory through the processor of the server where it is located and running. From the hardware level, as shown in Fig. 4, it is a hardware structure diagram of the server where the apparatus for interpreting the clustering results of this specification is located, except for the processor, memory, network interface, and non-volatile memory shown in Fig. 4 In addition, in the embodiment, the server where the device is located usually may include other hardware according to the actual function of the server, which will not be repeated here.
图5是本说明书一示例性实施例示出的一种聚类结果的解释装置的框图。Fig. 5 is a block diagram of an apparatus for interpreting clustering results shown in an exemplary embodiment of the present specification.
请参考图5,所述聚类结果的解释装置400可以应用在前述图4所示的服务器中,包括有:嵌入处理单元401、对象聚类单元402、模型训练单元403、对象提取单元404、特征确定单元405以及特征汇总单元406。Referring to FIG. 5, the apparatus 400 for interpreting clustering results may be applied to the server shown in FIG. 4 described above, including: an embedding processing unit 401, an object clustering unit 402, a model training unit 403, an object extraction unit 404, Feature determination unit 405 and feature summary unit 406.
其中,嵌入处理单元401,采用嵌入算法对嵌入对象进行嵌入处理,得到每个嵌入对象的嵌入结果;Among them, the embedding processing unit 401 uses an embedding algorithm to embed the embedded objects to obtain the embedding result of each embedded object;
对象聚类单元402,采用聚类模型对所述嵌入结果进行聚类,得到每个嵌入对象的类别标签;The object clustering unit 402 uses a clustering model to cluster the embedding results to obtain a category label for each embedded object;
模型训练单元403,采用所述嵌入对象的特征和类别标签对解释模型进行训练;The model training unit 403 trains the interpretation model using the characteristics and category labels of the embedded objects;
对象提取单元404,针对每个类别,从所述类别中提取若干嵌入对象;The object extraction unit 404 extracts several embedded objects from the category for each category;
特征确定单元405,基于提取的每个嵌入对象的特征和已训练的解释模型确定所述 嵌入对象属于所述类别的解释特征;The feature determining unit 405 determines, based on the extracted feature of each embedded object and the trained interpretation model, that the embedded object belongs to the interpretation feature of the category;
特征汇总单元406,汇总同一类别下提取的每个嵌入对象的解释特征,得到所述聚类模型在该类别下的解释特征。The feature summary unit 406 summarizes the interpretation features of each embedded object extracted under the same category to obtain the interpretation features of the clustering model under the category.
可选的,所述特征确定单元405:Optionally, the feature determination unit 405:
针对每个嵌入对象,基于已训练的解释模型计算所述嵌入对象的每个特征对类别划分结果的贡献值;For each embedded object, the contribution value of each feature of the embedded object to the classification result is calculated based on the trained interpretation model;
提取贡献值满足预定条件的特征作为所述嵌入对象属于所述类别的解释特征。A feature whose contribution value satisfies a predetermined condition is extracted as an interpretation feature of the embedded object belonging to the category.
可选的,当所述解释模型是线性模型时,所述特征确定单元405:Optionally, when the interpretation model is a linear model, the feature determination unit 405:
获取已训练的线性模型中的各特征在所述嵌入对象所属类别下的权重;Obtaining the weight of each feature in the trained linear model under the category of the embedded object;
计算所述嵌入对象的特征值与对应权重的乘积,作为所述特征对嵌入对象类别划分结果的贡献值。Calculate the product of the feature value of the embedded object and the corresponding weight as the contribution value of the feature to the classification result of the embedded object.
可选的,当所述解释模型是决策树时,所述特征确定单元405:Optionally, when the interpretation model is a decision tree, the feature determination unit 405:
在采用已训练的决策树对所述嵌入对象进行类别划分的过程中,获取所述嵌入对象经过的路径上各特征的***点;In the process of classifying the embedded object using the trained decision tree, obtaining the split points of each feature on the path the embedded object passes;
计算所述特征的***点与对应的嵌入对象特征值之间的距离,作为所述特征对嵌入对象类别划分结果的贡献值。The distance between the split point of the feature and the corresponding feature value of the embedded object is calculated as the contribution value of the feature to the classification result of the embedded object category.
可选的,所述特征确定单元405:Optionally, the feature determination unit 405:
按照贡献值从大到小的顺序对特征进行排序;Sort features in order of contribution value from largest to smallest;
提取排列在前N位的特征作为所述嵌入对象属于所述类别的解释特征,N为大于等于1的自然数。The features arranged in the top N bits are extracted as the interpretation features of the embedded object belonging to the category, and N is a natural number greater than or equal to 1.
可选的,所述特征包括:原始特征和拓扑特征。Optionally, the features include: original features and topological features.
可选的,所述拓扑特征包括以下一种或多种:Optionally, the topological features include one or more of the following:
一阶邻居数量、二阶邻居数量、一阶邻居的平均邻居数量、一阶邻居在指定原始特征维度下的统计值。The number of first-order neighbors, the number of second-order neighbors, the average number of first-order neighbors, and the statistics of first-order neighbors under the specified original feature dimensions.
上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。For the implementation process of the functions and functions of the units in the above device, please refer to the implementation process of the corresponding steps in the above method for details, which will not be repeated here.
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本说明书方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。As for the device embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to the description of the method embodiments. The device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located One place, or can be distributed to multiple network elements. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solution in this specification. Those of ordinary skill in the art can understand and implement without paying creative labor.
上述实施例阐明的***、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机,计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任意几种设备的组合。The system, device, module or unit explained in the above embodiments may be specifically implemented by a computer chip or entity, or implemented by a product with a certain function. A typical implementation device is a computer, and the specific form of the computer may be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email sending and receiving device, and a game control Desk, tablet computer, wearable device, or any combination of these devices.
与前述聚类结果的解释方法的实施例相对应,本说明书还提供一种聚类结果的解释装置,该装置包括:处理器以及用于存储机器可执行指令的存储器。其中,处理器和存储器通常借由内部总线相互连接。在其他可能的实现方式中,所述设备还可能包括外部接口,以能够与其他设备或者部件进行通信。Corresponding to the foregoing embodiments of the clustering result interpretation method, this specification also provides a clustering result interpretation apparatus, which includes a processor and a memory for storing machine-executable instructions. Among them, the processor and the memory are usually connected to each other via an internal bus. In other possible implementations, the device may also include an external interface to be able to communicate with other devices or components.
在本实施例中,通过读取并执行所述存储器存储的与聚类结果的解释逻辑对应的机器可执行指令,所述处理器被促使:In this embodiment, by reading and executing machine-executable instructions stored in the memory corresponding to the interpretation logic of the clustering result, the processor is prompted to:
采用嵌入算法对嵌入对象进行嵌入处理,得到每个嵌入对象的嵌入结果;Use embedding algorithm to embed the embedded objects to get the embedding result of each embedded object;
采用聚类模型对所述嵌入结果进行聚类,得到每个嵌入对象的类别标签;Clustering the embedding results using a clustering model to obtain a category label for each embedded object;
采用所述嵌入对象的特征和类别标签对解释模型进行训练;Use the features and category labels of the embedded objects to train the interpretation model;
针对每个类别,从所述类别中提取若干嵌入对象;For each category, extract several embedded objects from the category;
基于提取的每个嵌入对象的特征和已训练的解释模型确定所述嵌入对象属于所述类别的解释特征;Based on the extracted features of each embedded object and the trained interpretation model, it is determined that the embedded object belongs to the interpretation feature of the category;
汇总同一类别下提取的每个嵌入对象的解释特征,得到所述聚类模型在该类别下的解释特征。Summarize the interpretation features of each embedded object extracted under the same category to obtain the interpretation features of the clustering model under that category.
可选的,在基于提取的每个嵌入对象的特征和已训练的解释模型确定所述嵌入对象属于所述类别的解释特征时,所述处理器被促使:Optionally, when it is determined that the embedded object belongs to the interpretation feature of the category based on the extracted feature of each embedded object and the trained interpretation model, the processor is prompted to:
针对每个嵌入对象,基于已训练的解释模型计算所述嵌入对象的每个特征对类别 划分结果的贡献值;For each embedded object, the contribution value of each feature of the embedded object to the classification result is calculated based on the trained interpretation model;
提取贡献值满足预定条件的特征作为所述嵌入对象的解释特征。A feature whose contribution value meets a predetermined condition is extracted as an interpretation feature of the embedded object.
可选的,当所述解释模型是线性模型时,在基于已训练的解释模型计算所述嵌入对象的每个特征对类别划分结果的贡献值时,所述处理器被促使:Optionally, when the interpretation model is a linear model, when calculating the contribution value of each feature of the embedded object to the classification result based on the trained interpretation model, the processor is prompted to:
获取已训练的线性模型中的各特征在所述嵌入对象所属类别下的权重;Obtaining the weight of each feature in the trained linear model under the category of the embedded object;
计算所述嵌入对象的特征值与对应权重的乘积,作为所述特征对嵌入对象类别划分结果的贡献值。Calculate the product of the feature value of the embedded object and the corresponding weight as the contribution value of the feature to the classification result of the embedded object.
可选的,当所述解释模型是决策树时,在基于已训练的解释模型计算所述嵌入对象的每个特征对类别划分结果的贡献值时,所述处理器被促使:Optionally, when the interpretation model is a decision tree, when calculating the contribution value of each feature of the embedded object to the classification result based on the trained interpretation model, the processor is prompted to:
在采用已训练的决策树对所述嵌入对象进行类别划分的过程中,获取所述嵌入对象经过的路径上各特征的***点;In the process of classifying the embedded object using the trained decision tree, obtaining the split points of each feature on the path the embedded object passes;
计算所述特征的***点与对应的嵌入对象特征值之间的距离,作为所述特征对嵌入对象类别划分结果的贡献值。The distance between the split point of the feature and the corresponding feature value of the embedded object is calculated as the contribution value of the feature to the classification result of the embedded object category.
可选的,在提取贡献值满足预定条件的特征作为所述嵌入对象属于所述类别的解释特征时,所述处理器被促使:Optionally, when a feature whose contribution value meets a predetermined condition is extracted as an interpretation feature of the embedded object belonging to the category, the processor is prompted to:
按照贡献值从大到小的顺序对特征进行排序;Sort features in order of contribution value from largest to smallest;
提取排列在前N位的特征作为所述嵌入对象属于所述类别的解释特征,N为大于等于1的自然数。The features arranged in the top N bits are extracted as the interpretation features of the embedded object belonging to the category, and N is a natural number greater than or equal to 1.
可选的,所述特征包括:原始特征和拓扑特征。Optionally, the features include: original features and topological features.
可选的,所述拓扑特征包括以下一种或多种:Optionally, the topological features include one or more of the following:
一阶邻居数量、二阶邻居数量、一阶邻居的平均邻居数量、一阶邻居在指定原始特征维度下的统计值。The number of first-order neighbors, the number of second-order neighbors, the average number of first-order neighbors, and the statistics of first-order neighbors under the specified original feature dimensions.
与前述聚类结果的解释方法的实施例相对应,本说明书还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,该程序被处理器执行时实现以下步骤:Corresponding to the foregoing embodiment of the clustering result interpretation method, this specification also provides a computer-readable storage medium that stores a computer program on the computer-readable storage medium, and the program implements the following steps when executed by the processor:
采用嵌入算法对嵌入对象进行嵌入处理,得到每个嵌入对象的嵌入结果;Use embedding algorithm to embed the embedded objects to get the embedding result of each embedded object;
采用聚类模型对所述嵌入结果进行聚类,得到每个嵌入对象的类别标签;Clustering the embedding results using a clustering model to obtain a category label for each embedded object;
采用所述嵌入对象的特征和类别标签对解释模型进行训练;Use the features and category labels of the embedded objects to train the interpretation model;
针对每个类别,从所述类别中提取若干嵌入对象;For each category, extract several embedded objects from the category;
基于提取的每个嵌入对象的特征和已训练的解释模型确定所述嵌入对象属于所述类别的解释特征;Based on the extracted features of each embedded object and the trained interpretation model, it is determined that the embedded object belongs to the interpretation feature of the category;
汇总同一类别下提取的每个嵌入对象的解释特征,得到所述聚类模型在该类别下的解释特征。Summarize the interpretation features of each embedded object extracted under the same category to obtain the interpretation features of the clustering model under that category.
可选的,所述基于提取的每个嵌入对象的特征和已训练的解释模型确定所述嵌入对象属于所述类别的解释特征,包括:Optionally, the determining, based on the extracted feature of each embedded object and the trained interpretation model, that the embedded object belongs to the interpretation feature of the category includes:
针对每个嵌入对象,基于已训练的解释模型计算所述嵌入对象的每个特征对类别划分结果的贡献值;For each embedded object, the contribution value of each feature of the embedded object to the classification result is calculated based on the trained interpretation model;
提取贡献值满足预定条件的特征作为所述嵌入对象属于所述类别的解释特征。A feature whose contribution value satisfies a predetermined condition is extracted as an interpretation feature of the embedded object belonging to the category.
可选的,当所述解释模型是线性模型时,所述基于已训练的解释模型计算所述嵌入对象的每个特征对类别划分结果的贡献值,包括:Optionally, when the interpretation model is a linear model, the calculation of the contribution value of each feature of the embedded object to the classification result based on the trained interpretation model includes:
获取已训练的线性模型中的各特征在所述嵌入对象所属类别下的权重;Obtaining the weight of each feature in the trained linear model under the category of the embedded object;
计算所述嵌入对象的特征值与对应权重的乘积,作为所述特征对嵌入对象类别划分结果的贡献值。Calculate the product of the feature value of the embedded object and the corresponding weight as the contribution value of the feature to the classification result of the embedded object.
可选的,当所述解释模型是决策树时,所述基于已训练的解释模型计算所述嵌入对象的每个特征对类别划分结果的贡献值,包括:Optionally, when the interpretation model is a decision tree, the calculation of the contribution value of each feature of the embedded object to the classification result based on the trained interpretation model includes:
在采用已训练的决策树对所述嵌入对象进行类别划分的过程中,获取所述嵌入对象经过的路径上各特征的***点;In the process of classifying the embedded object using the trained decision tree, obtaining the split points of each feature on the path the embedded object passes;
计算所述特征的***点与对应的嵌入对象特征值之间的距离,作为所述特征对嵌入对象类别划分结果的贡献值。The distance between the split point of the feature and the corresponding feature value of the embedded object is calculated as the contribution value of the feature to the classification result of the embedded object category.
可选的,所述提取贡献值满足预定条件的特征作为所述嵌入对象属于所述类别的解释特征,包括:Optionally, the feature whose extracted contribution value satisfies a predetermined condition as the interpretation feature of the embedded object belonging to the category includes:
按照贡献值从大到小的顺序对特征进行排序;Sort features in order of contribution value from largest to smallest;
提取排列在前N位的特征作为所述嵌入对象属于所述类别的解释特征,N为大于等于1的自然数。The features arranged in the top N bits are extracted as the interpretation features of the embedded object belonging to the category, and N is a natural number greater than or equal to 1.
可选的,所述特征包括:原始特征和拓扑特征。Optionally, the features include: original features and topological features.
可选的,所述拓扑特征包括以下一种或多种:Optionally, the topological features include one or more of the following:
一阶邻居数量、二阶邻居数量、一阶邻居的平均邻居数量、一阶邻居在指定原始特征维度下的统计值。The number of first-order neighbors, the number of second-order neighbors, the average number of first-order neighbors, and the statistics of first-order neighbors under the specified original feature dimensions.
上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。The foregoing describes specific embodiments of the present specification. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve the desired results. In addition, the processes depicted in the drawings do not necessarily require the particular order shown or sequential order to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
以上所述仅为本说明书的较佳实施例而已,并不用以限制本说明书,凡在本说明书的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本说明书保护的范围之内。The above are only the preferred embodiments of this specification and are not intended to limit this specification. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of this specification should be included in this specification Within the scope of protection.

Claims (17)

  1. 一种聚类结果的解释方法,包括:An interpretation method for clustering results, including:
    采用嵌入算法对嵌入对象进行嵌入处理,得到每个嵌入对象的嵌入结果;Use embedding algorithm to embed the embedded objects to get the embedding result of each embedded object;
    采用聚类模型对所述嵌入结果进行聚类,得到每个嵌入对象的类别标签;Clustering the embedding results using a clustering model to obtain a category label for each embedded object;
    采用所述嵌入对象的特征和类别标签对解释模型进行训练;Use the features and category labels of the embedded objects to train the interpretation model;
    针对每个类别,从所述类别中提取若干嵌入对象;For each category, extract several embedded objects from the category;
    基于提取的每个嵌入对象的特征和已训练的解释模型确定所述嵌入对象属于所述类别的解释特征;Based on the extracted features of each embedded object and the trained interpretation model, it is determined that the embedded object belongs to the interpretation feature of the category;
    汇总同一类别下提取的每个嵌入对象的解释特征,得到所述聚类模型在该类别下的解释特征。Summarize the interpretation features of each embedded object extracted under the same category to obtain the interpretation features of the clustering model under that category.
  2. 根据权利要求1所述的方法,基于提取的每个嵌入对象的特征和已训练的解释模型确定所述嵌入对象属于所述类别的解释特征,包括:According to the method of claim 1, determining that the embedded object belongs to the interpretation feature of the category based on the extracted feature of each embedded object and the trained interpretation model includes:
    针对每个嵌入对象,基于已训练的解释模型计算所述嵌入对象的每个特征对类别划分结果的贡献值;For each embedded object, the contribution value of each feature of the embedded object to the classification result is calculated based on the trained interpretation model;
    提取贡献值满足预定条件的特征作为所述嵌入对象属于所述类别的解释特征。A feature whose contribution value satisfies a predetermined condition is extracted as an interpretation feature of the embedded object belonging to the category.
  3. 根据权利要求2所述的方法,当所述解释模型是线性模型时,所述基于已训练的解释模型计算所述嵌入对象的每个特征对类别划分结果的贡献值,包括:According to the method of claim 2, when the interpretation model is a linear model, the calculation of the contribution value of each feature of the embedded object to the classification result based on the trained interpretation model includes:
    获取已训练的线性模型中的各特征在所述嵌入对象所属类别下的权重;Obtaining the weight of each feature in the trained linear model under the category of the embedded object;
    计算所述嵌入对象的特征值与对应权重的乘积,作为所述特征对嵌入对象类别划分结果的贡献值。Calculate the product of the feature value of the embedded object and the corresponding weight as the contribution value of the feature to the classification result of the embedded object.
  4. 根据权利要求2所述的方法,当所述解释模型是决策树时,所述基于已训练的解释模型计算所述嵌入对象的每个特征对类别划分结果的贡献值,包括:According to the method of claim 2, when the interpretation model is a decision tree, the calculation of the contribution value of each feature of the embedded object to the classification result based on the trained interpretation model includes:
    在采用已训练的决策树对所述嵌入对象进行类别划分的过程中,获取所述嵌入对象经过的路径上各特征的***点;In the process of classifying the embedded object using the trained decision tree, obtaining the split points of each feature on the path the embedded object passes;
    计算所述特征的***点与对应的嵌入对象特征值之间的距离,作为所述特征对嵌入对象类别划分结果的贡献值。The distance between the split point of the feature and the corresponding feature value of the embedded object is calculated as the contribution value of the feature to the classification result of the embedded object category.
  5. 根据权利要求2所述的方法,提取所述贡献值满足预定条件的特征作为所述嵌入对象属于所述类别的解释特征,包括:According to the method of claim 2, extracting the feature whose contribution value meets a predetermined condition as the interpretation feature of the embedded object belonging to the category includes:
    按照贡献值从大到小的顺序对特征进行排序;Sort features in order of contribution value from largest to smallest;
    提取排列在前N位的特征作为所述嵌入对象属于所述类别的解释特征,N为大于等于1的自然数。The features arranged in the top N bits are extracted as the interpretation features of the embedded object belonging to the category, and N is a natural number greater than or equal to 1.
  6. 根据权利要求1所述的方法,The method according to claim 1,
    所述特征包括:原始特征和拓扑特征。The features include: original features and topological features.
  7. 根据权利要求6所述的方法,所述拓扑特征包括以下一种或多种:The method according to claim 6, the topological features include one or more of the following:
    一阶邻居数量、二阶邻居数量、一阶邻居的平均邻居数量、一阶邻居在指定原始特征维度下的统计值。The number of first-order neighbors, the number of second-order neighbors, the average number of first-order neighbors, and the statistics of first-order neighbors under the specified original feature dimensions.
  8. 一种风险团伙识别模型的识别结果解释方法,包括:An interpretation method for the identification results of a risk gang identification model, including:
    采用嵌入算法对用户节点进行嵌入处理,得到每个用户节点的嵌入结果;Use embedding algorithm to embed user nodes to get the embedding result of each user node;
    采用风险团伙识别模型对所述嵌入结果进行识别,得到每个用户节点所属的风险团伙标签;A risk gang identification model is used to identify the embedded result to obtain a risk gang label to which each user node belongs;
    采用所述用户节点的特征和所述风险团伙标签对解释模型进行训练;Train the interpretation model using the characteristics of the user node and the risk group label;
    针对每个风险团伙,从所述风险团伙中提取若干用户节点;For each risk group, extract several user nodes from the risk group;
    基于提取的每个用户节点的特征和已训练的解释模型确定所述用户节点属于所述风险团伙的解释特征;Determine that the user node belongs to the interpretation feature of the risk group based on the extracted feature of each user node and the trained interpretation model;
    汇总同一风险团伙中提取的每个用户节点的解释特征,得到所述风险团伙识别模型对应应该风险团伙的解释特征。Summarizing the interpretation characteristics of each user node extracted from the same risk gang to obtain the interpretation characteristics of the risk gang identification model corresponding to the risk gang.
  9. 一种文本聚类模型的聚类结果解释方法,包括:A method for interpreting the clustering results of text clustering models, including:
    采用嵌入算法对待聚类文本进行嵌入处理,得到每个文本的嵌入结果;Use embedding algorithm to embed the clustered text to get the embedding result of each text;
    采用文本聚类模型对所述嵌入结果进行聚类,得到每个文本的类别标签;A text clustering model is used to cluster the embedding results to obtain a category label for each text;
    采用所述文本的特征和所述类别标签对解释模型进行训练;Train the explanatory model using the features of the text and the category label;
    针对每个类别,从所述类别中提取若干文本;For each category, extract several texts from the category;
    基于提取的每个文本的特征和已训练的解释模型确定所述文本属于所述类别的解释特征;Determine that the text belongs to the interpretation feature of the category based on the extracted feature of each text and the trained interpretation model;
    汇总同一类别中提取的每个文本的解释特征,得到所述文本聚类模型在该类别下的解释特征。Summarize the interpretation features of each text extracted in the same category to obtain the interpretation features of the text clustering model under that category.
  10. 一种聚类结果的解释装置,包括:A device for interpreting clustering results, including:
    嵌入处理单元,采用嵌入算法对嵌入对象进行嵌入处理,得到每个嵌入对象的嵌入结果;The embedding processing unit uses embedding algorithms to embed the embedded objects to obtain the embedding result of each embedded object;
    对象聚类单元,采用聚类模型对所述嵌入结果进行聚类,得到每个嵌入对象的类别标签;The object clustering unit uses a clustering model to cluster the embedding results to obtain a category label for each embedded object;
    模型训练单元,采用所述嵌入对象的特征和类别标签对解释模型进行训练;A model training unit, which uses the features and category labels of the embedded objects to train the explanatory model;
    对象提取单元,针对每个类别,从所述类别中提取若干嵌入对象;The object extraction unit extracts several embedded objects from the category for each category;
    特征确定单元,基于提取的每个嵌入对象的特征和已训练的解释模型确定所述嵌入对象属于所述类别的解释特征;A feature determining unit, based on the extracted feature of each embedded object and the trained interpretation model, determining that the embedded object belongs to the interpretation feature of the category;
    特征汇总单元,汇总同一类别下提取的每个嵌入对象的解释特征,得到所述聚类模型在该类别下的解释特征。The feature summary unit summarizes the interpretation features of each embedded object extracted under the same category to obtain the interpretation features of the clustering model under the category.
  11. 根据权利要求10所述的装置,所述特征确定单元:The apparatus according to claim 10, the characteristic determining unit:
    针对每个嵌入对象,基于已训练的解释模型计算所述嵌入对象的每个特征对类别划分结果的贡献值;For each embedded object, the contribution value of each feature of the embedded object to the classification result is calculated based on the trained interpretation model;
    提取贡献值满足预定条件的特征作为所述嵌入对象属于所述类别的解释特征。A feature whose contribution value satisfies a predetermined condition is extracted as an interpretation feature of the embedded object belonging to the category.
  12. 根据权利要求11所述的装置,当所述解释模型是线性模型时,所述特征确定单元:The apparatus according to claim 11, when the interpretation model is a linear model, the feature determining unit:
    获取已训练的线性模型中的各特征在所述嵌入对象所属类别下的权重;Obtaining the weight of each feature in the trained linear model under the category of the embedded object;
    计算所述嵌入对象的特征值与对应权重的乘积,作为所述特征对嵌入对象类别划分结果的贡献值。Calculate the product of the feature value of the embedded object and the corresponding weight as the contribution value of the feature to the classification result of the embedded object.
  13. 根据权利要求11所述的装置,当所述解释模型是决策树时,所述特征确定单元:The apparatus according to claim 11, when the interpretation model is a decision tree, the feature determining unit:
    在采用已训练的决策树对所述嵌入对象进行类别划分的过程中,获取所述嵌入对象经过的路径上各特征的***点;In the process of classifying the embedded object using the trained decision tree, obtaining the split points of each feature on the path the embedded object passes;
    计算所述特征的***点与对应的嵌入对象特征值之间的距离,作为所述特征对嵌入对象类别划分结果的贡献值。The distance between the split point of the feature and the corresponding feature value of the embedded object is calculated as the contribution value of the feature to the classification result of the embedded object category.
  14. 根据权利要求11所述的装置,所述特征确定单元:The apparatus according to claim 11, the feature determining unit:
    按照贡献值从大到小的顺序对特征进行排序;Sort features in order of contribution value from largest to smallest;
    提取排列在前N位的特征作为所述嵌入对象属于所述类别的解释特征,N为大于等于1的自然数。The features arranged in the top N bits are extracted as the interpretation features of the embedded object belonging to the category, and N is a natural number greater than or equal to 1.
  15. 根据权利要求10所述的装置,The device according to claim 10,
    所述特征包括:原始特征和拓扑特征。The features include: original features and topological features.
  16. 根据权利要求15所述的装置,所述拓扑特征包括以下一种或多种:The apparatus according to claim 15, the topological features include one or more of the following:
    一阶邻居数量、二阶邻居数量、一阶邻居的平均邻居数量、一阶邻居在指定原始特征维度下的统计值。The number of first-order neighbors, the number of second-order neighbors, the average number of first-order neighbors, and the statistics of first-order neighbors under the specified original feature dimensions.
  17. 一种聚类结果的解释装置,包括:A device for interpreting clustering results, including:
    处理器;processor;
    用于存储机器可执行指令的存储器;Memory for storing machine executable instructions;
    其中,通过读取并执行所述存储器存储的与聚类结果的解释逻辑对应的机器可执行指令,所述处理器被促使:Wherein, by reading and executing the machine-executable instructions stored in the memory corresponding to the interpretation logic of the clustering result, the processor is prompted to:
    采用嵌入算法对嵌入对象进行嵌入处理,得到每个嵌入对象的嵌入结果;Use embedding algorithm to embed the embedded objects to get the embedding result of each embedded object;
    采用聚类模型对所述嵌入结果进行聚类,得到每个嵌入对象的类别标签;Clustering the embedding results using a clustering model to obtain a category label for each embedded object;
    采用所述嵌入对象的特征和类别标签对解释模型进行训练;Use the features and category labels of the embedded objects to train the interpretation model;
    针对每个类别,从所述类别中提取若干嵌入对象;For each category, extract several embedded objects from the category;
    基于提取的每个嵌入对象的特征和已训练的解释模型确定所述嵌入对象属于所述类别的解释特征;Based on the extracted features of each embedded object and the trained interpretation model, it is determined that the embedded object belongs to the interpretation feature of the category;
    汇总同一类别下提取的每个嵌入对象的解释特征,得到所述聚类模型在该类别下的解释特征。Summarize the interpretation features of each embedded object extracted under the same category to obtain the interpretation features of the clustering model under that category.
PCT/CN2019/112090 2018-12-04 2019-10-21 Clustering result interpretation method and device WO2020114108A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811471749.9 2018-12-04
CN201811471749.9A CN110046634B (en) 2018-12-04 2018-12-04 Interpretation method and device of clustering result

Publications (1)

Publication Number Publication Date
WO2020114108A1 true WO2020114108A1 (en) 2020-06-11

Family

ID=67273278

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/112090 WO2020114108A1 (en) 2018-12-04 2019-10-21 Clustering result interpretation method and device

Country Status (3)

Country Link
CN (1) CN110046634B (en)
TW (1) TWI726420B (en)
WO (1) WO2020114108A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395500A (en) * 2020-11-17 2021-02-23 平安科技(深圳)有限公司 Content data recommendation method and device, computer equipment and storage medium

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046634B (en) * 2018-12-04 2021-04-27 创新先进技术有限公司 Interpretation method and device of clustering result
CN110766040B (en) * 2019-09-03 2024-02-06 创新先进技术有限公司 Method and device for risk clustering of transaction risk data
CN111126442B (en) * 2019-11-26 2021-04-30 北京京邦达贸易有限公司 Method for generating key attribute of article, method and device for classifying article
CN111401570B (en) * 2020-04-10 2022-04-12 支付宝(杭州)信息技术有限公司 Interpretation method and device for privacy tree model
CN111784181B (en) * 2020-07-13 2023-09-19 南京大学 Evaluation result interpretation method for criminal reconstruction quality evaluation system
CN112116028B (en) * 2020-09-29 2024-04-26 联想(北京)有限公司 Model decision interpretation realization method and device and computer equipment
CN113284027B (en) * 2021-06-10 2023-05-09 支付宝(杭州)信息技术有限公司 Training method of partner recognition model, abnormal partner recognition method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105022754A (en) * 2014-04-29 2015-11-04 腾讯科技(深圳)有限公司 Social network based object classification method and apparatus
CN106682095A (en) * 2016-12-01 2017-05-17 浙江大学 Subjectterm and descriptor prediction and ordering method based on diagram
US20170277721A1 (en) * 2014-12-05 2017-09-28 Microsoft Technology Licensing, Llc Image Annotation Using Aggregated Page Information From Active and Inactive Indices
US20170347964A1 (en) * 2015-10-16 2017-12-07 General Electric Company System and method of adaptive interpretation of ecg waveforms
CN108268554A (en) * 2017-01-03 2018-07-10 ***通信有限公司研究院 A kind of method and apparatus for generating filtering junk short messages strategy
CN108280755A (en) * 2018-02-28 2018-07-13 阿里巴巴集团控股有限公司 The recognition methods of suspicious money laundering clique and identification device
CN110046634A (en) * 2018-12-04 2019-07-23 阿里巴巴集团控股有限公司 The means of interpretation and device of cluster result

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004054680A (en) * 2002-07-22 2004-02-19 Fujitsu Ltd Parallel efficiency calculation method
US9507858B1 (en) * 2007-02-28 2016-11-29 Google Inc. Selectively merging clusters of conceptually related words in a generative model for text
CN102081627B (en) * 2009-11-27 2014-09-17 北京金山办公软件有限公司 Method and system for determining contribution degree of word in text
US20130091150A1 (en) * 2010-06-30 2013-04-11 Jian-Ming Jin Determiining similarity between elements of an electronic document
CN103164713B (en) * 2011-12-12 2016-04-06 阿里巴巴集团控股有限公司 Image classification method and device
CN104239338A (en) * 2013-06-19 2014-12-24 阿里巴巴集团控股有限公司 Information recommendation method and information recommendation device
CN104346336A (en) * 2013-07-23 2015-02-11 广州华久信息科技有限公司 Machine text mutual-curse based emotional venting method and system
WO2016004075A1 (en) * 2014-06-30 2016-01-07 Amazon Technologies, Inc. Interactive interfaces for machine learning model evaluations
CN104346459B (en) * 2014-11-10 2017-10-27 南京信息工程大学 A kind of text classification feature selection approach based on term frequency and chi
SG11201900220RA (en) * 2016-07-18 2019-02-27 Nantomics Inc Distributed machine learning systems, apparatus, and methods
US11621969B2 (en) * 2017-04-26 2023-04-04 Elasticsearch B.V. Clustering and outlier detection in anomaly and causation detection for computing environments
CN107203787B (en) * 2017-06-14 2021-01-08 江西师范大学 Unsupervised regularization matrix decomposition feature selection method
CN108090048B (en) * 2018-01-12 2021-05-25 安徽大学 College evaluation system based on multivariate data analysis
CN108153899B (en) * 2018-01-12 2021-11-02 安徽大学 Intelligent text classification method
CN108319682B (en) * 2018-01-31 2021-12-28 天闻数媒科技(北京)有限公司 Method, device, equipment and medium for correcting classifier and constructing classification corpus
CN108875816A (en) * 2018-06-05 2018-11-23 南京邮电大学 Merge the Active Learning samples selection strategy of Reliability Code and diversity criterion

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105022754A (en) * 2014-04-29 2015-11-04 腾讯科技(深圳)有限公司 Social network based object classification method and apparatus
US20170277721A1 (en) * 2014-12-05 2017-09-28 Microsoft Technology Licensing, Llc Image Annotation Using Aggregated Page Information From Active and Inactive Indices
US20170347964A1 (en) * 2015-10-16 2017-12-07 General Electric Company System and method of adaptive interpretation of ecg waveforms
CN106682095A (en) * 2016-12-01 2017-05-17 浙江大学 Subjectterm and descriptor prediction and ordering method based on diagram
CN108268554A (en) * 2017-01-03 2018-07-10 ***通信有限公司研究院 A kind of method and apparatus for generating filtering junk short messages strategy
CN108280755A (en) * 2018-02-28 2018-07-13 阿里巴巴集团控股有限公司 The recognition methods of suspicious money laundering clique and identification device
CN110046634A (en) * 2018-12-04 2019-07-23 阿里巴巴集团控股有限公司 The means of interpretation and device of cluster result

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395500A (en) * 2020-11-17 2021-02-23 平安科技(深圳)有限公司 Content data recommendation method and device, computer equipment and storage medium
CN112395500B (en) * 2020-11-17 2023-09-05 平安科技(深圳)有限公司 Content data recommendation method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN110046634B (en) 2021-04-27
TWI726420B (en) 2021-05-01
TW202022716A (en) 2020-06-16
CN110046634A (en) 2019-07-23

Similar Documents

Publication Publication Date Title
WO2020114108A1 (en) Clustering result interpretation method and device
CN110309331B (en) Cross-modal deep hash retrieval method based on self-supervision
CN111523621B (en) Image recognition method and device, computer equipment and storage medium
US20200143248A1 (en) Machine learning model training method and device, and expression image classification method and device
WO2020073507A1 (en) Text classification method and terminal
WO2020238053A1 (en) Neural grid model-based text data category recognition method and apparatus, nonvolatile readable storage medium, and computer device
WO2019200782A1 (en) Sample data classification method, model training method, electronic device and storage medium
CN104750798B (en) Recommendation method and device for application program
CN105022754B (en) Object classification method and device based on social network
TW201909112A (en) Image feature acquisition
US10637826B1 (en) Policy compliance verification using semantic distance and nearest neighbor search of labeled content
CN107786943B (en) User grouping method and computing device
CN109697451B (en) Similar image clustering method and device, storage medium and electronic equipment
CN110619051B (en) Question sentence classification method, device, electronic equipment and storage medium
WO2017092623A1 (en) Method and device for representing text as vector
CN110751027B (en) Pedestrian re-identification method based on deep multi-instance learning
WO2020114109A1 (en) Interpretation method and apparatus for embedding result
JP6004015B2 (en) Learning method, information processing apparatus, and learning program
CN112632984A (en) Graph model mobile application classification method based on description text word frequency
CN112348079A (en) Data dimension reduction processing method and device, computer equipment and storage medium
CN113656373A (en) Method, device, equipment and storage medium for constructing retrieval database
CN111062440B (en) Sample selection method, device, equipment and storage medium
CN112668482A (en) Face recognition training method and device, computer equipment and storage medium
CN112749737A (en) Image classification method and device, electronic equipment and storage medium
WO2020147259A1 (en) User portait method and apparatus, readable storage medium, and terminal device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19892123

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19892123

Country of ref document: EP

Kind code of ref document: A1