CN109903082B - Clustering method based on user portrait, electronic device and storage medium - Google Patents

Clustering method based on user portrait, electronic device and storage medium Download PDF

Info

Publication number
CN109903082B
CN109903082B CN201910068877.7A CN201910068877A CN109903082B CN 109903082 B CN109903082 B CN 109903082B CN 201910068877 A CN201910068877 A CN 201910068877A CN 109903082 B CN109903082 B CN 109903082B
Authority
CN
China
Prior art keywords
user
variables
weight
characteristic
discrete
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910068877.7A
Other languages
Chinese (zh)
Other versions
CN109903082A (en
Inventor
金戈
徐亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910068877.7A priority Critical patent/CN109903082B/en
Priority to PCT/CN2019/089151 priority patent/WO2020151152A1/en
Publication of CN109903082A publication Critical patent/CN109903082A/en
Application granted granted Critical
Publication of CN109903082B publication Critical patent/CN109903082B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Databases & Information Systems (AREA)
  • Economics (AREA)
  • General Engineering & Computer Science (AREA)
  • Game Theory and Decision Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Data Mining & Analysis (AREA)
  • Technology Law (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a data analysis technology, and provides a clustering method based on user portrait, which comprises the following steps: acquiring user characteristics and characteristic variables of a plurality of users; converting the user characteristics into word vectors; clustering the word vectors, and determining the category of each user feature; dividing the characteristic variables into continuous variables and discrete variables; carrying out quantization processing on discrete variables and continuous variables; screening out the type of the user characteristics with preference, and endowing the discrete variable and the continuous variable which are subjected to quantization processing of the type of the user characteristics with preference with a weight value more than 1; and clustering all the discrete variables and continuous variables subjected to the quantization processing to obtain biased user characteristic clusters. The invention also provides an electronic device and a storage medium. The invention has the advantages of pointed clustering on the basis of keeping all characteristic information.

Description

Clustering method based on user portrait, electronic device and storage medium
Technical Field
The present invention relates to the field of data analysis technologies, and in particular, to a user portrait based clustering method, an electronic device, and a storage medium.
Background
Therefore, the concept of user portrait comes into force for accurate marketing service and further deep mining of potential business value. The user portrait is the labeling of the user information, and one label is usually a highly refined feature identifier, such as age, gender, user preference, and the like, and finally, the three-dimensional portrait of the user can be outlined by combining all labels of the user, and the user portrait can abstract the overall appearance of the user information. In the current stage, user figures are clustered, and data sources can be generally divided into life attributes, behavior attributes and the like, so that accurate clustering with pertinence cannot be realized.
Disclosure of Invention
In view of the foregoing, it is an object of the present invention to provide a user-portrait-based clustering method, an electronic device, and a storage medium, which can perform targeted clustering while retaining all feature information.
To achieve the above object, the present invention provides an electronic device, comprising a memory and a processor, wherein the memory comprises a user-portrait-based clustering program, and the user-portrait-based clustering program, when executed by the processor, implements the following steps:
acquiring user characteristics of a plurality of users and characteristic variables corresponding to the user characteristics;
converting the user characteristics into word vectors;
clustering the word vectors, and determining the category of each user feature;
dividing feature variables corresponding to the user features into continuous variables and discrete variables, wherein the continuous variables are numerical variables with sequence attributes, and the discrete variables are non-numerical variables;
carrying out quantization processing on discrete variables and continuous variables;
screening out the type of the user characteristics with preference, and endowing the quantized discrete variable and continuous variable of the type of the user characteristics with preference to a weight value which is more than 1, wherein the preference refers to the bias of a clustering process;
and clustering all the quantized discrete variables and continuous variables to obtain biased user characteristic clusters.
In addition, in order to achieve the above object, the present invention further provides a user portrait-based clustering method, including:
acquiring user characteristics of a plurality of users and corresponding characteristic variables thereof;
converting the user characteristics into word vectors;
clustering the word vectors, and determining the category of each user feature;
dividing the characteristic variables into continuous variables and discrete variables, wherein the continuous variables are numerical variables with sequence attributes, and the discrete variables are non-numerical variables;
carrying out quantization processing on discrete variables and continuous variables;
screening out the category of the user characteristics with preference, and endowing the quantized discrete variable and continuous variable of the user characteristic category with preference with a weight value larger than 1, wherein the preference refers to the bias of a clustering process;
and clustering all the quantized discrete variables and continuous variables to obtain biased user characteristic clusters.
Preferably, the method for quantizing discrete variables and continuous variables includes:
converting discrete variables with order into numerical form;
converting discrete variables which do not have the orderliness and have the value number exceeding the set number into a high-order form;
encoding the discrete variable converted into the high-order form;
and carrying out normalization processing on the discrete variables and the continuous variables with the sequence after the codes are screened out.
Preferably, the method for giving a weight greater than 1 to the quantized discrete variable and continuous variable of the preferred user feature category comprises:
counting the number n of categories after user feature clustering;
changing the weight of the characteristic variable of the category of the user characteristic with preference within the range of more than 1 and not more than n-1;
and determining the optimal weight according to the contour coefficient or/and the interpretability of the cluster after weighting.
Further, preferably, the method further comprises:
clustering results corresponding to the optimal weight are used as optimal biased user characteristic clustering, wherein the clustering comprises the following steps:
calculating the contour coefficient of each cluster according to the formula
Figure BDA0001956631660000031
Wherein s is i Contour coefficient of i-th cluster, a i And b i Respectively are distances belonging to different classes in the ith clustering resultThe largest two characteristic variables;
and repeating the steps to obtain a curve of the profile coefficient changing along with the weight value, observing whether the curve has an extreme point, taking the weight value corresponding to the maximum value of the profile coefficient as an optimal weight value, and taking the clustering result corresponding to the maximum value of the profile coefficient as the optimal biased user characteristic clustering.
Furthermore, preferably, the category of the preferred user feature is one or more categories, and when the category of the preferred user feature is one category, the weight of the feature variable of the preferred user feature is in a range greater than 1 and not greater than n-1; when the preferred category is multiple categories, the weight value of the feature variable of one category of the user features of the multiple categories is more than 1, the sum of the weight values is not more than n-1, and n is the number of the categories after the user features are clustered.
Furthermore, preferably, the method for assigning a weight greater than 1 to the quantized discrete variable and the continuous variable of the preferred user feature class includes:
counting the total number of the user characteristics, wherein the total number of the user characteristics belongs to the user characteristic number of each user characteristic category;
the weight given to a preferred user feature category is in the range of more than 1 to make the number of user features of the category equal to the sum of the number of user features of other categories.
In addition, in order to achieve the above object, the present invention further provides a computer readable storage medium, where the computer readable storage medium includes a user portrait based clustering program, and when the user portrait based clustering program is executed by a processor, the steps of the user portrait based clustering method are implemented.
The user portrait-based clustering method, the electronic device and the computer-readable storage medium can realize targeted clustering on the basis of retaining all feature information, and meanwhile, due to the ordered and unordered processing of discrete features, the overall precision is improved.
Drawings
FIG. 1 is a schematic diagram of an application environment of a preferred embodiment of a user profile-based clustering method according to the present invention;
FIG. 2 is a block diagram of a preferred embodiment of the user profile-based clustering routine of FIG. 1;
FIG. 3 is a flow chart of a preferred embodiment of the user profile-based clustering method of the present invention.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
The invention provides a clustering method based on user portrait, which is applied to an electronic device 1. FIG. 1 is a schematic diagram of an application environment of a user portrait-based clustering method according to a preferred embodiment of the present invention.
In the present embodiment, the electronic device 1 may be a terminal client having an arithmetic function, such as a server, a mobile phone, a tablet computer, a portable computer, and a desktop computer.
The memory 11 includes at least one type of readable storage medium. The at least one type of readable storage medium may be a non-volatile storage medium such as a flash memory, a hard disk, a multimedia card, a card-type memory, and the like. In some embodiments, the readable storage medium may be an internal storage unit of the electronic apparatus 1, such as a hard disk of the electronic apparatus 1. In other embodiments, the readable storage medium may also be an external memory of the electronic device 1, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1.
In the present embodiment, the readable storage medium of the memory 11 is generally used for storing the user portrait based clustering program 10 and the like installed in the electronic device 1. The memory 11 may also be used for temporarily storing data that has been output or is to be output.
Processor 12, which in some embodiments may be a Central Processing Unit (CPU), microprocessor or other data Processing chip, operates program code or processes data stored in memory 11, such as executing user-portrait based clustering routine 10.
The network interface 13 may optionally comprise a standard wired interface, a wireless interface (e.g. WI-FI interface), typically used for establishing a communication connection between the electronic apparatus 1 and other electronic clients.
The communication bus 14 is used to enable connection communication between these components.
Fig. 1 only shows the electronic device 1 with components 11-14, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may alternatively be implemented.
Optionally, the electronic device 1 may further include a user interface, the user interface may include an input unit such as a Keyboard (Keyboard), a voice input device such as a microphone (microphone) or other client with a voice recognition function, a voice output device such as a sound box, a headset, and the like, and optionally the user interface may further include a standard wired interface, a wireless interface.
Optionally, the electronic device 1 may further comprise a display, which may also be referred to as a display screen or a display unit.
In some embodiments, the display device may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch device, or the like. The display is used for displaying information processed in the electronic apparatus 1 and for displaying a visualized user interface.
Optionally, the electronic device 1 further comprises a touch sensor. The area provided by the touch sensor for the user to perform touch operation is called a touch area. Further, the touch sensor described herein may be a resistive touch sensor, a capacitive touch sensor, or the like. The touch sensor may include not only a contact type touch sensor but also a proximity type touch sensor. Further, the touch sensor may be a single sensor, or may be a plurality of sensors arranged in an array, for example.
Optionally, the electronic device 1 may further include logic gates, sensors, audio circuits, and the like, which are not described herein.
In the apparatus embodiment shown in FIG. 1, a memory 11, which is a type of computer storage medium, may include an operating system and a user profile-based clustering program 10; processor 12 implements the following steps when executing user portrait based clustering routine 10 stored in memory 11:
acquiring user characteristics of a plurality of users and characteristic variables corresponding to the user characteristics;
converting the user characteristics into word vectors;
clustering the word vectors, and determining the category of each user feature;
dividing feature variables corresponding to the user features into continuous variables and discrete variables, wherein the continuous variables are numerical variables with sequence attributes, and the discrete variables are non-numerical variables;
carrying out quantization processing on discrete variables and continuous variables;
screening out the type of the user characteristics with preference, and endowing the quantized discrete variable and continuous variable of the type of the user characteristics with preference to a weight value which is more than 1, wherein the preference refers to the bias of a clustering process;
and clustering all the quantized discrete variables and continuous variables to obtain biased user characteristic clusters.
In other embodiments, the user-representation-based clustering program 10 may be further partitioned into one or more modules, which are stored in the memory 11 and executed by the processor 12 to implement the present invention. A module as referred to herein is a set of computer program instruction segments capable of performing a specified function. Referring to FIG. 2, a functional block diagram of a preferred embodiment of the user profile-based clustering routine 10 of FIG. 1 is shown. The user-portrait-based clustering routine 10 may be partitioned into:
a user characteristic obtaining module 110, for obtaining user characteristics of a plurality of users and corresponding characteristic variables;
a conversion module 120, which converts the user characteristics into word vectors;
the first clustering module 130 is used for clustering the word vectors and determining the category of each user feature;
a dividing module 140 that divides the characteristic variables into continuous variables and discrete variables, wherein the continuous variables are numerical variables having sequence attributes, and the discrete variables are non-numerical variables;
a quantization module 150 for performing quantization processing on the discrete variable and the continuous variable;
the preference selection module 160 screens out the categories of the preferred user characteristics, and gives a weight greater than 1 to the quantized discrete variable and continuous variable of the preferred user characteristic categories, wherein the preference refers to the concerned user characteristics and is also the bias of the clustering process;
the second clustering module 170 clusters all the quantized discrete variables and continuous variables, and clusters the weighted feature variables of the user feature categories and the unweighted feature variables of the user feature categories to obtain biased user feature clusters.
In addition, the invention also provides a clustering method based on the user portrait. FIG. 3 is a flowchart illustrating a user portrait-based clustering method according to a preferred embodiment of the present invention. The method may be performed by an apparatus, which may be implemented by software and/or hardware.
In this embodiment, a user portrait-based clustering method includes:
step S1, user characteristics of a plurality of users and characteristic variables corresponding to the user characteristics are obtained, for example, the user characteristics and the characteristic variables thereof can be obtained from a network by using a web crawler technology, and also can be obtained through special data, for example, the user characteristics are gender, and the characteristic variables are female;
step S2, converting the user characteristics into Word vectors, for example, searching the Word vectors corresponding to the user characteristics from a Word vector dictionary, wherein the Word vectors are specifically a dictionary prepared in advance, and the training method is Word2Vec;
s3, clustering word vectors, and determining the category of each user feature, wherein the step can be realized through an SKLearn module in Python, for example, names, sexes, ages, native heredity and the like can be clustered into personal attributes, academic calendars, certificates, work experiences and the like can be clustered into service capacity, and family rows, family structures, family happiness, family education and the like can be clustered into family responsibility;
step S4, dividing the characteristic variables into continuous variables and discrete variables, wherein the continuous variables are numerical variables with sequence attributes, the discrete variables are non-numerical variables (such as place names and grade information), and the characteristic variables can be distinguished automatically through programming;
s5, carrying out quantization processing on the discrete variable and the continuous variable;
s6, screening out a type of the user characteristics with preference, and endowing a weight value which is more than 1 to a discrete variable and a continuous variable which are subjected to quantization processing of the type of the user characteristics with preference, wherein the preference refers to the bias of a clustering process, for example, for clustering of biased characters, the specific gravity of the characteristic variable of the user characteristics related to the biased characters can be adjusted, and the difference of a clustering result in the aspect of characters can be more obvious;
and S7, clustering all the quantized discrete variables and continuous variables, namely clustering the characteristic variables of the weighted user characteristic categories and the characteristic variables of the unweighted user characteristic categories (such as hierarchical clustering, K-Means clustering and the like) to obtain biased user characteristic clusters. This step can be implemented by the K-protocols library in Python.
The clustering method is an unsupervised classification method, a weighted clustering algorithm is established according to the user portrait characteristics, the user classification function can be modified in a weighted mode according to specific application scenes, and the preference of the clustering method can be increased according to business requirements.
In step S5, the method for quantizing a discrete variable and a continuous variable includes:
converting discrete variables with order (such as levels) into numerical form;
converting discrete variables (such as place names and other information) which have no order and have the value number exceeding a set number (for example, 20) into high-order forms (such as identity, city grade and other information);
encoding the discrete variable converted into a higher-order form (e.g., one-hot encoding);
and carrying out normalization processing on the discrete variables and the continuous variables with the sequence after the codes are screened out.
In one embodiment of the present invention, in step S6, the category of the preferred user feature is one or more categories, and when the category of the preferred user feature is one category, the weight of the feature variable of the preferred user feature is in a range greater than 1 and not greater than n-1; when the preferred category is multiple categories, the weight of the feature variable of the user feature of the multiple categories is more than 1, the sum of the weights is not more than n-1, and n is the number of the categories after the user feature clustering.
In another embodiment of the present invention, the category of the preferred user features is one or more categories, and when the category of the preferred user features is one category, the weight of the feature variable of the preferred user features is in a range that is greater than 1 and makes the number of user features of the category equal to the sum of the numbers of user features of other categories; when the preferred category is a multi-category, the weight of the feature variable of the user feature of the preferred category of the multi-category is in a range which is greater than 1 and the sum of the weights enables the total number of the user features of the preferred category to be equal to the sum of the user features of the non-preferred category, for example, the total number of the user features is 800, 4 user feature categories are provided, the user feature numbers of the first category to the fourth category are respectively 100, 300, 200 and 200, and the preferred category is a first category, the weight of the first category is changed in a range which is greater than 1 and not greater than 7.
The weight given to the user feature category with preference in the two embodiments can be changed within the above range to obtain different assignments, so as to obtain different clusters, and the optimal weight of the user feature category with preference can be obtained by one or more combinations in the following embodiments.
In an alternative embodiment, the method for giving a weight greater than 1 to the quantized discrete variable and continuous variable of the preferred user feature class includes:
counting the number n of categories after user feature clustering;
changing the weight of the characteristic variable of the category of the user characteristic with preference within the range of more than 1 and not more than n-1;
and determining the optimal weight according to the contour coefficient or/and interpretability of the cluster after weighting.
Preferably, the method further comprises the following steps:
clustering results corresponding to the optimal weight values as optimal biased user characteristic clusters, wherein the clustering comprises the following steps:
calculating the contour coefficient of each cluster according to the formula
Figure BDA0001956631660000081
Wherein s is i Contour coefficient of i-th cluster, a i And b i Respectively two characteristic variables with the maximum distance belonging to different categories in the ith clustering result;
and repeating the steps to obtain a change curve of the profile coefficient along with the weight value, observing whether the curve has an extreme point, taking the weight value corresponding to the maximum value of the profile coefficient as an optimal weight value, and taking the clustering result corresponding to the maximum value of the profile coefficient as the optimal biased user characteristic clustering.
In an alternative embodiment, the method for giving a weight greater than 1 to the quantized discrete variable and continuous variable of the preferred user feature class includes:
obtaining a quantization matrix composed of discrete variables and continuous variables subjected to quantization processing and having user feature categories with preference for one or more categories
B=(b ij ) m×n
Wherein, b ij A j characteristic variable which is the ith user characteristic;
constructing a combined weight matrix which endows the feature variables of the user feature classes with preference with different weights for different times
F=WΘ=[F 1 F 2 … F n ] T
Figure BDA0001956631660000091
Figure BDA0001956631660000092
F n =w n,1 θ 1 +w n,2 θ 2 +…+w n,l θ l
Wherein, the matrix W is the weight value given by different times of the characteristic variable of one or more types of user characteristics, theta is the linear coefficient vector of each time given weight value, W n,l The weight value given to the nth characteristic variable for the first time is larger than 1 and not larger than n-1, n is the number of the characteristic variables, l is the number of the given weight times, w l Weight vectors composed of the first weighted weights, and the sum of the weights in each weight vector is not more than n-1, theta l Linear coefficient for the first weighting, theta k ≥0,k=1,2,…,l,
Figure BDA0001956631660000093
F n The combination weight of the nth characteristic is taken as the combination weight of the nth characteristic;
a vector difference matrix C is constructed using the vector matrix,
Figure BDA0001956631660000094
obtaining a weight evaluation model according to the vector difference matrix and the combined weight matrix
M(F)=CF=CWΘ;
And respectively taking the optimal solution of the combined weight matrix corresponding to the zero first-order derivative of the weight evaluation model as the optimal weight of each characteristic variable.
In an alternative embodiment, the method for giving a weight greater than 1 to the quantized discrete variable and continuous variable of the preferred user feature class includes:
obtaining a quantization matrix composed of discrete variables and continuous variables subjected to quantization processing and having one or more types of user characteristic categories preferred
B=(b ij ) m×n
Wherein, b ij A j characteristic variable which is the ith user characteristic;
constructing a combined weight matrix which endows the feature variables of the user feature categories with preference with different weights for different times
F=WΘ=[F 1 F 2 … F n ] T
Figure BDA0001956631660000101
Figure BDA0001956631660000102
F n =w n,1 θ 1 +w n,2 θ 2 +…+w n,l θ l
Wherein, the matrix W is the weight given to different times of the characteristic variable which has a preference to one or more types of user characteristics, theta is the linear coefficient vector of each time given to the weight, W n,l The weight value given to the nth characteristic variable for the first time is more than 1 and not more than n-1, n is the number of the characteristic variables, l is the number of the given weight times, w l Weight vectors composed of the first weighted weights, and the sum of the weights in each weight vector is not more than n-1, theta l Linear coefficient, θ, for the first weighting k ≥0,k=1,2,…,l,
Figure BDA0001956631660000103
F n The combination weight of the nth characteristic is taken as the combination weight of the nth characteristic;
the vector matrix is used to construct the vector sum matrix H,
Figure BDA0001956631660000104
obtaining a weight evaluation model according to the vector sum matrix and the combined weight matrix
M′(F)=HF=HWΘ;
And respectively taking the optimal solution of the combined weight matrix corresponding to the zero first-order derivative of the weight evaluation model as the optimal weight of each characteristic variable.
The vector difference matrix is used for constructing the weight evaluation model, so that the difference between characteristic variables belonging to different user characteristics is reflected, the difference between various types when the characteristic variables are clustered is clear, the interpretability is better, the weight evaluation model is constructed by using the vector and the matrix, and the relation between different user characteristics is reflected, so that the characteristic variables are clustered to have a good outline, and therefore, the evaluation model can be constructed by adopting the weighted combination of the vector difference matrix and the matrix.
In an embodiment of the present invention, the method for quantizing a discrete variable and a continuous variable includes:
the degree of dispersion of the dispersion variable is determined, which can be obtained according to one or more methods of the range, the interquartile range, the variance, the standard deviation, the average difference and the coefficient of variation of the word vector, for example, the dispersion is evaluated by the average variance,
Figure BDA0001956631660000111
wherein PC is the discrete degree of discrete variable of a user characteristic, N is the number of users, y i And o i Discrete variables of user characteristics of the ith user and expected values thereof, respectively, the expected values being set values that reduce the degree of the dispersion;
and summarizing and counting discrete variables with the discrete degrees exceeding a threshold (the discrete degrees can be set, the higher the clustering precision is, the lower the threshold is) until the discrete degrees do not exceed the threshold, for example, discrete features of residential areas can be summarized and unified into streets by cells, and when the discrete degrees of the discrete features summarized and unified into streets still exceed the threshold, the discrete variables can be further summarized and unified into districts/counties.
In an embodiment of the present invention, the method for clustering all quantized discrete variables and continuous variables to obtain biased user feature clusters includes:
giving different weights to perform multiple initial clustering;
constructing a tree structure according to results of multiple initial clustering, wherein root nodes are clustered from a first initial clustering result to a last initial clustering result from top to bottom in sequence, and the side length is the proportion of characteristic variables with the same length in the clustering results to all the characteristic variables;
taking the ratio of the side length difference value between the nodes to the maximum side length and the shortest side length as the similarity between the nodes;
and clustering the nodes according to the similarity (for example, clustering by adopting a k-means method), and taking the intersection of the initial clusters in the clustering result as the optimal clustering result.
In addition, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a user-portrait-based clustering program, and when executed by a processor, the user-portrait-based clustering program implements the following steps:
acquiring user characteristics of a plurality of users and corresponding characteristic variables thereof;
converting the user characteristics into word vectors;
clustering the word vectors, and determining the category of each user feature;
dividing the characteristic variables into continuous variables and discrete variables, wherein the continuous variables are numerical variables with sequence attributes, and the discrete variables are non-numerical variables;
carrying out quantization processing on discrete variables and continuous variables;
screening out the type of the user characteristics with preference, and endowing the quantized discrete variable and continuous variable of the type of the user characteristics with preference to a weight value which is more than 1, wherein the preference refers to the bias of a clustering process;
and clustering all the discrete variables and continuous variables subjected to the quantization processing to obtain biased user characteristic clusters.
The specific implementation of the computer-readable storage medium of the present invention is substantially the same as the above-mentioned user-portrait-based clustering method and the electronic device, and will not be described herein again.
The user portrait-based clustering method, the electronic device and the storage medium can select a plurality of fields (targeted classification, for example, if the group of users hope to focus on personal attribute classification, the weight of the attribute is increased) to perform weight adjustment (larger than 1), so that targeted clustering is realized.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of another identical element in a process, apparatus, article, or method comprising the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments. Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention or the portions contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) as described above and includes several instructions for enabling a terminal client (which may be a mobile phone, a computer, a server, or a network client, etc.) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (8)

1. A user portrait based clustering method, comprising:
acquiring user characteristics of a plurality of users and characteristic variables corresponding to the user characteristics;
converting the user characteristics into word vectors;
clustering the word vectors, and determining the category of each user feature;
dividing feature variables corresponding to the user features into continuous variables and discrete variables, wherein the continuous variables are numerical variables with sequence attributes, and the discrete variables are non-numerical variables;
carrying out quantization processing on discrete variables and continuous variables;
screening out the category of the user characteristics with preference, and endowing the discrete variable and the continuous variable which are subjected to quantization processing and are of the user characteristic category with preference with a weight value larger than 1, wherein the preference refers to the bias of a clustering process;
clustering all the discrete variables and continuous variables subjected to the quantization processing to obtain biased user characteristic clusters,
the quantization processing of the discrete variable and the continuous variable comprises the following steps:
judging the discrete degree of the discrete variable, wherein the discrete degree is evaluated by the average variance,
Figure FDA0003858778560000011
wherein PC is the discrete degree of discrete variable of a user characteristic, N is the number of users, y i And o i Departure of user characteristics for the ith user respectivelyA dispersion amount and an expected value thereof, the expected value being a set value that reduces the degree of dispersion,
summarizing and counting the discrete variables with the discrete degrees exceeding the threshold value until the discrete degrees do not exceed the threshold value, wherein the method for quantizing the discrete variables and the continuous variables comprises the following steps:
converting discrete variables with order into numerical form;
converting discrete variables which have no order and the value number of which exceeds the set number into a high-order form;
encoding the discrete variable converted into the high-order form;
the discrete variables and the continuous variables with the sequence after being screened out and coded are normalized,
the method for endowing the discrete variable and the continuous variable which are subjected to the quantization processing and have the preference user characteristic category with the weight value which is more than 1 comprises the following steps:
obtaining a quantization matrix formed by discrete variables and continuous variables which are subjected to quantization processing and have one or more types of user characteristic categories which are preferred;
B=(b ij ) m×n
wherein, b ij A j characteristic variable which is the ith user characteristic;
m represents the number of user features;
constructing a combined weight matrix which endows different weights to the characteristic variables of the preferred user characteristic categories for different times;
F=WΘ=[F 1 F 2 …F n ] T
Figure FDA0003858778560000021
Figure FDA0003858778560000022
F n =w n,1 θ 1 +w n,2 θ 2 +…+w n,p θ p
wherein, the matrix W is the weight value given by different times of the characteristic variable of one or more types of user characteristics, theta is the linear coefficient vector of each time given weight value, W n,p The weight value given to the nth characteristic variable for the pth time is more than 1 and not more than n-1, n is the number of the characteristic variables, p is the number of the given weight times, w p Weight vectors composed of the weight values weighted for the p-th time, and the sum of the weight values in each weight vector is not more than n-1, theta p Linear coefficient for p-th weighting, theta k ≥0,k=1,2,…,p,
Figure FDA0003858778560000023
F n The combination weight of the nth characteristic;
a vector difference matrix C is constructed using the quantization matrices,
Figure FDA0003858778560000031
obtaining a weight evaluation model according to the vector difference matrix and the combined weight matrix;
M(F)=CF=CWΘ;
and respectively taking the optimal solution of the combined weight matrix corresponding to the zero first-order derivative of the weight evaluation model as the optimal weight of each characteristic variable.
2. The user profile-based clustering method according to claim 1, wherein the method for assigning a weight value greater than 1 to the quantized discrete variable and continuous variable of the preferred user feature class comprises:
counting the number n of the characteristic variables after the user characteristic clustering;
changing the weight of the characteristic variable of the category of the user characteristic with preference within the range of more than 1 and not more than n-1;
and determining the optimal weight according to the contour coefficient or/and the interpretability of the cluster after weighting.
3. The user portrait based clustering method of claim 2, wherein after the step of determining the optimal weight according to the contour coefficient or/and interpretability of the weighted cluster, further comprising:
clustering results corresponding to the optimal weight values as optimal biased user characteristic clusters, wherein the clustering comprises the following steps:
calculating the contour coefficient of each cluster according to the following formula
Figure FDA0003858778560000032
Wherein s is i Contour coefficient of i-th cluster, a i And b i Respectively two feature variables with the maximum distance belonging to different categories in the ith clustering result;
and repeating the steps to obtain a change curve of the profile coefficient along with the weight value, observing whether the curve has an extreme point, taking the weight value corresponding to the maximum value of the profile coefficient as an optimal weight value, and taking the clustering result corresponding to the maximum value of the profile coefficient as the optimal biased user characteristic clustering.
4. The user profile-based clustering method according to claim 1, wherein the category of the preferred user features is one or more, and when the category of the preferred user features is one category, the weight of the feature variable of the preferred user features is in a range of more than 1 and not more than n-1; when the preferred category is multiple categories, the weight value of the feature variable of one category of the user features of the multiple categories is more than 1, the sum of the weight values is not more than n-1, and n is the number of the categories after the user features are clustered.
5. The user portrait based clustering method of claim 4, wherein the method for assigning a weight greater than 1 to the quantized discrete variable and continuous variable of the preferred user feature category further comprises:
obtaining a quantization matrix formed by discrete variables and continuous variables which are subjected to quantization processing and have one or more types of user characteristic categories which are preferred;
B (b ij ) m×n
wherein, b ij A j characteristic variable which is the ith user characteristic;
constructing a combined weight matrix which endows different weights to the characteristic variables of the preferred user characteristic categories for different times;
F=WΘ=[F 1 F 2 …F n ] T
Figure FDA0003858778560000041
Figure FDA0003858778560000042
F n =w n,1 θ 1 +w n,2 θ 2 +…+w n,p θ p
wherein, the matrix W is the weight value given by different times of the characteristic variable of one or more types of user characteristics, theta is the linear coefficient vector of each time given weight value, W n,p The weight value given to the nth characteristic variable for the pth time is more than 1 and not more than n-1, n is the number of the characteristic variables, p is the number of the given weight times, w p Weight vectors composed of the weight values weighted for the p-th time, and the sum of the weight values in each weight vector is not more than n-1, theta p Linear coefficient, θ, for the p-th weighting k ≥0,k=1,2,…,
Figure FDA0003858778560000043
F n The combination weight of the nth characteristic;
a vector and matrix H are constructed using the quantization matrix,
Figure FDA0003858778560000051
obtaining a weight evaluation model according to the vector sum matrix and the combined weight matrix;
M′(F)=HF=HWΘ;
and respectively taking the optimal solution of the combined weight matrix corresponding to the zero first-order derivative of the weight evaluation model as the optimal weight of each characteristic variable.
6. The user profile-based clustering method according to claim 1, wherein the method for assigning a weight value greater than 1 to the quantized discrete variable and continuous variable of the preferred user feature class comprises:
counting the total number of the user characteristics, wherein the total number of the user characteristics belongs to the user characteristic number of each user characteristic category;
the weight assigned to a preferred user feature category is in the range of more than 1 to make the number of user features of the category equal to the sum of the number of user features of other categories.
7. An electronic device comprising a memory and a processor, the memory having stored therein a user representation-based clustering program, the user representation-based clustering program when executed by the processor implementing the steps of the user representation-based clustering method according to any one of claims 1 to 6.
8. A computer-readable storage medium, comprising a user representation-based clustering program, wherein the user representation-based clustering program, when executed by a processor, performs the steps of the user representation-based clustering method as claimed in any one of claims 1 to 6.
CN201910068877.7A 2019-01-24 2019-01-24 Clustering method based on user portrait, electronic device and storage medium Active CN109903082B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910068877.7A CN109903082B (en) 2019-01-24 2019-01-24 Clustering method based on user portrait, electronic device and storage medium
PCT/CN2019/089151 WO2020151152A1 (en) 2019-01-24 2019-05-30 User profile-based clustering method, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910068877.7A CN109903082B (en) 2019-01-24 2019-01-24 Clustering method based on user portrait, electronic device and storage medium

Publications (2)

Publication Number Publication Date
CN109903082A CN109903082A (en) 2019-06-18
CN109903082B true CN109903082B (en) 2022-10-28

Family

ID=66944108

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910068877.7A Active CN109903082B (en) 2019-01-24 2019-01-24 Clustering method based on user portrait, electronic device and storage medium

Country Status (2)

Country Link
CN (1) CN109903082B (en)
WO (1) WO2020151152A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111597348B (en) * 2020-04-27 2024-02-06 平安科技(深圳)有限公司 User image drawing method, device, computer equipment and storage medium
CN111881190B (en) * 2020-08-05 2021-10-08 厦门南讯股份有限公司 Key data mining system based on customer portrait
CN112116205B (en) * 2020-08-21 2024-03-12 国网上海市电力公司 Image method, device and storage medium for power utilization characteristics of platform area
CN117973789A (en) * 2021-07-30 2024-05-03 北京壹心壹翼科技有限公司 Intelligent matching method, device, equipment and medium based on full-flow user portrait
CN117272119B (en) * 2023-11-21 2024-03-22 国网山东省电力公司营销服务中心(计量中心) User portrait classification model training method, user portrait classification method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108062375A (en) * 2017-12-12 2018-05-22 百度在线网络技术(北京)有限公司 A kind of processing method, device, terminal and the storage medium of user's portrait
CN108427669A (en) * 2018-02-27 2018-08-21 华青融天(北京)技术股份有限公司 Abnormal behaviour monitoring method and system
CN108737856A (en) * 2018-04-26 2018-11-02 西北大学 The IPTV user behaviors modeling of social relationships perception and program commending method
CN108734217A (en) * 2018-05-22 2018-11-02 齐鲁工业大学 A kind of customer segmentation method and device based on clustering

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9251275B2 (en) * 2013-05-16 2016-02-02 International Business Machines Corporation Data clustering and user modeling for next-best-action decisions
CN104268290B (en) * 2014-10-22 2017-08-08 武汉科技大学 A kind of recommendation method based on user clustering
CN107730289A (en) * 2016-08-11 2018-02-23 株式会社理光 A kind of user behavior analysis method and user behavior analysis device
CN106850314B (en) * 2016-12-20 2021-06-15 上海掌门科技有限公司 Method and equipment for determining user attribute model and user attribute information
CN107679946B (en) * 2017-09-28 2021-09-10 平安科技(深圳)有限公司 Fund product recommendation method and device, terminal equipment and storage medium
CN108519993B (en) * 2018-03-02 2022-03-29 华南理工大学 Social network hotspot event detection method based on multi-data-stream calculation
CN109086787B (en) * 2018-06-06 2023-07-25 平安科技(深圳)有限公司 User portrait acquisition method, device, computer equipment and storage medium
CN109165383B (en) * 2018-08-09 2022-07-12 四川政资汇智能科技有限公司 Data aggregation, analysis, mining and sharing method based on cloud platform
CN109255715A (en) * 2018-09-03 2019-01-22 平安科技(深圳)有限公司 Electronic device, Products Show method and computer readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108062375A (en) * 2017-12-12 2018-05-22 百度在线网络技术(北京)有限公司 A kind of processing method, device, terminal and the storage medium of user's portrait
CN108427669A (en) * 2018-02-27 2018-08-21 华青融天(北京)技术股份有限公司 Abnormal behaviour monitoring method and system
CN108737856A (en) * 2018-04-26 2018-11-02 西北大学 The IPTV user behaviors modeling of social relationships perception and program commending method
CN108734217A (en) * 2018-05-22 2018-11-02 齐鲁工业大学 A kind of customer segmentation method and device based on clustering

Also Published As

Publication number Publication date
CN109903082A (en) 2019-06-18
WO2020151152A1 (en) 2020-07-30

Similar Documents

Publication Publication Date Title
CN109903082B (en) Clustering method based on user portrait, electronic device and storage medium
CN110347835B (en) Text clustering method, electronic device and storage medium
US20230102337A1 (en) Method and apparatus for training recommendation model, computer device, and storage medium
CN110362723B (en) Topic feature representation method, device and storage medium
CN111667022A (en) User data processing method and device, computer equipment and storage medium
CN113569135B (en) Recommendation method, device, computer equipment and storage medium based on user portrait
CN110232154B (en) Random forest-based product recommendation method, device and medium
CN110276382B (en) Crowd classification method, device and medium based on spectral clustering
CN110503506B (en) Item recommendation method, device and medium based on grading data
CN110297888B (en) Domain classification method based on prefix tree and cyclic neural network
WO2023108993A1 (en) Product recommendation method, apparatus and device based on deep clustering algorithm, and medium
US11367116B1 (en) System and method for automatic product matching
CN112395487B (en) Information recommendation method and device, computer readable storage medium and electronic equipment
CN114528844A (en) Intention recognition method and device, computer equipment and storage medium
CN112085565A (en) Deep learning-based information recommendation method, device, equipment and storage medium
CN110929524A (en) Data screening method, device, equipment and computer readable storage medium
US11755668B1 (en) Apparatus and method of performance matching
CN110688452A (en) Text semantic similarity evaluation method, system, medium and device
CN116348894A (en) System and method for counterfactual interpretation in machine learning models
CN115730597A (en) Multi-level semantic intention recognition method and related equipment thereof
CN114492669B (en) Keyword recommendation model training method, recommendation device, equipment and medium
WO2020114109A1 (en) Interpretation method and apparatus for embedding result
CN113591881B (en) Intention recognition method and device based on model fusion, electronic equipment and medium
US20120239382A1 (en) Recommendation method and recommender computer system using dynamic language model
CN113704620A (en) User label updating method, device, equipment and medium based on artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant