WO2019120007A1 - Method and apparatus for predicting user gender, and electronic device - Google Patents

Method and apparatus for predicting user gender, and electronic device Download PDF

Info

Publication number
WO2019120007A1
WO2019120007A1 PCT/CN2018/115358 CN2018115358W WO2019120007A1 WO 2019120007 A1 WO2019120007 A1 WO 2019120007A1 CN 2018115358 W CN2018115358 W CN 2018115358W WO 2019120007 A1 WO2019120007 A1 WO 2019120007A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
sample
sample set
classification
information
Prior art date
Application number
PCT/CN2018/115358
Other languages
French (fr)
Chinese (zh)
Inventor
陈岩
刘耀勇
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2019120007A1 publication Critical patent/WO2019120007A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions

Definitions

  • the present application relates to the field of electronic device terminals, and in particular, to a user gender prediction method, device, and electronic device.
  • User portraits are a very popular research direction in recent years, such as on smartphones. If there is a way to accurately determine the gender of the user from the user's habit, it is very meaningful to optimize the electronic device in all aspects.
  • the current electronic device system allows users to register binding electronic devices and user accounts, but not every user is willing to provide gender information, which can not solve the problem of a large part of the user portrait. Therefore, for users who do not provide gender information, it is necessary to provide a user gender prediction method, device and electronic device.
  • the embodiment of the present application provides a user gender prediction method, device, and electronic device to intelligently close an application.
  • the embodiment of the present application provides a user gender prediction method, including:
  • the sample set is sample-classified according to the information gain rate of the feature classification for the sample to construct a decision tree model of the user gender prediction;
  • the gender of users who did not provide gender information was predicted based on the predicted sample and the decision tree model.
  • the embodiment of the present application further provides a user gender prediction method apparatus, where the apparatus includes:
  • a first collecting unit configured to collect a multi-dimensional feature of a behavior habit of a user who has provided gender information as a sample, and construct a sample set of behavior habits of the user who has provided the gender information;
  • a classifying unit configured to perform sample classification on the sample set according to an information gain rate of the feature classification for the feature when the number of the features exceeds a preset threshold, to construct a decision tree model of the user gender prediction, where the decision tree model
  • the output includes male or female
  • a second collecting unit configured to collect, as a prediction sample, a multi-dimensional feature of a behavior habit of a user who does not provide gender information according to the predicted time;
  • a prediction unit is configured to predict a gender of a user who does not provide gender information according to the prediction sample and the decision tree model.
  • An electronic device provided by an embodiment of the present application includes a processor and a memory, wherein the memory has a computer program, wherein the processor is configured to execute a user gender prediction method by calling the computer program, where the user gender prediction method includes :
  • the sample set is sample-classified according to the information gain rate of the feature classification for the sample to construct a decision tree model of the user gender prediction;
  • the gender of users who did not provide gender information was predicted based on the predicted sample and the decision tree model.
  • the user gender prediction method, apparatus and electronic device provided by the present application collects a multi-dimensional feature of a behavior habit of a user who has provided gender information as a sample, and constructs a sample set of behavior habits of the user who has provided the gender information; When the number of features exceeds a preset threshold, the sample set is sampled according to the information gain rate of the feature classification for the sample to construct a decision tree model for predicting the gender of the user, and the output of the decision tree model includes “male” or “female”;
  • the multi-dimensional feature of the behavior habit of the user who does not provide the gender information is collected as a prediction sample according to the predicted time; the gender of the user who does not provide the gender information is predicted according to the prediction sample and the decision tree model.
  • FIG. 1 is a schematic diagram of an application scenario of a user gender prediction method according to an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of a user gender prediction method provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a decision tree provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of another decision tree provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of still another decision tree provided by an embodiment of the present application.
  • FIG. 6 is another schematic flowchart of a user gender prediction method provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a user gender prediction apparatus according to an embodiment of the present application.
  • FIG. 8 is another schematic structural diagram of a user gender prediction apparatus according to an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • FIG. 10 is another schematic structural diagram of an electronic device according to an embodiment of the present application.
  • a user gender prediction method includes:
  • the sample set is sample-classified according to the information gain rate of the feature classification for the sample to construct a decision tree model of the user gender prediction;
  • the gender of users who did not provide gender information was predicted based on the predicted sample and the decision tree model.
  • Tree model including:
  • the child node is used as a leaf node, and an output of the leaf node is set according to a category of the sample in the removed subsample set, and the category of the sample includes a male and a female.
  • the target sample set is divided according to the dividing feature, including:
  • the target sample set is divided according to the feature value.
  • selecting the current division feature from the features according to the information gain rate including:
  • the feature corresponding to the target information gain rate is selected as the current dividing feature.
  • the user gender prediction method further includes:
  • the current node is taken as a leaf node, and the sample category with the largest number of samples is selected as the output of the leaf node.
  • determining whether the child node meets the preset classification termination condition includes:
  • acquiring an information gain rate of the feature in the target sample set for the target sample set classification includes:
  • the information gain rate of the feature classification for the target sample set is obtained, including:
  • the information gain rate of the feature classification for the target sample set is obtained according to the information gain and the split information, including:
  • the information gain rate of the feature classification for the target sample set is calculated by the following formula:
  • g R (D, A) is the information gain rate of feature A for sample set D classification
  • g(D, A) is the information gain of feature A for sample classification
  • H A (D) is the split information of feature A
  • g(D, A) can be calculated by the following formula:
  • H(D) is the empirical entropy of the sample set D classification
  • A) is the conditional entropy of the feature A for the sample set D classification
  • pi is the A feature taking the ith sample of the sample in the sample set D.
  • the probability of occurrence, n and i are positive integers greater than zero.
  • a user gender prediction device includes:
  • a first collecting unit configured to collect a multi-dimensional feature of a behavior habit of a user who has provided gender information as a sample, and construct a sample set of behavior habits of the user who has provided the gender information;
  • a classifying unit configured to perform sample classification on the sample set according to an information gain rate of the feature classification for the feature when the number of the features exceeds a preset threshold, to construct a decision tree model of the user gender prediction, where the decision tree model
  • the output includes male or female
  • a second collecting unit configured to collect, as a prediction sample, a multi-dimensional feature of a behavior habit of a user who does not provide gender information according to the predicted time;
  • a prediction unit is configured to predict a gender of a user who does not provide gender information according to the prediction sample and the decision tree model.
  • An electronic device includes a processor and a memory, the memory having a computer program, the processor is configured to execute a user gender prediction method by calling the computer program, the user gender prediction method comprising:
  • the sample set is sample-classified according to the information gain rate of the feature classification for the sample to construct a decision tree model of the user gender prediction;
  • the gender of users who did not provide gender information was predicted based on the predicted sample and the decision tree model.
  • the sample set is sample-classified according to the information gain rate of the feature classification for the sample to construct a decision tree model for user gender prediction.
  • the child node is used as a leaf node, and an output of the leaf node is set according to a category of the sample in the removed subsample set, and the category of the sample includes a male and a female.
  • the target sample set is divided according to the dividing feature, including:
  • the target sample set is divided according to the feature value.
  • selecting a current dividing feature from the feature according to the information gain rate including:
  • the feature corresponding to the target information gain rate is selected as the current dividing feature.
  • the user gender prediction method further includes:
  • the current node is taken as a leaf node, and the sample category with the largest number of samples is selected as the output of the leaf node.
  • determining whether the child node meets the preset classification termination condition includes:
  • acquiring an information gain rate of the feature in the target sample set for the target sample set classification includes:
  • acquiring an information gain rate of the feature for the target sample set classification includes:
  • the information gain rate of the feature classification for the target sample set is obtained according to the information gain and the split information, including:
  • the information gain rate of the feature classification for the target sample set is calculated by the following formula:
  • g R (D, A) is the information gain rate of feature A for sample set D classification
  • g(D, A) is the information gain of feature A for sample classification
  • H A (D) is the split information of feature A
  • g(D, A) can be calculated by the following formula:
  • H(D) is the empirical entropy of the sample set D classification
  • A) is the conditional entropy of the feature A for the sample set D classification
  • pi is the A feature taking the ith sample of the sample in the sample set D.
  • the probability of occurrence, n and i are positive integers greater than zero.
  • the embodiment of the present application provides a user gender prediction method, and the execution subject of the user gender prediction method may be the user gender prediction device provided by the embodiment of the present application, or an electronic device integrated with the user gender prediction device, wherein the user gender prediction
  • the device can be implemented in hardware or software.
  • the electronic device may be a device such as a smart phone, a tablet computer, a palmtop computer, a notebook computer, or a desktop computer.
  • FIG. 1 is a schematic diagram of an application scenario of a user gender prediction method according to an embodiment of the present application.
  • the user gender estimation device is integrated into an electronic device as an example, and the electronic device can collect the behavior habit of the user who has provided the gender information.
  • the multi-dimensional feature is taken as a sample, and a sample set of behavior habits of the user who has provided the gender information is constructed; and the sample set is sample-classified according to the information gain rate of the feature classification of the feature to construct a decision for predicting user gender a tree model; collecting a multi-dimensional feature of a behavior habit corresponding to a user who does not provide gender information according to the predicted threshold, obtaining a predicted sample; and predicting a gender of the user not providing the gender information according to the predicted sample and the decision tree model.
  • the multi-dimensional characteristics of the behavior habits of the user who has provided the gender information may be collected in the historical time period (for example, the user reads the sports news).
  • a sample of behavioral habits such as the length of time when the user of the user who does not provide the gender information reads the sports news, the number of times the user uses the beauty software, etc. is used as a prediction sample; and the user who does not provide the gender information is predicted based on the prediction sample and the decision tree model. gender.
  • FIG. 2 is a schematic flowchart diagram of a user gender prediction method according to an embodiment of the present application.
  • the specific process of the user gender prediction method provided by the embodiment of the present application may be as follows:
  • the multi-dimensional feature of the behavior habit of the user who has provided the gender information has a dimension of a certain length, and the parameters in each dimension correspond to a feature information representing the behavior habit of the user who has provided the gender information, that is, the multi-dimensional feature It consists of multiple features.
  • the sample set of the behavior habits of the user who has provided the gender information may include a plurality of samples, each of which includes a multi-dimensional feature of the behavior habit of the user who has provided the gender information.
  • a sample set of behavioral habits of users who have provided gender information may include a plurality of samples collected at a preset frequency during a historical time period.
  • the historical time period may be, for example, the past 7 days or 10 days; the preset frequency may be, for example, collected every 10 minutes and collected every half hour. It can be understood that the multi-dimensional feature data of the behavior habits of the user who has provided the gender information at one time constitutes one sample, and the plurality of samples collected multiple times constitute a sample set.
  • each sample in the sample set can be marked to obtain a sample label for each sample. Since the implementation of the present embodiment is to predict the gender of the user, the labeled sample label includes the gender as "male". And the gender is “female”, that is, the sample categories include “male” and “female”. Specifically, it may be marked according to the behavior habits of the user who has provided the gender information.
  • the user browses the male-type goods (such as men's clothing) in the shopping application for 50 times, it is marked as “male”; for example, the user reads the female If the novel has a duration of more than 20 hours, it is marked as “female.”
  • the value “1” may be used to indicate “male”, and the value "0" to indicate “female”, and vice versa.
  • the preset threshold may be 10000, that is, when the number of the features exceeds 10000, the sample set is sample-classified according to the information gain rate of the feature classification for the sample to construct a user gender prediction.
  • the feature information not directly represented by the numerical value is quantified by a specific numerical value, for example, whether the user is turned on before
  • the characteristic information of the camera can be turned on by the value 1 and not turned on by the value 0 (or vice versa); for example, whether the user performs the beauty processing on the picture, the beauty information can be represented by the value 1.
  • a value of 0 means no cosmetic treatment (and vice versa).
  • the embodiment of the present application may perform sample classification on the sample set based on the information gain rate of the feature classification for the sample to construct a decision tree model of the user gender prediction.
  • a decision tree model can be constructed based on the C4.5 algorithm.
  • the decision tree is a tree built on the basis of decision-making.
  • a decision tree is a predictive model that represents a mapping between object attributes and object values.
  • Each node represents an object, and each forked path in the tree represents a certain Possible attribute values, and each leaf node corresponds to the value of the object represented by the path from the root node to the leaf node.
  • the decision tree has only a single output. If there are multiple outputs, separate decision trees can be created to handle different outputs.
  • the C4.5 algorithm is a kind of decision tree. It is a series of algorithms used in the classification problem of machine learning and data mining. It is an important algorithm improved by ID3. Its goal is to supervise learning: Given a data set, each of these tuples can be described by a set of attribute values, each of which belongs to a class in a mutually exclusive category. The goal of C4.5 is to find a mapping from attribute values to categories by learning, and this mapping can be used to classify entities with unknown new categories.
  • ID3 (Iterative Dichotomiser 3, iterative binary tree 3 generation) is based on the Occam razor principle, that is, to do more with as few things as possible. In information theory, the smaller the expected information, the greater the information gain and the higher the purity.
  • the core idea of the ID3 algorithm is to measure the choice of attributes with information gain, and select the attribute with the largest information gain after splitting to split. The algorithm uses a top-down greedy search to traverse possible decision spaces.
  • the information gain rate may be defined as the ratio of the information gain of the feature to the sample classification and the split information of the feature to the sample classification.
  • the specific information gain rate acquisition method is described below.
  • the information gain is for one feature. It is to look at a feature t. What is the amount of information when the system has it and without it? The difference between the two is the amount of information that the feature brings to the system, that is, the information gain. .
  • the split information is used to measure the breadth and uniformity of the feature split data (such as the sample set), and the split information can be the entropy of the feature.
  • the classification process may include the following steps:
  • the division feature of the sample in the subsample set is removed, and the removed subsample set is obtained;
  • the child node is used as a leaf node, and the output of the leaf node is set according to the category of the sample in the removed subsample set.
  • the categories of the sample include “male” and “female”.
  • the partitioning feature is a feature selected from the features according to the information gain rate of each feature for the sample set classification, and is used to classify the sample set.
  • the feature according to the information gain rate there are various ways to select the feature according to the information gain rate. For example, in order to improve the accuracy of the sample classification, the feature corresponding to the maximum information gain rate may be selected as the division feature.
  • the category of the sample may include two categories: "male” and “female”, and the category of each sample may be represented by a sample mark. For example, when the sample is marked as a numerical value, the value “1" indicates “male”, and the numerical value is used. “0” means “female”, and vice versa.
  • the child node When the child node satisfies the preset classification termination condition, the child node may be used as a leaf node, that is, the sample set classification of the child node is stopped, and the output of the leaf node may be set based on the category of the sample in the removed sub-sample set.
  • the output of a leaf node There are several ways to set the output of a leaf node based on the category of the sample. For example, the category with the largest number of samples in the sample set can be removed as the output of the leaf node.
  • the preset classification termination condition may be set according to actual requirements.
  • the current child node is used as a leaf node, and the sample set corresponding to the child node is stopped for word segmentation; when the child node is not satisfied
  • the classification termination condition is preset, the classification of the sample set corresponding to the child node is continued.
  • the preset classification termination condition may include: the number of categories of the samples in the removed sub-sample set of the child node is a preset number, that is, the step “determining whether the child node satisfies the preset classification termination condition” may include:
  • the preset classification termination condition may include: the number of categories of the samples in the removed subsample set corresponding to the child node is 1, that is, the sample in the sample set of the child node has only one category. At this time, if the child node satisfies the preset classification termination condition, the category of the sample in the subsample set is taken as the output of the leaf node. If there is only a sample with the category "male" in the subsample set after removal, then "male" can be used as the output of the leaf node.
  • a gain rate threshold may also be set; when the maximum information gain rate is greater than the threshold, the feature corresponding to the information gain rate is selected as the division feature. That is, the step of "selecting the current division feature from the feature according to the information gain rate" may include:
  • the feature corresponding to the target information gain rate is selected as the current partition feature.
  • the current node when the target information gain rate is not greater than a preset threshold, the current node may be used as a leaf node, and the sample category with the largest number of samples is selected as the output of the leaf node.
  • the sample categories include "male” and "female”.
  • the preset threshold can be set according to actual needs, such as 0.9, 0.8, and the like.
  • the preset gain rate threshold is 0.8, since the maximum information gain rate is greater than the preset threshold, feature 1 can be used as the division feature.
  • the preset threshold is 1, the maximum information gain rate is less than the preset threshold.
  • the current node can be used as a leaf node, and the sample set analysis can be found that the number of samples with the category "male" is the largest, and the category is greater than the category. The number of samples of "female”. At this time, "male" can be used as the output of the leaf node.
  • the sample sets can be divided based on the feature values of the divided features. That is, the step "dividing the sample set according to the division feature" may include:
  • the target sample set is divided according to the feature value.
  • a sample with the same feature value in the sample set can be divided into the same subsample set.
  • the feature values of the divided features include: 0, 1, 2, then, at this time, the samples whose feature values are 0 can be classified into one class, and the samples with the feature value 1 are classified into one class, and the feature values are The samples of 2 are classified into one category.
  • sample set D For example, for sample set D ⁇ sample 1, sample 2...sample i...sample n ⁇ , where the sample includes several features A.
  • the current node acts as a leaf node, and the sample category with the largest number of samples is selected as the output of the leaf node.
  • the feature corresponding to the information gain g R (D, A)max may be selected as the partitioning feature Ag, and the sample set D ⁇ sample according to the feature Ag 1.
  • the divided feature Ag in the subsample sets D1 and D2 is removed, that is, A-Ag.
  • the child nodes d1 and d2 of the root node d are generated with reference to FIG. 3, and the subsample set D1 is taken as the node information of the child node d1, and the subsample set D2 is taken as the node information of the child node d2.
  • A-Ag is taken as a feature
  • Di of the child node is used as a data set, and the above steps are recursively called to construct a subtree until the preset classification termination condition is satisfied.
  • the child node d1 it is determined whether the child node satisfies the preset classification termination condition. If yes, the current child node d1 is used as a leaf node, and the leaf node output is set according to the category of the sample in the subsample set corresponding to the child node d1.
  • the above-mentioned sub-sample set corresponding to the child node is continued to be classified according to the information gain classification method.
  • the child node d2 can be used as an example to calculate the characteristics of the A2 sample set relative to the sample.
  • the classified information gain rate g R (D, A) select the maximum information gain rate g R (D, A) max, when the maximum information gain rate g R (D, A) max is greater than the preset threshold ⁇ ,
  • the feature corresponding to the information gain rate g R (D, A) is selected as the partitioning feature Ag (such as the feature Ai+1), and the D2 is divided into several sub-sample sets based on the partitioning feature Ag, for example, the D2 can be divided into the sub-sample set D21, D22, D23, and then, the partitioning features Ag in the subsample sets D21, D22, and D23 are removed, and the child nodes d21, d22, and d23 of the current node d2 are generated, and the sample sets D21, D22, and D23 after dividing the feature Ag are removed.
  • the above-described information gain rate classification based method can be used to construct a decision tree as shown in FIG. 4, and the output of the leaf node of the decision tree includes "male” or "female".
  • the feature values of the corresponding divided features may be marked on the path between the nodes. For example, in the above process based on information gain classification, the feature values of the corresponding divided features may be marked on the path of the current node and its child nodes.
  • the feature values of the partitioning feature Ag include: 0, 1 may mark 1 on the path between d2 and d, mark 0 on the path between a1 and a, and so on, after each division,
  • a decision tree as shown in FIG. 5 can be obtained by marking a corresponding partition feature value such as 0 or 1 on the path of the current node and its child nodes.
  • the information gain rate may be defined as the ratio of the information gain of the feature to the sample classification and the split information of the feature to the sample classification.
  • the information gain is for one feature. It is to look at a feature t. What is the amount of information when the system has it and without it? The difference between the two is the amount of information that the feature brings to the system, that is, the information gain. .
  • the information gain represents the degree of uncertainty in the information of a class (male and female) of a feature.
  • the split information is used to measure the breadth and uniformity of the feature split data (such as the sample set), and the split information can be the entropy of the feature.
  • the step “acquiring the information gain rate of the feature in the target sample set for the target sample set classification” may include:
  • the information gain of the feature for the sample set classification may be obtained based on the empirical entropy of the sample classification and the conditional entropy of the feature for the sample set classification result. That is, the step "acquiring the information gain of the feature classification for the target sample set" may include:
  • the first probability that the positive sample appears in the sample set and the second probability that the negative sample appears in the sample set can be obtained, the positive sample is a sample with a sample category of “male”, and the negative sample is a sample with a sample category of “female”.
  • the terms "first,” “second,” and “third,” etc. in this application are used to distinguish different objects, and are not intended to describe a particular order.
  • the information gain of the feature for the target sample set classification may be the difference between empirical entropy and conditional entropy.
  • the sample includes a multi-dimensional feature, such as feature A.
  • the information gain rate of feature A for sample classification can be obtained by the following formula:
  • g R (D, A) is the information gain rate of feature A for sample set D classification
  • g(D, A) is the information gain of feature A for sample classification
  • H A (D) is the split information of feature A, That is, the entropy of feature A.
  • gR (D, A) can be obtained by the following formula:
  • H(D) is the empirical entropy of the sample set D classification
  • A) is the conditional entropy of the feature A for the sample set D classification.
  • the sample size of the sample category is "male” is j
  • the information gain is the difference between the information of the decision tree before and after the attribute selection.
  • the empirical entropy H(D) of the sample classification is:
  • the sample set may be divided into several sub-sample sets according to feature A, and then information entropy of each sub-sample set classification is obtained, and the probability that each feature value of the feature A appears in the sample set, according to the The information entropy and the probability can be used to obtain the divided information entropy, that is, the conditional entropy of the feature Ai for the sample set classification result.
  • conditional entropy of the sample feature A for the sample set D classification result can be calculated by the following formula:
  • n is the number of values of the feature A, that is, the number of feature value types.
  • pi is the probability that the sample whose A eigenvalue is the ith value appears in the sample set D
  • Ai is the ith value of A.
  • the feature set D ⁇ sample 1, sample 2, sample i, ... sample n ⁇ can be divided into three by feature A.
  • d, e are positive integers and less than n.
  • conditional entropy of feature A for the classification result of sample set D is:
  • A) p1H(D
  • A A1)+p2H(D
  • A A2)+p3H(D
  • A A3);
  • A1) is the information entropy of the subsample set D1 classification, that is, the empirical entropy, which can be calculated by the above formula of empirical entropy.
  • the information gain of the feature A for the sample set D classification can be calculated, for example, by The formula is calculated:
  • the information gain of the feature A for the sample set D classification is: the difference between the empirical entropy H(D) and the conditional entropy H(D
  • the split information of the feature classification for the sample set is the entropy of the feature.
  • the probability of the distribution of the features can be obtained based on the probability of distribution of the samples in the target sample set. For example, HA(D) can be obtained by the following formula:
  • Di is the sample set of the sample set D feature A is the i-th type.
  • the prediction time can be set according to requirements, such as the current time.
  • a multi-dimensional feature of a behavioral habit of a user who does not provide gender information may be collected as a prediction sample according to a predicted time point.
  • the multi-dimensional features collected in steps 201 and 203 are the same features, for example, the duration of the user reading the partial male novel, the duration of the user reading the female novel, and the like.
  • the corresponding output result is obtained according to the predicted sample and the decision tree model, and the gender of the user who does not provide the gender information is determined according to the output result.
  • the output results include "male” or "female”.
  • the corresponding leaf node may be determined according to the characteristics of the predicted sample and the decision tree model, and the output of the leaf node is used as a predicted output result.
  • the current leaf node is determined according to the branch condition of the decision tree (ie, the feature value of the divided feature), and the output of the leaf node is taken as the prediction result. Since the output of the leaf node includes "male” or "female", the gender of the user who does not provide the gender information can be determined based on the decision tree at this time.
  • the corresponding leaf node may be found as an1 according to the branch condition of the decision tree in the decision tree shown in FIG. 5, and the output of the leaf node an1 is “male”. At this time, it is determined that the gender of the user who does not provide the gender information is male.
  • the embodiment of the present application collects a multi-dimensional feature of the behavior habit of the user who has provided the gender information as a sample, and constructs a sample set of the behavior habits of the user who has provided the gender information; when the number of the features exceeds a preset threshold According to the information classification rate of the sample classification, the sample set is sampled to construct a decision tree model for predicting the gender of the user.
  • the output of the decision tree model includes “male” or “female”; the gender information is not provided according to the predicted time.
  • the multidimensional features of the user's behavioral habits are used as prediction samples; the genders of users who do not provide gender information are predicted based on the prediction samples and the decision tree model.
  • each of the samples of the sample set includes a plurality of feature information reflecting the user's behavior habits
  • the embodiment of the present application can make the user gender prediction more intelligent.
  • the accuracy of the user's gender prediction can be improved, thereby improving the accuracy of the prediction.
  • the user gender prediction method of the present application will be further described below based on the method described in the above embodiments.
  • the user gender prediction method may include:
  • the multi-dimensional feature of the behavior habit of the user who has provided the gender information has a dimension of a certain length, and the parameters in each dimension correspond to a feature information representing the behavior habit of the user who has provided the gender information, that is, the multi-dimensional feature It consists of multiple features.
  • the sample set of the behavior habits of the user who has provided the gender information may include a plurality of samples, each of which includes a multi-dimensional feature of the behavior habit of the user who has provided the gender information.
  • a sample set of behavioral habits of users who have provided gender information may include a plurality of samples collected at a preset frequency during a historical time period.
  • the historical time period may be, for example, the past 7 days or 10 days; the preset frequency may be, for example, collected every 10 minutes and collected every half hour. It can be understood that the multi-dimensional feature data of the behavior habits of the user who has provided the gender information at one time constitutes one sample, and the plurality of samples collected multiple times constitute a sample set.
  • a specific sample may be as shown in Table 1 below, and includes feature information of multiple dimensions. It should be noted that the feature information shown in Table 1 is only an example. In practice, the number of feature information included in one sample may be increased. The number of information shown in Table 1 may be less than the number of information shown in Table 1. The specific feature information may be different from that shown in Table 1, and is not specifically limited herein.
  • Dimension Characteristic information 1 The number of times a user viewed a male product (such as a men's clothing) in the shopping app 2 When a user browses a male-type item (such as a men's clothing) in the shopping app 3 The number of times a user viewed a female product (such as cosmetics, women's clothing) in the shopping app 4 When a user browses a female product (such as cosmetics, women's clothing) in the shopping app 5
  • the length of time users read a male novel 6
  • the length of time users read the news of the constellation 9
  • Number of times users use beauty software 11 The number and duration of users playing different categories of games
  • each sample in the sample set can be marked to obtain a sample label for each sample. Since the implementation of the present embodiment is to predict the gender of the user, the labeled sample label includes the gender as "male". And the gender is “female”, that is, the sample categories include “male” and “female”. Specifically, it may be marked according to the behavior habits of the user who has provided the gender information.
  • the user browses the male-type goods (such as men's clothing) in the shopping application for 50 times, it is marked as “male”; for example, the user reads the female If the novel has a duration of more than 20 hours, it is marked as “female.”
  • the value “1” may be used to indicate “male”, and the value "0" to indicate “female”, and vice versa.
  • the root node d of the decision tree can be made, and the sample set D is taken as the node information of the root node d.
  • the sample set of the root node is determined as the target sample set to be classified currently.
  • the information gain rate g R (D, A)1, g R (D, A) 2, ... g R (D, A) m of each feature for the sample set classification can be calculated;
  • the information gain rate g R (D, A)max, such as g R (D, A) i is the maximum information gain rate.
  • the information gain rate of the feature classification for the sample set can be obtained as follows:
  • the ratio of the information gain to the entropy is obtained, and the information gain rate of the feature for the sample classification is obtained.
  • the sample includes a multi-dimensional feature, such as feature A.
  • the information gain rate of feature A for sample classification can be obtained by the following formula:
  • g(D, A) is the information gain of feature A for sample classification
  • H A (D) is the split information of feature A, that is, the entropy of feature A.
  • H(D) is the empirical entropy of the sample classification
  • A) is the conditional entropy of the feature A for the sample classification.
  • the sample size of the sample category is "male” is j
  • the information gain is the difference between the information of the decision tree before and after the attribute selection.
  • the empirical entropy H(D) of the sample classification is:
  • the sample set may be divided into several sub-sample sets according to feature A, and then information entropy of each sub-sample set classification is obtained, and the probability that each feature value of the feature A appears in the sample set, according to the The information entropy and the probability can be used to obtain the divided information entropy, that is, the conditional entropy of the feature Ai for the sample set classification result.
  • conditional entropy of the sample feature A for the sample set D classification result can be calculated by the following formula:
  • n is the number of values of the feature A, that is, the number of feature value types.
  • pi is the probability that the sample whose A eigenvalue is the ith value appears in the sample set D
  • Ai is the ith value of A.
  • the feature set D ⁇ sample 1, sample 2, sample i, ... sample n ⁇ can be divided into three by feature A.
  • d, e are positive integers and less than n.
  • conditional entropy of feature A for the classification result of sample set D is:
  • A) p1H(D
  • A A1)+p2H(D
  • A A2)+p3H(D
  • A A3);
  • A1) is the information entropy of the subsample set D1 classification, that is, the empirical entropy, which can be calculated by the above formula of empirical entropy.
  • the information gain of the feature A for the sample set D classification can be calculated, for example, by The formula is calculated:
  • the information gain of the feature A for the sample set D classification is: the difference between the empirical entropy H(D) and the conditional entropy H(D
  • the split information of the feature classification for the sample set is the entropy of the feature.
  • the probability of the distribution of the features can be obtained based on the probability of distribution of the samples in the target sample set.
  • H A (D) can be obtained by the following formula:
  • Di is the sample set of the sample set D feature A is the i-th type.
  • step 306. Determine whether the maximum information gain rate is greater than a preset threshold. If yes, go to step 307. If no, go to step 313.
  • it can be determined whether the maximum information gain g R (D, A) max is greater than a preset threshold ⁇ , which can be set according to actual needs.
  • the feature Ag may be selected as the division feature.
  • the sample set may be divided into several sub-sample sets according to the number of feature values of the divided features, and the number of the sub-sample sets is the same as the number of feature values.
  • the feature values of the divided features include: 0, 1, 2, then, at this time, the samples whose feature values are 0 can be classified into one class, and the samples with the feature value 1 are classified into one class, and the feature values are The samples of 2 are classified into one category.
  • the sample set D can be divided into D1 ⁇ sample 1, sample 2, ... sample k ⁇ and D2 ⁇ sample k+1 ... sample n ⁇ . Then, the partitioning features Ag in the subsample sets D1 and D2 can be removed, that is, A-Ag.
  • one subsample set corresponds to one child node.
  • the map 3 generates the child nodes d1 and d2 of the root node d, and uses the subsample set D1 as the node information of the child node d1 and the subsample set D2 as the node information of the child node d2.
  • the divided feature values corresponding to the child nodes may also be set on the path of the child node and the current node, so as to facilitate subsequent user gender prediction, refer to FIG. 5.
  • the preset classification termination condition may be set according to actual requirements. When the child node satisfies the preset classification termination condition, the current child node is used as a leaf node, and the sample set corresponding to the child node is stopped for word segmentation; when the child node is not satisfied When the classification termination condition is preset, the classification of the sample set corresponding to the child node is continued.
  • the preset classification termination condition may include: the number of categories of the samples in the removed sub-sample set of the child node is a preset number.
  • the preset classification termination condition may include: the number of categories of the samples in the removed subsample set corresponding to the child node is 1, that is, the sample in the sample set of the child node has only one category.
  • the child node d1 it is determined whether the child node satisfies the preset classification termination condition. If yes, the current child node d1 is used as a leaf node, and the leaf node output is set according to the category of the sample in the subsample set corresponding to the child node d1.
  • the above-mentioned sub-sample set corresponding to the child node is continued to be classified according to the information gain classification method.
  • the child node d2 can be used as an example to calculate the characteristics of the A2 sample set relative to the sample.
  • the classified information gain rate g R (D, A) select the maximum information gain rate g R (D, A) max, when the maximum information gain rate g R (D, A) max is greater than the preset threshold ⁇ ,
  • the feature corresponding to the information gain rate g R (D, A) is selected as the partitioning feature Ag (such as the feature Ai+1), and the D2 is divided into several sub-sample sets based on the partitioning feature Ag, for example, the D2 can be divided into the sub-sample set D21, D22, D23, and then, the partitioning features Ag in the subsample sets D21, D22, and D23 are removed, and the child nodes d21, d22, and d23 of the current node d2 are generated, and the sample sets D21, D22, and D23 after dividing the feature Ag are removed.
  • the child node is used as a leaf node, and the output of the leaf node is set according to a sample category of the child sample set of the child node.
  • the preset classification termination condition may include: the number of categories of the samples in the removed subsample set corresponding to the child node is 1, that is, the sample in the sample set of the child node has only one category.
  • the category of the sample in the subsample set is taken as the output of the leaf node. If there is only a sample with the category "male” in the subsample set after removal, then "male" can be used as the output of the leaf node.
  • the current node is used as a leaf node, and the sample category with the largest number of samples is selected as the output of the leaf node.
  • sample categories include “male” and “female”.
  • the sample category having the largest number of samples in the sub-sample set D1 can be used as the output of the leaf node. If the "female" has the largest number of samples, then “female” can be used as the output of the leaf node a1.
  • the prediction time can be set according to requirements, such as the current time.
  • the corresponding leaf node may be determined according to the characteristics of the predicted sample and the decision tree model, and the output of the leaf node is used as a predicted output result.
  • the current leaf node is determined according to the branch condition of the decision tree (ie, the feature value of the divided feature), and the output of the leaf node is taken as the prediction result. Since the output of the leaf node includes "male” or "female", the gender of the user who does not provide the gender information can be determined based on the decision tree at this time.
  • the corresponding leaf node may be found as an1 according to the branch condition of the decision tree in the decision tree shown in FIG. 5, and the output of the leaf node an1 is “male”. At this time, it is determined that the gender of the user who does not provide the gender information is male.
  • the embodiment of the present application collects a multi-dimensional feature of the behavior habit of the user who has provided the gender information as a sample, and constructs a sample set of the behavior habits of the user who has provided the gender information; when the number of the features exceeds a preset threshold According to the information classification rate of the sample classification, the sample set is sampled to construct a decision tree model for predicting the gender of the user.
  • the output of the decision tree model includes “male” or “female”; the gender information is not provided according to the predicted time.
  • the multidimensional features of the user's behavioral habits are used as prediction samples; the genders of users who do not provide gender information are predicted based on the prediction samples and the decision tree model.
  • each of the samples of the sample set includes a plurality of feature information reflecting the user's behavior habits
  • the embodiment of the present application can make the user gender prediction more intelligent.
  • the accuracy of the user's gender prediction can be improved, thereby improving the accuracy of the prediction.
  • FIG. 7 is a schematic structural diagram of a user gender prediction apparatus according to an embodiment of the present application.
  • the user gender prediction device is applied to an electronic device, and the user gender prediction device includes a first collection unit 401, a classification unit 402, a second collection unit 403, and a prediction unit 404, as follows:
  • the first collecting unit 401 is configured to collect a multi-dimensional feature of the behavior habit of the user who has provided the gender information as a sample, and construct a sample set of behavior habits of the user who has provided the gender information;
  • the classification unit 402 is configured to perform sample classification on the sample set according to the information gain rate of the feature classification for the feature when the number of the features exceeds a preset threshold, to construct a decision tree model of the user gender prediction;
  • a second collecting unit 403 configured to collect, according to the predicted time, a multi-dimensional feature of a behavior habit of a user who does not provide gender information as a prediction sample;
  • the prediction unit 404 is configured to predict, according to the prediction sample and the decision tree model, the gender of the user who does not provide the gender information.
  • the classification unit 402 may include:
  • a first node generating sub-unit 4021 configured to generate a root node of the decision tree, and use the sample set as node information of the root node; and determine a sample set of the root node as a target sample set to be classified currently;
  • a gain rate obtaining sub-unit 4022 configured to acquire an information gain rate of the feature in the target sample set for classifying the target sample set
  • a feature determining sub-unit 4023 configured to acquire an information gain rate of the feature in the target sample set for classifying the target sample set
  • a classification sub-unit 4024 configured to divide the sample set according to the dividing feature to obtain a plurality of sub-sample sets
  • a second node generating sub-unit 4025 configured to remove the divided feature of the sample in the sub-sample set, obtain a removed sub-sample set, generate a child node of the current node, and use the removed sub-sample set as a Describe the node information of the child node;
  • the determining sub-unit 4026 is configured to determine whether the child node meets the preset classification termination condition, and if not, update the target sample set to the removed sub-sample set, and trigger the gain rate acquisition sub-unit 4022 to execute the acquisition target. a step of classifying the information gain rate of the feature in the sample set for the sample set; if yes, setting the child node as a leaf node, and setting an output of the leaf node according to a category of the sample in the removed subsample set,
  • the categories of the sample include “male” and “female”.
  • the classification sub-unit 4024 may be configured to acquire feature values of the feature set in the sample set
  • the sample set is divided according to the feature value.
  • the same sample is divided into the same subsample set.
  • the feature determining subunit 4023 can be used to:
  • the feature corresponding to the target information gain is selected as the current dividing feature.
  • the gain rate acquisition sub-unit 4022 can be used to:
  • the gain acquisition sub-unit 4022 can be used to:
  • the determining sub-unit 4025 may be configured to determine whether the number of categories of the samples in the removed sub-sample set corresponding to the child node is a preset number
  • the feature determining sub-unit 4023 is further configured to: when the target information gain rate is not greater than a preset threshold, use the current node as a leaf node, and select a sample category with the largest number of samples as the leaf node. Output.
  • the steps performed by the units in the user gender prediction apparatus may refer to the method steps described in the foregoing method embodiments.
  • the user gender prediction device can be integrated in an electronic device such as a mobile phone, a tablet, or the like.
  • the foregoing various units may be implemented as an independent entity, and may be implemented in any combination, and may be implemented as the same entity or a plurality of entities.
  • the foregoing units refer to the foregoing embodiments, and details are not described herein again.
  • the user gender prediction apparatus of the embodiment may collect, by the first collection unit 401, a multi-dimensional feature of the behavior habit of the user who has provided the gender information as a sample, and construct a sample set of behavior habits of the user who has provided the gender information;
  • the classification unit 402 classifies the sample set according to the information gain rate of the feature classification for the sample to construct a decision tree model for predicting the gender of the user, and the output of the decision tree model includes “male” Or "female";
  • the multi-dimensional feature of the behavior habit of the user who does not provide the gender information is collected as the prediction sample by the second acquisition unit 403 according to the predicted time;
  • the prediction unit 404 predicts the user who does not provide the gender information according to the prediction sample and the decision tree model. Gender.
  • each of the samples of the sample set includes a plurality of feature information reflecting the user's behavior habits
  • the embodiment of the present application can make the user gender prediction more intelligent.
  • the accuracy of the user's gender prediction can be improved, thereby improving the accuracy of the prediction.
  • the electronic device 500 includes a processor 501 and a memory 502.
  • the processor 501 is electrically connected to the memory 502.
  • the processor 500 is a control center of the electronic device 500 that connects various portions of the entire electronic device using various interfaces and lines, by running or loading a computer program stored in the memory 502, and recalling data stored in the memory 502, The various functions of the electronic device 500 are performed and the data is processed to perform overall monitoring of the electronic device 500.
  • the memory 502 can be used to store software programs and modules, and the processor 501 executes various functional applications and data processing by running computer programs and modules stored in the memory 502.
  • the memory 502 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, a computer program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may be stored according to Data created by the use of electronic devices, etc.
  • memory 502 can include high speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, memory 502 can also include a memory controller to provide processor 501 access to memory 502.
  • the processor 501 in the electronic device 500 loads the instructions corresponding to the process of one or more computer programs into the memory 502 according to the following steps, and is stored in the memory 502 by the processor 501.
  • the computer program in which to implement various functions, as follows:
  • the sample set is sample-classified according to the information gain rate of the feature classification for the sample to construct a decision tree model of the user gender prediction, and the output of the decision tree model includes a male or female ;
  • the gender of users who did not provide gender information was predicted based on the predicted sample and the decision tree model.
  • the processor 501 may specifically perform the following steps:
  • the child node is used as a leaf node, and the output of the leaf node is set according to the category of the sample in the removed subsample set.
  • the categories of the sample include “male” and “female”.
  • the processor 501 may specifically perform the following steps:
  • the target sample set is divided according to the feature value.
  • the processor 501 when selecting the current partitioning feature from the features according to the information gain rate selection, the processor 501 may specifically perform the following steps:
  • the feature corresponding to the target information gain rate is selected as the current dividing feature.
  • the processor 501 may further perform the following steps:
  • the current node is taken as a leaf node, and the sample category with the largest number of samples is selected as the output of the leaf node.
  • the processor 501 when acquiring the information gain of the feature classification for the sample set in the target sample set, the processor 501 may specifically perform the following steps:
  • the electronic device in the embodiment of the present application collects a multi-dimensional feature of the behavior habit of the user who has provided the gender information as a sample, and constructs a sample set of the behavior habit of the user who has provided the gender information; when the number of the features exceeds
  • the sample set is sampled according to the information gain rate of the feature classification for the sample to construct a decision tree model for predicting the gender of the user.
  • the output of the decision tree model includes “male” or “female”;
  • the multi-dimensional feature of the behavioral habits of the user who does not provide the gender information is used as a prediction sample; the gender of the user who does not provide the gender information is predicted according to the prediction sample and the decision tree model.
  • the electronic device 500 may further include: a display 503, a radio frequency circuit 504, an audio circuit 505, and a power source 506.
  • the display 503, the radio frequency circuit 504, the audio circuit 505, and the power source 506 are electrically connected to the processor 501, respectively.
  • the display 503 can be used to display information entered by a user or information provided to a user, as well as various graphical user interfaces, which can be composed of graphics, text, icons, video, and any combination thereof.
  • the display 503 can include a display panel.
  • the display panel can be configured in the form of a liquid crystal display (LCD) or an organic light-emitting diode (OLED).
  • LCD liquid crystal display
  • OLED organic light-emitting diode
  • the radio frequency circuit 504 can be used to transmit and receive radio frequency signals to establish wireless communication with a network device or other electronic device through wireless communication, and to transmit and receive signals with a network device or other electronic device.
  • the audio circuit 505 can be used to provide an audio interface between a user and an electronic device through a speaker or a microphone.
  • the power source 506 can be used to power various components of the electronic device 500.
  • the power source 506 can be logically coupled to the processor 501 through a power management system to enable functions such as managing charging, discharging, and power management through the power management system.
  • the electronic device 500 may further include a camera, a Bluetooth module, and the like, and details are not described herein.
  • the embodiment of the present application further provides a storage medium, where the storage medium stores a computer program, and when the computer program runs on a computer, causes the computer to execute a user gender prediction method in any of the above embodiments, such as : collecting a multi-dimensional feature of the behavior habit of the user who has provided the gender information as a sample, and constructing a sample set of behavior habits of the user who has provided the gender information; when the number of the features exceeds a preset threshold, classifying the sample according to the feature
  • the information gain rate samples the sample set to construct a decision tree model for predicting the user's gender.
  • the output of the decision tree model includes “male” or “female”; the multi-dimensionality of the behavior habits of users who do not provide gender information is collected according to the predicted time.
  • the feature is used as a predictive sample; the gender of the user who does not provide gender information is predicted based on the predicted sample and the decision tree model.
  • the storage medium may be a magnetic disk, an optical disk, a read only memory (ROM), or a random access memory (RAM).
  • ROM read only memory
  • RAM random access memory
  • the computer program may be stored in a computer readable storage medium, such as in a memory of the electronic device, and executed by at least one processor in the electronic device, and may include, for example, user gender during execution.
  • the storage medium may be a magnetic disk, an optical disk, a read only memory, a random access memory, or the like.
  • each functional module may be integrated into one processing chip, or each module may exist physically separately, or two or more modules may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
  • the integrated module if implemented in the form of a software functional module and sold or used as a standalone product, may also be stored in a computer readable storage medium, such as a read only memory, a magnetic disk or an optical disk, etc. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method and apparatus for predicting user gender and an electronic device. The method comprises: acquiring a multidimensional feature of a behavior habit of a user, who has provided gender information, as a sample, and constructing a sample set of the behavior habit of the user who has provided gender information (201); when the number of the features exceeds a preset threshold, performing sample classification on the sample set according to information gain rate of the features for sample classification, to construct a decision tree model for predicting user gender (202); acquiring, according to a prediction time, a multidimensional feature of a behavior habit of a user, who has not provided gender information, as a prediction sample (203); and predicting, according to the prediction sample and the decision tree model, gender of the user who has not provided gender information (204).

Description

用户性别预测方法、装置及电子设备User gender prediction method, device and electronic device
本申请要求于2017年12月22日提交中国专利局、申请号为201711405558.8、申请名称为“用户性别预测方法、装置、介质及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese Patent Application filed on Dec. 22, 2017, with the application number of 201711405558.8, and the application name is "user gender prediction method, device, medium and electronic device", the entire contents of which are incorporated by reference. In this application.
技术领域Technical field
本申请涉及电子设备终端领域,具体涉及一种用户性别预测方法、装置及电子设备。The present application relates to the field of electronic device terminals, and in particular, to a user gender prediction method, device, and electronic device.
背景技术Background technique
用户画像是近年来非常热门的一个研究方向,比如在智能手机上。如果有一种方法能够从用户习惯上准确地判断出用户的性别,从而对电子设备进行各方面的深度优化是非常有意义的。当前电子设备***会让用户注册绑定电子设备与用户账号,但不是每个用户都愿意提供性别信息,则导致不能解决一大部分用户画像的问题。因此,对没有提供性别信息的用户,有必要提供一种用户性别预测方法、装置及电子设备。User portraits are a very popular research direction in recent years, such as on smartphones. If there is a way to accurately determine the gender of the user from the user's habit, it is very meaningful to optimize the electronic device in all aspects. The current electronic device system allows users to register binding electronic devices and user accounts, but not every user is willing to provide gender information, which can not solve the problem of a large part of the user portrait. Therefore, for users who do not provide gender information, it is necessary to provide a user gender prediction method, device and electronic device.
技术问题technical problem
本申请实施例提供一种用户性别预测方法、装置及电子设备,以智能关闭应用程序。The embodiment of the present application provides a user gender prediction method, device, and electronic device to intelligently close an application.
技术解决方案Technical solution
本申请实施例提供一种用户性别预测方法,包括:The embodiment of the present application provides a user gender prediction method, including:
采集已提供性别信息的用户的行为习惯的多维特征作为样本,并构建已提供性别信息的用户的行为习惯的样本集;Collecting multidimensional features of the behavioral habits of users who have provided gender information as a sample, and constructing a sample set of behavioral habits of users who have provided gender information;
当所述特征的数量超过预设阈值时,根据特征对于样本分类的信息增益率对样本集进行样本分类,以构建出用户性别预测的决策树模型;When the number of the features exceeds a preset threshold, the sample set is sample-classified according to the information gain rate of the feature classification for the sample to construct a decision tree model of the user gender prediction;
根据预测时间采集未提供性别信息的用户的行为习惯的多维特征作为预测样本;Collecting multi-dimensional features of behavioral habits of users who do not provide gender information as prediction samples according to predicted time;
根据预测样本和决策树模型预测未提供性别信息的用户的性别。The gender of users who did not provide gender information was predicted based on the predicted sample and the decision tree model.
本申请实施例还提供一种用户性别预测方法装置,所述装置包括:The embodiment of the present application further provides a user gender prediction method apparatus, where the apparatus includes:
第一采集单元,用于采集已提供性别信息的用户的行为习惯的多维特征作为样本,并构建已提供性别信息的用户的行为习惯的样本集;a first collecting unit, configured to collect a multi-dimensional feature of a behavior habit of a user who has provided gender information as a sample, and construct a sample set of behavior habits of the user who has provided the gender information;
分类单元,用于当所述特征的数量超过预设阈值时,根据特征对于样本分类的信息增益率对样本集进行样本分类,以构建出用户性别预测的决策树模型,所述决策树模型的输出包括男或女;a classifying unit, configured to perform sample classification on the sample set according to an information gain rate of the feature classification for the feature when the number of the features exceeds a preset threshold, to construct a decision tree model of the user gender prediction, where the decision tree model The output includes male or female;
第二采集单元,用于根据预测时间采集未提供性别信息的用户的行为习惯的多维特征作为预测样本;a second collecting unit, configured to collect, as a prediction sample, a multi-dimensional feature of a behavior habit of a user who does not provide gender information according to the predicted time;
预测单元,用于根据预测样本和决策树模型预测未提供性别信息的用户的性别。A prediction unit is configured to predict a gender of a user who does not provide gender information according to the prediction sample and the decision tree model.
本申请实施例提供的电子设备,包括处理器和存储器,所述存储器有计算机程序,其中,所述处理器通过调用所述计算机程序,用于执行用户性别预测方法,所述用户性别预测方法包括:An electronic device provided by an embodiment of the present application includes a processor and a memory, wherein the memory has a computer program, wherein the processor is configured to execute a user gender prediction method by calling the computer program, where the user gender prediction method includes :
采集已提供性别信息的用户的行为习惯的多维特征作为样本,并构建已提供性别信息的用户的行为习惯的样本集;Collecting multidimensional features of the behavioral habits of users who have provided gender information as a sample, and constructing a sample set of behavioral habits of users who have provided gender information;
当所述特征的数量超过预设阈值时,根据特征对于样本分类的信息增益率对样本集进行样本分类,以构建出用户性别预测的决策树模型;When the number of the features exceeds a preset threshold, the sample set is sample-classified according to the information gain rate of the feature classification for the sample to construct a decision tree model of the user gender prediction;
根据预测时间采集未提供性别信息的用户的行为习惯的多维特征作为预测样本;Collecting multi-dimensional features of behavioral habits of users who do not provide gender information as prediction samples according to predicted time;
根据预测样本和决策树模型预测未提供性别信息的用户的性别。The gender of users who did not provide gender information was predicted based on the predicted sample and the decision tree model.
有益效果Beneficial effect
本申请所提供的用户性别预测方法、装置及电子设备,通过采集已提供性别信息的用户的行为习惯的多维特征作为样本,并构建已提供性别信息的用户的行为习惯的样本集;当所述特征的数量超过预设阈值时,根据特征对于样本分类的信息增益率对样本集进行样本分类,以构建出预测用户性别的决策树模型,决策树模型的输出包括“男”或“女”;根据预测时间采集未提供性别信息的用户的行为习惯的多维特征作为预测样本;根据预测样本和决策树模型预测未提供性别信息的用户的性别。The user gender prediction method, apparatus and electronic device provided by the present application collects a multi-dimensional feature of a behavior habit of a user who has provided gender information as a sample, and constructs a sample set of behavior habits of the user who has provided the gender information; When the number of features exceeds a preset threshold, the sample set is sampled according to the information gain rate of the feature classification for the sample to construct a decision tree model for predicting the gender of the user, and the output of the decision tree model includes “male” or “female”; The multi-dimensional feature of the behavior habit of the user who does not provide the gender information is collected as a prediction sample according to the predicted time; the gender of the user who does not provide the gender information is predicted according to the prediction sample and the decision tree model.
附图说明DRAWINGS
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present application. Other drawings can also be obtained from those skilled in the art based on these drawings without paying any creative effort.
图1为本申请实施例提供的用户性别预测方法的应用场景示意图。FIG. 1 is a schematic diagram of an application scenario of a user gender prediction method according to an embodiment of the present application.
图2是本申请实施例提供的用户性别预测方法的一个流程示意图。FIG. 2 is a schematic flowchart of a user gender prediction method provided by an embodiment of the present application.
图3是本申请实施例提供的一种决策树的示意图。FIG. 3 is a schematic diagram of a decision tree provided by an embodiment of the present application.
图4是本申请实施例提供的另一种决策树的示意图。FIG. 4 is a schematic diagram of another decision tree provided by an embodiment of the present application.
图5是本申请实施例提供的又一种决策树的示意图。FIG. 5 is a schematic diagram of still another decision tree provided by an embodiment of the present application.
图6是本申请实施例提供的用户性别预测方法的另一个流程示意图。FIG. 6 is another schematic flowchart of a user gender prediction method provided by an embodiment of the present application.
图7是本申请实施例提供的用户性别预测装置的一个结构示意图。FIG. 7 is a schematic structural diagram of a user gender prediction apparatus according to an embodiment of the present application.
图8是本申请实施例提供的用户性别预测装置的另一结构示意图。FIG. 8 is another schematic structural diagram of a user gender prediction apparatus according to an embodiment of the present application.
图9是本申请实施例提供的电子设备的一个结构示意图。FIG. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
图10是本申请实施例提供的电子设备的另一结构示意图。FIG. 10 is another schematic structural diagram of an electronic device according to an embodiment of the present application.
本发明的实施方式Embodiments of the invention
一种用户性别预测方法,包括:A user gender prediction method includes:
采集已提供性别信息的用户的行为习惯的多维特征作为样本,并构建已提供性别信息的用户的行为习惯的样本集;Collecting multidimensional features of the behavioral habits of users who have provided gender information as a sample, and constructing a sample set of behavioral habits of users who have provided gender information;
当所述特征的数量超过预设阈值时,根据特征对于样本分类的信息增益率对样本集进行样本分类,以构建出用户性别预测的决策树模型;When the number of the features exceeds a preset threshold, the sample set is sample-classified according to the information gain rate of the feature classification for the sample to construct a decision tree model of the user gender prediction;
根据预测时间采集未提供性别信息的用户的行为习惯的多维特征作为预测样本;Collecting multi-dimensional features of behavioral habits of users who do not provide gender information as prediction samples according to predicted time;
根据预测样本和决策树模型预测未提供性别信息的用户的性别。The gender of users who did not provide gender information was predicted based on the predicted sample and the decision tree model.
在本申请实施例所提供的用户性别预测方法中,当所述特征的数量超过预设阈值时,根据特征对于样本分类的信息增益率对样本集进行样本分类,以构建出用户性别预测的决策树模型,包括:In the user gender prediction method provided by the embodiment of the present application, when the number of the features exceeds a preset threshold, the sample set is sampled according to the information gain rate of the feature classification for the sample to construct a user gender prediction decision. Tree model, including:
生成决策树的根节点,并将所述样本集作为所述根节点的节点信息;Generating a root node of the decision tree, and using the sample set as node information of the root node;
将所述根节点的样本集确定为当前待分类的目标样本集;Determining the sample set of the root node as a target sample set to be classified currently;
获取目标样本集内所述特征对于目标样本集分类的信息增益率;Obtaining an information gain rate of the feature in the target sample set for the target sample set classification;
根据所述信息增益率选取从所述特征中选取当前的划分特征;Selecting a current partitioning feature from the features according to the information gain rate;
根据所述划分特征对所述样本集进行划分,得到若干子样本集;Dividing the sample set according to the dividing feature to obtain a plurality of sub-sample sets;
对所述子样本集中样本的所述划分特征进行去除,得到去除后子样本集;And removing the dividing feature of the sample in the subsample set to obtain a removed subsample set;
生成当前节点的子节点,并将所述去除后子样本集作为所述子节点的节点信息;Generating a child node of the current node, and using the removed subsample set as node information of the child node;
判断子节点是否满足预设分类终止条件;Determining whether the child node satisfies a preset classification termination condition;
若否,则将所述目标样本集更新为所述去除后子样本集,并返回执行获取目标样本集内所述特征对于目标样本集分类的信息增益率的步骤;If not, updating the target sample set to the post-removed sub-sample set, and returning to perform the step of acquiring an information gain rate of the feature in the target sample set for the target sample set classification;
若是,则将所述子节点作为叶子节点,根据所述去除后子样本集中样本的类别设置所述叶子节点的输出,所述样本的类别包括男和女。If yes, the child node is used as a leaf node, and an output of the leaf node is set according to a category of the sample in the removed subsample set, and the category of the sample includes a male and a female.
在本申请实施例所提供的用户性别预测方法中,根据所述划分特征对所述目标样本集进行划分,包括:In the user gender prediction method provided by the embodiment of the present application, the target sample set is divided according to the dividing feature, including:
获取所述目标样本集中划分特征的特征值;Obtaining feature values of the feature in the target sample set;
根据所述特征值对所述目标样本集进行划分。The target sample set is divided according to the feature value.
在本申请实施例所提供的用户性别预测方法中,根据所述信息增益率选取从所述特征中选取当前的划分特征,包括:In the user gender prediction method provided by the embodiment of the present application, selecting the current division feature from the features according to the information gain rate, including:
从所述信息增益中选取最大的目标信息增益率;Selecting a maximum target information gain rate from the information gains;
判断所述目标信息增益率是否大于预设阈值;Determining whether the target information gain rate is greater than a preset threshold;
若是,则选取所述目标信息增益率对应的特征作为当前的划分特征。If yes, the feature corresponding to the target information gain rate is selected as the current dividing feature.
在本申请实施例所提供的用户性别预测方法中,所述用户性别预测方法还包括:In the user gender prediction method provided by the embodiment of the present application, the user gender prediction method further includes:
当目标信息增益率不大于预设阈值时,将当前节点作为叶子节点,并选取样本数量最多的样本类别作为所述叶子节点的输出。When the target information gain rate is not greater than the preset threshold, the current node is taken as a leaf node, and the sample category with the largest number of samples is selected as the output of the leaf node.
在本申请实施例所提供的用户性别预测方法中,判断子节点是否满足预设分类终止条件,包括:In the user gender prediction method provided by the embodiment of the present application, determining whether the child node meets the preset classification termination condition includes:
判断所述子节点对应的去除后子样本集中样本的类别数量是否为预设数量;Determining whether the number of categories of samples in the removed subsample set corresponding to the child node is a preset number;
若是,则确定所述子节点满足预设分类终止条件。If yes, it is determined that the child node satisfies a preset classification termination condition.
在本申请实施例所提供的用户性别预测方法中,获取目标样本集内所述特征对于目标样本集分类的信息增益率,包括:In the user gender prediction method provided by the embodiment of the present application, acquiring an information gain rate of the feature in the target sample set for the target sample set classification includes:
获取所述特征对于目标样本集分类的信息增益;Obtaining an information gain of the feature classification for the target sample set;
获取所述特征对于目标样本集分类的***信息;Obtaining split information of the feature classification for the target sample set;
根据所述信息增益与所述***信息,获取所述特征对于目标样本集分类的信息增益率。Obtaining an information gain rate of the feature for the target sample set classification according to the information gain and the split information.
在本申请实施例所提供的用户性别预测方法中,获取所述特征对于目标样本集分类的信息增益率,包括:In the user gender prediction method provided by the embodiment of the present application, the information gain rate of the feature classification for the target sample set is obtained, including:
获取目标样本分类的经验熵;Obtaining the empirical entropy of the target sample classification;
获取所述特征对于目标样本集分类结果的条件熵;Obtaining conditional entropy of the feature for the classification result of the target sample set;
根据所述条件熵和所述经验熵,获取所述特征对于所述目标样本集分类的信息增益率。Obtaining an information gain rate of the feature for the target sample set classification according to the condition entropy and the empirical entropy.
在本申请实施例所提供的用户性别预测方法中,根据所述信息增益与所述***信息,获取所述特征对于目标样本集分类的信息增益率,包括:In the user gender prediction method provided by the embodiment of the present application, the information gain rate of the feature classification for the target sample set is obtained according to the information gain and the split information, including:
通过如下公式计算特征对于目标样本集分类的信息增益率:The information gain rate of the feature classification for the target sample set is calculated by the following formula:
Figure PCTCN2018115358-appb-000001
Figure PCTCN2018115358-appb-000001
其中,g R(D,A)为特征A对于样本集D分类的信息增益率,g(D,A)为特征A对于样本分类的信息增益,H A(D)为特征A的***信息; Where g R (D, A) is the information gain rate of feature A for sample set D classification, g(D, A) is the information gain of feature A for sample classification, and H A (D) is the split information of feature A;
并且,g(D,A)可以通过如下公式计算得到:Also, g(D, A) can be calculated by the following formula:
Figure PCTCN2018115358-appb-000002
Figure PCTCN2018115358-appb-000002
其中,H(D)为样本集D分类的经验熵,H(D|A)为特征A对于样本集D分类的条件熵,pi为A特征取第i种取值的样本在样本集D中出现的概率,n和i均为大于零的正整数。Where H(D) is the empirical entropy of the sample set D classification, H(D|A) is the conditional entropy of the feature A for the sample set D classification, and pi is the A feature taking the ith sample of the sample in the sample set D. The probability of occurrence, n and i are positive integers greater than zero.
一种用户性别预测装置,包括:A user gender prediction device includes:
第一采集单元,用于采集已提供性别信息的用户的行为习惯的多维特征作为样本,并构建已提供性别信息的用户的行为习惯的样本集;a first collecting unit, configured to collect a multi-dimensional feature of a behavior habit of a user who has provided gender information as a sample, and construct a sample set of behavior habits of the user who has provided the gender information;
分类单元,用于当所述特征的数量超过预设阈值时,根据特征对于样本分类的信息增益率对样本集进行样本分类,以构建出用户性别预测的决策树模型,所述决策树模型的输出包括男或女;a classifying unit, configured to perform sample classification on the sample set according to an information gain rate of the feature classification for the feature when the number of the features exceeds a preset threshold, to construct a decision tree model of the user gender prediction, where the decision tree model The output includes male or female;
第二采集单元,用于根据预测时间采集未提供性别信息的用户的行为习惯的多维特征作为预测样本;a second collecting unit, configured to collect, as a prediction sample, a multi-dimensional feature of a behavior habit of a user who does not provide gender information according to the predicted time;
预测单元,用于根据预测样本和决策树模型预测未提供性别信息的用户的性别。A prediction unit is configured to predict a gender of a user who does not provide gender information according to the prediction sample and the decision tree model.
一种电子设备,包括处理器和存储器,所述存储器有计算机程序,所述处理器通过调用所述计算机程序,用于执行用户性别预测方法,所述用户性别预测方法包括:An electronic device includes a processor and a memory, the memory having a computer program, the processor is configured to execute a user gender prediction method by calling the computer program, the user gender prediction method comprising:
采集已提供性别信息的用户的行为习惯的多维特征作为样本,并构建已提供性别信息的用户的行为习惯的样本集;Collecting multidimensional features of the behavioral habits of users who have provided gender information as a sample, and constructing a sample set of behavioral habits of users who have provided gender information;
当所述特征的数量超过预设阈值时,根据特征对于样本分类的信息增益率对样本集进行样本分类,以构建出用户性别预测的决策树模型;When the number of the features exceeds a preset threshold, the sample set is sample-classified according to the information gain rate of the feature classification for the sample to construct a decision tree model of the user gender prediction;
根据预测时间采集未提供性别信息的用户的行为习惯的多维特征作为预测样本;Collecting multi-dimensional features of behavioral habits of users who do not provide gender information as prediction samples according to predicted time;
根据预测样本和决策树模型预测未提供性别信息的用户的性别。The gender of users who did not provide gender information was predicted based on the predicted sample and the decision tree model.
在本申请实施例所提供的电子设备中,当所述特征的数量超过预设阈值时,根据特征对于样本分类的信息增益率对样本集进行样本分类,以构建出用户性别预测的决策树模型,包括:In the electronic device provided by the embodiment of the present application, when the number of the features exceeds a preset threshold, the sample set is sample-classified according to the information gain rate of the feature classification for the sample to construct a decision tree model for user gender prediction. ,include:
生成决策树的根节点,并将所述样本集作为所述根节点的节点信息;Generating a root node of the decision tree, and using the sample set as node information of the root node;
将所述根节点的样本集确定为当前待分类的目标样本集;Determining the sample set of the root node as a target sample set to be classified currently;
获取目标样本集内所述特征对于目标样本集分类的信息增益率;Obtaining an information gain rate of the feature in the target sample set for the target sample set classification;
根据所述信息增益率选取从所述特征中选取当前的划分特征;Selecting a current partitioning feature from the features according to the information gain rate;
根据所述划分特征对所述样本集进行划分,得到若干子样本集;Dividing the sample set according to the dividing feature to obtain a plurality of sub-sample sets;
对所述子样本集中样本的所述划分特征进行去除,得到去除后子样本集;And removing the dividing feature of the sample in the subsample set to obtain a removed subsample set;
生成当前节点的子节点,并将所述去除后子样本集作为所述子节点的节点信息;Generating a child node of the current node, and using the removed subsample set as node information of the child node;
判断子节点是否满足预设分类终止条件;Determining whether the child node satisfies a preset classification termination condition;
若否,则将所述目标样本集更新为所述去除后子样本集,并返回执行获取目标样本集内所述特征对于目标样本集分类的信息增益率的步骤;If not, updating the target sample set to the post-removed sub-sample set, and returning to perform the step of acquiring an information gain rate of the feature in the target sample set for the target sample set classification;
若是,则将所述子节点作为叶子节点,根据所述去除后子样本集中样本的类别设置所述叶子节点的输出,所述样本的类别包括男和女。If yes, the child node is used as a leaf node, and an output of the leaf node is set according to a category of the sample in the removed subsample set, and the category of the sample includes a male and a female.
在本申请实施例所提供的电子设备中,根据所述划分特征对所述目标样本集进行划分,包括:In the electronic device provided by the embodiment of the present application, the target sample set is divided according to the dividing feature, including:
获取所述目标样本集中划分特征的特征值;Obtaining feature values of the feature in the target sample set;
根据所述特征值对所述目标样本集进行划分。The target sample set is divided according to the feature value.
在本申请实施例所提供的电子设备中,根据所述信息增益率选取从所述特征中选取当前的划分特征,包括:In the electronic device provided by the embodiment of the present application, selecting a current dividing feature from the feature according to the information gain rate, including:
从所述信息增益中选取最大的目标信息增益率;Selecting a maximum target information gain rate from the information gains;
判断所述目标信息增益率是否大于预设阈值;Determining whether the target information gain rate is greater than a preset threshold;
若是,则选取所述目标信息增益率对应的特征作为当前的划分特征。If yes, the feature corresponding to the target information gain rate is selected as the current dividing feature.
在本申请实施例所提供的电子设备中,所述用户性别预测方法还包括:In the electronic device provided by the embodiment of the present application, the user gender prediction method further includes:
当目标信息增益率不大于预设阈值时,将当前节点作为叶子节点,并选取样本数量最多的样本类别作为所述叶子节点的输出。When the target information gain rate is not greater than the preset threshold, the current node is taken as a leaf node, and the sample category with the largest number of samples is selected as the output of the leaf node.
在本申请实施例所提供的电子设备中,判断子节点是否满足预设分类终止条件,包括:In the electronic device provided by the embodiment of the present application, determining whether the child node meets the preset classification termination condition includes:
判断所述子节点对应的去除后子样本集中样本的类别数量是否为预设数量;Determining whether the number of categories of samples in the removed subsample set corresponding to the child node is a preset number;
若是,则确定所述子节点满足预设分类终止条件。If yes, it is determined that the child node satisfies a preset classification termination condition.
在本申请实施例所提供的电子设备中,获取目标样本集内所述特征对于目标样本集分类的信息增益率,包括:In the electronic device provided by the embodiment of the present application, acquiring an information gain rate of the feature in the target sample set for the target sample set classification includes:
获取所述特征对于目标样本集分类的信息增益;Obtaining an information gain of the feature classification for the target sample set;
获取所述特征对于目标样本集分类的***信息;Obtaining split information of the feature classification for the target sample set;
根据所述信息增益与所述***信息,获取所述特征对于目标样本集分类的信息增益率。Obtaining an information gain rate of the feature for the target sample set classification according to the information gain and the split information.
在本申请实施例所提供的电子设备中,获取所述特征对于目标样本集分类的信息增益率,包括:In the electronic device provided by the embodiment of the present application, acquiring an information gain rate of the feature for the target sample set classification includes:
获取目标样本分类的经验熵;Obtaining the empirical entropy of the target sample classification;
获取所述特征对于目标样本集分类结果的条件熵;Obtaining conditional entropy of the feature for the classification result of the target sample set;
根据所述条件熵和所述经验熵,获取所述特征对于所述目标样本集分类的信息增益率。Obtaining an information gain rate of the feature for the target sample set classification according to the condition entropy and the empirical entropy.
在本申请实施例所提供的电子设备中,根据所述信息增益与所述***信息,获取所述特征对于目标样本集分类的信息增益率,包括:In the electronic device provided by the embodiment of the present application, the information gain rate of the feature classification for the target sample set is obtained according to the information gain and the split information, including:
通过如下公式计算特征对于目标样本集分类的信息增益率:The information gain rate of the feature classification for the target sample set is calculated by the following formula:
Figure PCTCN2018115358-appb-000003
Figure PCTCN2018115358-appb-000003
其中,g R(D,A)为特征A对于样本集D分类的信息增益率,g(D,A)为特征A对于样本分类的信息增益,H A(D)为特征A的***信息; Where g R (D, A) is the information gain rate of feature A for sample set D classification, g(D, A) is the information gain of feature A for sample classification, and H A (D) is the split information of feature A;
并且,g(D,A)可以通过如下公式计算得到:Also, g(D, A) can be calculated by the following formula:
Figure PCTCN2018115358-appb-000004
Figure PCTCN2018115358-appb-000004
其中,H(D)为样本集D分类的经验熵,H(D|A)为特征A对于样本集D分类的条件熵,pi为A特征取第i种取值的样本在样本集D中出现的概率,n和i均为大于零的正整数。Where H(D) is the empirical entropy of the sample set D classification, H(D|A) is the conditional entropy of the feature A for the sample set D classification, and pi is the A feature taking the ith sample of the sample in the sample set D. The probability of occurrence, n and i are positive integers greater than zero.
本申请实施例提供一种用户性别预测方法,该用户性别预测方法的执行主体可以是本申请实施例提供的用户性别预测装置,或者集成了该用户性别预测装置的电子设备,其中该用户性别预测装置可以采用硬件或者软件的方式实现。其中,电子设备可以是智能手机、平板电脑、掌上电脑、笔记本电脑、或者台式电脑等设备。The embodiment of the present application provides a user gender prediction method, and the execution subject of the user gender prediction method may be the user gender prediction device provided by the embodiment of the present application, or an electronic device integrated with the user gender prediction device, wherein the user gender prediction The device can be implemented in hardware or software. The electronic device may be a device such as a smart phone, a tablet computer, a palmtop computer, a notebook computer, or a desktop computer.
请参阅图1,图1为本申请实施例提供的用户性别预测方法的应用场景示意图,以用户性别预测装置集成在电子设备中为例,电子设备可以采集已提供性别信息的用户的行为习惯的多维特征作为样本,并构建所述已提供性别信息的用户的行为习惯的样本集;根据所述特征对于样本分类的信息增益率对所述样本集进行样本分类,以构建出预测用户性别的决策树模型;根据预测阈值采集未提供性别信息的用户对应的行为习惯的多维特征,得到预测样本;根据所述预测样本和所述决策树模型预测所述未提供性别信息的用户的性别。Referring to FIG. 1 , FIG. 1 is a schematic diagram of an application scenario of a user gender prediction method according to an embodiment of the present application. The user gender estimation device is integrated into an electronic device as an example, and the electronic device can collect the behavior habit of the user who has provided the gender information. The multi-dimensional feature is taken as a sample, and a sample set of behavior habits of the user who has provided the gender information is constructed; and the sample set is sample-classified according to the information gain rate of the feature classification of the feature to construct a decision for predicting user gender a tree model; collecting a multi-dimensional feature of a behavior habit corresponding to a user who does not provide gender information according to the predicted threshold, obtaining a predicted sample; and predicting a gender of the user not providing the gender information according to the predicted sample and the decision tree model.
具体地,例如图1所示,以判断未提供性别信息的用户的性别为例,可以在历史时间段内,采集已提供性别信息的用户的行为习惯的多维特征(例如,用户阅读体育类新闻的时长、用户使用美颜类软件的次数等)作为样本,构建已提供性别信息的用户的行为习惯的样本集,根据特征(例如,用户阅读体育类新闻的时长、用户使用美颜类软件的次数等)对于样本分类的信息增益率对样本集进行样本分类,以构建出用户性别预测的决策树模型;根据预测阈值(如t)采集未提供性别信息的用户的对应的多维特征(例如收集未提供性别信息的用户的t个用户阅读体育类新闻的时长、用户使用美颜类软件的次数等行为习惯的样本)作为预测样本;根据预测样本和决策树模型预测未提供性别信息的用户的性别。Specifically, for example, as shown in FIG. 1 , by taking the gender of the user who does not provide the gender information as an example, the multi-dimensional characteristics of the behavior habits of the user who has provided the gender information may be collected in the historical time period (for example, the user reads the sports news). As a sample, build a sample set of behavioral habits of users who have provided gender information, based on characteristics (for example, the length of time users read sports news, users use beauty software) The number of times, etc.) The sample gain is sampled by the information gain rate of the sample classification to construct a decision tree model of the user's gender prediction; the corresponding multidimensional features of the user who does not provide the gender information are collected according to the prediction threshold (such as t) (for example, collection) A sample of behavioral habits such as the length of time when the user of the user who does not provide the gender information reads the sports news, the number of times the user uses the beauty software, etc.) is used as a prediction sample; and the user who does not provide the gender information is predicted based on the prediction sample and the decision tree model. gender.
请参阅图2,图2为本申请实施例提供的用户性别预测方法的流程示意图。本申请实施例提供的用户性别预测方法的具体流程可以如下:Referring to FIG. 2, FIG. 2 is a schematic flowchart diagram of a user gender prediction method according to an embodiment of the present application. The specific process of the user gender prediction method provided by the embodiment of the present application may be as follows:
201、采集已提供性别信息的用户的行为习惯的多维特征作为样本,并构建已提供性别信息的用户的行为习惯的样本集。201. Collect a multi-dimensional feature of the behavior habit of the user who has provided the gender information as a sample, and construct a sample set of behavior habits of the user who has provided the gender information.
所述已提供性别信息的用户的行为习惯的多维特征具有一定长度的维度,其每个维度上的参数均对应表征已提供性别信息的用户的行为习惯的一种特征信息,即该多维特征息由多个特征构成。The multi-dimensional feature of the behavior habit of the user who has provided the gender information has a dimension of a certain length, and the parameters in each dimension correspond to a feature information representing the behavior habit of the user who has provided the gender information, that is, the multi-dimensional feature It consists of multiple features.
其中,已提供性别信息的用户的行为习惯的样本集可以包括多个样本,每个样本包括已提供性别信息的用户的行为习惯的多维特征。在已提供性别信息的用户的行为习惯的样本集中,可以包括在历史时间段内,按照预设频率采集的多个样本。历史时间段,例如可以是过去7天、10天;预设频率,例如可以是每10分钟采集一次、每半小时采集一次。可以理解的是,一次采集的已提供性别信息的用户的行为习惯的多维特征数据构成一个样本,多次采集的多个样本构成样本集。Wherein, the sample set of the behavior habits of the user who has provided the gender information may include a plurality of samples, each of which includes a multi-dimensional feature of the behavior habit of the user who has provided the gender information. A sample set of behavioral habits of users who have provided gender information may include a plurality of samples collected at a preset frequency during a historical time period. The historical time period may be, for example, the past 7 days or 10 days; the preset frequency may be, for example, collected every 10 minutes and collected every half hour. It can be understood that the multi-dimensional feature data of the behavior habits of the user who has provided the gender information at one time constitutes one sample, and the plurality of samples collected multiple times constitute a sample set.
在构成样本集之后,可以对样本集中的每个样本进行标记,得到每个样本的样本标签,由于本实施要实现的是预测用户的性别,因此,所标记的样本标签包括性别为“男”和性别为“女”,也即样本类别包括“男”和“女”。具体可根据已提供性别信息的用户的行为习惯进行标记,例如:用户在购物应用中浏览偏男性类商品(如男装)次数为50次,则标记为“男”;再例如,用户阅读偏女性类小说的时长超过20小时,则标记为“女”。具体地,可以用数值“1”表示“男”,用数值“0”表示“女”,反之亦可。After constituting the sample set, each sample in the sample set can be marked to obtain a sample label for each sample. Since the implementation of the present embodiment is to predict the gender of the user, the labeled sample label includes the gender as "male". And the gender is “female”, that is, the sample categories include “male” and “female”. Specifically, it may be marked according to the behavior habits of the user who has provided the gender information. For example, if the user browses the male-type goods (such as men's clothing) in the shopping application for 50 times, it is marked as "male"; for example, the user reads the female If the novel has a duration of more than 20 hours, it is marked as "female." Specifically, the value "1" may be used to indicate "male", and the value "0" to indicate "female", and vice versa.
202、当所述特征的数量超过预设阈值时,根据特征对于样本分类的信息增益率对样本集进行样本分类,以构建出用户性别预测的决策树模型。202. When the number of the features exceeds a preset threshold, classify the sample set according to the information gain rate of the feature classification for the sample to construct a decision tree model of the user gender prediction.
在一种实施例中,所述预设阈值可以为10000,也即当所述特征的数量超过10000时,根据特征对于样本分类的信息增益率对样本集进行样本分类,以构建出用户性别预测的决策树模型In an embodiment, the preset threshold may be 10000, that is, when the number of the features exceeds 10000, the sample set is sample-classified according to the information gain rate of the feature classification for the sample to construct a user gender prediction. Decision tree model
在一种实施例中,为便于样本分类,可以将已提供性别信息的用户的行为习惯的多维特征信息中,未用数值直接表示的特征信息用具体的数值量化出来,例如,用户是否开启前置摄像头这个特征信息,可以用数值1表示开启,用数值0表示未开启(反之亦可);再例如,用户是否对图片进行美颜处理这个特征信息,可以用数值1表示进行了美颜处理,用数值0表示未进行美颜处理(反之亦可)。In an embodiment, in order to facilitate sample classification, in the multi-dimensional feature information of the behavior habits of the user who has provided the gender information, the feature information not directly represented by the numerical value is quantified by a specific numerical value, for example, whether the user is turned on before The characteristic information of the camera can be turned on by the value 1 and not turned on by the value 0 (or vice versa); for example, whether the user performs the beauty processing on the picture, the beauty information can be represented by the value 1. , with a value of 0 means no cosmetic treatment (and vice versa).
本申请实施例可以基于特征对于样本分类的信息增益率对样本集进行样本分类,以构建用户性别预测的决策树模型。比如,可以基于C4.5算法来构建决策树模型。The embodiment of the present application may perform sample classification on the sample set based on the information gain rate of the feature classification for the sample to construct a decision tree model of the user gender prediction. For example, a decision tree model can be constructed based on the C4.5 algorithm.
其中,决策树是一种依托决策而建立起来的一种树。在机器学习中,决策树是一种预测模型,代表的是一种对象属性与对象值之间的一种映射关系,每一个节点代表某个对象,树中的每一个分叉路径代表某个可能的属性值,而每一个叶子节点则对应从根节点到该叶子节点所经历的路径所表示的对象的值。决策树仅有单一输出,如果有多个输出,可以分别建立独立的决策树以处理不同的输出。Among them, the decision tree is a tree built on the basis of decision-making. In machine learning, a decision tree is a predictive model that represents a mapping between object attributes and object values. Each node represents an object, and each forked path in the tree represents a certain Possible attribute values, and each leaf node corresponds to the value of the object represented by the path from the root node to the leaf node. The decision tree has only a single output. If there are multiple outputs, separate decision trees can be created to handle different outputs.
其中,C4.5算法是决策树的一种,它是一系列用在机器学习和数据挖掘的分类问题中的算法,是由ID3改进后的一种重要算法。它的目标是监督学习:给定一个数据集,其中的每一个元组都能用一组属性值来描述,每一个元组属于一个互斥的类别中的某一类。C4.5的目标是通过学习,找到一个从属性值到类别的映射关系,并且这个映射能用于对新的类别未知的实体进行分类。Among them, the C4.5 algorithm is a kind of decision tree. It is a series of algorithms used in the classification problem of machine learning and data mining. It is an important algorithm improved by ID3. Its goal is to supervise learning: Given a data set, each of these tuples can be described by a set of attribute values, each of which belongs to a class in a mutually exclusive category. The goal of C4.5 is to find a mapping from attribute values to categories by learning, and this mapping can be used to classify entities with unknown new categories.
ID3(Iterative Dichotomiser 3,迭代二叉树3代)是基于奥卡姆剃刀原理的,即用尽量用较少的东西做更多的事。在信息论中,期望信息越小,那么信息增益就越大,从而纯度就越高。ID3算法的核心思想就是以信息增益来度量属性的选择,选择***后信息增益最大的属性进行***。该算法采用自顶向下的贪婪搜索遍历可能的决策空间。ID3 (Iterative Dichotomiser 3, iterative binary tree 3 generation) is based on the Occam razor principle, that is, to do more with as few things as possible. In information theory, the smaller the expected information, the greater the information gain and the higher the purity. The core idea of the ID3 algorithm is to measure the choice of attributes with information gain, and select the attribute with the largest information gain after splitting to split. The algorithm uses a top-down greedy search to traverse possible decision spaces.
本申请实施例中,信息增益率可以定义为:特征对于样本分类的信息增益、与特征对于样本分类的***信息之比。具体地的信息增益率获取方式参考下面的描述。In the embodiment of the present application, the information gain rate may be defined as the ratio of the information gain of the feature to the sample classification and the split information of the feature to the sample classification. The specific information gain rate acquisition method is described below.
信息增益是针对一个一个特征而言的,就是看一个特征t,***有它和没有它时的信息量各是多少,两者的差值就是这个特征给***带来的信息量,即信息增益。The information gain is for one feature. It is to look at a feature t. What is the amount of information when the system has it and without it? The difference between the two is the amount of information that the feature brings to the system, that is, the information gain. .
***信息用来衡量特征***数据(如样本集)的广度和均匀程度,该***信息可以为特征的熵。The split information is used to measure the breadth and uniformity of the feature split data (such as the sample set), and the split information can be the entropy of the feature.
下面将详细介绍基于信息增益率对样本集进行分类的过程,比如,分类过程可以包括如下步骤:The process of classifying the sample set based on the information gain rate will be described in detail below. For example, the classification process may include the following steps:
生成决策树的根节点,并将样本集作为根节点的节点信息;Generating the root node of the decision tree and using the sample set as the node information of the root node;
将根节点的样本集确定为当前待分类的目标样本集;Determining the sample set of the root node as the target sample set to be classified currently;
获取目标样本集内特征对于目标样本集分类的信息增益率;Obtaining an information gain rate of the feature in the target sample set for the target sample set classification;
根据信息增益率选取从特征中选取当前的划分特征;Selecting a current partitioning feature from the feature according to the information gain rate;
根据划分特征对样本集进行划分,得到若干子样本集;Dividing the sample set according to the division feature to obtain several sub-sample sets;
对子样本集中样本的划分特征进行去除,得到去除后子样本集;The division feature of the sample in the subsample set is removed, and the removed subsample set is obtained;
生成当前节点的子节点,并将去除后子样本集作为子节点的节点信息;Generating a child node of the current node, and removing the post-subsample set as node information of the child node;
判断子节点是否满足预设分类终止条件;Determining whether the child node satisfies a preset classification termination condition;
若否,则将目标样本集更新为去除后子样本集,并返回执行获取目标样本集内特征对于目标样本集分类的信息增益率;If not, updating the target sample set to the removed subsample set, and returning to perform an information gain rate of the feature in the acquired target sample set for the target sample set classification;
若是,则将子节点作为叶子节点,根据去除后子样本集中样本的类别设置叶子节点的输出,样本的类别包括“男”和“女”。If yes, the child node is used as a leaf node, and the output of the leaf node is set according to the category of the sample in the removed subsample set. The categories of the sample include “male” and “female”.
其中,划分特征为根据各特征对于样本集分类的信息增益率从特征中选取的特征,用于对样本集分类。其中,根据信息增益率选取划分特征的方式有多种,比如为了提升样本分类的精确性,可以选取最大信息增益率对应的特征为划分特征。The partitioning feature is a feature selected from the features according to the information gain rate of each feature for the sample set classification, and is used to classify the sample set. Among them, there are various ways to select the feature according to the information gain rate. For example, in order to improve the accuracy of the sample classification, the feature corresponding to the maximum information gain rate may be selected as the division feature.
其中,样本的类别可以包括“男”和“女”两种类别,每个样本的类别可以用样本标记来表示,比如,当样本标记为数值时,数值“1”表示“男”,用数值“0”表示“女”,反之亦可。The category of the sample may include two categories: "male" and "female", and the category of each sample may be represented by a sample mark. For example, when the sample is marked as a numerical value, the value "1" indicates "male", and the numerical value is used. “0” means “female”, and vice versa.
当子节点满足预设分类终止条件时,可以将子节点作为叶子节点,即停止对该子节点的样本集分类, 并且可以基于去除后子样本集中样本的类别设置该叶子节点的输出。基于样本的类别设置叶子节点的输出的方式有多种。比如,可以去除后样本集中样本数量最多的类别作为该叶子节点的输出。When the child node satisfies the preset classification termination condition, the child node may be used as a leaf node, that is, the sample set classification of the child node is stopped, and the output of the leaf node may be set based on the category of the sample in the removed sub-sample set. There are several ways to set the output of a leaf node based on the category of the sample. For example, the category with the largest number of samples in the sample set can be removed as the output of the leaf node.
其中,预设分类终止条件可以根据实际需求设定,当子节点满足预设分类终止条件时,将当前子节点作为叶子节点,停止对子节点对应的样本集进行分词分类;当子节点不满足预设分类终止条件时,继续对子节点对应的额样本集进行分类。比如,预设分类终止条件可以包括:子节点的去除后子样本集合中样本的类别数量为与预设数量,也即步骤“判断子节点是否满足预设分类终止条件”可以包括:The preset classification termination condition may be set according to actual requirements. When the child node satisfies the preset classification termination condition, the current child node is used as a leaf node, and the sample set corresponding to the child node is stopped for word segmentation; when the child node is not satisfied When the classification termination condition is preset, the classification of the sample set corresponding to the child node is continued. For example, the preset classification termination condition may include: the number of categories of the samples in the removed sub-sample set of the child node is a preset number, that is, the step “determining whether the child node satisfies the preset classification termination condition” may include:
判断子节点对应的去除后子样本集中样本的类别数量是否为预设数量;Determining whether the number of categories of samples in the removed subsample set corresponding to the child node is a preset number;
若是,则确定子节点满足预设分类终止条件;If yes, determining that the child node meets the preset classification termination condition;
若否,则确定子节点不满预设分类终端终止条件。If not, it is determined that the child node is not satisfied with the preset classification terminal termination condition.
例如,预设分类终止条件可以包括:子节点对应的去除后子样本集中样本的类别数量为1,也即子节点的样本集中只有一个类别的样本。此时,如果子节点满足该预设分类终止条件,那么,将子样本集中样本的类别作为该叶子节点的输出。如去除后子样本集中只有类别为“男”的样本时,那么,可以将“男”作为该叶子节点的输出。For example, the preset classification termination condition may include: the number of categories of the samples in the removed subsample set corresponding to the child node is 1, that is, the sample in the sample set of the child node has only one category. At this time, if the child node satisfies the preset classification termination condition, the category of the sample in the subsample set is taken as the output of the leaf node. If there is only a sample with the category "male" in the subsample set after removal, then "male" can be used as the output of the leaf node.
在一种实施例中,为了提升决策树模型的决策准确性,还可以设置一个增益率阈值;当最大的信息增益率大于该阈值时,才选取该信息增益率对应的特征为划分特征。也即,步骤“根据信息增益率选取从特征中选取当前的划分特征”可以包括:In an embodiment, in order to improve the decision accuracy of the decision tree model, a gain rate threshold may also be set; when the maximum information gain rate is greater than the threshold, the feature corresponding to the information gain rate is selected as the division feature. That is, the step of "selecting the current division feature from the feature according to the information gain rate" may include:
从信息增益率中选取最大的目标信息增益率;Selecting the maximum target information gain rate from the information gain rate;
判断目标信息增益率是否大于预设阈值;Determining whether the target information gain rate is greater than a preset threshold;
若是,则选取目标信息增益率对应的特征作为当前的划分特征。If yes, the feature corresponding to the target information gain rate is selected as the current partition feature.
在一种实施例中,当目标信息增益率不大于预设阈值时,可以将当前节点作为叶子节点,并选取样本数量最多的样本类别作为该叶子节点的输出。,其中,样本类别包括“男”和“女”。In an embodiment, when the target information gain rate is not greater than a preset threshold, the current node may be used as a leaf node, and the sample category with the largest number of samples is selected as the output of the leaf node. Among them, the sample categories include "male" and "female".
其中,预设阈值可以根据实际需求设定,如0.9、0.8等等。The preset threshold can be set according to actual needs, such as 0.9, 0.8, and the like.
例如,当特征1对于样本分类的信息增益率0.9为最大信息增益时,预设增益率阈值为0.8时,由于最大信息增益率大于预设阈值,此时,可以将特征1作为划分特征。For example, when the information gain rate of feature 1 for the sample classification is 0.9, the preset gain rate threshold is 0.8, since the maximum information gain rate is greater than the preset threshold, feature 1 can be used as the division feature.
又例如,当预设阈值为1时,那么最大信息增益率小于预设阈值,此时,可以将当前节点作为叶子节点,对样本集分析可知类别为“男”的样本数量最多,大于类别为“女”的样本数量,此时,可以将“男”作为该叶子节点的输出。For example, when the preset threshold is 1, the maximum information gain rate is less than the preset threshold. In this case, the current node can be used as a leaf node, and the sample set analysis can be found that the number of samples with the category "male" is the largest, and the category is greater than the category. The number of samples of "female". At this time, "male" can be used as the output of the leaf node.
其中,根据划分特征对样本进行分类划分的方式有多种,比如,可以基于划分特征的特征值来对样本集进行划分。也即步骤“根据划分特征对样本集进行划分”可以包括:Among them, there are various ways to classify the samples according to the division features. For example, the sample sets can be divided based on the feature values of the divided features. That is, the step "dividing the sample set according to the division feature" may include:
获取目标样本集中划分特征的特征值;Obtaining feature values of the feature in the target sample set;
根据特征值对目标样本集进行划分。The target sample set is divided according to the feature value.
比如,可以将样本集中划分特征值相同的样本划分到同一子样本集中。譬如,划分特征的特征值包括:0、1、2,那么此时,可以划分特征的特征值为0的样本归为一类、将特征值为1的样本归为一类、将特征值为2的样本归为一类。For example, a sample with the same feature value in the sample set can be divided into the same subsample set. For example, the feature values of the divided features include: 0, 1, 2, then, at this time, the samples whose feature values are 0 can be classified into one class, and the samples with the feature value 1 are classified into one class, and the feature values are The samples of 2 are classified into one category.
例如,对于样本集D{样本1、样本2……样本i……样本n},其中样本包括若干特征A。For example, for sample set D{sample 1, sample 2...sample i...sample n}, where the sample includes several features A.
首先,对样本集中所有样本进行初始化,然后,生成一个根节点d,并将样本集D作为该根节点d的节点信息,如参考图3。First, all samples in the sample set are initialized, then a root node d is generated, and the sample set D is taken as the node information of the root node d, as described with reference to FIG.
计算各特征如特征A对于样本集分类的信息增益率g R(D,A)1、g R(D,A)2……g R(D,A)m;选取最大的信息增益率g R(D,A)max。 Calculate the information gain rate g R (D, A)1, g R (D, A) 2, ... g R (D, A) m of each feature such as feature A for the sample set; select the maximum information gain rate g R (D, A) max.
当最大的信息增益率g R(D,A)max小于预设阈值ε时,当前的节点作为叶子节点,并选取样本数量最多的样本类别作为叶子节点的输出。 When the maximum information gain rate g R (D, A)max is less than the preset threshold ε, the current node acts as a leaf node, and the sample category with the largest number of samples is selected as the output of the leaf node.
当最大的信息增益率g R(D,A)max大于预设阈值ε时,可以选取信息增益g R(D,A)max对应的特征作为划分特征Ag,根据特征Ag对样本集D{样本1、样本2……样本i……样本n}进行划分,具体地,对Ag的每一个取值ai,依照Ag=ai将D划分为若干个非空集合Di,作为当前节点的子节点。如将样本集划 分成两个子样本集D1{样本1、样本2……样本k}和D2{样本k+1……样本n}。 When the maximum information gain rate g R (D, A)max is greater than the preset threshold ε, the feature corresponding to the information gain g R (D, A)max may be selected as the partitioning feature Ag, and the sample set D{sample according to the feature Ag 1. Sample 2...sample i...sample n} is divided. Specifically, for each value a of Ag, D is divided into several non-empty sets Di according to Ag=ai as a child node of the current node. For example, the sample set is divided into two subsample sets D1 {sample 1, sample 2 ... sample k} and D2 {sample k+1 ... sample n}.
将子样本集D1和D2中划分特征Ag去除即A-Ag。参考图3生成根节点d的子节点d1和d2,并将子样本集D1作为子节点d1的节点信息、将子样本集D2作为子节点d2的节点信息。The divided feature Ag in the subsample sets D1 and D2 is removed, that is, A-Ag. The child nodes d1 and d2 of the root node d are generated with reference to FIG. 3, and the subsample set D1 is taken as the node information of the child node d1, and the subsample set D2 is taken as the node information of the child node d2.
接着,对于每个子节点,对于每个子节点,以A-Ag作为特征,子节点的Di作为数据集,递归调用上述步,构建子树,直到满足预设分类终止条件为止。Next, for each child node, for each child node, A-Ag is taken as a feature, and Di of the child node is used as a data set, and the above steps are recursively called to construct a subtree until the preset classification termination condition is satisfied.
以子节点d1为例,判断子节点是否满足预设分类终止条件,若是,则将当前的子节点d1作为叶子节点,并根据子节点d1对应的子样本集中样本的类别设置该叶子节点输出。Taking the child node d1 as an example, it is determined whether the child node satisfies the preset classification termination condition. If yes, the current child node d1 is used as a leaf node, and the leaf node output is set according to the category of the sample in the subsample set corresponding to the child node d1.
当子节点不满足预设分类终止条件时,采用上述基于信息增益分类的方式,继续对子节点对应的子样本集进行分类,如以子节点d2为例可以计算A2样本集中各特征相对于样本分类的信息增益率g R(D,A),选取最大的信息增益率g R(D,A)max,当最大的信息增益率g R(D,A)max大于预设阈值ε时,可以选取该信息增益率g R(D,A)对应的特征为划分特征Ag(如特征Ai+1),基于划分特征Ag将D2划分成若干子样本集,如可以将D2划分成子样本集D21、D22、D23,然后,将子样本集D21、D22、D23中的划分特征Ag去除,并生成当前节点d2的子节点d21、d22、d23,将去除划分特征Ag后的样本集D21、D22、D23分别作为子节点d21、d22、d23的节点信息。 When the child node does not satisfy the preset classification termination condition, the above-mentioned sub-sample set corresponding to the child node is continued to be classified according to the information gain classification method. For example, the child node d2 can be used as an example to calculate the characteristics of the A2 sample set relative to the sample. The classified information gain rate g R (D, A), select the maximum information gain rate g R (D, A) max, when the maximum information gain rate g R (D, A) max is greater than the preset threshold ε, The feature corresponding to the information gain rate g R (D, A) is selected as the partitioning feature Ag (such as the feature Ai+1), and the D2 is divided into several sub-sample sets based on the partitioning feature Ag, for example, the D2 can be divided into the sub-sample set D21, D22, D23, and then, the partitioning features Ag in the subsample sets D21, D22, and D23 are removed, and the child nodes d21, d22, and d23 of the current node d2 are generated, and the sample sets D21, D22, and D23 after dividing the feature Ag are removed. The node information of the child nodes d21, d22, and d23, respectively.
依次类推,利用上述的基于信息增益率分类的方式可以构成出如图4所示的决策树,该决策树的叶子节点的输出包括“男”或者“女”。By analogy, the above-described information gain rate classification based method can be used to construct a decision tree as shown in FIG. 4, and the output of the leaf node of the decision tree includes "male" or "female".
在一种实施例中,为了提升利用决策树进行预测的速度和效率,还可以在节点之间的路径上标记相应的划分特征的特征值。比如,在上述基于信息增益分类的过程中,可以在当前节点与其子节点路径上标记相应划分特征的特征值。In one embodiment, in order to improve the speed and efficiency of prediction using the decision tree, it is also possible to mark the feature values of the corresponding divided features on the path between the nodes. For example, in the above process based on information gain classification, the feature values of the corresponding divided features may be marked on the path of the current node and its child nodes.
例如,划分特征Ag的特征值包括:0、1时,可以在d2与d之间的路径上标记1,在a1与a之间的路径上标记0,依次类推,在每次划分后,便可以在当前节点与其子节点的路径上标记相应的划分特征值如0或1,便可以得到如图5所示的决策树。For example, the feature values of the partitioning feature Ag include: 0, 1 may mark 1 on the path between d2 and d, mark 0 on the path between a1 and a, and so on, after each division, A decision tree as shown in FIG. 5 can be obtained by marking a corresponding partition feature value such as 0 or 1 on the path of the current node and its child nodes.
下面具体介绍信息增益率的获取方式:The following describes the way to obtain the information gain rate:
本申请实施例中,信息增益率可以定义为:特征对于样本分类的信息增益、与特征对于样本分类的***信息之比。In the embodiment of the present application, the information gain rate may be defined as the ratio of the information gain of the feature to the sample classification and the split information of the feature to the sample classification.
信息增益是针对一个一个特征而言的,就是看一个特征t,***有它和没有它时的信息量各是多少,两者的差值就是这个特征给***带来的信息量,即信息增益。信息增益表示某个特征的类(男和女)的信息的不确定性减少程度。The information gain is for one feature. It is to look at a feature t. What is the amount of information when the system has it and without it? The difference between the two is the amount of information that the feature brings to the system, that is, the information gain. . The information gain represents the degree of uncertainty in the information of a class (male and female) of a feature.
***信息用来衡量特征***数据(如样本集)的广度和均匀程度,该***信息可以为特征的熵。The split information is used to measure the breadth and uniformity of the feature split data (such as the sample set), and the split information can be the entropy of the feature.
其中,步骤“获取目标样本集内所述特征对于目标样本集分类的信息增益率”可以包括:Wherein, the step “acquiring the information gain rate of the feature in the target sample set for the target sample set classification” may include:
获取所述特征对于目标样本集分类的信息增益;Obtaining an information gain of the feature classification for the target sample set;
获取所述特征对于目标样本集分类的***信息;Obtaining split information of the feature classification for the target sample set;
根据所述信息增益与所述***信息,获取所述特征对于目标样本集分类的信息增益率。Obtaining an information gain rate of the feature for the target sample set classification according to the information gain and the split information.
在一种实施例中,可以基于样本分类的经验熵以及特征对于样本集分类结果的条件熵,获取特征对于样本集分类的信息增益。也即步骤“获取所述特征对于目标样本集分类的信息增益”可以包括:In one embodiment, the information gain of the feature for the sample set classification may be obtained based on the empirical entropy of the sample classification and the conditional entropy of the feature for the sample set classification result. That is, the step "acquiring the information gain of the feature classification for the target sample set" may include:
获取目标样本分类的经验熵;Obtaining the empirical entropy of the target sample classification;
获取所述特征对于目标样本集分类结果的条件熵;Obtaining conditional entropy of the feature for the classification result of the target sample set;
根据所述条件熵和所述经验熵,获取所述特征对于所述目标样本集分类的信息增益。其中,可以获取正样本在样本集中出现的第一概率、以及负样本在样本集中出现的第二概率,正样本为样本类别为“男”的样本,负样本为样本类别为“女”的样本;根据第一概率和第二概率获取样本的经验熵。本申请中的术语“第一”、“第二”和“第三”等是用于区别不同对象,而不是用于描述特定顺序。Obtaining an information gain of the feature for the target sample set classification according to the conditional entropy and the empirical entropy. Among them, the first probability that the positive sample appears in the sample set and the second probability that the negative sample appears in the sample set can be obtained, the positive sample is a sample with a sample category of “male”, and the negative sample is a sample with a sample category of “female”. Obtaining empirical entropy of the sample according to the first probability and the second probability. The terms "first," "second," and "third," etc. in this application are used to distinguish different objects, and are not intended to describe a particular order.
在一种实施例中,特征对于所述目标样本集分类的信息增益可以为经验熵与条件熵之间的差值。例如,对于样本集D{样本1、样本2……样本i……样本n},样本包括多维特征,如特征A。特征A对于样本分类的信息增益率可以通过以下公式得到:In one embodiment, the information gain of the feature for the target sample set classification may be the difference between empirical entropy and conditional entropy. For example, for a sample set D{sample 1, sample 2...sample i...sample n}, the sample includes a multi-dimensional feature, such as feature A. The information gain rate of feature A for sample classification can be obtained by the following formula:
Figure PCTCN2018115358-appb-000005
Figure PCTCN2018115358-appb-000005
其中,g R(D,A)为特征A对于样本集D分类的信息增益率,g(D,A)为特征A对于样本分类的信息增益,H A(D)为特征A的***信息,即特征A的熵。 Where g R (D, A) is the information gain rate of feature A for sample set D classification, g(D, A) is the information gain of feature A for sample classification, and H A (D) is the split information of feature A, That is, the entropy of feature A.
其中,gR(D,A)可以通过以下公式得到:Among them, gR (D, A) can be obtained by the following formula:
Figure PCTCN2018115358-appb-000006
Figure PCTCN2018115358-appb-000006
H(D)为样本集D分类的经验熵,H(D|A)为特征A对于样本集D分类的条件熵。H(D) is the empirical entropy of the sample set D classification, and H(D|A) is the conditional entropy of the feature A for the sample set D classification.
如果样本类别为“男”的样本数量为j,“女”的样本数量为n-j;此时,正样本在样本集D中的出现概率p1=j/n,负样本在样本集D中的出现概率p2=n-j/n。然后,基于以下经验熵的计算公式,计算出样本分类的经验熵H(D):If the sample size of the sample category is "male" is j, the sample size of "female" is nj; at this time, the probability of occurrence of positive samples in sample set D is p1=j/n, and the appearance of negative samples in sample set D Probability p2 = nj / n. Then, based on the following empirical entropy calculation formula, the empirical entropy H(D) of the sample classification is calculated:
Figure PCTCN2018115358-appb-000007
Figure PCTCN2018115358-appb-000007
在决策树分类问题中,信息增益就是决策树在进行属性选择划分前和划分后信息的差值。本实施中,样本分类的经验熵H(D)为:In the decision tree classification problem, the information gain is the difference between the information of the decision tree before and after the attribute selection. In this implementation, the empirical entropy H(D) of the sample classification is:
H(D)=p 1log p 1+p 2log p 2 H(D)=p 1 log p 1 +p 2 log p 2
在一种实施例中,可以根据特征A将样本集划分成若干子样本集,然后,获取各子样本集分类的信息熵,以及该特征A的各特征值在样本集中出现的概率,根据该信息熵以及该概率便可以得到划分后的信息熵,即该特征Ai对于样本集分类结果的条件熵。In an embodiment, the sample set may be divided into several sub-sample sets according to feature A, and then information entropy of each sub-sample set classification is obtained, and the probability that each feature value of the feature A appears in the sample set, according to the The information entropy and the probability can be used to obtain the divided information entropy, that is, the conditional entropy of the feature Ai for the sample set classification result.
例如,对于样本特征A,该样本特征A对于样本集D分类结果的条件熵可以通过以下公式计算得到:For example, for sample feature A, the conditional entropy of the sample feature A for the sample set D classification result can be calculated by the following formula:
Figure PCTCN2018115358-appb-000008
Figure PCTCN2018115358-appb-000008
其中,n为特征A的取值种数,即特征值类型数量。此时,pi为A特征值为第i种取值的样本在样本集D中出现的概率,Ai为A的第i种取值。(D|A=Ai)为子样本集Di分类的经验熵,该子样本集Di中样本的A特征值均为第i种取值。Where n is the number of values of the feature A, that is, the number of feature value types. At this time, pi is the probability that the sample whose A eigenvalue is the ith value appears in the sample set D, and Ai is the ith value of A. (D|A=Ai) is the empirical entropy of the subsample set Di classification, and the A eigenvalues of the samples in the subsample set Di are all the ith values.
例如,以特征A的取值种数为3,即A1、A2、A3为例,此时,可以特征A将样本集D{样本1、样本2……样本i……样本n}划分成三个子样本集,特征值为A1的D1{样本1、样本2……样本d}、特征值为A2的D2{样本d+1……样本e}、特征值为A3的D3{样本e+1……样本n}。d、e均为正整数,且小于n。For example, taking the value of the feature A as 3, that is, A1, A2, and A3 as an example, at this time, the feature set D{sample 1, sample 2, sample i, ... sample n} can be divided into three by feature A. Sub-sample set, D1 {sample 1, sample 2...sample d} with eigenvalues A1, D2 {sample d+1...sample e} with eigenvalue A2, D3{sample e+1 with eigenvalue A3 ...sample n}. d, e are positive integers and less than n.
此时,特征A对于样本集D分类结果的条件熵为:At this time, the conditional entropy of feature A for the classification result of sample set D is:
H(D|A)=p1H(D|A=A1)+p2H(D|A=A2)+p3H(D|A=A3);H(D|A)=p1H(D|A=A1)+p2H(D|A=A2)+p3H(D|A=A3);
其中,p1=D1/D,p2=D2/D,p2=D3/D;Where p1=D1/D, p2=D2/D, p2=D3/D;
H(D|A1)为子样本集D1分类的信息熵,即经验熵,可以通过上述经验熵的计算公式计算得到。H(D|A1) is the information entropy of the subsample set D1 classification, that is, the empirical entropy, which can be calculated by the above formula of empirical entropy.
在得到样本分类的经验熵H(D),以及特征A对于样本集D分类结果的条件熵H(D|A)后,便可以计算出特征A对于样本集D分类的信息增益,如通过以下公式计算得到:After obtaining the empirical entropy H(D) of the sample classification and the conditional entropy H(D|A) of the feature A for the sample set D classification result, the information gain of the feature A for the sample set D classification can be calculated, for example, by The formula is calculated:
Figure PCTCN2018115358-appb-000009
Figure PCTCN2018115358-appb-000009
也即特征A对于样本集D分类的信息增益为:经验熵H(D)与特征A对于样本集D分类结果的条件熵H(D|A)的差值。That is, the information gain of the feature A for the sample set D classification is: the difference between the empirical entropy H(D) and the conditional entropy H(D|A) of the feature A for the sample set D classification result.
其中,特征对于样本集分类的***信息为特征的熵。可以基于特征的取值在目样本集中的样本分布概率得到。比如,HA(D)可以通过如下公式得到:The split information of the feature classification for the sample set is the entropy of the feature. The probability of the distribution of the features can be obtained based on the probability of distribution of the samples in the target sample set. For example, HA(D) can be obtained by the following formula:
Figure PCTCN2018115358-appb-000010
为特征A的取值类别,或种数。
Figure PCTCN2018115358-appb-000010
Is the value category of feature A, or the number of species.
其中,Di为样本集D特征A为第i种的样本集。Where Di is the sample set of the sample set D feature A is the i-th type.
203、根据预测时间采集未提供性别信息的用户的行为习惯的多维特征作为预测样本。203. Collect, according to the predicted time, a multi-dimensional feature of a behavior habit of a user who does not provide gender information as a prediction sample.
其中,预测时间可以根据需求设定,如可以为当前时间等。The prediction time can be set according to requirements, such as the current time.
比如,可以根据预测时间点来采集未提供性别信息的用户的行为习惯的多维特征作为预测样本。For example, a multi-dimensional feature of a behavioral habit of a user who does not provide gender information may be collected as a prediction sample according to a predicted time point.
本申请实施例中,步骤201和203中采集的多维特征是相同特征,例如:用户阅读偏男性类小说的时长、用户阅读偏女性类小说的时长等。In the embodiment of the present application, the multi-dimensional features collected in steps 201 and 203 are the same features, for example, the duration of the user reading the partial male novel, the duration of the user reading the female novel, and the like.
204、根据预测样本和决策树模型预测未提供性别信息的用户的性别。204. Predict the gender of the user who does not provide the gender information according to the prediction sample and the decision tree model.
具体地,根据预测样本和决策树模型获取相应的输出结果,根据输出结果确定未提供性别信息的用户的性别。其中,输出结果包括“男”或“女”。Specifically, the corresponding output result is obtained according to the predicted sample and the decision tree model, and the gender of the user who does not provide the gender information is determined according to the output result. Among them, the output results include "male" or "female".
比如,可以根据预测样本的特征和决策树模型确定相应的叶子节点,将该叶子节点的输出作为预测输出结果。如利用预测样本的特征按照决策树的分支条件(即划分特征的特征值)确定当前的叶子节点,取该叶子节点的输出作为预测的结果。由于叶子节点的输出包括“男”或“女”,因此,此时可以基于决策树来确定未提供性别信息的用户的性别。For example, the corresponding leaf node may be determined according to the characteristics of the predicted sample and the decision tree model, and the output of the leaf node is used as a predicted output result. For example, the current leaf node is determined according to the branch condition of the decision tree (ie, the feature value of the divided feature), and the output of the leaf node is taken as the prediction result. Since the output of the leaf node includes "male" or "female", the gender of the user who does not provide the gender information can be determined based on the decision tree at this time.
例如,采集未提供性别信息的用户的行为习惯的多维特征后,可以在图5所示的决策树中按照决策树的分支条件查找相应的叶子节点为an1,叶子节点an1的输出为“男”,此时,便确定未提供性别信息的用户的性别是男。For example, after collecting the multi-dimensional features of the behavior habits of the user who does not provide the gender information, the corresponding leaf node may be found as an1 according to the branch condition of the decision tree in the decision tree shown in FIG. 5, and the output of the leaf node an1 is “male”. At this time, it is determined that the gender of the user who does not provide the gender information is male.
由上可知,本申请实施例采集已提供性别信息的用户的行为习惯的多维特征作为样本,并构建已提供性别信息的用户的行为习惯的样本集;当所述特征的数量超过预设阈值时,根据特征对于样本分类的信息增益率对样本集进行样本分类,以构建出预测用户性别的决策树模型,决策树模型的输出包括“男”或“女”;根据预测时间采集未提供性别信息的用户的行为习惯的多维特征作为预测样本;根据预测样本和决策树模型预测未提供性别信息的用户的性别。As can be seen from the above, the embodiment of the present application collects a multi-dimensional feature of the behavior habit of the user who has provided the gender information as a sample, and constructs a sample set of the behavior habits of the user who has provided the gender information; when the number of the features exceeds a preset threshold According to the information classification rate of the sample classification, the sample set is sampled to construct a decision tree model for predicting the gender of the user. The output of the decision tree model includes “male” or “female”; the gender information is not provided according to the predicted time. The multidimensional features of the user's behavioral habits are used as prediction samples; the genders of users who do not provide gender information are predicted based on the prediction samples and the decision tree model.
进一步地,由于样本集的每个样本中,包括了反映用户行为习惯的多个特征信息,因此本申请实施例可以使得用户性别预测更加智能化。Further, since each of the samples of the sample set includes a plurality of feature information reflecting the user's behavior habits, the embodiment of the present application can make the user gender prediction more intelligent.
进一步地,基于决策树预测模型来实现用户性别预测,可以提升用户性别预测的准确性,进而提高预测的准确度。Further, based on the decision tree prediction model to achieve user gender prediction, the accuracy of the user's gender prediction can be improved, thereby improving the accuracy of the prediction.
下面将在上述实施例描述的方法基础上,对本申请的用户性别预测方法做进一步介绍。参考图6,该用户性别预测方法可以包括:The user gender prediction method of the present application will be further described below based on the method described in the above embodiments. Referring to FIG. 6, the user gender prediction method may include:
301、采集已提供性别信息的用户的行为习惯的多维特征作为样本,并构建已提供性别信息的用户的行为习惯的样本集。301. Collect a multi-dimensional feature of a behavioral habit of a user who has provided gender information as a sample, and construct a sample set of behavior habits of the user who has provided the gender information.
所述已提供性别信息的用户的行为习惯的多维特征具有一定长度的维度,其每个维度上的参数均对应表征已提供性别信息的用户的行为习惯的一种特征信息,即该多维特征息由多个特征构成。The multi-dimensional feature of the behavior habit of the user who has provided the gender information has a dimension of a certain length, and the parameters in each dimension correspond to a feature information representing the behavior habit of the user who has provided the gender information, that is, the multi-dimensional feature It consists of multiple features.
其中,已提供性别信息的用户的行为习惯的样本集可以包括多个样本,每个样本包括已提供性别信息的用户的行为习惯的多维特征。在已提供性别信息的用户的行为习惯的样本集中,可以包括在历史时间段内,按照预设频率采集的多个样本。历史时间段,例如可以是过去7天、10天;预设频率,例如可以是每10分钟采集一次、每半小时采集一次。可以理解的是,一次采集的已提供性别信息的用户的行为习惯的多维特征数据构成一个样本,多次采集的多个样本构成样本集。Wherein, the sample set of the behavior habits of the user who has provided the gender information may include a plurality of samples, each of which includes a multi-dimensional feature of the behavior habit of the user who has provided the gender information. A sample set of behavioral habits of users who have provided gender information may include a plurality of samples collected at a preset frequency during a historical time period. The historical time period may be, for example, the past 7 days or 10 days; the preset frequency may be, for example, collected every 10 minutes and collected every half hour. It can be understood that the multi-dimensional feature data of the behavior habits of the user who has provided the gender information at one time constitutes one sample, and the plurality of samples collected multiple times constitute a sample set.
一个具体的样本可如下表1所示,包括多个维度的特征信息,需要说明的是,表1所示的特征信息仅 为举例,实际中,一个样本所包含的特征信息的数量,可以多于比表1所示信息的数量,也可以少于表1所示信息的数量,所取的具体特征信息也可以与表1所示不同,此处不作具体限定。A specific sample may be as shown in Table 1 below, and includes feature information of multiple dimensions. It should be noted that the feature information shown in Table 1 is only an example. In practice, the number of feature information included in one sample may be increased. The number of information shown in Table 1 may be less than the number of information shown in Table 1. The specific feature information may be different from that shown in Table 1, and is not specifically limited herein.
维度Dimension 特征信息Characteristic information
11 用户在购物应用中浏览偏男性类商品(如男装)次数The number of times a user viewed a male product (such as a men's clothing) in the shopping app
22 用户在购物应用中浏览偏男性类商品(如男装)时长When a user browses a male-type item (such as a men's clothing) in the shopping app
33 用户在购物应用中浏览偏女性类商品(如化妆品、女装)次数The number of times a user viewed a female product (such as cosmetics, women's clothing) in the shopping app
44 用户在购物应用中浏览偏女性类商品(如化妆品、女装)时长When a user browses a female product (such as cosmetics, women's clothing) in the shopping app
55 用户阅读偏男性类小说的时长The length of time users read a male novel
66 用户阅读偏女性类小说的时长The length of time users read a female novel
77 用户阅读体育类新闻的时长The length of time users read sports news
88 用户阅读星座类新闻的时长The length of time users read the news of the constellation
99 用户使用前置摄像头自拍的次数The number of times the user used the front camera to take a self-portrait
1010 用户使用美颜类软件的次数Number of times users use beauty software
1111 用户玩不同类别游戏的次数与时长The number and duration of users playing different categories of games
表1Table 1
302、对样本集中的样本进行标记,得到每个样本的样本标签。302. Mark the samples in the sample set to obtain the sample labels of each sample.
在构成样本集之后,可以对样本集中的每个样本进行标记,得到每个样本的样本标签,由于本实施要实现的是预测用户的性别,因此,所标记的样本标签包括性别为“男”和性别为“女”,也即样本类别包括“男”和“女”。具体可根据已提供性别信息的用户的行为习惯进行标记,例如:用户在购物应用中浏览偏男性类商品(如男装)次数为50次,则标记为“男”;再例如,用户阅读偏女性类小说的时长超过20小时,则标记为“女”。具体地,可以用数值“1”表示“男”,用数值“0”表示“女”,反之亦可。After constituting the sample set, each sample in the sample set can be marked to obtain a sample label for each sample. Since the implementation of the present embodiment is to predict the gender of the user, the labeled sample label includes the gender as "male". And the gender is “female”, that is, the sample categories include “male” and “female”. Specifically, it may be marked according to the behavior habits of the user who has provided the gender information. For example, if the user browses the male-type goods (such as men's clothing) in the shopping application for 50 times, it is marked as "male"; for example, the user reads the female If the novel has a duration of more than 20 hours, it is marked as "female." Specifically, the value "1" may be used to indicate "male", and the value "0" to indicate "female", and vice versa.
303、生成决策树模型的根节点,并将样本集作为根节点的节点信息。303. Generate a root node of the decision tree model, and use the sample set as node information of the root node.
比如,参考图3,对于样本集D{样本1、样本2……样本i……样本n},可以先生成决策树的根节点d,并将样本集D作为该根节点d的节点信息。For example, referring to FIG. 3, for the sample set D{sample 1, sample 2, ... sample i ... sample n}, the root node d of the decision tree can be made, and the sample set D is taken as the node information of the root node d.
304、确定样本集为当前待分类的目标样本集。304. Determine the sample set as the target sample set to be classified.
也即确定根节点的样本集作为当前待分类的目标样本集。That is, the sample set of the root node is determined as the target sample set to be classified currently.
305、获取目标样本集内各特征对于样本集分类的信息增益率,并确定最大的信息增益率。305. Acquire an information gain rate of each feature in the target sample set for the sample set classification, and determine a maximum information gain rate.
比如,对于样本集D,可以计算各特征对于样本集分类的信息增益率g R(D,A)1、g R(D,A)2……g R(D,A)m;选取最大的信息增益率g R(D,A)max,如g R(D,A)i为最大的信息增益率。 For example, for sample set D, the information gain rate g R (D, A)1, g R (D, A) 2, ... g R (D, A) m of each feature for the sample set classification can be calculated; The information gain rate g R (D, A)max, such as g R (D, A) i is the maximum information gain rate.
其中,特征对于样本集分类的信息增益率,可以采用如下方式获取:The information gain rate of the feature classification for the sample set can be obtained as follows:
获取样本分类的经验熵;获取特征对于样本集分类结果的条件熵;根据条件熵和经验熵,获取特征对于样本集分类的信息增益;Obtaining the empirical entropy of the sample classification; obtaining the conditional entropy of the feature for the classification result of the sample set; and obtaining the information gain of the feature for the sample set classification according to the conditional entropy and the empirical entropy;
获取特征对于样本集分类的***信息,即特征对于样本分类的熵;Obtaining the split information of the feature classification for the sample set, that is, the entropy of the feature for the sample classification;
获取信息增益与熵的比值,得到特征对于样本分类的信息增益率。The ratio of the information gain to the entropy is obtained, and the information gain rate of the feature for the sample classification is obtained.
例如,对于样本集D{样本1、样本2……样本i……样本n},样本包括多维特征,如特征A。特征A对于样本分类的信息增益率可以通过以下公式得到:For example, for a sample set D{sample 1, sample 2...sample i...sample n}, the sample includes a multi-dimensional feature, such as feature A. The information gain rate of feature A for sample classification can be obtained by the following formula:
Figure PCTCN2018115358-appb-000011
Figure PCTCN2018115358-appb-000011
其中,g(D,A)为特征A对于样本分类的信息增益,H A(D)为特征A的***信息,即特征A的熵。 Where g(D, A) is the information gain of feature A for sample classification, and H A (D) is the split information of feature A, that is, the entropy of feature A.
其中,g(D,A)可以通过以下公式得到:Where g(D, A) can be obtained by the following formula:
Figure PCTCN2018115358-appb-000012
Figure PCTCN2018115358-appb-000012
H(D)为样本分类的经验熵,H(D|A)为特征A对于样本分类的条件熵。H(D) is the empirical entropy of the sample classification, and H(D|A) is the conditional entropy of the feature A for the sample classification.
如果样本类别为“男”的样本数量为j,“女”的样本数量为n-j;此时,正样本在样本集D中的出现概率p1=j/n,负样本在样本集D中的出现概率p2=n-j/n。然后,基于以下经验熵的计算公式,计算出样本分类的经验熵H(D):If the sample size of the sample category is "male" is j, the sample size of "female" is nj; at this time, the probability of occurrence of positive samples in sample set D is p1=j/n, and the appearance of negative samples in sample set D Probability p2 = nj / n. Then, based on the following empirical entropy calculation formula, the empirical entropy H(D) of the sample classification is calculated:
Figure PCTCN2018115358-appb-000013
Figure PCTCN2018115358-appb-000013
在决策树分类问题中,信息增益就是决策树在进行属性选择划分前和划分后信息的差值。本实施中,样本分类的经验熵H(D)为:In the decision tree classification problem, the information gain is the difference between the information of the decision tree before and after the attribute selection. In this implementation, the empirical entropy H(D) of the sample classification is:
H(D)=p 1log p 1+p 2log p 2 H(D)=p 1 log p 1 +p 2 log p 2
在一种实施例中,可以根据特征A将样本集划分成若干子样本集,然后,获取各子样本集分类的信息熵,以及该特征A的各特征值在样本集中出现的概率,根据该信息熵以及该概率便可以得到划分后的信息熵,即该特征Ai对于样本集分类结果的条件熵。In an embodiment, the sample set may be divided into several sub-sample sets according to feature A, and then information entropy of each sub-sample set classification is obtained, and the probability that each feature value of the feature A appears in the sample set, according to the The information entropy and the probability can be used to obtain the divided information entropy, that is, the conditional entropy of the feature Ai for the sample set classification result.
例如,对于样本特征A,该样本特征A对于样本集D分类结果的条件熵可以通过以下公式计算得到:For example, for sample feature A, the conditional entropy of the sample feature A for the sample set D classification result can be calculated by the following formula:
Figure PCTCN2018115358-appb-000014
Figure PCTCN2018115358-appb-000014
其中,n为特征A的取值种数,即特征值类型数量。此时,pi为A特征值为第i种取值的样本在样本集D中出现的概率,Ai为A的第i种取值。(D|A=Ai)为子样本集Di分类的经验熵,该子样本集Di中样本的A特征值均为第i种取值。Where n is the number of values of the feature A, that is, the number of feature value types. At this time, pi is the probability that the sample whose A eigenvalue is the ith value appears in the sample set D, and Ai is the ith value of A. (D|A=Ai) is the empirical entropy of the subsample set Di classification, and the A eigenvalues of the samples in the subsample set Di are all the ith values.
例如,以特征A的取值种数为3,即A1、A2、A3为例,此时,可以特征A将样本集D{样本1、样本2……样本i……样本n}划分成三个子样本集,特征值为A1的D1{样本1、样本2……样本d}、特征值为A2的D2{样本d+1……样本e}、特征值为A3的D3{样本e+1……样本n}。d、e均为正整数,且小于n。For example, taking the value of the feature A as 3, that is, A1, A2, and A3 as an example, at this time, the feature set D{sample 1, sample 2, sample i, ... sample n} can be divided into three by feature A. Sub-sample set, D1 {sample 1, sample 2...sample d} with eigenvalues A1, D2 {sample d+1...sample e} with eigenvalue A2, D3{sample e+1 with eigenvalue A3 ...sample n}. d, e are positive integers and less than n.
此时,特征A对于样本集D分类结果的条件熵为:At this time, the conditional entropy of feature A for the classification result of sample set D is:
H(D|A)=p1H(D|A=A1)+p2H(D|A=A2)+p3H(D|A=A3);H(D|A)=p1H(D|A=A1)+p2H(D|A=A2)+p3H(D|A=A3);
其中,p1=D1/D,p2=D2/D,p2=D3/D;Where p1=D1/D, p2=D2/D, p2=D3/D;
H(D|A1)为子样本集D1分类的信息熵,即经验熵,可以通过上述经验熵的计算公式计算得到。H(D|A1) is the information entropy of the subsample set D1 classification, that is, the empirical entropy, which can be calculated by the above formula of empirical entropy.
在得到样本分类的经验熵H(D),以及特征A对于样本集D分类结果的条件熵H(D|A)后,便可以计算出特征A对于样本集D分类的信息增益,如通过以下公式计算得到:After obtaining the empirical entropy H(D) of the sample classification and the conditional entropy H(D|A) of the feature A for the sample set D classification result, the information gain of the feature A for the sample set D classification can be calculated, for example, by The formula is calculated:
Figure PCTCN2018115358-appb-000015
Figure PCTCN2018115358-appb-000015
也即特征A对于样本集D分类的信息增益为:经验熵H(D)与特征A对于样本集D分类结果的条件熵H(D|A)的差值。That is, the information gain of the feature A for the sample set D classification is: the difference between the empirical entropy H(D) and the conditional entropy H(D|A) of the feature A for the sample set D classification result.
其中,特征对于样本集分类的***信息为特征的熵。可以基于特征的取值在目样本集中的样本分布概率得到。比如,H A(D)可以通过如下公式得到: The split information of the feature classification for the sample set is the entropy of the feature. The probability of the distribution of the features can be obtained based on the probability of distribution of the samples in the target sample set. For example, H A (D) can be obtained by the following formula:
Figure PCTCN2018115358-appb-000016
为特征A的取值类别,或种数。
Figure PCTCN2018115358-appb-000016
Is the value category of feature A, or the number of species.
其中,Di为样本集D特征A为第i种的样本集。Where Di is the sample set of the sample set D feature A is the i-th type.
306、判断最大的信息增益率是否大于预设阈值,若是,则执行步骤307,若否,则执行步骤313。306. Determine whether the maximum information gain rate is greater than a preset threshold. If yes, go to step 307. If no, go to step 313.
例如,可以判断最大的信息增益g R(D,A)max是否大于预设的阈值ε,该阈值ε可以根据实际需求设定。 For example, it can be determined whether the maximum information gain g R (D, A) max is greater than a preset threshold ε, which can be set according to actual needs.
307、选取最大的信息增益率对应的特征作为划分特征,并根据该划分特征的特征值对样本集进行划分,得到若干子样本集。307. Select a feature corresponding to the maximum information gain rate as a partitioning feature, and divide the sample set according to the feature value of the divided feature to obtain a plurality of subsample sets.
比如,当最大的信息增益g R(D,A)max对应的特征为特征Ag时,可以选取特征Ag为划分特征。 For example, when the feature corresponding to the maximum information gain g R (D, A) max is the feature Ag, the feature Ag may be selected as the division feature.
具体地,可以根据划分特征的特征值种数将样本集划分成若干子样本集,子样本集的数量与特征值种数相同。例如,可以将样本集中划分特征值相同的样本划分到同一子样本集中。譬如,划分特征的特征值包括:0、1、2,那么此时,可以划分特征的特征值为0的样本归为一类、将特征值为1的样本归为一类、将特征值为2的样本归为一类。Specifically, the sample set may be divided into several sub-sample sets according to the number of feature values of the divided features, and the number of the sub-sample sets is the same as the number of feature values. For example, you can divide samples in the sample set that have the same eigenvalue into the same subsample set. For example, the feature values of the divided features include: 0, 1, 2, then, at this time, the samples whose feature values are 0 can be classified into one class, and the samples with the feature value 1 are classified into one class, and the feature values are The samples of 2 are classified into one category.
308、将子样本集中样本的划分特征去除,得到去除后子样本集。308. Remove the partitioning feature of the sample in the subsample set to obtain the removed subsample set.
比如,划分特征i的取值有两种时,可以将样本集D划分成D1{样本1、样本2……样本k}和D2{样本k+1……样本n}。然后,可以将子样本集D1和D2中的划分特征Ag去除,即A-Ag。For example, when there are two values of the partitioning feature i, the sample set D can be divided into D1 {sample 1, sample 2, ... sample k} and D2 {sample k+1 ... sample n}. Then, the partitioning features Ag in the subsample sets D1 and D2 can be removed, that is, A-Ag.
309、生成当前节点的子节点,并将去除后子样本集作为相应子节点的节点信息。309. Generate a child node of the current node, and remove the post-subsample set as node information of the corresponding child node.
其中,一个子样本集对应一个子节点。例如,考图3生成根节点d的子节点d1和d2,并将子样本集D1作为子节点d1的节点信息、将子样本集D2作为子节点d2的节点信息。Among them, one subsample set corresponds to one child node. For example, the map 3 generates the child nodes d1 and d2 of the root node d, and uses the subsample set D1 as the node information of the child node d1 and the subsample set D2 as the node information of the child node d2.
在一种实施例中,还可以将子节点对应的划分特征值设置子节点与当前节点的路径上,便于后续进行用户性别预测,参考图5。In an embodiment, the divided feature values corresponding to the child nodes may also be set on the path of the child node and the current node, so as to facilitate subsequent user gender prediction, refer to FIG. 5.
310、判断子节点的子样本集是否满足预设分类终止条件,若是,则执行步骤311,若否,则执行步骤312。310. Determine whether the sub-sample set of the child node meets the preset classification termination condition. If yes, execute step 311. If not, execute step 312.
其中,预设分类终止条件可以根据实际需求设定,当子节点满足预设分类终止条件时,将当前子节点作为叶子节点,停止对子节点对应的样本集进行分词分类;当子节点不满足预设分类终止条件时,继续对子节点对应的额样本集进行分类。比如,预设分类终止条件可以包括:子节点的去除后子样本集合中样本的类别数量为与预设数量。The preset classification termination condition may be set according to actual requirements. When the child node satisfies the preset classification termination condition, the current child node is used as a leaf node, and the sample set corresponding to the child node is stopped for word segmentation; when the child node is not satisfied When the classification termination condition is preset, the classification of the sample set corresponding to the child node is continued. For example, the preset classification termination condition may include: the number of categories of the samples in the removed sub-sample set of the child node is a preset number.
例如,预设分类终止条件可以包括:子节点对应的去除后子样本集中样本的类别数量为1,也即子节点的样本集中只有一个类别的样本。For example, the preset classification termination condition may include: the number of categories of the samples in the removed subsample set corresponding to the child node is 1, that is, the sample in the sample set of the child node has only one category.
以子节点d1为例,判断子节点是否满足预设分类终止条件,若是,则将当前的子节点d1作为叶子节点,并根据子节点d1对应的子样本集中样本的类别设置该叶子节点输出。Taking the child node d1 as an example, it is determined whether the child node satisfies the preset classification termination condition. If yes, the current child node d1 is used as a leaf node, and the leaf node output is set according to the category of the sample in the subsample set corresponding to the child node d1.
311、将目标样本集更新为子节点的子样本集,并返回执行步骤305。311. Update the target sample set to the child sample set of the child node, and return to step 305.
当子节点不满足预设分类终止条件时,采用上述基于信息增益分类的方式,继续对子节点对应的子样本集进行分类,如以子节点d2为例可以计算A2样本集中各特征相对于样本分类的信息增益率g R(D,A),选取最大的信息增益率g R(D,A)max,当最大的信息增益率g R(D,A)max大于预设阈值ε时,可以选取该信息增益率g R(D,A)对应的特征为划分特征Ag(如特征Ai+1),基于划分特征Ag将D2划分成若干子样本集,如可以将D2划分成子样本集D21、D22、D23,然后,将子样本集D21、D22、D23中的划分特征Ag去除,并生成当前节点d2的子节点d21、d22、d23,将去除划分特征Ag后的样本集D21、D22、D23分别作为子节点d21、d22、d23的节点信息。 When the child node does not satisfy the preset classification termination condition, the above-mentioned sub-sample set corresponding to the child node is continued to be classified according to the information gain classification method. For example, the child node d2 can be used as an example to calculate the characteristics of the A2 sample set relative to the sample. The classified information gain rate g R (D, A), select the maximum information gain rate g R (D, A) max, when the maximum information gain rate g R (D, A) max is greater than the preset threshold ε, The feature corresponding to the information gain rate g R (D, A) is selected as the partitioning feature Ag (such as the feature Ai+1), and the D2 is divided into several sub-sample sets based on the partitioning feature Ag, for example, the D2 can be divided into the sub-sample set D21, D22, D23, and then, the partitioning features Ag in the subsample sets D21, D22, and D23 are removed, and the child nodes d21, d22, and d23 of the current node d2 are generated, and the sample sets D21, D22, and D23 after dividing the feature Ag are removed. The node information of the child nodes d21, d22, and d23, respectively.
312、将该子节点作为叶子节点,并根据子节点的子样本集中样本类别设置该叶子节点的输出。312. The child node is used as a leaf node, and the output of the leaf node is set according to a sample category of the child sample set of the child node.
例如,预设分类终止条件可以包括:子节点对应的去除后子样本集中样本的类别数量为1,也即子节点的样本集中只有一个类别的样本。For example, the preset classification termination condition may include: the number of categories of the samples in the removed subsample set corresponding to the child node is 1, that is, the sample in the sample set of the child node has only one category.
此时,如果子节点满足该预设分类终止条件,那么,将子样本集中样本的类别作为该叶子节点的输出。如去除后子样本集中只有类别为“男”的样本时,那么,可以将“男”作为该叶子节点的输出。At this time, if the child node satisfies the preset classification termination condition, the category of the sample in the subsample set is taken as the output of the leaf node. If there is only a sample with the category "male" in the subsample set after removal, then "male" can be used as the output of the leaf node.
313、将当前节点作为叶子节点,并选取样本数量最多的样本类别作为该叶子节点的输出。313. The current node is used as a leaf node, and the sample category with the largest number of samples is selected as the output of the leaf node.
其中,样本类别包括“男”和“女”。Among them, the sample categories include “male” and “female”.
例如,在子节点d1的子样本集D1分类时,如果最大信息增益小与预设阈值,此时,可以将子样本集D1中样本数量最多的样本类别作为该叶子节点的输出。如“女”的样本数量最多,那么可以将“女”作为叶子节点a1的输出。For example, when the sub-sample set D1 of the child node d1 is classified, if the maximum information gain is small and the preset threshold, at this time, the sample category having the largest number of samples in the sub-sample set D1 can be used as the output of the leaf node. If the "female" has the largest number of samples, then "female" can be used as the output of the leaf node a1.
314、在构建完决策树模型后,获取需要预测用户性别的时间,根据预测时间采集未提供性别信息的用户的行为习惯的多维特征作为预测样本。314. After constructing the decision tree model, obtain a time for which the gender of the user needs to be predicted, and collect a multi-dimensional feature of the behavior habit of the user who does not provide the gender information according to the predicted time as a prediction sample.
其中,预测时间可以根据需求设定,如可以为当前时间等。The prediction time can be set according to requirements, such as the current time.
315、根据预测样本和决策树模型预测未提供性别信息的用户的性别。315. Predict the gender of the user who does not provide gender information according to the prediction sample and the decision tree model.
比如,可以根据预测样本的特征和决策树模型确定相应的叶子节点,将该叶子节点的输出作为预测输出结果。如利用预测样本的特征按照决策树的分支条件(即划分特征的特征值)确定当前的叶子节点,取该叶子节点的输出作为预测的结果。由于叶子节点的输出包括“男”或“女”,因此,此时可以基于决策树来确定未提供性别信息的用户的性别。For example, the corresponding leaf node may be determined according to the characteristics of the predicted sample and the decision tree model, and the output of the leaf node is used as a predicted output result. For example, the current leaf node is determined according to the branch condition of the decision tree (ie, the feature value of the divided feature), and the output of the leaf node is taken as the prediction result. Since the output of the leaf node includes "male" or "female", the gender of the user who does not provide the gender information can be determined based on the decision tree at this time.
例如,采集未提供性别信息的用户的行为习惯的多维特征后,可以在图5所示的决策树中按照决策树的分支条件查找相应的叶子节点为an1,叶子节点an1的输出为“男”,此时,便确定未提供性别信息的用户的性别是男。For example, after collecting the multi-dimensional features of the behavior habits of the user who does not provide the gender information, the corresponding leaf node may be found as an1 according to the branch condition of the decision tree in the decision tree shown in FIG. 5, and the output of the leaf node an1 is “male”. At this time, it is determined that the gender of the user who does not provide the gender information is male.
由上可知,本申请实施例采集已提供性别信息的用户的行为习惯的多维特征作为样本,并构建已提供性别信息的用户的行为习惯的样本集;当所述特征的数量超过预设阈值时,根据特征对于样本分类的信息增益率对样本集进行样本分类,以构建出预测用户性别的决策树模型,决策树模型的输出包括“男”或“女”;根据预测时间采集未提供性别信息的用户的行为习惯的多维特征作为预测样本;根据预测样本和决策树模型预测未提供性别信息的用户的性别。As can be seen from the above, the embodiment of the present application collects a multi-dimensional feature of the behavior habit of the user who has provided the gender information as a sample, and constructs a sample set of the behavior habits of the user who has provided the gender information; when the number of the features exceeds a preset threshold According to the information classification rate of the sample classification, the sample set is sampled to construct a decision tree model for predicting the gender of the user. The output of the decision tree model includes “male” or “female”; the gender information is not provided according to the predicted time. The multidimensional features of the user's behavioral habits are used as prediction samples; the genders of users who do not provide gender information are predicted based on the prediction samples and the decision tree model.
进一步地,由于样本集的每个样本中,包括了反映用户行为习惯的多个特征信息,因此本申请实施例可以使得用户性别预测更加智能化。Further, since each of the samples of the sample set includes a plurality of feature information reflecting the user's behavior habits, the embodiment of the present application can make the user gender prediction more intelligent.
进一步地,基于决策树预测模型来实现用户性别预测,可以提升用户性别预测的准确性,进而提高预测的准确度。Further, based on the decision tree prediction model to achieve user gender prediction, the accuracy of the user's gender prediction can be improved, thereby improving the accuracy of the prediction.
在一种实施例中还提供了一种用户性别预测装置。请参阅图7,图7为本申请实施例提供的用户性别预测装置的结构示意图。其中该用户性别预测装置应用于电子设备,该用户性别预测装置包括第一采集单元401、分类单元402、第二采集单元403、和预测单元404,如下:A user gender prediction device is also provided in an embodiment. Please refer to FIG. 7. FIG. 7 is a schematic structural diagram of a user gender prediction apparatus according to an embodiment of the present application. The user gender prediction device is applied to an electronic device, and the user gender prediction device includes a first collection unit 401, a classification unit 402, a second collection unit 403, and a prediction unit 404, as follows:
第一采集单元401,用于采集已提供性别信息的用户的行为习惯的多维特征作为样本,并构建已提供性别信息的用户的行为习惯的样本集;The first collecting unit 401 is configured to collect a multi-dimensional feature of the behavior habit of the user who has provided the gender information as a sample, and construct a sample set of behavior habits of the user who has provided the gender information;
分类单元402,用于当所述特征的数量超过预设阈值时,根据特征对于样本分类的信息增益率对样本集进行样本分类,以构建出用户性别预测的决策树模型;The classification unit 402 is configured to perform sample classification on the sample set according to the information gain rate of the feature classification for the feature when the number of the features exceeds a preset threshold, to construct a decision tree model of the user gender prediction;
第二采集单元403,用于根据根据预测时间采集未提供性别信息的用户的行为习惯的多维特征作为预测样本;a second collecting unit 403, configured to collect, according to the predicted time, a multi-dimensional feature of a behavior habit of a user who does not provide gender information as a prediction sample;
预测单元404,用于根据预测样本和决策树模型预测未提供性别信息的用户的性别。The prediction unit 404 is configured to predict, according to the prediction sample and the decision tree model, the gender of the user who does not provide the gender information.
在一种实施例中,参考图8,分类单元402,可以包括:In an embodiment, referring to FIG. 8, the classification unit 402 may include:
第一节点生成子单元4021,用于生成决策树的根节点,并将所述样本集作为所述根节点的节点信息;将所述根节点的样本集确定为当前待分类的目标样本集;a first node generating sub-unit 4021, configured to generate a root node of the decision tree, and use the sample set as node information of the root node; and determine a sample set of the root node as a target sample set to be classified currently;
增益率获取子单元4022,用于获取目标样本集内所述特征对于目标样本集分类的信息增益率;a gain rate obtaining sub-unit 4022, configured to acquire an information gain rate of the feature in the target sample set for classifying the target sample set;
特征确定子单元4023,用于获取目标样本集内所述特征对于目标样本集分类的信息增益率;a feature determining sub-unit 4023, configured to acquire an information gain rate of the feature in the target sample set for classifying the target sample set;
分类子单元4024,用于根据所述划分特征对所述样本集进行划分,得到若干子样本集;a classification sub-unit 4024, configured to divide the sample set according to the dividing feature to obtain a plurality of sub-sample sets;
第二节点生成子单元4025,用于对所述子样本集中样本的所述划分特征进行去除,得到去除后子样本集;生成当前节点的子节点,并将所述去除后子样本集作为所述子节点的节点信息;a second node generating sub-unit 4025, configured to remove the divided feature of the sample in the sub-sample set, obtain a removed sub-sample set, generate a child node of the current node, and use the removed sub-sample set as a Describe the node information of the child node;
判断子单元4026,用于判断子节点是否满足预设分类终止条件,若否,将所述目标样本集更新为所述去除后子样本集,并触发所述增益率获取子单元4022执行获取目标样本集内所述特征对于样本集分类的信息增益率的步骤;若是,则将所述子节点作为叶子节点,根据所述去除后子样本集中样本的类别设 置所述叶子节点的输出,所述样本的类别包括“男”和“女”。The determining sub-unit 4026 is configured to determine whether the child node meets the preset classification termination condition, and if not, update the target sample set to the removed sub-sample set, and trigger the gain rate acquisition sub-unit 4022 to execute the acquisition target. a step of classifying the information gain rate of the feature in the sample set for the sample set; if yes, setting the child node as a leaf node, and setting an output of the leaf node according to a category of the sample in the removed subsample set, The categories of the sample include “male” and “female”.
其中,分类子单元4024,可以用于获取所述样本集中划分特征的特征值;The classification sub-unit 4024 may be configured to acquire feature values of the feature set in the sample set;
根据所述特征值对所述样本集进行划分。相同的样本划分到相同的子样本集。The sample set is divided according to the feature value. The same sample is divided into the same subsample set.
其中,特征确定子单元4023,可以用于:The feature determining subunit 4023 can be used to:
从所述信息增益中选取最大的目标信息增益;Selecting a maximum target information gain from the information gains;
判断所述目标信息增益是否大于预设阈值;Determining whether the target information gain is greater than a preset threshold;
若是,则选取所述目标信息增益对应的特征作为当前的划分特征。If yes, the feature corresponding to the target information gain is selected as the current dividing feature.
在一种实施例中,增益率获取子单元4022,可以用于:In an embodiment, the gain rate acquisition sub-unit 4022 can be used to:
获取所述特征对于目标样本集分类的信息增益;Obtaining an information gain of the feature classification for the target sample set;
获取所述特征对于目标样本集分类的***信息;Obtaining split information of the feature classification for the target sample set;
根据所述信息增益与所述***信息,获取所述特征对于目标样本集分类的信息增益率。Obtaining an information gain rate of the feature for the target sample set classification according to the information gain and the split information.
比如,增益获取子单元4022,可以用于:For example, the gain acquisition sub-unit 4022 can be used to:
获取目标样本分类的经验熵;Obtaining the empirical entropy of the target sample classification;
获取所述特征对于目标样本集分类结果的条件熵;Obtaining conditional entropy of the feature for the classification result of the target sample set;
根据所述条件熵和所述经验熵,获取所述特征对于所述目标样本集分类的信息增益率。Obtaining an information gain rate of the feature for the target sample set classification according to the condition entropy and the empirical entropy.
在一种实施例中,判断子单元4025,可以用于判断所述子节点对应的去除后子样本集中样本的类别数量是否为预设数量;In an embodiment, the determining sub-unit 4025 may be configured to determine whether the number of categories of the samples in the removed sub-sample set corresponding to the child node is a preset number;
若是,则确定所述子节点满足预设分类终止条件。If yes, it is determined that the child node satisfies a preset classification termination condition.
在一种实施例中,特征确定子单元4023,还可以用于当目标信息增益率不大于预设阈值时,将当前节点作为叶子节点,并选取样本数量最多的样本类别作为所述叶子节点的输出。In an embodiment, the feature determining sub-unit 4023 is further configured to: when the target information gain rate is not greater than a preset threshold, use the current node as a leaf node, and select a sample category with the largest number of samples as the leaf node. Output.
其中,用户性别预测装置中各单元执行的步骤可以参考上述方法实施例描述的方法步骤。该用户性别预测装置可以集成在电子设备中,如手机、平板电脑等。The steps performed by the units in the user gender prediction apparatus may refer to the method steps described in the foregoing method embodiments. The user gender prediction device can be integrated in an electronic device such as a mobile phone, a tablet, or the like.
具体实施时,以上各个单元可以作为独立的实体实现,也可以进行任意组合,作为同一或若干个实体来实现,以上各个单位的具体实施可参见前面的实施例,在此不再赘述。In the specific implementation, the foregoing various units may be implemented as an independent entity, and may be implemented in any combination, and may be implemented as the same entity or a plurality of entities. For the specific implementation of the foregoing units, refer to the foregoing embodiments, and details are not described herein again.
由上可知,本实施例用户性别预测装置可以由第一采集单元401采集已提供性别信息的用户的行为习惯的多维特征作为样本,并构建已提供性别信息的用户的行为习惯的样本集;由分类单元402当所述特征的数量超过预设阈值时,根据特征对于样本分类的信息增益率对样本集进行样本分类,以构建出预测用户性别的决策树模型,决策树模型的输出包括“男”或“女”;由第二采集单元403根据预测时间采集未提供性别信息的用户的行为习惯的多维特征作为预测样本;由预测单元404根据预测样本和决策树模型预测未提供性别信息的用户的性别。As can be seen from the above, the user gender prediction apparatus of the embodiment may collect, by the first collection unit 401, a multi-dimensional feature of the behavior habit of the user who has provided the gender information as a sample, and construct a sample set of behavior habits of the user who has provided the gender information; When the number of the features exceeds a preset threshold, the classification unit 402 classifies the sample set according to the information gain rate of the feature classification for the sample to construct a decision tree model for predicting the gender of the user, and the output of the decision tree model includes “male” Or "female"; the multi-dimensional feature of the behavior habit of the user who does not provide the gender information is collected as the prediction sample by the second acquisition unit 403 according to the predicted time; the prediction unit 404 predicts the user who does not provide the gender information according to the prediction sample and the decision tree model. Gender.
进一步地,由于样本集的每个样本中,包括了反映用户行为习惯的多个特征信息,因此本申请实施例可以使得用户性别预测更加智能化。Further, since each of the samples of the sample set includes a plurality of feature information reflecting the user's behavior habits, the embodiment of the present application can make the user gender prediction more intelligent.
进一步地,基于决策树预测模型来实现用户性别预测,可以提升用户性别预测的准确性,进而提高预测的准确度。Further, based on the decision tree prediction model to achieve user gender prediction, the accuracy of the user's gender prediction can be improved, thereby improving the accuracy of the prediction.
本申请实施例还提供一种电子设备。请参阅图9,电子设备500包括处理器501以及存储器502。其中,处理器501与存储器502电性连接。An embodiment of the present application further provides an electronic device. Referring to FIG. 9, the electronic device 500 includes a processor 501 and a memory 502. The processor 501 is electrically connected to the memory 502.
所述处理器500是电子设备500的控制中心,利用各种接口和线路连接整个电子设备的各个部分,通过运行或加载存储在存储器502内的计算机程序,以及调用存储在存储器502内的数据,执行电子设备500的各种功能并处理数据,从而对电子设备500进行整体监控。The processor 500 is a control center of the electronic device 500 that connects various portions of the entire electronic device using various interfaces and lines, by running or loading a computer program stored in the memory 502, and recalling data stored in the memory 502, The various functions of the electronic device 500 are performed and the data is processed to perform overall monitoring of the electronic device 500.
所述存储器502可用于存储软件程序以及模块,处理器501通过运行存储在存储器502的计算机程序以及模块,从而执行各种功能应用以及数据处理。存储器502可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作***、至少一个功能所需的计算机程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据电子设备的使用所创建的数据等。此外,存储器502可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态 存储器件。相应地,存储器502还可以包括存储器控制器,以提供处理器501对存储器502的访问。The memory 502 can be used to store software programs and modules, and the processor 501 executes various functional applications and data processing by running computer programs and modules stored in the memory 502. The memory 502 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, a computer program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may be stored according to Data created by the use of electronic devices, etc. Moreover, memory 502 can include high speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, memory 502 can also include a memory controller to provide processor 501 access to memory 502.
在本申请实施例中,电子设备500中的处理器501会按照如下的步骤,将一个或一个以上的计算机程序的进程对应的指令加载到存储器502中,并由处理器501运行存储在存储器502中的计算机程序,从而实现各种功能,如下:In the embodiment of the present application, the processor 501 in the electronic device 500 loads the instructions corresponding to the process of one or more computer programs into the memory 502 according to the following steps, and is stored in the memory 502 by the processor 501. The computer program in which to implement various functions, as follows:
采集已提供性别信息的用户的行为习惯的多维特征作为样本,并构建已提供性别信息的用户的行为习惯的样本集;Collecting multidimensional features of the behavioral habits of users who have provided gender information as a sample, and constructing a sample set of behavioral habits of users who have provided gender information;
当所述特征的数量超过预设阈值时,根据特征对于样本分类的信息增益率对样本集进行样本分类,以构建出用户性别预测的决策树模型,所述决策树模型的输出包括男或者女;When the number of the features exceeds a preset threshold, the sample set is sample-classified according to the information gain rate of the feature classification for the sample to construct a decision tree model of the user gender prediction, and the output of the decision tree model includes a male or female ;
根据预测时间采集未提供性别信息的用户的行为习惯的多维特征作为预测样本;Collecting multi-dimensional features of behavioral habits of users who do not provide gender information as prediction samples according to predicted time;
根据预测样本和决策树模型预测未提供性别信息的用户的性别。The gender of users who did not provide gender information was predicted based on the predicted sample and the decision tree model.
在某些实施方式中,在根据所述特征对于样本分类的信息增益对所述样本集进行样本分类,以构建出所述用户的决策树模型时,处理器501可以具体执行以下步骤:In some embodiments, when the sample set is sample-classified according to the information gain of the feature classification for the sample to construct the decision tree model of the user, the processor 501 may specifically perform the following steps:
生成决策树的根节点,并将所述样本集作为所述根节点的节点信息;Generating a root node of the decision tree, and using the sample set as node information of the root node;
将所述根节点的样本集确定为当前待分类的目标样本集;Determining the sample set of the root node as a target sample set to be classified currently;
获取目标样本集内所述特征对于目标样本集分类的信息增益率;Obtaining an information gain rate of the feature in the target sample set for the target sample set classification;
根据所述信息增益率选取从所述特征中选取当前的划分特征;Selecting a current partitioning feature from the features according to the information gain rate;
根据所述划分特征对所述样本集进行划分,得到若干子样本集;Dividing the sample set according to the dividing feature to obtain a plurality of sub-sample sets;
对所述子样本集中样本的所述划分特征进行去除,得到去除后子样本集;And removing the dividing feature of the sample in the subsample set to obtain a removed subsample set;
生成当前节点的子节点,并将所述去除后子样本集作为所述子节点的节点信息;Generating a child node of the current node, and using the removed subsample set as node information of the child node;
判断子节点是否满足预设分类终止条件;Determining whether the child node satisfies a preset classification termination condition;
若否,则将所述目标样本集更新为所述去除后子样本集,并返回执行获取目标样本集内所述特征对于目标样本集分类的信息增益率的步骤;If not, updating the target sample set to the post-removed sub-sample set, and returning to perform the step of acquiring an information gain rate of the feature in the target sample set for the target sample set classification;
若是,则将子节点作为叶子节点,根据去除后子样本集中样本的类别设置叶子节点的输出,样本的类别包括“男”和“女”。If yes, the child node is used as a leaf node, and the output of the leaf node is set according to the category of the sample in the removed subsample set. The categories of the sample include “male” and “female”.
在某些实施方式中,在根据所述划分特征对所述目标样本集进行划分时,处理器501可以具体执行以下步骤:In some embodiments, when the target sample set is divided according to the dividing feature, the processor 501 may specifically perform the following steps:
获取所述目标样本集中划分特征的特征值;Obtaining feature values of the feature in the target sample set;
根据所述特征值对所述目标样本集进行划分。The target sample set is divided according to the feature value.
在某些实施方式中,在根据所述信息增益率选取从所述特征中选取当前的划分特征时,处理器501可以具体执行以下步骤:In some embodiments, when selecting the current partitioning feature from the features according to the information gain rate selection, the processor 501 may specifically perform the following steps:
从所述信息增益中选取最大的目标信息增益率;Selecting a maximum target information gain rate from the information gains;
判断所述目标信息增益率是否大于预设阈值;Determining whether the target information gain rate is greater than a preset threshold;
若是,则选取所述目标信息增益率对应的特征作为当前的划分特征。If yes, the feature corresponding to the target information gain rate is selected as the current dividing feature.
在某些实施方式中,处理器501还可以具体执行以下步骤:In some embodiments, the processor 501 may further perform the following steps:
当目标信息增益率不大于预设阈值时,将当前节点作为叶子节点,并选取样本数量最多的样本类别作为所述叶子节点的输出。When the target information gain rate is not greater than the preset threshold, the current node is taken as a leaf node, and the sample category with the largest number of samples is selected as the output of the leaf node.
在某些实施方式中,在获取目标样本集内所述特征对于样本集分类的信息增益时,处理器501可以具体执行以下步骤:In some embodiments, when acquiring the information gain of the feature classification for the sample set in the target sample set, the processor 501 may specifically perform the following steps:
获取所述特征对于目标样本集分类的信息增益;Obtaining an information gain of the feature classification for the target sample set;
获取所述特征对于目标样本集分类的***信息;Obtaining split information of the feature classification for the target sample set;
根据所述信息增益与所述***信息,获取所述特征对于目标样本集分类的信息增益率。Obtaining an information gain rate of the feature for the target sample set classification according to the information gain and the split information.
由上述可知,本申请实施例的电子设备,采集已提供性别信息的用户的行为习惯的多维特征作为样本,并构建已提供性别信息的用户的行为习惯的样本集;当所述特征的数量超过预设阈值时,根据特征对于样本分类的信息增益率对样本集进行样本分类,以构建出预测用户性别的决策树模型,决策树模型 的输出包括“男”或“女”;根据预测时间采集未提供性别信息的用户的行为习惯的多维特征作为预测样本;根据预测样本和决策树模型预测未提供性别信息的用户的性别。It can be seen from the above that the electronic device in the embodiment of the present application collects a multi-dimensional feature of the behavior habit of the user who has provided the gender information as a sample, and constructs a sample set of the behavior habit of the user who has provided the gender information; when the number of the features exceeds When the threshold is preset, the sample set is sampled according to the information gain rate of the feature classification for the sample to construct a decision tree model for predicting the gender of the user. The output of the decision tree model includes “male” or “female”; The multi-dimensional feature of the behavioral habits of the user who does not provide the gender information is used as a prediction sample; the gender of the user who does not provide the gender information is predicted according to the prediction sample and the decision tree model.
请一并参阅图10,在某些实施方式中,电子设备500还可以包括:显示器503、射频电路504、音频电路505以及电源506。其中,其中,显示器503、射频电路504、音频电路505以及电源506分别与处理器501电性连接。Referring to FIG. 10 together, in some embodiments, the electronic device 500 may further include: a display 503, a radio frequency circuit 504, an audio circuit 505, and a power source 506. The display 503, the radio frequency circuit 504, the audio circuit 505, and the power source 506 are electrically connected to the processor 501, respectively.
所述显示器503可以用于显示由用户输入的信息或提供给用户的信息以及各种图形用户接口,这些图形用户接口可以由图形、文本、图标、视频和其任意组合来构成。显示器503可以包括显示面板,在某些实施方式中,可以采用液晶显示器(Liquid Crystal Display,LCD)、或者有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板。The display 503 can be used to display information entered by a user or information provided to a user, as well as various graphical user interfaces, which can be composed of graphics, text, icons, video, and any combination thereof. The display 503 can include a display panel. In some embodiments, the display panel can be configured in the form of a liquid crystal display (LCD) or an organic light-emitting diode (OLED).
所述射频电路504可以用于收发射频信号,以通过无线通信与网络设备或其他电子设备建立无线通讯,与网络设备或其他电子设备之间收发信号。The radio frequency circuit 504 can be used to transmit and receive radio frequency signals to establish wireless communication with a network device or other electronic device through wireless communication, and to transmit and receive signals with a network device or other electronic device.
所述音频电路505可以用于通过扬声器、传声器提供用户与电子设备之间的音频接口。The audio circuit 505 can be used to provide an audio interface between a user and an electronic device through a speaker or a microphone.
所述电源506可以用于给电子设备500的各个部件供电。在一些实施例中,电源506可以通过电源管理***与处理器501逻辑相连,从而通过电源管理***实现管理充电、放电、以及功耗管理等功能。The power source 506 can be used to power various components of the electronic device 500. In some embodiments, the power source 506 can be logically coupled to the processor 501 through a power management system to enable functions such as managing charging, discharging, and power management through the power management system.
尽管图10中未示出,电子设备500还可以包括摄像头、蓝牙模块等,在此不再赘述。Although not shown in FIG. 10, the electronic device 500 may further include a camera, a Bluetooth module, and the like, and details are not described herein.
本申请实施例还提供一种存储介质,所述存储介质存储有计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行上述任一种实施例中的用户性别预测方法,比如:采集已提供性别信息的用户的行为习惯的多维特征作为样本,并构建已提供性别信息的用户的行为习惯的样本集;当所述特征的数量超过预设阈值时,根据特征对于样本分类的信息增益率对样本集进行样本分类,以构建出预测用户性别的决策树模型,决策树模型的输出包括“男”或“女”;根据预测时间采集未提供性别信息的用户的行为习惯的多维特征作为预测样本;根据预测样本和决策树模型预测未提供性别信息的用户的性别。The embodiment of the present application further provides a storage medium, where the storage medium stores a computer program, and when the computer program runs on a computer, causes the computer to execute a user gender prediction method in any of the above embodiments, such as : collecting a multi-dimensional feature of the behavior habit of the user who has provided the gender information as a sample, and constructing a sample set of behavior habits of the user who has provided the gender information; when the number of the features exceeds a preset threshold, classifying the sample according to the feature The information gain rate samples the sample set to construct a decision tree model for predicting the user's gender. The output of the decision tree model includes “male” or “female”; the multi-dimensionality of the behavior habits of users who do not provide gender information is collected according to the predicted time. The feature is used as a predictive sample; the gender of the user who does not provide gender information is predicted based on the predicted sample and the decision tree model.
在本申请实施例中,存储介质可以是磁碟、光盘、只读存储器(Read Only Memory,ROM,)、或者随机存取记忆体(Random Access Memory,RAM)等。In the embodiment of the present application, the storage medium may be a magnetic disk, an optical disk, a read only memory (ROM), or a random access memory (RAM).
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above embodiments, the descriptions of the various embodiments are different, and the details that are not detailed in a certain embodiment can be referred to the related descriptions of other embodiments.
需要说明的是,对本申请实施例的用户性别预测方法而言,本领域普通测试人员可以理解实现本申请实施例的用户性别预测方法的全部或部分流程,是可以通过计算机程序来控制相关的硬件来完成,所述计算机程序可存储于一计算机可读取存储介质中,如存储在电子设备的存储器中,并被该电子设备内的至少一个处理器执行,在执行过程中可包括如用户性别预测方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储器、随机存取记忆体等。It should be noted that, for the user gender prediction method in the embodiment of the present application, a general tester in the field can understand all or part of the process of implementing the user gender prediction method in the embodiment of the present application, and the related hardware can be controlled by a computer program. To complete, the computer program may be stored in a computer readable storage medium, such as in a memory of the electronic device, and executed by at least one processor in the electronic device, and may include, for example, user gender during execution. The flow of an embodiment of the prediction method. The storage medium may be a magnetic disk, an optical disk, a read only memory, a random access memory, or the like.
对本申请实施例的用户性别预测装置而言,其各功能模块可以集成在一个处理芯片中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中,所述存储介质譬如为只读存储器,磁盘或光盘等。For the user gender prediction apparatus of the embodiment of the present application, each functional module may be integrated into one processing chip, or each module may exist physically separately, or two or more modules may be integrated into one module. The above integrated modules can be implemented in the form of hardware or in the form of software functional modules. The integrated module, if implemented in the form of a software functional module and sold or used as a standalone product, may also be stored in a computer readable storage medium, such as a read only memory, a magnetic disk or an optical disk, etc. .
以上对本申请实施例所提供的一种用户性别预测方法、装置及电子设备进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The user gender prediction method, apparatus, and electronic device provided by the embodiments of the present application are described in detail. The specific examples are used herein to explain the principles and implementation manners of the present application. The description of the above embodiments is only used for To help understand the method of the present application and its core ideas; at the same time, those skilled in the art, according to the idea of the present application, there will be changes in the specific implementation manner and the scope of application, in summary, the contents of this specification are not It should be understood that the limitations of the application.

Claims (19)

  1. 一种用户性别预测方法,其中,包括:A user gender prediction method, which includes:
    采集已提供性别信息的用户的行为习惯的多维特征作为样本,并构建已提供性别信息的用户的行为习惯的样本集;Collecting multidimensional features of the behavioral habits of users who have provided gender information as a sample, and constructing a sample set of behavioral habits of users who have provided gender information;
    当所述特征的数量超过预设阈值时,根据特征对于样本分类的信息增益率对样本集进行样本分类,以构建出用户性别预测的决策树模型;When the number of the features exceeds a preset threshold, the sample set is sample-classified according to the information gain rate of the feature classification for the sample to construct a decision tree model of the user gender prediction;
    根据预测时间采集未提供性别信息的用户的行为习惯的多维特征作为预测样本;Collecting multi-dimensional features of behavioral habits of users who do not provide gender information as prediction samples according to predicted time;
    根据预测样本和决策树模型预测未提供性别信息的用户的性别。The gender of users who did not provide gender information was predicted based on the predicted sample and the decision tree model.
  2. 如权利要求1所述的用户性别预测方法,其中,当所述特征的数量超过预设阈值时,根据特征对于样本分类的信息增益率对样本集进行样本分类,以构建出用户性别预测的决策树模型,包括:The user gender prediction method according to claim 1, wherein when the number of the features exceeds a preset threshold, the sample set is sample-classified according to the information gain rate of the feature classification for the sample to construct a user gender prediction decision. Tree model, including:
    生成决策树的根节点,并将所述样本集作为所述根节点的节点信息;Generating a root node of the decision tree, and using the sample set as node information of the root node;
    将所述根节点的样本集确定为当前待分类的目标样本集;Determining the sample set of the root node as a target sample set to be classified currently;
    获取目标样本集内所述特征对于目标样本集分类的信息增益率;Obtaining an information gain rate of the feature in the target sample set for the target sample set classification;
    根据所述信息增益率选取从所述特征中选取当前的划分特征;Selecting a current partitioning feature from the features according to the information gain rate;
    根据所述划分特征对所述样本集进行划分,得到若干子样本集;Dividing the sample set according to the dividing feature to obtain a plurality of sub-sample sets;
    对所述子样本集中样本的所述划分特征进行去除,得到去除后子样本集;And removing the dividing feature of the sample in the subsample set to obtain a removed subsample set;
    生成当前节点的子节点,并将所述去除后子样本集作为所述子节点的节点信息;Generating a child node of the current node, and using the removed subsample set as node information of the child node;
    判断子节点是否满足预设分类终止条件;Determining whether the child node satisfies a preset classification termination condition;
    若否,则将所述目标样本集更新为所述去除后子样本集,并返回执行获取目标样本集内所述特征对于目标样本集分类的信息增益率的步骤;If not, updating the target sample set to the post-removed sub-sample set, and returning to perform the step of acquiring an information gain rate of the feature in the target sample set for the target sample set classification;
    若是,则将所述子节点作为叶子节点,根据所述去除后子样本集中样本的类别设置所述叶子节点的输出,所述样本的类别包括男和女。If yes, the child node is used as a leaf node, and an output of the leaf node is set according to a category of the sample in the removed subsample set, and the category of the sample includes a male and a female.
  3. 如权利要求2所述的用户性别预测方法,其中,根据所述划分特征对所述目标样本集进行划分,包括:The user gender prediction method according to claim 2, wherein the dividing the target sample set according to the dividing feature comprises:
    获取所述目标样本集中划分特征的特征值;Obtaining feature values of the feature in the target sample set;
    根据所述特征值对所述目标样本集进行划分。The target sample set is divided according to the feature value.
  4. 如权利要求2所述的用户性别预测方法,其中,根据所述信息增益率选取从所述特征中选取当前的划分特征,包括:The user gender prediction method according to claim 2, wherein selecting the current division feature from the features according to the information gain rate selection comprises:
    从所述信息增益中选取最大的目标信息增益率;Selecting a maximum target information gain rate from the information gains;
    判断所述目标信息增益率是否大于预设阈值;Determining whether the target information gain rate is greater than a preset threshold;
    若是,则选取所述目标信息增益率对应的特征作为当前的划分特征。If yes, the feature corresponding to the target information gain rate is selected as the current dividing feature.
  5. 如权利要求4所述的用户性别预测方法,其中,所述用户性别预测方法还包括:The user gender prediction method according to claim 4, wherein the user gender prediction method further comprises:
    当目标信息增益率不大于预设阈值时,将当前节点作为叶子节点,并选取样本数量最多的样本类别作为所述叶子节点的输出。When the target information gain rate is not greater than the preset threshold, the current node is taken as a leaf node, and the sample category with the largest number of samples is selected as the output of the leaf node.
  6. 如权利要求2所述的用户性别预测方法,其中,判断子节点是否满足预设分类终止条件,包括:The user gender prediction method according to claim 2, wherein determining whether the child node satisfies the preset classification termination condition comprises:
    判断所述子节点对应的去除后子样本集中样本的类别数量是否为预设数量;Determining whether the number of categories of samples in the removed subsample set corresponding to the child node is a preset number;
    若是,则确定所述子节点满足预设分类终止条件。If yes, it is determined that the child node satisfies a preset classification termination condition.
  7. 如权利要求2所述的用户性别预测方法,其中,获取目标样本集内所述特征对于目标样本集分类的信息增益率,包括:The user gender prediction method according to claim 2, wherein obtaining an information gain rate of the feature in the target sample set for the target sample set classification comprises:
    获取所述特征对于目标样本集分类的信息增益;Obtaining an information gain of the feature classification for the target sample set;
    获取所述特征对于目标样本集分类的***信息;Obtaining split information of the feature classification for the target sample set;
    根据所述信息增益与所述***信息,获取所述特征对于目标样本集分类的信息增益率。Obtaining an information gain rate of the feature for the target sample set classification according to the information gain and the split information.
  8. 如权利要求7所述的用户性别预测方法,其中,获取所述特征对于目标样本集分类的信息增益率,包括:The user gender prediction method according to claim 7, wherein the information gain rate of the feature classification for the target sample set is obtained, including:
    获取目标样本分类的经验熵;Obtaining the empirical entropy of the target sample classification;
    获取所述特征对于目标样本集分类结果的条件熵;Obtaining conditional entropy of the feature for the classification result of the target sample set;
    根据所述条件熵和所述经验熵,获取所述特征对于所述目标样本集分类的信息增益率。Obtaining an information gain rate of the feature for the target sample set classification according to the condition entropy and the empirical entropy.
  9. 如权利要求7所述的用户性别预测方法,其中,根据所述信息增益与所述***信息,获取所述特征对于目标样本集分类的信息增益率,包括:The user gender prediction method according to claim 7, wherein the information gain rate of the feature classification for the target sample set is obtained according to the information gain and the split information, including:
    通过如下公式计算特征对于目标样本集分类的信息增益率:The information gain rate of the feature classification for the target sample set is calculated by the following formula:
    Figure PCTCN2018115358-appb-100001
    Figure PCTCN2018115358-appb-100001
    其中,g R(D,A)为特征A对于样本集D分类的信息增益率,g(D,A)为特征A对于样本分类的信息增益,HA(D)为特征A的***信息; Where g R (D, A) is the information gain rate of feature A for sample set D classification, g(D, A) is the information gain of feature A for sample classification, and HA(D) is the split information of feature A;
    并且,g(D,A)可以通过如下公式计算得到:Also, g(D, A) can be calculated by the following formula:
    Figure PCTCN2018115358-appb-100002
    Figure PCTCN2018115358-appb-100002
    其中,H(D)为样本集D分类的经验熵,H(D|A)为特征A对于样本集D分类的条件熵,pi为A特征取第i种取值的样本在样本集D中出现的概率,n和i均为大于零的正整数。Where H(D) is the empirical entropy of the sample set D classification, H(D|A) is the conditional entropy of the feature A for the sample set D classification, and pi is the A feature taking the ith sample of the sample in the sample set D. The probability of occurrence, n and i are positive integers greater than zero.
  10. 一种用户性别预测装置,其中,包括:A user gender prediction device, comprising:
    第一采集单元,用于采集已提供性别信息的用户的行为习惯的多维特征作为样本,并构建已提供性别信息的用户的行为习惯的样本集;a first collecting unit, configured to collect a multi-dimensional feature of a behavior habit of a user who has provided gender information as a sample, and construct a sample set of behavior habits of the user who has provided the gender information;
    分类单元,用于当所述特征的数量超过预设阈值时,根据特征对于样本分类的信息增益率对样本集进行样本分类,以构建出用户性别预测的决策树模型,所述决策树模型的输出包括男或女;a classifying unit, configured to perform sample classification on the sample set according to an information gain rate of the feature classification for the feature when the number of the features exceeds a preset threshold, to construct a decision tree model of the user gender prediction, where the decision tree model The output includes male or female;
    第二采集单元,用于根据预测时间采集未提供性别信息的用户的行为习惯的多维特征作为预测样本;a second collecting unit, configured to collect, as a prediction sample, a multi-dimensional feature of a behavior habit of a user who does not provide gender information according to the predicted time;
    预测单元,用于根据预测样本和决策树模型预测未提供性别信息的用户的性别。A prediction unit is configured to predict a gender of a user who does not provide gender information according to the prediction sample and the decision tree model.
  11. 一种电子设备,包括处理器和存储器,所述存储器有计算机程序,其中,所述处理器通过调用所述计算机程序,用于执行用户性别预测方法,所述用户性别预测方法包括:An electronic device includes a processor and a memory, wherein the memory has a computer program, wherein the processor is configured to execute a user gender prediction method by calling the computer program, the user gender prediction method comprising:
    采集已提供性别信息的用户的行为习惯的多维特征作为样本,并构建已提供性别信息的用户的行为习惯的样本集;Collecting multidimensional features of the behavioral habits of users who have provided gender information as a sample, and constructing a sample set of behavioral habits of users who have provided gender information;
    当所述特征的数量超过预设阈值时,根据特征对于样本分类的信息增益率对样本集进行样本分类,以构建出用户性别预测的决策树模型;When the number of the features exceeds a preset threshold, the sample set is sample-classified according to the information gain rate of the feature classification for the sample to construct a decision tree model of the user gender prediction;
    根据预测时间采集未提供性别信息的用户的行为习惯的多维特征作为预测样本;Collecting multi-dimensional features of behavioral habits of users who do not provide gender information as prediction samples according to predicted time;
    根据预测样本和决策树模型预测未提供性别信息的用户的性别。The gender of users who did not provide gender information was predicted based on the predicted sample and the decision tree model.
  12. 如权利要求11所述的电子设备,其中,当所述特征的数量超过预设阈值时,根据特征对于样本分类的信息增益率对样本集进行样本分类,以构建出用户性别预测的决策树模型,包括:The electronic device according to claim 11, wherein when the number of the features exceeds a preset threshold, the sample set is sample-classified according to the information gain rate of the feature classification for the sample to construct a decision tree model for user gender prediction. ,include:
    生成决策树的根节点,并将所述样本集作为所述根节点的节点信息;Generating a root node of the decision tree, and using the sample set as node information of the root node;
    将所述根节点的样本集确定为当前待分类的目标样本集;Determining the sample set of the root node as a target sample set to be classified currently;
    获取目标样本集内所述特征对于目标样本集分类的信息增益率;Obtaining an information gain rate of the feature in the target sample set for the target sample set classification;
    根据所述信息增益率选取从所述特征中选取当前的划分特征;Selecting a current partitioning feature from the features according to the information gain rate;
    根据所述划分特征对所述样本集进行划分,得到若干子样本集;Dividing the sample set according to the dividing feature to obtain a plurality of sub-sample sets;
    对所述子样本集中样本的所述划分特征进行去除,得到去除后子样本集;And removing the dividing feature of the sample in the subsample set to obtain a removed subsample set;
    生成当前节点的子节点,并将所述去除后子样本集作为所述子节点的节点信息;Generating a child node of the current node, and using the removed subsample set as node information of the child node;
    判断子节点是否满足预设分类终止条件;Determining whether the child node satisfies a preset classification termination condition;
    若否,则将所述目标样本集更新为所述去除后子样本集,并返回执行获取目标样本集内所述特征对于目标样本集分类的信息增益率的步骤;If not, updating the target sample set to the post-removed sub-sample set, and returning to perform the step of acquiring an information gain rate of the feature in the target sample set for the target sample set classification;
    若是,则将所述子节点作为叶子节点,根据所述去除后子样本集中样本的类别设置所述叶子节点的输出,所述样本的类别包括男和女。If yes, the child node is used as a leaf node, and an output of the leaf node is set according to a category of the sample in the removed subsample set, and the category of the sample includes a male and a female.
  13. 如权利要求12所述的电子设备,其中,根据所述划分特征对所述目标样本集进行划分,包括:The electronic device of claim 12, wherein dividing the target sample set according to the dividing feature comprises:
    获取所述目标样本集中划分特征的特征值;Obtaining feature values of the feature in the target sample set;
    根据所述特征值对所述目标样本集进行划分。The target sample set is divided according to the feature value.
  14. 如权利要求12所述的电子设备,其中,根据所述信息增益率选取从所述特征中选取当前的划分特征,包括:The electronic device according to claim 12, wherein selecting the current dividing feature from the features according to the information gain rate selection comprises:
    从所述信息增益中选取最大的目标信息增益率;Selecting a maximum target information gain rate from the information gains;
    判断所述目标信息增益率是否大于预设阈值;Determining whether the target information gain rate is greater than a preset threshold;
    若是,则选取所述目标信息增益率对应的特征作为当前的划分特征。If yes, the feature corresponding to the target information gain rate is selected as the current dividing feature.
  15. 如权利要求14所述的电子设备,其中,所述用户性别预测方法还包括:The electronic device of claim 14, wherein the user gender prediction method further comprises:
    当目标信息增益率不大于预设阈值时,将当前节点作为叶子节点,并选取样本数量最多的样本类别作为所述叶子节点的输出。When the target information gain rate is not greater than the preset threshold, the current node is taken as a leaf node, and the sample category with the largest number of samples is selected as the output of the leaf node.
  16. 如权利要求12所述的电子设备,其中,判断子节点是否满足预设分类终止条件,包括:The electronic device of claim 12, wherein determining whether the child node satisfies a preset classification termination condition comprises:
    判断所述子节点对应的去除后子样本集中样本的类别数量是否为预设数量;Determining whether the number of categories of samples in the removed subsample set corresponding to the child node is a preset number;
    若是,则确定所述子节点满足预设分类终止条件。If yes, it is determined that the child node satisfies a preset classification termination condition.
  17. 如权利要求12所述的电子设备,其中,获取目标样本集内所述特征对于目标样本集分类的信息增益率,包括:The electronic device of claim 12, wherein the obtaining an information gain rate of the feature in the target sample set for the target sample set classification comprises:
    获取所述特征对于目标样本集分类的信息增益;Obtaining an information gain of the feature classification for the target sample set;
    获取所述特征对于目标样本集分类的***信息;Obtaining split information of the feature classification for the target sample set;
    根据所述信息增益与所述***信息,获取所述特征对于目标样本集分类的信息增益率。Obtaining an information gain rate of the feature for the target sample set classification according to the information gain and the split information.
  18. 如权利要求17所述的电子设备,其中,获取所述特征对于目标样本集分类的信息增益率,包括:The electronic device of claim 17, wherein the obtaining an information gain rate of the feature for the target sample set classification comprises:
    获取目标样本分类的经验熵;Obtaining the empirical entropy of the target sample classification;
    获取所述特征对于目标样本集分类结果的条件熵;Obtaining conditional entropy of the feature for the classification result of the target sample set;
    根据所述条件熵和所述经验熵,获取所述特征对于所述目标样本集分类的信息增益率。Obtaining an information gain rate of the feature for the target sample set classification according to the condition entropy and the empirical entropy.
  19. 如权利要求17所述的电子设备,其中,根据所述信息增益与所述***信息,获取所述特征对于目标样本集分类的信息增益率,包括:The electronic device according to claim 17, wherein the information gain rate of the feature classification for the target sample set is obtained according to the information gain and the splitting information, including:
    通过如下公式计算特征对于目标样本集分类的信息增益率:The information gain rate of the feature classification for the target sample set is calculated by the following formula:
    Figure PCTCN2018115358-appb-100003
    Figure PCTCN2018115358-appb-100003
    其中,g R(D,A)为特征A对于样本集D分类的信息增益率,g(D,A)为特征A对于样本分类的信息增益,HA(D)为特征A的***信息; Where g R (D, A) is the information gain rate of feature A for sample set D classification, g(D, A) is the information gain of feature A for sample classification, and HA(D) is the split information of feature A;
    并且,g(D,A)可以通过如下公式计算得到:Also, g(D, A) can be calculated by the following formula:
    Figure PCTCN2018115358-appb-100004
    Figure PCTCN2018115358-appb-100004
    其中,H(D)为样本集D分类的经验熵,H(D|A)为特征A对于样本集D分类的条件熵,pi为A特征取第i种取值的样本在样本集D中出现的概率,n和i均为大于零的正整数。Where H(D) is the empirical entropy of the sample set D classification, H(D|A) is the conditional entropy of the feature A for the sample set D classification, and pi is the A feature taking the ith sample of the sample in the sample set D. The probability of occurrence, n and i are positive integers greater than zero.
PCT/CN2018/115358 2017-12-22 2018-11-14 Method and apparatus for predicting user gender, and electronic device WO2019120007A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711405558.8 2017-12-22
CN201711405558.8A CN109961075A (en) 2017-12-22 2017-12-22 User gender prediction method, apparatus, medium and electronic equipment

Publications (1)

Publication Number Publication Date
WO2019120007A1 true WO2019120007A1 (en) 2019-06-27

Family

ID=66993039

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/115358 WO2019120007A1 (en) 2017-12-22 2018-11-14 Method and apparatus for predicting user gender, and electronic device

Country Status (2)

Country Link
CN (1) CN109961075A (en)
WO (1) WO2019120007A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116757450A (en) * 2023-08-17 2023-09-15 浪潮通用软件有限公司 Method, device, equipment and medium for task allocation of sharing center

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639714B (en) * 2020-06-01 2021-07-23 贝壳找房(北京)科技有限公司 Method, device and equipment for determining attributes of users
CN112348583B (en) * 2020-11-04 2022-12-06 贝壳技术有限公司 User preference generation method and generation system
CN112446144A (en) * 2020-11-17 2021-03-05 哈工大机器人(合肥)国际创新研究院 Fault diagnosis method and device for large-scale rotating machine set

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164470A (en) * 2011-12-15 2013-06-19 盛大计算机(上海)有限公司 Directional application method based on user gender distinguished results and system thereof
CN104598648A (en) * 2015-02-26 2015-05-06 苏州大学 Interactive gender identification method and device for microblog user
CN104933075A (en) * 2014-03-20 2015-09-23 百度在线网络技术(北京)有限公司 User attribute predicting platform and method
CN107180044A (en) * 2016-03-09 2017-09-19 精硕科技(北京)股份有限公司 Recognize Internet user's sex method and system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473231A (en) * 2012-06-06 2013-12-25 深圳先进技术研究院 Classifier building method and system
US10339465B2 (en) * 2014-06-30 2019-07-02 Amazon Technologies, Inc. Optimized decision tree based models
CN105654131A (en) * 2015-12-30 2016-06-08 小米科技有限责任公司 Classification model training method and device
CN106294667A (en) * 2016-08-05 2017-01-04 四川九洲电器集团有限责任公司 A kind of decision tree implementation method based on ID3 and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164470A (en) * 2011-12-15 2013-06-19 盛大计算机(上海)有限公司 Directional application method based on user gender distinguished results and system thereof
CN104933075A (en) * 2014-03-20 2015-09-23 百度在线网络技术(北京)有限公司 User attribute predicting platform and method
CN104598648A (en) * 2015-02-26 2015-05-06 苏州大学 Interactive gender identification method and device for microblog user
CN107180044A (en) * 2016-03-09 2017-09-19 精硕科技(北京)股份有限公司 Recognize Internet user's sex method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116757450A (en) * 2023-08-17 2023-09-15 浪潮通用软件有限公司 Method, device, equipment and medium for task allocation of sharing center
CN116757450B (en) * 2023-08-17 2024-01-30 浪潮通用软件有限公司 Method, device, equipment and medium for task allocation of sharing center

Also Published As

Publication number Publication date
CN109961075A (en) 2019-07-02

Similar Documents

Publication Publication Date Title
WO2019120007A1 (en) Method and apparatus for predicting user gender, and electronic device
WO2019120023A1 (en) Gender prediction method and apparatus, storage medium and electronic device
WO2019062418A1 (en) Application cleaning method and apparatus, storage medium and electronic device
WO2019120019A1 (en) User gender prediction method and apparatus, storage medium and electronic device
WO2020155627A1 (en) Facial image recognition method and apparatus, electronic device, and storage medium
WO2019062414A1 (en) Method and apparatus for managing and controlling application program, storage medium, and electronic device
US9256693B2 (en) Recommendation system with metric transformation
WO2020207074A1 (en) Information pushing method and device
CN104298713B (en) A kind of picture retrieval method based on fuzzy clustering
US11422831B2 (en) Application cleaning method, storage medium and electronic device
WO2019052403A1 (en) Training method for image-text matching model, bidirectional search method, and related apparatus
CN106792003B (en) Intelligent advertisement insertion method and device and server
CN107894827B (en) Application cleaning method and device, storage medium and electronic equipment
US20220058222A1 (en) Method and apparatus of processing information, method and apparatus of recommending information, electronic device, and storage medium
CN108108455B (en) Destination pushing method and device, storage medium and electronic equipment
US11200444B2 (en) Presentation object determining method and apparatus based on image content, medium, and device
WO2019062342A1 (en) Background application cleaning method and apparatus, and storage medium and electronic device
CN108197225B (en) Image classification method and device, storage medium and electronic equipment
WO2019062419A1 (en) Application cleaning method and apparatus, storage medium and electronic device
CN108665007B (en) Recommendation method and device based on multiple classifiers and electronic equipment
US20180046721A1 (en) Systems and Methods for Automatic Customization of Content Filtering
CN107943582B (en) Feature processing method, feature processing device, storage medium and electronic equipment
TW202217597A (en) Image incremental clustering method, electronic equipment, computer storage medium thereof
JP2022117941A (en) Image searching method and device, electronic apparatus, and computer readable storage medium
CN107943537B (en) Application cleaning method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18891851

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18891851

Country of ref document: EP

Kind code of ref document: A1