WO2019120023A1 - Gender prediction method and apparatus, storage medium and electronic device - Google Patents

Gender prediction method and apparatus, storage medium and electronic device Download PDF

Info

Publication number
WO2019120023A1
WO2019120023A1 PCT/CN2018/116709 CN2018116709W WO2019120023A1 WO 2019120023 A1 WO2019120023 A1 WO 2019120023A1 CN 2018116709 W CN2018116709 W CN 2018116709W WO 2019120023 A1 WO2019120023 A1 WO 2019120023A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
sample
gini index
sample set
classification
Prior art date
Application number
PCT/CN2018/116709
Other languages
French (fr)
Chinese (zh)
Inventor
陈岩
刘耀勇
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2019120023A1 publication Critical patent/WO2019120023A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24317Piecewise classification, i.e. whereby each classification requires several discriminant rules
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Definitions

  • the present application relates to the field of communications technologies, and in particular, to a gender prediction method, apparatus, storage medium, and electronic device.
  • the current electronic devices are very intelligent, and electronic devices can achieve many functions.
  • some scene users have other needs for electronic devices, such as predicting the gender of users.
  • the embodiment of the present application provides a gender prediction method, device, storage medium, and electronic device, which can predict user gender.
  • an embodiment of the present application provides a gender prediction method, including:
  • an embodiment of the present application provides a gender prediction apparatus, including:
  • a sample construction unit for acquiring a multidimensional feature of a known gender user using the electronic device as a sample, and constructing a sample set of gender prediction
  • a classification unit configured to classify the sample set according to the Gini index information gain of the feature classification for the sample set to construct a corresponding classification regression tree model, where the output of the classification regression tree model includes a male or a female;
  • An acquisition unit configured to collect, according to a predicted time, a multi-dimensional feature of an electronic device that is used by an unknown gender user as a prediction sample;
  • a prediction unit configured to predict, according to the predicted sample and the classified regression tree model, the gender of the unknown gender user.
  • a storage medium provided by an embodiment of the present application has a computer program stored thereon, and when the computer program is run on a computer, the computer is caused to perform a gender prediction method according to any embodiment of the present application.
  • an electronic device provided by an embodiment of the present application includes a processor and a memory, where the memory has a computer program, wherein the processor is configured to execute any implementation of the present application by calling the computer program.
  • the gender prediction method provided in the example.
  • FIG. 1 is a schematic diagram of an application scenario of a gender prediction method according to an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of a gender prediction method provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a classification regression tree provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of another classification regression tree provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of still another classification regression tree provided by an embodiment of the present application.
  • FIG. 6 is another schematic flowchart of a gender prediction method provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a gender prediction apparatus according to an embodiment of the present application.
  • FIG. 8 is another schematic structural diagram of a gender prediction apparatus according to an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • FIG. 10 is another schematic structural diagram of an electronic device according to an embodiment of the present application.
  • references to "an embodiment” herein mean that a particular feature, structure, or characteristic described in connection with the embodiments can be included in at least one embodiment of the present application.
  • the appearances of the phrases in various places in the specification are not necessarily referring to the same embodiments, and are not exclusive or alternative embodiments that are mutually exclusive. Those skilled in the art will understand and implicitly understand that the embodiments described herein can be combined with other embodiments.
  • the embodiment of the present application provides a gender prediction method, including:
  • the sample set is partitioned according to the feature of the Gini index information gain of the sample set to construct a corresponding classification regression tree model, including:
  • the child node is used as a leaf node, and an output of the leaf node is set according to a sample category of the sample in the subset of samples, the sample category includes a male or a female.
  • obtaining a Gini index information gain of the feature for the target sample set classification includes:
  • the gain of the feature is obtained for the Gini index information of the target sample set classification.
  • obtaining a Gini index of the value of the feature for the target sample set classification includes:
  • the Gini index of the value classification for the target sample is obtained according to the probability of the sample category.
  • the Gini index of the value classification for the target sample is obtained according to the probability of the sample category, including:
  • Gini index information gain of the feature for the target sample set classification according to the Gini index including:
  • Obtaining a value of the feature according to the first Gini index, a sample size ratio of the first subsample set to the target sample set, a second Gini index, and a sample size ratio of the second subsample set to the target sample set The Gini index information gain of the target sample set classification.
  • obtaining a Gini index information gain of the feature for the target sample set classification according to the Gini index includes:
  • the Gini index information gain of the feature classification for the target sample set is calculated by the following formula:
  • Gini(D,A) is the Gini index information gain of feature A for the target sample set D
  • Gini(D 1 ) is the Gini index for the target sample D when the feature A is the value a
  • Gini(D 2 ) A is the Gini index for the target sample D when A is not a value
  • a is a value of the feature A
  • selecting the current partitioning feature and its corresponding partitioning point from the feature according to the Gini index information gain includes:
  • the characteristics of the target Gini index information gain and their values are taken as the division feature and the division point, respectively.
  • determining whether the child node meets a preset classification termination condition includes:
  • predicting the gender of the unknown gender user based on the predicted sample and the classification regression tree model including:
  • the embodiment of the present application provides a gender prediction method
  • the execution body of the gender prediction method may be a gender prediction device provided by an embodiment of the present application, or a device integrated with the gender prediction device, such as an electronic device, a network device, or the like.
  • the gender prediction device can be implemented in hardware or software.
  • the electronic device may be a device such as a smart phone, a tablet computer, a palmtop computer, a notebook computer, or a desktop computer.
  • FIG. 1 is a schematic diagram of an application scenario of a gender prediction method according to an embodiment of the present application.
  • the gender prediction device is integrated into an electronic device as an example, and the electronic device can take a multi-dimensional feature of a known gender user using the electronic device as a sample. And construct a sample set of gender prediction; classify the sample set according to the Gini index information gain of the feature set for the sample set to construct a corresponding classification regression tree model, and the output of the classification regression tree model includes male or female; Time collects the multi-dimensional features of the electronic device used by unknown gender users as prediction samples; predicts the gender of unknown gender users according to the prediction samples and the classification regression tree model.
  • the electronic device can also perform deep customization optimization on system resource scheduling according to the predicted gender. For example, clean up applications based on predicted genders.
  • a multi-dimensional feature of the electronic device can be collected in a historical time period (eg, The number and duration of the user b browsing the male content in the application, the number and duration of the user b browsing the female content in the application, etc. as a sample, constructing a sample set of gender prediction; and the Gini index information classified according to the feature for the sample set Gain classifies the sample set to construct a corresponding classification regression tree model.
  • the output of the classification regression tree model includes male or female; the multi-dimensional characteristics of the electronic device used by the unknown gender are collected according to the predicted time (for example, at time t such as user a The number and duration of browsing the male content in the application, the number and duration of the user a browsing the female content in the application, etc. as a prediction sample; predicting the gender of the unknown gender user a according to the prediction sample and the classification regression tree model (eg Male or female).
  • FIG. 2 is a schematic flowchart diagram of a gender prediction method according to an embodiment of the present application.
  • the specific process of the gender prediction method provided by the embodiment of the present application may be as follows:
  • the multi-dimensional feature is a multi-dimensional user behavior feature of a known gender user such as a male user or a female user using an electronic device.
  • a multi-dimensional user behavior feature of a known gender user using an electronic device during a historical time period is a known gender user such as a male user or a female user using an electronic device.
  • the multi-dimensional feature is a gender-specific behavioral feature of the user's use of the electronic device.
  • a user has a behavioral characteristic characterized by male or female characteristics in the process of using an electronic device.
  • the multi-dimensional feature has a dimension of a certain length, and the parameters in each dimension correspond to a feature information that represents the user's use of the electronic device, that is, the multi-dimensional feature is composed of multiple features.
  • the plurality of features may include behavior characteristics of the user using the application on the electronic device, for example, the number and duration of browsing the male-type goods (such as men's clothing) in the user shopping application, and the user browsing the female-type goods in the shopping application (such as cosmetics, women's clothing) The number and duration of time, the length of time a user reads a male novel, the length of time a user reads a male-like novel in a reading application, and the length of time a user reads a female-like novel in a reading application.
  • the multi-dimensional feature may also include relevant behavior characteristic information of the user using the electronic device itself, such as the number of times the user uses the electronic device front camera, the number of times the user uses the rear camera, and the like.
  • the sample set of gender prediction may include a plurality of samples, each sample including a multi-dimensional feature of a known user using an electronic device.
  • the sample set of gender predictions may include multiple samples collected at a preset frequency during the historical time period.
  • the historical time period may be, for example, the past 7 days or 10 days; the preset frequency may be, for example, collected every 10 minutes and collected every half hour. It can be understood that the multi-dimensional feature data of the application acquired at one time constitutes one sample and multiple samples constitute a sample set.
  • the multi-dimensional features of each of the known gender users using their electronic devices may be collected by the server, and then the electronic device may be obtained from the server during gender prediction.
  • a gender user is known to be a user who provides gender information when using an electronic device; for example, a user who provides gender information when the account is registered.
  • each sample in the sample set can be marked to obtain a sample label for each sample.
  • the labeled sample labels include males and females, That is, the sample categories include males and females.
  • the value “1” may be used to mean “male”
  • the value "0" may be used to mean “female", and vice versa.
  • feature information of the known user's multi-dimensional feature information that is not directly represented by the value may be quantified by a specific value, for example, the feature information of the wireless network connection state of the electronic device,
  • the value 1 can be used to indicate the normal state, and the value 0 is used to indicate the abnormal state (or vice versa); for example, for the characteristic information of whether the electronic device is in the charging state, the value 1 can be used to indicate the state of charge, and the value 0 is used to indicate the uncharged state. Status (or vice versa).
  • the embodiment of the present application may classify the sample set based on the Gini index information gain of the feature classification for the sample to construct a CART (Classification And Regression Tree).
  • a classification regression tree model can be constructed based on ID3 (Iterative Dichotomiser 3) algorithm.
  • the classification regression tree is a kind of decision tree, and is a very important decision tree. It is a binary tree, and each non-leaf node has two children, so the number of leaf nodes is lower than the non-leaf for the first subtree. The number of nodes is one more.
  • a decision tree is a tree built on decision making. In machine learning, a decision tree is a predictive model that represents a mapping between object attributes and object values. Each node represents an object, and each forked path in the tree represents a certain Possible attribute values, and each leaf node corresponds to the value of the object represented by the path from the root node to the leaf node. The decision tree has only a single output. If there are multiple outputs, separate decision trees can be created to handle different outputs.
  • ID3 Intelligent Dichotomiser 3, iterative binary tree 3 generation
  • ID3 is an algorithm of decision tree, which is based on Occam's razor principle, that is, to do more with as few things as possible.
  • the core idea of the ID3 algorithm is to measure the choice of attributes with information gain, and select the attribute with the largest information gain after splitting to split.
  • the algorithm uses a top-down greedy search to traverse possible decision spaces.
  • the information gain is for one feature, that is, one feature t, the amount of information when the system has it and without it, the difference between the two is the amount of information that the feature brings to the system, ie Information gain.
  • the Gini index is a kind of feature selection similar to information entropy, which can be used to indicate the impureness of data, that is, the possibility that a randomly selected sample is split in a subset.
  • the Gini index can be used to construct a binary decision tree.
  • the Gini index is an inequality measure that is commonly used to measure income disparity and can be used to measure any uneven distribution, which is a number between 0 and 1, 0-completely equal, and 1-completely unequal.
  • the Gini index is an inequality measure that is commonly used to measure income disparity and can be used to measure any uneven distribution, which is a number between 0 and 1, 0-completely equal, and 1-completely unequal.
  • the feature for the sample set classification Gini Gain, Gini gain represents the impurity gain of the sample after the feature set is divided based on the feature, such as the Gini index information gain of the feature A for the sample set D is Gini (D, A), which represents the impurity gain of the sample set after partitioning the sample set D based on the feature A.
  • the classification process may include the following steps:
  • the child node is used as a leaf node, and the output of the leaf node is set according to the sample category of the sample in the subsample set, and the sample category includes male or female.
  • the dividing feature may be selected according to the Gini index information gain of the feature set for each sample set and the corresponding value, and used for classifying the sample set.
  • the dividing point is a certain value in the dividing feature.
  • the feature corresponding to the minimum Gini index information gain may be selected as the division feature. That is, the step "selecting the current dividing feature and its corresponding dividing point from the feature according to the Gini index information gain" may include:
  • the characteristics of the target Gini index information gain and its value are taken as the division feature and the division point respectively.
  • the feature with the smallest variation (such as descent) of the sample after the sample set is divided and its corresponding value are selected as the dividing feature and the dividing point.
  • the value a is a division point.
  • the category of the sample may include two categories, male and female, and the category of each sample may be represented by a sample mark. For example, when the sample is marked as a numerical value, the value “1” represents “male”, and the value “0” is used. "Female”, vice versa.
  • the child node When the child node satisfies the preset classification termination condition, the child node may be used as a leaf node, that is, the sample set classification of the child node is stopped, and the output of the leaf node may be set based on the category of the sample in the removed sub-sample set.
  • the output of a leaf node There are several ways to set the output of a leaf node based on the category of the sample. For example, the category with the largest number of samples in the sample set can be removed as the output of the leaf node.
  • the preset classification termination condition may be set according to actual requirements.
  • the current child node is regarded as a leaf node, and the sample set corresponding to the child node is stopped; when the child node does not satisfy the pre-sale
  • the classification termination condition is set, the sample set corresponding to the child node is continued to be classified.
  • the preset classification termination condition may include: the number of categories of the samples in the removed sub-sample set of the child node is a preset number, that is, the step “determining whether the child node satisfies the preset classification termination condition” may include:
  • the preset classification termination condition may include: the number of categories of the samples in the removed subsample set corresponding to the child node is 1, that is, the sample in the sample set of the child node has only one category. At this time, if the child node satisfies the preset classification termination condition, the category of the sample in the subsample set is taken as the output of the leaf node. If there is only a sample with the category "male" in the subsample set after removal, then "male" can be used as the output of the leaf node.
  • the sample set may be divided into two sub-samples according to whether the dividing feature is a value of the dividing point. For example, when the partitioning feature is A and the partitioning point is a, the sample set can be divided into two sub-sample sets based on whether the feature A is a or not.
  • the Gini index information gain of the feature classification for the target sample set may include the Gini index information gain of the feature value for the target sample set classification; for example, the value a of the feature A is classified for the target sample set D.
  • the Gini index gains Gini Gain.
  • the Gini index information gain can be obtained based on the value of the feature for the Gini index of the sample set classification.
  • the step "Getting the Gini index information gain of the feature classification for the target sample set" may include:
  • the gain of the feature is obtained for the Gini index information of the target sample set classification.
  • the value of the feature is obtained as follows for the Gini index of the target sample classification:
  • the Gini index of the value classification for the target sample is obtained based on the probability of the sample category.
  • the Gini index of the value classification for the target sample includes: the Gini index for the target sample set when the feature is the value, and the Gini index for the target sample set when the feature is not the value.
  • the step "acquiring the value of the sample category according to the probability of the sample category to the Gini index of the target sample classification" may include:
  • the second Gini index for the target sample set when the feature is not taken is obtained according to the probability of the sample category in the second subsample set.
  • the step "according to the Gini index, obtaining the Gini index information gain of the feature classification for the target sample set" may include:
  • Obtaining a feature as a target sample set according to a first Gini index, a sample size ratio of the first subsample set to the target sample set, a second Gini index, and a sample size ratio of the second subsample set to the target sample set The Gini index information gain of the classification.
  • Gini index Gini (D1) a for the Gini index Gini(D1) of the sample set D according to the probability pk of the sample category (male or female) in D1, and the probability pk according to the sample category (male or female) in D2
  • Gini (D1) and Gini (D2) are calculated as shown in the following formula.
  • the feature A can be calculated based on the Gini (D1), the sample number ratio D1/D, Gini (D2) of the subsample set D1 and the sample set D, and the sample number ratio D2/D of the sample set D2 and the sample set D.
  • the gain of the Gini index information for the sample set D when it is a is Gini(D, A). For example, use the following formula to find:
  • the gain of each sample feature can be calculated as the Gini index information gain of the sample set D classification.
  • sample 2 For example, for sample set D ⁇ sample 1, sample 2...sample i...sample n ⁇ , where sample 1 includes t1, t2...tm, sample i includes t1, t2...tm, sample n includes t1, t2... ...tm.
  • each sample feature contains multiple values.
  • the construction process of the classification regression tree is as follows:
  • Gini index information gain the possible values of each feature are calculated as the feature 1, the feature 2, the feature m, the Gini index information gain Gini(D, t1), Gini(D, t2) for the sample set D classification. ... Gini (D, tm).
  • the minimum Gini index information gain such as Gini(D, ti) is selected as the minimum information gain. At this time, it can be determined that ti is the division feature t, and the value t' corresponding to ti in Gini(D, ti) is the division point.
  • the two child nodes d1 and d2 of the current node d are generated, D1 is assigned to the child node d1, and D2 is assigned to the child node d2.
  • the above-mentioned sub-sample set corresponding to the child node is continued to be classified according to the information gain classification method.
  • the child node d2 can be used as an example to calculate the value of each feature in the D2 sample set. Relative to the Gini index information gain Gini(D,t) of the sample classification, the minimum information gain Gini(D,t)min is selected, and the corresponding features and values of Gini(D,t)min are selected as the division feature t and the division point.
  • D2 may be divided into sub-sample sets D21 and D22; then, sub-nodes d21 and d22 of the current node d2 are generated, and D21 and D22 are respectively assigned to the sub-samples. Nodes d21, d22.
  • the above-described classification based on the Gini index information gain classification can form a classification regression tree as shown in FIG. 4, and the output of the leaf node of the classification regression tree includes "male” or "female".
  • corresponding division features and their corresponding division feature values may also be marked on the path between the nodes.
  • the feature values of the corresponding divided features may be marked on the path of the current node and its child nodes.
  • the feature values of the partitioning feature t include: 0, 1 may mark 1 on the path between d2 and d, mark 0 on the path between d1 and d, and so on, after each division,
  • a classification regression tree as shown in FIG. 5 can be obtained by marking a corresponding division feature value such as 0 or 1 on the path of the current node and its child nodes.
  • the prediction time can be set according to requirements, such as the current time.
  • a multi-dimensional feature of an electronic device that is used by an unknown gender user can be collected as a prediction sample at a current time point.
  • the multi-dimensional features collected in steps 201 and 203 are the same feature, for example, the number and duration of browsing male-type goods (such as men's clothing) in the user shopping application, and the user browsing the female-type goods in the shopping application (such as The number and duration of cosmetics and women's wear, the length of time users read partial male novels, etc., the length of time users read a male-like novel in a reading application, and the length of time a user reads a female-like novel in a reading application.
  • the number and duration of browsing male-type goods such as men's clothing
  • the female-type goods in the shopping application such as The number and duration of cosmetics and women's wear, the length of time users read partial male novels, etc., the length of time users read a male-like novel in a reading application, and the length of time a user reads a female-like novel in a reading application.
  • the corresponding output result is obtained according to the predicted sample and the classification regression tree model, and the gender of the unknown gender user is determined according to the output result.
  • the output includes women, or men.
  • the corresponding leaf node may be determined according to the characteristics of the predicted sample and the classification regression tree model, and the output of the leaf node is used as a predicted output result.
  • the current leaf node is determined according to the branch condition of the classification regression tree (ie, the feature value of the divided feature), and the output of the leaf node is taken as the prediction result. Since the output of the leaf node includes a female, or a male, the gender of the user can be determined based on the classification regression tree at this time.
  • the corresponding leaf node may be found as dn1 according to the branch condition of the classification regression tree in the classification regression tree shown in FIG. 5, and the output of the leaf node dn1 is male. At this point, it is determined that the user is a male.
  • the embodiment of the present application obtains a multi-dimensional feature of a known gender user using the electronic device as a sample, and constructs a sample set of gender prediction; classifies the sample set according to the Gini index information gain of the feature set for the sample set to construct Corresponding classification regression tree model, the output of the classification regression tree model includes male or female; the multi-dimensional feature of the electronic device is used as the prediction sample by the unknown gender user according to the prediction time; the gender of the unknown gender user is predicted according to the prediction sample and the classification regression tree model. .
  • the program accurately predicts user gender.
  • each sample of the sample set includes a plurality of feature information reflecting a behavior habit of the user using the electronic device
  • the embodiment of the present application can make the user gender prediction more personalized and intelligent.
  • the user regression prediction based on the classification regression tree prediction model can improve the accuracy of the user gender prediction and save resources.
  • the gender prediction method may include:
  • the multi-dimensional feature is a multi-dimensional user behavior feature of a known gender user such as a male user or a female user using an electronic device.
  • a multi-dimensional user behavior feature of a known gender user using an electronic device during a historical time period is a known gender user such as a male user or a female user using an electronic device.
  • the multi-dimensional feature is a gender-specific behavioral feature of the user's use of the electronic device.
  • a user has a behavioral characteristic characterized by male or female characteristics in the process of using an electronic device.
  • the multi-dimensional feature has a dimension of a certain length, and the parameters in each dimension correspond to a feature information that represents the user's use of the electronic device, that is, the multi-dimensional feature is composed of multiple features.
  • the plurality of features may include behavior characteristics of the user using the application on the electronic device, for example, the number and duration of browsing the male-type goods (such as men's clothing) in the user shopping application, and the user browsing the female-type goods in the shopping application (such as cosmetics, women's clothing) The number and duration of time, the length of time a user reads a male novel, the length of time a user reads a male-like novel in a reading application, and the length of time a user reads a female-like novel in a reading application.
  • the multi-dimensional feature may also include relevant behavior characteristic information of the user using the electronic device itself, such as the number of times the user uses the electronic device front camera, the number of times the user uses the rear camera, and the like.
  • the sample set of gender prediction may include a plurality of samples, each sample including a multi-dimensional feature of a known user using an electronic device.
  • the sample set of gender predictions may include multiple samples collected at a preset frequency during the historical time period.
  • the historical time period may be, for example, the past 7 days or 10 days; the preset frequency may be, for example, collected every 10 minutes and collected every half hour. It can be understood that the multi-dimensional feature data of the application acquired at one time constitutes one sample and multiple samples constitute a sample set.
  • the multi-dimensional features of each of the known gender users using their electronic devices may be collected by the server, and then the electronic device may be obtained from the server during gender prediction.
  • a gender user is known to be a user who provides gender information when using an electronic device; for example, a user who provides gender information when the account is registered.
  • a specific sample may be as shown in Table 1 below, and includes feature information of multiple dimensions. It should be noted that the feature information shown in Table 1 is only an example. In practice, the number of feature information included in one sample may be increased. The number of information shown in Table 1 may be less than the number of information shown in Table 1. The specific feature information may be different from that shown in Table 1, and is not specifically limited herein.
  • Dimension Characteristic information 1 The number and duration of browsing of male-type items (such as men's clothing) in the shopping app 2
  • the length of time users read a male novel 4
  • the length of time users read a female novel 5
  • the length of time users read sports news 6
  • the length of time users read the news of the constellation 7
  • Number of times users use beauty software 9 The number and duration of users playing different categories of games
  • the labeled sample tags include both male and female.
  • the sample label for the sample characterizes the sample category of the sample.
  • the sample categories may include males and females.
  • the value "1” may be used to mean “male”
  • the value "0” may be used to mean “female”, and vice versa.
  • the root node d of the regression tree model can be classified and the sample set D is assigned to the root node d.
  • the sample set of the root node is determined as the target sample set to be classified currently.
  • each feature such as the feature t1, the feature t2, the feature tm, the Gini index information gain Gini(D, t1), Gini(D, t2), Gini(D, tm) for the sample set classification can be calculated. ); select the smallest information gain Gini(D, t)min.
  • the Gini index information gain of the feature for the sample set classification can be obtained as follows:
  • Obtaining a feature as a target sample set according to a first Gini index, a sample size ratio of the first subsample set to the target sample set, a second Gini index, and a sample size ratio of the second subsample set to the target sample set The Gini index information gain of the classification.
  • Gini index Gini (D1) a for the Gini index Gini(D1) of the sample set D according to the probability pk of the sample category (male or female) in D1, and the probability pk according to the sample type (male or female) in D2
  • Gini (D1) and Gini (D2) are calculated as shown in the following formula.
  • the feature A can be calculated based on the Gini (D1), the sample number ratio D1/D, Gini (D2) of the subsample set D1 and the sample set D, and the sample number ratio D2/D of the sample set D2 and the sample set D.
  • the gain of the Gini index information for the sample set D when it is a is Gini(D, A). For example, use the following formula to find:
  • the gain of each sample feature can be calculated as the Gini index information gain of the sample set D classification.
  • the feature corresponding to the smallest information gain and its corresponding value are used as the dividing feature and the dividing point.
  • the feature ti may be selected as the division feature, and the value t' corresponding to ti is the division point.
  • the feature may be divided into yes or no, and the target sample set is divided into two sub-sample sets.
  • one subsample set corresponds to one child node.
  • FIG. 3 generates child nodes d1 and d2 of the root node d, and assigns the subsample set D1 to the child node d1 and the child sample set D2 to the child node d2.
  • the divided feature values corresponding to the child nodes may also be set on the path of the child node and the current node, so as to facilitate subsequent gender prediction, refer to FIG. 5.
  • step 309. Determine whether the sub-sample set of the child node meets the preset classification termination condition. If not, execute step 310, and if yes, perform step 311.
  • the preset classification termination condition may be set according to actual requirements. When the child node satisfies the preset classification termination condition, the current child node is used as a leaf node, and the sample set corresponding to the child node is stopped for word segmentation; when the child node is not satisfied When the classification termination condition is preset, the classification of the sample set corresponding to the child node is continued.
  • the preset classification termination condition may include: the number of categories of the samples in the removed sub-sample set of the child node is a preset number.
  • the preset classification termination condition may include: the number of categories of the samples in the removed subsample set corresponding to the child node is 1, that is, the sample in the sample set of the child node has only one category.
  • the child node is used as a leaf node, and the output of the leaf node is set according to a sample category of the child sample set of the child node.
  • the preset classification termination condition may include: the number of categories of the samples in the removed subsample set corresponding to the child node is 1, that is, the sample in the sample set of the child node has only one category.
  • the category of the sample in the subsample set is taken as the output of the leaf node. If there is only a sample with the category "female" in the subsample set after removal, then "female" can be used as the output of the leaf node.
  • the sample categories include women and men.
  • the time required to predict the gender may include the current time, or other time.
  • the corresponding leaf node may be determined according to the characteristics of the predicted sample and the classification regression tree model, and the output of the leaf node is used as a predicted output result.
  • the current leaf node is determined according to the branch condition of the classification regression tree (ie, the feature value of the divided feature), and the output of the leaf node is taken as the prediction result. Since the output of the leaf node includes male, or female, the user's gender can be determined based on the classification regression tree at this time.
  • the corresponding leaf nodes can be found as an2 according to the branching condition of the classification regression tree in the classification regression tree shown in FIG. 5, and the output of the leaf node an2 is female. Make sure the user is a female.
  • the embodiment of the present application obtains a multi-dimensional feature of a known gender user using the electronic device as a sample, and constructs a sample set of gender prediction; classifies the sample set according to the Gini index information gain of the feature set for the sample set to construct Corresponding classification regression tree model, the output of the classification regression tree model includes male or female; the multi-dimensional feature of the electronic device is used as the prediction sample by the unknown gender user according to the prediction time; the gender of the unknown gender user is predicted according to the prediction sample and the classification regression tree model. .
  • the program accurately predicts user gender.
  • each sample of the sample set includes a plurality of feature information reflecting a behavior habit of the user using the electronic device
  • the embodiment of the present application can make the user gender prediction more personalized and intelligent.
  • the user regression prediction based on the classification regression tree prediction model can improve the accuracy of the user gender prediction and save resources.
  • the embodiment of the present application further provides a gender prediction apparatus, including:
  • a sample construction unit for acquiring a multidimensional feature of a known gender user using the electronic device as a sample, and constructing a sample set of gender prediction
  • a classification unit configured to classify the sample set according to the Gini index information gain of the feature classification for the sample set to construct a corresponding classification regression tree model, where the output of the classification regression tree model includes a male or a female;
  • An acquisition unit configured to collect, according to a predicted time, a multi-dimensional feature of an electronic device that is used by an unknown gender user as a prediction sample;
  • a prediction unit configured to predict, according to the predicted sample and the classified regression tree model, the gender of the unknown gender user.
  • the classification unit comprises:
  • a node generating subunit configured to generate a root node of the classification regression tree model, and assign the sample set to the root node, and determine a sample set of the root node as a target sample set to be classified currently;
  • a gain acquisition subunit configured to obtain a Gini index information gain of the feature for the target sample set classification
  • a dividing feature determining subunit configured to select a current dividing feature and a corresponding dividing point thereof from the feature according to the Gini index information gain
  • a classifying subunit configured to divide the sample set according to the dividing feature and the dividing point, to obtain two subsample sets
  • a child node generating subunit configured to generate a child node of the current node, and assign the going to the subsample set to the corresponding child node;
  • a determining subunit configured to determine whether the child node satisfies a preset classification termination condition, and if not, updating the target sample set to the subsample set, and triggering the gain acquisition subunit to perform acquiring the feature for the target sample set
  • the step of the Gini index if so, the child node is used as a leaf node, and the output of the leaf node is set according to the sample category of the sample in the subsample set, the sample category including male or female.
  • the gain acquisition subunit is used
  • the gain of the feature is obtained for the Gini index information of the target sample set classification.
  • the gain acquisition subunit is configured to:
  • the Gini index of the value classification for the target sample is obtained according to the probability of the sample category.
  • the dividing feature determining subunit is configured to:
  • the characteristics of the target Gini index information gain and their values are taken as the division feature and the division point, respectively.
  • the gain acquisition subunit is configured to:
  • Obtaining a value of the feature according to the first Gini index, a sample size ratio of the first subsample set to the target sample set, a second Gini index, and a sample size ratio of the second subsample set to the target sample set The Gini index information gain of the target sample set classification.
  • the gain acquisition subunit is configured to:
  • the Gini index information gain of the feature classification for the target sample set is calculated by the following formula:
  • Gini(D,A) is the Gini index information gain of feature A for the target sample set D
  • Gini(D 1 ) is the Gini index for the target sample D when the feature A is the value a
  • Gini(D 2 ) A is the Gini index for the target sample D when A is not a value
  • a is a value of the feature A
  • the dividing feature determining subunit is configured to:
  • the characteristics of the target Gini index information gain and their values are taken as the division feature and the division point, respectively.
  • the determining subunit is configured to determine whether the number of categories of samples in the removed subsample set corresponding to the child node is a preset number; if yes, determining that the child node meets a preset classification termination condition .
  • a gender prediction device is also provided in an embodiment. Please refer to FIG. 7.
  • FIG. 7 is a schematic structural diagram of a gender prediction apparatus according to an embodiment of the present application. Wherein the gender prediction device is applied to an electronic device, and the gender prediction device includes a sample construction unit 401, a classification unit 402, an acquisition unit 403, and a prediction unit 404, as follows:
  • a sample construction unit 401 configured to acquire a multi-dimensional feature of a known gender user using the electronic device as a sample, and construct a sample set of the gender prediction;
  • the classification unit 402 is configured to classify the sample set according to the Gini index information gain of the feature classification for the sample set to construct a corresponding classification regression tree model, and the output of the classification regression tree model includes a male or a female ;
  • the collecting unit 403 is configured to collect, according to the predicted time, a multi-dimensional feature of the electronic device that the unknown gender user uses as the prediction sample;
  • the prediction unit 404 is configured to predict, according to the prediction sample and the classification regression tree model, the gender of the unknown gender user.
  • the classification unit 402 may include:
  • a node generating sub-unit 4021 configured to generate a root node of the classification regression tree model, and allocate the sample set to the root node, and determine a sample set of the root node as a target sample set to be classified currently;
  • a gain acquisition sub-unit 4022 configured to acquire a Gini index information gain of the feature for the target sample set classification
  • a dividing feature determining sub-unit 4023 configured to select a current dividing feature and a corresponding dividing point thereof from the feature according to the Gini index information gain;
  • a classification sub-unit 4024 configured to generate a child node of the current node, and allocate the sub-sample set to the corresponding child node;
  • a child node generating sub-unit 4025 configured to remove the divided feature of the sample in the sub-sample set to obtain a removed sub-sample set; generate a child node of the current node, and use the removed sub-sample set as the Node information of the child node;
  • the determining sub-unit 4026 is configured to determine whether the child node satisfies a preset classification termination condition, and if not, update the target sample set to the sub-sample set, and trigger the gain acquisition sub-unit 4022 to perform acquiring the feature for the target sample.
  • the step of the set Gini index if so, the child node is used as a leaf node, and the output of the leaf node is set according to the sample category of the sample in the subsample set, the sample category including male or female.
  • the gain acquisition subunit 4022 can be used to:
  • the gain of the feature is obtained for the Gini index information of the target sample set classification.
  • the gain acquisition sub-unit 4022 can be used to:
  • the Gini index of the value classification for the target sample is obtained according to the probability of the sample category.
  • the gain acquisition sub-unit 4022 can be used to:
  • Obtaining a value of the feature according to the first Gini index, a sample size ratio of the first subsample set to the target sample set, a second Gini index, and a sample size ratio of the second subsample set to the target sample set The Gini index information gain of the target sample set classification.
  • the dividing feature determining subunit 4023 can be used to:
  • the characteristics of the target Gini index information gain and their values are taken as the division feature and the division point, respectively.
  • the determining sub-unit 4025 may be configured to determine whether the number of categories of the samples in the removed sub-sample set corresponding to the child node is a preset number
  • the gain acquisition subunit 4022 can be used to:
  • Obtaining a value of the feature according to the first Gini index, a sample size ratio of the first subsample set to the target sample set, a second Gini index, and a sample size ratio of the second subsample set to the target sample set The Gini index information gain of the target sample set classification.
  • the gain acquisition subunit 4022 is configured to:
  • the Gini index information gain of the feature classification for the target sample set is calculated by the following formula:
  • Gini(D,A) is the Gini index information gain of feature A for the target sample set D
  • Gini(D 1 ) is the Gini index for the target sample D when the feature A is the value a
  • Gini(D 2 ) A is the Gini index for the target sample D when A is not a value
  • a is a value of the feature A
  • the dividing feature determining subunit 4023 is configured to:
  • the characteristics of the target Gini index information gain and their values are taken as the division feature and the division point, respectively.
  • the determining sub-unit 4026 is configured to determine whether the number of categories of the samples in the removed sub-sample set corresponding to the child node is a preset number; if yes, determining that the child node meets the preset classification termination condition.
  • the steps performed by each unit in the gender prediction apparatus may refer to the method steps described in the foregoing method embodiments.
  • the gender prediction device can be integrated in an electronic device such as a mobile phone, a tablet, or the like.
  • module unit
  • module may be taken to mean a software object that is executed on the computing system.
  • the different components, modules, engines, and services described herein can be considered as implementation objects on the computing system.
  • the apparatus and method described herein may be implemented in software, and may of course be implemented in hardware, all of which are within the scope of the present application.
  • the foregoing various units may be implemented as an independent entity, and may be implemented in any combination, and may be implemented as the same entity or a plurality of entities.
  • the foregoing units refer to the foregoing embodiments, and details are not described herein again.
  • the gender prediction apparatus of the present embodiment can acquire the multi-dimensional feature of the known gender user using the electronic device as a sample by the sample construction unit 401, and construct a sample set of the gender prediction; the classification unit 402 classifies the sample set according to the feature.
  • the Gini index information gain classifies the sample set to construct a corresponding classification regression tree model, the output of the classification regression tree model includes a male or a female; and the collecting unit 403 collects an unknown gender based on the predicted time to use the electronic
  • the multi-dimensional feature of the device is used as a prediction sample; the prediction unit 404 predicts the gender of the unknown gender user based on the prediction sample and the classification regression tree model.
  • the program accurately predicts user gender.
  • the electronic device 500 includes a processor 501 and a memory 502.
  • the processor 501 is electrically connected to the memory 502.
  • the processor 500 is a control center of the electronic device 500 that connects various portions of the entire electronic device using various interfaces and lines, by running or loading a computer program stored in the memory 502, and recalling data stored in the memory 502, The various functions of the electronic device 500 are performed and the data is processed to perform overall monitoring of the electronic device 500.
  • the memory 502 can be used to store software programs and modules, and the processor 501 executes various functional applications and data processing by running computer programs and modules stored in the memory 502.
  • the memory 502 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, a computer program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may be stored according to Data created by the use of electronic devices, etc.
  • memory 502 can include high speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, memory 502 can also include a memory controller to provide processor 501 access to memory 502.
  • the processor 501 in the electronic device 500 loads the instructions corresponding to the process of one or more computer programs into the memory 502 according to the following steps, and is stored in the memory 502 by the processor 501.
  • the computer program in which to implement various functions, as follows:
  • the processor 501 may specifically perform the following steps when dividing the sample set by the Gini index information gain for the sample set according to the feature to construct a corresponding classification regression tree model:
  • the child node is used as a leaf node, and an output of the leaf node is set according to a sample category of the sample in the subset of samples, the sample category includes a male or a female.
  • the processor 501 when acquiring the Gini index information gain of the feature classification for the target sample set, the processor 501 may specifically perform the following steps:
  • the gain of the feature is obtained for the Gini index information of the target sample set classification.
  • the processor 501 when obtaining the Gini index of the feature value for the target sample set, the processor 501 may specifically perform the following steps:
  • the Gini index of the value classification for the target sample is obtained according to the probability of the sample category.
  • the processor 501 may further perform the following steps:
  • the processor 501 may specifically perform the following steps:
  • Obtaining a value of the feature according to the first Gini index, a sample size ratio of the first subsample set to the target sample set, a second Gini index, and a sample size ratio of the second subsample set to the target sample set The Gini index information gain of the target sample set classification.
  • the processor 501 may specifically perform the following steps:
  • the feature of the target Gini index information gain and its value are taken as the division feature and the division point respectively.
  • the electronic device in the embodiment of the present application acquires a multi-dimensional feature of the known gender user using the electronic device as a sample, and constructs a sample set of gender prediction; and the Gini index information gain according to the feature classification for the sample set
  • the sample set is classified to construct a corresponding classification regression tree model, and the output of the classification regression tree model includes a male or a female;
  • the multi-dimensional feature of the electronic device is used as a prediction sample by the unknown gender user according to the prediction time;
  • the sample and the classification regression tree model predict the gender of an unknown gender user.
  • the program accurately predicts user gender.
  • the electronic device 500 may further include: a display 503, a radio frequency circuit 504, an audio circuit 505, and a power source 506.
  • the display 503, the radio frequency circuit 504, the audio circuit 505, and the power source 506 are electrically connected to the processor 501, respectively.
  • the display 503 can be used to display information entered by a user or information provided to a user, as well as various graphical user interfaces, which can be composed of graphics, text, icons, video, and any combination thereof.
  • the display 503 can include a display panel.
  • the display panel can be configured in the form of a liquid crystal display (LCD) or an organic light-emitting diode (OLED).
  • LCD liquid crystal display
  • OLED organic light-emitting diode
  • the radio frequency circuit 504 can be used to transmit and receive radio frequency signals to establish wireless communication with a network device or other electronic device through wireless communication, and to transmit and receive signals with a network device or other electronic device.
  • the audio circuit 505 can be used to provide an audio interface between a user and an electronic device through a speaker or a microphone.
  • the power source 506 can be used to power various components of the electronic device 500.
  • the power source 506 can be logically coupled to the processor 501 through a power management system to enable functions such as managing charging, discharging, and power management through the power management system.
  • the electronic device 500 may further include a camera, a Bluetooth module, and the like, and details are not described herein.
  • the embodiment of the present application further provides a storage medium, where the storage medium stores a computer program, and when the computer program runs on a computer, causes the computer to perform a gender prediction method in any of the above embodiments, such as: It is known that a gender user uses a multi-dimensional feature of an electronic device as a sample, and constructs a sample set of gender predictions; classifies the sample set according to the Gini index information gain of the feature set for the sample set to construct a corresponding classification regression tree.
  • a model the output of the classification regression tree model includes a male or a female; collecting a multi-dimensional feature of the electronic device by using an unknown gender user as a prediction sample according to the predicted time; predicting an unknown gender user according to the predicted sample and the classified regression tree model gender. .
  • the storage medium may be a magnetic disk, an optical disk, a read only memory (ROM), or a random access memory (RAM).
  • ROM read only memory
  • RAM random access memory
  • the computer program may be stored in a computer readable storage medium, such as in a memory of the electronic device, and executed by at least one processor in the electronic device, and may include, for example, a gender prediction method during execution.
  • the storage medium may be a magnetic disk, an optical disk, a read only memory, a random access memory, or the like.
  • each functional module may be integrated into one processing chip, or each module may exist physically separately, or two or more modules may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
  • the integrated module if implemented in the form of a software functional module and sold or used as a standalone product, may also be stored in a computer readable storage medium, such as a read only memory, a magnetic disk or an optical disk, etc. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed are a gender prediction method and apparatus, a storage medium and an electronic device. The solution comprises: constructing sample sets for gender prediction; classifying the sample sets according to a gini index information gain of features for the classification of the sample sets, to construct a corresponding classification regression tree model; acquiring, according to a prediction time, multi-dimensional features of an unknown-gender user using an electronic device as a prediction sample; and predicting the gender of the unknown-gender user according to the prediction sample and the classification regression tree model.

Description

性别预测方法、装置、存储介质及电子设备Gender prediction method, device, storage medium and electronic device
本申请要求于2017年12月22日提交中国专利局、申请号为201711407326.6、发明名称为“性别预测方法、装置、存储介质及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese Patent Application filed on Dec. 22, 2017, the Chinese Patent Application No. 201711407326.6, entitled "Sex Prediction Method, Apparatus, Storage Medium, and Electronic Equipment", the entire contents of which are incorporated by reference. In this application.
技术领域Technical field
本申请涉及通信技术领域,具体涉及一种性别预测方法、装置、存储介质及电子设备。The present application relates to the field of communications technologies, and in particular, to a gender prediction method, apparatus, storage medium, and electronic device.
背景技术Background technique
目前,智能手机等电子设备上,通常会有多个应用同时运行,其中,一个应用在前台运行,其他应用在后台运行。At present, on electronic devices such as smart phones, there are usually multiple applications running simultaneously, one of which runs in the foreground and the other runs in the background.
目前的电子设备已经非常智能化了,电子设备可以实现很多功能。但是一些场景用户对电子设备有着其他需求,如预测用户的性别等。The current electronic devices are very intelligent, and electronic devices can achieve many functions. However, some scene users have other needs for electronic devices, such as predicting the gender of users.
发明内容Summary of the invention
本申请实施例提供了一种性别预测方法、装置、存储介质及电子设备,可以预测用户性别。The embodiment of the present application provides a gender prediction method, device, storage medium, and electronic device, which can predict user gender.
第一方面,本申请实施例了提供了的一种性别预测方法,包括:In a first aspect, an embodiment of the present application provides a gender prediction method, including:
获取已知性别用户使用电子设备的多维特征作为样本,并构建性别预测的样本集;Obtaining a multidimensional feature of a known gender user using an electronic device as a sample, and constructing a sample set of gender predictions;
根据所述特征对于样本集分类的基尼指数信息增益对所述样本集进行分类,以构建出相应的分类回归树模型,所述分类回归树模型的输出包括男性、或者女性;Sorting the sample set according to the Gini index information gain of the feature set for the sample set to construct a corresponding classified regression tree model, and the output of the classified regression tree model includes a male or a female;
根据预测时间采集未知性别用户使用电子设备的多维特征作为预测样本;Collecting multidimensional features of the electronic device used by unknown gender users as prediction samples according to the predicted time;
根据所述预测样本和所述分类回归树模型预测所述未知性别用户的性别。Predicting the gender of the unknown gender user according to the predicted sample and the classified regression tree model.
第二方面,本申请实施例了提供了的一种性别预测装置,包括:In a second aspect, an embodiment of the present application provides a gender prediction apparatus, including:
样本构建单元,用于获取已知性别用户使用电子设备的多维特征作为样本,并构建性别预测的样本集;a sample construction unit for acquiring a multidimensional feature of a known gender user using the electronic device as a sample, and constructing a sample set of gender prediction;
分类单元,用于根据所述特征对于样本集分类的基尼指数信息增益对所述样本集进行分类,以构建出相应的分类回归树模型,所述分类回归树模型的输出包括男性、或者女性;a classification unit, configured to classify the sample set according to the Gini index information gain of the feature classification for the sample set to construct a corresponding classification regression tree model, where the output of the classification regression tree model includes a male or a female;
采集单元,用于根据预测时间采集未知性别用户使用电子设备的多维特征作为预测样本;An acquisition unit, configured to collect, according to a predicted time, a multi-dimensional feature of an electronic device that is used by an unknown gender user as a prediction sample;
预测单元,用于根据所述预测样本和所述分类回归树模型预测所述未知性别用户的性别。And a prediction unit, configured to predict, according to the predicted sample and the classified regression tree model, the gender of the unknown gender user.
第三方面,本申请实施例提供的存储介质,其上存储有计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行如本申请任一实施例提供的性别预测方法。In a third aspect, a storage medium provided by an embodiment of the present application has a computer program stored thereon, and when the computer program is run on a computer, the computer is caused to perform a gender prediction method according to any embodiment of the present application.
第四方面,本申请实施例提供的电子设备,包括处理器和存储器,所述存储器有计算机程序,其特征在于,所述处理器通过调用所述计算机程序,用于执行如本申请任一实施例提供的性别预测方法。In a fourth aspect, an electronic device provided by an embodiment of the present application includes a processor and a memory, where the memory has a computer program, wherein the processor is configured to execute any implementation of the present application by calling the computer program. The gender prediction method provided in the example.
附图说明DRAWINGS
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present application. Other drawings can also be obtained from those skilled in the art based on these drawings without paying any creative effort.
图1为本申请实施例提供的性别预测方法的应用场景示意图。FIG. 1 is a schematic diagram of an application scenario of a gender prediction method according to an embodiment of the present application.
图2是本申请实施例提供的性别预测方法的一个流程示意图。FIG. 2 is a schematic flowchart of a gender prediction method provided by an embodiment of the present application.
图3是本申请实施例提供的一种分类回归树的示意图。FIG. 3 is a schematic diagram of a classification regression tree provided by an embodiment of the present application.
图4是本申请实施例提供的另一种分类回归树的示意图。FIG. 4 is a schematic diagram of another classification regression tree provided by an embodiment of the present application.
图5是本申请实施例提供的又一种分类回归树的示意图。FIG. 5 is a schematic diagram of still another classification regression tree provided by an embodiment of the present application.
图6是本申请实施例提供的性别预测方法的另一个流程示意图。FIG. 6 is another schematic flowchart of a gender prediction method provided by an embodiment of the present application.
图7是本申请实施例提供的性别预测装置的一个结构示意图。FIG. 7 is a schematic structural diagram of a gender prediction apparatus according to an embodiment of the present application.
图8是本申请实施例提供的性别预测装置的另一结构示意图。FIG. 8 is another schematic structural diagram of a gender prediction apparatus according to an embodiment of the present application.
图9是本申请实施例提供的电子设备的一个结构示意图。FIG. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
图10是本申请实施例提供的电子设备的另一结构示意图。FIG. 10 is another schematic structural diagram of an electronic device according to an embodiment of the present application.
具体实施方式Detailed ways
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。References to "an embodiment" herein mean that a particular feature, structure, or characteristic described in connection with the embodiments can be included in at least one embodiment of the present application. The appearances of the phrases in various places in the specification are not necessarily referring to the same embodiments, and are not exclusive or alternative embodiments that are mutually exclusive. Those skilled in the art will understand and implicitly understand that the embodiments described herein can be combined with other embodiments.
本申请实施例提供了一种性别预测方法,包括:The embodiment of the present application provides a gender prediction method, including:
获取已知性别用户使用电子设备的多维特征作为样本,并构建性别预测的样本集;Obtaining a multidimensional feature of a known gender user using an electronic device as a sample, and constructing a sample set of gender predictions;
根据所述特征对于样本集分类的基尼指数信息增益对所述样本集进行分类,以构建出相应的分类回归树模型,所述分类回归树模型的输出包括男性、或者女性;Sorting the sample set according to the Gini index information gain of the feature set for the sample set to construct a corresponding classified regression tree model, and the output of the classified regression tree model includes a male or a female;
根据预测时间采集未知性别用户使用电子设备的多维特征作为预测样本;Collecting multidimensional features of the electronic device used by unknown gender users as prediction samples according to the predicted time;
根据所述预测样本和所述分类回归树模型预测所述未知性别用户的性别。Predicting the gender of the unknown gender user according to the predicted sample and the classified regression tree model.
在一些实施例中,根据所述特征对于样本集的基尼指数信息增益对所述样本集进行划分,以构建出相应的分类回归树模型,包括:In some embodiments, the sample set is partitioned according to the feature of the Gini index information gain of the sample set to construct a corresponding classification regression tree model, including:
生成分类回归树模型的根节点,并将所述样本集分配给所述根节点;Generating a root node of the classification regression tree model and assigning the sample set to the root node;
将所述根节点的样本集确定为当前待分类的目标样本集;Determining the sample set of the root node as a target sample set to be classified currently;
获取所述特征对于目标样本集分类的基尼指数信息增益;Obtaining a Gini index information gain of the feature for the target sample set classification;
根据所述基尼指数信息增益从所述特征中选取当前的划分特征及其对应的划分点;Selecting a current partitioning feature and its corresponding dividing point from the feature according to the Gini index information gain;
根据所述划分特征和所述划分点对所述样本集进行划分,得到两个子样本集;Dividing the sample set according to the dividing feature and the dividing point to obtain two sub-sample sets;
生成当前节点的子节点,并将所述去所述子样本集分配给相应的所述子节点;Generating a child node of the current node, and assigning the sub-sample set to the corresponding child node;
判断所述子节点是否满足预设分类终止条件;Determining whether the child node meets a preset classification termination condition;
若否,则将所述目标样本集更新为所述子样本集,并返回执行获取所述特征对于目标样本集的基尼指数的步骤;If not, updating the target sample set to the sub-sample set, and returning to perform the step of acquiring the Gini index of the feature for the target sample set;
若是,则将所述子节点作为叶子节点,根据所述子样本集中样本的样本类别设置所述叶子节点的输出,所述样本类别包括男性、或者女性。If yes, the child node is used as a leaf node, and an output of the leaf node is set according to a sample category of the sample in the subset of samples, the sample category includes a male or a female.
在一些实施例中,获取所述特征对于目标样本集分类的基尼指数信息增益,包括:In some embodiments, obtaining a Gini index information gain of the feature for the target sample set classification includes:
获取所述特征的取值对于目标样本集分类的基尼指数;Obtaining a Gini index of the value of the feature for the target sample set classification;
根据所述基尼指数,获取所述特征的取值对于目标样本集分类的基尼指数信息增益。According to the Gini index, the gain of the feature is obtained for the Gini index information of the target sample set classification.
在一些实施例中,获取所述特征的取值对于目标样本集分类的基尼指数,包括:In some embodiments, obtaining a Gini index of the value of the feature for the target sample set classification includes:
根据所述特征的取值将所述目标样本集划分成第一子样本集和第二子样本集;Dividing the target sample set into a first subsample set and a second subsample set according to the value of the feature;
获取所述第一子样本集和所述第二子样本集中样本类别的概率;Obtaining a probability of the sample class of the first subsample set and the second subsample set;
根据所述样本类别的概率获取所述取值对于目标样本分类的基尼指数。The Gini index of the value classification for the target sample is obtained according to the probability of the sample category.
在一些实施例中,根据所述样本类别的概率获取所述取值对于目标样本分类的基尼指数,包括:In some embodiments, the Gini index of the value classification for the target sample is obtained according to the probability of the sample category, including:
根据所述第一子样本集中样本类别的概率获取所述特征为取值时对于目标样本集分类的第一基尼指数;Obtaining, according to a probability of the sample category in the first subsample set, a first Gini index classified for the target sample set when the feature is a value;
根据所述第二子样本集中样本类别的概率获取所述特征不为取值时对于目标样本集分类的第二基尼指数;Obtaining, according to the probability of the sample category in the second subsample set, a second Gini index that is classified for the target sample set when the feature is not a value;
根据所述基尼指数,获取所述特征对于目标样本集分类的基尼指数信息增益,包括:Obtaining the Gini index information gain of the feature for the target sample set classification according to the Gini index, including:
根据所述第一基尼指数、第一子样本集与目标样本集的样本数量比值、第二基尼指数、以及第二子样本集与目标样本集的样本数量比值,获取所述特征的取值对于目标样本集分类的基尼指数信息增益。Obtaining a value of the feature according to the first Gini index, a sample size ratio of the first subsample set to the target sample set, a second Gini index, and a sample size ratio of the second subsample set to the target sample set The Gini index information gain of the target sample set classification.
在一些实施例中,根据所述基尼指数,获取所述特征对于目标样本集分类的基尼指数信息增益,包括:In some embodiments, obtaining a Gini index information gain of the feature for the target sample set classification according to the Gini index includes:
通过如下公式计算出特征对于目标样本集分类的基尼指数信息增益:The Gini index information gain of the feature classification for the target sample set is calculated by the following formula:
Figure PCTCN2018116709-appb-000001
其中,Gini(D,A)为特征A对于目标样本集D分类的基尼指数信息增益,Gini(D 1)为特征A为取值a时对于目标样本D分类的基尼指数,Gini(D 2)为A不为取值a时对于目标样本D分类的基尼指数;a为特征A的一种取值,D1和D2为基于特征A=a对目标样本集D划分后得到的两个子样本集。
Figure PCTCN2018116709-appb-000001
Among them, Gini(D,A) is the Gini index information gain of feature A for the target sample set D, and Gini(D 1 ) is the Gini index for the target sample D when the feature A is the value a, Gini(D 2 ) A is the Gini index for the target sample D when A is not a value; a is a value of the feature A, and D1 and D2 are two sub-sample sets obtained by dividing the target sample set D based on the feature A=a.
在一些实施例中,根据所述基尼指数信息增益从所述特征中选取当前的划分特征及其对应的划分点,包括:In some embodiments, selecting the current partitioning feature and its corresponding partitioning point from the feature according to the Gini index information gain includes:
从所述基尼指数信息增益中确定最小的目标基尼指数信息增益;Determining a minimum target Gini index information gain from the Gini index information gain;
将所述目标基尼指数信息增益的特征及其取值,分别作为划分特征和划分点。The characteristics of the target Gini index information gain and their values are taken as the division feature and the division point, respectively.
在一些实施例中,判断子节点是否满足预设分类终止条件,包括:In some embodiments, determining whether the child node meets a preset classification termination condition includes:
判断所述子节点对应的去除后子样本集中样本的类别数量是否为预设数量;Determining whether the number of categories of samples in the removed subsample set corresponding to the child node is a preset number;
若是,则确定所述子节点满足预设分类终止条件。If yes, it is determined that the child node satisfies a preset classification termination condition.
在一些实施例中,根据所述预测样本和所述分类回归树模型预测所述未知性别用户的性别,包括:In some embodiments, predicting the gender of the unknown gender user based on the predicted sample and the classification regression tree model, including:
根据所述预测样本的特征和所述分类回归树模型确定相应的叶子节点,将所述叶子节点的输出作为预测输出结果。Determining a corresponding leaf node according to the feature of the predicted sample and the classification regression tree model, and using the output of the leaf node as a predicted output result.
本申请实施例提供一种性别预测方法,该性别预测方法的执行主体可以是本申请实施例提供的应性别预测装置,或者集成了该性别预测装置的设备如电子设备、网络设备等,其中该性别预测装置可以采用硬件或者软件的方式实现。其中,电子设备可以是智能手机、平板电脑、掌上电脑、笔记本电脑、或者台式电脑等设备。The embodiment of the present application provides a gender prediction method, and the execution body of the gender prediction method may be a gender prediction device provided by an embodiment of the present application, or a device integrated with the gender prediction device, such as an electronic device, a network device, or the like. The gender prediction device can be implemented in hardware or software. The electronic device may be a device such as a smart phone, a tablet computer, a palmtop computer, a notebook computer, or a desktop computer.
请参阅图1,图1为本申请实施例提供的性别预测方法的应用场景示意图,以性别预测装置集成在电子设备中为例,电子设备可以取已知性别用户使用电子设备的多维特征作为样本,并构建性别预测的样本集;根据特征对于样本集分类的基尼指数信息增益对样本集进行分类,以构建出相应的分类回归树模型,分类回归树模型的输出包括男性、或者女性;根据预测时间采集未知性别用户使用电子设备的多维特征作为预测样本;根据预测样本和分类回归树模型预测未知性别用户的性别。此外,电子设备还可以根据预测的性别对***资源调度做深度定制优化。比如,根据预测的性别对应用进行清理等。Referring to FIG. 1 , FIG. 1 is a schematic diagram of an application scenario of a gender prediction method according to an embodiment of the present application. The gender prediction device is integrated into an electronic device as an example, and the electronic device can take a multi-dimensional feature of a known gender user using the electronic device as a sample. And construct a sample set of gender prediction; classify the sample set according to the Gini index information gain of the feature set for the sample set to construct a corresponding classification regression tree model, and the output of the classification regression tree model includes male or female; Time collects the multi-dimensional features of the electronic device used by unknown gender users as prediction samples; predicts the gender of unknown gender users according to the prediction samples and the classification regression tree model. In addition, the electronic device can also perform deep customization optimization on system resource scheduling according to the predicted gender. For example, clean up applications based on predicted genders.
具体地,例如图1所示,以预测用户a的性别为例,可以在历史时间段内,采集已知性别用户(如男性用户b、女性用户c等等)使用电子设备的多维特征(如用户b在应用中浏览偏男性内容的次数与时长、用户b在应用中浏览偏女性内容的次数与时长等等)作为样本,构建性别预测的样本集;根据特征对于样本集分类的基尼指数信息增益对样本集进行分类,以构建出相应的分类回归树模型,分类回归树模型的输出包括男性、或者女性; 根据预测时间采集未知性别用户使用电子设备的多维特征(例如在t时刻如用户a在应用中浏览偏男性内容的次数与时长、用户a在应用中浏览偏女性内容的次数与时长等等等)作为预测样本;根据预测样本和分类回归树模型预测未知性别用户a的性别(如男性还是女性)。Specifically, for example, as shown in FIG. 1 , taking the gender of the user a as an example, a multi-dimensional feature of the electronic device (such as a male user b, a female user c, etc.) can be collected in a historical time period (eg, The number and duration of the user b browsing the male content in the application, the number and duration of the user b browsing the female content in the application, etc. as a sample, constructing a sample set of gender prediction; and the Gini index information classified according to the feature for the sample set Gain classifies the sample set to construct a corresponding classification regression tree model. The output of the classification regression tree model includes male or female; the multi-dimensional characteristics of the electronic device used by the unknown gender are collected according to the predicted time (for example, at time t such as user a The number and duration of browsing the male content in the application, the number and duration of the user a browsing the female content in the application, etc. as a prediction sample; predicting the gender of the unknown gender user a according to the prediction sample and the classification regression tree model (eg Male or female).
请参阅图2,图2为本申请实施例提供的性别预测方法的流程示意图。本申请实施例提供的性别预测方法的具体流程可以如下:Please refer to FIG. 2. FIG. 2 is a schematic flowchart diagram of a gender prediction method according to an embodiment of the present application. The specific process of the gender prediction method provided by the embodiment of the present application may be as follows:
201、获取已知性别用户使用电子设备的多维特征作为样本,并构建性别预测的样本集。201. Obtain a multi-dimensional feature of a known gender user using the electronic device as a sample, and construct a sample set of gender prediction.
其中,多维特征为已知性别用户如男性用户或女性用户使用电子设备的多维用户行为特征。比如,可以历史时间段内已知性别用户使用电子设备的多维用户行为特征。Among them, the multi-dimensional feature is a multi-dimensional user behavior feature of a known gender user such as a male user or a female user using an electronic device. For example, a multi-dimensional user behavior feature of a known gender user using an electronic device during a historical time period.
在一实施例中,多维特征为用户使用电子设备过程中具有性别特点的行为特征。比如,用户使用电子设备过程中具有男性或女性特点的行为特征。In an embodiment, the multi-dimensional feature is a gender-specific behavioral feature of the user's use of the electronic device. For example, a user has a behavioral characteristic characterized by male or female characteristics in the process of using an electronic device.
其中,多维特征具有一定长度的维度,其每个维度上的参数均对应表征用户使用电子设备的一种特征信息,即该多维特征息由多个特征构成。该多个特征可以包括用户使用电子设备上应用的行为特征,比如,用户购物应用中浏览偏男性类商品(如男装)次数与时长,用户在购物应用中浏览偏女性类商品(如化妆品、女装)次数与时长,用户阅读偏男性类小说的时长等,用户在阅读类应用中阅读偏男性类小说的时长,用户在阅读类应用中阅读偏女性类小说的时长。The multi-dimensional feature has a dimension of a certain length, and the parameters in each dimension correspond to a feature information that represents the user's use of the electronic device, that is, the multi-dimensional feature is composed of multiple features. The plurality of features may include behavior characteristics of the user using the application on the electronic device, for example, the number and duration of browsing the male-type goods (such as men's clothing) in the user shopping application, and the user browsing the female-type goods in the shopping application (such as cosmetics, women's clothing) The number and duration of time, the length of time a user reads a male novel, the length of time a user reads a male-like novel in a reading application, and the length of time a user reads a female-like novel in a reading application.
该多维特征还可以包括用户使用电子设备本身的相关行为特征信息,比如,用户使用电子设备前置摄像头的次数、用户使用后置摄像的次数等等。The multi-dimensional feature may also include relevant behavior characteristic information of the user using the electronic device itself, such as the number of times the user uses the electronic device front camera, the number of times the user uses the rear camera, and the like.
其中,性别预测的样本集可以包括多个样本,每个样本包括已知用户使用电子设备的多维特征。性别预测的样本集中,可以包括在历史时间段内,按照预设频率采集的多个样本。历史时间段,例如可以是过去7天、10天;预设频率,例如可以是每10分钟采集一次、每半小时采集一次。可以理解的是,一次采集的应用的多维特征数据构成一个样本,多个样本,构成样本集。Wherein, the sample set of gender prediction may include a plurality of samples, each sample including a multi-dimensional feature of a known user using an electronic device. The sample set of gender predictions may include multiple samples collected at a preset frequency during the historical time period. The historical time period may be, for example, the past 7 days or 10 days; the preset frequency may be, for example, collected every 10 minutes and collected every half hour. It can be understood that the multi-dimensional feature data of the application acquired at one time constitutes one sample and multiple samples constitute a sample set.
在一实施例中,可以由服务器收集各已知性别用户使用其电子设备的多维特征,然后,在性别预测时电子设备可以从服务器中获取。其中,已知性别用户可以为使用电子设备时提供了性别信息的用户;比如,在账号注册时提供性别信息的用户等。In an embodiment, the multi-dimensional features of each of the known gender users using their electronic devices may be collected by the server, and then the electronic device may be obtained from the server during gender prediction. Among them, a gender user is known to be a user who provides gender information when using an electronic device; for example, a user who provides gender information when the account is registered.
在构成样本集之后,可以对样本集中的每个样本进行标记,得到每个样本的样本标签,由于本实施要实现的是预测用户的性别,因此,所标记的样本标签包括男性和女性,也即样本类别包括男性、女性。具体可根据已知性别用户的性别进行标记,例如:当男性用户在应用浏览偏男性内容(如商品),则标记为“男性”;再例如,当女性用户阅读偏女性类小说=,则标记为“女性”。具体地,可以用数值“1”表示“男性”,用数值“0”表示“女性”,反之亦可。After constituting the sample set, each sample in the sample set can be marked to obtain a sample label for each sample. Since this implementation is to predict the gender of the user, the labeled sample labels include males and females, That is, the sample categories include males and females. Specifically, it can be marked according to the gender of a known gender user, for example, when a male user browses a male content (such as a commodity) in an application, it is marked as "male"; for example, when a female user reads a female-like novel =, then the marker For "female." Specifically, the value "1" may be used to mean "male", and the value "0" may be used to mean "female", and vice versa.
202、根据特征对于样本分类的基尼指数信息增益对样本集进行样本分类,以构建出相应的分类回归决策树模型。202. Perform sample classification on the sample set according to the Gini index information gain of the feature classification to construct a corresponding classification regression decision tree model.
在一实施例中,为便于样本分类,可以将已知用户的多维特征信息中,未用数值直接表示的特征信息用具体的数值量化出来,例如针对电子设备的无线网连接状态这个特征信息,可以用数值1表示正常的状态,用数值0表示异常的状态(反之亦可);再例如,针对电子设备是否在充电状态这个特征信息,可以用数值1表示充电状态,用数值0表示未充电状态(反之亦可)。In an embodiment, in order to facilitate sample classification, feature information of the known user's multi-dimensional feature information that is not directly represented by the value may be quantified by a specific value, for example, the feature information of the wireless network connection state of the electronic device, The value 1 can be used to indicate the normal state, and the value 0 is used to indicate the abnormal state (or vice versa); for example, for the characteristic information of whether the electronic device is in the charging state, the value 1 can be used to indicate the state of charge, and the value 0 is used to indicate the uncharged state. Status (or vice versa).
本申请实施例可以基于特征对于样本分类的基尼指数信息增益对样本集进行样本分类,以构建应用的分类回归树(CART,Classification And Regression Tree)。比如,可以基于ID3(Iterative Dichotomiser 3,迭代二叉树3代)算法来构建分类回归树模型。The embodiment of the present application may classify the sample set based on the Gini index information gain of the feature classification for the sample to construct a CART (Classification And Regression Tree). For example, a classification regression tree model can be constructed based on ID3 (Iterative Dichotomiser 3) algorithm.
其中,分类回归树是决策树的一种,并且是非常重要的决策树,是一颗二叉树,且每个非叶子节点都有两个孩子,所以对于第一棵子树其叶子节点数比非叶子节点数多1。决策树是一种依托决策而建立起来的一种树。在机器学习中,决策树是一种预测模型,代表的是一种对象属性与对象值之间的一种映射关系,每一个节点代表某个对象,树中的每一个分叉路径代表某个可能的属性值,而每一个叶子节点则对应从根节点到该叶子节点所经历的路径所表示的对象的值。决策树仅有单一输出,如果有多个输出,可以分别建立独立的决策树以处理不同的输出。Among them, the classification regression tree is a kind of decision tree, and is a very important decision tree. It is a binary tree, and each non-leaf node has two children, so the number of leaf nodes is lower than the non-leaf for the first subtree. The number of nodes is one more. A decision tree is a tree built on decision making. In machine learning, a decision tree is a predictive model that represents a mapping between object attributes and object values. Each node represents an object, and each forked path in the tree represents a certain Possible attribute values, and each leaf node corresponds to the value of the object represented by the path from the root node to the leaf node. The decision tree has only a single output. If there are multiple outputs, separate decision trees can be created to handle different outputs.
其中,ID3(Iterative Dichotomiser 3,迭代二叉树3代)算法是决策树的一种算法,它是基于奥卡姆剃刀原理的,即用尽量用较少的东西做更多的事。在信息论中,期望信息越小,那么信息增益就越大,从而纯度就越高。ID3算法的核心思想就是以信息增益来度量属性的选择,选择***后信息增益最大的属性进行***。该算法采用自顶向下的贪婪搜索遍历可能的决策空间。Among them, ID3 (Iterative Dichotomiser 3, iterative binary tree 3 generation) algorithm is an algorithm of decision tree, which is based on Occam's razor principle, that is, to do more with as few things as possible. In information theory, the smaller the expected information, the greater the information gain and the higher the purity. The core idea of the ID3 algorithm is to measure the choice of attributes with information gain, and select the attribute with the largest information gain after splitting to split. The algorithm uses a top-down greedy search to traverse possible decision spaces.
其中,信息增益是针对一个一个特征而言的,就是看一个特征t,***有它和没有它时的信息量各是多少,两者的差值就是这个特征给***带来的信息量,即信息增益。Among them, the information gain is for one feature, that is, one feature t, the amount of information when the system has it and without it, the difference between the two is the amount of information that the feature brings to the system, ie Information gain.
其中,基尼(Gini)指数是一种与信息熵类似的做特征选择的方式,可以用来表示数据的不纯度,即表示一个随机选中的样本在子集中被分错的可能性。在CART算法中可以利用基尼指数构造二叉决策树。Among them, the Gini index is a kind of feature selection similar to information entropy, which can be used to indicate the impureness of data, that is, the possibility that a randomly selected sample is split in a subset. In the CART algorithm, the Gini index can be used to construct a binary decision tree.
Gini指数是一种不等性度量,通常用来度量收入不平衡,可以用来度量任何不均匀分布,是介于0~1之间的数,0-完全相等,1-完全不相等。分类度量时,总体内包含的类别越杂乱,Gini指数就越大(跟熵的概念很相似)。也即数据的不纯度越大Gini指数就越大。The Gini index is an inequality measure that is commonly used to measure income disparity and can be used to measure any uneven distribution, which is a number between 0 and 1, 0-completely equal, and 1-completely unequal. When categorizing metrics, the more chaotic the categories contained within the population, the larger the Gini index (similar to the concept of entropy). That is, the greater the purity of the data, the larger the Gini index.
其中,特征对于样本集分类基尼指数信息增益(Gini Gain),即Gini增益,表示基于该特征对样本集划分后样本的不纯度增益,如特征A对于样本集D分类的基尼指数信息增益为Gini(D,A),其表示基于特征A对样本集D划分后样本集的不纯度增益。Among them, the feature for the sample set classification Gini Gain, Gini gain, represents the impurity gain of the sample after the feature set is divided based on the feature, such as the Gini index information gain of the feature A for the sample set D is Gini (D, A), which represents the impurity gain of the sample set after partitioning the sample set D based on the feature A.
下面将详细介绍基于基尼指数信息增益对样本集进行分类的过程,比如,分类过程可以包括如下步骤:The process of classifying the sample set based on the Gini index information gain will be described in detail below. For example, the classification process may include the following steps:
生成分类回归树模型的根节点,并将样本集作为根节点的节点信息;Generating a root node of the classification regression tree model, and using the sample set as the node information of the root node;
将根节点的样本集确定为当前待分类的目标样本集;Determining the sample set of the root node as the target sample set to be classified currently;
获取特征对于目标样本集分类的基尼指数信息增益;Acquiring the Gini index information gain of the feature classification for the target sample set;
根据基尼指数信息增益从特征中选取当前的划分特征及其对应的划分点;Selecting a current partitioning feature and its corresponding dividing point from the feature according to the Gini index information gain;
根据划分特征和划分点对样本集进行划分,得到两个子样本集;Dividing the sample set according to the dividing feature and the dividing point to obtain two sub-sample sets;
生成当前节点的子节点,并将去子样本集分配给相应的子节点;Generating a child node of the current node, and assigning the de-subsample set to the corresponding child node;
判断子节点是否满足预设分类终止条件;Determining whether the child node satisfies a preset classification termination condition;
若否,则将目标样本集更新为子样本集,并返回执行获取特征对于目标样本集的基尼指数的步骤;If not, updating the target sample set to the subsample set and returning to perform the step of acquiring the feature for the Gini index of the target sample set;
若是,则将子节点作为叶子节点,根据子样本集中样本的样本类别设置叶子节点的输出,样本类别包括男性、或者女性。If yes, the child node is used as a leaf node, and the output of the leaf node is set according to the sample category of the sample in the subsample set, and the sample category includes male or female.
其中,划分特征,可以根据各特征对于样本集分类的基尼指数信息增益从特征及其对应的取值中选取,用于对样本集分类。划分点为划分特征中某种取值。The dividing feature may be selected according to the Gini index information gain of the feature set for each sample set and the corresponding value, and used for classifying the sample set. The dividing point is a certain value in the dividing feature.
本申请实施例中,根据基尼指数信息增益选取划分特征的方式有多种,比如为了提升样本分类的精确性,可以选取最小基尼指数信息增益对应的特征为划分特征。也即步骤“根据基尼指数信息增益从特征中选取当前的划分特征及其对应的划分点”可以包括:In the embodiment of the present application, there are various ways to select the feature according to the Gini index information gain. For example, in order to improve the accuracy of the sample classification, the feature corresponding to the minimum Gini index information gain may be selected as the division feature. That is, the step "selecting the current dividing feature and its corresponding dividing point from the feature according to the Gini index information gain" may include:
从基尼指数信息增益中确定最小的目标基尼指数信息增益;Determining a minimum target Gini index information gain from the Gini index information gain;
将目标基尼指数信息增益的特征及其取值,分别作为划分特征和划分点。The characteristics of the target Gini index information gain and its value are taken as the division feature and the division point respectively.
也即,选取样本集划分后样本不纯度变化(如下降)最小的特征及其对应的取值作为划分特征以及划分点。比如,如当某个特征A为某个取值a时,基于特征A=a对样本集划分后样本的不纯度变化(如下降)最小时,那么该特征A即为划分特征,此时,取值a为划分点。That is to say, the feature with the smallest variation (such as descent) of the sample after the sample set is divided and its corresponding value are selected as the dividing feature and the dividing point. For example, if a certain feature A is a certain value a, the feature A is a minimum feature when the sample contains a minimum impurity change (such as a drop) based on the feature A=a. The value a is a division point.
其中,样本的类别可以包括男性、女性两种类别,每个样本的类别可以用样本标记来表示,比如,当样本标记为数值时,数值“1”表示“男性”,用数值“0”表示“女性”,反之亦可。The category of the sample may include two categories, male and female, and the category of each sample may be represented by a sample mark. For example, when the sample is marked as a numerical value, the value “1” represents “male”, and the value “0” is used. "Female", vice versa.
当子节点满足预设分类终止条件时,可以将子节点作为叶子节点,即停止对该子节点的样本集分类,并且可以基于去除后子样本集中样本的类别设置该叶子节点的输出。基于样本的类别设置叶子节点的输出的方式有多种。比如,可以去除后样本集中样本数量最多的类别作为该叶子节点的输出。When the child node satisfies the preset classification termination condition, the child node may be used as a leaf node, that is, the sample set classification of the child node is stopped, and the output of the leaf node may be set based on the category of the sample in the removed sub-sample set. There are several ways to set the output of a leaf node based on the category of the sample. For example, the category with the largest number of samples in the sample set can be removed as the output of the leaf node.
其中,预设分类终止条件可以根据实际需求设定,当子节点满足预设分类终止条件时,将当前子节点作为叶子节点,停止对子节点对应的样本集进行分类;当子节点不满足预设分类终止条件时,继续对子节点对应的样本集进行分类。比如,预设分类终止条件可以包括:子节点的去除后子样本集合中样本的类别数量为与预设数量,也即步骤“判断子节点是否满足预设分类终止条件”可以包括:The preset classification termination condition may be set according to actual requirements. When the child node satisfies the preset classification termination condition, the current child node is regarded as a leaf node, and the sample set corresponding to the child node is stopped; when the child node does not satisfy the pre-sale When the classification termination condition is set, the sample set corresponding to the child node is continued to be classified. For example, the preset classification termination condition may include: the number of categories of the samples in the removed sub-sample set of the child node is a preset number, that is, the step “determining whether the child node satisfies the preset classification termination condition” may include:
判断子节点对应的去除后子样本集中样本的类别数量是否为预设数量;Determining whether the number of categories of samples in the removed subsample set corresponding to the child node is a preset number;
若是,则确定子节点满足预设分类终止条件;If yes, determining that the child node meets the preset classification termination condition;
若否,则确定子节点不满预设分类终端终止条件。If not, it is determined that the child node is not satisfied with the preset classification terminal termination condition.
例如,预设分类终止条件可以包括:子节点对应的去除后子样本集中样本的类别数量为1,也即子节点的样本集中只有一个类别的样本。此时,如果子节点满足该预设分类终止条件,那么,将子样本集中样本的类别作为该叶子节点的输出。如去除后子样本集中只有类别为“男性”的样本时,那么,可以将“男性”作为该叶子节点的输出。For example, the preset classification termination condition may include: the number of categories of the samples in the removed subsample set corresponding to the child node is 1, that is, the sample in the sample set of the child node has only one category. At this time, if the child node satisfies the preset classification termination condition, the category of the sample in the subsample set is taken as the output of the leaf node. If there is only a sample with the category "male" in the subsample set after removal, then "male" can be used as the output of the leaf node.
在一实施例中,可以根据划分特征是否为划分点取值,将样本集划分成两个子样本。比如,划分特征为A,划分点为a时,可以基于特征A是否为a将样本集划分成两个子样本集。In an embodiment, the sample set may be divided into two sub-samples according to whether the dividing feature is a value of the dividing point. For example, when the partitioning feature is A and the partitioning point is a, the sample set can be divided into two sub-sample sets based on whether the feature A is a or not.
本申请实施例中,特征对于目标样本集分类的基尼指数信息增益,可以包括特征的取值对于目标样本集分类的基尼指数信息增益;比如,特征A的取值a对于目标样本集D分类的基尼指数增益Gini Gain。其中,基尼指数信息增益可以基于特征的取值对于样本集分类的基尼指数来获取。比如,步骤“获取特征对于目标样本集分类的基尼指数信息增益”可以包括:In the embodiment of the present application, the Gini index information gain of the feature classification for the target sample set may include the Gini index information gain of the feature value for the target sample set classification; for example, the value a of the feature A is classified for the target sample set D. The Gini index gains Gini Gain. Wherein, the Gini index information gain can be obtained based on the value of the feature for the Gini index of the sample set classification. For example, the step "Getting the Gini index information gain of the feature classification for the target sample set" may include:
获取特征的取值对于目标样本集分类的基尼指数;Obtaining the value of the feature for the Gini index of the target sample set classification;
根据基尼指数,获取特征的取值对于目标样本集分类的基尼指数信息增益。According to the Gini index, the gain of the feature is obtained for the Gini index information of the target sample set classification.
具体地,特征的取值对于目标样本分类的基尼指数的获取方式如下:Specifically, the value of the feature is obtained as follows for the Gini index of the target sample classification:
根据特征的取值将目标样本集划分成第一子样本集和第二子样本集;取值为特征的所有可能取值中的一种取值;Dividing the target sample set into the first sub-sample set and the second sub-sample set according to the value of the feature; taking the value as one of all possible values of the feature;
获取第一子样本集和第二子样本集中样本类别的概率;The probability of obtaining the sample categories in the first subsample set and the second subsample set;
根据样本类别的概率获取取值对于目标样本分类的基尼指数。The Gini index of the value classification for the target sample is obtained based on the probability of the sample category.
其中,取值对于目标样本分类的基尼指数包括:特征为取值时对于目标样本集分类的基尼指数、特征不为取值时对于目标样本集分类的基尼指数。其中,步骤“根据样本类别的概率获取取值对于目标样本分类的基尼指数”可以包括:The Gini index of the value classification for the target sample includes: the Gini index for the target sample set when the feature is the value, and the Gini index for the target sample set when the feature is not the value. Wherein, the step "acquiring the value of the sample category according to the probability of the sample category to the Gini index of the target sample classification" may include:
根据第一子样本集中样本类别的概率获取特征为取值时对于目标样本集分类的第一基尼指数;Obtaining a first Gini index for the target sample set when the feature is a value according to a probability of the sample class in the first subsample set;
根据第二子样本集中样本类别的概率获取特征不为取值时对于目标样本集分类的第二基尼指数。The second Gini index for the target sample set when the feature is not taken is obtained according to the probability of the sample category in the second subsample set.
此时,步骤“根据基尼指数,获取特征对于目标样本集分类的基尼指数信息增益”可以包括:At this time, the step "according to the Gini index, obtaining the Gini index information gain of the feature classification for the target sample set" may include:
根据第一基尼指数、第一子样本集与目标样本集的样本数量比值、第二基尼指数、以及第二子样本集与目标样本集的样本数量比值,获取特征为取值时对于目标样本集分类的基尼指数信息增益。Obtaining a feature as a target sample set according to a first Gini index, a sample size ratio of the first subsample set to the target sample set, a second Gini index, and a sample size ratio of the second subsample set to the target sample set The Gini index information gain of the classification.
例如,以目标样本集为样本集D,特征为特征A为例,特征A的可能取值包括多种,如特征A=a,那么特征A的取值a对于样本集D的基尼指数信息增益可以通过方式获取:For example, taking the target sample set as the sample set D, the feature is the feature A as an example, and the possible values of the feature A include multiple, such as the feature A=a, then the value a of the feature A is the gain of the Gini index information of the sample set D. Can be obtained by means:
首选根据特征A=a为“是”或“否”将样本集D划分成子样本集D1和D2;Preferably, the sample set D is divided into subsample sets D1 and D2 according to whether the feature A=a is "yes" or "no";
根据D1中样本类别(男性或女性)的概率pk计算特征A=a时对于样本集D的基尼指数Gini(D1),以及根据D2中样本类别(男性或女性)的概率pk计算特征A=a时对于样本集D的基尼指数Gini(D2)。如以下公式计算出Gini(D1)和Gini(D2)。Calculate the feature A=a for the Gini index Gini(D1) of the sample set D according to the probability pk of the sample category (male or female) in D1, and the probability pk according to the sample category (male or female) in D2 The Gini index Gini (D2) for the sample set D. Gini (D1) and Gini (D2) are calculated as shown in the following formula.
Figure PCTCN2018116709-appb-000002
类别,k=1、2、……k。
Figure PCTCN2018116709-appb-000002
Category, k=1, 2, ... k.
接着,可以基于Gini(D1)、子样本集D1与样本集D的样本数量比值D1/D、Gini(D2)以及子样本集D2与样本集D的样本数量比值D2/D、计算出特征A为a时对于样本集D分类的基尼指数信息增益即Gini(D,A)。比如,通过以下公式求得:Then, the feature A can be calculated based on the Gini (D1), the sample number ratio D1/D, Gini (D2) of the subsample set D1 and the sample set D, and the sample number ratio D2/D of the sample set D2 and the sample set D. The gain of the Gini index information for the sample set D when it is a is Gini(D, A). For example, use the following formula to find:
Figure PCTCN2018116709-appb-000003
Figure PCTCN2018116709-appb-000003
通过上述方式可以计算出各样本特征的取值对于样本集D分类的基尼指数信息增益。In the above manner, the gain of each sample feature can be calculated as the Gini index information gain of the sample set D classification.
例如,对于样本集D{样本1、样本2……样本i……样本n},其中样本1包括t1、t2……tm,样本i包括t1、t2……tm,样本n包括t1、t2……tm。其中,每个样本特征包含多种取值。分类回归树的构建过程如下:For example, for sample set D{sample 1, sample 2...sample i...sample n}, where sample 1 includes t1, t2...tm, sample i includes t1, t2...tm, sample n includes t1, t2... ...tm. Among them, each sample feature contains multiple values. The construction process of the classification regression tree is as follows:
首先,对样本集D中所有样本进行初始化,然后,生成一个分类回归书的根节点d,并将样本集D分配给作该根节点d,如参考图3。First, all samples in the sample set D are initialized, then a root node d of a classification regression book is generated, and the sample set D is assigned to the root node d, as described with reference to FIG.
通过上述基尼指数信息增益的计算方式,计算各特征的可能取值如特征1、特征2……特征m对于样本集D分类的基尼指数信息增益Gini(D,t1)、Gini(D,t2)……Gini(D,tm)。Through the above calculation method of the Gini index information gain, the possible values of each feature are calculated as the feature 1, the feature 2, the feature m, the Gini index information gain Gini(D, t1), Gini(D, t2) for the sample set D classification. ... Gini (D, tm).
选取最小的基尼指数信息增益,如Gini(D,ti)为最小的信息增益,此时,可以确定ti为划分特征t,Gini(D,ti)中ti对应的取值t’为划分点。The minimum Gini index information gain, such as Gini(D, ti), is selected as the minimum information gain. At this time, it can be determined that ti is the division feature t, and the value t' corresponding to ti in Gini(D, ti) is the division point.
基于ti=t’为“是”或“否”将样本集D划分成两个子样本集D1{样本1、样本2……样本k}和A2{样本k+1……样本n};然后,生成当前节点d的两个子节点d1和d2,将D1分配个子节点d1,将D2分配给子节点d2。The sample set D is divided into two subsample sets D1 {sample 1, sample 2, ... sample k} and A2 {sample k+1 ... sample n} based on whether ti = t' is "yes" or "no"; The two child nodes d1 and d2 of the current node d are generated, D1 is assigned to the child node d1, and D2 is assigned to the child node d2.
接着,对于每个子节点,以子节点d1为例,判断子节点是否满足预设分类终止条件,若是,则将当前的子节点a1作为叶子节点,并根据子节点a1对应的子样本集中样本的类别设置该叶子节点输出。Next, for each child node, taking the child node d1 as an example, determining whether the child node satisfies the preset classification termination condition, and if so, using the current child node a1 as a leaf node, and according to the sample of the child sample set corresponding to the child node a1 The category sets the leaf node output.
当子节点不满足预设分类终止条件时,采用上述基于信息增益分类的方式,继续对子节点对应的子样本集进行分类,如以子节点d2为例可以计算D2样本集中各特征的取值相对于样本分类的基尼指数信息增益Gini(D,t),选取最小的信息增益Gini(D,t)min,选取Gini(D,t)min对应的特征和取值为划分特征t和划分点,基于划分特征t和划分点 将D2划分成两个子样本集,如可以将D2划分成子样本集D21、D22;然后,生成当前节点d2的子节点d21、d22、将D21、D22分别分配给子节点d21、d22。When the child node does not satisfy the preset classification termination condition, the above-mentioned sub-sample set corresponding to the child node is continued to be classified according to the information gain classification method. For example, the child node d2 can be used as an example to calculate the value of each feature in the D2 sample set. Relative to the Gini index information gain Gini(D,t) of the sample classification, the minimum information gain Gini(D,t)min is selected, and the corresponding features and values of Gini(D,t)min are selected as the division feature t and the division point. Dividing D2 into two sub-sample sets based on the partitioning feature t and the dividing point, for example, D2 may be divided into sub-sample sets D21 and D22; then, sub-nodes d21 and d22 of the current node d2 are generated, and D21 and D22 are respectively assigned to the sub-samples. Nodes d21, d22.
依次类推,利用上述的基于基尼指数信息增益分类的方式可以构成出如图4所示的分类回归树,该分类回归树的叶子节点的输出包括“男性”、或者“女性”。By analogy, the above-described classification based on the Gini index information gain classification can form a classification regression tree as shown in FIG. 4, and the output of the leaf node of the classification regression tree includes "male" or "female".
本申请中的术语“第一”、“第二”和“第三”等是用于区别不同对象,而不是用于描述特定顺序。The terms "first," "second," and "third," etc. in this application are used to distinguish different objects, and are not intended to describe a particular order.
在一实施例中,为了提升利用分类回归树进行预测的速度和效率,还可以在节点之间的路径上标记相应的划分特征及其对应的划分特征值。比如,在上述基于信息增益分类的过程中,可以在当前节点与其子节点路径上标记相应划分特征的特征值。In an embodiment, in order to improve the speed and efficiency of prediction using the classification regression tree, corresponding division features and their corresponding division feature values may also be marked on the path between the nodes. For example, in the above process based on information gain classification, the feature values of the corresponding divided features may be marked on the path of the current node and its child nodes.
例如,划分特征t的特征值包括:0、1时,可以在d2与d之间的路径上标记1,在d1与d之间的路径上标记0,依次类推,在每次划分后,便可以在当前节点与其子节点的路径上标记相应的划分特征值如0或1,便可以得到如图5所示的分类回归树。For example, the feature values of the partitioning feature t include: 0, 1 may mark 1 on the path between d2 and d, mark 0 on the path between d1 and d, and so on, after each division, A classification regression tree as shown in FIG. 5 can be obtained by marking a corresponding division feature value such as 0 or 1 on the path of the current node and its child nodes.
203、根据预测时间采集未知性别用户使用电子设备的多维特征作为预测样本。203. Collect, according to the predicted time, a multi-dimensional feature of the electronic device used by the unknown gender user as the prediction sample.
其中,预测时间可以根据需求设定,如可以为当前时间等。The prediction time can be set according to requirements, such as the current time.
比如,可以在当前时间点采集未知性别用户使用电子设备的多维特征作为预测样本。For example, a multi-dimensional feature of an electronic device that is used by an unknown gender user can be collected as a prediction sample at a current time point.
本申请实施例中,步骤201和203中采集的多维特征是相同特征,例如:用户购物应用中浏览偏男性类商品(如男装)次数与时长,用户在购物应用中浏览偏女性类商品(如化妆品、女装)次数与时长,用户阅读偏男性类小说的时长等,用户在阅读类应用中阅读偏男性类小说的时长,用户在阅读类应用中阅读偏女性类小说的时长。In the embodiment of the present application, the multi-dimensional features collected in steps 201 and 203 are the same feature, for example, the number and duration of browsing male-type goods (such as men's clothing) in the user shopping application, and the user browsing the female-type goods in the shopping application (such as The number and duration of cosmetics and women's wear, the length of time users read partial male novels, etc., the length of time users read a male-like novel in a reading application, and the length of time a user reads a female-like novel in a reading application.
204、根据预测样本和分类回归树模型预测未知性别用户的性别。204. Predict the gender of the unknown gender user according to the prediction sample and the classification regression tree model.
具体地,根据预测样本和分类回归树模型获取相应的输出结果,根据输出结果确定未知性别用户的性别。其中,输出结果包括女性、或男性。Specifically, the corresponding output result is obtained according to the predicted sample and the classification regression tree model, and the gender of the unknown gender user is determined according to the output result. Among them, the output includes women, or men.
比如,可以根据预测样本的特征和分类回归树模型确定相应的叶子节点,将该叶子节点的输出作为预测输出结果。如利用预测样本的特征按照分类回归树的分支条件(即划分特征的特征值)确定当前的叶子节点,取该叶子节点的输出作为预测的结果。由于叶子节点的输出包括女性、或男性,因此,此时可以基于分类回归树来确定用户的性别。For example, the corresponding leaf node may be determined according to the characteristics of the predicted sample and the classification regression tree model, and the output of the leaf node is used as a predicted output result. For example, the current leaf node is determined according to the branch condition of the classification regression tree (ie, the feature value of the divided feature), and the output of the leaf node is taken as the prediction result. Since the output of the leaf node includes a female, or a male, the gender of the user can be determined based on the classification regression tree at this time.
例如,采集当前时间点未知性别用户使用电子设备的多维特征后,可以在图5所示的分类回归树中按照分类回归树的分支条件查找相应的叶子节点为dn1,叶子节点dn1的输出为男性,此时,便确定该用户为男性。For example, after collecting the multi-dimensional features of the electronic device using the unknown gender user at the current time point, the corresponding leaf node may be found as dn1 according to the branch condition of the classification regression tree in the classification regression tree shown in FIG. 5, and the output of the leaf node dn1 is male. At this point, it is determined that the user is a male.
由上可知,本申请实施例获取已知性别用户使用电子设备的多维特征作为样本,并构建性别预测的样本集;根据特征对于样本集分类的基尼指数信息增益对样本集进行分类,以构建出相应的分类回归树模型,分类回归树模型的输出包括男性、或者女性;根据预测时间采集未知性别用户使用电子设备的多维特征作为预测样本;根据预测样本和分类回归树模型预测未知性别用户的性别。该方案可以准确地预测用户性别。As can be seen from the above, the embodiment of the present application obtains a multi-dimensional feature of a known gender user using the electronic device as a sample, and constructs a sample set of gender prediction; classifies the sample set according to the Gini index information gain of the feature set for the sample set to construct Corresponding classification regression tree model, the output of the classification regression tree model includes male or female; the multi-dimensional feature of the electronic device is used as the prediction sample by the unknown gender user according to the prediction time; the gender of the unknown gender user is predicted according to the prediction sample and the classification regression tree model. . The program accurately predicts user gender.
进一步地,由于样本集的每个样本中,包括了反映用户使用电子设备的行为习惯的多个特征信息,因此本申请实施例可以使得对用户性别预测更加个性化和智能化。Further, since each sample of the sample set includes a plurality of feature information reflecting a behavior habit of the user using the electronic device, the embodiment of the present application can make the user gender prediction more personalized and intelligent.
进一步地,基于分类回归树预测模型来实现用户性别预测,可以提升用户性别预测的准确性以及节省资源。Further, the user regression prediction based on the classification regression tree prediction model can improve the accuracy of the user gender prediction and save resources.
下面将在上述实施例描述的方法基础上,对本申请的性别预测方法做进一步介绍。参考图6,该性别预测方法可以包括:The gender prediction method of the present application will be further described below based on the method described in the above embodiments. Referring to FIG. 6, the gender prediction method may include:
301、获取已知性别用户使用电子设备的多维特征作为样本,并构建性别预测的样本集。301. Obtain a multi-dimensional feature of a known gender user using the electronic device as a sample, and construct a sample set of gender prediction.
其中,多维特征为已知性别用户如男性用户或女性用户使用电子设备的多维用户行为特征。比如,可以历史时间段内已知性别用户使用电子设备的多维用户行为特征。Among them, the multi-dimensional feature is a multi-dimensional user behavior feature of a known gender user such as a male user or a female user using an electronic device. For example, a multi-dimensional user behavior feature of a known gender user using an electronic device during a historical time period.
在一实施例中,多维特征为用户使用电子设备过程中具有性别特点的行为特征。比如,用户使用电子设备过程中具有男性或女性特点的行为特征。In an embodiment, the multi-dimensional feature is a gender-specific behavioral feature of the user's use of the electronic device. For example, a user has a behavioral characteristic characterized by male or female characteristics in the process of using an electronic device.
其中,多维特征具有一定长度的维度,其每个维度上的参数均对应表征用户使用电子设备的一种特征信息,即该多维特征息由多个特征构成。该多个特征可以包括用户使用电子设备上应用的行为特征,比如,用户购物应用中浏览偏男性类商品(如男装)次数与时长,用户在购物应用中浏览偏女性类商品(如化妆品、女装)次数与时长,用户阅读偏男性类小说的时长等,用户在阅读类应用中阅读偏男性类小说的时长,用户在阅读类应用中阅读偏女性类小说的时长。The multi-dimensional feature has a dimension of a certain length, and the parameters in each dimension correspond to a feature information that represents the user's use of the electronic device, that is, the multi-dimensional feature is composed of multiple features. The plurality of features may include behavior characteristics of the user using the application on the electronic device, for example, the number and duration of browsing the male-type goods (such as men's clothing) in the user shopping application, and the user browsing the female-type goods in the shopping application (such as cosmetics, women's clothing) The number and duration of time, the length of time a user reads a male novel, the length of time a user reads a male-like novel in a reading application, and the length of time a user reads a female-like novel in a reading application.
该多维特征还可以包括用户使用电子设备本身的相关行为特征信息,比如,用户使用电子设备前置摄像头的次数、用户使用后置摄像的次数等等。The multi-dimensional feature may also include relevant behavior characteristic information of the user using the electronic device itself, such as the number of times the user uses the electronic device front camera, the number of times the user uses the rear camera, and the like.
其中,性别预测的样本集可以包括多个样本,每个样本包括已知用户使用电子设备的多维特征。性别预测的样本集中,可以包括在历史时间段内,按照预设频率采集的多个样本。历史时间段,例如可以是过去7天、10天;预设频率,例如可以是每10分钟采集一次、每半小时采集一次。可以理解的是,一次采集的应用的多维特征数据构成一个样本,多个样本,构成样本集。Wherein, the sample set of gender prediction may include a plurality of samples, each sample including a multi-dimensional feature of a known user using an electronic device. The sample set of gender predictions may include multiple samples collected at a preset frequency during the historical time period. The historical time period may be, for example, the past 7 days or 10 days; the preset frequency may be, for example, collected every 10 minutes and collected every half hour. It can be understood that the multi-dimensional feature data of the application acquired at one time constitutes one sample and multiple samples constitute a sample set.
在一实施例中,可以由服务器收集各已知性别用户使用其电子设备的多维特征,然后,在性别预测时电子设备可以从服务器中获取。其中,已知性别用户可以为使用电子设备时提供了性别信息的用户;比如,在账号注册时提供性别信息的用户等。In an embodiment, the multi-dimensional features of each of the known gender users using their electronic devices may be collected by the server, and then the electronic device may be obtained from the server during gender prediction. Among them, a gender user is known to be a user who provides gender information when using an electronic device; for example, a user who provides gender information when the account is registered.
一个具体的样本可如下表1所示,包括多个维度的特征信息,需要说明的是,表1所示的特征信息仅为举例,实际中,一个样本所包含的特征信息的数量,可以多于比表1所示信息的数量,也可以少于表1所示信息的数量,所取的具体特征信息也可以与表1所示不同,此处不作具体限定。A specific sample may be as shown in Table 1 below, and includes feature information of multiple dimensions. It should be noted that the feature information shown in Table 1 is only an example. In practice, the number of feature information included in one sample may be increased. The number of information shown in Table 1 may be less than the number of information shown in Table 1. The specific feature information may be different from that shown in Table 1, and is not specifically limited herein.
维度Dimension 特征信息 Characteristic information
11 用户在购物应用中浏览偏男性类商品(如男装)次数与时长The number and duration of browsing of male-type items (such as men's clothing) in the shopping app
22 用户在购物应用中浏览偏女性类商品(如化妆品、女装)次数与时长The number and duration of browsing female-oriented items (such as cosmetics and women's clothing) in the shopping app
33 用户阅读偏男性类小说的时长The length of time users read a male novel
44 用户阅读偏女性类小说的时长The length of time users read a female novel
55 用户阅读体育类新闻的时长The length of time users read sports news
66 用户阅读星座类新闻的时长The length of time users read the news of the constellation
77 用户使用前置摄像头自拍的次数The number of times the user used the front camera to take a self-portrait
88 用户使用美颜类软件的次数Number of times users use beauty software
99 用户玩不同类别游戏的次数与时长The number and duration of users playing different categories of games
表1Table 1
302、对样本集中的样本进行标记,得到每个样本的样本标签。302. Mark the samples in the sample set to obtain the sample labels of each sample.
由于本实施要实现的是预测用户性别,因此,所标记的样本标签包括男性和女性。该样本的样本标签表征该样本的样本类别。此时,样本类别可以包括男性、女性。Since this implementation is to predict the gender of the user, the labeled sample tags include both male and female. The sample label for the sample characterizes the sample category of the sample. At this time, the sample categories may include males and females.
此外,还可根据已知性别用户的性别进行标记,例如:当男性用户在应用浏览偏男性内容(如商品),则标记为“男性”;再例如,当女性用户阅读偏女性类小说=,则标记为“女性”。具体地,可以用数值“1”表示“男性”,用数值“0”表示“女性”,反之亦可。In addition, it can be marked according to the gender of a known gender user, for example, when a male user browses a male content (such as a product) in an application, it is marked as "male"; for example, when a female user reads a partial female novel =, It is marked as "female". Specifically, the value "1" may be used to mean "male", and the value "0" may be used to mean "female", and vice versa.
303、生成分类回归树模型的根节点,并将样本集分配给根节点。303. Generate a root node of the classification regression tree model, and allocate the sample set to the root node.
比如,参考图3,对于样本集D{样本1、样本2……样本i……样本n},可以先生成分类回归树模型的根节点d,并将样本集D分配给该根节点d。For example, referring to FIG. 3, for the sample set D{sample 1, sample 2...sample i...sample n}, the root node d of the regression tree model can be classified and the sample set D is assigned to the root node d.
304、确定样本集为当前待分类的目标样本集。304. Determine the sample set as the target sample set to be classified.
也即确定根节点的样本集作为当前待分类的目标样本集。That is, the sample set of the root node is determined as the target sample set to be classified currently.
305、获取目标样本集内各特征对于目标样本集分类的基尼指数信息增益,并确定最小的信息增益。305. Acquire a Gini index information gain of each feature in the target sample set for the target sample set, and determine a minimum information gain.
比如,对于样本集D,可以计算各特征如特征t1、特征t2……特征tm对于样本集分类的基尼指数信息增益Gini(D,t1)、Gini(D,t2)……Gini(D,tm);选取最小的信息增益Gini(D,t)min。For example, for the sample set D, each feature such as the feature t1, the feature t2, the feature tm, the Gini index information gain Gini(D, t1), Gini(D, t2), Gini(D, tm) for the sample set classification can be calculated. ); select the smallest information gain Gini(D, t)min.
其中,特征对于样本集分类的基尼指数信息增益,可以采用如下方式获取:Among them, the Gini index information gain of the feature for the sample set classification can be obtained as follows:
根据特征的取值将目标样本集划分成第一子样本集和第二子样本集;取值为特征的所有可能取值中的一种取值;Dividing the target sample set into the first sub-sample set and the second sub-sample set according to the value of the feature; taking the value as one of all possible values of the feature;
获取第一子样本集和第二子样本集中样本类别的概率;The probability of obtaining the sample categories in the first subsample set and the second subsample set;
根据第一子样本集中样本类别的概率获取特征为取值时对于目标样本集分类的第一基尼指数;Obtaining a first Gini index for the target sample set when the feature is a value according to a probability of the sample class in the first subsample set;
根据第二子样本集中样本类别的概率获取特征不为取值时对于目标样本集分类的第二基尼指数;Obtaining a second Gini index for classifying the target sample set when the feature is not a value according to the probability of the sample class in the second subsample set;
根据第一基尼指数、第一子样本集与目标样本集的样本数量比值、第二基尼指数、以及第二子样本集与目标样本集的样本数量比值,获取特征为取值时对于目标样本集分类的基尼指数信息增益。Obtaining a feature as a target sample set according to a first Gini index, a sample size ratio of the first subsample set to the target sample set, a second Gini index, and a sample size ratio of the second subsample set to the target sample set The Gini index information gain of the classification.
例如,以目标样本集为样本集D,特征为特征A为例,特征A的可能取值包括多种,如特征A=a,那么特征A的取值a对于样本集D的基尼指数信息增益可以通过方式获取:For example, taking the target sample set as the sample set D, the feature is the feature A as an example, and the possible values of the feature A include multiple, such as the feature A=a, then the value a of the feature A is the gain of the Gini index information of the sample set D. Can be obtained by means:
首选根据特征A=a为“是”或“否”将样本集D划分成子样本集D1和D2;Preferably, the sample set D is divided into subsample sets D1 and D2 according to whether the feature A=a is "yes" or "no";
根据D1中样本类别(男性或女性)的概率pk计算特征A=a时对于样本集D的基尼指数Gini(D1),以及根据D2中样本类别(男性或女)的概率pk计算特征A=a时对于样本集D的基尼指数Gini(D2)。如以下公式计算出Gini(D1)和Gini(D2)。Calculate the feature A=a for the Gini index Gini(D1) of the sample set D according to the probability pk of the sample category (male or female) in D1, and the probability pk according to the sample type (male or female) in D2 The Gini index Gini (D2) for the sample set D. Gini (D1) and Gini (D2) are calculated as shown in the following formula.
Figure PCTCN2018116709-appb-000004
类别,k=1、2、……k。
Figure PCTCN2018116709-appb-000004
Category, k=1, 2, ... k.
接着,可以基于Gini(D1)、子样本集D1与样本集D的样本数量比值D1/D、Gini(D2)以及子样本集D2与样本集D的样本数量比值D2/D、计算出特征A为a时对于样本集D分类的基尼指数信息增益即Gini(D,A)。比如,通过以下公式求得:Then, the feature A can be calculated based on the Gini (D1), the sample number ratio D1/D, Gini (D2) of the subsample set D1 and the sample set D, and the sample number ratio D2/D of the sample set D2 and the sample set D. The gain of the Gini index information for the sample set D when it is a is Gini(D, A). For example, use the following formula to find:
Figure PCTCN2018116709-appb-000005
Figure PCTCN2018116709-appb-000005
通过上述方式可以计算出各样本特征的取值对于样本集D分类的基尼指数信息增益。In the above manner, the gain of each sample feature can be calculated as the Gini index information gain of the sample set D classification.
306、从最小的信息增益对应的特征及其对应的取值作为划分特征以及划分点。306. The feature corresponding to the smallest information gain and its corresponding value are used as the dividing feature and the dividing point.
比如,当最小的Gini(D,t)min x对应的特征为特征i,特征值为t’时,可以选取特征ti为划分特征,ti对应的取值t’为划分点。For example, when the feature corresponding to the smallest Gini(D, t)min x is the feature i and the feature value is t', the feature ti may be selected as the division feature, and the value t' corresponding to ti is the division point.
307、根据划分特征以及划分点将目标样本集划分成两个子样本集。307. Divide the target sample set into two sub-sample sets according to the dividing feature and the dividing point.
具体地,可以划分特征为划分取值的是与否,将目标样本集划分成两个子样本集。Specifically, the feature may be divided into yes or no, and the target sample set is divided into two sub-sample sets.
例如,可以基于ti=t’为“是”或“否”将样本集D划分成两个子样本集D1{样本1、样本2……样本k}和A2{样本k+1……样本n}。For example, the sample set D may be divided into two subsample sets D1 {sample 1, sample 2, ... sample k} and A2 {sample k+1 ... sample n} based on whether ti=t' is "yes" or "no" .
308、生成当前节点的子节点,并将子样本集分配给相应子节点。308. Generate a child node of the current node, and assign the child sample set to the corresponding child node.
其中,一个子样本集对应一个子节点。例如,考图3生成根节点d的子节点d1和d2, 并将子样本集D1分配给子节点d1、将子样本集D2分配给子节点d2。Among them, one subsample set corresponds to one child node. For example, FIG. 3 generates child nodes d1 and d2 of the root node d, and assigns the subsample set D1 to the child node d1 and the child sample set D2 to the child node d2.
在一实施例中,还可以将子节点对应的划分特征值设置子节点与当前节点的路径上,便于后续进行性别预测,参考图5。In an embodiment, the divided feature values corresponding to the child nodes may also be set on the path of the child node and the current node, so as to facilitate subsequent gender prediction, refer to FIG. 5.
309、判断子节点的子样本集是否满足预设分类终止条件,若否,则执行步骤310,若是,则执行步骤311。309. Determine whether the sub-sample set of the child node meets the preset classification termination condition. If not, execute step 310, and if yes, perform step 311.
其中,预设分类终止条件可以根据实际需求设定,当子节点满足预设分类终止条件时,将当前子节点作为叶子节点,停止对子节点对应的样本集进行分词分类;当子节点不满足预设分类终止条件时,继续对子节点对应的额样本集进行分类。比如,预设分类终止条件可以包括:子节点的去除后子样本集合中样本的类别数量为与预设数量。The preset classification termination condition may be set according to actual requirements. When the child node satisfies the preset classification termination condition, the current child node is used as a leaf node, and the sample set corresponding to the child node is stopped for word segmentation; when the child node is not satisfied When the classification termination condition is preset, the classification of the sample set corresponding to the child node is continued. For example, the preset classification termination condition may include: the number of categories of the samples in the removed sub-sample set of the child node is a preset number.
例如,预设分类终止条件可以包括:子节点对应的去除后子样本集中样本的类别数量为1,也即子节点的样本集中只有一个类别的样本。For example, the preset classification termination condition may include: the number of categories of the samples in the removed subsample set corresponding to the child node is 1, that is, the sample in the sample set of the child node has only one category.
310、将目标样本集更新为子节点的子样本集,并返回执行步骤305。310. Update the target sample set to the child sample set of the child node, and return to step 305.
311、将该子节点作为叶子节点,并根据子节点的子样本集中样本类别设置该叶子节点的输出。311. The child node is used as a leaf node, and the output of the leaf node is set according to a sample category of the child sample set of the child node.
例如,预设分类终止条件可以包括:子节点对应的去除后子样本集中样本的类别数量为1,也即子节点的样本集中只有一个类别的样本。For example, the preset classification termination condition may include: the number of categories of the samples in the removed subsample set corresponding to the child node is 1, that is, the sample in the sample set of the child node has only one category.
此时,如果子节点满足该预设分类终止条件,那么,将子样本集中样本的类别作为该叶子节点的输出。如去除后子样本集中只有类别为“女性”的样本时,那么,可以将“女性”作为该叶子节点的输出At this time, if the child node satisfies the preset classification termination condition, the category of the sample in the subsample set is taken as the output of the leaf node. If there is only a sample with the category "female" in the subsample set after removal, then "female" can be used as the output of the leaf node.
其中,样本类别包括女性、男性。Among them, the sample categories include women and men.
312、在构建完分类回归树模型后,获取需要预测性别的时间,根据该时间采集当前未知性别用户使用电子设备的多维特征作为预测样本。312. After constructing the classification regression tree model, obtain a time for which the gender needs to be predicted, and collect, according to the time, a multi-dimensional feature of the electronic device used by the current unknown gender user as a prediction sample.
其中,需要预测性别的时间可以包括当前时间,或者其他时间。Among them, the time required to predict the gender may include the current time, or other time.
313、根据预测样本和分类回归树模型预测未知性别用户的性别。313. Predict the gender of the unknown gender user according to the prediction sample and the classification regression tree model.
比如,可以根据预测样本的特征和分类回归树模型确定相应的叶子节点,将该叶子节点的输出作为预测输出结果。如利用预测样本的特征按照分类回归树的分支条件(即划分特征的特征值)确定当前的叶子节点,取该叶子节点的输出作为预测的结果。由于叶子节点的输出包括男性、或女性,因此,此时可以基于分类回归树来确定用户性别。For example, the corresponding leaf node may be determined according to the characteristics of the predicted sample and the classification regression tree model, and the output of the leaf node is used as a predicted output result. For example, the current leaf node is determined according to the branch condition of the classification regression tree (ie, the feature value of the divided feature), and the output of the leaf node is taken as the prediction result. Since the output of the leaf node includes male, or female, the user's gender can be determined based on the classification regression tree at this time.
例如,采集当前时间点应用的多维特征后,可以在图5所示的分类回归树中按照分类回归树的分支条件查找相应的叶子节点为an2,叶子节点an2的输出为女性,此时,便确定用户为女性。For example, after collecting the multi-dimensional features of the current time point application, the corresponding leaf nodes can be found as an2 according to the branching condition of the classification regression tree in the classification regression tree shown in FIG. 5, and the output of the leaf node an2 is female. Make sure the user is a female.
由上可知,本申请实施例获取已知性别用户使用电子设备的多维特征作为样本,并构建性别预测的样本集;根据特征对于样本集分类的基尼指数信息增益对样本集进行分类,以构建出相应的分类回归树模型,分类回归树模型的输出包括男性、或者女性;根据预测时间采集未知性别用户使用电子设备的多维特征作为预测样本;根据预测样本和分类回归树模型预测未知性别用户的性别。该方案可以准确地预测用户性别。As can be seen from the above, the embodiment of the present application obtains a multi-dimensional feature of a known gender user using the electronic device as a sample, and constructs a sample set of gender prediction; classifies the sample set according to the Gini index information gain of the feature set for the sample set to construct Corresponding classification regression tree model, the output of the classification regression tree model includes male or female; the multi-dimensional feature of the electronic device is used as the prediction sample by the unknown gender user according to the prediction time; the gender of the unknown gender user is predicted according to the prediction sample and the classification regression tree model. . The program accurately predicts user gender.
进一步地,由于样本集的每个样本中,包括了反映用户使用电子设备的行为习惯的多个特征信息,因此本申请实施例可以使得对用户性别预测更加个性化和智能化。Further, since each sample of the sample set includes a plurality of feature information reflecting a behavior habit of the user using the electronic device, the embodiment of the present application can make the user gender prediction more personalized and intelligent.
进一步地,基于分类回归树预测模型来实现用户性别预测,可以提升用户性别预测的准确性以及节省资源。Further, the user regression prediction based on the classification regression tree prediction model can improve the accuracy of the user gender prediction and save resources.
本申请实施例还提供了一种性别预测装置,包括:The embodiment of the present application further provides a gender prediction apparatus, including:
样本构建单元,用于获取已知性别用户使用电子设备的多维特征作为样本,并构建性别预测的样本集;a sample construction unit for acquiring a multidimensional feature of a known gender user using the electronic device as a sample, and constructing a sample set of gender prediction;
分类单元,用于根据所述特征对于样本集分类的基尼指数信息增益对所述样本集进行分类,以构建出相应的分类回归树模型,所述分类回归树模型的输出包括男性、或者女性;a classification unit, configured to classify the sample set according to the Gini index information gain of the feature classification for the sample set to construct a corresponding classification regression tree model, where the output of the classification regression tree model includes a male or a female;
采集单元,用于根据预测时间采集未知性别用户使用电子设备的多维特征作为预测样本;An acquisition unit, configured to collect, according to a predicted time, a multi-dimensional feature of an electronic device that is used by an unknown gender user as a prediction sample;
预测单元,用于根据所述预测样本和所述分类回归树模型预测所述未知性别用户的性别。And a prediction unit, configured to predict, according to the predicted sample and the classified regression tree model, the gender of the unknown gender user.
在一些实施例中,所述分类单元包括:In some embodiments, the classification unit comprises:
节点生成子单元,用于生成分类回归树模型的根节点,并将所述样本集分配给所述根节点,将所述根节点的样本集确定为当前待分类的目标样本集;a node generating subunit, configured to generate a root node of the classification regression tree model, and assign the sample set to the root node, and determine a sample set of the root node as a target sample set to be classified currently;
增益获取子单元,用于获取所述特征对于目标样本集分类的基尼指数信息增益;a gain acquisition subunit, configured to obtain a Gini index information gain of the feature for the target sample set classification;
划分特征确定子单元,用于根据所述基尼指数信息增益从所述特征中选取当前的划分特征及其对应的划分点;a dividing feature determining subunit, configured to select a current dividing feature and a corresponding dividing point thereof from the feature according to the Gini index information gain;
分类子单元,用于根据所述划分特征和所述划分点对所述样本集进行划分,得到两个子样本集;a classifying subunit, configured to divide the sample set according to the dividing feature and the dividing point, to obtain two subsample sets;
子节点生成子单元,用于生成当前节点的子节点,并将所述去所述子样本集分配给相应的所述子节点;a child node generating subunit, configured to generate a child node of the current node, and assign the going to the subsample set to the corresponding child node;
判断子单元,用于判断子节点是否满足预设分类终止条件,若否,将所述目标样本集更新为所述子样本集,并触发增益获取子单元执行获取所述特征对于目标样本集的基尼指数的步骤;若是,则将所述子节点作为叶子节点,根据所述子样本集中样本的样本类别设置所述叶子节点的输出,所述样本类别包括男性、或者女性。a determining subunit, configured to determine whether the child node satisfies a preset classification termination condition, and if not, updating the target sample set to the subsample set, and triggering the gain acquisition subunit to perform acquiring the feature for the target sample set The step of the Gini index; if so, the child node is used as a leaf node, and the output of the leaf node is set according to the sample category of the sample in the subsample set, the sample category including male or female.
在一些实施例中,所述增益获取子单元,用In some embodiments, the gain acquisition subunit is used
获取所述特征的取值对于目标样本集分类的基尼指数;Obtaining a Gini index of the value of the feature for the target sample set classification;
根据所述基尼指数,获取所述特征的取值对于目标样本集分类的基尼指数信息增益。According to the Gini index, the gain of the feature is obtained for the Gini index information of the target sample set classification.
在一些实施例中,所述增益获取子单元,用于:In some embodiments, the gain acquisition subunit is configured to:
根据所述特征的取值将所述目标样本集划分成第一子样本集和第二子样本集;Dividing the target sample set into a first subsample set and a second subsample set according to the value of the feature;
获取所述第一子样本集和所述第二子样本集中样本类别的概率;Obtaining a probability of the sample class of the first subsample set and the second subsample set;
根据所述样本类别的概率获取所述取值对于目标样本分类的基尼指数。The Gini index of the value classification for the target sample is obtained according to the probability of the sample category.
在一些实施例中,所述划分特征确定子单元,用于:In some embodiments, the dividing feature determining subunit is configured to:
从所述基尼指数信息增益中确定最小的目标基尼指数信息增益;Determining a minimum target Gini index information gain from the Gini index information gain;
将所述目标基尼指数信息增益的特征及其取值,分别作为划分特征和划分点。The characteristics of the target Gini index information gain and their values are taken as the division feature and the division point, respectively.
在一些实施例中,所述增益获取子单元,用于:In some embodiments, the gain acquisition subunit is configured to:
根据所述特征的取值将所述目标样本集划分成第一子样本集和第二子样本集;Dividing the target sample set into a first subsample set and a second subsample set according to the value of the feature;
获取所述第一子样本集和所述第二子样本集中样本类别的概率;Obtaining a probability of the sample class of the first subsample set and the second subsample set;
根据所述第一子样本集中样本类别的概率获取所述特征为取值时对于目标样本集分类的第一基尼指数;Obtaining, according to a probability of the sample category in the first subsample set, a first Gini index classified for the target sample set when the feature is a value;
根据所述第二子样本集中样本类别的概率获取所述特征不为取值时对于目标样本集分类的第二基尼指数;Obtaining, according to the probability of the sample category in the second subsample set, a second Gini index that is classified for the target sample set when the feature is not a value;
根据所述第一基尼指数、第一子样本集与目标样本集的样本数量比值、第二基尼指数、以及第二子样本集与目标样本集的样本数量比值,获取所述特征的取值对于目标样本集分类的基尼指数信息增益。Obtaining a value of the feature according to the first Gini index, a sample size ratio of the first subsample set to the target sample set, a second Gini index, and a sample size ratio of the second subsample set to the target sample set The Gini index information gain of the target sample set classification.
在一些实施例中,所述增益获取子单元,用于:In some embodiments, the gain acquisition subunit is configured to:
通过如下公式计算出特征对于目标样本集分类的基尼指数信息增益:The Gini index information gain of the feature classification for the target sample set is calculated by the following formula:
Figure PCTCN2018116709-appb-000006
其中,Gini(D,A)为特征A对于目标样本集D分类的基尼指数信息增益,Gini(D 1)为特征A为取值a时对于目标样本D分类的基尼指数,Gini(D 2)为A不为取值a时对于目标样本D分类的基尼指数;a为特征A的一种取值,D1和D2为基于特征A=a对目标样本集D划分后得到的两个子样本集。
Figure PCTCN2018116709-appb-000006
Among them, Gini(D,A) is the Gini index information gain of feature A for the target sample set D, and Gini(D 1 ) is the Gini index for the target sample D when the feature A is the value a, Gini(D 2 ) A is the Gini index for the target sample D when A is not a value; a is a value of the feature A, and D1 and D2 are two sub-sample sets obtained by dividing the target sample set D based on the feature A=a.
在一些实施例中,所述划分特征确定子单元,用于:In some embodiments, the dividing feature determining subunit is configured to:
从所述基尼指数信息增益中确定最小的目标基尼指数信息增益;Determining a minimum target Gini index information gain from the Gini index information gain;
将所述目标基尼指数信息增益的特征及其取值,分别作为划分特征和划分点。The characteristics of the target Gini index information gain and their values are taken as the division feature and the division point, respectively.
在一些实施例中,所述判断子单元,用于判断所述子节点对应的去除后子样本集中样本的类别数量是否为预设数量;若是,则确定所述子节点满足预设分类终止条件。在一实施例中还提供了一种性别预测装置。请参阅图7,图7为本申请实施例提供的性别预测装置的结构示意图。其中该性别预测装置应用于电子设备,该性别预测装置包括样本构建单元401、分类单元402、采集单元403、和预测单元404,如下:In some embodiments, the determining subunit is configured to determine whether the number of categories of samples in the removed subsample set corresponding to the child node is a preset number; if yes, determining that the child node meets a preset classification termination condition . A gender prediction device is also provided in an embodiment. Please refer to FIG. 7. FIG. 7 is a schematic structural diagram of a gender prediction apparatus according to an embodiment of the present application. Wherein the gender prediction device is applied to an electronic device, and the gender prediction device includes a sample construction unit 401, a classification unit 402, an acquisition unit 403, and a prediction unit 404, as follows:
样本构建单元401,用于获取已知性别用户使用电子设备的多维特征作为样本,并构建性别预测的样本集;a sample construction unit 401, configured to acquire a multi-dimensional feature of a known gender user using the electronic device as a sample, and construct a sample set of the gender prediction;
分类单元402,用于根据所述特征对于样本集分类的基尼指数信息增益对所述样本集进行分类,以构建出相应的分类回归树模型,所述分类回归树模型的输出包括男性、或者女性;The classification unit 402 is configured to classify the sample set according to the Gini index information gain of the feature classification for the sample set to construct a corresponding classification regression tree model, and the output of the classification regression tree model includes a male or a female ;
采集单元403,用于根据预测时间采集未知性别用户使用电子设备的多维特征作为预测样本;The collecting unit 403 is configured to collect, according to the predicted time, a multi-dimensional feature of the electronic device that the unknown gender user uses as the prediction sample;
预测单元404,用于根据所述预测样本和所述分类回归树模型预测所述未知性别用户的性别。The prediction unit 404 is configured to predict, according to the prediction sample and the classification regression tree model, the gender of the unknown gender user.
在一实施例中,参考图8,分类单元402,可以包括:In an embodiment, referring to FIG. 8, the classification unit 402 may include:
节点生成子单元4021,用于生成分类回归树模型的根节点,并将所述样本集分配给所述根节点,将所述根节点的样本集确定为当前待分类的目标样本集;a node generating sub-unit 4021, configured to generate a root node of the classification regression tree model, and allocate the sample set to the root node, and determine a sample set of the root node as a target sample set to be classified currently;
增益获取子单元4022,用于获取所述特征对于目标样本集分类的基尼指数信息增益;a gain acquisition sub-unit 4022, configured to acquire a Gini index information gain of the feature for the target sample set classification;
划分特征确定子单元4023,用于根据所述基尼指数信息增益从所述特征中选取当前的划分特征及其对应的划分点;a dividing feature determining sub-unit 4023, configured to select a current dividing feature and a corresponding dividing point thereof from the feature according to the Gini index information gain;
分类子单元4024,用于生成当前节点的子节点,并将所述去所述子样本集分配给相应的所述子节点;a classification sub-unit 4024, configured to generate a child node of the current node, and allocate the sub-sample set to the corresponding child node;
子节点生成子单元4025,用于对所述子样本集中样本的所述划分特征进行去除,得到去除后子样本集;生成当前节点的子节点,并将所述去除后子样本集作为所述子节点的节点信息;a child node generating sub-unit 4025, configured to remove the divided feature of the sample in the sub-sample set to obtain a removed sub-sample set; generate a child node of the current node, and use the removed sub-sample set as the Node information of the child node;
判断子单元4026,用于判断子节点是否满足预设分类终止条件,若否,将所述目标样本集更新为所述子样本集,并触发增益获取子单元4022执行获取所述特征对于目标样本集的基尼指数的步骤;若是,则将所述子节点作为叶子节点,根据所述子样本集中样本的样本类别设置所述叶子节点的输出,所述样本类别包括男性、或者女性。The determining sub-unit 4026 is configured to determine whether the child node satisfies a preset classification termination condition, and if not, update the target sample set to the sub-sample set, and trigger the gain acquisition sub-unit 4022 to perform acquiring the feature for the target sample. The step of the set Gini index; if so, the child node is used as a leaf node, and the output of the leaf node is set according to the sample category of the sample in the subsample set, the sample category including male or female.
其中,增益获取子单元4022,可以用于:The gain acquisition subunit 4022 can be used to:
获取所述特征的取值对于目标样本集分类的基尼指数;Obtaining a Gini index of the value of the feature for the target sample set classification;
根据所述基尼指数,获取所述特征的取值对于目标样本集分类的基尼指数信息增益。According to the Gini index, the gain of the feature is obtained for the Gini index information of the target sample set classification.
在一实施例中,增益获取子单元4022,可以用于:In an embodiment, the gain acquisition sub-unit 4022 can be used to:
根据所述特征的取值将所述目标样本集划分成第一子样本集和第二子样本集;Dividing the target sample set into a first subsample set and a second subsample set according to the value of the feature;
获取所述第一子样本集和所述第二子样本集中样本类别的概率;Obtaining a probability of the sample class of the first subsample set and the second subsample set;
根据所述样本类别的概率获取所述取值对于目标样本分类的基尼指数。The Gini index of the value classification for the target sample is obtained according to the probability of the sample category.
在一实施例中,增益获取子单元4022,可以用于:In an embodiment, the gain acquisition sub-unit 4022 can be used to:
根据所述第一子样本集中样本类别的概率获取所述特征为取值时对于目标样本集分类的第一基尼指数;Obtaining, according to a probability of the sample category in the first subsample set, a first Gini index classified for the target sample set when the feature is a value;
根据所述第二子样本集中样本类别的概率获取所述特征不为取值时对于目标样本集分类的第二基尼指数;Obtaining, according to the probability of the sample category in the second subsample set, a second Gini index that is classified for the target sample set when the feature is not a value;
根据所述第一基尼指数、第一子样本集与目标样本集的样本数量比值、第二基尼指数、以及第二子样本集与目标样本集的样本数量比值,获取所述特征的取值对于目标样本集分类的基尼指数信息增益。Obtaining a value of the feature according to the first Gini index, a sample size ratio of the first subsample set to the target sample set, a second Gini index, and a sample size ratio of the second subsample set to the target sample set The Gini index information gain of the target sample set classification.
其中,划分特征确定子单元4023,可以用于:The dividing feature determining subunit 4023 can be used to:
从所述基尼指数信息增益中确定最小的目标基尼指数信息增益;Determining a minimum target Gini index information gain from the Gini index information gain;
将所述目标基尼指数信息增益的特征及其取值,分别作为划分特征和划分点。The characteristics of the target Gini index information gain and their values are taken as the division feature and the division point, respectively.
在一实施例中,判断子单元4025,可以用于判断所述子节点对应的去除后子样本集中样本的类别数量是否为预设数量;In an embodiment, the determining sub-unit 4025 may be configured to determine whether the number of categories of the samples in the removed sub-sample set corresponding to the child node is a preset number;
若是,则确定所述子节点满足预设分类终止条件。If yes, it is determined that the child node satisfies a preset classification termination condition.
在一实施例中,所述增益获取子单元4022,可以用于:In an embodiment, the gain acquisition subunit 4022 can be used to:
根据所述特征的取值将所述目标样本集划分成第一子样本集和第二子样本集;Dividing the target sample set into a first subsample set and a second subsample set according to the value of the feature;
获取所述第一子样本集和所述第二子样本集中样本类别的概率;Obtaining a probability of the sample class of the first subsample set and the second subsample set;
根据所述第一子样本集中样本类别的概率获取所述特征为取值时对于目标样本集分类的第一基尼指数;Obtaining, according to a probability of the sample category in the first subsample set, a first Gini index classified for the target sample set when the feature is a value;
根据所述第二子样本集中样本类别的概率获取所述特征不为取值时对于目标样本集分类的第二基尼指数;Obtaining, according to the probability of the sample category in the second subsample set, a second Gini index that is classified for the target sample set when the feature is not a value;
根据所述第一基尼指数、第一子样本集与目标样本集的样本数量比值、第二基尼指数、以及第二子样本集与目标样本集的样本数量比值,获取所述特征的取值对于目标样本集分类的基尼指数信息增益。Obtaining a value of the feature according to the first Gini index, a sample size ratio of the first subsample set to the target sample set, a second Gini index, and a sample size ratio of the second subsample set to the target sample set The Gini index information gain of the target sample set classification.
在一实施例中,所述增益获取子单元4022,用于:In an embodiment, the gain acquisition subunit 4022 is configured to:
通过如下公式计算出特征对于目标样本集分类的基尼指数信息增益:The Gini index information gain of the feature classification for the target sample set is calculated by the following formula:
Figure PCTCN2018116709-appb-000007
其中,Gini(D,A)为特征A对于目标样本集D分类的基尼指数信息增益,Gini(D 1)为特征A为取值a时对于目标样本D分类的基尼指数,Gini(D 2)为A不为取值a时对于目标样本D分类的基尼指数;a为特征A的一种取值,D1和D2为基于特征A=a对目标样本集D划分后得到的两个子样本集。
Figure PCTCN2018116709-appb-000007
Among them, Gini(D,A) is the Gini index information gain of feature A for the target sample set D, and Gini(D 1 ) is the Gini index for the target sample D when the feature A is the value a, Gini(D 2 ) A is the Gini index for the target sample D when A is not a value; a is a value of the feature A, and D1 and D2 are two sub-sample sets obtained by dividing the target sample set D based on the feature A=a.
在一实施例中,所述划分特征确定子单元4023,用于:In an embodiment, the dividing feature determining subunit 4023 is configured to:
从所述基尼指数信息增益中确定最小的目标基尼指数信息增益;Determining a minimum target Gini index information gain from the Gini index information gain;
将所述目标基尼指数信息增益的特征及其取值,分别作为划分特征和划分点。The characteristics of the target Gini index information gain and their values are taken as the division feature and the division point, respectively.
在一实施例中,所述判断子单元4026,用于判断所述子节点对应的去除后子样本集中样本的类别数量是否为预设数量;若是,则确定所述子节点满足预设分类终止条件。In an embodiment, the determining sub-unit 4026 is configured to determine whether the number of categories of the samples in the removed sub-sample set corresponding to the child node is a preset number; if yes, determining that the child node meets the preset classification termination condition.
其中,性别预测装置中各单元执行的步骤可以参考上述方法实施例描述的方法步骤。该性别预测装置可以集成在电子设备中,如手机、平板电脑等。The steps performed by each unit in the gender prediction apparatus may refer to the method steps described in the foregoing method embodiments. The gender prediction device can be integrated in an electronic device such as a mobile phone, a tablet, or the like.
本文所使用的术语“模块”“单元”可看做为在该运算***上执行的软件对象。本文所述的不同组件、模块、引擎及服务可看做为在该运算***上的实施对象。而本文所述的装置及方法可以以软件的方式进行实施,当然也可在硬件上进行实施,均在本申请保护范围之内。The term "module" "unit" as used herein may be taken to mean a software object that is executed on the computing system. The different components, modules, engines, and services described herein can be considered as implementation objects on the computing system. The apparatus and method described herein may be implemented in software, and may of course be implemented in hardware, all of which are within the scope of the present application.
具体实施时,以上各个单元可以作为独立的实体实现,也可以进行任意组合,作为同一或若干个实体来实现,以上各个单位的具体实施可参见前面的实施例,在此不再赘述。In the specific implementation, the foregoing various units may be implemented as an independent entity, and may be implemented in any combination, and may be implemented as the same entity or a plurality of entities. For the specific implementation of the foregoing units, refer to the foregoing embodiments, and details are not described herein again.
由上可知,本实施例性别预测装置可以由样本构建单元401获取已知性别用户使用电子设备的多维特征作为样本,并构建性别预测的样本集;由分类单元402根据所述特征对于样本集分类的基尼指数信息增益对所述样本集进行分类,以构建出相应的分类回归树模型,所述分类回归树模型的输出包括男性、或者女性;由采集单元403根据预测时间采集未知性别用户使用电子设备的多维特征作为预测样本;由预测单元404根据所述预测样本和所述分类回归树模型预测未知性别用户的性别。该方案可以准确地预测用户性别。As can be seen from the above, the gender prediction apparatus of the present embodiment can acquire the multi-dimensional feature of the known gender user using the electronic device as a sample by the sample construction unit 401, and construct a sample set of the gender prediction; the classification unit 402 classifies the sample set according to the feature. The Gini index information gain classifies the sample set to construct a corresponding classification regression tree model, the output of the classification regression tree model includes a male or a female; and the collecting unit 403 collects an unknown gender based on the predicted time to use the electronic The multi-dimensional feature of the device is used as a prediction sample; the prediction unit 404 predicts the gender of the unknown gender user based on the prediction sample and the classification regression tree model. The program accurately predicts user gender.
本申请实施例还提供一种电子设备。请参阅图9,电子设备500包括处理器501以及存储器502。其中,处理器501与存储器502电性连接。An embodiment of the present application further provides an electronic device. Referring to FIG. 9, the electronic device 500 includes a processor 501 and a memory 502. The processor 501 is electrically connected to the memory 502.
所述处理器500是电子设备500的控制中心,利用各种接口和线路连接整个电子设备的各个部分,通过运行或加载存储在存储器502内的计算机程序,以及调用存储在存储器502内的数据,执行电子设备500的各种功能并处理数据,从而对电子设备500进行整体监控。The processor 500 is a control center of the electronic device 500 that connects various portions of the entire electronic device using various interfaces and lines, by running or loading a computer program stored in the memory 502, and recalling data stored in the memory 502, The various functions of the electronic device 500 are performed and the data is processed to perform overall monitoring of the electronic device 500.
所述存储器502可用于存储软件程序以及模块,处理器501通过运行存储在存储器502的计算机程序以及模块,从而执行各种功能应用以及数据处理。存储器502可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作***、至少一个功能所需的计算机程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据电子设备的使用所创建的数据等。此外,存储器502可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器502还可以包括存储器控制器,以提供处理器501对存储器502的访问。The memory 502 can be used to store software programs and modules, and the processor 501 executes various functional applications and data processing by running computer programs and modules stored in the memory 502. The memory 502 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, a computer program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may be stored according to Data created by the use of electronic devices, etc. Moreover, memory 502 can include high speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, memory 502 can also include a memory controller to provide processor 501 access to memory 502.
在本申请实施例中,电子设备500中的处理器501会按照如下的步骤,将一个或一个以上的计算机程序的进程对应的指令加载到存储器502中,并由处理器501运行存储在存储器502中的计算机程序,从而实现各种功能,如下:In the embodiment of the present application, the processor 501 in the electronic device 500 loads the instructions corresponding to the process of one or more computer programs into the memory 502 according to the following steps, and is stored in the memory 502 by the processor 501. The computer program in which to implement various functions, as follows:
获取已知性别用户使用电子设备的多维特征作为样本,并构建性别预测的样本集;Obtaining a multidimensional feature of a known gender user using an electronic device as a sample, and constructing a sample set of gender predictions;
根据所述特征对于样本集分类的基尼指数信息增益对所述样本集进行分类,以构建出相应的分类回归树模型,所述分类回归树模型的输出包括男性、或者女性;Sorting the sample set according to the Gini index information gain of the feature set for the sample set to construct a corresponding classified regression tree model, and the output of the classified regression tree model includes a male or a female;
根据预测时间采集未知性别用户使用电子设备的多维特征作为预测样本;Collecting multidimensional features of the electronic device used by unknown gender users as prediction samples according to the predicted time;
根据所述预测样本和所述分类回归树模型预测所述未知性别用户的性别。Predicting the gender of the unknown gender user according to the predicted sample and the classified regression tree model.
在某些实施方式中,在根据所述特征对于样本集的基尼指数信息增益对所述样本集进行划分,以构建出相应的分类回归树模型时,处理器501可以具体执行以下步骤:In some embodiments, the processor 501 may specifically perform the following steps when dividing the sample set by the Gini index information gain for the sample set according to the feature to construct a corresponding classification regression tree model:
生成分类回归树模型的根节点,并将所述样本集分配给所述根节点;Generating a root node of the classification regression tree model and assigning the sample set to the root node;
将所述根节点的样本集确定为当前待分类的目标样本集;Determining the sample set of the root node as a target sample set to be classified currently;
获取所述特征对于目标样本集分类的基尼指数信息增益;Obtaining a Gini index information gain of the feature for the target sample set classification;
根据所述基尼指数信息增益从所述特征中选取当前的划分特征及其对应的划分点;Selecting a current partitioning feature and its corresponding dividing point from the feature according to the Gini index information gain;
根据所述划分特征和所述划分点对所述样本集进行划分,得到两个子样本集;Dividing the sample set according to the dividing feature and the dividing point to obtain two sub-sample sets;
生成当前节点的子节点,并将所述去所述子样本集分配给相应的所述子节点;Generating a child node of the current node, and assigning the sub-sample set to the corresponding child node;
判断所述子节点是否满足预设分类终止条件;Determining whether the child node meets a preset classification termination condition;
若否,则将所述目标样本集更新为所述子样本集,并返回执行获取所述特征对于目标样本集的基尼指数的步骤;If not, updating the target sample set to the sub-sample set, and returning to perform the step of acquiring the Gini index of the feature for the target sample set;
若是,则将所述子节点作为叶子节点,根据所述子样本集中样本的样本类别设置所述叶子节点的输出,所述样本类别包括男性、或者女性。If yes, the child node is used as a leaf node, and an output of the leaf node is set according to a sample category of the sample in the subset of samples, the sample category includes a male or a female.
在某些实施方式中,在获取所述特征对于目标样本集分类的基尼指数信息增益时,处理器501可以具体执行以下步骤:In some embodiments, when acquiring the Gini index information gain of the feature classification for the target sample set, the processor 501 may specifically perform the following steps:
获取所述特征的取值对于目标样本集分类的基尼指数;Obtaining a Gini index of the value of the feature for the target sample set classification;
根据所述基尼指数,获取所述特征的取值对于目标样本集分类的基尼指数信息增益。According to the Gini index, the gain of the feature is obtained for the Gini index information of the target sample set classification.
在某些实施方式中,在获取所述特征的取值对于目标样本集分类的基尼指数时,处理器501可以具体执行以下步骤:In some embodiments, when obtaining the Gini index of the feature value for the target sample set, the processor 501 may specifically perform the following steps:
根据所述特征的取值将所述目标样本集划分成第一子样本集和第二子样本集;Dividing the target sample set into a first subsample set and a second subsample set according to the value of the feature;
获取所述第一子样本集和所述第二子样本集中样本类别的概率;Obtaining a probability of the sample class of the first subsample set and the second subsample set;
根据所述样本类别的概率获取所述取值对于目标样本分类的基尼指数。The Gini index of the value classification for the target sample is obtained according to the probability of the sample category.
在某些实施方式中,在根据所述样本类别的概率获取所述取值对于目标样本分类的基尼指数时,处理器501还可以具体执行以下步骤:In some embodiments, when the Gini index of the value classification for the target sample is obtained according to the probability of the sample category, the processor 501 may further perform the following steps:
根据所述第一子样本集中样本类别的概率获取所述特征为取值时对于目标样本集分类的第一基尼指数;Obtaining, according to a probability of the sample category in the first subsample set, a first Gini index classified for the target sample set when the feature is a value;
根据所述第二子样本集中样本类别的概率获取所述特征不为取值时对于目标样本集分类的第二基尼指数;Obtaining, according to the probability of the sample category in the second subsample set, a second Gini index that is classified for the target sample set when the feature is not a value;
在据所述基尼指数,获取所述特征对于目标样本集分类的基尼指数信息增益,处理器501可以具体执行以下步骤:In accordance with the Gini index, the gain of the Gini index information of the feature classification for the target sample set is obtained, and the processor 501 may specifically perform the following steps:
根据所述第一基尼指数、第一子样本集与目标样本集的样本数量比值、第二基尼指数、以及第二子样本集与目标样本集的样本数量比值,获取所述特征的取值对于目标样本集分类的基尼指数信息增益。Obtaining a value of the feature according to the first Gini index, a sample size ratio of the first subsample set to the target sample set, a second Gini index, and a sample size ratio of the second subsample set to the target sample set The Gini index information gain of the target sample set classification.
在某些实施方式中,在根据所述基尼指数信息增益从所述特征中选取当前的划分特征及其对应的划分点时,处理器501可以具体执行以下步骤:In some embodiments, when the current partitioning feature and its corresponding partitioning point are selected from the features according to the Gini index information gain, the processor 501 may specifically perform the following steps:
从所述基尼指数信息增益中确定最小的目标基尼指数信息增益;Determining a minimum target Gini index information gain from the Gini index information gain;
将所述目标基尼指数信息增益的特征及其取值,分别作为划分特征和划分点The feature of the target Gini index information gain and its value are taken as the division feature and the division point respectively.
由上述可知,本申请实施例的电子设备,获取已知性别用户使用电子设备的多维特征作为样本,并构建性别预测的样本集;根据所述特征对于样本集分类的基尼指数信息增益对所述样本集进行分类,以构建出相应的分类回归树模型,所述分类回归树模型的输出包括男性、或者女性;根据预测时间采集未知性别用户使用电子设备的多维特征作为预测样本;根据所述预测样本和所述分类回归树模型预测未知性别用户的性别。该方案可以准确地预测用户性别。It can be seen from the above that the electronic device in the embodiment of the present application acquires a multi-dimensional feature of the known gender user using the electronic device as a sample, and constructs a sample set of gender prediction; and the Gini index information gain according to the feature classification for the sample set The sample set is classified to construct a corresponding classification regression tree model, and the output of the classification regression tree model includes a male or a female; the multi-dimensional feature of the electronic device is used as a prediction sample by the unknown gender user according to the prediction time; The sample and the classification regression tree model predict the gender of an unknown gender user. The program accurately predicts user gender.
请一并参阅图10,在某些实施方式中,电子设备500还可以包括:显示器503、射频电路504、音频电路505以及电源506。其中,其中,显示器503、射频电路504、音频电路505以及电源506分别与处理器501电性连接。Referring to FIG. 10 together, in some embodiments, the electronic device 500 may further include: a display 503, a radio frequency circuit 504, an audio circuit 505, and a power source 506. The display 503, the radio frequency circuit 504, the audio circuit 505, and the power source 506 are electrically connected to the processor 501, respectively.
所述显示器503可以用于显示由用户输入的信息或提供给用户的信息以及各种图形用户接口,这些图形用户接口可以由图形、文本、图标、视频和其任意组合来构成。显示器503可以包括显示面板,在某些实施方式中,可以采用液晶显示器(Liquid Crystal Display,LCD)、或者有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板。The display 503 can be used to display information entered by a user or information provided to a user, as well as various graphical user interfaces, which can be composed of graphics, text, icons, video, and any combination thereof. The display 503 can include a display panel. In some embodiments, the display panel can be configured in the form of a liquid crystal display (LCD) or an organic light-emitting diode (OLED).
所述射频电路504可以用于收发射频信号,以通过无线通信与网络设备或其他电子设备建立无线通讯,与网络设备或其他电子设备之间收发信号。The radio frequency circuit 504 can be used to transmit and receive radio frequency signals to establish wireless communication with a network device or other electronic device through wireless communication, and to transmit and receive signals with a network device or other electronic device.
所述音频电路505可以用于通过扬声器、传声器提供用户与电子设备之间的音频接口。The audio circuit 505 can be used to provide an audio interface between a user and an electronic device through a speaker or a microphone.
所述电源506可以用于给电子设备500的各个部件供电。在一些实施例中,电源506可以通过电源管理***与处理器501逻辑相连,从而通过电源管理***实现管理充电、放电、以及功耗管理等功能。The power source 506 can be used to power various components of the electronic device 500. In some embodiments, the power source 506 can be logically coupled to the processor 501 through a power management system to enable functions such as managing charging, discharging, and power management through the power management system.
尽管图10中未示出,电子设备500还可以包括摄像头、蓝牙模块等,在此不再赘述。Although not shown in FIG. 10, the electronic device 500 may further include a camera, a Bluetooth module, and the like, and details are not described herein.
本申请实施例还提供一种存储介质,所述存储介质存储有计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行上述任一实施例中的性别预测方法,比如:获取已知性别用户使用电子设备的多维特征作为样本,并构建性别预测的样本集;根据所述特征对于样本集分类的基尼指数信息增益对所述样本集进行分类,以构建出相应的分类回归树模型,所述分类回归树模型的输出包括男性、或者女性;根据预测时间采集未知性别用户使用电子设备的多维特征作为预测样本;根据所述预测样本和所述分类回归树模型预测未知性别用户的性别。。The embodiment of the present application further provides a storage medium, where the storage medium stores a computer program, and when the computer program runs on a computer, causes the computer to perform a gender prediction method in any of the above embodiments, such as: It is known that a gender user uses a multi-dimensional feature of an electronic device as a sample, and constructs a sample set of gender predictions; classifies the sample set according to the Gini index information gain of the feature set for the sample set to construct a corresponding classification regression tree. a model, the output of the classification regression tree model includes a male or a female; collecting a multi-dimensional feature of the electronic device by using an unknown gender user as a prediction sample according to the predicted time; predicting an unknown gender user according to the predicted sample and the classified regression tree model gender. .
在本申请实施例中,存储介质可以是磁碟、光盘、只读存储器(Read Only Memory,ROM,)、或者随机存取记忆体(Random Access Memory,RAM)等。In the embodiment of the present application, the storage medium may be a magnetic disk, an optical disk, a read only memory (ROM), or a random access memory (RAM).
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above embodiments, the descriptions of the various embodiments are different, and the details that are not detailed in a certain embodiment can be referred to the related descriptions of other embodiments.
需要说明的是,对本申请实施例的性别预测方法而言,本领域普通测试人员可以理解实现本申请实施例的性别预测理方法的全部或部分流程,是可以通过计算机程序来控制相关的硬件来完成,所述计算机程序可存储于一计算机可读取存储介质中,如存储在电子设备的存储器中,并被该电子设备内的至少一个处理器执行,在执行过程中可包括如性别预测方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储器、随机存取记忆体等。It should be noted that, for the gender prediction method in the embodiment of the present application, a general tester in the field can understand all or part of the process of implementing the gender prediction method of the embodiment of the present application, and the related hardware can be controlled by a computer program. The computer program may be stored in a computer readable storage medium, such as in a memory of the electronic device, and executed by at least one processor in the electronic device, and may include, for example, a gender prediction method during execution. The flow of an embodiment. The storage medium may be a magnetic disk, an optical disk, a read only memory, a random access memory, or the like.
对本申请实施例的性别预测装置而言,其各功能模块可以集成在一个处理芯片中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中,所述存储介质譬如为只读存储器,磁盘或光盘等。For the gender prediction device of the embodiment of the present application, each functional module may be integrated into one processing chip, or each module may exist physically separately, or two or more modules may be integrated into one module. The above integrated modules can be implemented in the form of hardware or in the form of software functional modules. The integrated module, if implemented in the form of a software functional module and sold or used as a standalone product, may also be stored in a computer readable storage medium, such as a read only memory, a magnetic disk or an optical disk, etc. .
以上对本申请实施例所提供的一种性别预测方法、装置、存储介质及电子设备进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The foregoing describes a gender prediction method, apparatus, storage medium, and electronic device provided by the embodiments of the present application. The specific examples are used herein to explain the principles and implementation manners of the present application. The description of the above embodiments is only The method for understanding the present application and its core idea; at the same time, those skilled in the art, according to the idea of the present application, there will be changes in the specific embodiment and the scope of application, in summary, the present specification The content should not be construed as limiting the application.

Claims (20)

  1. 一种性别预测方法,其中,包括:A gender prediction method, including:
    获取已知性别用户使用电子设备的多维特征作为样本,并构建性别预测的样本集;Obtaining a multidimensional feature of a known gender user using an electronic device as a sample, and constructing a sample set of gender predictions;
    根据所述特征对于样本集分类的基尼指数信息增益对所述样本集进行分类,以构建出相应的分类回归树模型,所述分类回归树模型的输出包括男性、或者女性;Sorting the sample set according to the Gini index information gain of the feature set for the sample set to construct a corresponding classified regression tree model, and the output of the classified regression tree model includes a male or a female;
    根据预测时间采集未知性别用户使用电子设备的多维特征作为预测样本;Collecting multidimensional features of the electronic device used by unknown gender users as prediction samples according to the predicted time;
    根据所述预测样本和所述分类回归树模型预测所述未知性别用户的性别。Predicting the gender of the unknown gender user according to the predicted sample and the classified regression tree model.
  2. 如权利要求1所述的性别预测方法,其中,根据所述特征对于样本集的基尼指数信息增益对所述样本集进行划分,以构建出相应的分类回归树模型,包括:The gender prediction method according to claim 1, wherein the sample set is divided according to the feature of the Gini index information gain of the sample set to construct a corresponding classification regression tree model, including:
    生成分类回归树模型的根节点,并将所述样本集分配给所述根节点;Generating a root node of the classification regression tree model and assigning the sample set to the root node;
    将所述根节点的样本集确定为当前待分类的目标样本集;Determining the sample set of the root node as a target sample set to be classified currently;
    获取所述特征对于目标样本集分类的基尼指数信息增益;Obtaining a Gini index information gain of the feature for the target sample set classification;
    根据所述基尼指数信息增益从所述特征中选取当前的划分特征及其对应的划分点;Selecting a current partitioning feature and its corresponding dividing point from the feature according to the Gini index information gain;
    根据所述划分特征和所述划分点对所述样本集进行划分,得到两个子样本集;Dividing the sample set according to the dividing feature and the dividing point to obtain two sub-sample sets;
    生成当前节点的子节点,并将所述去所述子样本集分配给相应的所述子节点;Generating a child node of the current node, and assigning the sub-sample set to the corresponding child node;
    判断所述子节点是否满足预设分类终止条件;Determining whether the child node meets a preset classification termination condition;
    若否,则将所述目标样本集更新为所述子样本集,并返回执行获取所述特征对于目标样本集的基尼指数的步骤;If not, updating the target sample set to the sub-sample set, and returning to perform the step of acquiring the Gini index of the feature for the target sample set;
    若是,则将所述子节点作为叶子节点,根据所述子样本集中样本的样本类别设置所述叶子节点的输出,所述样本类别包括男性、或者女性。If yes, the child node is used as a leaf node, and an output of the leaf node is set according to a sample category of the sample in the subset of samples, the sample category includes a male or a female.
  3. 如权利要求2所述的性别预测方法,其中,获取所述特征对于目标样本集分类的基尼指数信息增益,包括:The gender prediction method according to claim 2, wherein obtaining the Gini index information gain of the feature for the target sample set classification comprises:
    获取所述特征的取值对于目标样本集分类的基尼指数;Obtaining a Gini index of the value of the feature for the target sample set classification;
    根据所述基尼指数,获取所述特征的取值对于目标样本集分类的基尼指数信息增益。According to the Gini index, the gain of the feature is obtained for the Gini index information of the target sample set classification.
  4. 如权利要求3所述的性别预测方法,其中,获取所述特征的取值对于目标样本集分类的基尼指数,包括:The gender prediction method according to claim 3, wherein obtaining a Gini index of the value of the feature for the target sample set classification comprises:
    根据所述特征的取值将所述目标样本集划分成第一子样本集和第二子样本集;Dividing the target sample set into a first subsample set and a second subsample set according to the value of the feature;
    获取所述第一子样本集和所述第二子样本集中样本类别的概率;Obtaining a probability of the sample class of the first subsample set and the second subsample set;
    根据所述样本类别的概率获取所述取值对于目标样本分类的基尼指数。The Gini index of the value classification for the target sample is obtained according to the probability of the sample category.
  5. 如权利要求4所述的性别预测方法,其中,根据所述样本类别的概率获取所述取值对于目标样本分类的基尼指数,包括:The gender prediction method according to claim 4, wherein the Gini index of the value classification for the target sample is obtained according to the probability of the sample category, including:
    根据所述第一子样本集中样本类别的概率获取所述特征为取值时对于目标样本集分类的第一基尼指数;Obtaining, according to a probability of the sample category in the first subsample set, a first Gini index classified for the target sample set when the feature is a value;
    根据所述第二子样本集中样本类别的概率获取所述特征不为取值时对于目标样本集分类的第二基尼指数;Obtaining, according to the probability of the sample category in the second subsample set, a second Gini index that is classified for the target sample set when the feature is not a value;
    根据所述基尼指数,获取所述特征对于目标样本集分类的基尼指数信息增益,包括:Obtaining the Gini index information gain of the feature for the target sample set classification according to the Gini index, including:
    根据所述第一基尼指数、第一子样本集与目标样本集的样本数量比值、第二基尼指数、以及第二子样本集与目标样本集的样本数量比值,获取所述特征的取值对于目标样本集分类的基尼指数信息增益。Obtaining a value of the feature according to the first Gini index, a sample size ratio of the first subsample set to the target sample set, a second Gini index, and a sample size ratio of the second subsample set to the target sample set The Gini index information gain of the target sample set classification.
  6. 如权利要求3所述的性别预测方法,其中,根据所述基尼指数,获取所述特征对于目标样本集分类的基尼指数信息增益,包括:The gender prediction method according to claim 3, wherein the gain of the Gini index information of the feature classification for the target sample set is obtained according to the Gini index, including:
    通过如下公式计算出特征对于目标样本集分类的基尼指数信息增益:The Gini index information gain of the feature classification for the target sample set is calculated by the following formula:
    Figure PCTCN2018116709-appb-100001
    其中,Gini(D,A)为特征A对于目标样本集D分类的基尼指数信息增益,Gini(D 1)为特征A为取值a时对于目标样本D分类的基尼指数,Gini(D 2)为A不为取值a时对于目标样本D分类的基尼指数;a为特征A的一种取值,D1和D2为基于特征A=a对目标样本集D划分后得到的两个子样本集。
    Figure PCTCN2018116709-appb-100001
    Among them, Gini(D,A) is the Gini index information gain of feature A for the target sample set D, and Gini(D 1 ) is the Gini index for the target sample D when the feature A is the value a, Gini(D 2 ) A is the Gini index for the target sample D when A is not a value; a is a value of the feature A, and D1 and D2 are two sub-sample sets obtained by dividing the target sample set D based on the feature A=a.
  7. 如权利要求2所述的性别预测方法,其中,根据所述基尼指数信息增益从所述特征中选取当前的划分特征及其对应的划分点,包括:The gender prediction method according to claim 2, wherein the current division feature and its corresponding division point are selected from the features according to the Gini index information gain, including:
    从所述基尼指数信息增益中确定最小的目标基尼指数信息增益;Determining a minimum target Gini index information gain from the Gini index information gain;
    将所述目标基尼指数信息增益的特征及其取值,分别作为划分特征和划分点。The characteristics of the target Gini index information gain and their values are taken as the division feature and the division point, respectively.
  8. 如权利要求2所述的性别预测方法,其中,判断子节点是否满足预设分类终止条件,包括:The gender prediction method according to claim 2, wherein determining whether the child node satisfies a preset classification termination condition comprises:
    判断所述子节点对应的去除后子样本集中样本的类别数量是否为预设数量;Determining whether the number of categories of samples in the removed subsample set corresponding to the child node is a preset number;
    若是,则确定所述子节点满足预设分类终止条件。If yes, it is determined that the child node satisfies a preset classification termination condition.
  9. 如权利要求2所述的性别预测方法,其中,根据所述预测样本和所述分类回归树模型预测所述未知性别用户的性别,包括:The gender prediction method according to claim 2, wherein the gender of the unknown gender user is predicted according to the prediction sample and the classification regression tree model, including:
    根据所述预测样本的特征和所述分类回归树模型确定相应的叶子节点,将所述叶子节点的输出作为预测输出结果。Determining a corresponding leaf node according to the feature of the predicted sample and the classification regression tree model, and using the output of the leaf node as a predicted output result.
  10. 一种性别预测装置,其中,包括:A gender prediction device, comprising:
    样本构建单元,用于获取已知性别用户使用电子设备的多维特征作为样本,并构建性别预测的样本集;a sample construction unit for acquiring a multidimensional feature of a known gender user using the electronic device as a sample, and constructing a sample set of gender prediction;
    分类单元,用于根据所述特征对于样本集分类的基尼指数信息增益对所述样本集进行分类,以构建出相应的分类回归树模型,所述分类回归树模型的输出包括男性、或者女性;a classification unit, configured to classify the sample set according to the Gini index information gain of the feature classification for the sample set to construct a corresponding classification regression tree model, where the output of the classification regression tree model includes a male or a female;
    采集单元,用于根据预测时间采集未知性别用户使用电子设备的多维特征作为预测样本;An acquisition unit, configured to collect, according to a predicted time, a multi-dimensional feature of an electronic device that is used by an unknown gender user as a prediction sample;
    预测单元,用于根据所述预测样本和所述分类回归树模型预测所述未知性别用户的性别。And a prediction unit, configured to predict, according to the predicted sample and the classified regression tree model, the gender of the unknown gender user.
  11. 如权利要求10所述的性别预测装置,其中,所述分类单元包括:The gender prediction apparatus according to claim 10, wherein said classification unit comprises:
    节点生成子单元,用于生成分类回归树模型的根节点,并将所述样本集分配给所述根节点,将所述根节点的样本集确定为当前待分类的目标样本集;a node generating subunit, configured to generate a root node of the classification regression tree model, and assign the sample set to the root node, and determine a sample set of the root node as a target sample set to be classified currently;
    增益获取子单元,用于获取所述特征对于目标样本集分类的基尼指数信息增益;a gain acquisition subunit, configured to obtain a Gini index information gain of the feature for the target sample set classification;
    划分特征确定子单元,用于根据所述基尼指数信息增益从所述特征中选取当前的划分特征及其对应的划分点;a dividing feature determining subunit, configured to select a current dividing feature and a corresponding dividing point thereof from the feature according to the Gini index information gain;
    分类子单元,用于根据所述划分特征和所述划分点对所述样本集进行划分,得到两个子样本集;a classifying subunit, configured to divide the sample set according to the dividing feature and the dividing point, to obtain two subsample sets;
    子节点生成子单元,用于生成当前节点的子节点,并将所述去所述子样本集分配给相应的所述子节点;a child node generating subunit, configured to generate a child node of the current node, and assign the going to the subsample set to the corresponding child node;
    判断子单元,用于判断子节点是否满足预设分类终止条件,若否,将所述目标样本集更新为所述子样本集,并触发增益获取子单元执行获取所述特征对于目标样本集的基尼指数的步骤;若是,则将所述子节点作为叶子节点,根据所述子样本集中样本的样本类别设置所述叶子节点的输出,所述样本类别包括男性、或者女性。a determining subunit, configured to determine whether the child node satisfies a preset classification termination condition, and if not, updating the target sample set to the subsample set, and triggering the gain acquisition subunit to perform acquiring the feature for the target sample set The step of the Gini index; if so, the child node is used as a leaf node, and the output of the leaf node is set according to the sample category of the sample in the subsample set, the sample category including male or female.
  12. 如权利要求11所述的性别预测装置,其中,所述增益获取子单元,用The gender prediction apparatus according to claim 11, wherein said gain acquisition subunit is used
    获取所述特征的取值对于目标样本集分类的基尼指数;Obtaining a Gini index of the value of the feature for the target sample set classification;
    根据所述基尼指数,获取所述特征的取值对于目标样本集分类的基尼指数信息增益。According to the Gini index, the gain of the feature is obtained for the Gini index information of the target sample set classification.
  13. 如权利要求12所述的性别预测装置,其中,所述增益获取子单元,用于:The gender prediction apparatus according to claim 12, wherein said gain acquisition subunit is configured to:
    根据所述特征的取值将所述目标样本集划分成第一子样本集和第二子样本集;Dividing the target sample set into a first subsample set and a second subsample set according to the value of the feature;
    获取所述第一子样本集和所述第二子样本集中样本类别的概率;Obtaining a probability of the sample class of the first subsample set and the second subsample set;
    根据所述样本类别的概率获取所述取值对于目标样本分类的基尼指数。The Gini index of the value classification for the target sample is obtained according to the probability of the sample category.
  14. 如权利要求11所述的性别预测装置,其中,划分特征确定子单元,用于:The gender prediction apparatus according to claim 11, wherein the division feature determination subunit is configured to:
    从所述基尼指数信息增益中确定最小的目标基尼指数信息增益;Determining a minimum target Gini index information gain from the Gini index information gain;
    将所述目标基尼指数信息增益的特征及其取值,分别作为划分特征和划分点。The characteristics of the target Gini index information gain and their values are taken as the division feature and the division point, respectively.
  15. 如权利要求12所述的性别预测装置,其中,所述增益获取子单元,用于:The gender prediction apparatus according to claim 12, wherein said gain acquisition subunit is configured to:
    根据所述特征的取值将所述目标样本集划分成第一子样本集和第二子样本集;Dividing the target sample set into a first subsample set and a second subsample set according to the value of the feature;
    获取所述第一子样本集和所述第二子样本集中样本类别的概率;Obtaining a probability of the sample class of the first subsample set and the second subsample set;
    根据所述第一子样本集中样本类别的概率获取所述特征为取值时对于目标样本集分类的第一基尼指数;Obtaining, according to a probability of the sample category in the first subsample set, a first Gini index classified for the target sample set when the feature is a value;
    根据所述第二子样本集中样本类别的概率获取所述特征不为取值时对于目标样本集分类的第二基尼指数;Obtaining, according to the probability of the sample category in the second subsample set, a second Gini index that is classified for the target sample set when the feature is not a value;
    根据所述第一基尼指数、第一子样本集与目标样本集的样本数量比值、第二基尼指数、以及第二子样本集与目标样本集的样本数量比值,获取所述特征的取值对于目标样本集分类的基尼指数信息增益。Obtaining a value of the feature according to the first Gini index, a sample size ratio of the first subsample set to the target sample set, a second Gini index, and a sample size ratio of the second subsample set to the target sample set The Gini index information gain of the target sample set classification.
  16. 如权利要求12所述的性别预测装置,其中,所述增益获取子单元,用于:The gender prediction apparatus according to claim 12, wherein said gain acquisition subunit is configured to:
    通过如下公式计算出特征对于目标样本集分类的基尼指数信息增益:The Gini index information gain of the feature classification for the target sample set is calculated by the following formula:
    Figure PCTCN2018116709-appb-100002
    其中,Gini(D,A)为特征A对于目标样本集D分类的基尼指数信息增益,Gini(D 1)为特征A为取值a时对于目标样本D分类的基尼指数,Gini(D 2)为A不为取值a时对于目标样本D分类的基尼指数;a为特征A的一种取值,D1和D2为基于特征A=a对目标样本集D划分后得到的两个子样本集。
    Figure PCTCN2018116709-appb-100002
    Among them, Gini(D,A) is the Gini index information gain of feature A for the target sample set D, and Gini(D 1 ) is the Gini index for the target sample D when the feature A is the value a, Gini(D 2 ) A is the Gini index for the target sample D when A is not a value; a is a value of the feature A, and D1 and D2 are two sub-sample sets obtained by dividing the target sample set D based on the feature A=a.
  17. 如权利要求11所述的性别预测装置,其中,所述划分特征确定子单元,用于:The gender prediction apparatus according to claim 11, wherein said division feature determination subunit is configured to:
    从所述基尼指数信息增益中确定最小的目标基尼指数信息增益;Determining a minimum target Gini index information gain from the Gini index information gain;
    将所述目标基尼指数信息增益的特征及其取值,分别作为划分特征和划分点。The characteristics of the target Gini index information gain and their values are taken as the division feature and the division point, respectively.
  18. 如权利要求11所述的性别预测装置,其中,所述判断子单元,用于判断所述子节点对应的去除后子样本集中样本的类别数量是否为预设数量;若是,则确定所述子节点满足预设分类终止条件。The gender prediction apparatus according to claim 11, wherein the determining subunit is configured to determine whether the number of categories of the sample of the sample after the subsample corresponding to the child node is a preset number; if yes, determining the child The node satisfies the preset classification termination condition.
  19. 一种存储介质,其上存储有计算机程序,其中,当所述计算机程序在计算机上运行时,使得所述计算机执行如权利要求1至9任一项所述的性别预测方法。A storage medium having stored thereon a computer program, wherein when the computer program is run on a computer, the computer is caused to perform the gender prediction method according to any one of claims 1 to 9.
  20. 一种电子设备,包括处理器和存储器,所述存储器有计算机程序,其中,所述处理器通过调用所述计算机程序,用于执行如权利要求1至9任一项所述的性别预测方法。An electronic device comprising a processor and a memory, the memory having a computer program, wherein the processor is operative to execute the gender prediction method according to any one of claims 1 to 9 by calling the computer program.
PCT/CN2018/116709 2017-12-22 2018-11-21 Gender prediction method and apparatus, storage medium and electronic device WO2019120023A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711407326.6A CN109961077A (en) 2017-12-22 2017-12-22 Gender prediction's method, apparatus, storage medium and electronic equipment
CN201711407326.6 2017-12-22

Publications (1)

Publication Number Publication Date
WO2019120023A1 true WO2019120023A1 (en) 2019-06-27

Family

ID=66992647

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/116709 WO2019120023A1 (en) 2017-12-22 2018-11-21 Gender prediction method and apparatus, storage medium and electronic device

Country Status (2)

Country Link
CN (1) CN109961077A (en)
WO (1) WO2019120023A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111209173A (en) * 2020-01-02 2020-05-29 腾讯科技(深圳)有限公司 Performance prediction method, device, storage medium and electronic equipment
CN112308647A (en) * 2020-01-02 2021-02-02 北京京东尚科信息技术有限公司 Method, system, electronic device and storage medium for predicting gender attribute of product
CN112800747A (en) * 2021-02-02 2021-05-14 虎博网络技术(北京)有限公司 Text processing method and device and computer equipment
CN113657917A (en) * 2020-05-12 2021-11-16 上海佳投互联网技术集团有限公司 Visitor gender analysis method and system based on USER-AGENT
CN113822309A (en) * 2020-09-25 2021-12-21 京东科技控股股份有限公司 User classification method, device and non-volatile computer-readable storage medium

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110705642B (en) * 2019-09-30 2023-05-23 北京金山安全软件有限公司 Classification model, classification method, classification device, electronic equipment and storage medium
CN111143441A (en) * 2019-12-30 2020-05-12 北京每日优鲜电子商务有限公司 Gender determination method, device, equipment and storage medium
CN113268654A (en) * 2020-02-17 2021-08-17 北京搜狗科技发展有限公司 User gender identification method and device and electronic equipment
CN111881972B (en) * 2020-07-24 2023-11-07 腾讯音乐娱乐科技(深圳)有限公司 Black-out user identification method and device, server and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120259619A1 (en) * 2011-04-06 2012-10-11 CitizenNet, Inc. Short message age classification
CN103530540A (en) * 2013-09-27 2014-01-22 西安交通大学 User identity attribute detection method based on man-machine interaction behavior characteristics
CN105654131A (en) * 2015-12-30 2016-06-08 小米科技有限责任公司 Classification model training method and device
CN106897727A (en) * 2015-12-21 2017-06-27 百度在线网络技术(北京)有限公司 A kind of user's gender identification method and device
CN106960387A (en) * 2017-04-28 2017-07-18 浙江工商大学 Individual credit risk appraisal procedure and system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045931A (en) * 2015-09-02 2015-11-11 南京邮电大学 Video recommendation method and system based on Web mining
CN107169284A (en) * 2017-05-12 2017-09-15 北京理工大学 A kind of biomedical determinant attribute system of selection

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120259619A1 (en) * 2011-04-06 2012-10-11 CitizenNet, Inc. Short message age classification
CN103530540A (en) * 2013-09-27 2014-01-22 西安交通大学 User identity attribute detection method based on man-machine interaction behavior characteristics
CN106897727A (en) * 2015-12-21 2017-06-27 百度在线网络技术(北京)有限公司 A kind of user's gender identification method and device
CN105654131A (en) * 2015-12-30 2016-06-08 小米科技有限责任公司 Classification model training method and device
CN106960387A (en) * 2017-04-28 2017-07-18 浙江工商大学 Individual credit risk appraisal procedure and system

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111209173A (en) * 2020-01-02 2020-05-29 腾讯科技(深圳)有限公司 Performance prediction method, device, storage medium and electronic equipment
CN112308647A (en) * 2020-01-02 2021-02-02 北京京东尚科信息技术有限公司 Method, system, electronic device and storage medium for predicting gender attribute of product
CN111209173B (en) * 2020-01-02 2023-10-31 腾讯科技(深圳)有限公司 Gender prediction method and device, storage medium and electronic equipment
CN113657917A (en) * 2020-05-12 2021-11-16 上海佳投互联网技术集团有限公司 Visitor gender analysis method and system based on USER-AGENT
CN113822309A (en) * 2020-09-25 2021-12-21 京东科技控股股份有限公司 User classification method, device and non-volatile computer-readable storage medium
CN113822309B (en) * 2020-09-25 2024-04-16 京东科技控股股份有限公司 User classification method, apparatus and non-volatile computer readable storage medium
CN112800747A (en) * 2021-02-02 2021-05-14 虎博网络技术(北京)有限公司 Text processing method and device and computer equipment

Also Published As

Publication number Publication date
CN109961077A (en) 2019-07-02

Similar Documents

Publication Publication Date Title
WO2019120023A1 (en) Gender prediction method and apparatus, storage medium and electronic device
CN106897428B (en) Text classification feature extraction method and text classification method and device
WO2018041168A1 (en) Information pushing method, storage medium and server
WO2020094060A1 (en) Recommendation method and apparatus
WO2019062418A1 (en) Application cleaning method and apparatus, storage medium and electronic device
US20220058222A1 (en) Method and apparatus of processing information, method and apparatus of recommending information, electronic device, and storage medium
WO2019062419A1 (en) Application cleaning method and apparatus, storage medium and electronic device
Qian et al. Social media based event summarization by user–text–image co-clustering
WO2019062414A1 (en) Method and apparatus for managing and controlling application program, storage medium, and electronic device
US11200444B2 (en) Presentation object determining method and apparatus based on image content, medium, and device
CN104077723B (en) A kind of social networks commending system and method
WO2019120007A1 (en) Method and apparatus for predicting user gender, and electronic device
US11269966B2 (en) Multi-classifier-based recommendation method and device, and electronic device
US11422831B2 (en) Application cleaning method, storage medium and electronic device
CN107894827B (en) Application cleaning method and device, storage medium and electronic equipment
CN109558533B (en) Personalized content recommendation method and device based on multiple clustering
US20140324965A1 (en) Recommending media items based on purchase history
US20180012237A1 (en) Inferring user demographics through categorization of social media data
CN112818230B (en) Content recommendation method, device, electronic equipment and storage medium
Ben-Shimon et al. An ensemble method for top-N recommendations from the SVD
CN111507400A (en) Application classification method and device, electronic equipment and storage medium
CN109961163A (en) Gender prediction's method, apparatus, storage medium and electronic equipment
CN114169418B (en) Label recommendation model training method and device and label acquisition method and device
CN112905885B (en) Method, apparatus, device, medium and program product for recommending resources to user
US11782918B2 (en) Selecting access flow path in complex queries

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18890098

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18890098

Country of ref document: EP

Kind code of ref document: A1