CN108399418B

CN108399418B - User classification method and device

Info

Publication number: CN108399418B
Application number: CN201810063767.7A
Authority: CN
Inventors: 谢仁强
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2018-01-23
Filing date: 2018-01-23
Publication date: 2021-09-03
Anticipated expiration: 2038-01-23
Also published as: CN108399418A

Abstract

The embodiment of the invention provides a user classification method and a user classification device, which relate to the technical field of computers, and the method comprises the following steps: acquiring a characteristic value of a user to be classified aiming at a preset characteristic; inputting the obtained characteristic values into a classification model obtained by pre-training to classify the users to be classified, and obtaining classification results of the users to be classified; wherein, the classification model is as follows: the method includes the steps that training information of each sample user is adopted to train a preset first model to obtain a model, and the training information of one sample user comprises the following steps: the sample user classifies the characteristic value of the preset characteristic and the label of the sample user, and the training information of each sample user is as follows: information determined by user information provided by different data sources. Compared with the prior art, the scheme provided by the embodiment of the invention has the advantages that the classification effect of the classification model is better, and the accuracy of the obtained classification result is higher.

Description

User classification method and device

Technical Field

The invention relates to the technical field of computers, in particular to a user classification method and device.

Background

Often users belonging to different categories have different interest in various different information provided by the operator, e.g. different types of advertisements, different types of videos, different types of news, etc. that are of interest to users belonging to different categories. Based on the above situation, before pushing information to users, an operator wants to obtain the classification of the users, and then push information to the users in a targeted manner according to the classification of the users.

In the prior art, matching rules of user types are generally preset, the characteristics of a user belonging to one user type are defined in the matching rule of one user type, when the classification of the user to be classified is obtained, the characteristics of the user to be classified are obtained first, then the matching rule matched with the characteristics of the user to be classified is searched, and the user type corresponding to the searched rule is the classification of the user to be classified.

For example, the matching rule for the user classification of college students is: users between 18-22 years of age belong to undergraduate users.

However, the inventor finds that the prior art has at least the following problems in the process of implementing the invention: the preset matching rules of the user types are generally generated by analyzing various types of users by staff, so that the accuracy of the matching rules of the user types is low due to the influences of factors such as the experience of the staff, whether the characteristics of the selected users have typicality and the like, and further the accuracy is low when the matching rules of the user types are adopted to classify the users.

Disclosure of Invention

The embodiment of the invention aims to provide a user classification method and device so as to improve the accuracy of user classification.

The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a user classification method, where the method includes:

acquiring a characteristic value of a user to be classified aiming at a preset characteristic;

inputting the obtained characteristic values into a classification model obtained by pre-training to classify the users to be classified, and obtaining classification results of the users to be classified;

wherein the classification model is: the method includes the steps that training information of each sample user is adopted to train a preset first model to obtain a model, and the training information of one sample user comprises the following steps: the sample user classifies the characteristic value of the preset characteristic and the label of the sample user, and the training information of each sample user is as follows: information determined by user information provided by different data sources.

In one implementation, the classification model is trained by:

obtaining user information provided by different data sources, wherein the user information of a user provided by one data source comprises: the user's label classification provided by the data source is used as a first classification;

according to the first classification, determining a positive sample user from target users, and determining the labeling classification of the positive sample user, wherein the target users are as follows: a user corresponding to the obtained user information;

determining a negative sample user and determining the labeling classification of the negative sample user;

and training the preset first model by adopting a positive sample user and a negative sample user respectively aiming at the characteristic value of the preset characteristic, the labeling classification of the positive sample user and the labeling classification of the negative sample user to obtain the classification model.

In one implementation, the step of determining a positive sample user from the target users according to the first classification and determining an annotation classification of the positive sample user includes:

calculating the label classification of each user by using the classification information of each user in the target users, wherein the classification information of one user comprises the following steps: a first classification of the user, a confidence level of a data source providing the first classification of the user;

calculating the confidence of the label classification of each user;

and selecting a positive sample user of each annotation classification from the target users according to the confidence coefficient of the annotation classification of each user.

In one implementation, the step of calculating the label classification of each user by using the classification information of each user in the target users includes:

calculating the confidence sum of the target data source of each user, wherein the target data source of one user is as follows: providing data sources of the same first category of the user;

and determining the target classification of each user as the label classification of each user, wherein the target classification of one user is as follows: the confidence level of the user target data source and the largest first classification.

In one implementation, the step of calculating the confidence level of the label classification of each user includes:

calculating the Wilson interval of the labeling classification of each user according to the following formula, and determining the Wilson interval lower limit value of the labeling classification of each user as the confidence coefficient of the labeling classification of each user:

where n represents the sum of the confidence levels of the respective data sources providing the first classification of a user,

represents the ratio of a first value to n, said first value being: the maximum value of the confidence sum for a user target data source,

represents the z statistic for alpha, which represents the confidence level.

In one implementation, the step of determining the negative sample user and determining the label classification of the negative sample user includes:

the negative sample user of each label classification is obtained by adopting the following method:

acquiring candidate negative sample users of a second classification, wherein the second classification is one of the labeling classifications;

extracting users in a preset proportion from the positive sample users in the second classification to obtain verification users in the second classification;

setting the label classification of the negative example sample user as a third classification, wherein the third classification is as follows: for classes representing non-second classes, negative examples users are: candidate negative sample users and verified users of the second classification;

and training a preset second model by adopting a positive sample user and a negative sample user respectively aiming at the characteristic value of the preset characteristic, the labeling classification of the positive sample user and the labeling classification of the negative sample user to obtain a two-classification model of the labeling classification, wherein the positive sample user is as follows: positive sample users other than the verification sample user among the positive sample users of the second category;

classifying the negative example sample users by using the two classification models to obtain the confidence coefficient of the classification result of each negative example sample user;

obtaining a negative sample selection threshold according to the confidence of the user classification result of the verification sample;

and selecting the negative sample users of the second classification from the candidate negative sample users according to the negative sample selection threshold value.

In one implementation, the negative sample selection threshold is: the smallest confidence in the obtained confidences of the user classification results of the verification samples.

In one implementation, the user is a device level user or an account level user.

In one implementation, the predetermined characteristic is at least one of the following:

the age of the user, the location of the user, the type of video the user watches, the type of e-book the user reads, the time the user watches the video, the time the user reads the e-book, the group characteristics the user joins, the specific functions the user uses.

In a second aspect, an embodiment of the present invention provides an apparatus for classifying users, where the apparatus includes:

the characteristic value acquisition module is used for acquiring the characteristic value of the user to be classified aiming at the preset characteristic;

the user classification module is used for inputting the acquired characteristic values into a classification model obtained by pre-training to classify the users to be classified, and obtaining classification results of the users to be classified;

In one implementation, the apparatus further includes:

a classification model obtaining module for obtaining the classification model;

wherein the classification model obtaining module comprises:

the user information obtaining sub-module is used for obtaining user information provided by different data sources, wherein the user information of one user provided by one data source comprises: the user's label classification provided by the data source is used as a first classification;

the positive sample user determining submodule is used for determining a positive sample user from the target users according to the first classification and determining the labeling classification of the positive sample user, wherein the target users are as follows: a user corresponding to the obtained user information;

the negative sample user determining submodule is used for determining negative sample users and determining the labeling classification of the negative sample users;

and the model training submodule is used for training the preset first model by adopting a positive sample user and a negative sample user respectively aiming at the characteristic value of the preset characteristic, the label classification of the positive sample user and the label classification of the negative sample user to obtain the classification model.

In one implementation, the positive sample user determination sub-module includes:

a label classification calculating unit, configured to calculate a label classification of each user using classification information of each user in the target users, where the classification information of one user includes: a first classification of the user, a confidence level of a data source providing the first classification of the user;

the annotation classification confidence coefficient calculation unit is used for calculating the confidence coefficient of the annotation classification of each user;

and the positive sample user selection unit is used for selecting the positive sample user of each label classification from the target users according to the confidence coefficient of the label classification of each user.

In one implementation, the label classification calculating unit includes:

a confidence sum calculating subunit, configured to calculate a confidence sum of the target data source of each user, where the target data source of one user is: providing data sources of the same first category of the user;

and the label classification determining subunit is used for determining the target classification of each user as the label classification of each user, wherein the target classification of one user is as follows: the confidence level of the user target data source and the largest first classification.

In one implementation, the labeling classification confidence calculating unit is specifically configured to: calculating the Wilson interval of the labeling classification of each user according to the following formula, and determining the Wilson interval lower limit value of the labeling classification of each user as the confidence coefficient of the labeling classification of each user:

represents the z statistic for alpha, which represents the confidence level.

In one implementation, the negative sample user determination submodule is specifically configured to obtain a negative sample user for each label classification;

wherein the negative example user determination submodule comprises:

the candidate negative sample user acquiring unit is used for acquiring a candidate negative sample user of a second classification, wherein the second classification is one of the labeling classifications;

the verification user obtaining unit is used for extracting users with a preset proportion from the positive sample users of the second classification to obtain verification users of the second classification;

and the label classification setting unit is used for setting the label classification of the negative example sample user as a third classification, wherein the third classification is as follows: for classes representing non-second classes, negative examples users are: candidate negative sample users and verified users of the second classification;

the model training unit is used for training a preset second model by adopting a positive sample user and a negative sample user respectively aiming at the characteristic value of the preset characteristic, the labeling classification of the positive sample user and the labeling classification of the negative sample user to obtain a two-classification model of the labeling classification, wherein the positive sample user is as follows: positive training samples except the verification samples in the positive sample users of the second classification;

the negative example confidence obtaining unit is used for classifying the negative example users by utilizing the two classification models to obtain the confidence of the classification result of each negative example user;

the negative sample selection threshold value obtaining unit is used for obtaining a negative sample selection threshold value according to the confidence coefficient of the user classification result of the verification sample;

and the negative sample user selection unit is used for selecting the negative sample users in the second classification from the candidate negative sample users according to the negative sample selection threshold.

In one implementation, the preset feature is at least one of the following:

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor and the communication interface complete communication between the memory and the processor through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1-9 when executing a program stored in the memory.

In yet another aspect of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to execute any of the user classification methods described above.

In yet another aspect of the present invention, the present invention further provides a computer program product containing instructions, which when run on a computer, causes the computer to execute any of the user classification methods described above.

As can be seen from the above, in the scheme provided by the embodiment of the present invention, the user information used when determining the sample user is provided by multiple data sources, so that the obtained sample data is more comprehensive, and the user information with typical characteristics is more easily obtained; the classification model obtained by training the preset model by adopting more comprehensive sample data has stronger robustness, the classification model has better classification stability and better classification effect; meanwhile, the users are classified through the classification model, and the working personnel are not required to set the matching rules of the user types according to the experience, so that the influence caused by insufficient experience of the working personnel and the like is avoided. Compared with the prior art, the classification model obtained in the embodiment of the invention is adopted to classify the users, so that the accuracy of the classification result is higher.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a flowchart illustrating a user classification method according to an embodiment of the present invention.

Fig. 2 is a schematic flow chart of a first classification model training method according to an embodiment of the present invention.

Fig. 3 is a flowchart illustrating a second classification model training method according to an embodiment of the present invention.

Fig. 4 is a flowchart illustrating a third classification model training method according to an embodiment of the present invention.

Fig. 5 is a flowchart illustrating a fourth classification model training method according to an embodiment of the present invention.

Fig. 6 is a schematic structural diagram of a user classification device according to an embodiment of the present invention.

Fig. 7 is a schematic structural diagram of a first classification model training apparatus according to an embodiment of the present invention.

Fig. 8 is a schematic structural diagram of a second classification model training apparatus according to an embodiment of the present invention.

Fig. 9 is a schematic structural diagram of a third classification model training apparatus according to an embodiment of the present invention.

Fig. 10 is a schematic structural diagram of a fourth classification model training apparatus according to an embodiment of the present invention.

Fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.

In the prior art, when users are classified, the matching rules of the user types used are generally generated by analyzing various types of users by staff, so that the accuracy of the matching rules of the user types is low due to the influence of factors such as the experience of the staff, whether the characteristics of the selected users have typicality and the like, and further, when the matching rules of the user types are adopted to classify the users, the accuracy is low.

In order to solve the problems in the prior art, an embodiment of the present invention provides a user classification method, including:

wherein, the classification model is as follows: the method includes the steps that training information of each sample user is adopted to train a preset first model to obtain a model, and the training information of one sample user comprises the following steps: the sample user classifies the characteristic value of the preset characteristic and the label of the sample user, and the training information of each sample user is as follows: information determined by user information provided by different data sources.

As shown in fig. 1, a schematic flow chart of a user classification method provided in an embodiment of the present invention is shown, where the method includes:

s101: acquiring a characteristic value of a user to be classified aiming at a preset characteristic;

in one implementation, the preset feature may be at least one of the following features: the age of the user, the location of the user, the type of video the user watches, the type of e-book the user reads, the time the user watches the video, the time the user reads the e-book, the group characteristics the user joins, the specific functions the user uses, etc.

Wherein the age of the user can be expressed in the form of age group, such as 18-25 years, 10-18 years, 25 years and above;

the location of the user may be a geographical location where the user uses various application functions provided by the service provider, for example, the location of the user is a campus, a workplace, or a residence;

the type of the video watched by the user can be determined by information such as the category of the video watched by the user, the label of the video, the video publishing time, the character related to the video and the like, for example, the type of the video watched by the user can be an education type, an entertainment type, a legal type and the like;

the type of the user reading the electronic book may be determined by the category of the electronic book read by the user, the publishing time of the electronic book, and other information, for example, the type of the user reading the electronic book may be a traversing type, a hallucination type, a speech type, and other information;

the time for the user to watch the video may be the time length for the user to watch the video within a preset time, for example, the time length for the user to watch the video every day; it may also be a time period during which the user watches the video, for example, which time period of each day the user watches the video, etc.;

the time for the user to read the electronic book may be a time period for the user to read the electronic book within a preset time, for example, a time period for the user to read the electronic book every day; it may also be a time period for the user to read the e-book, for example, which time period of each day the user reads the e-book, etc.;

the group that the user joins may be represented by an identifier indicating whether the user joins a certain group; it can also be represented by a tag that the user joins a group, for example, the group can be an arcade bubble circle, a star arcade bubble circle, etc.;

the particular function used by the user may be whether the user uses a function provided by the service provider, such as whether the user uses a "super curriculum schedule" function, whether the user is a curiosity fantasy live user, and the like.

It should be noted that the present invention only uses the above as an example to describe the content of the preset feature, and does not limit the embodiment of the present invention, and of course, the preset feature may also include other features, for example: whether the user uses an education network IP, etc.

S102: inputting the obtained characteristic values into a classification model obtained by pre-training to classify the users to be classified, and obtaining classification results of the users to be classified;

The users are classified from different angles, and various different user classification forms can be obtained, for example, the users can be classified into male users and female users according to gender, can be classified into underage users, young users, middle-aged users and old users according to age, and can be classified into middle-school student users, college student users and non-student users according to whether the users are in school or not.

Therefore, when the user classification method provided by the embodiment of the present invention is applied, a technician needs to first determine a user classification form according to the needs of practical application, and then determine that the classification model is a few classification models according to the determined user classification form, for example, when the user classification form is to classify users into male users and female users, the classification model is a two classification model, and the classification result is that the user is a male user or the user is a female user; when the user classification form is to classify the users into minor users, young users, middle-aged users and old users, the classification model is a four-classification model, and the classification result is that the users are minor users, young users, middle-aged users or old users.

The label classification of the sample user is label information indicating the category of the sample user;

the data source is an information source capable of providing user information, and for example, the data source may be an operator user registration system, in which case the data source provides user information filled in when a user registers an account; or the client log provided by the client to the management server of the operator when the user uses the client provided by the operator, in this case, the information provided by the data source is the application information of the user using the client recorded in the client log; in the embodiment of the present invention, any information source capable of providing user information may be used as a data source to provide user information, and the present invention does not limit the type of the data source.

In one implementation, as shown in fig. 2, the classification model may be trained by the following methods:

s201: obtaining user information provided by different data sources, wherein the user information of a user provided by one data source comprises: the user's label classification provided by the data source is used as a first classification;

the label classification of the user is information which is provided by the data source and indicates the user category, the label classification provided by different data sources is possibly the same or different because the characteristics of different data sources are different and the user information provided by different data sources is possibly different, the label classification in the user information provided by one data source is used as the first classification of the user, for the same user, several data sources provide the user information of the user containing the label classification, and the user has several first classifications.

S202: according to the first classification, determining a positive sample user from target users, and determining the labeling classification of the positive sample user, wherein the target users are as follows: a user corresponding to the obtained user information;

s203: determining a negative sample user and determining the labeling classification of the negative sample user;

when the user classification method provided by the embodiment of the invention is applied, a technician already determines the user classification form according to the actual application requirements, namely determines which categories the users are classified into, and sets a label classification for the users of each category, wherein the label classification is information used for indicating which category the users are.

The classification model is obtained by training a preset model by using training information of sample users, and therefore, in order to obtain a classification model corresponding to the user classification form, sample users of various categories, that is, sample users of various label classifications, need to be determined among users corresponding to user information provided by different data sources.

Due to the characteristics of the user classification forms and the characteristics of the user information provided by different data sources, under the condition of some user classification forms, not all sample users with label classification can be directly obtained by analyzing the user information provided by the different data sources, and the sample users with label classification which cannot be directly obtained need to be obtained by other methods.

In order to distinguish the sample users obtained in different manners, the sample users directly obtained by analyzing the user information provided by different data sources are used as positive sample users, the obtained positive sample users are classified according to the label classification, the positive sample users with the same label classification are the positive sample users with the same category, and the label classification represents the category of the positive sample users and is the positive sample user of the label user. As can be seen, the positive sample user includes a plurality of label categories.

And taking the sample user obtained by other modes as a negative sample user, and setting a label classification for the obtained negative sample user, wherein the set label classification represents the category of the negative sample user.

For the same classification model, the obtained positive sample users and the negative sample users form sample users of all classes of users in the user classification form adopted by the classification model.

S204: and training a preset first model by adopting a positive sample user and a negative sample user respectively aiming at the characteristic value of the preset characteristic, the label classification of the positive sample user and the label classification of the negative sample user to obtain the classification model.

In one implementation, as shown in fig. 3, the determining, at S202, a positive sample user from the target users according to the first classification, and determining an annotation classification of the positive sample user includes:

s2021: calculating the label classification of each user by using the classification information of each user in the target users, wherein the classification information of one user comprises the following steps: a first classification of the user, a confidence level of a data source providing the first classification of the user;

the target user is a user corresponding to the obtained user information, and the user information is provided by different data sources. Because different data sources have different ways and ways of obtaining user information, users corresponding to user information provided by different data sources may be the same or different, and user information of the same user provided by different data sources may be the same or different.

Thus, one of the target users may have a first category provided by a plurality of different data sources, which may be different, for example, in the form of user categories of three, i.e., middle school, college and non-student, and A, B, C, D total four different data sources provide user information of the same user X, wherein the first category provided by the a data source is middle school and the first category provided by the B data source is college; the first classification provided by the C data source is middle school students; the first classification provided by the D data source is non-student.

In order to determine a positive sample user from target users and determine the label classification of the positive sample user, the label classification of each positive sample user can only be a certain first classification, so that the first classifications provided by different data sources of the same user need to be processed, and a first classification is determined as the label classification for the user.

The characteristics of different data sources are different, so that the reliability of the user information provided by different data sources is different, and the reliability can be represented by confidence.

Based on this, the first classification of a user provided by different data sources and the confidence level of the data source providing the first classification of the user can be used to calculate the label classification of the user, that is, the label classification of each user is calculated by using the classification information of each user in the target users.

S2022: calculating the confidence of the label classification of each user;

according to the above method, although the label classification of each user is obtained, because the label classification of each user is calculated by the confidence of the first classification of the user and the data source providing the first classification of the user, the reliability of the label classification of each user may have a large difference, and not all users belonging to a classification label may be used as positive sample users of the label classification. In order to make the classification result of the trained classification model for the user more accurate, a user with higher reliability of the label classification needs to be selected as a positive sample user of the label classification, and the reliability of the label classification of one user is represented by the confidence of the label classification of the user.

In one implementation, the confidence level of the annotation classification for each user can be calculated by:

specifically, the method comprises the following steps: calculating the Wilson interval of the labeling classification of each user according to the following formula, and determining the Wilson interval lower limit value of the labeling classification of each user as the confidence coefficient of the labeling classification of each user:

represents the z statistic for alpha, which represents the confidence level.

S2023: and selecting a positive sample user of each annotation classification from the target users according to the confidence coefficient of the annotation classification of each user.

The selection method of the positive sample user may be: firstly, classifying target users according to the label classification of the target users, then selecting the target users with the label classification confidence coefficient not less than the label classification confidence coefficient threshold value from the target users of each label classification as the positive sample users of each label classification according to the preset confidence coefficient threshold value of each label classification, the confidence threshold of each label classification is set by the skilled person according to the needs of the practical application, and the confidence thresholds of different label classifications may be the same or different, and besides, the confidence threshold of each annotation classification can be determined when the positive sample user is selected every time, or the confidence threshold can be preset, the preset confidence threshold is directly adopted when the positive sample user is selected every time, and the size relationship and the determination method of the confidence threshold of each annotation classification are not limited in the application;

the method can also be as follows: firstly, classifying target users according to the labeling classification of the target users, then arranging the users of each standard classification according to the sequence of confidence degrees from large to small, and acquiring from the user with the highest confidence degree in the users of each labeling classification according to the number of preset positive sample users of each labeling classification until the acquired users meet the preset number, wherein the acquired user is the positive sample user of each labeling classification.

It should be noted that, the embodiment of the present invention only describes, by taking the above as an example, a method for selecting a positive sample user of each label classification from target users according to the confidence of the label classification of each user, and does not limit the embodiment of the present invention.

As can be seen from the above, the positive sample user is determined according to the confidence level of the annotation classification of the user, and thus the positive sample user is determined, and the annotation classification of the positive sample user is also determined.

In one implementation, as shown in fig. 4, S2021: calculating the labeling classification of each user by using the classification information of each user in the target users, wherein the method comprises the following steps:

S2021A: calculating the confidence sum of the target data source of each user, wherein the target data source of one user is as follows: providing data sources of the same first category of the user;

because the characteristics of each data are different, the first classification of a user provided by each data source may be the same or different, the data source providing the same first classification of a user is used as a target data source, the confidence degrees of the data sources are added, the obtained confidence degree sum is the confidence degree sum of the target data source, the confidence degree sum represents the reliability degree of the confidence degree sum and the reliability degree of the labeling classification of the user corresponding to the corresponding first classification, and the larger the confidence degree sum is, the higher the reliability degree of the labeling classification of the user corresponding to the confidence degree sum is.

For example, assume that in the above example, the confidence of the a data source is 8, the confidence of the B data source is 6, the confidence of the C data source is 9, and the confidence of the D data source is 7. Thus, when the first classification of user X is middle school, the confidence sum of the target data sources is 17(8+ 9); when the first classification of X is college student, the confidence coefficient sum of the target data source is 6; when the first classification of X is non-student, the confidence coefficient sum of the target data source is 7;

S2022B: and determining the target classification of each user as the label classification of each user, wherein the target classification of one user is as follows: the confidence level of the user target data source and the largest first classification.

For example, in the above example, the confidence score and the largest first classification of the target data source of user X is middle school student, so the target classification of user X is middle school student, and the target classification is determined as the label classification of user X, so the label classification of user X is middle school student.

In one implementation, as shown in fig. 5, S203: determining a negative sample user and determining the labeling classification of the negative sample user, wherein the steps comprise:

s2031: acquiring candidate negative sample users of a second classification, wherein the second classification is one of the labeling classifications;

for example, when classifying the users of the love art, the users are classified into three categories of middle school students, college students and non-students, and meanwhile, the middle school student users are set to be aged 10-18 years, the college student users are aged 18-25 years, and the users over 25 are all non-student users.

Only the sample users of the middle school students and the college students can be obtained by directly processing the user information provided by different data sources, and the sample users of the non-students need to be obtained by other methods. Therefore, the middle school sample user and the college student sample user are positive sample users used for training the classification model, and the non-student sample user is a negative sample user used for training the classification model.

The middle school student and the college student are respectively a second classification, and in order to facilitate better understanding of the implementation manner, the college student is taken as a second classification for explanation.

The candidate negative examples users for obtaining the second classification are: candidate non-student sample users corresponding to college students are obtained. That is, a preset number of users with ages of 18-25 years are randomly selected from users corresponding to user information provided by different data sources, and the users are candidate non-student sample users of college students.

S2032: extracting users in a preset proportion from the positive sample users in the second classification to obtain verification users in the second classification;

usually, the preset ratio is 15%, but other preset ratios may be used, such as 10%.

The description is continued with the above example, that is, a preset proportion of sample users, for example, 10%, are randomly extracted from the obtained college student sample users, and these extracted sample users are referred to as authenticated users.

S2033: setting the label classification of the negative example sample user as a third classification, wherein the third classification is as follows: for classes representing non-second classes, negative examples users are: candidate negative sample users and verified users of the second classification;

the labeling classification of the negative example sample is not determined according to the user information of the negative example sample, but is set.

Continuing with the above example, combining the extracted verification user with the randomly selected candidate non-student sample users of the college students as negative sample users, setting the labeling classifications of the negative sample users as non-students, and setting the set non-student labeling classification as the third classification.

S2034: and training a preset second model by adopting a positive sample user and a negative sample user respectively aiming at the characteristic value of the preset characteristic, the labeling classification of the positive sample user and the labeling classification of the negative sample user to obtain a two-classification model of the labeling classification, wherein the positive sample user is as follows: positive sample users other than the verification sample user among the positive sample users of the second category;

continuing to explain by taking the example, the remaining 90% of the university sample users are positive example sample users, and the preset second model is trained respectively for the feature values of the preset features and the respective label classifications by adopting the positive example sample users and the obtained negative example sample users, so as to obtain a two-classification model, wherein the labels of the positive example sample users are classified as university students, the labels of the negative example sample users are classified as non-students, and the obtained two-classification model users classify the users into university student users or non-student users.

S2035: classifying the negative example sample users by using the two classification models to obtain the confidence coefficient of the classification result of each negative example sample user;

continuing with the above example, according to the obtained feature value of the negative example user for the preset feature, the obtained binary classification model may be used to classify the negative example users, so as to obtain the classification result of each negative example user. The classification result of a negative example user represents the reliability degree of the negative example user as an university student sample user, and is represented by the confidence degree of the classification result, and the higher the confidence degree is, the higher the reliability degree of the negative example as the university student sample user is; of course, the classification result of a negative example user represents the reliability of the negative example user as a non-student sample user, and is represented by the confidence level of the classification result, and the higher the confidence level is, the higher the reliability of the negative example as a non-student sample user is.

S2036: obtaining a negative sample selection threshold according to the confidence of the user classification result of the verification sample;

since the verification sample users in the negative sample users are extracted from the already obtained positive sample users, the confidence levels of the classification results of the verification sample users can be obtained from the confidence levels of the obtained classification results of the negative sample users.

In one implementation, the negative example selection threshold may be the smallest confidence in the obtained confidences of verifying the sample user classification results when the confidence of the negative example user classification results indicates how reliable the negative example user is a college student user.

In another implementation, the negative example selection threshold may be the greatest confidence in the obtained confidences of the verified sample user classification results when the confidence of the negative example user classification results indicates how reliable the negative example user is for non-student users.

Continuing with the above example, assume that the confidence level of the negative example user classification result indicates how reliable the negative example user is an undergraduate user. Since the verification sample users are known university student sample users, the confidence level with the lowest value of the classification results among the verification sample users indicates the lowest possibility that one of the negative sample users is an university student user, and when the confidence level of the classification result of one of the negative sample users is lower than the confidence level with the lowest value, the negative sample user is unlikely to be an university student sample user to a great extent, and therefore the lowest confidence level among the confidence levels of the classification results among the verification sample users is used as the negative sample selection threshold.

S2037: and selecting the negative sample users of the second classification from the candidate negative sample users according to the negative sample selection threshold value.

In one implementation, when the confidence level of the negative example user classification result indicates the reliability degree that the negative example user is an university user, the method for selecting the negative example user in the second classification may be: and selecting candidate negative sample users with the confidence degrees of the classification results of the negative sample users smaller than the negative sample selection threshold value as the negative sample users.

In another implementation, when the confidence level of the negative example user classification result indicates the reliability degree of the negative example user that is a non-student user, the above method for selecting the negative example user of the second classification may be: and selecting candidate negative sample users with the confidence degrees of the classification results of the negative sample users larger than the negative sample selection threshold value as the negative sample users.

In one implementation, the user may be a device level user or an account level user.

When the user is an equipment-level user, the user information is associated with the equipment used by the user, the obtained user information is comprehensive user information obtained after the information of the equipment user is integrated, the sample user is determined according to the user information, and the user to be classified is classified, so that the classification result can better cover all equipment users, and the requirements of all users using the equipment can be better met during information popularization.

When the user is an account-level user, the user information is associated with the account used by the user, the obtained information is the information of the user corresponding to the account, the information comprises information generated when the account user uses the account on different equipment, and the sample user and the user to be classified are determined according to the user information, so that the classification model has a better effect, and the classification result is more accurate.

For better understanding of the above user classification method, the following specific examples are specifically described:

supposing that the love art company develops a new program for middle school students, the love art company wants to determine a middle school student user group by classifying the love art users, so that the program can be popularized to the middle school student group in a targeted manner, wherein the users are account-level users.

The method comprises the steps of firstly dividing the Aiqiyi users into three categories of middle school students, college students and non-students, wherein the ages of the middle school student users are set to be 10-18 years old, the ages of the college student users are 18-25 years old, the users over 25 years old are all non-student users, preset characteristics of all the users are set to be whether to use an education network IP (Internet protocol), the positions of the users when watching videos and the time point of watching the videos every day, and 5000 sample users of all the categories need to be obtained.

The method comprises the steps that firstly, user information of a large number of users is obtained from three data sources, namely personal information filled when a user registers an account, a delivery address filled when the user uses an Aiqiyi mall and identity card information provided when the user uses an Aiqiyi overseas shopping function, and first classifications of the users provided by the data sources are obtained, wherein confidence degrees of the three data sources are 5, 7 and 10 respectively, and the first classifications can be college students, middle students or non-students;

and secondly, respectively calculating the confidence degrees of data sources of college students, middle students or non-students for each user, and using the calculated confidence degrees and the maximum first classification as the labeling classification of each user.

And thirdly, calculating the Wilson interval of the obtained labeling classification of the users with the labeling classification of the college/middle school students, and taking the lower limit of the calculated Wilson interval as the confidence coefficient of the labeling classification of the users with the labeling classification of the college/middle school students.

And fourthly, according to the obtained label classification, selecting the label classification of each user for the users of the middle school students and the college students respectively, dividing the label classification into two parts of the middle school students and the college students, sequencing the users of each part in each part according to the descending order of the confidence degrees of the label classification of each user, and respectively obtaining the users from the user with the highest confidence degree of the label classification in each part until 5000 users are respectively obtained, wherein the minimum value of the confidence degrees of the label classification of the obtained users is larger than the maximum value of the confidence degrees of the standard classifications of the rest users. The 5000 users who are classified as middle school students and the 5000 users who are classified as college students thus obtained are middle school student sample users and college student sample users, respectively, which are positive sample users.

And fifthly, obtaining negative sample users of the university student sample users.

1) And randomly acquiring 5000 users with age characteristics of 18-25 years from the users corresponding to the user information provided by the data source as candidate negative sample users of the college student sample users, and randomly extracting 10% of the 5000 college student sample users as verification users.

2) And classifying the extracted labels of the verification user and the candidate negative sample user into non-students.

3) And taking the rest college student sample users as positive example sample users, taking the candidate negative example users and the verification users as negative example sample users, adopting the samples to classify according to the characteristic values of the preset characteristics and respective labels, and training the preset second model to obtain a two-classification model, wherein the model can classify the users to be classified into college students or non-students.

4) And classifying the negative sample users by using the obtained binary classification model, and respectively obtaining the confidence degrees of the classification results of the users, wherein the confidence degrees represent the reliability degree of the negative sample users as the university student sample users.

5) Obtaining the minimum value in the confidence degrees of the classification results of the verification sample users, and taking the minimum value as a negative sample selection threshold value;

6) and selecting the candidate negative sample users of which the confidence degrees of the classification results are smaller than the negative sample selection threshold value from the candidate negative sample users as the negative sample users of the college student sample users.

And sixthly, obtaining a negative sample user of the middle school student sample user.

And obtaining a negative sample user of the middle school student sample user by adopting the same method as the fifth step.

Seventhly, randomly acquiring 1000 users with the ages of over 25 as candidate negative sample users of the non-student users.

And eighthly, combining the negative sample users of the college student sample users, the negative sample users of the middle school student sample users and the candidate negative sample users of the non-student users, and randomly extracting 5000 sample users from the negative sample users to serve as the negative sample users.

And ninthly, training a preset first model by utilizing the obtained positive sample users and the obtained negative sample users respectively aiming at the characteristic values of the preset characteristics and respective label classification to obtain a three-classification model, wherein the three-classification model can classify the users to be classified into three types of users, namely middle school students, college students and non-students.

And finally, obtaining a characteristic value of the user to be classified aiming at the preset characteristics, inputting the obtained characteristic value into the three classification models, and classifying the user to be classified according to a result obtained by the classification models.

As shown in fig. 6, which is a schematic structural diagram of a user classifying device according to an embodiment of the present invention, the device includes:

the feature value obtaining module 610 is configured to obtain a feature value of a user to be classified for a preset feature;

the user classification module 620 is configured to input the obtained feature values to a classification model obtained through pre-training to classify the user to be classified, so as to obtain a classification result of the user to be classified;

As can be seen from the above, in the scheme provided by the embodiment of the present invention, the training information of the sample user may be determined by processing the user information provided by the multiple data sources, the obtained training information of the sample user is used to train the preset model to obtain the classification model, and the feature value of the user to be classified for the preset feature is input to the classification model, so that the user to be classified is classified to obtain the classification record. Because the training information of the training samples of the classification model is derived from different data sources, the training information of the training samples has higher confidence, so that the obtained classification model has better classification effect, and the obtained classification result has higher accuracy.

In an implementation manner, as shown in fig. 7, a schematic structural diagram of a first classification model training apparatus provided in an embodiment of the present invention is provided, where the apparatus includes: and the classification model obtaining module is used for obtaining the classification model.

Specifically, the classification model obtaining module includes:

the user information obtaining sub-module 710 is configured to obtain user information provided by different data sources, where the user information of a user provided by one data source includes: the user's label classification provided by the data source is used as a first classification;

the positive sample user determining submodule 720 is configured to determine, according to the first classification, a positive sample user from target users, and determine an annotation classification of the positive sample user, where the target users are: a user corresponding to the obtained user information;

the negative sample user determining submodule 730 is used for determining a negative sample user and determining the labeling classification of the negative sample user;

the model training submodule 740 is configured to train a preset first model by using the positive sample user and the negative sample user to respectively target at the feature value of the preset feature, the label classification of the positive sample user, and the label classification of the negative sample user, so as to obtain the classification model.

In an implementation manner, as shown in fig. 8, a result schematic diagram of a second classification model training apparatus provided in an embodiment of the present invention is shown, where the positive sample user determination sub-module 720 includes:

the label classification calculating unit 7201, configured to calculate a label classification for each user using the classification information of each user in the target users, where the classification information of one user includes: a first classification of the user, a confidence level of a data source providing the first classification of the user;

an annotation classification confidence calculation unit 7202 for calculating the confidence of the annotation classification of each user;

the positive sample user selecting unit 7203 is configured to select a positive sample user of each annotation classification from the target users according to the confidence of the annotation classification of each user.

In an implementation manner, as shown in fig. 9, which is a schematic structural diagram of a third classification model training apparatus provided in an embodiment of the present invention, wherein the label classification calculating unit 7201 includes:

a confidence sum calculating subunit 7201A configured to calculate a confidence sum of target data sources of each user, where a target data source of one user is: providing data sources of the same first category of the user;

an annotation classification determining subunit 7201B, configured to determine the target classification of each user as the annotation classification of each user, where the target classification of one user is: the confidence level of the user target data source and the largest first classification.

In one implementation, the labeling classification confidence calculating unit 7202 is specifically configured to: calculating the Wilson interval of the labeling classification of each user according to the following formula, and determining the Wilson interval lower limit value of the labeling classification of each user as the confidence coefficient of the labeling classification of each user:

represents the z statistic for alpha, which represents the confidence level.

In an implementation manner, as shown in fig. 10, which is a schematic structural diagram of a fourth classification model training apparatus provided in an embodiment of the present invention, where the negative sample user determining submodule 230 is specifically configured to obtain a negative sample user of each labeled classification, and specifically, the negative sample user determining submodule 230 includes:

a candidate negative sample user acquiring unit 2301, configured to acquire a candidate negative sample user in a second classification, where the second classification is one of the labeling classifications;

the verified user obtaining unit 2302 is used for extracting users with a preset proportion from the positive sample users in the second classification to obtain verified users in the second classification;

an annotation classification setting unit 2303, configured to set the annotation classification of the negative example user as a third classification, where the third classification is: for classes representing non-second classes, negative examples users are: candidate negative sample users and verified users of the second classification;

the model training unit 2304 is configured to train a preset second model by using positive example users and negative example users respectively for feature values of preset features, label classifications of the positive example users and label classifications of the negative example users, so as to obtain two classification models of label classifications, where the positive example users are: positive sample users other than the verification sample user among the positive sample users of the second category;

a negative example confidence obtaining unit 2305, configured to classify negative example users by using the above-mentioned two classification models, and obtain confidence of classification result of each negative example user;

a negative sample selection threshold obtaining unit 2306, configured to obtain a negative sample selection threshold according to a confidence of the verification sample user classification result;

a negative sample user selecting unit 2307, configured to select a negative sample user of the second category from the candidate negative sample users according to a negative sample selection threshold.

In one implementation, the negative sample selection threshold may be the smallest confidence in the obtained confidence of the user classification result of the verification sample.

In one implementation, the user is a device-level user or an account-level user.

In one implementation, the preset feature is at least one of the following features:

the age of the user, the location of the user, the type of video the user watches, the type of e-book the user reads, the time the user watches the video, the time the user reads the e-book, the group characteristics the user joins, the characteristics the user uses the specific function.

The embodiment of the present invention further provides an electronic device, as shown in fig. 11, which includes a processor 1110, a communication interface 1120, a memory 1130, and a communication bus 1140, wherein the processor 1110, the communication interface 1120, and the memory 1130 complete mutual communication through the communication bus 1140,

a memory 1130 for storing computer programs;

the processor 1110, when executing the program stored in the memory 1130, implements the following steps:

It should be noted that other implementation manners of the user classification method implemented by the processor 1110 executing the program stored in the memory 1130 are the same as the user classification method embodiments provided in the foregoing method embodiment section, and are not described again here.

As can be seen from the above, in the scheme provided in the embodiment of the present invention, the electronic device may determine the training information of the sample user by processing the user information provided by the multiple data sources, train the preset model by using the obtained training information of the sample user to obtain a classification model, and input the feature value of the user to be classified for the preset feature into the classification model, so as to classify the user to be classified, thereby obtaining the classification. Because the training information of the training samples of the classification model is derived from different data sources, the training information of the training samples has higher confidence, so that the obtained classification model has better classification effect, and the obtained classification result has higher accuracy.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In yet another embodiment of the present invention, a computer-readable storage medium is further provided, which has instructions stored therein, and when the instructions are executed on a computer, the instructions cause the computer to execute the user classification method described in any of the above embodiments.

In a further embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the user classification method of any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, apparatus embodiments, electronic device embodiments, computer-readable storage medium embodiments, and computer program products containing instructions are described for simplicity of description as they are substantially similar to method embodiments, where relevant, reference may be made to some descriptions of method embodiments.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for classifying a user, the method comprising:

the classification model is obtained by training in the following way:

determining a negative sample user and determining the labeling classification of the negative sample user, wherein the steps comprise:

selecting a second classified negative sample user from the candidate negative sample users according to the negative sample selection threshold;

and training a preset first model by adopting a positive sample user and a negative sample user respectively aiming at the characteristic value of the preset characteristic, the labeling classification of the positive sample user and the labeling classification of the negative sample user to obtain the classification model.

2. The method of claim 1, wherein the step of determining the positive sample user from the target users and determining the label classification of the positive sample user according to the first classification comprises:

calculating the confidence of the label classification of each user;

3. The method of claim 2, wherein the step of calculating the label classification for each user using the classification information of each user in the target users comprises:

4. The method of claim 2 or 3, wherein the step of calculating the confidence level of the label classification for each user comprises:

wilson interval =

express correspondence

The z-statistic of (a) is,

indicating a confidence level.

5. The method of claim 1, wherein the negative sample selection threshold is:

the smallest confidence in the obtained confidences of the user classification results of the verification samples.

6. The method of any of claims 1-3 and 5, wherein the user is a device level user or an account level user.

7. The method of claim 1, wherein the predetermined characteristic is at least one of:

8. An apparatus for classifying a user, the apparatus comprising:

a classification model obtaining module for obtaining the classification model;

wherein the classification model obtaining module comprises:

the negative sample user determining submodule is used for determining negative sample users and determining the labeling classification of the negative sample users; the negative sample user determination submodule is specifically used for obtaining negative sample users of each label classification;

wherein the negative example user determination submodule comprises:

the negative sample user selection unit is used for selecting a second classified negative sample user from the candidate negative sample users according to the negative sample selection threshold;

and the model training submodule is used for training a preset first model by adopting a positive sample user and a negative sample user respectively aiming at the characteristic value of the preset characteristic, the labeling classification of the positive sample user and the labeling classification of the negative sample user to obtain the classification model.

9. The apparatus of claim 8, wherein the positive sample user determination submodule comprises:

10. The apparatus of claim 9, wherein the label classification calculating unit comprises:

11. The apparatus of claim 9 or 10,

the labeling classification confidence calculation unit is specifically configured to: calculating the Wilson interval of the labeling classification of each user according to the following formula, and determining the Wilson interval lower limit value of the labeling classification of each user as the confidence coefficient of the labeling classification of each user:

wilson interval =

represents the ratio of a first value to n, said first value being: a user's eyeThe maximum value of the confidence sum of the target data sources,

express correspondence

The z-statistic of (a) is,

indicating a confidence level.

12. The apparatus of claim 8, wherein the negative sample selection threshold is:

13. The apparatus of any of claims 8-10 and 12, wherein the user is a device level user or an account level user.

14. The apparatus of claim 8, wherein the predetermined characteristic is at least one of:

15. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the steps of the method of any one of claims 1 to 7 when executing a program stored in the memory.