CN111639714A - Method, device and equipment for determining attributes of users - Google Patents

Method, device and equipment for determining attributes of users Download PDF

Info

Publication number
CN111639714A
CN111639714A CN202010484863.6A CN202010484863A CN111639714A CN 111639714 A CN111639714 A CN 111639714A CN 202010484863 A CN202010484863 A CN 202010484863A CN 111639714 A CN111639714 A CN 111639714A
Authority
CN
China
Prior art keywords
sample
samples
application
user
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010484863.6A
Other languages
Chinese (zh)
Other versions
CN111639714B (en
Inventor
李嘉晨
郭凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Seashell Housing Beijing Technology Co Ltd
Original Assignee
Beike Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beike Technology Co Ltd filed Critical Beike Technology Co Ltd
Priority to CN202010484863.6A priority Critical patent/CN111639714B/en
Publication of CN111639714A publication Critical patent/CN111639714A/en
Application granted granted Critical
Publication of CN111639714B publication Critical patent/CN111639714B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method, apparatus, medium, and device for determining attributes of a user are disclosed. The method comprises the following steps: obtaining application characteristic samples of a plurality of users to form a sample set; the application characteristic sample is used for representing the application installed on the terminal equipment of the user; according to the application with the preset attribute, setting a label for at least one application characteristic sample in the sample set to form a positive sample and a negative sample; initializing a semi-supervised classification model according to a positive sample and a negative sample in a sample set; according to the initialized semi-supervised classification model, performing attribute prediction processing on the application feature samples without the labels in the sample set; and setting labels for corresponding application characteristic samples in the application characteristic samples without the labels according to the prediction processing result to form positive samples and negative samples. The method and the device are beneficial to conveniently and accurately knowing the attributes of the user.

Description

Method, device and equipment for determining attributes of users
Technical Field
The present disclosure relates to computer technologies, and in particular, to a method of determining an attribute of a user, an apparatus for determining an attribute of a user, a storage medium, and an electronic device.
Background
In some application fields, it is sometimes required to provide more precise services for users according to the attributes of the users (such as demographic attributes). The attribute of the user may be the gender of the user, whether the user is married or fertile, and the like.
In a network environment, how to adopt a legal means to accurately infer the attributes of a user using a terminal device is a technical problem of great concern.
Disclosure of Invention
The present disclosure is proposed to solve the above technical problems. The embodiment of the disclosure provides a method for determining an attribute of a user, a device for determining the attribute of the user, a storage medium and an electronic device.
According to an aspect of an embodiment of the present disclosure, there is provided a method of determining an attribute of a user, the method including: obtaining application characteristic samples of a plurality of users to form a sample set; the application characteristic sample is used for characterizing an application installed by the terminal equipment of a user; according to the application with the preset attribute, setting a label for at least one application characteristic sample in the sample set to form a positive sample and a negative sample; initializing a semi-supervised classification model according to the positive samples and the negative samples in the sample set; according to the initialized semi-supervised classification model, performing attribute prediction processing on the application feature samples without the set labels in the sample set; and setting labels for corresponding application characteristic samples in the application characteristic samples without the labels according to the prediction processing result to form a positive sample and a negative sample.
In an embodiment of the present disclosure, the acquiring application feature samples of a plurality of users includes: for any user, generating an application map of the user according to application installation information reported by the terminal equipment of the user; and compressing the application map of the user to obtain an application feature sample of the user.
In another embodiment of the present disclosure, the compressing the application map of the user to obtain an application feature sample of the user includes: factoring the application map of the user to obtain a first characteristic of the application map of the user; taking the application map of the user as the input of a neural network, and performing feature extraction processing on the application map of the user through the neural network to obtain second features of the application map of the user; and splicing the first characteristic and the second characteristic to obtain an application characteristic sample of the user.
In yet another embodiment of the present disclosure, the setting a label for at least one application feature sample in the sample set according to an application having a predetermined attribute includes: determining the number of applications with preset attributes respectively installed on the terminal equipment of each user according to the application maps of all users; and setting labels corresponding to the preset attributes for the application characteristic samples of which the number meets the preset number of requirements.
In yet another embodiment of the present disclosure, after labeling at least one application feature sample in the sample set according to an application having a predetermined attribute, forming a positive sample and a negative sample, the method further includes: taking any sample with a label in the sample set as a base sample, determining all application characteristic samples with the distance from the base sample meeting a preset distance requirement, and taking the application characteristic samples with the labels of the base sample in all the application characteristic samples as few samples; and if the number of all the application characteristic samples meets the preset requirement, generating a new few samples according to the base samples and the few samples.
In yet another embodiment of the present disclosure, the generating a new small sample according to the base sample and the small sample includes: dividing all the application feature samples into a first part and a second part; determining the probability of generating few samples corresponding to the first part and the second part according to the number of the few samples contained in the first part and the second part and the number of the application characteristic samples without labels; determining the direction of generating few samples according to the probability; wherein the direction indicates the first portion and/or the second portion; generating a new few samples from the base sample and the few samples in the portion indicated by the direction.
In yet another embodiment of the present disclosure, the dividing the all application feature samples into a first part and a second part includes: and dividing all the application characteristic samples into a first part and a second part which are positioned on two sides of the diameter by taking the basic sample as a circle center and taking the principle that the difference of the small sample quantity in the application characteristic samples passing through the two sides of the circle center is minimum.
In yet another embodiment of the present disclosure, the generating a new small sample according to the base sample and the small sample in the portion indicated by the direction includes: and setting the characteristic value of each dimensional characteristic of the new few samples according to each value range formed by the characteristic values of each dimensional characteristic of the few samples in the base sample and the part indicated by the direction.
In another embodiment of the present disclosure, the setting, according to the result of the prediction processing, a label for a corresponding application feature sample in the unlabeled application feature samples to form a positive sample and a negative sample includes: according to the probability value which is obtained by the prediction processing result and belongs to the positive sample, taking the application characteristic sample which belongs to the positive sample and has the probability value not lower than a first preset probability value and is not provided with a label as the positive sample; and according to the probability value which is obtained according to the prediction processing result and belongs to the positive sample, taking the application characteristic sample which belongs to the positive sample and has the probability value lower than a second preset probability value and is not provided with a label as a negative sample.
In yet another embodiment of the present disclosure, the method further comprises: and after forming the positive samples and the negative samples, returning to the step of initializing the semi-supervised classification model according to the positive samples and the negative samples in the sample set until no application feature samples without labels are arranged in the sample set.
According to another aspect of the embodiments of the present disclosure, there is provided an apparatus for determining an attribute of a user, the apparatus including: the acquisition sample module is used for acquiring application characteristic samples of a plurality of users to form a sample set; the application characteristic sample is used for characterizing an application installed by the terminal equipment of a user; the first setting module is used for setting labels for at least one application characteristic sample in the sample set formed by the sample obtaining module according to the application with the preset attribute to form a positive sample and a negative sample; the initialization module is used for initializing a semi-supervised classification model according to the positive samples and the negative samples in the sample set; the prediction processing module is used for performing attribute prediction processing on the application feature samples without the labels in the sample set according to the semi-supervised classification model initialized by the initialization module; and the second setting module is used for setting labels for corresponding application characteristic samples in the application characteristic samples without the labels according to the prediction processing result of the prediction processing module to form a positive sample and a negative sample.
In an embodiment of the present disclosure, the obtaining a sample module includes: the first sub-module is used for generating an application map of any user according to the application installation information reported by the terminal equipment of the user; and the second sub-module is used for compressing the application map of the user to obtain an application feature sample of the user.
In yet another embodiment of the present disclosure, the second sub-module is further configured to: factoring the application map of the user to obtain a first characteristic of the application map of the user; taking the application map of the user as the input of a neural network, and performing feature extraction processing on the application map of the user through the neural network to obtain second features of the application map of the user; and splicing the first characteristic and the second characteristic to obtain an application characteristic sample of the user.
In yet another embodiment of the present disclosure, the first setting module is further configured to: determining the number of applications with preset attributes respectively installed on the terminal equipment of each user according to the application maps of all users; and setting labels corresponding to the preset attributes for the application characteristic samples of which the number meets the preset number of requirements.
In yet another embodiment of the present disclosure, the apparatus further includes: a sample-less-sample determining module, configured to use any sample with a tag in the sample set as a base sample, determine all application feature samples whose distances from the base sample meet a predetermined distance requirement, and use the application feature samples with the tags of the base sample in the all application feature samples as sample-less samples; and the sample generation module is used for generating a new few samples according to the base samples and the few samples if the number of all the application characteristic samples meets a preset requirement.
In yet another embodiment of the present disclosure, the generate sample module includes: a third sub-module for dividing the all application characteristic samples into a first part and a second part; the fourth sub-module is used for determining the probability of generating few samples corresponding to the first part and the second part according to the number of the few samples contained in the first part and the second part and the number of the application characteristic samples without labels; the fifth submodule is used for determining the direction of generating few samples according to the probability; wherein the direction indicates the first portion and/or the second portion; a sixth sub-module for generating a new few samples from the base sample and the few samples in the portion indicated by the direction.
In yet another embodiment of the present disclosure, the third sub-module is further configured to: and dividing all the application characteristic samples into a first part and a second part which are positioned on two sides of the diameter by taking the basic sample as a circle center and taking the principle that the difference of the small sample quantity in the application characteristic samples passing through the two sides of the circle center is minimum.
In yet another embodiment of the present disclosure, the sixth submodule is further configured to: and setting the characteristic value of each dimensional characteristic of the new few samples according to each value range formed by the characteristic values of each dimensional characteristic of the few samples in the base sample and the part indicated by the direction.
In yet another embodiment of the present disclosure, the second setting module is further configured to: according to the probability value which is obtained by the prediction processing result and belongs to the positive sample, taking the application characteristic sample which belongs to the positive sample and has the probability value not lower than a first preset probability value and is not provided with a label as the positive sample; and according to the probability value which is obtained according to the prediction processing result and belongs to the positive sample, taking the application characteristic sample which belongs to the positive sample and has the probability value lower than a second preset probability value and is not provided with a label as a negative sample.
In yet another embodiment of the present disclosure, the apparatus further comprises a control module configured to: after the second setting module forms a positive sample and a negative sample, triggering the initialization module to execute the step of initializing the semi-supervised classification model again according to the positive sample and the negative sample in the sample set until no application feature sample without a set label exists in the sample set.
Based on the method and the device for determining the attribute of the user provided by the above embodiment of the present disclosure, by using the application feature sample to reflect the application installed in the terminal device of the user, and using the application with the predetermined attribute (such as the application with the demographic attribute) to set the corresponding label for the application feature sample, the positive sample and the negative sample with higher purity can be obtained conveniently and accurately; because the purity of the positive sample and the negative sample in the method is high, the semi-supervised classification model is initialized by using the positive sample and the negative sample, and the classification performance of the semi-supervised classification model is favorably improved; and the accuracy of the semi-supervised classification model for carrying out attribute prediction processing on the application feature samples without the labels is improved. Therefore, the technical scheme provided by the disclosure is beneficial to rapidly and conveniently acquiring the attributes of the user.
The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:
FIG. 1 is a schematic diagram of one embodiment of a suitable scenario for use with the present disclosure;
FIG. 2 is a flow diagram of one embodiment of a method of determining attributes of a user of the present disclosure;
FIG. 3 is a flow diagram of one embodiment of the present disclosure for obtaining application characteristic samples of a user;
FIG. 4 is a flow chart of one embodiment of the present disclosure for generating a new positive sample;
FIG. 5 is a schematic diagram of an embodiment of a neighborhood of base samples according to the present disclosure;
FIG. 6 is a flow chart of one embodiment of the present disclosure for generating new few samples;
FIG. 7 is a schematic diagram of an embodiment of the present disclosure that divides all application feature samples in the neighborhood into two parts;
FIG. 8 is a flow chart of another embodiment of a method of determining attributes of a user of the present disclosure;
FIG. 9 is a schematic diagram illustrating an embodiment of an apparatus for determining attributes of a user according to the present disclosure;
fig. 10 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.
Detailed Description
Example embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.
It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.
It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.
It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more than two and "at least one" may refer to one, two or more than two.
It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.
In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, such as a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.
It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
Embodiments of the present disclosure may be implemented in electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with an electronic device, such as a terminal device, computer system, or server, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.
Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be implemented in a distributed cloud computing environment. In a distributed cloud computing environment, tasks may be performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.
Summary of the disclosure
In implementing the present disclosure, the inventors discovered that, through a user registration process, attributes of a user (e.g., demographic attributes, etc.) can be obtained; the attributes (such as demographic attributes and the like) of the user can be obtained through the identity card, the passport and other documents provided by the user; attributes (e.g., demographic attributes, etc.) of the user may also be obtained by way of questionnaires. However, in some application scenarios, the attributes of the user cannot be obtained in the above manner. For example, a user accesses a corresponding website (e.g., a house renting website, a house selling website, or a news website) through a terminal device (e.g., a smart phone) in a non-login manner, and at this time, the network side cannot acquire an attribute (e.g., a demographic attribute) of the user through information such as a network access address of the user. As another example, in the case where the user registration process includes only a user name and a login password, the network side may not obtain the user's attributes (e.g., demographic attributes, etc.). For example, the network side cannot know attributes such as the gender of the user, whether the user is married, and whether the user is fertile.
Brief description of the drawings
One example of an application scenario of the technology for determining attributes of a user provided by the present disclosure is shown in fig. 1.
In fig. 1, it is assumed that there are currently n users, i.e., user 1, user 2, … …, and user n, and each user has its own terminal device, i.e., terminal device 1.1 (e.g., a computer), terminal device 2.1 (e.g., a smart phone), … …, and terminal device n.1 (e.g., a tablet computer). Each user accesses the server 102 using their respective terminal device. The server 102 respectively pushes corresponding information to the terminal device 1.1, the terminal devices 2.1, … …, and the terminal device n.1 according to access requirements of users, and the terminal device 1.1, the terminal devices 2.1, … …, and the terminal device n.1 respectively form corresponding pages (such as pages in a web page form or pages in an APP form) according to the information pushed by the server 102 and display the corresponding pages to the corresponding users.
The server 102 can obtain all applications installed in the terminal device 1.1, all applications installed in the terminal device 2.1, … …, and all applications installed in the terminal device n.1, with the consent of the user 1, the user 2, the … …, and the user n, respectively.
The server 102 may adopt the technical solution of the present disclosure to infer at least one of the attributes of the gender, whether married, whether fertile, etc. of the user 1 according to all applications installed in the terminal device 1.1 obtained by the server.
The server 102 may adopt the technical solution of the present disclosure to infer at least one of the attributes of the gender, whether married, whether fertile, etc. of the user 2 according to all applications installed in the terminal device 2.1 it obtains.
……
The server 102 may adopt the technical solution of the present disclosure to infer at least one of the attributes of the gender, married, and fertile of the user n according to all applications installed in the terminal device n.1 obtained by the server.
The server 102 may push information matching the attributes of the users to the terminal device 1.1, the terminal devices 2.1 and … …, and the terminal device n.1, respectively, based on the inferred attributes of the user 1, the user 2, and the attribute of … …, and the user n. The information may be advertisement information or current news or entertainment information. For example, for female users, advertisement information related to female products may be pushed thereto, and for female users who have developed, advertisement information of house information around schools, school education and the like may be pushed thereto.
Exemplary method
FIG. 2 is a flow chart of one embodiment of a method of determining attributes of a user of the present disclosure. The method of the embodiment shown in fig. 2 comprises: s200, S201, S202, S203, and S204. The following describes each step.
S200, obtaining application characteristic samples of a plurality of users to form a sample set.
The application characteristic sample in the present disclosure may represent a plurality of applications installed by the terminal device of the user, for example, all applications installed by the terminal device of the user. An application feature sample in the present disclosure typically includes a user identification and a plurality of feature points. One of the feature points may be considered a one-dimensional feature, and the N feature points may be considered N-dimensional features. In the case where all the feature points included in the application feature sample are feature points after the compression processing, each feature point in the application feature sample generally has no specific meaning.
The terminal device in the present disclosure may refer to an electronic device with network access capability, such as a smart mobile phone, a computer, or a tablet computer. The application in the present disclosure may refer to an APP installed on a smart mobile phone and a tablet computer, or may refer to a client program installed on a computer, and the like. The sample set in the present disclosure typically contains application characteristic samples for a large number of users.
S201, setting labels for at least one application characteristic sample in the sample set according to the application with the preset attribute to form a positive sample and a negative sample.
A predetermined attribute (e.g., demographic attribute, etc.) in this disclosure may refer to an attribute having two states that are opposite to each other. For example, the predetermined attribute (e.g., demographic attribute, etc.) in the present disclosure may be gender, whether married or fertile, etc. Specifically, if the user gender is male, the user gender may not be female; if the user is married, whether the user is married cannot be unmarried; if the user has been brought to an educated state, it is unlikely that the user has been brought to an unexpired state.
All applications to which the present disclosure relates do not have predetermined attributes (e.g., demographic attributes), and typically, only a small number of applications have predetermined attributes in all applications. For example, cosmetic supervision, beauty pomelo, day-to-day P-chart, beauty swatter, and beauty art applications may be considered as applications with female attributes. For another example, applications such as hair salons (boy version), men's clothing, mr. beast, etc. may be considered as applications having male attributes.
Since the predetermined attributes in the present disclosure are attributes having two states opposite to each other, one of the attributes may be corresponding to the first label, and the application characteristic sample having the first label is a positive sample, and the other attribute may be corresponding to the second label, and the application characteristic sample having the second label is a negative sample. For example, a female attribute corresponds to a first tag and a male attribute corresponds to a second tag.
S202, initializing the semi-supervised classification model according to the positive samples and the negative samples in the sample set.
The positive and negative samples in the present disclosure are typically higher purity positive and negative samples, i.e., the positive and negative samples in the present disclosure are typically higher reliability positive and negative samples.
The semi-supervised classification model in the present disclosure is used for performing attribute (e.g., demographic attribute) prediction processing of a user on application feature samples in a sample set that are not labeled. The Semi-Supervised classification model in the present disclosure may refer to a classifier based on a Semi-Supervised Learning (Semi-Supervised Learning) algorithm. The semi-supervised learning algorithm is a learning method algorithm combining a supervised learning algorithm and an unsupervised learning algorithm. The semi-supervised classification model may include in the sample set: in the case of the application feature samples with the labels set (i.e., the positive and negative examples) and the application feature samples without the labels set, the corresponding classification operation is performed. Initializing a semi-supervised classification model of the present disclosure may refer to setting parameters of a classifier.
And S203, according to the initialized semi-supervised classification model, performing attribute prediction processing on the application feature samples without the set labels in the sample set.
The method can take each unlabeled application characteristic sample in the sample set as a model input respectively and provide the model input to the semi-supervised classification model, and the supervised classification model can output a probability value respectively aiming at each model input, wherein the probability value represents the possibility that the corresponding model input belongs to a positive sample.
And S204, setting labels for corresponding application characteristic samples in the application characteristic samples without the labels according to the prediction processing result to form positive samples and negative samples.
The preset requirement can be preset, and for the prediction processing result meeting the preset requirement, the corresponding label is set for the application characteristic sample corresponding to the prediction processing result, for example, the first label or the second label is set, so that a positive sample or a negative sample is formed, the number of the application characteristic samples without the label in the sample set is further consumed, and the number of the positive samples and the number of the negative samples in the sample set are gradually increased.
The application characteristic samples are used for reflecting applications installed in the terminal equipment of a user, and the applications with preset attributes (such as applications with demographic attributes) are used for setting labels for the application characteristic samples, so that the positive samples and the negative samples with higher purity can be conveniently and accurately obtained; and the accuracy of the semi-supervised classification model for carrying out attribute prediction processing on the application feature samples without the labels is improved. Therefore, the technical scheme provided by the disclosure is beneficial to quickly and conveniently knowing the attributes of the user (such as the demographic attributes of the user).
In an alternative example, one implementation of the present disclosure to obtain application characteristic samples for any of a plurality of users may be as shown in fig. 3.
In fig. 3, S300, for any user of a plurality of users, an application map of the user is generated according to the application installation information reported by the terminal device of the user.
Optionally, the present disclosure may collect the application installation information of the terminal device of the user after obtaining the consent of the user through information interaction. For example, after obtaining the agreement of the user, the terminal device of the user reports the application installation information to the network side, so that the present disclosure can generate the application map of the user according to the received application installation information.
Alternatively, the application map of the user in the present disclosure may be generally drawn in a plurality of applications, the applications already installed and the applications not installed in the terminal device of the user. For example, the application map of the user includes a plurality of points, each of which corresponds to an application, and if the application corresponding to a point is installed in the terminal device of the user, the value of the point may be set to a first value (e.g., 1), and if the application corresponding to a point is not installed in the terminal device of the user, the value of the point may be set to a second value (e.g., 0). The user's application map is a numeric string formed by a plurality of 1 s and 0 s.
S301, compressing the application map of the user to obtain an application feature sample of the user.
Optionally, because the number of the current applications is usually very large (e.g., ten thousands), and the number of the applications installed on the terminal device of a user is usually small (e.g., tens), the application map of the user usually presents a sparse structure (e.g., ten thousands to several tens of 1), and the application map of the user is compressed according to the present disclosure, so that a phenomenon of sparse application characteristic sample structure can be effectively avoided, and further, the subsequent attribute prediction processing operation of the user is favorably performed.
Optionally, the present disclosure may employ a preset model to compress the application map of any user. For example, an application map of a user is provided as an input to a preset model, and an application feature sample of the user is obtained according to an output of the model. Before the compression process is performed, each point included in the application map generally has a physical meaning, and after the compression process is performed, the number of points included in the application feature sample is much smaller than the number of points included in the application map, and each point included in the application feature sample generally no longer has a physical meaning.
Optionally, the preset model for the compression process in the present disclosure may include: deep FM (Factorization Machine). The deep FM mainly comprises: deep part and FM part. The FM part is mainly used to extract the low-order features of the application map. The low-order features are referred to as second-order features. The Deep part is usually a DNN (Deep neural networks, such as multilayer fully-connected neural networks). The Deep part is mainly used for extracting high-order features of the application map. The high-order features are features of more than two orders.
Specifically, the present disclosure may utilize the FM portion to perform factorization processing on the application map of the user, to obtain a first feature of the application map of the user, where the first feature may be referred to as a second-order feature, and meanwhile, the present disclosure may use the application map of the user as an input of the Deep portion, and perform feature extraction processing on the application map of the user through the Deep portion, to obtain a second feature of the application map of the user, where the second feature may be referred to as a high-order feature; after the second-order features and the high-order features are obtained, the present disclosure may perform a stitching process on the second-order features and the high-order features, so as to obtain an application feature sample of the user. For example, in the case where the second-order feature is a dimension M1(M1 is an integer greater than 1) feature and the high-order feature is a dimension M2(M2 is an integer greater than M1) feature, the present disclosure may take the stitched dimension (M1+ M2) feature as the application feature sample for the user. The M1 dimension feature may refer to a feature formed by M1 feature points. The M2 dimension feature may refer to a feature formed by M2 feature points. The (M1+ M2) -dimensional feature may refer to a feature formed by (M1+ M2) feature points. For example, the (M1+ M2) dimensional feature may be: { F1:0.87, F2: 0.43 … … }. According to the method and the device, the application map is compressed by using the deep FM, so that not only can the application characteristic sample of a user be conveniently obtained, but also the application characteristic sample contains more comprehensive characteristic information, and the prediction processing performance of the semi-supervised classification model is favorably improved.
In an alternative example, the present disclosure may consider the number of applications having predetermined attributes (e.g., predetermined demographic attributes, specifically, such as female or male, etc.) installed by the terminal device of the user when setting the tag for the application feature sample. Specifically, the present disclosure may first determine, according to application maps of all users, the number of applications having a predetermined attribute (e.g., a predetermined demographic attribute) installed on the terminal device of each user, and then set a tag corresponding to the predetermined attribute (e.g., the predetermined demographic attribute) for an application feature sample whose number meets a predetermined number of requirements.
Optionally, the predetermined number of requirements may be: the terminal device has no less than a predetermined number threshold of installed applications having a predetermined attribute, such as a predetermined demographic attribute. The predetermined number of requirements may also be: the ranking of the number of applications that the terminal device has installed with a predetermined attribute, such as a predetermined demographic attribute, belongs to the top N, etc. Of course, the predetermined number of requirements in the present disclosure may also be a combination of the two, and the like. The present disclosure facilitates ensuring the accuracy of the labels set for application characteristic samples by considering the number of applications having predetermined attributes (e.g., predetermined demographic attributes) by which the device in the user is in compliance, thereby facilitating further ensuring the purity of the positive and negative samples in the sample set.
As a specific example, in the case where the predetermined attribute is female, the present disclosure may count the number of applications having the female attribute installed in the terminal device of each user, on the basis of the application maps of all users, assuming that applications having the female attribute are installed in terminal devices of 100 users, that is, the terminal device of the first user is installed with X1 applications with female attributes, the terminal device of the second user is installed with X2 applications with female attributes, … …, and the terminal device of the first hundred users is installed with X100 applications with female attributes, the present disclosure may sort X1, X2, … …, and X100 in descending order, and select the top 10 applications from them in a proportion of 10%, furthermore, the first labels (such as 1) are respectively set for the first 10 selected application characteristic samples corresponding to the first application characteristic samples, so that 10 positive samples are formed. In the case that the predetermined attribute is male, the present disclosure may count the number of applications having male attributes installed in the terminal device of each user according to the application map of all users, assuming that applications having male attributes are installed in the terminal devices of 200 users, that is, the terminal device of the first user is installed with Y1 applications having female attributes, the terminal device of the second user is installed with Y2 applications having female attributes, … …, and the terminal devices of the second hundred users are installed with Y200 applications having female attributes, the present disclosure may sort Y1, Y2, Y … …, and Y200 in descending order, and select the top 20 applications from them in a proportion of 10%, and further set second labels (e.g., 0) for the top 20 selected application feature samples corresponding to each other, so as to form 20 negative samples.
In an alternative example, the number of positive and negative examples obtained by the present disclosure in the above manner is typically small, the present disclosure may form a new positive example based on a small number of positive examples in the set of examples, and the present disclosure may form a new negative example based on a small number of negative examples in the set of examples.
The process of generating a new positive sample by the present disclosure is described below in conjunction with fig. 4. The process of generating a new negative example is substantially the same as the process shown in fig. 4, and only the positive example needs to be replaced by the negative example, and the description is not repeated here.
In fig. 4, S400 determines whether there is an unselected positive sample in the sample set, and if there is an unselected positive sample, S401 is performed, and if there is no unselected positive sample, S405 is performed.
Although FIG. 4 depicts: the processing operation of generating a new positive sample is performed separately for each positive sample in the sample set, but it is understood that the present disclosure may also perform the processing operation of generating a new positive sample for only a portion of the positive samples in the sample set. For example, the processing operation of generating new positive samples is performed on the randomly selected partial positive samples, respectively. As another example, the processing operation of generating new positive samples is performed for a certain number of positive samples ranked first in the sample set, respectively. The ranking may refer to a ranking of the number of applications installed by the terminal device that have a predetermined attribute, such as a predetermined demographic attribute.
S401, selecting a positive sample from unselected positive samples in the sample set, marking the positive sample as the selected positive sample, and taking the positive sample as a base sample. A base sample in the present disclosure may refer to a positive sample on the basis of which a new positive sample is generated.
S402, determining all application characteristic samples in the sample set, the distance between which and the base sample meets the preset distance requirement, and taking positive samples in all application characteristic samples, the distance of which meets the preset distance requirement, as few samples.
Optionally, the present disclosure may calculate a distance, such as a euclidean distance, between each application characteristic sample in the sample set and the base sample. The predetermined distance requirement in the present disclosure may be: the distance from the base sample is less than a predetermined distance. If the base sample is taken as the center of a circle and the predetermined distance is taken as the radius, the present disclosure may obtain a circle based on the center of a circle and the radius, and all application feature samples located in the circle except the base sample are: all application characteristic samples that are at a distance from the base sample that meets a predetermined distance requirement. One example is shown in fig. 5.
In fig. 5, the circles filled with gray represent application feature samples without labels, the triangles filled with gray represent positive samples, the triangles filled with black represent base samples, a circle 500 is obtained by taking the base sample as a center and taking a predetermined distance as a radius R, and all the circles filled with gray and all the triangles filled with gray in the circle 500 are all application feature samples whose distance from the base sample meets a predetermined distance requirement.
And S403, judging whether the number of the application feature samples in the neighborhood meets a preset requirement, if so, going to S404, and if not, returning to S400.
Optionally, the predetermined requirements in the present disclosure may be: the density of the small sample is not lower than a predetermined density. The predetermined requirements may also be: the number of small samples is not less than a predetermined number of samples, etc. If the space formed by all the application feature samples in the sample set whose distance from the base sample meets the predetermined distance requirement is called a neighborhood, the predetermined requirement may also be: the number of small samples is not lower than a predetermined number of samples, and the number of all application feature samples in the neighborhood is not lower than another predetermined number of samples, etc.
Alternatively, the density of the small samples may be a ratio of the number of all the small samples in the neighborhood to the number of all the application feature samples in the neighborhood. For example, the density of the small samples in fig. 5 is 3/10.
S404, a new positive sample is generated based on the base sample and the small samples, and the process returns to S400.
Optionally, the neighborhood usually includes a plurality of small samples, the present disclosure may generate at least one new positive sample by using the base sample and a part of the small samples, or the present disclosure may generate a new positive sample by using the base sample and each of the small samples. One example of the present disclosure of generating new positive samples from base samples and small samples is shown in fig. 6.
And S405, ending the process of generating the new positive sample.
If the number of the samples in the neighborhood of a base sample is small and/or the number of all the application feature samples is too small, a new positive sample generated based on the base sample is likely to be wrong.
According to the semi-supervised classification model initialization method and device, the number of the positive samples and the number of the negative samples in the sample set are increased, the phenomenon that the initialization effect of the semi-supervised classification model is adversely affected due to the fact that the number of the positive samples and the number of the negative samples in the sample set are too small is avoided, and therefore the initialization effect of the semi-supervised classification model is improved.
In an alternative example, the present disclosure illustrates an example of generating a new small sample from a base sample and a small sample in its neighborhood as shown in fig. 6.
In fig. 6, S600, all application feature samples except the base sample in the neighborhood of the base sample are divided into two parts, i.e., a first part and a second part.
Optionally, when dividing all application feature samples except the base sample in the neighborhood of the base sample into two parts, the present disclosure may divide all application feature samples into a first part and a second part respectively located on both sides of the diameter by using the base sample as a center of a circle and using the principle that the difference between the small number of samples in the application feature samples on both sides of the diameter passing through the center of the circle is the minimum.
Following the example shown in fig. 5, two parts of the present disclosure are shown in fig. 7. In fig. 7, all application feature samples in the neighborhood are divided into two parts respectively located on the left and right sides of the diameter by taking the base sample as the diameter of the center, the left part includes five application feature samples without labels and one positive sample, and the right part includes one application feature sample without labels and two positive samples.
S601, determining the probability of generating few samples corresponding to the first part and the second part according to the number of the few samples contained in the first part and the second part and the number of the application characteristic samples without the labels.
Optionally, the present disclosure may use a ratio of the number of the small samples included in the first portion to the number of all the application characteristic samples included in the first portion as a probability of generating the small samples corresponding to the first portion. The present disclosure may use a ratio of the number of the small samples included in the second portion to the number of all the application characteristic samples included in the second portion as a probability of generating the small samples corresponding to the second portion. In the example shown in fig. 7, the probability of generating fewer samples for the left part is 1/6, and the probability of generating fewer samples for the right part is 2/3.
And S602, determining the direction of generating few samples according to the probability.
Optionally, directions in this disclosure are used to indicate the first portion and/or the second portion. Specifically, the direction in the present disclosure may indicate whether to generate a new small sample based on the base sample and the small sample in the first section, and whether to generate a new small sample based on the base sample and the small sample in the second section. The direction in this disclosure may affect the value of each feature point of the new few samples generated. The present disclosure may determine the direction of generating few samples based on the above probabilities by shaking dice, etc.
Continuing with the example shown in fig. 7, since the probability of generating a small sample corresponding to the portion on the left side of the diameter is 1/6, one face of the six-face dice of the present disclosure indicates that a small sample is generated, and the other five faces indicate that no small sample is generated, and by rolling the dice, it can be determined whether to generate a new small sample based on the base sample and the small sample in the portion on the left side. Since the probability of generating few samples for the portion on the right side of the diameter is 2/3, four sides of the six-sided dice of the present disclosure indicate that few samples are generated, and the other two sides indicate that no few samples are generated.
And S603, generating a new few samples according to the base samples and the few samples in the part indicated by the direction.
Optionally, the present disclosure may determine a value of each dimensional feature of a new few samples based on a value of each dimensional feature in the base sample and a value of each dimensional feature in the few samples in the portion indicated by the direction, so as to generate a new few sample point, that is, a new positive sample.
Specifically, the present disclosure may respectively select a feature value from the value range of each dimensional feature according to the value range of each dimensional feature formed by the feature values of each dimensional feature of the small samples in the base sample and the portion indicated by the direction, and respectively use the selected feature values as the feature values of each dimensional feature of the new small samples, thereby forming a new positive sample. The method and the device set the new characteristic value of each dimension characteristic of the small sample by utilizing the value range formed by the characteristic values of each dimension characteristic of the base sample and the small sample, are favorable for reasonably setting the characteristic value of the new small sample, and are favorable for avoiding the phenomenon that the new small sample becomes noise, thereby being favorable for reducing the noise quantity introduced by the new small sample to the sample set.
For example, assume that each application feature sample in the sample set includes M-dimensional features, i.e., a first-dimensional feature, a second-dimensional feature, … … M-1-dimensional feature, and an M-dimensional feature. Assuming that the eigenvalues of the M-dimensional features of the base samples are respectively: x1, x2, … … xm-1, and xm. Assume that the M-dimensional features of a small number of samples are: y1, y2, … … ym-1, and ym. Under the assumption, the range of the new sample-less M-dimensional features generated by the present disclosure is: the value ranges formed by x1 and y1, x2 and y2, … …, xm-1 and ym-1, and xm and ym. The method can select a numerical value from the value ranges respectively to serve as a new characteristic value of the M-dimensional characteristic of few samples. For example, the present disclosure may use the middle values of the value ranges as the feature values of the new sample-less M-dimensional features.
The direction of generating few samples is determined, new few samples are generated based on the direction, the phenomenon that new few samples are formed at an improper position is avoided, and the probability that the new few samples formed at the improper position are noise samples is high, so that the process of generating the new few samples based on the direction is beneficial to avoiding the phenomenon that the noise samples are introduced into a sample set, and therefore the method is beneficial to improving the purities of the positive samples and the negative samples in the sample set and improving the initialization effect of a semi-supervised classification model.
In an alternative example, in the present disclosure, setting labels for corresponding application feature samples in the sample set, which are not set with labels, to form an example of a positive sample and a negative sample, may be: the prediction processing result output by the semi-supervised classification model is usually a probability value, the probability value represents the possibility that the input application characteristic sample is a positive sample, and each probability value output by the semi-supervised classification model can be respectively judged; for any probability value, if the probability value is not lower than a first preset probability value, the input (namely, the corresponding application feature sample) corresponding to the probability value can be used as a positive sample, namely, a first label is set for the input corresponding to the probability value; if the probability value is lower than a second predetermined probability value, the input corresponding to the probability value (i.e. the corresponding application feature sample) can be used as a negative sample, i.e. a second label is set for the input corresponding to the probability value; if the probability value is lower than the first predetermined probability value and not lower than the second predetermined probability value, the input corresponding to the probability value (i.e. the corresponding application feature sample) still sets the application feature sample without the label, i.e. does not set any label for the input corresponding to the probability value this time. The first predetermined probability value is typically much greater than the second predetermined probability value. For example, the first predetermined probability value may be 0.85 or 0.9, etc., and the second predetermined probability value may be 0.15 or 0.1, etc. The higher the first predetermined probability value setting, the higher the purity of the obtained positive sample, and the lower the second predetermined probability value setting, the higher the purity of the obtained negative sample. The first predetermined probability value and the second predetermined probability value may be set according to actual requirements.
According to the method and the device, the application characteristic sample corresponding to the extremely high probability value is used as the positive sample, and the application characteristic sample corresponding to the extremely low probability value is used as the negative sample, so that the phenomenon that a noise sample is introduced into the sample set is avoided, the purities of the positive sample and the negative sample in the sample set are improved, and the initialization effect of the semi-supervised classification model is improved.
In an optional example, after the positive samples and the negative samples are obtained by using the prediction processing result, the present disclosure may perform initialization processing on the semi-supervised classification model again by using the positive samples and the negative samples in the current sample set, perform prediction processing on the application feature samples without the labels in the current sample set by using the initialized semi-supervised classification model again, and continue to set labels for corresponding application feature samples in the application feature samples without the labels in the current sample set according to the prediction processing result to form the positive samples and the negative samples. And repeating the process of the iteration loop until no application feature sample with no label is set in the sample set. Of course, it is also possible until the number of application feature samples for which no tag is set no longer decreases with increasing number of iteration cycles. By repeating the iterative loop process, the number of the application characteristic samples without the labels in the sample set can be continuously consumed, so that all the application characteristic samples in the sample set can be provided with the corresponding labels as soon as possible.
The following describes an implementation process of the method for determining the attribute of the user according to the present disclosure with reference to fig. 8, taking the attribute of the user as the gender as an example.
In fig. 8, S800 generates an application map of each user according to the application installation information reported by the terminal device of each user.
And S801, respectively compressing the application maps of the users to obtain application feature samples of the users.
S802, determining the number of applications with female attributes installed on the terminal equipment of each user and the number of applications with male attributes installed on the terminal equipment of each user according to the application map of each user.
S8031, rank the number of applications installed with female attributes, and set a female label, i.e., a first label, for the first i 1% (e.g., 10%) application feature sample in the rank.
S8032, sort the number of applications installed with male attribute, and set a male label, i.e., a second label, for the first i 2% (10%) of the application feature samples in the sort.
S8041, regarding any sample in the sample set provided with the female label, taking the sample as a base sample, determining all application feature samples of which the distance from the base sample meets the requirement of a preset distance, regarding the application feature samples provided with the female label in all application feature samples of which the distance from the base sample meets the requirement of the preset distance as few samples, and generating new few samples according to the base sample and the few samples when the number of all application feature samples meets the preset requirement.
The specific implementation process of this step can be referred to the description of fig. 6 above.
S8042, regarding any sample in the sample set provided with a male tag, taking the sample as a base sample, determining all application feature samples of which the distance from the base sample meets the requirement of a preset distance, regarding the application feature samples provided with the male tag in all application feature samples of which the distance from the base sample meets the requirement of the preset distance as few samples, and generating new few samples according to the base sample and the few samples under the condition that the number of all application feature samples meets the preset requirement.
The specific implementation process of this step can be referred to the description of fig. 6 above.
And S805, initializing the semi-supervised classification model according to all the positive samples and the negative samples in the current sample set.
And S806, according to the initialized semi-supervised classification model, performing attribute prediction processing on the application feature samples which are not provided with the labels in the current sample set by the user.
And S807, setting labels for corresponding application characteristic samples in the application characteristic samples without the labels according to the prediction processing result to form positive samples and negative samples.
And S808, judging whether to continue executing the process of setting the label for the application characteristic sample without the label, if so, returning to the S805, and if not, going to the S809.
Optionally, the present disclosure may determine that the process of setting the label for the application feature sample without the label needs to be continuously performed when it is determined that there are application feature samples without the label in the current sample set and the number of the application feature samples without the label in the current sample set is less than the number of the application feature samples without the label in the current sample set in the last cycle process.
Optionally, the present disclosure may determine that the process of setting the label for the application feature sample without the label does not need to be continuously performed when it is determined that there are application feature samples without the label in the current sample set and the number of application feature samples without the label in the current sample set is equal to the number of application feature samples without the label in the current sample set in the last cycle process.
Optionally, the present disclosure may determine that the process of setting the label for the application feature sample with no label set does not need to be continuously performed when it is determined that there is no application feature sample with no label set in the current sample set.
And S809, ending the flow.
Exemplary devices
Fig. 9 is a schematic structural diagram of an embodiment of an apparatus for determining an attribute of a user according to the present disclosure. The apparatus of this embodiment may be used to implement the method embodiments of the present disclosure described above. As shown in fig. 9, the apparatus of the present embodiment includes: a sample acquisition module 900, a first setup module 901, an initialization module 902, a prediction processing module 903, and a second setup module 904. Optionally, the apparatus of this embodiment may further include: a determine few samples module 905, a generate samples module 906, and a control module 907.
The get samples module 900 is used to get application feature samples of multiple users to form a sample set. The application characteristic sample is used for characterizing the application installed on the terminal equipment of the user.
Optionally, the module 900 for obtaining a sample includes: a first sub-module 9001 and a second sub-module 9002. The first sub-module 9001 is configured to, for any user, generate an application map of the user according to application installation information reported by the terminal device of the user. The second sub-module 9002 is configured to perform compression processing on the application map of the user to obtain an application feature sample of the user. For example, the second sub-module 9002 may be configured to perform factorization processing on the application map of the user to obtain a first feature of the application map of the user, and meanwhile, the second sub-module 9002 performs feature extraction processing on the application map of the user via a neural network by using the application map of the user as an input of the neural network to obtain a second feature of the application map of the user, and then, the second sub-module 9002 performs stitching processing on the first feature and the second feature to obtain an application feature sample of the user.
The first setting module 901 is used for setting labels for at least one application characteristic sample in the sample set formed by the sample obtaining module 900 according to the application with the predetermined attribute, so as to form a positive sample and a negative sample. For example, the first setting module 901 may determine, according to the application maps of all users, the number of applications with predetermined attributes respectively installed on the terminal devices of each user, and set the tags corresponding to the predetermined attributes for the application feature samples whose number meets the predetermined number requirements.
The initialization module 902 is configured to initialize the semi-supervised classification model according to the positive and negative examples in the sample set.
The prediction processing module 903 is configured to perform attribute prediction processing of a user on an application feature sample without a label in the sample set according to the semi-supervised classification model initialized by the initialization module 902.
The second setting module 904 is configured to set a label for a corresponding application feature sample in the application feature samples without the label according to the prediction processing result of the prediction processing module 903, so as to form a positive sample and a negative sample.
Alternatively, the second setting module 904 may use, as the positive sample, the application feature sample without the tag having the probability value not lower than the first predetermined probability value, which is obtained according to the prediction processing result, and the second setting module 904 may use, as the negative sample, the application feature sample without the tag having the probability value lower than the second predetermined probability value, which is obtained according to the prediction processing result, which is obtained according to the positive sample.
The determine few samples module 905 is configured to determine, as to any sample in the sample set, which is provided with a label, all application feature samples whose distances from the base sample meet a predetermined distance requirement, and determine, as the few samples, the application feature samples provided with the label of the base sample, among all application feature samples whose distances from the base sample meet the predetermined distance requirement.
The generate samples module 906 is configured to generate a new small sample according to the base sample and the small samples if the number of all the application feature samples whose distance from the base sample meets the predetermined distance requirement meets the predetermined requirement.
Optionally, the generate samples module 906 includes: a third sub-module 9061, a fourth sub-module 9062, a fifth sub-module 9063, and a sixth sub-module 9064. The third sub-module 9061 is configured to divide all application feature samples whose distances from the base sample meet a predetermined distance requirement into a first part and a second part. As an example, the third sub-module 9061 may divide all the application feature samples into a first part and a second part on two sides of the diameter by using the base sample as a center and using the principle that the difference between the small number of samples in the application feature samples on two sides of the diameter passing through the center is the smallest. The fourth sub-module 9062 is configured to determine, according to the number of the small samples included in each of the first part and the second part and the number of the application feature samples without the label, the probability of generating the small samples corresponding to each of the first part and the second part. The fifth submodule 9063 is configured to determine, according to the probability determined by the fourth submodule 9062, a direction in which a small number of samples are generated. Wherein the direction is used to indicate the first portion and/or the second portion. The sixth sub-module 9064 is configured to generate a new small sample according to the base sample and the small sample in the portion indicated by the direction determined by the fifth sub-module 9063. For example, the sixth sub-module 9064 may set a new feature value of each dimensional feature of the small sample according to each value range formed by the base sample and the feature value of each dimensional feature of the small sample in the portion indicated by the direction determined by the fifth sub-module 9063.
The small sample determining module 905 and the sample generating module 906 in the present disclosure may perform corresponding operations before the initialization operation is performed on the semi-supervised classification model for the first time, so as to increase the number of positive samples and the number of negative samples in the sample set, thereby being beneficial to avoiding a phenomenon that the initialization effect of the semi-supervised classification model is affected due to too small number of positive samples and negative samples in the sample set.
The control module 907 is configured to trigger the initialization module 902 to perform the step of initializing the semi-supervised classification model again according to the positive samples and the negative samples in the current sample set after the positive samples and the negative samples are formed by the second setting module 904 until there are no application feature samples without labels in the current sample set.
The operations specifically executed by the modules and the sub-modules included in the modules may be referred to in the description of the method embodiments with reference to fig. 2 to 8, and are not described in detail here.
Exemplary electronic device
An electronic device according to an embodiment of the present disclosure is described below with reference to fig. 10. FIG. 10 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure. As shown in fig. 10, the electronic device 101 includes one or more processors 1011 and memory 1012.
The processor 1011 may be a Central Processing Unit (CPU) or other form of processing unit having the capability to determine attributes of a user and/or instruction execution capabilities, and may control other components in the electronic device 101 to perform desired functions.
Memory 1012 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory, for example, may include: random Access Memory (RAM) and/or cache memory (cache), etc. The nonvolatile memory, for example, may include: read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 1011 to implement the methods of determining attributes of a user of the various embodiments of the present disclosure described above and/or other desired functions. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.
In one example, the electronic device 101 may further include: an input device 1013, an output device 1014, etc., which are interconnected by a bus system and/or other form of connection mechanism (not shown). Further, the input device 1013 may include, for example, a keyboard, a mouse, and the like. The output device 1014 can output various kinds of information to the outside. The output devices 1014 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, among others.
Of course, for simplicity, only some of the components of the electronic device 101 relevant to the present disclosure are shown in fig. 10, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device 101 may include any other suitable components, depending on the particular application.
Exemplary computer program product and computer-readable storage Medium
In addition to the above-described methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the method of determining attributes of a user according to various embodiments of the present disclosure described in the "exemplary methods" section of this specification above.
The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform steps in a method of determining attributes of a user according to various embodiments of the present disclosure described in the "exemplary methods" section above in this specification.
The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium may include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, and systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," comprising, "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects, and the like, will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims (10)

1. A method of determining attributes of a user, comprising:
obtaining application characteristic samples of a plurality of users to form a sample set; the application characteristic sample is used for characterizing an application installed by the terminal equipment of a user;
according to the application with the preset attribute, setting a label for at least one application characteristic sample in the sample set to form a positive sample and a negative sample;
initializing a semi-supervised classification model according to the positive samples and the negative samples in the sample set;
according to the initialized semi-supervised classification model, performing attribute prediction processing on the application feature samples without the set labels in the sample set;
and setting labels for corresponding application characteristic samples in the application characteristic samples without the labels according to the prediction processing result to form a positive sample and a negative sample.
2. The method of claim 1, wherein the obtaining application feature samples for a plurality of users comprises:
for any user, generating an application map of the user according to application installation information reported by the terminal equipment of the user;
and compressing the application map of the user to obtain an application feature sample of the user.
3. The method of claim 2, wherein the compressing the application map of the user to obtain the application feature sample of the user comprises:
factoring the application map of the user to obtain a first characteristic of the application map of the user;
taking the application map of the user as the input of a neural network, and performing feature extraction processing on the application map of the user through the neural network to obtain second features of the application map of the user;
and splicing the first characteristic and the second characteristic to obtain an application characteristic sample of the user.
4. The method according to claim 2 or 3, wherein said tagging at least one application feature sample of said set of samples according to an application having a predetermined attribute comprises:
determining the number of applications with preset attributes respectively installed on the terminal equipment of each user according to the application maps of all users;
and setting labels corresponding to the preset attributes for the application characteristic samples of which the number meets the preset number of requirements.
5. The method of any of claims 1 to 4, wherein the method further comprises, after labeling at least one application feature exemplar in the exemplar set according to an application having predetermined attributes, forming a positive exemplar and a negative exemplar:
taking any sample with a label in the sample set as a base sample, determining all application characteristic samples with the distance from the base sample meeting a preset distance requirement, and taking the application characteristic samples with the labels of the base sample in all the application characteristic samples as few samples;
and if the number of all the application characteristic samples meets the preset requirement, generating a new few samples according to the base samples and the few samples.
6. The method of claim 5, wherein the generating a new small sample from the base sample and the small sample comprises:
dividing all the application feature samples into a first part and a second part;
determining the probability of generating few samples corresponding to the first part and the second part according to the number of the few samples contained in the first part and the second part and the number of the application characteristic samples without labels;
determining the direction of generating few samples according to the probability; wherein the direction indicates the first portion and/or the second portion;
generating a new few samples from the base sample and the few samples in the portion indicated by the direction.
7. The method of claim 6, wherein the dividing the all application feature samples into a first portion and a second portion comprises:
and dividing all the application characteristic samples into a first part and a second part which are positioned on two sides of the diameter by taking the basic sample as a circle center and taking the principle that the difference of the small sample quantity in the application characteristic samples passing through the two sides of the circle center is minimum.
8. An apparatus for determining attributes of a user, wherein the apparatus comprises:
the acquisition sample module is used for acquiring application characteristic samples of a plurality of users to form a sample set; the application characteristic sample is used for characterizing an application installed by the terminal equipment of a user;
the first setting module is used for setting labels for at least one application characteristic sample in the sample set formed by the sample obtaining module according to the application with the preset attribute to form a positive sample and a negative sample;
the initialization module is used for initializing a semi-supervised classification model according to the positive samples and the negative samples in the sample set;
the prediction processing module is used for performing attribute prediction processing on the application feature samples without the labels in the sample set according to the semi-supervised classification model initialized by the initialization module;
and the second setting module is used for setting labels for corresponding application characteristic samples in the application characteristic samples without the labels according to the prediction processing result of the prediction processing module to form a positive sample and a negative sample.
9. A computer-readable storage medium, the storage medium storing a computer program for performing the method of any of the preceding claims 1-7.
10. An electronic device, the electronic device comprising:
a processor;
a memory for storing the processor-executable instructions;
the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method of any one of claims 1-7.
CN202010484863.6A 2020-06-01 2020-06-01 Method, device and equipment for determining attributes of users Active CN111639714B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010484863.6A CN111639714B (en) 2020-06-01 2020-06-01 Method, device and equipment for determining attributes of users

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010484863.6A CN111639714B (en) 2020-06-01 2020-06-01 Method, device and equipment for determining attributes of users

Publications (2)

Publication Number Publication Date
CN111639714A true CN111639714A (en) 2020-09-08
CN111639714B CN111639714B (en) 2021-07-23

Family

ID=72329716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010484863.6A Active CN111639714B (en) 2020-06-01 2020-06-01 Method, device and equipment for determining attributes of users

Country Status (1)

Country Link
CN (1) CN111639714B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022204939A1 (en) * 2021-03-30 2022-10-06 Paypal, Inc. Machine learning and reject inference techniques utilizing attributes of unlabeled data samples

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106126597A (en) * 2016-06-20 2016-11-16 乐视控股(北京)有限公司 User property Forecasting Methodology and device
CN106776925A (en) * 2016-11-30 2017-05-31 腾云天宇科技(北京)有限公司 A kind of Forecasting Methodology of mobile terminal user's sex, server and system
CN107194336A (en) * 2017-05-11 2017-09-22 西安电子科技大学 The Classification of Polarimetric SAR Image method of network is measured based on semi-supervised depth distance
CN107451565A (en) * 2017-08-01 2017-12-08 重庆大学 A kind of semi-supervised small sample deep learning image model classifying identification method
CN108256052A (en) * 2018-01-15 2018-07-06 成都初联创智软件有限公司 Automobile industry potential customers' recognition methods based on tri-training
CN108256537A (en) * 2016-12-28 2018-07-06 北京酷我科技有限公司 A kind of user gender prediction method and system
US20180204111A1 (en) * 2013-02-28 2018-07-19 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform
EP2646904B1 (en) * 2010-11-29 2018-08-29 BioCatch Ltd. Method and device for confirming computer end-user identity
CN108829763A (en) * 2018-05-28 2018-11-16 电子科技大学 A kind of attribute forecast method of the film review website user based on deep neural network
CN109299976A (en) * 2018-09-07 2019-02-01 深圳大学 Clicking rate prediction technique, electronic device and computer readable storage medium
CN109961075A (en) * 2017-12-22 2019-07-02 广东欧珀移动通信有限公司 User gender prediction method, apparatus, medium and electronic equipment
CN110210335A (en) * 2019-05-16 2019-09-06 上海工程技术大学 A kind of training method, system and the device of pedestrian's weight identification learning model
CN110490625A (en) * 2018-05-11 2019-11-22 北京京东尚科信息技术有限公司 User preference determines method and device, electronic equipment, storage medium
CN110674883A (en) * 2019-09-29 2020-01-10 江南大学 Active learning method based on k nearest neighbor and probability selection
CN111191722A (en) * 2019-12-30 2020-05-22 支付宝(杭州)信息技术有限公司 Method and device for training prediction model through computer
CN111209173A (en) * 2020-01-02 2020-05-29 腾讯科技(深圳)有限公司 Performance prediction method, device, storage medium and electronic equipment

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2646904B1 (en) * 2010-11-29 2018-08-29 BioCatch Ltd. Method and device for confirming computer end-user identity
US20180204111A1 (en) * 2013-02-28 2018-07-19 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform
CN106126597A (en) * 2016-06-20 2016-11-16 乐视控股(北京)有限公司 User property Forecasting Methodology and device
CN106776925A (en) * 2016-11-30 2017-05-31 腾云天宇科技(北京)有限公司 A kind of Forecasting Methodology of mobile terminal user's sex, server and system
CN108256537A (en) * 2016-12-28 2018-07-06 北京酷我科技有限公司 A kind of user gender prediction method and system
CN107194336A (en) * 2017-05-11 2017-09-22 西安电子科技大学 The Classification of Polarimetric SAR Image method of network is measured based on semi-supervised depth distance
CN107451565A (en) * 2017-08-01 2017-12-08 重庆大学 A kind of semi-supervised small sample deep learning image model classifying identification method
CN109961075A (en) * 2017-12-22 2019-07-02 广东欧珀移动通信有限公司 User gender prediction method, apparatus, medium and electronic equipment
CN108256052A (en) * 2018-01-15 2018-07-06 成都初联创智软件有限公司 Automobile industry potential customers' recognition methods based on tri-training
CN110490625A (en) * 2018-05-11 2019-11-22 北京京东尚科信息技术有限公司 User preference determines method and device, electronic equipment, storage medium
CN108829763A (en) * 2018-05-28 2018-11-16 电子科技大学 A kind of attribute forecast method of the film review website user based on deep neural network
CN109299976A (en) * 2018-09-07 2019-02-01 深圳大学 Clicking rate prediction technique, electronic device and computer readable storage medium
CN110210335A (en) * 2019-05-16 2019-09-06 上海工程技术大学 A kind of training method, system and the device of pedestrian's weight identification learning model
CN110674883A (en) * 2019-09-29 2020-01-10 江南大学 Active learning method based on k nearest neighbor and probability selection
CN111191722A (en) * 2019-12-30 2020-05-22 支付宝(杭州)信息技术有限公司 Method and device for training prediction model through computer
CN111209173A (en) * 2020-01-02 2020-05-29 腾讯科技(深圳)有限公司 Performance prediction method, device, storage medium and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张维 等: "一种处理部分标记数据的粗糙集属性约简算法", 《计算机科学》 *
王俊淑 等: "高光谱遥感图像 DE-self-training 半监督分类算法", 《农业机械学报》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022204939A1 (en) * 2021-03-30 2022-10-06 Paypal, Inc. Machine learning and reject inference techniques utilizing attributes of unlabeled data samples

Also Published As

Publication number Publication date
CN111639714B (en) 2021-07-23

Similar Documents

Publication Publication Date Title
Krishnaraj et al. An efficient radix trie‐based semantic visual indexing model for large‐scale image retrieval in cloud environment
Yu et al. Category-based deep CCA for fine-grained venue discovery from multimodal data
US11455473B2 (en) Vector representation based on context
Bucak et al. Multiple kernel learning for visual object recognition: A review
Zhu et al. Dimensionality reduction by mixed kernel canonical correlation analysis
Dey Sarkar et al. A novel feature selection technique for text classification using Naive Bayes
US10878281B2 (en) Video face clustering detection with inherent and weak supervision
Kumar et al. Extraction of informative regions of a face for facial expression recognition
US11263223B2 (en) Using machine learning to determine electronic document similarity
CN110929505B (en) Method and device for generating house source title, storage medium and electronic equipment
Zhang et al. Semisupervised particle swarm optimization for classification
Said et al. DGSD: Distributed graph representation via graph statistical properties
CN111639714B (en) Method, device and equipment for determining attributes of users
JP2019028984A (en) System and method for clustering near-duplicate images in very large image collections, method and system for clustering multiple images, program, method for clustering multiple content items
Wang et al. Semi-supervised constraints preserving hashing
Tang et al. Collaborative Filtering Recommendation Using Nonnegative Matrix Factorization in GPU‐Accelerated Spark Platform
Wang et al. Random angular projection for fast nearest subspace search
US11227231B2 (en) Computational efficiency in symbolic sequence analytics using random sequence embeddings
US20210357681A1 (en) Scalable Attributed Graph Embedding for Large-Scale Graph Analytics
JP2022079430A (en) Methods, systems and computer programs
Liu et al. Fast constrained spectral clustering and cluster ensemble with random projection
Liu et al. Incremental tensor principal component analysis for handwritten digit recognition
Cao et al. Multiple hierarchical deep hashing for large scale image retrieval
US20200372108A1 (en) Natural language skill generation for digital assistants
Wei et al. Cross-modal retrieval based on shared proxies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20201104

Address after: 100085 Floor 102-1, Building No. 35, West Second Banner Road, Haidian District, Beijing

Applicant after: Seashell Housing (Beijing) Technology Co.,Ltd.

Address before: 300 457 days Unit 5, Room 1, 112, Room 1, Office Building C, Nangang Industrial Zone, Binhai New Area Economic and Technological Development Zone, Tianjin

Applicant before: BEIKE TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant