Summary of the invention
This specification embodiment provides and a kind of target group's method for digging, device, server and readable storage medium storing program for executing.
In a first aspect, this specification embodiment provides a kind of target group's method for digging, comprising:
Collect for user to be identified, according to user's natural quality information and user's social property information, filters out primary mesh
Mark user's collection;
Collect for the Primary objectives user, extracts target group's feature vector;
It for target group's feature vector, is identified, is obtained based on first order Weak Classifier trained in advance
Grade target user's collection;
Collect for the intermediate target user, is identified based on second level strong classifier trained in advance, determine to belong to
Collect in the ultimate aim user of target group.
Second aspect, this specification embodiment provide a kind of target group's method for digging, comprising:
Collect for user to be identified, according to user's natural quality information and user's social property information, filters out primary mesh
Mark user's collection;
Collect for the Primary objectives user, extracts target group's feature vector;
For target group's feature vector, is identified based on strong classifier trained in advance, determine to belong to mesh
The end user collection of mark crowd.
The third aspect, this specification embodiment provide a kind of target group's excavating gear, comprising:
Primary objectives user collects screening unit, for collecting for user to be identified, according to user's natural quality information and use
Family social property information filters out Primary objectives user collection;
Target group's characteristic vector pickup unit extracts target group spy for collecting for the Primary objectives user
Levy vector;
First order recognition unit, for being directed to target group's feature vector, based on trained in advance weak point of the first order
Class device is identified, intermediate target user's collection is obtained;
Second level recognition unit, for being classified by force based on the second level trained in advance for intermediate target user's collection
Device is identified, determines the ultimate aim user collection for belonging to target group.
Fourth aspect, this specification embodiment provide a kind of target group's excavating gear, comprising:
Primary objectives user collects screening unit, for collecting for user to be identified, according to user's natural quality information and use
Family social property information filters out Primary objectives user collection;
Target group's characteristic vector pickup unit extracts target group spy for collecting for the Primary objectives user
Levy vector;
Recognition unit, for being identified based on strong classifier trained in advance for target group's feature vector,
Determine the end user collection for belonging to target group.
5th aspect, this specification embodiment provide a kind of server, including memory, processor and are stored in memory
Computer program that is upper and can running on a processor, the processor realize side described in any of the above-described when executing described program
The step of method.
6th aspect, this specification embodiment provide a kind of computer readable storage medium, are stored thereon with computer journey
Sequence, when which is executed by processor the step of realization any of the above-described the method.
This specification embodiment has the beneficial effect that:
As it can be seen that target group's method for digging that this specification embodiment provides, using two-stage (Weak Classifier and strong classification
Device) mode classification, a large number of users data can be accomplished more accurately to excavate;Moreover, being sieved by being concentrated from user to be identified
Primary objectives user collection is selected, the range of identification can be reduced, in addition, the target group determined according to the intercommunity of target group
Feature vector can accurately describe the feature of target group, so that recognition result is more accurate.
In addition, proposing the mode of two-level model training, for example, being directed to the model training of student enrollment, use first
Weak Classifier (such as naive Bayesian) is classified, and is then used the negative sense result in classification results as negative sample, is carried out
Second wheel classifies to the positive result in first round classification results compared with the training of strong classifier (such as support vector machines),
Final classification result is that positive judgement is student enrollment crowd.
For the Feature Selection mode of university student crowd's candidate user and an innovation point.For example, using campus
Wifi feature links number as feature, and (wherein campus wifi connects during the new term begins from the user for having been marked as university student
Connect the wifi more situation of number), user add student enrollment's entry communication record quantity as feature, the position user LBS
Change the shipping address whether added as feature, user contain around colleges and universities or colleges and universities as feature etc., above several features
Selection, the characteristics of student enrollment more can comprehensively be depicted is accurately to train model and be recognized accurately university
Raw basis.
Specific embodiment
In order to better understand the above technical scheme, below by attached drawing and specific embodiment to this specification embodiment
Technical solution be described in detail, it should be understood that the specific features in this specification embodiment and embodiment are to this explanation
The detailed description of book embodiment technical solution, rather than the restriction to this specification technical solution, in the absence of conflict,
Technical characteristic in this specification embodiment and embodiment can be combined with each other.
It is this specification embodiment target group method for digging application scenarios schematic diagram referring to Fig. 1.Terminal 10 is user
End, server-side 20 is website or the background server end of APP.Server-side 20 is collected into the correlation of multiple users from multiple terminals 10
Data, for example, getting the natural quality information (such as age, gender etc.) of user from user's registration, operation, consumption information
Or social property information (zone of action, good friend etc.).Server-side 20 is based on a large amount of user information got, to specific objective
Crowd excavates.
In a first aspect, this specification embodiment provides a kind of target group's method for digging, referring to FIG. 2, including S201-
S204。
S201: collecting for user to be identified, according to user's natural quality information and user's social property information, filters out just
Grade target user's collection.
Such as Fig. 1 scene, server-side gets user data from multiple terminals, constitutes user's collection to be identified.
In order to expeditiously be excavated to target group, can according to user's natural quality information and social property information,
From a large amount of user set data to be identified, Primary objectives user collection is filtered out, to reduce data area.
For example, can determine screening conditions according to the actual situation if target group to be excavated is student enrollment
Are as follows: the age belongs to campus 17-27 years old, frequent zone of action, tentatively meets to filter out from a large number of users data
The Primary objectives user of student enrollment collects.
It is appreciated that it is above-mentioned using the age as user's natural quality information, zone of action as social property information only
It is an example, actual conditions can be without being limited thereto.Similarly, it is also only one that target group to be excavated, which is student enrollment,
Example, the excavation for other target groups's (such as certain company personnel, special workers etc.), this specification embodiment
It is equally applicable.
S202: collect for Primary objectives user, extract target group's feature vector.
It is appreciated that target group has certain intercommunity, therefore target group's feature can be determined based on its intercommunity
Vector.
Still by taking student enrollment as an example, determining feature can be respectively: local city, online remaining sum, age, address list
Good friend, shipping address, campus wifi link number and LBS (location based service is based on location-based service) feature
Deng.
Local city, online remaining sum, age are known as the static nature of user.For example, local city and institute of colleges and universities
It is whether consistent in city, it can be used as a Rule of judgment.Whether online remaining sum is in a certain range.Whether the age is in a certain range
(17-27 years old).
Address list good friend, shipping address, campus wifi link number and LBS feature are properly termed as the behavioral characteristics of user.
For example, user's shipping address belongs to the ground in campus if it is the addresses such as school dormitory address or school experiment room, teaching and research room
Location.It has been the quantity of student enrollment in the address list good friend of selection user's addition as ginseng for address list good friend's feature
It examines, particularly, for entrant, the behavior for adding good friend can be further defined to (start to school season) from August to September bimestrial
The number of addition.Similarly, user is also used as a characteristic value in the number for connecting campus wifi the 8-9 month.Wherein, campus wifi
Decision procedure can be used the wifi title that the student enrollment of mark often connects, be judged as the wifi in campus.User
LBS feature, then be choose the nearest colleges and universities of user distance distance as judgement.Before college entrance examination, during the Spring Festival, during summer vacation and
September is started to school the time, and the distance of user distance colleges and universities is detected.
S203: being directed to target group's feature vector, identified based on first order Weak Classifier trained in advance, obtains
Grade target user's collection.
Collect the target group's feature vector extracted for Primary objectives user, is input to trained in advance weak point of the first order
Class device is identified, can further determine that out the intermediate target user collection for belonging to target group.
S204: it for intermediate target user's collection, is identified based on second level strong classifier trained in advance, determines to belong to
Collect in the ultimate aim user of target group.
For the intermediate target user collection that first order Weak Classifier identifies, it is strong to be based further on the second level trained in advance
Classifier is identified, determines the ultimate aim user collection for belonging to target group.
It is this that target group is known otherwise using two-level classifier, it is ensured that the accuracy of identification.
Wherein, the algorithm of first order Weak Classifier application includes but is not limited to naive Bayesian, logistic regression, in boost
Any one;The algorithm of second level strong classifier application includes but is not limited to support vector machines, deep neural network, gradient promotion
One in decision tree, xgboost.Hereinafter, the training to first order Weak Classifier and second level strong classifier in conjunction with Fig. 3
Journey is illustrated.
Referring to Fig. 3, model training signal in the target group's method for digging provided for this specification embodiment first aspect
Figure.
Training process are as follows:
(1) positive sample of target group is obtained, and obtains non-mark sample.
It is positive sample to labeled User label (labeled).According to user's natural quality information and society
Attribute information determines non-mark (unlabeled) sample.For example, still by taking student enrollment identifies as an example, target group be away from
User relatively close from school and that the age is in admission range, thus choose the age be in 17-27 years old and apart from nearest school 5km with
It is interior to be used as non-mark sample.Under normal circumstances, non-mark sample size is significantly more than the quantity for having determined that positive sample, for example, just
Sample, non-mark sample ratio be 1:10, the user of 300w or so is certified positive sample, and in addition 3000w or so
User be unlabeled user for certification.
(2) for positive sample and non-mark sample, target group's feature vector is extracted.
For positive sample and non-mark sample, the feature of description target group's intercommunity is extracted as target group's feature
Vector.For example, extracting static nature, (age of user, user local city, user are remaining online for student enrollment
It is one or more in volume) and behavioral characteristics (network link information, special time period in special time period for specific region
It is interior user good friend, station address, one or more in LBS information) constitute target group's feature vector.
(3) based on the target group's feature vector gone out according to positive sample and non-mark sample extraction, weak point of the first order of training
Class device.
The algorithm of first order Weak Classifier application includes but is not limited to naive Bayesian, logistic regression, any in boost
?.The first order Weak Classifier can tentatively judge target group for target group's feature vector.
(4) classified using first order Weak Classifier to non-mark sample, determine the negative sample of target group.
Classified using first order Weak Classifier to the unlabeled sample mentioned in step (1), classification results 0
The label negative sample that is, this part of negative sample will be used to train second level strong classifier together with positive sample.
(5) positive sample and negative sample, training second level strong classifier are based on.
Based on the negative sample that mark positive sample and first order Weak Classifier determine, training second level strong classifier.Second
Grade strong classifier application algorithm include but is not limited to support vector machines, deep neural network, gradient promoted decision tree,
One in xgboost.
On the basis of first order Weak Classifier and second level strong classifier are completed in training, it can collect to user to be identified
Carry out the identification of target group.
Referring to fig. 4, model identification signal in the target group's method for digging provided for this specification embodiment first aspect
Figure.
Model identification process include:
(1) collect for user to be identified, filter out Primary objectives user collection.
For example, being directed to student enrollment, according to age of user and zone of action, Primary objectives user collection is filtered out.
(2) collect for Primary objectives user, extract target group's feature vector.
For example, being directed to student enrollment, determining target group's feature vector includes static nature (age of user, user
It is one or more in local city, the online remaining sum of user), behavioral characteristics (network of specific region is directed in special time period
It is user good friend in link information, special time period, station address, one or more in LBS information), therefore, for primary mesh
User's collection is marked, extracts above-mentioned several features as target group's feature vector.
(3) it is directed to target group's feature vector, is identified based on first order Weak Classifier, obtains intermediate target user
Collection.
It by target group's feature vector, is input to first order Weak Classifier and is identified, recognition result is 1 (this grade classification
Determination belongs to target group) or 0 (this grade classification determination is not belonging to target group), using recognition result be 1 recognition result as
Intermediate target user's collection.
(4) it for intermediate target user's collection, is identified based on second level strong classifier, determines to belong to target group's
Ultimate aim user collection.
Intermediate target user collection is based further on second level strong classifier to identify, recognition result is 1 (this grade classification
Determination belongs to target group) or 0 (this grade classification determination is not belonging to target group), using recognition result be 1 recognition result as
Ultimate aim user collection.
As it can be seen that target group's method for digging that this specification embodiment provides, using two-stage (Weak Classifier and strong classification
Device) mode classification, a large number of users data can be accomplished more accurately to excavate;Moreover, being sieved by being concentrated from user to be identified
Primary objectives user collection is selected, the range of identification can be reduced, in addition, the target group determined according to the intercommunity of target group
Feature vector can accurately describe the feature of target group, so that recognition result is more accurate.
In addition, proposing the mode of two-level model training, for example, being directed to the model training of student enrollment, use first
Weak Classifier (such as naive Bayesian) is classified, and is then used the negative sense result in classification results as negative sample, is carried out
Second wheel classifies to the positive result in first round classification results compared with the training of strong classifier (such as support vector machines),
Final classification result is that positive judgement is student enrollment crowd.For the Feature Selection side of university student crowd's candidate user
Formula and an innovation point.For example, campus wifi feature is used to link number as feature (the wherein source campus wifi
In having been marked as the user of the university student more situation of wifi number of connection during the new term begins), user addition student enrollment
Whether the quantity of entry communication record adds as feature, user containing colleges and universities or colleges and universities' week as feature, user LBS change in location
The characteristics of shipping address enclosed is as feature etc., the selection of above several features, and student enrollment more can comprehensively be depicted,
It is the basis for accurately training model and university student being recognized accurately.
Second aspect, based on the same inventive concept, this specification embodiment provide a kind of target group's method for digging.It is following
Related detailed process in Fig. 5-7 can refer to Fig. 2-4, only illustrate below to difference.
Referring to FIG. 5, including: for target group's method for digging flow chart that this specification embodiment second aspect provides
S501: collecting for user to be identified, according to user's natural quality information and user's social property information, filters out just
Grade target user's collection;
S502: collect for Primary objectives user, extract target group's feature vector;
S503: being directed to target group's feature vector, is identified based on strong classifier trained in advance, determines to belong to mesh
The end user collection of mark crowd.
Wherein, the process that Primary objectives user collection is filtered out in step S501 can be, previously according to the multinomial of user
Natural quality information and user's social property information, set out the matching rule for meeting target group, use for target to be identified
Family collection selects Primary objectives user collection according to matching rule.For example, according to age of user information, User Activity area information,
Good friend's quantity, user are increased newly in the interior network linking number for being directed to specific region of user's special time period, user's special time period
It is one or more in specific region whether address belongs to, and determines that Primary objectives user collects.For example, being directed to the digging of university student
Pick, the matching rule of setting is: within school 5km, the age at 17-27 years old and be most recently connected campus wifi number it is big
In 5 times, cell phone address book adds number and is more than 1 people, and shipping address is not belonging to together in school area, local city and present city
The similar rules such as one city.
Model training schematic diagram in the target group's method for digging provided referring to Fig. 6, this specification embodiment second aspect.
(1) positive sample of target group is obtained, and obtains non-mark sample.
For example, according to specific region is directed in age of user information, User Activity area information, user's special time period
Increase whether good friend's quantity, station address belong in specific region in network linking number, user's special time period newly one or
It is multinomial, determine non-mark sample.
(2) sample of preset proportion is never selected in mark sample as negative sample.
For example, never mark sample is sampled, random (or according to rule) select 10% without mark sample conduct
Negative sample.
(3) for positive sample and negative sample, target group's feature vector is extracted.
(4) based on the target group's feature vector extracted according to positive sample and negative sample, training strong classifier.
Referring to Fig. 7, model identification signal in the target group's method for digging provided for this specification embodiment second aspect
Figure.
Model identification process includes:
(1) collect for user to be identified, filter out Primary objectives user collection.
(2) collect for Primary objectives user, extract target group's feature vector.
(3) it is directed to target group's feature vector, is identified based on strong classifier trained in advance, determines to belong to target
The end user collection of crowd.
It excavates difference with the target group that this specification embodiment first aspect provides to be, this specification embodiment second
Aspect provide target group's method for digging in, identified only with level-one strong classifier, be implemented it is easier, wherein
In order to guarantee to identify accuracy, to the screening of Primary objectives user collection in identification process, and, during model training,
The determination of screening and negative sample for unmarked sample gives some specific processing modes, for example, using rule
With determining Primary objectives user collection and unmarked sample, and determine a certain proportion of unmarked sample as negative sample.
Referring to Fig. 8, for the structural schematic diagram for target group's excavating gear that this specification embodiment third aspect provides.Dress
It sets and includes:
Primary objectives user collects screening unit 801, for collecting for user to be identified, according to user's natural quality information and
User's social property information filters out Primary objectives user collection;
Target group's characteristic vector pickup unit 802 extracts target group for collecting for the Primary objectives user
Feature vector;
First order recognition unit 803, it is weak based on the first order trained in advance for being directed to target group's feature vector
Classifier is identified, intermediate target user's collection is obtained;
Second level recognition unit 804, for being divided by force based on the second level trained in advance for intermediate target user's collection
Class device is identified, determines the ultimate aim user collection for belonging to target group.
In a kind of optional way, further includes: classifier training unit 805;
The classifier training unit 805 further comprises:
Sample acquisition subelement 8051 for obtaining the positive sample of target group, and obtains non-mark sample;
Target group's characteristic vector pickup subelement 8052, for mentioning for the positive sample and the non-mark sample
Take out target group's feature vector;
The first order trains subelement 8053, for based on the mesh gone out according to the positive sample and the non-mark sample extraction
Mark crowd characteristic vector, training first order Weak Classifier;
Negative sample determines subelement 8054, for being divided using the first order Weak Classifier the non-mark sample
Class determines the negative sample of target group;
Subelement 8055 is trained in the second level, and for being based on the positive sample and the negative sample, the training second level is divided by force
Class device.
In a kind of optional way, the Primary objectives user collects screening unit 801 or the sample acquisition subelement
8051 are specifically used for: according to age of user information and User Activity area information, determining Primary objectives user collection or do not beat
Standard specimen sheet.
In a kind of optional way, target group's characteristic vector pickup unit 802 or target group's feature vector are mentioned
It takes subelement 8052 to be specifically used for: for Primary objectives user collection or the positive sample and the non-mark sample, extracting
The static nature and behavioral characteristics of user out;Target group's feature vector is made of the static nature and behavioral characteristics.
In a kind of optional way, the static nature includes: age of user, user local city, the online remaining sum of user
In it is one or more, the behavioral characteristics include: in special time period for specific region network link information, it is specific when
Between user good friend in section, station address, one or more in LBS information.
In a kind of optional way, the algorithm of first order Weak Classifier application include naive Bayesian, logistic regression,
One in boost, the algorithm of second level strong classifier application includes that support vector machines, deep neural network, gradient mention
One in liter decision tree, xgboost.
Referring to Fig. 9, for the structural schematic diagram for target group's excavating gear that this specification embodiment fourth aspect provides.It should
Device includes:
Primary objectives user collects screening unit 901, for collecting for user to be identified, according to user's natural quality information and
User's social property information filters out Primary objectives user collection;
Target group's characteristic vector pickup unit 902 extracts target group for collecting for the Primary objectives user
Feature vector;
Recognition unit 903 is known for being directed to target group's feature vector based on strong classifier trained in advance
Not, the end user collection for belonging to target group is determined.
In a kind of optional way, further includes: classifier training unit 904;
The classifier training unit 904 further comprises:
Sample acquisition subelement 9041 for obtaining the positive sample of target group, and obtains non-mark sample;
Negative sample determines subelement 9042, for selecting the sample of preset proportion from the non-mark sample as negative
Sample;
Target group's characteristic vector pickup subelement 9043, for extracting for the positive sample and the negative sample
Target group's feature vector;
Classifier training subelement 9044, for based on the target person extracted according to the positive sample and the negative sample
Group character vector, training strong classifier.
In a kind of optional way, the Primary objectives user collects screening unit 901 or sample acquisition subelement 9041 has
Body is used for: according to the lattice chain for being directed to specific region in age of user information, User Activity area information, user's special time period
It connects number, increase good friend's quantity, that whether station address belongs to is one or more in specific region in user's special time period newly, really
Make Primary objectives user collection or non-mark sample.
In a kind of optional way, target group's characteristic vector pickup unit 902 or target group's feature vector are mentioned
It takes subelement 9042 to be specifically used for: for Primary objectives user collection or the positive sample and the negative sample, extracting use
The static nature and behavioral characteristics at family;Target group's feature vector is made of the static nature and behavioral characteristics.
In a kind of optional way, the static nature includes: age of user, user local city, the online remaining sum of user
In it is one or more, the behavioral characteristics include: in special time period for specific region network link information, it is specific when
Between user good friend in section, station address, one or more in LBS information.
In a kind of optional way, the algorithm of the strong classifier application includes support vector machines, deep neural network, ladder
One in degree promotion decision tree, xgboost.
5th aspect, is based on inventive concept same as target group's method for digging in previous embodiment, and the present invention also mentions
For a kind of server, as shown in Figure 10, including memory 1004, processor 1002 and it is stored on memory 1004 and can locating
The computer program run on reason device 1002, the processor 1002 realize that target group described previously digs when executing described program
The step of pick method.
Wherein, in Figure 10, bus architecture (is represented) with bus 1000, and bus 1000 may include any number of mutual
The bus and bridge of connection, bus 1000 will include that the one or more processors represented by processor 1002 and memory 1004 represent
The various circuits of memory link together.Bus 1000 can also will such as peripheral equipment, voltage-stablizer and power management electricity
Various other circuits on road or the like link together, and these are all it is known in the art, therefore, no longer carry out herein to it
It further describes.Bus interface 1006 provides interface between bus 1000 and receiver 1001 and transmitter 1003.Receiver
1001 and transmitter 1003 can be the same element, i.e. transceiver, provide for over a transmission medium with various other devices
The unit of communication.Processor 1002 is responsible for management bus 1000 and common processing, and memory 1004 can be used to store
The used data when executing operation of processor 1002.
6th aspect, based on the inventive concept with target group's method for digging in previous embodiment, the present invention also provides one
Kind computer readable storage medium, is stored thereon with computer program, which realizes mesh described previously when being executed by processor
The step of mark crowd's method for digging.
This specification is referring to the method, equipment (system) and computer program product according to this specification embodiment
Flowchart and/or the block diagram describes.It should be understood that can be realized by computer program instructions every in flowchart and/or the block diagram
The combination of process and/or box in one process and/or box and flowchart and/or the block diagram.It can provide these computers
Processor of the program instruction to general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices
To generate a machine, so that generating use by the instruction that computer or the processor of other programmable data processing devices execute
In setting for the function that realization is specified in one or more flows of the flowchart and/or one or more blocks of the block diagram
It is standby.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of equipment, the commander equipment realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
Although the preferred embodiment of this specification has been described, once a person skilled in the art knows basic wounds
The property made concept, then additional changes and modifications may be made to these embodiments.So the following claims are intended to be interpreted as includes
Preferred embodiment and all change and modification for falling into this specification range.
Obviously, those skilled in the art can carry out various modification and variations without departing from this specification to this specification
Spirit and scope.In this way, if these modifications and variations of this specification belong to this specification claim and its equivalent skill
Within the scope of art, then this specification is also intended to include these modifications and variations.