CN104598452B - User's gender analysis method and apparatus - Google Patents

User's gender analysis method and apparatus Download PDF

Info

Publication number
CN104598452B
CN104598452B CN201310526980.4A CN201310526980A CN104598452B CN 104598452 B CN104598452 B CN 104598452B CN 201310526980 A CN201310526980 A CN 201310526980A CN 104598452 B CN104598452 B CN 104598452B
Authority
CN
China
Prior art keywords
user
gender
cis
male
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310526980.4A
Other languages
Chinese (zh)
Other versions
CN104598452A (en
Inventor
丁若谷
陈家耀
冯是聪
吴明辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Miaozhen Systems Information Technology Co Ltd
Original Assignee
Miaozhen Systems Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Miaozhen Systems Information Technology Co Ltd filed Critical Miaozhen Systems Information Technology Co Ltd
Priority to CN201310526980.4A priority Critical patent/CN104598452B/en
Publication of CN104598452A publication Critical patent/CN104598452A/en
Application granted granted Critical
Publication of CN104598452B publication Critical patent/CN104598452B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention provides a kind of user's gender analysis method and apparatus.It is related to data analysis field;Solve the problems, such as that existing analysis mode is not suitable for individual character domain name and is associated with weaker occasion with name.This method includes:Collecting sample data set, the sample data set include multipair user personality domain name and corresponding user's gender;Count the probability that different monograms occur according to gender on different alphabetical and adjacent several cis-positions on each cis-position in the user personality domain name that the sample data is concentrated;Using the ratio of sample data concentration male and the probability as parameter is referred to, the user personality domain name of unknown subscriber's gender is analyzed, judges user's gender.Technical solution provided by the invention is suitable for data analysis, realizes user's gender analysis based on automation algorithm.

Description

User's gender analysis method and apparatus
Technical field
The present invention relates to data analysis field more particularly to a kind of user's gender analysis method and apparatus.
Background technology
Under internet environment, the gender of user is a highly important information.According to the gender of user, in internet Different contents can be showed to different user by holding supplier.For example, male user may compare female user to e-sports It is interested, and female user may be interested in fashion dress ornament compared to male user.In this case, if user Gender is identified that Internet advertising provider can be female user exhibition to show the advertisement of e-sports for male user Show the advertisement of fashion dress ornament, so that advertisement is more targeted, obtains better advertising results.
For the user of registration blog, microblogging or other social network sites, many service providers can be complete in user After necessary log-on message, it is proposed that user fills in the attribute of some users itself, such as gender, age, working condition, for certainly Oneself setting individual character domain name etc., and be often usually all selectivity in the information registering item for being related to privacy of user in these attributes Item is filled in, rather than has to fill out item, in this way, having led to quite a few user selection does not fill in this type of information, such as User is that the information of oneself is protected not leak outside, and can select not fill in the age, gender etc., then, for data analysis mechanism or confession It answers for quotient itself, also can not just directly acquire the gender information of user.But the selectivity for not being related to privacy fills in item For, the success rate being filled out is often very high.For example, individual character domain name, service provider in order to increase user experience and affinity, Often allow user that the virtual url for having and representing user's nature is set for the microblogging or personal space homepage of oneself.User It can set these Domain Name Form registering sites to the name of oneself, or the arbitrary number oneself liked or monogram, i.e. fashion and side Just.However, for the gender differences of mankind itself, in the setting to individual character domain name, male and female often go to set by instinct Set some domain names for representing self attributes.For example, certain user may register property domain name one by one: http://weibo.com/ Basketballfans, wherein weibo.com are the domain names of microblogging service provider, and the parts basketballfans are user The individual character domain name of selection.So, the gender information of user is extrapolated by the individual character domain name with user representative's property, i.e., do not invaded Violate user and collects user information again.
In existing technology, most like technology is United States Patent (USP) 7,447,996 [1].This patent proposes one kind Software module, the gender for inferring user according to different user names in instant communicating system, according to different gender exhibitions Show different virtual images.Dependent on specific Praxeology data, i.e. name in language-specific and the pass between gender System.For example, referred in this patent, for Chinese Name, by the retrieval of Praxeology database, " xiuxiu " and " lili " it is more likely that women name.
Praxeology database is not particularly suited for multiple network application scenarios, is particularly unsuitable for individual character domain name and name The weaker occasion of association.The composition of individual character domain name normally comprises a large amount of ingredients beyond common name scope, these ingredients are very Difficulty passes through Praxeology data analysis.For example, may include " basketball ", i.e. basketball in individual character domain name;And it may incite somebody to action Basketball is put into the basket ball fan of individual character domain name, and male may occupy an leading position.If by " basketball corresponds to male " this kind of number According to database is added, required work will be significantly greatly increased, and be difficult complete.
Invention content
The present invention provides a kind of user's gender analysis method and apparatus, solve existing analysis mode and are not suitable for individual character Domain name is associated with the problem of weaker occasion with name.
A kind of user's gender analysis method, including:
Collecting sample data set, the sample data set include multipair user personality domain name and corresponding user's gender;
It counts in the user personality domain name that the sample data is concentrated on each cis-position on different letters and adjacent several cis-positions The probability that different monograms occur according to gender;
Using the ratio of sample data concentration male and the probability as parameter is referred to, to the use of unknown subscriber's gender Family individual character domain name is analyzed, and judges user's gender.
Preferably, different alphabetical and adjacent on each cis-position in the user personality domain name that the statistics sample data is concentrated Before the step of probability that different monograms occur according to gender on several cis-positions, further include:
Calculate the ratio that the sample data concentrates male.
Preferably, count different alphabetical and adjacent several on each cis-position in the user personality domain name that the sample data is concentrated Monogram includes according to the probability that gender occurs on cis-position:
Step a:The part that user specifies in a user personality domain name is taken, while it is corresponding to record the user personality domain name User's gender;
Step b:To different on the number and/or adjacent several cis-positions of letter appearance on each cis-position of the specified part The number that monogram occurs is counted;
Step c:Processing such as step a to b is carried out to whole user personality domain names that the sample data is concentrated, until institute Sample data set traversal is stated to complete;
Step d:Count number that different sexes occurs in letter on each cis-position of user personality domain name and/or adjacent The number that different sexes occurs in monogram on several cis-positions, and calculate alphabetical and/or adjacent several suitable on each cis-position The probability that different sexes occurs in monogram on position.
Preferably, the number and/or phase that different sexes occurs in letter on each cis-position of user personality domain name are counted The number that different sexes occurs in monogram on adjacent several cis-positions, and calculate alphabetical and/or adjacent several on each cis-position The probability that different sexes occurs in monogram on cis-position is specially:
According to expression formula
Calculate separately the probability that each letter on each cis-position corresponds to male with each monogram on adjacent several cis-positions.Its In, the P (n-gram corresponds to male) on the left of equation is that length is that monogram on adjacent several cis-positions of n corresponds to male's Probability, P (n-gram corresponds to male) corresponds to the probability of male for the letter on single cis-position when n is 1;N- on the right side of equation Gram corresponds to that male's frequency is letter on single cis-position or length is that monogram on adjacent several cis-positions of n corresponds to man The number of property, n-gram correspond to that women frequency is letter on single cis-position or length is letter on adjacent several cis-positions of n Combination corresponds to the number of women.
Preferably, using the probability as parameter is referred to, the user personality domain name of unknown subscriber's gender is analyzed, is sentenced Break user's gender include:
Step a:The length for obtaining the user personality domain name of unknown subscriber's gender, is denoted as k;
Step b:According to expression formula
The gender for calculating the user is the probability of male, wherein url indicates the part that user specifies in individual character domain name; substr(Url, j, i)Indicate that jth position character starts what the monogram on adjacent several cis-positions that length is i was constituted in url Substring is the substring that the letter on single cis-position is constituted when i is 1;N indicates the number of substr (url, j, i);wh Indicate the weight of the letter or monogram;P (substr (url, j, i) concentrates corresponding male in sample data) indicates above-mentioned The corresponding male's probability of letter or monogram on substring;
Step c:Result of calculation and the sample data in comparison step b concentrate the ratio of male;
Step d:When result of calculation in stepb is more than or equal to the ratio that step c is calculated, the non-intellectual is judged The gender of other user is male.
Preferably, after the step d, further include:
Step e:When result of calculation in stepb is less than the ratio that step c is calculated, the judgement unknown gender is used The gender at family is women.
The present invention also provides a kind of user's gender analysis devices, including:
Sampling module is used for collecting sample data set, and the sample data set includes multipair user personality domain name and correspondence User's gender;
Reference parameter computing module, it is different on each cis-position for counting in the user personality domain name that the sample data is concentrated The probability that different monograms occur according to gender on alphabetical and adjacent several cis-positions;
Analysis module, the ratio and the probability for concentrating male using the sample data are as parameter is referred to, to not Know that the user personality domain name of user's gender is analyzed, judges user's gender.
Preferably, which further includes:
With reference to ratio computing module, the ratio of male is concentrated for calculating the sample data.
Preferably, the reference parameter computing module includes:
Gender extraction unit for taking the part that user specifies in a user personality domain name, while recording the user The property corresponding user's gender of domain name;
Counting unit, for letter occurs on each cis-position of the specified part number and/or adjacent several suitable The number that different monograms occur on position is counted;
Statistic unit, the place for whole user personality domain names that the sample data is concentrated to be carried out with the counting unit Reason counts letter on each cis-position of user personality domain name and different sexes is gone out until sample data set traversal is completed The number that different sexes occurs in monogram on existing number and/or adjacent several cis-positions, and calculate word on each cis-position The probability that different sexes occurs in monogram on female and/or adjacent several cis-positions.
Preferably, the statistic unit calculates on each cis-position on letter and/or adjacent several cis-positions monogram for not With gender occur probability be specially:
According to expression formula
Calculate separately each letter and the corresponding male's probability of each monogram on adjacent several cis-positions on each cis-position, wherein P (n-gram corresponds to male) is that length is the probability that a monogram corresponds to male on adjacent several cis-positions of n, P when n is 1 (n-gram corresponds to male) is the probability that a letter corresponds to male on single cis-position, and it is single suitable that n-gram, which corresponds to male's frequency, A monogram corresponds to the number of male on the upper letter in position or adjacent several cis-positions that length is n, and n-gram corresponds to women Frequency is the number that a letter or length are a monogram corresponds to women on adjacent several cis-positions of n on single cis-position.
Preferably, the analysis module includes:
Domain name length acquiring unit, the length of the user personality domain name for obtaining unknown subscriber's gender, is denoted as k;
Probability calculation unit, for according to expression formula
The gender for calculating the user is the probability of male, wherein url indicates the part that user specifies in individual character domain name, substr(Url, j, i)Indicate that jth position character starts the substring that the adjacent character that length is i is constituted in url, n is indicated The number of substr (url, j, i), whIndicate the weight of the letter or monogram, (substr (url, j, i) is in sample number by P Male is corresponded to according to concentrating) indicate that jth position character or jth position character start the sub- character that the adjacent character that length is i is constituted in url The corresponding male's probability of letter or monogram on string;
Comparing unit, the ratio that result of calculation and reference ratio computing module for comparing probability calculation unit are calculated Example;
Judging unit, for being to compare the result of calculation of probability calculation unit to be more than in the comparing unit result of the comparison When equal to the ratio being calculated with reference to ratio computing module, judge that the gender of the unknown gender user is male.
Preferably, the judging unit is additionally operable in the comparing unit result of the comparison be to compare probability calculation unit Result of calculation when being less than the ratio being calculated with reference to ratio computing module, judge the gender of the unknown gender user For women.
The present invention provides a kind of user's gender analysis method and apparatus, collecting sample data set, the sample data sets Including multipair user personality domain name and corresponding user's gender, then count in the user personality domain name that the sample data is concentrated The different letters probability that different monograms occur according to gender on adjacent several cis-positions on each cis-position, then made with the probability For reference parameter, the user personality domain name of unknown subscriber's gender is analyzed, judges user's gender, is realized based on certainly User's gender analysis of dynamicization algorithm, it is more flexible and accurate, it solves existing analysis mode and is not suitable for individual character domain first name and last name Name is associated with the problem of weaker occasion.
Description of the drawings
Fig. 1 is a kind of flow chart for user's gender analysis method that the embodiment of the present invention one provides;
Fig. 2 is a kind of structural schematic diagram for user's gender analysis device that the embodiment of the present invention two provides;
Fig. 3 is the structural schematic diagram of reference parameter computing module 202 in Fig. 2;
Fig. 4 is the structural schematic diagram of analysis module 203 in Fig. 2.
Specific implementation mode
The embodiment provides a kind of user's gender analysis method and apparatus, by a kind of algorithm of automation, Avoid the dependence to Praxeology database.
The embodiment of the present invention is described in detail below in conjunction with attached drawing.It should be noted that not conflicting In the case of, the features in the embodiments and the embodiments of the present application mutually can be combined arbitrarily.
First in conjunction with attached drawing, the embodiment of the present invention one is illustrated.
An embodiment of the present invention provides a kind of user's gender analysis methods, and the stream of user's gender analysis is completed using this method Journey is as shown in Figure 1, include:
Step 101, collecting sample data set, the sample data set include multipair user personality domain name and corresponding user Gender;
Step 102 calculates the ratio that the sample data concentrates male;
Step 103, the statistics sample data are concentrated different on each cis-position in male's proportion and user personality domain name The probability that different monograms occur according to gender on alphabetical and adjacent several cis-positions;
This step specifically includes:
Step a:The part that user specifies in a user personality domain name is taken, while it is corresponding to record the user personality domain name User's gender;
Step b:Calculate the ratio that the sample data concentrates male;Step c:To on each cis-position of the specified part The number that different monograms occur on the number and/or adjacent several cis-positions that letter occurs is counted;
The method counted for the number that letter occurs on single character bit is as follows:
The occurrence number of letter in the user personality domain name first is added 1, then again by the user personality domain The occurrence number for the character string that name second is constituted adds 1, is counted successively to last position of the user personality domain name.
The method counted for the number that different monograms occur on adjacent several cis-positions is as follows:
The length n for the character string that adjacent several cis-positions are constituted is determined first, then with the user personality domain name first It takes n cis-positions to constitute character string for starting, the number that the monogram in the character string occurs is added 1;Then with the user personality Domain name second is that starting takes n cis-positions to constitute character string, and the number that the monogram in the character string occurs is added 1.Class according to this It pushes away, until straight last position for causing the last position of character string as user personality domain name.The value of n is by 2 to the user personality domain name Length.
Step c:Processing such as step a to b is carried out to whole user personality domain names that the sample data is concentrated, until institute Sample data set traversal is stated to complete;
Step d:Count number that different sexes occurs in letter on each cis-position of user personality domain name and/or adjacent The number that different sexes occurs in monogram on several cis-positions, and calculate alphabetical and/or adjacent several suitable on each cis-position The probability that different sexes occurs in monogram on position.
In this step, according to expression formula
Calculate separately the probability that each letter on each cis-position corresponds to male with each monogram on adjacent several cis-positions.Its In, the P (n-gram corresponds to male) on the left of equation is that length is that a monogram corresponds to male's on adjacent several cis-positions of n Probability, P (n-gram corresponds to male) is the probability that a letter corresponds to male on single cis-position when n is 1;N- on the right side of equation It is that a letter or length are that a monogram corresponds to man on adjacent several cis-positions of n on single cis-position that gram, which corresponds to male's frequency, Property number, n-gram correspond to women frequency be on single cis-position one letter or length be on adjacent several cis-positions of n one letter Combination corresponds to the number of women.
Step 104, using the probability as refer to parameter, the user personality domain name of unknown subscriber's gender is analyzed, Judge user's gender;
This step specifically includes:
Step a:The length for obtaining the user personality domain name of unknown subscriber's gender, is denoted as k;
Step b:According to expression formula
The gender for calculating the user is the probability of male, wherein url indicates the part that user specifies in individual character domain name, substr(Url, j, i)Indicate that jth position character starts the substring that the adjacent character that length is i is constituted in url, n is indicated The number of substr (url, j, i), whIndicate the weight of the letter or monogram, (substr (url, j, i) is in sample number by P Male is corresponded to according to concentrating) indicate that jth position character or jth position character start the sub- character that the adjacent character that length is i is constituted in url The corresponding male's probability of letter or monogram on string;
Step c:The ratio for the male that result of calculation and step 102 in comparison step b are calculated;
Step d:When result of calculation in stepb is more than or equal to the ratio that step 102 is calculated, judgement is described unknown The gender of gender user is male;
Step e:When result of calculation in stepb is less than the ratio that step 102 is calculated, the unknown gender is judged The gender of user is women.
Below in conjunction with the accompanying drawings, the embodiment of the present invention two is illustrated.
An embodiment of the present invention provides a kind of user's gender analysis devices, and structure is as shown in Fig. 2, include:
Sampling module 201, is used for collecting sample data set, and the sample data set includes multipair user personality domain name and right The user's gender answered;
Reference parameter computing module 202, for counting each cis-position in the user personality domain name that the sample data is concentrated The probability that different monograms occur according to gender on different letters and adjacent several cis-positions;
Analysis module 203, ratio and the probability for concentrating male using the sample data are used as with reference to parameter, right The user personality domain name of unknown subscriber's gender is analyzed, and judges user's gender.
Preferably, which further includes:
With reference to ratio computing module 204, the ratio of male is concentrated for calculating the sample data.
Preferably, the structure of the reference parameter computing module 202 is as shown in figure 3, include:
Gender extraction unit 2021 for taking the part that user specifies in a user personality domain name, while recording the use The corresponding user's gender of family individual character domain name;
Counting unit 2022, if for letter occurs on each cis-position of the specified part number and/or adjacent The number that different monograms occur on dry cis-position is counted;
Statistic unit 2023, whole user personality domain names for being concentrated to the sample data carry out the counting unit Processing count on each cis-position of user personality domain name letter for dissimilarity until sample data set traversal is completed The number that different sexes occurs in the monogram on number and/or adjacent several cis-positions not occurred, and calculate each cis-position The probability that different sexes occurs in monogram on upper letter and/or adjacent several cis-positions.
Preferably, the statistic unit 2023 calculates on each cis-position monogram pair on letter and/or adjacent several cis-positions It is specially in the probability that different sexes occur:
According to expression formula
Calculate separately each letter and the corresponding male's probability of each monogram on adjacent several cis-positions on each cis-position, wherein P (n-gram corresponds to male) is that length is the probability that a monogram corresponds to male on adjacent several cis-positions of n, P when n is 1 (n-gram corresponds to male) is the probability that a letter corresponds to male on single cis-position, and it is single suitable that n-gram, which corresponds to male's frequency, A monogram corresponds to the number of male on the upper letter in position or adjacent several cis-positions that length is n, and n-gram corresponds to women Frequency is the number that a letter or length are a monogram corresponds to women on adjacent several cis-positions of n on single cis-position.
Preferably, the structure of the analysis module 203 is as shown in figure 4, include:
Domain name length acquiring unit 2031, the length of the user personality domain name for obtaining unknown subscriber's gender, note For k;
Probability calculation unit 2032, for according to expression formula
The gender for calculating the user is the probability of male, wherein url indicates the part that user specifies in individual character domain name, substr(Url, j, i)Indicate that jth position character starts the substring that the adjacent character that length is i is constituted in url, n is indicated The number of substr (url, j, i), whIndicate the weight of the letter or monogram, (substr (url, j, i) is in sample number by P Male is corresponded to according to concentrating) indicate that jth position character or jth position character start the sub- character that the adjacent character that length is i is constituted in url The corresponding male's probability of letter or monogram on string;
Comparing unit 2033, the result of calculation for comparing probability calculation unit 2032 and reference ratio computing module 204 The ratio being calculated;
Judging unit 2034, for being to compare probability calculation unit 2032 in 2033 result of the comparison of the comparing unit When result of calculation is more than or equal to the ratio being calculated with reference to ratio computing module 204, the property of the unknown gender user is judged It Wei not male.
Preferably, the judging unit 2034 is additionally operable in 2033 result of the comparison of the comparing unit be to compare probability When the result of calculation of computing unit is less than the ratio being calculated with reference to ratio computing module, the judgement unknown gender is used The gender at family is women.
Below in conjunction with the accompanying drawings, the embodiment of the present invention three is illustrated.
The embodiment of the invention discloses a kind of user's gender analysis system, the individual character for being applied according to user, possessing or using Domain name automatically classifies to the gender of user.The embodiment of the present invention passes through the means such as user data statistics, business associate first The sample data set of the correspondence of individual character domain name and user's gender is obtained, the portion that user specifies in individual character domain name is then analyzed Point, using the method for machine learning, train the grader classified to user's gender using individual character domain name.When needs are not to When knowing that the individual character domain name of user's gender is classified, this grader is used, you can export user's gender of prediction.
It is as follows.
Step 1:The sample data set of the correspondence of individual character domain name and user's gender is acquired, analyzes and is used in individual character domain name The specified part in family.
Step 2:Calculate the ratio shared by the sample data concentration male.
Step 3:The part that user specifies in property domain name one by one is taken, is denoted as character string one, while recording corresponding user's property Not.
Step 4:The length of character string one is denoted as k, counts the frequency of occurrences of all 1-gram, 2- in character string one The frequency of occurrences of gram, the frequency of occurrences of 3-gram until k-gram the frequency of occurrences(K represents the length of character string, Ke Yiwei 1 or 1 or more integer, the value upper limit of k), by corresponding n-gram(N represents the length of character string, can be 1 to k)Appearance It is cumulative that frequency presses corresponding user's gender.
Step 5:Step 3 is repeated, until the sample data set traversal acquired in step 1 is completed.
Step 6:Calculate the probability and the n-gram of user's gender corresponding to the n-gram occurred goes out occurrence Number, while the probability that different sexes occur in statistical sample data set, collectively as the parameter of grader.
Step 6:When using grader, to the individual character domain name of unknown subscriber's gender, the part that wherein user specifies is analyzed, Length is denoted as k, obtains its 1-gram until k-gram, the probability that its gender is male is calculated by following formula:
The gender for calculating the user is the probability of male, wherein url indicates the part that user specifies in individual character domain name, substr(Url, j, i)Indicate that jth position character starts the substring that length is i in url, n indicates substr's (url, j, i) Number, whIndicate the weight of the monogram.
Step 7:If the probability being calculated in step 6, which is more than the sample data being calculated in step 2, concentrates male The individual character domain name can be then classified as corresponding male user, otherwise correspond to female user by the ratio of appearance.
Below in conjunction with the accompanying drawings, the embodiment of the present invention four is illustrated.
An embodiment of the present invention provides a kind of user's gender analysis method, detailed process is as follows:
Step 1:Collect following three individual character domain names:http://weibo.com/nickleave, http:// Weibo.com/inferpku, http:The part that //t.qq.com/bankofdota, wherein user are specified is respectively nickleave、inferpku、bankofdota.In this example, system of the present invention by the means of business associate from Information is obtained at the service provider of weibo.com and t.qq.com, learns that user's gender corresponding to nickleave is female, User's gender corresponding to inferpku and bankofdota is man.
Step 2:Calculate the probability of male in all samples.In step 1, we have collected three samples, property in total It Fen Biewei not nickleave(Female)、inferpku(Man)And bankofdota(Man).It can be seen that in three samples, man Sex ratio accounts for 2/3.
Step 3:Take nickleave, female.
Step 4:Nickleave corresponding 1-gram, 2-gram, 3-gram, 4-gram, 5-gram, 6-gram, 7- Gram, 8-gram, 9-gram are added in corresponding women, and statistical result is as shown in table 1.In order to indicate convenient, only arranged in following table Go out tri- kinds of situations of 1-gram, 2-gram, 3-gram.
Table 1
Step 5:Inferpku is repeated the above process.After cumulative, the result in table 1 is updated to table 2:
Table 2
1-gram Male Women 2-gram Male Women 3-gram Male Women
Frequency Frequency Frequency Frequency Frequency Frequency
n 1 1 ni 0 1 nic 0 1
i 1 1 ic 0 1 ick 0 1
c 0 1 ck 0 1 ckl 0 1
k 1 1 kl 0 1 kle 0 1
l 0 1 le 0 1 lea 0 1
e 1 2 ea 0 1 eav 0 1
a 0 1 av 0 1 ave 0 1
v 0 1 ve 0 1 inf 1 0
f 1 0 in 1 0 nfe 1 0
r 1 0 nf 1 0 fer 1 0
p 1 0 fe 1 0 erp 1 0
u 1 0 er 1 0 rpk 1 0
rp 1 0 pku 1 0
pk 1 0
ku 1 0
Bankofdota is repeated the above process again.After cumulative, table 2 is updated to table 3:
Table 3
Step 6:Calculate the probability and the n-gram of user's gender corresponding to the n-gram occurred goes out occurrence Number.Wherein, n-gram corresponds to the probability that user's gender is male(Male's probability i.e. in following table)Computational methods be:
For example, it is to correspond to male's frequency 2 and women with 1-gram n in upper table that the 1-gram n in table 4, which correspond to male's probability, What frequency 1 was calculated, i.e. 2/ (2+1)=0.666667.
In order to indicate convenient, tri- kinds of situations of 1-gram, 2-gram, 3-gram are only listed in table 4.
Table 4
1-gram Male's probability 2-gram Male's probability 3-gram Male's probability
n 0.666667 ni 0 nic 0
i 0.5 ic 0 ick 0
c 0 ck 0 ckl 0
k 0.666667 kl 0 kle 0
l 0 le 0 lea 0
e 0.333333 ea 0 eav 0
a 0.666667 av 0 ave 0
v 0 ve 0 inf 1
f 1 in 1 nfe 1
r 1 nf 1 fer 1
p 1 fe 1 erp 1
u 1 er 1 rpk 1
b 1 rp 1 pku 1
o 1 pk 1 ban 1
d 1 ku 1 ank 1
t 1 ba 1 nko 1
an 1 kof 1
nk 1 ofd 1
ko 1 fdo 1
of 1 dot 1
fd 1 ota 1
do 1
ot 1
ta 1
Step 7:Assuming that the individual character domain name classified is www.renren.com/eleven, wherein user is specified Part be eleven, the n-gram occurred in eleven includes e(Three times)、l(Once)、v(Once)、n(Once)、le (Once)、ve(Once), and the n-gram occurred in above-mentioned third gender frequency meter lattice includes e(Three times)、l(Once)、v (Once)、n (Three times)、le(Once)、ve(Once).The n-gram that user name eleven occurs in gender frequency table is total Number is 10.Thus the weight w of letter or monogram is calculatedh.According to above formula, the above numerical value is brought into, can obtain:
Step 7:
Step 8:Due to the eleven being calculated in upper step 2 correspond to user's gender be male probability 0.166 it is small Male's proportion 0.67 in sample, therefore www.renren.com/eleven can be classified as to corresponding female user.
The embodiment provides a kind of user's gender analysis method and apparatus, collecting sample data set, the samples Notebook data collection includes multipair user personality domain name and corresponding user's gender, counts the user personality domain that the sample data is concentrated The different letters probability that different monograms occur according to gender on adjacent several cis-positions on each cis-position in name, then with described general Rate, which is used as, refers to parameter, analyzes the user personality domain name of unknown subscriber's gender, judges user's gender, it realizes User's gender analysis based on automation algorithm, it is more flexible and accurate, it solves existing analysis mode and is not suitable for individual character domain First name and last name name is associated with the problem of weaker occasion.
Technical solution provided in an embodiment of the present invention is avoided by a kind of algorithm of automation to Praxeology number According to the dependence in library.Dependence of the existing analysis mode to name, be not particularly suited for individual character domain name etc. be associated with name it is weaker Occasion, and this problem is not present in technical solution provided in an embodiment of the present invention.In addition, the embodiment of the present invention passes through to individual character The analysis of domain name can be used for showing in the broader practices such as advertisement optimization.
One of ordinary skill in the art will appreciate that all or part of step of above-described embodiment can use computer journey Sequence flow realizes that the computer program can be stored in a computer readable storage medium, the computer program exists On corresponding hardware platform(Such as system, unit, device)Execute, when being executed, include the steps that embodiment of the method it One or combinations thereof.
Optionally, all or part of step of above-described embodiment can also realize that these steps can using integrated circuit To be fabricated to integrated circuit modules one by one respectively, or by them multiple modules or step be fabricated to single integrated electricity Road module is realized.In this way, the present invention is not limited to any specific hardware and softwares to combine.
General computing device may be used to realize in each device/function module/functional unit in above-described embodiment, it Can be concentrated on a single computing device, can also be distributed on network constituted by multiple computing devices.
Each device/function module/functional unit in above-described embodiment realized in the form of software function module and as Independent product sale in use, can be stored in a computer read/write memory medium.Computer mentioned above Read/write memory medium can be read-only memory, disk or CD etc..
Any one skilled in the art in the technical scope disclosed by the present invention, can readily occur in variation or It replaces, should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the guarantor described in claim It protects subject to range.

Claims (10)

1. a kind of user's gender analysis method, which is characterized in that including:
Collecting sample data set, the sample data set include multipair user personality domain name and corresponding user's gender;
Count different on different alphabetical and adjacent several cis-positions on each cis-position in the user personality domain name that the sample data is concentrated The probability that monogram occurs according to gender, including:
Step a:The part that user specifies in a user personality domain name is taken, while recording the corresponding user of the user personality domain name Gender;
Step b:To different letters on the number and/or adjacent several cis-positions of letter appearance on each cis-position of the specified part The number that combination occurs is counted;
Step c:Processing such as step a to b is carried out to whole user personality domain names that the sample data is concentrated, until the sample Notebook data collection traversal is completed;
Step d:Count number that different sexes occurs in letter on each cis-position of user personality domain name and/or adjacent several The number that different sexes occurs in monogram on cis-position, and calculate on each cis-position on letter and/or adjacent several cis-positions The probability that different sexes occurs in monogram;
Using the ratio of sample data concentration male and the probability as parameter is referred to, to the user of unknown subscriber's gender Property domain name is analyzed, and judges user's gender.
2. user's gender analysis method according to claim 1, which is characterized in that the statistics sample data is concentrated User personality domain name on each cis-position on different letters and adjacent several cis-positions different monograms occur according to gender it is general Before the step of rate, further include:
Calculate the ratio that the sample data concentrates male.
3. user's gender analysis method according to claim 1, which is characterized in that it is each to count the user personality domain name There are different sexes in the monogram on number and/or adjacent several cis-positions that different sexes occurs in letter on cis-position Number, and calculate on each cis-position the probability tool that different sexes occurs in monogram on letter and/or adjacent several cis-positions Body is:
According to expression formula
Calculate separately the probability that each letter on each cis-position corresponds to male with each monogram on adjacent several cis-positions;Wherein, etc. P (n-gram corresponds to male) on the left of formula is that length is the probability that monogram on adjacent several cis-positions of n corresponds to male, P (n-gram corresponds to male) corresponds to the probability of male for the letter on single cis-position when n is 1;N-gram pairs on the right side of equation Answer that male's frequency is letter on single cis-position or length is time that monogram on adjacent several cis-positions of n corresponds to male Number, n-gram correspond to that women frequency is letter on single cis-position or length is monogram pair on adjacent several cis-positions of n It should be the number of women.
4. user's gender analysis method according to claim 1, which is characterized in that using the probability as refer to parameter, The user personality domain name of unknown subscriber's gender is analyzed, judges that user's gender includes:
Step a:The length for obtaining the user personality domain name of unknown subscriber's gender, is denoted as k;
Step b:According to expression formula
The gender for calculating the user is the probability of male, wherein url indicates the part that user specifies in individual character domain name; Substr (url, j, i) indicates that jth position character starts what the monogram on adjacent several cis-positions that length is i was constituted in url Substring is the substring that the letter on single cis-position is constituted when i is 1;N indicates the number of substr (url, j, i);wh Indicate the weight of the letter or monogram;P (substr (url, j, i) concentrates corresponding male in sample data) indicates above-mentioned son The corresponding male's probability of letter or monogram in character string;
Step c:Result of calculation and the sample data in comparison step b concentrate the ratio of male;
Step d:When result of calculation in stepb is more than or equal to the ratio that step c is calculated, the judgement unknown gender is used The gender at family is male.
5. user's gender analysis method according to claim 4, which is characterized in that after the step d, further include:
Step e:When result of calculation in stepb is less than the ratio that step c is calculated, judge the unknown gender user's Gender is women.
6. a kind of user's gender analysis device, which is characterized in that including:
Sampling module is used for collecting sample data set, and the sample data set includes multipair user personality domain name and corresponding use Family gender;
Reference parameter computing module, for counting, difference is alphabetical on each cis-position in the user personality domain name that the sample data is concentrated The probability that different monograms occur according to gender on adjacent several cis-positions;
Analysis module, the ratio and the probability for concentrating male using the sample data are as parameter is referred to, to unknown use The user personality domain name of family gender is analyzed, and judges user's gender;
Wherein, the reference parameter computing module includes:
Gender extraction unit for taking the part that user specifies in a user personality domain name, while recording the user personality domain The corresponding user's gender of name;
Counting unit, for the number and/or adjacent several cis-positions to letter appearance on each cis-position of the specified part The number that different monograms occur is counted;
Statistic unit, the processing for whole user personality domain names that the sample data is concentrated to be carried out with the counting unit, Until sample data set traversal is completed, count what different sexes occurred in letter on each cis-position of user personality domain name The number that different sexes occurs in monogram on number and/or adjacent several cis-positions, and calculate on each cis-position letter and/ Or the probability that different sexes occurs in monogram on adjacent several cis-positions.
7. user's gender analysis device according to claim 6, which is characterized in that the device further includes:
With reference to ratio computing module, the ratio of male is concentrated for calculating the sample data.
8. user's gender analysis device according to claim 6, which is characterized in that the statistic unit calculates on each cis-position The probability that different sexes occurs in monogram on alphabetical and/or adjacent several cis-positions is specially:
According to expression formula
Calculate separately each letter and the corresponding male's probability of each monogram on adjacent several cis-positions on each cis-position, wherein P (n- Gram corresponds to male) be length it is the probability that a monogram corresponds to male on adjacent several cis-positions of n, P (n- when n is 1 Gram corresponds to male) it is the probability that a letter corresponds to male on single cis-position, it is single cis-position that n-gram, which corresponds to male's frequency, A monogram corresponds to the number of male on a upper letter or adjacent several cis-positions that length is n, and n-gram corresponds to women frequency Rate is the number that a letter or length are a monogram corresponds to women on adjacent several cis-positions of n on single cis-position.
9. user's gender analysis device according to claim 7, which is characterized in that the analysis module includes:
Domain name length acquiring unit, the length of the user personality domain name for obtaining unknown subscriber's gender, is denoted as k;
Probability calculation unit, for according to expression formula
The gender for calculating the user is the probability of male, wherein url indicates the part that user specifies in individual character domain name, Substr (url, j, i) indicates that jth position character starts the substring that the adjacent character that length is i is constituted in url, and n is indicated The number of substr (url, j, i), whIndicate the weight of the letter or monogram, (substr (url, j, i) is in sample number by P Male is corresponded to according to concentrating) indicate that jth position character or jth position character start the sub- character that the adjacent character that length is i is constituted in url The corresponding male's probability of letter or monogram on string;
Comparing unit, for the result of calculation for comparing probability calculation unit and the ratio being calculated with reference to ratio computing module;
Judging unit, for being to compare the result of calculation of probability calculation unit to be more than or equal in the comparing unit result of the comparison When the ratio being calculated with reference to ratio computing module, judge that the gender of the unknown gender user is male.
10. user's gender analysis device according to claim 9, which is characterized in that
The judging unit, be additionally operable to the comparing unit result of the comparison be compare probability calculation unit result of calculation it is small When the ratio being calculated with reference to ratio computing module, judge that the gender of the unknown gender user is women.
CN201310526980.4A 2013-10-30 2013-10-30 User's gender analysis method and apparatus Active CN104598452B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310526980.4A CN104598452B (en) 2013-10-30 2013-10-30 User's gender analysis method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310526980.4A CN104598452B (en) 2013-10-30 2013-10-30 User's gender analysis method and apparatus

Publications (2)

Publication Number Publication Date
CN104598452A CN104598452A (en) 2015-05-06
CN104598452B true CN104598452B (en) 2018-09-11

Family

ID=53124251

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310526980.4A Active CN104598452B (en) 2013-10-30 2013-10-30 User's gender analysis method and apparatus

Country Status (1)

Country Link
CN (1) CN104598452B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106656943B (en) * 2015-11-03 2019-09-17 秒针信息技术有限公司 A kind of matching process and device of network user's attribute
CN105809557A (en) * 2016-03-15 2016-07-27 微梦创科网络科技(中国)有限公司 Method and device for mining genders of users in social network
CN106844687B (en) * 2017-01-23 2021-01-01 炫彩互动网络科技有限公司 Method and system for determining gender of user based on game log
CN107357782B (en) * 2017-06-29 2020-12-18 深圳市金立通信设备有限公司 Method and terminal for identifying gender of user
CN111309913A (en) * 2020-02-26 2020-06-19 北京慧博科技有限公司 Method for analyzing gender by name

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1987852A (en) * 2005-12-21 2007-06-27 腾讯科技(深圳)有限公司 Method and device for determining communication object attribute according to news content
CN103164470A (en) * 2011-12-15 2013-06-19 盛大计算机(上海)有限公司 Directional application method based on user gender distinguished results and system thereof

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8005782B2 (en) * 2007-08-10 2011-08-23 Microsoft Corporation Domain name statistical classification using character-based N-grams
US8041662B2 (en) * 2007-08-10 2011-10-18 Microsoft Corporation Domain name geometrical classification using character-based n-grams

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1987852A (en) * 2005-12-21 2007-06-27 腾讯科技(深圳)有限公司 Method and device for determining communication object attribute according to news content
CN103164470A (en) * 2011-12-15 2013-06-19 盛大计算机(上海)有限公司 Directional application method based on user gender distinguished results and system thereof

Also Published As

Publication number Publication date
CN104598452A (en) 2015-05-06

Similar Documents

Publication Publication Date Title
JP6568609B2 (en) Grammar model for structured search queries
CN102929939B (en) The offer method and device of customized information
CN104598452B (en) User's gender analysis method and apparatus
US20150081431A1 (en) Posterior probability calculating apparatus, posterior probability calculating method, and non-transitory computer-readable recording medium
US20120066195A1 (en) Search assist powered by session analysis
WO2021160157A1 (en) Group display method and device
CN107292463A (en) A kind of method and system that the project evaluation is carried out to application program
CN107291755B (en) Terminal pushing method and device
CN110472154A (en) A kind of resource supplying method, apparatus, electronic equipment and readable storage medium storing program for executing
CN104915426B (en) Information sorting method, the method and device for generating information sorting model
CN101000611A (en) Method for providing and inquiry information for public by interconnection network
CN106650760A (en) Method and device for recognizing user behavioral object based on flow analysis
US11574123B2 (en) Content analysis utilizing general knowledge base
CN110275952A (en) News recommended method, device and medium based on user's short-term interest
CN107690634A (en) Automatic query pattern generation
CN106326338A (en) Service providing method and device based on search engine
CN110019837B (en) User portrait generation method and device, computer equipment and readable medium
CN112182391A (en) User portrait drawing method and device
CN106910135A (en) User recommends method and device
CN111782816B (en) Method and device for generating knowledge graph, searching method, engine and system
US20170124120A1 (en) Information processing system, information processing method, and information processing program
US8560468B1 (en) Learning expected values for facts
CN109144999B (en) Data positioning method, device, storage medium and program product
EP4198758B1 (en) Method and system for scalable acceleration of data processing pipeline
CN108595453B (en) URL (Uniform resource locator) identifier mapping obtaining method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20150506

Assignee: Beijing Interactive Technology Co., Ltd.

Assignor: Beijing Sibotu Information Technology Co., Ltd.

Contract record no.: 2015110000019

Denomination of invention: Method and device for analyzing user gender

License type: Exclusive License

Record date: 20150603

LICC Enforcement, change and cancellation of record of contracts on the licence for exploitation of a patent or utility model
EC01 Cancellation of recordation of patent licensing contract

Assignee: Beijing Interactive Technology Co., Ltd.

Assignor: The second hand information technology Co. Ltd.

Contract record no.: 2015110000019

Date of cancellation: 20160426

EM01 Change of recordation of patent licensing contract

Change date: 20160426

Contract record no.: 2015110000019

Assignor after: The second hand information technology Co. Ltd.

Assignor before: Beijing Sibotu Information Technology Co., Ltd.

LICC Enforcement, change and cancellation of record of contracts on the licence for exploitation of a patent or utility model
CB02 Change of applicant information

Address after: 100102 Beijing, Chaoyang District Fu Tong East Street, building 1, room 5, room 321008

Applicant after: The second hand information technology Co. Ltd.

Address before: Beijing City, a small town east of Changping District road 102218 in No. 398 Coal Construction Group No. 1 building, 4 floor second hand system

Applicant before: Beijing Sibotu Information Technology Co., Ltd.

COR Change of bibliographic data
GR01 Patent grant
GR01 Patent grant