CN105956187A - Selection method of majority-class user network access characteristics - Google Patents

Selection method of majority-class user network access characteristics Download PDF

Info

Publication number
CN105956187A
CN105956187A CN201610394392.3A CN201610394392A CN105956187A CN 105956187 A CN105956187 A CN 105956187A CN 201610394392 A CN201610394392 A CN 201610394392A CN 105956187 A CN105956187 A CN 105956187A
Authority
CN
China
Prior art keywords
user
feature
minority class
network access
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610394392.3A
Other languages
Chinese (zh)
Other versions
CN105956187B (en
Inventor
牟超
周庆
胡月
孙启亮
孟瑶
全文君
廖凤露
尹春梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN201610394392.3A priority Critical patent/CN105956187B/en
Publication of CN105956187A publication Critical patent/CN105956187A/en
Application granted granted Critical
Publication of CN105956187B publication Critical patent/CN105956187B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a selection method of majority-class user network access characteristics, and relates to the field of big data analysis. The selection method comprises the following steps: firstly, downloading a log from a gateway server, carrying out statistics on the network access data of a real user, and forming an initial data set; secondly, creating N majority-class virtual users, and constructing a virtual data set; thirdly, carrying out correlation analysis on the virtual data set, and selecting access characteristics remarkably associated with the majority-class user; and finally, carrying out factor analysis to further lower a characteristic dimensionality. The virtual user is created to automatically balance the ratio of the majority-class user, and therefore, the network access characteristics still can be extracted when the number of target users occupies a small ratio.

Description

A kind of choosing method of minority class subscriber network access feature
Technical field
The present invention relates to big data analysis field, particularly relate to the choosing of a kind of minority class subscriber network access feature Access method.
Background technology
Popular along with Internet technology, the quantity of Internet user is more and more huger.Network has oozed Having entered the every aspect of life, therefore the network behavior to user is analyzed being the most significant, and it can To contribute to understanding user, provide the foundation of science for carrying out decision-making.Such as, in electricity business field, analyze and use The purchasing behavior at family can realize the accurate input of Commdity advertisement;In education sector, analyze teen-age online Behavior, the bad network that can correct them in time accesses;In information security field, the online of detection user Behavior, can stop the access of disabled user in time.Choose and can express the network of user network behavior and access Feature, is requisite during user behavior analysis, is also most important link.
In current research modal subscriber network access feature be access dissimilar network address the frequency and Time.The number of these features is all the most, it is necessary to therefrom chooses important feature and reduces The purpose of dimension.The method of existing research selected characteristic has a lot, and such as correlation analysis is exactly a kind of letter Single Feature Selection Algorithms the most easily performed.But current feature selecting algorithm does not the most all account for target User belongs to the situation of minority class user, say, that the ratio accounting for total user is considerably less, as browsed certain In a large number of users of Commdity advertisement, only small part user is ready to buy.In this case, because number ratio Example unbalanced, can cause the feature chosen the most comprehensive, it is impossible to express the access information of targeted customer completely. The most how minority class user account for all ratios the least in the case of, access from substantial amounts of network adaptively Feature selects important feature, just seems that there is Research Significance and using value very much.
Summary of the invention
Because the drawbacks described above of prior art, the technical problem to be solved is to provide a kind of minority The choosing method of class subscriber network access feature, it is possible to minority class user account for all ratios the least in the case of, Access feature from substantial amounts of network adaptively and select important feature.
For achieving the above object, the invention provides the choosing method of a kind of minority class subscriber network access feature, It is characterized in that, comprise the steps:
Step one, from gateway server download log, the network access data of statistics real user, and forming Initial data set;Wherein, the dimension of initial data set is m × d, and m is total number of persons, and d is number of features, just The packet of beginning data set contains the data of index of coincidence distribution;
Step 2, the Virtual User of establishment N name minority class, build dummy data set;Wherein, virtual data The dimension of collection is (m+N) × d, and dummy data set meets probability with distribution with the data of initial data set;
Step 3, dummy data set is carried out correlation analysis, select and minority class user's significant correlation Access feature.
Furthermore, the choosing method of a kind of minority class subscriber network access feature also comprises step 4: right The dummy data set having carried out correlation analysis carries out factorial analysis, reduces intrinsic dimensionality further.
Furthermore, step 2 specifically includes:
A0, labelling minority class user Si, wherein i ∈ [1, p], p is the total number of persons of minority class user;
A1, calculate the mean μ of each featurej,j∈[1,d];
The total number of persons of the virtual minority class user that A2, calculating needs create: N=m-p;
A3, judge that p, whether more than 1, if it is, continue executing with A4, the most directly replicates N name S1 And perform described step 3;
A4, calculate every true minority class user and need corresponding virtual number Represent downwards Rounding operation;
A5, according to exponential to SiIncrease n name Virtual User.
Furthermore, in the A5 step of step 2, specifically include as follows:
B0, judge whether i exceedes true minority class user number, if it is, terminate, otherwise continue to hold OK;
B1, execution i=i+1;
B2, judge SiWhether the Virtual User number created is more than n, without then continuing B3, otherwise Redirect B0 to continue executing with;
B3, find min (p-1,5) name and the minimum real minority class user of its Euclidean distance;
B4, randomly selected in min (p-1,5) name user, be denoted ask∈[1,min(p-1,5)];
B5, one random number R of generation, and R~U (0,1);
B6, establishment Virtual User S 'i, jth feature S' of described Virtual UseriJ (), is expressed as:
S i ′ ( j ) = - μ j l n ( ( 1 - R ) e - S i ( j ) μ j + Re - S i k ( j ) μ j ) ;
Wherein j ∈ [1, d];
B7, combine these features, it is thus achieved that the characteristic set of newly created Virtual User is:
S'i=[S'i(1),S'i(2),...,S'i(j),...,S'i(d)]。
Furthermore, in step 3, described correlation analysis is simple correlation analysis, its correlation coefficient For Pearson correlation coefficients, significant level is 0.05;The marked feature collection of output is combined into [Sig1,Sig2,...,Sigs], Wherein s is the number of features of significant correlation, SigsIt it is the feature of significant correlation.
Furthermore, step 4 specifically includes:
C1, calculating Bartlett statistic and KMO statistic;
C2, judge Bartlett statistic whether less than 0.05 and KMO statistic whether less than 0.5, as The most no, then continue executing with, otherwise terminate;
C3, the selection characteristic root common factor more than 1, described common factor includes some minority class user networks The homogenous characteristics accessed;
C4, use varimax carry out the rotation of factor axle, highlight minority class subscriber network access feature.
The invention has the beneficial effects as follows: the present invention can be by creating Virtual User, autobalance minority class user Ratio, it is ensured that after newly-increased Virtual User, each feature still obeys original exponential, reach from Adapt to extract the purpose of the feature of minority class subscriber network access behavior.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of the embodiment of the present invention one;
Fig. 2 is the schematic flow sheet creating school work difficulty university students's Virtual User;
Fig. 3 is the schematic flow sheet creating Virtual User based on exponential;
Fig. 4 is the schematic flow sheet of factor analysis.
Detailed description of the invention
The invention will be further described with embodiment below in conjunction with the accompanying drawings:
The present embodiment, with school work difficulty university students as minority class, utilizes the present invention to analyze its network and accesses feature.
As shown in Figures 1 to 4, the present embodiment provides a kind of school work difficulty College Students with Internet to access the choosing of feature Access method, comprises the steps:
Step one, from gateway server download log, the network access data of statistics school, and forming Initial data set;In data set, include non-school work difficulty university students and minority school work difficulty university students. URL, student number etc. is included from gateway server download log.Main Analysis school work difficulty university students accesses network Type, network visitation frequency, network access duration etc..The dimension of initial data set is m × d, and m is total learning Raw number, d is number of features, and feature mainly includes the frequency, accesses duration etc..It is noted that the frequency is special Levying, accessing duration characteristics is index of coincidence distribution.Additionally, in order to reduce data analysis workload, generally By dissimilar network address, the frequency and temporal characteristics are added up.General website can be divided into large-scale portal website, Industrial sustainability, transaction class website, classification information site, forum, government website, functional character website, joy Happy type website, enterprise web site etc..
Step 2, establishment N name school work difficulty university students's Virtual User (minority class), build dummy data set; M name real user and the virtual data that N name school work difficulty university students's Virtual User dimension is (m+N) × d Collection.By creating school work difficulty university students's Virtual User, minority class quantity is by increase and adds data set, has Help its network is accessed the extraction of feature.It is noted that the school work difficulty university students created is virtual User, its feature should be to meet probability with distribution with the data of real user.Exemplary, network accesses The frequency and to access duration be without memory, meets exponential, the most newly created virtual school work difficulty university Also true school work difficulty university students's exponential should be met.
Specifically, as shown in Figure 2 to analyze school work difficulty College Students with Internet access feature, step 2 includes:
A0, labelling real school work difficulty student users Si, wherein i ∈ [1, p], p is that real school work is stranded The total number of persons of difficult student users;
A1, calculate the mean μ of each featurej,j∈[1,d];
The total number of persons of school work difficulty university students's Virtual User that A2, calculating needs create: N=m-p;
A3, judge that p, whether more than 1, if it is, continue executing with A4, the most directly replicates N name S1 And perform described step 3;
A4, calculate every true school work difficulty university students and need the number of corresponding Virtual User Generation The downward rounding operation of table;
A5, according to exponential to SiIncrease n name school work difficulty university students's Virtual User;
Furthermore, as it is shown on figure 3, described in A5 according to exponential to SiIncrease n name school work difficulty Specifically comprising the following steps that of university students's Virtual User
B0, judge whether i exceedes true school work difficulty student users number, if it is, terminate, no Then continue executing with B1;
B1, execution i=i+1;
B2, judge SiWhether the school work virtual number of difficulty university students created is more than n, without then continuing B3, otherwise redirects B0 and continues executing with;
B3, find min (p-1,5) name and the minimum real school work difficulty university students of its Euclidean distance;
B4, randomly selected in min (p-1,5) name user, be denoted ask∈[1,min(p-1,5)];
B5, one random number R of generation, and R~U (0,1);
B6, to ensure create school work difficulty university students Virtual User S 'iAfter, all school work difficulty university students are still Obey identical exponential, then for each feature j, it should meet:
P { S i < x < S i &prime; } P { S i < x < S i k } = R - - - ( 1 )
The distribution function using exponential solves (1) formula, and available newly created school work difficulty university students is empty Intend jth feature S' of useriJ (), can be expressed as follows:
S i &prime; ( j ) = - &mu; j l n ( ( 1 - R ) e - S i ( j ) &mu; j + Re - S i k ( j ) &mu; j ) - - - ( 2 )
Wherein j ∈ [1, d]
B7, combine these features, it is thus achieved that the characteristic set of newly created school work difficulty university students's Virtual User is:
S'i=[S'i(1),S'i(2),...,S'i(j),...,S'i(d)] (3)
Step 3, dummy data set is carried out correlation analysis, select and the school work notable phase of difficulty university students The access feature closed.
In the present embodiment, using simple correlation analysis, its correlation coefficient is Pearson correlation coefficients, significantly Level is 0.05;The marked feature collection of output is combined into [Sig1,Sig2,...,Sigs], wherein s is the feature of significant correlation Number, SigsIt it is the feature of significant correlation.
Step 4, carry out factorial analysis, reduce school work difficulty university students's intrinsic dimensionality further, specifically include:
C1, calculating Bartlett statistic and KMO statistic;
C2, judge Bartlett statistic whether less than 0.05 and KMO statistic whether less than 0.5, as The most no, then continue executing with, otherwise terminate;
C3, the selection characteristic root common factor more than 1, each common factor includes minority class subscriber network access Homogenous characteristics, these common factors are the classification to feature, and each common factor represents affects student's school work A category feature;
C4, use varimax carry out the rotation of factor axle, highlight minority class subscriber network access feature, The common factor interpretability making acquisition becomes apparent from, and helps to understand the principal element affecting college students ' academic.
To sum up, the present embodiment is by creating school work difficulty university students's Virtual User, autobalance minority class user Ratio, after simultaneously also ensureing the Virtual User increased newly, each of all school work difficulty university students's Virtual User Feature still obeys original exponential, reaches the spy of extracted in self-adaptive minority class subscriber network access behavior The purpose levied.Although the present embodiment is with school work difficulty university students as minority class and analyzes network access behavior, It is equally applicable to other minority class network and accesses behavior research case, repeat no more here.
The preferred embodiment of the present invention described in detail above.Should be appreciated that the ordinary skill of this area Personnel just can make many modifications and variations according to the design of the present invention without creative work.Therefore, all Technical staff passes through logical analysis the most on the basis of existing technology, pushes away in the art Reason or the limited available technical scheme of experiment, all should be at the protection model being defined in the patent claims In enclosing.

Claims (6)

1. the choosing method of a minority class subscriber network access feature, it is characterised in that comprise the steps:
Step one, from gateway server download log, the network access data of statistics real user, and forming Initial data set;Wherein, the dimension of described initial data set is m × d, and m is total number of persons, and d is number of features, The packet of described initial data set contains the data of index of coincidence distribution;
Step 2, the Virtual User of establishment N name minority class, build dummy data set;Wherein, described virtual The dimension of data set is (m+N) × d, and described dummy data set meets probability with dividing with the data of initial data set Cloth;
Step 3, dummy data set is carried out correlation analysis, select and minority class user's significant correlation Access feature.
The choosing method of a kind of minority class subscriber network access feature the most as claimed in claim 1, its feature Being, described choosing method also comprises step 4: the dummy data set carrying out correlation analysis is carried out because of Son is analyzed, and reduces intrinsic dimensionality further.
The choosing method of a kind of minority class subscriber network access feature the most as claimed in claim 1, its feature Being, described step 2 specifically includes:
Minority class user S described in A0, labellingi, wherein i ∈ [1, p], p is the total number of persons of described minority class user;
A1, calculate the mean μ of each featurej,j∈[1,d];
The total number of persons of the described virtual minority class user that A2, calculating needs create: N=m-p;
A3, judge that p, whether more than 1, if it is, continue executing with A4, the most directly replicates N name S1 And perform described step 3;
A4, calculate every true minority class user and need corresponding virtual number
A5, according to exponential to SiIncrease the described Virtual User of n name.
The choosing method of a kind of minority class subscriber network access feature the most as claimed in claim 3, its feature Be: described in A5 according to exponential to SiIncrease specifically comprising the following steps that of n name Virtual User
B0, judge whether i exceedes described true minority class user number, if it is, terminate, otherwise continue Continuous execution;
B1, execution i=i+1;
B2, judge SiWhether the Virtual User number created is more than n, without then continuing B3, otherwise Redirect B0 to continue executing with;
B3, find min (p-1,5) name and the minimum real minority class user of its Euclidean distance;
B4, randomly selected in min (p-1,5) name user, be denoted ask∈[1,min(p-1,5)];
B5, one random number R of generation, and R~U (0,1);
B6, establishment Virtual User S 'i, jth feature S' of described Virtual UseriJ (), is expressed as:
S i &prime; ( j ) = - &mu; j l n ( ( 1 - R ) e - S i ( j ) &mu; j + Re - S i k ( j ) &mu; j )
Wherein j ∈ [1, d];
B7, combine these features, it is thus achieved that the characteristic set of newly created Virtual User is:
S'i=[S'i(1),S'i(2),...,S'i(j),...,S'i(d)]。
The choosing method of a kind of minority class subscriber network access feature the most as claimed in claim 1, its feature Being: in described step 3, described correlation analysis is simple correlation analysis, and its correlation coefficient is Pierre Gloomy correlation coefficient, significant level is 0.05;The marked feature collection of output is combined into [Sig1,Sig2,...,Sigs], wherein s is The number of features of significant correlation, SigsIt it is the feature of significant correlation.
The choosing method of a kind of minority class subscriber network access feature the most as claimed in claim 2, its feature Being, described step 4 sequentially includes the following steps:
C1, calculating Bartlett statistic and KMO statistic;
C2, judge Bartlett statistic whether less than 0.05 and KMO statistic whether less than 0.5, as The most no, then continue executing with step C3, otherwise terminate;
C3, the selection characteristic root common factor more than 1, described common factor includes some minority class user networks The homogenous characteristics accessed;
C4, use varimax carry out the rotation of factor axle, highlight minority class subscriber network access feature.
CN201610394392.3A 2016-06-03 2016-06-03 A kind of choosing method of minority class subscriber network access feature Expired - Fee Related CN105956187B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610394392.3A CN105956187B (en) 2016-06-03 2016-06-03 A kind of choosing method of minority class subscriber network access feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610394392.3A CN105956187B (en) 2016-06-03 2016-06-03 A kind of choosing method of minority class subscriber network access feature

Publications (2)

Publication Number Publication Date
CN105956187A true CN105956187A (en) 2016-09-21
CN105956187B CN105956187B (en) 2019-03-15

Family

ID=56907799

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610394392.3A Expired - Fee Related CN105956187B (en) 2016-06-03 2016-06-03 A kind of choosing method of minority class subscriber network access feature

Country Status (1)

Country Link
CN (1) CN105956187B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220318A (en) * 2017-05-18 2017-09-29 重庆大学 A kind of method for determining special student groups online feature
CN107508809A (en) * 2017-08-17 2017-12-22 腾讯科技(深圳)有限公司 Identify the method and device of website type

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411610A (en) * 2011-10-12 2012-04-11 浙江大学 Semi-supervised dimensionality reduction method for high dimensional data clustering
US20140244644A1 (en) * 2011-07-06 2014-08-28 Fred Bergman Healthcare Pty Ltd Event detection algorithms
CN104254852A (en) * 2012-03-17 2014-12-31 海智网聚网络技术(北京)有限公司 Method and system for hybrid information query

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140244644A1 (en) * 2011-07-06 2014-08-28 Fred Bergman Healthcare Pty Ltd Event detection algorithms
CN102411610A (en) * 2011-10-12 2012-04-11 浙江大学 Semi-supervised dimensionality reduction method for high dimensional data clustering
CN104254852A (en) * 2012-03-17 2014-12-31 海智网聚网络技术(北京)有限公司 Method and system for hybrid information query

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220318A (en) * 2017-05-18 2017-09-29 重庆大学 A kind of method for determining special student groups online feature
CN107508809A (en) * 2017-08-17 2017-12-22 腾讯科技(深圳)有限公司 Identify the method and device of website type

Also Published As

Publication number Publication date
CN105956187B (en) 2019-03-15

Similar Documents

Publication Publication Date Title
US11687728B2 (en) Text sentiment analysis method based on multi-level graph pooling
CN105046515B (en) Method and device for sorting advertisements
US9699042B2 (en) Systems and methods of classifying sessions
CN103020845B (en) A kind of method for pushing and system of mobile application
CN109615128A (en) Real estate client&#39;s conclusion of the business probability forecasting method, device and server
CN106776881A (en) A kind of realm information commending system and method based on microblog
CN105022754A (en) Social network based object classification method and apparatus
CN110490625A (en) User preference determines method and device, electronic equipment, storage medium
CN103544188A (en) Method and device for pushing mobile internet content based on user preference
CN102523274A (en) Core network side based system and method for initiatively pushing wireless personalized accurate information
CN110134845A (en) Project public sentiment monitoring method, device, computer equipment and storage medium
CN109086317A (en) Risk control method and relevant apparatus
CN109446431A (en) For the method, apparatus of information recommendation, medium and calculate equipment
CN104484449B (en) The context extraction method and device of Webpage
CN107612922A (en) User ID authentication method and device based on user operation habits and geographical position
CN111767443A (en) Efficient web crawler analysis platform
CN110609958A (en) Data pushing method and device, electronic equipment and storage medium
US10742627B2 (en) System and method for dynamic network data validation
CN107862039A (en) Web data acquisition methods, system and Data Matching method for pushing
CN107506649A (en) A kind of leak detection method of html web page, device and electronic equipment
CN105989114A (en) Collection content recommendation method and terminal
CN102982012B (en) Method and device used for obtaining target character strings in disorder text
CN105956187A (en) Selection method of majority-class user network access characteristics
CN105426392A (en) Collaborative filtering recommendation method and system
CN107679883A (en) The method and system of advertisement generation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190315

CF01 Termination of patent right due to non-payment of annual fee