CN105956187B - A kind of choosing method of minority class subscriber network access feature - Google Patents

A kind of choosing method of minority class subscriber network access feature Download PDF

Info

Publication number
CN105956187B
CN105956187B CN201610394392.3A CN201610394392A CN105956187B CN 105956187 B CN105956187 B CN 105956187B CN 201610394392 A CN201610394392 A CN 201610394392A CN 105956187 B CN105956187 B CN 105956187B
Authority
CN
China
Prior art keywords
minority class
user
feature
data set
network access
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201610394392.3A
Other languages
Chinese (zh)
Other versions
CN105956187A (en
Inventor
牟超
周庆
胡月
孙启亮
孟瑶
全文君
廖凤露
尹春梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN201610394392.3A priority Critical patent/CN105956187B/en
Publication of CN105956187A publication Critical patent/CN105956187A/en
Application granted granted Critical
Publication of CN105956187B publication Critical patent/CN105956187B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of choosing methods of minority class subscriber network access feature, are related to big data analysis field, include the following steps: firstly, counting the network access data of real user from gateway server download log, and form initial data set;Secondly, the Virtual User of N minority class of creation, constructs dummy data set;Again, correlation analysis is carried out to dummy data set, selected and the significant relevant access feature of minority class user;Finally, carrying out factorial analysis, intrinsic dimensionality is further decreased.The present invention is by creation Virtual User, the ratio of autobalance minority class user, so that the network that can still extract accesses feature in target user's number accounting very little.

Description

A kind of choosing method of minority class subscriber network access feature
Technical field
The present invention relates to big data analysis field more particularly to a kind of selection sides of minority class subscriber network access feature Method.
Background technique
With the prevalence of Internet technology, the quantity of Internet user is more and more huger.Network has had penetrated into life Every aspect living, thus to the network behavior of user carry out analysis be it is very significant, it can contribute to understand user, The foundation of science is provided to carry out decision.For example, Commdity advertisement may be implemented in the buying behavior for analyzing user in electric business field Precisely launch;In education sector, teen-age internet behavior is analyzed, their bad network access can be corrected in time;Believing Security fields are ceased, the internet behavior of user is detected, the access of illegal user can be prevented in time.Selection can most express user network The network of behavior accesses feature, is essential and most important link during user behavior analysis.
The most common subscriber network access is characterized in accessing the frequency and the time of different type network address in current research.This The number of a little features is all very more, it is necessary to choose important feature therefrom to achieve the purpose that reduce dimension.It is existing The method of research selected characteristic has very much, such as correlation analysis is exactly a kind of simple and quick Feature Selection Algorithms easily executed. But current feature selecting algorithm does not often all account for the case where target user belongs to minority class user, that is to say, that Zhan Quan The ratio of body user is considerably less, such as browsed in a large number of users of certain Commdity advertisement and only has small part user to be ready to buy.At this In the case of kind, because number ratio is unbalanced, the feature that will cause selection is not comprehensive enough, cannot express completely target user's Access information.Therefore it how in the case where minority class user's Zhan Quanti ratio very little, is adaptively accessed from a large amount of network Important feature is selected in feature, just seem has research significance and application value very much.
Summary of the invention
In view of the above drawbacks of the prior art, technical problem to be solved by the invention is to provide a kind of minority class users Network accesses the choosing method of feature, can be in the case where minority class user's Zhan Quanti ratio very little, adaptively from a large amount of Network access feature in select important feature.
To achieve the above object, the present invention provides a kind of choosing methods of minority class subscriber network access feature, special Sign is, includes the following steps:
Step 1: counting the network access data of real user from gateway server download log, and form primary data Collection;Wherein, the dimension of initial data set is m × d, and m is total number of persons, and d is number of features, and the data packet of initial data set contains symbol The data of hop index distribution;
Step 2: the Virtual User of N minority class of creation, constructs dummy data set;Wherein, the dimension of dummy data set is (m+N) data of × d, dummy data set and initial data set meet probability with distribution;
Step 3: carrying out correlation analysis to dummy data set, select special to the significant relevant access of minority class user Sign.
Furthermore, a kind of choosing method of minority class subscriber network access feature also includes step 4: to having carried out The dummy data set of correlation analysis carries out factorial analysis, further decreases intrinsic dimensionality.
Furthermore, step 2 specifically includes:
A0, label minority class user Si, wherein [1, p] i ∈, p are the total numbers of persons of minority class user;
A1, the mean μ for calculating each featurej,j∈[1,d];
A2, the total number of persons for needing the virtual minority class user created: N=m-p is calculated;
A3, judge whether p is greater than 1, if it is, continuing to execute A4, otherwise directly replicate N S1And execute the step Three;
Every A4, calculating true minority class user need to correspond to virtual number It represents and is rounded fortune downwards It calculates;
A5, according to exponential distribution to SiIncrease n Virtual User.
Furthermore, it in the A5 step of step 2, specifically includes as follows:
B0, judge whether i is more than true minority class user number, if it is, terminating, otherwise continue to execute;
B1, i=i+1 is executed;
B2, judge to SiWhether the Virtual User number of creation more than n, if continuing B3 without if, otherwise jump B0 after It is continuous to execute;
B3, min (p-1,5) name and the smallest true minority class user of its Euclidean distance are found;
B4, one in min (p-1,5) name user is randomly selected, be denoted ask∈[1,min(p-1,5)];
B5, a random number R, and R~U (0,1) are generated;
B6, creation Virtual User S 'i, j-th of feature S' of the Virtual Useri(j), it indicates are as follows:
Wherein [1, d] j ∈;
B7, these features are combined, obtain the characteristic set of newly created Virtual User are as follows:
S'i=[S'i(1),S'i(2),...,S'i(j),...,S'i(d)]。
Furthermore, in step 3, the correlation analysis is simple correlation analysis, and related coefficient is Pearson Related coefficient, the level of signifiance 0.05;The notable feature collection of output is combined into [Sig1,Sig2,...,Sigs], wherein s is significant phase The number of features of pass, SigsIt is significant relevant feature.
Furthermore, step 4 specifically includes:
C1, Bartlett statistic and KMO statistic are calculated;
C2, judge Bartlett statistic whether less than 0.05 and whether KMO statistic less than 0.5, if it is not, then after It is continuous to execute, otherwise terminate;
C3, selection characteristic root are greater than 1 common factor, and the common factor includes the same of several minority class subscriber network access Category feature;
C4, the rotation that factor axis is carried out using varimax, highlight minority class subscriber network access feature.
The beneficial effects of the present invention are: the present invention can by create Virtual User, the ratio of autobalance minority class user, After guaranteeing newly-increased Virtual User, each feature still obeys original exponential distribution, reaches extracted in self-adaptive minority class user Network accesses the purpose of the feature of behavior.
Detailed description of the invention
Fig. 1 is the flow diagram of the embodiment of the present invention one;
Fig. 2 is the flow diagram for creating school work difficulty university student's Virtual User;
Fig. 3 is the flow diagram based on exponential distribution creation Virtual User;
Fig. 4 is the flow diagram of factor analysis.
Specific embodiment
Present invention will be further explained below with reference to the attached drawings and examples:
The present embodiment analyzes its network using the present invention and accesses feature using school work difficulty university student as minority class.
As shown in Figures 1 to 4, the present embodiment provides a kind of school work difficulty College Students with Internet access feature choosing method, Include the following steps:
Step 1: counting the network access data of school from gateway server download log, and form primary data Collection;It include non-school work difficulty university student and a small number of school work difficulty university students in data set.Day is downloaded from gateway server Will includes URL, student number etc..When Main Analysis school work difficulty university student accesses network type, network visitation frequency, network access It is long etc..The dimension of initial data set is m × d, and m is total number of students, and d is number of features, and feature mainly includes the frequency, access duration Deng.It is noted that frequency characteristic, access duration characteristics are index of coincidence distributions.In addition, in order to reduce data analysis work It measures, usually presses different type network address, the frequency and temporal characteristics are counted.General website can be divided into large-scale portal website, Industrial sustainability, transaction class website, classification information website, forum, government website, functional character website, types of entertainment website, enterprise Website etc..
Step 2: N school work difficulty university student Virtual User (minority class) of creation, construct dummy data set;M true use The dummy data set that family and N school work difficulty university student's Virtual User dimensions are (m+N) × d.It is difficult by creation school work University student's Virtual User, minority class quantity will increase and are added data set, facilitate the extraction for accessing its network feature.It is worth One is mentioned that, the school work difficulty university student's Virtual User created, and feature should be that meet probability same with the data of real user Distribution.Illustratively, the frequency of network access and access duration are without memory, meet exponential distribution, i.e., newly created void Quasi- school work difficulty university should also meet true school work difficulty university student's exponential distribution.
Specifically, as shown in Figure 2 to analyze school work difficulty College Students with Internet access feature, step 2 includes:
A0, the true school work difficulty student users S of labeli, wherein [1, p] i ∈, p are true school work difficulty universities The total number of persons of raw user;
A1, the mean μ for calculating each featurej,j∈[1,d];
A2, the total number of persons for needing the school work difficulty university student's Virtual User created: N=m-p is calculated;
A3, judge whether p is greater than 1, if it is, continuing to execute A4, otherwise directly replicate N S1And execute the step Three;
Every A4, calculating true school work difficulty university student need to correspond to the number of Virtual User It represents Downward rounding operation;
A5, according to exponential distribution to SiIncrease n school work difficulty university student's Virtual User;
Furthermore, as shown in figure 3, according to exponential distribution to S described in A5iIt is empty to increase n school work difficulty university students Specific step is as follows by quasi- user:
B0, judge whether i is more than true school work difficulty student users number, if it is, terminating, otherwise continue to execute B1;
B1, i=i+1 is executed;
B2, judge to SiWhether the virtual number of school work difficulty university student of creation is more than n, no if continuing B3 without if B0 is then jumped to continue to execute;
B3, min (p-1,5) name and the smallest true school work difficulty university student of its Euclidean distance are found;
B4, one in min (p-1,5) name user is randomly selected, be denoted ask∈[1,min(p-1,5)];
B5, a random number R, and R~U (0,1) are generated;
B6, to guarantee to create school work difficulty university student's Virtual User S 'iAfterwards, all school work difficulty university students still obey phase Same exponential distribution, then for each feature j, it should meet:
(1) formula is solved using the distribution function of exponential distribution, newly created school work difficulty university student's Virtual User can be obtained J-th of feature S'i(j), it can be expressed as follows:
Wherein [1, d] j ∈
B7, these features are combined, obtain the characteristic set of newly created school work difficulty university student's Virtual User are as follows:
S'i=[S'i(1),S'i(2),...,S'i(j),...,S'i(d)] (3)
Step 3: carrying out correlation analysis to dummy data set, select and the significant relevant visit of school work difficulty university student Ask feature.
In the present embodiment, using simple correlation analysis, related coefficient is Pearson correlation coefficients, and the level of signifiance is 0.05;The notable feature collection of output is combined into [Sig1,Sig2,...,Sigs], wherein s is significant relevant number of features, SigsIt is Significant relevant feature.
Step 4: carrying out factorial analysis, school work difficulty university student's intrinsic dimensionality is further decreased, is specifically included:
C1, Bartlett statistic and KMO statistic are calculated;
C2, judge Bartlett statistic whether less than 0.05 and whether KMO statistic less than 0.5, if it is not, then after It is continuous to execute, otherwise terminate;
C3, selection characteristic root are greater than 1 common factor, and each common factor includes the similar spy of minority class subscriber network access Sign, these common factors are the classification to feature, each common factor represents the category feature for influencing student's school work;
C4, the rotation that factor axis is carried out using varimax, are highlighted minority class subscriber network access feature, make acquisition Common factor interpretation is more obvious, helps to understand the principal element for influencing college students ' academic.
To sum up, the present embodiment, which passes through, creates school work difficulty university student's Virtual User, the ratio of autobalance minority class user, After also guaranteeing newly-increased Virtual User simultaneously, each feature of all school work difficulty university student's Virtual User is still obeyed original Exponential distribution achievees the purpose that the feature of extracted in self-adaptive minority class subscriber network access behavior.Although the present embodiment is to learn Industry difficulty university student is minority class and analyzes network access behavior, is equally applicable to other minority class networks access behavioral study cases Example, which is not described herein again.
The preferred embodiment of the present invention has been described in detail above.It should be appreciated that those skilled in the art without It needs creative work according to the present invention can conceive and makes many modifications and variations.Therefore, all technologies in the art Personnel are available by logical analysis, reasoning, or a limited experiment on the basis of existing technology under this invention's idea Technical solution, all should be within the scope of protection determined by the claims.

Claims (5)

1. a kind of choosing method of minority class subscriber network access feature, which comprises the steps of:
Step 1: counting the network access data of real user from gateway server download log, and form initial data set; Wherein, the dimension of the initial data set is m × d, and m is total number of persons, and d is number of features, the data packet of the initial data set Data containing index of coincidence distribution;
Step 2: the Virtual User of N minority class of creation, constructs dummy data set;Wherein, the dimension of the dummy data set is (m+N) data of × d, the dummy data set and initial data set meet probability with distribution;A0, the label minority class user Si, wherein [1, p] i ∈, p are the total numbers of persons of the minority class user;
The step 2 specifically includes:
A1, the mean μ for calculating each featurej,j∈[1,d],
A2, the total number of persons for needing the virtual minority class user created: N=m-p is calculated,
A3, judge whether p is greater than 1, if it is, continuing to execute A4, otherwise directly replicate N S1And step 3 is executed,
Every A4, calculating true minority class user need to correspond to virtual number
A5, according to exponential distribution to SiIncrease the n Virtual User;
Step 3: carrying out correlation analysis to dummy data set, select and the significant relevant access feature of minority class user.
2. a kind of choosing method of minority class subscriber network access feature as described in claim 1, which is characterized in that the choosing Taking method also includes step 4: carrying out factorial analysis to the dummy data set for having carried out correlation analysis, further decreases feature Dimension.
3. a kind of choosing method of minority class subscriber network access feature as described in claim 1, it is characterised in that: described in A5 According to exponential distribution to SiIncreasing n Virtual User, specific step is as follows:
B0, judge whether i is more than the true minority class user number, if it is, terminating, otherwise continue to execute;
B1, i=i+1 is executed;
B2, judge to SiWhether the Virtual User number of creation, if continuing B3 without if, otherwise jumps B0 and continues to hold more than n Row;
B3, min (p-1,5) name and the smallest true minority class user of its Euclidean distance are found;
B4, one in min (p-1,5) name user is randomly selected, be denoted as
B5, a random number R, and R~U (0,1) are generated;
B6, creation Virtual User S 'i, j-th of feature S' of the Virtual Useri(j), it indicates are as follows:
Wherein [1, d] j ∈;
B7, these features are combined, obtain the characteristic set of newly created Virtual User are as follows:
S'i=[S'i(1),S'i(2),...,S'i(j),...,S'i(d)]。
4. a kind of choosing method of minority class subscriber network access feature as described in claim 1, it is characterised in that: described In step 3, the correlation analysis is simple correlation analysis, and related coefficient is Pearson correlation coefficients, and the level of signifiance is 0.05;The notable feature collection of output is combined into [Sig1,Sig2,...,Sigs], wherein s is significant relevant number of features, SigsIt is Significant relevant feature.
5. a kind of choosing method of minority class subscriber network access feature as claimed in claim 2, which is characterized in that the step Rapid four sequentially include the following steps:
C1, Bartlett statistic and KMO statistic are calculated;
C2, Bartlett statistic is judged whether less than 0.05 and whether KMO statistic is less than 0.5, if it is not, then continuing to hold Row step C3, otherwise terminates;
C3, selection characteristic root are greater than 1 common factor, and the common factor includes the similar spy of several minority class subscriber network access Sign;
C4, the rotation that factor axis is carried out using varimax, highlight minority class subscriber network access feature.
CN201610394392.3A 2016-06-03 2016-06-03 A kind of choosing method of minority class subscriber network access feature Expired - Fee Related CN105956187B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610394392.3A CN105956187B (en) 2016-06-03 2016-06-03 A kind of choosing method of minority class subscriber network access feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610394392.3A CN105956187B (en) 2016-06-03 2016-06-03 A kind of choosing method of minority class subscriber network access feature

Publications (2)

Publication Number Publication Date
CN105956187A CN105956187A (en) 2016-09-21
CN105956187B true CN105956187B (en) 2019-03-15

Family

ID=56907799

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610394392.3A Expired - Fee Related CN105956187B (en) 2016-06-03 2016-06-03 A kind of choosing method of minority class subscriber network access feature

Country Status (1)

Country Link
CN (1) CN105956187B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220318A (en) * 2017-05-18 2017-09-29 重庆大学 A kind of method for determining special student groups online feature
CN107508809B (en) * 2017-08-17 2020-10-23 腾讯科技(深圳)有限公司 Method and device for identifying website type

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411610A (en) * 2011-10-12 2012-04-11 浙江大学 Semi-supervised dimensionality reduction method for high dimensional data clustering
CN104254852A (en) * 2012-03-17 2014-12-31 海智网聚网络技术(北京)有限公司 Method and system for hybrid information query

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9646073B2 (en) * 2011-07-06 2017-05-09 Fred Bergman Healthcare Pty. Ltd. Event detection algorithms

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411610A (en) * 2011-10-12 2012-04-11 浙江大学 Semi-supervised dimensionality reduction method for high dimensional data clustering
CN104254852A (en) * 2012-03-17 2014-12-31 海智网聚网络技术(北京)有限公司 Method and system for hybrid information query

Also Published As

Publication number Publication date
CN105956187A (en) 2016-09-21

Similar Documents

Publication Publication Date Title
US20210019674A1 (en) Risk profiling and rating of extended relationships using ontological databases
Can et al. A new direction in social network analysis: Online social network analysis problems and applications
Ren et al. Building an effective intrusion detection system by using hybrid data optimization based on machine learning algorithms
Wu et al. Network attacks detection methods based on deep learning techniques: a survey
Jain et al. A machine learning based approach for phishing detection using hyperlinks information
Yao et al. Automated crowdturfing attacks and defenses in online review systems
CN107786575B (en) DNS flow-based self-adaptive malicious domain name detection method
Ramanathan et al. phishGILLNET—phishing detection methodology using probabilistic latent semantic analysis, AdaBoost, and co-training
US9516051B1 (en) Detecting web exploit kits by tree-based structural similarity search
CN109873810B (en) Network fishing detection method based on goblet sea squirt group algorithm support vector machine
Sonowal Phishing email detection based on binary search feature selection
CN111758098B (en) Named entity identification and extraction using genetic programming
Bannur et al. Judging a site by its content: learning the textual, structural, and visual features of malicious web pages
CN105956187B (en) A kind of choosing method of minority class subscriber network access feature
CN108694183A (en) A kind of search method and device
Nazah et al. An unsupervised model for identifying and characterizing dark web forums
Sun et al. Design and Application of an AI‐Based Text Content Moderation System
Vörös et al. Web content filtering through knowledge distillation of large language models
CN106469182A (en) A kind of information recommendation method based on mapping relations and device
Phan et al. User identification via neural network based language models
Holeňa et al. Classification Methods for Internet Applications
CN111401067B (en) Honeypot simulation data generation method and device
Lee et al. Generation of network traffic using WGAN-GP and a DFT filter for resolving data imbalance
Akinwale et al. Detection and Binary Classification of Spear-Phishing Emails in Organizations Using a Hybrid Machine Learning Approach
Xu et al. Detecting Fake Sites based on HTML Structure Analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190315