CN110135566A

CN110135566A - Registration user name detection method based on bis- Classification Neural model of LSTM

Info

Publication number: CN110135566A
Application number: CN201910425791.5A
Authority: CN
Inventors: 普雪飞
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2019-05-21
Filing date: 2019-05-21
Publication date: 2019-08-16

Abstract

The invention discloses a kind of registration user name detection methods based on bis- Classification Neural model of LSTM, the following steps are included: pre-process to training data and test data, wherein training data includes normal users name data and the username data that generates at random；Pretreated username data is encoded, the length of unified each data；Data after coding are carried out with the serializing of character level；Bis- Classification Neural model of LSTM is built by the training data after pretreatment, coding and serializing, forms the detection model of registration user name；It will be after pretreatment, coding and serializing in test data input detection model, the probability that test data is identified as exceptional sample is P by detection model, when P is more than or equal to abnormal probability threshold value, which is identified as exceptional sample, is otherwise identified as normal sample.The present invention registers whether user name as what is generated at random has good detection effect to platform.

Description

Registration user name detection method based on bis- Classification Neural model of LSTM

Technical field

The present invention relates to web applications, depth learning technology field, especially a kind of to be based on bis- Classification Neural mould of LSTM The registration user name detection method of type.

Background technique

In recent years, web application becomes increasingly popular, and preferably services and keep here user to provide, many platforms provide use Family registering functional, and registered to user is open, some problems are also following, and on the one hand open registration, can allow some not useful The large batch of malicious registration account of the user of the heart, may cause network security problem.On the other hand, there are the use of magnanimity in platform Family, user quality is irregular, certainly will will affect the related movable efficiency of operation of subsequent progress.

Summary of the invention

The user name generated at random or arbitrarily is batch malicious registration and the low-quality feature that shares of registration user, this The naming rule of a little user names does not often meet the nomenclature rule of phonetic and English；To solve problems of the prior art, The purpose of the invention is to detection platforms to register the doubtful random malicious user for generating user name and low quality user in user, It is identified to batch registration identification, low quality user and reference is provided, propose a kind of note based on bis- Classification Neural model of LSTM Volume user name detection method.

To achieve the above object, the technical solution adopted by the present invention is that: one kind be based on bis- Classification Neural model of LSTM Registration user name detection method, comprising the following steps:

Step 1: pre-processing to training data and test data, wherein training data includes normal users name data With the username data generated at random；

Step 2: being encoded to pretreated username data, the length of unified each data；

Step 3: the data after coding are carried out with the serializing of character level；

Step 4: bis- Classification Neural model of LSTM is built by the training data after pretreatment, coding and serializing, Form the detection model of registration user name；

Step 5: will be after pretreatment, coding and serializing in test data input detection model, detection model will be surveyed The probability that examination data are identified as exceptional sample is P, when P is more than or equal to abnormal probability threshold value, which is identified as different Normal sample, is otherwise identified as normal sample.

As a preferred embodiment, carrying out pretreatment tool to training data and test data in the step 1 Body includes: the character for removing all non-English words in training data and test data, and the English character of capitalization is converted The suffix of mailbox type is removed if the user name in data is name for the English character of small letter.

Data encode as another preferred embodiment, in the step 2 specific as follows: preprocessed It only include English alphabet, corresponding 26 codings of 26 English alphabets, the length of the coded sequence of unified each data in data afterwards Degree, the inadequate zero padding of length, what length was more than be truncated.

As another preferred embodiment, the data after coding are carried out in the step 3 to serialize specific packet It includes: Feature Mapping being carried out to each letter using term vector technology, each letter corresponds to the vector of a regular length, specifically It is embedded in by Embedding word and data is mapped as embeded matrix, if 32 dimension of output, the coding of each letter are mapped to The vector of one 32 dimension, each user name sample become the matrix of a 1*20*32.

As another preferred embodiment, in the step 4, the bis- Classification Neural model of LSTM built is such as Under:

First layer is embedding layers, and the sample of input is the character string that sequence length is equal to 20, is passed through After embedding layers of coding mapping, the matrix that each output is 20*32, n sample is expressed as n*20*32；

The second layer is LSTM layers, and the matrix that input dimension is n*20*132, wherein n indicates user name sample strip number, output Dimension is 64 dimensions, and the result dimension for exporting each time step is n*20*64；

Third layer is flatten layers, converts data to the dimension of n*1280；

4th layer is full articulamentum, and output dimension is 64, and data become the dimension of n*64 after this layer, activate in the layer Function is ReLU；

Layer 5 is output layer, and output dimension is 2, and activation primitive is Softmax in the layer, and the loss function of model is to hand over Entropy loss function is pitched, optimal way is adam optimization algorithm.

As another preferred embodiment, reduce detection model to test data by increasing abnormal probability threshold value Wrong report.

The beneficial effects of the present invention are: training mould with the username data generated at random by using normal users name data Type, in the prediction result to new data, as two disaggregated models, the accuracy rate and recall rate of every one kind data are on 95% left side The right side, the user name that more can effectively distinguish normal users name and generate at random.Since requirement in actual use scene will just Normal specimen discerning is that the ratio (rate of false alarm) of exceptional sample is small as far as possible, and the probability threshold value of exceptional sample is determined as by setting It can control rate of false alarm, exceptional sample be accordingly determined as that normal probability will increase.

Detailed description of the invention

Fig. 1 is the method flow block diagram of the embodiment of the present invention；

Fig. 2 is bis- Classification Neural model structure of LSTM in the embodiment of the present invention.

Specific embodiment

The embodiment of the present invention is described in detail with reference to the accompanying drawing.

Embodiment:

As shown in Figure 1, a kind of registration user name detection method based on bis- Classification Neural model of LSTM, including it is following Step:

1, the user name in training data and test data is pre-processed, the word including removing all non-English words Symbol, capitalization turn small letter, if user name is name, remove the suffix of mailbox type；Such as: [email protected] is pre- Output that treated is goodname.

2, the letter in pretreated username data is encoded, the length of unified each sample data；Processing User name afterwards only includes English alphabet, has 26 letters corresponding 26 to encode, such as a:1, b:2, c:3 and so on, finally The length of the coded sequence of unification user name is 15, and inadequate zero padding, be more than be truncated；Such as: user name Become sequence [0,0,0,0,0,0,0,7,15,15,4,14,1,13,5] after " Goodname123 " coding uniform length.

3, Feature Mapping carried out to each letter using term vector technology, each corresponding regular length of letter to Amount；

It is embedded in by Embedding word and data is mapped as embeded matrix, if 32 dimension of output, the coding of each letter It is mapped to the vector of one 32 dimension, each user name sample becomes the matrix of a 1*20*32.

4, bis- Classification Neural model of LSTM as shown in Figure 2 is built:

The second layer is one LSTM layers, matrix (n indicates user name sample strip number) output that input dimension is n*20*132 Dimension is 64 dimensions, and the result dimension for exporting each time step is n*20*64；

Third layer is flatten layers, converts data to the dimension of n*1280；

4th layer is a full articulamentum, and output dimension is 64, and data become the dimension of n*64 after this layer, activate letter Number is ReLU；

Layer 5 is output layer, and output dimension is 2, activation primitive Softmax, and the loss function of model is cross entropy damage Function is lost, optimal way is adam optimization algorithm.

5, model wrong report amendment:

Detection model normal template be determined as exceptional sample or exceptional sample be determined as normal sample be known as report by mistake, It actually uses in scene, the loss of wrong report bring is much larger than failing to report, so setting a kind of mechanism of setting probability threshold value to control Wrong report.Threshold value can be increased to reduce wrong report, increased to a certain degree correspondingly, failing to report and having.

The present embodiment is further described below:

When batch registration behavior occurs for platform, the user of malicious registration is possible to generate a large amount of user name at random, can To combine some appropriate rules, judge that batch registration possibility occurs for these users.

If some users only careless on probation lower platform, when registration, may at will fill in a login name, detect Of this sort user name out can be marked, and successive stage can reduce these when carrying out user's operation The priority of user reduces operation cost, improves efficiency of operation.

The method detected using random user name, detects the doubtful user name generated at random, can be to a certain extent Reference is provided for malicious registration behavior, and the low quality user of a part can be filtered out based on the nomenclature rule of user name.

This method pre-processes user name, the serializing for line character grade of going forward side by side, and is expressed using LSTM in time series The advantage of aspect, the user name generated using normal users name and at random carry out model training, learn two kinds of user name intercharacters Collocating rule establish the two classification minds of LSTM to learn the naming rule of normal users name and the user name generated at random Through network model, whether user name, which as what is generated at random has good detection effect, is registered to platform.

Random user name detection model is the model of one two classification, and existing sorting algorithm all supports two classification, but sharp It can handle sequence problem with LSTM, because the spelling of phonetic or English word regardless of Chinese character, all follows specific character Combination rule, the front and back collocation sequence of character be on the finally formed semanteme of word it is influential, can be fine using LSTM Capture this relationship, and then learn to the difference in normal users and the abnormal user name name generated at random.

A specific embodiment of the invention above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously Limitations on the scope of the patent of the present invention therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to guarantor of the invention Protect range.

Claims

1. a kind of registration user name detection method based on bis- Classification Neural model of LSTM, which is characterized in that including following Step:

Step 1: pre-processed to training data and test data, wherein training data include normal users name data with The username data that machine generates；

Step 4: building bis- Classification Neural model of LSTM by the training data after pretreatment, coding and serializing, formed Register the detection model of user name；

Step 5: will be after pretreatment, coding and serializing in test data input detection model, detection model will test number It is P according to the probability for being identified as exceptional sample, when P is more than or equal to abnormal probability threshold value, which is identified as abnormal sample This, is otherwise identified as normal sample.

2. the registration user name detection method according to claim 1 based on bis- Classification Neural model of LSTM, special Sign is, in the step 1, carries out pretreatment to training data and test data and specifically includes: removal training data and test The character of all non-English words in data, and the English character of capitalization is converted to the English character of small letter, if data In user name be name, then remove the suffix of mailbox type.

3. the registration user name detection method according to claim 2 based on bis- Classification Neural model of LSTM, special Sign is, encode to data in the step 2 specific as follows: in data after pretreatment only include English alphabet, 26 Corresponding 26 codings of a English alphabet, the length of the coded sequence of unified each data, the inadequate zero padding of length, what length was more than It is truncated.

4. the registration user name detection method according to claim 3 based on bis- Classification Neural model of LSTM, special Sign is, carries out serializing to the data after coding in the step 3 and specifically includes: using term vector technology to each letter Feature Mapping is carried out, the vector of each corresponding regular length of letter is embedded in particular by Embedding word and reflects data It penetrates as embeded matrix, if 32 dimension of output, the coding of each letter are mapped to one 32 vector tieed up, each user name sample Originally become the matrix of a 1*20*32.

5. the registration user name detection method according to claim 4 based on bis- Classification Neural model of LSTM, special Sign is, in the step 4, the bis- Classification Neural model of LSTM built is as follows:

First layer is embedding layers, and the sample of input is the character string that sequence length is equal to 20, by embedding layers Coding mapping after, the matrix that each output is 20*32, n sample be expressed as n*20*32；

The second layer is LSTM layers, and the matrix that input dimension is n*20*132, wherein n indicates user name sample strip number, exports dimension The result dimension tieed up for 64, and export each time step is n*20*64；

Third layer is flatten layers, converts data to the dimension of n*1280；

4th layer is full articulamentum, and output dimension is 64, and data become the dimension of n*64 after this layer, activation primitive in the layer For ReLU；

Layer 5 is output layer, and output dimension is 2, and activation primitive is Softmax in the layer, and the loss function of model is cross entropy Loss function, optimal way are adam optimization algorithm.

6. the registration user name detection method according to claim 1 or 5 based on bis- Classification Neural model of LSTM, It is characterized in that, reduces wrong report of the detection model to test data by increasing abnormal probability threshold value.