CN113190646A - User name sample labeling method and device, electronic equipment and storage medium - Google Patents

User name sample labeling method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113190646A
CN113190646A CN202010038362.5A CN202010038362A CN113190646A CN 113190646 A CN113190646 A CN 113190646A CN 202010038362 A CN202010038362 A CN 202010038362A CN 113190646 A CN113190646 A CN 113190646A
Authority
CN
China
Prior art keywords
sample
user name
samples
cluster
clusters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010038362.5A
Other languages
Chinese (zh)
Other versions
CN113190646B (en
Inventor
周亚林
张子琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202010038362.5A priority Critical patent/CN113190646B/en
Publication of CN113190646A publication Critical patent/CN113190646A/en
Application granted granted Critical
Publication of CN113190646B publication Critical patent/CN113190646B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to a user name sample labeling method, which comprises the following steps: clustering the user name samples based on the obtained semantic features of the user name samples to obtain a plurality of sample clusters; according to the respective designated characteristics of the plurality of sample clusters, screening the sample clusters meeting the preset sample cluster selection condition from the plurality of sample clusters, wherein the designated characteristics are used for representing whether the username samples in the sample clusters are negative sample types, and the sample cluster selection condition is determined based on the designated characteristic statistical result of the sample clusters which are identified as abnormal usernames in advance; and marking the user name sample in the screened sample cluster as a negative user name sample.

Description

User name sample labeling method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of network security technologies, and in particular, to a method and an apparatus for labeling a user name sample, an electronic device, and a storage medium.
Background
The user name (english name) is also called an account name, and Chinese characters, letters, character codes and the like can be used, such as Jolmmunnam, zmlmf and 12345, and the like can be used as the user name. The abnormal user name, for example, is typically a user name generated and registered in a large batch by a malicious user using a script, and such a user name either contains pornographic reaction information itself or distributes pornographic information, phishing website links, advertisements and the like on a network platform, which may cause adverse effects on legitimate users and may also easily cause a network security problem.
In order to prevent the abnormal user name from appearing on the network platform, it is necessary to identify the registered user name, so as to limit the successful registration or use of the abnormal user name.
In the related technology, usually, a manual labeling mode can be adopted to label the user name sample with a positive sample type and a negative sample type, then the user name sample subjected to the positive sample type and the negative sample type labeling is used as a training sample, a user name identification model for carrying out classification identification on the user name is trained, and finally whether the target user name is abnormal or not is identified through the trained model.
In the prior art, when the user name sample is labeled, a method of simply manually labeling is adopted, so that the accuracy of the labeling result is greatly influenced by the individual subjective judgment capability of a labeling person, and once the subjective judgment capability of the labeling person is low, the condition of inaccurate labeling result is easy to occur, so that the accuracy of the identification result of the trained model is influenced.
Disclosure of Invention
The invention provides a user name sample labeling method, a user name sample labeling device and electronic equipment, a user name identification model training method, a user name identification model training device, electronic equipment and a storage medium, and a user name identification method, a user name identification model training device, an electronic equipment and a storage medium based on the user name identification model, and at least solves the problem that in the related art, the identification result of a model obtained through training is inaccurate due to the fact that the user name sample labeling is simply carried out manually.
The technical scheme of the disclosure is as follows:
according to a first aspect of the embodiments of the present disclosure, there is provided a method for annotating a user name sample, including:
clustering the user name samples based on the obtained semantic features of the user name samples to obtain a plurality of sample clusters;
according to the respective designated characteristics of the plurality of sample clusters, screening sample clusters meeting the preset sample cluster selection condition from the plurality of sample clusters; the specified characteristics of the sample cluster characterize whether the username sample in the sample cluster is a negative sample type; the sample cluster selection condition is determined based on the specified characteristic statistical result of the sample cluster formed by the abnormal user names which are identified in advance;
and marking the user name sample in the screened sample cluster as a negative user name sample.
In an alternative embodiment, the specified features include: the method for screening the sample clusters meeting the preset sample cluster selection condition from the plurality of sample clusters according to the respective designated characteristics of the plurality of sample clusters comprises the following steps:
calculating the average similarity of semantic features among the user name samples in each sample cluster;
and screening sample clusters, wherein the average semantic feature similarity of the sample clusters is greater than a semantic similarity threshold, and the preset sample cluster selection condition comprises that the average semantic feature similarity between different user name samples in the sample clusters is greater than the semantic similarity threshold.
In an optional implementation manner, the calculating the average similarity of semantic features between the username samples in each sample cluster includes:
determining semantic center vectors of the user name samples corresponding to the clustering center points of the sample clusters;
and calculating the average distance between the semantic feature vector of each user name sample in each sample cluster and the semantic center vector of the user name sample corresponding to the cluster center point of each sample cluster, so as to obtain the average similarity of the semantic features between the user name samples in the sample clusters.
In an alternative embodiment, the specified features include: the method for screening the user name samples in the sample clusters according to the similarity of the labeled positive and negative sample types of the user name samples in the sample clusters includes the following steps:
calculating the labeled positive and negative sample type similarity between the user name samples in each sample cluster;
and screening sample clusters, wherein the labeled positive and negative sample type similarity of the user name sample is smaller than a type similarity threshold, and the preset sample cluster selection condition comprises that the labeled positive and negative sample type similarity of the user name sample in the sample clusters is smaller than the type similarity threshold.
In an optional implementation manner, calculating the similarity of positive and negative sample types of username samples in each sample cluster includes:
determining the number of the user name samples of the positive sample type and the number of the user name samples of the negative sample type in each sample cluster;
and respectively calculating the ratio of the number of the user name samples of the positive sample type to the number of the user name samples of the negative sample type in each sample cluster, and taking the ratio as the similarity of the positive sample type and the negative sample type of the user name samples contained in each sample cluster.
According to a second aspect of the embodiments of the present disclosure, there is provided a method for training a user name recognition model, including:
constructing a training sample set based on the negative user name sample and the rest user name samples in the plurality of sample clusters, wherein the training sample set is used for training a user name recognition model for carrying out classification recognition on user names;
inputting the training sample set into a neural network text classification model, and acquiring a first feature vector from a hidden layer of the neural network text classification model;
inputting the training sample set into a neural network structure model, and acquiring a second feature vector from a hidden layer of the neural network structure model;
and training a user name recognition model for carrying out classification recognition on user names by taking the first feature vector and the second feature vector as training samples.
In an optional implementation manner, if the number of the user name samples in the training sample set is increased, and/or the positive and negative sample type information of the user name samples in the training sample set is changed, removing the output layer of the user name recognition model;
adding a preset number of full connection layers in the user name identification model with the output layer removed;
training the preset number of full connection layers based on the added training samples and/or the user name samples with the changed positive and negative sample type information to obtain the updated user name identification model.
According to a third aspect of the embodiments of the present disclosure, there is provided a user name identification method, including:
acquiring a user name to be identified;
and inputting the user name to be identified into the trained user name identification model for identification so as to obtain an identification result output by the trained user name identification model.
According to a fourth aspect of the embodiments of the present disclosure, there is provided an apparatus for annotating a user name sample, including:
the clustering module is configured to perform clustering on the user name samples based on the acquired semantic features of the user name samples to obtain a plurality of sample clusters;
the screening module is configured to screen sample clusters which meet a preset sample cluster selection condition from the plurality of sample clusters according to the respective specified characteristics of the plurality of sample clusters; the specified characteristics are used for representing whether the username sample in the sample cluster is a negative sample type, and the sample cluster selection condition is determined based on the specified characteristic statistical result of the sample cluster formed by the abnormal usernames which are identified in advance;
and the marking module is configured to mark the user name samples in the screened sample cluster as negative user name samples.
In an alternative embodiment, the specified features include: the semantic feature average similarity between different user name samples in the sample cluster, wherein the screening module comprises:
a semantic similarity calculation unit configured to perform calculation of an average similarity of semantic features between the username samples in each sample cluster;
the first screening unit is configured to perform screening from sample clusters, wherein the average semantic feature similarity of the sample clusters is greater than a semantic similarity threshold, and the predetermined sample cluster selection condition includes that the average semantic feature similarity between different user name samples in the sample clusters is greater than the semantic similarity threshold.
In an optional implementation manner, the semantic similarity calculation unit includes:
a first determining subunit, configured to perform determining a semantic center vector of a username sample corresponding to the cluster center point of each sample cluster;
and the first calculating subunit is configured to calculate an average distance between the semantic feature vector of each username sample in each sample cluster and the semantic center vector of the username sample corresponding to the respective cluster center point of each sample cluster, so as to obtain an average similarity of the semantic features between the username samples in the sample cluster.
In an alternative embodiment, the specified features include: the user name sample in the sample cluster is labeled with positive and negative sample type similarity, wherein the screening module comprises:
the type similarity calculation unit is configured to calculate the labeled positive and negative sample type similarity between the username samples in each sample cluster;
and the second screening unit is configured to perform screening from the sample clusters, wherein the user name samples are labeled with sample clusters with positive and negative sample type similarities smaller than a type similarity threshold, and the predetermined sample cluster selection condition includes that the user name samples in the sample clusters are labeled with positive and negative sample type similarities smaller than the type similarity threshold.
In one embodiment, the type similarity calculation unit includes:
a second determining subunit configured to perform determining the number of user name samples of the positive sample type and the number of user name samples of the negative sample type in each sample cluster;
and the second calculating subunit is configured to perform calculation of a ratio of the number of the user name samples of the positive sample type to the number of the user name samples of the negative sample type in each sample cluster, as a positive and negative sample type similarity of the user name samples included in each sample cluster.
According to a fifth aspect of the embodiments of the present disclosure, there is provided a training apparatus for a user name recognition model, including:
a construction module configured to perform construction of a training sample set based on the negative username sample and the rest of username samples in the plurality of sample clusters, wherein the training sample set is used for training a username identification model for performing classification identification on a username;
a first input module configured to perform inputting the training sample set into a neural network text classification model, and obtaining a first sample feature vector from a hidden layer of the neural network text classification model;
a second input module configured to input the training sample set into a neural network structure model, and obtain a second sample feature vector from a hidden layer of the neural network structure model;
and the training module is configured to perform the recognition by inputting the user name to be recognized into the trained user name recognition model so as to obtain a recognition result output by the trained user name recognition model.
In an optional implementation manner, the apparatus for training the username recognition model further includes:
a removing module configured to remove an output layer of the username identification model if the number of the username samples in the training sample set increases and/or the positive and negative sample type information of the username samples in the training sample set changes;
the adding module is configured to add a preset number of full connection layers in the user name recognition model after the output layer is removed;
and the processing module is configured to execute training on the preset number of full connection layers based on the added training samples and/or the username samples with changed positive and negative sample type information so as to obtain the updated username identification model.
According to a sixth aspect of the embodiments of the present disclosure, there is provided a user name recognition apparatus based on a user name recognition model, including:
the user name acquisition module is configured to acquire a user name to be identified;
an output module configured to perform recognition by inputting the user name to be recognized into the trained user name recognition model to obtain a recognition result output by the trained user name recognition model
According to a seventh aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:
a processor: a memory for storing the processor-executable instructions;
wherein the processor is configured to perform the method steps of the method for annotating a sample of a user name according to any of the above-mentioned first aspects, the method steps of the method for training a user name recognition model according to any of the second aspects, or the method steps of the user name recognition method according to any of the third aspects.
According to an eighth aspect of embodiments of the present disclosure, there is provided a storage medium, wherein the instructions of the storage medium, when executed by a processor of a training electronic device for a username identification model, enable the training electronic device for the username identification model to perform the method steps of the annotation method of any of the username samples of the first aspect or the method steps of the training method of any of the username identification models of the second aspect or the method steps of any of the username identification methods of the third aspect.
According to a ninth aspect of embodiments of the present disclosure, there is provided a computer program product, which, when run on an electronic device, causes the training electronic device of a username recognition model to perform: the method comprises the steps of any user name sample labeling method in the first aspect, or any user name identification model training method in the second aspect, or any user name identification method in the third aspect.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
based on the characteristic that typical abnormal user names are often similar semantically (for example, abnormal user names generated by a script in large batch generally have the characteristic), user name samples are clustered according to semantic feature vectors of the user name samples, so that the user name samples with the same type information of the user name samples can be gathered together, and further, sample clusters meeting the preset sample cluster selection condition (namely clusters gathered by suspected abnormal user names) can be screened out from the sample clusters according to the specified characteristics of the user name samples contained in the sample clusters, so that the suspected abnormal user names gathered together can be conveniently and intensively labeled, the problem that the suspected abnormal user names cannot be intensively labeled or labeled wrongly because the suspected abnormal user names are scattered is avoided, and errors caused by labeling are well controlled, therefore, the accuracy of the identification result of the user name identification model is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
Fig. 1 is a flowchart illustrating a method for annotating a sample username according to an exemplary embodiment.
Fig. 2 is a flowchart illustrating a method for annotating a sample username according to an exemplary embodiment.
Fig. 3 is a flowchart illustrating a method for annotating a sample username according to an exemplary embodiment.
Fig. 4 is a flow chart illustrating a method of username identification in accordance with an exemplary embodiment.
Fig. 5 is a block diagram illustrating a username sample annotation appliance, according to an exemplary embodiment.
Fig. 6 is a block diagram illustrating a username identification apparatus in accordance with an exemplary embodiment.
Fig. 7 is a block diagram illustrating a hardware configuration of an electronic device according to an example embodiment.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
Example 1
Fig. 1 is a flowchart illustrating a user identification sample labeling method according to an exemplary embodiment, as shown in fig. 1, by which accuracy of a user identification model identification result can be improved.
The execution subject of the method includes, but is not limited to, a server, a personal computer, a notebook computer, a tablet computer, a smart phone, and other intelligent electronic devices that can execute a predetermined process, such as numerical calculation and/or logical calculation, by running a predetermined program or instruction. Wherein the server may be a single server or a plurality of servers. The method may comprise the steps of:
in step S101, based on the obtained semantic features of each username sample, the username samples are clustered to obtain a plurality of sample clusters.
In one implementation, before performing step S101, obtaining a user name sample may be further included.
The user name sample refers to a user name which can be used as a model training sample. For example, the user names "mumlemma peak, zmlmf, 12345, etc" described in the background art may be obtained as the user name samples. The user name sample may be collected from user names that are actually registered. In general, there may be several username samples obtained.
The user name (english name) is also called an account name, and is information that can be used to uniquely identify the user identity. For example, the user name may be an online banking account, a game account, a WeChat account, a forum account, or a mailbox address, among others. For example, the online banking account number, the game account number, the WeChat account number, the forum account number, or the mailbox address may be a mobile phone number or an identity card number of the user, or, for convenience of memory and input, a long or short character string (a character string, for example, may include a lower case letter, an upper case letter, a number, a special character, and the like) may be used as the user name, for example: AABBx-123, cc1234567, and the like.
Alternatively, the username sample may be obtained, for example, by using Java Server Pages Standard Tag Library (JSPL).
It should be noted that, in order to enable the training sample of the user name recognition model to cover various types of user names that may exist as fully as possible, including abnormal user names and legal user names, the user name sample obtained in step S101 may include not only the legal user name, but also abnormal user names that are generated and registered in large batches by malicious users using scripts, such as "clothing sales little-energy hand cc-1 in a certain area", "clothing sales little-energy hand cc-2 in a certain area", and "clothing sales little-energy hand ss-2 in a certain area", and abnormal user names with elements such as erotic violence, etc.
Before describing a specific implementation manner of step S101, the following description is made of the reason for adopting the technical means of "clustering user name samples based on semantic features of the obtained user name samples" in the present disclosure:
generally, in the process of the training method of the user name identification model, positive and negative type information labeling is performed on each user name sample in advance, so that the common characteristics of the user samples (namely, user name positive samples) labeled with the positive sample type information and the common characteristics of the user samples (namely, user name negative samples) labeled with the positive sample type information can be determined according to the positive sample type information labeled on the user name samples, and further, the user name identification can be performed on the basis of the common characteristics of the user name positive samples and the common characteristics of the user name negative samples in the subsequent process.
In the embodiment of the disclosure, the user name samples are clustered based on the semantic features of the user name samples, on one hand, the characteristic that typical abnormal user names (for example, abnormal user names generated in large batches by using scripts or user names with the semantics of erotic violence) are often similar in semantics is considered, so that the user name samples can be clustered based on the semantic features of the user name samples, and thus, the sample clusters where the abnormal user names are located are pointedly marked out.
On the other hand, the embodiment of the disclosure considers the labeled positive and negative sample type information, and if the labeled positive and negative sample type information of the user name sample is correctly labeled, after the user name sample is clustered, usually the same sample cluster only can contain the user name positive sample or the user name negative sample, so that the labeled positive and negative sample type information can be verified based on the semantic features of the user name sample on the clustering result of the user name sample.
Optionally, in order to avoid the existence of repeated words or incorrect punctuation marks or the like in the username sample, which may cause inaccuracy of the semantic features determined for the username sample, in an embodiment of the present disclosure, before obtaining the semantic features of each username sample, the method may include: and (4) preprocessing/cleaning the obtained user name samples.
The pre-treatment/washing may include, but is not limited to, the following operations:
removing preset stop words in the user name sample, wherein the stop words comprise's','s' and other words with small association degree with the user name;
punctuation in the username sample is removed.
The above method of pre-processing/cleaning the username sample is merely an exemplary illustration and is not intended to limit the present disclosure in any way.
Optionally, after preprocessing each user name sample, a Word2vec tool or a Continuous space Word vector technology (CBOW) or other technical means may be adopted to obtain the semantic feature vector of each user name sample.
For example, when the Word2vec method is adopted, each user name sample can be loaded into a Word2vec module in a genesis library, and then the corresponding semantic features output by the Word2vec module based on each user name sample are obtained.
Alternatively, the present disclosure may also obtain semantic features of each username sample through a recurrent neural network and an attention mechanism. The present disclosure is not limited as to the manner in which the semantic features of each username sample are obtained.
After the semantic features of each username sample are obtained in any manner, the username samples can be clustered based on the obtained semantic features of each username sample to obtain each sample cluster.
For example, the present disclosure may utilize a clustering analysis algorithm, such as a mean shift clustering algorithm, to cluster the username samples based on the obtained semantic features of each username sample to obtain each sample cluster.
Or the method can randomly select k semantic feature vectors as initial mean vectors, calculate the distance from the rest semantic feature vectors to each initial mean vector, and divide user name samples corresponding to the semantic feature vectors with similar distances into the same cluster; and then calculating a new mean vector, and iterating until the mean vector is not updated or reaches the maximum times to obtain each sample cluster.
In step S102, a sample cluster satisfying a predetermined sample cluster selection condition is selected from the plurality of sample clusters according to the respective designated characteristics of the plurality of sample clusters.
The specified characteristics of the sample cluster can represent whether the username sample in the sample cluster is a negative sample type.
In one implementation, specifying the features includes: and average similarity of semantic features among different user name samples in the sample cluster.
Sample cluster selection conditions include: and the average similarity of the semantic features among different user name samples in the sample cluster is greater than a semantic similarity threshold value.
By the method, the sample cluster meeting the preset sample cluster selection condition can be selected according to the characteristic that typical abnormal user names are often similar semantically (for example, abnormal user names generated in a large batch by using scripts generally have the characteristic).
In one implementation, specifying the features includes: and the positive and negative sample type similarity of the user name samples marked in the sample cluster.
Sample cluster selection conditions include: the type similarity of the positive and negative samples marked by the user name sample in the sample cluster is smaller than a type similarity threshold value.
By screening the sample clusters through the method, the samples to be labeled can be limited in the range of the screened sample clusters, and then the screened sample clusters can be subjected to key labeling, so that the error caused by labeling is controlled to exist only in a certain range, and the labeling accuracy is improved.
Optionally, in this embodiment of the present disclosure, before outputting the screened sample cluster as an output result, the screened sample cluster may be identified, so that the sample cluster may be identified. Alternatively, the selected sample cluster and the remaining sample clusters except the selected sample cluster may be output to different storage areas for distinction.
In step S103, the user name samples in the screened sample cluster are labeled as negative user name samples.
As described above, in an actual scenario, there may be cases where "no positive or negative sample type information is labeled" and "positive or negative sample type information is labeled" in an acquired user name sample.
By adopting the method provided by the embodiment of the disclosure, for the former situation, the positive and negative sample type information of each user name sample can be quickly and accurately marked so as to ensure the accuracy of the positive and negative sample type information of each user name sample in the constructed training sample set; for the latter case, the accuracy of the positive and negative sample type information can be checked and corrected, so as to ensure the accuracy of the positive and negative sample type information of each user name sample in the constructed training sample set.
The present disclosure provides specific embodiments of step 103 for the above two cases. The following is a detailed description.
The first embodiment:
in the embodiment of the present disclosure, after receiving the screened sample cluster, the user name sample in the screened sample cluster may be labeled as a negative user name sample.
Alternatively, for example, each user name sample in the screened sample cluster may be labeled as a negative user name sample, and the user name samples in each sample cluster except the screened sample cluster may be labeled with positive sample type information.
When each user name sample is labeled, negative sample type information labeling can be performed on the user name samples in the screened sample cluster by writing a script, for example. Or, in order to perform the key labeling on each user name sample in the screened sample cluster, the received screened sample cluster may be output, so that the labeling personnel may perform the key labeling. The above two labeling ways are only exemplary illustrations of the embodiments of the disclosure, and do not limit the disclosure in any way.
According to the method, the accuracy of the positive and negative sample type information of each user name sample in the training sample set can be ensured, the condition that the labeling result is inaccurate when the positive and negative sample type information of the user name sample is labeled only manually in the related technology is avoided, and therefore the accuracy of the identification result of the user name identification model is improved.
The second embodiment:
in the embodiment of the disclosure, after the screened sample clusters are received, whether the pre-labeled positive and negative sample type information of each user name sample is correct can be determined according to the positive and negative sample type similarity of the user name samples contained in the screened sample clusters; and the situation that the pre-marked positive and negative sample type information is incorrect can be corrected, namely, the mark is updated.
By the method, the user name samples marked with the positive and negative sample type information can be further corrected according to the positive and negative sample type similarity of the user name samples contained in each sample cluster, so that when the positive and negative sample type information of the user name samples are incorrectly marked, correction can be performed by the marking mode disclosed by the invention, and therefore, the accuracy of the marking result of the user name samples can be ensured, and the accuracy of the identification result of the user name identification model can be further improved.
The first solution is applicable to both the user name sample labeled with the positive and negative sample type information and the user name sample not labeled with the positive and negative sample type information. The second solution may be applicable to username samples tagged with positive and negative sample type information.
Based on the characteristic that typical abnormal user names are often similar semantically (for example, abnormal user names generated by a script in large batch generally have the characteristic), user name samples are clustered according to semantic feature vectors of the user name samples, so that the user name samples with the same type information of the user name samples can be gathered together, and further, sample clusters meeting the preset sample cluster selection condition (namely clusters gathered by suspected abnormal user names) can be screened out from the sample clusters according to the specified characteristics of the user name samples contained in the sample clusters, so that the suspected abnormal user names gathered together can be conveniently and intensively labeled, the problem that the suspected abnormal user names cannot be intensively labeled or labeled wrongly because the suspected abnormal user names are scattered is avoided, and errors caused by labeling are well controlled, therefore, the accuracy of the identification result of the user name identification model is improved.
Based on the inventive concept of the above embodiment, the embodiment of the present disclosure further provides a method for training a user name recognition model based on a user name sample labeling method, where the method includes:
constructing a training sample set based on the negative user name sample and the rest user name samples in the plurality of sample clusters, wherein the training sample set is used for training a user name recognition model for carrying out classification recognition on the user names;
in one implementation, for example, a training sample set may be constructed based on username samples tagged with positive sample type information and username samples tagged with negative sample type information.
Or, in an implementation manner, a training sample set may be constructed based on the corrected sample cluster and the correct sample clusters labeled by the remaining user name sample type information.
Optionally, when the user name recognition model for performing the classification recognition on the user name is trained in the embodiment of the present disclosure, the training sample set may be input to a convolution layer of a convolutional neural network text classification model trained in advance, and a convolution operation is performed on the training sample set through the convolution layer, so as to obtain feature information (referred to as first feature information for the distinction description) of each user name sample in the training sample set.
The pre-trained convolutional neural network text classification model is used for obtaining the characteristic information of the training sample set, and the parameters of the model are not updated in the training process.
For example, a training sample set may be input into a convolutional layer of a convolutional neural network text classification model trained in advance, one-dimensional convolutional kernels with the scales of 1, 3 and 5 may be established in the convolutional layer, and convolution operation may be performed on the training sample set to extract first feature information of each user name sample in the training sample set, where the first feature information obtained by convolutional kernels of different sizes has different dimensions.
It should be noted that before the training sample set is input into the convolutional layer of the convolutional neural network text classification model, the output layer of the convolutional neural network text classification model may be removed, so that after the convolutional operation is performed on the training sample set, the comprehensive characteristic information of each user name sample may be obtained from the hidden layer of the convolutional neural network text classification model.
Optionally, in order to make the dimensions of the first feature information corresponding to the convolution kernels of the respective sizes the same, the present disclosure may sequentially perform a Pooling operation on the first feature information under the convolution kernels of different sizes, where the Pooling operation may be, for example, a maximum Pooling operation (Max Pooling) or an Average Pooling operation (Average Pooling), and then extract a maximum value of the first feature information under the convolution kernels of different sizes.
After extracting the maximum values of the first feature information under the convolution kernels with different sizes, splicing the maximum value results to obtain a first feature vector.
Optionally, after obtaining the first feature vector, the method may further include inputting the training sample set into the neural network structure model, and obtaining a second feature vector from a hidden layer of the neural network structure model. The parameters of the neural network structure model can be randomly initialized and untrained.
And splicing the obtained first characteristic vector and the second characteristic vector, and then adding one or more full connection layers based on the spliced result of the first characteristic vector and the second characteristic vector to construct a user name identification model for classifying and identifying user names.
After a user name recognition model for carrying out classification recognition on user names is built, a training sample set is input into the user name recognition model for model training.
Optionally, after obtaining the output result from the user name recognition model, the output result may be mapped between (0,1) by using a softmax function, which is used to represent the probability that the user name sample is a positive sample and/or the user name is a negative sample.
It should be noted that, in the model training process, a new user name sample or a situation that the positive and negative sample type information of the user name sample in the training sample set is changed may occur, so that a problem that the identification performance of the user name identification model is not matched with the training sample set may occur.
In order to avoid the above problem, the present disclosure further includes updating the user name recognition model, and the following manner may be adopted when updating:
if the number of the user name samples in the training sample set is increased and/or the positive and negative sample type information of the user name samples in the training sample set is changed, removing an output layer of the user name recognition model;
adding a preset number of full connection layers in the user name identification model with the output layer removed;
training a preset number of full connection layers based on the added training samples and/or the user name samples with the changed positive and negative sample type information to obtain an updated user name identification model.
When the user name identification model is updated by the method, the user name identification model does not need to be copied in full quantity every time, and the user name identification model can be trained on the full connection layers in the preset number only based on the added training samples and/or the user name samples with the changed positive and negative sample type information, so that the updating efficiency of the user name identification model can be improved, and the workload can be reduced.
An alternative way of the method for labeling the user name sample provided in embodiment 1 of the present disclosure is described in detail below by describing embodiment 2.
Example 2
Fig. 2 is a flowchart illustrating a method for labeling user name samples according to an exemplary embodiment, and as shown in fig. 2, in step S102, a sample cluster satisfying a predetermined sample cluster selection condition is selected from a plurality of sample clusters according to respective specified characteristics of the plurality of sample clusters, including the following steps S201 to S202:
the specified characteristics of the sample cluster include: semantic feature average similarity between different user name samples in the sample cluster;
sample cluster selection conditions include: and the average similarity of the semantic features among different user name samples in the sample cluster is greater than a semantic similarity threshold value.
In step S201, calculating an average similarity of semantic features between username samples in each sample cluster;
in an alternative embodiment, calculating the average similarity of semantic features between the username samples in each sample cluster comprises:
and determining semantic center vectors of the user name samples corresponding to the clustering center points of the sample clusters.
Alternatively, assuming that the semantic feature vectors of the samples are clustered into N clusters through the above step S102, each cluster has a corresponding cluster center, i.e., a center point of a sample cluster. It will be appreciated that each cluster center may be represented by a vector of the same dimensions as the semantic feature vector, referred to herein as a semantic center vector.
And calculating the average distance between the semantic feature vector of each username sample in each sample cluster and the semantic center vector of the username sample corresponding to the respective cluster center point of each sample cluster to obtain the average similarity of the semantic features between the username samples in the sample cluster.
In step S202, a sample cluster with the average semantic feature similarity greater than the semantic similarity threshold is selected from the sample clusters.
According to the method, the characteristics of the abnormal user names are analyzed, and the abnormal user names are usually registered in batches through scripts, so that the abnormal user name samples are similar in semantics (namely the distances between the semantic feature vectors of the user names are close), and therefore the method selects the sample cluster meeting the preset selection condition by adopting a technical means of selecting the sample cluster with the average distance smaller than the specified distance threshold.
Or after the average distance corresponding to each sample cluster is obtained, the average distances can be compared pairwise based on the average distances, and then a plurality of sample clusters with smaller average distances are selected from the smaller average distances to serve as the sample clusters meeting the preset selection condition.
For example, assuming that the user name samples are divided into 10 sample clusters after being clustered and the average distances of the 10 sample clusters are respectively calculated, pairwise comparison may be performed based on the average distances of the 10 sample clusters, and then 3 sample clusters with smaller average distances are selected as the sample clusters satisfying the predetermined selection condition.
According to the method, the accuracy of the positive and negative sample type information of each user name sample in the training sample set can be ensured, the condition that the labeling result is inaccurate when the positive and negative sample type information of the user name sample is labeled only manually in the related technology is avoided, and therefore the accuracy of the identification result of the user name identification model is improved.
In view of the same inventive concept as that in the foregoing embodiment 1, the present disclosure further provides a flowchart of a username sample labeling method, so as to solve the problem in the related art that the labeling result is inaccurate when the username sample is simply labeled with the positive and negative sample type information manually.
An alternative way of the method for labeling the user name sample provided in embodiment 1 of the present disclosure is described in detail below by describing embodiment 3.
Example 3
Fig. 3 is a flowchart illustrating a method for labeling user name samples according to an exemplary embodiment, and as shown in fig. 3, in step S102, a sample cluster satisfying a predetermined sample cluster selection condition is selected from a plurality of sample clusters according to respective specified characteristics of the plurality of sample clusters, including the following steps S301 and S302:
wherein the specified characteristics of the sample cluster comprise: the positive and negative sample type similarity of the user name samples marked in the sample cluster;
predetermined sample cluster selection conditions, including: the type similarity of the positive and negative samples marked by the user name sample in the sample cluster is smaller than a type similarity threshold value.
In step S301, the labeled positive and negative sample type similarity between the username samples in each sample cluster is calculated.
In an alternative embodiment, the labeled positive and negative sample type similarity between the username samples in each sample cluster can be calculated, for example, as follows:
determining the number of user name samples of a positive sample type and the number of user name samples of a negative sample type in each sample cluster;
and respectively calculating the ratio of the number of the user name samples of the positive sample type to the number of the user name samples of the negative sample type in each sample cluster, and taking the ratio as the similarity of the positive sample type and the negative sample type of the user name samples contained in each sample cluster.
In general, if the positive and negative type information of the user name samples are labeled correctly, the positive and negative type information of all the user name samples are generally the same for the same cluster in each obtained sample cluster. On the contrary, if the positive and negative type information of the user name sample is labeled incorrectly, the positive and negative type information of each user name sample is not completely the same in each obtained sample cluster for the same cluster.
Based on the thought, the method and the device can determine whether the positive and negative sample type information of the user name samples contained in each sample cluster is labeled correctly through the positive and negative sample type similarity of the user name samples contained in each sample cluster, and can correct the user name samples with the wrong labeling on the positive and negative sample type information.
Optionally, in a case where each user name sample in each sample cluster is labeled with positive and negative sample type information, when determining the positive and negative sample type similarity of the user name samples included in each sample cluster, the positive and negative sample type similarity of the user name samples included in each sample cluster may be determined by respectively calculating a ratio of the number of the user name samples of the positive sample type to the number of the user name samples of the negative sample type in each sample cluster.
In step S302, a sample cluster in which the type similarity of the positive and negative samples labeled in the username sample is smaller than the type similarity threshold is screened from the sample cluster, wherein the predetermined sample cluster selection condition includes that the type similarity of the positive and negative samples labeled in the username sample in the sample cluster is smaller than the type similarity threshold.
Because the ratio of the number of the user name samples of the positive sample type to the number of the user name samples of the negative sample type in each sample cluster is closer to 1, the probability that the labeling of the positive and negative type information of each user name sample in the cluster is wrong is higher, the sample cluster corresponding to the ratio that the difference value of 1 is smaller than the preset difference threshold value is selected as the sample cluster of which the labeled positive and negative sample type similarity of the user name sample is smaller than the type similarity threshold value, so that the user name positive and negative sample type information in the screened sample cluster can be corrected later.
The preset difference threshold may be preset according to needs, for example, may be set to 0.1.
In an implementation manner, after the step S302 is executed, the method may further include correcting positive and negative sample type information of the username sample in the selected sample cluster to obtain a corrected sample cluster to construct a training sample set.
For example, assuming that a sample cluster a corresponding to a ratio that the difference value from 1 is smaller than a preset difference threshold value is selected, the user name sample in the sample cluster a may be output to a storage space of the server, so that a annotating person may obtain the sample cluster from the storage space of the server and correct the positive and negative sample type information of the user name sample in the sample cluster.
Or, in order to improve the labeling efficiency of the positive and negative sample type information of the user name sample, the method may consider discarding the sample cluster corresponding to the ratio that the difference value between 1 and the preset difference threshold is smaller, so as to correct the sample cluster.
After the positive and negative sample type information of the username sample in the selected sample cluster is corrected, a training sample set can be constructed based on the corrected sample cluster and the sample cluster corresponding to the ratio of the difference value of 1 in step S302 to the preset difference threshold value.
By adopting the method provided by the embodiment of the disclosure, the accuracy of the positive and negative sample type information of each user name sample in the training sample set can be ensured, and the condition of inaccurate labeling result caused by manually labeling the positive and negative sample type information of the user name sample in the related technology is avoided, so that the accuracy of the identification result of the user name identification model is improved.
Based on the same inventive concept as that in embodiment 1, the present disclosure further provides a flowchart of a user name recognition method based on the training method of the user name recognition model, so as to recognize the user name sample.
The following describes the user name identification method provided by the present disclosure in detail by describing embodiment 4.
Example 4
Fig. 4 is a flowchart illustrating a username identification method based on a username identification model according to an exemplary embodiment, and as shown in fig. 4, the username identification method based on the username identification model is applied to a web server, by which a problem of inaccurate username identification result in the related art can be solved, and an executing subject of the method includes, but is not limited to, a server, a personal computer, a laptop, a tablet computer, a smartphone, and other intelligent electronic devices that can execute a predetermined processing procedure such as numerical calculation and/or logical calculation by executing a predetermined program or instruction. The server may be a single network server or a server group consisting of a plurality of network servers or a Cloud Computing (Cloud Computing) based Cloud consisting of a large number of computers or network servers. The method may comprise the steps of:
in step S401, a user name to be identified is acquired;
in step S402, the user name to be recognized is input into the trained user name recognition model for recognition, so as to obtain a recognition result output by the trained user name recognition model.
The identification result may be a normal probability value and/or an abnormal probability value of the user name to be identified.
Alternatively, the user name recognition model may be, but is not limited to being, trained by the training method of the user name recognition model shown in fig. 1. For the related description of the training method for the user name recognition model, reference may be made to the content shown in fig. 1, and for avoiding redundant description, the description is not repeated here.
After the user name identification model is trained in advance, when the user name to be identified is obtained, the user name to be identified can be input into the user name identification model trained in advance, so that the user name identification model can output the probability that the user name is a positive sample and/or a negative sample.
For example, the disclosure may set a normal critical threshold p and an abnormal critical threshold q between [0,1] according to actual requirements, (where p > q) when the positive class probability output by the user name recognition model is greater than p, the user name is considered normal; and when the positive class probability output by the user name identification model is smaller than q, considering that the user name is illegal, and performing further processing, such as manual review.
By adopting the embodiment provided by the disclosure, the accuracy of the positive and negative sample type information of each user name sample in the training sample set can be ensured, the condition that the labeling result is inaccurate when the positive and negative sample type information of the user name sample is simply labeled manually in the related technology is avoided, and the accuracy of the user name identification result is improved.
In view of the same inventive concept as that in the foregoing embodiment 1, the present disclosure further provides a block diagram of a training apparatus for a user name recognition model, so as to solve the problem in the related art that a labeling result is inaccurate when positive and negative sample type information is simply manually labeled on a user name sample.
The following describes in detail the training apparatus for the user name recognition model provided in the present disclosure by referring to example 5.
Example 5
FIG. 5 is a block diagram of a training apparatus illustrating a username recognition model in accordance with an exemplary embodiment. Referring to fig. 5, the apparatus includes a sample clustering module 501, a filtering module 502, and a labeling module 503.
A clustering module 501 configured to perform clustering on the username samples based on the obtained semantic features of each username sample to obtain a plurality of sample clusters;
a screening module 502 configured to perform screening of a sample cluster satisfying a predetermined sample cluster selection condition from the plurality of sample clusters according to respective specified characteristics of the plurality of sample clusters; wherein, whether the user name sample in the designated feature characterization sample cluster is a negative sample type or not is determined; selecting conditions of the sample clusters, and determining the conditions based on the specified characteristic statistical results of the sample clusters formed by the abnormal user names which are identified in advance;
and an annotating module 503 configured to perform annotating the user name samples in the screened sample cluster as negative user name samples.
In an alternative embodiment, the specifying features includes: the semantic feature average similarity between different user name samples in the sample cluster, wherein the screening module comprises:
a semantic similarity calculation unit configured to perform calculation of an average similarity of semantic features between the username samples in each sample cluster;
the first screening unit is configured to perform screening from the sample clusters, wherein the average semantic feature similarity of the sample clusters is greater than a semantic similarity threshold, and the sample cluster selection condition comprises that the average semantic feature similarity between different user name samples in the sample clusters is greater than the semantic similarity threshold.
In an alternative embodiment, the semantic similarity calculation unit includes:
the first determining subunit is configured to determine semantic center vectors of the user name samples corresponding to the clustering center points of the sample clusters;
and the first calculating subunit is configured to calculate an average distance between the semantic feature vector of each username sample in each sample cluster and the semantic center vector of the username sample corresponding to the respective cluster center point of each sample cluster, so as to obtain an average similarity of the semantic features between the username samples in the sample cluster.
In an alternative embodiment, the specifying features includes: the positive and negative sample type similarity that user name sample was annotated in the sample cluster, wherein, screening module includes:
the type similarity calculation unit is configured to calculate the labeled positive and negative sample type similarity between the username samples in each sample cluster;
and the second screening unit is configured to perform screening from the sample clusters, wherein the user name samples are labeled with sample clusters with positive and negative sample type similarity smaller than a type similarity threshold, and the predetermined sample cluster selection condition comprises that the positive and negative sample type similarity labeled with the user name samples in the sample clusters is smaller than the type similarity threshold.
In one embodiment, the type similarity calculation unit includes:
a second determining subunit configured to perform determining the number of user name samples of the positive sample type and the number of user name samples of the negative sample type in each sample cluster;
and the second calculating subunit is configured to perform calculation of a ratio of the number of the user name samples of the positive sample type to the number of the user name samples of the negative sample type in each sample cluster as the positive and negative sample type similarity of the user name samples contained in each sample cluster.
The device provided by the disclosure is based on the characteristic that typical abnormal user names are often similar semantically (for example, abnormal user names generated in large batches by using scripts generally have the characteristic), user name samples with the same type information of the user name samples can be gathered together by clustering the user name samples according to semantic feature vectors of the user name samples, and then sample clusters meeting the selection condition of a preset sample cluster (namely clusters gathered by suspected abnormal user names) can be screened out from the sample clusters according to the specified features of the user name samples contained in the sample clusters, so that the suspected abnormal user names gathered together can be conveniently and intensively labeled, and the problem that the suspected abnormal user names cannot be intensively labeled or labeled wrongly due to scattered distribution of the suspected abnormal user names is avoided, the error caused by the labeling is well controlled, and therefore the accuracy of the identification result of the user name identification model is improved.
With the same inventive concept as the above embodiment 4, the present disclosure also provides a block diagram of a user name identification apparatus, which is used to identify a user name sample.
The following describes in detail the user name identification apparatus provided by the present disclosure by describing embodiment 6.
Fig. 6 is a block diagram illustrating a username identification apparatus based on a username identification model in accordance with an exemplary embodiment. Referring to fig. 6, the apparatus includes a sample user name acquisition module and an output module.
A user name obtaining module 601 configured to perform obtaining of a user name to be identified;
the output module 602 is configured to input the user name to be recognized into a pre-trained user name recognition model, and output a recognition result; and the identification result is a normal probability value and/or an abnormal probability value of the user name to be identified.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
By adopting the device provided by the disclosure, the accuracy of the positive and negative sample type information of each user name sample in a training sample set can be ensured, the condition that the labeling result is inaccurate when the positive and negative sample type information of the user name sample is labeled by only manpower in the related technology is avoided, and the accuracy of the user name identification result is improved.
Example 7
Fig. 7 is a diagram illustrating a hardware configuration of an electronic device according to an exemplary embodiment. As shown in fig. 7, electronic device 700 includes, but is not limited to: a radio frequency unit 701, a network module 702, an audio output unit 703, an input unit 704, a sensor 705, a display unit 706, a user input unit 707, an interface unit 708, a memory 709, a processor 710, a power supply 711, and the like.
Those skilled in the art will appreciate that the electronic device configuration shown in fig. 7 does not constitute a limitation of the electronic device, and that the electronic device may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. In the embodiments of the present disclosure, the electronic device includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal, a wearable device, a pedometer, and the like.
Wherein, the processor 710, coupled to the memory, is configured to:
clustering the user name samples based on the obtained semantic feature vectors of the user name samples to obtain a plurality of sample clusters;
according to the respective designated characteristics of the plurality of sample clusters, screening the sample clusters meeting the preset sample cluster selection condition from the plurality of sample clusters; wherein, whether the user name sample in the designated feature characterization sample cluster is a negative sample type or not is determined; selecting conditions of the sample clusters, and determining the conditions based on the specified characteristic statistical results of the sample clusters formed by the abnormal user names which are identified in advance;
and marking the user name sample in the screened sample cluster as a negative user name sample.
In an alternative embodiment, the specifying features includes: semantic feature average similarity between different user name samples in the sample cluster; sample cluster selection conditions include: and the average similarity of the semantic features among different user name samples in the sample cluster is greater than a semantic similarity threshold value.
Optionally, if the specified feature includes an average similarity of semantic features between different user name samples in the sample cluster, screening a sample cluster satisfying a predetermined sample cluster selection condition from the plurality of sample clusters according to the specified feature of each of the plurality of sample clusters, including:
calculating the average similarity of semantic features among the user name samples in each sample cluster;
and screening sample clusters, wherein the average semantic feature similarity of the sample clusters is greater than a semantic similarity threshold, and the sample cluster selection condition comprises that the average semantic feature similarity between different user name samples in the sample clusters is greater than the semantic similarity threshold.
Optionally, calculating the average similarity of semantic features between the username samples in each sample cluster includes:
determining semantic center vectors of the user name samples corresponding to the clustering center points of the sample clusters;
and calculating the average distance between the semantic feature vector of each username sample in each sample cluster and the semantic center vector of the username sample corresponding to the respective cluster center point of each sample cluster to obtain the average similarity of the semantic features between the username samples in the sample cluster.
In an alternative embodiment, the specifying features includes: the positive and negative sample type similarity of the user name samples marked in the sample cluster;
sample cluster selection conditions include: the type similarity of the positive and negative samples marked by the user name sample in the sample cluster is smaller than a type similarity threshold value.
Optionally, if the specified feature includes a positive and negative sample type similarity labeled in the username sample in the sample cluster, screening a sample cluster satisfying a predetermined sample cluster selection condition from the plurality of sample clusters according to the specified feature of each of the plurality of sample clusters, including:
calculating the labeled positive and negative sample type similarity between the user name samples in each sample cluster;
and screening sample clusters, wherein the type similarity of the positive and negative samples marked on the user name sample is smaller than a type similarity threshold, and the sample cluster selection condition comprises that the type similarity of the positive and negative samples marked on the user name sample in the sample clusters is smaller than the type similarity threshold.
In an optional implementation manner, calculating the similarity of positive and negative sample types of the username samples in each sample cluster may include:
determining the number of user name samples of a positive sample type and the number of user name samples of a negative sample type in each sample cluster;
and respectively calculating the ratio of the number of the user name samples of the positive sample type to the number of the user name samples of the negative sample type in each sample cluster, and taking the ratio as the similarity of the positive sample type and the negative sample type of the user name samples contained in each sample cluster.
Optionally, training a user name recognition model for performing classification recognition on a user name based on a training sample set includes:
constructing a training sample set based on the negative user name sample and the rest user name samples in the plurality of sample clusters, wherein the training sample set is used for training a user name recognition model for carrying out classification recognition on the user names;
inputting a training sample set into a neural network text classification model, and acquiring a first feature vector from a hidden layer of the neural network text classification model;
inputting the training sample set into a neural network structure model, and acquiring a second feature vector from a hidden layer of the neural network structure model;
and training a user name recognition model for carrying out classification recognition on user names by taking the first feature vector and the second feature vector as training samples.
In one implementation, the processor is further operable to:
if the number of the user name samples in the training sample set is increased and/or the positive and negative sample type information of the user name samples in the training sample set is changed, removing an output layer of the user name recognition model;
adding a preset number of full connection layers in the user name identification model with the output layer removed;
training a preset number of full connection layers based on the added training samples and/or the user name samples with the changed positive and negative sample type information to obtain an updated user name identification model.
Alternatively, the processor may be further configured to:
acquiring a user name to be identified;
and inputting the user name to be identified into the trained user name identification model for identification so as to obtain an identification result output by the trained user name identification model.
A memory 709 for storing a computer program that is executable on the processor 710, the computer program, when executed by the processor 710, performing the above-mentioned functions performed by the processor 710.
It should be understood that, in the embodiment of the present disclosure, the radio frequency unit 701 may be used for receiving and transmitting signals during a message transmission or a call, and specifically, receives downlink data from a base station and then processes the received downlink data to the processor 710; in addition, the uplink data is transmitted to the base station. In general, radio frequency unit 701 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 701 may also communicate with a network and other devices through a wireless communication system.
The electronic device provides wireless broadband internet access to the user via the network module 702, such as assisting the user in sending and receiving e-mails, browsing web pages, and accessing streaming media.
The audio output unit 703 may convert audio data received by the radio frequency unit 701 or the network module 702 or stored in the memory 709 into an audio signal and output as sound. Also, the audio output unit 703 may also provide audio output related to a specific function performed by the electronic apparatus 700 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output unit 703 includes a speaker, a buzzer, a receiver, and the like.
The input unit 704 is used to receive audio or video signals. The input Unit 704 may include a Graphics Processing Unit (GPU) 7041 and a microphone 7042, and the Graphics processor 7041 processes image data of a still picture or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display unit 706. The image frames processed by the graphic processor 7041 may be stored in the memory 709 (or other storage medium) or transmitted via the radio unit 701 or the network module 702. The microphone 7042 may receive sounds and may be capable of processing such sounds into audio data. The processed audio data may be converted into a format output transmittable to a mobile communication base station via the radio frequency unit 701 in case of a phone call mode.
The electronic device 700 also includes at least one sensor 705, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor includes an ambient light sensor that can adjust the brightness of the display panel 7061 according to the brightness of ambient light, and a proximity sensor that can turn off the display panel 7061 and/or a backlight when the electronic device 700 is moved to the ear. As one type of motion sensor, an accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), detect the magnitude and direction of gravity when stationary, and can be used to identify the posture of an electronic device (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), and vibration identification related functions (such as pedometer, tapping); the sensors 705 may also include fingerprint sensors, pressure sensors, iris sensors, molecular sensors, gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc., which are not described in detail herein.
The display unit 706 is used to display information input by the user or information provided to the user. The Display unit 706 may include a Display panel 7061, and the Display panel 7061 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.
The user input unit 707 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device. Specifically, the user input unit 707 includes a touch panel 7071 and other input devices 7072. The touch panel 7071, also referred to as a touch screen, may collect touch operations by a user on or near the touch panel 7071 (e.g., operations by a user on or near the touch panel 7071 using a finger, a stylus, or any other suitable object or attachment). The touch panel 7071 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 710, receives a command from the processor 710, and executes the command. In addition, the touch panel 7071 can be implemented by various types such as resistive, capacitive, infrared, and surface acoustic wave. The user input unit 707 may include other input devices 7072 in addition to the touch panel 7071. In particular, the other input devices 7072 may include, but are not limited to, a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described herein again.
Further, the touch panel 7071 may be overlaid on the display panel 7061, and when the touch panel 7071 detects a touch operation on or near the touch panel 7071, the touch operation is transmitted to the processor 710 to determine the type of the touch event, and then the processor 710 provides a corresponding visual output on the display panel 7061 according to the type of the touch event. Although the touch panel 7071 and the display panel 7061 are shown in fig. 7 as two separate components to implement the input and output functions of the electronic device, in some embodiments, the touch panel 7071 and the display panel 7061 may be integrated to implement the input and output functions of the electronic device, which is not limited herein.
The interface unit 708 is an interface for connecting an external device to the electronic apparatus 700. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 708 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to one or more elements within the electronic apparatus 700 or may be used to transmit data between the electronic apparatus 700 and the external device.
The memory 709 may be used to store software programs as well as various data. The memory 709 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 709 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
The processor 710 is a control center of the electronic device, connects various parts of the whole electronic device by using various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 709 and calling data stored in the memory 709, thereby monitoring the whole electronic device. Processor 710 may include one or more processing units; preferably, the processor 710 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 710.
The electronic device 700 may also include a power supply 711 (e.g., a battery) for providing power to the various components, and preferably, the power supply 711 may be logically coupled to the processor 710 via a power management system, such that functions of managing charging, discharging, and power consumption may be performed via the power management system.
In addition, the electronic device 700 includes some functional modules that are not shown, and are not described in detail herein.
In an exemplary embodiment, a storage medium including instructions is further provided, and a computer program is stored on the computer-readable storage medium, and when being executed by a processor, the computer program implements each process of any one of the method embodiments described in the foregoing embodiments, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A method for labeling a user name sample is characterized by comprising the following steps:
clustering the user name samples based on the obtained semantic features of the user name samples to obtain a plurality of sample clusters;
according to the respective designated characteristics of the plurality of sample clusters, screening the sample clusters meeting the preset sample cluster selection condition from the plurality of sample clusters, wherein the designated characteristics are used for representing whether the username samples in the sample clusters are negative sample types, and the sample cluster selection condition is determined based on the designated characteristic statistical result of the sample clusters which are identified as abnormal usernames in advance;
and marking the user name sample in the screened sample cluster as a negative user name sample.
2. The annotation method of claim 1, wherein the specified feature comprises: the method for screening the sample clusters meeting the preset sample cluster selection condition from the plurality of sample clusters according to the respective designated characteristics of the plurality of sample clusters comprises the following steps:
calculating the average similarity of semantic features among the user name samples in each sample cluster;
and screening sample clusters, wherein the average semantic feature similarity of the sample clusters is greater than a semantic similarity threshold, and the preset sample cluster selection condition comprises that the average semantic feature similarity between different user name samples in the sample clusters is greater than the semantic similarity threshold.
3. The annotation method of claim 2, wherein computing the mean similarity of semantic features between username samples in each sample cluster comprises:
determining semantic center vectors of the user name samples corresponding to the clustering center points of the sample clusters;
and calculating the average distance between the semantic feature vector of each user name sample in each sample cluster and the semantic center vector of the user name sample corresponding to the cluster center point of each sample cluster, so as to obtain the average similarity of the semantic features between the user name samples in the sample clusters.
4. The annotation method of claim 1, wherein the specified feature comprises: the method for screening the user name samples in the sample clusters according to the similarity of the labeled positive and negative sample types of the user name samples in the sample clusters includes the following steps:
calculating the labeled positive and negative sample type similarity between the user name samples in each sample cluster;
and screening sample clusters, wherein the labeled positive and negative sample type similarity of the user name sample is smaller than a type similarity threshold, and the preset sample cluster selection condition comprises that the labeled positive and negative sample type similarity of the user name sample in the sample clusters is smaller than the type similarity threshold.
5. The labeling method of claim 4, wherein calculating the positive and negative sample type similarity of username samples in each sample cluster comprises:
determining the number of the user name samples of the positive sample type and the number of the user name samples of the negative sample type in each sample cluster;
and respectively calculating the ratio of the number of the user name samples of the positive sample type to the number of the user name samples of the negative sample type in each sample cluster, and taking the ratio as the similarity of the positive sample type and the negative sample type of the user name samples contained in each sample cluster.
6. A training method of a user name recognition model based on the labeling method of the user name sample according to claim 1, wherein the training method comprises:
constructing a training sample set based on the negative user name sample and the rest user name samples in the plurality of sample clusters, wherein the training sample set is used for training a user name recognition model for carrying out classification recognition on user names;
inputting the training sample set into a neural network text classification model, and acquiring a first feature vector from a hidden layer of the neural network text classification model;
inputting the training sample set into a neural network structure model, and acquiring a second feature vector from a hidden layer of the neural network structure model;
and training a user name recognition model for carrying out classification recognition on user names by taking the first feature vector and the second feature vector as training samples.
7. A user name recognition method based on the training method of the user name recognition model according to claim 6, characterized in that the recognition method comprises:
acquiring a user name to be identified;
and inputting the user name to be identified into the trained user name identification model for identification so as to obtain an identification result output by the trained user name identification model.
8. A user name sample labeling apparatus, comprising:
the clustering module is configured to perform clustering on the user name samples based on the acquired semantic features of the user name samples to obtain a plurality of sample clusters; (ii) a
The screening module is configured to screen a sample cluster which meets a preset sample cluster selection condition from the plurality of sample clusters according to respective specified characteristics of the plurality of sample clusters, wherein the specified characteristics are used for representing whether a username sample in the sample cluster is a negative sample type, and the sample cluster selection condition is determined based on a specified characteristic statistical result of the sample cluster formed by the abnormal user names which are identified in advance;
and the marking module is configured to mark the user name samples in the screened sample cluster as negative user name samples.
9. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the method of annotating a username sample as defined in any of claims 1 to 5 or the method of training a username recognition model as defined in claim 6 or the method of username recognition as defined in claim 7.
10. A storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the method of annotating a username sample as defined in any one of claims 1 to 5 or the method of training a username recognition model as defined in claim 6 or the method of username recognition as defined in claim 7.
CN202010038362.5A 2020-01-14 2020-01-14 User name sample labeling method and device, electronic equipment and storage medium Active CN113190646B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010038362.5A CN113190646B (en) 2020-01-14 2020-01-14 User name sample labeling method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010038362.5A CN113190646B (en) 2020-01-14 2020-01-14 User name sample labeling method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113190646A true CN113190646A (en) 2021-07-30
CN113190646B CN113190646B (en) 2024-05-07

Family

ID=76972683

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010038362.5A Active CN113190646B (en) 2020-01-14 2020-01-14 User name sample labeling method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113190646B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113447928A (en) * 2021-08-30 2021-09-28 广东电网有限责任公司湛江供电局 False alarm rate reduction target identification method and system based on synthetic aperture radar
CN113988176A (en) * 2021-10-27 2022-01-28 支付宝(杭州)信息技术有限公司 Sample labeling method and device
CN114418752A (en) * 2022-03-28 2022-04-29 北京芯盾时代科技有限公司 Method and device for processing user data without type label, electronic equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107194430A (en) * 2017-05-27 2017-09-22 北京三快在线科技有限公司 A kind of screening sample method and device, electronic equipment
US20170323202A1 (en) * 2016-05-06 2017-11-09 Fujitsu Limited Recognition apparatus based on deep neural network, training apparatus and methods thereof
CN108616491A (en) * 2016-12-13 2018-10-02 北京酷智科技有限公司 A kind of recognition methods of malicious user and system
CN109284380A (en) * 2018-09-25 2019-01-29 平安科技(深圳)有限公司 Illegal user's recognition methods and device, electronic equipment based on big data analysis
CN110309297A (en) * 2018-03-16 2019-10-08 腾讯科技(深圳)有限公司 Rubbish text detection method, readable storage medium storing program for executing and computer equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170323202A1 (en) * 2016-05-06 2017-11-09 Fujitsu Limited Recognition apparatus based on deep neural network, training apparatus and methods thereof
CN108616491A (en) * 2016-12-13 2018-10-02 北京酷智科技有限公司 A kind of recognition methods of malicious user and system
CN107194430A (en) * 2017-05-27 2017-09-22 北京三快在线科技有限公司 A kind of screening sample method and device, electronic equipment
CN110309297A (en) * 2018-03-16 2019-10-08 腾讯科技(深圳)有限公司 Rubbish text detection method, readable storage medium storing program for executing and computer equipment
CN109284380A (en) * 2018-09-25 2019-01-29 平安科技(深圳)有限公司 Illegal user's recognition methods and device, electronic equipment based on big data analysis

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113447928A (en) * 2021-08-30 2021-09-28 广东电网有限责任公司湛江供电局 False alarm rate reduction target identification method and system based on synthetic aperture radar
CN113988176A (en) * 2021-10-27 2022-01-28 支付宝(杭州)信息技术有限公司 Sample labeling method and device
CN114418752A (en) * 2022-03-28 2022-04-29 北京芯盾时代科技有限公司 Method and device for processing user data without type label, electronic equipment and medium

Also Published As

Publication number Publication date
CN113190646B (en) 2024-05-07

Similar Documents

Publication Publication Date Title
CN107944380B (en) Identity recognition method and device and storage equipment
US10169639B2 (en) Method for fingerprint template update and terminal device
CN111260665B (en) Image segmentation model training method and device
CN109947650B (en) Script step processing method, device and system
CN113190646B (en) User name sample labeling method and device, electronic equipment and storage medium
CN110704661A (en) Image classification method and device
CN109951889B (en) Internet of things network distribution method and mobile terminal
CN111177180A (en) Data query method and device and electronic equipment
CN109726121B (en) Verification code obtaining method and terminal equipment
CN110162653B (en) Image-text sequencing recommendation method and terminal equipment
CN111159338A (en) Malicious text detection method and device, electronic equipment and storage medium
WO2017088434A1 (en) Human face model matrix training method and apparatus, and storage medium
CN109885490B (en) Picture comparison method and device
CN114722937A (en) Abnormal data detection method and device, electronic equipment and storage medium
CN110674294A (en) Similarity determination method and electronic equipment
CN108304369B (en) File type identification method and device
CN115563255A (en) Method and device for processing dialog text, electronic equipment and storage medium
CN114970562A (en) Semantic understanding method, device, medium and equipment
CN111353422B (en) Information extraction method and device and electronic equipment
CN109739998B (en) Information classification method and device
CN116259083A (en) Image quality recognition model determining method and related device
CN117011649B (en) Model training method and related device
CN109168154B (en) User behavior information collection method and device and mobile terminal
CN109886324B (en) Icon identification method and device
CN111610913B (en) Message identification method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant