CN114385775B

CN114385775B - Sensitive word recognition method based on big data

Info

Publication number: CN114385775B
Application number: CN202111636920.9A
Authority: CN
Inventors: 周洁琴; 周金明
Original assignee: Nanjing Inspector Intelligent Technology Co ltd
Current assignee: Nanjing Inspector Intelligent Technology Co ltd
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2024-06-04
Anticipated expiration: 2041-12-29
Also published as: CN114385775A

Abstract

The invention discloses a sensitive word recognition method based on big data, which comprises the following steps: and step 1, collecting text data by using crawler software, performing sensitive marking on the text data to obtain a sensitive text D ₁ and a normal text D ₂, performing sensitive word classification and grade marking on sensitive words, and storing the sensitive words in a sensitive word list S. Step 2, new word discovery is carried out through an N-gram model, and sensitive word list S is amplified: and 3, carrying out deformation processing on each sensitive word in each sensitive word list S to obtain deformed sensitive words. Step 4, filtering the sensitive words in the sensitive word list S based on the Trie and the BERT model; the method improves the accuracy and the efficiency of auditing and identifying the sensitive words.

Description

Sensitive word recognition method based on big data

Technical Field

The invention relates to the field of big data research, in particular to a natural language processing method, and specifically relates to a sensitive word recognition method based on big data.

Background

With the continuous development of the internet, people can see a large amount of text information through various platforms of the internet. Some of these information contains sensitive information such as terrorist tendencies and the like, which, if not distinguished and controlled, can interfere with social stability and impair social public benefits. The quality of the text information is controlled, the sensitive information is recognized and processed in time, the released content is ensured to contain no sensitive information, and a healthy network environment is built.

In the process of implementing the present invention, the inventor finds that at least the following problems exist in the prior art: the traditional technology mainly adopts a rule matching method to identify, namely, a sensitive word list is constructed, the sensitive word list is mainly collected by manual operation, the labor cost is high, the efficiency is low, the sensitive word list is traversed to be matched with each text message, and if the sensitive word is found, the sensitive word list needs to be submitted to an auditor for manual audit. The method has the defects that: firstly, along with the increase of the number of the sensitive words, the deformation of the sensitive words is increased, the word list of the sensitive words is increased, and the search speed is reduced due to cyclic matching; secondly, the method still needs auditing personnel to carry out manual judgment, and the manual workload is high; thirdly, the method can only detect whether the sensitive word appears or not, ignores the context of the appearance of the sensitive word, and is easy to cause false alarm.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a sensitive word recognition method based on big data, which improves the accuracy and efficiency of auditing and recognition of the sensitive word. The technical proposal is as follows:

the invention provides a sensitive word recognition method based on big data, which mainly comprises the following steps:

And step 1, collecting text data by using crawler software, performing sensitive marking on the text data to obtain a sensitive text D ₁ and a normal text D ₂, performing sensitive word classification and grade marking on sensitive words, and storing the sensitive words in a sensitive word list S.

Step 2, new word discovery is carried out through an N-gram model, and sensitive word list S is amplified:

Dividing the sensitive text D ₁ in the step 1 into words by adopting an N-gram model, and dividing the original words according to the length N to obtain a plurality of spliced words with the length N; counting word frequency of each spliced word, calculating frequency P, and selecting the spliced word with frequency larger than a set threshold value a as a candidate word w': wherein count _w represents the sensitivity Wen Benshu containing the splice word w, and N represents the total number of sensitive texts.

The degree of solidification I (x; y) of each candidate word is calculated,Wherein, P (x, y) represents the probability of the co-occurrence of the word x and the word y in the candidate word, P (x) represents the probability of the word x occurring alone, and P (y) represents the probability of the word y occurring alone.

The degree of freedom H (w ') of each candidate word w' is calculated, and the calculation formula is as follows:

Wherein s _l is the set of left adjacency words of candidate word w'; s _r is the set of right adjacency words of candidate word w'; p (w '_l |w') is the conditional probability of the occurrence of the left adjacency word w '_l in the case where the candidate word w' occurs; p (w '_r |w') is the conditional probability that the adjacency word w '_r appears if the candidate word w' appears.

The following will be satisfied at the same time: the candidate words with I (x; y) larger than the solidification degree threshold value b and H (w) larger than the freedom degree threshold value c are used as new words, the sensitivity level of the new words is set to be low-risk, the sensitivity is classified into the classification of the sensitive text where the new words are located, and the new words, the sensitivity level and the sensitivity classification are stored in the sensitive vocabulary S.

And 3, carrying out deformation processing on each sensitive word m in each sensitive word list S to obtain deformed sensitive words m'. The variant comprises: adding special characters in the middle of the sensitive words, replacing one or more characters of the sensitive words with pinyin, splitting one or more characters of the sensitive words, and replacing one or more characters of the sensitive words with traditional Chinese characters.

After the deformation processing, the deformed sensitive word m' is stored in a sensitive word list S, the sensitive classification and the sensitive grade are the classification and the grade of the original sensitive word m, and the sensitive word list S is stored in a database.

Step 4, filtering the sensitive words in the sensitive word list S based on the Trie and the BERT model;

and generating a sensitive word Trie according to the sensitive words, and searching the text content with search in the sensitive word Trie according to the text sequence to obtain all the sensitive words contained in the text content.

And (3) putting the sensitive text D ₁ and the normal text D ₂ together, randomly dividing the sensitive text D ₁ and the normal text D ₂ into a training set and a testing set, training a BERT model, and identifying and filtering sensitive words of the input detection text by combining a Trie and the BERT model.

Determining whether the input detection text contains sensitive words according to the Trie: and generating a sensitive word Trie according to the sensitive word library, and performing Chinese matching according to the sensitive word Trie.

And further judging the matched result according to the BERT model:

If the sensitive word is not contained, directly passing the auditing;

If the sensitive words are contained, judging through a BERT model, judging whether the text is a sensitive text, if the text is a sensitive text, and if the contained sensitive words are high-risk, directly filtering the text; if the text is sensitive and the contained sensitive words are low-risk, replacing the sensitive words contained in the text by 'x'; and if the text is judged to be normal, performing manual auditing.

Preferably, in step 1, the sensitive words are classified and marked according to the level, specifically: the sensitive words are divided into five categories of C ₁、C₂、C₃、C₄、C₅, and the sensitive words are divided into two grades of high-risk sensitive words and low-risk sensitive words.

Preferably, in step 4, the matching of chinese is performed according to the Trie of the sensitive word, specifically: splitting an input detection text into single words by using a regular expression, searching a first character of the detection text from a root node, and if the first character is not found, searching a next character from the root node until a character meeting the condition is found; if the character meeting the condition is found, continuing to search the node of the next character under the descendant node of the node corresponding to the character until the leaf node is reached. And after the cycle traversal is completed, returning all the matched characters.

Preferably, the method further comprises continuously updating the training set, and updating the training BERT model according to the training set.

Further, according to the result of the manual auditing in the step 4, if the BERT model is judged to be a normal text, but the manual auditing is judged to be a sensitive text, the text is used as training data, and the BERT model is retrained.

Compared with the prior art, one of the technical schemes has the following beneficial effects: by amplifying the sensitive words based on the new word discovery, the new sensitive words can be automatically mined, and the labor consumption is reduced. The sensitive word screening method based on the Trie and the BERT model effectively improves the auditing speed, reduces the manual intervention times and improves the sensitive word recognition accuracy.

Drawings

Fig. 1 is a schematic diagram of a sensitive word Trie provided in an embodiment of the present disclosure.

Detailed Description

In order to clarify the technical scheme and working principle of the present invention, the following describes the embodiments of the present disclosure in further detail with reference to the accompanying drawings. Any combination of the above-mentioned optional solutions may be adopted to form an optional embodiment of the present disclosure, which is not described herein in detail.

The terms "step 1," "step 2," "step 3," and the like in the description and in the claims and in the foregoing drawings, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those described herein.

The embodiment of the disclosure provides a sensitive word recognition method based on big data, which mainly comprises the following steps:

And step 1, collecting text data by using crawler software, performing sensitive marking on the text data to obtain a sensitive text D ₁ and a normal text D ₂, performing sensitive word classification and grade marking on sensitive words, and storing the sensitive words in a sensitive word list S. Preferably, the sensitive words are classified into five categories of C ₁、C₂、C₃、C₄、C₅, and the sensitive words are classified into two grades of high-risk sensitive words and low-risk sensitive words.

The sensitive text D ₁ in the step 1 is segmented by adopting an N-gram model, and original words are segmented according to the length of N to obtain a plurality of spliced words with the length of N, for example, when N is taken to be 2, people put high interest and loan are obtained, and the obtained spliced words are { ("people"), ("people put"), ("high interest"), ("interest and loan") }.

Counting word frequency of each spliced word, calculating frequency P, and selecting the spliced word with frequency larger than a set threshold value a as a candidate word w': Wherein count _w represents the sensitivity Wen Benshu containing the splice word w, and N represents the total number of sensitive texts.

The degree of solidification I (x; y) of each candidate word is calculated,Wherein, P (x, y) represents the probability of the co-occurrence of the word x and the word y in the candidate word, P (x) represents the probability of the word x occurring alone, and P (y) represents the probability of the word y occurring alone. If I (x; y) is greater than the threshold b of coagulability, the candidate word w' composed of the word x and the word y satisfies one of the conditions of the new word.

If H (w) is greater than the degree of freedom threshold c, the candidate word is described as a new word in the degree of freedom by indicating that the number of the left and right neighbors of the candidate word is large, wherein the degree of solidification threshold b and the degree of freedom threshold c are determined according to the new word result.

And 3, carrying out deformation processing on each sensitive word m in each sensitive word list S to obtain deformed sensitive words m'. The variant comprises: ① Adding special characters in the middle of the sensitive word, such as 'Gaoli' and 'Gaoli#' in the middle of the sensitive word; ② The word or words of the sensitive word are replaced by pinyin, such as Gaoli credit, which can obtain gaolidai, gaoli lidai, gaoli credit and the like; ③ Splitting one or more words in the sensitive words, for example, only the "Gaoli loan" word can be split, so that the splitting deformation of the "Gaoli loan" can obtain "Gao Lidai shellfish"; ④ The traditional Chinese characters are used for replacing one or more characters in the sensitive words, such as 'high interest lending', and the traditional Chinese characters of 'high interest lending' are modified by 'high interest lending' only because the traditional Chinese characters of 'high interest' are the same as the simple traditional Chinese characters of 'interest'.

The sensitive text D ₁ and the normal text D ₂ are put together and randomly divided into a training set and a testing set, so as to train the BERT model.

And through the combination of the Trie and the BERT model, the input detection text is subjected to sensitive word recognition and filtering, so that the misjudgment rate is reduced.

A sensitive word Trie tree is generated according to a sensitive word library, taking "high interest lending" as an example, as shown in FIG. 1, and FIG. 1 is a schematic diagram of the sensitive word Trie tree provided in the embodiment of the disclosure.

Splitting the "high interest credit" into three words of high interest and credit;

Checking whether the root node has a character 'high' node, if so, adding 'good' and 'credit' nodes in turn, and if not, adding 'high' under the root node.

Chinese matching is carried out according to the sensitive word Trie tree, taking 'someone releasing interest loan' as an example: splitting the text content into single words by using a regular expression, searching the first character 'having' from the root node, and if so, continuing to find the next 'person'; if not, the next character is searched by 'people' until the character meeting the condition is found. If a character meeting the condition, such as "high", is found, the node of the character "good" is searched under the node of the character "high". And after the cycle traversal is completed, returning all the matched characters.

And further judging the matched result according to the BERT model:

If the sensitive word is not contained, directly passing the auditing;

Preferably, the method further comprises the step of continuously updating the training set and updating the training BERT model according to the training set.

Preferably, according to the result of the manual auditing in the step 4, if the BERT model is judged to be a normal text, but the manual auditing is judged to be a sensitive text, the text is used as training data, the BERT model is retrained, and the accuracy of the BERT model judgment is improved.

While the invention has been described above by way of example with reference to the accompanying drawings, it is to be understood that the invention is not limited to the particular embodiments described, but is capable of numerous insubstantial modifications of the inventive concepts and technical solutions; or the above conception and technical scheme of the invention are directly applied to other occasions without improvement and equivalent replacement, and all are within the protection scope of the invention.

Claims

1. The sensitive word recognition method based on big data is characterized by mainly comprising the following steps of:

Step 1, collecting text data by using crawler software, performing sensitive marking on the text data to obtain a sensitive text D ₁ and a normal text D ₂, performing sensitive word classification and grade marking on sensitive words, and storing the sensitive words into a sensitive word list S;

Dividing the sensitive text D ₁ in the step 1 into words by adopting an N-gram model, and dividing the original words according to the length N to obtain a plurality of spliced words with the length N; counting word frequency of each spliced word, calculating frequency P, and selecting the spliced word with frequency larger than a set threshold value a as a candidate word w': wherein count _w represents the sensitivity Wen Benshu containing the splice word w, and N represents the total number of sensitive texts;

the degree of solidification I (x; y) of each candidate word is calculated, Wherein P (x, y) represents the probability of co-occurrence of the word x and the word y in the candidate word, P (x) represents the probability of single occurrence of the word x, and P (y) represents the probability of single occurrence of the word y;

The degree of freedom H (w ') of each candidate word w' is calculated, and the calculation formula is as follows: Wherein s _l is the set of left adjacency words of candidate word w'; s _r is the set of right adjacency words of candidate word w'; p (w '_l |w') is the conditional probability of the occurrence of the left adjacency word w '_l in the case where the candidate word w' occurs; p (w '_r |w') is the conditional probability that the adjacency word w '_r appears if the candidate word w' appears;

the following will be satisfied at the same time: the candidate words with I (x; y) larger than the solidification degree threshold b and H (w) larger than the freedom degree threshold c are used as new words, the sensitivity level of the new words is set to be low-risk, the sensitivity is classified into the classification of the sensitive text where the new words are located, and the new words, the sensitivity level and the sensitivity classification are stored in a sensitive vocabulary S;

Step 3, carrying out deformation processing on each sensitive word m in each sensitive word list S to obtain deformed sensitive words m', wherein the deformed bodies comprise: adding special characters in the middle of the sensitive words, replacing one or more characters of the sensitive words with pinyin, splitting one or more characters of the sensitive words, and replacing one or more characters of the sensitive words with traditional characters;

after the deformation processing, storing the deformed sensitive word m' into a sensitive word list S, wherein the sensitive classification and the sensitive grade are the classification and the grade of the original sensitive word m, and storing the sensitive word list S into a database;

Generating a sensitive word Trie according to the sensitive words, and searching the text content with search in the sensitive word Trie according to the text sequence to obtain all the sensitive words contained in the text content;

The sensitive text D ₁ and the normal text D ₂ are put together and randomly divided into a training set and a testing set, a BERT model is trained, and the sensitive word recognition and the filtering are carried out on the input detection text by combining a Trie and the BERT model;

determining whether the input detection text contains sensitive words according to the Trie: generating a sensitive word Trie according to the sensitive word library, and performing Chinese matching according to the sensitive word Trie;

And further judging the matched result according to the BERT model:

If the sensitive word is not contained, directly passing the auditing;

If the sensitive words are contained, judging through a BERT model, judging whether the text is a sensitive text, if the text is a sensitive text, and if the contained sensitive words are high-risk, directly filtering the text; if the text is sensitive and the contained sensitive words are low-risk, replacing the sensitive words contained in the text by 'x'; if the text is judged to be normal text, performing manual auditing;

the Chinese matching is carried out according to the sensitive word Trie, specifically: splitting an input detection text into single words by using a regular expression, searching a first character of the detection text from a root node, and if the first character is not found, searching a next character from the root node until a character meeting the condition is found; if the character meeting the condition is found, continuing to search the node of the next character under the descendant node of the node corresponding to the character until the leaf node is reached, and returning all the matched characters after the circulation traversal is completed.

2. The big data-based sensitive word recognition method according to claim 1, wherein in step 1, sensitive words are classified and labeled according to grades, specifically: the sensitive words are divided into five categories of C1, C2, C3, C4 and C5, and the sensitive words are divided into two grades of high-risk sensitive words and low-risk sensitive words.

3. A method of big data based sensitive word recognition according to any of claims 1-2, further comprising continuously updating the training set, and updating the training BERT model based on the training set.

4. The big data based sensitive word recognition method of claim 3, wherein according to the result of the manual review in step 4, if the BERT model is judged to be a normal text, but the manual review is judged to be a sensitive text, the text is used as training data, and the BERT model is retrained.