CN117077678B - Sensitive word recognition method, device, equipment and medium - Google Patents

Sensitive word recognition method, device, equipment and medium Download PDF

Info

Publication number
CN117077678B
CN117077678B CN202311322762.9A CN202311322762A CN117077678B CN 117077678 B CN117077678 B CN 117077678B CN 202311322762 A CN202311322762 A CN 202311322762A CN 117077678 B CN117077678 B CN 117077678B
Authority
CN
China
Prior art keywords
sensitive word
sensitive
detected
word
text data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311322762.9A
Other languages
Chinese (zh)
Other versions
CN117077678A (en
Inventor
熊浩
万青玲
刘波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei Shenyue Software Technology Co ltd
Original Assignee
Hebei Shenyue Software Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei Shenyue Software Technology Co ltd filed Critical Hebei Shenyue Software Technology Co ltd
Priority to CN202311322762.9A priority Critical patent/CN117077678B/en
Publication of CN117077678A publication Critical patent/CN117077678A/en
Application granted granted Critical
Publication of CN117077678B publication Critical patent/CN117077678B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Character Discrimination (AREA)

Abstract

The present disclosure relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a medium for identifying a sensitive word. The method comprises the steps of obtaining text data to be detected; determining a word vector sequence corresponding to the text data to be detected, wherein the word vector sequence can represent context information; performing expansion convolution operation by utilizing an expansion convolution layer according to the word vector sequence to obtain a first feature vector; according to the first feature vector, carrying out random inactivation and data transformation by utilizing a feature extraction layer to obtain a second feature vector; performing sensitive word multi-classification analysis by using a full-connection layer based on the second feature vector to obtain a plurality of sensitive word categories corresponding to the text data to be detected; determining the corresponding position of the sensitive word and the corresponding sensitive word category by using the CRF layer according to the second feature vector; the category of the sensitive word is prompted, the sensitive word is replaced by a preset symbol according to the position corresponding to the sensitive word, and the recognition effect and efficiency of the sensitive word are improved.

Description

Sensitive word recognition method, device, equipment and medium
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a medium for identifying a sensitive word.
Background
Most of the information in the internet platform is presented in text form, for example, people can watch videos or web pages through the internet, comment through an evaluation system, or talk through the internet, etc. When a user inputs text content through a dialogue system or an evaluation system, there may be incorrect comments or content, and some sensitive words inevitably appear, so that in order to build a good internet use environment, supervision on the text content is particularly important.
And the supervision of the text content is carried out by adopting a sensitive word stock comprising sensitive words of various types to carry out type recognition and position detection of the sensitive words. Along with the development of artificial intelligence, some sensitive word detection is integrated into big data technology, for example, new word discovery is performed through an N-gram model to expand a sensitive word library, but the construction and maintenance of the sensitive word library not only wastes manpower and time, but also is difficult to detect some sensitive word variants due to multiple semantics and easy misjudgment of contents under normal contexts. Therefore, the conventional sensitive word recognition method needs to consume a lot of manpower, and has poor recognition effect.
Disclosure of Invention
The application aims to provide a sensitive word recognition method, device, equipment and medium, which can improve the detection effect and efficiency of sensitive words.
The first object of the present application is achieved by the following technical solutions:
in a first aspect, a method for identifying a sensitive word is provided, including:
acquiring text data to be detected;
performing sensitive word recognition on the text data to be detected by using a multi-category sensitive word recognition model to obtain a sensitive word category corresponding to the text data to be detected and a position corresponding to the sensitive word;
prompting the category of the sensitive word, and replacing the sensitive word with a preset symbol according to the position corresponding to the sensitive word;
wherein the multi-category sensitive word recognition model comprises: the expansion convolution layer, the feature extraction layer, the full connection layer and the CRF layer, wherein the method for carrying out sensitive word recognition on the text data to be detected by utilizing the multi-category sensitive word recognition model comprises the following steps:
determining a word vector sequence corresponding to the text data to be detected, wherein the word vector sequence can represent context information;
performing expansion convolution operation by utilizing the expansion convolution layer according to the word vector sequence to obtain a first feature vector;
According to the first feature vector, the feature extraction layer is utilized to perform random inactivation and data transformation to obtain a second feature vector;
performing sensitive word multi-classification analysis by using the full-connection layer based on the second feature vector to obtain a plurality of sensitive word categories corresponding to the text data to be detected;
and determining the corresponding position of the sensitive word and the corresponding sensitive word category by utilizing the CRF layer according to the second feature vector.
Through the technical scheme, the IDCNN-CRF model is adopted to perform sensitive word type recognition and sensitive word position detection, text data to be detected is converted into a word vector sequence capable of representing context information, and then an expansion convolution layer is utilized to perform feature vector extraction on the word vector sequence; then, the first feature vector obtained through extraction is subjected to random inactivation and data transformation through a feature extraction layer; secondly, performing sensitive word multi-classification on a second feature vector obtained by feature extraction through a full connection layer; and then, the CRF layer is used for positioning the second feature vector, and further, the sensitive words are desensitized according to the positions corresponding to the sensitive words, so that the multi-category sensitive word recognition model for the sensitive word recognition can better understand the semantics of the context by expanding the receptive field, and compared with the sensitive word recognition in the related art by adopting the sensitive word library, the sensitive words and variants thereof can be captured rapidly and accurately, and the accuracy and the efficiency of the sensitive word recognition are improved.
In one possible implementation, the multi-class sensitive word recognition model training process includes:
training an initial multi-category sensitive word recognition model based on a training set to obtain a trained multi-category sensitive word recognition model, wherein the training set comprises: the system comprises a plurality of sensitive word training samples and labels corresponding to the sensitive word training samples, wherein the labels comprise multi-category labels and sensitive word position labels;
based on the verification set, determining a loss value of the trained multi-class sensitive word recognition model, adjusting model parameters of the trained multi-class sensitive word recognition model according to the loss value and a preset loss value, and performing iterative training to obtain the multi-class sensitive word recognition model.
According to the technical scheme, model training is carried out based on the training set, then the loss value of the trained multi-class sensitive word recognition model is determined based on the verification set, and then parameter correction of the model is carried out according to the loss value, so that the multi-class sensitive word recognition model is obtained.
In one possible implementation, the method further includes:
in the iterative training process, when the variation amplitude of the loss value of the continuous preset round training is smaller than a preset amplitude threshold, updating the training set and the verification set according to the optimized sample, and continuing to perform iterative training according to the updated training set and verification set.
Through the technical scheme, when the variation amplitude of the loss value of the continuous preset round training is smaller than the preset amplitude threshold, the training set and the verification set are updated according to the optimized sample, so that the optimization efficiency and effect on the model can be improved, and the overfitting is avoided.
In one possible implementation manner, before the performing the sensitive word recognition on the text data to be detected by using the multi-category sensitive word recognition model to obtain the sensitive word category corresponding to the text data to be detected and the position corresponding to the sensitive word, the method further includes:
determining a sensitive word recognition mode based on the trigger condition; the triggering condition at least comprises one of the following: when the text source of the text data to be detected is a network source; identifying whether the accuracy is greater than a preset accuracy threshold; detecting a specified recognition mode in a user-triggered instruction;
when the sensitive word recognition mode comprises a semantic analysis mode, performing sensitive word recognition on the text data to be detected by utilizing a multi-category sensitive word recognition model to obtain a sensitive word category corresponding to the text data to be detected and a position corresponding to the sensitive word;
when the sensitive word recognition mode comprises a matching mode, determining a sensitive word category corresponding to the text data to be detected and a position corresponding to the sensitive word by using a sensitive word dictionary tree.
Through the technical scheme, at least two sensitive word recognition modes can be supported, the currently adopted recognition mode is automatically determined based on the triggering condition, and various requirements of users can be met.
In one possible implementation, when the sensitive word recognition pattern is a matching pattern and a semantic analysis pattern;
after replacing the sensitive word with the preset symbol according to the position corresponding to the sensitive word, the method further comprises:
determining the supplementary sensitive words of the text data to be detected and the positions corresponding to the supplementary sensitive words according to the matching mode;
and replacing the supplementary sensitive words with preset symbols according to the positions corresponding to the supplementary sensitive words.
Through the technical scheme, after the matching mode and the semantic analysis mode are selected at the same time, the matching mode can be supplemented or replaced by the semantic analysis mode, so that the reliability of sensitive word recognition is improved.
In one possible implementation manner, the determining, by using a sensitive word dictionary tree, a sensitive word category corresponding to the text data to be detected and a position corresponding to the sensitive word includes:
identifying a plurality of sensitive words of the text data to be detected by using a sensitive word dictionary tree;
determining whether paired sensitive words matched with a preset associated word set exist from a plurality of sensitive words;
And screening the plurality of sensitive words according to the paired sensitive words to obtain final effective sensitive words and sensitive word positions.
Through the technical scheme, the pair sensitive words are set, and the final effective sensitive words and the positions of the sensitive words can be determined only by hitting the pair sensitive words when the sensitive words are identified through the matching mode, so that the accuracy of identifying the sensitive words is improved, and the occurrence of the condition of missed detection of the sensitive words is reduced.
In one possible implementation manner, after the obtaining the text data to be detected, the method further includes:
judging whether the sentence number of the text data to be detected exceeds a first preset number threshold value;
if yes, slicing the text data to be detected according to the relevance among sentences and a second preset quantity threshold value to obtain a plurality of text slices;
the step of performing sensitive word recognition on the text data to be detected by using a multi-category sensitive word recognition model to obtain a sensitive word category corresponding to the text data to be detected and a position corresponding to the sensitive word, includes:
and carrying out sensitive word recognition on the plurality of text slices in sequence by utilizing a multi-category sensitive word recognition model to obtain a sensitive word category corresponding to the text data to be detected and a position corresponding to the sensitive word.
According to the technical scheme, when the number of sentences of the text data to be detected exceeds the first preset number threshold, the accuracy of recognition can be affected, and the text data to be detected is sliced according to the second preset number threshold and the relevance between adjacent sentences, so that the semantic meaning and the data size of the sentences are comprehensively considered, and the accuracy of recognition of the sensitive words is improved.
In a second aspect, there is provided a sensitive word recognition apparatus, comprising:
the acquisition module is used for acquiring text data to be detected;
the recognition module is used for recognizing the sensitive words of the text data to be detected by utilizing a multi-category sensitive word recognition model to obtain the sensitive word category corresponding to the text data to be detected and the position corresponding to the sensitive word;
the desensitization module is used for prompting the category of the sensitive word and replacing the sensitive word with a preset symbol according to the position corresponding to the sensitive word;
wherein the multi-category sensitive word recognition model comprises: the expansion convolution layer, the feature extraction layer, the full connection layer and the CRF layer, and the identification module is further used for: determining a word vector sequence corresponding to the text data to be detected, wherein the word vector sequence can represent context information; performing expansion convolution operation by utilizing the expansion convolution layer according to the word vector sequence to obtain a first feature vector; according to the first feature vector, the feature extraction layer is utilized to perform random inactivation and data transformation to obtain a second feature vector; performing sensitive word multi-classification analysis by using the full-connection layer based on the second feature vector to obtain a plurality of sensitive word categories corresponding to the text data to be detected; and determining the corresponding position of the sensitive word and the corresponding sensitive word category by utilizing the CRF layer according to the second feature vector.
In a third aspect, an electronic device is provided, the electronic device comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to: operations corresponding to the sensitive word recognition method according to any one of the possible implementations of the first aspect are performed.
In a fourth aspect, a computer readable storage medium is provided, the storage medium storing at least one instruction, at least one program, code set, or instruction set, the at least one instruction, at least one program, code set, or instruction set being loaded and executed by a processor to implement a method of sensitive word recognition as shown in any one of the possible implementations of the first aspect.
In summary, the present application includes at least one of the following beneficial technical effects:
1. performing sensitive word type recognition and sensitive word position detection by adopting an IDCNN-CRF model, converting text data to be detected into a word vector sequence capable of representing context information, and extracting feature vectors of the word vector sequence by utilizing an expansion convolution layer; then, the first feature vector obtained through extraction is subjected to random inactivation and data transformation through a feature extraction layer; secondly, performing sensitive word multi-classification on a second feature vector obtained by feature extraction through a full connection layer; and then, the CRF layer is used for positioning the second feature vector, and further, the sensitive words are desensitized according to the positions corresponding to the sensitive words, so that the multi-category sensitive word recognition model for the sensitive word recognition can better understand the semantics of the context by expanding the receptive field, and compared with the sensitive word recognition in the related art by adopting the sensitive word library, the sensitive words and variants thereof can be captured rapidly and accurately, and the accuracy and the efficiency of the sensitive word recognition are improved.
2. When the number of sentences of the text data to be detected exceeds a first preset number threshold, the accuracy of recognition may be affected, and the text data to be detected is sliced according to the second preset number threshold and the relevance between adjacent sentences, so that the semantic meaning and the data size of the sentences are comprehensively considered, and the accuracy of recognition of the sensitive words is improved.
Drawings
FIG. 1 is a schematic diagram of a system for recognizing sensitive words according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of a method for recognizing sensitive words according to an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a multi-class sensitive word recognition model according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a manner of sensitive word recognition provided in an embodiment of the present application;
FIG. 5 is a schematic flow chart of a dual-mode detection method for matching mode and semantic analysis mode according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a sensitive word recognition device according to an embodiment of the present application;
FIG. 7 is a schematic diagram of an overall framework of a sensitive word recognition device according to an embodiment of the present disclosure;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The present application is described in further detail below with reference to fig. 1 to 8.
The present embodiment is merely illustrative of the present application and is not intended to be limiting, and those skilled in the art, after having read the present specification, may make modifications to the present embodiment without creative contribution as required, but is protected by patent laws within the scope of the present application.
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
In addition, the term "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In this context, unless otherwise specified, the term "/" generally indicates that the associated object is an "or" relationship.
Sensitive word detection is always an indispensable link in dialogue systems or rating systems, and sensitive word detection must exist in all places related to dialogue or user text input. The common sensitive word detection and recognition method basically relies on a sensitive word stock to detect sensitive words, so that a great deal of manpower is required to maintain the word stock, and along with the development of artificial intelligence, some sensitive word detection is integrated into a big data technology so as to find new words through an N-gram model and amplify a sensitive word list.
However, the sensitive word variety is various in form, when the sensitive word library is adopted for recognition, a rule system needs to maintain a huge word library in time to realize effective sensitive word interception, and meanwhile, the sensitive word variety needs to be manually checked, so that the workload is high; meanwhile, the sensitive words have multiple semantics, and misjudgment is easy to occur under the normal context only through sensitive word stock detection.
Therefore, the conventional sensitive word detection method needs to consume a lot of manpower, and has poor detection effect. Sensitive word detection is generally 2-3 sentences, and the vocabulary quantity is small; however, the current text classification model is generally applicable to long text content classification, and if the text classification model is adopted for classifying sensitive words, the effect is poor.
In order to solve the problems, the application provides a sensitive word recognition method, a device, equipment and a medium applicable to short text types based on an expansion convolution network.
Among them, the dilation convolutional network is a special convolutional neural network, and is generally applied to image processing. The dilation convolutional network performs well in tasks such as image segmentation, which requires classification of each pixel of the input image. In a common convolutional neural network, a convolutional kernel performs a point-by-point multiply-accumulate operation on an input feature map, while in an expanded convolutional network, the convolutional kernel skips some pixels in the convolution process, so that the receptive field is increased.
The method and the device migrate the expansion convolution network suitable for the image field to the text field, and can quickly and accurately capture sensitive words and related variants by combining a conditional random field model CRF.
FIG. 1 illustrates an exemplary sensitive word recognition system diagram according to an embodiment of the present application. The sensitive word recognition system includes: one or more client devices 100, a server 200, one or more communication networks 300 coupling the one or more client devices 100 to the server 200.
The client device 100 provides a text content input or upload interface on a user interaction interface. The client device obtains text data to be detected through an input or uploading interface. Client devices include, but are not limited to: smart phones, tablet computers, notebook computers, desktop computers, smart speakers, wearable devices, etc.
Communication network 300 includes, but is not limited to: a wired communication link, a wireless communication link, or a fiber optic cable, etc.
The server 200 may specifically be a dedicated server for implementing a single service, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platforms.
In addition, the text processing method in the embodiment of the present application may be applied to an electronic device, which may be the client device 100 or the server 200, and the embodiment of the present application is not limited in particular. In one possible scenario, the client device 100 obtains text data to be detected; performing sensitive word recognition on the text data to be detected by using a multi-category sensitive word recognition model to obtain a sensitive word corresponding to the text data to be detected and a position corresponding to the sensitive word; and replacing the sensitive word with a preset symbol according to the position corresponding to the sensitive word. In another possible case, the client device 100 obtains text data to be detected input by a user and sends the text data to the server 200, and the server 200 performs sensitive word recognition on the text data to be detected by using a multi-category sensitive word recognition model to obtain a sensitive word corresponding to the text data to be detected and a position corresponding to the sensitive word; and replacing the sensitive word with a preset symbol according to the position corresponding to the sensitive word.
The method, the device, the equipment and the medium for identifying the sensitive words are described in detail below with reference to the specific embodiments.
As shown in fig. 2, an embodiment of the present application provides a method for identifying a sensitive word, where the method may be performed by an electronic device, and the method for identifying a sensitive word includes: step S210-step S230, wherein:
step S210, acquiring text data to be detected;
in the embodiment of the application, the text can be extracted from the text database to be detected as text data to be detected so as to carry out text recognition, and the text can be a webpage text, a video text, a dialogue text or a comment text; the real-time text content for a certain dialogue system/evaluation system can be automatically acquired as the text data to be detected in a network crawling mode.
In some embodiments, after obtaining the text data to be detected, further comprising: and cleaning the text data to be detected. Specifically, the information such as special labels in the original data is removed, and the special labels comprise punctuation marks, special symbols and meaningless common words, so that the calculation complexity is reduced and the system overhead is reduced.
Step S220, performing sensitive word recognition on the text data to be detected by using a multi-category sensitive word recognition model to obtain a sensitive word category corresponding to the text data to be detected and a position corresponding to the sensitive word;
Wherein the multi-category sensitive word recognition model comprises: an expansion convolution layer, a feature extraction layer, a full connection layer and a CRF layer. The multi-category sensitive word recognition model can be used for rapidly and accurately determining the category of the sensitive word and the corresponding position of the sensitive word.
Sensitive words include various categories, such as sensitive words that violate legal regulations; sensitive words related to personal privacy; sensitive words related to business secrets; to sensitive words that are culture sensitive, etc.
Specifically, referring to fig. 3 and fig. 4, fig. 3 is a schematic structural diagram of a multi-category sensitive word recognition model according to an embodiment of the present application; fig. 4 is a schematic diagram of a manner of recognition of a sensitive word provided in an embodiment of the present application, where a multi-class sensitive word recognition model is a IDCNN (Iterated Dilated Convolutional Neural Networks) -CRF (Conditional Random Field ) algorithm, and the algorithm is that a network model originally applied to an image field is migrated into a text field, and the semantics of a context are better understood by expanding a receptive field.
The Input layer input_1 is used for inputting text data to be detected;
the word Embedding layer Embedding_1 is used for converting text data to be detected into a word vector sequence;
The spatial_dropout1d module carries out regularization processing on the word vector sequences, and checks the word vector sequences in batches, and randomly sets certain word vectors in the word vector sequences of each batch to 0 so as to improve the independence among the word vectors of the word vector sequences, improve generalization capability and prevent overfitting;
the expansion convolution layer comprises 3 expansion convolution modules Conv1d, which are used for extracting the feature vectors to obtain first feature vectors; each convolution module performs convolution operation on the data input by the previous convolution module or the word embedding layer by using a group of convolution cores so as to extract the characteristics of the input data; the expansion convolution can expand the receptive field without increasing the calculation amount, extract wider features, gradually extract higher-level features through the stacking of a plurality of convolution modules, capture longer context information, capture common sensitive words and related variants thereof more quickly and accurately, and improve the accuracy of text classification tasks;
the feature extraction layer includes: the random inactivation layer dropout_1 and the data transformation layer lambda_1 can also comprise a reshape layer;
the dropout_1 layer is used for carrying out random inactivation operation on the first feature vector so as to reduce the excessive dependence of the model on certain features and improve the generalization capability of the model;
The lambda_1 layer is used for carrying out data transformation operation on the features subjected to random inactivation operation to obtain a second feature vector, wherein the data transformation comprises one or more of text vector cleaning such as normalization operation, vectorization, text preliminary classification and the like, and the parameter learning process is not involved;
the reshape_1 is used for realizing good butt joint between different layers, and performing dimension reduction operation on the second feature vector so that the vector dimension meets the dimension requirement of the full-connection layer;
the full-connection layer Dense is used for classifying the sensitive words according to the second characteristic vector after the second characteristic vector/the dimension reduction;
and the CRF layer is used for decoding the second feature vector by using the Viterbi algorithm to obtain the sensitive word class and the position thereof.
In one implementation manner, step S220 performs sensitive word recognition on the text data to be detected using the multi-category sensitive word recognition model, including: step S221-step S225, wherein:
step S221, determining a word vector sequence corresponding to the text data to be detected, wherein the word vector sequence can represent context information;
in some embodiments, the Word characteristic representation is performed by using a distributed representation, word2vec/GloVe/BERT (Bidirectional Encoder Representation from Transformers)/FastText technology can be used to generate Word vectors, a specific Word vector generation mode can be specified by a user, or a mapping relation between a scene and the Word vector generation mode is established in advance, so that the Word vector generation mode corresponding to the current sensitive Word recognition scene is determined based on the current sensitive Word recognition scene. The word vector generation method has the advantages that a plurality of word vectors are determined, a word vector sequence capable of representing context semantic information is obtained according to the plurality of word vectors, semantic environments, namely context information of words, can be fully considered, the word vector generation method has richer semantic information, and meanwhile, text data with high latitude and high sparseness can be changed into continuous dense, low-dimensional and suitable data for neural network processing. It should be noted that, for the variant of the sensitive word, even if the font or the word pronunciation changes, the context semantic dependency relationship is unchanged, so that the variant of the sensitive word can be identified based on the context semantic dependency relationship in the embodiment of the application, and the identification effect is improved.
Word2vec can capture semantic relations among words; GLOve is able to capture co-occurrence relationships between words; fastText is capable of handling the case of a word ambiguous or ambiguous; the advantage of BERT is the ability to capture contextual relationships between words.
In some embodiments, at least two word vector generation modes may be adopted to determine a plurality of word vectors, specifically, word vector extraction is performed on text data to be detected by using each word vector generation mode, and a plurality of word vectors corresponding to each word vector generation mode are generated; and carrying out information fusion on a plurality of word vectors corresponding to each word vector generation mode to obtain a plurality of word vectors corresponding to the text data to be detected, wherein the fusion method can adopt an average value method, a weighted average value method, a splicing method or matrix decomposition fusion. By combining the word vector generation technologies, more comprehensive and accurate word vector representation can be obtained, so that the performance and accuracy of subsequent natural language processing tasks are improved. And sequencing the obtained word vectors according to the sequence to form a word vector sequence.
Step S222, performing expansion convolution operation by utilizing an expansion convolution layer according to the word vector sequence to obtain a first feature vector;
Step S223, according to the first feature vector, carrying out random inactivation and data transformation by utilizing the feature extraction layer to obtain a second feature vector;
step S224, performing sensitive word multi-classification analysis by using a full-connection layer based on the second feature vector to obtain a plurality of sensitive word categories corresponding to the text data to be detected;
the expansion convolution layer is utilized to carry out expansion convolution operation on the word vector sequence, so that text features with different scales are captured, context semantics are better understood to obtain a first feature vector, and the problem of difficult recognition caused by sensitive word variants can be solved to a certain extent.
According to the embodiment of the application, the full connection layer serving as the classifier is adjusted, and multi-classification can be achieved. And according to the second feature vector, evaluating each sensitive category by using an activation function of the full connection layer to obtain sensitive category probability distribution, and determining the sensitive categories with the prediction probability larger than a preset prediction probability threshold value as a plurality of sensitive categories corresponding to the text data to be detected.
According to the embodiment of the application, the ID-CNN Long-Term Memory network in deep learning is adopted to conduct sensitive type multi-classification on the input second feature vector, the ID-CNN can capture longer context information, and parallelism can be achieved compared with a traditional LSTM (Long Short-Term Memory) Long-Term Memory network.
Step S225, determining the corresponding position of the sensitive word and the corresponding sensitive word category by using the CRF layer according to the second feature vector.
In the embodiment of the application, a named entity recognition algorithm CRF is used for judging the sensitive words in the sensitive word category and the positions corresponding to the sensitive words semantically. For example, for a piece of text data to be detected, the data (200, 11) is output, 11 represents the extracted tag_list= [ ' O ', ' b_sensitive word category 1', ' b_sensitive word category 2', ' b_sensitive word category 3', ' b_sensitive word category 4', ' b_sensitive word category 5', ' i_sensitive word category 1', ' i_sensitive word category 2', ' i_sensitive word category 3', ' i_sensitive word category 4', ' i_sensitive word category 5', ' where B represents the start position of the sensitive word, I represents the end position of the sensitive word, and O represents the non-sensitive word.
The model training speed of the multi-category sensitive word recognition model corresponding to the semantic analysis mode is high, the visual field of the convolution layer is large, the relation between context semantics can be better grasped, the prediction of the sensitive words is carried out through semantic understanding, the model is only required to be updated and upgraded regularly, and the workload of manual maintenance is greatly reduced.
Step S230, prompting the category of the sensitive word, and replacing the sensitive word with a preset symbol according to the position corresponding to the sensitive word.
In some embodiments, a preset symbol corresponding to the sensitive word can be determined based on the type of the sensitive word, and the sensitive word is rewritten based on the preset symbol corresponding to the sensitive word, so that the text content and the corresponding user can be tracked and audited at a later stage; and at the same time, the user can be prompted to pay attention to the text.
Through the embodiment, the IDCNN-CRF model is adopted to perform sensitive word type recognition and sensitive word position detection, text data to be detected is converted into a word vector sequence capable of representing context information, and then an expansion convolution layer is utilized to perform feature vector extraction on the word vector sequence; then, the first feature vector obtained through extraction is subjected to random inactivation and data transformation through a feature extraction layer; secondly, performing sensitive word multi-classification on a second feature vector obtained by feature extraction through a full connection layer; and then, the CRF layer is used for positioning the sensitive words of the second feature vector, and further, the sensitive words are desensitized according to the positions corresponding to the sensitive words, so that the multi-category sensitive word recognition model for the sensitive word recognition can better understand the semantics of the context by expanding the receptive field, and compared with the sensitive word recognition in the related art by adopting the sensitive word library, the sensitive words and variants thereof can be captured rapidly and accurately, and the accuracy and the efficiency of the sensitive word recognition are improved.
In some embodiments, to ensure accuracy of recognition, the size of data input into the multi-category sensitive word recognition model may be limited, and after step S210, the sensitive word recognition method further includes: judging whether the number of sentences of the text data to be detected exceeds a first preset number threshold, wherein the first preset number threshold can be set to be a fixed preset number threshold, such as 3, 4 and the like; if yes, slicing the text data to be detected according to the relevance among sentences and a second preset quantity threshold value to obtain a plurality of text slices, and sequentially carrying out sensitive word recognition on the plurality of text slices by utilizing a multi-category sensitive word recognition model to obtain a sensitive word category corresponding to the text data to be detected and a position corresponding to the sensitive word.
In this embodiment of the present application, sentence segmentation is performed on text data to be detected according to a preset symbol as a separator to obtain multiple sentences, where the preset symbol at least includes a period, a division mark, a question mark, and an exclamation mark, it should be noted that sometimes the period may also appear in a quotation mark or a bracket, and in this case, judgment needs to be performed according to the context so as to avoid the sentence being segmented erroneously. The number of sentences obtained based on the sentence processing is used as the basis of the slicing processing.
When the text data to be detected is required to be sliced, the relevance between adjacent sentences is determined by analyzing the semantic relation, the context linkage or the keywords among the sentences for the sentences obtained by segmentation. An initial text slice determination step: and taking sentences with a second preset quantity threshold value from the divided sentences as initial text slices, wherein the second preset quantity threshold value is at least more than 2, and the second preset quantity threshold value is less than or equal to the first preset quantity threshold value. A text slice determination step: judging whether the last sentence a of the initial text slice is related to the previous sentence b; if there is a relation with the previous sentence b, the initial text slice is taken as a text slice; if no association exists between the last sentence a and the next sentence c, judging whether the last sentence a of the initial text slice is associated with the next sentence b; if no association exists with the subsequent sentence c, the initial text slice is taken as a text slice; if there is a relation to its subsequent sentence c, the sentences of the initial text slice except the last sentence a are taken as text slices. A step of determining a segmentation sentence: the text slice in the original segmented sentence is removed to update the segmented sentence. And then repeatedly executing the initial text slice determining step, the text slice determining step and the divided sentence determining step to realize slice processing and obtain a plurality of text slices.
In this embodiment, when the number of sentences of the text data to be detected exceeds a first preset number threshold, the accuracy of recognition may be affected, and the text data to be detected is sliced according to the second preset number threshold and the relevance between adjacent sentences, so that the semantics of the sentences and the data size are comprehensively considered to improve the accuracy of recognition of the sensitive words.
In one possible implementation manner of the embodiment of the present application, a multi-category sensitive word recognition model training process includes:
training an initial multi-category sensitive word recognition model based on a training set to obtain a trained multi-category sensitive word recognition model, wherein the training set comprises: the system comprises a plurality of sensitive word training samples and labels corresponding to the sensitive word training samples, wherein the labels comprise multi-category labels and sensitive word position labels;
based on the verification set, determining a loss value of the trained multi-class sensitive word recognition model, adjusting model parameters of the trained multi-class sensitive word recognition model according to the loss value and a preset loss value, and performing iterative training to obtain the multi-class sensitive word recognition model.
And collecting normal sample data of the sensitive sample data, and manually labeling. The proportions of the training set and the validation set can be set in a customized manner, and the embodiment of the application is not limited.
In the training process, model parameters Epoch and Batch size are adjusted according to loss value of the verification set, so that loss of the verification set is minimized. Loss is the average error calculated between the predicted and actual values of the validation set. When the loss value is not smaller than the preset loss value, the training is completed; when the loss value is smaller than the preset loss value, the model parameters are adjusted by using a grid search algorithm based on the loss value, and training is continued until the training times reach a preset time threshold or the loss value is not smaller than the preset loss value.
Therefore, in the embodiment of the application, model training is performed based on the training set, then the loss value of the trained multi-class sensitive word recognition model is determined based on the verification set, and then the parameter correction of the model is performed according to the loss value, so that the multi-class sensitive word recognition model is obtained.
Further, in the training process, the data is limited, and the situation of low optimization efficiency may occur, so in the iterative training process in the embodiment of the present application, when the variation amplitude of the loss value of the continuous preset round training is smaller than the preset amplitude threshold, the training set and the verification set are updated according to the optimized sample, and the iterative training is continued according to the updated training set and verification set.
When the optimized sample is a preset sample, the samples in the training set and the verification set can be increased/replaced proportionally or randomly; when the optimized samples include samples that failed verification at the time of verification and a portion of the samples in the training set, the samples that failed verification at the time of verification are added/replaced with the samples in the training set, and the portion of the samples in the training set are added/replaced with the samples in the training set.
According to the embodiment of the application, when the variation amplitude of the loss value of the continuous preset round training is smaller than the preset amplitude threshold, the training set and the verification set are updated according to the optimized sample, so that the optimization efficiency and effect on the model can be improved, and the overfitting is avoided.
In one possible implementation manner of the embodiment of the present application, the method further includes: when the using times/using time length of the multi-category sensitive word recognition model of the current version reaches a preset requirement, acquiring a first data set recognized by the multi-category sensitive word recognition model, wherein the first data set comprises a plurality of actually detected text data and sensitive word labels corresponding to the actually detected text data;
and carrying out iterative training on the multi-category sensitive word recognition model of the current version by using the first data set to obtain the multi-category sensitive word recognition model of the upgrading version so as to realize version upgrading of the multi-category sensitive word recognition model.
The preset requirements corresponding to the times/time of use can be set by a user in a self-defined way. Recording a multi-category sensitive word recognition model of a current version as Pi, performing sensitive word recognition by using the Pi, storing each actual detection text data and a corresponding result of the recognition, and performing manual correction when the result is wrong so as to obtain each actual detection text data and a corresponding sensitive word label; when the using times of Pi recognition reaches T1/the using time of Pi recognition reaches T2, training the Pi according to the recognized actual detection text data and the sensitive word label to obtain the Pi+1 of the next version; and then repeatedly executing the steps, and carrying out iterative optimization on the multi-category sensitive word recognition model of the previous version by continuously utilizing the actual recognition result, so that the sensitive word recognition effect is continuously improved.
In another embodiment provided in the present application, referring to fig. 5, a sensitive word recognition mode may be determined based on a triggering condition corresponding to a service requirement, and then a sensitive word algorithm and a service logic are designed through the triggering condition, deep learning and a natural language processing technology, so as to provide a set of dual-mode detection schemes of an accurate matching mode and a semantic analysis mode, wherein text data to be detected and a model enabling flag are acquired; when the model enabling mark is started, an IDCNN-CRF algorithm is started, and sensitive words are prompted by the IDCNN-CRF algorithm and are shielded; when the model enabling mark is not started, scene information is obtained, different scenes correspond to different sensitive word libraries, black and white list data aiming at the current scene information can be set to supplement the sensitive word libraries corresponding to the current scene information, and then the words in the scene libraries in the text data to be detected are found by using a DFA matching algorithm and are shielded.
Specifically, in one possible implementation manner of the embodiment of the present application, before performing sensitive word recognition on text data to be detected by using a multi-category sensitive word recognition model to obtain a sensitive word corresponding to the text data to be detected and a position corresponding to the sensitive word, the method further includes:
determining a sensitive word recognition mode based on the trigger condition; the triggering condition comprises at least one of the following: when the text source of the text data to be detected is a network source; identifying whether the accuracy is greater than a preset accuracy threshold; detecting a specified recognition mode in a user-triggered instruction;
when the sensitive word recognition mode comprises a semantic analysis mode, performing sensitive word recognition on the text data to be detected by using the multi-category sensitive word recognition model to obtain a sensitive word corresponding to the text data to be detected and a position corresponding to the sensitive word;
when the sensitive word recognition mode comprises a matching mode, determining a sensitive word corresponding to the text data to be detected and a position corresponding to the sensitive word by using a sensitive word dictionary tree.
In the embodiment of the application, when the text source of the text data to be detected is a network source, determining that the sensitive word recognition mode is a semantic analysis mode, and otherwise, determining that the sensitive word recognition mode is a matching mode; when the recognition accuracy is larger than a preset accuracy threshold, determining that the sensitive word recognition mode is a semantic analysis mode, and otherwise, determining that the sensitive word recognition mode is a matching mode; the user-triggered instruction includes a specified recognition mode, and the specified recognition mode can be one or a combination of a semantic analysis mode and a matching mode.
It should be noted that, the sensitive word library corresponding to the error/missing sensitive word expansion matching mode can be identified in the application process based on the semantic analysis mode; a shared sensitive word library which can be used for semantic analysis modes and/or matching modes can be set, wherein the shared sensitive word library comprises sensitive words with each item of recognition errors or omission and sensitive words with the recognition frequency of the sensitive words being greater than a preset frequency; when the sensitive word recognition is needed, the shared sensitive word library is preferentially adopted for the sensitive word recognition, and then any one or two of the sensitive word recognition modes are adopted for the recognition, so that the accuracy of the sensitive word recognition can be improved.
The embodiment of the application reserves the matching mode of the sensitive word stock, can support a user to build the custom sensitive word stock corresponding to different scenes, and realizes shielding of the sensitive words through literal and semantic judgment.
By combining the two sensitive word recognition modes, sensitive content can be detected quickly, and the safety risk caused by content is reduced due to the propagation of dangerous language.
Therefore, the embodiment of the application can support at least two sensitive word recognition modes, and the currently adopted recognition mode is automatically determined based on the triggering condition, so that various requirements of users can be met.
Specifically, in one possible implementation manner of the embodiment of the present application, when the sensitive word recognition mode is a matching mode and a semantic analysis mode;
after replacing the sensitive word with the preset symbol according to the position corresponding to the sensitive word, the method further comprises:
determining the position corresponding to the supplementary sensitive word of the text data to be detected according to the matching mode;
and replacing the supplementary sensitive words with preset symbols according to the positions corresponding to the supplementary sensitive words.
In the embodiment, after the matching mode and the semantic analysis mode are selected at the same time, the matching mode can be supplemented or replaced by the semantic analysis mode, so that the reliability of sensitive word recognition is improved.
In one possible implementation manner of the embodiment of the present application, determining a sensitive word corresponding to text data to be detected and a position corresponding to the sensitive word by using a sensitive word dictionary tree includes:
identifying a plurality of sensitive words of the text data to be detected by using a sensitive word dictionary tree;
determining whether paired sensitive words matched with a preset associated word set exist from a plurality of sensitive words;
and screening a plurality of sensitive words according to the paired sensitive words to obtain final effective sensitive words and sensitive word positions.
In the embodiment of the application, all the sensitive words corresponding to the target scene are constructed into a sensitive word dictionary tree (namely, all the sensitive words are organized into a tree relation, namely, words beginning with any word can be searched out), then a one-by-one search is carried out on the sensitive word dictionary tree based on a DFA (Deterministic Finite Automaton, a finite automaton) algorithm to determine whether each word of text data to be detected exists in the sensitive word dictionary tree, if so, whether the corresponding word exists is determined, if so, matching is successful, the sensitive word and the sensitive word position are recorded, and the next word is searched until the whole text data to be detected is searched out, so that a plurality of sensitive words of the text data to be detected are obtained.
If only one sensitive word hit is required, it is marked as a sensitive word, which may lead to many false positives. In a sentence, only a small part of meaning of a certain sensitive word may be involved, and only another word is associated to determine that the meaning is a sensitive word, and the application requires that the accuracy of sensitive word detection can be improved by paired sensitive word hit, wherein paired sensitive words are at least two sensitive words. The method comprises the steps of pre-configuring a preset association word set in the electronic equipment to support configuration through the combination of a plurality of key sensitive words, wherein the association is used as a paired sensitive word between the combination words. For example: the sensitive word "WeChat & part-time" means that only when the text data to be detected contains both WeChat and part-time words, the WeChat and part-time words are taken as effective sensitive words.
Therefore, the method is set to be sensitive words, and when the sensitive words are identified through the matching mode, the final effective sensitive words and the positions of the sensitive words can be determined only when the sensitive words are hit, so that the accuracy of the sensitive word identification is improved, and the occurrence of the condition of missed detection of the sensitive words is reduced.
In a possible implementation manner of this embodiment of the present application, before identifying a plurality of sensitive words of a text to be detected using a sensitive word dictionary tree, the method further includes:
Acquiring scene category and black-and-white list data identified by sensitive words;
determining a target sensitive word dictionary tree from mapping relations based on scene categories, wherein the mapping relations are mapping relations of a plurality of scene categories and a plurality of sensitive word dictionary trees;
and adding and deleting the target sensitive word dictionary tree according to the black-and-white list data to obtain the sensitive word dictionary tree.
The black-and-white list data can be user-defined setting for dynamically adjusting the target sensitive word dictionary tree corresponding to the scene category. The blacklist data is used for supplementing the sensitive word dictionary tree, and the whitelist data is used for deleting the sensitive word dictionary tree. The scene categories include: information scenes, medical scenes, educational scenes, etc. There may be overlapping of several sensitive word dictionary trees corresponding to each scene category, which is not limited in the embodiment of the present application.
The method comprises the steps that different scene categories correspond to different sensitive word dictionary trees, a target sensitive word dictionary tree corresponding to the scene category identified by the current sensitive word is determined through preset mapping relations of the scene categories and the sensitive word dictionary trees, wherein the target sensitive word dictionary tree possibly relates to one or more sensitive words, and the target sensitive word dictionary tree is dynamically adjusted based on black and white list data, so that personal requirements and data structures can be kept synchronous without reconstructing the dictionary tree.
In the above embodiments, a method for recognizing a sensitive word is described from the viewpoint of a method flow, and in the following embodiments, a device for recognizing a sensitive word is described from the viewpoint of a module or a unit, and in particular, the following embodiments are described.
An embodiment of the present application provides a sensitive word recognition apparatus, as shown in fig. 6, the apparatus may include:
an obtaining module 610, configured to obtain text data to be detected;
the recognition module 620 is configured to perform sensitive word recognition on the text data to be detected by using the multi-category sensitive word recognition model, so as to obtain a sensitive word corresponding to the text data to be detected and a position corresponding to the sensitive word;
the desensitization module 630 is configured to prompt a category of the sensitive word, and replace the sensitive word with a preset symbol according to a position corresponding to the sensitive word;
wherein the multi-category sensitive word recognition model comprises: the identification module 620 is further configured to: determining a word vector sequence corresponding to the text data to be detected, wherein the word vector sequence can represent context information; performing expansion convolution operation by utilizing an expansion convolution layer according to the word vector sequence to obtain a first feature vector; according to the first feature vector, carrying out random inactivation and data transformation by utilizing a feature extraction layer to obtain a second feature vector; performing sensitive word multi-classification analysis by using a full-connection layer based on the second feature vector to obtain a plurality of sensitive word categories corresponding to the text data to be detected; and determining the corresponding position of the sensitive word and the corresponding sensitive word category by using the CRF layer according to the second feature vector.
In another implementation manner, the sensitive word recognition apparatus further includes:
the multi-category sensitive word recognition model training module is used for:
training an initial multi-category sensitive word recognition model based on a training set to obtain a trained multi-category sensitive word recognition model, wherein the training set comprises: the system comprises a plurality of sensitive word training samples and labels corresponding to the sensitive word training samples, wherein the labels comprise multi-category labels and sensitive word position labels;
based on the verification set, determining a loss value of the trained multi-class sensitive word recognition model, adjusting model parameters of the trained multi-class sensitive word recognition model according to the loss value and a preset loss value, and performing iterative training to obtain the multi-class sensitive word recognition model.
In one possible implementation, the multi-category sensitive word recognition model training module is further configured to:
in the iterative training process, when the variation amplitude of the loss value of the continuous preset round training is smaller than a preset amplitude threshold, updating the training set and the verification set according to the optimized sample, and continuing to perform iterative training according to the updated training set and verification set.
In one possible implementation manner, the sensitive word recognition apparatus further includes:
the determining module is used for determining a sensitive word recognition mode based on the triggering condition; the triggering condition comprises at least one of the following: when the text source of the text data to be detected is a network source; identifying whether the accuracy is greater than a preset accuracy threshold; detecting a specified recognition mode in a user-triggered instruction;
When the sensitive word recognition pattern includes a semantic analysis pattern, the recognition module 620 is triggered;
when the sensitive word recognition mode comprises a matching mode, triggering a matching recognition module;
and the matching recognition module is used for determining the sensitive word corresponding to the text data to be detected and the position corresponding to the sensitive word by using the sensitive word dictionary tree.
Referring to fig. 7, fig. 7 is a schematic diagram of an overall framework of a sensitive word recognition device according to an embodiment of the present application. The integral frame comprises: a front end portion and a rear end portion;
the front end part comprises sensitive word detection management and a default sensitive word stock; the rear end portion includes: business system, algorithm system, springBoot framework, flash framework, mybaits-plus, swagger, IDCNN algorithm, CRF algorithm, DFA algorithm, mySQL and MongoDB.
The accurate matching mode realizes the function of detecting the sensitive words in a matching mode of the sensitive words and the combined sensitive words; the implementation flow of the semantic analysis mode is divided into five major modules: the device comprises a data acquisition module, a data cleaning module, a text word vectorization module, a classifier module and a desensitization module. The MySQL, mongoDB is used for data storage in both modes, so that each module can flexibly, efficiently and uninterruptedly read and store data through two databases, memory overhead is reduced, and the system is more rapidly provided with services.
In another implementation, when the sensitive word recognition pattern is a matching pattern and a semantic analysis pattern;
the sensitive word recognition device further comprises:
the supplementary information determining module is used for determining supplementary sensitive words of the text data to be detected and positions corresponding to the supplementary sensitive words according to the matching mode;
and the supplementary replacement module is used for replacing the supplementary sensitive words with preset symbols according to the positions corresponding to the supplementary sensitive words.
In another implementation, the matching identification module includes:
the recognition unit is used for recognizing a plurality of sensitive words of the text data to be detected by using the sensitive word dictionary tree;
the determining unit is used for determining whether paired sensitive words matched with a preset associated word set exist from a plurality of sensitive words;
and the screening unit is used for screening a plurality of sensitive words according to the paired sensitive words to obtain final effective sensitive words and the sensitive word positions.
In another implementation manner, the sensitive word recognition apparatus further includes:
the sensitive word dictionary tree acquisition module is used for:
acquiring scene category and black-and-white list data identified by sensitive words;
determining a target sensitive word dictionary tree from mapping relations based on scene categories, wherein the mapping relations are mapping relations of a plurality of scene categories and a plurality of sensitive word dictionary trees;
And adding and deleting the target sensitive word dictionary tree according to the black-and-white list data to obtain the sensitive word dictionary tree.
In another implementation manner, the sensitive word recognition apparatus further includes:
the judging module is used for judging whether the number of sentences of the text data to be detected exceeds a first preset number threshold value; if yes, triggering a slicing module;
the slicing module is used for slicing the text data to be detected according to the relevance among sentences and a second preset quantity threshold value to obtain a plurality of text slices;
the identification module 620 includes:
and the recognition unit is used for sequentially carrying out sensitive word recognition on the plurality of text slices by utilizing the multi-category sensitive word recognition model to obtain a sensitive word category corresponding to the text data to be detected and a position corresponding to the sensitive word.
The device provided in the embodiment of the present application is applicable to the above method embodiment, and is not described herein again.
In an embodiment of the present application, as shown in fig. 8, an electronic device 800 shown in fig. 8 includes: a processor 801 and a memory 803. The processor 801 is coupled to a memory 803, such as via a bus 802. Optionally, the electronic device 800 may also include a transceiver 804. It should be noted that, in practical applications, the transceiver 804 is not limited to one, and the structure of the electronic device 800 is not limited to the embodiments of the present application.
The processor 801 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. The processor 801 may also be a combination of computing functions, e.g., including one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.
Bus 802 may include a path to transfer information between the aforementioned components. Bus 802 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or EISA (Extended Industry Standard Architecture ) bus, among others. Bus 802 may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in fig. 8, but not only one bus or one type of bus.
The Memory 803 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (Electrically Erasable Programmable Read Only Memory ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
The memory 803 is used for storing application program codes for executing the present application and is controlled to be executed by the processor 801. The processor 801 is configured to execute application code stored in the memory 803 to implement what is shown in the foregoing method embodiment.
Among them, electronic devices include, but are not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 8 is only an example and should not impose any limitation on the functionality and scope of use of the embodiments of the present application.
The present application provides a computer readable storage medium having a computer program stored thereon, which when run on a computer, causes the computer to perform the corresponding method embodiments described above.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
The foregoing is only a partial embodiment of the present application, and it should be noted that, for a person skilled in the art, several improvements and modifications can be made without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims (8)

1. A method for recognizing a sensitive word, comprising:
acquiring text data to be detected;
determining a sensitive word recognition mode based on the trigger condition; the triggering condition at least comprises one of the following: when the text source of the text data to be detected is a network source; identifying whether the accuracy is greater than a preset accuracy threshold; detecting a specified recognition mode in a user-triggered instruction; when the text source of the text data to be detected is a network source, determining that the sensitive word recognition mode is a semantic analysis mode, and otherwise, determining that the sensitive word recognition mode is a matching mode; when the recognition accuracy is larger than a preset accuracy threshold, determining that the sensitive word recognition mode is a semantic analysis mode, and otherwise, determining that the sensitive word recognition mode is a matching mode; the user-triggered instruction comprises a specified recognition mode, wherein the specified recognition mode is one or a combination of a semantic analysis mode and a matching mode;
when the sensitive word recognition mode comprises a semantic analysis mode, performing sensitive word recognition on the text data to be detected by utilizing a multi-category sensitive word recognition model to obtain a sensitive word category corresponding to the text data to be detected and a position corresponding to the sensitive word;
when the sensitive word recognition mode comprises a matching mode, determining a sensitive word category corresponding to the text data to be detected and a position corresponding to the sensitive word by using a sensitive word dictionary tree;
Prompting the category of the sensitive word, and replacing the sensitive word with a preset symbol according to the position corresponding to the sensitive word;
wherein the multi-category sensitive word recognition model comprises: the expansion convolution layer, the feature extraction layer, the full connection layer and the CRF layer, wherein the method for carrying out sensitive word recognition on the text data to be detected by utilizing the multi-category sensitive word recognition model comprises the following steps:
determining a word vector sequence corresponding to the text data to be detected, wherein the word vector sequence can represent context information;
performing expansion convolution operation by utilizing the expansion convolution layer according to the word vector sequence to obtain a first feature vector;
according to the first feature vector, the feature extraction layer is utilized to perform random inactivation and data transformation to obtain a second feature vector;
performing sensitive word multi-classification analysis by using the full-connection layer based on the second feature vector to obtain a plurality of sensitive word categories corresponding to the text data to be detected;
determining the corresponding position of the sensitive word and the corresponding sensitive word category by utilizing the CRF layer according to the second feature vector;
the determining, by using the sensitive word dictionary tree, a sensitive word category corresponding to the text data to be detected and a position corresponding to the sensitive word includes:
Identifying a plurality of sensitive words of the text data to be detected by using a sensitive word dictionary tree;
determining whether paired sensitive words matched with a preset associated word set exist from a plurality of sensitive words;
and screening the plurality of sensitive words according to the paired sensitive words to obtain final effective sensitive words and sensitive word positions.
2. The method of claim 1, wherein the multi-category sensitive word recognition model training process comprises:
training an initial multi-category sensitive word recognition model based on a training set to obtain a trained multi-category sensitive word recognition model, wherein the training set comprises: the system comprises a plurality of sensitive word training samples and labels corresponding to the sensitive word training samples, wherein the labels comprise multi-category labels and sensitive word position labels;
based on the verification set, determining a loss value of the trained multi-class sensitive word recognition model, adjusting model parameters of the trained multi-class sensitive word recognition model according to the loss value and a preset loss value, and performing iterative training to obtain the multi-class sensitive word recognition model.
3. The method of claim 2, further comprising:
In the iterative training process, when the variation amplitude of the loss value of the continuous preset round training is smaller than a preset amplitude threshold, updating the training set and the verification set according to the optimized sample, and continuing to perform iterative training according to the updated training set and verification set.
4. The method of claim 1, wherein when the sensitive word recognition pattern is a matching pattern and a semantic analysis pattern;
after replacing the sensitive word with the preset symbol according to the position corresponding to the sensitive word, the method further comprises:
determining the supplementary sensitive words of the text data to be detected and the positions corresponding to the supplementary sensitive words according to the matching mode;
and replacing the supplementary sensitive words with preset symbols according to the positions corresponding to the supplementary sensitive words.
5. The method for recognizing a sensitive word according to any one of claims 1 to 4, further comprising, after the obtaining the text data to be detected:
judging whether the sentence number of the text data to be detected exceeds a first preset number threshold value;
if yes, slicing the text data to be detected according to the relevance among sentences and a second preset quantity threshold value to obtain a plurality of text slices;
The step of performing sensitive word recognition on the text data to be detected by using a multi-category sensitive word recognition model to obtain a sensitive word category corresponding to the text data to be detected and a position corresponding to the sensitive word, includes:
and carrying out sensitive word recognition on the plurality of text slices in sequence by utilizing a multi-category sensitive word recognition model to obtain a sensitive word category corresponding to the text data to be detected and a position corresponding to the sensitive word.
6. A sensitive word recognition apparatus, comprising:
the acquisition module is used for acquiring text data to be detected;
the determining module is used for determining a sensitive word recognition mode based on the triggering condition; the triggering condition at least comprises one of the following: when the text source of the text data to be detected is a network source; identifying whether the accuracy is greater than a preset accuracy threshold; detecting a specified recognition mode in a user-triggered instruction; when the text source of the text data to be detected is a network source, determining that the sensitive word recognition mode is a semantic analysis mode, and otherwise, determining that the sensitive word recognition mode is a matching mode; when the recognition accuracy is larger than a preset accuracy threshold, determining that the sensitive word recognition mode is a semantic analysis mode, and otherwise, determining that the sensitive word recognition mode is a matching mode; the user-triggered instruction comprises a specified recognition mode, wherein the specified recognition mode is one or a combination of a semantic analysis mode and a matching mode;
When the sensitive word recognition mode comprises a semantic analysis mode, triggering a recognition module,
when the sensitive word recognition mode comprises a matching mode, triggering a matching recognition module,
the recognition module is used for recognizing the sensitive words of the text data to be detected by utilizing a multi-category sensitive word recognition model when the sensitive word recognition mode comprises a semantic analysis mode, so as to obtain the category of the sensitive words corresponding to the text data to be detected and the position corresponding to the sensitive words;
the matching recognition module is used for determining the category of the sensitive word corresponding to the text data to be detected and the position corresponding to the sensitive word by using the sensitive word dictionary tree;
the desensitization module is used for prompting the category of the sensitive word and replacing the sensitive word with a preset symbol according to the position corresponding to the sensitive word;
wherein the multi-category sensitive word recognition model comprises: the expansion convolution layer, the feature extraction layer, the full connection layer and the CRF layer, and the identification module is further used for: determining a word vector sequence corresponding to the text data to be detected, wherein the word vector sequence can represent context information; performing expansion convolution operation by utilizing the expansion convolution layer according to the word vector sequence to obtain a first feature vector; according to the first feature vector, the feature extraction layer is utilized to perform random inactivation and data transformation to obtain a second feature vector; performing sensitive word multi-classification analysis by using the full-connection layer based on the second feature vector to obtain a plurality of sensitive word categories corresponding to the text data to be detected; determining the corresponding position of the sensitive word and the corresponding sensitive word category by utilizing the CRF layer according to the second feature vector;
The matching recognition module comprises:
the identification unit is used for identifying a plurality of sensitive words of the text data to be detected by using the sensitive word dictionary tree;
the determining unit is used for determining whether paired sensitive words matched with a preset associated word set exist from a plurality of sensitive words;
and the screening unit is used for screening the plurality of sensitive words according to the paired sensitive words to obtain final effective sensitive words and sensitive word positions.
7. An electronic device, comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to: a sensitive word recognition method according to any one of claims 1 to 5 is performed.
8. A computer readable storage medium storing at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code set, or instruction set being loaded and executed by a processor to implement the sensitive word recognition method of any one of claims 1 to 5.
CN202311322762.9A 2023-10-13 2023-10-13 Sensitive word recognition method, device, equipment and medium Active CN117077678B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311322762.9A CN117077678B (en) 2023-10-13 2023-10-13 Sensitive word recognition method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311322762.9A CN117077678B (en) 2023-10-13 2023-10-13 Sensitive word recognition method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN117077678A CN117077678A (en) 2023-11-17
CN117077678B true CN117077678B (en) 2023-12-29

Family

ID=88704551

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311322762.9A Active CN117077678B (en) 2023-10-13 2023-10-13 Sensitive word recognition method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN117077678B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368535A (en) * 2018-12-26 2020-07-03 珠海金山网络游戏科技有限公司 Sensitive word recognition method, device and equipment
CN113705225A (en) * 2021-09-07 2021-11-26 北京北大方正电子有限公司 Sensitive word data processing method and device and electronic equipment
CN114186567A (en) * 2021-12-10 2022-03-15 广州华多网络科技有限公司 Sensitive word detection method and device, equipment, medium and product thereof
CN114298035A (en) * 2021-12-29 2022-04-08 电子科技大学广东电子信息工程研究院 Text recognition desensitization method and system thereof
CN114491034A (en) * 2022-01-24 2022-05-13 聚好看科技股份有限公司 Text classification method and intelligent device
CN115510500A (en) * 2022-11-18 2022-12-23 北京国科众安科技有限公司 Sensitive analysis method and system for text content

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368535A (en) * 2018-12-26 2020-07-03 珠海金山网络游戏科技有限公司 Sensitive word recognition method, device and equipment
CN113705225A (en) * 2021-09-07 2021-11-26 北京北大方正电子有限公司 Sensitive word data processing method and device and electronic equipment
CN114186567A (en) * 2021-12-10 2022-03-15 广州华多网络科技有限公司 Sensitive word detection method and device, equipment, medium and product thereof
CN114298035A (en) * 2021-12-29 2022-04-08 电子科技大学广东电子信息工程研究院 Text recognition desensitization method and system thereof
CN114491034A (en) * 2022-01-24 2022-05-13 聚好看科技股份有限公司 Text classification method and intelligent device
CN115510500A (en) * 2022-11-18 2022-12-23 北京国科众安科技有限公司 Sensitive analysis method and system for text content

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
网页敏感词过滤与敏感文本分类***设计;李伟;;电脑知识与技术;16(08);第245-247页 *

Also Published As

Publication number Publication date
CN117077678A (en) 2023-11-17

Similar Documents

Publication Publication Date Title
CN111291195B (en) Data processing method, device, terminal and readable storage medium
EP3748548A1 (en) Adversarial learning-based text annotation method and device
CN109034203B (en) Method, device, equipment and medium for training expression recommendation model and recommending expression
CN108376129B (en) Error correction method and device
US10915756B2 (en) Method and apparatus for determining (raw) video materials for news
US11861918B2 (en) Image analysis for problem resolution
CN112613293B (en) Digest generation method, digest generation device, electronic equipment and storage medium
CN112613306A (en) Method, device, electronic equipment and storage medium for extracting entity relationship
US20230114673A1 (en) Method for recognizing token, electronic device and storage medium
CN114218945A (en) Entity identification method, device, server and storage medium
CN115017898A (en) Sensitive text recognition method and device, electronic equipment and storage medium
CN115858776B (en) Variant text classification recognition method, system, storage medium and electronic equipment
CN115952854B (en) Training method of text desensitization model, text desensitization method and application
CN117077678B (en) Sensitive word recognition method, device, equipment and medium
CN115512176A (en) Model training method, image understanding method, device, medium and electronic equipment
CN115455416A (en) Malicious code detection method and device, electronic equipment and storage medium
CN110276001B (en) Checking page identification method and device, computing equipment and medium
US11301633B2 (en) Technical document issues scanner
CN112579774A (en) Model training method, model training device and terminal equipment
CN113962196A (en) Resume processing method and device, electronic equipment and storage medium
CN117573956B (en) Metadata management method, device, equipment and storage medium
CN114723073B (en) Language model pre-training method, product searching method, device and computer equipment
CN117909505B (en) Event argument extraction method and related equipment
CN116886991B (en) Method, apparatus, terminal device and readable storage medium for generating video data
WO2022213864A1 (en) Corpus annotation method and apparatus, and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant