CN113590738A

CN113590738A - Method for detecting network sensitive information based on content and emotion

Info

Publication number: CN113590738A
Application number: CN202011447762.8A
Authority: CN
Inventors: 邓海刚; 徐本锡; 李超; 章森; 王正
Original assignee: Tianbo Electronic Information Technology Co ltd
Current assignee: Tianbo Electronic Information Technology Co ltd
Priority date: 2020-12-09
Filing date: 2020-12-09
Publication date: 2021-11-02

Abstract

The invention provides a method for detecting network sensitive information based on content and emotion, which comprises the following four steps: (1) creating an emotion word bank according to the existing dictionary resources, calculating the emotion intensity of each term, (2) preprocessing the text needing to judge the emotion tendency, (3) matching the semantic role patterns of the sentences and determining the weight of the sentences, and (4) comparing the emotion polarity value of the text calculated in the previous step with the set threshold value to further determine the emotion tendency of the text. The invention can be reflected by the aspects of improving the operation efficiency, reducing the energy consumption, improving the yield, improving the precision, simplifying the working procedure, being convenient to control, appearing the useful performance and the like. Specifically, facts are described, and scientific analysis and test results are the most convincing evidence.

Description

Method for detecting network sensitive information based on content and emotion

Technical Field

The invention relates to the technical field of detection methods, in particular to a method for detecting network sensitive information based on content and emotion.

Background

The existing sensitive information detection method is mainly based on an information retrieval technology, and according to a general flow of information detection, sensitive information detection research is divided into a query expansion technology, a document indexing technology and an information detection model, wherein the query expansion technology is mainly used for performing semantic expansion on a user given word through multiple modes through the semantic expansion technology to form a plurality of expansion words for retrieval, so that the recall ratio and the accuracy of detection are improved. Document indexing techniques extract data from unstructured and semi-structured documents and reorganize them so that they can be recognized by a computer. The traditional document indexing technology does not consider the semantics among the keywords, so the detection effect is poor. In recent years, the semantic-based index is mainly an index with increased keyword semantic concepts, and can index similar or related concepts such as similar words and synonyms, and becomes a research hotspot of scholars at home and abroad. The information detection mainly comprises the steps of searching items which can be matched with a query and a user given word in all files, returning a document containing the query item and the occurrence position of the query item, wherein the quality of detection depends on a detection model, common information detection models mainly comprise a Boolean model, a vector space model and a probability model, but semantic association among keywords is not considered by the information detection models in the three modes, and uncertainty in the detection process cannot be effectively processed. The three sensitive information detection methods do not consider optimizing the detection result and sensitive information brought by emotion semanteme.

Disclosure of Invention

In view of this, the invention provides a method for detecting network sensitive information based on content and emotion to solve the existing problems, and the specific technical scheme is as follows:

a method for detecting network sensitive information based on content and emotion comprises the following four steps:

(1) according to the existing dictionary resources, an emotion word bank is created, the emotion intensity of each term is calculated, two different words are represented by W1 and W2, and the point mutual information is calculated as the following formula:

if the result is positive, the two words tend to be co-occurrence, the larger the value is, the stronger the correlation between the two words is, and if the result is negative, the two words are basically not co-occurrence;

(2) the method comprises the steps of preprocessing a text needing to judge the emotional tendency, dividing the text by sentence units, extracting all contained emotional words, and determining the polarity value of the emotional words by combining the influence of degree adverbs and negatives on the polarity magnitude of vocabulary emotion, for example, in the phrase of 'no longer beautiful', because the negatives exist, reversing the pragmatic emotion expressed by 'beautiful' into deparastic emotion.

(3) Matching the semantic role modes of the sentences and determining the weight values of the sentences, determining the emotion polarity values of the texts by combining the polarity values of the emotion words in the second step,

a weighting value is set for each mode, a positive integer between 1 and 5 is given as a weighting value to the mode of the positive object as the donor, and a negative integer between-1 and-5 is given as a weighting value to the mode of the negative object as the donor.

(4) And comparing the emotion polarity value of the text calculated in the previous step with a set threshold value to further determine the emotion tendentiousness of the text.

Further, a degree side word bank and a negative word bank are created in the first step, and weighted values influencing the emotional intensity are distributed to the degree side word bank and the negative word bank; a library of semantic patterns is created and a weighting value is assigned to each pattern.

Further, the detection of the sensitive information comprises the detection of forbidden words, the classification of erotic violent character information and the detection of improper political comments.

Furthermore, the emotional colors of the text constructed by the dictionary resources and the semantic mode library are usually determined by the semantic composition mode of adjectives, adverbs, emotional verbs, negatives and sentences, and a proprietary database is respectively established for the emotional words, the degree adverbs, the negatives and the semantic mode library.

Further, entries in the semantic schema library are subdivided into ten schemas: (1) positive implementer + action + positive implemented + modification, (2) positive implementer + action + negative implemented + modification, (3) negative implementer + action + positive implemented + modification, (4) negative implementer + action + negative implemented + modification, (5) positive implementer + action + modification, (6) negative implementer + action + modification, (7) action + positive implemented + modification, (8) action + negative implemented + modification, (9) positive object + modification, (10) negative object + modification.

By adopting the technical scheme, the method has the following beneficial effects:

in the research process of the content emotion analysis technology, a method of mutually combining the semantic mode and the emotion dictionary is adopted, the emotion polarity of the emotion words is calculated, meanwhile, the effect of the semantic mode on word, sentence and text emotion tendency judgment is combined, and a unique method for accurately judging the positive and negative emotion polarity of the information content is created.

Drawings

FIG. 1 is a schematic structural diagram of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Example 1: a method for detecting network sensitive information based on content and emotion comprises the following four steps:

(1) creating an emotion word bank according to the existing dictionary resources, calculating the emotion intensity of each term, representing two different words by W1 and W2, and calculating the point mutual information of the words as the following formula

(3) Matching semantic character patterns of the sentences and determining the weight of the sentences, determining the emotion polarity value of the text by combining the polarity values of the emotion words in the second step, setting a weighted value for each pattern, giving a positive integer between 1 and 5 as the weighting for the pattern of the positive object as the donor, and giving a negative integer between-1 and-5 as the weighting for the pattern of the negative object as the donor.

Creating a degree side word bank and a negative word bank in the step one and distributing a weighted value influencing the emotional intensity for the degree side word bank and the negative word bank; a library of semantic patterns is created and a weighting value is assigned to each pattern.

The detection of the sensitive information comprises the detection of forbidden words, the classification of erotic violent character information and the detection of improper political comments.

The emotional colors of the text constructed by the dictionary resources and the semantic mode library are usually determined by the semantic composition mode of adjectives, adverbs, emotional verbs, negatives and sentences, and a proprietary database is respectively established for the emotional words, the degree adverbs, the negatives and the semantic mode library.

The entries in the semantic pattern library are subdivided into ten patterns: (1) positive implementer + action + positive implemented + modification, (2) positive implementer + action + negative implemented + modification, (3) negative implementer + action + positive implemented + modification, (4) negative implementer + action + negative implemented + modification, (5) positive implementer + action + modification, (6) negative implementer + action + modification, (7) action + positive implemented + modification, (8) action + negative implemented + modification, (9) positive object + modification, (10) negative object + modification.

In the embodiment, the inappropriate political comments are taken as the key analysis direction, firstly, words related to political topics are collected to establish a political object word bank, the words are labeled manually, the words are divided into positive entities and negative entities, and the positive degree and the negative degree of the words are represented by numbers. These sentences containing political objects are then extracted from the text and determined using an emotional tendency analysis system, based on which we would consider the positive entity as being under-political review and possibly sensitive text if it was evaluated either devastating or commendably against the negative entity.

And establishing a political object word bank as a newly added characteristic item of the original vector space model. In the text vectorization representation process, the weight of each feature item has its calculation mode, and for the convenience of calculation, we use the absolute word frequency in the fourth chapter as a method. However, to highlight the Weight of the feature items in the political object thesaurus, we assign a Weight in the range of-5, 5 to each object in the thesaurus.

As shown in fig. 1, most of data required to identify sensitive text content is text information on a video and content and comments on a news webpage on the internet. After receiving the text data, the original text data is preprocessed, for example, redundant spaces and symbols are removed, the whole text is divided into single sentences, then the processed sentences are divided one by using the word segmentation technology, after segmentation by using word segmentation technology, searching whether the vocabulary items in the forbidden word lexicon are contained or not, if so, extracting the sentences to be contained, carrying out sensitive word processing, if the forbidden word term is not contained, text classification detection is carried out to judge whether the data belong to the categories of erotic violence, terrorist drive and the like, the classification detection has two processing modes, the character contents belonging to erotic violence and terrorist drive are classified and detected, sensitive character identification and processing are carried out on the character contents, the character contents are not classified into the above categories to detect the final content of adverse political comments, sentences identified in the detection as containing inappropriate political content are subjected to sensitive textual identification and processing.

Only after the three tests are passed and it is determined that sensitive literal content is not contained, normal transmission of the text data is permitted.

In the design, forbidden words are checked in a word segmentation and direct search mode, the vector space model is adopted to classify input files so as to identify the erotic violence category, and finally, the emotional analysis method is used for identifying the adverse political content.

The emotional colors of the constructed texts of the related dictionary and the semantic mode library are usually determined by the semantic composition mode of adjectives, adverbs, emotional verbs, negatives and sentences, and a proprietary database is respectively established for the emotional words, the degree adverbs, the negatives and the semantic mode library.

(1) Construction of emotion word bank

The emotion vocabulary library is constructed by combining a network commonly used vocabulary dictionary in recent years according to the basis of three resources of an emotion dictionary of a knowledge network, a CGI (Chinese common language) dictionary and a recognition and derogation dictionary of students. The emotion dictionary of the learnt web includes ten categories of emotion words including positive and negative face attributes, positive and negative face emotions, positive and negative face evaluations and the like, the content is quite rich, and a large number of words with positive and negative emotion colors are recorded in the CGI and the positive and negative derogatory dictionary. By combining the three types of resources, the commonly seen emotion vocabularies are collected, the original labeling conditions of all dictionaries and the emotion analysis requirements are combined, and a final emotion word bank is completed through layer-by-layer comparison and screening, and the word bank provides important basis for calculating the emotion tendencies of words, sentences and chapters.

(2) Negative word bank construction

The presence of negative words often reverses the emotional propensity of a sentence, e.g., "no longer like" in this phrase, because the presence of "no longer" negative words reverses the praise emotion expressed as "like" to a depreciative emotion. Considering the influence of negative words, a negative word library is constructed, wherein the negative words commonly used in Chinese language are contained in the negative word library. As shown in the following table:

(3) level word dictionary construction

In addition to negative words that may have an effect on the emotional tendency of the word, some degree-indicative adverbs may also affect the magnitude of the emotional polarity of the word. For example, "this novel is very good at sight", the emotional polarity of the whole sentence is more recognizable due to the modification of the degree adverb "very" to "good at sight".

In order to express the adjusting effect of the degree side words on the emotional intensity of the modified words, a special dictionary is established for the modified words. And corresponding weighted values are set for the adverbs of different degrees according to the emotional intensity effect of the adverbs of degrees on the emotional words. As shown in the following table:

(4) semantic schema library construction

The semantic components of a sentence are divided into four components, namely 'behavior', 'enforcer', 'transferee' and 'modifier'. The 'action' represents the central predicate verb of the sentence and is also the main event of the sentence, the 'implementer' refers to the implementation subject of the action, the 'transferee' represents the implementation subject of the action, and the 'modification item' contains the remaining semantic items such as time, place, tool, degree and the like which do not influence the emotional tendency of the sentence.

In the simplified semantic role categories, the semantic mode of a sentence can be roughly divided into four categories, namely a principal and subordinate object mode, a subordinate object mode and a subject description mode:

(1) master-predicate-guest mode: implementer + behavior + recipient + modification.

(2) The major and minor modes: implementer + action + modifier.

(3) A predicate guest mode: behavior + receptor + modification terms.

(4) The subject description mode: implementer + modified term.

In the fourth mode, since there is no predicate verb, its implementer is not the subject of the action, but represents the object modified by the modifier.

In sensitive character recognition, semantic and emotion analysis needs to be carried out on some texts containing political object names, the political objects can be divided into positive and negative categories, and in an emotion sentence containing both positive objects and negative objects, the emotion tendencies of the sentence can be determined by the components of the objects in semantic patterns. In combination with the four semantic patterns described above and the semantic roles that the different political objects are in, we further relate the semantics to

Entries in the schema library are subdivided into ten schemas:

(1) positive implementer + action + positive recipient + modification.

(2) Positive performer + behavior + negative recipient + modification.

(3) Negative performer + behavior + positive recipient + modification.

(4) Negative performer + behavior + negative recipient + modification.

(5) Positive implementer + action + modifier.

(6) Negative performer + behavior + modifier.

(7) Behavior + positive recipients + modification terms.

(8) Behavior + negative recipients + modification terms.

(9) Positive object + modifier.

(10) Negative + modified terms.

For the ten semantic role relations, a weighted value is set for each mode, a positive integer between 1 and 5 is given as a weighted value to the mode with the positive object as the applicator, a negative integer between-1 and-5 is given as a weighted value to the mode with the negative object as the applicator, for the mode without the applicator, a positive integer between 1 and 2 is given to the mode with the positive object, and a weighted value between-1 and-2 is given to the negative object. In the emotion calculation process of the sentence, firstly, the emotion tendentiousness of the sentence is calculated under the condition that the semantic mode is not considered, then the emotion tendentiousness is adjusted by combining the semantic mode corresponding to the sentence and the weighted value of the semantic mode, and finally the emotion tendentiousness judgment is close to the real condition.

Having thus described the basic principles and principal features of the invention, it will be appreciated by those skilled in the art that the invention is not limited by the embodiments described above, which are given by way of illustration only, but that various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined by the appended claims and their equivalents.

Claims

1. A method for detecting network sensitive information based on content and emotion is characterized in that the whole process comprises the following four steps:

2. The method for detecting network sensitive information based on contents and emotions as claimed in claim 1, wherein in the first step, a degree side word bank and a negative word bank are created and assigned with weighting values influencing the emotion intensity; a library of semantic patterns is created and a weighting value is assigned to each pattern.

3. The method for detecting the network sensitive information based on the contents and the emotions as claimed in claim 1, wherein the detection of the sensitive information comprises the detection of forbidden words, the classification of erotic violent text information and the detection of inappropriate political comments.

4. The method as claimed in claim 1, wherein the emotion colors of the texts constructed from the dictionary resources and the semantic pattern library are generally determined by the semantic composition of adjectives, adverbs, verb verbs, negatives, and sentences, and a proprietary database is created for the emotional words, the degree adverbs, the negatives, and the semantic pattern library.

5. The method for detecting network sensitive information based on content and emotion as recited in claim 1, wherein the entries in the semantic pattern library are subdivided into ten patterns: (1) positive implementer + action + positive implemented + modification, (2) positive implementer + action + negative implemented + modification, (3) negative implementer + action + positive implemented + modification, (4) negative implementer + action + negative implemented + modification, (5) positive implementer + action + modification, (6) negative implementer + action + modification, (7) action + positive implemented + modification, (8) action + negative implemented + modification, (9) positive object + modification, (10) negative object + modification.