CN116244446A - Social media cognitive threat detection method and system - Google Patents

Social media cognitive threat detection method and system Download PDF

Info

Publication number
CN116244446A
CN116244446A CN202211732859.2A CN202211732859A CN116244446A CN 116244446 A CN116244446 A CN 116244446A CN 202211732859 A CN202211732859 A CN 202211732859A CN 116244446 A CN116244446 A CN 116244446A
Authority
CN
China
Prior art keywords
cognitive
threat
emotion
text
cognitive threat
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211732859.2A
Other languages
Chinese (zh)
Inventor
李飞扬
姜迎畅
胡浩
吴疆
张玉臣
李炳龙
周洪伟
汪永伟
董书琴
谭晶磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information Engineering University of PLA Strategic Support Force
Original Assignee
Information Engineering University of PLA Strategic Support Force
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information Engineering University of PLA Strategic Support Force filed Critical Information Engineering University of PLA Strategic Support Force
Priority to CN202211732859.2A priority Critical patent/CN116244446A/en
Publication of CN116244446A publication Critical patent/CN116244446A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Business, Economics & Management (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of network security, and particularly relates to a social media cognitive threat detection method and a social media cognitive threat detection system, which are used for collecting text data of sensitive topics of a network platform and preprocessing the data; aiming at the preprocessed sensitive topic text data, acquiring a cognitive threat topic text through multi-level cognitive threat detection; constructing a cognitive threat propagation knowledge graph through named entity identification and entity relation extraction of the cognitive threat topic text; and carrying out user tracing, event tracing and organization tracing on the text transmission of the cognitive threat topics based on the knowledge graph of the cognitive threat transmission. Aiming at topic texts related to specific topics and sensitive events, the invention utilizes the emotion tendencies behind the topic texts to identify cognitive threats, compared with the traditional manual evidence, the identification period is greatly shortened, the detection information quantity and efficiency are improved, the safety and feasibility are good, the threat judgment accuracy is high, the detection effect is good, and the application scene is very wide.

Description

Social media cognitive threat detection method and system
Technical Field
The invention belongs to the technical field of network security, and particularly relates to a social media cognitive threat detection method and system.
Background
Cognition refers to the process by which a person obtains knowledge or applies knowledge, or the process of information processing, which includes feel, perception, memory, thinking, imagination, language, and the like. The cognitive threat is based on information with purposeful, flaring, hidden, directional and non-authenticity input to the individual, and the cognition of the individual is realized by continuously influencing and solidifying the cognition process of the individual, so that the individual forms distorted, irregular and reverse negative cognition, or the existing normal cognition system of the individual is changed, so that the individual deviates from a social core value system. The emerging from media platforms and social platforms provides a hotbed for the propagation of cognitive threats. Network space has become the primary battlefield of cognitive threat countermeasure in a very short way. The social network has the characteristics of anonymous identity, free speech, high real-time performance, quick propagation and the like, and users are mostly young people, insensitive to social problems and easy to be cognitively permeated. The problems of high concealment of cognitive threat, high tracing difficulty, difficult cross-platform supervision and the like are to be solved, and recognition, tracing and countermeasure of cognitive threat information are urgent technically. Aiming at the problems of high concealment of cognitive threat, high tracing difficulty, difficult cross-platform supervision and the like, how to technically inhibit the cognitive threat becomes an urgent need for purifying network space.
Disclosure of Invention
Therefore, the invention provides a social media cognitive threat detection method and a social media cognitive threat detection system, aiming at topic texts related to specific topics and sensitive events, the cognitive threat is identified by using the emotion orientation behind the topic texts, compared with the traditional manual evidence, the identification period is greatly shortened, and the information detection quantity and efficiency are improved.
According to the design scheme provided by the invention, the social media cognitive threat detection method comprises the following steps:
collecting sensitive topic text data of a network platform and preprocessing the data;
aiming at the preprocessed sensitive topic text data, acquiring a cognitive threat topic text through multi-level cognitive threat detection, wherein the multi-level cognitive threat detection comprises: dividing sensitive topic text data into primary detection of cognitive threat topic text and initial cognitive threat topic text, classifying the initial cognitive threat topic text into intermediate detection of cognitive threat topic text, suspected cognitive threat topic text and non-cognitive threat topic text, and acquiring final detection of the cognitive threat topic text from the suspected cognitive threat topic text through manual labeling;
Constructing a cognitive threat propagation knowledge graph through named entity identification and entity relation extraction of the cognitive threat topic text;
and carrying out user tracing, event tracing and organization tracing on the text transmission of the cognitive threat topics based on the knowledge graph of the cognitive threat transmission.
As the social media cognitive threat detection method provided by the invention, further, the method for collecting the text data of the sensitive topics of the network platform and preprocessing the data comprises the following steps:
firstly, acquiring sensitive topic text information of a network platform and related user data in a distributed manner according to a user authorization information base;
and combining the title and the text aiming at the acquired text information, removing redundant information by using a redundant detection algorithm, performing duplication removal processing on related comments, cleaning and converting text noise data, and performing word segmentation processing on the text by using a word segmentation system.
In the social media cognitive threat detection method, in primary detection, sensitive topic text data is divided into cognitive threat topic text and initial suspected cognitive threat topic text by using an emotion analysis method, wherein the emotion analysis method comprises the following steps of:
Firstly, constructing a basic emotion dictionary according to a known emotion dictionary and applying a word frequency statistics method, and expanding the emotion dictionary by carrying out correlation statistics on words in text data and words in the basic emotion dictionary;
secondly, carrying out emotion weight statistics on broken sentences among each separator by taking texts in sensitive topic text data as units and emotion words as separators, and judging emotion polarity of the texts according to the proportion of negative emotion weights in all emotion word weights;
and then dividing the sensitive topic text data into a cognitive threat topic text and an initial suspected cognitive threat topic text according to the emotion polarity of the text.
As the social media cognitive threat detection method of the invention, further, a basic emotion dictionary is constructed according to a known emotion dictionary and by applying a word frequency statistical method, comprising:
firstly, selecting a series of emotion words from a known emotion dictionary, sorting the emotion words according to the click rate of a search engine in the series of emotion words, and selecting a plurality of emotion words according to the click rate;
then, selecting emotion words with highest degree of correlation with the theme based on word frequency statistics, and forming a basic emotion dictionary by utilizing a plurality of selected emotion words and emotion words;
And then expanding the basic emotion dictionary by using synonyms and candidate words with emotion tendencies.
As the social media cognitive threat detection method of the invention, further, carrying out emotion weight statistics on the broken sentences among each separator, judging emotion polarity of the text according to the proportion of negative emotion weights in all emotion word weights, comprising:
firstly, aiming at the broken sentences among the separators, counting emotion tendencies through emotion word analysis, negative word analysis, adverb analysis, fixed collocation word analysis, turning word analysis and exclamation sentence analysis;
and then, counting the sum of negative emotion tendency values of all clauses and the sum of total emotion weight absolute values of the text, and judging the emotion polarity of the text by utilizing the longitudinal proportion of the negative emotion word weight in all emotion word weights of the text.
As the social media cognitive threat detection method of the present invention, further, in the intermediate detection, the initial suspected cognitive threat topic text is classified into a cognitive threat topic text, a suspected cognitive threat topic text and a non-cognitive threat topic text by using a deep learning method, and the classification process includes:
constructing a deep learning model and pre-training by using a training data set with labeling labels, wherein the deep learning model comprises a BERT model for representing the input word vector and a BiLSTM model for detecting the cognitive threat of the input word vector;
Inputting the initial suspected cognitive threat topic text into a pre-trained deep learning model, acquiring a cognitive threat probability value by using the deep learning model, and determining the cognitive threat topic text, the suspected cognitive threat topic text and the non-cognitive threat topic text in the initial suspected cognitive threat topic text by using the cognitive threat probability value.
As the social media cognitive threat detection method of the present invention, further, for the preprocessed sensitive topic text data, the cognitive threat topic text is obtained through multi-level cognitive threat detection, and the method further includes: and evaluating the influence degree of the cognitive threat by using an emotion analysis method according to the overall emotion tendency in a comment area in the cognitive threat topic text.
As the social media cognitive threat detection method of the invention, further, a cognitive threat propagation knowledge graph is constructed by identifying named entities and extracting entity relations of cognitive threat topic texts, comprising:
constructing a named entity extraction model and optimizing the model by using an countermeasure training method, wherein the named entity recognition model comprises an encoder for mapping input characters to real space and mining potential semantics, a BiLSTM neural network layer for extracting context semantic information by capturing forward and backward bidirectional features in encoder conversion vectors, and a CRF conditional random field layer for taking the bidirectional features extracted by the BiLSTM neural network layer as input and generating character corresponding labels by combining a Bioes labeling paradigm;
And taking the cognitive threat topic text as an optimized named entity extraction model to be input, and identifying entity categories and relations in the cognitive threat topic text by using the named entity extraction model.
As the social media cognitive threat detection method, further, a cognitive threat propagation knowledge graph is constructed by identifying named entities of a cognitive threat topic text and extracting entity relations, two named entity extraction models connected in a pipeline are constructed, wherein the first named entity extraction model adopts a single-label multi-classification task mode to identify entities in the cognitive threat topic text, and the second named entity extraction model adopts a multi-label multi-classification task mode to input the first named entity extraction model as input to identify relations between the entities.
Further, the invention also provides a social media cognitive threat detection system, which comprises: the system comprises a data acquisition server, a plurality of cognitive threat authentication servers, a knowledge graph server and a web server, wherein,
the data acquisition server is used for acquiring the text data of the sensitive topics of the network platform and preprocessing the data;
the multiple cognitive threat authentication servers are used for acquiring cognitive threat topic texts through multi-level cognitive threat detection according to the preprocessed sensitive topic text data, wherein the multiple cognitive threat authentication servers specifically comprise: the method comprises the steps of dividing sensitive topic text data into a primary identification server of a cognitive threat topic text and an initial suspected cognitive threat topic text, classifying the initial suspected cognitive threat topic text into a middle identification server of the cognitive threat topic text, the suspected cognitive threat topic text and a non-cognitive threat topic text, and obtaining a final identification server of the cognitive threat topic text from the suspected cognitive threat topic text through manual labeling;
The knowledge graph server is used for constructing a cognitive threat propagation knowledge graph through named entity identification and entity relation extraction of the cognitive threat topic text;
and the web server is used for carrying out user tracing, event tracing and organization tracing on the cognitive threat topic text transmission by utilizing the web interaction interface based on the cognitive threat transmission knowledge graph.
The invention has the beneficial effects that:
the invention can rely on network social platforms such as microblogs, knowledges, weChat public numbers and the like to carry out multidimensional emotion analysis on the crawled sensitive topic text and comments thereof to realize detection of cognitive threat topic text, a cognitive threat propagation knowledge graph is constructed by identifying cognitive threat related named entities and extracting relationships among the entities, cognitive threat propagation user tracing is realized by utilizing the knowledge graph, cognitive threat propagation event tracing and cognitive threat propagation organization tracing visual display are realized by utilizing the knowledge graph, cognitive threat propagation prediction is realized by exploring implicit relationships, real-time monitoring is carried out on important account numbers, groups, organizations and users, cognitive countermeasure strategy analysis is provided, cognitive threat propagation is blocked, network cognitive threat supervision dynamics is deepened, and network illegal behaviors related to cognitive threat are deterred, thereby effectively purifying network space.
Description of the drawings:
FIG. 1 is a schematic diagram of a social media cognitive threat detection flow in an embodiment;
FIG. 2 is a schematic illustration of a cognitive threat detection and metrics flow in an embodiment;
FIG. 3 is a schematic diagram of an countermeasure training flow in an embodiment;
fig. 4 is a schematic diagram of a knowledge graph visualization construction hierarchy in an embodiment.
The specific embodiment is as follows:
the present invention will be described in further detail with reference to the drawings and the technical scheme, in order to make the objects, technical schemes and advantages of the present invention more apparent.
The development of network environment makes the implementation of cognitive domain threat simpler and more feasible, and the cognitive domain threat can be implemented independently or jointly in multiple dimensions and multiple levels, thereby affecting the whole social value form. Referring to fig. 1, an embodiment of the present disclosure provides a social media cognitive threat detection method, including:
s101, collecting sensitive topic text data of a network platform and preprocessing the data;
s102, aiming at the preprocessed sensitive topic text data, acquiring a cognitive threat topic text through multi-level cognitive threat detection, wherein the multi-level cognitive threat detection comprises: dividing sensitive topic text data into primary detection of cognitive threat topic text and initial cognitive threat topic text, classifying the initial cognitive threat topic text into intermediate detection of cognitive threat topic text, suspected cognitive threat topic text and non-cognitive threat topic text, and acquiring final detection of the cognitive threat topic text from the suspected cognitive threat topic text through manual labeling;
S103, constructing a cognitive threat propagation knowledge graph through named entity identification and entity relation extraction of the cognitive threat topic text;
and S104, carrying out user tracing, event tracing and organization tracing on the cognitive threat topic text transmission based on the cognitive threat transmission knowledge graph.
By means of network social platforms such as microblogs, knowledgeable and WeChat public numbers, multidimensional emotion analysis is carried out on the crawled sensitive topic text and comments thereof to detect cognitive threat topic text, cognitive threat propagation knowledge graphs are constructed by identifying cognitive threat related named entities and extracting relationships among the entities, and cognitive threat propagation user tracing is achieved by means of the knowledge graphs. And (3) predicting a cognitive threat propagation path by constructing a cognitive threat knowledge graph, and providing cognitive countermeasure strategy analysis. The method can realize accurate authentication under the short text background of the social media platform, and can maintain higher accuracy rate when authenticating the long text of the media, so that news media must be responsible for own language, and certain unhealthy media can be effectively deterred.
As a preferred embodiment, further, collecting text data of sensitive topics of the network platform and preprocessing the data, including:
Firstly, acquiring sensitive topic text information of a network platform and related user data in a distributed manner according to a user authorization information base;
and combining the title and the text aiming at the acquired text information, removing redundant information by using a redundant detection algorithm, performing duplication removal processing on related comments, cleaning and converting text noise data, and performing word segmentation processing on the text by using a word segmentation system.
Massive sensitive topic text data of social platforms such as microblogs, public numbers and the like can be obtained through an API interface, and the massive sensitive topic text data comprises ten fields such as article titles, texts, comments and the like. For facilitating further processing, the data are subjected to header and text merging, redundant information is removed through a redundant detection algorithm, comments are de-duplicated, noise data are cleaned and converted, and word segmentation data preprocessing operation is performed through ICTCLAS. The redundancy detection algorithm for removing the redundant information can be designed to comprise the following steps:
step1: sentence segmentation is carried out on the text according to punctuation marks;
step2: acquiring the first 5 sentences of the sentence, deleting the sentence if the sentence contains word patterns such as ' attention we ', ' click ' font ', and the rest of the sentence is reserved;
step3: acquiring the first 10 sentences of the sentence-after-sentence article, if the sentence-after-sentence article comprises texts such as 'edit', 'initial review', 'click-to-see', deleting the sentence, and remaining;
Step4: the reserved sentences are recombined into text.
It should be noted that the data collected in the scheme can be sensitive topic text data of each network social platform. The collection can be carried out through a microblog platform, and also can be randomly carried out through social platforms such as a public number with a public meaning of a WeChat, a WeChat and the like. The main microblog data acquisition process can comprise the following steps: user authorization, newly issued microblog acquisition, microblog information updating and user information acquisition. User authorization is completed through Oauth2, and acquisition of newly released microblogs, updating of microblog information and acquisition of user information are completed through automatic calling of an API (application program interface) disclosed by the microblog official.
In primary detection, sensitive topic text data can be divided into cognitive threat topic text and initial suspected cognitive threat topic text by using an emotion analysis method, wherein the emotion analysis method comprises the following steps of:
firstly, constructing a basic emotion dictionary according to a known emotion dictionary and applying a word frequency statistics method, and expanding the emotion dictionary by carrying out correlation statistics on words in text data and words in the basic emotion dictionary;
secondly, carrying out emotion weight statistics on broken sentences among each separator by taking texts in sensitive topic text data as units and emotion words as separators, and judging emotion polarity of the texts according to the proportion of negative emotion weights in all emotion word weights;
And then dividing the sensitive topic text data into a cognitive threat topic text and an initial suspected cognitive threat topic text according to the emotion polarity of the text.
The multidimensional emotion analysis is a process of analyzing and processing emotion polarity, emotion degree and emotion type multidimensional of subjective text with emotion colors by using natural language processing and text mining technologies. An important research direction in the NLP field is emotion analysis, and correct and effective emotion analysis can quickly obtain positive or negative emotion expressed by people from texts, so that the emotion tendency behind the texts is explored, and hidden political threats and cognitive threats with cultural penetration properties in a large amount of information are separated. Emotion analysis tasks can be classified into chapter level, sentence level, word or phrase level according to the granularity of analysis; the text processing category can be classified into text-based emotion analysis and comment-based emotion analysis, and the text processing category can be classified into emotion classification, emotion retrieval, emotion extraction and other sub-problems according to the task type of the study. In the embodiment of the present case, as shown in fig. 2, a basic flow of cognitive threat identification and measurement through emotion dictionary dynamic expansion and deep learning is performed
Methods for emotion classification can be generally classified into classification methods based on emotion dictionaries and classification methods based on deep learning, and the two types of methods have characteristics and have defects. The emotion dictionary-based method is characterized in that an emotion dictionary marked with emotion polarity is used for carrying out emotion polarity quantification calculation on a text, the method is characterized in that a series of rules and the emotion dictionary are used for classifying, words in the emotion dictionary are matched with words in a text to be analyzed, emotion values of sentences are obtained through calculation, and finally the obtained emotion values are used as judgment basis for sentence emotion tendency classification, and although the accuracy of the method is higher, the cost for constructing the emotion dictionary is higher, and the emotion dictionary-based method does not consider the relation between words in the text and lacks word sense information; the deep learning-based method is to treat emotion classification as a special text classification, and perform emotion classification on the text by using a manual labeling and machine learning method. The deep learning-based method utilizes marked data and labels which are manually marked, and then utilizes the deep learning method to carry out emotion analysis on the text, and common machine learning methods include naive Bayes NB (NaiveBayes), decision trees, support vector machines SVM (SupportVectorMachine) and the like. The quality of the effect of the method mainly depends on the quantity and quality of manually marked data, so that the effect is greatly influenced by subjective consciousness of people, and a large amount of labor is consumed.
Aiming at the characteristics of the two methods, the two methods based on the emotion dictionary and the deep learning are combined and optimized, and a multidimensional emotion analysis method combining the dynamic expansion emotion dictionary and the deep learning is provided, so that the defects of the two methods are overcome, and higher accuracy is obtained.
The method for constructing the basic emotion dictionary by using the word frequency statistical method according to the known emotion dictionary comprises the following steps:
firstly, selecting a series of emotion words from a known emotion dictionary, sorting the emotion words according to the click rate of a search engine in the series of emotion words, and selecting a plurality of emotion words according to the click rate;
then, selecting emotion words with highest degree of correlation with the theme based on word frequency statistics, and forming a basic emotion dictionary by utilizing a plurality of selected emotion words and emotion words;
and then expanding the basic emotion dictionary by using synonyms and candidate words with emotion tendencies.
Further, carrying out emotion weight statistics on the broken sentences among each separator, judging emotion polarity of the text according to the proportion of negative emotion weights in all emotion word weights, and comprising the following steps:
firstly, aiming at the broken sentences among the separators, counting emotion tendencies through emotion word analysis, negative word analysis, adverb analysis, fixed collocation word analysis, turning word analysis and exclamation sentence analysis;
And then, counting the sum of negative emotion tendency values of all clauses and the sum of total emotion weight absolute values of the text, and judging the emotion polarity of the text by utilizing the longitudinal proportion of the negative emotion word weight in all emotion word weights of the text.
A series of emotion words can be selected from a knowledge network Hownet, the emotion words are input to a search engine one by one, the emotion words are ordered according to the click quantity (hits value) returned by the search engine, a plurality of emotion words with the highest click quantity are selected as basic emotion words, in addition, a method based on word frequency statistics is adopted, and basic emotion words with higher relevance to a theme are selected semi-automatically to form a basic emotion dictionary together. Because most words containing emotion components in the text are adjectives, verbs and partial nouns, word frequency statistics is carried out only on the basis of automatic texts with enough entries after preprocessing, and then for a plurality of words with higher word frequencies, 20 positive emotion words with highest word frequencies and 20 negative emotion words with highest word frequencies are selected to form a basic emotion dictionary together with the universal basic emotion words.
Since the basic emotion dictionary expresses stronger emotion tendencies, negative emotion words in the basic emotion dictionary can be assigned an emotion tendencies value of-1. The vocabulary quantity in the basic emotion dictionary is smaller, and all words with emotion tendencies which appear in the text set cannot be contained, so that the basic emotion dictionary needs to be expanded to construct a relatively complete emotion dictionary. Synonyms can be added and candidates with emotional tendency can be added for expansion.
The addition of synonyms can help to identify emotion words more widely, and the existing synonym word stock is utilized to carry out synonym expansion on the basic emotion dictionary. However, in order to improve the algorithm performance of emotion tendency calculation, common synonym words still need to be manually screened, after expansion, the number of emotion dictionary words is increased to 256, and the emotion tendency value of the synonym words of the negative emotion words can be set to be-1.
It is very difficult to construct a complete and missing emotion dictionary, but words with high relativity are incorporated into the dictionary by analyzing the relativity of each word in a text set and words in the emotion dictionary, so that the emotion dictionary with wider coverage can be effectively constructed.
The emotional vocabulary relevance in the candidate word dictionary may be calculated by a point-to-point information method (Pointwise Mutual Information) to determine whether to add it to the emotional dictionary. The point mutual information method calculates the relevance between words based on mutual information theory. The basic idea is to count the probability of co-occurrence of two words wordi and wordj in a text, and the larger the co-occurrence probability is, the higher the correlation of the two words is, and the calculation formula is as follows:
Figure SMS_1
where p (wordi≡wordj) is the probability that wordi and wordj co-occur in the text, the calculation method is as follows:
Figure SMS_2
where n represents the total number of clauses in the text and numSentence (wordi, wordj) represents the number of clause strips containing both wordi and wordj. P (wordi) and P (wordj) represent the proportion of the number of clauses in the text containing wordi and wordj, respectively, in the total number of clauses. The calculation formula is as follows:
Figure SMS_3
Figure SMS_4
Where numSentence (wordi) represents the number of clause bars in the text that contain wordi. PMI (wordi, wordj) in the above formula represents the information amount of one variable that can be acquired when the other variable appears, and fully represents the statistical correlation between wordi and wordj: when the PMI is greater than 0, two words are related, and the larger the PMI value is, the stronger the correlation is; when PMI is 0, the two words are statistically independent; when PMI is less than 0, it means that the two words are mutually exclusive.
The method aims to solve the problem that the co-occurrence probability of partial non-emotion tendency words and positive or negative emotion words appearing when related words are added is very high, SO that errors are introduced into an emotion dictionary, the performance meaningless overhead problem of emotion classification is caused, the accuracy problem is reduced, and the efficiency of an extended dictionary algorithm is improved. The SO-PMI value of two candidate word words is calculated specifically as follows: calculating PMI values of candidate words and a positive basic dictionary, calculating PMI values between the candidate words and negative words, and finally obtaining SO-PMI values of the candidate words by seeing the candidate words and the negative words, wherein the calculation formula is as follows:
SO-PMI(word)=
posWord∈posWords PMI(word,posWord)-∑ negWord∈negWords PMI(word,negWord)
The relationship between the value of SO-PMI and emotion tendencies can be adjusted to
Figure SMS_5
In summary, the emotion dictionary expansion method is summarized as follows:
posWords:
if word is the positive word in the basic emotion dictionary, the word is difficult to incorporate into posWords;
if word is a synonym of a certain positive word in the basic emotion dictionary, then word is incorporated into posWords;
if word meets the formula word. Propertyal e { a, d, an, ag, al } or word. Propertyal e { vn, vd, vi, vg, vl }, and 1.36< SO-PMI (word) <23, word incorporates posWords.
Similarly, negWords:
if the word is a negative word in the base emotion dictionary, then the word incorporates negWords;
if word is a synonym for a negative term in the underlying emotion dictionary, then word is incorporated into negWords;
if word meets word. Propertyal e { a, d, an, ag, al } or word. Propertyal e { vn, vd, vi, vg, vl }, and-16 < SO-PMI (word) < -1, then word incorporates negWords.
Based on an emotion dictionary, taking each text sentence S as a unit, taking each emotion word WS in the sentence as a separator, and carrying out emotion weight calculation on a sentence-breaking phrase (WSi-1, WSi) between the two separators, wherein the sentence-breaking phrase (WSi-1, WSi) contains a word WSI but does not contain the word WSi-1; the model consists of 5 modules, which are respectively: the method comprises the steps of emotion word analysis, negative word analysis, adverb analysis, fixed collocation word and sentence analysis, turning word analysis and exclamation sentence analysis.
Analysis of emotion words: scanning an emotion dictionary aiming at each word in the text to be analyzed, judging whether the word exists in the emotion dictionary, if so, regarding the word as an emotion word, reading the emotion tendency value of the word from the negative emotion dictionary, and returning the emotion tendency value; if the word does not exist, the word is regarded as a neutral word, 0 is returned, and the method loops until word judgment of the whole text set is completed. By calculating the emotion tendencies value of each word, we obtain the exact emotion word (i.e., the word with weight not equal to 0) and filter the emotion words that do not affect emotion in the particular sentence (i.e., the emotion words with weight equal to 0).
Negative word analysis: in the case of the sentence in which the emotion word Wsi appears, the number negNum (Wsi-1, wsi) of negative words between Wsi and the preceding separator Wsi-1 (i.e., in one sentence break) is calculated. If negNum is an odd number, the emotion value of the clause is the emotion tendency value of the emotion word and is inverted; otherwise, the original emotion tendency value is maintained.
Analysis of adverbs: and judging whether the vocabulary is positioned in the adverb dictionary, if so, acquiring the emotion intensity of the adverb from the adverb dictionary, and multiplying the corresponding weight by the current emotion tendency value of the clause to obtain the emotion weight of the clause.
Turning word analysis: in this process, if a turn word is scanned, the weight (Wsi-1, wsi)) is inverted so that the emotion tendencies of the phrase (Wsi-1, wsi) are biased toward those of the sentence-breaking phrase (Wsi, wsi+1) following the turn word.
Exclamation sentence analysis: for analysis of exclamation sentences, we use the exclamation mark "+|! "as an exclamation sentence mark, it is denoted as exc. The emotion weight calculation method comprises the following steps: scanning to the exclamation mark is that we look for the emotion words Wsi-1 closest to the exclamation mark from back to front, and take the emotion tendency value of Wsi-1 as the weight value of exc.
Calculating to obtain the sum weight (S) of the negative emotion tendency values and the sum total (S) of the total emotion weight absolute values of all clauses contained in a piece of text S, calculating the proportion scale (S) of the negative emotion word weight values in all emotion word weights of the text, judging the emotion polarity of the text S according to the scale (S), making preliminary judgment on the cognition threat property of the text according to the emotion polarity, and regarding the text with the scale (S) in the [0.68-1] interval as the cognition threat; scale (S) is a text of suspected cognitive threat topics within the interval [0-0.68 ], so that the first stage classification of text cognitive threat is realized.
As a preferred embodiment, further, in the intermediate detection, classifying the initial suspected cognitive threat topic text into a cognitive threat topic text, a suspected cognitive threat topic text and a non-cognitive threat topic text by using a deep learning method, wherein the classification process includes:
constructing a deep learning model and pre-training by using a training data set with labeling labels, wherein the deep learning model comprises a BERT model for representing the input word vector and a BiLSTM model for detecting the cognitive threat of the input word vector;
inputting the initial suspected cognitive threat topic text into a pre-trained deep learning model, acquiring a cognitive threat probability value by using the deep learning model, and determining the cognitive threat topic text, the suspected cognitive threat topic text and the non-cognitive threat topic text in the initial suspected cognitive threat topic text by using the cognitive threat probability value.
In the embodiment of the scheme, aiming at the respective characteristics of the two methods, the two methods based on the emotion dictionary and the deep learning are combined and optimized, and a multidimensional emotion analysis method combining the emotion dictionary and the deep learning is dynamically expanded, so that the respective defects of the two methods are overcome, and higher accuracy is obtained.
Emotion analysis-based cognitive threat recognition performs emotion analysis on text in two stages. In the first stage, the existing known net (HowNet) emotion dictionary and BosonNLP emotion dictionary can be referred, a basic emotion dictionary is constructed by using a word frequency statistics method, and emotion tendencies of candidate words and words in the basic emotion dictionary are judged by calculating statistical correlation of the candidate words and words in the basic emotion dictionary, so that dynamic expansion of the emotion dictionary is realized. Based on an emotion dictionary, a negative dictionary and a degree adverb dictionary, taking each text S as a unit, taking each emotion word WS of the sentence as a separator, calculating the sum weight (S) of negative emotion weights and the sum total (S) of absolute values of emotion weights for a sentence breaking phrase (WSi-1, WSi) between the two separators, defining scale (S) as the proportion of the negative emotion weights in all emotion weights, judging the emotion polarity of the text S according to the scale (S), and making preliminary judgment on the cognition threat properties of the text according to the emotion polarity to finish the primary recognition of cognitive threat. Statistical results of analysis of a large number of collected experimental texts show that texts with negative emotion weights scale (S) in the interval of [0.68-1] are highly likely to be cognitive threats; the negative emotion weight scale (S) is a suspected cognitive threat topic text in the interval of [ 0-0.68); and (3) primarily classifying the text with the emotion tendency value in the score range of 0-0.68 as suspected cognitive threat, and carrying out second-stage identification treatment on the text. In the second stage, a BERT+BiLSTM deep learning model is used as a core to perform emotion analysis, further analysis on text emotion tendencies is performed to complete recognition threat re-recognition, a BERT (bidirectionEncode, reprsesnation from Transformers, BERT) pre-trained word vector is used for replacing a word vector trained in a traditional mode, a text processed by word segmentation is converted into a multi-dimensional word vector, a bidirectional long and short time memory network (BiLSTM) model capable of solving short-term dependence problems and long-term dependence problems is used for forming the core of plate emotion tendentiousness analysis, a manually marked cognitive threat topic text set and a known cognitive threat topic text under the same subject are used as training sets, the BERT+BiLSTM model is used for further emotion analysis on the suspected cognitive threat topic text obtained in the first stage by using a Softmax classifier, and the text is divided into a determined cognitive threat topic text set, a suspected cognitive threat topic text set and a non-cognitive threat topic text set.
And carrying out second-stage identification processing on the text which is based on the emotion analysis result of the dynamic extended emotion dictionary and is suspected to be the cognitive threat in the first stage, and taking the BERT+BiLSTM deep learning model as a core for further recognition of the cognitive threat. Firstly, model training is carried out, and the training flow is as follows: the training data set can be manually marked, whether the training data set has cognitive threat property or not is marked, word vector representation is carried out on the training data set by using a BERT model after word segmentation, and finally the converted vector is transmitted into a BiLSTM neural network. And training a BiLSTM model covering the cognitive threat sample according to the cognitive threat sample. The text with the suspected cognitive threat as the processing result in the first stage is vectorized through BERT words, the converted vectors are respectively transmitted into a cognitive threat model, a cognitive threat probability value is obtained through the model, a large number of text experiments show that the text with the training result probability of (0.68-1) can be confirmed to have the cognitive threat property, the probability of (0.32-0.68) is the suspected cognitive threat, the artificial judgment is needed, and the probability of [0-0.32] is the non-cognitive threat.
Experimental results show that under the condition that the data set contains nearly 5000 pieces of microblog text data, the accuracy of recognition of cognitive threat based on the emotion tendentiousness analysis method based on deep learning and emotion dictionary is 67.9% and 83.27% respectively, and the accuracy of recognition of comprehensive cognitive threat based on multidimensional emotion analysis in the scheme is 89.9% and is relatively better.
Further, in this embodiment, for the preprocessed sensitive topic text data, the cognitive threat topic text is obtained through multi-level cognitive threat detection, and further includes: and evaluating the influence degree of the cognitive threat by using an emotion analysis method according to the overall emotion tendency in a comment area in the cognitive threat topic text.
The method comprises the steps of defining the influence degree of cognitive threat by the overall emotion tendencies of a comment area under a text which is identified as cognitive threat, merging all comment texts under the text which is identified as cognitive threat, preprocessing data, and segmenting the text, judging the overall emotion tendencies of the text in the comment area by adopting a cognitive threat identification method based on an emotion dictionary, taking comment guidance caused by the text with cognitive threat property as a judgment basis of the threat degree, evaluating the threat degree of the text according to a comment emotion analysis result, and defining the text cognitive threat degree of which the overall negative emotion weight of the comment text is in a range of [0.68-1] as a first grade from high to low; the text threat degree of the overall negative emotion weight of the comment text with the overall emotion word weight proportion of [0.32-0.68 ] is defined as a second level; the text cognitive threat degree with the overall negative emotion weight of the comment text being in [0-0.32 ] in the overall emotion word weight of the text is defined as three-level, and the analysis result can provide important reference for dealing with the cognitive threat.
As a preferred embodiment, further, constructing a cognitive threat propagation knowledge graph by identifying named entities and extracting entity relationships of the cognitive threat topic text includes:
constructing a named entity extraction model and optimizing the model by using an countermeasure training method, wherein the named entity recognition model comprises an encoder for mapping input characters to real space and mining potential semantics, a BiLSTM neural network layer for extracting context semantic information by capturing forward and backward bidirectional features in encoder conversion vectors, and a CRF conditional random field layer for taking the bidirectional features extracted by the BiLSTM neural network layer as input and generating character corresponding labels by combining a Bioes labeling paradigm;
and taking the cognitive threat topic text as an optimized named entity extraction model to be input, and identifying entity categories and relations in the cognitive threat topic text by using the named entity extraction model.
At present, a statistical machine learning-based method is used for realizing a named entity recognition task more commonly, and a named entity extraction model in the embodiment adopts a BERT-BiLSTM-CRF model, which is a deep learning model which is developed based on the BiLSTM-CRF model and does not need manual induction of features and is end-to-end, so that the requirements of the current Chinese address resolution and address element labeling task can be met. The model consists of a coder (transducer), a BiLSTM neural network layer and a Conditional Random Field (CRF) layer from bottom to top. The transducer encoder is based on a character-level Chinese BERT model, maps input Chinese address characters into a low-dimensional dense real space, and excavates potential semantics contained in various address elements in the Chinese address; the BiLSTM neural network layer takes the character vector converted by the encoder as input, captures the bidirectional characteristics of the forward direction (from left to right) and the backward direction (from right to left) of the Chinese address sequence, and can fully acquire the semantic information of the context; the CRF conditional random field layer belongs to a probability map model, bidirectional features extracted by an upstream BiLSTM are used as input, labels corresponding to all characters in the address are generated by combining a Bioes labeling paradigm, so that Chinese addresses are further analyzed into various address elements according to the labels, the problem of sequences is considered in the calculation process, and the recognition effect of named entities can be improved to a great extent.
The entity in the cognitive threat field has no established standard yet, and most of the existing network naming identification tasks are only identified by aiming at network public opinion. In this case, the data crawled is analyzed, and according to the cognitive threat identification requirement, 6 types of entities in the cognitive threat field can be set, which are respectively a user, a time, an address, a platform, an organization and a hot event, as shown in table 1:
TABLE 1 cognitive threat entity types
Figure SMS_6
Figure SMS_7
Entity labeling is the most important problem of named entity recognition task and is also the basis of model training. The common labeling methods are BIO and BIOES. Although the BIOES labeling method provides more information, labels needing to be predicted are more, and the effect of the labeling method adopting the BIOES can be influenced due to the limited quantity of the data sets constructed in the scheme. In a BIO tagging system, the beginning of an entity may be tagged with a "B" that "I" tags the inside of the entity and "O" tags the non-entity. The tags of each class of entities contain a "start" and an "inside", so the named entity identification dataset constructed in this case can be set to 13 tags.
The knowledge graph related entities of the cognitive threat users have relation complexity, and when knowledge extraction is carried out on the knowledge texts of the cognitive threat topics, the dependency relationship between adjacent labels should be noted. However, since the text information of the long distance is good at the BiLSTM, the dependency relationship between the adjacent labels cannot be processed, therefore, the output score can be corrected through the relationship of the adjacent labels on the basis that the BiLSTM outputs the predictive label preliminarily corresponding to each word in combination with CRF (Conditional Random FieId, conditional random field) in knowledge extraction of knowledge patterns of cognitive threat users, and an optimal predictive sequence is obtained.
The CRF layer takes the output score of the BILSTM of the upper layer as input, and outputs the predicted marking sequence which accords with the marking transfer constraint condition and is most possible. For any sequence x= (X1, X2, …, xn); here, assuming that P is an output score matrix of BiLSTM, the size of P is n\times k, where n is the number of words, k is the number of tags, pij represents the score of the j-th tag of the i-th word, and for the predicted sequence y= (Y1, Y2, …, yn), the score function is obtained as follows:
Figure SMS_8
a represents a transition score matrix, aij represents the score of transition from tag i to tag j, and a has a size of k+2. The probability of predicted sequence Y generation is:
Figure SMS_9
taking logarithms at two ends to obtain a likelihood function of a predicted sequence:
Figure SMS_10
in the formula, w index { Y } represents a true annotation sequence, and YX represents all possible annotation sequences. Obtaining the output sequence with the maximum score after decoding:
Figure SMS_11
the CRF layer outputs an optimal label sequence of the cognitive threat topic text, focuses on words corresponding to labels of a forwarding user, forwarding time, forwarding place, forwarding platform, text abstract information, text subject and the like, and is a basis for establishing a knowledge graph of the cognitive threat user, performing traceability processes of forwarding user traceability, forwarding process traceability, forwarding time traceability, forwarding platform traceability and the like, and performing cognitive threat forwarding user relationship reasoning.
When BERT and variants thereof are used, the parameters reach a better level as they have been pre-trained, and in order to maintain the training effect, a lower learning rate should be used; in contrast, since the downstream task is pre-trained, if a low learning rate is set, not only is the training process slow, but it is difficult to synchronize with the BERT training. Therefore, in the embodiment of the present disclosure, a hierarchical learning rate policy may be set: for the upstream BERT pre-training layer, a smaller learning rate is set, while the lower layer sets a larger learning rate.
In the model training process, when the loss value is gradually and slowly reduced, if a larger learning rate is still adopted, the model can swing around the optimal point when converging to the global optimal point, and in order to ensure that the loss function is finally kept in a range very close to the optimal value all the time and gradually approaches to the optimal value, a learning rate attenuation strategy is adopted, namely the step length of parameter updating is reduced. The scheme can set a learning attenuation strategy: when the model effect is not improved in the training process, the learning rate is reduced, and the model precision can be effectively improved.
BERT-BiLSTM-CRF is used as a model for identifying named entities, but even small perturbations may produce large errors in the model due to the local instability of the neural network. Therefore, the present embodiments employ an countermeasure training approach to optimize the model. The model robustness is improved by inputting tiny disturbance into the model in the countermeasure training, and the effects of alleviating the defect of local instability of the neural network and improving the model robustness can be achieved. See fig. 3. In the training process, the BERT will generate an initial vector for the input text first, and then add some disturbance to the initial vector to generate countersamples, which are variants of the original samples, and are easy to mislead the model. The initial vector and challenge samples will be input together into the BiLSTM for training and the neural network will learn more robust parameters during the training process to resist challenge sample attacks.
In the preferred embodiment, further, in the cognitive threat propagation knowledge graph constructed by recognizing the named entities of the cognitive threat topic text and extracting the entity relationship, two named entity extraction models connected in a pipeline mode can be constructed, wherein the first named entity extraction model adopts a single-label multi-classification task mode to recognize the entities in the cognitive threat topic text, and the second named entity extraction model adopts a multi-label multi-classification task mode to input the first named entity extraction model as input to recognize the relationship between the entities.
The knowledge fusion is an important task for constructing a domain knowledge graph, and is formed by aligning, associating and combining a plurality of related entities, so that the knowledge fusion is integrated, and the main work is divided into two parts of entity unification and entity disambiguation. Because the cognitive threat topic text has the characteristics of politics and aggression, the problem that the entities identified by named entities are not unified exists, and therefore entity unification and entity disambiguation are needed. The entity unification refers to different entity examples with the same meaning, and entity unification is needed.
Ambiguity of naming an entity means that one entity reference term can correspond to multiple real world entities, and because of the richness and complexity of chinese semantics, meaning represented by the same word in different contexts may be different, thus requiring entity disambiguation. The scheme can adopt a link-based entity disambiguation method to link an entity designation chain to a corresponding entity in a knowledge base. And obtaining a final effective entity after entity unification.
The graph database is good at handling large volumes of complex, interconnected, low structured data that change rapidly, requiring frequent queries-in relational databases, these queries result in a large number of table connections, thus creating performance problems. In this embodiment, a persistence engine Neo4j supporting complete transactions can be adopted to provide large-scale scalability, so that billions of node relation-attribute graphs can be processed on one machine, and the method can be extended to parallel operation of multiple machines. And simultaneously, the problem of performance degradation is solved. By modeling the data around the graph, neo4j traverses nodes and edges at the same speed, which does not have any relation to the amount of data that make up the graph.
As shown in fig. 4, when the knowledge-graph visualization is implemented based on Neo4j, a set of visualization graph elements may be defined in the Neo4j program. The schema of cognitive threat propagation may be expressed primarily by type (type) and attribute (property). User, event, address, platform, hotspot event, emotion tag, threat intent are defined as entities. In the definition of the relationship, the following relationship can be defined: text-hotspot events, user-text, forwarding, user-organization, text-emotion tags, text-threat intent, and the like. The triplet may be expressed as: < text, text-hot event, hot event >, < user, user-cognitive threat topic text, text >, < user, forward, user >, < user, user-organization, user >, < text, text-threat intent, threat intent >, etc.
And realizing cognitive threat propagation user tracing, event tracing and organization tracing by constructing a cognitive threat propagation knowledge graph. Through implicit relation mining, real-time monitoring is carried out on heavy account numbers, groups, organizations and users, and basis can be provided for accurate positioning and directional blocking of cognitive threats.
Further, based on the above method, the embodiment of the present invention further provides a social media cognitive threat detection system, including: the system comprises a data acquisition server, a plurality of cognitive threat authentication servers, a knowledge graph server and a web server, wherein,
the data acquisition server is used for acquiring the text data of the sensitive topics of the network platform and preprocessing the data;
the multiple cognitive threat authentication servers are used for acquiring cognitive threat topic texts through multi-level cognitive threat detection according to the preprocessed sensitive topic text data, wherein the multiple cognitive threat authentication servers specifically comprise: the method comprises the steps of dividing sensitive topic text data into a primary identification server of a cognitive threat topic text and an initial suspected cognitive threat topic text, classifying the initial suspected cognitive threat topic text into a middle identification server of the cognitive threat topic text, the suspected cognitive threat topic text and a non-cognitive threat topic text, and obtaining a final identification server of the cognitive threat topic text from the suspected cognitive threat topic text through manual labeling;
The knowledge graph server is used for constructing a cognitive threat propagation knowledge graph through named entity identification and entity relation extraction of the cognitive threat topic text;
and the web server is used for carrying out user tracing, event tracing and organization tracing on the cognitive threat topic text transmission by utilizing the web interaction interface based on the cognitive threat transmission knowledge graph.
The front end can realize data visualization by adopting Echart based on JavaScript design. The multiple cognitive threat authentication servers can be set to be of a decentralised distributed characteristic and are jointly responsible for detecting measurement of cognitive threat information, and only texts which are recognized as cognitive threats by the multiple servers can be judged to be cognitive threat topic texts. After the text is judged to be the text of the cognitive threat topic, the cognitive threat identification server uploads text information and text cognitive threat attribute measurement information to a distributed network. The distributed structure not only improves the accuracy of cognitive threat detection, but also improves the risk resistance of the system, and the damage of one server can not influence the operation of the whole system.
The knowledge graph server corresponds to the cognitive threat knowledge extraction module and the cognitive threat propagation knowledge graph construction module. The knowledge graph server automatically accesses the distributed network through intelligent contracts, extracts text information of detected cognitive threat topic texts, performs named entity identification and relation extraction on cognitive threat entities such as users, time, addresses, organizations, forwarding platforms, cognitive threat topic text related hot events and the like, and constructs a cognitive threat propagation knowledge graph through Neo4 j.
The man-machine interaction interface can utilize Web to realize a bridge for user interaction with data, and utilizes the Echart visualization tool to convert abstract data and relations into visual charts.
In the embodiment of the scheme, the system has good safety and feasibility, the modularization operation complexity is low, and the maintenance is convenient. And through test data verification, the threat judgment accuracy rate of the text of topics within hundred characters reaches 93%, and the method has the advantages of higher accuracy and better detection effect. In addition, the scheme has wide application scenes, and can be used for news media supervision, network public opinion supervision, illegal action hit and the like.
The relative steps, numerical expressions and numerical values of the components and steps set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The elements and method steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or a combination thereof, and the elements and steps of the examples have been generally described in terms of functionality in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those of ordinary skill in the art may implement the described functionality using different methods for each particular application, but such implementation is not considered to be beyond the scope of the present invention.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in the above methods may be performed by a program that instructs associated hardware, and that the program may be stored on a computer readable storage medium, such as: read-only memory, magnetic or optical disk, etc. Alternatively, all or part of the steps of the above embodiments may be implemented using one or more integrated circuits, and accordingly, each module/unit in the above embodiments may be implemented in hardware or may be implemented in a software functional module. The present invention is not limited to any specific form of combination of hardware and software.
Finally, it should be noted that: the above examples are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention, but it should be understood by those skilled in the art that the present invention is not limited thereto, and that the present invention is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. The social media cognitive threat detection method is characterized by comprising the following steps of:
collecting sensitive topic text data of a network platform and preprocessing the data;
aiming at the preprocessed sensitive topic text data, acquiring a cognitive threat topic text through multi-level cognitive threat detection, wherein the multi-level cognitive threat detection comprises: dividing sensitive topic text data into primary detection of cognitive threat topic text and initial cognitive threat topic text, classifying the initial cognitive threat topic text into intermediate detection of cognitive threat topic text, suspected cognitive threat topic text and non-cognitive threat topic text, and acquiring final detection of the cognitive threat topic text from the suspected cognitive threat topic text through manual labeling;
Constructing a cognitive threat propagation knowledge graph through named entity identification and entity relation extraction of the cognitive threat topic text;
and carrying out user tracing, event tracing and organization tracing on the text transmission of the cognitive threat topics based on the knowledge graph of the cognitive threat transmission.
2. The method for detecting social media cognitive threat of claim 1, wherein collecting text data of sensitive topics of a network platform and preprocessing the data comprises:
firstly, acquiring sensitive topic text information of a network platform and related user data in a distributed manner according to a user authorization information base;
and combining the title and the text aiming at the acquired text information, removing redundant information by using a redundant detection algorithm, performing duplication removal processing on related comments, cleaning and converting text noise data, and performing word segmentation processing on the text by using a word segmentation system.
3. The method for detecting social media cognitive threat according to claim 1, wherein in the primary detection, the sensitive topic text data is divided into a cognitive threat topic text and an initial suspected cognitive threat topic text by using an emotion analysis method, wherein the emotion analysis method includes:
Firstly, constructing a basic emotion dictionary according to a known emotion dictionary and applying a word frequency statistics method, and expanding the emotion dictionary by carrying out correlation statistics on words in text data and words in the basic emotion dictionary;
secondly, carrying out emotion weight statistics on broken sentences among each separator by taking texts in sensitive topic text data as units and emotion words as separators, and judging emotion polarity of the texts according to the proportion of negative emotion weights in all emotion word weights;
and then dividing the sensitive topic text data into a cognitive threat topic text and an initial suspected cognitive threat topic text according to the emotion polarity of the text.
4. The method of claim 3, wherein constructing a base emotion dictionary from the known emotion dictionary using word frequency statistics comprises:
firstly, selecting a series of emotion words from a known emotion dictionary, sorting the emotion words according to the click rate of a search engine in the series of emotion words, and selecting a plurality of emotion words according to the click rate;
then, selecting emotion words with highest degree of correlation with the theme based on word frequency statistics, and forming a basic emotion dictionary by utilizing a plurality of selected emotion words and emotion words;
And then expanding the basic emotion dictionary by using synonyms and candidate words with emotion tendencies.
5. The social media cognitive threat detection method of claim 3, wherein performing emotion weight statistics on the sentence breaks between each separator, and determining emotion polarity of the text according to the proportion of negative emotion weights in all emotion word weights comprises:
firstly, aiming at the broken sentences among the separators, counting emotion tendencies through emotion word analysis, negative word analysis, adverb analysis, fixed collocation word analysis, turning word analysis and exclamation sentence analysis;
and then, counting the sum of negative emotion tendency values of all clauses and the sum of total emotion weight absolute values of the text, and judging the emotion polarity of the text by utilizing the longitudinal proportion of the negative emotion word weight in all emotion word weights of the text.
6. The method for detecting social media cognitive threat according to claim 1, wherein in the medium-level detection, the initial suspected cognitive threat topic text is classified into a cognitive threat topic text, a suspected cognitive threat topic text and a non-cognitive threat topic text by using a deep learning method, and the classification process comprises:
Constructing a deep learning model and pre-training by using a training data set with labeling labels, wherein the deep learning model comprises a BERT model for representing the input word vector and a BiLSTM model for detecting the cognitive threat of the input word vector;
inputting the initial suspected cognitive threat topic text into a pre-trained deep learning model, acquiring a cognitive threat probability value by using the deep learning model, and determining the cognitive threat topic text, the suspected cognitive threat topic text and the non-cognitive threat topic text in the initial suspected cognitive threat topic text by using the cognitive threat probability value.
7. The social media cognitive threat detection method of claim 1, wherein the cognitive threat topic text is obtained by multi-level cognitive threat detection for the preprocessed sensitive topic text data, further comprising: and evaluating the influence degree of the cognitive threat by using an emotion analysis method according to the overall emotion tendency in a comment area in the cognitive threat topic text.
8. The method for detecting social media cognitive threat according to claim 1, wherein the step of constructing a cognitive threat propagation knowledge graph by identifying named entities and extracting entity relationships of the cognitive threat topic text comprises the steps of:
Constructing a named entity extraction model and optimizing the model by using an countermeasure training method, wherein the named entity recognition model comprises an encoder for mapping input characters to real space and mining potential semantics, a BiLSTM neural network layer for extracting context semantic information by capturing forward and backward bidirectional features in encoder conversion vectors, and a CRF conditional random field layer for taking the bidirectional features extracted by the BiLSTM neural network layer as input and generating character corresponding labels by combining a Bioes labeling paradigm;
and taking the cognitive threat topic text as an optimized named entity extraction model to be input, and identifying entity categories and relations in the cognitive threat topic text by using the named entity extraction model.
9. The social media cognitive threat detection method of claim 8, wherein in the cognitive threat propagation knowledge graph constructed by identifying named entities and extracting entity relations of cognitive threat topic texts, two named entity extraction models connected in a pipeline are constructed, wherein the first named entity extraction model adopts a single-label multi-classification task mode to identify entities in the cognitive threat topic texts, and the second named entity extraction model adopts a multi-label multi-classification task mode to input the first named entity extraction model as input to identify the relations between the entities.
10. A social media cognitive threat detection system, comprising: the system comprises a data acquisition server, a plurality of cognitive threat authentication servers, a knowledge graph server and a web server, wherein,
the data acquisition server is used for acquiring the text data of the sensitive topics of the network platform and preprocessing the data;
the multiple cognitive threat authentication servers are used for acquiring cognitive threat topic texts through multi-level cognitive threat detection according to the preprocessed sensitive topic text data, wherein the multiple cognitive threat authentication servers specifically comprise: the method comprises the steps of dividing sensitive topic text data into a primary identification server of a cognitive threat topic text and an initial suspected cognitive threat topic text, classifying the initial suspected cognitive threat topic text into a middle identification server of the cognitive threat topic text, the suspected cognitive threat topic text and a non-cognitive threat topic text, and obtaining a final identification server of the cognitive threat topic text from the suspected cognitive threat topic text through manual labeling;
the knowledge graph server is used for constructing a cognitive threat propagation knowledge graph through named entity identification and entity relation extraction of the cognitive threat topic text;
And the web server is used for carrying out user tracing, event tracing and organization tracing on the cognitive threat topic text transmission by utilizing the web interaction interface based on the cognitive threat transmission knowledge graph.
CN202211732859.2A 2022-12-30 2022-12-30 Social media cognitive threat detection method and system Pending CN116244446A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211732859.2A CN116244446A (en) 2022-12-30 2022-12-30 Social media cognitive threat detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211732859.2A CN116244446A (en) 2022-12-30 2022-12-30 Social media cognitive threat detection method and system

Publications (1)

Publication Number Publication Date
CN116244446A true CN116244446A (en) 2023-06-09

Family

ID=86628873

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211732859.2A Pending CN116244446A (en) 2022-12-30 2022-12-30 Social media cognitive threat detection method and system

Country Status (1)

Country Link
CN (1) CN116244446A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117874755A (en) * 2024-03-13 2024-04-12 中国电子科技集团公司第三十研究所 System and method for identifying hidden network threat users
CN117910567A (en) * 2024-03-20 2024-04-19 道普信息技术有限公司 Vulnerability knowledge graph construction method based on safety dictionary and deep learning network

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117874755A (en) * 2024-03-13 2024-04-12 中国电子科技集团公司第三十研究所 System and method for identifying hidden network threat users
CN117874755B (en) * 2024-03-13 2024-05-10 中国电子科技集团公司第三十研究所 System and method for identifying hidden network threat users
CN117910567A (en) * 2024-03-20 2024-04-19 道普信息技术有限公司 Vulnerability knowledge graph construction method based on safety dictionary and deep learning network

Similar Documents

Publication Publication Date Title
Daumé III et al. A large-scale exploration of effective global features for a joint entity detection and tracking model
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN111428054A (en) Construction and storage method of knowledge graph in network space security field
CN110888991B (en) Sectional type semantic annotation method under weak annotation environment
Liu et al. Measuring similarity of academic articles with semantic profile and joint word embedding
CN108717408A (en) A kind of sensitive word method for real-time monitoring, electronic equipment, storage medium and system
CN116244446A (en) Social media cognitive threat detection method and system
Das et al. A graph based clustering approach for relation extraction from crime data
Uppal et al. Fake news detection using discourse segment structure analysis
CN114385775B (en) Sensitive word recognition method based on big data
CN110765277A (en) Online equipment fault diagnosis platform of mobile terminal based on knowledge graph
CN114048305A (en) Plan recommendation method for administrative penalty documents based on graph convolution neural network
Amali et al. Classification of cyberbullying Sinhala language comments on social media
Abid et al. Semi-automatic classification and duplicate detection from human loss news corpus
Xun et al. A survey on context learning
CN115329085A (en) Social robot classification method and system
Li et al. Neural factoid geospatial question answering
Suresh et al. Data mining and text mining—a survey
CN112307364B (en) Character representation-oriented news text place extraction method
CN116192537B (en) APT attack report event extraction method, system and storage medium
Qi et al. Scratch-dkg: A framework for constructing scratch domain knowledge graph
Wang et al. Sentiment detection and visualization of Chinese micro-blog
Wang et al. A Method of Hot Topic Detection in Blogs Using N-gram Model.
Thambi et al. Graph based document model and its application in keyphrase extraction
Sun et al. Topic-Aware Fake News Detection Based on Heterogeneous Graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination