CN110909531A

CN110909531A - Method, device, equipment and storage medium for discriminating information security

Info

Publication number: CN110909531A
Application number: CN201910991165.2A
Authority: CN
Inventors: 杨冬艳; 王智浩
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-10-18
Filing date: 2019-10-18
Publication date: 2020-03-24
Anticipated expiration: 2039-10-18
Also published as: CN110909531B

Abstract

The invention relates to the technical field of artificial intelligence, and discloses an information security screening method.A crawler system is built based on a distributed system framework and a memory type computer engine to collect information from different channels, then the first machine learning and semantic definition algorithm in various industries is utilized to continuously learn related entries of information security, the data source of the acquired text is continuously expanded, the network security information is analyzed from the perspective of more comprehensive field and deeper level, the internal association relationship among data is constructed, the effectiveness and persuasion of the analysis result are increased, and the information transmitted in the network is screened based on the internal association relationship in a knowledge base of the learned entries; the invention also provides a device, equipment and a computer readable storage medium for screening information security, which are used for mining the internal relation of network information security knowledge, assisting in identifying fraud or vulnerability scenes in network services and improving the security of network transmission information.

Description

Method, device, equipment and storage medium for discriminating information security

Technical Field

The present invention relates to the field of network security technologies, and in particular, to a method, an apparatus, a device, and a storage medium for discriminating information security.

Background

With the continuous development of network technology, networks become a part of real life, at present, users complete various requirements through networks, while realizing the requirements, the users need to provide some private information, such as identity cards, bank information and the like, which all belong to private information, and the networks are used as common platforms for users to realize the requirements and exchange, if the information cannot be well protected, the information is leaked, and if the information is acquired by lawless persons, serious consequences can be caused. Therefore, network information security becomes a large development point of network communication at present, especially for monitoring and protecting network attack, and the realization of network information security is different from the traditional security field threat form, and has the characteristics of variable forms, uneasiness in perception and the like.

In the prior art, corresponding self-learning is mainly performed through a single model algorithm to perform security identification on network information, but the traditional methods of rule engine, data mining, machine discovery and the like are still difficult to identify inherent relations of threats, especially at present, network attacks are realized through pushing or changing some phrase information, and if the inherent relations are identified by using a single model, the recognition confidence of the information is not high, misjudgment or misjudgment is often generated, and great potential safety hazards are brought to the use of users.

Disclosure of Invention

The invention mainly aims to provide a method, a device, equipment and a storage medium for information security discrimination, and aims to solve the technical problem that the existing network information security identification mode has lower identification precision on potential safety hazards.

In order to achieve the above object, the present invention provides an information security screening method, which includes the following steps:

acquiring data information related to network security on each Internet channel through a crawler platform, wherein the crawler platform is built based on a distributed system framework and a memory type computer engine, and the data information at least comprises text data and image data;

performing machine learning of text semantics or image shape outlines on the data information according to a preset machine learning algorithm and a semantic definition algorithm of the entries to obtain a machine learning result;

converting the machine learning result into a feature matrix of a word vector, and establishing an internal incidence relation between different data information in the data information based on the feature matrix to obtain an information security identification library, wherein the internal incidence relation comprises a semantic incidence relation between text data and a shape and contour incidence relation between image data;

the method comprises the steps of obtaining a security event to be processed, and determining the data type of the security event, wherein the security event is network information received by a network terminal from a network server through a network;

and selecting a corresponding knowledge base from the information security identification base according to the data type, and carrying out security classification and discrimination processing on the security event based on the semantic association relationship between the text data in the knowledge base and the shape profile association relationship between the image data.

Optionally, the step of obtaining, by the crawler platform, data information related to network security in each internet channel includes:

acquiring interactive data collected by the crawler platform when monitoring the Internet channel in real time;

extracting basic data related to network security from the monitored interactive data according to a rule of randomly extracting samples, and forming data samples for training the information security identification library based on the basic data, wherein the internet channel comprises at least one of an internet webpage and a data storage platform;

if the extracted basic data is text data, dividing the text data into a plurality of entries according to a semantic recognition technology to form the data information, wherein the entries are unit words and sentences with definite semantics;

if the extracted basic data is image data, the image data is divided into a plurality of maps according to the image shape minimum unit division technology to form the data information, and the maps are image fragments with complete outlines of determined single shapes.

Optionally, the segmenting the text data into a plurality of entries according to a semantic recognition technology, and the forming the data information includes:

dividing the text data according to a forward segmentation method and a reverse segmentation method respectively to obtain a forward entry set and a reverse entry set;

calculating the absolute frequency and the relative frequency of each entry in the forward entry set and the reverse entry set;

comparing the absolute frequency of the forward entry set with the absolute frequency of the reverse entry set, and comparing the relative frequency of the forward entry set with the relative frequency of the reverse entry set to obtain a comparison result of the absolute frequency and the relative frequency;

calculating a phase difference value between the absolute frequency and the relative frequency in the comparison result, and selecting any entry set of which the phase difference value is within a preset range as a segmentation set of the text data;

judging whether the absolute frequency and the relative frequency of the selected entries in the entry set are greater than corresponding preset statistical values or not;

if the judgment result is smaller than the preset statistic value, removing the entries smaller than the preset statistic value from the entry set to form final data information;

wherein, the absolute frequency is calculated in the following way: and dividing the occurrence frequency of the vocabulary entry by the length of the text data to obtain the absolute frequency of the vocabulary entry.

Optionally, the machine learning algorithm includes a language learning model and a regression training model, and in the step of dividing the text data into a plurality of entries according to a semantic recognition technique to form the data information, the method further includes:

acquiring semantic definition rules of new entries and proper nouns in the Internet;

after the step of converting the machine learning result into a feature matrix of a word vector, and establishing an internal association relationship between different data information in the data information based on the feature matrix to obtain an information security identification library, the method further comprises:

according to semantic definition rules and the language learning model, the text data in the text data are re-segmented and learned to form a text knowledge base;

and performing regression analysis on the information safety knowledge base according to the regression training model and the text knowledge base to obtain new entries and proper nouns which meet regression conditions in the text knowledge base, and adding the new entries and the proper nouns into the information safety knowledge base.

Optionally, the re-segmenting and learning the text data in the text data according to the semantic definition rule and the language learning model to form a text knowledge base includes:

if the language learning model is a TF-IDF model, re-segmenting the text data according to the semantic definition rule to obtain a new entry set;

evaluating each entry in the new entry set according to a feature evaluation method of the TF-IDF model;

and adjusting the new entry set according to the evaluation result to form the text knowledge base.

Optionally, the feature evaluation method of the TF-IDF model includes:

calculating the characteristic frequency TF and the inverse document frequency IDF of each entry in the new entry set in the text data;

and determining the evaluation level P of each entry according to the characteristic frequency and the inverse document frequency.

Optionally, the converting the learning result into a feature matrix of a word vector, and establishing an internal association relationship between different pieces of data information in the data information based on the feature matrix includes:

performing word segmentation training on each entry in the learned text data according to a machine learning model in the machine learning algorithm to obtain word characteristics;

performing vector training on the word features and corresponding semantics thereof through word2vec to generate word vectors of the word features;

carrying out multi-dimensional semantic expansion on the word features, and carrying out vector training on the dimensions according to the training mode of the word vectors to generate corresponding dimension vectors;

calculating a position vector of the word feature in the entry according to the word vector and the dimension vector corresponding to the word vector;

and constructing a three-dimensional space position relation graph of the word features according to the position vectors, wherein the three-dimensional space position relation graph comprises the internal association relation between the entries in the text data.

In addition, in order to achieve the above object, the present invention also provides an information security screening apparatus, including:

the system comprises a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for acquiring data information related to network security on each internet channel through a crawler platform, the crawler platform is built based on a distributed system framework and a memory type computer engine, and the data information at least comprises text data and image data;

a processing module configured to:

according to a preset machine learning algorithm and a semantic definition algorithm of entries, learning text semantics or image shape outlines of the data information, converting a learning result into a feature matrix of word vectors, and establishing an internal association relationship between different data information in the data information based on the feature matrix to obtain an information security identification library, wherein the internal association relationship comprises a semantic association relationship between text data and a shape outline association relationship between image data;

and selecting a corresponding knowledge base from the information security identification base according to the data type, and carrying out security classification and discrimination processing on the security event based on the semantic association relationship between the text data and the shape and contour association relationship between the image data in the knowledge base.

In another embodiment of the present invention, the data acquisition module comprises a monitoring unit and a grasping unit, wherein,

the monitoring unit is used for setting the crawler platform to continuously monitor interactive data of the Internet channel in real time, extracting basic data related to network security from the monitored interactive data according to a rule of randomly extracting samples, and forming data samples for training the information security identification library based on the basic data, wherein the Internet channel comprises at least one of an Internet webpage and a data storage platform;

the capturing unit is used for dividing the text data into a plurality of entries according to a semantic recognition technology when the extracted basic data is the text data to form the data information, wherein the entries are unit words with definite semantics; and when the extracted basic data is image data, segmenting the image data into a plurality of maps according to the segmentation technology of the minimum unit of the image shape to form the data information, wherein the maps are image fragments with the determined complete outline of a single shape.

In another embodiment of the present invention, the capturing unit is configured to divide the text data according to a forward segmentation method and a reverse segmentation method, respectively, to obtain a forward entry set and a reverse entry set; calculating the absolute frequency and the relative frequency of each entry in the forward entry set and the reverse entry set; comparing the absolute frequency of the forward entry set with the absolute frequency of the reverse entry set, and comparing the relative frequency of the forward entry set with the relative frequency of the reverse entry set to obtain a comparison result of the absolute frequency and the relative frequency, calculating a phase difference value between the absolute frequency and the relative frequency in the comparison result, and selecting any entry set of which the phase difference value is within a preset range as a segmentation set of the text data; judging whether the absolute frequency and the relative frequency of the selected entries in the entry set are greater than corresponding preset statistical values or not; if the judgment result is smaller than the preset statistic value, removing the entries smaller than the preset statistic value from the entry set to form final data information; wherein, the absolute frequency is calculated in the following way: and dividing the occurrence frequency of the vocabulary entry by the length of the text data to obtain the absolute frequency of the vocabulary entry.

In another embodiment of the present invention, the capturing unit is further configured to obtain semantic definition rules for new terms and proper nouns in the internet;

the processing module is further used for re-segmenting and learning the text data in the text data according to semantic definition rules and the language learning model to form a text knowledge base; and performing regression analysis on the information safety knowledge base according to the regression training model and the text knowledge base to obtain new entries and proper nouns which meet regression conditions in the text knowledge base, and adding the new entries and the proper nouns into the information safety knowledge base.

In another embodiment of the present invention, if the language learning model is a TF-IDF model, the processing module user re-segments the text data according to the semantic definition rule to obtain a new entry set; evaluating each entry in the new entry set according to a feature evaluation method of the TF-IDF model; and adjusting the new entry set according to the evaluation result to form the text knowledge base.

In another embodiment of the present invention, the feature evaluation method of the TF-IDF model includes:

In another embodiment of the present invention, the processing module is configured to perform word segmentation training on each entry in the learned text data according to a machine learning model in the machine learning algorithm to obtain word features; performing vector training on the word features and corresponding semantics thereof through word2vec to generate word vectors of the word features; carrying out multi-dimensional semantic expansion on the word features, and carrying out vector training on the dimensions according to the training mode of the word vectors to generate corresponding dimension vectors; calculating a position vector of the word feature in the entry according to the word vector and the dimension vector corresponding to the word vector; and constructing a three-dimensional space position relation graph of the word features according to the position vectors, wherein the three-dimensional space position relation graph comprises the internal association relation between the entries in the text data.

In addition, in order to achieve the above object, the present invention provides an information security screening apparatus, including: the information security screening method comprises a memory, a processor and an information security screening program stored on the memory and capable of running on the processor, wherein the information security screening program realizes the steps of the information security screening method according to any one of the above items when being executed by the processor.

In addition, to achieve the above object, the present invention provides a computer readable storage medium, which stores an information security screening program, and when the information security screening program is executed by a processor, the information security screening program implements the steps of the information security screening method according to any one of the above items.

The information security screening method provided by the invention is that a crawler system is built based on a distributed system architecture hadoop and a memory type computer engine spark, information is collected from different sub-channels through the crawler system, then relevant entries of information security are continuously learned by utilizing machine learning and deep learning technologies of the first machine in various industries, the data source of the acquired text is continuously expanded, network security information is analyzed from the perspective of more comprehensive field and deeper level, the effectiveness and persuasion of the analysis result are increased, and information transmitted in a network is screened based on a knowledge base of the learned entries, so that fraud or vulnerability scenes existing in network services are identified, and the security of network transmission information is improved.

Drawings

Fig. 1 is a schematic structural diagram of an operating environment of a mobile terminal according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a first embodiment of a screening method for information security according to the present invention;

FIG. 3 is a schematic structural diagram of a natural semantic processing platform according to the present invention;

fig. 4 is a functional block diagram of a screening apparatus for providing information security according to the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides an information security screening device, which can be a plug-in a mobile terminal and is used for executing the information security screening method provided by the embodiment of the invention, as shown in fig. 1, fig. 1 is a schematic structural diagram of an operating environment of the mobile terminal, which is related to the scheme of the embodiment of the invention and can realize the information security screening.

As shown in fig. 1, the mobile terminal includes: a processor 101, e.g. a CPU, a communication bus 102, a user interface 103, a network interface 104, a memory 105. Wherein the communication bus 102 is used for enabling connection communication between these components. The user interface 103 may comprise a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the network interface 104 may optionally comprise a standard wired interface, a wireless interface (e.g. WI-FI interface). The memory 105 may be a high-speed RAM memory or a non-volatile memory (e.g., a disk memory). The memory 105 may alternatively be a memory system separate from the processor 101 described above.

Those skilled in the art will appreciate that the hardware configuration of the mobile terminal shown in fig. 1 does not constitute a limitation of the information security screening apparatus or device provided by the present invention, and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components.

As shown in fig. 1, the memory 105, which is a computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a screening program for implementing security on network information. The operating system is a screening device for managing and controlling information security, a program called by software resources in a memory, a screening program for supporting information security, and the running of other software and/or programs.

In the hardware configuration of the mobile terminal shown in fig. 1, the network interface 104 is mainly used for accessing a network; the user interface 103 is mainly used for face image data to be recognized, and some requirements and other parameters when recognizing a face, and the processor 101 may be used for calling a screening program for information security stored in the memory 105 and performing the following operations of the embodiments of the screening method for information security.

Based on the hardware structure of the mobile terminal, the invention provides an information security screening method, which is mainly applied to small-sized terminal devices, such as mobile phones, IPADs and other mobile devices, and fig. 2 is a flowchart of the information security screening method provided by the embodiment of the invention, and fig. 2 is a flowchart. In this embodiment, the method for screening information security specifically includes the following steps:

step S210, data information related to network security on each Internet channel is obtained through a crawler platform;

in the present embodiment, the data information acquired here includes: the crawler platform is realized on the basis of a distributed system architecture hadoop and a memory type computer engine spark, and comprises a base layer, an algorithm layer, an ability layer and a functional layer, wherein the base layer is a deep learning technology which can be realized by the platform, the algorithm layer is internally provided with an algorithm for processing the acquired data, and the acquired data are subjected to the algorithm to realize the functions and the abilities limited in the ability layer and the functional layer.

That is, the bottom layer of the crawler system depends on a Hadoop HDFS (Hadoop Distributed file system) to store text data of network information security, and a spark memory type computing engine is used to process the data, the crawler system can provide various network devices to obtain communication interfaces of network software, and based on the interfaces, the system can obtain data information related to network security from different devices, software and web pages, and also continuously monitor and obtain the data information related to network security for 7 × 24 hours all day, and the specific interfaces specifically include a search engine, a news portal, a forum, a blog, an e-newspaper, and the like.

In this case, the data information includes at least one of text and image data, and information on the device and the network is read through a communication interface in the crawler system, where the information refers to text information or image information that has been determined to be a potential network safety hazard, and the image information may specifically be icon information of a certain behavior; the text information refers to statements, even code small programs, programming languages and the like, and information attack on the network is usually realized by some programming languages and is realized by inserting some keywords into the text information.

Step S220, performing machine learning of text semantics or image shape outline on the data information according to a preset machine learning algorithm and a semantic definition algorithm of the vocabulary entry to obtain a machine learning result;

step S230, converting the machine learning result into a feature matrix of a word vector, and establishing an internal association relation between different data information in the data information based on the feature matrix to obtain an information security identification library;

in this step, the intrinsic relationship includes a semantic relationship between the text data and a shape contour relationship between the image data, and in practical applications, the intrinsic relationship may also be a relationship between the text data and the image data.

In this embodiment, for the feature matrix that converts the learning result into a word vector, and establishing the internal association relationship between different pieces of data information based on the feature matrix, the method includes:

In practical application, the semantic association relationship between data and model and between data and model in text data can be established through learning, and the semantic association relationship can be specifically realized by constructing a feature matrix of a word vector:

firstly, training word segmentation is required to be carried out on text data, namely word groups are divided on the text data in multiple modes, and a uniquely identified word vector is generated on the divided text data;

then, carrying out multi-dimensional expansion on the phrases after the text data are divided based on the word vector, generating a dimension vector in each dimension, and increasing word senses based on the dimension vectors;

finally, calculating a vector of the position of each word group in the entry of the text data according to the dimension vector and the word vector, and combining the word vector and the dimension vector to carry out vector combination to obtain a semantic relation vector of each word group for each entry;

in practical application, for the position vector, in addition to learning the positions of the terms specified by the phrases in the text data, the positions of the terms in other terms need to be predicted according to the extended meaning of each phrase.

Specifically, the multi-dimensional expansion is mainly performed according to the semantics of the word group, and the direction and the distance of each word in the word position in the divided text data can be ignored, so that the word position has an opportunity to be directly encoded with each word in the sentence. Corresponding weight matrixes can be used as relations between each word and other words in the same sentence, the bigger the weight is, the stronger the relation is, and the deeper the weight between the words with fuzzy general meanings is;

further, in the process of representing text data, a vector containing the positions of the words is introduced to represent sequence order information, for example, the meaning of "you lack you for 100 ten thousand days to stay" and "i lack you for 100 ten thousand days to stay" is represented by a position vector, which contains the vector of the positions of the words;

and finally combining the vectors into a final vector, and interpreting the association relation between the data by combining an association rule mining mode and a graph mining method on the basis.

Furthermore, at least one entry or keyword exists in the text data, and the semantics of the text data only having one entry or keyword can be learned through a safe learning algorithm; for the text data containing at least two or more than two text data, besides learning the semantics, the semantic relationship between the entries needs to be established, that is, after the semantics in the entries are learned, the internal association relation of the entries is established according to the semantics of the entries, and the method can also be understood as simply classifying the entries to form a safe knowledge base for response. The security knowledge base can be understood as a comparison knowledge base for the security event to be screened, and the security of the security event is determined by comparing the comparison knowledge base, and may of course be a screening feature set for calculating the security of the security event.

In this embodiment, for the text data in the basic learning knowledge base, according to the machine learning model in the machine learning algorithm and the semantic definition method combined with the vocabulary entry, semantic learning is performed on the text data, and a semantic association relationship between the vocabulary entries in the basic learning knowledge base is established according to the learning result to form the secure text knowledge base.

For the image data in the basic learning knowledge base, according to a machine learning model in the machine learning algorithm, carrying out memory learning on the shape and the outline of the image data, and establishing internal connection for the image data of similar or same type to form an image knowledge base;

in this embodiment, the security learning algorithm may be understood as some algorithms or models for learning network security recognition, where the learning is mainly for learning natural semantics, and specifically, the meaning of a term may be determined according to semantic definitions of the term, for example, some words with offensive properties are generally used for attacking semantics, and the term of these attack classes is learned and trained to obtain recognition modes of these terms.

For image data, in addition to the memory learning of the shape and contour of the image data, the method comprises the steps of carrying out image turning, structure transformation, color histogram extraction, image high-level semantic information extraction, image bottom-layer visual clustering and the like on the image data to obtain basic features of the image data, respectively carrying out frequent pattern mining on the obtained image bottom-layer data features and high-level semantic features on the basis, fusing the two parts of data to form a multi-modal association rule, and interpreting the association relation of the image data according to the association rule.

In the scheme, a machine learning model and a semantic definition mode of the vocabulary entry are adopted to build a knowledge base, and some new vocabulary entries can be learned. The new words are general words and professional terms of various industries and the like emerging along with the development of society and science and technology; proper nouns, such as names of people, places, organizations, names of commodities, foreign language translations of trademarks, dialect idioms, etc., which are a special processing unit, have the attribute of words as an integral whole, and generally have the specific rules.

The number of new words is difficult to measure numerically, and with the progress of various social aspects, especially for some rapidly developing industry fields, such as computers, biotechnology, information technology and other emerging fields, the new words similar to professional terms are more and more, such as the word "online game" which is not available before with the development of networks and games, but now appears in common words, is often encountered in some text processing, and needs to be cut out when text is participated. How to distinguish meaningful new words from huge and unordered information becomes one of the important contents of the information work of the current generation. Based on the fact that the system is based on a mechanical word segmentation method and combined with a semantic definition method, the word segmentation text is processed by utilizing statistical knowledge in the word segmentation process so as to recognize the new words.

Step S240, acquiring a security event to be processed, and determining the data type of the security event;

in this step, the security event is network information received by the network terminal from the network server through the network.

And S250, selecting a corresponding knowledge base from the information security identification base according to the data type, and carrying out security classification and discrimination processing on the security event based on the semantic association relationship between the text data and the shape and contour association relationship between the image data in the knowledge base.

In this embodiment, the security classification and discrimination specifically refers to judgment of information security level and mining and classification of data text.

In this embodiment, for step S230, the building of the intrinsic association graph also includes building intrinsic association between each entry and the real security event; in practical application, when data information related to network security is acquired, the data information is specifically acquired by acquiring security events, screened security events are cached in a security log of a webpage or some equipment, the security events can be dangerous events or safe events, after the events are acquired, key terms in the events are extracted by some term extraction methods, then learning is performed by combining semantic definitions of the terms, and the terms extracted from the events establish connection relations through semantic definitions and security types of the events, so that learning of internal relations of the events is achieved.

In this embodiment, after step S230, an association relationship between the security text knowledge base and the image knowledge base is established, where the step is mainly directed to some text data corresponding to symbolic image matching, specifically, whether a currently acquired security event is a coexistence event of a text and an image is determined by learning a word pattern "shown in the following figure" in the text data, and if yes, the method of screening the step is performed:

firstly, after an event with a text and an image coexisting is discriminated, the image is obtained, the image is subjected to outline recognition, specifically, vocabulary entry analysis of the image is inquired through outline matching of an image knowledge base, the vocabulary entry analysis needs to be determined through an association relation between a safety knowledge base and the image knowledge base which is trained and established in advance, after the determination, the text in the security event is compared and recognized based on the vocabulary entry analysis, if a vocabulary entry corresponding to or similar to the vocabulary entry analysis exists, the security event is safe, otherwise, certain risk exists, and safe control processing needs to be carried out.

In this embodiment, when data information is acquired through a crawler platform, specifically, the crawler platform is set to monitor interactive data of the internet channel in real time without interruption, basic data related to network security is extracted from the monitored interactive data according to a rule of randomly extracting a sample, and a data sample for training the information security recognition library is formed based on the basic data, where the internet channel includes at least one of an internet webpage and a data storage platform;

if the extracted basic data is text data, the text data is divided into a plurality of entries according to a semantic recognition technology to form the data information, and the entries are unit words and sentences with definite semantics, namely words and sentences which are smallest in text language, can independently move, are meaningful and have relatively determined meanings in paragraphs or sentences;

and if the extracted basic data is image data, segmenting the image data into a plurality of maps according to the segmentation technology of the minimum unit of the image shape to form the data information, wherein the maps are image fragments with complete outlines of single shapes.

In practical application, when acquiring the basic data, firstly acquiring a large amount of data information from different channels, filtering the acquired data information according to initialized semantics, wherein the filtering comprises signature authentication of configuration files in the data information, namely whether the data is legal data or data which is subjected to security processing, and reading the related word bank and the sample bank into an engine memory in a crawler system for storage after the verification is passed, so that the basic data is formed.

In this embodiment, the segmenting the text data into a plurality of entries according to a semantic recognition technique, and the forming the data information includes:

comparing the absolute frequency of the forward entry set with the absolute frequency of the reverse entry set, and comparing the relative frequency of the forward entry set with the relative frequency of the reverse entry set to obtain a comparison result of the absolute frequency and the relative frequency, calculating a phase difference value between the absolute frequency and the relative frequency in the comparison result, and selecting any entry set of which the phase difference value is within a preset range as a segmentation set of the text data;

In practical application, 2 aspects, namely the word segmentation accuracy and the word segmentation speed, must be considered in the process of realizing the entry segmentation. No matter which word segmentation method is used, a large amount of time is needed to calculate the word forming possibility of the character string to be segmented, and then a most possible correct segmentation result is obtained by segmenting the segmented entries according to rules in statistics or grammar, so that the word segmentation accuracy is improved. Therefore, if the initial segmentation speed can be increased, it will also help to increase the speed of the whole word segmentation algorithm.

Firstly, respectively cutting words of the same word by using a forward maximum matching method and a reverse maximum matching method, and then comparing results. For example, segmentation of "Changchun city long spring festival to form a dictionary" is selected as a result of using the reverse maximum matching method because the forward maximum matching method has a word which cannot be matched. Secondly, refer to the aforementioned concept of word frequency, and each word will obtain a word frequency value according to its probability of occurrence in Chinese. The word segmentation of 2 methods is carried out on the Changchun pharmacy in Changchun city, but the word frequency of the "Chuncou" obtained by the reverse maximum matching method is much lower than that of other words. The results obtained by the word segmentation method are not universal, and the results are obtained by a positive maximum matching method. Therefore, the characteristic of combining the forward maximum matching method and the reverse maximum matching method is adopted, the word segmentation accuracy is greatly improved, and meanwhile, word segmentation ambiguity can be effectively resolved by matching with a word frequency library, and the word segmentation accuracy is further guaranteed. After the terms are obtained by a forward segmentation method and a reverse segmentation method, the ambiguity of the terms is judged by semantics, and some terms with large ambiguity or low occurrence frequency are divided to obtain the final correct and accurate terms.

In this embodiment, whether the forward entry set is selected as the segmentation set or the reverse entry set is selected as the segmentation set may be specifically determined according to a comparison result between an absolute frequency and a relative frequency in each entry set, and optionally, the entry set corresponding to the absolute frequency greater than the relative frequency is selected as the final segmentation set, and if the comparison result is that the absolute frequency in both the forward entry set and the reverse entry set is greater than the relative frequency, then a difference between the absolute frequency and the relative frequency is further determined, and one of the entries is selected based on the difference, preferably, the entry set corresponding to the absolute frequency that is slightly different from the relative frequency is selected as the segmentation set.

In this embodiment, the machine learning algorithm includes a language learning model and a regression training model, and in the step of dividing the text data into a plurality of entries according to a semantic recognition technique to form the data information, the method further includes:

after the step of learning text semantics or image shape contours of the data information according to a preset machine learning algorithm and a semantic definition algorithm of the vocabulary entry, converting a learning result into a feature matrix of a word vector, establishing an internal association relationship between different data information in the data information based on the feature matrix, and obtaining an information security identification library, the method further comprises the following steps:

In this embodiment, the language learning model specifically includes a conditional random field model, a TF-IDF model, a hidden markov model, a word2vec, etc., and a BERT model that is relatively new in the industry, and the machine learning model is fused, including a vector space model, a probability map model, a decision tree model, a support vector machine model, etc., and further learning models (which may be understood as regression models, of course) in the natural language field, such as CNN, RNN, LSTM, xgboost, etc., are added. The method comprises the steps of carrying out natural language processing on a network information text, analyzing a matching relation between keywords, a relation and a progressive relation between contexts, and reflecting the internal relation between network information security event entities in a knowledge graph mode in a graph mining mode, wherein a mainstream natural language processing method is taken as a main mode in the actual using process, and a machine learning method is combined to carry out good discrimination processing on security events, so that the accuracy and the recall rate of the system are improved.

In this case, the repartitioning and learning the text data in the text data according to the language learning model in combination with the semantic definition rules of the new terms and the proper nouns to form a text knowledge base includes:

re-segmenting the text data according to the new entries and the semantic definition rules of proper nouns to obtain a new entry set;

In this embodiment, the method for evaluating the characteristics of the TF-IDF model includes:

determining the evaluation level P of each entry according to the characteristic frequency and the inverse document frequency, wherein the calculation formula is as follows:

P＝TF×DF；

where DF (t) denotes the number of texts containing entries.

In practice, the most common criterion for the TFIDF method described above is to use when evaluating a feature, which uses the TF x IDF value of the feature to evaluate a feature. The definition of TF (feature frequency) is the number of times a feature appears in a page. Considering the influence of the document length factor, the TF is defined as:

TF(fi,pj)′＝TF(fi,pj)/max(TF(fv,pj))，v＝1,2,…。

for each feature in the feature set that does not appear in the document, the F value may be 0. For avoidance, the TF definition is again modified to be TF (fi, pj) '-0.5 +0.5 × TF (fi, pj)'.

The TF value reflects the importance of a feature relative to a document, with default being more important the greater the number of occurrences. However, there are some features that appear in almost all documents, and the TF value is high, for example, the number of occurrences of "computer" in the text resources of the network education resource management system is very high. Such features are clearly not helpful to classification and should be removed from the feature set. Thus, the IDF (inverse document frequency) concept is introduced, which is defined as:

the IDF value of a feature clearly decreases with increasing DF value.

In the present case, in the step of obtaining the knowledge base in steps S10-S20, the knowledge base may be obtained by combining a unique proprietary knowledge base in the security domain with a one-stop natural language processing technology satisfying information security, and after training of the knowledge base obtained in the above manner, requirements for entity extraction, association relation extraction, data capture, trend tracking, hotspot identification, topic analysis, emotion judgment and the like may be realized for the network security information received later, and functions of intelligent knowledge search, intelligent recommendation, security public opinion analysis, text classification mining of security information, intention analysis, relationship analysis, knowledge management system and the like in the security domain may also be realized.

According to the scheme, the text related to the network security information is continuously acquired all the day for 24 hours through the customized task of the automatic crawler system, the data source of the acquired text is continuously expanded, the network security information is analyzed from a more comprehensive field and a deeper perspective, and the effectiveness and persuasion of the analysis result are increased; meanwhile, machine learning and deep learning are combined with knowledge in the security field to form a unique knowledge base in the security field, and the knowledge is combined with the existing security service to provide data technology support for the security service; the internal relation of network information safety knowledge is mined through intelligent recommendation, relationship analysis, entity recognition, emotion analysis and the like, and the actual service is assisted to discover more fraud or vulnerability scenes.

The method for screening information security provided by the invention specifically realizes the detailed screening of entry information through natural semantics, and the screening process can be realized by constructing a natural semantic processing platform, which is described below by taking text data as an example, the framework of the processing platform is specifically shown in fig. 3, and the platform comprises a functional layer, an ability layer, an algorithm layer and a base layer, wherein the functional layer can realize the analysis of various information, such as the classification of texts, the search of internet data, and even some management functions, and the relationship analysis among data.

In this embodiment, after the text data is acquired by the functional layer, the text data is sent to the capability layer, the capability layer performs word segmentation processing on the text data, and meanwhile, semantic recognition is also implemented, specifically, the capability layer includes text word segmentation, emotion analysis, theme analysis, entity recognition, trend tracking, hotspot recognition, and the like.

The method comprises the steps of segmenting text data through an ability layer, semantically defining and identifying word features after segmentation, outputting the word features to an algorithm layer, training the word features through the algorithm layer, specifically, performing vector learning on the word features through a TF-IDF model, a word2vec model, a matrix decomposition and other traditional language models in the algorithm layer, converting the word features into vectors, and constructing the internal association among entries in the text data based on the vectors, wherein in practical application, the internal association can be specific to each word feature.

Finally, the data information to be recognized is recognized based on the intrinsic association relationship, so that the recognition process can be recognized from multiple directions, and whether the data information is safe or not is determined through semantics.

Based on data mining, knowledge graph and linguistic experience knowledge, on the basis of constructing a knowledge base of the information security field, combining a language model of natural language processing, including a conditional random field model, a TF-IDF model, a hidden Markov model, a word2vec and other traditional language models, a BERT model which is newer in the industry and the like, fusing machine learning models, including a vector space model, a probability graph model, a decision tree model, a support vector machine model and the like, adding deep learning models of the natural language field, such as CNN, RNN, LSTM, xgboost and the like, performing natural language processing on a network information text, analyzing the matching relation between keywords and the relation and the progressive relation between contexts, and reflecting the internal relation between network information security event entities in the form of the knowledge graph mining, in the actual use process, the mainstream natural language processing method is taken as the main method, and the machine learning method is combined to well discriminate and process the security events, so that the accuracy and the recall rate of the system are improved.

In order to solve the above problem, an embodiment of the present invention further provides an information security screening apparatus, and referring to fig. 4, fig. 4 is a schematic diagram of functional modules of the information security screening apparatus provided in the embodiment of the present invention. In this embodiment, the apparatus comprises:

the data acquisition module 41 is configured to acquire data information related to network security in each internet channel through a crawler platform, where the crawler platform is built based on a distributed system framework and a memory-type computer engine, and the data information at least includes text data and image data;

a processing module 42 configured to:

and selecting a corresponding knowledge base from the information security identification base according to the data type, and carrying out security classification and discrimination processing on the security event based on the knowledge base.

In practical applications, the functions implemented by the foregoing apparatus may also be implemented by specific functional modules, where the apparatus specifically includes:

the system comprises a crawler platform, a data processing system and a data processing system, wherein the crawler platform is used for acquiring data information related to network security on each internet channel through the crawler platform, the crawler platform is built based on a distributed system framework and a memory type computer engine, and the data information at least comprises text data and image data;

the algorithm training module is used for learning text semantics or image shape contours of the data information according to a preset machine learning algorithm and a semantic definition algorithm of entries, converting a learning result into a feature matrix of word vectors, and establishing an internal association relationship between different data information in the data information based on the feature matrix to obtain an information security recognition library, wherein the internal association relationship comprises a semantic association relationship between text data and a shape contour association relationship between image data;

the system comprises a screening module, a processing module and a processing module, wherein the screening module is used for acquiring a security event to be processed and determining the data type of the security event, and the security event is network information received by a network terminal from a network server through a network; and selecting a corresponding knowledge base from the information security identification base according to the data type, and carrying out security classification and discrimination processing on the security event based on the semantic association relationship between the text data and the shape and contour association relationship between the image data in the knowledge base.

Based on the same embodiment description contents as those of the information security screening method according to the embodiment of the present invention, the embodiment of the information security screening apparatus according to the present embodiment is not described in detail.

The invention also provides a computer readable storage medium.

In this embodiment, the computer-readable storage medium stores an information security screening program, and the information security screening program, when executed by a processor, implements the steps of the information security screening method described in any one of the above embodiments. The method implemented when the information security screening program is executed by the processor may refer to each embodiment of the information security screening method of the present invention, and thus, description thereof is not repeated.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM), and includes instructions for causing a terminal (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The present invention is described in connection with the accompanying drawings, but the present invention is not limited to the above embodiments, which are only illustrative and not restrictive, and those skilled in the art can make various changes without departing from the spirit and scope of the invention as defined by the appended claims, and all changes that come within the meaning and range of equivalency of the specification and drawings that are obvious from the description and the attached claims are intended to be embraced therein.

Claims

1. The method for screening the information security is characterized by comprising the following steps of:

2. The method for screening information security according to claim 1, wherein the step of obtaining the data information related to network security on each internet channel through the crawler platform comprises:

3. The method for screening information security of claim 2, wherein the segmenting the text data into entries according to a semantic recognition technique, and the forming the data information comprises:

4. The information security screening method of claim 3, wherein the machine learning algorithm includes a language learning model and a regression training model, and in the step of segmenting the text data into entries according to a semantic recognition technique to form the data information, the method further includes:

5. The information security screening method of claim 4, wherein the re-segmenting and learning the text data from the text data according to the semantic definition rules and the language learning model to form a text knowledge base comprises:

6. The method for screening information security according to claim 5, wherein the feature evaluation method of the TF-IDF model includes:

7. The information security screening method of any one of claims 1 to 6, wherein the converting the machine learning result into a feature matrix of a word vector, and establishing an intrinsic correlation between different ones of the data information based on the feature matrix comprises:

8. An information security screening apparatus, comprising:

a processing module configured to:

9. An information security screening apparatus characterized in that it comprises: a memory, a processor, and an information security screening program stored on the memory and executable on the processor, the information security screening program when executed by the processor implementing the steps of the information security screening method of any one of claims 1-7.

10. A computer-readable storage medium, on which an information security screening program is stored, which, when executed by a processor, implements the steps of the information security screening method according to any one of claims 1 to 7.