CN110719253A

CN110719253A - Web honeypot system based on intelligence question-answering

Info

Publication number: CN110719253A
Application number: CN201910807155.9A
Authority: CN
Inventors: 黄诚; 方勇; 龙啸; 高健
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2020-01-21

Abstract

The invention relates to a self-adaptive Web honeypot system based on an intelligent question-answering technology, which is used for capturing a deep complete attack chain. By using an intelligent question-answering technology, the system consists of three main algorithm models, and an LSTM model based on an attention mechanism carries out problem analysis on an attacker message and extracts a key attack vector in the attacker message; dynamically generating deception feedback according to the attack vector by a sensitive information counterfeiting model based on SeqGAN; the Web application learning model based on external observation utilizes an external observation mode to adaptively learn application characteristics. The honeypot system replaces traditional honeypot deception with context semantics, and can perform generative intelligent question answering response.

Description

Web honeypot system based on intelligence question-answering

Technical Field

Background

Network space has become an important factor affecting national security, international relationships, and political constellations. The internet and the network space formed by applications, services, data and the like carried by the internet are changing the way people live. When the method is integrated into every link of the society, the risk of information leakage of individuals, enterprises and even governments and countries is increased.

In order to deal with security threats in a network space, different security products are introduced in the industry and the academia, the traditional security products such as a firewall, an intrusion detection and defense system, a network behavior management system, a security scanning tool, a vulnerability auditing tool and the like cannot fundamentally turn around the dilemma of unbalanced attack and defense games, and the security products can only discover and repair security problems existing in the system and the network and cannot help a defensive party to acquire complete attack chain information of attackers.

Honeypots and their derived spoofing are just tools for changing such asymmetric attack and defense conditions. Due to this characteristic, the mechanism design and practical application of honeypots have been the focus of security researchers. However, the existing research results and engineering practices have many defects, and the high-interaction honeypots can attract deeper attacks, but have the defects of high deployment risk, complex deployment configuration and the like; the low-interaction honeypots can be deployed quickly, but only respond to attack requests simply. The deception technology research at the present stage mainly focuses on the deception angle of a protocol stack, and the deception research at a strategy level is lacked.

Aiming at the problems, the invention constructs a Web dynamic cheating feedback system based on an intelligent question-answering mechanism through relevant research in the field of migrating intelligent question-answering, solves the contradiction between honeypot interaction and safety, improves the cheating performance of Web honeypots, effectively improves the concealment performance of honeypots, and fully exerts the active defense capacity and the information acquisition capacity of a network cheating technology.

Disclosure of Invention

The invention relates to a Web dynamic deception feedback system based on an intelligent question-answering mechanism, which analyzes an attack request of an attacker through an LSTM module based on an attention mechanism, the analyzed request generates trapping information through a SeqGAN-based sensitive information generation module, the generated information is embedded into a webpage template generated by an external observation module, and a finally formed complete webpage is used for deceiving the attacker.

The inventive content includes the following aspects:

1) establishing a scanner cluster by using a vulnerability scanner of an open source community;

2) setting an HTTP proxy to capture all interactive messages of a scanner and a vulnerability page in a man-in-the-middle manner;

3) acquiring a Web vulnerability report in actual penetration;

4) combining an attack interaction message in an experimental environment and a Web vulnerability report in actual penetration to form an original question-answer corpus;

5) preprocessing an original question-answer corpus, including data duplication removal, special symbol cleaning, word segmentation, word embedding and the like;

6) specially processing the confused special symbols in the original question-answer corpus;

7) performing word segmentation on an original question-answer corpus based on a Web protocol;

8) performing Word embedding on the original question-answer corpus after Word segmentation by using a Gensim packet and using a Word2Vec algorithm;

9) finally forming a question and answer corpus by the above-mentioned method;

10) the semantic analysis technology based on the Web protocol utilizes the Web semantic and represents the semantic in the protocol through natural language coding;

11) extracting the representation semantic features and describing malicious attacks;

12) extracting a key attack vector in the attack request message by using an attention mechanism LSTM model;

13) based on a SeqGAN model, recognizing the semantics of an attack request, judging the attack effect which an attacker wants to achieve, and forging sensitive feedback information;

14) generating a model by a webpage template based on a network crawling technology, performing interactive access on a website to be protected by using a network crawler, and recording real website resources and response information;

15) aiming at the corresponding relation between the website path and the website webpage, constructing a website template and a routing mapping relation by using a relevant strategy;

16) and the dynamic honeypot generation model system is formed by an external observation learning mechanism, a key attack vector extraction model and a countermeasure generation model.

According to the method, the invention adopts the following technical scheme: a Web honeypot system based on intelligent question answering comprises five parts: data collection and preprocessing, Web application external observation simulation, attack abnormal point judgment, sensitive information generation and attack response generation. The invention provides a Web honeypot system based on intelligent question answering, which comprises the following functions:

1) utilizing each large open source scanner to establish a scanner cluster, collecting various Web vulnerability reports and collecting original question and answer corpora;

2) removing duplication and special symbols of the original question-answer corpus in an automatic and manual mode;

3) cleaning, word segmentation and other operations;

4) performing Word embedding on the processed original question-answer corpus by using a Gensim packet and using a Word2vec algorithm to generate Word vectors;

5) learning route composition characteristics and page visual characteristics of the Web application in an external mode by using a Web crawler and a parser;

6) acquiring an attack vector, extracting the attack vector and generating feedback information with deception;

7) completing the generation of an attack question and answer to an attacker based on semantic information in a Web protocol;

8) and returning a feedback webpage formed by the sensitive information and the webpage template to the attacker according to the template generation strategy.

The intended object of the invention is as follows.

1) The honeypot model based on the intelligent question answering is provided, so that information can be fed back to the attack behavior of a malicious attacker well and dynamically;

2) collecting and obtaining a high-quality intelligent question and answer corpus;

3) extracting key abnormal vectors in the attack message by using an LSTM algorithm based on an attention mechanism;

4) on the basis of SeqGan, learning the characteristics from the attack sequence to the feedback sequence from the semantic perspective;

5) and constructing a corresponding honeypot system prototype, and improving the reliability and success rate of the Web honeypot system in dynamic feedback, adaptive configuration, visual deception and attack trapping.

Drawings

FIG. 1 is a schematic diagram of the model overall framework of the present invention.

FIG. 2 is a functional layout of the data collection of the present invention.

FIG. 3 is a functional layout diagram for data preprocessing of the present invention.

Fig. 4 is a functional design diagram of an attack anomaly determination module according to the present invention.

FIG. 5 is a sensitive information generating module algorithm plan of the present invention.

FIG. 6 is a functional diagram of an attack response module of the present invention.

Detailed Description

The honey pot system is formed by the following five modules, and the technical scheme in the embodiment of the application is clearly and completely described in the following with reference to the attached drawings in the embodiment of the application.

Honeypots and the network spoofing techniques they derive are asymmetric tools to change the attack and defense situation. The intelligent question-answering Web honeypot based on the semantics utilizes the context semantics to replace the traditional single attack vector recognition, can more correctly feed back the malicious behavior of an attacker according with the semantics, and can improve the reliability and the success rate of the Web honeypot system in dynamic feedback, self-adaptive configuration, visual deception and attack trapping. The specific technical scheme is as follows.

Fig. one is a schematic diagram of the overall framework of the model of the present invention, and details a related design based on an intelligent question-answering honeypot. As shown in fig. one, the method includes the following steps.

(1) Data collection and preprocessing module

The data collection and preprocessing module is responsible for corpus collection and vectorization embedding of the model and mainly comprises two parts, and the question and answer corpus is composed of attack interaction messages in an experimental environment and Web vulnerability reports in actual penetration. And collecting experimental environment data, hijacking scanning interaction between a scanner and a vulnerability page in an HTTP proxy mode, recording a request message and a response message corresponding to the request message, and forming a group of question and answer corpora by using an attack vector of the corresponding request message and sensitive information in the response message. And meanwhile, extracting corresponding question and answer corpora from the Web vulnerability analysis report by using a manual extraction mode.

(2) Web application external observation simulation module

The Web application external observation module is mainly responsible for the self-adaptive generation of the honeypot, and learns the route construction characteristics and the page visual characteristics of the Web application in an external mode, so that the honeypot framework can automatically and dynamically generate deceptive contents. The module is mainly composed of a Web crawler and a resolver, and firstly accesses a Web application to be protected through an external mode, extracts all hyperlinks conforming to a current domain name in the application in a recursive traversal mode, and finally stores the hyperlinks and an application program mapped by the hyperlinks. The stored hyperlink and the application message are analyzed by an analyzer, and the analyzer determines static resources and dynamic resources in the hyperlink according to the relation between the hyperlink and the webpage. The resolver will eventually generate an access routing table and corresponding web page templates.

(3) Attack abnormal point judging module

The attack abnormal point judging module is used for extracting specific abnormal information in the Web request of an attacker, and the attack abnormal point judging module mainly utilizes an LSTM algorithm improved by an attention mechanism to determine an abnormal attack point. The module mainly comprises two parts, namely training of the attention LSTM model and judgment and extraction of abnormal points. The model training uses a semi-supervised mechanism, takes a normal Web application request packet as a training sample, utilizes the LSTM to encode the sample, and simultaneously carries out parameter tuning on the algorithm model.

(4) Sensitive information generating module

The sensitive information generation module completes the generation of attack questions and answers to attackers based on semantic information in a Web protocol, trains a generating model through collected attack vectors and sensitive information linguistic data, and analyzes the hidden characteristic relation between an attack sequence and a feedback sequence through an improved generation countermeasure network. The sensitive information generation module mainly uses an improved SeqGAN algorithm to generate countermeasures, word embedding of the model is realized by using word2vec, coded and embedded data mainly comprises an attack word sequence and a sensitive feedback word sequence, a self-encoder model of the generator converts an input attack sequence, the converted sequence is compared with the sensitive word feedback sequence in a distribution characteristic mode, training initialized noise is continuously close to the distribution characteristic of the sensitive word feedback sequence, a discriminator identifies real and false data, and when a loss function is reduced to a certain threshold value, the model of the generator can be considered to generate corresponding sensitive feedback according to unknown attack input.

(5) Attack response generation module

The attack response generation module is responsible for aggregating functions of other modules and realizing complete logic operation of the attack vector receiving feedback webpage. The attack response generation module is mainly butted with the three modules, wherein the routing function receives an attack path of an attacker, similarity matching is carried out on the attack path and a routing table of an external observation module, and a resource non-existing state is returned for the access of a missed route; accesses to the hit route are passed to the outlier determination module for further processing. And the internal resource function receives the webpage template hitting the route and the generated fake sensitive information, replaces the placeholder in the template according to a template generation strategy, and returns a feedback webpage consisting of the sensitive information and the webpage template to an attacker to perform deception defense on the attacker.

The Web honeypot system based on the intelligent question answering is described in detail above.

Claims

1. A Web honeypot system based on intelligent question answering is characterized by comprising the following steps: the method comprises the following steps: extracting a template of the application of the website to be protected and generating a mapping relation between the route and the template; step two: extracting the request of the attacker by using an attention mechanism LSTM model; step three: sensitive information expected to be obtained by attack of an attacker is counterfeited and generated by utilizing a SeqGAN model; step four: the three models jointly form a honeypot system, and adaptive cheating defense is performed on website applications.

2. The external viewing model of claim 1, wherein: the method comprises the steps of generating a strategy based on a character-level webpage difference template and a Ratcliff-Obershelp algorithm based on segmentation hashing, wherein the generated strategy comprises text variables and random hashing, which can be completely detected according to differences of webpage characters, and time variables are consistent in a plurality of webpages, so that a special regular expression is needed for matching, reflection variables need to be compared with the differences among the webpages and the same variables of the webpages and the webpage resource paths, and when the detection strategies are not hit, current variables are marked to be unknown variables.

3. The attention mechanism LSTM attack vector extraction model of claim 1, wherein: the attack vector in the attack message can be extracted by using an LSTM algorithm model improved by an attention mechanism.

4. The SeqGAN sensitive information generation model of claim 1, wherein: fusing the improved SeqGAN model of Seq2 Seq.

5. The three main modules of claim 1, together forming a honeypot system, capable of dynamically simulating any web application, generating a corresponding honeypot system, while dynamically spoofing according to the context semantics of an attacker.