CN112528190A

CN112528190A - Web page tampering judgment method and device based on fragmentation structure and content and storage medium

Info

Publication number: CN112528190A
Application number: CN202011543689.4A
Authority: CN
Inventors: 杜家浩; 黄旭; 石少东
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2021-03-19

Abstract

The invention belongs to the field of security protection of IT webpages, and provides a webpage tampering judgment method and device based on a fragmentation structure and content and a storage medium, aiming at the problem that whether a webpage is tampered or not is judged only according to a text comparison mode which is easy to ignore dark chain implantation; the method comprises the steps of obtaining HTML source codes, inputting the HTML source codes into a preset tag tree structure model and a preset content evaluation model respectively, calculating a suspicious weight alpha according to the HTML source codes and the tag tree structure model, establishing the content evaluation model through an LSTM network model, calculating a content difference beta according to the HTML source codes and the content evaluation model, substituting the suspicious weight alpha and the content difference beta into a judgment formula by a comprehensive evaluation device, judging the existence of tampering behaviors according to an output value, and accurately detecting the webpage attacking behaviors of dark chain implantation and text replacement.

Description

Web page tampering judgment method and device based on fragmentation structure and content and storage medium

Technical Field

The invention belongs to the field of IT webpage safety protection, and particularly relates to a webpage tampering judgment method and device based on a fragmentation structure and content, and a storage medium.

Background

With the rapid development of the internet, network application products in various industries show blowout, but the accompanying network security problem is also increasingly prominent and can not be ignored. The webpage browsing mode is a main channel for acquiring information daily, and 2604 hundred million websites exist only in China according to the incomplete statistics in 2018. Security maintenance for these web sites is also increasingly challenged. In 2017, the national internet emergency center monitors that about 2 ten thousand websites in China are tampered, wherein 618 tampered government websites exist. From the aspect of webpage tampering, the website embedded with the hidden chain accounts for 68% of all tampered websites, and still remains the main mode for website tampering in China. Webpage tampering behavior has shifted from ordinary modified page text, to the planted dark chain.

However, in the prior art, most behavior evaluation on whether an existing web page is tampered with adopts a hash watermark comparison technology based on a whole web page source code, and after different hash comparisons, recurs to a specific changed document Object model dom (document Object model) node XPATH (XML Path Language) Path, and then outputs an alarm message of the XPATH. The method has the advantages that the text similarity of the two is calculated to serve as a main basis for judging whether the webpage is tampered, the method is few in required resources and easy to achieve, and the method is mainly suitable for scenes with obvious webpage structure and content change difference before and after tampering.

However, in the case of a dynamic web page with a large text amount, it is difficult to judge the web page by similarity matching. Based on the similarity matching algorithm, whether the webpage is tampered or not is considered from the global perspective, and small content tampering in the webpage with large text content cannot be located. And, the similarity evaluation threshold depends on artificial selection. Too low threshold selection can result in a large number of false tampering and false alarm, and too high threshold selection can result in a large number of false tampering and false alarm. The judgment is simply carried out from the text content, so that the flexibility is not enough, and the method is difficult to be applied to the ubiquitous dynamic content web page.

Disclosure of Invention

The invention provides a webpage tampering judgment method and device based on a fragmentation structure and content and a storage medium, and solves the problem that in the prior art, whether a webpage is tampered or not is judged only according to text comparison, and a dark chain implantation mode is easily ignored.

The basic scheme of the invention is as follows: a webpage tampering judgment method based on a fragmentation structure and content comprises the following steps:

acquiring HTML source codes;

respectively inputting HTML source codes into a preset tag tree structure model and a preset content evaluation model; the label tree structure model is obtained by training a first neural network model and a training sample set, wherein the training samples comprise samples with tampering and samples without tampering; the content evaluation model is obtained by training a second neural network model and a sample library comprising text content;

calculating a suspicious weight alpha according to the HTML source code and the tag tree structure model, and calculating a content difference degree beta according to the HTML source code and the content evaluation model;

and substituting the doubtful weight alpha and the content difference degree beta into a judgment formula by the comprehensive evaluation device, and judging the existence of tampering behaviors according to the output value.

The principle and the beneficial effects of the basic scheme are as follows: according to the scheme, all links in the webpage are checked by adopting a label tree structure model to obtain the suspicious of dark chain modification/insertion, texts in the webpage are checked by a content evaluation model to obtain the content difference degree reflecting the possibility of content modification, and whether the webpage is tampered or not is comprehensively judged according to the combination of the content evaluation model and the content difference degree. The two modes of text tampering and link tampering in webpage tampering are comprehensively considered, and compared with the prior art that only text comparison and identification are carried out, the webpage tampering is judged more accurately. Moreover, a label tree structure model of a sheet structure is established to calculate the doubtful property corresponding to the label level, so that the problems that the detection of whether a dark chain exists is easy to miss and the similarity judgment threshold value needs to be set manually are solved.

Further, the process of training the sample set by the first neural network model includes:

acquiring a training sample set, extracting a tag tree in HTML source codes in the training sample set, generating a tag path set, normalizing, and training the normalized tag path set by using a first neural network model.

Has the advantages that: the label tree is a representation form of a fragmentation structure, and paths of all labels are arranged. In the scheme, an LSTM network model is adopted to learn a label tree in a training sample set, and the LSTM network model is debugged by comparing a comparison result between expectation output by the LSTM network model and preset expectation so as to ensure that the error of the LSTM network model output in the step S2 is within a preset range during operation. Therefore, the present solution aims at training and debugging the LSTM network model. In the process, whether the learning process of the LSTM network model is in place or not does not need to be manually judged, the LSTM parameters are automatically adjusted by outputting a comparison result between expected values, and the error of the label tree structure model output by the LSTM network model is guaranteed to be limited within a threshold value.

Further, the process of obtaining the label tree structure model by the first neural network model training sample set includes:

and adjusting parameters of the first neural network model according to a difference value between an actual expectation preset in the training sample set and a theoretical expectation finally output by the first neural network model.

Further, the calculating the doubtful weight alpha according to the HTML source code and the label tree structure model comprises the following steps;

and normalizing the tag paths in the HTML source codes, substituting the normalized tag paths into a preset tag tree structure model, and taking the output of the tag tree structure model at the time t as a suspicious weight alpha.

Has the advantages that: in the scheme, the stable tag tree structure model obtained in the previous step is adopted to calculate the tag path in the HTML source code collected this time, so that an output intermediate result is obtained, and the intermediate result is used as the doubtful weight alpha.

The stable label tree structure model obtained in the previous step not only includes a method for calculating the suspicious weight according to the label, but also includes a standard data set, wherein the standard data set refers to a sample set without being tampered. For this reason, for each label path, a suspicious weight α is calculated, and the less x (i) corresponding to the label appears in the standard data set, the higher the weight α is, which indicates that the hierarchical structure of the label is more likely to appear in the tampered web page.

Further, establishing a content evaluation model through the LSTM network model, comprising the following steps:

the process of training a sample library through a second neural network model comprises:

and acquiring a sample library, extracting text contents corresponding to the labels in the sample library, generating a vocabulary set, merging and normalizing, and training a normalized vocabulary set by using a second neural network model.

The beneficial effects are that: in the scheme, the LSTM network model is trained by utilizing the sample library, and parameter fine tuning is carried out on the LSTM network model according to a comparison result between the output expected value of the LSTM network model and a preset expected value, so that the accuracy of the content evaluation model output by the LSTM network model is ensured finally.

Further, the calculating the content difference degree β according to the HTML source content evaluation model includes:

extracting the label content of each label in the HTML source code, forming a label set, normalizing the label set, substituting the label set into a preset content evaluation model, and outputting a result of each label as a content difference degree beta.

Has the advantages that: in the scheme, the tag content of each tag in the HTML source code of the current webpage is input into a content evaluation model, so that the content difference degree beta corresponding to the tag is calculated, and the content difference measurement between the tag path content and the whole webpage is mined.

Further, Word2vec is adopted in the normalization processing.

Further, the judgment formula is as follows:

J(i)＝(α(i)+σ)*β(i)，i∈(1,2,…,N)；

wherein, N represents the total number of paths in the tag tree generated by the HTML source code, α (i) represents the weight of the doubtful degree of the ith tag path, σ represents an empirical constant, and β (i) represents the difference degree between the text content of the ith tag path and the whole web page content.

Has the advantages that: the method supports the web page without reference, whether the current web page has tampering behavior can be judged directly according to the current web page, the flexibility of the whole calculation model is high, and the system has certain universality after training.

Further, before substituting the doubtful weight α and the content difference β into the judgment formula, the method further includes:

and according to the reference HTML of the reference webpage, generating a comparison result between the tag path set and the tag path set in the HTML source code of the current webpage, and updating the numerical value of the content difference degree beta.

Has the advantages that: in the scheme, whether a reference webpage exists or not is also considered, and when the reference webpage exists, the content difference condition between the reference webpage and the current webpage HTML source code is directly judged and updated through the comparison between the label path set base (i) in the reference webpage and the label path set path (i) in the current webpage HTML source code. If the current webpage is consistent with the reference webpage, the content difference beta calculated by the current webpage through the content evaluation model approaches to 0 wirelessly, and a certain error exists compared with the situation that the content difference between the current webpage and the reference webpage is 0 actually; according to the scheme, the numerical value of the content difference beta is updated directly according to the judgment result whether the current webpage is consistent with the reference webpage, and the overall accuracy of the scheme is improved.

The invention also provides a webpage tampering judging device of the fragmented structure and the content, which comprises an information acquisition module, a storage module, a label tree structure model establishing module, a content evaluation model establishing module, a suspicion weight calculating module, a content difference calculating module and a comprehensive judging device;

the information acquisition module is used for acquiring an HTML source code of the current webpage and sending the HTML source code to the suspicious weight calculation module and the content difference calculation module;

the storage module comprises a sample storage area and a model storage area, the sample storage area is used for storing a training sample set and a sample library, and the model storage area is used for storing a label tree structure model and a content evaluation model;

the label tree structure model creating module is used for acquiring a training sample set in the storage module, training the LSTM network model through the training sample set to obtain a label tree structure model, and sending the label tree structure model to the storage module for storage and updating;

the content evaluation model creating module is used for acquiring a sample library in the storage module, training the LSTM network through the sample library to obtain a content evaluation model, and sending the content evaluation model to the storage module for storage and updating;

the suspicious weight calculation module is used for receiving the HTML source code sent by the information receiving module, acquiring a tag tree structure model from the storage module, normalizing the HTML source code, substituting the normalized HTML source code into the tag tree structure model to calculate suspicious weight alpha and sending the suspicious weight alpha to the comprehensive evaluation device;

the content difference calculation module is used for receiving the HTML source code sent by the information receiving module, acquiring a content evaluation model from the storage module, normalizing the HTML source code, inputting the normalized HTML source code into the content evaluation model for calculation to obtain a content difference beta, and sending the content difference beta to the comprehensive evaluation device;

the comprehensive evaluation device is used for receiving the doubtful weight alpha sent by the doubtful weight calculation module and the content difference beta sent by the content difference calculation module, updating the content difference beta according to the existence condition of the reference webpage, substituting the doubtful weight alpha and the content difference beta into a judgment formula, obtaining a judgment result according to the numerical value of J (i), and outputting the judgment result.

The invention also provides a computer-readable storage medium, wherein one or more instructions are stored in the computer-readable storage medium, and when executed, the one or more instructions realize the webpage tampering judgment method based on the fragmentation structure and the content.

An electronic device, comprising: a memory and a processor; at least one program instruction is stored in the memory; the processor is used for realizing the webpage tampering judgment method based on the fragmentation structure and the content by loading and executing the at least one program instruction.

Drawings

Fig. 1 is a flowchart of a web page tampering judgment method based on a fragmentation structure and content according to a first embodiment of the present invention;

FIG. 2 is a flow chart related to the building of the label tree structure model in FIG. 1;

FIG. 3 is a schematic diagram of the generation of the labelsoutet set of FIG. 2;

FIG. 4 is a flow chart relating to the computation of the suspicious weight of FIG. 1;

FIG. 5 is a flow chart associated with the creation of the content evaluation model of FIG. 1;

FIG. 6 is a schematic diagram of the creation of a content evaluation model in FIG. 1;

FIG. 7 is a flow chart of the content difference calculation in FIG. 1;

FIG. 8 is a flow chart illustrating the operation of the comprehensive evaluator of FIG. 1;

fig. 9 is a block diagram illustrating a web page tampering judgment device based on a fragmentation structure and content according to a third embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.

Detailed Description

The following is further detailed by the specific embodiments:

in order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.

The first embodiment of the invention relates to a webpage tampering judgment method based on a fragmentation structure and content, wherein the fragmentation structure adopts a representation form of a label tree, and the essence of the label tree is a set of all labels for path arrangement.

In this embodiment, as shown in fig. 1, a method for evaluating webpage tampering based on a fragmentation structure and content includes the following steps:

s1, acquiring HTML source codes of the current webpage to be detected;

s2, acquiring a preset label tree structure model;

s3, calculating a suspicious weight alpha according to the HTML source code and a preset label tree structure model;

s4, acquiring a preset content evaluation model;

s5, calculating a content difference degree beta according to the HTML source code and a preset content evaluation model;

and S6, the comprehensive judger substitutes the suspicious weight alpha and the content difference degree beta into a judgment formula to judge the existence of tampering behavior according to the output value.

In the scheme, a preset label tree structure model is adopted to check all links in a webpage to obtain the suspicious of modification/insertion of a dark chain, texts in the webpage are checked through a preset content evaluation model to obtain the content difference degree reflecting the possibility of content modification, and whether the webpage is tampered or not is comprehensively judged according to the combination of the two. The two modes of text tampering and link tampering in webpage tampering are comprehensively considered, and compared with the prior art that only text comparison and identification are carried out, the webpage tampering is judged more accurately. Moreover, a label tree structure model of a sheet structure is established to calculate the doubtful property corresponding to the label level, so that the problems that the detection of whether a dark chain exists is easy to miss and the similarity judgment threshold value needs to be set manually are solved.

The following describes in detail implementation details of the web page tampering judgment based on fragmentation structure and content in the embodiment, and the following only provides implementation details for easy understanding, but is not essential for implementing the present embodiment, and a specific flow of the embodiment is shown in fig. 1, and the embodiment is applied to a server on a network side.

Specifically, the preset process of the tag tree structure model in S2 is completed before S1 is started, and as shown in fig. 2, the setting process of the tag tree structure model is as follows:

s2-1, acquiring a training sample set;

as shown in fig. 3, extracting a label tree in an HTML source code in a training sample set, generating a label path set path (i), where i takes any integer between [1 and n ], and n represents the total number of label paths in a training sample, and executing S2-2;

s2-2, normalizing Path (i) by Word2vec to enable any Path (i) to be mapped to a numerical value in the interval of [0,1], and executing S2-3 and S2-4;

s2-3, extracting a path (i) set corresponding to the sample determined to be free from tampering, and performing S2-5 by using the expected output γ equal to 0 as an input sample of the LSTM network model;

s2-4, extracting a path (i) set corresponding to the sample judged to have been tampered with, and executing S2-5 by using the expected output γ equal to 1 as an input sample of the LSTM network model;

s2-5, at the time t, the LSTM network model outputs an intermediate result corresponding to Path (i), the intermediate result is used as a suspicious weight alpha, at the time n, the LSTM network model outputs a final output corresponding to Path (i), the final output is used as an expected value gamma', and S2-6 is executed;

s2-6, performing Drop out processing on the LSTM network model, performing forward propagation, and performing S2-7;

s2-7, comparing the actual expected gamma and the theoretical expected gamma' of Path (i), calculating an overall error, and executing S2-8;

s2-8, comparing the overall error with a preset error threshold range, and executing S2-9 if the overall error meets the error threshold range; if the overall error exceeds the error threshold range, executing S2-10;

s2-9, outputting the LSTM network model as a label tree structure training model;

s2-10, adjusting parameters of the LSTM network model, and re-executing S2-3 and S2-4.

The training sample set in S2-1 is different from the page corresponding to the HTML source code in S1 of the present application, and the training sample set is a plurality of HTML source codes of known web pages, and whether there is an attack or not is determined. After normalization through S2-2, different actual expected outputs gamma are given according to whether tampering exists or not in the steps of S2-3 and S2-4, and are input into the LSTM network model after being sorted. Because the input LSTM network is continuous in S2-3 and S2-4, that is, the path (i) set and the corresponding actual expected output γ are sequentially input to the LSTM network according to time, at different times, the expected output values of the LSTM network model are different, the expected output value at time t is an intermediate result, the intermediate result is used as a suspicious weight, the expected output value at time n is the final expected output value γ', the meaning at time n is that all the path (i) sets are input, and the time t is that the path (i) sets are not input completely. In S2-6, a Drop out process is used to prevent over-training of the LSTM network model and reduce over-fitting. In S2-8, whether to debug the LSTM network is determined according to the comparison result between the overall error between the actual expected gamma and the theoretical expected gamma' and the preset error threshold range.

In this scheme, the LSTM network model serves as a first neural network model, and the first neural network model may be implemented by using other neural network models. The error threshold range is preset by a designer and cannot be changed after setting.

Therefore, the essence of S2 is that the LSTM network model is trained by knowing the paths corresponding to the labels in many web pages and the network tampering conditions corresponding to the labels, and the parameters of the LSTM network model are further adjusted by comparing the theoretical expectation calculated by the LSTM network model with the actual expectation known in practice, so that the LSTM network model is more compact in correspondence between the sample corresponding label link paths and the intermediate output quantities. The LSTM network model finally output through S2-9 serves as a label tree structure training model and is preset before execution of S1.

Specifically, as shown in fig. 4, S3, calculating a suspicious weight α according to the HTML source code and the preset tag tree structure model, includes the following steps;

s3-1, extracting the label path in the HTML source code in S1, and forming a set Path (i);

s3-2, normalizing Path (i) through Word2vec, so that any Path (i) can be mapped to a numerical value in the interval of (0,1) to form X (i);

and S3-3, substituting X (i) into the label tree structure model output in the S2, and calculating the value output by the label tree structure model at the time t, wherein the value is counted as an intermediate result, and the intermediate result is the doubtful weight alpha.

The stable label tree structure model obtained in the previous step is adopted in S3, which includes not only the method of calculating the suspicious weight according to the label, but also the standard data set, which is a sample set without being tampered. For this reason, for each label path, a suspicious weight α is calculated, and the less x (i) corresponding to the label appears in the standard data set, the higher the weight α is, which indicates that the hierarchical structure of the label is more likely to appear in the tampered web page.

The method comprises the steps of adopting a preset label tree structure model to check all links in a webpage to obtain the suspicious of dark chain modification/insertion, checking texts in the webpage through a preset content evaluation model to obtain the content difference degree reflecting the possibility of content modification, and comprehensively judging whether the webpage is tampered according to the combination of the two. The two modes of text tampering and link tampering in webpage tampering are comprehensively considered, and compared with the prior art that only text comparison and identification are carried out, the webpage tampering is judged more accurately.

Specifically, the content evaluation model preset in S4, the preset configuration of which is completed before S1 is started, is set as shown in fig. 5 and 6:

s4-1, obtaining a sample library;

extracting text content in a sample library, splitting the text content into a plurality of vocabularies, wherein each vocabulary corresponds to a label to form a vocabulary set B (i), i belongs to [1, n ], n is the total number of the labels in the sample, and executing S4-2;

s4-2, carrying out normalization processing on the vocabulary set B (i), so that any B (i) can be mapped to a numerical value in the interval of [0,1], and executing S4-3;

s4-3, outputting the final output corresponding to B (i) by the LSTM network model, and executing S4-4 as the expected value theta';

s4-4, obtaining expected values theta preset by each label in the sample library, obtaining expected values theta' output by the LSTM network model corresponding to the label output, calculating content difference degrees Y (i) of each label content in the sample library, and executing S4-5;

s4-5, calculating the whole error loss of the text content of the whole sample library according to the set of the tag content difference degree Y (i), and executing S4-6;

s4-6, comparing the integral error loss in a preset error loss threshold range, and executing S4-7 if the integral error loss meets the error loss threshold range; if the overall error loss exceeds the error loss threshold range, executing S4-8;

s4-7, outputting the LSTM network model as a content evaluation model;

s4-8, adjusting the parameters of the LSTM network model, and executing S4-3 again.

The sample library in S4-1 is different from the page corresponding to the HTML source code in S1 of the present application, and the sample library is a plurality of HTML source codes of known web pages, and whether there is an attack or not is determined. Therefore, the essence of S4 is that the LSTM network model is trained by knowing the content corresponding to the tags in the web pages and the corresponding network tampering, and the parameters of the LSTM network model are further adjusted by comparing the theoretical expectation calculated by the LSTM network model with the actual known actual expectation, so that the LSTM network model is more closely corresponding to the content corresponding to the sample tags and the output value (i.e., the content difference).

Specifically, as shown in fig. 7 and 8, in S5, calculating the content difference β according to the HTML source code and the preset content evaluation model includes the following steps:

s5-1, extracting the label content of each label in the HTML source code in S1, forming a label set, and executing S5-2;

s5-2, normalizing the tab sets to enable any tab set to be mapped to a numerical value in the interval (0,1) to form X (i), and executing S5-3;

and S5-3, substituting X (i) into the content evaluation model output in S4, and calculating the result output by each label, wherein the result is the content difference degree beta.

The method comprises the following steps that S5, the label content of each label in the HTML source code of the current webpage is input into a content evaluation model, so that the content difference degree beta corresponding to the label is calculated, the content difference measurement of the label path content and the whole webpage is mined, and the content difference degree can be conveniently distinguished by a follow-up worker according to the value of the content difference degree beta, wherein the preset content difference degree is divided into the following five grades, namely, the height correlation [0.0,0.3 ], the general correlation [0.3,0.5 ], the failure judgment [0.5,0.8 ], the failure judgment [0.8, 1] and the complete independence 1.0; when the numerical value of the content difference degree β is 0.3, it can be judged that the degree corresponding to the content difference degree is "general correlation".

Specifically, as shown in fig. 8, in S6, the comprehensive evaluator substitutes the doubtful weight α and the content difference β into a judgment formula to judge the existence of tampering behavior according to the output value, and includes the following steps:

s6-1, judging whether a reference webpage exists, if so, executing S6-5, and if so, executing S6-2;

s6-2, obtaining the weight alpha of the doubtful degree output by S3, obtaining the content difference beta output by S5, and executing S6-3;

s6-3, substituting the suspicious weight α and the content difference β into a decision formula j (i) (α (i) + σ) × β (i), to obtain an output value j (i), where N represents the total number of paths in the tag tree generated by the HTML source code, j (i) represents a falsification suspicious value that the ith tag is falsified in the tag tree generated by the HTML source code, α (i) represents the suspicious weight of the ith tag path, σ represents an empirical constant, and β (i) represents the difference between the text content of the ith tag path and the entire web page content; executing S6-4;

s6-4, judging whether tampering behavior exists according to the value of the tampering suspicious value J (i); if J (i) > 0.5, judging that tampering exists; if J (i) is less than or equal to 0.5, judging that no tampering exists; outputting the judgment result;

s6-5, obtaining a reference HTML of the reference webpage, and executing S6-6;

s6-6, extracting the tags in the reference HTML, generating a tag path set base (i), and executing S6-7;

s6-7, comparing the tag path set base (i) in the reference HTML with the tag path set Path (i) in the HTML source code to judge whether the contents are consistent, if so, executing S6-8; if the contents of the two are not consistent, executing S6-3;

s6-8, the value of the content difference degree beta is reset, and S6-3 is performed.

The comprehensive evaluation device can work under two conditions of existence of a reference webpage and absence of the reference webpage, and judges whether webpage tampering behaviors exist or not according to the doubtful weight alpha and the content difference degree beta. Judging whether a reference webpage exists or not in advance in the step of S6-1, when the reference webpage exists, directly judging the content difference condition between the reference webpage and the current webpage through the comparison between a label path set base (i) in the reference webpage and a label path set path (i) in an HTML source code of the current webpage, and updating, wherein if the current webpage is consistent with the reference webpage, the content difference beta calculated by the current webpage through a content evaluation model approaches to 0 wirelessly, and compared with the situation that the content difference between the reference webpage and the current webpage is 0 actually, the content difference has a certain error; according to the scheme, the numerical value of the content difference beta is updated directly according to the judgment result whether the current webpage is consistent with the reference webpage, and the overall accuracy of the scheme is improved. When the reference webpage does not exist in the step S6-1, namely when the reference webpage does not exist, whether the tampering behavior exists can be judged directly according to the current webpage, the flexibility of the whole calculation model is high, and the system has certain universality after being trained.

The second embodiment of the invention relates to a webpage tampering judgment method based on a fragmentation structure and content. The second embodiment is a refinement of the first embodiment. In the second embodiment of the present invention, after the S6 comprehensive evaluator judges whether the web page tampering behavior exists according to the suspicious weight α and the content difference β, the HTML source code of the current web page and the web page tampering behavior judgment result are brought into the calculation process of the tag tree structure model of S2 and the content evaluation model of S4.

Specifically, if the tampering suspected value j (i) > 0.5 in S6-4, it is determined that the current HTML source code corresponds to the web page, and the HTML source code and the existing tampering are brought into the training sample set mentioned in S2-1 and the sample library mentioned in S4-1. If the tampering suspicion value J (i) is less than or equal to 0.5 in S6-4, judging that the current HTML source code corresponds to the webpage without tampering, and bringing the HTML source code and the existing tampering into the training sample set mentioned in S2-1 and the sample library mentioned in S4-1.

In the embodiment, the new judgment result of the comprehensive judger is brought into the training sample set and the sample library, so that the training samples and the sample library are enriched.

The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.

The third embodiment of the invention relates to a webpage tampering judgment device for a fragmentation structure and content. As shown in fig. 9, the system comprises an information acquisition module, a storage module, a tag tree structure model creation module, a content evaluation model creation module, a suspicion weight calculation module, a content difference calculation module and a comprehensive judger.

And the information acquisition module is used for acquiring the HTML source code of the current webpage and sending the HTML source code to the suspicious weight calculation module and the content difference calculation module. The acquisition mode includes, but is not limited to, crawling an HTML source code of the current webpage by using crawler software, wherein the HTML source code includes a path of each tag and tag content.

The storage module comprises a sample storage area and a model storage area, the sample storage area is used for storing a training sample set and a sample library, and the model storage area is used for storing a label tree structure model and a content evaluation model.

The label tree structure model creating module is used for acquiring a training sample set in the storage module, training the LSTM network model through the training sample set to obtain a label tree structure model, and sending the label tree structure model to the storage module for storage and updating. The storage and update execution process is that if the storage module does not have the label tree model, the label tree model is stored; if the storage module has the label tree model, the label tree model is updated, the label tree structure model creating module sends the label tree model to the storage module, and the storage module replaces the original label tree structure model with the label tree structure model sent by the label tree structure model creating module.

The content evaluation model creating module is used for acquiring a sample library in the storage module, training the LSTM network through the sample library to obtain a content evaluation model, and sending the content evaluation model to the storage module for storage and updating. The execution process of the storage and the update is that if the storage module does not have the content evaluation model, the content evaluation model is stored; if the storage module has the content evaluation model, the content evaluation model is updated, the content evaluation model creation module sends the content evaluation model to the storage module, and the storage module replaces the original content evaluation model with the content evaluation model sent by the content evaluation model creation module.

And the suspicious weight calculation module is used for receiving the HTML source code sent by the information receiving module, acquiring the tag tree structure model from the storage module, normalizing the HTML source code, substituting the normalized HTML source code into the tag tree structure model to calculate the suspicious weight alpha, and sending the suspicious weight alpha to the comprehensive evaluation device.

And the content difference calculation module is used for receiving the HTML source code sent by the information receiving module, acquiring the content evaluation model from the storage module, normalizing the HTML source code, inputting the normalized HTML source code into the content evaluation model for calculation to obtain the content difference beta, and sending the content difference beta to the comprehensive evaluation device.

The comprehensive evaluation device is used for receiving the doubtful weight alpha sent by the doubtful weight calculation module and the content difference degree beta sent by the content difference degree calculation module, updating the content difference degree beta according to the existence condition of the reference webpage, substituting the doubtful weight alpha and the content difference degree beta into a judgment formula J (i) ═ alpha (i) + sigma) × beta (i), obtaining a judgment result according to the numerical value of J (i), and outputting the judgment result; wherein, N represents the total number of paths in the tag tree generated by the HTML source code, j (i) represents a tampering suspicion value of tampering the ith tag in the tag tree generated by the HTML source code, α (i) represents the suspicion degree weight of the ith tag path, σ represents an empirical constant, and β (i) represents the difference degree between the text content of the ith tag path and the whole web page content.

In addition, the comprehensive evaluation device is also used for sending the HTML and the judgment result of the current webpage to the storage module for storage, and further enriching the contents of the sample storage area training sample set and the sample library in the storage module.

It should be understood that this embodiment is a system example corresponding to the first or second embodiment, and may be implemented in cooperation with the first or second embodiment. The related technical details mentioned in the first or second embodiment are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the first or second embodiment.

Compared with the prior art, the embodiment provides a webpage tampering judging device of a fragmented structure and content, wherein a tag tree structure model creation module and a content evaluation model creation module are used for updating a tag tree structure model and a content evaluation model in a storage module, a suspicion weight calculation module calculates suspicion weight by using an HTML source code and the tag tree structure model of a current webpage, a difference degree calculation module calculates content difference degree by using the HTML source code and the content evaluation model of the current webpage, and a comprehensive judger judges whether tampering exists according to the suspicion weight and the content difference degree. The method comprises the steps of checking all links in a webpage by adopting a preset label tree structure model to obtain the suspicious of dark chain modification/insertion, checking texts in the webpage by adopting a preset content evaluation model to obtain content difference degree reflecting the possibility of content modification, comprehensively judging whether the webpage is tampered according to the combination of the text difference degree and the link difference degree, comprehensively considering two modes of text tampering and link tampering in webpage tampering, and judging that the webpage is tampered more accurately compared with the prior art only according to text comparison and identification.

It should be noted that each module referred to in this embodiment is a logical module, and in practical applications, one logical unit may be one physical unit, may be a part of one physical unit, and may be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, elements that are not so closely related to solving the technical problems proposed by the present invention are not introduced in the present embodiment, but this does not indicate that other elements are not present in the present embodiment.

A fourth embodiment of the present invention relates to an electronic apparatus. As shown in fig. 10, includes at least one processor, and a memory coupled to the at least one processor; wherein the memory has stored therein instructions executable by the at least one processor; and the processor is used for loading and executing the at least one program instruction so as to realize the abnormal flow monitoring and analyzing method based on deep migration learning.

The memory and processor are connected by a bus, which may include any number of interconnected buses and bridges that connect one or more of the various circuits of the processor and memory together. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor.

The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory may be used to store data used by the processor in performing operations.

A fifth embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The computer-readable storage medium stores one or more instructions, and when executed, the one or more instructions implement the webpage tampering judgment method based on fragmentation structure and content in the above embodiments.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The foregoing is merely an example of the present invention, and common general knowledge in the field of known specific structures and characteristics is not described herein in any greater extent than that known in the art at the filing date or prior to the priority date of the application, so that those skilled in the art can now appreciate that all of the above-described techniques in this field and have the ability to apply routine experimentation before this date can be combined with one or more of the present teachings to complete and implement the present invention, and that certain typical known structures or known methods do not pose any impediments to the implementation of the present invention by those skilled in the art. It should be noted that, for those skilled in the art, without departing from the structure of the present invention, several changes and modifications can be made, which should also be regarded as the protection scope of the present invention, and these will not affect the effect of the implementation of the present invention and the practicability of the patent. The scope of the claims of the present application shall be determined by the contents of the claims, and the description of the embodiments and the like in the specification shall be used to explain the contents of the claims.

Claims

1. A webpage tampering judgment method based on a fragmentation structure and content is characterized by comprising the following steps:

acquiring HTML source codes;

2. The web page tampering judgment method based on the fragmentation structure and the content as claimed in claim 1, wherein the label tree structure model is obtained by training a first neural network model and a training sample set, and comprises:

3. The method for evaluating webpage tampering based on the fragmented structure and content as claimed in claim 2, wherein the calculating the doubtful weight α according to the HTML source code and the tag tree structure model comprises;

4. The method for webpage tampering judgment based on the fragmented structure and the content as claimed in claim 1, wherein the content evaluation model is trained by a second neural network model and a sample library including text content, and comprises:

5. The method for webpage tampering judgment based on the fragmentation structure and the content as claimed in claim 1, wherein the calculating the content difference β according to the HTML source content evaluation model comprises:

6. The method according to claim 1, wherein the judgment formula is as follows:

J(i)＝(α(i)+σ)*β(i)，i∈(1,2,…,N)；

wherein, N represents the total number of paths in the tag tree generated by the HTML source code, j (i) represents a tampering suspicion value of tampering the ith tag in the tag tree generated by the HTML source code, α (i) represents the suspicion degree weight of the ith tag path, σ represents an empirical constant, and β (i) represents the difference degree between the text content of the ith tag path and the whole web page content.

7. The method as claimed in claim 6, wherein before the step of substituting the suspicious weight α and the content difference β into the decision formula, the method further comprises:

8. A webpage tampering judgment device based on a fragmentation structure and content is characterized in that: the system comprises an information acquisition module, a storage module, a label tree structure model creation module, a content evaluation model creation module, a suspicion weight calculation module, a content difference calculation module and a comprehensive evaluation device;

9. A computer readable storage medium having one or more instructions stored therein, wherein the one or more instructions, when executed, implement the method for webpage tampering judgment based on a fragmented structure and content according to any one of claims 1 to 7.

10. An electronic device, comprising: a memory and a processor; at least one program instruction is stored in the memory; the processor, which loads and executes the at least one program instruction to implement the method for webpage tampering judgment based on the fragmentation structure and content according to any one of claims 1 to 7.