CN111488622A - Method and device for detecting webpage tampering behavior and related components - Google Patents

Method and device for detecting webpage tampering behavior and related components Download PDF

Info

Publication number
CN111488622A
CN111488622A CN201910074366.6A CN201910074366A CN111488622A CN 111488622 A CN111488622 A CN 111488622A CN 201910074366 A CN201910074366 A CN 201910074366A CN 111488622 A CN111488622 A CN 111488622A
Authority
CN
China
Prior art keywords
target
webpage
keyword
detection
tampering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910074366.6A
Other languages
Chinese (zh)
Inventor
杨荣海
何嘉伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN201910074366.6A priority Critical patent/CN111488622A/en
Publication of CN111488622A publication Critical patent/CN111488622A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/64Protecting data integrity, e.g. using checksums, certificates or signatures

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Storage Device Security (AREA)

Abstract

The application discloses a detection method of webpage tampering behaviors, which comprises the steps of determining a target webpage according to a detection instruction, and matching target type keywords in the target webpage; determining a target text corresponding to the target type keywords in the target webpage, and generating a semantic vector corresponding to the target text by using a target language model; the target language model is obtained by migrating a language model trained in a source field; and executing webpage tampering behavior detection operation on the target webpage according to the semantic vector. Therefore, the method and the device can reduce the dependence on the number of samples in the machine learning process, and detect the webpage tampering behavior under the condition of less training samples. The application also discloses a detection device for the webpage tampering behavior, a computer readable storage medium and an electronic device, which have the beneficial effects.

Description

Method and device for detecting webpage tampering behavior and related components
Technical Field
The present invention relates to the field of network security technologies, and in particular, to a method and an apparatus for detecting a webpage tampering behavior, a computer-readable storage medium, and an electronic device.
Background
With the increasing development of network technology, web tampering (web failure) has become a major hacking behavior affecting the normal business of a client website. The webpage tampering refers to the actions of a hacker invading a legal website and inserting illegal texts, images, malicious links and the like such as lottery, pornography and the like into the website. Webpage tampering is an important technology for black cap search engine optimization, the ranking of target websites in a search engine can be improved, so that the access amount is increased, and meanwhile, the webpage tampering can also be used for announcing advices of hackers or dazzling the hackers.
The existing mainstream web page tampering detection method makes a judgment by using the text information of the web page, and finds the illegal text inserted into the web page by using a data mining and statistical learning model or a deep learning model aiming at the web page text. However, the method for detecting the webpage tampering behavior needs a large amount of sample data as a support for machine learning, and the detection effect is not ideal when the number of training samples is small.
Therefore, how to reduce the dependence on the number of samples in the machine learning process, detecting the webpage tampering behavior with a small number of training samples is a technical problem that needs to be solved by those skilled in the art.
Disclosure of Invention
The application aims to provide a method and a device for detecting webpage tampering behaviors, a computer-readable storage medium and an electronic device, which can reduce the dependence on the number of samples in the machine learning process and detect the webpage tampering behaviors under the condition of less training samples.
In order to solve the technical problem, the present application provides a method for detecting a webpage tampering behavior, where the method includes:
determining a target webpage according to the detection instruction, and matching target type keywords in the target webpage;
determining a target text corresponding to the target type keywords in the target webpage, and generating a semantic vector corresponding to the target text by using the target language model after the migration learning;
and executing webpage tampering behavior detection operation on the target webpage according to the semantic vector.
Optionally, the generating of the semantic vector corresponding to the target text by using the target language model after the migration learning includes:
and extracting word vectors of the target text, and inputting all the word vectors into the target language model after the transfer learning to generate semantic vectors.
Optionally, matching the target type keyword in the target webpage includes:
and performing keyword matching operation on the target webpage by using the keyword list to obtain a target type keyword.
Optionally, before performing a keyword matching operation on the target webpage by using the keyword table to obtain the target type keyword, the method further includes:
generating a keyword table according to the database; the database comprises any one or a combination of any several of a tampered common vocabulary, a standard corpus and an expert knowledge base.
Optionally, the performing, according to the semantic vector, a webpage tampering behavior detection operation on the target webpage includes:
and inputting the semantic vector into a deep learning model so as to judge whether the target webpage has webpage tampering behaviors or not by using the deep learning model.
The application also provides a detection device for webpage tampering behavior, which comprises:
the keyword matching module is used for determining a target webpage according to the detection instruction and matching target type keywords in the target webpage;
the semantic vector generation module is used for determining a target text corresponding to the target type keywords in the target webpage and generating a semantic vector corresponding to the target text by using the target language model after the transfer learning;
and the detection module is used for executing webpage tampering behavior detection operation on the target webpage according to the semantic vector.
Optionally, the semantic vector generating module includes:
the text determination module is used for determining a target text corresponding to the target type keyword in the target webpage;
and the vector generating unit is used for extracting word vectors of the target text and inputting all the word vectors into the target language model after the transfer learning to generate semantic vectors.
Optionally, the keyword matching module includes:
the webpage determining unit is used for determining a target webpage according to the detection instruction;
and the screening unit is used for executing keyword matching operation on the target webpage by utilizing the keyword list to obtain the target type keyword.
Optionally, the method further includes:
the keyword table generating module is used for generating a keyword table according to the database; the database comprises any one or a combination of any several of a tampered common vocabulary, a standard corpus and an expert knowledge base.
Optionally, the detection module is specifically a module for inputting the semantic vector into the deep learning model so as to determine whether the target webpage has a webpage tampering behavior by using the deep learning model.
The application also provides a computer readable storage medium, on which a computer program is stored, and the steps executed by the method for detecting the webpage tampering behavior are realized when the computer program is executed.
The application also provides an electronic device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps executed by the detection method for the webpage tampering behavior when calling the computer program in the memory.
The invention provides a method for detecting webpage tampering behaviors, which comprises the steps of determining a target webpage according to a detection instruction, and matching target type keywords in the target webpage; determining a target text corresponding to the target type keywords in the target webpage, and generating a semantic vector corresponding to the target text by using the target language model after the migration learning; and executing webpage tampering behavior detection operation on the target webpage according to the semantic vector.
After a target webpage needing to be detected is determined, the target type keywords in the target webpage are matched so that the corresponding semantic vectors can be generated by using the target language model obtained by migration, and then the detection operation of webpage tampering behaviors is realized according to the semantic vectors. According to the method and the device, the target type keywords in the target webpage are used as the detection basis, so that illegal text samples of specific types do not need to be obtained from the target webpage, and the difficulty in obtaining the target type keywords is far lower than that of the illegal text samples. Because the target language model after the transfer learning is trained in the source field with a large amount of knowledge and labels in advance, the high-quality semantic vector can be obtained so as to execute the webpage tampering behavior detection operation according to the semantic vector, the method and the device can reduce the dependence on the number of samples in the machine learning process, and detect the webpage tampering behavior under the condition of less training samples. The application also provides a detection device for webpage tampering behaviors, a computer readable storage medium and an electronic device, which have the beneficial effects and are not repeated herein.
Drawings
In order to more clearly illustrate the embodiments of the present application, the drawings needed for the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.
Fig. 1 is a flowchart of a method for detecting a webpage tampering behavior according to an embodiment of the present application;
fig. 2 is a flowchart of a semantic vector generation method according to an embodiment of the present disclosure;
fig. 3 is a flowchart of a method for extracting keywords of a target type according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a device for detecting a webpage tampering behavior according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The web page tampering act can change a normal web page to a tampered web page. The normal web page is also called white web page, is the web page of the self service of the client website and does not contain illegal information; a tampered web page is a web page that is hacked, has malicious sensitive words inserted, or has malicious links inserted, where the malicious links typically point to malicious websites. The current mainstream web page tampering detection system uses the text information of the web page to make a judgment. For web page text, data mining, statistical learning models, or deep learning models are used to find illegal text inserted in web pages. However, the above prior art has the following disadvantages: (1) in the prior art, detection operation needs support of a large amount of data, a large amount of work is needed to obtain high-quality data, and development cost is increased; (2) for specific types of tampering, the sample acquisition difficulty is high, the number of samples which can be acquired is very small, and the acquisition cost is very high; (3) on the model level, the model obtained by training according to the small sample has poor general effect, and better generalization capability is difficult to obtain.
In view of the various defects in the prior art, the present application provides a new method for detecting a webpage tampering behavior through the following embodiments, which can solve the problems in the prior art, and achieve the effects of reducing the dependence on the number of samples in the machine learning process and detecting the webpage tampering behavior under the condition of a small number of training samples.
Referring to fig. 1, fig. 1 is a flowchart of a method for detecting a webpage tampering behavior according to an embodiment of the present application.
The specific steps may include:
s101: determining a target webpage according to the detection instruction, and matching target type keywords in the target webpage;
the embodiment may be applied to a network security device, and there is an operation of receiving a detection instruction before this step. The detection instruction may be an instruction issued by a user through a terminal device, or may be an instruction automatically generated by the network security device under a preset condition, and is not specifically limited herein.
After the detection instruction is received, the detection instruction may be analyzed to obtain an object of the detection of the webpage tampering behavior, where the number of target webpages corresponding to the detection instruction is not limited, and the target webpages may be one or multiple. The above mentioned webpage tampering behavior refers to the behavior of hackers invading legal websites and inserting illegal texts, images, malicious links, etc. such as lotteries, pornography, etc. into the websites. Webpage tampering is an important technology for optimizing a Black Hat Search Engine, and the Black Hat Search Engine Optimization (Black Hat Search Engine Optimization), namely Black Hat SEO, deceives the Search Engine by using a cheating means, so that the ranking of a target website in the Search Engine is illegally improved.
It should be noted that, in the prior art, after the target web page is determined, a suspected tampering sample of a specific type is extracted from the target web page, and then a web page tampering behavior detection operation is performed on the suspected tampering sample. However, when the number of samples of this particular type in the target web page is small, a lot of work is required to acquire high-quality data. In the embodiment, a specific type text sample in the prior art is not acquired, but a target type keyword is matched. The embodiment may obtain the keywords meeting the specific requirements from the target webpage through various methods, such as using a string multimodal matching algorithm, TF-IDF, TextRank, Rake, Topic-Model, and the like, which are not specifically limited herein.
It is understood that the operation of matching the target type keyword in the present embodiment may include the following steps: when the target type keyword is not matched in the target webpage, judging that the target webpage is not the webpage tampering behavior detection object described in the embodiment, and ending the process; on the contrary, when the target type keyword is matched in the target webpage, the webpage tampering behavior detection operation described in this embodiment may be performed according to the target type keyword. That is, the operation flows of S102 to S103 of the present embodiment are all implemented on the basis that S101 matches the target type keyword in the target web page.
S102: determining a target text corresponding to the target type keywords in the target webpage, and generating a semantic vector corresponding to the target text by using the target language model after the migration learning;
it should be noted that, the specific method for determining the target text may specifically be: and searching sentences of the target type keywords in the target webpage, wherein the sentences of the target keywords can be used as target texts according to punctuation marks or byte numbers as segmentation basis. The above process is illustrated by way of example: the existence of the following text content in the target web page, "granular in a particular environment, is a major design issue because it profoundly affects the size of the amount of data deposited in the data warehouse, as well as the type of query that can be answered. If the data warehouse is the target type key word and the basis for determining the target text is divided by punctuation marks, the target text corresponding to the data warehouse in the target webpage is the data warehouse because the data warehouse deeply influences the size of the data amount stored in the data warehouse; if the basis for determining the target text is that the target text is divided by a preset number of words before and after the target keyword, and the number of words is 10, the target text corresponding to the target webpage in the data warehouse is "the target text deeply affects the size of the data amount stored in the data warehouse, meanwhile, other methods for determining the target text may also exist, and no specific limitation is made here, as long as the obtained target text can accurately express the context meaning of the target type keyword. In addition, the target text can be determined, and irrelevant information in the target webpage can be removed, so that the accuracy of detecting the webpage tampering behavior is improved.
In the embodiment, the target text corresponding to the target type keywords can be converted into semantic vectors through a plurality of migration-learned language models, for example, a neural network language model (NN L M), an L og bilinear language model (L B L), and the like.
The Target language model mentioned in this embodiment can be applied to the Target field (i.e., a field with a large amount of data labeled) after the Source field is trained, the Source field in the migration learning (Source Domain) refers to a field with knowledge and a large number of data labels in the migration learning, the Target field (Target Domain) is a Target task which needs to be given knowledge and labeled, and as an optional implementation mode, the Source field can include a text classification field and a specific text detection field, where no specific limitation is needed, when the Target language model trained to the Source field is adjusted, the Target language model can be adjusted according to the operation parameters of the field, and the parameters of the field are set according to the operation conditions of the field.
The present embodiment does not limit the kind of target language Model, and the language Model (L anguage Model) refers to the assumption that all possible sentences in a certain language conform to a probability distribution.
On the basis of pre-screening small sample examples of specific types needing attention by using keywords, the method learns the knowledge of the source field similar to the target field by using the language model of the transfer learning, and improves the semantic expression of the small sample examples, so that the detection capability of the depth model on the tampered webpage of the small sample is improved, the detection rate is greatly improved, and the false alarm rate is reduced. Specifically, there may be an operation of training a source domain model before this step, and training a language model and learning knowledge in a source domain similar to the web page tampering task, where the source domain similar to the web page tampering task includes but is not limited to text classification, sensitive text detection, and the like. After the source domain language model is obtained, the source domain language model can be migrated to a small sample tampering task to generate a high-quality semantic vector, and then the semantic vector is combined with a deep learning model to predict whether a small sample type webpage is tampered. Because the number of certain specific types of tampered samples is not large in web page tampering, the mode of directly obtaining semantic expression on the samples by using a semantic vector technology cannot well represent the semantics of web page contents. In the present embodiment, the migration learning applies a model learned in an old domain to a new domain using similarity between data, tasks, or models. In the step, the language model of transfer learning is used, and the high-quality language model can be trained on some standard or universal language materials or scenes and tasks similar to webpage tampering. By transferring the language model after the source field training to the small sample webpage tampering task, high-quality webpage content semantic vector expression can be obtained, and the detection effect is improved.
S103: and executing webpage tampering behavior detection operation on the target webpage according to the semantic vector.
After the semantic vector of the target text is obtained, whether webpage tampering behaviors exist in the target webpage can be judged according to the semantic vector. As a possible implementation manner, the semantic vector may be input into a deep learning model that has been trained, so as to determine whether the target web page has a web page tampering behavior by using the deep learning model. Semantic vectors are a vector mechanism that maps the semantics of words into a high-dimensional space, and with the development of natural language processing, semantic vectors are often used as input for various machine learning or deep learning models. The deep learning model mentioned here is a model obtained after training by using a large number of semantic vector samples, and has the capability of identifying webpage tampering behaviors. The process of training the deep learning model may include: and extracting semantic vectors of webpage tampering behaviors from the existing attack logs as positive semantic vector samples, and taking the semantic vectors of normal texts as negative semantic vectors. And inputting the positive semantic vector and the negative semantic vector as training samples into an original model for training to obtain a deep learning model, wherein the deep learning model after training can predict the input semantic vector in a two-classification manner so as to judge whether webpage tampering behaviors exist in the target webpage.
After the target webpage needing to be detected is determined, the target type keywords in the target webpage are matched so that the corresponding semantic vectors can be generated by using the target language model obtained by migration, and then the detection operation of webpage tampering behaviors is realized according to the semantic vectors. In the embodiment, the target type keywords in the target webpage are used as the detection basis, so that an illegal text sample of a specific type does not need to be obtained from the target webpage, and the difficulty in obtaining the target type keywords is far lower than that of the illegal text sample. Because the target language model after the transfer learning is trained in the source field with a large amount of knowledge and labels in advance, the high-quality semantic vector can be obtained so as to execute the webpage tampering behavior detection operation according to the semantic vector, so that the dependence on the number of samples in the machine learning process can be reduced, and the webpage tampering behavior can be detected under the condition of less training samples.
Referring to fig. 2, fig. 2 is a flowchart of a semantic vector generation method provided in an embodiment of the present application, where this embodiment is a further description of the embodiment corresponding to fig. 1, and the semantic vector generation method based on the transfer learning may be obtained by combining this embodiment with the embodiment corresponding to fig. 1, and this embodiment may include the following steps:
s201: determining a target webpage according to the detection instruction, and matching target type keywords in the target webpage;
s202: determining a target text corresponding to the target type keywords in the target webpage;
s203: and extracting word vectors of the target text, and inputting all the word vectors into the target language model after the transfer learning to generate semantic vectors.
The step of generating the semantic vector may specifically be to extract Word vectors of the target text, and input all the Word vectors into the target language model after the migration learning to generate the semantic vector, the Word vector (Word embedding), which is a general term of a group of language modeling and feature learning technologies in Word embedded natural language processing (N L P), may be a vector in which words or phrases from a vocabulary are mapped to real numbers.
Referring to fig. 3, fig. 3 is a flowchart of a method for extracting a target-type keyword according to an embodiment of the present application, where the embodiment is a further explanation of S101 of fig. 1, and a more preferred implementation may be obtained by combining the embodiment with the embodiment corresponding to fig. 1, where the embodiment may include the following steps:
s301: generating a keyword table according to the database;
the database comprises any one or a combination of any several of a tampered common vocabulary, a standard corpus and an expert knowledge base. Since the collection difficulty of keywords is much lower than that of sample collection, keywords can be collected by fusing multiple sources, including but not limited to: a) collecting a tampered common vocabulary; b) extracting key words from a standard corpus, such as wiki Chinese corpora; c) expert prior knowledge.
S302: and performing keyword matching operation on the target webpage by using the keyword list to obtain a target type keyword.
Specifically, the keyword matching operation in this step may be performed by using a character string multi-mode matching algorithm so as to determine whether the web page to be detected contains the keyword, and if the web page to be detected contains the keyword, the migration learning module is used to determine whether the web page is tampered with. In addition, the keyword list can be dynamically adjusted according to the hit condition of the keyword matching operation, for example, when the number of the obtained target type keywords is large and the average quality is not high, the keyword list can be simplified to improve the quality of the keywords obtained by the keyword matching operation.
By combining the embodiment with the embodiment corresponding to fig. 1, the problem that a general detection model is difficult to detect the tampering of the small samples can be solved, so that a keyword pre-screening operation is first performed on a webpage to be detected, so as to determine whether the webpage is a small sample tampering webpage sample of a specific type. For example, in an observed actual scene, the number of tamper samples of the plug-in game is small, and the tamper samples are easily confused with websites of normal game companies and are difficult to distinguish by a traditional model. Specifically, a keyword table may be constructed for a specific type of web page, and then it is determined whether the web page sample contains a specific type of keyword, and if it is determined that the web page sample is a small sample type of web page, the tamper determination may be performed by using a migration learning manner in the embodiment corresponding to fig. 2.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a device for detecting a webpage tampering behavior according to an embodiment of the present application;
the apparatus may include:
the keyword matching module 100 is configured to determine a target webpage according to the detection instruction and match a target type keyword in the target webpage;
the semantic vector generation module 200 is configured to determine a target text corresponding to the target type keyword in the target webpage, and generate a semantic vector corresponding to the target text by using the target language model after the migration learning;
and the detection module 300 is configured to perform a webpage tampering detection operation on the target webpage according to the semantic vector.
After the target webpage needing to be detected is determined, the target type keywords in the target webpage are matched so that the corresponding semantic vectors can be generated by using the target language model obtained by migration, and then the detection operation of webpage tampering behaviors is realized according to the semantic vectors. In the embodiment, the target type keywords in the target webpage are used as the detection basis, so that an illegal text sample of a specific type does not need to be obtained from the target webpage, and the difficulty in obtaining the target type keywords is far lower than that of the illegal text sample. Because the target language model is trained in the source field with a large amount of knowledge and labels in advance, the high-quality semantic vector can be obtained so as to execute the webpage tampering behavior detection operation according to the semantic vector, so that the dependence on the number of samples in the machine learning process can be reduced, and the webpage tampering behavior can be detected under the condition of less training samples.
Further, the semantic vector generation module 200 includes:
the text determination module is used for determining a target text corresponding to the target type keyword in the target webpage;
and the vector generating unit is used for extracting word vectors of the target text and inputting all the word vectors into the target language model after the transfer learning to generate semantic vectors.
Further, the keyword matching module 100 includes:
the webpage determining unit is used for determining a target webpage according to the detection instruction;
and the screening unit is used for executing keyword matching operation on the target webpage by utilizing the keyword list to obtain the target type keyword.
Further, the detection device further comprises:
the keyword table generating module is used for generating a keyword table according to the database; the database comprises any one or a combination of any several of a tampered common vocabulary, a standard corpus and an expert knowledge base.
Further, the detection module 300 is specifically a module for inputting the semantic vector into the deep learning model so as to determine whether the target webpage has a webpage tampering behavior by using the deep learning model.
Since the embodiments of the apparatus portion and the method portion correspond to each other, please refer to the description of the embodiments of the method portion for the embodiments of the apparatus portion, which is not repeated here.
The present application also provides a computer readable storage medium having stored thereon a computer program which, when executed, may implement the steps provided by the above-described embodiments. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The present application further provides an electronic device, which may include a memory and a processor, where the memory stores a computer program, and when the processor calls the computer program in the memory, the steps provided in the foregoing embodiments may be implemented. Of course, the electronic device may also include various network interfaces, power supplies, and the like.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (12)

1. A method for detecting webpage tampering behavior is characterized by comprising the following steps:
determining a target webpage according to a detection instruction, and matching target type keywords in the target webpage;
determining a target text corresponding to the target type keywords in the target webpage, and generating a semantic vector corresponding to the target text by using a target language model after transfer learning;
and executing webpage tampering behavior detection operation on the target webpage according to the semantic vector.
2. The detection method according to claim 1, wherein the generating semantic vectors corresponding to the target texts by using the target language model after the transfer learning comprises:
and extracting word vectors of the target text, and inputting all the word vectors into the target language model after the transfer learning to generate the semantic vectors.
3. The detection method of claim 1, wherein matching the target type keyword in the target web page comprises:
and performing keyword matching operation on the target webpage by using a keyword table to obtain the target type keyword.
4. The method of claim 3, wherein before performing a keyword matching operation on the target web page using a keyword table to obtain the target type keyword, the method further comprises:
generating the keyword list according to a database; wherein the database comprises any one or a combination of any several of a tampered common vocabulary, a standard corpus and an expert knowledge base.
5. The detection method according to any one of claims 1 to 4, wherein performing a webpage tampering behavior detection operation on the target webpage according to the semantic vector comprises:
inputting the semantic vector into a deep learning model so as to judge whether the webpage tampering behaviors exist in the target webpage by using the deep learning model.
6. An apparatus for detecting webpage tampering behavior, comprising:
the keyword matching module is used for determining a target webpage according to a detection instruction and matching target type keywords in the target webpage;
the semantic vector generation module is used for determining a target text corresponding to the target type keywords in the target webpage and generating a semantic vector corresponding to the target text by using a target language model after transfer learning;
and the detection module is used for executing webpage tampering behavior detection operation on the target webpage according to the semantic vector.
7. The detection apparatus according to claim 6, wherein the semantic vector generation module comprises:
the text determination module is used for determining a target text corresponding to the target type keyword in the target webpage;
and the vector generating unit is used for extracting word vectors of the target text and inputting all the word vectors into the target language model after the transfer learning to generate the semantic vectors.
8. The detection apparatus according to claim 6, wherein the keyword matching module comprises:
the webpage determining unit is used for determining a target webpage according to the detection instruction;
and the screening unit is used for executing keyword matching operation on the target webpage by utilizing a keyword table to obtain the target type keyword.
9. The detection device of claim 8, further comprising:
the keyword table generating module is used for generating the keyword table according to a database; wherein the database comprises any one or a combination of any several of a tampered common vocabulary, a standard corpus and an expert knowledge base.
10. The detection apparatus according to any one of claims 6 to 9, wherein the detection module is specifically a module configured to input the semantic vector into a deep learning model so as to determine whether the webpage tampering behavior exists on the target webpage by using the deep learning model.
11. An electronic device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method for detecting tampering behaviour of a web page as claimed in any one of claims 1 to 5 when executing said computer program.
12. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for detecting tampering behaviour of a web page according to any one of claims 1 to 5.
CN201910074366.6A 2019-01-25 2019-01-25 Method and device for detecting webpage tampering behavior and related components Pending CN111488622A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910074366.6A CN111488622A (en) 2019-01-25 2019-01-25 Method and device for detecting webpage tampering behavior and related components

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910074366.6A CN111488622A (en) 2019-01-25 2019-01-25 Method and device for detecting webpage tampering behavior and related components

Publications (1)

Publication Number Publication Date
CN111488622A true CN111488622A (en) 2020-08-04

Family

ID=71813599

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910074366.6A Pending CN111488622A (en) 2019-01-25 2019-01-25 Method and device for detecting webpage tampering behavior and related components

Country Status (1)

Country Link
CN (1) CN111488622A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528677A (en) * 2020-12-22 2021-03-19 北京百度网讯科技有限公司 Training method and device of semantic vector extraction model and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341146A (en) * 2017-06-23 2017-11-10 上海交通大学 The semantic resolution system of transportable spoken language and its implementation based on semantic groove internal structure
CN107437038A (en) * 2017-08-07 2017-12-05 深信服科技股份有限公司 A kind of detection method and device of webpage tamper
CN108111478A (en) * 2017-11-07 2018-06-01 中国互联网络信息中心 A kind of phishing recognition methods and device based on semantic understanding
CN109165529A (en) * 2018-08-14 2019-01-08 杭州安恒信息技术股份有限公司 A kind of dark chain altering detecting method, device and computer readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341146A (en) * 2017-06-23 2017-11-10 上海交通大学 The semantic resolution system of transportable spoken language and its implementation based on semantic groove internal structure
CN107437038A (en) * 2017-08-07 2017-12-05 深信服科技股份有限公司 A kind of detection method and device of webpage tamper
CN108111478A (en) * 2017-11-07 2018-06-01 中国互联网络信息中心 A kind of phishing recognition methods and device based on semantic understanding
CN109165529A (en) * 2018-08-14 2019-01-08 杭州安恒信息技术股份有限公司 A kind of dark chain altering detecting method, device and computer readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘文洁 等: "基于迁移学习的语义推理网络" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528677A (en) * 2020-12-22 2021-03-19 北京百度网讯科技有限公司 Training method and device of semantic vector extraction model and electronic equipment
CN112528677B (en) * 2020-12-22 2022-03-11 北京百度网讯科技有限公司 Training method and device of semantic vector extraction model and electronic equipment

Similar Documents

Publication Publication Date Title
CN110537180B (en) System and method for tagging elements in internet content within a direct browser
CN102446255B (en) Method and device for detecting page tamper
US10423649B2 (en) Natural question generation from query data using natural language processing system
CN102436563B (en) Method and device for detecting page tampering
CN108959559B (en) Question and answer pair generation method and device
CN109391706A (en) Domain name detection method, device, equipment and storage medium based on deep learning
CN108038173B (en) Webpage classification method and system and webpage classification equipment
CN112541476B (en) Malicious webpage identification method based on semantic feature extraction
WO2020151173A1 (en) Webpage tampering detection method and related apparatus
CN113055386B (en) Method and device for identifying and analyzing attack organization
CN110427612B (en) Entity disambiguation method, device, equipment and storage medium based on multiple languages
CN111753171B (en) Malicious website identification method and device
CN110909531B (en) Information security screening method, device, equipment and storage medium
CN102591965A (en) Method and device for detecting black chain
US20140149106A1 (en) Categorization Based on Word Distance
CN112328936A (en) Website identification method, device and equipment and computer readable storage medium
Zhang et al. Annotating needles in the haystack without looking: Product information extraction from emails
CN112132710A (en) Legal element processing method and device, electronic equipment and storage medium
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN104036189A (en) Page distortion detecting method and black link database generating method
Alves et al. Leveraging BERT's Power to Classify TTP from Unstructured Text
Li et al. Semantic‐enhanced multimodal fusion network for fake news detection
CN111488622A (en) Method and device for detecting webpage tampering behavior and related components
CN113742785A (en) Webpage classification method and device, electronic equipment and storage medium
CN111382383A (en) Method, device, medium and computer equipment for determining sensitive type of webpage content

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination