CN116127236A - Webpage web component identification method and device based on parallel structure - Google Patents

Webpage web component identification method and device based on parallel structure Download PDF

Info

Publication number
CN116127236A
CN116127236A CN202310419786.XA CN202310419786A CN116127236A CN 116127236 A CN116127236 A CN 116127236A CN 202310419786 A CN202310419786 A CN 202310419786A CN 116127236 A CN116127236 A CN 116127236A
Authority
CN
China
Prior art keywords
web
information
component
response information
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310419786.XA
Other languages
Chinese (zh)
Other versions
CN116127236B (en
Inventor
张敏
任高锋
周静
赵建聪
穆丽珠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Webray Tech Beijing Co ltd
Original Assignee
Webray Tech Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Webray Tech Beijing Co ltd filed Critical Webray Tech Beijing Co ltd
Priority to CN202310419786.XA priority Critical patent/CN116127236B/en
Publication of CN116127236A publication Critical patent/CN116127236A/en
Application granted granted Critical
Publication of CN116127236B publication Critical patent/CN116127236B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a webpage web component identification method and device based on a parallel structure, belonging to the technical field of computers, wherein the method comprises the following steps: acquiring web response information to be identified; performing feature conversion on the web response information to obtain feature vectors; determining web components included in the web response information based on the feature vectors and the recognition model; the recognition model comprises a plurality of sub-models, model training data comprise web response information and label information of a plurality of component types, each sub-model is obtained by training based on the web response information and the label information of the component type corresponding to each sub-model, and the label information is web component information.

Description

Webpage web component identification method and device based on parallel structure
Technical Field
The invention relates to the technical field of computers, in particular to a webpage web component identification method and device based on a parallel structure.
Background
Web page (web) fingerprint recognition is to perform component recognition on a website, for example, to identify a frame used by the website, a web front end frame, a web front end language, and the like, and the web fingerprint recognition recognizes service component related information of a web object by comparing with related features of the target web object by means of existing web fingerprint information. Furthermore, the relevant information of the service component and the relevant loopholes can be correlated to perform loophole detection or security early warning.
Due to the rapid development of computer technology, web fingerprints may contain a variety of components, with the advent of more and more web sites and applications representing more and more component types and component occurrences, in order to provide a better experience for customers. Thus, how to accurately identify web components is a problem that needs to be addressed by those skilled in the art.
Disclosure of Invention
The invention provides a web component identification method and device based on a parallel structure, which are used for accurately identifying web components.
The invention provides a webpage web component identification method based on a parallel structure, which comprises the following steps:
acquiring web response information to be identified;
performing feature conversion on the web response information to obtain feature vectors;
determining web components included in the web response information based on the feature vectors and the recognition model;
the recognition model comprises a plurality of sub-models, training data comprises web response information and label information of a plurality of component types, each sub-model is obtained by training based on the web response information and the label information of the component type corresponding to each sub-model, and the label information is web component information.
According to the webpage web component identification method based on the parallel structure provided by the invention, the web response information is subjected to feature conversion to obtain the feature vector, and the method comprises the following steps:
Extracting keywords of the web response information;
determining a feature vector of the web response information according to the number of times of occurrence of the keyword in the web response information and the number of web response information including the keyword in a corpus; the corpus is a corpus of web response information comprising the web response information to be identified and training data.
According to the webpage web component identification method based on the parallel structure provided by the invention, before extracting the keywords of the web response information, the method further comprises the following steps:
performing webpage analysis processing on the web response information to obtain text information;
preprocessing the text information to obtain preprocessed text information;
and carrying out word segmentation on the preprocessed text information.
According to the webpage web component recognition method based on the parallel structure, the text information is preprocessed, and the method comprises at least one of the following steps:
converting English capital characters in the text information into lowercase characters;
carrying out unified processing on the format of the text information;
and eliminating stop words in the text information.
According to the webpage web component recognition method based on the parallel structure, before training the recognition model, the method further comprises the following steps:
Splitting the label information into component information corresponding to at least one component type for any one of the label information;
and aiming at any component type, carrying out de-duplication processing on component information corresponding to the component type to obtain the label vector dimension and the component list of the component type.
And obtaining the label vector corresponding to the component type in each piece of label information according to the label vector dimension corresponding to the component type in each piece of label information and the component list.
According to the webpage web component identification method based on the parallel structure, which is provided by the invention, the method further comprises the following steps:
for any sub-model, inputting a feature vector corresponding to web response information in training data and a label vector of a component type corresponding to the sub-model into the sub-model for training; the number of sub-models is the same as the number of component types.
According to the webpage web component identification method based on the parallel structure, the sub-model is built based on the integrated learning model.
The invention also provides a webpage web component identification device based on the parallel structure, which comprises the following steps:
The acquisition module is used for acquiring web response information to be identified;
the processing module is used for carrying out feature conversion on the web response information to obtain feature vectors;
the processing module is further used for determining a web component included in the web response information based on the feature vector and the recognition model;
the recognition model comprises a plurality of sub-models, training data comprises web response information and label information of a plurality of component types, each sub-model is obtained by training based on the web response information and the label information of the component type corresponding to each sub-model, and the label information is web component information.
The invention also provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the webpage web component identification method based on the parallel structure when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a web page web component recognition method based on a parallel structure as described in any one of the above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements a method of identifying web components of a web page based on a parallel architecture as described in any of the above.
According to the webpage web component identification method and device based on the parallel structure, the web response information to be identified is subjected to feature conversion to obtain the feature vector; determining web components included in the web response information based on the feature vectors and the recognition model; the recognition model comprises a plurality of sub-models, the training data comprises web response information and label information of a plurality of component types, the label information is web component information, each sub-model is obtained by training based on the web response information and the label information of the component type corresponding to each sub-model, so that the problem that a recognition result is inaccurate due to uneven distribution of the component types of the training data is avoided, the recognition model consists of the plurality of sub-models, information of a plurality of component types can be recognized, and therefore web components included in the web response information are determined based on the recognition model obtained by training, and the obtained result is accurate and comprehensive.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a web component recognition method based on a parallel structure provided by the invention;
FIG. 2 is a schematic diagram of a model training principle of the web component recognition method based on the parallel structure provided by the invention;
FIG. 3 is a second schematic diagram of the model training principle of the web component recognition method based on the parallel structure provided by the invention;
FIG. 4 is a schematic diagram of a web component recognition device based on a parallel structure according to the present invention;
fig. 5 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
First, description will be made of related nouns involved in the embodiment of the present invention:
the stop words refer to that in information retrieval, certain words or words are automatically filtered before or after processing natural language data (or text) in order to save storage space and improve search efficiency, and are called stop words. Such as 'the', 'is', 'at', 'whish', 'on', etc.
One piece of tag information corresponds to one piece of response information, one piece of tag information includes component information of a plurality of component types, and the component information includes component name and/or version information.
The component list is component information obtained by de-duplicating all component information of a certain component type in the plurality of tag information;
the tag vector dimension is the length of the component list.
The following describes the technical scheme of the present invention in detail with reference to fig. 1 to 5. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.
Fig. 1 is one of flow diagrams of a web component recognition method based on a parallel structure according to an embodiment of the present invention, as shown in fig. 1, where the method includes:
step 101, acquiring web response information to be identified;
102, performing feature conversion on web response information to obtain feature vectors;
specifically, for example, a keyword in the web response information is extracted, a feature vector is obtained based on the keyword, and each element of the feature vector may be calculated based on, for example, the number of times the keyword appears in the response information, the number of web response information including the keyword in the corpus, and the like. The importance of a word to a text increases proportionally with the number of times it appears in the text, but at the same time decreases inversely with the frequency with which it appears in the corpus. If a word appears in one text with a high word frequency and rarely appears in other text, the word or phrase is considered to have good class discrimination and is suitable for classification or recognition.
Step 103, determining web components included in the web response information based on the feature vectors and the recognition model;
the recognition model comprises a plurality of sub-models, the training data comprises web response information and label information of a plurality of component types, each sub-model is obtained by training based on the web response information and the label information of the component type corresponding to each sub-model, and the label information is web component information.
In particular, web component types include web servers (web_servers), middleware (web_middleware), web front end frameworks (web_frames), web front end languages (web_language), UI interfaces (ui_lib), web applications (web_apps), containers (web_containers), content delivery networks (Content Delivery Network, CDNs), and the like.
And inputting the feature vector corresponding to the response information to be identified into the identification model obtained through training, and determining the web component included in the response information.
The recognition model comprises a plurality of sub-models, each sub-model is obtained by training based on web response information and label information of a component type corresponding to each sub-model, and the label information is web component information, namely obtained by supervised training.
For example, the number of the component types is 8, the number of the sub-models is 8, the sub-models are in one-to-one correspondence with the component types, and the sub-models can be obtained by training any sub-model based on web response information in training data and label information of the component type corresponding to the sub-model in the training data.
According to the method, the web response information to be identified is subjected to feature conversion to obtain a feature vector; determining web components included in the web response information based on the feature vectors and the recognition model; the recognition model comprises a plurality of sub-models, the training data comprises web response information and label information of a plurality of component types, the label information is web component information, each sub-model is obtained by training based on the web response information and the label information of the component type corresponding to each sub-model, so that the problem that a recognition result is inaccurate due to uneven distribution of the component types of the training data is avoided, the recognition model consists of the plurality of sub-models, information of a plurality of component types can be recognized, and therefore web components included in the web response information are determined based on the recognition model obtained by training, and the obtained result is accurate and comprehensive.
Optionally, before step 102, the method further includes:
performing webpage analysis processing on the web response information to obtain text information;
preprocessing the text information to obtain preprocessed text information;
and performing word segmentation on the preprocessed text information.
Wherein the preprocessing of the text information comprises at least one of the following:
converting English capital characters in the text information into lowercase characters;
carrying out unified processing on the format of the text information;
and eliminating stop words in the text information.
Specifically, in order to facilitate keyword extraction of web response information, web page parsing is performed first to obtain text information, and key attribute information therein is extracted first, and generally extracted key attribute information has an attribute such as title, meta, script, link, because hidden component information exists in the attributes. The web response information comprises a response head and relevant text information of a response body of the target website, and optionally, text information in the response body is acquired by using a text function analyzed by the web page, namely, the text content of the response body is acquired.
For preprocessing of text information, since English words are case-coded, e.g. "Bootstrap" and "when statistics are desired" bootstrap "is a word, so it is generally necessary to convert all words into lowercase; due to the plurality of' \r,
Figure SMS_1
the existence of special symbols such as t' and the like causes great difference of text formats, and the text formats are unified by filtering the special symbols, namely format unification processing is carried out; there are many invalid stop words in the text, such as short words of "a", "to", "is", or punctuation marks, which need not be introduced during text analysis, and therefore need to be removed, so that the stop words are removed again for the text; the word segmentation is to split sentences in the text according to words, so that the machine understanding is facilitated.
In the embodiment, before the feature conversion is performed on the response information, text information extraction and some preprocessing are performed on the response information, so that the feature conversion is simpler, the efficiency is higher, the accuracy is higher, and the final model recognition result is more accurate.
Alternatively, step 102 may be implemented as follows:
extracting keywords of web response information;
determining a feature vector of the web response information according to the number of times of occurrence of the keywords in the web response information and the number of web response information including the keywords in a corpus; the corpus comprises the web response information to be identified and the web response information in the training data.
Alternatively, feature vector computation of response information may select a text vector representation, including, for example, word Frequency-inverse document Frequency (TF-IDF), word vectorization (word 2 vec), word embedding (embedding), and the like. According to the invention, vector elements corresponding to each keyword in the feature vector of the web response information are calculated through the following formulas (1), (2) and (3);
Figure SMS_2
(1)
Figure SMS_3
(2)
Figure SMS_4
(3)
wherein, the vector element corresponding to each keyword is TF-IDF value, n 1 Representing the number of times the keyword appears in the web response information, n 2 Representing a total number of occurrences of all keywords included in the web response information; m is m 1 Representing the total number of web response information m in a corpus 2 Representing the number of web response information containing the keywords in the corpus.
Specifically, a TF-IDF method is used for extracting feature vectors of text information of response information, wherein word frequency (TF) and Inverse Document Frequency (IDF) used for calculation and word frequency-inverse document frequency formulas are shown in formulas (1), (2) and (3), respectively, and TF-IDF corresponding to keywords in each piece of response information can be calculated through formulas (1), (2) and (3), and texts can be distinguished according to the keywords.
TF-IDF is used to evaluate the importance of a word to one of the documents in a set of documents or a corpus. The importance of a word increases proportionally with the number of times it appears in the document, but at the same time decreases inversely with the frequency with which it appears in the corpus. If a word appears in one text with a high word frequency (TF) and rarely in other text, it is considered that the word or phrase has a good class distinction capability, suitable for classification or recognition. Therefore, the elements of the feature vector are represented by TF-IDF, so that better model performance can be obtained, namely, the accuracy of the model output result is higher.
Optionally, in the training process, firstly, data acquisition is performed, response information and label information of different component types are collected through a certain asset platform, and required response information and label information are obtained. Wherein the response information comprises response body related text information of the response head and the target website, and the tag information comprises component name and/or component version information.
The acquired response information and label information are divided into a training set and a testing set, and the response information and the label information in the training set and the testing set are preprocessed after division, for example, format unification processing, stop words rejection and the like.
Further, feature conversion is carried out on the response information and the tag information, feature conversion is carried out on the response information by using a TF-IDF algorithm, namely, feature conversion is carried out on text information of the response information by using the TF-IDF algorithm, so that a feature vector is obtained, and each element of the feature vector represents a TF-IDF value corresponding to a keyword in the response information, namely, the product of word frequency TF and inverse document frequency IDF.
Alternatively, for the conversion of the feature vector of the tag information, the following manner may be adopted:
splitting the label information into component information corresponding to at least one component type for any one of the label information;
and aiming at any component type, carrying out de-duplication processing on component information corresponding to the component type to obtain the label vector dimension and the component list of the component type.
And obtaining the label vector corresponding to the component type in each piece of label information according to the label vector dimension corresponding to the component type in each piece of label information and the component list.
For example, firstly, acquiring a certain component type of the 8 types, counting all component information of the certain component type in the training set, namely component names and version information, merging and de-duplicating all component information, taking the number of component information (such as the component names) after merging and de-duplicating all component information as the dimension n of the training set label, taking the de-duplicated component and version information as a component list, and converting the label information into a label vector only comprising 0 and 1 through the component list. Firstly creating a label vector of a dimension n, initializing all element values of the label vector to be 0, finding a corresponding index in a component list corresponding to a certain piece of label information, and setting the element values corresponding to the index in the created label vector to be 1 and the rest to be 0 to obtain the label vector corresponding to the label information for the component type. For example, the component list obtained after merging and deduplication is [ apache, nginx, nginx/1.5.8,microsoft iis/8.5], where nginx in nginx/1.5.8 represents a web component, 1.5.8 represents version information, and the tag dimension is 4, and for a certain tag information, it is assumed that the tag information includes component information of component type web_server, and the component information includes { "web_server": [ nginx/1.5.8, apache ] }, then the tag vector is [1,0,1,0]. The label vector conversion for other component types is the same.
Model training: because component information objectively exists in common components and in unusual components, data acquisition may be subject to data imbalance. To alleviate the overfitting problem caused by data imbalance, an ensemble learning model may be selected for training, at least one of which is selected, including an optimized distributed gradient enhancement library (eXtreme Gradient Boosting, XGBoost), random forests, gradient descent trees (Gradient Boosting Decision Tree, GDBT), and the like. The invention selects an engineered XGBoost integrated learning model as a sub-model.
Optionally, for any sub-model, inputting a feature vector corresponding to web response information in training data and a label vector of a component type corresponding to the sub-model into the sub-model for training; the number of sub-models is the same as the number of component types.
Specifically, the feature vector of the response information of the training set and the label vector of a certain component type are input into the integrated learning model corresponding to the component type to train to form one branch of the parallel model, for example, the web component is divided into 8 component types, the 8 sub-models are trained in the same manner, and finally, the parallel structure model is formed, so that the accuracy of the output result of the model is higher.
For example, for the feature vector of the response information 1, the tag information 1 corresponding to the response information 1, the tag vector converted by the component information corresponding to the component type 1 in the tag information 1 is obtained, and the feature vectors of the plurality of response information 1 and the tag vector corresponding to the component type 1 in the plurality of tag information 1 are input into the sub-model corresponding to the component type 1 for training.
For web component identification, more and more accurate component information is identified, so that vulnerability detection and security early warning can be performed more comprehensively and accurately.
Model prediction: and performing data preprocessing and feature conversion on the response information and the tag information of the test set, inputting the feature vector of the response information and the tag vector of a certain component type into corresponding sub-model branches for prediction, comparing the prediction result of each sub-model branch with the tag vector for calculating the accuracy, and then synthesizing the prediction result of the parallel structure model for calculating the accuracy.
The web component identification is performed by adopting a parallel structure model, and firstly, response information and label information of a target website are acquired from a certain asset platform through a data acquisition module, wherein the response information comprises a response head and response body related text information of the target website, and the label information comprises component type and version information. For example, a total of 50000 pieces of data are collected, 40000 pieces of data are training data sets, the remaining 10000 pieces of data are test data sets, one piece of label data is collected as follows, and specific names and positions of companies are replaced by 'company names':
{domain: "testssl.healthcare-inc.com",
protocol: "tcp",
asn: "AS4808",
port: 443,
The city is Beijing,
provice: "Beijing",
cpes: ["cpe:/a:igor_sysoev:nginx:1.5.8"],
status: "up",
provider: "ipipnet",
country_name: "china",
organization: "China Unicom Beijing Province Network",
org: "ibreathcare.com",
data: {body: "<!DOCTYPE html>
<html lang=\"en\">
<head>
<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />
< title > company name of title >
<meta name=\"Resource-type\" content=\"Document\" /> ……
<script src=\"https://staticssl.healthcare-inc.com/defat/jquery-2.2.4.min.js\"></script>
<script src=\"https://staticssl.healthcare-inc.com/suomao/bootstrap.min.js\"></script>
<script src=\"https://staticssl.healthcare-inc.com/suomao/jquery-ui.min.js\"></script>
<script type=\"text/javascript\" src=\"js/jquery.fullPage.min.js\"></script>
<script>
protocol: "tcp",
cert_serial: "03fcdb7ec6ab0de9c3de496fa0666302e14f",
header: "HTTP/1.1 200 OK
Server: nginx/1.5.8
Date: Thu, 02 Jun 2022 23:05:12 GMT
Content-Type: text/html
Last-Modified: Thu, 20 Jan 2022 04:59:06 GMT
Transfer-Encoding: chunked
Connection: keep-alive
Content-Encoding: gzip",
ssl_text: "ssl.cipher.version : TLSv1.2 ……",
port: 443,
web: {web_server: [{version: "1.5.8",name: "Nginx"}],frontend: [{version: "",name: "Bootstrap"}],links: [],ui_lib: [{version: "",name: "jQuery UI"},{version: "2.2.4",name: "jQuery"},{version: "",name: "animate.css"}],robots: "",web_frameworks: [{version: "",name: "animate.css"},{version: "",name: "Bootstrap"},{version: "",name: "jQuery"}],web_lang: [],web_container: []},
service: "https",
title: "company name",
cert_expiredate: "2022-07-30T15:03:32.189507+8:00",
tunnel: "ssl",
server: "nginx/1.5.8",
is_ipv6: false,
os: "",
isp is "connected",
mask: "",
ip_str: "120.132.61.32",……}
as shown in fig. 3, the data preprocessing includes dividing the collected information into a training set and a testing set, preprocessing the response information after dividing the data, and preprocessing the component type information.
The response information preprocessing comprises the steps of carrying out webpage analysis on a response body, carrying out uppercase conversion of English characters into lowercase conversion of all text information after analysis, removing special characters and stop words, and carrying out word segmentation.
The response body in the response information is subjected to webpage analysis, the key attribute information in the response body is extracted, and the key attribute information is title, meta, script, link, and the like, because hidden component information exists in the key words. The specific analysis refers to obtaining text information in the response body by using a text function of webpage analysis, namely obtaining text content of the response body for the target data. The data in the dictionary data contains response body (body) and response header (header) contents in response information, and only part of important information is shown for analysis because of excessive response body contents in the example data. Content presentation for key attribute information:
<!DOCTYPE html>\r
Figure SMS_5
<html lang=\"en\">\r/>
Figure SMS_6
\r/>
Figure SMS_7
<head>\r/>
Figure SMS_8
\t<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />\r/>
Figure SMS_9
\t<title>Company name</title>\r/>
Figure SMS_10
\t<meta name=\"Resource-type\"
It can be seen from the above section that company name information exists in the title attribute.
<link rel=\"shortcut icon\" href=\"images/icon.png\">\r
Figure SMS_11
\t<link rel=\"stylesheet\" type=\"text/css\" href=\"css/jquery.fullPage.css\" />\r/>
Figure SMS_12
\t<link rel=\"stylesheet\" type=\"text/css\" href=\"css/style.css\" />\r/>
Figure SMS_13
\t<link rel=\"stylesheet\" type=\"text/css\" href=\"css/animate.css\" />\r/>
Figure SMS_14
\t<link rel=\"stylesheet\" href=\"https://staticssl.healthcare-inc.com/suomao/bootstrap.min.css\">\r/>
Figure SMS_15
\t<link rel=\"stylesheet\" href=\"css/header.css\">
From the above, it can be seen that jquery, animate, bootstrap component information exists in the link attribute.
<script
src=\"https://staticssl.healthcare-inc.com/defat/jquery-2.2.4.min.js\">
</script>\r
Figure SMS_16
\t<script
src=\"https://staticssl.healthcare-inc.com/suomao/bootstrap.min.
js\"></script>\r
Figure SMS_17
\t<script
src=\"https://staticssl.healthcare-inc.com/suomao/jquery-
ui.min.js\"></script>\r
Figure SMS_18
\t<script type=\"text/javascript\" src=\"js/jquery.fullPage.min.
js\"
From the above, it can be seen that jquery, bootstrap component information exists in the script attribute and that jquery contains its version information of 2.2.4.
For preprocessing of response information, firstly, obtaining key attribute information of a response body; secondly, obtaining a corresponding text by using a text function; since English words are case-coded, e.g. "Bootstrap" and "Bootstrap" are words when statistics are desired, it is generally necessary to convert all words into lowercase; due to the plurality of' \r,
Figure SMS_19
the existence of special symbols such as t' and the like causes great difference of text formats, and the text formats are unified by filtering the special symbols; there are many invalid stop words in the text, such as short words of "a", "to", "is", or punctuation marks, which need not be introduced during text analysis, and therefore need to be removed, so that the stop words are removed again for the text; the word segmentation is to split sentences in the text according to words, so that the machine understanding is facilitated.
The analysis result of the response information is as follows:
http/1.1 200 ok, server: nginx/1.5.8, date: thu, 02 jun 2022 23:05:12 gmt, content-type: text/html, last-modified: thu, 20 jan 2022 04:59:06 gmt, transfer-encoding: chunked, connection: keep-alive, content-encoding: gzip, company name, http-equivalent: content-type, content: text/html; charset=utf-8, name: resource-type, content: the content of the document is with=dimension-width, the initial-scale=1.0, the maximum-scale=1.0, the user-scale=0, the name of the document is view port, the content of the document is the company name, the content of the document is the image, the volume of the document is the text, the volume of the document is the volume of the document-scale=1.0, the volume of the document-scale=0, the volume of the document-scale-scale=2.4.jmin, the volume of the document-scale is the volume of the document-scale=2.4.jmin, the volume of the document-scale-scale=2.34. The volume of the document-scale is the document-scale=2.37. The volume of the document-scale. 20164-a.b.c: [email protected]: 400-8217-403.
Preprocessing of component type information includes dividing component information into 8 types, web server (web server), middleware (web middleware), web front end framework (web frames), web front end language (web language), UI interface (ui_lib), web application (web app), container (container), CDN (CDN), respectively. Stored in the form of a dictionary.
For example, the web field contains component detailed information, web { web_server [ { version }, "1.5.8", name }, "Nginx" }, front [ { version } ", name }," Bootstrap "}, links [ ], ui_lib [ { version [", name }, "jQuery UI" }, { version: "2.2.4", name: "jQuery" }, { version } ", name:" animate }, rots } ", web_frames [ { version }, name }," animate }, name: "animate }, name }, version }, name }, query }, name }, web }, and the component is parsed as described above. web_server: { version: "1.5.8", name: "nginx" }, will be converted to dictionary form { web_server: [ 'nginx/1.5.8' ] }, the conversion of the other components is the same as above. The result after the component information conversion in this embodiment is { ' web_server [ ' nginx/1.5.8' ], ' web_middleware [ ], ' web_frames works [ ' animate.css ', ' bootstrap ', ' jquery ', ' web_lang [ ], web_app [ ], cdn ' ], ' ui_lib [ ' jquery ', ' jquery/2.2.4', ' animat.css ' ], and ' web_container [ ]
Feature conversion: the feature conversion includes feature conversion for response information and feature conversion for tag information.
The TF-IDF method is used for extracting the feature vector of the text information of the response information, wherein the word frequency (TF) and the Inverse Document Frequency (IDF) used for calculation and the word frequency-inverse document frequency formulas are shown as (1), (2), and (3), respectively, and the TF-IDF corresponding to the keyword in each piece of response information can be calculated by formulas (1), (2), and (3), and the text can be distinguished according to the keyword:
Figure SMS_20
(1)
Figure SMS_21
(2)
Figure SMS_22
(3)
wherein n is 1 Representing the number of times the keyword appears in the web response information, n 2 Representing a total number of occurrences of all keywords included in the web response information; m is m 1 Representing the total number of web response information m in a corpus 2 Representing the number of web response information containing the keywords in the corpus.
For the feature conversion process of the component information, firstly, the component type, for example, one of the 8 component types, is obtained, all component information corresponding to the component type in the statistical training set, namely, component names and version information, the number of the component information after merging and de-duplication is the dimension of the label vector, the dimension of the label vector is recorded as n, the label information is converted into the label vector only comprising 0 and 1, wherein the element value of index of the label vector corresponding to a certain component is 1, and the rest is 0. For example: the statistics of the selected web_server component types are that there are apache, nginx, nginx/1.5.8,apache,microsoft iis/8.5, and the combined deduplication is that of [ apache, nginx, nginx/1.5.8, microsoft is/8.5], and the tag dimension is 4, for example, for a web_server type component in a certain tag information, the "web_server" is {: [ nginx/1.5.8] }, then the web_server type corresponds to a tag vector of [0, 1,0]. For example, for a certain tag information { "web_server" [ "apache, nginx, ginx/1.5.8" ] }, then the tag vector corresponding to the web_server type is [1, 0]. Label vector conversion for other component types is as above.
Model training, in order to alleviate the overfitting problem caused by data unbalance, an integrated learning model is selected for training, for example, an engineered XGBoost integrated learning model can be selected as a sub-model.
Optionally, the sub-model is built based on an ensemble learning model comprising a plurality of tree models, and the input of the latter tree model comprises the error of the former tree model in training the sub-model.
Specifically, the XGBoost model uses a plurality of weak classifiers to accumulate into a strong classifier, the error of fitting sample data of the first weak classifier is used as the input of the second weak classifier to continue fitting, and the like, and the error of data fitting is reduced through iteration, so that the prediction accuracy of the classifier is higher, namely the accuracy of the final recognition model is higher. The training process is shown in fig. 2.
The feature vector of the response information of the training set and the label vector corresponding to the component type are input into an XGBoost model for training, wherein the XGBoost model comprises a plurality of tree models (such as k),
Figure SMS_23
is the data we need to fit, +.>
Figure SMS_24
Representing each treeAnd finally, accumulating the parameters into a strong classifier through a plurality of weak classifiers. And constructing a parallel recognition model by using the XGBoost model for training and subsequent model evaluation. The input of one XGBoost model is a characteristic vector set of response information and a label vector set of a certain type of component, the component is split into 8 types, and 8 sub-models in a parallel recognition model are trained simultaneously.
Model prediction: and (3) carrying out data preprocessing and feature conversion on response information and label information of the test set, inputting a feature vector of the response information and a label vector of a certain component type into corresponding branches for prediction, comparing a prediction result of each branch with the label vector for calculating accuracy, and then synthesizing a prediction result of the parallel structure model for calculating accuracy.
Assuming that the predicted outcome is { 'web_server [' nginx/1.5.8'],' web_middleware [ ], 'web_frames [' boottrap ',' jquery '],' web_lang [ ], 'web_app [ ],' cdn [ ], 'ui_lib [' jquery '], and' web_container [ ]
The actual labels are { 'web_server [' nginx/1.5.8'],' web_middleware [ ], 'web_frame works [' terminate.css ',' bootstrap ',' jquery '],' web_lang [ ], 'web_app [ ],' cdn [ ], 'ui_lib [' jquery ',' jquery/2.2.4',' animat.css '], and' web_container [ ]
The model of the parallel structure constructed by the embodiment of the invention can predict the component information of different component types and can also predict a plurality of component information in one component type according to the prediction result and the real component information, so that the method of the embodiment of the invention can predict more comprehensive component information.
The accuracy of obtaining single parallel branches based on the test result is respectively as follows: 96.31% of web server, 98.7% of web middleware, 95.35% of web frames, 96.05% of web language, 99.96% of web app, 99.95% of cdn, 99.95% of ui_lib, 95.94% of web container, and 99.86% of web container.
The accurate rate of the narrow-definition model is 89.2% based on the test result, and the accurate prediction condition is represented when the prediction result of 8 component type information of one piece of text information is completely the same as the real data; the generalized model accuracy is 91.81%, which indicates that the prediction result of 8 pieces of component type information of one piece of text information is incomplete, but the prediction result is the situation that the part in the real label is correct.
Through the example and the web component identification result, the method provided by the embodiment of the invention can comprehensively and accurately identify the web component information.
The web page web component recognition device based on the parallel structure provided by the invention is described below, and the web page web component recognition device based on the parallel structure described below and the web page web component recognition method based on the parallel structure described above can be correspondingly referred to each other.
Fig. 4 is a schematic structural diagram of a web component recognition device based on a parallel structure. As shown in fig. 4, the web component recognition device based on the parallel structure provided in this embodiment includes:
An obtaining module 210, configured to obtain web response information to be identified;
the processing module 220 is configured to perform feature conversion on the web response information to obtain a feature vector;
the processing module 220 is further configured to determine, based on the feature vector and the recognition model, a web component included in the web response information;
the recognition model comprises a plurality of sub-models, training data comprises web response information and label information of a plurality of component types, each sub-model is obtained by training based on the web response information and the label information of the component type corresponding to each sub-model, and the label information is web component information.
Optionally, the processing module 220 is specifically configured to:
extracting keywords of the web response information;
determining a feature vector of the web response information according to the number of times of occurrence of the keyword in the web response information and the number of web response information including the keyword in a corpus; the corpus is a corpus of web response information comprising the web response information to be identified and training data.
Optionally, the processing module 220 is further configured to:
performing webpage analysis processing on the web response information to obtain text information;
Preprocessing the text information to obtain preprocessed text information;
and carrying out word segmentation on the preprocessed text information.
Optionally, the processing module 220 is specifically configured to perform at least one of the following:
converting English capital characters in the text information into lowercase characters;
carrying out unified processing on the format of the text information;
and eliminating stop words in the text information.
Optionally, the processing module 220 is further configured to:
splitting the label information into component information corresponding to at least one component type for any one of the label information;
and aiming at any component type, carrying out de-duplication processing on component information corresponding to the component type to obtain the label vector dimension and the component list of the component type.
And obtaining the label vector corresponding to the component type in each piece of label information according to the label vector dimension corresponding to the component type in each piece of label information and the component list.
Optionally, the processing module 220 is further configured to:
for any sub-model, inputting a feature vector corresponding to web response information in training data and a label vector of a component type corresponding to the sub-model into the sub-model for training; the number of sub-models is the same as the number of component types.
Optionally, the sub-model is built based on an ensemble learning model.
The device of the embodiment of the present invention is configured to perform the method of any of the foregoing method embodiments, and its implementation principle and technical effects are similar, and are not described in detail herein.
Fig. 5 illustrates a physical schematic diagram of an electronic device, as shown in fig. 5, which may include: processor 810, communication interface (Communications Interface) 820, memory 830, and communication bus 840, wherein processor 810, communication interface 820, memory 830 accomplish communication with each other through communication bus 840. Processor 810 may invoke logic instructions in memory 830 to perform a web page web component identification method based on a parallel architecture, the method comprising: acquiring web response information to be identified;
performing feature conversion on the web response information to obtain feature vectors;
determining web components included in the web response information based on the feature vectors and the recognition model;
the recognition model comprises a plurality of sub-models, training data comprises web response information and label information of a plurality of component types, each sub-model is obtained by training based on the web response information and the label information of the component type corresponding to each sub-model, and the label information is web component information.
Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, where the computer program when executed by a processor can perform a method for identifying web components of web pages based on parallel structures provided by the above methods, where the method includes: acquiring web response information to be identified;
Performing feature conversion on the web response information to obtain feature vectors;
determining web components included in the web response information based on the feature vectors and the recognition model;
the recognition model comprises a plurality of sub-models, training data comprises web response information and label information of a plurality of component types, each sub-model is obtained by training based on the web response information and the label information of the component type corresponding to each sub-model, and the label information is web component information.
In yet another aspect, the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the method for identifying web components of web pages based on parallel structures provided by the above methods, the method comprising: acquiring web response information to be identified;
performing feature conversion on the web response information to obtain feature vectors;
determining web components included in the web response information based on the feature vectors and the recognition model;
the recognition model comprises a plurality of sub-models, training data comprises web response information and label information of a plurality of component types, each sub-model is obtained by training based on the web response information and the label information of the component type corresponding to each sub-model, and the label information is web component information.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. The web page component identification method based on the parallel structure is characterized by comprising the following steps of:
acquiring web response information to be identified;
performing feature conversion on the web response information to obtain feature vectors;
determining web components included in the web response information based on the feature vectors and the recognition model;
the recognition model comprises a plurality of sub-models, training data comprises web response information and label information of a plurality of component types, each sub-model is obtained by training based on the web response information and the label information of the component type corresponding to each sub-model, and the label information is web component information.
2. The method for identifying web components of web pages based on parallel structures according to claim 1, wherein the step of performing feature conversion on the web response information to obtain feature vectors comprises the steps of:
extracting keywords of the web response information;
determining a feature vector of the web response information according to the number of times of occurrence of the keyword in the web response information and the number of web response information including the keyword in a corpus; the corpus is a corpus of web response information comprising the web response information to be identified and training data.
3. The method for identifying web components of web pages based on parallel structures according to claim 2, wherein before extracting the keywords of the web response information, further comprises:
performing webpage analysis processing on the web response information to obtain text information;
preprocessing the text information to obtain preprocessed text information;
and carrying out word segmentation on the preprocessed text information.
4. A method of identifying web components of a web page based on a parallel structure as claimed in claim 3 wherein preprocessing the text information comprises at least one of:
Converting English capital characters in the text information into lowercase characters;
carrying out unified processing on the format of the text information;
and eliminating stop words in the text information.
5. The parallel structure based web page web component recognition method of any one of claims 1-4, wherein prior to training the recognition model, the method further comprises:
splitting the label information into component information corresponding to at least one component type for any one of the label information;
performing de-duplication processing on the component information corresponding to any component type to obtain the label vector dimension and the component list of the component type;
and obtaining the label vector corresponding to the component type in each piece of label information according to the label vector dimension corresponding to the component type in each piece of label information and the component list.
6. The parallel structure based web page web component identification method of any one of claims 1-4, wherein the method further comprises:
for any sub-model, inputting a feature vector corresponding to web response information in training data and a label vector of a component type corresponding to the sub-model into the sub-model for training; the number of sub-models is the same as the number of component types.
7. The parallel structure based web component identification method of any one of claims 1-4, wherein the sub-model is built based on an ensemble learning model.
8. A web page web component recognition device based on a parallel structure, comprising:
the acquisition module is used for acquiring web response information to be identified;
the processing module is used for carrying out feature conversion on the web response information to obtain feature vectors;
the processing module is further used for determining a web component included in the web response information based on the feature vector and the recognition model;
the recognition model comprises a plurality of sub-models, training data comprises web response information and label information of a plurality of component types, each sub-model is obtained by training based on the web response information and the label information of the component type corresponding to each sub-model, and the label information is web component information.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the web page web component recognition method based on a parallel architecture as claimed in any one of claims 1 to 7 when the program is executed by the processor.
10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the web page web component recognition method based on a parallel structure as claimed in any one of claims 1 to 7.
CN202310419786.XA 2023-04-19 2023-04-19 Webpage web component identification method and device based on parallel structure Active CN116127236B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310419786.XA CN116127236B (en) 2023-04-19 2023-04-19 Webpage web component identification method and device based on parallel structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310419786.XA CN116127236B (en) 2023-04-19 2023-04-19 Webpage web component identification method and device based on parallel structure

Publications (2)

Publication Number Publication Date
CN116127236A true CN116127236A (en) 2023-05-16
CN116127236B CN116127236B (en) 2023-07-21

Family

ID=86312196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310419786.XA Active CN116127236B (en) 2023-04-19 2023-04-19 Webpage web component identification method and device based on parallel structure

Country Status (1)

Country Link
CN (1) CN116127236B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107294993A (en) * 2017-07-05 2017-10-24 重庆邮电大学 A kind of WEB abnormal flow monitoring methods based on integrated study
US20210203575A1 (en) * 2019-12-30 2021-07-01 Armis Security Ltd. System and method for determining device attributes using a classifier hierarchy
CN113806667A (en) * 2021-09-26 2021-12-17 上海交通大学 Method and system for supporting webpage classification
CN114528457A (en) * 2021-12-31 2022-05-24 北京邮电大学 Web fingerprint detection method and related equipment
CN115130038A (en) * 2022-06-17 2022-09-30 奇安信科技集团股份有限公司 Webpage classification method and device
CN115618291A (en) * 2022-10-14 2023-01-17 吉林省吉林祥云信息技术有限公司 Method, system, equipment and storage medium for identifying web fingerprint based on Transformer

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107294993A (en) * 2017-07-05 2017-10-24 重庆邮电大学 A kind of WEB abnormal flow monitoring methods based on integrated study
US20210203575A1 (en) * 2019-12-30 2021-07-01 Armis Security Ltd. System and method for determining device attributes using a classifier hierarchy
CN113806667A (en) * 2021-09-26 2021-12-17 上海交通大学 Method and system for supporting webpage classification
CN114528457A (en) * 2021-12-31 2022-05-24 北京邮电大学 Web fingerprint detection method and related equipment
CN115130038A (en) * 2022-06-17 2022-09-30 奇安信科技集团股份有限公司 Webpage classification method and device
CN115618291A (en) * 2022-10-14 2023-01-17 吉林省吉林祥云信息技术有限公司 Method, system, equipment and storage medium for identifying web fingerprint based on Transformer

Also Published As

Publication number Publication date
CN116127236B (en) 2023-07-21

Similar Documents

Publication Publication Date Title
US11170179B2 (en) Systems and methods for natural language processing of structured documents
EP3819785A1 (en) Feature word determining method, apparatus, and server
US20220197923A1 (en) Apparatus and method for building big data on unstructured cyber threat information and method for analyzing unstructured cyber threat information
CN110263009B (en) Method, device and equipment for generating log classification rule and readable storage medium
CN111177532A (en) Vertical search method, device, computer system and readable storage medium
CN108228875B (en) Log analysis method and device based on perfect hash
CN110580308A (en) information auditing method and device, electronic equipment and storage medium
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN115544240B (en) Text sensitive information identification method and device, electronic equipment and storage medium
CN113986864A (en) Log data processing method and device, electronic equipment and storage medium
CN115130038A (en) Webpage classification method and device
CN111191469A (en) Large-scale corpus cleaning and aligning method and device
CN114743012B (en) Text recognition method and device
CN116127236B (en) Webpage web component identification method and device based on parallel structure
CN113688240B (en) Threat element extraction method, threat element extraction device, threat element extraction equipment and storage medium
CN110232328A (en) A kind of reference report analytic method, device and computer readable storage medium
CN115906797A (en) Text entity alignment method, device, equipment and medium
CN115455416A (en) Malicious code detection method and device, electronic equipment and storage medium
CN115577082A (en) Document keyword extraction method and device, electronic equipment and storage medium
CN115563985A (en) Statement analysis method, statement analysis device, statement analysis apparatus, storage medium, and program product
CN109597879B (en) Service behavior relation extraction method and device based on &#39;citation relation&#39; data
CN114970531A (en) Intention identification and named entity extraction method and device based on instant messaging message
CN114021064A (en) Website classification method, device, equipment and storage medium
CN117973402B (en) Text conversion preprocessing method and device, storage medium and electronic equipment
CN114519357B (en) Natural language processing method and system based on machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant