CN108304483A - A kind of Web page classification method, device and equipment - Google Patents

A kind of Web page classification method, device and equipment Download PDF

Info

Publication number
CN108304483A
CN108304483A CN201711481103.4A CN201711481103A CN108304483A CN 108304483 A CN108304483 A CN 108304483A CN 201711481103 A CN201711481103 A CN 201711481103A CN 108304483 A CN108304483 A CN 108304483A
Authority
CN
China
Prior art keywords
target webpage
web page
interface
web
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711481103.4A
Other languages
Chinese (zh)
Other versions
CN108304483B (en
Inventor
邹荣珠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neusoft Corp
Original Assignee
Neusoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neusoft Corp filed Critical Neusoft Corp
Priority to CN201711481103.4A priority Critical patent/CN108304483B/en
Publication of CN108304483A publication Critical patent/CN108304483A/en
Application granted granted Critical
Publication of CN108304483B publication Critical patent/CN108304483B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the invention discloses a kind of Web page classification method, device and equipment, this method includes:Web page analysis is carried out to target webpage, obtains the target webpage element on the target webpage and the corresponding web data of the target webpage element;Feature extraction interface is called, feature extraction is carried out to the corresponding web data of the target webpage element, obtains the corresponding feature of the target webpage element;It calls feature vector to generate interface, the feature vector of the target webpage element is generated according to the corresponding feature of the target webpage element extracted;Calling classification algorithm interface classifies to the target webpage according to the feature vector of each target webpage element, obtains the classification results of the target webpage.The present invention is after obtaining web page analysis result, by way of calling interface, continues to realize Web page classifying function, finally obtains Web page classifying as a result, without two sets of program codes of successively triggering scheduling, improves the degree of automation for realizing Web page classifying function.

Description

A kind of Web page classification method, device and equipment
Technical field
The present invention relates to technical field of data processing, more specifically to a kind of Web page classification method, device and set It is standby.
Background technology
With popularizing for network, web-based network attack and malicious act are on the increase, and the net of user is seriously threatened Network access safety.Common malicious websites behavior includes modification homepage, extension horse, fishing, automatic spring, maliciously redirects, and is studied Personnel are by analyzing the behavioural characteristics of above malicious websites, in terms of further being classified to webpage using the method for machine learning Research, such as classified to webpage using trained disaggregated model, to identify and block malicious web pages.
Research in terms of being classified to webpage needs in advance to analyze webpage, specific in webpage to extract Then content carries out sort research using the specific content of extraction.For example, to using trained disaggregated model to target Webpage is classified, and needs first to analyze target webpage, to extract the specific content in target webpage, waits for target webpage point After the completion of analysis, sort research is carried out to the specific content of extraction using trained disaggregated model, finally obtains target webpage Classification results.
However, in the prior art, the specific content realized the process of above-mentioned web page analysis and obtained according to web page analysis The process for carrying out Web page classifying is realized by mutually independent program code, in terms of needing to classify to target webpage Research when, need the program code for first dispatching web page analysis to analyze target webpage, after obtaining web page analysis result, then Scheduling Web page classifying process program code classified to web page analysis result in terms of processing.As it can be seen that in order to realize to net The function that page is classified has to successively trigger two sets of mutually independent program codes of scheduling, it is clear that realize that process is complicated, from Dynamicization degree is low.
Invention content
In view of this, a kind of Web page classification method of present invention offer, device and equipment.
To achieve the above object, in a first aspect, the present invention provides a kind of Web page classification method, the method includes:
Web page analysis is carried out to target webpage, obtains the target webpage element on the target webpage and the target network The corresponding web data of page element;
Feature extraction interface is called, feature extraction is carried out to the corresponding web data of the target webpage element, obtains institute State the corresponding feature of target webpage element;
It calls feature vector to generate interface, the mesh is generated according to the corresponding feature of the target webpage element extracted Mark the feature vector of web page element;
Calling classification algorithm interface divides the target webpage according to the feature vector of each target webpage element Class obtains the classification results of the target webpage.
Optionally, it is described to target webpage carry out web page analysis, obtain the target webpage element on the target webpage with And the corresponding web data of the target webpage element, including:
The web data of target webpage is matched with default screening conditions, the default screening conditions of successful match are true It is set to goal condition, and obtains the corresponding web data of the goal condition;
According to the correspondence of the web page element in default screening conditions and the target webpage, determine and the target item The corresponding target webpage element of part;
According to web data corresponding with the goal condition and target webpage element, determine on the target webpage The corresponding web data of target webpage element.
Optionally, the default screening conditions include the regular expression for describing preset web element.
Optionally, before calling feature extraction interface for the first time, further include:The corresponding function of feature extraction interface is carried out Initialization process;
With or,
Before calling feature vector to generate interface for the first time, further include:To feature vector generate the corresponding function of interface into Row initialization process;
With or,
Before calling classification algorithm interface for the first time, further include:The corresponding function of sorting algorithm interface is initialized Processing;
Wherein, the initialization process includes parameter configuration and resource bid.
Optionally, it is described obtain the classification results of the target webpage after, further include:The resource of application is released It puts.
Optionally, the method further includes:
Calling classification model training interface, using the feature vector of each target webpage element to the classification mould that pre-establishes Type is trained, the disaggregated model trained;
Correspondingly, the calling classification algorithm interface, according to the feature vector of each target webpage element to the target Webpage is classified, and the classification results of the target webpage are obtained, including:
Calling classification algorithm interface, using the feature vector of each target webpage element as the defeated of the disaggregated model trained Enter parameter, output parameter is obtained after the processing of the disaggregated model trained, the classification knot as the target webpage Fruit.
Second aspect, the present invention also provides a kind of Web page classifying device, described device includes:
Web page analysis module obtains the target webpage on the target webpage for carrying out web page analysis to target webpage Element and the corresponding web data of the target webpage element;
Characteristic extracting module, for calling feature extraction interface, to the corresponding web data of the target webpage element into Row feature extraction obtains the corresponding feature of the target webpage element;
Vector generation module, for calling feature vector to generate interface, according to the target webpage element pair extracted The feature answered generates the feature vector of the target webpage element;
Sort module is used for calling classification algorithm interface, according to the feature vector of each target webpage element to the mesh Mark webpage is classified, and the classification results of the target webpage are obtained.
Optionally, the web page analysis module, including:
Matched sub-block, for matching the web data of target webpage with default screening conditions, by successful match Default screening conditions be determined as goal condition, and obtain the corresponding web data of the goal condition;
First determination sub-module, for screening conditions are corresponding with the web page element on the target webpage to close according to presetting System determines target webpage element corresponding with the goal condition;
Second determination sub-module, for according to web data corresponding with the goal condition and target webpage member Element determines the corresponding web data of target webpage element on the target webpage.
Optionally, the default screening conditions include the regular expression for describing preset web element.
Optionally, described device further includes:
Initialization module, for for the first time call feature extraction interface before, to the corresponding function of feature extraction interface into Row initialization process;With or, before calling feature vector to generate interface for the first time, the corresponding letter of interface is generated to feature vector Number carries out initialization process;With or, before calling classification algorithm interface for the first time, to the corresponding function of sorting algorithm interface into Row initialization process;Wherein, the initialization process includes parameter configuration and resource bid.
Optionally, described device further includes:
Release module, for after the classification results for obtaining the target webpage, being discharged to the resource of application.
Optionally, described device further includes:
Model training module is used for calling classification model training interface, utilizes the feature vector of each target webpage element The disaggregated model pre-established is trained, the disaggregated model trained;
Correspondingly, the sort module, is specifically used for:
Calling classification algorithm interface, using the feature vector of each target webpage element as the defeated of the disaggregated model trained Enter parameter, output parameter is obtained after the processing of the disaggregated model trained, the classification knot as the target webpage Fruit.
The third aspect, the present invention also provides a kind of computer readable storage medium, the computer readable storage medium storing program for executing In be stored with instruction, when described instruction is run on the terminal device so that the terminal device executes above-mentioned Web page classifying Method.
Fourth aspect, the present invention also provides a kind of Web page classifying equipment, the equipment includes memory and processor,
Said program code is transferred to the processor by the memory for storing program code;
The processor is for running said program code, wherein said program code executes above-mentioned webpage when running Sorting technique.
In Web page classification method provided by the invention, web page analysis is carried out to target webpage first, to obtain target webpage Target webpage element and the corresponding web data of each target webpage element, the basis as Web page classifying.Secondly, pass through Feature extraction interface is called, feature extraction is carried out to the corresponding web data of target webpage element, obtains the target webpage element Corresponding feature.Again, it by calling feature vector to generate interface, is given birth to according to the corresponding feature of the target webpage element extracted At the feature vector of the target webpage element.Finally, by calling classification algorithm interface, according to the spy of each target webpage element Sign vector classifies to target webpage, obtains the classification results of target webpage.The present invention is led to after obtaining web page analysis result The mode for crossing calling interface continues to realize Web page classifying function, finally obtains Web page classifying result.Compared with prior art, originally Invention improves the automation journey for realizing Web page classifying function without two sets of mutually independent program codes of successively triggering scheduling Degree.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with Other attached drawings are obtained according to these attached drawings.
Fig. 1 is a kind of flow chart of Web page classification method provided in an embodiment of the present invention;
Fig. 2 is a kind of flow chart of webpage analysis method provided in an embodiment of the present invention;
Fig. 3 is a kind of tree construction schematic diagram provided in an embodiment of the present invention;
Fig. 4 is a kind of schematic diagram of compiler provided in an embodiment of the present invention;
Fig. 5 is a kind of flow chart of the Web page classification method based on compiler provided in an embodiment of the present invention;
Fig. 6 is a kind of structural schematic diagram of Web page classifying device provided in an embodiment of the present invention;
Fig. 7 is a kind of structural schematic diagram of Web page classifying equipment provided in an embodiment of the present invention.
Term " first ", " second ", " third " " the 4th " in specification and claims and above-mentioned attached drawing etc. (if In the presence of) it is for distinguishing similar part, without being used to describe specific sequence or precedence.It should be appreciated that using in this way Data can be interchanged in the appropriate case, so that embodiments herein described herein can be in addition to illustrating herein Sequence in addition is implemented.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained all other under the premise of not making the creative labor Embodiment shall fall within the protection scope of the present invention.
Currently, Web page classifying is realized by priority invoking web page analysis program code and Web page classifying program code, Specifically, when needing to classify to target webpage, web site analysis program code is dispatched first, target webpage is analyzed, Web page analysis is obtained as a result, secondly dispatching Web page classifying program code, is classified to target webpage based on web page analysis result, Finally obtain the classification results of target webpage.The above process is dispatched two sets of mutually independent program codes by priority and could be realized Web page classifying function, it is seen then that realize that the degree of automation of Web page classifying function is relatively low.
In order to improve the degree of automation for realizing Web page classifying function, the present invention provides a kind of Web page classification method, After obtaining web page analysis result to target webpage progress web page analysis, by way of calling interface, realizes and be based on web page analysis As a result to the classification of target webpage.As it can be seen that the present invention, which need not be triggered successively, calls two sets of mutually independent program codes, but The function that web site analysis program code in the prior art and Web page classifying program code are realized respectively is merged, automation Complete the classification to target webpage.
Specifically, referring to Fig. 1, Fig. 1 is a kind of flow chart of Web page classification method provided by the embodiments of the present application, it can be with Including:
S101:Web page analysis is carried out to target webpage, obtains target webpage element on the target webpage and described The corresponding web data of target webpage element.
Web page element may include word, picture, audio, animation, video etc. on target webpage, for example, web page element Can be the heading label in webpage, keyword label.Wherein, target webpage element is one or more of target webpage net Page element.
The corresponding web data of target webpage element refers to the corresponding Data Matching model of target webpage element on target webpage It encloses, can also be the corresponding specific data value of target webpage element.For example, it is assumed that target webpage element is heading label, then mark It is Data Matching range corresponding with the heading label on target webpage to inscribe the corresponding web data of label then, such as heading label pair The Data Matching answered may range from target webpage the 1st byte to the 10th byte and the 20th byte to the 23rd Byte.
In the embodiment of the present invention, after the object that target webpage is determined as to Web page classifying, target webpage is carried out first Web page analysis obtains target webpage analysis result, wherein target webpage analysis result includes on the target webpage that analysis obtains Target webpage element and the corresponding web data of each target webpage element.
For example, when needing to determine whether target webpage is news category webpage by Web page classifying, it can be by target Webpage carries out web page analysis and obtains the heading label on target webpage and the corresponding web data of heading label, subsequently can be with base In on target webpage heading label and the corresponding web data of heading label classify to target webpage, to determine target Whether webpage is news category webpage.
S102:Feature extraction interface is called, feature extraction is carried out to the corresponding web data of the target webpage element, is obtained To the corresponding feature of the target webpage element.
In the embodiment of the present invention, feature extraction interface is pre-set, target is obtained carrying out web page analysis to target webpage After target webpage element and the corresponding web data of target webpage element on webpage, this feature is called to extract interface, with fortune The corresponding feature extraction function of the row interface, realizes the feature extraction to the corresponding web data of target webpage element, is somebody's turn to do The corresponding feature of target webpage element.
S103:It calls feature vector to generate interface, is generated according to the corresponding feature of the target webpage element extracted The feature vector of the target webpage element.
In the embodiment of the present invention, pre-sets feature vector and generate interface, obtaining the corresponding feature of target webpage element After extraction, this feature vector is called to generate interface, to run the corresponding feature vector generating function of the interface, realized according to extraction To the corresponding feature of target webpage element generate the target webpage element feature vector function.Specifically, feature to The generating process of amount is specially the process for being normalized to corresponding feature and being combined by predetermined format.
It is worth noting that, the possible corresponding Data Matching of each target webpage element is ranging from multiple on target webpage, For example, the corresponding Data Matching of target webpage heading label is ranging from multiple, that is to say, that have multiple marks on target webpage Inscribe content.In a kind of embodiment, the embodiment of the present invention is being extracted in the corresponding all titles of target webpage heading label After the feature of appearance, the feature vector of the target webpage heading label is generated.
In practical application, target webpage may include multiple target webpage elements, and the embodiment of the present invention is by repeatedly calling Feature vector generates interface, is the corresponding feature vector of each target webpage Element generation, wherein each feature vector is used for table Levy the feature of the corresponding target webpage element of this feature vector on target webpage.
S104:Calling classification algorithm interface, according to the feature vector of each target webpage element to the target webpage into Row classification, obtains the classification results of the target webpage.
In the embodiment of the present invention, sorting algorithm interface is pre-set, in the feature vector for generating each target webpage element Afterwards, the sorting algorithm interface is called, to run the corresponding sorting algorithm function of the interface, is realized according to each target webpage element The function classified to target webpage of feature vector.
In practical application, to target webpage carry out classification be by the feature of each target webpage element on the target webpage to The input parameter as the disaggregated model trained is measured, output parameter is obtained after the processing of disaggregated model, as the target The classification results of webpage.
In addition, the embodiment of the present invention can also utilize the feature vector of each target webpage element on target webpage, to pre- The disaggregated model first established is trained.Specifically, calling classification model training interface, to run the corresponding classification mould of the interface Type training function realizes the work(being trained to the sort module pre-established using the feature vector of each target webpage element Can, the disaggregated model trained.
In the embodiment of the present invention, by classifying to target webpage, it can identify whether target webpage belongs to malice net Page, to block the attack of malicious web pages in time.In addition, by classifying to target webpage, target webpage can also be identified Whether the webpage of certain classification, such as news category webpage are belonged to.
In Web page classification method provided in an embodiment of the present invention, web page analysis is carried out to target webpage first, to obtain mesh Mark the target webpage element and the corresponding web data of each target webpage element of webpage, the basis as Web page classifying.Its It is secondary, by calling feature extraction interface, feature extraction is carried out to the corresponding web data of target webpage element, obtains the target network The corresponding feature of page element.Again, corresponding according to the target webpage element extracted by calling feature vector to generate interface Feature generates the feature vector of the target webpage element.Finally, by calling classification algorithm interface, according to each target webpage member The feature vector of element classifies to target webpage, obtains the classification results of target webpage.The present invention is obtaining web page analysis knot After fruit, by way of calling interface, continues to realize Web page classifying function, finally obtain Web page classifying result.With the prior art Compare, the present invention without successively triggering scheduling two sets mutually independent program codes, improve realize Web page classifying function certainly Dynamicization degree.
In order to realize the function of Web page classifying, it is necessary first to analyze webpage, based on the realization pair of web page analysis result The classification of webpage.Specifically, the process of web page analysis is introduced in the embodiment of the present invention, it is the embodiment of the present invention with reference to figure 2 A kind of flow chart of the webpage analysis method provided.Corresponding to the specific implementation process of S101, which includes:
S201:The web data of target webpage is matched with default screening conditions, by the default screening of successful match Condition is determined as goal condition, and obtains the corresponding web data of the goal condition.
Can be in target webpage for carrying out matched web data with default screening conditions in the embodiment of the present invention Whole web datas, or the part web data in target webpage, specifically, the web data for web page analysis can To be determined according to actual demand, the method for the web data on target webpage is obtained in this application also without limiting.
Include web page element, such as heading label, keyword label on target webpage, the embodiment of the present invention is according to webpage point Class demand chooses one or more web page elements in advance, as preset web element, and is respectively configured for preset web element pre- If screening conditions, for describing preset web element.For example, it is assumed that having whether identification target webpage is dividing for news category webpage Class demand can then choose heading label as preset web element, and preset screening conditions for heading label setting, for retouching Heading label is stated, subsequently can identify whether target webpage is news category webpage based on the analysis result of heading label.Work as classification When demand changes, the adjustment of Web page classifying demand can be met by flexibly changing default screening conditions.
In the embodiment of the present invention, default screening conditions can by regular expression and or high-level programming language realize work( It can function description.In a kind of embodiment, default screening conditions include the regular expression for describing preset web element.Example Such as, regular expression $ 1~/<title>[^<]*</title>/ i can be used for describing title element.
In practical application, by the way that the web data on target webpage is matched with default screening conditions, it may be determined that Screening conditions are preset with the one or more of web data successful match, by the default screening of these and web data successful match Condition is as goal condition.Web data corresponding with goal condition in web data at the same time it can also obtain target webpage, The web data can be the Data Matching range in the web data on target webpage, can also be the webpage number of target webpage The specific data in.For example, can be the 1st in web data on target webpage with 1 matched web data of goal condition Byte is to the 10th byte, then the corresponding data in the web data of target webpage of goal condition 1 are the 1st byte to the 10th This Data Matching range of a byte, or be specific data in the 1st byte to the 10th byte.
S202:According to the correspondence of the web page element on default screening conditions and the target webpage, determine with it is described The corresponding target webpage element of goal condition.
In the embodiment of the present invention, since default screening conditions are arranged for the preset web element on target webpage, institute With presetting has correspondence between screening conditions and preset web element, the embodiment of the present invention is determining default screening conditions In goal condition after, according to the correspondence between default screening conditions and preset web element, determine that goal condition corresponds to Web page element, as target webpage element.
S203:According to web data corresponding with the goal condition and target webpage element, the target is determined The corresponding web data of target webpage element on webpage.
In the embodiment of the present invention, S201 has been obtained for the corresponding web data of goal condition, and S202 has determined that mesh The corresponding target webpage element of mark condition, based on the goal condition correspondence with web data and target webpage element respectively, Determine the web data corresponding to the corresponding target webpage element of goal condition.
The target webpage element and its corresponding web data that the embodiment of the present invention obtains web page analysis are as webpage The classification to target webpage is completed by way of calling interface in the basis of classification.
Webpage analysis method provided in an embodiment of the present invention can carry out target webpage according to the demand of Web page classifying Analysis, obtains finally passing through calling interface as the target webpage element on Web page classifying basis and its corresponding web data Mode, the classification of the completion of automation to target webpage.
In a kind of implement scene, target webpage includes multiple web page elements, has level structure between each web page element At relationship, the level constituent relation between each web page element can be embodied by the form of tree construction.For example, in target webpage Web page element include web object, web page tag, header label, heading label, keyword label, network address, domain name, access end Mouth, access path etc., can embody the level constituent relation of above-mentioned each web page element by tree construction shown in Fig. 3, In, the web page element of the leaf node position in tree construction is properly termed as basic web page element, such as heading label and keyword Label, one or more basis web page elements may be constructed the web page element of the father node position of leaf node, such as heading label Header label is constituted with keyword label.In general, default screening conditions are the screening conditions for describing basic web page element.
In the embodiment of the present invention, after S101 carries out web page analysis to target webpage, obtained target webpage element can be The web page element of any node position in tree construction shown in Fig. 3, is specifically as follows the web page element of leaf node position, i.e. base Plinth web page element;Or the web page element of father node position, can also be again the web page element etc. of upper level position.
In the case of one kind, if target webpage element is basic web page element, target webpage element can be directly determined Corresponding web data.For example, if target webpage element is heading label, determine that heading label corresponds on target webpage Web data, you can directly as the basis subsequently classified to target webpage.
In another case, if target webpage element is the web page element of father node position, the mesh is constituted determining After the corresponding web data of web page element for marking the leaf node position of web page element, you can obtain target webpage element pair The web data answered, specifically, the corresponding web data of target webpage element, can be the webpage for constituting the target webpage element The combination of the corresponding web data of element.For example, if target webpage element is header label, divide on target webpage It Que Ding not be after heading label and the corresponding web data of keyword label, by the combination of the corresponding web data of the two As the corresponding web data of portion's label.Assuming that the corresponding web data of heading label is the 1st byte to the 10th byte, close The corresponding web data of keyword label is the 20th byte to the 33rd byte, then the corresponding web data of header label is the 1st A byte is to the 10th byte and the 20th byte to the 33rd byte.
It, then can be according to for the web page element that target webpage element is the more even higher level of node position of tree construction in Fig. 3 Aforesaid way further obtains the corresponding web data of target webpage element, no longer excessive herein to introduce.
Based on above-described embodiment, the embodiment of the present invention additionally provides a kind of Web page classification method based on compiler, wherein The grammar rule of Web page classification method is defined first, and grammar rule can be indicated with production of grammar, general production of grammar Format is:
vn:V1 (p1) ... vk (pk), alternatively, vn:v1...vk;
Wherein, ":" it is stipulations symbol, using stipulations symbol as boundary, left part of a production is a non-terminal vn, production The right includes one or more symbol v1 ..., vk, and the symbol on the right of production can carry data screening condition P1 ..., pk, the symbol that non-terminal can be finely divided again, finishing sign are the symbol that cannot be subdivided.It generates The semanteme of formula is:The left part symbol of production is from the sign convention of production right part.
In the grammar rule of the embodiment of the present application, web page element can be abstracted as symbol, be denoted as web page element symbol, no Same web page element is abstracted as different web page element symbols.
For example, webpage can be abstracted as web page element symbol html_TOP, network address can be abstracted as web page element symbol Html_url, web object can be abstracted as web page element symbol html_object etc..
Using grammar rule, the level constituent relation between web page element can be defined, for example, see defining net shown in Fig. 3 Level constituent relation between page element, constituting the grammar rule of top page element can be:
HTML_TOP:html_url html_object;
Wherein, web page element symbol html_url indicates that network address, web page element symbol html_object indicate webpage pair As the two combines the web page element symbol HTML_TOP that can generate webpage.
Constituting the grammar rule of network address can be:
html_url:domain_name access_port url_path;
Wherein, web page element symbol domain_name indicates that domain name, web page element symbol access_port indicate to access Port, web page element symbol url_path indicate that access path, three combine the web page element symbol html_ that can generate network address url。
The exemplary illustration for the grammar rule for defining the level constituent relation between web page element is above are only, this is exemplary Illustrate there is no exhaustion is carried out to the level constituent relation defined between web page element, for grammar rule the embodiment of the present application Without limiting.
Meanwhile the correspondence of data screening condition and web page element can also be defined by grammar rule.
For example, defining the grammar rule of the correspondence of data screening condition and web page element can be:
html_title:Html_data ($ 1~/<title>[^<]*</title>/i);
Wherein, $ 1~/<title>[^<]*</title>/ i is data screening condition, is indicated by regular expression, is terminated Html_data is accorded with as html (Hyper Text Markup Language, HyperText Markup Language) input traffic Symbol, web page element symbol html_title indicate that this web page element of heading label, the grammar rule can indicate title mark The correspondence of label and the data screening condition.
The exemplary illustration of the grammar rule for the correspondence for defining data screening condition and web page element is above are only, it should Exemplary illustration is exhaustive there is no being carried out to the correspondence for defining data screening condition and web page element, for grammar rule sheet Apply embodiment also without limiting.
In addition, due to not only needing to analyze target webpage in the embodiment of the present invention, it is also necessary to be based on web page analysis As a result classify to webpage, therefore, on the basis of above-mentioned grammar rule defines, it is also necessary to increase to call and realize Web page classifying In each functional interface production of grammar, in order to be realized to the classification feature of webpage by calling interface.
Specifically, in order to realize that Web page classifying function at least needs that following interface is arranged:Feature extraction interface, feature Vector generates interface, sorting algorithm interface;It is also possible that disaggregated model teaching interface.Wherein, feature extraction interface is corresponding Function can be html_classify_features_extract functions, and feature vector, which generates the corresponding function of interface, to be Html_classify_vector_normalized functions, the corresponding function of sorting algorithm interface can be html_ Classify_vector_predict functions, the corresponding function of disaggregated model teaching interface can be html_classify_ Vector_train functions.
Wherein, it is characterized the production of grammar of extraction interface setting scheduling in advance, for example, when needing according to heading label When the corresponding web datas of html_title extract malicious web pages theme feature, then production of grammar is can be defined as follows:
html_classify→html_title
{
html_classify_features_extract($1,...);
}
Wherein, for the ease of between the web page element of aforementioned definitions the production of grammar of incidence relation mutually distinguish, here It indicates that html_title can obtain the symbol html_classify in predefined syntax analysis field with stipulations with " → ", generates Symbol " → " in formula shows that the production for doing external function analysis after being generated in symbol html_title, that is, is called special Sign extraction interface, it is html_classify_features_extract ($ 1 ...) that this feature, which extracts interface,." $ 1 " is html_ One input parameter of classify_features_extract functions, it indicates the corresponding web datas of html_title, That is the corresponding web datas of html_title input to the corresponding feature extraction function html_ of feature extraction interface classify_features_extract。
In addition, when feature extraction interface needs the web page analysis result of thinner dynamics, it is only necessary to change grammar rule i.e. Can, the complexity of program code exploitation is reduced to a certain extent.
For example, html_title can continue to refine:
html_title:title_A title_B;
title_A:title_A_p1title_A_p2;
html_classify→title_A_p1{html_classify_features_extract($1,...);};
That is, each grammar symbol in grammar rule can continue refinement expansion, it is thinner to function to realize The realization of granularity.
In addition, generating the production of grammar of interface setting scheduling for feature vector in advance, calling feature extraction interface real It is called after existing feature extraction, it specifically can be defined as follows production of grammar:
html_classify→html_target
{
html_classify_vector_normalized(...);
}
Wherein, it obtains html_target and shows that web page analysis process terminates, the production of grammar in stipulations for obtaining Html_target, and after the completion of feature extraction, feature vector is called to generate the corresponding html_classify_vector_ of interface Normalized functions generate the feature vector of target webpage.
In addition, being in advance the production of grammar of sorting algorithm interface setting scheduling, generated in the feature vector of target webpage Afterwards, the sorting algorithm interface is called, therefore features described above vector can be generated to the corresponding production of grammar of interface and be extended for:
html_classify→html_target
{
html_classify_vector_normalized(...);
html_classify_vector_predict(...);
}
Optionally, if desired the disaggregated model pre-established is trained using the feature vector of generation, then can adjusted With disaggregated model teaching interface, which is called after the feature vector for generating target webpage, therefore can be by features described above Vector generates the corresponding production of grammar of interface and is extended for:
html_classify→html_target
{
html_classify_vector_normalized(...);
html_classify_vector_train(...);
}
In addition, the embodiment of the present invention can also determine the purposes of the feature vector generated by the definition of production of grammar, The training of disaggregated model is e.g. done, or carries out the classification of webpage.Therefore, can features described above vector interface be generated to correspond to Production of grammar be extended for:
The embodiment of the present invention describes each functional interface of Web page classifying using simple and clear production of grammar, works as presence When function adjusts, production of grammar can be flexibly changed, you can quickly finish the modification of Web page classifying function.
In addition, in Web page classification method provided in an embodiment of the present invention, it will usually have initialization procedure, with default to realizing Initiation parameter needed for function is configured and initial memory resource is applied etc., after realizing preset function, can also Discharge the resource of application.Therefore, the embodiment of the present invention can also define initialization grammar rule, by executing built-in function Html_classify_init (...) realizes that initialization, and definition terminate grammar rule, by executing built-in function html_ Classify_fini (...) realizes release.
Wherein, initialization grammar rule can be defined as:
init{
html_classify_init(...);
}
Terminating grammar rule can be defined as:
fini{
html_classify_fini(...);
}
It should be noted that the definition of the above production of grammar, ignores the parameter of interface respective function, in order to simple The effect of production of grammar is lucidly described.
The embodiment of the present invention preserves content defined above after the production of grammar for completing to define Web page classification method For grammar rule file.
In addition, Web page classification method is typically necessary frequent adjustment algorithm parameter or more new feature for Performance tuning The methods of extraction, the method that the embodiment of the present invention uses compiler, each interface of Web page classifying is described by grammar rule Call production of grammar so that user can only change the adjustment that grammar rule can be achieved with sorting algorithm parameter, and will be literary Method rule file, which is directly replaced original grammar rule file and restarted in product allocation environment, can be realized parameter tune It is whole, thus reduce the workload of algorithm adjustment.
Based on the definition of above-mentioned grammar rule, the embodiment of the present invention can be based on compiler and realize Web page classification method.Tool Body, the grammar rule in the grammar rule file of generation is compiled into lexical analyzer, syntax analyzer, initialization by compiler Scheduler and release scheduler.
Wherein, the compilation process of lexical analyzer may include:
Compiler to the production of grammar in the grammar rule file of acquisition, the i.e. associated production of grammar of finishing sign, It is analyzed, extraction obtains lexical element, and lexical element is the right part of production of grammar, such as:" html_data ($ 1~/< title>[^<]*</title>/i)”。
Regular expression in the lexical element of extraction is collected and is built finite automaton by compiler, obtains integrating State the lexical analyzer of finite automaton.That is, the lexical analyzer for being integrated with above-mentioned finite automaton is exactly to obtaining The lexical analyzer that is compiled of grammar rule file.Finite automaton can be deterministic finite automaton DFA, Can be nondeterministic finite automaton NFA.The matching process preserved in aforementioned lexical analyzer can be:It is carried out using DFA Matching, or, it is matched using NFA.Certainly, can be that DFA configures different match parties according to actual demand difference Method similarly configures different matching process according to actual demand or NFA.
The compilation process of syntax analyzer may include:
Level constituent relation between the web page element defined based on grammar rule, generative grammar analyzer.It specifically can be with Using LALR grammer generation method generative grammar analyzers.The syntax analyzer generated includes under an analysis grammer state Automatic machine is pushed away, which includes:Controller, state of automata stack and web page element symbol stack, state of automata redirect Table (GOTO tables) and action schedule (ACTION tables) are output and input.
Wherein, input is the symbol sebolic addressing that lexical analyzer provides, which arranged according to the Data Matching range of hit Sequence.Controller is responsible for pushdown automata scheduling;State of automata stack preserves the grammer state of pushdown automata;Web page element accords with Number stack preserves the corresponding web page element symbol of web page element inputted by lexical analyzer;Pushdown automata output can be each A web page element symbol and the corresponding data of web page element symbol.Pushdown automata can be:Uncertainty pushdown automata (NPDA), alternatively, deterministic pushdown automaton (DPDA).The analysis method of grammatical relation in aforementioned syntax analyzer can be with For:Syntactic analysis is carried out using uncertainty pushdown automata, alternatively, carrying out syntactic analysis using deterministic pushdown automaton. Action tables input a symbol and do not jump to new state that is, under current state, directly obtain action of tabling look-up and (table look-up action only Can be stipulations), Goto tables input a symbol and jump to new state, then obtain action of tabling look-up and (look into that is, under current state Table action can be stipulations, shift-in or receiving).
In addition, in the embodiment of the present invention, compiler can also compile the initialization grammar rule in grammar rule file At initialization scheduler, grammar rule will be terminated and be compiled into end scheduler.
After completing above-mentioned compilation process, lexical analyzer, syntax analyzer, initialization scheduler are generated in compiler With terminate scheduler, as shown in Figure 4.Specifically, using lexical analyzer, syntactic analysis its, initialization scheduler and terminate adjust Spend the Web page classification method in the device completion embodiment of the present invention.Since compiler itself is relatively low to the dependence of platform, so, this The Web page classification method based on compiler that inventive embodiments provide is easy to be transplanted to new platform.
Below with reference to example, to the realization process of Web page classification method provided by the embodiments of the present application in practical applications It illustrates.
It is a kind of flow chart of the Web page classification method based on compiler provided in an embodiment of the present invention, the party with reference to figure 5 Method includes:
Step S501:Obtain the web data of target webpage.
Step S502:Above-mentioned web data is inputed into lexical analyzer, regular expression pair is passed through by lexical analyzer Web data is matched, and obtains the web data hit with regular expression matching, and the web data of the match hit can be with It is characterized by the Data Matching range;And web page element symbol and match hit of the output using regular expression as predicate Web data.
In the embodiment of the present invention, by finishing sign (for example, html_data in aforementioned exemplary) as lexical analyzer Incoming symbol is obtained the corresponding web data of finishing sign as the input of lexical analyzer, that is, above-mentioned steps S501 Web data.Wherein, the web page element symbol with predicate is finishing sign html_data (x), and x indicates fixed in grammar rule The regular expression of justice.
Step S503:Initialization scheduler is executed, to call initialization function, is completed at the initialization in Web page classifying Reason.
Before carrying out syntactic analysis, the initialization process of sort program, including parameter configuration and initial resource application are completed Deng in subsequent processes, if desired Extended RAM resource, can apply for new memory source again.That is, can dynamic Apply for memory source.In addition, in the embodiment of the present invention, can with the different functional interface of dynamic load, i.e., only needs when Wait just load needs functional interface, without can then be not loaded with, so as to save memory source.
It should be noted that step S503 and the execution of step S501 and step S502 are not specifically limited in proper order, that is, step Rapid S503 can be executed before step S501, can also be executed between step S501 and step S502, alternatively, can be in step It is executed after rapid S502, alternatively, being performed simultaneously with step S501, alternatively, being performed simultaneously with step S502.
Step S504:The finishing sign input syntax analyzer with predicate that lexical analyzer is exported, by syntactic analysis Device analyzes target webpage element according to the finishing sign with predicate, and determines the corresponding web data of target webpage element.
S505:Syntax analyzer calls feature extraction interface, and feature is carried out to the corresponding web data of target webpage element Extraction, obtains the corresponding feature of target webpage element.
S506:Syntax analyzer calls feature vector to generate interface, according to the corresponding spy of target webpage element extracted Sign generates the feature vector of target webpage element;
S507:Syntax analyzer calling classification algorithm interface, according to the feature vector of each target webpage element to target Webpage is classified, and the classification results of target webpage are obtained.
S508:It after syntax analyzer obtains classification results, executes and terminates scheduler, with the resource of release application.
In Web page classification method provided in an embodiment of the present invention based on compiler, web page analysis is retouched with grammar rule State, by Web page classifying meticulous procedure be different function interface, be fused in web page analysis grammar rule, in the process of running by Syntax analyzer scheduling executes, to improve the degree of automation of Web page classifying function realization.Lead in compared to the prior art It crosses and successively dispatches two sets of program codes realization Web page classifying functions, the embodiment of the present invention reduces the complexity of Web page classifying process Degree, improves the degree of automation of realization.
Corresponding with above method embodiment, the embodiment of the present invention additionally provides a kind of Web page classifying device.Reference chart 6, for an embodiment of the present invention provides a kind of structural schematic diagram of Web page classifying device, which specifically includes:
Web page analysis module 601 obtains the target network on the target webpage for carrying out web page analysis to target webpage Page element and the corresponding web data of the target webpage element;
Characteristic extracting module 602, for calling feature extraction interface, web data corresponding to the target webpage element Feature extraction is carried out, the corresponding feature of the target webpage element is obtained;
Vector generation module 603, for calling feature vector to generate interface, according to the target webpage element extracted Corresponding feature generates the feature vector of the target webpage element;
Sort module 604 is used for calling classification algorithm interface, according to the feature vector of each target webpage element to described Target webpage is classified, and the classification results of the target webpage are obtained.
Specifically, the web page analysis module, including:
Matched sub-block, for matching the web data of target webpage with default screening conditions, by successful match Default screening conditions be determined as goal condition, and obtain the corresponding web data of the goal condition;
First determination sub-module, for screening conditions are corresponding with the web page element on the target webpage to close according to presetting System determines target webpage element corresponding with the goal condition;
Second determination sub-module, for according to web data corresponding with the goal condition and target webpage member Element determines the corresponding web data of target webpage element on the target webpage.
Wherein, the default screening conditions include the regular expression for describing preset web element.
Specifically, described device further includes:
Initialization module, for for the first time call feature extraction interface before, to the corresponding function of feature extraction interface into Row initialization process;With or, before calling feature vector to generate interface for the first time, the corresponding letter of interface is generated to feature vector Number carries out initialization process;With or, before calling classification algorithm interface for the first time, to the corresponding function of sorting algorithm interface into Row initialization process;Wherein, the initialization process includes parameter configuration and resource bid.
In addition, described device further includes:
Release module, for after the classification results for obtaining the target webpage, being discharged to the resource of application.
In a kind of realization method, described device further includes:
Model training module is used for calling classification model training interface, utilizes the feature vector of each target webpage element The disaggregated model pre-established is trained, the disaggregated model trained;
Correspondingly, the sort module, is specifically used for:
Calling classification algorithm interface, using the feature vector of each target webpage element as the defeated of the disaggregated model trained Enter parameter, output parameter is obtained after the processing of the disaggregated model trained, the classification knot as the target webpage Fruit.
In Web page classifying device provided by the invention, web page analysis is carried out to target webpage, to obtain the mesh of target webpage Mark web page element and the corresponding web data of each target webpage element, the basis as Web page classifying.By calling feature Interface is extracted, feature extraction is carried out to the corresponding web data of target webpage element, obtains the corresponding spy of the target webpage element Sign.By calling feature vector to generate interface, which is generated according to the corresponding feature of target webpage element extracted The feature vector of element.By calling classification algorithm interface, according to the feature vector of each target webpage element to target webpage Classify, obtains the classification results of target webpage.The present invention passes through the side of calling interface after obtaining web page analysis result Formula continues to realize Web page classifying function, finally obtains Web page classifying result.Compared with prior art, the present invention is without successively touching Hair two sets of mutually independent program codes of scheduling, improve the degree of automation for realizing Web page classifying function.
Correspondingly, the embodiment of the present invention also provides a kind of Web page classifying equipment, it is shown in Figure 7, may include:
Processor 701, memory 702, input unit 703 and output device 704.Processor in Web page classifying equipment 701 quantity can be one or more, in Fig. 7 by taking a processor as an example.In some embodiments of the invention, processor 701, memory 702, input unit 703 and output device 704 can be connected by bus or other means, wherein with logical in Fig. 7 It crosses for bus connection.
Memory 702 can be used for storing software program and module, and processor 701 is stored in memory 702 by operation Software program and module, to execute various function application and the data processing of Web page classifying equipment.Memory 702 can Include mainly storing program area and storage data field, wherein storing program area can storage program area, needed at least one function Application program etc..In addition, memory 702 may include high-speed random access memory, can also include non-volatile memories Device, for example, at least a disk memory, flush memory device or other volatile solid-state parts.Input unit 703 can be used It is related with the user setting of Web page classifying equipment and function control in the number or character information that receive input, and generation Signal inputs.
Specifically in the present embodiment, processor 701 can be according to following instruction, by one or more application program The corresponding executable file of process be loaded into memory 702, and be stored in memory 702 by processor 701 to run Application program, to realize the various functions in above-mentioned Web page classification method.
Those of ordinary skill in the art may realize that lists described in conjunction with the examples disclosed in the embodiments of the present disclosure Member and algorithm steps can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually It is implemented in hardware or software, depends on the specific application and design constraint of technical solution.Professional technician Each specific application can be used different methods to achieve the described function, but this realization is it is not considered that exceed The scope of the present invention.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description The specific work process of (if present), device and unit, can refer to corresponding processes in the foregoing method embodiment, herein no longer It repeats.
In several embodiments provided herein, it should be understood that disclosed systems, devices and methods, it can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of division of logic function, formula that in actual implementation, there may be another division manner, such as multiple units or component It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be the indirect coupling by some interfaces, device or unit It closes or communicates to connect, can be electrical, machinery or other forms.
The unit illustrated as separating component may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, you can be located at a place, or may be distributed over multiple In network element.Some or all of unit therein can be selected according to the actual needs to realize the mesh of this embodiment scheme 's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also It is that each unit physically exists alone, it can also be during two or more units be integrated in one unit.
It, can be with if the function is realized in the form of SFU software functional unit and when sold or used as an independent product It is stored in a computer read/write memory medium.Based on this understanding, technical scheme of the present invention is substantially in other words The part of the part that contributes to existing technology or the technical solution can be expressed in the form of software products, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be People's computer, server or network equipment etc.) it performs all or part of the steps of the method described in the various embodiments of the present invention. And storage medium above-mentioned includes:USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited The various media that can store program code such as reservoir (RAM, Random Access Memory), magnetic disc or CD.
The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest range caused.

Claims (10)

1. a kind of Web page classification method, which is characterized in that the method includes:
Web page analysis is carried out to target webpage, obtains the target webpage element on the target webpage and target webpage member The corresponding web data of element;
Feature extraction interface is called, feature extraction is carried out to the corresponding web data of the target webpage element, obtains the mesh Mark the corresponding feature of web page element;
It calls feature vector to generate interface, the target network is generated according to the corresponding feature of the target webpage element extracted The feature vector of page element;
Calling classification algorithm interface is classified to the target webpage according to the feature vector of each target webpage element, is obtained To the classification results of the target webpage.
2. Web page classification method according to claim 1, which is characterized in that it is described that web page analysis is carried out to target webpage, The target webpage element on the target webpage and the corresponding web data of the target webpage element are obtained, including:
The web data of target webpage is matched with default screening conditions, the default screening conditions of successful match are determined as Goal condition, and obtain the corresponding web data of the goal condition;
According to the correspondence of the web page element in default screening conditions and the target webpage, determine and the goal condition pair The target webpage element answered;
According to web data corresponding with the goal condition and target webpage element, the mesh on the target webpage is determined Mark the corresponding web data of web page element.
3. Web page classification method according to claim 2, which is characterized in that the default screening conditions include for describing The regular expression of preset web element.
4. Web page classification method according to claim 1, which is characterized in that
Before calling feature extraction interface for the first time, further include:Initialization process is carried out to the corresponding function of feature extraction interface;
With or,
Before calling feature vector to generate interface for the first time, further include:The corresponding function of interface is generated to feature vector to carry out just Beginningization processing;
With or,
Before calling classification algorithm interface for the first time, further include:Initialization process is carried out to the corresponding function of sorting algorithm interface;
Wherein, the initialization process includes parameter configuration and resource bid.
5. Web page classification method according to claim 4, which is characterized in that the classification knot for obtaining the target webpage After fruit, further include:The resource of application is discharged.
6. Web page classification method according to claim 1, which is characterized in that the method further includes:
Calling classification model training interface, using each target webpage element feature vector to the disaggregated model that pre-establishes into Row training, the disaggregated model trained;
Correspondingly, the calling classification algorithm interface, according to the feature vector of each target webpage element to the target webpage Classify, obtains the classification results of the target webpage, including:
Calling classification algorithm interface is joined the feature vector of each target webpage element as the input for the disaggregated model trained Number, obtains output parameter, the classification results as the target webpage after the processing of the disaggregated model trained.
7. a kind of Web page classifying device, which is characterized in that described device includes:
Web page analysis module obtains the target webpage element on the target webpage for carrying out web page analysis to target webpage And the corresponding web data of the target webpage element;
Characteristic extracting module carries out the corresponding web data of the target webpage element special for calling feature extraction interface Sign extraction, obtains the corresponding feature of the target webpage element;
Vector generation module, it is corresponding according to the target webpage element extracted for calling feature vector to generate interface Feature generates the feature vector of the target webpage element;
Sort module is used for calling classification algorithm interface, according to the feature vector of each target webpage element to the target network Page is classified, and the classification results of the target webpage are obtained.
8. Web page classifying device according to claim 7, which is characterized in that the web page analysis module, including:
Matched sub-block, for matching the web data of target webpage with default screening conditions, by the pre- of successful match If screening conditions are determined as goal condition, and obtain the corresponding web data of the goal condition;
First determination sub-module presets the correspondence of screening conditions and the web page element on the target webpage for basis, Determine target webpage element corresponding with the goal condition;
Second determination sub-module, for basis web data corresponding with the goal condition and target webpage element, really The corresponding web data of target webpage element on the fixed target webpage.
9. a kind of computer readable storage medium, which is characterized in that it is stored with instruction in the computer readable storage medium storing program for executing, when When described instruction is run on the terminal device so that the terminal device executes webpage as claimed in any one of claims 1 to 6 Sorting technique.
10. a kind of Web page classifying equipment, which is characterized in that the equipment includes memory and processor,
Said program code is transferred to the processor by the memory for storing program code;
The processor is for running said program code, wherein is executed when said program code is run as in claim 1-6 Any one of them Web page classification method.
CN201711481103.4A 2017-12-29 2017-12-29 Webpage classification method, device and equipment Active CN108304483B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711481103.4A CN108304483B (en) 2017-12-29 2017-12-29 Webpage classification method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711481103.4A CN108304483B (en) 2017-12-29 2017-12-29 Webpage classification method, device and equipment

Publications (2)

Publication Number Publication Date
CN108304483A true CN108304483A (en) 2018-07-20
CN108304483B CN108304483B (en) 2021-01-19

Family

ID=62868276

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711481103.4A Active CN108304483B (en) 2017-12-29 2017-12-29 Webpage classification method, device and equipment

Country Status (1)

Country Link
CN (1) CN108304483B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162624A (en) * 2019-04-16 2019-08-23 腾讯科技(深圳)有限公司 A kind of text handling method, device and relevant device
CN110502677A (en) * 2019-04-18 2019-11-26 杭州海康威视数字技术股份有限公司 A kind of device identification method, device and equipment, storage medium
CN110515921A (en) * 2019-09-02 2019-11-29 江苏建筑职业技术学院 A kind of Artificial intelligent information screening plant
CN111125603A (en) * 2019-12-27 2020-05-08 百度时代网络技术(北京)有限公司 Webpage scene recognition method and device, electronic equipment and storage medium
CN113297525A (en) * 2021-06-17 2021-08-24 恒安嘉新(北京)科技股份公司 Webpage classification method and device, electronic equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101414351A (en) * 2008-11-03 2009-04-22 章毅 Fingerprint recognition system and control method
CN101719156A (en) * 2009-12-30 2010-06-02 南开大学 System of seamless integrated pure XML query engine in relational database
US20120084328A1 (en) * 2010-09-30 2012-04-05 International Business Machines Corporation Graphical User Interface for a Search Query
CN102426585A (en) * 2011-08-09 2012-04-25 中国科学技术信息研究所 Webpage automatic classification method based on Bayesian network
CN103020298A (en) * 2012-12-31 2013-04-03 华为技术有限公司 Method and device for acquiring page
CN104391860A (en) * 2014-10-22 2015-03-04 安一恒通(北京)科技有限公司 Content type detection method and device
CN105912607A (en) * 2016-04-06 2016-08-31 普强信息技术(北京)有限公司 Grammar rule based classification method
CN106663224A (en) * 2014-06-30 2017-05-10 亚马逊科技公司 Interactive interfaces for machine learning model evaluations
CN106657075A (en) * 2016-12-26 2017-05-10 东软集团股份有限公司 Multilayer protocol analysis method and device as well as data matching method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101414351A (en) * 2008-11-03 2009-04-22 章毅 Fingerprint recognition system and control method
CN101719156A (en) * 2009-12-30 2010-06-02 南开大学 System of seamless integrated pure XML query engine in relational database
US20120084328A1 (en) * 2010-09-30 2012-04-05 International Business Machines Corporation Graphical User Interface for a Search Query
CN102426585A (en) * 2011-08-09 2012-04-25 中国科学技术信息研究所 Webpage automatic classification method based on Bayesian network
CN103020298A (en) * 2012-12-31 2013-04-03 华为技术有限公司 Method and device for acquiring page
CN106663224A (en) * 2014-06-30 2017-05-10 亚马逊科技公司 Interactive interfaces for machine learning model evaluations
CN104391860A (en) * 2014-10-22 2015-03-04 安一恒通(北京)科技有限公司 Content type detection method and device
CN105912607A (en) * 2016-04-06 2016-08-31 普强信息技术(北京)有限公司 Grammar rule based classification method
CN106657075A (en) * 2016-12-26 2017-05-10 东软集团股份有限公司 Multilayer protocol analysis method and device as well as data matching method and device

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162624A (en) * 2019-04-16 2019-08-23 腾讯科技(深圳)有限公司 A kind of text handling method, device and relevant device
CN110162624B (en) * 2019-04-16 2024-04-09 腾讯科技(深圳)有限公司 Text processing method and device and related equipment
CN110502677A (en) * 2019-04-18 2019-11-26 杭州海康威视数字技术股份有限公司 A kind of device identification method, device and equipment, storage medium
CN110502677B (en) * 2019-04-18 2022-09-16 杭州海康威视数字技术股份有限公司 Equipment identification method, device and equipment, and storage medium
CN110515921A (en) * 2019-09-02 2019-11-29 江苏建筑职业技术学院 A kind of Artificial intelligent information screening plant
CN110515921B (en) * 2019-09-02 2021-11-02 江苏建筑职业技术学院 Computer artificial intelligence information screening device
CN111125603A (en) * 2019-12-27 2020-05-08 百度时代网络技术(北京)有限公司 Webpage scene recognition method and device, electronic equipment and storage medium
CN111125603B (en) * 2019-12-27 2023-06-27 百度时代网络技术(北京)有限公司 Webpage scene recognition method and device, electronic equipment and storage medium
CN113297525A (en) * 2021-06-17 2021-08-24 恒安嘉新(北京)科技股份公司 Webpage classification method and device, electronic equipment and storage medium
CN113297525B (en) * 2021-06-17 2023-12-12 恒安嘉新(北京)科技股份公司 Webpage classification method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN108304483B (en) 2021-01-19

Similar Documents

Publication Publication Date Title
CN108304483A (en) A kind of Web page classification method, device and equipment
CN111177569B (en) Recommendation processing method, device and equipment based on artificial intelligence
CN106649825B (en) Voice interaction system and creation method and device thereof
JP6894534B2 (en) Information processing method and terminal, computer storage medium
CN104142822B (en) Use information retrieval carries out source code flow point analysis
CN105117387B (en) A kind of intelligent robot interactive system
CN108021554A (en) Audio recognition method, device and washing machine
CN110286917A (en) File packing method, device, equipment and storage medium
CN110532176A (en) A kind of formalization verification method, electronic device and the storage medium of intelligence contract
CN105653949B (en) A kind of malware detection methods and device
CN107391675A (en) Method and apparatus for generating structure information
CN107220098A (en) The implementation method and device of regulation engine
CN109614093B (en) Visual intelligent contract system and intelligent contract processing method
CN106326099B (en) A kind of method, apparatus and electronic equipment for program tracking
CN109902157A (en) A kind of training sample validation checking method and device
CN108304387A (en) The recognition methods of noise word, device, server group and storage medium in text
US10785236B2 (en) Generation of malware traffic signatures using natural language processing by a neural network
CN107341027A (en) The generation method and generating means of user interface
CN107679937A (en) Customize method, system, storage medium and the equipment of service function
CN106775906A (en) Business flow processing method and device
CN109634569A (en) Process implementation method, device, equipment and readable storage medium storing program for executing based on note
CN108231074A (en) A kind of data processing method, voice assistant equipment and computer readable storage medium
CN110008352A (en) Entity finds method and device
CN110674355B (en) DSL application system for describing data labeling task and method thereof
Akpınar et al. Heuristic role detection of visual elements of web pages

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant