CN109062972A - Web page classification method, device and computer readable storage medium - Google Patents
Web page classification method, device and computer readable storage medium Download PDFInfo
- Publication number
- CN109062972A CN109062972A CN201810694720.0A CN201810694720A CN109062972A CN 109062972 A CN109062972 A CN 109062972A CN 201810694720 A CN201810694720 A CN 201810694720A CN 109062972 A CN109062972 A CN 109062972A
- Authority
- CN
- China
- Prior art keywords
- webpage
- web page
- sorted
- pages
- sub
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The present invention provides a kind of Web page classification method, device and storage medium, this method obtains web page interlinkage from sub-pages, after obtaining webpage source code in webpage to be sorted pointed by the web page interlinkage, noise filtering is carried out to the webpage source code, obtain the screening text of each webpage to be sorted, the screening text is segmented and stop words is gone to handle, obtains the available set of words of each webpage to be sorted.Later, this method from it is described can be with extracting kernel keyword in set of words, obtain the core key set of words of each webpage to be sorted, calculate the average value of the kernel keyword term vector of each webpage to be sorted, and the Web page classifying model for obtaining average value input training, obtain the classification results of each webpage to be sorted.Using the present invention, webpage to be sorted automatic classification can be realized pointed by the web page interlinkage to sub-pages.
Description
Technical field
The present invention relates to technical field of data processing more particularly to a kind of Web page classification methods, device and computer-readable
Storage medium.
Background technique
With the high speed development of Internet technology and Web technology, the quantity of webpage is being continuously increased on internet, data money
Source is being enriched constantly, provides potential data source for various data-intensive applications.However, excessive information content is to people
Handle data information and bring many difficulties, traditional information processing manner manually obviously can no longer meet a large amount of numbers
According to the requirement of processing.In this context, effective content of text of magnanimity webpage how is obtained automatically, and magnanimity webpage is carried out
Automatic classification, is the key that organization and management Internet resources.
Summary of the invention
For these reasons, the present invention provides a kind of Web page classification method, device and computer readable storage medium, master
Syllabus is to classify to webpage automatically in conjunction with crawler technology and neural network model.
To achieve the above object, the present invention provides a kind of Web page classification method, this method comprises:
Obtaining step: obtaining web page interlinkage from sub-pages, obtains from the webpage to be sorted that the web page interlinkage is directed toward
Take webpage source code;
Pre-treatment step: carrying out noise filtering to the webpage source code, obtains the screening text of each webpage to be sorted, right
The screening text is segmented and stop words is gone to handle, and the available set of words of each webpage to be sorted is obtained;
Extraction step: it is closed from the core that each webpage to be sorted with kernel keyword is extracted in set of words, can be obtained
Keyword set;
It calculates step: calculating the average value of the kernel keyword term vector of each webpage to be sorted, which is inputted
The Web page classifying model that training obtains in advance, obtains the classification results of each webpage to be sorted;And
Circulation step: using the webpage to be sorted for obtaining classification results as new sub-pages, obtaining step is returned to.
Preferably, the training step of the Web page classifying model includes:
Type of webpage is marked for the sub-pages for the preset quantity chosen in advance;
The webpage source code of the sub-pages is pre-processed, the available set of words of each sub-pages is obtained;
From the core key set of words that each sub-pages with kernel keyword is extracted in set of words, can be obtained;
Calculate the average value of the kernel keyword term vector of each sub-pages;And
Using the average value of the kernel keyword term vector of each sub-pages and corresponding type of webpage mark to nerve
Network model is trained, and obtains Web page classifying model.
Preferably, the screening text includes the text in heading label in webpage source code, keyword label and description label
This part, the segmenting method that the word segmentation processing uses includes the segmenting method based on string matching, the participle based on understanding
One or more of method and the segmenting method based on statistics.
Preferably, this method further include:
The execution number of the circulation step is set, when meeting setting requirements, terminates the circulation step.
Preferably, this method further include:
By the sub-pages marked with type of webpage web page interlinkage corresponding with the webpage to be sorted of classification results is obtained
It stores to database;
When acquisition web page interlinkage in the database in the presence of, terminate be directed to the web page interlinkage subsequent operation.
The present invention also provides a kind of electronic device, which includes memory and processor, is wrapped in the memory
Web page classifying program is included, which realizes following steps when being executed by the processor:
Obtaining step: obtaining web page interlinkage from sub-pages, obtains from the webpage to be sorted that the web page interlinkage is directed toward
Take webpage source code;
Pre-treatment step: carrying out noise filtering to the webpage source code, obtains the screening text of each webpage to be sorted, right
The screening text is segmented and stop words is gone to handle, and the available set of words of each webpage to be sorted is obtained;
Extraction step: it is closed from the core that each webpage to be sorted with kernel keyword is extracted in set of words, can be obtained
Keyword set;
It calculates step: calculating the average value of the kernel keyword term vector of each webpage to be sorted, which is inputted
The Web page classifying model that training obtains in advance, obtains the classification results of each webpage to be sorted;And
Circulation step: using the webpage to be sorted for obtaining classification results as new sub-pages, obtaining step is returned to.
Preferably, the training step of the Web page classifying model includes:
Type of webpage is marked for the sub-pages for the preset quantity chosen in advance;
The webpage source code of the sub-pages is pre-processed, the available set of words of each sub-pages is obtained;
From the core key set of words that each sub-pages with kernel keyword is extracted in set of words, can be obtained;
Calculate the average value of the kernel keyword term vector of each sub-pages;And
Using the average value of the kernel keyword term vector of each sub-pages and corresponding type of webpage mark to nerve
Network model is trained, and obtains Web page classifying model.
Preferably, the screening text includes the text in heading label in webpage source code, keyword label and description label
This part, the segmenting method that the word segmentation processing uses includes the segmenting method based on string matching, the participle based on understanding
One or more of method and the segmenting method based on statistics.
Preferably, following steps are also realized when the Web page classifying program is executed by the processor:
By the sub-pages marked with type of webpage web page interlinkage corresponding with the webpage to be sorted of classification results is obtained
It stores to database;
When acquisition web page interlinkage in the database in the presence of, terminate be directed to the web page interlinkage subsequent operation.
In addition, to achieve the above object, it is described computer-readable the present invention also provides a kind of computer readable storage medium
It include Web page classifying program in storage medium, which realizes webpage as described above when being executed by the processor
Arbitrary steps in classification method.
Web page classification method, device and computer readable storage medium proposed by the present invention, by being obtained from sub-pages
Web page interlinkage is taken, obtains webpage source code from the webpage to be sorted that web page interlinkage is directed toward, noise mistake then is carried out to webpage source code
Filter obtains including heading label, keyword label and the screening text for describing textual portions in label, divide screening text
Word and stop words is gone to handle, obtains available set of words, extract kernel keyword from available set of words using TF-IDF algorithm, obtain
To the core key set of words of each webpage to be sorted, the flat of the kernel keyword term vector of each webpage to be sorted is then calculated
Mean value is inputted Web page classifying model, obtains the classification results of webpage to be sorted.Because obtaining the net to be sorted of classification results
Page can be used as new sub-pages, reacquire its web page interlinkage and corresponding webpage source code, so can be with using the present invention
Realize the automatic classification to a large amount of webpages.
Detailed description of the invention
Fig. 1 is the schematic diagram of electronic device preferred embodiment of the present invention;
Fig. 2 is the Program modual graph of Web page classifying program in Fig. 1;
Fig. 3 is the flow chart of Web page classification method preferred embodiment of the present invention.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific embodiment
Understand to make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with several attached drawings and reality
Example is applied, the present invention will be described in further detail.It should be understood that specific embodiment described herein is only used to solve
The present invention is released, is not intended to limit the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not making
Every other embodiment obtained, shall fall within the protection scope of the present invention under the premise of creative work.
The present invention provides a kind of electronic device.It is the signal of 1 preferred embodiment of electronic device of the present invention shown in referring to Fig.1
Figure.In this embodiment, electronic device 1 crawls web page interlinkage and webpage source code using crawler technology, carries out to webpage source code pre-
Processing obtain can word, and then obtain the core key set of words of each webpage to be sorted, then utilize each webpage to be sorted
The average value of kernel keyword term vector and the Web page classifying model that training obtains in advance obtain the classification of each webpage to be sorted
As a result.
The electronic device 1 can be server, smart phone, tablet computer, portable computer, desktop PC etc.
Terminal device with storage and calculation function.In one embodiment, when electronic device 1 is server, which can
To be the one or more of rack-mount server, blade server, tower server or Cabinet-type server etc..
The electronic device 1 includes memory 11, processor 12, network interface 13 and communication bus 14.
Wherein, memory 11 includes the readable storage medium storing program for executing of at least one type.The readable of at least one type is deposited
Storage media can be the non-volatile memory medium of such as flash memory, hard disk, multimedia card, card-type memory.In some embodiments,
The readable storage medium storing program for executing can be the internal storage unit of the electronic device 1, such as the hard disk of the electronic device 1.Another
In some embodiments, the readable storage medium storing program for executing is also possible to the external memory 11 of the electronic device 1, such as the electronics
The plug-in type hard disk being equipped on device 1, intelligent memory card (Smart Media Card, SMC), secure digital (Secure
Digital, SD) card, flash card (Flash Card) etc..
In the present embodiment, the readable storage medium storing program for executing of the memory 11 is commonly used in storage program area, Web page classifying
The webpage pair to be sorted of program 10, Web page classifying model and sub-pages and acquisition classification results with type of webpage mark
The web page interlinkage etc. answered.The memory 11 can be also used for temporarily storing the data that has exported or will export.
Processor 12 can be in some embodiments a central processing unit (Central Processing Unit,
CPU), microprocessor or other data processing chips, program code or processing data for being stored in run memory 11, example
Such as execute Web page classifying program 10.
Network interface 13 may include standard wireline interface and wireless interface (such as WI-FI interface).Commonly used in the clothes
It is engaged in establishing between device 1 and other electronic equipments or system and communicate to connect.
Communication bus 14 is for realizing the connection communication between said modules.
Fig. 1 is illustrated only with component 11-14 and the electronic device of Web page classifying program 10 1, it should be understood that
It is not required for implementing all components shown, the implementation that can be substituted is more or less component.
Optionally, which can also include display, be referred to as display screen or display unit.Some
It can be light-emitting diode display, liquid crystal display, touch-control liquid crystal display and Organic Light Emitting Diode (Organic in embodiment
Light-Emitting Diode, OLED) display etc..Display for show the information handled in the electronic apparatus 1 and
For showing visual user interface.
Optionally, which further includes touch sensor.It is touched provided by the touch sensor for user
The region for touching operation is known as touch area.In addition, touch sensor described here can be resistive touch sensor, capacitor
Formula touch sensor etc..Moreover, the touch sensor not only includes the touch sensor of contact, proximity may also comprise
Touch sensor etc..In addition, the touch sensor can be single sensor, or such as multiple biographies of array arrangement
Sensor.User can start Web page classifying program 10 by touching the touch area.
In addition, the area of the display of the electronic device 1 can be identical as the area of the touch sensor, it can also not
Together.Optionally, display and touch sensor stacking are arranged, to form touch display screen.The device is based on touching aobvious
Display screen detects the touch control operation of user's triggering.
The electronic device 1 can also include radio frequency (Radio Frequency, RF) circuit, sensor and voicefrequency circuit etc.
Deng details are not described herein.
In the above-described embodiments, following step is realized when processor 12 executes the Web page classifying program 10 stored in memory 11
It is rapid:
Obtaining step: obtaining web page interlinkage from sub-pages, obtains from the webpage to be sorted that the web page interlinkage is directed toward
Take webpage source code;
Pre-treatment step: carrying out noise filtering to the webpage source code, obtains the screening text of each webpage to be sorted, right
The screening text is segmented and stop words is gone to handle, and the available set of words of each webpage to be sorted is obtained;
Extraction step: it is closed from the core that each webpage to be sorted with kernel keyword is extracted in set of words, can be obtained
Keyword set;
It calculates step: calculating the average value of the kernel keyword term vector of each webpage to be sorted, which is inputted
The Web page classifying model that training obtains in advance, obtains the classification results of each webpage to be sorted;And
Circulation step: using the webpage to be sorted for obtaining classification results as new sub-pages, obtaining step is returned to.
About being discussed in detail for above-mentioned steps, journey of following Fig. 2 about 10 preferred embodiment of Web page classifying program is please referred to
The explanation of sequence module map and Fig. 3 about the flow chart of Web page classification method preferred embodiment.
In other embodiments, Web page classifying program 10 can be divided into multiple modules, and multiple module is stored in
It in memory 12, and is executed by processor 13, to complete the present invention.The so-called module of the present invention is to refer to complete specific function
Series of computation machine program instruction section.
It is the Program modual graph of 10 preferred embodiment of Web page classifying program in Fig. 1 referring to shown in Fig. 2.In the present embodiment,
The Web page classifying program 10 can be divided into: obtain module 110, preprocessing module 120, extraction module 130, computing module
140, model training module 150 and model application module 160.
Module 110 is obtained, for obtaining web page interlinkage and webpage source code.It is climbed for example, obtaining module 110 using universal network
Worm obtains web page interlinkage from sub-pages, obtains webpage source code from the webpage to be sorted that the web page interlinkage is directed toward.
Preprocessing module 120 obtains the available word set of each webpage to be sorted for pre-processing to webpage source code
It closes.In the present embodiment, preprocessing module 120 first carries out noise filtering to webpage source code using regular expression, obtains webpage
Textual portions in heading label in source code, keyword label and description label, i.e.<title>,<keywords>,<
Description > in textual portions, in this, as the screening text of each webpage to be sorted, then to screening text carry out
It segments and stop words is gone to handle, obtain the available set of words of each webpage to be sorted.
Wherein, the regular expression is also known as regular expression, is usually used to retrieval, replaces those and meet some mould
The text of formula, rule.Each regular expression can filter out corresponding webpage noise, including advertisement, navigation bar,
Javascript scripted code, CSS style code, html tag, punctuation mark, additional character etc..
Participle is the basis of text-processing, participle can using based on string matching segmenting method, based on understanding
One or more of segmenting method and the segmenting method based on statistics.Wherein, the segmenting method based on string matching is also referred to as
For the segmentation methods based on dictionary.In the present embodiment, stammerer segmenter can be used to carry out at participle the screening text
Reason.
Stop words mainly includes function word, can be conjunction, preposition, auxiliary word, modal particle etc., be also possible to sometimes pronoun,
For several times etc..These function words usually itself have no specific meaning, and only putting it into a complete sentence just has centainly
Effect, for example, " so ", " so ", " ", " ", " ", " this ", " that " etc..In the present embodiment, it can compare default
Deactivated vocabulary carries out stop words to the screening text and handles, and obtains the available set of words of each webpage to be sorted.
Extraction module 130 obtains the core of each webpage to be sorted for extracting kernel keyword from available set of words
Keyword set.In the present embodiment, word frequency-inverse file frequency (Term Frequency-Inverse Document is utilized
Frequency, TF-IDF) algorithm and default corpus (such as Chinese wikipedia corpus), TF-IDF value is greater than default
Threshold value can word as kernel keyword, obtain the core key set of words of each webpage to be sorted.
TF-IDF algorithm is a kind of statistical method, for assessing certain word for its in a file set or a corpus
The significance level of middle text document.Specifically, in the present embodiment, what TF-IDF algorithm was used to assess webpage to be sorted can word
For the significance level of webpage to be sorted, using the value of TF*IDF be greater than preset threshold can word as the core of the webpage to be sorted
Heart keyword.Wherein, word frequency (Term Frequency, TF) indicate can the frequency that occurs in webpage of word, i.e., certain can word
It is all in the number that occurs in certain webpage to be sorted and the webpage to be sorted can the quotient of number that occurs of word.Inverse file frequency
(Inverse document frequency, IDF) be considered as certain can word to the weight of certain webpage significance level to be sorted,
Certain can word frequency of the word in certain class webpage it is bigger, the word frequency in all webpages is smaller, then the value of IDF is bigger, this can word
It is bigger to the significance level of the webpage to be sorted.
Computing module 140 for kernel keyword to be mapped as term vector, and calculates the kernel keyword word of each webpage
The average value of vector.In the present embodiment, the kernel keyword term vector of webpage to be sorted is indicated using distributed.Distributed word
Vector is a kind of low-dimensional real vector, by the dot in the kernel keyword and lower dimensional space at corresponding relationship, this vector
Expression be not it is unique, only to realize certain distinction.The distance between distributed term vector can use traditional Europe
Family name's distance is measured, and can also be measured with COS distance.The vector indicated in this way, the distance of " Mike " and " microphone "
The distance of " Mike " and " sunlight " can be far smaller than.Model application module 160 exactly divides webpage using the realization of above-mentioned property
Class.
Model training module 150, for the flat of the kernel keyword term vector using the sub-pages each chosen in advance
Mean value and corresponding type of webpage mark are trained neural network model, obtain Web page classifying model.The type of webpage
Mark can be artificial mark, be also possible to automatic marking.It, can be with for example, when the sub-pages quantity chosen in advance is larger
It is financial type by the sub-pages automatic marking chosen from financial web site, the sub-pages chosen from P. E Web Sites is marked automatically
Note is sports genre.It is more accurate, multilayer deutero-albumose can be carried out to the sub-pages chosen in advance by way of manually marking
Note, for example, certain sub-pages is labeled as: sport-basketball-NBA can more reasonably utilize web page resources so as to subsequent, such as
Realize type of webpage subdivision etc..It is understood that type of webpage mark can also be by combining artificial mark and automatic marking
Mode realize.
The neural network model can be deep learning model neural network based, including but not limited to convolutional Neural
Network, deep neural network and Recognition with Recurrent Neural Network etc..Computing module 140 obtains the core of each sub-pages chosen in advance
After the average value of keyword term vector, model training module 150 is with the average value of these kernel keyword term vectors and corresponding
Type of webpage mark is used as sample data, by training and verifying, adjusts model parameter, obtains trained Web page classifying mould
Type.
Model application module 160, average value and webpage for the kernel keyword term vector using webpage to be sorted
Disaggregated model obtains the classification results of webpage to be sorted.In the present embodiment, the core for the webpage to be sorted being calculated is closed
Feature vector of the average value of keyword term vector as webpage to be sorted, using the Web page classifying model, by calculating wait divide
It is remaining between the average value of the kernel keyword term vector of the average value and sub-pages of the kernel keyword term vector of class webpage
Chordal distance, by COS distance minimum or type of webpage corresponding less than the sub-pages of threshold value mark as the webpage to be sorted
Type of webpage.
In one embodiment, the Web page classifying model includes the access model of multiple type of webpage, can find out with to
The average value of the kernel keyword term vector of classification webpage counts corresponding webpage classification apart from K nearest sub-pages
And probability from high to low according to probability sequentially inputs the average value of the kernel keyword term vector of the webpage to be sorted various
The access model of classification, by Web page classifying, this more classification problem is converted into multiple two-value classification problems.
In another embodiment, the Web page classifying model is obtained by other procedural trainings, that is to say, that the webpage
Sort program 10 can not include the model training module 150.
In addition, the present invention also provides a kind of Web page classification methods.It is Web page classification method of the present invention referring to shown in Fig. 3
The flow chart of preferred embodiment.The realization when processor 12 of electronic device 1 executes the Web page classifying program 10 stored in memory
The following steps of Web page classification method:
Step S300 obtains module 110 and obtains web page interlinkage from sub-pages, pointed by the web page interlinkage to point
Webpage source code is obtained in class webpage.For example, obtaining module 110 using universal network crawler from the kind for the preset quantity chosen in advance
All web page interlinkages are obtained in sub-pages, obtain webpage source code from the webpage to be sorted that web page interlinkage is directed toward.
Step S301, preprocessing module 120 carry out noise filtering to the webpage source code, obtain each webpage to be sorted
Text is screened, which is segmented and stop words is gone to handle, obtains the available set of words of each webpage to be sorted.Institute
Stating screening text includes the textual portions in heading label in webpage source code, keyword label and description label, at the participle
The segmenting method that reason uses includes the segmenting method based on string matching, the segmenting method based on understanding and point based on statistics
One or more of word method.It is segmented about the process for obtaining screening text from webpage source code and to screening text
Process with going stop words to handle, can refer to above-mentioned being discussed in detail about preprocessing module 120, details are not described herein.
Step S302, extraction module 130 extract kernel keyword from available set of words, obtain each webpage to be sorted
Core key set of words.For example, extraction module 130 utilizes TF-IDF algorithm, in conjunction with Chinese wikipedia corpus, by TF*IDF
Value be greater than preset threshold can word extract, the kernel keyword as webpage to be sorted.
Step S303, computing module 140 calculate the average value of the kernel keyword term vector of each webpage to be sorted, model
The average value is inputted the Web page classifying model obtained by the training of model training module 150 by application module 160, and output is each wait divide
The classification results of class webpage.
Step S304 repeats above-mentioned steps using the webpage to be sorted for obtaining classification results as new sub-pages
S300-S303。
In other embodiments, between step S303 and step S304 further include:
The execution number of setting steps S304, when meeting setting requirements, no longer execution step S304 terminates Web page classifying
Operation.
For the ease of statement, sub-pages are divided into first generation sub-pages, second generation sub-pages and the by us herein
Three generations's sub-pages etc..Similarly, webpage to be sorted can be divided into first generation webpage to be sorted, second generation webpage to be sorted etc..
Wherein, belong to first generation sub-pages for carrying out the sub-pages of model training, the first generation webpage to be sorted refers to described
Webpage pointed by all web page interlinkages in first generation sub-pages, can be used as second generation sub-pages, and so on, no longer
It repeats.
For example, it is assumed that the execution number of setting steps S304 is 2, then when obtaining the classification of each first generation webpage to be sorted
As a result after, step S304 is executed for the first time using first generation webpage to be sorted as second generation sub-pages and repeats step
After S300-S303, the classification results of each second generation webpage to be sorted are obtained, then execute step S304 second, until
To the classification results of each third generation webpage to be sorted, no longer execution step S304, terminate Web page classifying operation.
In other embodiments, can also by the sub-pages marked with type of webpage and obtain classification results to point
The corresponding web page interlinkage of class webpage is stored to database, when acquisition web page interlinkage in the database in the presence of, terminate
For the subsequent operation of the web page interlinkage.For example, after acquisition module 110 obtains web page interlinkage from sub-pages, described
The web page interlinkage is inquired in database, if successful inquiring, the existing classification results of the corresponding webpage of the web page interlinkage, Wu Xuchong
Multiple operation, if inquiry failure, normal to execute subsequent step.
The Web page classification method that the present embodiment proposes is referred to by obtaining web page interlinkage from sub-pages from web page interlinkage
To webpage to be sorted in obtain webpage source code, to webpage source code carry out noise filtering, obtain including heading label, keyword mark
The screening text of textual portions, segments screening text and stop words is gone to handle, obtaining can word in label and description label
Set, extracts kernel keyword from available set of words using TF-IDF algorithm, obtains the core key of each webpage to be sorted
Then set of words calculates the average value of the kernel keyword term vector of each webpage to be sorted, is inputted Web page classifying model,
The classification results of webpage to be sorted are obtained, then obtain web page interlinkage from the webpage to be sorted, are repeated the above steps.Utilize net
Network crawler obtains a large amount of web datas, it can be achieved that crawl to webpage source code and the deep layer of web page interlinkage, passes through training deep learning
Model is, it can be achieved that therefore using the present invention, the automatic classification to a large amount of webpages may be implemented in automatic webpage classification.
In addition, the embodiment of the present invention also proposes a kind of computer readable storage medium, the computer readable storage medium
It can be hard disk, multimedia card, SD card, flash card, SMC, read-only memory (ROM), Erasable Programmable Read Only Memory EPROM
(EPROM), any one in portable compact disc read-only memory (CD-ROM), USB storage etc. or several timess
Meaning combination.
The specific embodiment of the computer readable storage medium of the present invention and above-mentioned Web page classification method and electronic device 1
Specific embodiment it is roughly the same, related introduction please be join, details are not described herein.
It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row
His property includes, so that the process, device, article or the method that include a series of elements not only include those elements, and
And further include other elements that are not explicitly listed, or further include for this process, device, article or method institute it is intrinsic
Element.It in addition, the technical solution between each embodiment can be combined with each other, but must be with ordinary skill people
Based on member can be realized, this technical solution will be understood that when the combination of technical solution appearance is conflicting or cannot achieve
Combination be not present, also not the present invention claims protection scope within.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side
Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases
The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art
The part contributed out can be embodied in the form of software products, which is stored in one as described above
In storage medium, used including some instructions so that server executes method described in each embodiment of the present invention.
The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair
Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills
Art field, is included within the scope of the present invention.
Claims (10)
1. a kind of Web page classification method is applied to electronic device, which is characterized in that this method comprises:
Obtaining step: obtaining web page interlinkage from sub-pages, obtains net from the webpage to be sorted that the web page interlinkage is directed toward
Page source code;
Pre-treatment step: noise filtering is carried out to the webpage source code, the screening text of each webpage to be sorted is obtained, to the sieve
Selection is originally segmented and stop words is gone to handle, and the available set of words of each webpage to be sorted is obtained;
Extraction step: the kernel keyword of each webpage to be sorted can be obtained with kernel keyword is extracted in set of words from described
Set;
It calculates step: calculating the average value of the kernel keyword term vector of each webpage to be sorted, which is inputted preparatory
The Web page classifying model that training obtains, obtains the classification results of each webpage to be sorted;And
Circulation step: using the webpage to be sorted for obtaining classification results as new sub-pages, obtaining step is returned to.
2. Web page classification method as described in claim 1, which is characterized in that the training step packet of the Web page classifying model
It includes:
Type of webpage is marked for the sub-pages for the preset quantity chosen in advance;
The webpage source code of the sub-pages is pre-processed, the available set of words of each sub-pages is obtained;
From the core key set of words that each sub-pages with kernel keyword is extracted in set of words, can be obtained;
Calculate the average value of the kernel keyword term vector of each sub-pages;And
Using the average value of the kernel keyword term vector of each sub-pages and corresponding type of webpage mark to neural network
Model is trained, and obtains Web page classifying model.
3. Web page classification method as claimed in claim 1 or 2, which is characterized in that the screening text includes in webpage source code
Textual portions in heading label, keyword label and description label, the segmenting method that the word segmentation processing uses include being based on
One or more of the segmenting method of string matching, the segmenting method based on understanding and segmenting method based on statistics.
4. Web page classification method as claimed in claim 2, which is characterized in that this method further include:
The execution number of the circulation step is set, when meeting setting requirements, terminates the circulation step.
5. Web page classification method as claimed in claim 2, which is characterized in that this method further include:
By the sub-pages marked with type of webpage web page interlinkage storage corresponding with the webpage to be sorted of classification results is obtained
To database;
When acquisition web page interlinkage in the database in the presence of, terminate be directed to the web page interlinkage subsequent operation.
6. a kind of electronic device, including memory and processor, which is characterized in that include Web page classifying journey in the memory
Sequence, the Web page classifying program realize following steps when being executed by the processor:
Obtaining step: obtaining web page interlinkage from sub-pages, obtains net from the webpage to be sorted that the web page interlinkage is directed toward
Page source code;
Pre-treatment step: noise filtering is carried out to the webpage source code, the screening text of each webpage to be sorted is obtained, to the sieve
Selection is originally segmented and stop words is gone to handle, and the available set of words of each webpage to be sorted is obtained;
Extraction step: the kernel keyword of each webpage to be sorted can be obtained with kernel keyword is extracted in set of words from described
Set;
It calculates step: calculating the average value of the kernel keyword term vector of each webpage to be sorted, which is inputted preparatory
The Web page classifying model that training obtains, obtains the classification results of each webpage to be sorted;And
Circulation step: using the webpage to be sorted for obtaining classification results as new sub-pages, obtaining step is returned to.
7. electronic device as claimed in claim 6, which is characterized in that the training step of the Web page classifying model includes:
Type of webpage is marked for the sub-pages for the preset quantity chosen in advance;
The webpage source code of the sub-pages is pre-processed, the available set of words of each sub-pages is obtained;
From the core key set of words that each sub-pages with kernel keyword is extracted in set of words, can be obtained;
Calculate the average value of the kernel keyword term vector of each sub-pages;And
Using the average value of the kernel keyword term vector of each sub-pages and corresponding type of webpage mark to neural network
Model is trained, and obtains Web page classifying model.
8. electronic device as claimed in claims 6 or 7, which is characterized in that the screening text includes title in webpage source code
Textual portions in label, keyword label and description label, the segmenting method that the word segmentation processing uses include being based on character
One or more of the segmenting method of String matching, the segmenting method based on understanding and segmenting method based on statistics.
9. electronic device as claimed in claim 6, which is characterized in that when the Web page classifying program is executed by the processor
Also realize following steps:
By the sub-pages marked with type of webpage web page interlinkage storage corresponding with the webpage to be sorted of classification results is obtained
To database;
When acquisition web page interlinkage in the database in the presence of, terminate be directed to the web page interlinkage subsequent operation.
10. a kind of computer readable storage medium, which is characterized in that include Web page classifying in the computer readable storage medium
Program when the Web page classifying program is executed by processor, realizes the Web page classifying as described in any one of claims 1 to 5
The step of method.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810694720.0A CN109062972A (en) | 2018-06-29 | 2018-06-29 | Web page classification method, device and computer readable storage medium |
PCT/CN2018/107490 WO2020000717A1 (en) | 2018-06-29 | 2018-09-26 | Web page classification method and device, and computer-readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810694720.0A CN109062972A (en) | 2018-06-29 | 2018-06-29 | Web page classification method, device and computer readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109062972A true CN109062972A (en) | 2018-12-21 |
Family
ID=64817979
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810694720.0A Withdrawn CN109062972A (en) | 2018-06-29 | 2018-06-29 | Web page classification method, device and computer readable storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109062972A (en) |
WO (1) | WO2020000717A1 (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109783175A (en) * | 2019-01-16 | 2019-05-21 | 平安普惠企业管理有限公司 | Application icon management method, device, readable storage medium storing program for executing and terminal device |
CN110191096A (en) * | 2019-04-30 | 2019-08-30 | 安徽工业大学 | A kind of term vector homepage invasion detection method based on semantic analysis |
CN110427628A (en) * | 2019-08-02 | 2019-11-08 | 杭州安恒信息技术股份有限公司 | Web assets classes detection method and device based on neural network algorithm |
CN110545355A (en) * | 2019-07-31 | 2019-12-06 | 努比亚技术有限公司 | intelligent reminding method, terminal and computer readable storage medium |
CN110705290A (en) * | 2019-09-29 | 2020-01-17 | 新华三信息安全技术有限公司 | Webpage classification method and device |
CN110750493A (en) * | 2019-09-03 | 2020-02-04 | 平安科技(深圳)有限公司 | Legal text filing method and device, readable storage medium and terminal equipment |
CN111382385A (en) * | 2020-02-21 | 2020-07-07 | 奇安信科技集团股份有限公司 | Webpage affiliated industry classification method and device |
CN111797299A (en) * | 2019-04-09 | 2020-10-20 | Oppo广东移动通信有限公司 | Model training method, webpage classification method, device, storage medium and equipment |
CN111931040A (en) * | 2020-06-30 | 2020-11-13 | 深圳市世强元件网络有限公司 | Recommendation method for service entry of service entity in network platform |
CN112256987A (en) * | 2020-10-19 | 2021-01-22 | 中国互联网金融协会 | Method, device, equipment and storage medium for monitoring overseas stock trading website |
CN112860726A (en) * | 2021-02-07 | 2021-05-28 | 天云融创数据科技(北京)有限公司 | Structured query statement classification model training method and device |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101794311B (en) * | 2010-03-05 | 2012-06-13 | 南京邮电大学 | Fuzzy data mining based automatic classification method of Chinese web pages |
CN103226578B (en) * | 2013-04-02 | 2015-11-04 | 浙江大学 | Towards the website identification of medical domain and the method for webpage disaggregated classification |
CN104035968B (en) * | 2014-05-20 | 2017-11-03 | 微梦创科网络科技(中国)有限公司 | The construction method and device of training corpus collection based on social networks |
CN106126512A (en) * | 2016-04-13 | 2016-11-16 | 北京天融信网络安全技术有限公司 | The Web page classification method of a kind of integrated study and device |
-
2018
- 2018-06-29 CN CN201810694720.0A patent/CN109062972A/en not_active Withdrawn
- 2018-09-26 WO PCT/CN2018/107490 patent/WO2020000717A1/en active Application Filing
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109783175A (en) * | 2019-01-16 | 2019-05-21 | 平安普惠企业管理有限公司 | Application icon management method, device, readable storage medium storing program for executing and terminal device |
CN111797299A (en) * | 2019-04-09 | 2020-10-20 | Oppo广东移动通信有限公司 | Model training method, webpage classification method, device, storage medium and equipment |
CN110191096A (en) * | 2019-04-30 | 2019-08-30 | 安徽工业大学 | A kind of term vector homepage invasion detection method based on semantic analysis |
CN110191096B (en) * | 2019-04-30 | 2023-05-09 | 安徽工业大学 | Word vector webpage intrusion detection method based on semantic analysis |
CN110545355B (en) * | 2019-07-31 | 2021-04-02 | 努比亚技术有限公司 | Intelligent reminding method, terminal and computer readable storage medium |
CN110545355A (en) * | 2019-07-31 | 2019-12-06 | 努比亚技术有限公司 | intelligent reminding method, terminal and computer readable storage medium |
CN110427628A (en) * | 2019-08-02 | 2019-11-08 | 杭州安恒信息技术股份有限公司 | Web assets classes detection method and device based on neural network algorithm |
CN110750493A (en) * | 2019-09-03 | 2020-02-04 | 平安科技(深圳)有限公司 | Legal text filing method and device, readable storage medium and terminal equipment |
CN110750493B (en) * | 2019-09-03 | 2022-08-09 | 平安科技(深圳)有限公司 | Legal text filing method and device, readable storage medium and terminal equipment |
CN110705290A (en) * | 2019-09-29 | 2020-01-17 | 新华三信息安全技术有限公司 | Webpage classification method and device |
CN111382385A (en) * | 2020-02-21 | 2020-07-07 | 奇安信科技集团股份有限公司 | Webpage affiliated industry classification method and device |
CN111382385B (en) * | 2020-02-21 | 2024-04-12 | 奇安信科技集团股份有限公司 | Method and device for classifying industries of web pages |
CN111931040A (en) * | 2020-06-30 | 2020-11-13 | 深圳市世强元件网络有限公司 | Recommendation method for service entry of service entity in network platform |
CN111931040B (en) * | 2020-06-30 | 2024-01-12 | 深圳市世强元件网络有限公司 | Recommendation method for service entry of service entity in network platform |
CN112256987A (en) * | 2020-10-19 | 2021-01-22 | 中国互联网金融协会 | Method, device, equipment and storage medium for monitoring overseas stock trading website |
CN112860726A (en) * | 2021-02-07 | 2021-05-28 | 天云融创数据科技(北京)有限公司 | Structured query statement classification model training method and device |
Also Published As
Publication number | Publication date |
---|---|
WO2020000717A1 (en) | 2020-01-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109062972A (en) | Web page classification method, device and computer readable storage medium | |
CN108629043B (en) | Webpage target information extraction method, device and storage medium | |
US11157830B2 (en) | Automated customized web portal template generation systems and methods | |
CN109325165B (en) | Network public opinion analysis method, device and storage medium | |
CN109815333B (en) | Information acquisition method and device, computer equipment and storage medium | |
CN109492222B (en) | Intention identification method and device based on concept tree and computer equipment | |
CN108959383A (en) | Analysis method, device and the computer readable storage medium of network public-opinion | |
US20230004604A1 (en) | Ai-augmented auditing platform including techniques for automated document processing | |
CN107679144A (en) | News sentence clustering method, device and storage medium based on semantic similarity | |
CN107704503A (en) | User's keyword extracting device, method and computer-readable recording medium | |
US9720912B2 (en) | Document management system, document management method, and document management program | |
CN112287069B (en) | Information retrieval method and device based on voice semantics and computer equipment | |
CN111984589A (en) | Document processing method, document processing device and electronic equipment | |
CN107870945A (en) | Content classification method and apparatus | |
CN114547315A (en) | Case classification prediction method and device, computer equipment and storage medium | |
CN106815253B (en) | Mining method based on mixed data type data | |
CN112650910A (en) | Method, device, equipment and storage medium for determining website update information | |
CN107168635A (en) | Information demonstrating method and device | |
CN112307314A (en) | Method and device for generating fine selection abstract of search engine | |
CN111382243A (en) | Text category matching method, text category matching device and terminal | |
CN110647504A (en) | Method and device for searching judicial documents | |
CN112579781A (en) | Text classification method and device, electronic equipment and medium | |
CN117251761A (en) | Data object classification method and device, storage medium and electronic device | |
CN111488452A (en) | Webpage tampering detection method, detection system and related equipment | |
CN111291561A (en) | Text recognition method, device and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20181221 |
|
WW01 | Invention patent application withdrawn after publication |