CN109062972A - Web page classification method, device and computer readable storage medium - Google Patents

Web page classification method, device and computer readable storage medium Download PDF

Info

Publication number
CN109062972A
CN109062972A CN201810694720.0A CN201810694720A CN109062972A CN 109062972 A CN109062972 A CN 109062972A CN 201810694720 A CN201810694720 A CN 201810694720A CN 109062972 A CN109062972 A CN 109062972A
Authority
CN
China
Prior art keywords
webpage
web page
sorted
pages
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201810694720.0A
Other languages
Chinese (zh)
Inventor
吴壮伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201810694720.0A priority Critical patent/CN109062972A/en
Priority to PCT/CN2018/107490 priority patent/WO2020000717A1/en
Publication of CN109062972A publication Critical patent/CN109062972A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention provides a kind of Web page classification method, device and storage medium, this method obtains web page interlinkage from sub-pages, after obtaining webpage source code in webpage to be sorted pointed by the web page interlinkage, noise filtering is carried out to the webpage source code, obtain the screening text of each webpage to be sorted, the screening text is segmented and stop words is gone to handle, obtains the available set of words of each webpage to be sorted.Later, this method from it is described can be with extracting kernel keyword in set of words, obtain the core key set of words of each webpage to be sorted, calculate the average value of the kernel keyword term vector of each webpage to be sorted, and the Web page classifying model for obtaining average value input training, obtain the classification results of each webpage to be sorted.Using the present invention, webpage to be sorted automatic classification can be realized pointed by the web page interlinkage to sub-pages.

Description

Web page classification method, device and computer readable storage medium
Technical field
The present invention relates to technical field of data processing more particularly to a kind of Web page classification methods, device and computer-readable Storage medium.
Background technique
With the high speed development of Internet technology and Web technology, the quantity of webpage is being continuously increased on internet, data money Source is being enriched constantly, provides potential data source for various data-intensive applications.However, excessive information content is to people Handle data information and bring many difficulties, traditional information processing manner manually obviously can no longer meet a large amount of numbers According to the requirement of processing.In this context, effective content of text of magnanimity webpage how is obtained automatically, and magnanimity webpage is carried out Automatic classification, is the key that organization and management Internet resources.
Summary of the invention
For these reasons, the present invention provides a kind of Web page classification method, device and computer readable storage medium, master Syllabus is to classify to webpage automatically in conjunction with crawler technology and neural network model.
To achieve the above object, the present invention provides a kind of Web page classification method, this method comprises:
Obtaining step: obtaining web page interlinkage from sub-pages, obtains from the webpage to be sorted that the web page interlinkage is directed toward Take webpage source code;
Pre-treatment step: carrying out noise filtering to the webpage source code, obtains the screening text of each webpage to be sorted, right The screening text is segmented and stop words is gone to handle, and the available set of words of each webpage to be sorted is obtained;
Extraction step: it is closed from the core that each webpage to be sorted with kernel keyword is extracted in set of words, can be obtained Keyword set;
It calculates step: calculating the average value of the kernel keyword term vector of each webpage to be sorted, which is inputted The Web page classifying model that training obtains in advance, obtains the classification results of each webpage to be sorted;And
Circulation step: using the webpage to be sorted for obtaining classification results as new sub-pages, obtaining step is returned to.
Preferably, the training step of the Web page classifying model includes:
Type of webpage is marked for the sub-pages for the preset quantity chosen in advance;
The webpage source code of the sub-pages is pre-processed, the available set of words of each sub-pages is obtained;
From the core key set of words that each sub-pages with kernel keyword is extracted in set of words, can be obtained;
Calculate the average value of the kernel keyword term vector of each sub-pages;And
Using the average value of the kernel keyword term vector of each sub-pages and corresponding type of webpage mark to nerve Network model is trained, and obtains Web page classifying model.
Preferably, the screening text includes the text in heading label in webpage source code, keyword label and description label This part, the segmenting method that the word segmentation processing uses includes the segmenting method based on string matching, the participle based on understanding One or more of method and the segmenting method based on statistics.
Preferably, this method further include:
The execution number of the circulation step is set, when meeting setting requirements, terminates the circulation step.
Preferably, this method further include:
By the sub-pages marked with type of webpage web page interlinkage corresponding with the webpage to be sorted of classification results is obtained It stores to database;
When acquisition web page interlinkage in the database in the presence of, terminate be directed to the web page interlinkage subsequent operation.
The present invention also provides a kind of electronic device, which includes memory and processor, is wrapped in the memory Web page classifying program is included, which realizes following steps when being executed by the processor:
Obtaining step: obtaining web page interlinkage from sub-pages, obtains from the webpage to be sorted that the web page interlinkage is directed toward Take webpage source code;
Pre-treatment step: carrying out noise filtering to the webpage source code, obtains the screening text of each webpage to be sorted, right The screening text is segmented and stop words is gone to handle, and the available set of words of each webpage to be sorted is obtained;
Extraction step: it is closed from the core that each webpage to be sorted with kernel keyword is extracted in set of words, can be obtained Keyword set;
It calculates step: calculating the average value of the kernel keyword term vector of each webpage to be sorted, which is inputted The Web page classifying model that training obtains in advance, obtains the classification results of each webpage to be sorted;And
Circulation step: using the webpage to be sorted for obtaining classification results as new sub-pages, obtaining step is returned to.
Preferably, the training step of the Web page classifying model includes:
Type of webpage is marked for the sub-pages for the preset quantity chosen in advance;
The webpage source code of the sub-pages is pre-processed, the available set of words of each sub-pages is obtained;
From the core key set of words that each sub-pages with kernel keyword is extracted in set of words, can be obtained;
Calculate the average value of the kernel keyword term vector of each sub-pages;And
Using the average value of the kernel keyword term vector of each sub-pages and corresponding type of webpage mark to nerve Network model is trained, and obtains Web page classifying model.
Preferably, the screening text includes the text in heading label in webpage source code, keyword label and description label This part, the segmenting method that the word segmentation processing uses includes the segmenting method based on string matching, the participle based on understanding One or more of method and the segmenting method based on statistics.
Preferably, following steps are also realized when the Web page classifying program is executed by the processor:
By the sub-pages marked with type of webpage web page interlinkage corresponding with the webpage to be sorted of classification results is obtained It stores to database;
When acquisition web page interlinkage in the database in the presence of, terminate be directed to the web page interlinkage subsequent operation.
In addition, to achieve the above object, it is described computer-readable the present invention also provides a kind of computer readable storage medium It include Web page classifying program in storage medium, which realizes webpage as described above when being executed by the processor Arbitrary steps in classification method.
Web page classification method, device and computer readable storage medium proposed by the present invention, by being obtained from sub-pages Web page interlinkage is taken, obtains webpage source code from the webpage to be sorted that web page interlinkage is directed toward, noise mistake then is carried out to webpage source code Filter obtains including heading label, keyword label and the screening text for describing textual portions in label, divide screening text Word and stop words is gone to handle, obtains available set of words, extract kernel keyword from available set of words using TF-IDF algorithm, obtain To the core key set of words of each webpage to be sorted, the flat of the kernel keyword term vector of each webpage to be sorted is then calculated Mean value is inputted Web page classifying model, obtains the classification results of webpage to be sorted.Because obtaining the net to be sorted of classification results Page can be used as new sub-pages, reacquire its web page interlinkage and corresponding webpage source code, so can be with using the present invention Realize the automatic classification to a large amount of webpages.
Detailed description of the invention
Fig. 1 is the schematic diagram of electronic device preferred embodiment of the present invention;
Fig. 2 is the Program modual graph of Web page classifying program in Fig. 1;
Fig. 3 is the flow chart of Web page classification method preferred embodiment of the present invention.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific embodiment
Understand to make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with several attached drawings and reality Example is applied, the present invention will be described in further detail.It should be understood that specific embodiment described herein is only used to solve The present invention is released, is not intended to limit the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not making Every other embodiment obtained, shall fall within the protection scope of the present invention under the premise of creative work.
The present invention provides a kind of electronic device.It is the signal of 1 preferred embodiment of electronic device of the present invention shown in referring to Fig.1 Figure.In this embodiment, electronic device 1 crawls web page interlinkage and webpage source code using crawler technology, carries out to webpage source code pre- Processing obtain can word, and then obtain the core key set of words of each webpage to be sorted, then utilize each webpage to be sorted The average value of kernel keyword term vector and the Web page classifying model that training obtains in advance obtain the classification of each webpage to be sorted As a result.
The electronic device 1 can be server, smart phone, tablet computer, portable computer, desktop PC etc. Terminal device with storage and calculation function.In one embodiment, when electronic device 1 is server, which can To be the one or more of rack-mount server, blade server, tower server or Cabinet-type server etc..
The electronic device 1 includes memory 11, processor 12, network interface 13 and communication bus 14.
Wherein, memory 11 includes the readable storage medium storing program for executing of at least one type.The readable of at least one type is deposited Storage media can be the non-volatile memory medium of such as flash memory, hard disk, multimedia card, card-type memory.In some embodiments, The readable storage medium storing program for executing can be the internal storage unit of the electronic device 1, such as the hard disk of the electronic device 1.Another In some embodiments, the readable storage medium storing program for executing is also possible to the external memory 11 of the electronic device 1, such as the electronics The plug-in type hard disk being equipped on device 1, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..
In the present embodiment, the readable storage medium storing program for executing of the memory 11 is commonly used in storage program area, Web page classifying The webpage pair to be sorted of program 10, Web page classifying model and sub-pages and acquisition classification results with type of webpage mark The web page interlinkage etc. answered.The memory 11 can be also used for temporarily storing the data that has exported or will export.
Processor 12 can be in some embodiments a central processing unit (Central Processing Unit, CPU), microprocessor or other data processing chips, program code or processing data for being stored in run memory 11, example Such as execute Web page classifying program 10.
Network interface 13 may include standard wireline interface and wireless interface (such as WI-FI interface).Commonly used in the clothes It is engaged in establishing between device 1 and other electronic equipments or system and communicate to connect.
Communication bus 14 is for realizing the connection communication between said modules.
Fig. 1 is illustrated only with component 11-14 and the electronic device of Web page classifying program 10 1, it should be understood that It is not required for implementing all components shown, the implementation that can be substituted is more or less component.
Optionally, which can also include display, be referred to as display screen or display unit.Some It can be light-emitting diode display, liquid crystal display, touch-control liquid crystal display and Organic Light Emitting Diode (Organic in embodiment Light-Emitting Diode, OLED) display etc..Display for show the information handled in the electronic apparatus 1 and For showing visual user interface.
Optionally, which further includes touch sensor.It is touched provided by the touch sensor for user The region for touching operation is known as touch area.In addition, touch sensor described here can be resistive touch sensor, capacitor Formula touch sensor etc..Moreover, the touch sensor not only includes the touch sensor of contact, proximity may also comprise Touch sensor etc..In addition, the touch sensor can be single sensor, or such as multiple biographies of array arrangement Sensor.User can start Web page classifying program 10 by touching the touch area.
In addition, the area of the display of the electronic device 1 can be identical as the area of the touch sensor, it can also not Together.Optionally, display and touch sensor stacking are arranged, to form touch display screen.The device is based on touching aobvious Display screen detects the touch control operation of user's triggering.
The electronic device 1 can also include radio frequency (Radio Frequency, RF) circuit, sensor and voicefrequency circuit etc. Deng details are not described herein.
In the above-described embodiments, following step is realized when processor 12 executes the Web page classifying program 10 stored in memory 11 It is rapid:
Obtaining step: obtaining web page interlinkage from sub-pages, obtains from the webpage to be sorted that the web page interlinkage is directed toward Take webpage source code;
Pre-treatment step: carrying out noise filtering to the webpage source code, obtains the screening text of each webpage to be sorted, right The screening text is segmented and stop words is gone to handle, and the available set of words of each webpage to be sorted is obtained;
Extraction step: it is closed from the core that each webpage to be sorted with kernel keyword is extracted in set of words, can be obtained Keyword set;
It calculates step: calculating the average value of the kernel keyword term vector of each webpage to be sorted, which is inputted The Web page classifying model that training obtains in advance, obtains the classification results of each webpage to be sorted;And
Circulation step: using the webpage to be sorted for obtaining classification results as new sub-pages, obtaining step is returned to.
About being discussed in detail for above-mentioned steps, journey of following Fig. 2 about 10 preferred embodiment of Web page classifying program is please referred to The explanation of sequence module map and Fig. 3 about the flow chart of Web page classification method preferred embodiment.
In other embodiments, Web page classifying program 10 can be divided into multiple modules, and multiple module is stored in It in memory 12, and is executed by processor 13, to complete the present invention.The so-called module of the present invention is to refer to complete specific function Series of computation machine program instruction section.
It is the Program modual graph of 10 preferred embodiment of Web page classifying program in Fig. 1 referring to shown in Fig. 2.In the present embodiment, The Web page classifying program 10 can be divided into: obtain module 110, preprocessing module 120, extraction module 130, computing module 140, model training module 150 and model application module 160.
Module 110 is obtained, for obtaining web page interlinkage and webpage source code.It is climbed for example, obtaining module 110 using universal network Worm obtains web page interlinkage from sub-pages, obtains webpage source code from the webpage to be sorted that the web page interlinkage is directed toward.
Preprocessing module 120 obtains the available word set of each webpage to be sorted for pre-processing to webpage source code It closes.In the present embodiment, preprocessing module 120 first carries out noise filtering to webpage source code using regular expression, obtains webpage Textual portions in heading label in source code, keyword label and description label, i.e.<title>,<keywords>,< Description > in textual portions, in this, as the screening text of each webpage to be sorted, then to screening text carry out It segments and stop words is gone to handle, obtain the available set of words of each webpage to be sorted.
Wherein, the regular expression is also known as regular expression, is usually used to retrieval, replaces those and meet some mould The text of formula, rule.Each regular expression can filter out corresponding webpage noise, including advertisement, navigation bar, Javascript scripted code, CSS style code, html tag, punctuation mark, additional character etc..
Participle is the basis of text-processing, participle can using based on string matching segmenting method, based on understanding One or more of segmenting method and the segmenting method based on statistics.Wherein, the segmenting method based on string matching is also referred to as For the segmentation methods based on dictionary.In the present embodiment, stammerer segmenter can be used to carry out at participle the screening text Reason.
Stop words mainly includes function word, can be conjunction, preposition, auxiliary word, modal particle etc., be also possible to sometimes pronoun, For several times etc..These function words usually itself have no specific meaning, and only putting it into a complete sentence just has centainly Effect, for example, " so ", " so ", " ", " ", " ", " this ", " that " etc..In the present embodiment, it can compare default Deactivated vocabulary carries out stop words to the screening text and handles, and obtains the available set of words of each webpage to be sorted.
Extraction module 130 obtains the core of each webpage to be sorted for extracting kernel keyword from available set of words Keyword set.In the present embodiment, word frequency-inverse file frequency (Term Frequency-Inverse Document is utilized Frequency, TF-IDF) algorithm and default corpus (such as Chinese wikipedia corpus), TF-IDF value is greater than default Threshold value can word as kernel keyword, obtain the core key set of words of each webpage to be sorted.
TF-IDF algorithm is a kind of statistical method, for assessing certain word for its in a file set or a corpus The significance level of middle text document.Specifically, in the present embodiment, what TF-IDF algorithm was used to assess webpage to be sorted can word For the significance level of webpage to be sorted, using the value of TF*IDF be greater than preset threshold can word as the core of the webpage to be sorted Heart keyword.Wherein, word frequency (Term Frequency, TF) indicate can the frequency that occurs in webpage of word, i.e., certain can word It is all in the number that occurs in certain webpage to be sorted and the webpage to be sorted can the quotient of number that occurs of word.Inverse file frequency (Inverse document frequency, IDF) be considered as certain can word to the weight of certain webpage significance level to be sorted, Certain can word frequency of the word in certain class webpage it is bigger, the word frequency in all webpages is smaller, then the value of IDF is bigger, this can word It is bigger to the significance level of the webpage to be sorted.
Computing module 140 for kernel keyword to be mapped as term vector, and calculates the kernel keyword word of each webpage The average value of vector.In the present embodiment, the kernel keyword term vector of webpage to be sorted is indicated using distributed.Distributed word Vector is a kind of low-dimensional real vector, by the dot in the kernel keyword and lower dimensional space at corresponding relationship, this vector Expression be not it is unique, only to realize certain distinction.The distance between distributed term vector can use traditional Europe Family name's distance is measured, and can also be measured with COS distance.The vector indicated in this way, the distance of " Mike " and " microphone " The distance of " Mike " and " sunlight " can be far smaller than.Model application module 160 exactly divides webpage using the realization of above-mentioned property Class.
Model training module 150, for the flat of the kernel keyword term vector using the sub-pages each chosen in advance Mean value and corresponding type of webpage mark are trained neural network model, obtain Web page classifying model.The type of webpage Mark can be artificial mark, be also possible to automatic marking.It, can be with for example, when the sub-pages quantity chosen in advance is larger It is financial type by the sub-pages automatic marking chosen from financial web site, the sub-pages chosen from P. E Web Sites is marked automatically Note is sports genre.It is more accurate, multilayer deutero-albumose can be carried out to the sub-pages chosen in advance by way of manually marking Note, for example, certain sub-pages is labeled as: sport-basketball-NBA can more reasonably utilize web page resources so as to subsequent, such as Realize type of webpage subdivision etc..It is understood that type of webpage mark can also be by combining artificial mark and automatic marking Mode realize.
The neural network model can be deep learning model neural network based, including but not limited to convolutional Neural Network, deep neural network and Recognition with Recurrent Neural Network etc..Computing module 140 obtains the core of each sub-pages chosen in advance After the average value of keyword term vector, model training module 150 is with the average value of these kernel keyword term vectors and corresponding Type of webpage mark is used as sample data, by training and verifying, adjusts model parameter, obtains trained Web page classifying mould Type.
Model application module 160, average value and webpage for the kernel keyword term vector using webpage to be sorted Disaggregated model obtains the classification results of webpage to be sorted.In the present embodiment, the core for the webpage to be sorted being calculated is closed Feature vector of the average value of keyword term vector as webpage to be sorted, using the Web page classifying model, by calculating wait divide It is remaining between the average value of the kernel keyword term vector of the average value and sub-pages of the kernel keyword term vector of class webpage Chordal distance, by COS distance minimum or type of webpage corresponding less than the sub-pages of threshold value mark as the webpage to be sorted Type of webpage.
In one embodiment, the Web page classifying model includes the access model of multiple type of webpage, can find out with to The average value of the kernel keyword term vector of classification webpage counts corresponding webpage classification apart from K nearest sub-pages And probability from high to low according to probability sequentially inputs the average value of the kernel keyword term vector of the webpage to be sorted various The access model of classification, by Web page classifying, this more classification problem is converted into multiple two-value classification problems.
In another embodiment, the Web page classifying model is obtained by other procedural trainings, that is to say, that the webpage Sort program 10 can not include the model training module 150.
In addition, the present invention also provides a kind of Web page classification methods.It is Web page classification method of the present invention referring to shown in Fig. 3 The flow chart of preferred embodiment.The realization when processor 12 of electronic device 1 executes the Web page classifying program 10 stored in memory The following steps of Web page classification method:
Step S300 obtains module 110 and obtains web page interlinkage from sub-pages, pointed by the web page interlinkage to point Webpage source code is obtained in class webpage.For example, obtaining module 110 using universal network crawler from the kind for the preset quantity chosen in advance All web page interlinkages are obtained in sub-pages, obtain webpage source code from the webpage to be sorted that web page interlinkage is directed toward.
Step S301, preprocessing module 120 carry out noise filtering to the webpage source code, obtain each webpage to be sorted Text is screened, which is segmented and stop words is gone to handle, obtains the available set of words of each webpage to be sorted.Institute Stating screening text includes the textual portions in heading label in webpage source code, keyword label and description label, at the participle The segmenting method that reason uses includes the segmenting method based on string matching, the segmenting method based on understanding and point based on statistics One or more of word method.It is segmented about the process for obtaining screening text from webpage source code and to screening text Process with going stop words to handle, can refer to above-mentioned being discussed in detail about preprocessing module 120, details are not described herein.
Step S302, extraction module 130 extract kernel keyword from available set of words, obtain each webpage to be sorted Core key set of words.For example, extraction module 130 utilizes TF-IDF algorithm, in conjunction with Chinese wikipedia corpus, by TF*IDF Value be greater than preset threshold can word extract, the kernel keyword as webpage to be sorted.
Step S303, computing module 140 calculate the average value of the kernel keyword term vector of each webpage to be sorted, model The average value is inputted the Web page classifying model obtained by the training of model training module 150 by application module 160, and output is each wait divide The classification results of class webpage.
Step S304 repeats above-mentioned steps using the webpage to be sorted for obtaining classification results as new sub-pages S300-S303。
In other embodiments, between step S303 and step S304 further include:
The execution number of setting steps S304, when meeting setting requirements, no longer execution step S304 terminates Web page classifying Operation.
For the ease of statement, sub-pages are divided into first generation sub-pages, second generation sub-pages and the by us herein Three generations's sub-pages etc..Similarly, webpage to be sorted can be divided into first generation webpage to be sorted, second generation webpage to be sorted etc.. Wherein, belong to first generation sub-pages for carrying out the sub-pages of model training, the first generation webpage to be sorted refers to described Webpage pointed by all web page interlinkages in first generation sub-pages, can be used as second generation sub-pages, and so on, no longer It repeats.
For example, it is assumed that the execution number of setting steps S304 is 2, then when obtaining the classification of each first generation webpage to be sorted As a result after, step S304 is executed for the first time using first generation webpage to be sorted as second generation sub-pages and repeats step After S300-S303, the classification results of each second generation webpage to be sorted are obtained, then execute step S304 second, until To the classification results of each third generation webpage to be sorted, no longer execution step S304, terminate Web page classifying operation.
In other embodiments, can also by the sub-pages marked with type of webpage and obtain classification results to point The corresponding web page interlinkage of class webpage is stored to database, when acquisition web page interlinkage in the database in the presence of, terminate For the subsequent operation of the web page interlinkage.For example, after acquisition module 110 obtains web page interlinkage from sub-pages, described The web page interlinkage is inquired in database, if successful inquiring, the existing classification results of the corresponding webpage of the web page interlinkage, Wu Xuchong Multiple operation, if inquiry failure, normal to execute subsequent step.
The Web page classification method that the present embodiment proposes is referred to by obtaining web page interlinkage from sub-pages from web page interlinkage To webpage to be sorted in obtain webpage source code, to webpage source code carry out noise filtering, obtain including heading label, keyword mark The screening text of textual portions, segments screening text and stop words is gone to handle, obtaining can word in label and description label Set, extracts kernel keyword from available set of words using TF-IDF algorithm, obtains the core key of each webpage to be sorted Then set of words calculates the average value of the kernel keyword term vector of each webpage to be sorted, is inputted Web page classifying model, The classification results of webpage to be sorted are obtained, then obtain web page interlinkage from the webpage to be sorted, are repeated the above steps.Utilize net Network crawler obtains a large amount of web datas, it can be achieved that crawl to webpage source code and the deep layer of web page interlinkage, passes through training deep learning Model is, it can be achieved that therefore using the present invention, the automatic classification to a large amount of webpages may be implemented in automatic webpage classification.
In addition, the embodiment of the present invention also proposes a kind of computer readable storage medium, the computer readable storage medium It can be hard disk, multimedia card, SD card, flash card, SMC, read-only memory (ROM), Erasable Programmable Read Only Memory EPROM (EPROM), any one in portable compact disc read-only memory (CD-ROM), USB storage etc. or several timess Meaning combination.
The specific embodiment of the computer readable storage medium of the present invention and above-mentioned Web page classification method and electronic device 1 Specific embodiment it is roughly the same, related introduction please be join, details are not described herein.
It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row His property includes, so that the process, device, article or the method that include a series of elements not only include those elements, and And further include other elements that are not explicitly listed, or further include for this process, device, article or method institute it is intrinsic Element.It in addition, the technical solution between each embodiment can be combined with each other, but must be with ordinary skill people Based on member can be realized, this technical solution will be understood that when the combination of technical solution appearance is conflicting or cannot achieve Combination be not present, also not the present invention claims protection scope within.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in one as described above In storage medium, used including some instructions so that server executes method described in each embodiment of the present invention.
The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims (10)

1. a kind of Web page classification method is applied to electronic device, which is characterized in that this method comprises:
Obtaining step: obtaining web page interlinkage from sub-pages, obtains net from the webpage to be sorted that the web page interlinkage is directed toward Page source code;
Pre-treatment step: noise filtering is carried out to the webpage source code, the screening text of each webpage to be sorted is obtained, to the sieve Selection is originally segmented and stop words is gone to handle, and the available set of words of each webpage to be sorted is obtained;
Extraction step: the kernel keyword of each webpage to be sorted can be obtained with kernel keyword is extracted in set of words from described Set;
It calculates step: calculating the average value of the kernel keyword term vector of each webpage to be sorted, which is inputted preparatory The Web page classifying model that training obtains, obtains the classification results of each webpage to be sorted;And
Circulation step: using the webpage to be sorted for obtaining classification results as new sub-pages, obtaining step is returned to.
2. Web page classification method as described in claim 1, which is characterized in that the training step packet of the Web page classifying model It includes:
Type of webpage is marked for the sub-pages for the preset quantity chosen in advance;
The webpage source code of the sub-pages is pre-processed, the available set of words of each sub-pages is obtained;
From the core key set of words that each sub-pages with kernel keyword is extracted in set of words, can be obtained;
Calculate the average value of the kernel keyword term vector of each sub-pages;And
Using the average value of the kernel keyword term vector of each sub-pages and corresponding type of webpage mark to neural network Model is trained, and obtains Web page classifying model.
3. Web page classification method as claimed in claim 1 or 2, which is characterized in that the screening text includes in webpage source code Textual portions in heading label, keyword label and description label, the segmenting method that the word segmentation processing uses include being based on One or more of the segmenting method of string matching, the segmenting method based on understanding and segmenting method based on statistics.
4. Web page classification method as claimed in claim 2, which is characterized in that this method further include:
The execution number of the circulation step is set, when meeting setting requirements, terminates the circulation step.
5. Web page classification method as claimed in claim 2, which is characterized in that this method further include:
By the sub-pages marked with type of webpage web page interlinkage storage corresponding with the webpage to be sorted of classification results is obtained To database;
When acquisition web page interlinkage in the database in the presence of, terminate be directed to the web page interlinkage subsequent operation.
6. a kind of electronic device, including memory and processor, which is characterized in that include Web page classifying journey in the memory Sequence, the Web page classifying program realize following steps when being executed by the processor:
Obtaining step: obtaining web page interlinkage from sub-pages, obtains net from the webpage to be sorted that the web page interlinkage is directed toward Page source code;
Pre-treatment step: noise filtering is carried out to the webpage source code, the screening text of each webpage to be sorted is obtained, to the sieve Selection is originally segmented and stop words is gone to handle, and the available set of words of each webpage to be sorted is obtained;
Extraction step: the kernel keyword of each webpage to be sorted can be obtained with kernel keyword is extracted in set of words from described Set;
It calculates step: calculating the average value of the kernel keyword term vector of each webpage to be sorted, which is inputted preparatory The Web page classifying model that training obtains, obtains the classification results of each webpage to be sorted;And
Circulation step: using the webpage to be sorted for obtaining classification results as new sub-pages, obtaining step is returned to.
7. electronic device as claimed in claim 6, which is characterized in that the training step of the Web page classifying model includes:
Type of webpage is marked for the sub-pages for the preset quantity chosen in advance;
The webpage source code of the sub-pages is pre-processed, the available set of words of each sub-pages is obtained;
From the core key set of words that each sub-pages with kernel keyword is extracted in set of words, can be obtained;
Calculate the average value of the kernel keyword term vector of each sub-pages;And
Using the average value of the kernel keyword term vector of each sub-pages and corresponding type of webpage mark to neural network Model is trained, and obtains Web page classifying model.
8. electronic device as claimed in claims 6 or 7, which is characterized in that the screening text includes title in webpage source code Textual portions in label, keyword label and description label, the segmenting method that the word segmentation processing uses include being based on character One or more of the segmenting method of String matching, the segmenting method based on understanding and segmenting method based on statistics.
9. electronic device as claimed in claim 6, which is characterized in that when the Web page classifying program is executed by the processor Also realize following steps:
By the sub-pages marked with type of webpage web page interlinkage storage corresponding with the webpage to be sorted of classification results is obtained To database;
When acquisition web page interlinkage in the database in the presence of, terminate be directed to the web page interlinkage subsequent operation.
10. a kind of computer readable storage medium, which is characterized in that include Web page classifying in the computer readable storage medium Program when the Web page classifying program is executed by processor, realizes the Web page classifying as described in any one of claims 1 to 5 The step of method.
CN201810694720.0A 2018-06-29 2018-06-29 Web page classification method, device and computer readable storage medium Withdrawn CN109062972A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810694720.0A CN109062972A (en) 2018-06-29 2018-06-29 Web page classification method, device and computer readable storage medium
PCT/CN2018/107490 WO2020000717A1 (en) 2018-06-29 2018-09-26 Web page classification method and device, and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810694720.0A CN109062972A (en) 2018-06-29 2018-06-29 Web page classification method, device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN109062972A true CN109062972A (en) 2018-12-21

Family

ID=64817979

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810694720.0A Withdrawn CN109062972A (en) 2018-06-29 2018-06-29 Web page classification method, device and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN109062972A (en)
WO (1) WO2020000717A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783175A (en) * 2019-01-16 2019-05-21 平安普惠企业管理有限公司 Application icon management method, device, readable storage medium storing program for executing and terminal device
CN110191096A (en) * 2019-04-30 2019-08-30 安徽工业大学 A kind of term vector homepage invasion detection method based on semantic analysis
CN110427628A (en) * 2019-08-02 2019-11-08 杭州安恒信息技术股份有限公司 Web assets classes detection method and device based on neural network algorithm
CN110545355A (en) * 2019-07-31 2019-12-06 努比亚技术有限公司 intelligent reminding method, terminal and computer readable storage medium
CN110705290A (en) * 2019-09-29 2020-01-17 新华三信息安全技术有限公司 Webpage classification method and device
CN110750493A (en) * 2019-09-03 2020-02-04 平安科技(深圳)有限公司 Legal text filing method and device, readable storage medium and terminal equipment
CN111382385A (en) * 2020-02-21 2020-07-07 奇安信科技集团股份有限公司 Webpage affiliated industry classification method and device
CN111797299A (en) * 2019-04-09 2020-10-20 Oppo广东移动通信有限公司 Model training method, webpage classification method, device, storage medium and equipment
CN111931040A (en) * 2020-06-30 2020-11-13 深圳市世强元件网络有限公司 Recommendation method for service entry of service entity in network platform
CN112256987A (en) * 2020-10-19 2021-01-22 中国互联网金融协会 Method, device, equipment and storage medium for monitoring overseas stock trading website
CN112860726A (en) * 2021-02-07 2021-05-28 天云融创数据科技(北京)有限公司 Structured query statement classification model training method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101794311B (en) * 2010-03-05 2012-06-13 南京邮电大学 Fuzzy data mining based automatic classification method of Chinese web pages
CN103226578B (en) * 2013-04-02 2015-11-04 浙江大学 Towards the website identification of medical domain and the method for webpage disaggregated classification
CN104035968B (en) * 2014-05-20 2017-11-03 微梦创科网络科技(中国)有限公司 The construction method and device of training corpus collection based on social networks
CN106126512A (en) * 2016-04-13 2016-11-16 北京天融信网络安全技术有限公司 The Web page classification method of a kind of integrated study and device

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783175A (en) * 2019-01-16 2019-05-21 平安普惠企业管理有限公司 Application icon management method, device, readable storage medium storing program for executing and terminal device
CN111797299A (en) * 2019-04-09 2020-10-20 Oppo广东移动通信有限公司 Model training method, webpage classification method, device, storage medium and equipment
CN110191096A (en) * 2019-04-30 2019-08-30 安徽工业大学 A kind of term vector homepage invasion detection method based on semantic analysis
CN110191096B (en) * 2019-04-30 2023-05-09 安徽工业大学 Word vector webpage intrusion detection method based on semantic analysis
CN110545355B (en) * 2019-07-31 2021-04-02 努比亚技术有限公司 Intelligent reminding method, terminal and computer readable storage medium
CN110545355A (en) * 2019-07-31 2019-12-06 努比亚技术有限公司 intelligent reminding method, terminal and computer readable storage medium
CN110427628A (en) * 2019-08-02 2019-11-08 杭州安恒信息技术股份有限公司 Web assets classes detection method and device based on neural network algorithm
CN110750493A (en) * 2019-09-03 2020-02-04 平安科技(深圳)有限公司 Legal text filing method and device, readable storage medium and terminal equipment
CN110750493B (en) * 2019-09-03 2022-08-09 平安科技(深圳)有限公司 Legal text filing method and device, readable storage medium and terminal equipment
CN110705290A (en) * 2019-09-29 2020-01-17 新华三信息安全技术有限公司 Webpage classification method and device
CN111382385A (en) * 2020-02-21 2020-07-07 奇安信科技集团股份有限公司 Webpage affiliated industry classification method and device
CN111382385B (en) * 2020-02-21 2024-04-12 奇安信科技集团股份有限公司 Method and device for classifying industries of web pages
CN111931040A (en) * 2020-06-30 2020-11-13 深圳市世强元件网络有限公司 Recommendation method for service entry of service entity in network platform
CN111931040B (en) * 2020-06-30 2024-01-12 深圳市世强元件网络有限公司 Recommendation method for service entry of service entity in network platform
CN112256987A (en) * 2020-10-19 2021-01-22 中国互联网金融协会 Method, device, equipment and storage medium for monitoring overseas stock trading website
CN112860726A (en) * 2021-02-07 2021-05-28 天云融创数据科技(北京)有限公司 Structured query statement classification model training method and device

Also Published As

Publication number Publication date
WO2020000717A1 (en) 2020-01-02

Similar Documents

Publication Publication Date Title
CN109062972A (en) Web page classification method, device and computer readable storage medium
CN108629043B (en) Webpage target information extraction method, device and storage medium
US11157830B2 (en) Automated customized web portal template generation systems and methods
CN109325165B (en) Network public opinion analysis method, device and storage medium
CN109815333B (en) Information acquisition method and device, computer equipment and storage medium
CN109492222B (en) Intention identification method and device based on concept tree and computer equipment
CN108959383A (en) Analysis method, device and the computer readable storage medium of network public-opinion
US20230004604A1 (en) Ai-augmented auditing platform including techniques for automated document processing
CN107679144A (en) News sentence clustering method, device and storage medium based on semantic similarity
CN107704503A (en) User&#39;s keyword extracting device, method and computer-readable recording medium
US9720912B2 (en) Document management system, document management method, and document management program
CN112287069B (en) Information retrieval method and device based on voice semantics and computer equipment
CN111984589A (en) Document processing method, document processing device and electronic equipment
CN107870945A (en) Content classification method and apparatus
CN114547315A (en) Case classification prediction method and device, computer equipment and storage medium
CN106815253B (en) Mining method based on mixed data type data
CN112650910A (en) Method, device, equipment and storage medium for determining website update information
CN107168635A (en) Information demonstrating method and device
CN112307314A (en) Method and device for generating fine selection abstract of search engine
CN111382243A (en) Text category matching method, text category matching device and terminal
CN110647504A (en) Method and device for searching judicial documents
CN112579781A (en) Text classification method and device, electronic equipment and medium
CN117251761A (en) Data object classification method and device, storage medium and electronic device
CN111488452A (en) Webpage tampering detection method, detection system and related equipment
CN111291561A (en) Text recognition method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20181221

WW01 Invention patent application withdrawn after publication