WO2016180270A1 - 网页分类方法和装置、计算设备以及机器可读存储介质 - Google Patents

网页分类方法和装置、计算设备以及机器可读存储介质 Download PDF

Info

Publication number
WO2016180270A1
WO2016180270A1 PCT/CN2016/081139 CN2016081139W WO2016180270A1 WO 2016180270 A1 WO2016180270 A1 WO 2016180270A1 CN 2016081139 W CN2016081139 W CN 2016081139W WO 2016180270 A1 WO2016180270 A1 WO 2016180270A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
webpage
corpus
similarity
domain name
Prior art date
Application number
PCT/CN2016/081139
Other languages
English (en)
French (fr)
Inventor
梁捷
郑海洪
邹红才
Original Assignee
广州市动景计算机科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州市动景计算机科技有限公司 filed Critical 广州市动景计算机科技有限公司
Priority to US15/505,851 priority Critical patent/US10997256B2/en
Publication of WO2016180270A1 publication Critical patent/WO2016180270A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/954Navigation, e.g. using categorised browsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Definitions

  • the present application relates to the field of web page processing technologies, and in particular, to a web page classification method and apparatus, a computing device, and a machine readable storage medium.
  • the webpage can be classified into a network application, which is convenient for the user to quickly and conveniently find the preferred information, and can also determine different users according to the type of the webpage browsed by the user in the demand analysis stage of the network-related product development. Preference information.
  • web page classification generally needs to parse a large number of web pages, extract feature data from the Uniform Resource Locator (URL) and a header of the web page as training data, and classify the training data based on the classification.
  • the classification model established by the algorithm (for example) is trained to obtain a webpage classifier, so that when the target webpage is classified, the target feature data of the target webpage is first extracted, and then the target feature data is analyzed according to the webpage classifier, and then Know the type of the landing page.
  • commonly used classification algorithms include decision tree classification, naive Bayesian classifier, support vector machine (SVM) classification algorithm, neural network method, k-nearest neighbor method ( K-nearest neighbor, kNN), fuzzy classification, etc.
  • the webpage classification is implemented based on the above method, and the feature data contains a large number of short sentences or words, and the data processing amount is large, especially for the Chinese webpage, the feature data is mostly Chinese words, the processing complexity is higher, and the corresponding webpage classification efficiency is higher. low.
  • the present application provides a web page classification method and apparatus, a computing device, and a non-transitory machine readable storage medium.
  • a first aspect of the embodiments of the present application provides a webpage classification method, including:
  • the corpus is trained by a word-to-vector tool, such as word2vec, to obtain a vector corresponding to each corpus in the corpus, and each corpus and corresponding vector are recorded in a classification model file, wherein the corpus in the corpus Associated with the title and keywords in the header of the web page;
  • word2vec word2vec
  • At least one target webpage category is selected as the classification result of the target webpage.
  • the determining, according to the scoring model file, the target similarity corresponding to each target corpus includes:
  • the webpage classification method further includes:
  • the corresponding domain name is recorded as a vertical domain name in the vertical domain name list.
  • the method for classifying a webpage further includes:
  • determining a target webpage category and a target similarity corresponding to the target webpage according to the target domain name In response to determining that the target domain name exists in the vertical list of the domain name, determining a target webpage category and a target similarity corresponding to the target webpage according to the target domain name.
  • the method for classifying a webpage further includes:
  • determining a target webpage category and a target similarity corresponding to the target webpage according to the uniform resource locator URL corresponding to the target webpage In response to determining that the target domain name does not exist in the vertical list of the domain name, determining a target webpage category and a target similarity corresponding to the target webpage according to the uniform resource locator URL corresponding to the target webpage.
  • a second aspect of the embodiments of the present application provides a webpage device, including:
  • a corpus training unit configured to train the corpus by the word-to-vector tool word2vec, obtain a vector corresponding to each corpus in the corpus, and record each corpus and corresponding vector in a classification model file, wherein the corpus is in the corpus
  • the corpus is associated with the title and keywords in the header of the web page;
  • a corpus screening unit configured to determine, according to the classification model file, a vector corresponding to each of the classified seed words corresponding to each preset webpage category, and calculate a vector sum of all the classified seed words corresponding to the same webpage category, Searching, in the classification model file, a corpus corresponding to the vector whose similarity is within the preset range, and recording the found corpus, the corresponding similarity, and the vector and the corresponding webpage category in the score Model file
  • a target webpage processing unit configured to search, in the scoring model file, a target corpus corresponding to a target title and a target keyword in a header of the target webpage, and determine, according to the scoring model file, corresponding to each target corpus Target similarity and landing page category;
  • the webpage category determining unit is configured to select at least one target webpage category as the classification result of the target webpage according to the determined target similarity.
  • the target webpage processing unit includes:
  • a weight coefficient setting unit configured to respectively set a weight coefficient corresponding to the target title and the target keyword
  • a target similarity calculation unit configured to calculate, for the first target corpus corresponding to the target title, a product of a corresponding reference similarity in the scoring model file and a first weighting coefficient corresponding to the target title, a target similarity corresponding to the first target corpus; and for the second target corpus corresponding to the target keyword, calculated in the scoring model file A product of a corresponding reference similarity and a second weighting coefficient corresponding to the target keyword, to obtain a target similarity corresponding to the second target corpus.
  • the webpage classification apparatus further includes:
  • a vertical domain name determining unit configured to determine, according to the target webpage, each webpage under the same domain name as a result of the classification, and in response to determining that the classification result of each webpage under the same domain name and the corresponding similarity satisfy the preset
  • the threshold condition records the corresponding domain name as a vertical domain name in the vertical domain name list.
  • the webpage classification apparatus further includes:
  • a target domain name processing unit configured to determine, in the case that the target title or the target keyword is not available, whether the target domain name corresponding to the target webpage exists in the vertical list of the domain name, in response to determining that the vertical domain name exists in the domain name Determining, by the target domain name, a target webpage category and a target similarity corresponding to the target webpage according to the target domain name.
  • the webpage classification apparatus further includes:
  • the URL processing unit is configured to determine a target webpage category and a target similarity corresponding to the target webpage according to the uniform resource locator URL corresponding to the target webpage, in response to determining that the target domain name does not exist in the domain name vertical list.
  • a third aspect of the embodiments of the present application provides a computing device, including:
  • a processor that reads information related to the web page from the memory and performs the following operations:
  • the corpus is trained by the word-to-vector tool word2vec to obtain a vector corresponding to each corpus in the corpus, and each corpus and corresponding vector are recorded in a classification model file, wherein the corpus in the corpus and the webpage are The title in the header is associated with the keyword;
  • At least one target webpage category is selected as the classification result of the target webpage.
  • a fourth aspect of the embodiments of the present application provides a non-transitory machine readable storage medium having executable code stored thereon, when the executable code is executed by a processor, causing the processor to execute according to the present application.
  • a fifth aspect of an embodiment of the present application is also a computing device, the computing device comprising a processor and a non-transitory machine readable storage medium.
  • the non-transitory machine readable storage medium stores executable code thereon.
  • the processor is caused to perform the method according to the first aspect of the embodiments of the present application.
  • the embodiments of the present application convert each corpus in the corpus into a vector, thereby converting the processing of comparison between the corpus and the similarity analysis into a vector operation, which is more convenient for computer automation. Therefore, the efficiency of the webpage classification is improved.
  • the embodiment of the present application filters the corpus according to the plurality of preset classification seed words, and can remove the corpus irrelevant to the webpage type, thereby improving the accuracy of the webpage classification.
  • FIG. 1 is a flowchart of a webpage classification method according to an exemplary embodiment.
  • FIG. 2 is a flowchart of another webpage classification method according to an exemplary embodiment.
  • FIG. 3 is a functional block diagram of a web page sorting apparatus according to an exemplary embodiment.
  • FIG. 4 is a functional block diagram of another web page sorting apparatus according to an exemplary embodiment.
  • FIG. 1 is a flowchart of a webpage classification method according to an exemplary embodiment. As shown in FIG. 1, the method includes the following steps.
  • the massive webpage used to construct the corpus may be derived from the browsing record of the user.
  • the headers of each web page generally include two fields, a title and a keyword, so that the vocabulary in the two fields can be recorded as a corpus in the corpus.
  • the embodiment of the present application can also be applied to an existing corpus related to a webpage header. For example, in the case of regular web page classification, a corpus constructed at the time of the last web page classification can be used.
  • the title is generally in the form of a sentence or a phrase rather than a single word, it is necessary to segment the title by a word segmentation tool; the keyword itself is a single word, and no word segmentation operation is required.
  • the corpus is trained by the word-to-vector tool word2vec to obtain a vector corresponding to each corpus in the corpus, and each corpus and corresponding vector are recorded in the classification model file.
  • the above word2vec is a text processing tool, which assigns a unique vector to each word by analyzing the similarity between the massive words; and is applied to the embodiment of the present application, that is, by analyzing the similarity between the corpus in the corpus, determining The vector corresponding to each corpus; in order to facilitate the representation of complex similarities between words, the vector is a multi-dimensional vector, such as: [0.792, -0.177, -0.17, 0.109, -0.542,...]. Among them, the higher the similarity between two corpus, the smaller the corresponding vector difference (the vector difference can be the cosine of the angle between the two vectors).
  • the classification model file may specifically adopt a binary file in a BIN format, for example, the classification model file may be used. Named word.bin, which records each corpus and its corresponding vector.
  • the classification seed words related to the novel may include: Online novels, novels, novels, classical novels, fine novels, novels, novels, novels, original novels, txt complete works, romance novels, love stories, fantasy novels, fantasy novels, science fiction, martial arts novels, Xian Xia novels, urban powers, acquaintances, fan fiction, novels, novels, novel, suspense novels, horror novels, detective reasoning, detective novels, mystery novels, youth campuses, etc.
  • the corresponding vector is first determined.
  • the specific method is: searching the classification model file for the corpus that is most similar to the classification seed word, so that the vector corresponding to the corpus is recorded as the corresponding seed word. vector.
  • the vector obtained by the word2vec conversion can perform the addition operation, the vectors corresponding to the respective classification seed words corresponding to the same web page category are added, and the obtained vector sum corresponding to the web page category is obtained. For example, the vectors corresponding to the respective classification seed words related to the above novel are added to obtain a vector sum corresponding to the "fiction" category.
  • steps S13 and S14 are used to filter the corpus describing the webpage type from the classification model file by the processing calculation of the vector, and are uniformly recorded in the scoring model file.
  • the similarity between the vectors may be specifically expressed as a cosine of an angle between the vectors, that is, a value ranging from 0 to 1.
  • the degree of similarity between the vectors may also be expressed in a percentile score; that is, multiplied by 100 on the basis of the above-described cosine value to obtain a corresponding score.
  • the above scoring model file may use a text file in a TXT format, for example, Named word.txt, whose storage format is "based on the vector and the found corpus: the vector and the corresponding web page category: similarity", wherein the vector corresponds to the corresponding web page category, that is, the found corpus Web page category; for example, according to the vector corresponding to the above "fiction” category and the corpus A and B found, the similarities are 95 and 80, respectively, and can be recorded as "A: Fiction: 95" in the scoring model file, "B: Fiction: 80".
  • a text file in a TXT format for example, Named word.txt, whose storage format is "based on the vector and the found corpus: the vector and the corresponding web page category: similarity", wherein the vector corresponds to the corresponding web page category, that is, the found corpus Web page category; for example, according to the vector corresponding to the above "fiction” category and the corpus A and B found, the similarities are 95 and 80, respectively
  • step S11 Similar to step S11, for the target title, a word segmentation operation needs to be performed, which is divided into a plurality of words by a phrase or a sentence. After the word segmentation is completed, each word obtained after the segmentation of the target title and the corpus corresponding to each target keyword are selected from the scoring model file as the target corpus.
  • S17 Calculate a sum of target similarities corresponding to the same target webpage category, and select at least one target webpage category with the largest sum of the target similarities as the classification result of the target webpage.
  • the target corpus found according to the scoring model file includes A, B, and C, and the target webpage categories corresponding to A and B are both "fictions", the target similarities are 90 and 85, respectively, and the target webpage category corresponding to C is "Sports", the target similarity is 80, so the target similarity of A and B is added, that is, the sum of the similarities of the goals corresponding to "fiction” is 175; since 175>80, the "fiction" is preferred as the target.
  • the classification result of the web page includes A, B, and C, and the target webpage categories corresponding to A and B are both "fictions”, the target similarities are 90 and 85, respectively, and the target webpage category corresponding to C is "Sports", the target similarity is 80, so the target similarity of A and B is added, that is, the sum of the similarities of the goals corresponding to "fiction" is 175; since 175>80, the "fiction” is preferred as the target.
  • the classification result of the web page includes A, B, and C, and the target webpage categories corresponding
  • the embodiments of the present application convert each corpus in the corpus into a vector, thereby converting the processing of comparison between the corpus and the similarity analysis into a vector operation, which is more convenient for computer automation. Therefore, the efficiency of the webpage classification is improved.
  • the embodiment of the present application filters the corresponding corpus according to the preset classification seed words, and can remove the corpus irrelevant to the webpage type, thereby improving the accuracy of the webpage classification.
  • the webpage category corresponding to one target corpus with the largest target similarity may be selected as the classification result of the target webpage; or the target corpus may be sorted according to the rules of the target similarity from large to small, and the top N are selected.
  • the webpage category corresponding to the target corpus is used as the classification result of the target webpage; and the webpage category corresponding to all the target corpus of the target similarity greater than the preset threshold is selected as the classification result of the target webpage.
  • At least one target webpage category may be selected as the classification result of the target webpage according to the determined target similarity.
  • the similarity corresponding to the target corpus recorded in the scoring model file may be directly used as the target similarity; in another feasible embodiment of the present application, Determine the target similarity as follows:
  • the first weight coefficient corresponding to the target title is greater than the second weight coefficient corresponding to the target keyword.
  • the first weighting coefficient may be set to 1, and the second weighting coefficient is 0.8, and the target similarity corresponding to the first target corpus is the product of the reference similarity and 1 and the target similarity of the second target corpus is Its reference similarity is the product of 0.8.
  • the probability that the target corpus corresponding to the target title is determined as the classification result of the target webpage is improved, and the accuracy of the webpage classification is improved.
  • a webpage classification method provided by another embodiment of the present application may include the following steps:
  • S203 Determine, according to the classification model file, each score corresponding to each preset webpage category.
  • the vector corresponding to the class seed word and calculates the vector sum of all the classification seed words corresponding to the same web page category.
  • the classification result of each web page under the same domain name is determined.
  • the specific steps are as follows: for each web page, the target corpus corresponding to the title and the keyword is searched in the scoring model file, and the search is determined.
  • the target similarity and the target webpage category corresponding to the target corpus calculate the sum of the target similarities corresponding to the same target webpage category, and select at least one target webpage category with the largest sum of the target similarities as the classification result of the corresponding webpage.
  • the preset threshold condition for determining whether a domain name is a vertical domain name includes at least the following three items:
  • the classification result of a webpage includes two types of webpages, "fiction” and "sports".
  • the corpus corresponding to "fiction” includes A and B, and the similarities expressed by scores are 90 and 85 respectively.
  • the corresponding corpus is C, and the corresponding similarity is 80.
  • the similarity ratio of the "fiction” is (90+85)/(90+85+80).
  • the classification result of each webpage may include multiple webpage categories (ie, each webpage may correspond to multiple webpage categories), and the classification result of different webpages may also exist in the same webpage category (ie, the webpage categories corresponding to different webpages may be partially or completely the same) If the number of webpages in which the webpage category D exists in the classification result under the domain name is greater than a preset value, D may be referred to as a public of these webpages. Page category.
  • the first ratio, the preset value, and the second ratio may be set according to actual application conditions, and the present application is not specifically limited. If the summary result corresponding to a domain name meets the above three conditions, the domain name may be determined to be a vertical domain name, that is, all the webpages corresponding to the domain name have the same type.
  • the public webpage category that satisfies the above conditions 2) and 3) may be used as the webpage category corresponding to the domain name (that is, the categories of the webpages under the domain name are public. Web page category), and correspondingly record its similarity.
  • the webpage category and the corresponding similarity may be stored while storing the determined vertical domain name, for example, the webpage category and the similarity corresponding to the vertical domain name may also be recorded in the vertical domain name list.
  • step S207 Obtain a target title and a target keyword in a header of the target webpage. If the obtaining is successful, step S208 is performed, otherwise step S209 is performed.
  • step S209 If the target title or the target keyword is missing, determine whether the target domain name corresponding to the target webpage exists in the vertical list of the domain name. If yes, execute step S210; otherwise, perform step S211.
  • step S210 Determine a target webpage category and a target similarity corresponding to the target webpage according to the target domain name, and perform step S212.
  • the webpage categories of all the webpages corresponding to the vertical domain name are the same, and the webpage category and the similarity corresponding to the target domain name can be directly used as the target webpage category and the target similarity of the target webpage.
  • the public webpage category and the corresponding similarity obtained by the judging process of step S206 are recorded as the webpage category and similarity of the corresponding vertical domain name, so that the reading result is directly read in step S210. Take the target domain name.
  • the webpage category corresponding to each vertical domain name and its similarity may also be directly set.
  • the domain name “sports.sina.com.cn” may be set.
  • the page category is "Sports" with a similarity of 90.
  • step S211 Determine a target webpage category and a target similarity corresponding to the target webpage according to the URL corresponding to the target webpage, and perform step S212.
  • the corresponding webpage category and the similarity may be preset and stored for the common domain name and the common characteristic URL; for example, the webpage category corresponding to the URL satisfying the following attribute “xxx.com/sport” may be preset as “ Sports, the similarity is 80.
  • S209 to S211 are used as a supplementary step when the header data of the target webpage is missing (including the lack of a title, a keyword, etc.), that is, because the header data is missing, the target webpage category and the target of the target webpage cannot be determined by step S208.
  • the target webpage category and the target similarity may be determined according to the target domain name or the URL characteristic corresponding to the target webpage, thereby ensuring the accuracy of the classification result.
  • its target webpage category and target similarity can be determined according to its URL characteristics.
  • the URL attribute is a weak rule, that is, under the premise that both methods are feasible.
  • the target domain name is a vertical domain name
  • the target webpage category and the target similarity of the target webpage are determined according to the target domain name.
  • the target domain name is not a vertical domain name (the target domain name does not conform to the vertical domain name)
  • the domain name rule cannot determine the target page category and target similarity based on the target domain name, and then determines the target page category and target similarity according to the URL characteristics of the target web page.
  • the embodiment of the present application analyzes in advance whether the domain name involved is a vertical domain name according to the scoring model file, so that when the header data of the target webpage is missing (including missing titles, keywords, etc.), according to the target of the target webpage.
  • the domain name or the URL determines the target page category and the target similarity, and the classification is successful and the classification is accurate. If the target domain name is a vertical domain name, the target page type and the target similarity are determined according to the target domain name. If the target domain name is not a vertical domain name, The target page type and the target similarity are determined according to the URL characteristics of the target webpage.
  • the supplementary step based on the domain name and the URL provided by the embodiment of the present application can avoid the problem that the classification accuracy of the target webpage is insufficient or even the classification failure due to the lack of the header, and is simple and easy to implement, and does not affect the efficiency of the webpage classification.
  • FIG. 3 is a functional block diagram of a web page sorting apparatus according to an exemplary embodiment.
  • the functional blocks of the page sorting device may be implemented by hardware, software or a combination of hardware and software that implements the principles of the present invention. It will be understood by those skilled in the art that the functional modules described in FIG. 3 can be combined or divided into sub-modules to implement the principles of the above described invention. Accordingly, the description herein may support any possible combination, or division, or further limitation of the functional modules described herein.
  • the apparatus includes a corpus extraction unit 100, a corpus training unit 200, a corpus screening unit 300, a target web page processing unit 400, and a web page category determining unit 500.
  • the corpus extraction unit 100 is configured to acquire a title and a keyword in a header of each web page, and record the obtained title and keyword as a corpus in the corpus.
  • the web page sorting process can be performed using the existing corpus, thereby omitting the corpus extracting unit 100.
  • the corpus training unit 200 is configured to train the corpus by the word-to-vector tool word2vec to obtain a vector corresponding to each corpus in the corpus, and record each corpus and corresponding vector in the classification model file.
  • the corpus screening unit 300 is configured to determine, according to the classification model file, a vector corresponding to each category seed word corresponding to each preset webpage category, and calculate a vector sum of all the classification seed words corresponding to the same webpage category. Searching for a corpus corresponding to a vector whose similarity with the vector sum is within a preset range in the classification model file, and recording the found corpus, the corresponding similarity, and the vector and the corresponding webpage category in Rating model file.
  • the target webpage processing unit 400 is configured to acquire a target title and a target keyword in a header of the target webpage, and search for the target corpus corresponding to the target title and the target keyword in the scoring model file, according to the The scoring model file determines the target similarity and the target web page category corresponding to each target corpus.
  • the web page category determining unit 500 is configured to calculate a sum of target similarities corresponding to the same target web page category, and select at least one target web page category having the largest sum of target similarities as the classification result of the target web page. As described above, the person skilled in the art may further configure the webpage category determining unit to select at least one target webpage category as the classification result of the target webpage according to the determined target similarity. For example, the webpage category determining unit is configured to select only the webpage category corresponding to one target corpus with the largest target similarity as the classification result of the target webpage; or the target corpus may be sorted according to the rules of the target similarity from large to small, and the selection is performed. Top N target corpora The webpage category corresponding to the word is used as the classification result of the target webpage; and the webpage category corresponding to all the target corpus of the target similarity greater than the preset threshold is also selected as the classification result of the target webpage.
  • the embodiments of the present application convert each corpus in the corpus into a vector, thereby converting the processing of comparison between the corpus and the similarity analysis into a vector operation, which is more convenient for computer automation. Therefore, the efficiency of the webpage classification is improved.
  • the embodiment of the present application filters the corpus according to the plurality of preset classification seed words, and can remove the corpus irrelevant to the webpage type, thereby improving the accuracy of the webpage classification.
  • the target webpage processing unit 400 may include: a weight coefficient setting unit and a target similarity calculating unit.
  • the weight coefficient setting unit is configured to respectively set a weight coefficient corresponding to the target title and the target keyword;
  • the target similarity calculation unit is configured to calculate, for the first target corpus corresponding to the target title, a product of a reference similarity in the scoring model file and a first weight coefficient corresponding to the target title. Obtaining a target similarity corresponding to the first target corpus; and calculating, for the second target corpus corresponding to the target keyword, a corresponding reference similarity in the scoring model file corresponding to the target keyword The product of the second weighting coefficient obtains the target similarity corresponding to the second target corpus.
  • the webpage classification apparatus may further include a vertical domain name determining unit 600.
  • the vertical domain name determining unit 600 is configured to determine, according to the target webpage, each webpage under the same domain name, and determine the classification result thereof, and determine whether the classification result of each webpage under the same domain name and the corresponding similarity thereof are satisfied.
  • the preset threshold condition if satisfied, records the corresponding domain name as a vertical domain name in the vertical domain name list.
  • the webpage sorting apparatus may further include a target domain name processing unit 700.
  • the target domain name processing unit 700 is configured to determine, if the target title or target keyword acquisition fails, whether the target domain name corresponding to the target webpage exists in the vertical list of the domain name, if the domain name exists in the vertical list of the domain name Determining the target domain name, determining the target webpage category and the target similarity corresponding to the target webpage according to the target domain name.
  • the webpage classification apparatus may further include a URL processing unit 800.
  • the URL processing unit 800 is configured to: when the target domain name processing unit 700 determines that the target domain name does not exist in the domain name vertical list, according to The uniform resource locator URL corresponding to the target webpage determines a target webpage category and a target similarity corresponding to the target webpage.
  • the embodiment of the present application further provides a non-transitory computer storage medium, such as a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, and an optical data storage device, etc.;
  • a program when the program in the storage medium is executed by a processor of the related device, enabling the device to perform some or all of the steps of the web page classification method described in the above method embodiments.
  • an embodiment of the present application further provides a computing device, where the computing device includes a memory and a processor.
  • the memory stores information related to the web page
  • the processor reads information related to the web page from the memory, and executes some or all of the steps of the web page classification method described in the foregoing method embodiments.
  • the computing device can be, for example, a personal computer, a server, a mobile terminal such as a mobile phone, or a network device.
  • the above technical concept of the present invention can also be embodied as a computing device including a processor and a non-transitory machine readable storage medium.
  • the non-transitory machine readable storage medium stores executable code thereon.
  • the processor is caused to perform some or all of the steps of the web page classification method described in the above method embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Remote Sensing (AREA)
  • Radar, Positioning & Navigation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Algebra (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种网页分类方法和装置、计算设备以及机器可读存储介质,其通过词语转向量工具word2vec将语料库中的各个语料词转换为向量,从而将语料词之间的比较、相似度分析等处理过程,转换为向量的运算,更便于计算机自动化的实现,提高网页分类效率;同时,根据预设的分类种子词筛选对应的语料词,可以剔除与网页类型无关的语料词,提高网页分类的准确率。

Description

网页分类方法和装置、计算设备以及机器可读存储介质 技术领域
本申请涉及网页处理技术领域,尤其涉及一种网页分类方法和装置、计算设备以及机器可读存储介质。
背景技术
随着互联网的高速发展,来自网络的信息也越来越丰富。根据网页中所展示的信息进行网页分类,既可以在网络应用方面,便于用户快速便捷的找到偏好的信息,还可以在网络相关产品开发的需求分析阶段,根据用户浏览的网页的类型确定不同用户的偏好信息。
相关技术中,网页分类一般需要对海量网页进行解析,从网页的统一资源定位符(Uniform Resource Locator,URL)和标头(header)中提取特征数据作为训练数据,并通过该训练数据对基于分类算法(如)建立的分类模型进行训练,得到网页分类器,从而在对目标网页进行分类时,首先提取该目标网页的目标特征数据,然后根据上述网页分类器对目标特征数据进行分析,就可以得知该目标网页的类型。其中,常用的分类算法包括决策树分类法,朴素的贝叶斯分类算法(native Bayesian classifier)、基于支持向量机(Support Vector Machine,SVM)的分类算法,神经网络法,k-最近邻法(k-nearest neighbor,kNN),模糊分类法等。
可见,基于上述方法实现网页分类,特征数据中包含大量短句或词语,数据处理量大,特别是对于中文网页,其特征数据多为中文词语,处理复杂度更高,相应的网页分类效率较低。
发明内容
为克服相关技术中存在的问题,本申请提供一种网页分类方法和装置、计算设备以及非暂时性机器可读存储介质。
本申请实施例的第一方面,提供一种网页分类方法,包括:
通过词语转向量工具,例如word2vec,对语料库进行训练,得到所述语料库中的各个语料词对应的向量,并将各个语料词及对应的向量记录于分类模型文件,其中所述语料库中的语料词与网页的标头中的标题和关键词相关联;
根据所述分类模型文件确定与预设的每个网页类别对应的每个分类种子词相对应的向量,计算同一网页类别对应的所有分类种子词的向量和;
在所述分类模型文件中查找与所述向量和的相似度在预设范围内的向量对应的语料词,并将查找到的语料词、对应的相似度,以及所述向量和对应的网页类别记录于评分模型文件;
在所述评分模型文件中查找与目标网页的标头中的所述目标标题和目标关键词相对应的目标语料词;
根据所述评分模型文件确定与各个目标语料词相对应的目标相似度和目标网页类别;
根据所确定的的目标相似度,选择至少一个目标网页类别作为所述目标网页的分类结果。
结合第一方面,在第一方面第一个可行的实施例中,所述根据所述评分模型文件确定各个目标语料词对应的目标相似度,包括:
分别设置所述目标标题和目标关键词对应的权重系数;
对于所述目标标题对应的第一目标语料词,计算其在所述评分模型文件中对应的基准相似度与所述目标标题对应的第一权重系数的乘积,得到所述第一目标语料词对应的目标相似度;
对于所述目标关键词对应的第二目标语料词,计算其在所述评分模型文件中对应的基准相似度与所述目标关键词对应的第二权重系数的乘积,得到所述第二目标语料词对应的目标相似度。
结合第一方面,或者第一方面第一个可行的实施例,在第一方面第二个可行的实施例中,所述网页分类方法还包括:
将同一域名下的各个网页分别作为所述目标网页,确定其分类结果;
响应于判定所述同一域名下的各个网页的分类结果及其对应的相似度满足预设阈值条件,将对应的域名作为垂直域名记录于垂直域名列表。
结合第一方面第二个可行的实施例,在第一方面第三个可行的实施例中,所述网页分类方法还包括:
在无法获取所述目标标题或目标关键词的情况下,则判断所述域名垂直列表中是否存在所述目标网页对应的目标域名;
响应于判定所述域名垂直列表中存在所述目标域名,根据所述目标域名确定所述目标网页对应的目标网页类别和目标相似度。
结合第一方面第三个可行的实施例,在第一方面第四个可行的实施例中,所述网页分类方法还包括:
响应于判定所述域名垂直列表中不存在所述目标域名,根据所述目标网页对应的统一资源定位符URL确定所述目标网页对应的目标网页类别和目标相似度。
本申请实施例的第二方面,提供一种网页装置,包括:
语料训练单元,用于通过词语转向量工具word2vec对语料库进行训练,得到所述语料库中的各个语料词对应的向量,并将各个语料词及对应的向量记录于分类模型文件,其中所述语料库中的语料词与网页的标头中的标题和关键词相关联;
语料筛选单元,用于根据所述分类模型文件确定与预设的每个网页类别对应的每个分类种子词相对应的向量,计算同一网页类别对应的所有分类种子词的向量和,在所述分类模型文件中查找与所述向量和的相似度在预设范围内的向量对应的语料词,并将查找到的语料词、对应的相似度,以及所述向量和对应的网页类别记录于评分模型文件;
目标网页处理单元,用于在所述评分模型文件中查找与目标网页的标头中的目标标题和目标关键词相对应的目标语料词,根据所述评分模型文件确定与各个目标语料词相对应的目标相似度和目标网页类别;以及
网页类别确定单元,用于根据所确定的目标相似度,选择至少一个目标网页类别作为所述目标网页的分类结果。
结合第二方面,在第二方面第一种可行的实施方式中,所述目标网页处理单元包括:
权重系数设置单元,用于分别设置所述目标标题和目标关键词对应的权重系数;
目标相似度计算单元,用于对于所述目标标题对应的第一目标语料词,计算其在所述评分模型文件中对应的基准相似度与所述目标标题对应的第一权重系数的乘积,得到所述第一目标语料词对应的目标相似度;对于所述目标关键词对应的第二目标语料词,计算其在所述评分模型文件中 对应的基准相似度与所述目标关键词对应的第二权重系数的乘积,得到所述第二目标语料词对应的目标相似度。
结合第二方面,或者第二方面第一种可行的实施方式,在第二方面第二种可行的实施方式中,所述网页分类装置还包括:
垂直域名判断单元,用于将同一域名下的各个网页分别作为所述目标网页,确定其分类结果,并响应于判定所述同一域名下的各个网页的分类结果及其对应的相似度满足预设阈值条件,将对应的域名作为垂直域名记录于垂直域名列表。
结合第二方面第二种可行的实施方式,在第二方面第三种可行的实施方式中,所述网页分类装置还包括:
目标域名处理单元,用于在无法获取所述目标标题或目标关键词的情况下,判断所述域名垂直列表中是否存在所述目标网页对应的目标域名,响应于判定所述域名垂直列表中存在所述目标域名,根据所述目标域名确定所述目标网页对应的目标网页类别和目标相似度。
结合第二方面第三种可行的实施方式,在第二方面第四种可行的实施方式中,所述网页分类装置还包括:
URL处理单元,用于响应于判定在所述域名垂直列表中不存在所述目标域名,根据所述目标网页对应的统一资源定位符URL确定所述目标网页对应的目标网页类别和目标相似度。
本申请实施例的第三方面,提供一种计算设备,包括:
存储器,该存储器中存储有与网页相关的信息;和
处理器,该处理器从所述存储器读取与网页相关的信息,并执行以下操作:
通过词语转向量工具word2vec对语料库进行训练,得到所述语料库中的各个语料词对应的向量,并将各个语料词及对应的向量记录于分类模型文件,其中所述语料库中的语料词与网页的标头中的标题和关键词相关联;
根据所述分类模型文件确定与预设的每个网页类别对应的每个分类种子词相对应的向量,计算同一网页类别对应的所有分类种子词的向量和;
在所述分类模型文件中查找与所述向量和的相似度在预设范围内的向量对应的语料词,并将查找到的语料词、对应的相似度,以及所述向量和对应的网页类别记录于评分模型文件;
在所述评分模型文件中查找与目标网页的标头中的目标标题和目标关键词相对应的目标语料词;
根据所述评分模型文件确定与各个目标语料词相对应的目标相似度和目标网页类别;以及
根据所确定的目标相似度,选择至少一个目标网页类别作为所述目标网页的分类结果。
本申请实施例的第四方面,提供一种非暂时性机器可读存储介质,其上存储有可执行代码,当所述可执行代码被处理器执行时,使所述处理器执行根据本申请实施例的第一方面所述的方法。
本申请实施例的第五方面,还提供了一种计算设备,该计算设备包括处理器和非暂时性机器可读存储介质。该非暂时性机器可读存储介质上存储有可执行代码。当该可执行代码被该处理器执行时,使该处理器执行根据本申请实施例的第一方面所述的方法。
由以上技术方案可知,本申请实施例将语料库中的各个语料词转换为向量,从而将语料词之间的比较、相似度分析等处理过程,转换为向量的运算,更便于计算机自动化的实现,从而提高网页分类效率;同时,本申请实施例根据多个预设的分类种子词对应筛选语料词,可以剔除与网页类型无关的语料词,提高网页分类的准确率。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本发明的实施例,并与说明书一起用于解释本发明的原理。
图1是根据一示例性实施例示出的一种网页分类方法的流程图。
图2是根据一示例性实施例示出的另一种网页分类方法的流程图。
图3是根据一示例性实施例示出的一种网页分类装置的功能框图。
图4是根据一示例性实施例示出的另一种网页分类装置的功能框图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本发明相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本发明的一些方面相一致的装置和方法的例子。
图1是根据一示例性实施例示出的一种网页分类方法的流程图,如图1所示,该方法包括以下步骤。
S11、获取各个网页的标头中的标题和关键词,并将获取到的标题和关键词作为语料词记录于语料库。即,语料库中的语料词与网页的标头中的标题和关键词相关联。
本申请实施例中,用于构建语料库的海量网页可以来源于用户的浏览记录。各个网页的标头(header)中一般均包含标题(title)和关键词(keyword)两个字段,故可以将这两个字段中的词汇作为语料词记录于语料库。需要指出的是,本申请实施例还可以应用于已有的与网页标头相关的语料库。例如,在定期进行网页分类的情况下,可以采用上一次网页分类时构建的语料库。
另外,由于标题一般为句子或短语形式,而不是单个词语,故需要通过分词工具对标题进行分词;关键词本身即为单个词语,不需对其执行分词操作。
S12、通过词语转向量工具word2vec对所述语料库进行训练,得到所述语料库中的各个语料词对应的向量,并将各个语料词及对应的向量记录于分类模型文件。
上述word2vec为一种文本处理工具,其通过分析海量词语之间的相似度,为每个词语分配唯一的向量;应用于本申请实施例,即通过分析语料库中语料词之间的相似度,确定每个语料词对应的向量;为了便于表示词语之间复杂的相似关系,该向量为多维度向量,如:[0.792,-0.177,-0.107,0.109,-0.542,...]。其中,两个语料词之间的相似度越高,其对应的向量差越小(该向量差可以为两个向量之间夹角的余弦值)。本实施例中,分类模型文件具体可以采用BIN格式的二进制文件,如该分类模型文件可以命 名为word.bin,其记录有各个语料词及其对应的向量。
S13、根据所述分类模型文件确定预设的每个网页类别对应的每个分类种子词对应的向量,计算同一网页类别对应的所有分类种子词的向量和。
本申请实施例中,针对每种网页类别,如小说、体育、科技等,预先设定多个可能出现在网页中的相关词语,作为分类种子词;例如与小说相关的分类种子词可以包括:网络小说、小说、小说书库、经典小说、精品小说、小说在线、小说全集、小说集、系列小说、原创小说、txt全集、言情小说、爱情小说、玄幻小说、奇幻小说、科幻小说、武侠小说、仙侠小说、都市异能、同人耿美、同人小说、灵异小说、穿越小说、修真小说、悬疑小说、恐怖小说、侦探推理、侦探小说、推理小说、青春校园等。
针对每个分类种子词,首先确定其对应的向量,具体方法为:在分类模型文件中搜索与分类种子词最相似的语料词,从而将该语料词对应的向量记为该分类种子词对应的向量。进一步,由于word2vec转换得到的向量可以执行加法运算,故将同一网页类别对应的各个分类种子词对应的向量相加,得到的该网页类别对应的向量和。例如,将上述小说相关的各个分类种子词对应的向量相加,得到“小说”类别对应的向量和。
S14、在所述分类模型文件中查找与所述向量和的相似度在预设范围内的向量对应的语料词,并将查找到的语料词、对应的相似度,以及所述向量和对应的网页类别记录于评分模型文件。
遍历上述分类模型文件,分别计算分类模型文件中记录的各个向量与上述向量和之间的相似度,筛选出相似度在预设范围内的向量,并将与其对应的语料词对应记录于评分模型文件中。
上述步骤S13和S14通过对向量的处理计算,将描述网页类型的语料词从分类模型文件中筛选出来,并统一记录于评分模型文件中。
在本申请一个可行的实施例中,向量之间的相似度具体可以表示为向量之间夹角的余弦值,即取值范围为0~1。
在本申请另一个可行的实施例中,还可以以百分制分值表示向量之间的相似度;即在上述余弦值的基础上乘以100,得到对应的分值。
另外,上述评分模型文件可以采用TXT格式的文本文件,例如可以 命名为word.txt,其存储格式为“根据向量和查找到的语料词:该向量和对应的网页类别:相似度”,其中,该向量和对应的网页类别也即查找到的语料词对应的网页类别;例如,根据上述“小说”类别对应的向量和查找到语料词A和B,其相似度分别为95和80,则可以在评分模型文件中分别记录为“A:小说:95”、“B:小说:80”。
S15、获取目标网页的标头中的目标标题和目标关键词,在所述评分模型文件中查找所述目标标题和目标关键词对应的目标语料词。
与步骤S11类似的,对于目标标题,需要执行分词操作,将其由短语或句子划分为多个词语。分词完成后,从评分模型文件中选择与目标标题分词后得到的各个词语以及各个目标关键词对应的语料词作为目标语料词。
S16、根据所述评分模型文件确定各个目标语料词对应的目标相似度和目标网页类别。
S17、计算同一目标网页类别对应的目标相似度之和,选择目标相似度之和最大的至少一个目标网页类别作为所述目标网页的分类结果。
例如,根据评分模型文件查找到的目标语料词包括A、B和C,且A和B对应的目标网页类别均为“小说”,目标相似度分别为90和85,C对应的目标网页类别为“体育”,目标相似度为80,故将A和B对应的目标相似度相加,即“小说”对应的目标相似度之和为175;由于175>80,故优先选择“小说”作为目标网页的分类结果。
由以上技术方案可知,本申请实施例将语料库中的各个语料词转换为向量,从而将语料词之间的比较、相似度分析等处理过程,转换为向量的运算,更便于计算机自动化的实现,从而提高网页分类效率;同时,本申请实施例根据预设的分类种子词筛选对应的语料词,可以剔除与网页类型无关的语料词,提高网页分类的准确率。
本实施例中,可以仅选择目标相似度最大的一个目标语料词对应的网页类别作为目标网页的分类结果;也可以按目标相似度由大到小的规则对目标语料词排序,选择前N个目标语料词对应的网页类别作为目标网页的分类结果;还可以选择目标相似度大于预设阈值的所有目标语料词对应的网页类别均作为目标网页的分类结果。其中,N和预设阈值均可根据实际 应用需求设定,如N=10,预设阈值为80(以分值表示相似度)或者0.8(以余弦值表示相似度)。需要指出的是,本领域技术人员在以上示例的教习下,完全可以构想出其他根据目标相似度确定目标网页类别的实施方式。综上所述,可以根据所确定的目标相似度,选择至少一个目标网页类别作为目标网页的分类结果。
在本申请一个可行的实施例中,上述步骤S16中,可以直接将评分模型文件中记录的目标语料词对应的相似度作为目标相似度;在本申请另一个可行的实施例中,还可以通过如下方法确定目标相似度:
分别设置所述目标标题和目标关键词对应的权重系数;
对于所述目标标题对应的第一目标语料词,计算其在所述评分模型文件中对应的基准相似度与所述第一权重系数的乘积,得到所述第一目标语料词对应的目标相似度;
对于所述目标关键词对应的第二目标语料词,计算其在所述评分模型文件中对应的基准相似度与所述第二权重系数的乘积,得到所述第二目标语料词对应的目标相似度。
由于一般情况下标题比关键字更能准确体现网页的类型,故所述目标标题对应的第一权重系数大于所述目标关键词对应的第二权重系数。例如,可以设置第一权重系数为1,第二权重系数为0.8,则第一目标语料词对应的目标相似度为其基准相似度与1的乘积,第二目标语料词对应的目标相似度为其基准相似度与0.8的乘积。
上述实施例中,通过设置权重系数,提高目标标题对应的目标语料词被确定为目标网页的分类结果的概率,提高网页分类的准确性。
参照图2,本申请另一个实施方式提供的网页分类方法可以包括如下步骤:
S201、获取各个网页的标头中的标题和关键词,并将获取到的标题和关键词作为语料词记录于语料库。
S202、通过词语转向量工具word2vec对所述语料库进行训练,得到所述语料库中的各个语料词对应的向量,并将各个语料词及对应的向量记录于分类模型文件。
S203、根据所述分类模型文件确定预设的每个网页类别对应的每个分 类种子词对应的向量,计算同一网页类别对应的所有分类种子词的向量和。
S204、在所述分类模型文件中查找与所述向量和的相似度在预设范围内的向量对应的语料词,并将查找到的语料词、对应的相似度,以及所述向量和对应的网页类别记录于评分模型文件。
S205、针对同一域名下的各个网页,分别确定其分类结果。
参照图1所示实施例,确定同一域名下的各个网页的分类结果,具体步骤如下:针对每个网页,分别在评分模型文件中查找其标题和关键词对应的目标语料词,并确定查找到的目标语料词对应的目标相似度和目标网页类别,计算同一目标网页类别对应的目标相似度之和,选择目标相似度之和最大的至少一个目标网页类别作为对应网页的分类结果。
S206、判断所述同一域名下的各个网页的分类结果及其对应的相似度是否满足预设阈值条件,如果满足,则将对应的域名作为垂直域名记录于垂直域名列表。
由于一个域名下存在多个网页,一个网页的分类结果中存在多个网页类别,故一个域名也对应多个网页类别;判断某个域名是否为垂直域名的预设阈值条件至少包括如下三项:
1)该域名下的每个网页,作为其分类结果的网页类别对应的多个语料词的相似度之和在该网页对应的所有语料词的相似度之和中的占比高于第一比值。
例如,某个网页的分类结果中包括“小说”和“体育”两种网页类别,其中“小说”对应的语料词包括A和B,以分值表示的相似度分别为90和85,“体育”对应的语料词为C,对应的相似度为80,则“小说”对应的相似度占比计算式为(90+85)/(90+85+80)。
2)存在至少一个公共网页类别,该域名下分类结果中存在该公共网页类别的网页个数大于预设数值;
每个网页的分类结果中可以包括多种网页类别(即每个网页可以对应多种网页类别),不同网页的分类结果也可以存在同一网页类别(即不同网页对应的网页类别可以部分或全部相同),如果该域名下分类结果中存在网页类别D的网页个数大于预设数值,则D可以称为这些网页的公共 网页类别。
3)存在至少一个公共网页类别,该域名下分类结果中存在该公共网页类别的网页个数与该域名下所有网页个数之间的比值大于第二比值。
上述第一比值、预设数值和第二比值都可以根据实际应用情况设定,本申请不作具体限定。如果某个域名对应的汇总结果同时满足上述三个条件,则可以判定该域名为垂直域名,即该域名对应的所有网页的类型相同。
另外,在判定某个域名为垂直域名时,本实施例还可以将满足上述条件2)和3)的公共网页类别作为该域名对应的网页类别(即该域名下的各个网页的类别均为公共网页类别),并对应记录其相似度。在一个可行的实施例中,可以在存储判断出的垂直域名的同时,存储其网页类别及对应的相似度,如可以将垂直域名对应的网页类别和相似度也记录于上述垂直域名列表中,以便于后续步骤查询使用(如下文步骤S210)。
S207、获取目标网页的标头中的目标标题和目标关键词,如果获取成功,则执行步骤S208,否则执行步骤S209。
S208、在所述评分模型文件中查找所述目标标题和目标关键词对应的目标语料词,根据所述评分模型文件确定各个目标语料词对应的目标相似度和目标网页类别,并执行步骤S212。
S209、在所述目标标题或目标关键词缺失的情况下,判断所述域名垂直列表中是否存在所述目标网页对应的目标域名,如果存在,则执行步骤S210,否则执行步骤S211。
S210、根据所述目标域名确定所述目标网页对应的目标网页类别和目标相似度,并执行步骤S212。
根据垂直域名规则,垂直域名对应的所有网页的网页类别相同,可以直接将目标域名对应的网页类别及相似度相应作为目标网页的目标网页类别和目标相似度。
在本申请一个可行的实施例中,可以在步骤S206的判断过程得到的公共网页类别及对应的相似度记录为相应垂直域名的网页类别和相似度,从而在步骤S210中直接在记录结果中读取目标域名。
在本申请另一个可行的实施例中,还可以直接设置各个垂直域名对应的网页类别及其相似度,例如,可以设置域名“sports.sina.com.cn”对应的 网页类别为“体育”,相似度为90。
S211、根据所述目标网页对应的URL确定所述目标网页对应的目标网页类别和目标相似度,并执行步骤S212。
本申请实施例中,可以针对常见域名及常见特性的URL预设并存储相应的网页类别和相似度;例如,可以预设满足如下特性“xxx.com/sport”的URL对应的网页类别为“体育”,相似度为80。
S212、计算同一目标网页类别对应的目标相似度之和,选择目标相似度之和最大的至少一个目标网页类别作为所述目标网页的分类结果。
上述步骤中,S209至S211作为目标网页的标头数据缺失(包括缺少标题、关键词等)时的补充步骤,即由于标头数据缺失,无法通过步骤S208确定目标网页的目标网页类别和目标相似度,通过执行步骤S209至S211可以根据目标网页对应的目标域名或URL特性确定其目标网页类别及目标相似度,从而保证分类结果的准确性。其中,虽然对于任意目标网页,均可根据其URL特性确定其目标网页类别及目标相似度,但由于上述垂直域名规则为强规则,URL特性为弱规则,即在两种方法都可行的前提下,前者准确度更高,故在目标域名为垂直域名的情况下,优先根据该目标域名确定目标网页的目标网页类别及目标相似度,在目标域名不是垂直域名的情况下(目标域名不符合垂直域名规则,不能根据目标域名确定目标网页类别及目标相似度),才根据目标网页的URL特性确定其目标网页类别及目标相似度。
由上述技术方案可知,本申请实施例预先根据评分模型文件分析涉及到的域名是否为垂直域名,从而在目标网页的标头数据缺失(包括缺少标题、关键词等)时,根据目标网页的目标域名或URL确定其目标网页类别和目标相似度,保证分类成功且分类准确,其中,如果目标域名为垂直域名,则优先根据目标域名确定目标网页类型和目标相似度,如果目标域名不是垂直域名,则根据目标网页的URL特性确定目标网页类型和目标相似度。可见,本申请实施例提供的基于域名和URL的补充步骤,可以避免因标头缺失导致对目标网页分类精度不够甚至分类失败的问题,且简单易实现,不会影响网页分类效率。
图3是根据一示例性实施例示出的一种网页分类装置的功能框图。网 页分类装置的功能模块可以由实现本发明原理的硬件、软件或硬件和软件的结合来实现。本领域技术人员可以理解的是,图3中所描述的功能模块可以组合起来或者划分成子模块,从而实现上述发明的原理。因此,本文的描述可以支持对本文描述的功能模块的任何可能的组合、或者划分、或者更进一步的限定。
参照图3,该装置包括:语料提取单元100、语料训练单元200、语料筛选单元300、目标网页处理单元400和网页类别确定单元500。
该语料提取单元100被配置为,用于获取各个网页的标头中的标题和关键词,并将获取到的标题和关键词作为语料词记录于语料库。如前所述,在本申请另选实施例中,可以利用已有语料库进行网页分类处理,由此省略语料提取单元100。
该语料训练单元200被配置为,通过词语转向量工具word2vec对所述语料库进行训练,得到所述语料库中的各个语料词对应的向量,并将各个语料词及对应的向量记录于分类模型文件。
该语料筛选单元300被配置为,根据所述分类模型文件确定预设的每个网页类别对应的每个分类种子词对应的向量,计算同一网页类别对应的所有分类种子词的向量和,在所述分类模型文件中查找与所述向量和的相似度在预设范围内的向量对应的语料词,并将查找到的语料词、对应的相似度,以及所述向量和对应的网页类别记录于评分模型文件。
该目标网页处理单元400被配置为,获取目标网页的标头中的目标标题和目标关键词,在所述评分模型文件中查找所述目标标题和目标关键词对应的目标语料词,根据所述评分模型文件确定各个目标语料词对应的目标相似度和目标网页类别。
该网页类别确定单元500被配置为,计算同一目标网页类别对应的目标相似度之和,选择目标相似度之和最大的至少一个目标网页类别作为所述目标网页的分类结果。如前所述,本领域技术人员还可以将网页类别确定单元配置为根据所确定的目标相似度,选择至少一个目标网页类别作为目标网页的分类结果。例如,将网页类别确定单元配置为仅选择目标相似度最大的一个目标语料词对应的网页类别作为目标网页的分类结果;也可以按目标相似度由大到小的规则对目标语料词排序,选择前N个目标语料 词对应的网页类别作为目标网页的分类结果;还可以选择目标相似度大于预设阈值的所有目标语料词对应的网页类别均作为目标网页的分类结果。
由以上技术方案可知,本申请实施例将语料库中的各个语料词转换为向量,从而将语料词之间的比较、相似度分析等处理过程,转换为向量的运算,更便于计算机自动化的实现,从而提高网页分类效率;同时,本申请实施例根据多个预设的分类种子词对应筛选语料词,可以剔除与网页类型无关的语料词,提高网页分类的准确率。
在本申请一个可行的实施例中,为确定各个目标语料词对应的目标相似度,上述目标网页处理单元400可以包括:权重系数设置单元和目标相似度计算单元。
其中,该权重系数设置单元被配置为,分别设置所述目标标题和目标关键词对应的权重系数;
该目标相似度计算单元被配置为,对于所述目标标题对应的第一目标语料词,计算其在所述评分模型文件中对应的基准相似度与所述目标标题对应的第一权重系数的乘积,得到所述第一目标语料词对应的目标相似度;对于所述目标关键词对应的第二目标语料词,计算其在所述评分模型文件中对应的基准相似度与所述目标关键词对应的第二权重系数的乘积,得到所述第二目标语料词对应的目标相似度。
参见图4,本申请其他可行的实施例提供的网页分类装置还可以包括垂直域名判断单元600。
该垂直域名判断单元600被配置为,将同一域名下的各个网页分别作为所述目标网页,确定其分类结果,并判断所述同一域名下的各个网页的分类结果及其对应的相似度是否满足预设阈值条件,如果满足,则将对应的域名作为垂直域名记录于垂直域名列表。
另外,基于垂直域名判断单元600,本实施例提供的网页分类装置还可以包括目标域名处理单元700。
该目标域名处理单元700被配置为,在所述目标标题或目标关键词获取失败时,判断所述域名垂直列表中是否存在所述目标网页对应的目标域名,如果所述域名垂直列表中存在所述目标域名,则根据所述目标域名确定所述目标网页对应的目标网页类别和目标相似度。
进一步的,本实施例提供的网页分类装置还可以包括URL处理单元800;该URL处理单元800被配置为,在目标域名处理单元700判断所述域名垂直列表中不存在所述目标域名时,根据所述目标网页对应的统一资源定位符URL确定所述目标网页对应的目标网页类别和目标相似度。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
另外,本申请实施例还提供了一种非暂时性计算机存储介质,例如可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等;该计算机存储介质中存储有程序,当所述存储介质中的程序由相关设备的处理器执行时,使得该设备能够执行上述方法实施例中记载的网页分类方法的部分或全部步骤。
此外,本申请实施例还提供了一种计算设备,该计算设备包括存储器和处理器。该存储器中存储有与网页相关的信息,该处理器从该存储器读取与网页相关的信息,并执行上述方法实施例中记载的网页分类方法的部分或全部步骤。该计算设备例如可以是个人计算机、服务器、诸如手机的移动终端,或者网络设备。
本发明的上述技术构思还可以被实施为一种计算设备,该计算设备包括处理器和非暂时性机器可读存储介质。该非暂时性机器可读存储介质上存储有可执行代码。当该可执行代码被该处理器执行时,使该处理器执行上述方法实施例中记载的网页分类方法的部分或全部步骤。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本发明的其它实施方案。本申请旨在涵盖本发明的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本发明的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本发明的真正范围和精神由下面的权利要求指出。
应当理解的是,本发明并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本发明的范围仅由所附的权利要求来限制。

Claims (14)

  1. 一种网页分类方法,包括:
    通过词语转向量工具对语料库进行训练,得到所述语料库中的各个语料词对应的向量,并将各个语料词及对应的向量记录于分类模型文件,其中所述语料库中的语料词与网页的标头中的标题和关键词相关联;
    根据所述分类模型文件确定与预设的每个网页类别对应的每个分类种子词相对应的向量,计算同一网页类别对应的所有分类种子词的向量和;
    在所述分类模型文件中查找与所述向量和的相似度在预设范围内的向量对应的语料词,并将查找到的语料词、对应的相似度,以及所述向量和对应的网页类别记录于评分模型文件;
    在所述评分模型文件中查找与目标网页的标头中的目标标题和目标关键词相对应的目标语料词;
    根据所述评分模型文件确定与各个目标语料词相对应的目标相似度和目标网页类别;以及
    根据所确定的目标相似度,选择至少一个目标网页类别作为所述目标网页的分类结果。
  2. 根据权利要求1所述的网页分类方法,其中,所述根据所确定的目标相似度选择至少一个目标网页类别作为所述目标网页的分类结果,包括:
    计算同一目标网页类别对应的目标相似度之和,选择目标相似度之和最大的至少一个目标网页类别作为所述目标网页的分类结果。
  3. 根据权利要求1所述的网页分类方法,其中,所述根据所述评分模型文件确定各个目标语料词对应的目标相似度,包括:
    分别设置所述目标标题和目标关键词对应的权重系数;
    对于所述目标标题对应的第一目标语料词,计算其在所述评分模型文件中对应的基准相似度与所述目标标题对应的第一权重系数的乘积,得到所述第一目标语料词对应的目标相似度;
    对于所述目标关键词对应的第二目标语料词,计算其在所述评分模型文件中对应的基准相似度与所述目标关键词对应的第二权重系数的乘积,得到所述第二目标语料词对应的目标相似度。
  4. 根据权利要求1至3中任一项所述的网页分类方法,还包括:
    将同一域名下的各个网页分别作为所述目标网页,确定其分类结果;
    响应于判定所述同一域名下的各个网页的分类结果及其对应的相似度满足预设阈值条件,将对应的域名作为垂直域名记录于垂直域名列表。
  5. 根据权利要求4所述的网页分类方法,还包括:
    在无法获取所述目标标题或目标关键词的情况下,判断所述域名垂直列表中是否存在所述目标网页对应的目标域名;
    响应于判定所述域名垂直列表中存在所述目标域名,根据所述目标域名确定所述目标网页对应的目标网页类别和目标相似度。
  6. 根据权利要求5所述的网页分类方法,还包括:
    响应于判定所述域名垂直列表中不存在所述目标域名,根据所述目标网页对应的统一资源定位符URL确定所述目标网页对应的目标网页类别和目标相似度。
  7. 一种网页分类装置,包括:
    语料训练单元,用于通过词语转向量工具对语料库进行训练,得到所述语料库中的各个语料词对应的向量,并将各个语料词及对应的向量记录于分类模型文件,其中所述语料库中的语料词与网页的标头中的标题和关键词相关联;
    语料筛选单元,用于根据所述分类模型文件确定与预设的每个网页类别对应的每个分类种子词相对应的向量,计算同一网页类别对应的所有分类种子词的向量和,在所述分类模型文件中查找与所述向量和的相似度在预设范围内的向量对应的语料词,并将查找到的语料词、对应的相似度,以及所述向量和对应的网页类别记录于评分模型文件;
    目标网页处理单元,用于在所述评分模型文件中查找与目标网页的标头中的目标标题和目标关键词相对应的目标语料词,根据所述评分模型文件确定与各个目标语料词相对应的目标相似度和目标网页类别;以及
    网页类别确定单元,用于根据所确定的目标相似度,选择至少一个目标网页类别作为所述目标网页的分类结果。
  8. 根据权利要求7所述的网页分类装置,其中,所述网页类别确定单元包括以下单元:
    用于计算同一目标网页类别对应的目标相似度之和,并选择目标相似度之和最大的至少一个目标网页类别作为所述目标网页的分类结果的单元。
  9. 根据权利要求8所述的网页分类装置,其中,所述目标网页处理单元包括:
    权重系数设置单元,用于分别设置所述目标标题和目标关键词对应的权重系数;
    目标相似度计算单元,用于对于所述目标标题对应的第一目标语料词,计算其在所述评分模型文件中对应的基准相似度与所述目标标题对应的第一权重系数的乘积,得到所述第一目标语料词对应的目标相似度;对于所述目标关键词对应的第二目标语料词,计算其在所述评分模型文件中对应的基准相似度与所述目标关键词对应的第二权重系数的乘积,得到所述第二目标语料词对应的目标相似度。
  10. 根据权利要求7至9中任一项所述的网页分类装置,还包括:
    垂直域名判断单元,用于将同一域名下的各个网页分别作为所述目标网页,确定其分类结果,并响应于判定所述同一域名下的各个网页的分类结果及其对应的相似度满足预设阈值条件,将对应的域名作为垂直域名记录于垂直域名列表。
  11. 根据权利要求10所述的网页分类装置,还包括:
    目标域名处理单元,用于在无法获取所述目标标题或目标关键词的情况下,判断所述域名垂直列表中是否存在所述目标网页对应的目标域名,响应于判定所述域名垂直列表中存在所述目标域名,根据所述目标域名确定所述目标网页对应的目标网页类别和目标相似度。
  12. 根据权利要求11所述的网页分类装置,还包括:
    URL处理单元,用于响应于判定所述域名垂直列表中不存在所述目标域名,根据所述目标网页对应的统一资源定位符URL确定所述目标网页对应的目标网页类别和目标相似度。
  13. 一种计算设备,包括:
    存储器,该存储器中存储有与网页相关的信息;和
    处理器,该处理器从所述存储器读取与网页相关的信息,并执行以下操作:
    通过词语转向量工具对语料库进行训练,得到所述语料库中的各个语料词对应的向量,并将各个语料词及对应的向量记录于分类模型文件,其中所述语料库中的语料词与网页的标头中的标题和关键词相关联;
    根据所述分类模型文件确定与预设的每个网页类别对应的每个分类种子词相对应的向量,计算同一网页类别对应的所有分类种子词的向量和;
    在所述分类模型文件中查找与所述向量和的相似度在预设范围内的向量对应的语料词,并将查找到的语料词、对应的相似度,以及所述向量和对应的网页类别记录于评分模型文件;
    在所述评分模型文件中查找与目标网页的标头中的目标标题和目标关键词相对应的目标语料词;
    根据所述评分模型文件确定与各个目标语料词相对应的目标相似度和目标网页类别;以及
    根据所确定的目标相似度,选择至少一个目标网页类别作为所述目标网页的分类结果。
  14. 一种非暂时性机器可读存储介质,其上存储有可执行代码,当所述可执行代码被处理器执行时,使所述处理器执行根据权利要求1至6中任一项所述的网页分类方法。
PCT/CN2016/081139 2015-05-08 2016-05-05 网页分类方法和装置、计算设备以及机器可读存储介质 WO2016180270A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/505,851 US10997256B2 (en) 2015-05-08 2016-05-05 Webpage classification method and apparatus, calculation device and machine readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510230951.2A CN106202124B (zh) 2015-05-08 2015-05-08 网页分类方法及装置
CN201510230951.2 2015-05-08

Publications (1)

Publication Number Publication Date
WO2016180270A1 true WO2016180270A1 (zh) 2016-11-17

Family

ID=57248605

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/081139 WO2016180270A1 (zh) 2015-05-08 2016-05-05 网页分类方法和装置、计算设备以及机器可读存储介质

Country Status (3)

Country Link
US (1) US10997256B2 (zh)
CN (1) CN106202124B (zh)
WO (1) WO2016180270A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107436864A (zh) * 2017-08-04 2017-12-05 逸途(北京)科技有限公司 一种基于Word2Vec的中文问答语义相似度计算方法
CN107784099A (zh) * 2017-10-24 2018-03-09 济南浪潮高新科技投资发展有限公司 一种自动生成中文新闻摘要的方法
CN109462582A (zh) * 2018-10-30 2019-03-12 腾讯科技(深圳)有限公司 文本识别方法、装置、服务器及存储介质
WO2019182593A1 (en) * 2018-03-22 2019-09-26 Equifax, Inc. Text classification using automatically generated seed data
CN110705290A (zh) * 2019-09-29 2020-01-17 新华三信息安全技术有限公司 一种网页分类方法及装置

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108733695B (zh) * 2017-04-18 2020-04-03 腾讯科技(深圳)有限公司 用户搜索串的意图识别方法及装置
CN108256104B (zh) * 2018-02-05 2020-05-26 恒安嘉新(北京)科技股份公司 基于多维特征的互联网网站综合分类方法
CN110147881B (zh) * 2018-03-13 2022-11-22 腾讯科技(深圳)有限公司 语言处理方法、装置、设备及存储介质
CN109388665B (zh) * 2018-09-30 2020-10-09 吉林大学 作者关系在线挖掘方法及***
CN109359301A (zh) * 2018-10-19 2019-02-19 国家计算机网络与信息安全管理中心 一种网页内容的多维度标注方法及装置
CN109829478B (zh) * 2018-12-29 2024-05-07 平安科技(深圳)有限公司 一种基于变分自编码器的问题分类方法和装置
US11080358B2 (en) * 2019-05-03 2021-08-03 Microsoft Technology Licensing, Llc Collaboration and sharing of curated web data from an integrated browser experience
CN110263175B (zh) * 2019-06-27 2022-05-03 北京金山安全软件有限公司 一种信息归类的方法、装置及电子设备
CN110674442B (zh) * 2019-09-17 2023-08-18 ***股份有限公司 页面监控方法、装置、设备及计算机可读存储介质
CN110991509B (zh) * 2019-11-25 2023-08-01 杭州安恒信息技术股份有限公司 基于人工智能技术的资产识别与信息分类方法
CN111325032B (zh) * 2020-02-21 2023-06-16 中国建设银行股份有限公司 一种5g+智能银行机构名称的规范化方法及装置
CN111382337B (zh) * 2020-03-10 2023-04-25 开封博士创新技术转移有限公司 一种信息对接匹配方法、装置、服务器及可读存储介质
CN111898369B (zh) * 2020-08-17 2024-03-08 腾讯科技(深圳)有限公司 文章标题生成方法、模型的训练方法、装置和电子设备
CN113076453A (zh) * 2021-03-22 2021-07-06 鹏城实验室 域名分类方法、设备及计算机可读存储介质
US20230409649A1 (en) * 2022-06-21 2023-12-21 Uab 360 It Systems and methods for categorizing domains using artificial intelligence

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1158460A (zh) * 1996-12-31 1997-09-03 复旦大学 一种跨语种语料自动分类与检索方法
JP2010067005A (ja) * 2008-09-10 2010-03-25 Yahoo Japan Corp 検索装置、および検索装置の制御方法
CN102831246A (zh) * 2012-09-17 2012-12-19 中央民族大学 藏文网页分类方法和装置
CN102955791A (zh) * 2011-08-23 2013-03-06 句容今太科技园有限公司 网络信息搜索与分类服务***
CN103605702A (zh) * 2013-11-08 2014-02-26 北京邮电大学 一种基于词相似度的网络文本分类方法
CN104331498A (zh) * 2014-11-19 2015-02-04 亚信科技(南京)有限公司 一种对互联网用户访问的网页内容自动分类的方法

Family Cites Families (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6460036B1 (en) * 1994-11-29 2002-10-01 Pinpoint Incorporated System and method for providing customized electronic newspapers and target advertisements
US6233575B1 (en) * 1997-06-24 2001-05-15 International Business Machines Corporation Multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values
US6360227B1 (en) * 1999-01-29 2002-03-19 International Business Machines Corporation System and method for generating taxonomies with applications to content-based recommendations
US7240067B2 (en) * 2000-02-08 2007-07-03 Sybase, Inc. System and methodology for extraction and aggregation of data from dynamic content
US7133862B2 (en) * 2001-08-13 2006-11-07 Xerox Corporation System with user directed enrichment and import/export control
US8027876B2 (en) * 2005-08-08 2011-09-27 Yoogli, Inc. Online advertising valuation apparatus and method
US8060505B2 (en) * 2007-02-13 2011-11-15 International Business Machines Corporation Methodologies and analytics tools for identifying white space opportunities in a given industry
US7822742B2 (en) * 2008-01-02 2010-10-26 Microsoft Corporation Modifying relevance ranking of search result items
US20110004588A1 (en) * 2009-05-11 2011-01-06 iMedix Inc. Method for enhancing the performance of a medical search engine based on semantic analysis and user feedback
US8326820B2 (en) * 2009-09-30 2012-12-04 Microsoft Corporation Long-query retrieval
US8554854B2 (en) * 2009-12-11 2013-10-08 Citizennet Inc. Systems and methods for identifying terms relevant to web pages using social network messages
KR101095069B1 (ko) * 2010-02-03 2011-12-20 고려대학교 산학협력단 사용자 관심 주제를 추출하는 휴대용 통신 단말기 및 그 방법
US8886587B1 (en) * 2011-04-01 2014-11-11 Google Inc. Model development and evaluation
CN102207961B (zh) * 2011-05-25 2013-10-23 盛乐信息技术(上海)有限公司 一种网页自动分类方法及装置
US8635107B2 (en) * 2011-06-03 2014-01-21 Adobe Systems Incorporated Automatic expansion of an advertisement offer inventory
US8713028B2 (en) * 2011-11-17 2014-04-29 Yahoo! Inc. Related news articles
US9020950B2 (en) * 2011-12-19 2015-04-28 Palo Alto Research Center Incorporated System and method for generating, updating, and using meaningful tags
US8972376B1 (en) * 2013-01-02 2015-03-03 Palo Alto Networks, Inc. Optimized web domains classification based on progressive crawling with clustering
CN104424308A (zh) * 2013-09-04 2015-03-18 中兴通讯股份有限公司 网页分类标准获取方法、装置及网页分类方法、装置
CN103605794B (zh) 2013-12-05 2017-02-15 国家计算机网络与信息安全管理中心 一种网站分类方法
EP3089049A4 (en) * 2014-12-26 2017-10-04 Ubic, Inc. Data analysis system, data analysis method, and data analysis program
EP3915023A1 (en) * 2019-01-23 2021-12-01 Keeeb Inc. Data processing system for data search and retrieval augmentation and enhanced data storage

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1158460A (zh) * 1996-12-31 1997-09-03 复旦大学 一种跨语种语料自动分类与检索方法
JP2010067005A (ja) * 2008-09-10 2010-03-25 Yahoo Japan Corp 検索装置、および検索装置の制御方法
CN102955791A (zh) * 2011-08-23 2013-03-06 句容今太科技园有限公司 网络信息搜索与分类服务***
CN102831246A (zh) * 2012-09-17 2012-12-19 中央民族大学 藏文网页分类方法和装置
CN103605702A (zh) * 2013-11-08 2014-02-26 北京邮电大学 一种基于词相似度的网络文本分类方法
CN104331498A (zh) * 2014-11-19 2015-02-04 亚信科技(南京)有限公司 一种对互联网用户访问的网页内容自动分类的方法

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107436864A (zh) * 2017-08-04 2017-12-05 逸途(北京)科技有限公司 一种基于Word2Vec的中文问答语义相似度计算方法
CN107436864B (zh) * 2017-08-04 2021-03-02 识因智能科技(北京)有限公司 一种基于Word2Vec的中文问答语义相似度计算方法
CN107784099A (zh) * 2017-10-24 2018-03-09 济南浪潮高新科技投资发展有限公司 一种自动生成中文新闻摘要的方法
WO2019182593A1 (en) * 2018-03-22 2019-09-26 Equifax, Inc. Text classification using automatically generated seed data
US10671812B2 (en) 2018-03-22 2020-06-02 Equifax Inc. Text classification using automatically generated seed data
CN109462582A (zh) * 2018-10-30 2019-03-12 腾讯科技(深圳)有限公司 文本识别方法、装置、服务器及存储介质
CN110705290A (zh) * 2019-09-29 2020-01-17 新华三信息安全技术有限公司 一种网页分类方法及装置

Also Published As

Publication number Publication date
CN106202124B (zh) 2019-12-31
CN106202124A (zh) 2016-12-07
US10997256B2 (en) 2021-05-04
US20180218241A1 (en) 2018-08-02

Similar Documents

Publication Publication Date Title
WO2016180270A1 (zh) 网页分类方法和装置、计算设备以及机器可读存储介质
US10423648B2 (en) Method, system, and computer readable medium for interest tag recommendation
WO2017167067A1 (zh) 网页文本分类的方法和装置,网页文本识别的方法和装置
US9588990B1 (en) Performing image similarity operations using semantic classification
JP3882048B2 (ja) 質問応答システムおよび質問応答処理方法
US8805026B1 (en) Scoring items
WO2017097231A1 (zh) 话题处理方法及装置
US10353925B2 (en) Document classification device, document classification method, and computer readable medium
US20180341686A1 (en) System and method for data search based on top-to-bottom similarity analysis
KR20150036117A (ko) 쿼리 확장
WO2012075884A1 (zh) 书签智能分类的方法和服务器
Tsur et al. Identifying web queries with question intent
CN110134792B (zh) 文本识别方法、装置、电子设备以及存储介质
JP5012078B2 (ja) カテゴリ作成方法、カテゴリ作成装置、およびプログラム
WO2015016784A1 (en) A method and apparatus for tracking microblog messages for relevancy to an entity identifiable by an associated text and an image
JP5094830B2 (ja) 画像検索装置、画像検索方法及びプログラム
JP4750832B2 (ja) 情報検索方法およびそのシステム
WO2018176913A1 (zh) 搜索方法、装置及非临时性计算机可读存储介质
JP4937395B2 (ja) 特徴ベクトル生成装置、特徴ベクトル生成方法及びプログラム
CN105404674A (zh) 一种知识依赖的网页信息抽取方法
CN116932730B (zh) 基于多叉树和大规模语言模型的文档问答方法及相关设备
CN107169020B (zh) 一种基于关键字的定向网页采集方法
JP4891638B2 (ja) 目的データをカテゴリに分類する方法
JP6426074B2 (ja) 関連文書検索装置、モデル作成装置、これらの方法及びプログラム
JP6039057B2 (ja) 文書分析装置及び文書分析プログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16792120

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 15505851

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16792120

Country of ref document: EP

Kind code of ref document: A1