TW201737118A

TW201737118A - Method and device for webpage text classification, method and device for webpage text recognition

Info

Publication number: TW201737118A
Application number: TW106105613A
Authority: TW
Inventors: Bing-Nan Duan
Original assignee: Alibaba Group Services Ltd
Priority date: 2016-03-30
Filing date: 2017-02-20
Publication date: 2017-10-16
Also published as: CN107291723A; TWI735543B; WO2017167067A1; CN107291723B

Abstract

A method and device for webpage text classification, and a method and device for webpage text recognition. The method for webpage text classification comprises: collecting text data from a webpage; segmenting the text data to obtain basic text segments; calculating a first attribute value and a second attribute value of each of the basic text segments; calculating a characteristic value of each of the basic text segments according to the first attribute value and the second attribute value; screening and selecting characteristic text segments from the basic text segments according to the characteristic value; calculating a weight corresponding to each of the characteristic text segments; treating the weight as a characteristic vector corresponding to the characteristic text segments, and utilizing the characteristic vector to train a classification model. The method and device of the present invention effectively ensure objectivity and accuracy in extracting a characteristic, and also take into account the influence of a characteristic on classification, thereby increasing the accuracy of webpage text classification, and further facilitating a user to accurately and timely obtain effective information from a massive amount of text.

Description

網頁文本分類的方法和裝置，網頁文本識別的方法和裝置 Method and device for classifying webpage text, method and device for recognizing webpage text

本申請係關於文本分類的技術領域，特別是關於一種網頁文本分類的方法，一種網頁文本分類的裝置，一種網頁文本識別的方法，以及，一種網頁文本識別的裝置。 The present application relates to the technical field of text classification, in particular to a method for classifying webpage text, a device for classifying webpage text, a method for recognizing webpage text, and a device for recognizing webpage text.

在當今的資訊社會，各種形式的資訊都極大的豐富了人們的生活，尤其隨著Internet的大規模普及，網路上的資訊量在飛速增長當中，如各種電子文檔、電子郵件和網頁充滿網路上，從而造成資訊雜亂。為了快速、準確、全面地找到我們所需要的資訊，文本分類成為了有效組織和管理文本資料的重要方式，越來越受到廣泛的關注。 In today's information society, all forms of information have greatly enriched people's lives. Especially with the massive popularity of the Internet, the amount of information on the Internet is growing rapidly, such as various electronic documents, emails and web pages. , causing information clutter. In order to find the information we need quickly, accurately and comprehensively, text categorization has become an important way to effectively organize and manage text materials, and it has received more and more attention.

網頁文本分類是指按照預先定義的主題類別，根據海量網頁文檔的內容，確定相應網頁的類別。網頁文本分類採用的技術基礎是基於內容的純文字分類。基本方法是，在抓取到的網頁集合中，對每篇網頁文本進行純文字的內容抽取，得到相應的純文字。再將抽取出的純文字組成新的文檔集合，在新的文檔集合上應用純文字分類演算法進行分類。再根據純文字與網頁文本的對應關係，對網頁文本進行分類，即應用網頁的純文字內容資訊，對網頁進行分類。 The webpage text classification refers to determining the category of the corresponding webpage according to the content of the massive webpage document according to the predefined theme category. The technical basis for web page text categorization is content-based plain text categorization. The basic method is to extract the content of the plain text in each webpage text in the captured webpage collection, and obtain the corresponding plain text. Then extract the extracted plain text into a new document collection, and apply the plain text classification algorithm to the new document collection. Line classification. According to the correspondence between the plain text and the webpage text, the webpage text is classified, that is, the plain text content information of the webpage is applied, and the webpage is classified.

由於海量文本所具有的多意性、模糊性、各異性等特點，已有技術中，在分類特徵的選取上難以令人滿意，例如，往往會誇大某些無效詞的作用，或者，忽略某些特徵分詞的重要屬性，從而導致網頁文本分類的準確度極低。 Due to the multi-intentionality, ambiguity, and dissimilarity of massive texts, in the prior art, it is difficult to select the classification features. For example, the role of some invalid words is often exaggerated, or some The important attributes of some feature parts, resulting in extremely low accuracy of web page text classification.

鑒於上述問題，提出了本申請實施例以便提供一種克服上述問題或者至少部分地解決上述問題的一種網頁文本分類的方法，一種網頁文本識別的方法，和相應的一種網頁文本分類的裝置，一種網頁文本識別的裝置。 In view of the above problems, embodiments of the present application are provided to provide a method for classifying web page texts that overcomes the above problems or at least partially solves the above problems, a method for recognizing web page text, and a corresponding device for classifying web page text, a web page A device for text recognition.

為了解決上述問題，本申請實施例公開了一種網頁文本分類的方法，包括：採集網頁中的文本資料；對所述文本資料進行分詞，獲得基礎分詞；計算各基礎分詞的第一屬性值和第二屬性值；依據所述第一屬性值和第二屬性值計算各基礎分詞的特徵值；依據所述特徵值從所述基礎分詞中篩選出特徵分詞；計算各特徵分詞相應的權重；將所述權重作為相應特徵分詞的特徵向量，採用所述特徵向量訓練出分類模型。 In order to solve the above problem, the embodiment of the present application discloses a method for classifying a webpage text, including: collecting text data in a webpage; performing word segmentation on the text data to obtain a basic participle; calculating a first attribute value of each basic participle and a first a second attribute value; calculating a feature value of each basic participle according to the first attribute value and the second attribute value; selecting a feature participle from the basic participle according to the feature value; calculating a corresponding weight of each feature participle; The weight is used as the feature vector of the corresponding feature word segmentation, and the feature model is used to train the classification model.

較佳地，所述第一屬性值為所述基礎分詞的資訊增益值，所述第二屬性值為所述基礎分詞相對於預定義的各個分類的卡方統計量值的標準差，所述特徵值為所述基礎分詞的區分度。 Preferably, the first attribute value is an information gain value of the base participle, and the second attribute value is a standard deviation of the base participle relative to a predefined chi-square statistic value of each category, The feature value is the degree of discrimination of the base participle.

較佳地，透過如下公式依據所述第一屬性值和第二屬性值計算各基礎分詞的特徵值： Preferably, the feature values of the basic participles are calculated according to the first attribute value and the second attribute value by using the following formula:

其中，score為基礎分詞的區分度，igScore為基礎分詞的資訊增益值，chiScore為基礎分詞對相對於預定義的各個分類的卡方統計量值，所述n為預定義的分類的數量。 Where, score is the degree of discrimination of the base participle, igScore is the information gain value of the base participle, chiScore is the chi-square statistic value of the base part-of-score pair with respect to the predefined respective categories, and n is the number of predefined classifications.

較佳地，所述依據所述特徵值從所述基礎分詞中篩選出特徵分詞的步驟包括：將所述基礎分詞按照其對應的特徵值由高至低排列；提取預設數量的，所述特徵值高於預設閾值的基礎分詞作為特徵分詞。 Preferably, the step of filtering the feature participle from the basic participle according to the feature value comprises: arranging the basic participle according to a corresponding feature value from high to low; extracting a preset quantity, the The basic participle whose eigenvalue is higher than the preset threshold is used as the feature participle.

較佳地，所述計算各特徵分詞相應的權重的步驟包括：獲取各特徵分詞在相應網頁的文本資料中出現的次數；統計所述網頁的文本資料中特徵分詞的總數；依據所述特徵分詞的特徵值，各特徵分詞在相應網頁的文本資料中出現的次數，以及，所述網頁的文本資料中特徵分詞的總數，計算得到各特徵分詞相應的權重。 Preferably, the step of calculating the corresponding weights of each feature word segment comprises: obtaining the number of occurrences of each feature word segment in the text data of the corresponding webpage; and counting the total number of feature word segments in the text data of the webpage; According to the feature value of the feature word segment, the number of occurrences of each feature word segment in the text material of the corresponding web page, and the total number of feature word segments in the text data of the web page, the corresponding weights of each feature word segment are calculated.

較佳地，透過如下公式依據所述特徵分詞的特徵值，各特徵分詞在相應網頁的文本資料中出現的次數，以及，所述網頁的文本資料中特徵分詞的總數，計算得到各特徵分詞相應的權重： Preferably, the feature value of the feature word segment is obtained according to the following formula, the number of occurrences of each feature word segment in the text material of the corresponding web page, and the total number of feature word segments in the text data of the web page, and corresponding feature word segments are calculated. the weight of:

其中，weight為特徵分詞的權重，tf為特徵分詞在相應網頁的文本資料中出現的次數，n為網頁的文本資料中特徵分詞的總數，score為特徵分詞的區分度。 Where weight is the weight of the feature word segment, tf is the number of times the feature word segment appears in the text data of the corresponding web page, n is the total number of feature word segments in the text data of the web page, and score is the degree of distinguishing feature word segmentation.

較佳地，所述計算各特徵分詞相應的權重的步驟還包括：對所述特徵分詞的權重進行歸一化處理。 Preferably, the step of calculating a corresponding weight of each feature word segment further comprises: normalizing the weight of the feature word segmentation.

較佳地，透過以下公式對所述特徵分詞的權重進行歸一化處理： Preferably, the weights of the feature word segments are normalized by the following formula:

其中，norm(weight)為歸一化之後的權重，weight為所述特徵分詞的權重，min(weight)為所述網頁中文本資料中最小weight值，max(weight)為所述網頁中文本資料中最大weight值。 Wherein, norm(weight) is the weight after normalization, weight is the weight of the feature participle, min(weight) is the minimum weight value in the text data in the webpage, and max(weight) is the text data in the webpage. Medium maximum weight value.

本申請實施例還公開了一種網頁文本識別的方法，包括：提取待識別網頁中的文本資料；對所述文本資料進行分詞，獲得基礎分詞；計算各基礎分詞的第一屬性值和第二屬性值；依據所述第一屬性值和第二屬性值計算各基礎分詞的特徵值；依據所述特徵值從所述基礎分詞中篩選出特徵分詞；計算各特徵分詞相應的權重；將所述權重作為特徵向量輸入預先訓練出的分類模型中，獲得分類資訊；針對所述待識別網頁標記分類資訊。 The embodiment of the present application further discloses a method for text recognition of a webpage, comprising: extracting text materials in a webpage to be recognized; segmenting the text data to obtain a basic participle; and calculating a first attribute value and a second attribute of each basic participle Calculating a feature value of each basic participle according to the first attribute value and the second attribute value; selecting a feature participle from the basic participle according to the feature value; calculating a corresponding weight of each feature participle; The classification information is obtained by inputting the pre-trained classification model as a feature vector; and classifying information is marked for the to-be-identified webpage.

較佳地，所述計算各特徵分詞相應的權重的步驟包括：獲取各特徵分詞在相應網頁的文本資料中出現的次數；統計所述網頁的文本資料中特徵分詞的總數；依據所述特徵分詞的特徵值，各特徵分詞在相應網頁的文本資料中出現的次數，以及，所述網頁的文本資料中特徵分詞的總數，計算得到各特徵分詞相應的權重。 Preferably, the step of calculating the corresponding weight of each feature word segment includes: Obtaining the number of occurrences of each feature word segment in the text data of the corresponding webpage; counting the total number of feature word segments in the text data of the webpage; and according to the feature value of the feature segmentation word, the number of occurrences of each feature segmentation word in the text data of the corresponding webpage And, the total number of feature word segments in the text data of the web page, and the corresponding weights of each feature word segment are calculated.

本申請實施例還公開了一種網頁文本分類的裝置，包括：採集模組，用於採集網頁中的文本資料；分詞模組，用於對所述文本資料進行分詞，獲得基礎分詞；分詞屬性計算模組，用於計算各基礎分詞的第一屬性值和第二屬性值；特徵值計算模組，用於依據所述第一屬性值和第二屬性值計算各基礎分詞的特徵值；特徵提取模組，用於依據所述特徵值從所述基礎分詞中篩選出特徵分詞；特徵權重分配模組，用於計算各特徵分詞相應的權重；模型訓練模組，用於將所述權重作為相應特徵分詞的特徵向量，採用所述特徵向量訓練出分類模型。 The embodiment of the present invention further discloses an apparatus for classifying a webpage text, comprising: an acquisition module, configured to collect text data in a webpage; a word segmentation module, configured to perform word segmentation on the text data, obtain a basic participle; a module, configured to calculate a first attribute value and a second attribute value of each basic participle; the feature value calculation module is configured to calculate feature values of each basic participle according to the first attribute value and the second attribute value; feature extraction a module, configured to filter a feature word segment from the basic participle according to the feature value; a feature weight distribution module, configured to calculate a corresponding weight of each feature word segment; and a model training module, configured to use the weight as a corresponding A feature vector of feature segmentation, and the feature model is used to train the classification model.

較佳地，所述特徵值計算模組透過如下公式依據所述第一屬性值和第二屬性值計算各基礎分詞的特徵值： Preferably, the feature value calculation module calculates the feature values of the basic participle words according to the first attribute value and the second attribute value by using the following formula:

較佳地，所述特徵提取模組包括：排序子模組，用於將所述基礎分詞按照其對應的特徵值由高至低排列；提取子模組，用於提取預設數量的，所述特徵值高於預設閾值的基礎分詞作為特徵分詞。 Preferably, the feature extraction module includes: a sorting sub-module for arranging the basic participle according to its corresponding feature value from high to low; and extracting a sub-module for extracting a preset number of The basic participle whose feature value is higher than the preset threshold is used as the feature participle.

較佳地，所述特徵權重分配模組包括：次數統計子模組，用於獲取各特徵分詞在相應網頁的文本資料中出現的次數；分詞總數統計子模組，用於統計所述網頁的文本資料中特徵分詞的總數；計算子模組，用於依據所述特徵分詞的特徵值，各特徵分詞在相應網頁的文本資料中出現的次數，以及，所述網頁的文本資料中特徵分詞的總數，計算得到各特徵分詞相應的權重。 Preferably, the feature weight distribution module includes: a quantity statistics sub-module, configured to acquire the number of occurrences of each feature word segment in the text data of the corresponding webpage; and a total number of word segmentation sub-modules for counting the webpage The total number of feature word segments in the text material; a calculation sub-module, configured to calculate, according to the feature value of the feature word segmentation, the number of occurrences of each feature segmentation word in the text material of the corresponding webpage, and the total number of feature segmentation words in the text data of the webpage, and calculate corresponding feature segmentation words correspondingly the weight of.

較佳地，所述計算子模組透過如下公式依據所述特徵分詞的特徵值，各特徵分詞在相應網頁的文本資料中出現的次數，以及，所述網頁的文本資料中特徵分詞的總數，計算得到各特徵分詞相應的權重： Preferably, the calculation sub-module is based on the feature value of the feature word segment, the number of occurrences of each feature word segment in the text data of the corresponding webpage, and the total number of feature word segments in the text data of the webpage. Calculate the corresponding weights of each feature participle:

較佳地，所述特徵權重分配模組還包括：歸一化子模組，用於對所述特徵分詞的權重進行歸一化處理。 Preferably, the feature weight distribution module further includes: a normalization sub-module, configured to normalize the weight of the feature word segmentation.

較佳地，所述歸一化子模組透過以下公式對所述特徵分詞的權重進行歸一化處理： Preferably, the normalization sub-module normalizes the weight of the feature word segmentation by the following formula:

其中，norm(weight)為歸一化之後的權重，weight為所述特徵分詞的權重，min(weight)為所述網頁中文本資料中最小weight值，max(weight)為所述網頁中文本資料中最大weight值。 Where norm(weight) is the weight after normalization, weight is the weight of the feature participle, and min(weight) is the text in the webpage. The minimum weight value in the material, max(weight) is the maximum weight value in the text data in the web page.

本申請實施例還公開了一種網頁文本識別的裝置，包括：文本提取模組，用於提取待識別網頁中的文本資料；分詞模組，用於對所述文本資料進行分詞，獲得基礎分詞；分詞屬性計算模組，用於計算各基礎分詞的第一屬性值和第二屬性值；特徵值計算模組，用於依據所述第一屬性值和第二屬性值計算各基礎分詞的特徵值；特徵提取模組，用於依據所述特徵值從所述基礎分詞中篩選出特徵分詞；特徵權重分配模組，用於計算各特徵分詞相應的權重；分類模組，用於將所述權重作為特徵向量輸入預先訓練出的分類模型中，獲得分類資訊；標記模組，用於針對所述待識別網頁標記分類資訊。 The embodiment of the present invention further discloses a device for recognizing a webpage text, comprising: a text extraction module, configured to extract text data in a webpage to be recognized; and a word segmentation module, configured to perform segmentation on the text data to obtain a basic participle; a word segment attribute calculation module, configured to calculate a first attribute value and a second attribute value of each basic participle; the feature value calculation module is configured to calculate the feature value of each basic participle according to the first attribute value and the second attribute value a feature extraction module, configured to filter a feature word segment from the basic participle according to the feature value; a feature weight distribution module, configured to calculate a corresponding weight of each feature word segment; a classification module, configured to use the weight The classification information is obtained by inputting the pre-trained classification model as a feature vector; and the marking module is configured to mark the classification information for the to-be-identified webpage.

本申請實施例包括以下優點：本申請實施例透過改進特徵分詞的提取方式，以及，特徵分詞權重的計算方式，不僅有效保證了特徵提取的客觀性與準確性，還兼顧了特徵對分類影響，從而提高了網頁文本分類的準確性，更方便於使用者在海量的文本中及時準確地獲得有效的資訊。 The embodiments of the present application include the following advantages: the method for extracting feature segmentation words and the calculation method of feature word segmentation weights not only effectively ensure the objectivity and accuracy of feature extraction, but also take into account the influence of features on classification. Thereby, the accuracy of the text classification of the webpage is improved, and the user is more convenient to obtain effective information in a timely and accurate manner in a large amount of text.

本申請實施例融合至少兩種特徵提取演算法，並在卡方統計中引入標準差，有效保證了特徵提取的客觀性與準確性。並且，透過使用長尾分佈圖選擇特徵數量，針對特徵分詞採用兼顧了特徵對分類影響的權重，因而能進一步篩選出有效特徵，從而使網頁文本分類的效果更精準。 The embodiment of the present application combines at least two feature extraction algorithms, and introduces a standard deviation in the chi-square statistics, which effectively ensures the objectivity and accuracy of the feature extraction. Moreover, by using the long tail distribution map to select the number of features, the feature segmentation takes the weight of the feature on the classification, so that the effective features can be further screened, so that the effect of web page text classification is more accurate.

401‧‧‧採集模組 401‧‧‧ acquisition module

402‧‧‧分詞模組 402‧‧‧word segmentation module

403‧‧‧分詞屬性計算模組 403‧‧‧Word attribute calculation module

404‧‧‧特徵值計算模組 404‧‧‧Characteristic Value Calculation Module

405‧‧‧特徵提取模組 405‧‧‧Feature Extraction Module

406‧‧‧特徵權重分配模組 406‧‧‧Characteristic weight distribution module

407‧‧‧模型訓練模組 407‧‧‧Model Training Module

501‧‧‧文本提取模組 501‧‧‧Text extraction module

502‧‧‧分詞模組 502‧‧‧word segmentation module

503‧‧‧分詞屬性計算模組 503‧‧ § word attribute calculation module

504‧‧‧特徵值計算模組 504‧‧‧ eigenvalue calculation module

505‧‧‧特徵提取模組 505‧‧‧Feature Extraction Module

506‧‧‧特徵權重分配模組 506‧‧‧Characteristic weight distribution module

507‧‧‧分類模組 507‧‧‧Classification module

508‧‧‧標記模組 508‧‧‧ mark module

圖1是本申請的一種網頁文本分類的方法的步驟流程圖；圖2是本申請一種示例中長尾分佈的示意圖；圖3是本申請的一種網頁文本識別的步驟流程圖；圖4是本申請的一種網頁文本分類的裝置的結構框圖；圖5是本申請的一種網頁文本識別的裝置的結構框圖。 1 is a flow chart of a method for classifying a webpage text according to the present application; FIG. 2 is a schematic diagram of a long tail distribution in an example of the present application; FIG. 3 is a flow chart of a step of text recognition of a webpage according to the present application; A structural block diagram of a device for classifying webpage text; FIG. 5 is a structural block diagram of an apparatus for recognizing webpage text of the present application.

為使本申請的上述目的、特徵和優點能夠更加明顯易懂，下面結合附圖和具體實施方式對本申請作進一步詳細的說明。 The above described objects, features and advantages of the present application will become more apparent and understood.

文本分類是透過訓練一定的文本集合，得到類別與未知文本的映射規則，即計算出文本與類別的相關度，再根據訓練的分類器來決定文本的類別歸屬。 Text categorization is to obtain a mapping rule between a category and an unknown text by training a certain set of texts, that is, calculating the relevance of the text and the category, and then determining the category attribution of the text according to the trained classifier.

文本分類是一個有指導的學習過程，它根據一個已經被標注的訓練文本集合，找到文字屬性(特徵)和文本類別之間的關係模型(分類器)，然後利用這種學習得到的關係模型對新的文本進行類別判斷。文本分類的過程總體可劃分為訓練和分類兩部分。訓練的目的是透過新的文本和類別之間的聯繫構造分類模型，使其用於分類。分類過程是根據訓練結果對未知文本進行分類，給定類別標識的過程。 Text categorization is a guided learning process based on an already The labeled training text set finds a relational model (classifier) between the text attribute (feature) and the text category, and then uses the learned relational model to classify the new text. The process of text categorization can be divided into two parts: training and classification. The purpose of the training is to construct a classification model for the classification by linking the new text to the category. The classification process is a process of classifying unknown texts based on training results, giving a category identification.

參考圖1，示出了本申請的一種網頁文本分類的方法實施例的步驟流程圖，具體可以包括如下步驟： Referring to FIG. 1 , a flow chart of steps of a method for classifying web page texts according to the present application is shown. Specifically, the method may include the following steps:

步驟101，採集網頁中的文本資料；本步驟即獲取到用於進行分類模型訓練的網頁的文本資料，在實際中，其可能是海量資料。通常的處理方法是，在抓取到的網頁集合中，對每篇網頁文本進行純文字的內容抽取，從而得到相應的純文字，然後將抽取出的純文字組成新的文檔集合，該文檔集合即為本申請所指網頁中的文本資料。 In step 101, the text data in the webpage is collected; in this step, the text data of the webpage used for the training of the classification model is obtained, and in reality, it may be massive data. The usual processing method is to extract the content of the plain text in each webpage text in the captured webpage collection, thereby obtaining the corresponding plain text, and then extracting the plain text into a new document collection, the document collection. This is the textual material in the webpage referred to in this application.

步驟102，對所述文本資料進行分詞，獲得基礎分詞；眾所周知，英文是以詞為單位的，詞和詞之間是靠空格隔開，而中文是以字為單位，句子中所有的字連起來才能描述一個意思。例如，英文句子I am a student，用中文則為：“我是一個學生”。電腦可以很簡單透過空格知道student是一個單詞，但是不能很容易明白“學”、“生”兩個字合起來才表示一個詞。把中文的漢字序列切分成有意義的詞，就是中文分詞。例如，我是一個學生，分詞的結果是：我是一個學生。 Step 102: Perform word segmentation on the text material to obtain a basic participle; as is well known, English is a word unit, words and words are separated by spaces, and Chinese is a word unit, and all word links in a sentence It can be used to describe a meaning. For example, the English sentence I am a student, in Chinese is: "I am a student." The computer can easily know that student is a word through a space, but it is not easy to understand that the words "learning" and "sheng" are combined to represent a word. Cut the Chinese character sequence Divided into meaningful words, that is, Chinese word segmentation. For example, I am a student and the result of the participle is: I am a student.

下面介紹一些常用的分詞方法： Here are some common word segmentation methods:

1、基於字串匹配的分詞方法：是指按照一定的策略將待分析的漢字串與一個預置的機器詞典中的詞條進行匹配，若在詞典中找到某個字串，則匹配成功(識別出一個詞)。實際使用的分詞系統，都是把機械分詞作為一種初分手段，還需透過利用各種其它的語言資訊來進一步提高切分的準確率。 1. Word segmentation based word segmentation method: refers to matching a Chinese character string to be analyzed with a term in a preset machine dictionary according to a certain strategy. If a string is found in the dictionary, the matching is successful ( Identify a word). The actual word segmentation system uses mechanical segmentation as a means of initial separation. It also needs to use various other language information to further improve the accuracy of segmentation.

2、基於特徵掃描或標誌切分的分詞方法：是指優先在待分析字串中識別和切分出一些帶有明顯特徵的詞，以這些詞作為中斷點，可將原字串分為較小的串再來進機械分詞，從而減少匹配的錯誤率；或者將分詞和詞類標注結合起來，利用豐富的詞類資訊對分詞決策提供幫助，並且在標注過程中又反過來對分詞結果進行檢驗、調整，從而提高切分的準確率。 2. Word segmentation method based on feature scan or mark segmentation: It refers to prioritizing and segmenting some words with obvious features in the string to be analyzed. Using these words as break points, the original string can be divided into Small strings come into mechanical participles to reduce the error rate of matching; or combine word segmentation with word class notation, use rich word class information to help segmentation decision making, and in turn, test the word segmentation results in the labeling process. Adjust to improve the accuracy of the segmentation.

3、基於理解的分詞方法：是指透過讓電腦模擬人對句子的理解，達到識別詞的效果。其基本思想就是在分詞的同時進行句法、語義分析，利用句法資訊和語義資訊來處理歧義現象。它通常包括三個部分：分詞子系統、句法語義子系統、總控部分。在總控部分的協調下，分詞子系統可以獲得有關詞、句子等的句法和語義資訊來對分詞歧義進行判斷，即它模擬了人對句子的理解過程。這種分詞方法需要使用大量的語言知識和資訊。 3. The word-sharing method based on understanding: refers to the effect of identifying words by letting the computer simulate the understanding of the sentence. The basic idea is to perform syntactic and semantic analysis at the same time as word segmentation, and use syntactic information and semantic information to deal with ambiguity. It usually consists of three parts: the word segmentation subsystem, the syntactic and semantic subsystem, and the general control part. Under the coordination of the general control part, the word segmentation subsystem can obtain the syntactic and semantic information about words, sentences, etc. to judge the ambiguity of the participle, that is, it simulates the process of human understanding of the sentence. This method of word segmentation requires a large amount of language knowledge and information.

4、基於統計的分詞方法：是指，中文資訊中由於字與字相鄰共現的頻率或機率能夠較好的反映成詞的可信度，所以可以對語料中相鄰共現的各個字的組合的頻度進行統計，計算它們的互現資訊，以及計算兩個漢字X、Y的相鄰共現機率。互現資訊可以體現漢字之間結合關係的緊密程度。當緊密程度高於某一個閾值時，便可認為此字組可能構成了一個詞。這種方法只需對語料中的字組頻度進行統計，不需要切分詞典。 4. Statistical-based word segmentation method: It means that the frequency or probability of co-occurrence of words and words in Chinese information can better reflect the credibility of words, so it can be adjacent to each other in the corpus. The frequency of the combination of words is counted, their mutual information is calculated, and the adjacent co-occurrence probability of the two Chinese characters X and Y is calculated. The mutual information can reflect the closeness of the relationship between Chinese characters. When the degree of tightness is above a certain threshold, the word group may be considered to constitute a word. This method only needs to count the frequency of the words in the corpus, and does not need to cut the dictionary.

本申請對所述文本資料進行分詞的方式不作限制，在針對文檔集合進行分詞，所獲得的所有分詞即為本申請所指的基礎分詞。 The manner in which the text data is segmented by the present application is not limited, and the word segmentation is performed on the document collection, and all the word segments obtained are the basic participles referred to in the application.

在具體實現中，在進入下一步驟前，還可以針對基礎分詞中的無效詞，比如，針對停用詞等預先進行去除處理。停用詞通常指在各類文本中都頻繁出現，因而被認為帶有很少的有助於分類任何資訊的代詞、介詞、連詞等高頻詞。本領域技術人員也可以按需求設計需要在特徵提取之前或特徵提取過程中刪除的特徵詞，本申請對此無需加以限制。 In a specific implementation, before proceeding to the next step, the removal process may also be performed in advance for the invalid words in the basic participle, for example, for the stop words. Stop words usually refer to frequent occurrences in various types of text, and are therefore considered to have few high-frequency words such as pronouns, prepositions, conjunctions, etc. that help to classify any information. Those skilled in the art can also design feature words that need to be deleted before or during feature extraction according to requirements, which need not be limited in this application.

步驟103，計算各基礎分詞的第一屬性值和第二屬性值；步驟104，依據所述第一屬性值和第二屬性值計算各基礎分詞的特徵值；步驟105，依據所述特徵值從所述基礎分詞中篩選出特徵分詞；以上步驟103-105涉及文本分類中特徵選擇的處理。通常原始特徵空間維數非常高，且存在大量冗餘的特徵，因此需要進行特徵降維。特徵選擇是特徵降維中的其中一類，它的基本思路：根據某種評價函數獨立地對每個原始特徵項進行評分，然後按分值的高低排序，從中選取若干個分值最高的特徵項，或者預先設定一個閾值，把度量值小於閾值特徵過濾掉，剩下的候選特徵作為結果的特徵子集。 Step 103: Calculate a first attribute value and a second attribute value of each basic participle; Step 104, calculate a feature value of each basic participle according to the first attribute value and the second attribute value; Step 105, according to the feature value The feature participle is selected from the basic participle; The above steps 103-105 relate to the processing of feature selection in text categorization. Usually the original feature space dimension is very high, and there are a lot of redundant features, so feature dimension reduction is needed. Feature selection is one of the characteristics of feature dimension reduction. Its basic idea is to score each original feature item independently according to a certain evaluation function, and then sort by the level of the score, and select several feature items with the highest score. Or, a threshold is set in advance, and the metric value is filtered out of the threshold feature, and the remaining candidate features are used as the feature subset of the result.

特徵選擇演算法包括：文檔頻次、互資訊量、資訊增益、χ²統計量(CHI)等演算法。已有技術中，本領域技術人員通常會選用其中之一進行特徵分詞的選取，然而這種單一演算法的使用存在不少弊端，以資訊增益演算法為例，資訊增益透過分詞在文本中出現和不出現前後的資訊量之差來推斷該分詞所帶的資訊量，即一個分詞的資訊增益值表示分詞特徵包含的資訊量。可以理解，資訊增益值越高表示分詞特徵可以給分類器帶來較大的資訊量，但已有的資訊增益演算法只考慮分詞特徵對整體分類器提供的資訊量，忽略了分詞特徵對不同的各個分類的區分度。 The feature selection algorithm includes algorithms such as document frequency, mutual information volume, information gain, and 统计² statistic (CHI). In the prior art, those skilled in the art usually select one of them to select the feature word segmentation. However, the use of this single algorithm has many drawbacks. Taking the information gain algorithm as an example, the information gain appears in the text through the word segmentation. The amount of information carried by the participle is inferred from the difference between the amount of information before and after the occurrence of the word segmentation, that is, the information gain value of a participle indicates the amount of information contained in the participle feature. It can be understood that the higher the information gain value, the segmentation feature can bring a large amount of information to the classifier, but the existing information gain algorithm only considers the amount of information provided by the segmentation feature to the overall classifier, ignoring the different segmentation feature pairs. The degree of discrimination of each category.

或者，以χ²統計量(CHI)演算法為例，卡方統計也用於表徵兩個變數的相關性，它同時考慮了特徵在某類文本中出現和不出現時的情況。卡方統計量值越大，它與該類的相關性就越大，攜帶的類別資訊也就越多，但已有的χ²統計量(CHI)演算法中過分誇大低頻詞的作用。 Or, taking the χ ² statistic (CHI) algorithm as an example, the chi-square statistic is also used to characterize the correlation between two variables. It also considers the case when the feature appears and does not appear in a certain type of text. The larger the chi-square statistic, the more relevant it is to the class, and the more the category information is carried, but the existing χ ² statistic (CHI) algorithm over-exaggerates the role of low-frequency words.

針對上述弊端，本申請提出不採用單一演算法，而採用至少兩種演算法進行特徵提取，即分別採用不同的兩種演算法計算各基礎分詞的第一屬性值和第二屬性值，例如，採用資訊增益演算法計算第一屬性值，採用CHI演算法計算第二屬性值。 In view of the above drawbacks, the present application proposes not to adopt a single algorithm, but Feature extraction is performed by using at least two algorithms, that is, different first algorithms are used to calculate the first attribute value and the second attribute value of each basic participle, for example, the information attribute algorithm is used to calculate the first attribute value, and the CHI calculus is used. The method calculates the second attribute value.

當然，本領域技術人員依據實際情況採用其它演算法分別計算分詞不同的屬性值，甚至兩個以上的屬性值，都是可行的，本申請對此不作限制。 Certainly, those skilled in the art may use other algorithms to calculate different attribute values of the word segmentation according to actual conditions, and even more than two attribute values are feasible. This application does not limit this.

在本申請的一種較佳實施例中，所述第一屬性值可以為所述基礎分詞的資訊增益值，所述第二屬性值可以為所述基礎分詞相對於預定義的各個分類的卡方統計量值的標準差，所述特徵值可以為所述基礎分詞的區分度，即所述步驟103具體可以包括如下子步驟：子步驟1031，計算各基礎分詞的資訊增益值；子步驟1032，計算各基礎分詞的卡方統計量值；子步驟1033，基於所述基礎分詞的數量，統計所述基礎分詞相對於預定義的各個分類的卡方統計量的標準差。 In a preferred embodiment of the present application, the first attribute value may be an information gain value of the basic participle, and the second attribute value may be a chi-square part of the basic participle relative to a predefined each category. The standard value of the statistic value, the eigenvalue may be the degree of discrimination of the basic participle, that is, the step 103 may specifically include the following sub-steps: sub-step 1031, calculating the information gain value of each basic participle; sub-step 1032, Calculating a chi-square statistic value of each base participle; and sub-step 1033, based on the number of the base participles, counting a standard deviation of the base participle relative to the predefined chi-square statistic of each of the categories.

在這種情況下，所述步驟104可以為，基於所述資訊增益值和標準差的乘積獲得各基礎分詞的區分度。 In this case, the step 104 may be: obtaining the discrimination degree of each basic participle based on the product of the information gain value and the standard deviation.

更具體而言，可以透過如下公式依據所述第一屬性值和第二屬性值計算各基礎分詞的特徵值： More specifically, the feature values of the basic participles may be calculated according to the first attribute value and the second attribute value by using the following formula:

本申請融合至少兩種特徵提取演算法，並在卡方統計中引入標準差，有效保證了特徵提取的客觀性與準確性。 The application combines at least two feature extraction algorithms and introduces a standard deviation in the chi-square statistics, which effectively ensures the objectivity and accuracy of feature extraction.

在本申請的一種較佳實施例中，所述步驟105具體可以包括如下子步驟：子步驟1051，將所述基礎分詞按照其對應的特徵值由高至低排列；子步驟1052，提取預設數量的，所述特徵值高於預設閾值的基礎分詞作為特徵分詞。 In a preferred embodiment of the present application, the step 105 may specifically include the following sub-steps: sub-step 1051, arranging the basic participle according to its corresponding feature value from high to low; sub-step 1052, extracting a preset The number of the basic participle whose feature value is higher than the preset threshold is used as the feature participle.

在計算出各基礎分詞的特徵值後，可以發現此值符合如圖2所示的長尾分佈(齊魯夫定律)示意圖，圖2中橫軸為基礎分詞的個數，縱軸為基礎分詞的區分度，應用本申請的較佳實施例，可以取例如橫坐標大於0小於30000的基礎分詞作為特徵分詞。 After calculating the eigenvalues of the basic participles, it can be found that the value conforms to the long tail distribution (Qilufu's law) as shown in Fig. 2. In Fig. 2, the horizontal axis is the number of basic participles, and the vertical axis is the division of the basic participles. For the preferred embodiment of the present application, for example, a basic participle with an abscissa greater than 0 and less than 30,000 may be taken as a feature segmentation.

本申請透過使用長尾分佈圖選擇特徵數量，可以進一步篩選出有效特徵，從而使網頁文本分類的效果更精準。 By using the long tail profile to select the number of features, the present application can further filter out the effective features, so that the effect of web page text classification is more accurate.

步驟106，計算各特徵分詞相應的權重；在文本中，每一個特徵分詞賦予一個權重，表示這一特徵分詞在該文本中的重要程度。權重一般都是以特徵項的頻率為基礎進行計算，計算方式很多，例如，布林權值法，詞頻權值法，TF/IDF權值法，TFC權值法等，已有這種權重計算方法的計算也存在不少弊端，例如，TF/IDF權值法中TF表示特徵在單個文本中的數量，IDF表示特徵在整個語料中的數量，因此完全忽略了特徵對分類的影響。 Step 106: Calculate corresponding weights of each feature participle; in the text, each feature participle is given a weight, indicating the importance degree of the feature participle in the text. Weights are generally calculated based on the frequency of feature items. There are many calculation methods, such as Boolean weight method, word frequency weight method, TF/IDF weight method, TFC weight method, etc. There are also many disadvantages in the calculation of this weight calculation method. For example, in TF/IDF weight method, TF indicates the number of features in a single text, and IDF indicates the number of features in the entire corpus, so the feature pair classification is completely ignored. Impact.

因而，本申請提出了一種用於計算權重的較佳實施例，在本實施例中，所述步驟106可以包括如下子步驟：子步驟1061，獲取各特徵分詞在相應網頁的文本資料中出現的次數；子步驟1062，統計所述網頁的文本資料中特徵分詞的總數；子步驟1063，依據所述特徵分詞的特徵值，各特徵分詞在相應網頁的文本資料中出現的次數，以及，所述網頁的文本資料中特徵分詞的總數，計算得到各特徵分詞相應的權重。 Therefore, the present application proposes a preferred embodiment for calculating weights. In this embodiment, the step 106 may include the following sub-steps: sub-step 1061, obtaining the feature segmentation words appearing in the text data of the corresponding webpage. Number of times; sub-step 1062, counting the total number of feature word segments in the text data of the web page; sub-step 1063, the number of occurrences of each feature word segment in the text data of the corresponding web page according to the feature value of the feature word segment, and The total number of feature word segments in the text data of the web page, and the corresponding weights of each feature word segment are calculated.

作為本申請較佳實施例具體應用的一種示例，所述子步驟1063具體可以透過如下公式計算各特徵分詞相應的權重： As an example of a specific application of the preferred embodiment of the present application, the sub-step 1063 may specifically calculate the corresponding weight of each feature word segment by using the following formula:

在具體實現中，更佳的是，所述步驟105還可以包括如下子步驟：子步驟1064，對所述特徵分詞的權重進行歸一化處理。 In a specific implementation, more preferably, the step 105 may further include The sub-steps are as follows: Sub-step 1064, the weights of the feature word segments are normalized.

作為本申請具體應用的一種示例，可以透過以下公式對所述特徵分詞的權重進行歸一化處理： As an example of the specific application of the present application, the weight of the feature word segmentation can be normalized by the following formula:

以上本申請的示例中所採用的權重則兼顧了特徵對分類影響，因而能進一步提升特徵選取的有效性。當然，本申請採用任一種權重計算方式均是可行的，對此本申請無需加以限制。 The weights used in the examples of the present application take into account the influence of features on the classification, and thus can further improve the effectiveness of feature selection. Of course, it is feasible to use any of the weight calculation methods in this application, and the application does not need to be limited.

以上計算得到的各特徵分詞相應的權重(包括如子步驟1063得到的權重或如子步驟1064得到的歸一化權重)，可以作為一個文本的特徵向量，得到特徵向量之後可以選擇某個文本分類演算法訓練出分類模型。 The corresponding weights of each feature segment calculated above (including the weight obtained in sub-step 1063 or the normalized weight obtained in sub-step 1064) can be used as a feature vector of a text, and a text classification can be selected after obtaining the feature vector. The algorithm trains the classification model.

步驟107，將所述權重作為相應特徵分詞的特徵向量，採用所述特徵向量訓練出分類模型。 Step 107: The weight is used as a feature vector of the corresponding feature word segment, and the feature model is used to train the classification model.

本領域技術人員採用任一種文本分類演算法，比如貝葉斯機率演算法(Naive Bayese)，支援向量機，KNN演算法(k nearest neighbor)等採用特徵向量訓練出分類模型都是可行的，本申請對此不作限制。 Those skilled in the art use any text classification algorithm, such as Bayesian probability algorithm (Naive Bayese), support vector machine, KNN performance It is feasible to use the feature vector to train the classification model, such as k nearest neighbor, which is not limited in this application.

本申請實施例透過改進特徵分詞的提取方式，以及，特徵分詞權重的計算方式，不僅有效保證了特徵提取的客觀性與準確性，還兼顧了特徵對分類影響，從而提高了網頁文本分類的準確性，更方便於使用者在海量的文本中及時準確地獲得有效的資訊。 The embodiment of the present application improves the objectivity and accuracy of the feature extraction by improving the extraction method of the feature word segmentation and the calculation method of the feature word segmentation weight, and also takes into account the influence of the feature on the classification, thereby improving the accuracy of the webpage text classification. Sexuality is more convenient for users to obtain effective information in a timely and accurate manner in a large amount of text.

參考圖3，示出了本申請的一種網頁文本識別的方法實施例的流程圖，具體可以包括如下步驟：步驟301，提取待識別網頁中的文本資料；步驟302，對所述文本資料進行分詞，獲得基礎分詞；步驟303，計算各基礎分詞的第一屬性值和第二屬性值；步驟304，依據所述第一屬性值和第二屬性值計算各基礎分詞的特徵值；步驟305，依據所述特徵值從所述基礎分詞中篩選出特徵分詞；步驟306，計算各特徵分詞相應的權重；步驟307，將所述權重作為特徵向量輸入預先訓練出的分類模型中，獲得分類資訊；步驟308，針對所述待識別網頁標記分類資訊。 Referring to FIG. 3, a flowchart of an embodiment of a method for text recognition of a webpage according to the present application is shown. The method may include the following steps: Step 301: Extract text data in a webpage to be identified; Step 302, perform word segmentation on the text data. Obtaining a basic participle; step 303, calculating a first attribute value and a second attribute value of each basic participle; and step 304, calculating a feature value of each basic participle according to the first attribute value and the second attribute value; The feature value is used to filter feature tokens from the base participle; step 306, calculating corresponding weights of each feature segmentation; step 307, inputting the weight as a feature vector into the pre-trained classification model to obtain classification information; 308. Mark classification information for the to-be-identified webpage.

在本申請的一種較佳實施例中，所述第一屬性值可以為所述基礎分詞的資訊增益值，所述第二屬性值可以為所述基礎分詞相對於預定義的各個分類的卡方統計量值的標準差，所述特徵值可以為所述基礎分詞的區分度。 In a preferred embodiment of the present application, the first attribute value may be an information gain value of the basic participle, and the second attribute value may be The standard deviation of the base participle relative to the predefined chi-square statistic value of each category, and the feature value may be the degree of discrimination of the base participle.

作為本申請具體應用的一種示例，可以透過如下公式依據所述第一屬性值和第二屬性值計算各基礎分詞的特徵值： As an example of the specific application of the present application, the feature values of the basic participles may be calculated according to the first attribute value and the second attribute value by using the following formula:

在本申請的一種較佳實施例中，所述步驟305可以包括如下子步驟：子步驟3051，將所述基礎分詞按照其對應的特徵值由高至低排列；子步驟3052，提取預設數量的，所述特徵值高於預設閾值的基礎分詞作為特徵分詞。 In a preferred embodiment of the present application, the step 305 may include the following sub-steps: sub-step 3051, the basic participle is ranked according to its corresponding feature value from high to low; sub-step 3052, extracting the preset quantity The basic participle whose feature value is higher than a preset threshold is used as a feature participle.

在本申請的一種較佳實施例中，所述步驟306可以包括如下子步驟：子步驟3061，獲取各特徵分詞在相應網頁的文本資料中出現的次數；子步驟3062，統計所述網頁的文本資料中特徵分詞的總數；子步驟3063，依據所述特徵分詞的特徵值，各特徵分詞在相應網頁的文本資料中出現的次數，以及，所述網頁的文本資料中特徵分詞的總數，計算得到各特徵分詞相應的權重。 In a preferred embodiment of the present application, the step 306 may include the following sub-steps: sub-step 3061, obtaining the number of occurrences of each feature participle in the text material of the corresponding webpage; and sub-step 3062, counting the text of the webpage. Characteristic participle in the data The total number of sub-words 3063, according to the feature value of the feature word segment, the number of occurrences of each feature word segment in the text material of the corresponding web page, and the total number of feature word segments in the text data of the web page, the weight of.

作為本申請較佳實施例具體應用的一種示例，所述子步驟3063具體可以透過如下公式計算各特徵分詞相應的權重： As an example of a specific application of the preferred embodiment of the present application, the sub-step 3063 may specifically calculate the corresponding weight of each feature word segment by using the following formula:

在具體實現中，更佳的是，所述步驟306還可以包括如下子步驟：子步驟3064，對所述特徵分詞的權重進行歸一化處理。 In a specific implementation, the step 306 may further include the following sub-steps: sub-step 3064, normalizing the weights of the feature word segmentation.

其中，norm(weight)為歸一化之後的權重，weight 為所述特徵分詞的權重，min(weight)為所述網頁中文本資料中最小weight值，max(weight)為所述網頁中文本資料中最大weight值。 Where norm(weight) is the weight after normalization, weight For the weight of the feature segmentation, min(weight) is the minimum weight value in the text data in the webpage, and max(weight) is the maximum weight value in the text data in the webpage.

以上計算得到的各特徵分詞相應的權重，可以作為一個文本的特徵向量，得到特徵向量之後可以將其輸人按圖1所示的過程預先生成的分類模型中，即可獲得當前特徵向量所歸屬的分類資訊，最後將當前識別的網頁標記上相應的分類資訊即可。 The corresponding weights of each feature segment obtained above can be used as a feature vector of a text. After obtaining the feature vector, it can be input into the classification model pre-generated according to the process shown in Figure 1, and the current feature vector can be obtained. The classification information, and finally mark the currently identified webpage with the corresponding classification information.

需要說明的是，對於方法實施例，為了簡單描述，故將其都表述為一系列的動作組合，但是本領域技術人員應該知悉，本申請實施例並不受所描述的動作順序的限制，因為依據本申請實施例，某些步驟可以採用其他順序或者同時進行。其次，本領域技術人員也應該知悉，說明書中所描述的實施例均屬於較佳實施例，所涉及的動作並不一定是本申請實施例所必須的。 It should be noted that, for the method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should understand that the embodiments of the present application are not limited by the described action sequence, because In accordance with embodiments of the present application, certain steps may be performed in other sequences or concurrently. In the following, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of the present application.

參照圖4，示出了本申請的一種網頁文本分類的裝置實施例的結構框圖，具體可以包括如下模組：採集模組401，用於採集網頁中的文本資料；分詞模組402，用於對所述文本資料進行分詞，獲得基礎分詞；分詞屬性計算模組403，用於計算各基礎分詞的第一屬性值和第二屬性值；特徵值計算模組404，用於依據所述第一屬性值和第二屬性值計算各基礎分詞的特徵值；特徵提取模組405，用於依據所述特徵值從所述基礎分詞中篩選出特徵分詞；特徵權重分配模組406，用於計算各特徵分詞相應的權重；模型訓練模組407，用於將所述權重作為相應特徵分詞的特徵向量，採用所述特徵向量訓練出分類模型。 Referring to FIG. 4, a block diagram of a device embodiment of a webpage text classification of the present application is shown, which may include the following modules: an acquisition module 401 for collecting text data in a webpage; a word segmentation module 402, Performing word segmentation on the text material to obtain a basic participle; the word segment attribute calculation module 403 is configured to calculate a first attribute value and a second attribute value of each base participle; the feature value calculation module 404 is configured to An attribute value and a second attribute value calculate a feature value of each base participle; a feature extraction module 405, configured to filter feature tokens from the basic participle according to the feature value; a feature weight assignment module 406, configured to calculate a corresponding weight of each feature segmentation; a model training module 407, configured to The weight is used as a feature vector of the corresponding feature word segment, and the feature model is used to train the classification model.

在本申請的一種較佳實施例中，所述第一屬性值可以為所述基礎分詞的資訊增益值，所述第二屬性值可以為所述基礎分詞相對於預定義的各個分類的卡方統計量值的標準差，所述特徵值可以為所述基礎分詞的區分度。 In a preferred embodiment of the present application, the first attribute value may be an information gain value of the basic participle, and the second attribute value may be a chi-square part of the basic participle relative to a predefined each category. The standard deviation of the statistic value, which may be the degree of discrimination of the base participle.

作為本申請實施例具體應用的一種示例，所述特徵值計算模組404可以透過如下公式依據所述第一屬性值和第二屬性值計算各基礎分詞的特徵值： As an example of the specific application of the embodiment of the present application, the feature value calculation module 404 may calculate the feature value of each basic participle according to the first attribute value and the second attribute value by using the following formula:

在本申請的一種較佳實施例中，所述特徵提取模組405可以包括如下子模組：排序子模組4051，用於將所述基礎分詞按照其對應的特徵值由高至低排列；提取子模組4052，用於提取預設數量的，所述特徵值高於預設閾值的基礎分詞作為特徵分詞。 In a preferred embodiment of the present application, the feature extraction module 405 may include a sub-module: a sub-module 4051 for matching the basic participle according to its corresponding The feature values are arranged from high to low; the extraction sub-module 4052 is configured to extract a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.

在本申請的一種較佳實施例中，所述特徵權重分配模組406可以包括如下子模組：次數統計子模組4061，用於獲取各特徵分詞在相應網頁的文本資料中出現的次數；分詞總數統計子模組4062，用於統計所述網頁的文本資料中特徵分詞的總數；計算子模組4063，用於依據所述特徵分詞的特徵值，各特徵分詞在相應網頁的文本資料中出現的次數，以及，所述網頁的文本資料中特徵分詞的總數，計算得到各特徵分詞相應的權重。 In a preferred embodiment of the present application, the feature weight distribution module 406 may include a sub-module: a number-of-scores sub-module 4061, configured to obtain the number of occurrences of each feature participle in the text data of the corresponding webpage; The segmentation total number statistics sub-module 4062 is configured to count the total number of feature word segments in the text data of the webpage; the calculation sub-module 4063 is configured to, according to the feature value of the feature segmentation word, each feature segmentation word is in the text data of the corresponding webpage. The number of occurrences, and the total number of feature parts in the text data of the webpage, calculate the corresponding weights of each feature segmentation.

作為本申請實施例具體應用的一種示例，所述計算子模組4063可以透過如下公式依據所述特徵分詞的特徵值，各特徵分詞在相應網頁的文本資料中出現的次數，以及，所述網頁的文本資料中特徵分詞的總數，計算得到各特徵分詞相應的權重： As an example of the specific application of the embodiment of the present application, the calculation sub-module 4063 may use the following formula according to the feature value of the feature segmentation, the number of occurrences of each feature segmentation in the text data of the corresponding webpage, and the webpage. The total number of feature word segments in the text data, and the corresponding weights of each feature word segment are calculated:

在本申請的一種較佳實施例中，所述特徵權重分配模組406還可以包括如下子模組：歸一化子模組4064，用於對所述特徵分詞的權重進行歸一化處理。 In a preferred embodiment of the present application, the feature weight distribution module 406 may further include a sub-module: a normalization sub-module 4064, configured to normalize the weight of the feature segmentation.

作為本申請實施例具體應用的一種示例，所述歸一化子模組4064可以透過以下公式對所述特徵分詞的權重進行歸一化處理： As an example of a specific application of the embodiment of the present application, the normalization sub-module 4064 can normalize the weight of the feature word segmentation by using the following formula:

對於裝置實施例而言，由於其與方法實施例基本相似，所以描述的比較簡單，相關之處參見方法實施例的部分說明即可。 For the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.

參照圖5，示出了本申請的一種網頁文本識別的裝置實施例的結構框圖，具體可以包括如下模組：文本提取模組501，用於提取待識別網頁中的文本資料；分詞模組502，用於對所述文本資料進行分詞，獲得基礎分詞；分詞屬性計算模組503，用於計算各基礎分詞的第一屬性值和第二屬性值；特徵值計算模組504，用於依據所述第一屬性值和第二屬性值計算各基礎分詞的特徵值；特徵提取模組505，用於依據所述特徵值從所述基礎分詞中篩選出特徵分詞；特徵權重分配模組506，用於計算各特徵分詞相應的權重；分類模組507，用於將所述權重作為特徵向量輸入預先訓練出的分類模型中，獲得分類資訊；標記模組508，用於針對所述待識別網頁標記分類資訊。 Referring to FIG. 5, a block diagram of a device for recognizing a webpage text of the present application is shown, which may include the following modules: a text extraction module 501 for extracting text data in a webpage to be recognized; a word segmentation module 502, configured to perform segmentation on the text material to obtain a basic participle; a word segment attribute calculation module 503, configured to calculate the first part of each basic participle The attribute value calculation module 504 is configured to calculate the feature value of each basic participle according to the first attribute value and the second attribute value; the feature extraction module 505 is configured to use the feature value A feature segmentation word is selected from the basic participle; a feature weight assignment module 506 is configured to calculate a corresponding weight of each feature segmentation; and a classification module 507 is configured to input the weight as a feature vector into the pre-trained classification model. And obtaining the classification information; the marking module 508 is configured to mark the classification information for the to-be-identified webpage.

作為本申請實施例具體應用的一種示例，所述特徵值計算模組504可以透過如下公式依據所述第一屬性值和第二屬性值計算各基礎分詞的特徵值： As an example of the specific application of the embodiment of the present application, the feature value calculation module 504 may calculate the feature value of each basic participle according to the first attribute value and the second attribute value by using the following formula:

其中，score為基礎分詞的區分度，igScore為基礎分詞的資訊增益值，chiScore為基礎分詞對相對於預定義的各個分類的卡方統計量值，所述n為預定義的分類的數量。 Among them, score is the discrimination of the basic participle, igScore is the information gain value of the basic participle, and chiScore is the basic participle pair relative to the predefined The chi-square statistic value of each category, where n is the number of predefined categories.

在本申請的一種較佳實施例中，所述特徵提取模組505可以包括如下子模組：排序子模組5051，用於將所述基礎分詞按照其對應的特徵值由高至低排列；提取子模組5052，用於提取預設數量的，所述特徵值高於預設閾值的基礎分詞作為特徵分詞。 In a preferred embodiment of the present application, the feature extraction module 505 may include the following sub-module: a sorting sub-module 5051 for arranging the basic participle according to its corresponding feature value from high to low; The extraction sub-module 5052 is configured to extract a preset number of basic participles whose feature values are higher than a preset threshold as feature segmentation words.

在本申請的一種較佳實施例中，所述特徵權重分配模組506可以包括如下子模組：次數統計子模組5061，用於獲取各特徵分詞在相應網頁的文本資料中出現的次數；分詞總數統計子模組5062，用於統計所述網頁的文本資料中特徵分詞的總數；計算子模組5063，用於依據所述特徵分詞的特徵值，各特徵分詞在相應網頁的文本資料中出現的次數，以及，所述網頁的文本資料中特徵分詞的總數，計算得到各特徵分詞相應的權重。 In a preferred embodiment of the present application, the feature weight distribution module 506 may include the following sub-module: a frequency statistics sub-module 5061, configured to acquire the number of occurrences of each feature word segment in the text data of the corresponding webpage; The segmentation total number statistics sub-module 5062 is configured to count the total number of feature word segments in the text data of the webpage; the calculation sub-module 5063 is configured to, according to the feature value of the feature segmentation word, each feature segmentation word is in the text data of the corresponding webpage. The number of occurrences, and the total number of feature parts in the text data of the webpage, calculate the corresponding weights of each feature segmentation.

在本申請的一種較佳實施例中，所述特徵權重分配模組506還可以包括如下子模組：歸一化子模組5064，用於對所述特徵分詞的權重進行歸一化處理。 In a preferred embodiment of the present application, the feature weight distribution module 506 may further include a sub-module: a normalization sub-module 5064 for normalizing the weight of the feature segmentation.

本說明書中的每個實施例重點說明的都是與其他實施例的不同之處，各個實施例之間相同相似的部分互相參見即可。 Each embodiment in this specification focuses on differences from other embodiments, and the same similar parts between the embodiments are referred to each other. Just fine.

本領域內的技術人員應明白，本申請實施例的實施例可提供為方法、裝置、或電腦程式產品。因此，本申請實施例可採用完全硬體實施例、完全軟體實施例、或結合軟體和硬體方面的實施例的形式。而且，本申請實施例可採用在一個或多個其中包含有電腦可用程式碼的電腦可用儲存媒體(包括但不限於磁碟記憶體、CD-ROM、光學記憶體等)上實施的電腦程式產品的形式。 Those skilled in the art will appreciate that embodiments of the embodiments of the present application can be provided as a method, apparatus, or computer program product. Thus, embodiments of the present application may take the form of a complete hardware embodiment, a fully software embodiment, or an embodiment combining soft and hardware aspects. Moreover, embodiments of the present application may employ computer program products implemented on one or more computer usable storage media (including but not limited to disk memory, CD-ROM, optical memory, etc.) including computer usable code. form.

在一個典型的配置中，所述電腦設備包括一個或多個處理器(CPU)、輸入/輸出介面、網路介面和記憶體。記憶體可能包括電腦可讀媒體中的非永久性記憶體，隨機存取記憶體(RAM)和/或非易失性記憶體等形式，如唯讀記憶體(ROM)或快閃記憶體(flash RAM)。記憶體是電腦可讀媒體的示例。電腦可讀媒體包括永久性和非永久性、可移動和非可移動媒體可以由任何方法或技術來實現資訊儲存。資訊可以是電腦可讀指令、資料結構、程式的模組或其他資料。電腦的儲存媒體的例子包括，但不限於相變記憶體(PRAM)、靜態隨機存取記憶體(SRAM)、動態隨機存取記憶體(DRAM)、其他類型的隨機存取記憶體(RAM)、唯讀記憶體(ROM)、電可擦除可程式設計唯讀記憶體(EEPROM)、快閃記憶體或其他記憶體技術、唯讀光碟唯讀記憶體(CD-ROM)、數位多功能光碟(DVD)或其他光學儲存、磁盒式磁帶，磁帶磁片儲存或其他磁性存放裝置或任何其他非傳輸媒體，可用於儲存可以被計算設備訪問的資訊。按照本文中的界定，電腦可讀媒體不包括非持續性的電腦可讀媒體(transitory media)，如調製的資料信號和載波。 In a typical configuration, the computer device includes one or more processors (CPUs), input/output interfaces, a network interface, and memory. The memory may include non-permanent memory, random access memory (RAM) and/or non-volatile memory in computer readable media, such as read only memory (ROM) or flash memory ( Flash RAM). Memory is an example of a computer readable medium. Computer readable media including both permanent and non-permanent, removable and non-removable media can be stored by any method or technology. Information can be computer readable instructions, data structures, modules of programs, or other materials. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), and other types of random access memory (RAM). Read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM only, digitally versatile Optical disc (DVD) or other optical storage, magnetic cassette, magnetic tape storage or other magnetic storage device or any other non-transportable media that can be used to store computing devices Access to information. Computer-readable media, as defined herein, does not include non-persistent computer readable media, such as modulated data signals and carrier waves.

本申請實施例是參照根據本申請實施例的方法、終端設備(系統)、和電腦程式產品的流程圖和/或方框圖來描述的。應理解可由電腦程式指令實現流程圖和/或方框圖中的每一流程和/或方框、以及流程圖和/或方框圖中的流程和/或方框的結合。可提供這些電腦程式指令到通用電腦、專用電腦、嵌入式處理機或其他可程式設計資料處理終端設備的處理器以產生一個機器，使得通過電腦或其他可程式設計資料處理終端設備的處理器執行的指令產生用於實現在流程圖一個流程或多個流程和/或方框圖一個方框或多個方框中指定的功能的裝置。 The embodiments of the present application are described with reference to flowcharts and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer program instructions can be provided to a general purpose computer, a special purpose computer, an embedded processor or other programmable data processing terminal device processor to generate a machine for execution by a processor of a computer or other programmable data processing terminal device The instructions produce means for implementing the functions specified in one or more flows of the flowchart or in a block or blocks of the flowchart.

這些電腦程式指令也可儲存在能引導電腦或其他可程式設計資料處理終端設備以特定方式工作的電腦可讀記憶體中，使得儲存在該電腦可讀記憶體中的指令產生包括指令裝置的製造品，該指令裝置實現在流程圖一個流程或多個流程和/或方框圖一個方框或多個方框中指定的功能。 The computer program instructions can also be stored in a computer readable memory that can boot a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer readable memory include the manufacture of the instruction device. The instruction means implements the functions specified in a block or blocks of a flow or a flow and/or a block diagram of the flowchart.

這些電腦程式指令也可裝載到電腦或其他可程式設計資料處理終端設備上，使得在電腦或其他可程式設計終端設備上執行一系列操作步驟以產生電腦實現的處理，從而在電腦或其他可程式設計終端設備上執行的指令提供用於實現在流程圖一個流程或多個流程和/或方框圖一個方框或多個方框中指定的功能的步驟。 These computer program instructions can also be loaded onto a computer or other programmable data processing terminal device to perform a series of operational steps on a computer or other programmable terminal device to produce computer-implemented processing for use on a computer or other programmable computer. The instructions executed on the design terminal device provide steps for implementing the functions specified in one or more blocks of the flowchart or in a flow or block of the flowchart.

儘管已描述了本申請實施例的較佳實施例，但本領域內的技術人員一旦得知了基本創造性概念，則可對這些實施例做出另外的變更和修改。所以，所附申請專利範圍意欲解釋為包括較佳實施例以及落入本申請實施例範圍的所有變更和修改。 While a preferred embodiment of the embodiments of the present invention has been described, those skilled in the art can make further changes and modifications to the embodiments. Therefore, the scope of the appended claims is intended to be construed as a

最後，還需要說明的是，在本文中，諸如第一和第二等之類的關係術語僅僅用來將一個實體或者操作與另一個實體或操作區分開來，而不一定要求或者暗示這些實體或操作之間存在任何這種實際的關係或者順序。而且，術語“包括”、“包含”或者其任何其他變體意在涵蓋非排他性的包含，從而使得包括一系列要素的過程、方法、物品或者終端設備不僅包括那些要素，而且還包括沒有明確列出的其他要素，或者是還包括為這種過程、方法、物品或者終端設備所固有的要素。在沒有更多限制的情況下，由語句“包括一個……”限定的要素，並不排除在包括所述要素的過程、方法、物品或者終端設備中還存在另外的相同要素。 Finally, it should also be noted that in this context, relational terms such as first and second are used merely to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities. There is any such actual relationship or order between operations. Furthermore, the terms "comprises" or "comprising" or "comprising" or any other variations are intended to encompass a non-exclusive inclusion, such that a process, method, article, or terminal device that includes a plurality of elements includes not only those elements but also Other elements that are included, or include elements inherent to such a process, method, article, or terminal device. An element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article, or terminal device that comprises the element, without further limitation.

以上對本申請所提供的一種網頁文本分類的方法，一種網頁文本分類的裝置，一種網頁文本識別的方法，以及，一種網頁文本識別的裝置進行了詳細介紹，本文中應用了具體個例對本申請的原理及實施方式進行了闡述，以上實施例的說明只是用於幫助理解本申請的方法及其核心思想；同時，對於本領域的一般技術人員，依據本申請的思想，在具體實施方式及應用範圍上均會有改變之處，綜上所述，本說明書內容不應理解為對本申請的限制。 The method for classifying webpage texts provided by the present application, a device for classifying webpage texts, a method for recognizing webpage texts, and a device for recognizing webpage texts are described in detail, and specific examples are applied herein to the present application. The principles and implementations are set forth, and the description of the above embodiments is only for helping to understand the method of the present application and its core ideas; at the same time, for those of ordinary skill in the art, in accordance with the idea of the present application, in the specific embodiments and application scope There will be changes on the top, comprehensive The above description should not be taken as limiting the present application.

Claims

一種網頁文本分類的方法，其特徵在於，包括：採集網頁中的文本資料；對該文本資料進行分詞，獲得基礎分詞；計算各基礎分詞的第一屬性值和第二屬性值；依據該第一屬性值和第二屬性值計算各基礎分詞的特徵值；依據該特徵值從該基礎分詞中篩選出特徵分詞；計算各特徵分詞相應的權重；將該權重作為相應特徵分詞的特徵向量，採用該特徵向量訓練出分類模型。 A method for classifying a webpage text, comprising: collecting text data in a webpage; performing segmentation on the text material to obtain a basic participle; calculating a first attribute value and a second attribute value of each basic participle; The attribute value and the second attribute value are used to calculate the feature value of each basic participle; the feature participle is selected from the basic participle according to the feature value; the corresponding weight of each feature participle is calculated; and the weight is used as the feature vector of the corresponding feature participle, The eigenvectors train a classification model.

根據申請專利範圍第1項所述的方法，其中，該第一屬性值為該基礎分詞的資訊增益值，該第二屬性值為該基礎分詞相對於預定義的各個分類的卡方統計量值的標準差，該特徵值為該基礎分詞的區分度。 The method of claim 1, wherein the first attribute value is an information gain value of the base participle, and the second attribute value is a chi-square statistic value of the base participle relative to a predefined each category. The standard deviation, the eigenvalue is the degree of discrimination of the base participle.

根據申請專利範圍第2項所述的方法，其中，透過如下公式依據該第一屬性值和第二屬性值計算各基礎分詞的特徵值：其中，score為基礎分詞的區分度，igScore為基礎分詞的資訊增益值，chiScore為基礎分詞對相對於預定義的各個分類的卡方統計量值，n為預定義的分類的數量。 The method of claim 2, wherein the eigenvalues of the basic participles are calculated according to the first attribute value and the second attribute value by a formula: Where, score is the degree of discrimination of the base participle, igScore is the information gain value of the base participle, chiScore is the chi-square statistic of the base part-of-score pair with respect to the predefined categories, and n is the number of predefined categories.

根據申請專利範圍第1或2或3項所述的方法，其中，所述依據該特徵值從該基礎分詞中篩選出特徵分詞的步驟包括：將該基礎分詞按照其對應的特徵值由高至低排列；提取預設數量的，該特徵值高於預設閾值的基礎分詞作為特徵分詞。 The method of claim 1 or 2 or 3, wherein the step of filtering the feature participle from the base participle according to the feature value comprises: associating the base participle according to its corresponding feature value from high to Low permutation; extracting a preset number of basic participles whose feature value is higher than a preset threshold as a feature segmentation.

根據申請專利範圍第1或2或3項所述的方法，其中，所述計算各特徵分詞相應的權重的步驟包括：獲取各特徵分詞在相應網頁的文本資料中出現的次數；統計該網頁的文本資料中特徵分詞的總數；依據該特徵分詞的特徵值，各特徵分詞在相應網頁的文本資料中出現的次數，以及，該網頁的文本資料中特徵分詞的總數，計算得到各特徵分詞相應的權重。 The method of claim 1 or 2 or 3, wherein the calculating the corresponding weight of each feature word segment comprises: obtaining the number of times each feature word segment appears in the text material of the corresponding web page; The total number of feature word segments in the text data; according to the feature value of the feature word segment, the number of occurrences of each feature word segment in the text data of the corresponding web page, and the total number of feature word segments in the text data of the web page, corresponding to the feature word segmentation Weights.

根據申請專利範圍第5項所述的方法，其中，透過如下公式依據該特徵分詞的特徵值，各特徵分詞在相應網頁的文本資料中出現的次數，以及，該網頁的文本資料中特徵分詞的總數，計算得到各特徵分詞相應的權重：其中，weight為特徵分詞的權重，tf為特徵分詞在相應網頁的文本資料中出現的次數，n為網頁的文本資料中特徵分詞的總數，score為特徵分詞的區分度。 According to the method of claim 5, wherein the feature value of the feature word segment is used according to the following formula, the number of occurrences of each feature word segment in the text material of the corresponding web page, and the feature word segmentation in the text data of the web page. The total number is calculated by the corresponding weights of each feature participle: Where weight is the weight of the feature word segment, tf is the number of times the feature word segment appears in the text data of the corresponding web page, n is the total number of feature word segments in the text data of the web page, and score is the degree of distinguishing feature word segmentation.

根據申請專利範圍第1或2或3項所述的方法，其中，所述計算各特徵分詞相應的權重的步驟還包括：對該特徵分詞的權重進行歸一化處理。 The method of claim 1 or 2 or 3, wherein the step of calculating a corresponding weight of each feature word segment further comprises: normalizing the weight of the feature word segmentation.

根據申請專利範圍第7項所述的方法，其中，透過以下公式對該特徵分詞的權重進行歸一化處理：其中，norm(weight)為歸一化之後的權重，weight為該特徵分詞的權重，min(weight)為該網頁中文本資料中最小weight值，max(weight)為該網頁中文本資料中最大weight值。 According to the method of claim 7, wherein the weight of the feature word segmentation is normalized by the following formula: Where norm(weight) is the weight after normalization, weight is the weight of the feature participle, min(weight) is the minimum weight value in the text data in the webpage, and max(weight) is the maximum weight in the text data in the webpage. value.

一種網頁文本識別的方法，其特徵在於，包括：提取待識別網頁中的文本資料；對該文本資料進行分詞，獲得基礎分詞；計算各基礎分詞的第一屬性值和第二屬性值；依據該第一屬性值和第二屬性值計算各基礎分詞的特徵值；依據該特徵值從該基礎分詞中篩選出特徵分詞；計算各特徵分詞相應的權重；將該權重作為特徵向量輸入預先訓練出的分類模型中，獲得分類資訊；針對該待識別網頁標記分類資訊。 A method for recognizing a webpage text, comprising: extracting text data in a webpage to be recognized; segmenting the text data to obtain a basic participle; calculating a first attribute value and a second attribute value of each basic participle; The first attribute value and the second attribute value calculate the feature value of each basic participle; the feature participle is selected from the basic participle according to the feature value; the corresponding weight of each feature participle is calculated; and the weight is pre-trained as the feature vector input In the classification model, classification information is obtained; and classification information is marked for the to-be-identified web page.

根據申請專利範圍第9項所述的方法，其中，該第一屬性值為該基礎分詞的資訊增益值，該第二屬性值為該基礎分詞相對於預定義的各個分類的卡方統計量值的標準差，該特徵值為該基礎分詞的區分度。 The method of claim 9, wherein the first attribute value is an information gain value of the base participle, and the second attribute value is a chi-square statistic value of the base part word relative to a predefined each category. The standard deviation, the eigenvalue is the degree of discrimination of the base participle.

根據申請專利範圍第9或10項所述的方法，其中，所述依據該特徵值從該基礎分詞中篩選出特徵分詞的步驟包括：將該基礎分詞按照其對應的特徵值由高至低排列；提取預設數量的，該特徵值高於預設閾值的基礎分詞作為特徵分詞。 The method of claim 9 or 10, wherein the step of filtering the feature participle from the base participle according to the feature value comprises: arranging the base participle according to its corresponding feature value from highest to lowest And extracting a preset number of basic participles whose feature value is higher than a preset threshold as a feature participle.

根據申請專利範圍第9或10項所述的方法，其中，所述計算各特徵分詞相應的權重的步驟包括：獲取各特徵分詞在相應網頁的文本資料中出現的次數；統計該網頁的文本資料中特徵分詞的總數；依據該特徵分詞的特徵值，各特徵分詞在相應網頁的文本資料中出現的次數，以及，該網頁的文本資料中特徵分詞的總數，計算得到各特徵分詞相應的權重。 The method of claim 9 or 10, wherein the calculating the corresponding weight of each feature word includes: obtaining the number of occurrences of each feature word in the text data of the corresponding webpage; and counting the text data of the webpage The total number of feature word segments; according to the feature value of the feature word segment, the number of occurrences of each feature word segment in the text material of the corresponding web page, and the total number of feature word segments in the text data of the web page, the corresponding weights of each feature word segment are calculated.

根據申請專利範圍第9或10項所述的方法，其中，所述計算各特徵分詞相應的權重的步驟還包括：對該特徵分詞的權重進行歸一化處理。 The method of claim 9 or 10, wherein the calculating the weight of each feature segment further comprises: normalizing the weight of the feature segmentation.

一種網頁文本分類的裝置，其特徵在於，包括：採集模組，用於採集網頁中的文本資料；分詞模組，用於對該文本資料進行分詞，獲得基礎分詞；分詞屬性計算模組，用於計算各基礎分詞的第一屬性值和第二屬性值；特徵值計算模組，用於依據該第一屬性值和第二屬性值計算各基礎分詞的特徵值；特徵提取模組，用於依據該特徵值從該基礎分詞中篩選出特徵分詞；特徵權重分配模組，用於計算各特徵分詞相應的權重；模型訓練模組，用於將該權重作為相應特徵分詞的特徵向量，採用該特徵向量訓練出分類模型。 An apparatus for classifying webpage text, comprising: an acquisition module, configured to collect text data in a webpage; a word segmentation module, configured to perform word segmentation on the text data, and obtain a basic score a word segmentation attribute calculation module, configured to calculate a first attribute value and a second attribute value of each basic participle; the feature value calculation module is configured to calculate features of each basic participle according to the first attribute value and the second attribute value a feature extraction module, configured to filter a feature word segment from the basic participle according to the feature value; a feature weight distribution module, configured to calculate a corresponding weight of each feature word segment; and a model training module, configured to use the weight as The feature vector of the corresponding feature participle is used to train the classification model.

根據申請專利範圍第14項所述的裝置，其中，該第一屬性值為該基礎分詞的資訊增益值，該第二屬性值為該基礎分詞相對於預定義的各個分類的卡方統計量值的標準差，該特徵值為該基礎分詞的區分度。 The device of claim 14, wherein the first attribute value is an information gain value of the base participle, and the second attribute value is a chi-square statistic value of the base part word relative to a predefined each category. The standard deviation, the eigenvalue is the degree of discrimination of the base participle.

根據申請專利範圍第15項所述的裝置，其中，該特徵值計算模組透過如下公式依據該第一屬性值和第二屬性值計算各基礎分詞的特徵值：其中，score為基礎分詞的區分度，igScore為基礎分詞的資訊增益值，chiScore為基礎分詞對相對於預定義的各個分類的卡方統計量值，n為預定義的分類的數量。 The device according to claim 15, wherein the feature value calculation module calculates the feature values of the basic participle according to the first attribute value and the second attribute value by using the following formula: Where, score is the degree of discrimination of the base participle, igScore is the information gain value of the base participle, chiScore is the chi-square statistic of the base part-of-score pair with respect to the predefined categories, and n is the number of predefined categories.

根據申請專利範圍第14或15或16項所述的裝置，其中，該特徵提取模組包括：排序子模組，用於將該基礎分詞按照其對應的特徵值由高至低排列；提取子模組，用於提取預設數量的，該特徵值高於預設閾值的基礎分詞作為特徵分詞。 The device of claim 14 or 15 or 16, wherein the feature extraction module comprises: a sorting sub-module for arranging the basic participle according to its corresponding feature value from high to low; The module is configured to extract a preset number of basic participles whose feature value is higher than a preset threshold as a feature segmentation.

根據申請專利範圍第14或15或16項所述的裝置，其中，該特徵權重分配模組包括：次數統計子模組，用於獲取各特徵分詞在相應網頁的文本資料中出現的次數；分詞總數統計子模組，用於統計該網頁的文本資料中特徵分詞的總數；計算子模組，用於依據該特徵分詞的特徵值，各特徵分詞在相應網頁的文本資料中出現的次數，以及，該網頁的文本資料中特徵分詞的總數，計算得到各特徵分詞相應的權重。 The device of claim 14 or 15 or 16, wherein the feature weight distribution module comprises: a number of statistics sub-module, configured to obtain the number of occurrences of each feature word segment in the text data of the corresponding webpage; a total number of statistical sub-modules for counting the total number of feature word segments in the text data of the web page; a calculation sub-module for using the feature values of the feature word segmentation, the number of occurrences of each feature word segment in the text data of the corresponding web page, and The total number of feature word segments in the text data of the web page, and the corresponding weights of each feature word segment are calculated.

根據申請專利範圍第18項所述的裝置，其中，該計算子模組透過如下公式依據該特徵分詞的特徵值，各特徵分詞在相應網頁的文本資料中出現的次數，以及，該網頁的文本資料中特徵分詞的總數，計算得到各特徵分詞相應的權重：其中，weight為特徵分詞的權重，tf為特徵分詞在相應網頁的文本資料中出現的次數，n為網頁的文本資料中特徵分詞的總數，score為特徵分詞的區分度。 The device according to claim 18, wherein the calculation sub-module is based on the feature value of the feature word segment, the number of occurrences of each feature word segment in the text data of the corresponding web page, and the text of the web page by the following formula The total number of feature participles in the data, and the corresponding weights of each feature participle are calculated: Where weight is the weight of the feature word segment, tf is the number of times the feature word segment appears in the text data of the corresponding web page, n is the total number of feature word segments in the text data of the web page, and score is the degree of distinguishing feature word segmentation.

根據申請專利範圍第14或15或16項所述的裝置，其中，該特徵權重分配模組還包括：歸一化子模組，用於對該特徵分詞的權重進行歸一化處理。 The device according to claim 14 or 15 or 16, wherein the feature weight distribution module further comprises: a normalization sub-module for normalizing the weight of the feature word segmentation.

根據申請專利範圍第20項所述的裝置，其中，該歸一化子模組透過以下公式對該特徵分詞的權重進行歸一化處理：其中，norm(weight)為歸一化之後的權重，weight為該特徵分詞的權重，min(weight)為該網頁中文本資料中最小weight值，max(weight)為該網頁中文本資料中最大weight值。 The device according to claim 20, wherein the normalized sub-module normalizes the weight of the feature word segment by the following formula: Where norm(weight) is the weight after normalization, weight is the weight of the feature participle, min(weight) is the minimum weight value in the text data in the webpage, and max(weight) is the maximum weight in the text data in the webpage. value.

一種網頁文本識別的裝置，其特徵在於，包括：文本提取模組，用於提取待識別網頁中的文本資料；分詞模組，用於對該文本資料進行分詞，獲得基礎分詞；分詞屬性計算模組，用於計算各基礎分詞的第一屬性值和第二屬性值；特徵值計算模組，用於依據該第一屬性值和第二屬性值計算各基礎分詞的特徵值；特徵提取模組，用於依據該特徵值從該基礎分詞中篩選出特徵分詞；特徵權重分配模組，用於計算各特徵分詞相應的權重；分類模組，用於將該權重作為特徵向量輸入預先訓練出的分類模型中，獲得分類資訊；標記模組，用於針對該待識別網頁標記分類資訊。 An apparatus for recognizing a webpage text, comprising: a text extraction module, configured to extract text data in a webpage to be recognized; a word segmentation module, configured to perform word segmentation on the text material, obtain a basic participle; and a word segmentation attribute calculation mode a group, configured to calculate a first attribute value and a second attribute value of each basic participle; a feature value calculation module, configured to use the first attribute value and the second attribute The value is used to calculate the feature value of each basic participle; the feature extraction module is configured to filter the feature word segmentation from the basic participle according to the feature value; the feature weight distribution module is configured to calculate the corresponding weight of each feature word segment; the classification module, The weighting is used as a feature vector to input the pre-trained classification model, and the classification information is obtained; the marking module is configured to mark the classification information for the to-be-identified webpage.