CN105893556B - Entry classification method and device based on encyclopedic content - Google Patents

Entry classification method and device based on encyclopedic content Download PDF

Info

Publication number
CN105893556B
CN105893556B CN201610201440.2A CN201610201440A CN105893556B CN 105893556 B CN105893556 B CN 105893556B CN 201610201440 A CN201610201440 A CN 201610201440A CN 105893556 B CN105893556 B CN 105893556B
Authority
CN
China
Prior art keywords
entry
attribute
category
webpage
webpages
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610201440.2A
Other languages
Chinese (zh)
Other versions
CN105893556A (en
Inventor
王智广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qizhi Business Consulting Co ltd
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201610201440.2A priority Critical patent/CN105893556B/en
Publication of CN105893556A publication Critical patent/CN105893556A/en
Application granted granted Critical
Publication of CN105893556B publication Critical patent/CN105893556B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an entry classification method and device based on encyclopedic content. The method comprises the following steps: extracting attribute data of corresponding entries from encyclopedic content, wherein the attribute data comprise attribute names and attribute values; acquiring entry webpages corresponding to the extracted attribute values; and determining the category of the extracted attribute name, and determining the category of the entry webpage corresponding to the attribute value according to the category of the attribute name. The method and the device can directly determine the category of the entry corresponding to the attribute value by using the category of the attribute name in the attribute data of the entry, and compared with the prior art that a machine learning method is used, the category of partial entries is labeled manually, and then the category of unknown entries is predicted.

Description

Entry classification method and device based on encyclopedic content
Technical Field
The invention relates to the technical field of Internet application, in particular to a vocabulary entry classification method and device based on encyclopedic content.
Background
Encyclopedia refers to the general term for knowledge of all disciplines such as astronomy, geography, nature, humanity, religion, belief, literature, and the like. Encyclopedia on the internet is a knowledge base, is an open and free network encyclopedia book containing various entries. Many types are the characteristics of encyclopedias, and in some applications, encyclopedia entries need to be classified (such as character types, movie products types, music products types and the like), and since many entries of encyclopedias are edited by net friends, there is no clear classification information.
In the related technology, the encyclopedic entries are classified mainly by using a machine learning method, specifically, keywords capable of representing entry categories are extracted from the content of the encyclopedic entries, partial entry categories are labeled manually by using the machine learning method, and then unknown entry categories are predicted. However, machine learning is used for classification, one is to manually label a large number of label sets, and the other is to have a limited accuracy.
Therefore, how to rapidly and accurately classify encyclopedic entries becomes a technical problem to be solved urgently.
Disclosure of Invention
In view of the above, the present invention has been made to provide an encyclopedia-based vocabulary entry classification method and a corresponding apparatus that overcome or at least partially solve the above-mentioned problems.
According to an aspect of the present invention, there is provided an entry classification method based on encyclopedic content, including:
extracting attribute data of corresponding entries from encyclopedic content, wherein the attribute data comprise attribute names and attribute values;
acquiring entry webpages corresponding to the extracted attribute values;
and determining the category of the extracted attribute name, and determining the category of the entry webpage corresponding to the attribute value according to the category of the attribute name.
Optionally, the extracting attribute data of its corresponding entry from encyclopedic content includes:
determining a field for extracting attribute data of the entry;
and extracting attribute data of the corresponding entries from the encyclopedic content by using the determined fields.
Optionally, the extracting attribute data of its corresponding entry from encyclopedic content includes:
acquiring position information of attribute data of entries corresponding to encyclopedic contents recorded in the encyclopedic contents;
and extracting attribute data of the corresponding entry from the encyclopedic content according to the position information.
Optionally, the obtaining of the location information of the attribute data in which the corresponding entry is described in the encyclopedic content includes:
matching entry webpages corresponding to the encyclopedic contents in a webpage template library to obtain webpage templates corresponding to the entry webpages;
and acquiring the position information of the attribute data of the corresponding entry in the encyclopedic content according to the webpage template corresponding to the entry webpage.
Optionally, the method further comprises:
determining position information of attribute data of entries corresponding to entry webpages of different page types for the entry webpages of different page types under various websites;
and recording the corresponding relation between the entry webpages of different page types and the position information of the attribute data of the corresponding entries of the entry webpages of different page types, and generating the webpage template library.
Optionally, before the entry webpage corresponding to the extracted attribute value is acquired, the method further includes:
taking the extracted attribute values as entries to be classified; or,
and taking the entry matched with the extracted attribute value as an entry to be classified.
Optionally, the obtaining of the entry webpage corresponding to the extracted attribute value includes:
acquiring a link address corresponding to the attribute value in the encyclopedia content;
and taking the link address as a vocabulary entry webpage corresponding to the attribute value.
Optionally, the obtaining of the entry webpage corresponding to the extracted attribute value includes:
and searching the entry webpage corresponding to the attribute value in the corresponding relation according to the pre-established corresponding relation between the entry and the entry webpage.
Optionally, determining the category of the extracted attribute name includes:
and converting the attribute name into a standardized category field and using the standardized category field as the category of the attribute name.
Optionally, determining the category of the entry webpage corresponding to the attribute value according to the category of the attribute name includes:
and taking the category of the attribute name as the category of the entry webpage corresponding to the attribute value.
According to another aspect of the present invention, there is also provided an entry classification device based on encyclopedic content, including:
the extraction module is suitable for extracting attribute data of corresponding entries from encyclopedic content, wherein the attribute data comprise attribute names and attribute values;
the acquisition module is suitable for acquiring the entry webpage corresponding to the extracted attribute value;
and the determining module is suitable for determining the category of the extracted attribute name and determining the category of the entry webpage corresponding to the attribute value according to the category of the attribute name.
Optionally, the extraction module is further adapted to:
determining a field for extracting attribute data of the entry;
and extracting attribute data of the corresponding entries from the encyclopedic content by using the determined fields.
Optionally, the extraction module is further adapted to:
acquiring position information of attribute data of entries corresponding to encyclopedic contents recorded in the encyclopedic contents;
and extracting attribute data of the corresponding entry from the encyclopedic content according to the position information.
Optionally, the extraction module is further adapted to:
matching entry webpages corresponding to the encyclopedic contents in a webpage template library to obtain webpage templates corresponding to the entry webpages;
and acquiring the position information of the attribute data of the corresponding entry in the encyclopedic content according to the webpage template corresponding to the entry webpage.
Optionally, the apparatus further comprises a generating module adapted to:
determining position information of attribute data of entries corresponding to entry webpages of different page types for the entry webpages of different page types under various websites;
and recording the corresponding relation between the entry webpages of different page types and the position information of the attribute data of the corresponding entries of the entry webpages of different page types, and generating the webpage template library.
Optionally, the obtaining module is further adapted to:
before obtaining the entry webpage corresponding to the extracted attribute value, taking the extracted attribute value as an entry to be classified; or,
and taking the entry matched with the extracted attribute value as an entry to be classified.
Optionally, the obtaining module is further adapted to:
acquiring a link address corresponding to the attribute value in the encyclopedia content;
and taking the link address as a vocabulary entry webpage corresponding to the attribute value.
Optionally, the obtaining module is further adapted to:
and searching the entry webpage corresponding to the attribute value in the corresponding relation according to the pre-established corresponding relation between the entry and the entry webpage.
Optionally, the determining module is further adapted to: and converting the attribute name into a standardized category field and using the standardized category field as the category of the attribute name.
Optionally, the determining module is further adapted to: and taking the category of the attribute name as the category of the entry webpage corresponding to the attribute value.
In the embodiment of the invention, the attribute data of the corresponding entry is firstly extracted from encyclopedic content, wherein the attribute data comprises an attribute name and an attribute value, then the entry webpage corresponding to the extracted attribute value is obtained, the category of the extracted attribute name is determined, and then the category of the entry webpage corresponding to the attribute value is determined according to the category of the attribute name. Therefore, the method and the device can determine the category of the entry corresponding to the attribute value by directly utilizing the category of the attribute name in the attribute data of the entry, and compared with the prior art that a machine learning method is utilized, the method and the device label part of the entry categories manually and then predict unknown entry categories.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
The above and other objects, advantages and features of the present invention will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 illustrates a flow diagram of an encyclopedia content-based entry classification method according to one embodiment of the invention;
fig. 2 is a schematic structural diagram of an encyclopedia content-based vocabulary entry classifying apparatus according to an embodiment of the present invention; and
fig. 3 is a schematic structural diagram illustrating an entry classification apparatus based on encyclopedic content according to another embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
In order to solve the technical problem, an embodiment of the present invention provides an entry classification method based on encyclopedic content. Fig. 1 shows a flowchart of an encyclopedia content-based vocabulary entry classification method according to an embodiment of the present invention. As shown in fig. 1, the method at least includes the following steps S102 to S106:
step S102, extracting attribute data of corresponding entries from encyclopedic contents, wherein the attribute data comprises attribute names and attribute values;
step S104, acquiring entry webpages corresponding to the extracted attribute values;
and step S106, determining the category of the extracted attribute name, and determining the category of the entry webpage corresponding to the attribute value according to the category of the attribute name.
In the embodiment of the invention, the attribute data of the corresponding entry is firstly extracted from encyclopedic content, wherein the attribute data comprises an attribute name and an attribute value, then the entry webpage corresponding to the extracted attribute value is obtained, the category of the extracted attribute name is determined, and then the category of the entry webpage corresponding to the attribute value is determined according to the category of the attribute name. Therefore, the method and the device can determine the category of the entry corresponding to the attribute value by directly utilizing the category of the attribute name in the attribute data of the entry, and compared with the prior art that a machine learning method is utilized, the method and the device label part of the entry categories manually and then predict unknown entry categories.
In the above step S102, the attribute data of the corresponding entry is extracted from the encyclopedic content, and the embodiments of the present invention provide various implementation manners, such as a manner of using a field corresponding to the attribute data or location information of the attribute data in the entry webpage, which will be described in detail below.
In the first mode, the attribute data is extracted by using the field corresponding to the attribute data. That is, a field for extracting attribute data of an entry is specified, and attribute data of the corresponding entry is extracted from encyclopedia content using the specified field. The fields used for extracting the attribute data of the entries can be nationality, graduation colleges and the like, can also be departments, phyla, species, kingdoms and the like, and can also be regions and types of production and the like. Here, the fields of the attribute data may be collected and determined according to the type contained in the vocabulary entry, for example, the vocabulary entry of the character type, and the fields of nationality, graduate college, and the like are collected; for another example, entries for a movie work type are collected for the region of production, type, etc.
And in the second mode, the position information of the attribute data in the entry webpage is used for extracting the attribute data. That is, the position information of the attribute data of the corresponding entry in the encyclopedic content is acquired, and the attribute data of the corresponding entry is extracted from the encyclopedic content according to the position information. For example, attribute data at "basic information" in a lemma web page may be extracted.
Furthermore, in the second mode, the position information of the attribute data of the corresponding entry in the encyclopedic content is acquired, and the embodiment of the invention provides an optional scheme, wherein in the scheme, the entry webpage corresponding to the encyclopedic content can be matched in a webpage template library to acquire the webpage template corresponding to the entry webpage; and then according to the webpage template corresponding to the vocabulary entry webpage, acquiring the position information of the attribute data of the corresponding vocabulary entry recorded in the encyclopedic content.
Because the position information of the attribute data of the corresponding entry of the entry web pages of different page types under each website is different, in the embodiment of the invention, the position information of the attribute data of the corresponding entry of the entry web pages of different page types is determined for the entry web pages of different page types under each website, and then the corresponding relationship between the position information of the attribute data of the corresponding entry of the entry web pages of different page types and the entry web pages of different page types is recorded to generate the web page template library. Further, in an optional embodiment of the present invention, the structure and/or topic of a large number of collected entry web pages may also be analyzed, and entry web pages with the same structure and/or topic may be divided into entry web pages belonging to the same page type.
In the above first or second mode, how to extract the attribute data of the corresponding entry from the encyclopedic content is described, in practical application, the first and second modes may be combined to extract the attribute data, which is not limited in the present invention.
After the attribute data of the corresponding entry is extracted from the encyclopedic content in step S102, the extracted attribute value may be used as the entry to be classified, or the entry matching the extracted attribute value may be used as the entry to be classified. Taking the term "zhang san" as an example, the attribute data of the term "zhang san" is extracted from the encyclopedic content, and the attribute data includes an attribute name and an attribute value, as shown in table 1 below.
TABLE 1
Attribute name Attribute value Entry webpage corresponding to attribute value
Nationality book China (China) http://baike.so.com/subview/61891/14022133.htm
Nationality Chinese family http://baike.so.com/view/2717.htm
Colleges and universities of graduation Dance academy http://baike.so.com/view/15347.htm
Wife (wife) Li four http://baike.so.com/view/66376.htm
At this time, the extracted attribute values "china", "chinese", "dance college", and "liquad" may be used as entries to be classified.
Further, in the above step S104, the entry web page corresponding to the extracted attribute value is obtained, an optional scheme is provided in the embodiment of the present invention, in the optional scheme, a link address corresponding to the attribute value may be obtained in the encyclopedic content, and then the link address is used as the entry web page corresponding to the attribute value. In another optional scheme of the embodiment of the present invention, a term web page corresponding to the attribute value may be searched in a correspondence relationship according to a correspondence relationship between a term and a term web page established in advance. Taking the term "zhang san" as an example, step S104 obtains a term web page (i.e., a link address) corresponding to the extracted attribute value (i.e., the term to be classified), as shown in table 1 above.
In an alternative embodiment of the present invention, the category of the extracted attribute name is determined in step S106, and the attribute name may be converted into a standardized category field and used as the category of the attribute name. For example, the attribute names "nationality", "graduate college" and "wife" in table 1 are converted into standardized category fields, which are "country", "nationality", "school institution" and "person", respectively.
In another optional embodiment of the present invention, in step S106, the category of the entry webpage corresponding to the attribute value is determined according to the category of the attribute name, which may be the category of the entry webpage corresponding to the attribute value. Taking the term "zhang san" as an example, the category of the attribute name is used as the category of the entry web page corresponding to the attribute value, as shown in table 2 below.
TABLE 2
Entry webpage Categories
http://baike.so.com/subview/61891/14022133.htm State of the country
http://baike.so.com/view/2717.htm Nationality
http://baike.so.com/view/15347.htm School organization
http://baike.so.com/view/66376.htm Character
It should be noted that, in practical applications, all the above-mentioned optional embodiments may be combined in a combined manner at will to form an optional embodiment of the present invention, and details are not described here any more.
Based on the vocabulary entry classification method based on encyclopedic content provided by the embodiments, the embodiment of the invention also provides a vocabulary entry classification device based on encyclopedic content based on the same inventive concept. Fig. 2 is a schematic structural diagram of an encyclopedia content-based vocabulary entry classifying apparatus according to an embodiment of the present invention. As shown in fig. 2, the apparatus may include at least an extraction module 210, an acquisition module 220, and a determination module 230.
The functions of the components or devices of the vocabulary entry classifying device based on encyclopedic content and the connection relationship between the components are described:
the extraction module 210 is adapted to extract attribute data of a corresponding entry from encyclopedic content, wherein the attribute data includes an attribute name and an attribute value;
an obtaining module 220, coupled to the extracting module 210, adapted to obtain entry webpages corresponding to the extracted attribute values;
the determining module 230, coupled to the obtaining module 220, is adapted to determine the category of the extracted attribute name, and determine the category of the entry webpage corresponding to the attribute value according to the category of the attribute name.
In an embodiment of the present invention, the extracting module 210 is further adapted to: determining a field for extracting attribute data of the entry; and extracting attribute data of the corresponding entries from the encyclopedic content by using the determined fields. Here, the field for extracting the attribute data of the entry may be, for example, nationality, graduation, etc., department, gate, species, world, etc., and may also be, for example, a production region, a type, etc., which is not limited by the present invention. Here, the fields of the attribute data may be collected and determined according to the type contained in the vocabulary entry, for example, the vocabulary entry of the character type, and the fields of nationality, graduate college, and the like are collected; for another example, entries for a movie work type are collected for the region of production, type, etc.
In an embodiment of the present invention, the extracting module 210 is further adapted to: acquiring position information of attribute data of entries corresponding to encyclopedic contents recorded in the encyclopedic contents; and extracting attribute data of the corresponding entry from the encyclopedic content according to the position information. For example, attribute data at "basic information" in a lemma web page may be extracted.
In an embodiment of the present invention, the extracting module 210 is further adapted to:
matching entry webpages corresponding to encyclopedic contents in a webpage template library to obtain webpage templates corresponding to the entry webpages;
and acquiring the position information of the attribute data of the corresponding entry in the encyclopedic content according to the webpage template corresponding to the entry webpage.
In an embodiment of the present invention, as shown in fig. 3, the apparatus shown in fig. 2 above may further include: a generating module 240, coupled to the extracting module 210, adapted to:
determining position information of attribute data of entries corresponding to entry webpages of different page types for the entry webpages of different page types under various websites;
and recording the corresponding relation between the entry webpages of different page types and the position information of the attribute data of the corresponding entries of the entry webpages of different page types, and generating a webpage template library.
In an embodiment of the present invention, the obtaining module 220 is further adapted to:
before obtaining the entry webpage corresponding to the extracted attribute value, taking the extracted attribute value as an entry to be classified; or,
and taking the entry matched with the extracted attribute value as an entry to be classified.
In an embodiment of the present invention, the obtaining module 220 is further adapted to:
acquiring a link address corresponding to the attribute value in encyclopedia content;
and taking the link address as a vocabulary entry webpage corresponding to the attribute value.
In an embodiment of the present invention, the obtaining module 220 is further adapted to:
and searching the entry web pages corresponding to the attribute values in the corresponding relations according to the corresponding relations between the entries and the entry web pages established in advance.
In an embodiment of the invention, the determining module 230 is further adapted to:
the attribute name is converted into a standardized category field and used as a category of the attribute name.
In an embodiment of the invention, the determining module 230 is further adapted to:
and taking the category of the attribute name as the category of the entry webpage corresponding to the attribute value.
According to any one or a combination of the above preferred embodiments, the following advantages can be achieved by the embodiments of the present invention:
in the embodiment of the invention, the attribute data of the corresponding entry is firstly extracted from encyclopedic content, wherein the attribute data comprises an attribute name and an attribute value, then the entry webpage corresponding to the extracted attribute value is obtained, the category of the extracted attribute name is determined, and then the category of the entry webpage corresponding to the attribute value is determined according to the category of the attribute name. Therefore, the method and the device can determine the category of the entry corresponding to the attribute value by directly utilizing the category of the attribute name in the attribute data of the entry, and compared with the prior art that a machine learning method is utilized, the method and the device label part of the entry categories manually and then predict unknown entry categories.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It will be appreciated by those skilled in the art that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of the encyclopedia-based vocabulary entry sorting apparatus according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
Thus, it should be appreciated by those skilled in the art that while a number of exemplary embodiments of the invention have been illustrated and described in detail herein, many other variations or modifications consistent with the principles of the invention may be directly determined or derived from the disclosure of the present invention without departing from the spirit and scope of the invention. Accordingly, the scope of the invention should be understood and interpreted to cover all such other variations or modifications.
An aspect of the embodiments of the present invention provides a1, an entry classification method based on encyclopedia content, including:
extracting attribute data of corresponding entries from encyclopedic content, wherein the attribute data comprise attribute names and attribute values;
acquiring entry webpages corresponding to the extracted attribute values;
and determining the category of the extracted attribute name, and determining the category of the entry webpage corresponding to the attribute value according to the category of the attribute name.
A2, the method according to A1, wherein the extracting attribute data of corresponding entries from encyclopedia content includes:
determining a field for extracting attribute data of the entry;
and extracting attribute data of the corresponding entries from the encyclopedic content by using the determined fields.
A3, the method according to A1 or A2, wherein the extracting attribute data of corresponding entries from encyclopedia content comprises:
acquiring position information of attribute data of entries corresponding to encyclopedic contents recorded in the encyclopedic contents;
and extracting attribute data of the corresponding entry from the encyclopedic content according to the position information.
A4 and the method according to any one of A1 to A3, wherein the obtaining of the position information of the attribute data of the corresponding entry recorded in the encyclopedic content includes:
matching entry webpages corresponding to the encyclopedic contents in a webpage template library to obtain webpage templates corresponding to the entry webpages;
and acquiring the position information of the attribute data of the corresponding entry in the encyclopedic content according to the webpage template corresponding to the entry webpage.
A5, the method according to any one of A1-A4, further comprising:
determining position information of attribute data of entries corresponding to entry webpages of different page types for the entry webpages of different page types under various websites;
and recording the corresponding relation between the entry webpages of different page types and the position information of the attribute data of the corresponding entries of the entry webpages of different page types, and generating the webpage template library.
A6, the method according to any one of A1-A5, wherein before obtaining the entry webpage corresponding to the extracted attribute value, the method further includes:
taking the extracted attribute values as entries to be classified; or,
and taking the entry matched with the extracted attribute value as an entry to be classified.
A7, the method according to any one of A1-A6, wherein the obtaining of the entry webpage corresponding to the extracted attribute values includes:
acquiring a link address corresponding to the attribute value in the encyclopedia content;
and taking the link address as a vocabulary entry webpage corresponding to the attribute value.
A8, the method according to any one of A1-A7, wherein the obtaining of the entry webpage corresponding to the extracted attribute values includes:
and searching the entry webpage corresponding to the attribute value in the corresponding relation according to the pre-established corresponding relation between the entry and the entry webpage.
A9, the method according to any one of A1-A8, wherein determining the category of the extracted attribute name includes:
and converting the attribute name into a standardized category field and using the standardized category field as the category of the attribute name.
A10, the method according to any one of A1-A9, wherein the determining the category of the entry webpage corresponding to the attribute value according to the category of the attribute name comprises:
and taking the category of the attribute name as the category of the entry webpage corresponding to the attribute value.
Another aspect of the embodiments of the present invention provides B11, an entry classification device based on encyclopedia content, including:
the extraction module is suitable for extracting attribute data of corresponding entries from encyclopedic content, wherein the attribute data comprise attribute names and attribute values;
the acquisition module is suitable for acquiring the entry webpage corresponding to the extracted attribute value;
and the determining module is suitable for determining the category of the extracted attribute name and determining the category of the entry webpage corresponding to the attribute value according to the category of the attribute name.
B12, the apparatus according to B11, wherein the extraction module is further adapted to:
determining a field for extracting attribute data of the entry;
and extracting attribute data of the corresponding entries from the encyclopedic content by using the determined fields.
B13, the apparatus according to B11 or B12, wherein the extraction module is further adapted to:
acquiring position information of attribute data of entries corresponding to encyclopedic contents recorded in the encyclopedic contents;
and extracting attribute data of the corresponding entry from the encyclopedic content according to the position information.
B14, the apparatus according to any one of B11-B13, wherein the extraction module is further adapted to:
matching entry webpages corresponding to the encyclopedic contents in a webpage template library to obtain webpage templates corresponding to the entry webpages;
and acquiring the position information of the attribute data of the corresponding entry in the encyclopedic content according to the webpage template corresponding to the entry webpage.
B15, the apparatus according to any one of B11-B14, further comprising a generating module adapted to:
determining position information of attribute data of entries corresponding to entry webpages of different page types for the entry webpages of different page types under various websites;
and recording the corresponding relation between the entry webpages of different page types and the position information of the attribute data of the corresponding entries of the entry webpages of different page types, and generating the webpage template library.
B16, the apparatus according to any one of B11-B15, wherein the obtaining module is further adapted to:
before obtaining the entry webpage corresponding to the extracted attribute value, taking the extracted attribute value as an entry to be classified; or,
and taking the entry matched with the extracted attribute value as an entry to be classified.
B17, the apparatus according to any one of B11-B16, wherein the obtaining module is further adapted to:
acquiring a link address corresponding to the attribute value in the encyclopedia content;
and taking the link address as a vocabulary entry webpage corresponding to the attribute value.
B18, the apparatus according to any one of B11-B17, wherein the obtaining module is further adapted to:
and searching the entry webpage corresponding to the attribute value in the corresponding relation according to the pre-established corresponding relation between the entry and the entry webpage.
B19, the apparatus according to any one of B11-B18, wherein the determining module is further adapted to: and converting the attribute name into a standardized category field and using the standardized category field as the category of the attribute name.
B20, the apparatus according to any one of B11-B19, wherein the determining module is further adapted to: and taking the category of the attribute name as the category of the entry webpage corresponding to the attribute value.

Claims (18)

1. An entry classification method based on encyclopedic content comprises the following steps:
extracting attribute data of corresponding entries from encyclopedic content, wherein the attribute data comprise attribute names and attribute values;
acquiring entry webpages corresponding to the extracted attribute values;
determining the category of the extracted attribute name, and determining the category of the entry webpage corresponding to the attribute value according to the category of the attribute name;
wherein the method further comprises:
analyzing the structures and/or themes of at least two entry webpages, and dividing the entry webpages with the same structures and/or themes into entry webpages belonging to the same page type;
wherein, the extracting of the attribute data of the corresponding entry from the encyclopedic content comprises:
determining fields of attribute data for extracting entries, wherein the fields of the attribute data are collected and determined according to types contained in the entries;
and extracting attribute data of the corresponding entries from the encyclopedic content by using the determined fields.
2. The method of claim 1, wherein the extracting attribute data of its corresponding entry from encyclopedia content comprises:
acquiring position information of attribute data of entries corresponding to encyclopedic contents recorded in the encyclopedic contents;
and extracting attribute data of the corresponding entry from the encyclopedic content according to the position information.
3. The method according to claim 2, wherein the acquiring of the position information of the attribute data of the encyclopedia content in which the corresponding entry is recorded comprises:
matching entry webpages corresponding to the encyclopedic contents in a webpage template library to obtain webpage templates corresponding to the entry webpages;
and acquiring the position information of the attribute data of the corresponding entry in the encyclopedic content according to the webpage template corresponding to the entry webpage.
4. The method of claim 3, further comprising:
determining position information of attribute data of entries corresponding to entry webpages of different page types for the entry webpages of different page types under various websites;
and recording the corresponding relation between the entry webpages of different page types and the position information of the attribute data of the corresponding entries of the entry webpages of different page types, and generating the webpage template library.
5. The method of claim 1, wherein before obtaining the entry webpage corresponding to the extracted attribute value, the method further comprises:
taking the extracted attribute values as entries to be classified; or,
and taking the entry matched with the extracted attribute value as an entry to be classified.
6. The method of claim 1, wherein obtaining the entry webpage corresponding to the extracted attribute value comprises:
acquiring a link address corresponding to the attribute value in the encyclopedia content;
and taking the link address as a vocabulary entry webpage corresponding to the attribute value.
7. The method of claim 1, wherein obtaining the entry webpage corresponding to the extracted attribute value comprises:
and searching the entry webpage corresponding to the attribute value in the corresponding relation according to the pre-established corresponding relation between the entry and the entry webpage.
8. The method of claim 1, wherein determining the category of the extracted attribute name comprises:
and converting the attribute name into a standardized category field and using the standardized category field as the category of the attribute name.
9. The method of claim 1, wherein determining the category of the entry webpage corresponding to the attribute value according to the category of the attribute name comprises:
and taking the category of the attribute name as the category of the entry webpage corresponding to the attribute value.
10. An encyclopedia content-based vocabulary entry classification device, comprising:
the extraction module is suitable for extracting attribute data of corresponding entries from encyclopedic content, wherein the attribute data comprise attribute names and attribute values;
the acquisition module is suitable for acquiring the entry webpage corresponding to the extracted attribute value;
the determining module is suitable for determining the category of the extracted attribute name and determining the category of the entry webpage corresponding to the attribute value according to the category of the attribute name;
wherein the apparatus further comprises:
the analysis module is suitable for analyzing the structures and/or the themes of at least two entry webpages and dividing the entry webpages with the same structures and/or themes into the entry webpages belonging to the same page type;
wherein the extraction module is further adapted to:
determining fields of attribute data for extracting entries, wherein the fields of the attribute data are collected and determined according to types contained in the entries;
and extracting attribute data of the corresponding entries from the encyclopedic content by using the determined fields.
11. The apparatus of claim 10, wherein the extraction module is further adapted to:
acquiring position information of attribute data of entries corresponding to encyclopedic contents recorded in the encyclopedic contents;
and extracting attribute data of the corresponding entry from the encyclopedic content according to the position information.
12. The apparatus of claim 11, wherein the extraction module is further adapted to:
matching entry webpages corresponding to the encyclopedic contents in a webpage template library to obtain webpage templates corresponding to the entry webpages;
and acquiring the position information of the attribute data of the corresponding entry in the encyclopedic content according to the webpage template corresponding to the entry webpage.
13. The apparatus of claim 12, further comprising a generation module adapted to:
determining position information of attribute data of entries corresponding to entry webpages of different page types for the entry webpages of different page types under various websites;
and recording the corresponding relation between the entry webpages of different page types and the position information of the attribute data of the corresponding entries of the entry webpages of different page types, and generating the webpage template library.
14. The apparatus of claim 10, wherein the acquisition module is further adapted to:
before obtaining the entry webpage corresponding to the extracted attribute value, taking the extracted attribute value as an entry to be classified; or,
and taking the entry matched with the extracted attribute value as an entry to be classified.
15. The apparatus of claim 10, wherein the acquisition module is further adapted to:
acquiring a link address corresponding to the attribute value in the encyclopedia content;
and taking the link address as a vocabulary entry webpage corresponding to the attribute value.
16. The apparatus of claim 10, wherein the acquisition module is further adapted to:
and searching the entry webpage corresponding to the attribute value in the corresponding relation according to the pre-established corresponding relation between the entry and the entry webpage.
17. The apparatus of claim 10, wherein the determination module is further adapted to: and converting the attribute name into a standardized category field and using the standardized category field as the category of the attribute name.
18. The apparatus of claim 10, wherein the determination module is further adapted to: and taking the category of the attribute name as the category of the entry webpage corresponding to the attribute value.
CN201610201440.2A 2016-03-31 2016-03-31 Entry classification method and device based on encyclopedic content Active CN105893556B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610201440.2A CN105893556B (en) 2016-03-31 2016-03-31 Entry classification method and device based on encyclopedic content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610201440.2A CN105893556B (en) 2016-03-31 2016-03-31 Entry classification method and device based on encyclopedic content

Publications (2)

Publication Number Publication Date
CN105893556A CN105893556A (en) 2016-08-24
CN105893556B true CN105893556B (en) 2020-04-14

Family

ID=57012093

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610201440.2A Active CN105893556B (en) 2016-03-31 2016-03-31 Entry classification method and device based on encyclopedic content

Country Status (1)

Country Link
CN (1) CN105893556B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304530B (en) * 2018-01-26 2022-03-18 腾讯科技(深圳)有限公司 Knowledge base entry classification method and device and model training method and device
CN109492745A (en) * 2018-11-01 2019-03-19 西北工业大学 A kind of intelligence machine describes method and device
CN113672742B (en) * 2021-08-19 2024-04-26 鲨鱼快游网络技术(北京)有限公司 Knowledge co-building method, electronic equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073729A (en) * 2011-01-14 2011-05-25 百度在线网络技术(北京)有限公司 Relationship knowledge sharing platform and implementation method thereof
CN102495892A (en) * 2011-12-09 2012-06-13 北京大学 Webpage information extraction method
CN103123636A (en) * 2011-11-21 2013-05-29 北京百度网讯科技有限公司 Method to build vocabulary entry classification models, method of vocabulary entry automatic classification and device
CN103984685A (en) * 2013-02-07 2014-08-13 百度国际科技(深圳)有限公司 Method, device and equipment for classifying items to be classified

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073729A (en) * 2011-01-14 2011-05-25 百度在线网络技术(北京)有限公司 Relationship knowledge sharing platform and implementation method thereof
CN103123636A (en) * 2011-11-21 2013-05-29 北京百度网讯科技有限公司 Method to build vocabulary entry classification models, method of vocabulary entry automatic classification and device
CN102495892A (en) * 2011-12-09 2012-06-13 北京大学 Webpage information extraction method
CN103984685A (en) * 2013-02-07 2014-08-13 百度国际科技(深圳)有限公司 Method, device and equipment for classifying items to be classified

Also Published As

Publication number Publication date
CN105893556A (en) 2016-08-24

Similar Documents

Publication Publication Date Title
RU2571545C1 (en) Content-based document image classification
CN109190007B (en) Data analysis method and device
US9305083B2 (en) Author disambiguation
Attali et al. Automated essay scoring with e-rater [r] v. 2
US20110173197A1 (en) Methods and apparatuses for clustering electronic documents based on structural features and static content features
CN108694178B (en) Method and device for recommending judicial knowledge
CN104361037B (en) Microblogging sorting technique and device
CN105893556B (en) Entry classification method and device based on encyclopedic content
WO2016101716A1 (en) Search method and device based on user search intention
Agirre et al. Matching Cultural Heritage items to Wikipedia.
Lewis The pragmatic element in knowledge
JP2023501010A (en) A Classification Method for Application Preference Text Based on TextRank
CN106776640A (en) A kind of stock information information displaying method and device
Castellani Ribeiro et al. An urban data profiler
CN106649264A (en) Text information-based Chinese fruit variety information extracting method and device
CN106649750B (en) Searching method and device for multi-meaning term entry
El-Beltagy Niletmrg at semeval-2016 task 7: Deriving prior polarities for arabic sentiment terms
WO2016101727A1 (en) Question-and-answer-based search result adjustment method and device
WO2015143911A1 (en) Method and device for pushing webpages containing time-relevant information
CN109933791A (en) Material recommended method, device, computer equipment and computer readable storage medium
US10614083B2 (en) Method and system for identifying incipient field-specific entity records
US10394920B2 (en) Data verification device
Hu et al. Automatic maya hieroglyph retrieval using shape and context information
Satomi et al. New functionality for digital libraries: enhancing discoverability at the National Diet Library
CN112507203B (en) Information processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee after: Beijing Qizhi Business Consulting Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.

CP03 Change of name, title or address
TR01 Transfer of patent right

Effective date of registration: 20240108

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Beijing Qizhi Business Consulting Co.,Ltd.

TR01 Transfer of patent right