WO2017080183A1 - Network novel chapter list evaluation method and device - Google Patents

Network novel chapter list evaluation method and device Download PDF

Info

Publication number
WO2017080183A1
WO2017080183A1 PCT/CN2016/083434 CN2016083434W WO2017080183A1 WO 2017080183 A1 WO2017080183 A1 WO 2017080183A1 CN 2016083434 W CN2016083434 W CN 2016083434W WO 2017080183 A1 WO2017080183 A1 WO 2017080183A1
Authority
WO
WIPO (PCT)
Prior art keywords
chapter list
list page
value
page
chapter
Prior art date
Application number
PCT/CN2016/083434
Other languages
French (fr)
Chinese (zh)
Inventor
何建国
Original Assignee
北京奇虎科技有限公司
奇智软件(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京奇虎科技有限公司, 奇智软件(北京)有限公司 filed Critical 北京奇虎科技有限公司
Publication of WO2017080183A1 publication Critical patent/WO2017080183A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Definitions

  • the present invention relates to the field of Internet technologies, and in particular, to a method and apparatus for evaluating a list of network novel chapters.
  • the same e-book (such as a novel, etc.) usually exists at multiple sites at the same time, so when searching for an e-book, a plurality of sites where the e-book exists are presented in the search results.
  • the chapter list name is not standardized, chapters are repeated, broken chapters, invalid chapters, etc., which has an impact on the user experience.
  • the top-ranked site in the existing search results is the best quality site, that is, the site with the smallest chapter list name, chapter repetition, broken chapter, invalid chapter, etc., but its chapter list page is still It may be incomplete, and there may even be a fake chapter that is pieced together.
  • the chapter list page is evaluated by manually configuring the template of the novel site.
  • the method has high accuracy, the disadvantage is also obvious: the website that the human body can cover is limited and not intelligent enough. Therefore, how to evaluate the chapter list page flexibly, quickly and accurately becomes a technical problem that needs to be solved.
  • the present invention has been made in order to provide a method and apparatus for evaluating a network novel chapter list that overcomes the above problems or at least partially solves the above problems.
  • the invention provides a method for evaluating a list of network novel chapters, comprising the following steps:
  • the invention also provides a network novel chapter list evaluation device, comprising:
  • a categorization module configured to determine a similarity between a plurality of chapter list pages of the same subject, classify the plurality of chapter list pages whose similarity is higher than a preset threshold into the same set, and each chapter list page corresponds to one site ;
  • a diversity module configured to obtain an authority value of each site in the same set, and use a set of the maximum value of the authority value as the first set, wherein the authority value is determined according to the score of the user by the multiple users;
  • a feature quantity obtaining module configured to acquire at least one feature quantity value of each chapter list page in the first set
  • a target obtaining module configured to calculate, according to a preset rule, a comprehensive weight of the at least one feature quantity value of each chapter list page, and obtain a chapter list page in which the comprehensive weight value is the largest.
  • a computer program comprising computer readable code, when said computer readable code is run on a computing device, causing said computing device to perform a network novel according to said Chapter list evaluation method.
  • a computer readable medium storing the above computer program is provided.
  • the present invention has the following advantages:
  • the present invention provides a method for evaluating a list of network novel chapters, classifying a plurality of chapter list pages of different sites into the same set based on the similarity between the plurality of chapter list pages; and then each site in the same set
  • the set of the greatest value of the authority value is used as the first set, and then the comprehensive weight of the at least one feature quantity value of each chapter list page in the first set is calculated based on a preset rule, and the comprehensive weight is obtained.
  • Chapter list page That is, the scheme can automatically acquire the chapter list page of multiple sites, and obtain the chapter list page with the highest quality by comparing and synthesizing the similarity, the authoritative value of the site, and the acquired feature values. Therefore, the problem that the chapter list page is judged by the manual configuration template is low in efficiency, and the solution of the present invention can flexibly and quickly evaluate the chapter list page that most satisfies the requirements, and the evaluation result is accurate and objective.
  • the present invention when acquiring at least one feature quantity value of each chapter list page, separately analyzes and obtains multiple feature quantity values for characterizing the correctness, integrity, and newness of the chapter list page based on the preset rules.
  • the fake chapter list page is filtered out, and then the comprehensive weight corresponding to the at least one feature quantity value is obtained for each chapter list page based on the preset rule, wherein the chapter list page with the largest comprehensive weight is the highest quality target chapter list. page. That is, the solution of the present invention can automatically compare and analyze the quality of the chapter list page from the aspects of correctness, completeness and realism, and evaluate the most effective chapter list page, so that the evaluation result is more accurate.
  • FIG. 1 is a flow chart of a program of an embodiment of a method for evaluating a network novel chapter list according to the present invention
  • FIG. 2 is a flow chart showing a procedure of an embodiment of a method for evaluating a network novel chapter list according to the present invention
  • FIG. 3 is a flow chart showing a procedure of an embodiment of a method for evaluating a network novel chapter list in the present invention
  • FIG. 4 is a flow chart showing a procedure of an embodiment of a method for evaluating a network novel chapter list in the present invention
  • FIG. 5 is a flowchart of a process of an embodiment of a method for evaluating a network novel chapter list according to the present invention
  • FIG. 6 is a structural block diagram of an embodiment of a network novel chapter list evaluation apparatus according to the present invention.
  • FIG. 7 is a structural block diagram of a categorization module in an embodiment of a network novel chapter list evaluation apparatus according to the present invention.
  • FIG. 8 is a structural block diagram of a feature quantity acquisition module in an embodiment of a network novel chapter list evaluation apparatus according to the present invention.
  • FIG. 9 is a structural block diagram of a feature quantity acquisition module in an embodiment of a network novel chapter list evaluation apparatus according to the present invention.
  • FIG. 10 is a structural block diagram of a target acquisition module in an embodiment of a network novel chapter list evaluation apparatus according to the present invention.
  • Figure 11 is a block diagram schematically showing a computing device for performing a network novel chapter list evaluation method according to the present invention.
  • Fig. 12 schematically shows a storage unit for holding or carrying program code implementing the network novel chapter list evaluation method according to the present invention.
  • the Internet it generally includes a client (user mobile terminal), a network, and a server (such as a web server of a website).
  • the client can be a user's Internet mobile terminal, such as a desktop computer (PC), a laptop (Laptop), a smart device with web browsing capabilities, such as a personal digital assistant (PDA), and a mobile device.
  • MID Internet devices
  • smartphones can all request a service by another process (such as a server-provided process) in an Internet environment, typically in an Internet environment.
  • a mobile phone loaded with an APP of an e-book function is used as a client, for example, an Android mobile phone or the like; and a user feedback column is provided in the APP, and the user can send question feedback information to the server through the column.
  • the server returns a reply message to the user.
  • the server is typically a remote computer system that can be accessed via a communication medium such as the Internet, typically such as the Internet. Moreover, servers can often serve multiple clients from the Internet.
  • the service process includes receiving requests from the client, collecting user information and feedback information, and the like.
  • the server acts as an information provider for the computer network.
  • the server is usually located on the party providing the service, or configured by the service provider to serve the content, such a service provider may be, for example, an Internet service company's website.
  • the method for evaluating a list of network novel chapters provided by the present invention is described from the perspective of a server, and the method for evaluating the list of network novel chapters can be implemented as a computer program on a remote network device by programming. It includes, but is not limited to, a computer, a network host, a single web server, a plurality of network server sets, or a cloud of multiple servers.
  • an exemplary embodiment of a method for evaluating a list of network novel chapters of the present invention specifically includes the following steps:
  • a web spider in the method for evaluating a network novel chapter list according to the present invention, can acquire data of a plurality of websites based on the same subject, thereby acquiring a chapter list page of the main body.
  • the subject may be a title of the novel or a part of key text features therein. Therefore, before step S11, the method further includes the step of: obtaining the chapter list page corresponding to the subject from the plurality of sites based on the same subject.
  • the search engine may receive a search request with a keyword of the subject, and perform structural analysis on the webpage under the domain name of the novel website, if the webpage includes multiple parallel chapter lists.
  • the label can determine that the webpage is a novel chapter list page; wherein the pointing link href (Hypertext Reference) of the plurality of parallel chapter list labels has a highly similar relationship, and the corresponding chapter list directory is the same but specific
  • the file name is different. For example, assume that the href attribute of the plurality of parallel chapter list tags contains a directory of 5_5288, and the href attribute contains different file names, that is, 970871 to 970980.
  • the plurality of parallel chapter list tags included in the novel chapter list page include chapter text feature vectors including keywords and/or chapter numbers representing the chapters, and the search engine may be based on the keywords and/or chapters.
  • Count to evaluate the chapter list page includes the keyword “chapter”, and may also include “volume”, “section”, “chapter”, etc.; and also includes keywords “one”, “two”, “ Eighteen” and the like; of course, the number of chapters can also save “1", “2", “18”, and the like in the form of numbers.
  • step S11 is performed: determining the similarity between the plurality of chapter list pages of the same subject, and the similarity is higher than the preset threshold.
  • the chapter list pages are grouped into the same collection, and each chapter list page corresponds to one site.
  • the embodiment may be: extracting a text feature vector in a chapter list name in a plurality of chapter list pages of the same subject, wherein the text feature vector may be a plurality of keywords in a chapter list name, and determining based on a certain similarity
  • the algorithm determines a similarity between the plurality of keywords; or extracts a numerical feature vector in a page number corresponding to the plurality of chapter list page names of the same subject, wherein the numerical feature vector may be a value representing the page number
  • the text feature vector and its corresponding numerical feature vector may be combined to calculate the similarity between any two chapter list pages, or one of the feature vectors may be separately used to calculate the chapter list page. Similarity.
  • the step S11 specifically includes the following steps:
  • the chapter list page with the highest authority value may be determined by acquiring the authority values of the different sites. Refer to the chapter list page, where the authoritative value of the site is obtained by a large number of users by scoring the site; then extracting the text feature vector of each chapter list page based on a certain algorithm, and then calculating each chapter list page and the
  • the reference chapter list page has a total number of identical character feature vectors; when the total number is greater than the pre-stored threshold, the chapter list page and the reference chapter list page are classified into the same set, and the above method is repeated, and the other is not in the The chapter list page within the collection is classified into one or more collections.
  • the method of the present invention further includes a step S12: acquiring an authority value of each site in the same set, and using a set with the largest sum of authoritative values as the first set, wherein the authority value is based on multiple The rating of the site by the user is determined.
  • step S11 the plurality of chapter list pages are classified into different sets according to the similarity between the chapter list pages, and in this step S12, the sum value of the authority values of the sites where each chapter list page in the same set is located is calculated. , wherein the authoritative value of the site is determined according to the scores of the plurality of users on the site, and the set with the largest sum value of the authoritative values is obtained as the first set.
  • the method of the present invention further includes a step S13: acquiring at least one feature quantity value of each chapter list page in the first set.
  • the at least one feature quantity value may be a feature quantity value that represents a chapter list page integrity, or a correctness, or a real new property; an implementation manner of acquiring a feature quantity value is respectively introduced by different embodiments. .
  • the step of acquiring at least one feature quantity value of each chapter list page in the first set further includes:
  • the rule sets a first feature magnitude that characterizes the integrity of the chapter list page, wherein the magnitude of the difference corresponds to the first feature magnitude.
  • the character feature vector of each chapter list page in the first set is first extracted; then the number of the same character feature vector is calculated for each two chapter list pages, and the obtained plurality of quantity values are averaged to obtain a first average value.
  • the method of the present invention further includes the step of: setting a second characterizing the correctness of the chapter list page based on the preset correctness rule according to the difference between the second average value and the first average value A feature magnitude, wherein the magnitude of the difference corresponds to a second feature magnitude. That is, after the difference between the second average value and the second average value is obtained, a second feature quantity that characterizes the correctness of the chapter list page is set based on a preset rule for characterizing the correctness. Similarly, if the difference is larger, then The greater the probability that the chapter list page is incorrect, the smaller the corresponding second feature magnitude, and the difference magnitude is also pre-associated with the second feature magnitude. For example, if the difference is 15, the corresponding second feature quantity value is 65; when the difference value is 5, the corresponding first feature quantity value is 85; of course, this embodiment is merely exemplary and cannot constitute a pair. Limitations of the invention.
  • the step of obtaining at least one feature quantity value of each chapter list page in the first set further includes:
  • S135 Obtain a text feature vector in a chapter list corresponding to the same page number in each chapter list page in the first set, where a value corresponding to the page code is greater than a preset page code threshold;
  • This embodiment is mainly used to judge the realism of the chapter list page.
  • the total number of the same character feature vector is calculated for a chapter list page and a plurality of other chapter list pages by acquiring a character feature vector of the chapter list page corresponding to the page number of the preset page number threshold.
  • the fabricated chapter list page determines that the chapter list page is a fake chapter list page and filters the fake chapter list page.
  • the total number is greater than the preset second threshold, to determine the feature quantity that represents the newness, that is, the greater the difference between the total number and the second threshold, the higher the accuracy of the representation, the less likely it is For the fabricated or erroneous chapter list page, the corresponding feature quantity value corresponding to the newness is larger; anyway, the corresponding feature quantity value representing the real newness is larger.
  • the method of the present invention further includes a step S14: calculating a comprehensive weight of the at least one feature quantity value of each chapter list page according to a preset rule, and obtaining a chapter in which the comprehensive weight value is the largest. List.
  • the method calculates a comprehensive weight of the at least one feature quantity value of each chapter list page according to a preset rule, and obtains a maximum comprehensive weight value thereof.
  • the steps in the chapter list page also include the steps:
  • S151 Perform weighting processing on at least one feature quantity value of the same chapter list page according to a preset rule, to obtain a comprehensive weight value of the chapter list page;
  • the feature quantity value corresponding to the weight value is weighted, and the result is the comprehensive weight of the chapter list page, wherein the specific feature quantity value Characterize chapter list page integrity and/or correctness.
  • the comprehensive weight of the list page It is to be understood, of course, that the examples are merely exemplary and are not intended to limit the invention.
  • the size of the comprehensive weight of each chapter list page is compared, and the chapter list page with the largest comprehensive weight is obtained.
  • the chapter list page with the largest comprehensive weight is the target chapter list page.
  • the present invention provides a method for evaluating a list of network novel chapters, which classifies a plurality of chapter list pages of different sites into the same set based on the similarity between the plurality of chapter list pages; A set of the greatest value of the authority value of each site is used as the first set, and then the comprehensive weight of the at least one feature quantity value of each chapter list page in the first set is calculated based on a preset rule, and the comprehensive right is obtained.
  • the invention solves the problem that the chapter list page is judged to be inefficient by manually configuring the template in the prior art, and the solution of the present invention can flexibly and quickly evaluate the chapter list page that best meets the requirements, and evaluates The results are accurate and objective.
  • the present invention also provides a device for evaluating a list of network novel chapters, please refer to FIG. 6.
  • the device includes a classification module 11, a diversity module 12, a feature quantity acquisition module 13, and a target acquisition module 14, and uses the above modules to construct a principle framework of the entire device, thereby implementing a modular implementation.
  • the specific functions implemented by each module are specifically disclosed below.
  • the categorization module 11 is configured to determine the similarity between the plurality of chapter list pages of the same subject, and classify the plurality of chapter list pages whose similarity is higher than the preset threshold into the same set, and each chapter list page corresponds to On one site.
  • a web spider in the method for evaluating a network novel chapter list according to the present invention, can acquire data of a plurality of websites based on the same subject, thereby acquiring a chapter list page of the main body.
  • the subject may be a title of the novel or a part of key text features therein. Therefore, the present invention further includes a list page obtaining module, configured to acquire a chapter list page corresponding to the body from a plurality of sites based on the same subject.
  • the list page obtaining module may receive a search request with a keyword of the body, and perform structural analysis on a webpage under the domain name of the novel website, if the webpage includes multiple
  • the parallel chapter list label can determine that the webpage is a novel chapter list page; wherein the plurality of parallel chapter list labels have a highly similar relationship to the hyperlink href (Hypertext Reference), and the corresponding chapter list
  • the directories are the same but the specific file names are different. For example, assume that the href attribute of the plurality of parallel chapter list tags contains a directory of 5_5288, and the href attribute contains different file names, that is, 970871 to 970980.
  • the plurality of parallel chapter list tags included in the novel chapter list page include a chapter text feature vector including a keyword and/or a chapter number representing the chapter, and the list page obtaining module may be based on the keyword And/or the number of chapters to evaluate the chapter list page.
  • the chapter list label includes the keyword “chapter”, and may also include “volume”, “section”, “chapter”, etc.; and also includes keywords “one”, “two”, “ Eighteen” and the like; of course, the number of chapters can also save “1", “2", “18”, and the like in the form of numbers.
  • the categorization module 11 is required to determine the similarity between the plurality of chapter list pages of the same subject, which will be similar.
  • a plurality of chapter list pages whose degree is higher than a preset threshold are classified into the same set, and each chapter list page corresponds to one site.
  • the categorization module 11 in this embodiment may be a text feature vector in a chapter list name in a plurality of chapter list pages of the same body, wherein the text feature vector may be a plurality of keywords in a chapter list name.
  • the classification module 11 extracts a numerical feature vector in a page number corresponding to a plurality of chapter list page names of the same subject, wherein The numerical feature vector may be a numerical value representing a page number; in this embodiment, the classification module 11 may jointly calculate any two chapter lists in combination with the text feature vector and its corresponding numerical feature vector. For the similarity between pages, one of the feature vectors can also be used alone to calculate the similarity between the chapter list pages.
  • the categorization module 11 further includes a reference page determining unit 111, a first extracting unit 112, a first calculating unit 113, and a first returning Class unit 114.
  • the reference page determining unit 111 is configured to determine, according to an authority value of a site corresponding to the chapter list page, a chapter list page with the highest authority value as a reference chapter list page;
  • the first extracting unit 112 is configured to extract a text feature vector of each chapter list page
  • the first calculating unit 113 is configured to calculate a total number of the same character feature vectors of each chapter list page and the reference chapter list page;
  • the first categorizing unit 114 is configured to classify the chapter list page and the reference chapter list page into the same set when the total number is greater than a preset threshold.
  • the reference page determining unit 111 When judging the similarity between the plurality of chapter list pages, the reference page determining unit 111 first obtains a reference chapter list page.
  • the authority value may be determined by acquiring the authority values of different sites.
  • the highest chapter list page is the reference chapter list page, wherein the authoritative value of the site is obtained by a large number of users by scoring the site; then the first extracting unit 112 extracts each chapter list page based on a certain algorithm.
  • the class unit 114 classifies the chapter list page and the reference chapter list page into the same set, repeats the above method, and classifies other chapter list pages not in the set into another set or multiple sets.
  • the diversity module 12 is configured to obtain an authority value of each site in the same set, and use a set with the largest sum of authoritative values as the first set, where the authority value is based on multiple user pairs. The rating of the site is determined.
  • a plurality of chapter list pages are classified into different sets according to the similarity between the chapter list pages, and in the diversity module 12, the authority of the site where each chapter list page in the same set is located is calculated.
  • the sum value of the value, wherein the authoritative value of the site is determined according to the scores of the plurality of users on the site, and the set in which the sum of the authority values is the largest is obtained as the first set.
  • the feature quantity obtaining module 13 is configured to acquire at least one feature quantity value of each chapter list page in the first set.
  • the at least one feature quantity value may be a feature quantity value that represents the chapter list page integrity, or the correctness, or the real newness; the feature quantity acquisition module 13 acquires the feature separately by different embodiments. The implementation of the magnitude.
  • the feature quantity acquiring module 13 further includes a second extracting unit 131, a first average calculating unit 132, and a second average calculating unit. 133 and the first setting unit 134:
  • the second extracting unit 131 is configured to extract a text feature vector of each chapter list page in the first set
  • the first average value calculating unit 132 is configured to calculate a first average value of the number of the same character feature vectors in each of the two chapter list pages in the first set;
  • the second average value calculating unit 133 is configured to calculate a second average value of the number of identical character feature vectors of a certain chapter list page and a plurality of other chapter list pages;
  • the first setting unit 134 is configured to set, according to a preset integrity rule, a first feature that characterizes the integrity of the chapter list page according to the difference between the second average value and the first average value. A magnitude, wherein the magnitude of the difference corresponds to the first feature magnitude.
  • the second extraction unit 131 extracts a character feature vector of each chapter list page in the first set; the first average value calculation unit 132 further calculates that each two chapter list pages have the same character feature vector.
  • the quantity, the obtained plurality of quantity values are averaged to obtain a first average value;
  • the second average value calculation unit 133 calculates the number of the same character feature vector of a certain chapter list page and a plurality of other chapter list pages, and averages Obtaining a second average number;
  • the first setting unit 134 recalculating the difference between the first average value and the second average value, and then setting the integrity of the chapter list page based on the preset integrity rule a first feature magnitude; if the difference is larger, the probability that the chapter list page is incomplete is larger, and the corresponding first feature magnitude is smaller, wherein the difference magnitude is pre-associated with the first feature magnitude storage. For example, if the difference is 15, the corresponding first feature quantity value is 60; when the difference is 5, the corresponding first feature quantity value is 80; of course, this embodiment is
  • the device of the present invention further includes a second setting unit, wherein the second setting unit is configured to correct the preset value according to the difference between the second average value and the first average value
  • the sex rule sets a second feature magnitude that characterizes the correctness of the chapter list page, wherein the difference magnitude corresponds to the second feature magnitude. That is, after the second setting unit obtains the difference between the second average value and the second average value, the second feature quantity that represents the correctness of the chapter list page is set based on the preset rule of correctness of the representation, and the same reason. If the difference is larger, the probability that the chapter list page is incorrect is larger, and the corresponding second feature value is smaller, wherein the difference size is also pre-associated with the second feature amount. For example, if the difference is 15, the corresponding second feature quantity value is 65; when the difference value is 5, the corresponding first feature quantity value is 85; of course, this embodiment is merely exemplary and cannot constitute a pair. Limitations of the invention.
  • the feature quantity acquisition module 13 further includes a first obtaining unit 135, a total number obtaining unit 136, and a determining unit 137.
  • the first obtaining unit 135 is configured to obtain a character feature vector in the chapter list corresponding to the same page number in each chapter list page in the first set, where the value corresponding to the page code is greater than a preset page code threshold;
  • the total number obtaining unit 136 is configured to obtain a total number of the same character feature vectors of a certain chapter list page and a plurality of other chapter list pages;
  • the determining unit 137 is configured to determine, according to the size relationship between the total number and the second threshold of the preset representation reality, whether the chapter list page is a fake chapter list page.
  • This embodiment is mainly used to judge the realism of the chapter list page. Obtained by the first acquiring unit 135 And taking the character feature vector of the chapter list page corresponding to the page number of the preset page number threshold, and using the total number obtaining unit 136 to calculate the total number of the same character feature vector of the certain chapter list page and the plurality of other chapter list pages. That is, the first obtaining unit 135 obtains a character feature vector corresponding to several chapter list pages at the end of the chapter list page, and the total number obtaining unit 136 calculates a chapter list page and a plurality of other chapter list pages having the same page number. a total number of identical character feature vectors.
  • the determining unit 137 determines that the total number is greater than or equal to the preset second threshold, determining that the chapter list page is a valid chapter list page, but when the total number is less than the preset
  • the second threshold value indicates that the chapter list page is most likely an error generated or fabricated chapter list page, and the chapter list page is determined to be a false chapter list page.
  • the device of the present invention further includes a filtering module, configured to filter out the fake chapter list page after the determining unit determines that the chapter list page is a fake chapter list page.
  • the feature quantity that represents the actual newness may be determined according to the magnitude of the total number being greater than the preset second threshold, that is, the greater the difference between the total number and the second threshold. The higher the accuracy rate, the less likely it is to be a fabricated or erroneous chapter list page, and the corresponding feature value of the corresponding newness is larger; anyway, the corresponding feature value of the corresponding newness is larger.
  • the target acquiring module 14 included in the device is configured to calculate, according to a preset rule, a comprehensive weight of the at least one feature quantity value of each chapter list page, and obtain a maximum comprehensive weight value.
  • the chapter list page is configured to calculate, according to a preset rule, a comprehensive weight of the at least one feature quantity value of each chapter list page, and obtain a maximum comprehensive weight value.
  • the target obtaining module 14 further includes a weighting unit 141, a comparing unit 142, and a target acquiring unit 143.
  • the weighting unit 141 is configured to perform weighting processing on at least one feature quantity value of the same chapter list page according to a preset rule to obtain an integrated weight of the chapter list page;
  • the comparing unit 142 is configured to compare the size of the comprehensive weight corresponding to each chapter list page;
  • the target obtaining unit 143 is configured to obtain a chapter list page in which the comprehensive weight is the largest.
  • the weighting unit 141 performs weighting processing on the feature quantity value corresponding to the weight according to the preset weight value corresponding to each specific feature quantity value, and the obtained result is the comprehensive weight of the chapter list page.
  • a particular feature magnitude represents the chapter list page integrity and/or correctness.
  • the weighting unit 141 obtains a first feature quantity value of 80 for characterizing the integrity of a certain chapter list page according to the foregoing steps, and represents a second feature quantity value of correctness.
  • the comparing unit 142 compares the size of the comprehensive weight of each chapter list page, and the target obtaining unit 143 obtains the chapter in which the comprehensive weight is the largest. List.
  • the chapter list page with the largest comprehensive weight is the target chapter list page.
  • the present invention provides a network novel chapter list evaluation apparatus, and the classification module 11 classifies a plurality of chapter list pages of different sites into the same set based on the similarity between the plurality of chapter list pages.
  • the diversity module 12 further sets a set of the maximum value of the authority values of each site in the same set as the first set, and the feature quantity obtaining module 13 acquires at least one feature quantity of each chapter list page in the first set.
  • the target obtaining module 14 calculates the comprehensive weight of the at least one feature quantity value of each chapter list page in the first set based on the preset rule, and obtains a chapter list page in which the comprehensive weight is the largest.
  • the scheme can automatically acquire chapter list pages of multiple sites, and obtain the highest-quality chapter list page by comparing and comprehensively analyzing the similarity, the authoritative value of the site, and the acquired feature values.
  • the invention solves the problem that the chapter list page is judged to be inefficient by manual configuration of the template in the prior art.
  • the solution of the present invention can flexibly and quickly evaluate the chapter list page that most satisfies the requirements, and the evaluation result is accurate and objective.
  • modules in the devices of the embodiments can be adaptively changed and placed in one or more devices different from the embodiment.
  • the modules or units or components of the embodiments may be combined into one module or unit or component, and further they may be divided into a plurality of sub-modules or sub-units or sub-components.
  • any combination of the features disclosed in the specification, including the accompanying claims, the abstract and the drawings, and any methods so disclosed, or All processes or units of the device are combined.
  • Each feature disclosed in this specification (including the accompanying claims, the abstract and the drawings) may be replaced by alternative features that provide the same, equivalent or similar purpose.
  • Various component embodiments of the present invention may be implemented in hardware or on one or more processors
  • Those skilled in the art will appreciate that some or all of the functionality of some or all of the components of the network novel chapter list evaluation device in accordance with embodiments of the present invention may be implemented in practice using a microprocessor or digital signal processor (DSP).
  • DSP digital signal processor
  • the invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein.
  • Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
  • FIG. 11 illustrates a computing device that can implement a method of evaluating a network novel chapter list.
  • the computing device conventionally includes a processor 1110 and a computer program product or computer readable medium in the form of a memory 1120.
  • the memory 1120 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM.
  • Memory 1120 has a memory space 1130 for program code 1131 for performing any of the method steps described above.
  • the storage space 1130 for program code may include respective program codes 1131 for implementing various steps in the above methods, respectively.
  • the program code can be read from or written to one or more computer program products.
  • Such computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks.
  • Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG.
  • the storage unit may have a storage segment, a storage space, and the like that are similarly arranged to the storage 1120 in the computing device of FIG.
  • the program code can be compressed, for example, in an appropriate form.
  • the storage unit includes computer readable code 1131 ', ie, code readable by a processor, such as, for example, 1110, which when executed by a computing device causes the computing device to perform each of the methods described above step.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to the technical field of Internet, in particular to a network novel chapter list evaluation method and device. The method comprises the steps of: determining the similarity between a plurality of chapter list pages of the same body, and classifying a plurality of chapter list pages with the similarity higher than a pre-set threshold value into the same set, wherein each chapter list page corresponds to one site; acquiring an authoritative value of each site in the same set, and taking a set with the maximum sum value of the authoritative values as a first set, wherein the authoritative values are determined according to scores given by users to the sites; acquiring at least one feature amount value of each chapter list page in the first set; and calculating a comprehensive weight value of at least one feature amount value of each chapter list page according to a pre-set rule, and acquiring a chapter list page with the maximum comprehensive weight value. The present invention solves the problem of low efficiency caused by judging a chapter list page by means of manually configuring a template in the prior art, and can flexibly and rapidly evaluate a chapter list page best meeting the requirements, and the evaluation result is accurate and objective.

Description

网络小说章节列表评估方法及装置Method and device for evaluating network novel chapter list 技术领域Technical field
本发明涉及互联网技术领域,特别涉及一种网络小说章节列表评估方法及装置。The present invention relates to the field of Internet technologies, and in particular, to a method and apparatus for evaluating a list of network novel chapters.
背景技术Background technique
随着计算机和计算机网络的日益普及,互联网已经深入到人们工作、学习和生活的各个领域,成为人们发布和获取信息的重要途径。With the increasing popularity of computers and computer networks, the Internet has penetrated into all areas of people's work, study and life, and has become an important way for people to publish and access information.
在互联网中,同一本电子书(例如小说等)通常同时在多个站点存在,因此当搜索电子书时,在检索结果中会展现多个存在该电子书的站点。但是同一本电子书在不同的站点的转载过程中,会存在章节列表名称不规范、章节重复、断章、无效章节等情况,对用户的体验造成了影响。通常,在现有的检索结果中排在最前面的站点是质量最好的一个站点,即存在章节列表名称不规范、章节重复、断章、无效章节等情况最少的站点,但是其章节列表页仍然可能是不完整的,甚至可能存在拼凑得到的虚假章节。In the Internet, the same e-book (such as a novel, etc.) usually exists at multiple sites at the same time, so when searching for an e-book, a plurality of sites where the e-book exists are presented in the search results. However, in the process of reprinting the same e-book at different sites, there will be cases where the chapter list name is not standardized, chapters are repeated, broken chapters, invalid chapters, etc., which has an impact on the user experience. Usually, the top-ranked site in the existing search results is the best quality site, that is, the site with the smallest chapter list name, chapter repetition, broken chapter, invalid chapter, etc., but its chapter list page is still It may be incomplete, and there may even be a fake chapter that is pieced together.
现有技术中,通过对小说站点人工配置模板进行章节列表页的评估,该方法虽然准确率高,但是缺点也很明显:人力能覆盖的网站有限,不够智能。因此,如何灵活、快速以及准确的评估章节列表页成为目前需要解决的技术问题。In the prior art, the chapter list page is evaluated by manually configuring the template of the novel site. Although the method has high accuracy, the disadvantage is also obvious: the website that the human body can cover is limited and not intelligent enough. Therefore, how to evaluate the chapter list page flexibly, quickly and accurately becomes a technical problem that needs to be solved.
发明内容Summary of the invention
鉴于上述问题,提出了本发明以便提供一种克服上述问题或者至少部分地解决上述问题的网络小说章节列表评估方法及装置。In view of the above problems, the present invention has been made in order to provide a method and apparatus for evaluating a network novel chapter list that overcomes the above problems or at least partially solves the above problems.
本发明提供了一种网络小说章节列表评估方法,包括有如下步骤:The invention provides a method for evaluating a list of network novel chapters, comprising the following steps:
确定同一主体的多个章节列表页之间的相似度,将相似度高于预设阈值的多个章节列表页归类为同一集合,每个章节列表页对应于一个站点;Determining the similarity between the plurality of chapter list pages of the same subject, classifying the plurality of chapter list pages whose similarity is higher than the preset threshold into the same set, and each chapter list page corresponds to one site;
获取同一集合内每个站点的权威值,将权威值的和值最大的集合作为第一集合,其中权威值根据多个用户对该站点的评分确定;Obtaining an authoritative value of each site in the same set, and taking the set with the largest sum of authoritative values as the first set, wherein the authoritative value is determined according to the scores of the plurality of users on the site;
获取第一集合内每个章节列表页的至少一个特征量值;Obtaining at least one feature quantity value of each chapter list page in the first set;
根据预设规则计算每个章节列表页的所述至少一个特征量值的综合权值,获取其中综合权值最大的章节列表页。Calculating a comprehensive weight of the at least one feature quantity value of each chapter list page according to a preset rule, and obtaining a chapter list page in which the comprehensive weight value is the largest.
本发明还提供了一种网络小说章节列表评估装置,包括有:The invention also provides a network novel chapter list evaluation device, comprising:
归类模块,用于确定同一主体的多个章节列表页之间的相似度,将相似度高于预设阈值的多个章节列表页归类为同一集合,每个章节列表页对应于一个站点;a categorization module, configured to determine a similarity between a plurality of chapter list pages of the same subject, classify the plurality of chapter list pages whose similarity is higher than a preset threshold into the same set, and each chapter list page corresponds to one site ;
分集模块,用于获取同一集合内每个站点的权威值,将权威值的和值最大的集合作为第一集合,其中权威值根据多个用户对该站点的评分确定; a diversity module, configured to obtain an authority value of each site in the same set, and use a set of the maximum value of the authority value as the first set, wherein the authority value is determined according to the score of the user by the multiple users;
特征量获取模块,用于获取第一集合内每个章节列表页的至少一个特征量值;a feature quantity obtaining module, configured to acquire at least one feature quantity value of each chapter list page in the first set;
目标获取模块,用于根据预设规则计算每个章节列表页的所述至少一个特征量值的综合权值,获取其中综合权值最大的章节列表页。And a target obtaining module, configured to calculate, according to a preset rule, a comprehensive weight of the at least one feature quantity value of each chapter list page, and obtain a chapter list page in which the comprehensive weight value is the largest.
依据本发明的又一方面,提供了一种计算机程序,其包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行根据上文所述的网络小说章节列表评估方法。According to still another aspect of the present invention, a computer program is provided, comprising computer readable code, when said computer readable code is run on a computing device, causing said computing device to perform a network novel according to said Chapter list evaluation method.
依据本发明的再一方面,提供了一种计算机可读介质,其中存储了上述的计算机程序。According to still another aspect of the present invention, a computer readable medium storing the above computer program is provided.
与现有技术相比,本发明具备如下优点:Compared with the prior art, the present invention has the following advantages:
1、本发明提供了一种网络小说章节列表评估方法,基于多个章节列表页之间的相似度,将不同站点的多个章节列表页归类为同一集合;再将同一集合内每个站点的权威值的和值最大的集合作为第一集合,再基于预设规则计算该第一集合内每个章节列表页的所述至少一个特征量值的综合权值,获取其中综合权值最大的章节列表页。即本方案能实现对多个站点的章节列表页的自动获取,通过对相似度、站点的权威值及获取的特征量值多个参数的比较和综合分析,得到质量相对最高的章节列表页,从而解决了现有技术中通过人工配置模板进行章节列表页判断导致效率低的问题,本发明所述方案能灵活、快速的评估出最符合要求的章节列表页,评估结果准确、客观。1. The present invention provides a method for evaluating a list of network novel chapters, classifying a plurality of chapter list pages of different sites into the same set based on the similarity between the plurality of chapter list pages; and then each site in the same set The set of the greatest value of the authority value is used as the first set, and then the comprehensive weight of the at least one feature quantity value of each chapter list page in the first set is calculated based on a preset rule, and the comprehensive weight is obtained. Chapter list page. That is, the scheme can automatically acquire the chapter list page of multiple sites, and obtain the chapter list page with the highest quality by comparing and synthesizing the similarity, the authoritative value of the site, and the acquired feature values. Therefore, the problem that the chapter list page is judged by the manual configuration template is low in efficiency, and the solution of the present invention can flexibly and quickly evaluate the chapter list page that most satisfies the requirements, and the evaluation result is accurate and objective.
2、进一步的,本发明在获取每个章节列表页至少一个特征量值时,会基于预设规则分别分析得到表征章节列表页正确性、完整性和实新性的多个特征量值,还会过滤掉虚假的章节列表页,再基于预设的规则获取每个章节列表页对应于至少一个特征量值的综合权值,其中综合权值最大的章节列表页即为质量最高的目标章节列表页。即本发明方案能从正确性、完整性和实新性等多方面自动比较和分析章节列表页的质量,评估出最有效的章节列表页,使得评估结果更加准确。2. Further, when acquiring at least one feature quantity value of each chapter list page, the present invention separately analyzes and obtains multiple feature quantity values for characterizing the correctness, integrity, and newness of the chapter list page based on the preset rules. The fake chapter list page is filtered out, and then the comprehensive weight corresponding to the at least one feature quantity value is obtained for each chapter list page based on the preset rule, wherein the chapter list page with the largest comprehensive weight is the highest quality target chapter list. page. That is, the solution of the present invention can automatically compare and analyze the quality of the chapter list page from the aspects of correctness, completeness and realism, and evaluate the most effective chapter list page, so that the evaluation result is more accurate.
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。The above description is only an overview of the technical solutions of the present invention, and the above-described and other objects, features and advantages of the present invention can be more clearly understood. Specific embodiments of the invention are set forth below.
附图说明DRAWINGS
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those skilled in the art from a The drawings are only for the purpose of illustrating the preferred embodiments and are not to be construed as limiting. Throughout the drawings, the same reference numerals are used to refer to the same parts. In the drawing:
图1是本发明中网络小说章节列表评估方法的一个实施例的程序流程图;1 is a flow chart of a program of an embodiment of a method for evaluating a network novel chapter list according to the present invention;
图2是本发明中网络小说章节列表评估方法的一个实施例的程序流程图; 2 is a flow chart showing a procedure of an embodiment of a method for evaluating a network novel chapter list according to the present invention;
图3是本发明中网络小说章节列表评估方法的一个实施例的程序流程图;3 is a flow chart showing a procedure of an embodiment of a method for evaluating a network novel chapter list in the present invention;
图4是本发明中网络小说章节列表评估方法的一个实施例的程序流程图;4 is a flow chart showing a procedure of an embodiment of a method for evaluating a network novel chapter list in the present invention;
图5是本发明中网络小说章节列表评估方法的一个实施例的程序流程图;FIG. 5 is a flowchart of a process of an embodiment of a method for evaluating a network novel chapter list according to the present invention; FIG.
图6是本发明中网络小说章节列表评估装置的一个实施例的结构框图;6 is a structural block diagram of an embodiment of a network novel chapter list evaluation apparatus according to the present invention;
图7是本发明中网络小说章节列表评估装置的一个实施例中归类模块的结构框图;7 is a structural block diagram of a categorization module in an embodiment of a network novel chapter list evaluation apparatus according to the present invention;
图8是本发明中网络小说章节列表评估装置的一个实施例中特征量获取模块的结构框图;8 is a structural block diagram of a feature quantity acquisition module in an embodiment of a network novel chapter list evaluation apparatus according to the present invention;
图9是本发明中网络小说章节列表评估装置的一个实施例中特征量获取模块的结构框图;9 is a structural block diagram of a feature quantity acquisition module in an embodiment of a network novel chapter list evaluation apparatus according to the present invention;
图10是本发明中网络小说章节列表评估装置的一个实施例中目标获取模块的结构框图;FIG. 10 is a structural block diagram of a target acquisition module in an embodiment of a network novel chapter list evaluation apparatus according to the present invention; FIG.
图11示意性地示出了用于执行根据本发明的网络小说章节列表评估方法的计算设备的框图;以及Figure 11 is a block diagram schematically showing a computing device for performing a network novel chapter list evaluation method according to the present invention;
图12示意性地示出了用于保持或者携带实现根据本发明的网络小说章节列表评估方法的程序代码的存储单元。Fig. 12 schematically shows a storage unit for holding or carrying program code implementing the network novel chapter list evaluation method according to the present invention.
具体实施方式detailed description
下面结合附图和具体的实施方式对本发明作进一步的描述。The invention is further described below in conjunction with the drawings and specific embodiments.
有必要先对本发明的应用场景及其原理进行如下的先导性说明。It is necessary to first make a preliminary description of the application scenario and principles of the present invention as follows.
互联网中,一般包括用户端(用户移动终端)、网络和服务器(如网站的Web服务器等)。其中用户端可以是用户的互联网移动终端,如台式机(PC)、膝上型计算机(Laptop)、带有网页浏览功能的智能型设备,如个人数字助理(Personal Digital Assisstant,PDA),以及移动互联网设备(Mobile Internet Device,MID)和智能手机(Phone)等。这些移动终端都可以在互联网环境中,典型的如英特网环境中,请求由另一进程(如服务器提供的进程)提供某项服务。例如,在本发明中,以装载有电子书功能的APP的手机为用户端,例如:Android手机等;在所述APP中带有用户反馈栏目,用户可以通过该栏目向服务器发送问题反馈信息,服务器返回给用户回复信息。In the Internet, it generally includes a client (user mobile terminal), a network, and a server (such as a web server of a website). The client can be a user's Internet mobile terminal, such as a desktop computer (PC), a laptop (Laptop), a smart device with web browsing capabilities, such as a personal digital assistant (PDA), and a mobile device. Internet devices (MID) and smartphones (Phone). These mobile terminals can all request a service by another process (such as a server-provided process) in an Internet environment, typically in an Internet environment. For example, in the present invention, a mobile phone loaded with an APP of an e-book function is used as a client, for example, an Android mobile phone or the like; and a user feedback column is provided in the APP, and the user can send question feedback information to the server through the column. The server returns a reply message to the user.
服务器通常是可通过互联网等通信媒介,典型的如英特网访问的远程计算机***。而且,服务器通常可以为来自互联网的多个用户端提供服务。提供服务过程包括接收用户端发来的请求,收集用户端情报和反馈信息等。实质上,服务器充当计算机网络的信息提供者这一角色。服务器通常位于提供服务的一方,或由服务提供方配置以服务内容,这样的服务提供方可以如互联网服务公司的网站等。The server is typically a remote computer system that can be accessed via a communication medium such as the Internet, typically such as the Internet. Moreover, servers can often serve multiple clients from the Internet. The service process includes receiving requests from the client, collecting user information and feedback information, and the like. In essence, the server acts as an information provider for the computer network. The server is usually located on the party providing the service, or configured by the service provider to serve the content, such a service provider may be, for example, an Internet service company's website.
以下将详细说明为了运用上述的原理实现上述的场景而提出的本发明的若干技 术方案的具体实施方式。需要说明的是,本发明提供的一种网络小说章节列表评估方法,是从服务器的视角来加以描述的,可以通过编程将网络小说章节列表评估方法实现为计算机程序在远端网络设备上实现,其包括但不限于计算机、网络主机、单个网络服务器、多个网络服务器集或多个服务器构成的云。Several techniques of the present invention proposed to implement the above-described scenarios using the principles described above will be described in detail below. The specific implementation of the program. It should be noted that the method for evaluating a list of network novel chapters provided by the present invention is described from the perspective of a server, and the method for evaluating the list of network novel chapters can be implemented as a computer program on a remote network device by programming. It includes, but is not limited to, a computer, a network host, a single web server, a plurality of network server sets, or a cloud of multiple servers.
参见附图1,本发明的一种网络小说章节列表评估方法的一个典型实施例,具体包括以下步骤:Referring to FIG. 1, an exemplary embodiment of a method for evaluating a list of network novel chapters of the present invention specifically includes the following steps:
S11,确定同一主体的多个章节列表页之间的相似度,将相似度高于预设阈值的多个章节列表页归类为同一集合,每个章节列表页对应于一个站点。S11. Determine a similarity between the plurality of chapter list pages of the same subject, and classify the plurality of chapter list pages whose similarity is higher than the preset threshold into the same set, where each chapter list page corresponds to one site.
需要说明的是,本发明所述的网络小说章节列表评估方法中,能够通过网络蜘蛛基于同一主体抓取多个网站的数据,从而获取该主体的章节列表页。其中,所述主体可以是小说的标题或其中的部分关键文本特征。因此在步骤S11之前,还包括步骤:基于同一主体从多个站点获取该主体对应的章节列表页。It should be noted that, in the method for evaluating a network novel chapter list according to the present invention, a web spider can acquire data of a plurality of websites based on the same subject, thereby acquiring a chapter list page of the main body. Wherein, the subject may be a title of the novel or a part of key text features therein. Therefore, before step S11, the method further includes the step of: obtaining the chapter list page corresponding to the subject from the plurality of sites based on the same subject.
具体的,在本发明的一个实施例中,搜索引擎可以接收到带有该主体的关键字的搜索请求,对小说网站域名下的网页进行结构分析,若网页中包括有多个平行的章节列表标签,即可判定该网页为小说章节列表页;其中所述多个平行的章节列表标签的指向链接href(Hypertext Reference,超文本引用)存在高度类似关系,及其对应的章节列表目录相同但是具体的文件名不同。例如,假定所述多个平行的章节列表标签的href属性包含的目录均为5_5288,而href属性包含的文件名各不同,即由970871至970980。Specifically, in an embodiment of the present invention, the search engine may receive a search request with a keyword of the subject, and perform structural analysis on the webpage under the domain name of the novel website, if the webpage includes multiple parallel chapter lists. The label can determine that the webpage is a novel chapter list page; wherein the pointing link href (Hypertext Reference) of the plurality of parallel chapter list labels has a highly similar relationship, and the corresponding chapter list directory is the same but specific The file name is different. For example, assume that the href attribute of the plurality of parallel chapter list tags contains a directory of 5_5288, and the href attribute contains different file names, that is, 970871 to 970980.
进一步的,所述小说章节列表页包括的多个平行的章节列表标签包含有章节文本特征向量,其包括有表征章节的关键字和/或章节数,搜索引擎可以基于上述关键字和/或章节数去评估出章节列表页。例如,所述章节列表标签包括有关键字“章”,也可以包括“卷”、“节”、“章节”等等;且还包括表征章节数的关键字“一”、“二”、“一十八”等;当然所述章节数也能够以数字的形式保存“1”、“2”、“18”等等。Further, the plurality of parallel chapter list tags included in the novel chapter list page include chapter text feature vectors including keywords and/or chapter numbers representing the chapters, and the search engine may be based on the keywords and/or chapters. Count to evaluate the chapter list page. For example, the chapter list label includes the keyword "chapter", and may also include "volume", "section", "chapter", etc.; and also includes keywords "one", "two", " Eighteen" and the like; of course, the number of chapters can also save "1", "2", "18", and the like in the form of numbers.
进一步的,在从多个站点获取了同一主体对应的章节列表页后,需要执行步骤S11:确定该同一主体的多个章节列表页之间的相似度,将相似度高于预设阈值的多个章节列表页归类为同一集合,每个章节列表页对应于一个站点。本实施例可以是通过提取该同一主体的多个章节列表页中的章节列表名称中的文本特征向量,其中所述文本特征向量可以是章节列表名称中的多个关键字,基于一定相似度判断算法判断所述多个关键字之间的相似度;或者是通过提取该同一主体的多个章节列表页名称所对应的页码中的数值特征向量,其中所述数值特征向量可以是表征页码的数值;本实施例中,可以结合文本特征向量及其对应的数值特征向量来共同计算任意两个章节列表页之间的相似度,也可以单独采用其中一种特征向量来计算章节列表页之间的相似度。 Further, after the chapter list page corresponding to the same subject is obtained from the multiple sites, step S11 is performed: determining the similarity between the plurality of chapter list pages of the same subject, and the similarity is higher than the preset threshold. The chapter list pages are grouped into the same collection, and each chapter list page corresponds to one site. The embodiment may be: extracting a text feature vector in a chapter list name in a plurality of chapter list pages of the same subject, wherein the text feature vector may be a plurality of keywords in a chapter list name, and determining based on a certain similarity The algorithm determines a similarity between the plurality of keywords; or extracts a numerical feature vector in a page number corresponding to the plurality of chapter list page names of the same subject, wherein the numerical feature vector may be a value representing the page number In this embodiment, the text feature vector and its corresponding numerical feature vector may be combined to calculate the similarity between any two chapter list pages, or one of the feature vectors may be separately used to calculate the chapter list page. Similarity.
具体的,请参照附图2,在本发明的一个实施例中,所述步骤S11中具体还包括步骤:Specifically, referring to FIG. 2, in an embodiment of the present invention, the step S11 specifically includes the following steps:
S111,依据所述章节列表页所对应站点的权威值,确定权威值最高的章节列表页为参照章节列表页;S111. Determine, according to an authority value of the site corresponding to the chapter list page, a chapter list page with the highest authority value as a reference chapter list page;
S112,提取每一章节列表页的文字特征向量;S112. Extract a text feature vector of each chapter list page.
S113,计算每一章节列表页与所述参照章节列表页具有相同文字特征向量的总数;S113. Calculate, for each chapter list page, a total number of the same character feature vectors as the reference chapter list page;
S114,当该总数大于预设阈值时,将所述章节列表页与所述参照章节列表页归类为同一集合。S114. When the total number is greater than a preset threshold, classifying the chapter list page and the reference chapter list page into the same set.
在评判多个章节列表页之间的相似度时,首先获取一个参照章节列表页,本发明的一个实施例中,可以通过获取不同站点的权威值,确定权威值最高的章节列表页为所述的参照章节列表页,其中站点的权威值是由大量用户通过对该站点的评分得到;然后基于一定算法提取每一章节列表页的文字特征向量,再计算每一章节列表页与所述参照章节列表页具有相同文字特征向量的总数;当该总数大于预存储的阈值时,将所述章节列表页与所述参照章节列表页归类为同一集合,重复上述方法,将其他没在该集合内的章节列表页归类为另一或多个集合。When judging the similarity between the plurality of chapter list pages, first obtaining a reference chapter list page. In an embodiment of the present invention, the chapter list page with the highest authority value may be determined by acquiring the authority values of the different sites. Refer to the chapter list page, where the authoritative value of the site is obtained by a large number of users by scoring the site; then extracting the text feature vector of each chapter list page based on a certain algorithm, and then calculating each chapter list page and the The reference chapter list page has a total number of identical character feature vectors; when the total number is greater than the pre-stored threshold, the chapter list page and the reference chapter list page are classified into the same set, and the above method is repeated, and the other is not in the The chapter list page within the collection is classified into one or more collections.
进一步的,请参见附图1,本发明所述方法,还包括步骤S12:获取同一集合内每个站点的权威值,将权威值的和值最大的集合作为第一集合,其中权威值根据多个用户对该站点的评分确定。Further, referring to FIG. 1, the method of the present invention further includes a step S12: acquiring an authority value of each site in the same set, and using a set with the largest sum of authoritative values as the first set, wherein the authority value is based on multiple The rating of the site by the user is determined.
前述步骤S11中,依据章节列表页之间的相似度将多个章节列表页归类为不同的集合,在该步骤S12中,计算同一集合内每个章节列表页所在站点的权威值的和值,其中站点的权威值根据多个用户对该站点的评分确定,获取其中权威值的和值最大的集合作为第一集合。In the foregoing step S11, the plurality of chapter list pages are classified into different sets according to the similarity between the chapter list pages, and in this step S12, the sum value of the authority values of the sites where each chapter list page in the same set is located is calculated. , wherein the authoritative value of the site is determined according to the scores of the plurality of users on the site, and the set with the largest sum value of the authoritative values is obtained as the first set.
进一步的,请参见附图1,本发明所述方法,还包括步骤S13:获取第一集合内每个章节列表页的至少一个特征量值。需要说明的是,其中所述至少一个特征量值可以是表征章节列表页完整性、或正确性、或实新性的特征量值;下文通过不同的实施例分别介绍获取特征量值的实施方式。Further, referring to FIG. 1, the method of the present invention further includes a step S13: acquiring at least one feature quantity value of each chapter list page in the first set. It should be noted that, wherein the at least one feature quantity value may be a feature quantity value that represents a chapter list page integrity, or a correctness, or a real new property; an implementation manner of acquiring a feature quantity value is respectively introduced by different embodiments. .
1、具体的,请参见附图3,在本发明的一个实施例中,所述获取第一集合内每个章节列表页的至少一个特征量值的步骤中还包括有:1. Specifically, referring to FIG. 3, in an embodiment of the present invention, the step of acquiring at least one feature quantity value of each chapter list page in the first set further includes:
S131,提取第一集合内每一章节列表页的文字特征向量;S131. Extract a text feature vector of each chapter list page in the first set.
S132,计算所述第一集合中每两个章节列表页具有相同文字特征向量的数量的第一平均值;S132. Calculate a first average value of the number of the same character feature vector for each two chapter list pages in the first set;
S133,计算某一章节列表页与多个其他章节列表页的相同文字特征向量的数量的第二平均值;S133. Calculate a second average value of the number of identical character feature vectors of a chapter list page and a plurality of other chapter list pages;
S134,依据所述第二平均值与所述第一平均值的差值大小,基于预设的完整性 规则设定表征该章节列表页完整性的第一特征量值,其中该差值大小与第一特征量值相对应。S134, based on the difference between the second average value and the first average value, based on preset integrity The rule sets a first feature magnitude that characterizes the integrity of the chapter list page, wherein the magnitude of the difference corresponds to the first feature magnitude.
具体的,首先提取第一集合内每一章节列表页的文字特征向量;再计算每两个章节列表页具有相同文字特征向量的数量,对得到的多个数量值求平均得到第一平均值;计算某一个章节列表页与多个其他章节列表页的相同文字特征向量的数量,并求平均得到第二平均数;再计算所述第一平均值与第二平均值的差值大小,再基于预设的完整性规则设定表征该章节列表页完整性的第一特征量值;如果该差值越大,则表明该章节列表页不完整的概率越大,对应的第一特征量值则越小,其中差值大小与第一特征量值预先相关联存储。例如,若差值为15时,对应的第一特征量值为60;差值为5时,对应的第一特征量值为80;当然,该实施例仅是示例性的,并不能构成对本发明的限制。Specifically, the character feature vector of each chapter list page in the first set is first extracted; then the number of the same character feature vector is calculated for each two chapter list pages, and the obtained plurality of quantity values are averaged to obtain a first average value. Calculating the number of identical character feature vectors of a certain chapter list page and a plurality of other chapter list pages, and averaging to obtain a second average; and calculating a difference between the first average value and the second average value, and then calculating Setting a first feature quantity value that represents the integrity of the chapter list page based on the preset integrity rule; if the difference is larger, indicating that the chapter list page is incomplete, the greater the probability, the corresponding first feature quantity value The smaller, the difference size is stored in advance in association with the first feature magnitude. For example, if the difference is 15, the corresponding first feature quantity value is 60; when the difference is 5, the corresponding first feature quantity value is 80; of course, this embodiment is merely exemplary and does not constitute a pair. Limitations of the invention.
进一步的,本发明所述方法还包括步骤:依据所述第二平均值与所述第一平均值的差值大小,基于预设的正确性规则设定表征该章节列表页正确性的第二特征量值,其中该差值大小与第二特征量值相对应。即在得到第二平均值与第二平均值的差值大小后,基于预设的表征正确性的规则设定表征章节列表页正确性的第二特征量,同理如果差值越大,则表明该章节列表页不正确的概率越大,对应的第二特征量值则越小,其中差值大小也与第二特征量值预先相关联存储。例如,若差值为15时,对应的第二特征量值为65;差值为5时,对应的第一特征量值为85;当然,该实施例仅是示例性的,并不能构成对本发明的限制。Further, the method of the present invention further includes the step of: setting a second characterizing the correctness of the chapter list page based on the preset correctness rule according to the difference between the second average value and the first average value A feature magnitude, wherein the magnitude of the difference corresponds to a second feature magnitude. That is, after the difference between the second average value and the second average value is obtained, a second feature quantity that characterizes the correctness of the chapter list page is set based on a preset rule for characterizing the correctness. Similarly, if the difference is larger, then The greater the probability that the chapter list page is incorrect, the smaller the corresponding second feature magnitude, and the difference magnitude is also pre-associated with the second feature magnitude. For example, if the difference is 15, the corresponding second feature quantity value is 65; when the difference value is 5, the corresponding first feature quantity value is 85; of course, this embodiment is merely exemplary and cannot constitute a pair. Limitations of the invention.
2、请参见附图4,在本发明的另一个实施例中,所述获取第一集合内每个章节列表页的至少一个特征量值的步骤中还包括有:2. In another embodiment of the present invention, the step of obtaining at least one feature quantity value of each chapter list page in the first set further includes:
S135,获取该第一集合内每个章节列表页对应于相同页码的章节列表中的文字特征向量,其中该页码所对应的数值大于预设的页码阈值;S135: Obtain a text feature vector in a chapter list corresponding to the same page number in each chapter list page in the first set, where a value corresponding to the page code is greater than a preset page code threshold;
S136,获取某一个章节列表页与多个其他章节列表页具有相同文字特征向量的总数;S136. Acquire a total number of the same character feature vector of a certain chapter list page and a plurality of other chapter list pages;
S137,根据所述总数与预设的表征实新性的第二阈值的大小关系,判断该章节列表页是否为虚假章节列表页。S137. Determine, according to the size relationship between the total number and the preset second threshold value of the real newness, whether the chapter list page is a fake chapter list page.
该实施例主要是用于评判章节列表页的实新性。通过获取大于预设的页码阈值的页码所对应的章节列表页的文字特征向量,计算某一章节列表页与多个其他章节列表页具有相同文字特征向量的总数。即获得章节列表页末尾的几个章节列表页对应的文字特征向量,并计算某一章节列表页与多个其他具有相同页码的章节列表页所具有的相同文字特征向量的总数,当所述总数大于等于所述预设的第二阈值时,确定该章节列表页为有效的章节列表页,但是当总数小于所述预设的第二阈值时,表明该章节列表页极可能是错误产生或杜撰的章节列表页,确定所述章节列表页为虚假章节列表页,并过滤该虚假的章节列表页。同理,该实施例中也可以根据所述 总数大于所述预设的第二阈值的大小程度,来确定表征其实新性的特征量值,即所述总数与第二阈值的差值越大,表征其准确率越高,越不可能是杜撰或错误的章节列表页,其对应的表征实新性的特征量值越大;反正,对应的表征实新性的特征量值越大。This embodiment is mainly used to judge the realism of the chapter list page. The total number of the same character feature vector is calculated for a chapter list page and a plurality of other chapter list pages by acquiring a character feature vector of the chapter list page corresponding to the page number of the preset page number threshold. That is, obtaining a character feature vector corresponding to several chapter list pages at the end of the chapter list page, and calculating a total number of the same character feature vectors of a chapter list page and a plurality of other chapter list pages having the same page number, when When the total number is greater than or equal to the preset second threshold, determining that the chapter list page is a valid chapter list page, but when the total number is less than the preset second threshold, indicating that the chapter list page is likely to be an error or The fabricated chapter list page determines that the chapter list page is a fake chapter list page and filters the fake chapter list page. Similarly, in this embodiment, according to the description The total number is greater than the preset second threshold, to determine the feature quantity that represents the newness, that is, the greater the difference between the total number and the second threshold, the higher the accuracy of the representation, the less likely it is For the fabricated or erroneous chapter list page, the corresponding feature quantity value corresponding to the newness is larger; anyway, the corresponding feature quantity value representing the real newness is larger.
进一步的,请参见附图1,本发明所述方法还包括步骤S14:根据预设规则计算每个章节列表页的所述至少一个特征量值的综合权值,获取其中综合权值最大的章节列表页。Further, referring to FIG. 1, the method of the present invention further includes a step S14: calculating a comprehensive weight of the at least one feature quantity value of each chapter list page according to a preset rule, and obtaining a chapter in which the comprehensive weight value is the largest. List.
具体的,在本发明的一个实施例中,请参见附图5,所述根据预设规则计算每个章节列表页的所述至少一个特征量值的综合权值,获取其中综合权值最大的章节列表页的步骤中,还包括步骤:Specifically, in an embodiment of the present invention, referring to FIG. 5, the method calculates a comprehensive weight of the at least one feature quantity value of each chapter list page according to a preset rule, and obtains a maximum comprehensive weight value thereof. The steps in the chapter list page also include the steps:
S151,根据预设规则对同一章节列表页的至少一个特征量值进行加权处理,得到该章节列表页的综合权值;S151: Perform weighting processing on at least one feature quantity value of the same chapter list page according to a preset rule, to obtain a comprehensive weight value of the chapter list page;
S152,比较每个章节列表页对应的综合权值的大小;S152. Compare the size of the comprehensive weight corresponding to each chapter list page.
S153,获取其中综合权值最大的章节列表页。S153. Obtain a chapter list page in which the comprehensive weight is the largest.
具体的,根据预设的对应于每个特定特征量值的权值,对该权值所对应的特征量值进行加权处理,所得结果为该章节列表页的综合权值,其中特定特征量值表征章节列表页完整性和/或正确性。例如,在本发明的一个示例性实施例中,根据前述步骤中得到了某一个章节列表页表征完整性的第一特征量值为80,表征正确性的第二特征量值为90,然后预设的对应于第一特征量值的权值为0.5,对应于第二特征量值的权值为0.7,最后经过加权0.5*80+0.7*90=10.30,该结果即为所述某一个章节列表页的综合权值。当然,不难理解,该实施例仅是示例性的,并不能构成对本发明的限制。Specifically, according to the weight value corresponding to each specific feature quantity value, the feature quantity value corresponding to the weight value is weighted, and the result is the comprehensive weight of the chapter list page, wherein the specific feature quantity value Characterize chapter list page integrity and/or correctness. For example, in an exemplary embodiment of the present invention, according to the foregoing step, the first feature quantity value of the integrity of a certain chapter list page is obtained as 80, and the second feature quantity value of the correctness is 90, and then The weight corresponding to the first feature quantity value is 0.5, the weight corresponding to the second feature quantity value is 0.7, and finally weighted by 0.5*80+0.7*90=10.30, the result is the one chapter. The comprehensive weight of the list page. It is to be understood, of course, that the examples are merely exemplary and are not intended to limit the invention.
进一步,计算出每个章节列表页的综合权值后,比较每个章节列表页的综合权值的大小,获取其中综合权值最大的章节列表页。该综合权值最大的章节列表页即为目标章节列表页。不难理解,本发明中所述方法虽然以小说搜索引擎的数据处理环节作为应用场景,但是实际应用上并不限于此,还可以应用于其他需要获取最佳章节列表页的情况,为其他后续的处理做铺垫,提高用户的产品体检。Further, after calculating the comprehensive weight of each chapter list page, the size of the comprehensive weight of each chapter list page is compared, and the chapter list page with the largest comprehensive weight is obtained. The chapter list page with the largest comprehensive weight is the target chapter list page. It is not difficult to understand that although the method described in the present invention uses the data processing link of the novel search engine as an application scenario, the actual application is not limited thereto, and can also be applied to other situations in which it is necessary to obtain an optimal chapter list page for other follow-up. The processing of the preparations to improve the user's product physical examination.
终上所述,本发明提供了一种网络小说章节列表评估方法,基于多个章节列表页之间的相似度,将不同站点的多个章节列表页归类为同一集合;再将同一集合内每个站点的权威值的和值最大的集合作为第一集合,再基于预设规则计算该第一集合内每个章节列表页的所述至少一个特征量值的综合权值,获取其中综合权值最大的章节列表页。即本方案能实现对多个站点的章节列表页的自动获取,通过对相似度、站点的权威值及获取的特征量值多个参数的比较和综合分析,得到质量最高的章节列表页,从而解决了现有技术中通过人工配置模板进行章节列表页判断导致效率低的问题,本发明所述方案能灵活、快速的评估出最符合要求的章节列表页,评 估结果准确、客观。In conclusion, the present invention provides a method for evaluating a list of network novel chapters, which classifies a plurality of chapter list pages of different sites into the same set based on the similarity between the plurality of chapter list pages; A set of the greatest value of the authority value of each site is used as the first set, and then the comprehensive weight of the at least one feature quantity value of each chapter list page in the first set is calculated based on a preset rule, and the comprehensive right is obtained. The chapter list page with the largest value. That is, the scheme can automatically acquire chapter list pages of multiple sites, and obtain the highest-quality chapter list page by comparing and comprehensively analyzing the similarity, the authoritative value of the site, and the acquired feature values. The invention solves the problem that the chapter list page is judged to be inefficient by manually configuring the template in the prior art, and the solution of the present invention can flexibly and quickly evaluate the chapter list page that best meets the requirements, and evaluates The results are accurate and objective.
进一步,依据计算机软件的功能模块化思维,本发明还提供了一种网络小说章节列表评估方法的装置,请参阅图6。所述装置包括归类模块11、分集模块12、特征量获取模块13和目标获取模块14,利用上述各模块来搭建起整个装置的原理框架,从而实现模块化实施方案。以下具体揭示各模块实现的具体功能。Further, according to the functional modular thinking of the computer software, the present invention also provides a device for evaluating a list of network novel chapters, please refer to FIG. 6. The device includes a classification module 11, a diversity module 12, a feature quantity acquisition module 13, and a target acquisition module 14, and uses the above modules to construct a principle framework of the entire device, thereby implementing a modular implementation. The specific functions implemented by each module are specifically disclosed below.
所述归类模块11,用于确定同一主体的多个章节列表页之间的相似度,将相似度高于预设阈值的多个章节列表页归类为同一集合,每个章节列表页对应于一个站点。The categorization module 11 is configured to determine the similarity between the plurality of chapter list pages of the same subject, and classify the plurality of chapter list pages whose similarity is higher than the preset threshold into the same set, and each chapter list page corresponds to On one site.
需要说明的是,本发明所述的网络小说章节列表评估方法中,能够通过网络蜘蛛基于同一主体抓取多个网站的数据,从而获取该主体的章节列表页。其中,所述主体可以是小说的标题或其中的部分关键文本特征。因此本发明还包括有列表页获取模块,用于基于同一主体从多个站点获取该主体对应的章节列表页。It should be noted that, in the method for evaluating a network novel chapter list according to the present invention, a web spider can acquire data of a plurality of websites based on the same subject, thereby acquiring a chapter list page of the main body. Wherein, the subject may be a title of the novel or a part of key text features therein. Therefore, the present invention further includes a list page obtaining module, configured to acquire a chapter list page corresponding to the body from a plurality of sites based on the same subject.
具体的,在本发明的一个实施例中,所述列表页获取模块可以接收到带有该主体的关键字的搜索请求,对小说网站域名下的网页进行结构分析,若网页中包括有多个平行的章节列表标签,即可判定该网页为小说章节列表页;其中所述多个平行的章节列表标签的指向链接href(Hypertext Reference,超文本引用)存在高度类似关系,及其对应的章节列表目录相同但是具体的文件名不同。例如,假定所述多个平行的章节列表标签的href属性包含的目录均为5_5288,而href属性包含的文件名各不同,即由970871至970980。Specifically, in an embodiment of the present invention, the list page obtaining module may receive a search request with a keyword of the body, and perform structural analysis on a webpage under the domain name of the novel website, if the webpage includes multiple The parallel chapter list label can determine that the webpage is a novel chapter list page; wherein the plurality of parallel chapter list labels have a highly similar relationship to the hyperlink href (Hypertext Reference), and the corresponding chapter list The directories are the same but the specific file names are different. For example, assume that the href attribute of the plurality of parallel chapter list tags contains a directory of 5_5288, and the href attribute contains different file names, that is, 970871 to 970980.
进一步的,所述小说章节列表页包括的多个平行的章节列表标签包含有章节文本特征向量,其包括有表征章节的关键字和/或章节数,所述列表页获取模块可以基于上述关键字和/或章节数去评估出章节列表页。例如,所述章节列表标签包括有关键字“章”,也可以包括“卷”、“节”、“章节”等等;且还包括表征章节数的关键字“一”、“二”、“一十八”等;当然所述章节数也能够以数字的形式保存“1”、“2”、“18”等等。Further, the plurality of parallel chapter list tags included in the novel chapter list page include a chapter text feature vector including a keyword and/or a chapter number representing the chapter, and the list page obtaining module may be based on the keyword And/or the number of chapters to evaluate the chapter list page. For example, the chapter list label includes the keyword "chapter", and may also include "volume", "section", "chapter", etc.; and also includes keywords "one", "two", " Eighteen" and the like; of course, the number of chapters can also save "1", "2", "18", and the like in the form of numbers.
进一步的,在所述列表页获取模块从多个站点获取了同一主体对应的章节列表页后,需要所述归类模块11确定该同一主体的多个章节列表页之间的相似度,将相似度高于预设阈值的多个章节列表页归类为同一集合,每个章节列表页对应于一个站点。本实施例所述归类模块11可以是通过提取该同一主体的多个章节列表页中的章节列表名称中的文本特征向量,其中所述文本特征向量可以是章节列表名称中的多个关键字,基于一定相似度判断算法判断所述多个关键字之间的相似度;或者所述归类模块11通过提取该同一主体的多个章节列表页名称所对应的页码中的数值特征向量,其中所述数值特征向量可以是表征页码的数值;本实施例中,所述归类模块11可以结合文本特征向量及其对应的数值特征向量来共同计算任意两个章节列表 页之间的相似度,也可以单独采用其中一种特征向量来计算章节列表页之间的相似度。Further, after the list page obtaining module acquires the chapter list page corresponding to the same subject from the plurality of sites, the categorization module 11 is required to determine the similarity between the plurality of chapter list pages of the same subject, which will be similar. A plurality of chapter list pages whose degree is higher than a preset threshold are classified into the same set, and each chapter list page corresponds to one site. The categorization module 11 in this embodiment may be a text feature vector in a chapter list name in a plurality of chapter list pages of the same body, wherein the text feature vector may be a plurality of keywords in a chapter list name. Determining a similarity between the plurality of keywords based on a certain similarity judgment algorithm; or the classification module 11 extracts a numerical feature vector in a page number corresponding to a plurality of chapter list page names of the same subject, wherein The numerical feature vector may be a numerical value representing a page number; in this embodiment, the classification module 11 may jointly calculate any two chapter lists in combination with the text feature vector and its corresponding numerical feature vector. For the similarity between pages, one of the feature vectors can also be used alone to calculate the similarity between the chapter list pages.
具体的,请参照附图7,在本发明的一个实施例中,所述归类模块11中具体还包括有参照页确定单元111、第一提取单元112、第一计算单元113和第一归类单元114。Specifically, referring to FIG. 7, in an embodiment of the present invention, the categorization module 11 further includes a reference page determining unit 111, a first extracting unit 112, a first calculating unit 113, and a first returning Class unit 114.
其中所述参照页确定单元111,用于依据所述章节列表页所对应站点的权威值,确定权威值最高的章节列表页为参照章节列表页;The reference page determining unit 111 is configured to determine, according to an authority value of a site corresponding to the chapter list page, a chapter list page with the highest authority value as a reference chapter list page;
所述第一提取单元112,用于提取每一章节列表页的文字特征向量;The first extracting unit 112 is configured to extract a text feature vector of each chapter list page;
所述第一计算单元113,用于计算每一章节列表页与所述参照章节列表页具有相同文字特征向量的总数;The first calculating unit 113 is configured to calculate a total number of the same character feature vectors of each chapter list page and the reference chapter list page;
所述第一归类单元114,用于当该总数大于预设阈值时,将所述章节列表页与所述参照章节列表页归类为同一集合。The first categorizing unit 114 is configured to classify the chapter list page and the reference chapter list page into the same set when the total number is greater than a preset threshold.
在评判多个章节列表页之间的相似度时,首先通过所述参照页确定单元111获取一个参照章节列表页,本发明的一个实施例中,可以通过获取不同站点的权威值,确定权威值最高的章节列表页为所述的参照章节列表页,其中站点的权威值是由大量用户通过对该站点的评分得到;然后所述第一提取单元112基于一定算法提取每一章节列表页的文字特征向量,再通过所述第一计算单元113计算每一章节列表页与所述参照章节列表页具有相同文字特征向量的总数;当该总数大于预存储的阈值时,所述第一归类单元114将所述章节列表页与所述参照章节列表页归类为同一集合,重复上述方法,将其他没在该集合内的章节列表页归类为另一或多个集合。When judging the similarity between the plurality of chapter list pages, the reference page determining unit 111 first obtains a reference chapter list page. In an embodiment of the present invention, the authority value may be determined by acquiring the authority values of different sites. The highest chapter list page is the reference chapter list page, wherein the authoritative value of the site is obtained by a large number of users by scoring the site; then the first extracting unit 112 extracts each chapter list page based on a certain algorithm. a character feature vector, and the first calculating unit 113 calculates, by the first calculating unit 113, a total number of the same character feature vector of each chapter list page and the reference chapter list page; when the total number is greater than a pre-stored threshold, the first return The class unit 114 classifies the chapter list page and the reference chapter list page into the same set, repeats the above method, and classifies other chapter list pages not in the set into another set or multiple sets.
进一步的,请参见附图6,所述分集模块12,用于获取同一集合内每个站点的权威值,将权威值的和值最大的集合作为第一集合,其中权威值根据多个用户对该站点的评分确定。Further, referring to FIG. 6, the diversity module 12 is configured to obtain an authority value of each site in the same set, and use a set with the largest sum of authoritative values as the first set, where the authority value is based on multiple user pairs. The rating of the site is determined.
前述归类模快11中,依据章节列表页之间的相似度将多个章节列表页归类为不同的集合,在该分集模块12中,计算同一集合内每个章节列表页所在站点的权威值的和值,其中站点的权威值根据多个用户对该站点的评分确定,获取其中权威值的和值最大的集合作为第一集合。In the foregoing categorization module 11, a plurality of chapter list pages are classified into different sets according to the similarity between the chapter list pages, and in the diversity module 12, the authority of the site where each chapter list page in the same set is located is calculated. The sum value of the value, wherein the authoritative value of the site is determined according to the scores of the plurality of users on the site, and the set in which the sum of the authority values is the largest is obtained as the first set.
进一步的,请参见附图6,所述特征量获取模块13,用于获取第一集合内每个章节列表页的至少一个特征量值。需要说明的是,其中所述至少一个特征量值可以是表征章节列表页完整性、或正确性、或实新性的特征量值;下文通过不同的实施例分别介绍特征量获取模块13获取特征量值的实施方式。Further, referring to FIG. 6, the feature quantity obtaining module 13 is configured to acquire at least one feature quantity value of each chapter list page in the first set. It should be noted that, the at least one feature quantity value may be a feature quantity value that represents the chapter list page integrity, or the correctness, or the real newness; the feature quantity acquisition module 13 acquires the feature separately by different embodiments. The implementation of the magnitude.
1、具体的,请参见附图8,在本发明的一个实施例中,所述特征量获取模块13还包括有第二提取单元131、第一平均值计算单元132、第二平均值计算单元133和第一设定单元134:1. Specifically, referring to FIG. 8, in an embodiment of the present invention, the feature quantity acquiring module 13 further includes a second extracting unit 131, a first average calculating unit 132, and a second average calculating unit. 133 and the first setting unit 134:
所述第二提取单元131,用于提取第一集合内每一章节列表页的文字特征向量; The second extracting unit 131 is configured to extract a text feature vector of each chapter list page in the first set;
所述第一平均值计算单元132,用于计算所述第一集合中每两个章节列表页具有相同文字特征向量的数量的第一平均值;The first average value calculating unit 132 is configured to calculate a first average value of the number of the same character feature vectors in each of the two chapter list pages in the first set;
所述第二平均值计算单元133,用于计算某一章节列表页与多个其他章节列表页的相同文字特征向量的数量的第二平均值;The second average value calculating unit 133 is configured to calculate a second average value of the number of identical character feature vectors of a certain chapter list page and a plurality of other chapter list pages;
所述第一设定单元134,用于依据所述第二平均值与所述第一平均值的差值大小,基于预设的完整性规则设定表征该章节列表页完整性的第一特征量值,其中该差值大小与第一特征量值相对应。The first setting unit 134 is configured to set, according to a preset integrity rule, a first feature that characterizes the integrity of the chapter list page according to the difference between the second average value and the first average value. A magnitude, wherein the magnitude of the difference corresponds to the first feature magnitude.
具体的,首先所述第二提取单元131提取第一集合内每一章节列表页的文字特征向量;所述第一平均值计算单元132再计算每两个章节列表页具有相同文字特征向量的数量,对得到的多个数量值求平均得到第一平均值;所述第二平均值计算单元133计算某一个章节列表页与多个其他章节列表页的相同文字特征向量的数量,并求平均得到第二平均数;所述第一设定单元134再计算所述第一平均值与第二平均值的差值大小,再基于预设的完整性规则设定表征该章节列表页完整性的第一特征量值;如果该差值越大,则表明该章节列表页不完整的概率越大,对应的第一特征量值则越小,其中差值大小与第一特征量值预先相关联存储。例如,若差值为15时,对应的第一特征量值为60;差值为5时,对应的第一特征量值为80;当然,该实施例仅是示例性的,并不能构成对本发明的限制。Specifically, first, the second extraction unit 131 extracts a character feature vector of each chapter list page in the first set; the first average value calculation unit 132 further calculates that each two chapter list pages have the same character feature vector. The quantity, the obtained plurality of quantity values are averaged to obtain a first average value; the second average value calculation unit 133 calculates the number of the same character feature vector of a certain chapter list page and a plurality of other chapter list pages, and averages Obtaining a second average number; the first setting unit 134 recalculating the difference between the first average value and the second average value, and then setting the integrity of the chapter list page based on the preset integrity rule a first feature magnitude; if the difference is larger, the probability that the chapter list page is incomplete is larger, and the corresponding first feature magnitude is smaller, wherein the difference magnitude is pre-associated with the first feature magnitude storage. For example, if the difference is 15, the corresponding first feature quantity value is 60; when the difference is 5, the corresponding first feature quantity value is 80; of course, this embodiment is merely exemplary and does not constitute a pair. Limitations of the invention.
进一步的,本发明所述装置还包括有第二设定单元,所述第二设定单元用于依据所述第二平均值与所述第一平均值的差值大小,基于预设的正确性规则设定表征该章节列表页正确性的第二特征量值,其中该差值大小与第二特征量值相对应。即在所述第二设定单元得到第二平均值与第二平均值的差值大小后,基于预设的表征正确性的规则设定表征章节列表页正确性的第二特征量,同理如果差值越大,则表明该章节列表页不正确的概率越大,对应的第二特征量值则越小,其中差值大小也与第二特征量值预先相关联存储。例如,若差值为15时,对应的第二特征量值为65;差值为5时,对应的第一特征量值为85;当然,该实施例仅是示例性的,并不能构成对本发明的限制。Further, the device of the present invention further includes a second setting unit, wherein the second setting unit is configured to correct the preset value according to the difference between the second average value and the first average value The sex rule sets a second feature magnitude that characterizes the correctness of the chapter list page, wherein the difference magnitude corresponds to the second feature magnitude. That is, after the second setting unit obtains the difference between the second average value and the second average value, the second feature quantity that represents the correctness of the chapter list page is set based on the preset rule of correctness of the representation, and the same reason. If the difference is larger, the probability that the chapter list page is incorrect is larger, and the corresponding second feature value is smaller, wherein the difference size is also pre-associated with the second feature amount. For example, if the difference is 15, the corresponding second feature quantity value is 65; when the difference value is 5, the corresponding first feature quantity value is 85; of course, this embodiment is merely exemplary and cannot constitute a pair. Limitations of the invention.
2、请参见附图9,在本发明的另一个实施例中,所述特征量获取模块13还包括有第一获取单元135、总数获取单元136和判断单元137。2. In another embodiment of the present invention, the feature quantity acquisition module 13 further includes a first obtaining unit 135, a total number obtaining unit 136, and a determining unit 137.
所述第一获取单元135,用于获取该第一集合内每个章节列表页对应于相同页码的章节列表中的文字特征向量,其中该页码所对应的数值大于预设的页码阈值;The first obtaining unit 135 is configured to obtain a character feature vector in the chapter list corresponding to the same page number in each chapter list page in the first set, where the value corresponding to the page code is greater than a preset page code threshold;
所述总数获取单元136,用于获取某一个章节列表页与多个其他章节列表页具有相同文字特征向量的总数;The total number obtaining unit 136 is configured to obtain a total number of the same character feature vectors of a certain chapter list page and a plurality of other chapter list pages;
所述判断单元137,用于根据所述总数与预设的表征实新性的第二阈值的大小关系,判断该章节列表页是否为虚假章节列表页。The determining unit 137 is configured to determine, according to the size relationship between the total number and the second threshold of the preset representation reality, whether the chapter list page is a fake chapter list page.
该实施例主要是用于评判章节列表页的实新性。通过所述第一获取单元135获 取大于预设的页码阈值的页码所对应的章节列表页的文字特征向量,再采用所述总数获取单元136计算某一章节列表页与多个其他章节列表页具有相同文字特征向量的总数。即第一获取单元135获得章节列表页末尾的几个章节列表页对应的文字特征向量,所述总数获取单元136计算某一章节列表页与多个其他具有相同页码的章节列表页所具有的相同文字特征向量的总数,当所述判断单元137判断得到所述总数大于等于所述预设的第二阈值时,确定该章节列表页为有效的章节列表页,但是当总数小于所述预设的第二阈值时,表明该章节列表页极可能是错误产生或杜撰的章节列表页,确定所述章节列表页为虚假章节列表页。This embodiment is mainly used to judge the realism of the chapter list page. Obtained by the first acquiring unit 135 And taking the character feature vector of the chapter list page corresponding to the page number of the preset page number threshold, and using the total number obtaining unit 136 to calculate the total number of the same character feature vector of the certain chapter list page and the plurality of other chapter list pages. That is, the first obtaining unit 135 obtains a character feature vector corresponding to several chapter list pages at the end of the chapter list page, and the total number obtaining unit 136 calculates a chapter list page and a plurality of other chapter list pages having the same page number. a total number of identical character feature vectors. When the determining unit 137 determines that the total number is greater than or equal to the preset second threshold, determining that the chapter list page is a valid chapter list page, but when the total number is less than the preset The second threshold value indicates that the chapter list page is most likely an error generated or fabricated chapter list page, and the chapter list page is determined to be a false chapter list page.
进一步的,本发明所述装置还包括有过滤模块,用于所述判断单元确定所述章节列表页为虚假章节列表页之后,过滤掉所述虚假章节列表页。同理,该实施例中也可以根据所述总数大于所述预设的第二阈值的大小程度,来确定表征其实新性的特征量值,即所述总数与第二阈值的差值越大,表征其准确率越高,越不可能是杜撰或错误的章节列表页,其对应的表征实新性的特征量值越大;反正,对应的表征实新性的特征量值越大。Further, the device of the present invention further includes a filtering module, configured to filter out the fake chapter list page after the determining unit determines that the chapter list page is a fake chapter list page. Similarly, in this embodiment, the feature quantity that represents the actual newness may be determined according to the magnitude of the total number being greater than the preset second threshold, that is, the greater the difference between the total number and the second threshold. The higher the accuracy rate, the less likely it is to be a fabricated or erroneous chapter list page, and the corresponding feature value of the corresponding newness is larger; anyway, the corresponding feature value of the corresponding newness is larger.
进一步的,请参见附图6,本装置所包括的目标获取模块14,用于根据预设规则计算每个章节列表页的所述至少一个特征量值的综合权值,获取其中综合权值最大的章节列表页。Further, referring to FIG. 6, the target acquiring module 14 included in the device is configured to calculate, according to a preset rule, a comprehensive weight of the at least one feature quantity value of each chapter list page, and obtain a maximum comprehensive weight value. The chapter list page.
具体的,在本发明的一个实施例中,请参见附图10,所述目标获取模块14还包括有加权单元141、比较单元142和目标获取单元143。Specifically, in an embodiment of the present invention, referring to FIG. 10, the target obtaining module 14 further includes a weighting unit 141, a comparing unit 142, and a target acquiring unit 143.
所述加权单元141,用于根据预设规则对同一章节列表页的至少一个特征量值进行加权处理,得到该章节列表页的综合权值;The weighting unit 141 is configured to perform weighting processing on at least one feature quantity value of the same chapter list page according to a preset rule to obtain an integrated weight of the chapter list page;
所述比较单元142,用于比较每个章节列表页对应的综合权值的大小;The comparing unit 142 is configured to compare the size of the comprehensive weight corresponding to each chapter list page;
所述目标获取单元143,用于获取其中综合权值最大的章节列表页。The target obtaining unit 143 is configured to obtain a chapter list page in which the comprehensive weight is the largest.
具体的,所述加权单元141根据预设的对应于每个特定特征量值的权值,对该权值所对应的特征量值进行加权处理,所得结果为该章节列表页的综合权值,其中特定特征量值表征章节列表页完整性和/或正确性。例如,在本发明的一个示例性实施例中,所述加权单元141根据前述步骤中得到了某一个章节列表页表征完整性的第一特征量值为80,表征正确性的第二特征量值为90,然后预设的对应于第一特征量值的权值为0.5,对应于第二特征量值的权值为0.7,最后经过加权0.5*80+0.7*90=10.30,该结果即为所述某一个章节列表页的综合权值。当然,不难理解,该实施例仅是示例性的,并不能构成对本发明的限制。Specifically, the weighting unit 141 performs weighting processing on the feature quantity value corresponding to the weight according to the preset weight value corresponding to each specific feature quantity value, and the obtained result is the comprehensive weight of the chapter list page. Where a particular feature magnitude represents the chapter list page integrity and/or correctness. For example, in an exemplary embodiment of the present invention, the weighting unit 141 obtains a first feature quantity value of 80 for characterizing the integrity of a certain chapter list page according to the foregoing steps, and represents a second feature quantity value of correctness. 90, then the preset weight corresponding to the first feature quantity value is 0.5, the weight corresponding to the second feature quantity value is 0.7, and finally weighted by 0.5*80+0.7*90=10.30, the result is The comprehensive weight of a certain chapter list page. It is to be understood, of course, that the examples are merely exemplary and are not intended to limit the invention.
进一步,所述加权单元141计算出每个章节列表页的综合权值后,所述比较单元142比较每个章节列表页的综合权值的大小,目标获取单元143获取其中综合权值最大的章节列表页。该综合权值最大的章节列表页即为目标章节列表页。不难理解,本发明中所述方法虽然以小说搜索引擎的数据处理环节作为应用场景,但是实 际应用上并不限于此,还可以应用于其他需要获取最佳章节列表页的情况,为其他后续的处理做铺垫,提高用户的产品体检。Further, after the weighting unit 141 calculates the comprehensive weight of each chapter list page, the comparing unit 142 compares the size of the comprehensive weight of each chapter list page, and the target obtaining unit 143 obtains the chapter in which the comprehensive weight is the largest. List. The chapter list page with the largest comprehensive weight is the target chapter list page. It is not difficult to understand that although the method described in the present invention uses the data processing link of the novel search engine as an application scenario, The application is not limited to this, and can be applied to other situations where it is necessary to obtain the best chapter list page, paving the way for other subsequent processing, and improving the product inspection of the user.
终上所述,本发明提供了一种网络小说章节列表评估装置,所述归类模块11基于多个章节列表页之间的相似度,将不同站点的多个章节列表页归类为同一集合;所述分集模块12再将同一集合内每个站点的权威值的和值最大的集合作为第一集合,所述特征量获取模块13获取第一集合内每个章节列表页的至少一个特征量值;再采用所述目标获取模块14基于预设规则计算该第一集合内每个章节列表页的所述至少一个特征量值的综合权值,获取其中综合权值最大的章节列表页。即本方案能实现对多个站点的章节列表页的自动获取,通过对相似度、站点的权威值及获取的特征量值多个参数的比较和综合分析,得到质量最高的章节列表页,从而解决了现有技术中通过人工配置模板进行章节列表页判断导致效率低的问题,本发明所述方案能灵活、快速的评估出最符合要求的章节列表页,评估结果准确、客观。In conclusion, the present invention provides a network novel chapter list evaluation apparatus, and the classification module 11 classifies a plurality of chapter list pages of different sites into the same set based on the similarity between the plurality of chapter list pages. The diversity module 12 further sets a set of the maximum value of the authority values of each site in the same set as the first set, and the feature quantity obtaining module 13 acquires at least one feature quantity of each chapter list page in the first set. And the target obtaining module 14 calculates the comprehensive weight of the at least one feature quantity value of each chapter list page in the first set based on the preset rule, and obtains a chapter list page in which the comprehensive weight is the largest. That is, the scheme can automatically acquire chapter list pages of multiple sites, and obtain the highest-quality chapter list page by comparing and comprehensively analyzing the similarity, the authoritative value of the site, and the acquired feature values. The invention solves the problem that the chapter list page is judged to be inefficient by manual configuration of the template in the prior art. The solution of the present invention can flexibly and quickly evaluate the chapter list page that most satisfies the requirements, and the evaluation result is accurate and objective.
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. However, it is understood that the embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures, and techniques are not shown in detail so as not to obscure the understanding of the description.
类似地,应当理解,为了精简本公开并帮助理解各个发明方面中的一个或多个,在上面对本发明的示例性实施例的描述中,本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如下面的权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。Similarly, the various features of the invention are sometimes grouped together into a single embodiment, in the above description of the exemplary embodiments of the invention, Figure, or a description of it. However, the method disclosed is not to be interpreted as reflecting the intention that the claimed invention requires more features than those recited in the claims. Rather, as the following claims reflect, inventive aspects reside in less than all features of the single embodiments disclosed herein. Therefore, the claims following the specific embodiments are hereby explicitly incorporated into the embodiments, and each of the claims as a separate embodiment of the invention.
本领域那些技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art will appreciate that the modules in the devices of the embodiments can be adaptively changed and placed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and further they may be divided into a plurality of sub-modules or sub-units or sub-components. In addition to such features and/or at least some of the processes or units being mutually exclusive, any combination of the features disclosed in the specification, including the accompanying claims, the abstract and the drawings, and any methods so disclosed, or All processes or units of the device are combined. Each feature disclosed in this specification (including the accompanying claims, the abstract and the drawings) may be replaced by alternative features that provide the same, equivalent or similar purpose.
此外,本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如,在下面的权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。In addition, those skilled in the art will appreciate that, although some embodiments described herein include certain features that are included in other embodiments and not in other features, combinations of features of different embodiments are intended to be within the scope of the present invention. Different embodiments are formed and formed. For example, in the following claims, any one of the claimed embodiments can be used in any combination.
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运 行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的网络小说章节列表评估装置中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。Various component embodiments of the present invention may be implemented in hardware or on one or more processors The software module implementation of the line, or a combination of them. Those skilled in the art will appreciate that some or all of the functionality of some or all of the components of the network novel chapter list evaluation device in accordance with embodiments of the present invention may be implemented in practice using a microprocessor or digital signal processor (DSP). The invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein. Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
例如,图11示出了可以实现网络小说章节列表评估方法的计算设备。该计算设备传统上包括处理器1110和以存储器1120形式的计算机程序产品或者计算机可读介质。存储器1120可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器1120具有用于执行上述方法中的任何方法步骤的程序代码1131的存储空间1130。例如,用于程序代码的存储空间1130可以包括分别用于实现上面的方法中的各种步骤的各个程序代码1131。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为如参考图12所述的便携式或者固定存储单元。该存储单元可以具有与图11的计算设备中的存储器1120类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩。通常,存储单元包括计算机可读代码1131’,即可以由例如诸如1110之类的处理器读取的代码,这些代码当由计算设备运行时,导致该计算设备执行上面所描述的方法中的各个步骤。For example, FIG. 11 illustrates a computing device that can implement a method of evaluating a network novel chapter list. The computing device conventionally includes a processor 1110 and a computer program product or computer readable medium in the form of a memory 1120. The memory 1120 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM. Memory 1120 has a memory space 1130 for program code 1131 for performing any of the method steps described above. For example, the storage space 1130 for program code may include respective program codes 1131 for implementing various steps in the above methods, respectively. The program code can be read from or written to one or more computer program products. These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG. The storage unit may have a storage segment, a storage space, and the like that are similarly arranged to the storage 1120 in the computing device of FIG. The program code can be compressed, for example, in an appropriate form. Typically, the storage unit includes computer readable code 1131 ', ie, code readable by a processor, such as, for example, 1110, which when executed by a computing device causes the computing device to perform each of the methods described above step.
本文中所称的“一个实施例”、“实施例”或者“一个或者多个实施例”意味着,结合实施例描述的特定特征、结构或者特性包括在本发明的至少一个实施例中。此外,请注意,这里“在一个实施例中”的词语例子不一定全指同一个实施例。"an embodiment," or "an embodiment," or "an embodiment," In addition, it is noted that the phrase "in one embodiment" is not necessarily referring to the same embodiment.
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It is to be noted that the above-described embodiments are illustrative of the invention and are not intended to be limiting, and that the invention may be devised without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as a limitation. The word "comprising" does not exclude the presence of the elements or steps that are not recited in the claims. The word "a" or "an" The invention can be implemented by means of hardware comprising several distinct elements and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means can be embodied by the same hardware item. The use of the words first, second, and third does not indicate any order. These words can be interpreted as names.
此外,还应当注意,本说明书中使用的语言主要是为了可读性和教导的目的而选择的,而不是为了解释或者限定本发明的主题而选择的。因此,在不偏离所附权利要求书的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和 变更都是显而易见的。对于本发明的范围,对本发明所做的公开是说明性的,而非限制性的,本发明的范围由所附权利要求书限定。 In addition, it should be noted that the language used in the specification has been selected for the purpose of readability and teaching, and is not intended to be construed or limited. Therefore, many modifications and variations will occur to those skilled in the art without departing from the scope and spirit of the appended claims. The changes are obvious. The disclosure of the present invention is intended to be illustrative, and not restrictive, and the scope of the invention is defined by the appended claims.

Claims (24)

  1. 一种网络小说章节列表评估方法,包括有步骤:A method for evaluating a list of network novel chapters includes steps:
    确定同一主体的多个章节列表页之间的相似度,将相似度高于预设阈值的多个章节列表页归类为同一集合,每个章节列表页对应于一个站点;Determining the similarity between the plurality of chapter list pages of the same subject, classifying the plurality of chapter list pages whose similarity is higher than the preset threshold into the same set, and each chapter list page corresponds to one site;
    获取同一集合内每个站点的权威值,将权威值的和值最大的集合作为第一集合,其中权威值根据多个用户对该站点的评分确定;Obtaining an authoritative value of each site in the same set, and taking the set with the largest sum of authoritative values as the first set, wherein the authoritative value is determined according to the scores of the plurality of users on the site;
    获取第一集合内每个章节列表页的至少一个特征量值;Obtaining at least one feature quantity value of each chapter list page in the first set;
    根据预设规则计算每个章节列表页的所述至少一个特征量值的综合权值,获取其中综合权值最大的章节列表页。Calculating a comprehensive weight of the at least one feature quantity value of each chapter list page according to a preset rule, and obtaining a chapter list page in which the comprehensive weight value is the largest.
  2. 根据权利要求1所述的方法,其中,在所述确定同一主体的多个章节列表页之间的相似度的步骤之前,还包括步骤:The method of claim 1, wherein before the step of determining the similarity between the plurality of chapter list pages of the same subject, the method further comprises the steps of:
    基于同一主体从多个站点获取该主体对应的章节列表页。The chapter list page corresponding to the subject is obtained from a plurality of sites based on the same subject.
  3. 根据权利要求1所述的方法,其中,所述确定同一主体的多个章节列表页之间的相似度,将相似度高于预设阈值的多个章节列表页归类为同一集合的步骤中,还包括步骤:The method according to claim 1, wherein said determining a similarity between a plurality of chapter list pages of the same subject, and classifying a plurality of chapter list pages having a similarity higher than a preset threshold into the same set , also includes the steps:
    依据所述章节列表页所对应站点的权威值,确定权威值最高的章节列表页为参照章节列表页;Determining the chapter list page with the highest authoritative value as the reference chapter list page according to the authoritative value of the site corresponding to the chapter list page;
    提取每一章节列表页的文字特征向量;Extracting the text feature vector of each chapter list page;
    计算每一章节列表页与所述参照章节列表页具有相同文字特征向量的总数;Calculating a total number of the same character feature vectors for each chapter list page and the reference chapter list page;
    当该总数大于预设阈值时,将所述章节列表页与所述参照章节列表页归类为同一集合。When the total number is greater than a preset threshold, the chapter list page and the reference chapter list page are classified into the same set.
  4. 根据权利要求1所述的方法,其中,所述获取第一集合内每个章节列表页的至少一个特征量值的步骤中,包括:The method according to claim 1, wherein the step of obtaining at least one feature quantity value of each chapter list page in the first set comprises:
    提取第一集合内每一章节列表页的文字特征向量;Extracting a text feature vector of each chapter list page in the first set;
    计算所述第一集合中每两个章节列表页具有相同文字特征向量的数量的第一平均值;Calculating a first average of the number of identical text feature vectors for each of the two chapter list pages in the first set;
    计算某一章节列表页与多个其他章节列表页的相同文字特征向量的数量的第二平均值;Calculating a second average of the number of identical text feature vectors for a chapter list page and a plurality of other chapter list pages;
    依据所述第二平均值与所述第一平均值的差值大小,基于预设的完整性规则设定表征该章节列表页完整性的第一特征量值,其中该差值大小与第一特征量值相对应。And setting, according to the preset integrity rule, a first feature quantity value that represents the integrity of the chapter list page, where the difference size is the first one, according to the difference between the second average value and the first average value The feature magnitude corresponds.
  5. 根据权利要求4所述的方法,其中,还包括有步骤:The method of claim 4 further comprising the steps of:
    依据所述第二平均值与所述第一平均值的差值大小,基于预设的正确性规则设定表征该章节列表页正确性的第二特征量值,其中该差值大小与第二特征量值相对应。Determining, according to a preset correctness rule, a second feature quantity value that corrects the correctness of the chapter list page, wherein the difference size is different from the second value, according to the difference between the second average value and the first average value The feature magnitude corresponds.
  6. 根据权利要求1所述的方法,其中,所述获取第一集合内每个章节列表页的 至少一个特征量值的步骤中,还包括:The method of claim 1 wherein said obtaining each chapter list page within the first set The step of at least one feature quantity further includes:
    获取该第一集合内每个章节列表页对应于相同页码的章节列表中的文字特征向量,其中该页码所对应的数值大于预设的页码阈值;Obtaining a character feature vector in the chapter list corresponding to the same page number in each chapter list page in the first set, where the value corresponding to the page number is greater than a preset page code threshold;
    获取某一个章节列表页与多个其他章节列表页具有相同文字特征向量的总数;Obtaining a total number of the same character feature vector for a chapter list page and a plurality of other chapter list pages;
    根据所述总数与预设的表征实新性的第二阈值的大小关系,判断该章节列表页是否为虚假章节列表页。Determining whether the chapter list page is a fake chapter list page is determined according to a size relationship between the total number and a preset second threshold value.
  7. 根据权利要求6所述的方法,其中,根据所述总数与预设的表征实新性的第二阈值的大小关系,判断该章节列表页是否为虚假章节列表页的步骤,包括:The method according to claim 6, wherein the step of determining whether the chapter list page is a fake chapter list page according to the size relationship between the total number and the second threshold of the preset representation real newness comprises:
    当所述总数大于等于所述预设的第二阈值,确定所述章节列表页为有效章节列表页;When the total number is greater than or equal to the preset second threshold, determining that the chapter list page is a valid chapter list page;
    当所述总数小于所述预设的第二阈值,确定所述章节列表页为虚假章节列表页。When the total number is less than the preset second threshold, determining that the chapter list page is a fake chapter list page.
  8. 根据权利要求7所述的方法,其中,确定所述章节列表页为虚假章节列表页之后,还包括步骤:The method according to claim 7, wherein after determining that the chapter list page is a fake chapter list page, the method further comprises the steps of:
    过滤掉所述虚假章节列表页。Filter out the fake chapter list page.
  9. 根据权利要求1所述的方法,其中,所述根据预设规则计算每个章节列表页的至少一个特征量值的综合权值,获取其中综合权值最大的章节列表页的步骤中,包括:The method according to claim 1, wherein the step of calculating the comprehensive weight of the at least one feature quantity value of each chapter list page according to the preset rule, and obtaining the chapter list page in which the comprehensive weight value is the largest includes:
    根据预设规则对同一章节列表页的至少一个特征量值进行加权处理,得到该章节列表页的综合权值;Weighting at least one feature quantity value of the same chapter list page according to a preset rule to obtain an integrated weight of the chapter list page;
    比较每个章节列表页对应的综合权值的大小;Compare the size of the comprehensive weight corresponding to each chapter list page;
    获取其中综合权值最大的章节列表页。Get the chapter list page with the largest comprehensive weight.
  10. 根据权利要求9所述的方法,其中,所述根据预设规则对同一章节列表页的至少一个特征量值进行加权处理,得到该章节列表页的综合权值的步骤中,包括:The method of claim 9, wherein the step of weighting the at least one feature value of the same chapter list page according to the preset rule to obtain the comprehensive weight of the chapter list page comprises:
    根据预设的对应于每个特定特征量值的权值,对该权值所对应的特征量值进行加权处理,所得结果为该章节列表页的综合权值,其中特定特征量值表征章节列表页完整性和/或正确性。And weighting the feature quantity value corresponding to the weight value according to the preset weight value corresponding to each specific feature quantity value, and the obtained result is the comprehensive weight value of the chapter list page, wherein the specific feature quantity value represents the chapter list Page integrity and/or correctness.
  11. 根据权利要求1所述的方法,其中,所述确定同一主体的多个章节列表页之间的相似度的步骤中,还包括步骤:The method according to claim 1, wherein said step of determining a similarity between a plurality of chapter list pages of the same subject further comprises the steps of:
    确定同一主体的多个章节列表页中章节列表名称的文本特征向量之间的相似度;和/或Determining the similarity between text feature vectors of chapter list names in multiple chapter list pages of the same subject; and/or
    确定同一主体的多个章节列表页中对应于章节列表名称的页码的数值特征向量之间的相似度。The degree of similarity between the numerical feature vectors of the page numbers corresponding to the chapter list names in the plurality of chapter list pages of the same subject is determined.
  12. 一种网络小说章节列表评估装置,包括有:A network novel chapter list evaluation device includes:
    归类模块,用于确定同一主体的多个章节列表页之间的相似度,将相似度高于预设阈值的多个章节列表页归类为同一集合,每个章节列表页对应于一个站点; a categorization module, configured to determine a similarity between a plurality of chapter list pages of the same subject, classify the plurality of chapter list pages whose similarity is higher than a preset threshold into the same set, and each chapter list page corresponds to one site ;
    分集模块,用于获取同一集合内每个站点的权威值,将权威值的和值最大的集合作为第一集合,其中权威值根据多个用户对该站点的评分确定;a diversity module, configured to obtain an authority value of each site in the same set, and use a set of the maximum value of the authority value as the first set, wherein the authority value is determined according to the score of the user by the multiple users;
    特征量获取模块,用于获取第一集合内每个章节列表页的至少一个特征量值;a feature quantity obtaining module, configured to acquire at least one feature quantity value of each chapter list page in the first set;
    目标获取模块,用于根据预设规则计算每个章节列表页的所述至少一个特征量值的综合权值,获取其中综合权值最大的章节列表页。And a target obtaining module, configured to calculate, according to a preset rule, a comprehensive weight of the at least one feature quantity value of each chapter list page, and obtain a chapter list page in which the comprehensive weight value is the largest.
  13. 根据权利要求12所述的装置,其中,还包括有列表页获取模块,The apparatus according to claim 12, further comprising a list page obtaining module,
    所述列表页获取模块,用于基于同一主体从多个站点获取该主体对应的章节列表页。The list page obtaining module is configured to acquire a chapter list page corresponding to the body from a plurality of sites based on the same subject.
  14. 根据权利要求12所述的装置,其中,所述归类模块还包括有:The apparatus of claim 12, wherein the categorization module further comprises:
    参照页确定单元,用于依据所述章节列表页所对应站点的权威值,确定权威值最高的章节列表页为参照章节列表页;a reference page determining unit, configured to determine, according to an authority value of a site corresponding to the chapter list page, a chapter list page with the highest authoritative value as a reference chapter list page;
    第一提取单元,用于提取每一章节列表页的文字特征向量;a first extracting unit, configured to extract a text feature vector of each chapter list page;
    第一计算单元,用于计算每一章节列表页与所述参照章节列表页具有相同文字特征向量的总数;a first calculating unit, configured to calculate a total number of the same character feature vector of each chapter list page and the reference chapter list page;
    第一归类单元,用于当该总数大于预设阈值时,将所述章节列表页与所述参照章节列表页归类为同一集合。The first categorizing unit is configured to classify the chapter list page and the reference chapter list page into the same set when the total number is greater than a preset threshold.
  15. 根据权利要求12所述的装置,其中,所述特征量获取模块还包括有:The device according to claim 12, wherein the feature quantity acquisition module further comprises:
    第二提取单元,用于提取第一集合内每一章节列表页的文字特征向量;a second extracting unit, configured to extract a text feature vector of each chapter list page in the first set;
    第一平均值计算单元,用于计算所述第一集合中每两个章节列表页具有相同文字特征向量的数量的第一平均值;a first average value calculating unit, configured to calculate a first average value of the number of the same character feature vector for each two chapter list pages in the first set;
    第二平均值计算单元,用于计算某一章节列表页与多个其他章节列表页的相同文字特征向量的数量的第二平均值;a second average value calculating unit, configured to calculate a second average value of the number of identical character feature vectors of a certain chapter list page and a plurality of other chapter list pages;
    第一设定单元,用于依据所述第二平均值与所述第一平均值的差值大小,基于预设的完整性规则设定表征该章节列表页完整性的第一特征量值,其中该差值大小与第一特征量值相对应。a first setting unit, configured to set, according to a difference value between the second average value and the first average value, a first feature quantity value that represents integrity of the chapter list page based on a preset integrity rule, Wherein the difference magnitude corresponds to the first feature magnitude.
  16. 根据权利要求15所述的装置,其中,还包括有第二设定单元:The apparatus according to claim 15, further comprising a second setting unit:
    所述第二设定单元,用于依据所述第二平均值与所述第一平均值的差值大小,基于预设的正确性规则设定表征该章节列表页正确性的第二特征量值,其中该差值大小与第二特征量值相对应。The second setting unit is configured to set, according to a preset difference rule, a second feature quantity that is indicative of the correctness of the chapter list page according to the difference between the second average value and the first average value a value, wherein the difference magnitude corresponds to the second feature magnitude.
  17. 根据权利要求12所述的装置,其中,所述特征量获取模块还包括有:The device according to claim 12, wherein the feature quantity acquisition module further comprises:
    第一获取单元,用于获取该第一集合内每个章节列表页对应于相同页码的章节列表中的文字特征向量,其中该页码所对应的数值大于预设的页码阈值;a first acquiring unit, configured to acquire a character feature vector in a chapter list corresponding to the same page number in each chapter list page in the first set, where a value corresponding to the page number is greater than a preset page code threshold;
    总数获取单元,用于获取某一个章节列表页与多个其他章节列表页具有相同文字特征向量的总数;a total number obtaining unit, configured to obtain a total number of the same character feature vector of a certain chapter list page and a plurality of other chapter list pages;
    判断单元,用于根据所述总数与预设的表征实新性的第二阈值的大小关系,判断 该章节列表页是否为虚假章节列表页。a determining unit, configured to determine, according to the size relationship between the total number and a second threshold of the preset representation real newness Whether the chapter list page is a fake chapter list page.
  18. 根据权利要求17所述的装置,其中,The device according to claim 17, wherein
    所述判断单元还用于当所述总数大于等于所述预设的第二阈值,确定所述章节列表页为有效章节列表页;及The determining unit is further configured to: when the total number is greater than or equal to the preset second threshold, determine that the chapter list page is a valid chapter list page; and
    当所述总数小于所述预设的第二阈值,确定所述章节列表页为虚假章节列表页。When the total number is less than the preset second threshold, determining that the chapter list page is a fake chapter list page.
  19. 根据权利要求17所述的装置,其中,所述特征量获取模块还包括有过滤单元,所述过滤单元,用于所述判断单元确定所述章节列表页为虚假章节列表页之后,过滤掉所述虚假章节列表页。The device according to claim 17, wherein the feature quantity obtaining module further comprises a filtering unit, wherein the filtering unit is configured to: after the determining unit determines that the chapter list page is a fake chapter list page, filter out the Describe the fake chapter list page.
  20. 根据权利要求12所述的装置,其中,所述目标获取模块还包括有:The apparatus of claim 12, wherein the target acquisition module further comprises:
    加权单元,用于根据预设规则对同一章节列表页的至少一个特征量值进行加权处理,得到该章节列表页的综合权值;a weighting unit, configured to perform weighting processing on at least one feature quantity value of the same chapter list page according to a preset rule, to obtain an integrated weight of the chapter list page;
    比较单元,用于比较每个章节列表页对应的综合权值的大小;a comparison unit, configured to compare the size of the comprehensive weight corresponding to each chapter list page;
    目标获取单元,用于获取其中综合权值最大的章节列表页。The target obtaining unit is configured to obtain a chapter list page in which the comprehensive weight is the largest.
  21. 根据权利要求20所述的装置,其中,所述加权单元,还用于根据预设的对应于每个特定特征量值的权值,对该权值所对应的特征量值进行加权处理,所得结果为该章节列表页的综合权值,其中特定特征量值表征章节列表页完整性和/或正确性。The apparatus according to claim 20, wherein the weighting unit is further configured to perform weighting processing on the feature quantity value corresponding to the weight according to a preset weight corresponding to each specific feature quantity value, and obtain the result The result is the overall weight of the chapter list page, where the particular feature magnitude represents the chapter list page integrity and/or correctness.
  22. 根据权利要求12所述的装置,其中,所述归类模块还包括有相似度判断单元,The apparatus according to claim 12, wherein said categorization module further comprises a similarity determination unit,
    所述相似度判断单元,用于确定同一主体的多个章节列表页中章节列表名称的文本特征向量之间的相似度;和/或The similarity determining unit is configured to determine a similarity between text feature vectors of the chapter list names in the plurality of chapter list pages of the same subject; and/or
    用于确定同一主体的多个章节列表页中对应于章节列表名称的页码的数值特征向量之间的相似度。A similarity between numerical feature vectors for determining page numbers corresponding to chapter list names in a plurality of chapter list pages of the same subject.
  23. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行根据权利要求1至11任一项所述的网络小说章节列表评估方法。A computer program comprising computer readable code, when the computer readable code is run on a computing device, causing the computing device to perform the network novel chapter list evaluation method according to any one of claims 1 to 11.
  24. 一种计算机可读介质,其中存储了如权利要求23所述的计算机程序。 A computer readable medium storing the computer program of claim 23.
PCT/CN2016/083434 2015-11-12 2016-05-26 Network novel chapter list evaluation method and device WO2017080183A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510771521.1 2015-11-12
CN201510771521.1A CN105302913B (en) 2015-11-12 2015-11-12 Network novel Chapter List appraisal procedure and device

Publications (1)

Publication Number Publication Date
WO2017080183A1 true WO2017080183A1 (en) 2017-05-18

Family

ID=55200182

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/083434 WO2017080183A1 (en) 2015-11-12 2016-05-26 Network novel chapter list evaluation method and device

Country Status (2)

Country Link
CN (1) CN105302913B (en)
WO (1) WO2017080183A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115017430A (en) * 2022-06-27 2022-09-06 京东科技控股股份有限公司 List page determination method and device, electronic equipment and storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105302913B (en) * 2015-11-12 2018-09-18 北京奇虎科技有限公司 Network novel Chapter List appraisal procedure and device
CN107153908A (en) * 2017-03-24 2017-09-12 国家计算机网络与信息安全管理中心 Mobile news App influence power ranking methods

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101625693A (en) * 2009-08-10 2010-01-13 北京精讯云顿数据软件有限公司 Method and system of online article statistics
WO2010038481A1 (en) * 2008-10-03 2010-04-08 富士通株式会社 Computer-readable recording medium containing a sentence extraction program, sentence extraction method, and sentence extraction device
CN101980196A (en) * 2010-10-25 2011-02-23 中国农业大学 Article comparison method and device
CN104050273A (en) * 2014-06-24 2014-09-17 北京奇虎科技有限公司 Devices and methods for recording latest network file and modifying search result
CN104239285A (en) * 2013-06-06 2014-12-24 腾讯科技(深圳)有限公司 New article chapter detecting method and device
CN105302913A (en) * 2015-11-12 2016-02-03 北京奇虎科技有限公司 Network novel chapter list evaluating method and device

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8335998B1 (en) * 2006-12-29 2012-12-18 Global Prior Art, Inc. Interactive global map
CN103123640A (en) * 2012-02-22 2013-05-29 深圳市谷古科技有限公司 Method and device for searching novel
CN103544172B (en) * 2012-07-13 2019-01-29 深圳市世纪光速信息技术有限公司 A kind of chapters and sections catalogue processing method and processing device of e-book
CN104216872B (en) * 2013-05-31 2017-12-01 腾讯科技(深圳)有限公司 The method and device of rubbish chapters and sections in a kind of identification network novel
CN104572650A (en) * 2013-10-11 2015-04-29 中兴通讯股份有限公司 Method and device for realizing browser intelligent reading and terminal comprising device
CN103577566B (en) * 2013-10-25 2017-07-28 北京奇虎科技有限公司 A kind of web page browing content loading method and device
CN104615768B (en) * 2015-02-13 2017-06-16 广州神马移动信息科技有限公司 Same recognition methods of document and device
CN104850642B (en) * 2015-05-26 2017-05-17 广州神马移动信息科技有限公司 Internet content quality evaluation method and internet content quality evaluation device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010038481A1 (en) * 2008-10-03 2010-04-08 富士通株式会社 Computer-readable recording medium containing a sentence extraction program, sentence extraction method, and sentence extraction device
CN101625693A (en) * 2009-08-10 2010-01-13 北京精讯云顿数据软件有限公司 Method and system of online article statistics
CN101980196A (en) * 2010-10-25 2011-02-23 中国农业大学 Article comparison method and device
CN104239285A (en) * 2013-06-06 2014-12-24 腾讯科技(深圳)有限公司 New article chapter detecting method and device
CN104050273A (en) * 2014-06-24 2014-09-17 北京奇虎科技有限公司 Devices and methods for recording latest network file and modifying search result
CN105302913A (en) * 2015-11-12 2016-02-03 北京奇虎科技有限公司 Network novel chapter list evaluating method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115017430A (en) * 2022-06-27 2022-09-06 京东科技控股股份有限公司 List page determination method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN105302913B (en) 2018-09-18
CN105302913A (en) 2016-02-03

Similar Documents

Publication Publication Date Title
CN108920947B (en) Abnormity detection method and device based on log graph modeling
CN112347244B (en) Yellow-based and gambling-based website detection method based on mixed feature analysis
JP5990284B2 (en) Spam detection system and method using character histogram
CA2859135C (en) System and methods for spam detection using frequency spectra of character strings
US10637826B1 (en) Policy compliance verification using semantic distance and nearest neighbor search of labeled content
RU2708356C1 (en) System and method for two-stage classification of files
CN108376129B (en) Error correction method and device
US9210189B2 (en) Method, system and client terminal for detection of phishing websites
WO2015117560A1 (en) Web page recognizing method and apparatus
CN112839014B (en) Method, system, equipment and medium for establishing abnormal visitor identification model
WO2022116419A1 (en) Automatic determination method and apparatus for domain name infringement, electronic device, and storage medium
CN112528294A (en) Vulnerability matching method and device, computer equipment and readable storage medium
WO2017080183A1 (en) Network novel chapter list evaluation method and device
CN111177719A (en) Address category determination method, device, computer-readable storage medium and equipment
CN107786529B (en) Website detection method, device and system
CN109064067B (en) Financial risk operation subject determination method and device based on Internet
WO2018145637A1 (en) Method and device for recording web browsing behavior, and user terminal
CN108112026B (en) WiFi identification method and device
US9595071B2 (en) Document identification and inspection system, document identification and inspection method, and document identification and inspection program
CN108959289B (en) Website category acquisition method and device
CN115801455B (en) Method and device for detecting counterfeit website based on website fingerprint
US20210034704A1 (en) Identifying Ambiguity in Semantic Resources
CN108171053B (en) Rule discovery method and system
CN114201607B (en) Information processing method and device
CN115643044A (en) Data processing method, device, server and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16863364

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16863364

Country of ref document: EP

Kind code of ref document: A1