CN109871433B - Method, device, equipment and medium for calculating relevance between document and topic - Google Patents
Method, device, equipment and medium for calculating relevance between document and topic Download PDFInfo
- Publication number
- CN109871433B CN109871433B CN201910131086.4A CN201910131086A CN109871433B CN 109871433 B CN109871433 B CN 109871433B CN 201910131086 A CN201910131086 A CN 201910131086A CN 109871433 B CN109871433 B CN 109871433B
- Authority
- CN
- China
- Prior art keywords
- document
- topic
- dictionary
- words
- calculating
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a method for calculating the relevancy between a document and a topic, which obtains a document set; acquiring a dictionary corresponding to a preset topic; the dictionary is constructed by learning the topic data by using a semi-supervised learning algorithm, and comprises a plurality of words related to the preset topic semantics; and calculating the relevance of any document and a preset topic corresponding to the dictionary according to the hit condition of the words in the dictionary in the document set aiming at any document in the document set. The relevance between the document and the preset topic can represent the relevance degree between the document content and the preset topic, and can be used as a basis for considering whether the document is suitable to be adapted into the hot topic related film and television works. In addition, the application also provides related equipment of the relevance between the document and the topic so as to ensure the application and the realization of the method in practice.
Description
Technical Field
The invention relates to the technical field of data processing, in particular to a method and a device for calculating relevancy between a document and a topic.
Background
At present, many film and television works are derived from the arrangement of literary works, for example, films and televisions such as Qingyunzhi and night tomb note are arranged in novel works. The number of the literary works is large, the variety is wide, and the recomposition value of the literary works needs to be considered when the literary works are recomposed in a film and television.
The popular literary works have a large number of readers, and the recomposed movie and television works attract a large number of audiences, so that one existing consideration method of the movie and television recomposition value of the literary works is to evaluate the literary works with large reading amount through the number of comments, the number of praise and the payment condition, and select the literary works with high evaluation scores as movie and television recomposition subjects.
However, through research, some film and television works related to social hotspots, which are adapted from literature works, have adaptation values, such as the number of appearances of the works in the names of people, the worship houses and the god of medicine, and the like, is high. However, the reading amount of the literature works is very low, and the adaptation value of the literature works cannot be measured through the method.
Disclosure of Invention
In view of this, the present invention provides a method for calculating the relevance between a document and a topic, which is used to determine the relevance between the document and the topic, and further provide a basis for selecting a document that is relevant to the topic.
In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:
in a first aspect, the present invention provides a method for calculating relevance between a document and a topic, including:
obtaining a document set;
acquiring a dictionary corresponding to a preset topic; wherein the dictionary is constructed by learning topic data using a semi-supervised learning algorithm, and comprises a plurality of words semantically related to the preset topic;
and for any document in the document set, calculating the relevance of the any document and a preset topic corresponding to the dictionary according to the hit condition of the words in the dictionary in the document set.
In a second aspect, the present invention provides a device for calculating relevancy of a document to a topic, including:
the document acquisition module is used for acquiring a document set;
the dictionary obtaining module is used for obtaining a dictionary corresponding to a preset topic; wherein the dictionary is constructed by learning topic data using a semi-supervised learning algorithm, and comprises a plurality of words semantically related to the preset topic;
and the relevancy calculation module is used for calculating the relevancy of any document in the document set and a preset topic corresponding to the dictionary according to the hit condition of the words in the dictionary in the document set.
In a third aspect, the present invention provides a computing device for calculating relevance of a document to a topic, comprising: a processor and a memory, the processor executing a software program stored in the memory, calling data stored in the memory, and performing at least the following steps:
obtaining a document set;
acquiring a dictionary corresponding to a preset topic; wherein the dictionary is constructed by learning topic data using a semi-supervised learning algorithm, and comprises a plurality of words semantically related to the preset topic;
and for any document in the document set, calculating the relevance of the any document and a preset topic corresponding to the dictionary according to the hit condition of the words in the dictionary in the document set.
In a fourth aspect, the present invention provides a storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the account sharing detection method described above.
Based on the technical scheme, the invention provides the method for calculating the relevance between the document and the topic, the method can acquire the document set and the dictionary corresponding to the preset topic, and the relevance between any document in the document set and the preset topic can be calculated according to the hit condition of words in the dictionary of the preset topic in the document set. The relevance between the document and the preset topic can represent the relevance degree between the document content and the preset topic, and can be used as a basis for considering whether the document is suitable to be adapted into the hot topic related film and television works.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flowchart of a method for calculating topic relevance of a document provided by the present invention;
FIG. 2 is a flow chart of a topic dictionary construction process provided by the present invention;
FIG. 3 is a diagram illustrating an output display of an LDA topic model in the method for calculating the relevance between a document and a topic provided by the present invention;
FIG. 4 is a schematic diagram of a computing device for calculating topic relevance of a document provided by the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the present invention, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Referring to fig. 1, the invention provides a method for calculating the relevancy between a document and a topic, which comprises steps S101 to S103.
S101: a document set is obtained.
In particular, the document may be obtained in various manners, such as crawling on the web, scanning a book, or other obtaining manners. Taking a type of documents such as novel on the internet as an example, data crawlers are performed on documents on various websites such as a start-point web and 17k by using a crawler tool, and the crawled data comprises document names, authors, document introduction, the content in the first 5 chapters of the document, comments and the like. It should be noted that the document is an article composed of text contents, and may be in any form, and the type of the document may be various, for example, a literary work such as a novel.
And cleaning the crawled data, removing messy codes and unifying the data format. The processed document data may be written into a mysql (my structured query language) database. For example, crawling novel content on the origin web results in the corresponding table 1:
novel crawler data table hot _ social _ novel _ crawler:
name of field | Note | Type of field | Properties | Remarks for note |
novel_id | Novel numbering | int(11) | Non-null main key | Self-increasing |
novel_date | Novel release time | varchar(50) | Non-empty | |
novel_title | Subject of novel | varchar(100) | Non-empty | |
novel_author | Novel author | varchar(255) | Non-empty | |
novel_centent | Novel text | text | Can be emptied | |
novel_summary | Brief introduction to the novel | text | Can be emptied |
TABLE 1
As can be seen from table 1, the retrieved novels are numbered and assigned a value _ id, which is an 11-letter integer type. Other parameters are programmed in this manner and various parameters in the following tables may be referenced.
In addition, the novel comment content crawled by the crawler tool can also be stored in a data table, such as the novel comment data table hot _ social _ comments _ crawler shown in table 2:
name of field | Note | Type of field | Properties | Remarks for note |
novel_id | Novel numbering | int(11) | Non-null main key | |
comment_date | Time of comment | varchar(50) | Non-empty | |
comment_content | Text of comments | varchar(255) | Non-empty |
TABLE 2
In practical applications, a specific implementation manner of the step S101 of obtaining the document set may include: content data capable of representing the subject matter of the documents is extracted from each document by using a crawler tool, and the content data extracted from each document is combined into a document set.
In particular, in the internet, document resources are abundant, and a large number of documents need to be acquired for calculating in order to discover documents with changed values. The crawler tool is adopted to acquire the document, and the method is convenient and quick for acquiring a large amount of data. The crawler tool can extract a part of content data in the document from the document, wherein the part of content data is a part capable of representing the subject of the document, such as the document name, the author, the document brief description, the top 5 chapters of the document, comments and the like. Of course, the crawled data is not limited to the contents of these portions, but may be other. In order to obtain the correlation degree between a large number of documents and topics, a plurality of documents need to be calculated, and therefore the plurality of documents need to be crawled, combined into a document set, and input into an algorithm model for screening.
The above manner may obtain the document set, and besides obtaining the document set, a dictionary related to the topic needs to be obtained.
S102: acquiring a dictionary corresponding to a preset topic; the dictionary is constructed by learning the topic data by using a semi-supervised learning algorithm, and comprises a plurality of words related to the preset topic semantics.
The topics are preset, and may include various hot topics: livelihood, economy, culture, education, health, sports, science popularization, and the like. In the construction process of the dictionary, a preset topic needs to be used, and the specific process is described in the following.
The dictionary corresponding to the topic is extracted from topic data, for example, topic data is obtained through a crawler tool, the topic data is input into a topic extraction algorithm model, the topic extraction algorithm model divides the topic data into words, and divides each word into a plurality of clusters, different clusters represent different topics, and it needs to be noted that the topic is a hidden topic and is a basis for automatic division of an algorithm. The algorithm model does not need manual supervision and is automatically completed by the theme extraction algorithm model. In addition, topic data refers to the content data of the opinions expressed by the public on a certain topic, the topic data can be content data on any media forms such as the internet, newspapers and magazines and the like, and the forms of the topic data can also comprise a plurality of forms such as news data, hot topics with tags and the like.
And after a plurality of topic clusters are obtained by the topic extraction algorithm model, extracting words related to the topic semantics in the topic clusters according to the preset topics, thereby obtaining a dictionary corresponding to the topics. If the topic is multiple, each topic can obtain a corresponding dictionary in the way.
S103: and calculating the relevance of any document and a preset topic corresponding to the dictionary according to the hit condition of the words in the dictionary in the document set aiming at any document in the document set.
And according to the obtained document set and the dictionary related to the topic, counting the hit condition of the words in the dictionary corresponding to the topic in the document set. For example, the topic of the civil life is obtained through the above steps to obtain a dictionary corresponding to the topic of the civil life, and the situation that each word in the national dictionary appears in each document is counted in the obtained document set. It should be noted that "hit" in the case of hit means that a word in the dictionary appears in a document in the document set.
The condition that words in the dictionary appear in the document can reflect the semantic relevance degree of the document and the dictionary, and further, the relevance degree of the document and the topic corresponding to the dictionary can be calculated according to the hit condition.
In practical applications, a specific embodiment of the step S103 is as follows.
The calculation of the relevance between the document and the topic needs to calculate the occurrence frequency of the dictionary in a single document, namely the word frequency, then calculate the occurrence frequency of the dictionary in the document set, namely the inverse document frequency, and further calculate the relevance between the document and the topic according to the word frequency and the inverse document frequency. Thus:
first, the word frequency of the dictionary for any document is calculated based on the number of occurrences of words in the dictionary in any document.
Specifically, the dictionary comprises a plurality of words, the number of times each word in the dictionary appears in a certain document and the total number of words in the document are counted, and the word frequency of the dictionary in the document is calculated according to the number of times each word in the dictionary appears in a certain document and the total number of words in the document.
In practical application, a specific embodiment of the step includes the following steps: A1-A2.
A1: counting the total number of times each word in the dictionary appears in any document, and counting the total number of words in any document.
A2: and taking the ratio of the total number of times of occurrence of each word in the dictionary in any document to the total number of words in any document as the word frequency of the dictionary for any document.
Specifically, the total number of times that all words in the dictionary appear in the document is counted, for example, if the dictionary corresponding to the civil topic contains 2 words "poor" and "employment", if "poor" appears twice and "employment" appears once in a document, the words in the dictionary corresponding to the civil topic appear three times in the document.
According to the formulaAnd calculating the frequency of the dictionary corresponding to the topic in the article, wherein TF is the word frequency of the dictionary in a certain document. Still taking the above example as an example, if the total word count of the document is 100, the word frequency of the national dictionary in the document is 0.03.
Secondly, the inverse document frequency of the dictionary for the document set is calculated according to the occurrence times of the words in the dictionary in all the documents in the document set.
Counting the occurrence times of all documents in the document set by the words in the dictionary, counting the total number of the documents in the document set, and calculating the inverse document frequency of the dictionary for the document set according to the occurrence times of all documents in the document set by the words in the statistical dictionary and the total number of the documents in the document set.
In practical application, a specific embodiment of the step includes the following steps: B1-B2.
B1: and counting the occurrence times of the words in the dictionary in each document in the document set, and taking the document with the occurrence times meeting a preset threshold value as a document containing the dictionary.
Wherein, for all documents in the document set, the occurrence number of words in the dictionary in each document is counted. If the occurrence times are more, the document is considered to contain the dictionary; conversely, if the number of occurrences is small, the document is deemed to not contain the dictionary. The present invention thus presets a threshold as a criterion for a smaller or larger number of occurrences. In particular, a document containing a dictionary requires that the number of times words in the dictionary must appear be greater than or equal to log2n times; where n is the total number of words in the dictionary, log2n is the preset threshold.
B2: a ratio of the total number of all documents in the document set to the documents containing the lexicon is calculated, and the ratio is determined as an inverse document frequency of the lexicon to the document set.
Specifically, the total number of all documents in the document set is counted, and then the inverse document frequency is calculated according to the total number and the number of documents containing the dictionary. Specific examples thereof include:
according to the formulaThe number of occurrences of the dictionary in all the document sets is calculated. Where IDF is the frequency of occurrence of a dictionary in all documents, i.e., the inverse document frequency of the dictionary, and in this formula 1 is added to the total number of documents containing a dictionary in order to prevent the denominator in the logarithm from being zero.
And finally, calculating the product of the word frequency and the inverse document frequency, and taking the product as the relevancy of any document and the topic corresponding to the dictionary.
Specifically, from the word frequency and the inverse document frequency obtained in the above steps, the degree of correlation TF _ IDF between any document and the topic corresponding to the dictionary is calculated from the formula TF _ IDF — TF · IDF.
For example: assuming that a dictionary D corresponding to the topic1 is obtained, the dictionary D contains 32 words, a document set containing 2000 documents is obtained, statistics is performed in advance, the total number of the documents containing the dictionary D is 19, one document is selected from the 2000 documents, the document has 10000 words, the number of times that the words in the dictionary D appear in the document is 500, and the correspondence between the document and the topic1 is calculated based on the statistical data, which is specifically as follows:
TF_IDFD=TFD·IDFD=0.05×2=0.1
calculated TF _ IDFDI.e., the relevance of the document to the topic1 corresponding to dictionary D.
The relevance between the document and the topic can be stored, for example, assuming that the document is a novel, the relevance can be written into a database according to the field type of table 3 below, and the data table stored in the database is a novel topic relevance table hot _ social _ novel _ topic _ collaboration _ degree, which contains the fields shown in table 3.
TABLE 3
It can be seen from the above calculation manner that the correlation between a certain document and a preset topic includes two influencing factors, i.e., word frequency and inverse document frequency. It should be noted that, when calculating the inverse document frequency of the dictionary, there may be a case where the number of words in the dictionary contained in the certain document does not reach the threshold requirement, that is, the certain document does not contain the dictionary, but other documents in the document set may contain the dictionary, and the inverse document frequency value may still be calculated according to the above-provided calculation manner of the inverse document frequency. However, in this case, since it can be directly assumed that the dictionary is not related to the certain document, the inverse document frequency of the dictionary is directly assumed to be 0.
Therefore, in practical application, when calculating the inverse document frequency of the dictionary for the document set, it may be determined whether the number of occurrences of the word in the dictionary in the certain document (i.e., the document currently used for calculation) satisfies a preset threshold first, if not, the inverse document frequency of the dictionary for the document set is directly determined as 0, and if not, the calculation may be performed in the above-provided inverse document frequency calculation manner.
Since the inverse document frequency is determined to be 0, the correlation between the certain document and the preset topic is also 0 as known by the way of multiplying the word frequency and the inverse document frequency.
According to the technical scheme, the method for calculating the relevance between the document and the topic can acquire the dictionary corresponding to the document set and the topic, and can calculate the relevance between any document in the document set and the topic according to the hit condition of words in the topic dictionary in the document set. The relevancy between the document and the topic can represent the relevancy between the document content and the topic, and can be used as a basis for considering whether the document is suitable to be recomposed into the hot topic related film and television works.
Referring to fig. 2, an embodiment of the present invention provides a method for constructing a topic dictionary, which specifically includes steps S201 to S203.
S201: and capturing topic data by using a crawler tool.
Specifically, to obtain topics, news content may be collected, for example, news media such as people's net, chinese daily newspaper, chinese youth newspaper, etc., and news content on a free media platform may be crawled to obtain topic data.
And cleaning the data acquired by the crawler, removing messy codes and unifying the data format. The processed topic data can be written into a mysql database.
Each website designs a table, such as a social topic data table hot _ social _ news _ crawler as follows:
name of field | Note | Type of field | Properties | Remarks for note |
news_tag | Identification | varchar(100) | Non-null main key | Time and title splicing |
news_date | Time of news release | varchar(50) | Non-empty | |
news_topic | News topic | varchar(100) | Can be emptied | News net plate block |
news_title | News headline | varchar(255) | Non-empty | Relevancy of dictionary |
news_content | News content | text | Can be emptied | |
news_url | Current url | varchar(255) | Can be emptied | |
from_url | First order url | varchar(255) | Can be emptied | Related news |
TABLE 4
The news _ url in the table is the current website, and the from _ url is the previous website of the current website, and is used for judging the relation between related news.
Besides the social news table, Chinese daily news host _ social _ news _ crawl _ Chinese _ date and people network host _ social _ news _ crawl _ renmin _ net are provided, and the crawled data are written into a database according to the field type displayed by the table.
And the crawled topic data is used as an input of the topic model.
S202: the topic data is input into a topic model tool to extract a classification of the topic terms from the topic data.
In particular, the topic model tool is an existing tool for extracting topics, which can classify data contents input to itself, each classification representing a topic. In the step, the topic data is input into a topic model tool as data content, and the topic data can be classified by the topic model tool, so that the classification of topic words is obtained.
The specific process of the topic model tool for topic classification can be expressed as that for topic data, a topic is extracted from topic distribution; randomly extracting a word from the word distribution corresponding to the extracted theme; the above process is repeated until each word in the whole topic data is traversed. The processing procedure is based on the analysis of the generation process of the document, namely, the document is considered to be generated, the theme which needs to be contained in the document is firstly determined, and then related words are selected around the theme for word-sending and sentence-making, so that the corresponding document is generated. The topic model tool is based on a document generation principle, uses given topic data as a document, and infers topic distribution of the topic data according to the process.
One specific example of the topic model tool is an LDA (Latent Dirichlet Allocation) topic model, which is also called a three-layer bayesian probability model and includes three layers of structures of words, topics and documents. LDA is an unsupervised learning method, a bag-of-words model is adopted, each document is regarded as a word frequency vector, and the bag-of-words model does not consider the sequence between words. The classification process of the topic model tool will be briefly described below by taking the topic model tool as an example.
Specifically, topic data, topic number K and word number N are input into an LDA topic model tool, where the topic number K and the word number N are preset parameters in the tool, the topic number K indicates how many topic classifications the topic data needs to be divided into by the tool, each topic classification includes a plurality of topic words, and the word number N is used to indicate the number of topic words that need to be selected in each topic classification. It should be noted that, the LDA topic model tool may calculate a probability value of each topic word with respect to the topic classification, where the probability value represents a probability value that the topic word belongs to the topic classification, and therefore, when selecting the topic word, the topic word may be selected from a high probability value to a low probability value.
Based on the setting of the preset parameters, after the M topic data are input into the LDA topic model, the LDA topic model can automatically divide the topic data into K topic clusters, each topic cluster represents an independent topic, and the specific content of the topic cannot be obtained by the LDA topic model, so the topic clusters can be called hidden topics. In addition, each topic cluster comprises N topic words.
Assuming that the number of topics K in the LDA topic model tool is set to 30 and the number of words N is set to 10, some topic data are input into the LDA topic model tool with the values thus set, resulting in the output results shown in fig. 3. As shown in FIG. 3, the LDA topic model tool outputs 30 clusters of words, each cluster representing a topic, the first column number on the left side being the topic number, and each topic containing 10 topic words.
It should be noted that the LDA topic model is an algorithm model that follows Dirichlet (Dirichlet) distribution, the divided K topics follow Dirichlet distribution with a parameter α, and N words in each topic follow Dirichlet distribution with a parameter β. And when the words are segmented according to the hidden theme, the LDA theme model selects the former N words with high probability as the classification of the hidden theme words according to the probability. And are arranged according to the probability from big to small.
S203: inputting the classification of preset topics and topic words into a word vector generation model to obtain a plurality of words with similar topic semantics; wherein a plurality of words constitute a dictionary of topics.
Specifically, one or more topics may be set in advance. If the topics are multiple, the topics and the classification of the topic words are input into the word vector generation model respectively, so that dictionaries corresponding to the topics are obtained respectively.
The word vector generation model is used for obtaining a plurality of words with similar semantics with the topic from the classification of the topic words, and the words are a dictionary of the topic. An example of the word vector generation model is a work2vec word vector generation model, and the generation process of the dictionary is briefly described by taking the model as an example.
For example, a 'civil' topic and the classification of topic words are input into a word 2vec word vector generation model, word 2vec searches words related to the semantic meaning of the civil topic in the classification of the topic words according to the input civil topic, namely, the probability of each word and a preset topic in the topic words is calculated through an algorithm, and then a certain number of words (the number can be preset) are selected according to the sequence of the probability from large to small to be used as a dictionary corresponding to the topic.
In practical application, after the relevancy of the document and the topic is calculated, ranking can be carried out.
In one case, the preset topic is a plurality of topics, and each topic can calculate the correlation between the document and the topic in the above manner, so that for any document, the correlation between the document and the topics can be obtained, and the correlation between the document and each topic can be ranked. The ranking may be arranged according to the high-low order of the relevance of the topics to the document, the topic with the highest relevance to the document is the topic of the document, and the topic may be used as the adaptation direction of the document to be adapted into the film and television works.
In another case, there are multiple documents in the document set, and the relevance of each document in the document set to the same topic may be ranked. The ranking criterion may also be that the document with the highest relevance to the topic is the document with the highest relevance to the topic in the order of the relevance from high to low.
Referring to fig. 4, an embodiment of the present invention provides a structure of a computing device for calculating a topic relevance of a document. As shown in fig. 4, the apparatus may specifically include: a document acquisition module 401, a dictionary acquisition module 402, and a relevancy calculation module 403.
A document obtaining module 401, configured to obtain a document set.
A dictionary obtaining module 402, configured to obtain a dictionary corresponding to a preset topic; wherein the dictionary is constructed by learning topic data using a semi-supervised learning algorithm, and the dictionary comprises a plurality of words semantically related to the preset topic.
A relevance calculating module 403, configured to calculate, for any document in the document set, a relevance of the preset topic corresponding to the dictionary for the any document according to a hit condition of a word in the dictionary in the document set.
In one embodiment, the correlation calculation model 403 may specifically include: the device comprises a word frequency calculation sub-module, an inverse document frequency calculation sub-module and a correlation degree calculation sub-module.
The word frequency calculation submodule is used for calculating the word frequency of the dictionary according to the occurrence frequency of all words in the dictionary in any document;
the inverse document frequency calculation submodule is used for calculating the inverse document frequency of the dictionary according to the occurrence times of all documents in the document set of all words in the dictionary;
and the relevance calculation operator module is used for calculating the product of the word frequency and the inverse document frequency and taking the product as the relevance of the preset topic corresponding to the dictionary and the any document.
In an embodiment, the word frequency calculation sub-module may specifically include: the device comprises a single document dictionary counting unit, a document word counting unit and a word frequency calculating unit.
The single document dictionary counting unit is used for counting the total times of the occurrence of each word in the dictionary in any document;
the document word counting unit is used for counting the total word number in any document;
and the word frequency calculation unit is used for taking the ratio of the total number of times of occurrence of each word in the dictionary in any document to the total number of words in any document as the word frequency of the dictionary.
In one embodiment, the inverse document frequency calculation sub-module may specifically include: a multi-document dictionary counting unit and an inverse document frequency calculating unit.
The multi-document dictionary counting unit is used for counting the occurrence frequency of all words in the dictionary in each document in the document set, and taking the document with the occurrence frequency meeting a preset threshold value as a target document;
and the inverse document frequency calculating unit is used for calculating the ratio of the total number of all the documents in the document set to the number of the target documents and determining the ratio as the inverse document frequency of the dictionary.
In one embodiment, the inverse document frequency calculation sub-module may specifically include: and an inverse document frequency calculation unit.
And the preset threshold unit is used for determining the inverse document frequency of the dictionary as 0 if the occurrence frequency of all words in the dictionary in any document does not meet the preset threshold.
In one embodiment, the device for calculating the relevance of the document to the topic may further include a dictionary construction module, configured to construct a dictionary corresponding to the preset topic.
The dictionary building module may specifically include: a news catch sub-module, a classified word sub-module, a word generation sub-module and a dictionary generation sub-module.
The news capturing submodule is used for capturing topic data by using a crawler tool;
a category word sub-module, configured to input the topic data into a topic model tool to extract a category of a topic word from the topic data;
the word generation submodule is used for inputting the classification of a preset topic and the topic words into a word vector generation model so as to obtain a plurality of words with similar semantics with the preset topic;
and the dictionary generation submodule is used for forming a dictionary of the preset topic according to a plurality of words with similar semantics with the preset topic.
In one embodiment, the document obtaining module may specifically include: a document capturing sub-module and a document combining sub-module.
A document crawling sub-module for crawling a plurality of documents from a network using a crawler tool, wherein the crawled is content data in each of the documents that can represent a topic of the document;
and the document combination submodule is used for combining the plurality of documents into a document set.
In one embodiment, the device for calculating the relevancy of the document to the topic may further include: and a sorting module.
The ranking module is used for ranking the relevancy of any document and each preset topic if the preset topics are multiple; or ranking the relevance of each document in the document set and the same preset topic.
According to the technical scheme, the device for calculating the relevance between the document and the topic can acquire the document set and the dictionary corresponding to the preset topic, and can calculate the relevance between any document in the document set and the preset topic according to the hit condition of words in the topic dictionary in the document set. The relevance between the document and the preset topic can represent the relevance degree between the document content and the preset topic, and can be used as a basis for considering whether the document is suitable to be adapted into the hot topic related film and television works.
In addition, the present application also provides a computing device for relevancy between a document and a topic, which specifically includes: a processor and a memory, the processor executing a software program stored in the memory, calling data stored in the memory, and performing at least the following steps:
obtaining a document set;
acquiring a dictionary corresponding to a preset topic; wherein the dictionary is constructed by learning topic data using a semi-supervised learning algorithm, and comprises a plurality of words semantically related to the preset topic;
and for any document in the document set, calculating the relevance of the any document and a preset topic corresponding to the dictionary according to the hit condition of the words in the dictionary in the document set.
In addition, the present application also provides a storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method for calculating the relevancy between a document and a topic provided by any of the above embodiments is implemented.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
1. A method for calculating the relevancy between a document and a topic is characterized by comprising the following steps:
obtaining a document set;
acquiring a dictionary corresponding to a preset topic; wherein the dictionary is constructed by learning topic data using a semi-supervised learning algorithm, and comprises a plurality of words semantically related to the preset topic;
aiming at any document in the document set, calculating the word frequency of the dictionary according to the occurrence frequency of all words in the dictionary in the any document;
calculating the inverse document frequency of the dictionary according to the occurrence times of all documents in the document set of all words in the dictionary, specifically comprising: counting the occurrence times of all words in the dictionary in each document in the document set, and taking the document with the occurrence times meeting a preset threshold value as a target document; calculating the ratio of the total number of all documents in the document set to the number of target documents, and determining the ratio as the inverse document frequency of the dictionary;
and calculating the product of the word frequency and the inverse document frequency, and taking the product as the relevancy of the any document and a preset topic corresponding to the dictionary.
2. The method for calculating the relevancy between a document and a topic according to claim 1, wherein the calculating the word frequency of the dictionary according to the occurrence frequency of all words in the dictionary in any document comprises:
counting the total times of occurrence of each word in the dictionary in any document;
counting the total word number in any document;
and taking the ratio of the total number of times of occurrence of each word in the dictionary in any document to the total number of words in any document as the word frequency of the dictionary.
3. The method for calculating the topic relevance of a document according to claim 1, wherein the calculating the inverse document frequency of the dictionary according to the occurrence number of all documents in the document set of all words in the dictionary comprises:
and if the occurrence frequency of all words in the dictionary in any document does not meet a preset threshold value, determining the inverse document frequency of the dictionary as 0.
4. The method for calculating the relevancy between the document and the topic according to claim 1, wherein the construction mode of the dictionary corresponding to the preset topic comprises:
capturing topic data by using a crawler tool;
inputting the topic data into a topic model tool to extract a classification of topic words from the topic data;
inputting the classification of a preset topic and the topic words into a word vector generation model to obtain a plurality of words with similar semantics with the preset topic;
and forming a dictionary of the preset topic according to a plurality of words with similar semantemes with the preset topic.
5. The method for calculating the topic relevance of a document according to claim 1, wherein the obtaining a set of documents comprises:
crawling a plurality of documents from a network using a crawler tool, wherein crawled are content data in each of the documents that can represent a topic of the document;
combining the plurality of documents into a document collection.
6. The method of calculating the degree of relevance of a document to a topic according to claim 1, further comprising:
if a plurality of preset topics exist, sequencing the relevancy of any document and each preset topic;
or the like, or, alternatively,
and sequencing the relevancy of each document in the document set and the same preset topic.
7. A device for calculating relevancy of a document to a topic, comprising:
the document acquisition module is used for acquiring a document set;
the dictionary obtaining module is used for obtaining a dictionary corresponding to a preset topic; wherein the dictionary is constructed by learning topic data using a semi-supervised learning algorithm, and comprises a plurality of words semantically related to the preset topic;
the relevancy calculation module is used for calculating the relevancy of any document in the document set and a preset topic corresponding to the dictionary according to the hit condition of the words in the dictionary in the document set;
the correlation calculation module may specifically include: a word frequency calculation submodule, an inverse document frequency calculation submodule and a correlation degree calculation submodule, wherein,
the word frequency calculation submodule is used for calculating the word frequency of the dictionary according to the occurrence frequency of all words in the dictionary in any document;
the inverse document frequency calculation sub-module is configured to calculate an inverse document frequency of the dictionary according to the occurrence frequency of all documents in the document set of all words in the dictionary, and specifically includes: counting the occurrence times of all words in the dictionary in each document in the document set, and taking the document with the occurrence times meeting a preset threshold value as a target document; calculating the ratio of the total number of all documents in the document set to the number of target documents, and determining the ratio as the inverse document frequency of the dictionary;
and the relevancy calculation operator module is used for calculating the product of the word frequency and the inverse document frequency and taking the product as the relevancy of the any document and a preset topic corresponding to the dictionary.
8. The apparatus for calculating the topic relevance of a document according to claim 7, further comprising: the dictionary building module is used for building a dictionary corresponding to the preset topic;
the dictionary construction module includes:
the news capturing submodule is used for capturing topic data by using a crawler tool;
a category word sub-module, configured to input the topic data into a topic model tool to extract a category of a topic word from the topic data;
the dictionary generation submodule is used for inputting the classification of a preset topic and the topic words into a word vector generation model so as to obtain a plurality of words with similar semantics with the preset topic; wherein the plurality of words constitute a dictionary of the preset topic.
9. A computing device of document and topic relevance, comprising: a processor and a memory, the processor executing a software program stored in the memory, calling data stored in the memory, and performing at least the following steps:
obtaining a document set;
acquiring a dictionary corresponding to a preset topic; wherein the dictionary is constructed by learning topic data using a semi-supervised learning algorithm, and comprises a plurality of words semantically related to the preset topic;
aiming at any document in the document set, calculating the word frequency of the dictionary according to the occurrence frequency of all words in the dictionary in the any document;
calculating the inverse document frequency of the dictionary according to the occurrence times of all documents in the document set of all words in the dictionary, specifically comprising: counting the occurrence times of all words in the dictionary in each document in the document set, and taking the document with the occurrence times meeting a preset threshold value as a target document; calculating the ratio of the total number of all documents in the document set to the number of target documents, and determining the ratio as the inverse document frequency of the dictionary;
and calculating the product of the word frequency and the inverse document frequency, and taking the product as the relevancy of the any document and a preset topic corresponding to the dictionary.
10. A storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements a method of calculating a topic relevance for a document as recited in any one of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910131086.4A CN109871433B (en) | 2019-02-21 | 2019-02-21 | Method, device, equipment and medium for calculating relevance between document and topic |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910131086.4A CN109871433B (en) | 2019-02-21 | 2019-02-21 | Method, device, equipment and medium for calculating relevance between document and topic |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109871433A CN109871433A (en) | 2019-06-11 |
CN109871433B true CN109871433B (en) | 2021-07-23 |
Family
ID=66919047
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910131086.4A Active CN109871433B (en) | 2019-02-21 | 2019-02-21 | Method, device, equipment and medium for calculating relevance between document and topic |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109871433B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111143506B (en) * | 2019-12-27 | 2023-11-03 | 汉海信息技术(上海)有限公司 | Topic content ordering method, topic content ordering device, server and storage medium |
CN111553144A (en) * | 2020-04-28 | 2020-08-18 | 深圳壹账通智能科技有限公司 | Topic mining method and device based on artificial intelligence and electronic equipment |
CN112926297B (en) * | 2021-02-26 | 2023-06-30 | 北京百度网讯科技有限公司 | Method, apparatus, device and storage medium for processing information |
CN112597283B (en) * | 2021-03-04 | 2021-05-25 | 北京数业专攻科技有限公司 | Notification text information entity attribute extraction method, computer equipment and storage medium |
CN113656695A (en) * | 2021-08-18 | 2021-11-16 | 北京奇艺世纪科技有限公司 | Hot data generation method and device, data processing method and electronic equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103049568A (en) * | 2012-12-31 | 2013-04-17 | 武汉传神信息技术有限公司 | Method for classifying documents in mass document library |
CN105912528A (en) * | 2016-04-18 | 2016-08-31 | 深圳大学 | Question classification method and system |
CN108228555A (en) * | 2016-12-14 | 2018-06-29 | 北京国双科技有限公司 | Article treating method and apparatus based on column theme |
CN108829889A (en) * | 2018-06-29 | 2018-11-16 | 国信优易数据有限公司 | A kind of newsletter archive classification method and device |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120166414A1 (en) * | 2008-08-11 | 2012-06-28 | Ultra Unilimited Corporation (dba Publish) | Systems and methods for relevance scoring |
CN102298622B (en) * | 2011-08-11 | 2013-01-02 | 中国科学院自动化研究所 | Search method for focused web crawler based on anchor text and system thereof |
-
2019
- 2019-02-21 CN CN201910131086.4A patent/CN109871433B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103049568A (en) * | 2012-12-31 | 2013-04-17 | 武汉传神信息技术有限公司 | Method for classifying documents in mass document library |
CN105912528A (en) * | 2016-04-18 | 2016-08-31 | 深圳大学 | Question classification method and system |
CN108228555A (en) * | 2016-12-14 | 2018-06-29 | 北京国双科技有限公司 | Article treating method and apparatus based on column theme |
CN108829889A (en) * | 2018-06-29 | 2018-11-16 | 国信优易数据有限公司 | A kind of newsletter archive classification method and device |
Non-Patent Citations (1)
Title |
---|
具有共现关系的中文褒贬词典构建;杨春明;《计算机工程与应用》;20160531;第52卷(第9期);论文第3.3节 * |
Also Published As
Publication number | Publication date |
---|---|
CN109871433A (en) | 2019-06-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109871433B (en) | Method, device, equipment and medium for calculating relevance between document and topic | |
Barrón-Cedeno et al. | Proppy: Organizing the news based on their propagandistic content | |
Kang et al. | Modeling user interest in social media using news media and wikipedia | |
CN104077377B (en) | Network public-opinion focus based on web documents attribute finds method and apparatus | |
CN103914478B (en) | Webpage training method and system, webpage Forecasting Methodology and system | |
Nicosia et al. | QCRI: Answer selection for community question answering-experiments for Arabic and English | |
CA2578513C (en) | System and method for online information analysis | |
US10776885B2 (en) | Mutually reinforcing ranking of social media accounts and contents | |
CN109145216A (en) | Network public-opinion monitoring method, device and storage medium | |
CN110334202A (en) | User interest label construction method and relevant device based on news application software | |
CN108595660A (en) | Label information generation method, device, storage medium and the equipment of multimedia resource | |
CN106383887A (en) | Environment-friendly news data acquisition and recommendation display method and system | |
Im et al. | Linked tag: image annotation using semantic relationships between image tags | |
KR20100084510A (en) | Identifying information related to a particular entity from electronic sources | |
WO2020233344A1 (en) | Searching method and apparatus, and storage medium | |
CN107506472B (en) | Method for classifying browsed webpages of students | |
EP2307951A1 (en) | Method and apparatus for relating datasets by using semantic vectors and keyword analyses | |
Kanaris et al. | Learning to recognize webpage genres | |
US20170235836A1 (en) | Information identification and extraction | |
Kumar et al. | Hashtag recommendation for short social media texts using word-embeddings and external knowledge | |
Vick et al. | The effects of standardizing names for record linkage: Evidence from the United States and Norway | |
Alves et al. | A spatial and temporal sentiment analysis approach applied to Twitter microtexts | |
CN108681977B (en) | Lawyer information processing method and system | |
CN112989824A (en) | Information pushing method and device, electronic equipment and storage medium | |
Kartal et al. | TrClaim-19: The first collection for Turkish check-worthy claim detection with annotator rationales |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |