CN109871433B - Method, device, equipment and medium for calculating relevance between document and topic - Google Patents

Method, device, equipment and medium for calculating relevance between document and topic Download PDF

Info

Publication number
CN109871433B
CN109871433B CN201910131086.4A CN201910131086A CN109871433B CN 109871433 B CN109871433 B CN 109871433B CN 201910131086 A CN201910131086 A CN 201910131086A CN 109871433 B CN109871433 B CN 109871433B
Authority
CN
China
Prior art keywords
document
topic
dictionary
words
calculating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910131086.4A
Other languages
Chinese (zh)
Other versions
CN109871433A (en
Inventor
王文超
乔静静
阳任科
牛文娟
***
刘浩洋
关扬
郏昕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN201910131086.4A priority Critical patent/CN109871433B/en
Publication of CN109871433A publication Critical patent/CN109871433A/en
Application granted granted Critical
Publication of CN109871433B publication Critical patent/CN109871433B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for calculating the relevancy between a document and a topic, which obtains a document set; acquiring a dictionary corresponding to a preset topic; the dictionary is constructed by learning the topic data by using a semi-supervised learning algorithm, and comprises a plurality of words related to the preset topic semantics; and calculating the relevance of any document and a preset topic corresponding to the dictionary according to the hit condition of the words in the dictionary in the document set aiming at any document in the document set. The relevance between the document and the preset topic can represent the relevance degree between the document content and the preset topic, and can be used as a basis for considering whether the document is suitable to be adapted into the hot topic related film and television works. In addition, the application also provides related equipment of the relevance between the document and the topic so as to ensure the application and the realization of the method in practice.

Description

Method, device, equipment and medium for calculating relevance between document and topic
Technical Field
The invention relates to the technical field of data processing, in particular to a method and a device for calculating relevancy between a document and a topic.
Background
At present, many film and television works are derived from the arrangement of literary works, for example, films and televisions such as Qingyunzhi and night tomb note are arranged in novel works. The number of the literary works is large, the variety is wide, and the recomposition value of the literary works needs to be considered when the literary works are recomposed in a film and television.
The popular literary works have a large number of readers, and the recomposed movie and television works attract a large number of audiences, so that one existing consideration method of the movie and television recomposition value of the literary works is to evaluate the literary works with large reading amount through the number of comments, the number of praise and the payment condition, and select the literary works with high evaluation scores as movie and television recomposition subjects.
However, through research, some film and television works related to social hotspots, which are adapted from literature works, have adaptation values, such as the number of appearances of the works in the names of people, the worship houses and the god of medicine, and the like, is high. However, the reading amount of the literature works is very low, and the adaptation value of the literature works cannot be measured through the method.
Disclosure of Invention
In view of this, the present invention provides a method for calculating the relevance between a document and a topic, which is used to determine the relevance between the document and the topic, and further provide a basis for selecting a document that is relevant to the topic.
In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:
in a first aspect, the present invention provides a method for calculating relevance between a document and a topic, including:
obtaining a document set;
acquiring a dictionary corresponding to a preset topic; wherein the dictionary is constructed by learning topic data using a semi-supervised learning algorithm, and comprises a plurality of words semantically related to the preset topic;
and for any document in the document set, calculating the relevance of the any document and a preset topic corresponding to the dictionary according to the hit condition of the words in the dictionary in the document set.
In a second aspect, the present invention provides a device for calculating relevancy of a document to a topic, including:
the document acquisition module is used for acquiring a document set;
the dictionary obtaining module is used for obtaining a dictionary corresponding to a preset topic; wherein the dictionary is constructed by learning topic data using a semi-supervised learning algorithm, and comprises a plurality of words semantically related to the preset topic;
and the relevancy calculation module is used for calculating the relevancy of any document in the document set and a preset topic corresponding to the dictionary according to the hit condition of the words in the dictionary in the document set.
In a third aspect, the present invention provides a computing device for calculating relevance of a document to a topic, comprising: a processor and a memory, the processor executing a software program stored in the memory, calling data stored in the memory, and performing at least the following steps:
obtaining a document set;
acquiring a dictionary corresponding to a preset topic; wherein the dictionary is constructed by learning topic data using a semi-supervised learning algorithm, and comprises a plurality of words semantically related to the preset topic;
and for any document in the document set, calculating the relevance of the any document and a preset topic corresponding to the dictionary according to the hit condition of the words in the dictionary in the document set.
In a fourth aspect, the present invention provides a storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the account sharing detection method described above.
Based on the technical scheme, the invention provides the method for calculating the relevance between the document and the topic, the method can acquire the document set and the dictionary corresponding to the preset topic, and the relevance between any document in the document set and the preset topic can be calculated according to the hit condition of words in the dictionary of the preset topic in the document set. The relevance between the document and the preset topic can represent the relevance degree between the document content and the preset topic, and can be used as a basis for considering whether the document is suitable to be adapted into the hot topic related film and television works.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flowchart of a method for calculating topic relevance of a document provided by the present invention;
FIG. 2 is a flow chart of a topic dictionary construction process provided by the present invention;
FIG. 3 is a diagram illustrating an output display of an LDA topic model in the method for calculating the relevance between a document and a topic provided by the present invention;
FIG. 4 is a schematic diagram of a computing device for calculating topic relevance of a document provided by the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the present invention, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Referring to fig. 1, the invention provides a method for calculating the relevancy between a document and a topic, which comprises steps S101 to S103.
S101: a document set is obtained.
In particular, the document may be obtained in various manners, such as crawling on the web, scanning a book, or other obtaining manners. Taking a type of documents such as novel on the internet as an example, data crawlers are performed on documents on various websites such as a start-point web and 17k by using a crawler tool, and the crawled data comprises document names, authors, document introduction, the content in the first 5 chapters of the document, comments and the like. It should be noted that the document is an article composed of text contents, and may be in any form, and the type of the document may be various, for example, a literary work such as a novel.
And cleaning the crawled data, removing messy codes and unifying the data format. The processed document data may be written into a mysql (my structured query language) database. For example, crawling novel content on the origin web results in the corresponding table 1:
novel crawler data table hot _ social _ novel _ crawler:
name of field Note Type of field Properties Remarks for note
novel_id Novel numbering int(11) Non-null main key Self-increasing
novel_date Novel release time varchar(50) Non-empty
novel_title Subject of novel varchar(100) Non-empty
novel_author Novel author varchar(255) Non-empty
novel_centent Novel text text Can be emptied
novel_summary Brief introduction to the novel text Can be emptied
TABLE 1
As can be seen from table 1, the retrieved novels are numbered and assigned a value _ id, which is an 11-letter integer type. Other parameters are programmed in this manner and various parameters in the following tables may be referenced.
In addition, the novel comment content crawled by the crawler tool can also be stored in a data table, such as the novel comment data table hot _ social _ comments _ crawler shown in table 2:
name of field Note Type of field Properties Remarks for note
novel_id Novel numbering int(11) Non-null main key
comment_date Time of comment varchar(50) Non-empty
comment_content Text of comments varchar(255) Non-empty
TABLE 2
In practical applications, a specific implementation manner of the step S101 of obtaining the document set may include: content data capable of representing the subject matter of the documents is extracted from each document by using a crawler tool, and the content data extracted from each document is combined into a document set.
In particular, in the internet, document resources are abundant, and a large number of documents need to be acquired for calculating in order to discover documents with changed values. The crawler tool is adopted to acquire the document, and the method is convenient and quick for acquiring a large amount of data. The crawler tool can extract a part of content data in the document from the document, wherein the part of content data is a part capable of representing the subject of the document, such as the document name, the author, the document brief description, the top 5 chapters of the document, comments and the like. Of course, the crawled data is not limited to the contents of these portions, but may be other. In order to obtain the correlation degree between a large number of documents and topics, a plurality of documents need to be calculated, and therefore the plurality of documents need to be crawled, combined into a document set, and input into an algorithm model for screening.
The above manner may obtain the document set, and besides obtaining the document set, a dictionary related to the topic needs to be obtained.
S102: acquiring a dictionary corresponding to a preset topic; the dictionary is constructed by learning the topic data by using a semi-supervised learning algorithm, and comprises a plurality of words related to the preset topic semantics.
The topics are preset, and may include various hot topics: livelihood, economy, culture, education, health, sports, science popularization, and the like. In the construction process of the dictionary, a preset topic needs to be used, and the specific process is described in the following.
The dictionary corresponding to the topic is extracted from topic data, for example, topic data is obtained through a crawler tool, the topic data is input into a topic extraction algorithm model, the topic extraction algorithm model divides the topic data into words, and divides each word into a plurality of clusters, different clusters represent different topics, and it needs to be noted that the topic is a hidden topic and is a basis for automatic division of an algorithm. The algorithm model does not need manual supervision and is automatically completed by the theme extraction algorithm model. In addition, topic data refers to the content data of the opinions expressed by the public on a certain topic, the topic data can be content data on any media forms such as the internet, newspapers and magazines and the like, and the forms of the topic data can also comprise a plurality of forms such as news data, hot topics with tags and the like.
And after a plurality of topic clusters are obtained by the topic extraction algorithm model, extracting words related to the topic semantics in the topic clusters according to the preset topics, thereby obtaining a dictionary corresponding to the topics. If the topic is multiple, each topic can obtain a corresponding dictionary in the way.
S103: and calculating the relevance of any document and a preset topic corresponding to the dictionary according to the hit condition of the words in the dictionary in the document set aiming at any document in the document set.
And according to the obtained document set and the dictionary related to the topic, counting the hit condition of the words in the dictionary corresponding to the topic in the document set. For example, the topic of the civil life is obtained through the above steps to obtain a dictionary corresponding to the topic of the civil life, and the situation that each word in the national dictionary appears in each document is counted in the obtained document set. It should be noted that "hit" in the case of hit means that a word in the dictionary appears in a document in the document set.
The condition that words in the dictionary appear in the document can reflect the semantic relevance degree of the document and the dictionary, and further, the relevance degree of the document and the topic corresponding to the dictionary can be calculated according to the hit condition.
In practical applications, a specific embodiment of the step S103 is as follows.
The calculation of the relevance between the document and the topic needs to calculate the occurrence frequency of the dictionary in a single document, namely the word frequency, then calculate the occurrence frequency of the dictionary in the document set, namely the inverse document frequency, and further calculate the relevance between the document and the topic according to the word frequency and the inverse document frequency. Thus:
first, the word frequency of the dictionary for any document is calculated based on the number of occurrences of words in the dictionary in any document.
Specifically, the dictionary comprises a plurality of words, the number of times each word in the dictionary appears in a certain document and the total number of words in the document are counted, and the word frequency of the dictionary in the document is calculated according to the number of times each word in the dictionary appears in a certain document and the total number of words in the document.
In practical application, a specific embodiment of the step includes the following steps: A1-A2.
A1: counting the total number of times each word in the dictionary appears in any document, and counting the total number of words in any document.
A2: and taking the ratio of the total number of times of occurrence of each word in the dictionary in any document to the total number of words in any document as the word frequency of the dictionary for any document.
Specifically, the total number of times that all words in the dictionary appear in the document is counted, for example, if the dictionary corresponding to the civil topic contains 2 words "poor" and "employment", if "poor" appears twice and "employment" appears once in a document, the words in the dictionary corresponding to the civil topic appear three times in the document.
According to the formula
Figure BDA0001974652040000061
And calculating the frequency of the dictionary corresponding to the topic in the article, wherein TF is the word frequency of the dictionary in a certain document. Still taking the above example as an example, if the total word count of the document is 100, the word frequency of the national dictionary in the document is 0.03.
Secondly, the inverse document frequency of the dictionary for the document set is calculated according to the occurrence times of the words in the dictionary in all the documents in the document set.
Counting the occurrence times of all documents in the document set by the words in the dictionary, counting the total number of the documents in the document set, and calculating the inverse document frequency of the dictionary for the document set according to the occurrence times of all documents in the document set by the words in the statistical dictionary and the total number of the documents in the document set.
In practical application, a specific embodiment of the step includes the following steps: B1-B2.
B1: and counting the occurrence times of the words in the dictionary in each document in the document set, and taking the document with the occurrence times meeting a preset threshold value as a document containing the dictionary.
Wherein, for all documents in the document set, the occurrence number of words in the dictionary in each document is counted. If the occurrence times are more, the document is considered to contain the dictionary; conversely, if the number of occurrences is small, the document is deemed to not contain the dictionary. The present invention thus presets a threshold as a criterion for a smaller or larger number of occurrences. In particular, a document containing a dictionary requires that the number of times words in the dictionary must appear be greater than or equal to log2n times; where n is the total number of words in the dictionary, log2n is the preset threshold.
B2: a ratio of the total number of all documents in the document set to the documents containing the lexicon is calculated, and the ratio is determined as an inverse document frequency of the lexicon to the document set.
Specifically, the total number of all documents in the document set is counted, and then the inverse document frequency is calculated according to the total number and the number of documents containing the dictionary. Specific examples thereof include:
according to the formula
Figure BDA0001974652040000071
The number of occurrences of the dictionary in all the document sets is calculated. Where IDF is the frequency of occurrence of a dictionary in all documents, i.e., the inverse document frequency of the dictionary, and in this formula 1 is added to the total number of documents containing a dictionary in order to prevent the denominator in the logarithm from being zero.
And finally, calculating the product of the word frequency and the inverse document frequency, and taking the product as the relevancy of any document and the topic corresponding to the dictionary.
Specifically, from the word frequency and the inverse document frequency obtained in the above steps, the degree of correlation TF _ IDF between any document and the topic corresponding to the dictionary is calculated from the formula TF _ IDF — TF · IDF.
For example: assuming that a dictionary D corresponding to the topic1 is obtained, the dictionary D contains 32 words, a document set containing 2000 documents is obtained, statistics is performed in advance, the total number of the documents containing the dictionary D is 19, one document is selected from the 2000 documents, the document has 10000 words, the number of times that the words in the dictionary D appear in the document is 500, and the correspondence between the document and the topic1 is calculated based on the statistical data, which is specifically as follows:
Figure BDA0001974652040000072
Figure BDA0001974652040000073
TF_IDFD=TFD·IDFD=0.05×2=0.1
calculated TF _ IDFDI.e., the relevance of the document to the topic1 corresponding to dictionary D.
The relevance between the document and the topic can be stored, for example, assuming that the document is a novel, the relevance can be written into a database according to the field type of table 3 below, and the data table stored in the database is a novel topic relevance table hot _ social _ novel _ topic _ collaboration _ degree, which contains the fields shown in table 3.
Figure BDA0001974652040000081
TABLE 3
It can be seen from the above calculation manner that the correlation between a certain document and a preset topic includes two influencing factors, i.e., word frequency and inverse document frequency. It should be noted that, when calculating the inverse document frequency of the dictionary, there may be a case where the number of words in the dictionary contained in the certain document does not reach the threshold requirement, that is, the certain document does not contain the dictionary, but other documents in the document set may contain the dictionary, and the inverse document frequency value may still be calculated according to the above-provided calculation manner of the inverse document frequency. However, in this case, since it can be directly assumed that the dictionary is not related to the certain document, the inverse document frequency of the dictionary is directly assumed to be 0.
Therefore, in practical application, when calculating the inverse document frequency of the dictionary for the document set, it may be determined whether the number of occurrences of the word in the dictionary in the certain document (i.e., the document currently used for calculation) satisfies a preset threshold first, if not, the inverse document frequency of the dictionary for the document set is directly determined as 0, and if not, the calculation may be performed in the above-provided inverse document frequency calculation manner.
Since the inverse document frequency is determined to be 0, the correlation between the certain document and the preset topic is also 0 as known by the way of multiplying the word frequency and the inverse document frequency.
According to the technical scheme, the method for calculating the relevance between the document and the topic can acquire the dictionary corresponding to the document set and the topic, and can calculate the relevance between any document in the document set and the topic according to the hit condition of words in the topic dictionary in the document set. The relevancy between the document and the topic can represent the relevancy between the document content and the topic, and can be used as a basis for considering whether the document is suitable to be recomposed into the hot topic related film and television works.
Referring to fig. 2, an embodiment of the present invention provides a method for constructing a topic dictionary, which specifically includes steps S201 to S203.
S201: and capturing topic data by using a crawler tool.
Specifically, to obtain topics, news content may be collected, for example, news media such as people's net, chinese daily newspaper, chinese youth newspaper, etc., and news content on a free media platform may be crawled to obtain topic data.
And cleaning the data acquired by the crawler, removing messy codes and unifying the data format. The processed topic data can be written into a mysql database.
Each website designs a table, such as a social topic data table hot _ social _ news _ crawler as follows:
name of field Note Type of field Properties Remarks for note
news_tag Identification varchar(100) Non-null main key Time and title splicing
news_date Time of news release varchar(50) Non-empty
news_topic News topic varchar(100) Can be emptied News net plate block
news_title News headline varchar(255) Non-empty Relevancy of dictionary
news_content News content text Can be emptied
news_url Current url varchar(255) Can be emptied
from_url First order url varchar(255) Can be emptied Related news
TABLE 4
The news _ url in the table is the current website, and the from _ url is the previous website of the current website, and is used for judging the relation between related news.
Besides the social news table, Chinese daily news host _ social _ news _ crawl _ Chinese _ date and people network host _ social _ news _ crawl _ renmin _ net are provided, and the crawled data are written into a database according to the field type displayed by the table.
And the crawled topic data is used as an input of the topic model.
S202: the topic data is input into a topic model tool to extract a classification of the topic terms from the topic data.
In particular, the topic model tool is an existing tool for extracting topics, which can classify data contents input to itself, each classification representing a topic. In the step, the topic data is input into a topic model tool as data content, and the topic data can be classified by the topic model tool, so that the classification of topic words is obtained.
The specific process of the topic model tool for topic classification can be expressed as that for topic data, a topic is extracted from topic distribution; randomly extracting a word from the word distribution corresponding to the extracted theme; the above process is repeated until each word in the whole topic data is traversed. The processing procedure is based on the analysis of the generation process of the document, namely, the document is considered to be generated, the theme which needs to be contained in the document is firstly determined, and then related words are selected around the theme for word-sending and sentence-making, so that the corresponding document is generated. The topic model tool is based on a document generation principle, uses given topic data as a document, and infers topic distribution of the topic data according to the process.
One specific example of the topic model tool is an LDA (Latent Dirichlet Allocation) topic model, which is also called a three-layer bayesian probability model and includes three layers of structures of words, topics and documents. LDA is an unsupervised learning method, a bag-of-words model is adopted, each document is regarded as a word frequency vector, and the bag-of-words model does not consider the sequence between words. The classification process of the topic model tool will be briefly described below by taking the topic model tool as an example.
Specifically, topic data, topic number K and word number N are input into an LDA topic model tool, where the topic number K and the word number N are preset parameters in the tool, the topic number K indicates how many topic classifications the topic data needs to be divided into by the tool, each topic classification includes a plurality of topic words, and the word number N is used to indicate the number of topic words that need to be selected in each topic classification. It should be noted that, the LDA topic model tool may calculate a probability value of each topic word with respect to the topic classification, where the probability value represents a probability value that the topic word belongs to the topic classification, and therefore, when selecting the topic word, the topic word may be selected from a high probability value to a low probability value.
Based on the setting of the preset parameters, after the M topic data are input into the LDA topic model, the LDA topic model can automatically divide the topic data into K topic clusters, each topic cluster represents an independent topic, and the specific content of the topic cannot be obtained by the LDA topic model, so the topic clusters can be called hidden topics. In addition, each topic cluster comprises N topic words.
Assuming that the number of topics K in the LDA topic model tool is set to 30 and the number of words N is set to 10, some topic data are input into the LDA topic model tool with the values thus set, resulting in the output results shown in fig. 3. As shown in FIG. 3, the LDA topic model tool outputs 30 clusters of words, each cluster representing a topic, the first column number on the left side being the topic number, and each topic containing 10 topic words.
It should be noted that the LDA topic model is an algorithm model that follows Dirichlet (Dirichlet) distribution, the divided K topics follow Dirichlet distribution with a parameter α, and N words in each topic follow Dirichlet distribution with a parameter β. And when the words are segmented according to the hidden theme, the LDA theme model selects the former N words with high probability as the classification of the hidden theme words according to the probability. And are arranged according to the probability from big to small.
S203: inputting the classification of preset topics and topic words into a word vector generation model to obtain a plurality of words with similar topic semantics; wherein a plurality of words constitute a dictionary of topics.
Specifically, one or more topics may be set in advance. If the topics are multiple, the topics and the classification of the topic words are input into the word vector generation model respectively, so that dictionaries corresponding to the topics are obtained respectively.
The word vector generation model is used for obtaining a plurality of words with similar semantics with the topic from the classification of the topic words, and the words are a dictionary of the topic. An example of the word vector generation model is a work2vec word vector generation model, and the generation process of the dictionary is briefly described by taking the model as an example.
For example, a 'civil' topic and the classification of topic words are input into a word 2vec word vector generation model, word 2vec searches words related to the semantic meaning of the civil topic in the classification of the topic words according to the input civil topic, namely, the probability of each word and a preset topic in the topic words is calculated through an algorithm, and then a certain number of words (the number can be preset) are selected according to the sequence of the probability from large to small to be used as a dictionary corresponding to the topic.
In practical application, after the relevancy of the document and the topic is calculated, ranking can be carried out.
In one case, the preset topic is a plurality of topics, and each topic can calculate the correlation between the document and the topic in the above manner, so that for any document, the correlation between the document and the topics can be obtained, and the correlation between the document and each topic can be ranked. The ranking may be arranged according to the high-low order of the relevance of the topics to the document, the topic with the highest relevance to the document is the topic of the document, and the topic may be used as the adaptation direction of the document to be adapted into the film and television works.
In another case, there are multiple documents in the document set, and the relevance of each document in the document set to the same topic may be ranked. The ranking criterion may also be that the document with the highest relevance to the topic is the document with the highest relevance to the topic in the order of the relevance from high to low.
Referring to fig. 4, an embodiment of the present invention provides a structure of a computing device for calculating a topic relevance of a document. As shown in fig. 4, the apparatus may specifically include: a document acquisition module 401, a dictionary acquisition module 402, and a relevancy calculation module 403.
A document obtaining module 401, configured to obtain a document set.
A dictionary obtaining module 402, configured to obtain a dictionary corresponding to a preset topic; wherein the dictionary is constructed by learning topic data using a semi-supervised learning algorithm, and the dictionary comprises a plurality of words semantically related to the preset topic.
A relevance calculating module 403, configured to calculate, for any document in the document set, a relevance of the preset topic corresponding to the dictionary for the any document according to a hit condition of a word in the dictionary in the document set.
In one embodiment, the correlation calculation model 403 may specifically include: the device comprises a word frequency calculation sub-module, an inverse document frequency calculation sub-module and a correlation degree calculation sub-module.
The word frequency calculation submodule is used for calculating the word frequency of the dictionary according to the occurrence frequency of all words in the dictionary in any document;
the inverse document frequency calculation submodule is used for calculating the inverse document frequency of the dictionary according to the occurrence times of all documents in the document set of all words in the dictionary;
and the relevance calculation operator module is used for calculating the product of the word frequency and the inverse document frequency and taking the product as the relevance of the preset topic corresponding to the dictionary and the any document.
In an embodiment, the word frequency calculation sub-module may specifically include: the device comprises a single document dictionary counting unit, a document word counting unit and a word frequency calculating unit.
The single document dictionary counting unit is used for counting the total times of the occurrence of each word in the dictionary in any document;
the document word counting unit is used for counting the total word number in any document;
and the word frequency calculation unit is used for taking the ratio of the total number of times of occurrence of each word in the dictionary in any document to the total number of words in any document as the word frequency of the dictionary.
In one embodiment, the inverse document frequency calculation sub-module may specifically include: a multi-document dictionary counting unit and an inverse document frequency calculating unit.
The multi-document dictionary counting unit is used for counting the occurrence frequency of all words in the dictionary in each document in the document set, and taking the document with the occurrence frequency meeting a preset threshold value as a target document;
and the inverse document frequency calculating unit is used for calculating the ratio of the total number of all the documents in the document set to the number of the target documents and determining the ratio as the inverse document frequency of the dictionary.
In one embodiment, the inverse document frequency calculation sub-module may specifically include: and an inverse document frequency calculation unit.
And the preset threshold unit is used for determining the inverse document frequency of the dictionary as 0 if the occurrence frequency of all words in the dictionary in any document does not meet the preset threshold.
In one embodiment, the device for calculating the relevance of the document to the topic may further include a dictionary construction module, configured to construct a dictionary corresponding to the preset topic.
The dictionary building module may specifically include: a news catch sub-module, a classified word sub-module, a word generation sub-module and a dictionary generation sub-module.
The news capturing submodule is used for capturing topic data by using a crawler tool;
a category word sub-module, configured to input the topic data into a topic model tool to extract a category of a topic word from the topic data;
the word generation submodule is used for inputting the classification of a preset topic and the topic words into a word vector generation model so as to obtain a plurality of words with similar semantics with the preset topic;
and the dictionary generation submodule is used for forming a dictionary of the preset topic according to a plurality of words with similar semantics with the preset topic.
In one embodiment, the document obtaining module may specifically include: a document capturing sub-module and a document combining sub-module.
A document crawling sub-module for crawling a plurality of documents from a network using a crawler tool, wherein the crawled is content data in each of the documents that can represent a topic of the document;
and the document combination submodule is used for combining the plurality of documents into a document set.
In one embodiment, the device for calculating the relevancy of the document to the topic may further include: and a sorting module.
The ranking module is used for ranking the relevancy of any document and each preset topic if the preset topics are multiple; or ranking the relevance of each document in the document set and the same preset topic.
According to the technical scheme, the device for calculating the relevance between the document and the topic can acquire the document set and the dictionary corresponding to the preset topic, and can calculate the relevance between any document in the document set and the preset topic according to the hit condition of words in the topic dictionary in the document set. The relevance between the document and the preset topic can represent the relevance degree between the document content and the preset topic, and can be used as a basis for considering whether the document is suitable to be adapted into the hot topic related film and television works.
In addition, the present application also provides a computing device for relevancy between a document and a topic, which specifically includes: a processor and a memory, the processor executing a software program stored in the memory, calling data stored in the memory, and performing at least the following steps:
obtaining a document set;
acquiring a dictionary corresponding to a preset topic; wherein the dictionary is constructed by learning topic data using a semi-supervised learning algorithm, and comprises a plurality of words semantically related to the preset topic;
and for any document in the document set, calculating the relevance of the any document and a preset topic corresponding to the dictionary according to the hit condition of the words in the dictionary in the document set.
In addition, the present application also provides a storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method for calculating the relevancy between a document and a topic provided by any of the above embodiments is implemented.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for calculating the relevancy between a document and a topic is characterized by comprising the following steps:
obtaining a document set;
acquiring a dictionary corresponding to a preset topic; wherein the dictionary is constructed by learning topic data using a semi-supervised learning algorithm, and comprises a plurality of words semantically related to the preset topic;
aiming at any document in the document set, calculating the word frequency of the dictionary according to the occurrence frequency of all words in the dictionary in the any document;
calculating the inverse document frequency of the dictionary according to the occurrence times of all documents in the document set of all words in the dictionary, specifically comprising: counting the occurrence times of all words in the dictionary in each document in the document set, and taking the document with the occurrence times meeting a preset threshold value as a target document; calculating the ratio of the total number of all documents in the document set to the number of target documents, and determining the ratio as the inverse document frequency of the dictionary;
and calculating the product of the word frequency and the inverse document frequency, and taking the product as the relevancy of the any document and a preset topic corresponding to the dictionary.
2. The method for calculating the relevancy between a document and a topic according to claim 1, wherein the calculating the word frequency of the dictionary according to the occurrence frequency of all words in the dictionary in any document comprises:
counting the total times of occurrence of each word in the dictionary in any document;
counting the total word number in any document;
and taking the ratio of the total number of times of occurrence of each word in the dictionary in any document to the total number of words in any document as the word frequency of the dictionary.
3. The method for calculating the topic relevance of a document according to claim 1, wherein the calculating the inverse document frequency of the dictionary according to the occurrence number of all documents in the document set of all words in the dictionary comprises:
and if the occurrence frequency of all words in the dictionary in any document does not meet a preset threshold value, determining the inverse document frequency of the dictionary as 0.
4. The method for calculating the relevancy between the document and the topic according to claim 1, wherein the construction mode of the dictionary corresponding to the preset topic comprises:
capturing topic data by using a crawler tool;
inputting the topic data into a topic model tool to extract a classification of topic words from the topic data;
inputting the classification of a preset topic and the topic words into a word vector generation model to obtain a plurality of words with similar semantics with the preset topic;
and forming a dictionary of the preset topic according to a plurality of words with similar semantemes with the preset topic.
5. The method for calculating the topic relevance of a document according to claim 1, wherein the obtaining a set of documents comprises:
crawling a plurality of documents from a network using a crawler tool, wherein crawled are content data in each of the documents that can represent a topic of the document;
combining the plurality of documents into a document collection.
6. The method of calculating the degree of relevance of a document to a topic according to claim 1, further comprising:
if a plurality of preset topics exist, sequencing the relevancy of any document and each preset topic;
or the like, or, alternatively,
and sequencing the relevancy of each document in the document set and the same preset topic.
7. A device for calculating relevancy of a document to a topic, comprising:
the document acquisition module is used for acquiring a document set;
the dictionary obtaining module is used for obtaining a dictionary corresponding to a preset topic; wherein the dictionary is constructed by learning topic data using a semi-supervised learning algorithm, and comprises a plurality of words semantically related to the preset topic;
the relevancy calculation module is used for calculating the relevancy of any document in the document set and a preset topic corresponding to the dictionary according to the hit condition of the words in the dictionary in the document set;
the correlation calculation module may specifically include: a word frequency calculation submodule, an inverse document frequency calculation submodule and a correlation degree calculation submodule, wherein,
the word frequency calculation submodule is used for calculating the word frequency of the dictionary according to the occurrence frequency of all words in the dictionary in any document;
the inverse document frequency calculation sub-module is configured to calculate an inverse document frequency of the dictionary according to the occurrence frequency of all documents in the document set of all words in the dictionary, and specifically includes: counting the occurrence times of all words in the dictionary in each document in the document set, and taking the document with the occurrence times meeting a preset threshold value as a target document; calculating the ratio of the total number of all documents in the document set to the number of target documents, and determining the ratio as the inverse document frequency of the dictionary;
and the relevancy calculation operator module is used for calculating the product of the word frequency and the inverse document frequency and taking the product as the relevancy of the any document and a preset topic corresponding to the dictionary.
8. The apparatus for calculating the topic relevance of a document according to claim 7, further comprising: the dictionary building module is used for building a dictionary corresponding to the preset topic;
the dictionary construction module includes:
the news capturing submodule is used for capturing topic data by using a crawler tool;
a category word sub-module, configured to input the topic data into a topic model tool to extract a category of a topic word from the topic data;
the dictionary generation submodule is used for inputting the classification of a preset topic and the topic words into a word vector generation model so as to obtain a plurality of words with similar semantics with the preset topic; wherein the plurality of words constitute a dictionary of the preset topic.
9. A computing device of document and topic relevance, comprising: a processor and a memory, the processor executing a software program stored in the memory, calling data stored in the memory, and performing at least the following steps:
obtaining a document set;
acquiring a dictionary corresponding to a preset topic; wherein the dictionary is constructed by learning topic data using a semi-supervised learning algorithm, and comprises a plurality of words semantically related to the preset topic;
aiming at any document in the document set, calculating the word frequency of the dictionary according to the occurrence frequency of all words in the dictionary in the any document;
calculating the inverse document frequency of the dictionary according to the occurrence times of all documents in the document set of all words in the dictionary, specifically comprising: counting the occurrence times of all words in the dictionary in each document in the document set, and taking the document with the occurrence times meeting a preset threshold value as a target document; calculating the ratio of the total number of all documents in the document set to the number of target documents, and determining the ratio as the inverse document frequency of the dictionary;
and calculating the product of the word frequency and the inverse document frequency, and taking the product as the relevancy of the any document and a preset topic corresponding to the dictionary.
10. A storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements a method of calculating a topic relevance for a document as recited in any one of claims 1-6.
CN201910131086.4A 2019-02-21 2019-02-21 Method, device, equipment and medium for calculating relevance between document and topic Active CN109871433B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910131086.4A CN109871433B (en) 2019-02-21 2019-02-21 Method, device, equipment and medium for calculating relevance between document and topic

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910131086.4A CN109871433B (en) 2019-02-21 2019-02-21 Method, device, equipment and medium for calculating relevance between document and topic

Publications (2)

Publication Number Publication Date
CN109871433A CN109871433A (en) 2019-06-11
CN109871433B true CN109871433B (en) 2021-07-23

Family

ID=66919047

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910131086.4A Active CN109871433B (en) 2019-02-21 2019-02-21 Method, device, equipment and medium for calculating relevance between document and topic

Country Status (1)

Country Link
CN (1) CN109871433B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143506B (en) * 2019-12-27 2023-11-03 汉海信息技术(上海)有限公司 Topic content ordering method, topic content ordering device, server and storage medium
CN111553144A (en) * 2020-04-28 2020-08-18 深圳壹账通智能科技有限公司 Topic mining method and device based on artificial intelligence and electronic equipment
CN112926297B (en) * 2021-02-26 2023-06-30 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for processing information
CN112597283B (en) * 2021-03-04 2021-05-25 北京数业专攻科技有限公司 Notification text information entity attribute extraction method, computer equipment and storage medium
CN113656695A (en) * 2021-08-18 2021-11-16 北京奇艺世纪科技有限公司 Hot data generation method and device, data processing method and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049568A (en) * 2012-12-31 2013-04-17 武汉传神信息技术有限公司 Method for classifying documents in mass document library
CN105912528A (en) * 2016-04-18 2016-08-31 深圳大学 Question classification method and system
CN108228555A (en) * 2016-12-14 2018-06-29 北京国双科技有限公司 Article treating method and apparatus based on column theme
CN108829889A (en) * 2018-06-29 2018-11-16 国信优易数据有限公司 A kind of newsletter archive classification method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120166414A1 (en) * 2008-08-11 2012-06-28 Ultra Unilimited Corporation (dba Publish) Systems and methods for relevance scoring
CN102298622B (en) * 2011-08-11 2013-01-02 中国科学院自动化研究所 Search method for focused web crawler based on anchor text and system thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049568A (en) * 2012-12-31 2013-04-17 武汉传神信息技术有限公司 Method for classifying documents in mass document library
CN105912528A (en) * 2016-04-18 2016-08-31 深圳大学 Question classification method and system
CN108228555A (en) * 2016-12-14 2018-06-29 北京国双科技有限公司 Article treating method and apparatus based on column theme
CN108829889A (en) * 2018-06-29 2018-11-16 国信优易数据有限公司 A kind of newsletter archive classification method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
具有共现关系的中文褒贬词典构建;杨春明;《计算机工程与应用》;20160531;第52卷(第9期);论文第3.3节 *

Also Published As

Publication number Publication date
CN109871433A (en) 2019-06-11

Similar Documents

Publication Publication Date Title
CN109871433B (en) Method, device, equipment and medium for calculating relevance between document and topic
Barrón-Cedeno et al. Proppy: Organizing the news based on their propagandistic content
Kang et al. Modeling user interest in social media using news media and wikipedia
CN104077377B (en) Network public-opinion focus based on web documents attribute finds method and apparatus
CN103914478B (en) Webpage training method and system, webpage Forecasting Methodology and system
Nicosia et al. QCRI: Answer selection for community question answering-experiments for Arabic and English
CA2578513C (en) System and method for online information analysis
US10776885B2 (en) Mutually reinforcing ranking of social media accounts and contents
CN109145216A (en) Network public-opinion monitoring method, device and storage medium
CN110334202A (en) User interest label construction method and relevant device based on news application software
CN108595660A (en) Label information generation method, device, storage medium and the equipment of multimedia resource
CN106383887A (en) Environment-friendly news data acquisition and recommendation display method and system
Im et al. Linked tag: image annotation using semantic relationships between image tags
KR20100084510A (en) Identifying information related to a particular entity from electronic sources
WO2020233344A1 (en) Searching method and apparatus, and storage medium
CN107506472B (en) Method for classifying browsed webpages of students
EP2307951A1 (en) Method and apparatus for relating datasets by using semantic vectors and keyword analyses
Kanaris et al. Learning to recognize webpage genres
US20170235836A1 (en) Information identification and extraction
Kumar et al. Hashtag recommendation for short social media texts using word-embeddings and external knowledge
Vick et al. The effects of standardizing names for record linkage: Evidence from the United States and Norway
Alves et al. A spatial and temporal sentiment analysis approach applied to Twitter microtexts
CN108681977B (en) Lawyer information processing method and system
CN112989824A (en) Information pushing method and device, electronic equipment and storage medium
Kartal et al. TrClaim-19: The first collection for Turkish check-worthy claim detection with annotator rationales

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant