WO2022095374A1 - 关键词抽取方法、装置、终端设备及存储介质 - Google Patents

关键词抽取方法、装置、终端设备及存储介质 Download PDF

Info

Publication number
WO2022095374A1
WO2022095374A1 PCT/CN2021/091083 CN2021091083W WO2022095374A1 WO 2022095374 A1 WO2022095374 A1 WO 2022095374A1 CN 2021091083 W CN2021091083 W CN 2021091083W WO 2022095374 A1 WO2022095374 A1 WO 2022095374A1
Authority
WO
WIPO (PCT)
Prior art keywords
keyword
word
target
candidate
keywords
Prior art date
Application number
PCT/CN2021/091083
Other languages
English (en)
French (fr)
Inventor
饶刚
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022095374A1 publication Critical patent/WO2022095374A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application belongs to the technical field of artificial intelligence, and in particular relates to a keyword extraction method, device, terminal device and storage medium.
  • keyword extraction is widely used in many fields of text processing, such as the field of text clustering, text summarization and information retrieval.
  • keyword extraction is basically judged by extracting the single information of each word in the text.
  • it is popular to use the graph-based sorting algorithm TextRank algorithm or topic model (latent dirichlet allocation, LDA) to obtain the keywords of the text.
  • LDA topic dirichlet allocation
  • the inventor realizes that some special words, such as information such as person's name, place name, etc., are often ignored, and this information may be important information in the text. Therefore, the current method for extracting text keywords is difficult to accurately extract high-quality keywords related to the text.
  • One of the purposes of the embodiments of the present application is to provide a keyword extraction method, device, terminal device and storage medium, aiming at solving the technology that the current method for extracting text keywords is difficult to accurately extract high-quality keywords related to text question.
  • an embodiment of the present application provides a keyword extraction method, including:
  • the multiple candidate keywords and the target article respectively calculate multiple score values corresponding to each candidate keyword in the multiple candidate keywords
  • the multiple score values corresponding to each candidate keyword are input into the pre-trained supervision model, the word probability of each candidate keyword is obtained respectively, and the multiple candidate keywords are obtained according to the word probability. Identify target keywords.
  • an embodiment of the present application provides a keyword extraction device, including:
  • the first acquisition module is used to acquire multiple word segmentations in the target article
  • a first determining module configured to determine a plurality of candidate keywords from the plurality of word segmentations according to a preset keyword library
  • a first calculation module configured to calculate a plurality of score values corresponding to each candidate keyword in the plurality of candidate keywords according to the plurality of candidate keywords and the target article;
  • the second determination module is configured to input multiple score values corresponding to each candidate keyword into the pre-trained supervision model, obtain the word probability of each candidate keyword respectively, and calculate from the word probability according to the word probability.
  • a target keyword is determined from the plurality of candidate keywords.
  • a third aspect of the embodiments of the present application provides a terminal device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program When realized:
  • the multiple candidate keywords and the target article respectively calculate multiple score values corresponding to each candidate keyword in the multiple candidate keywords
  • the multiple score values corresponding to each candidate keyword are input into the pre-trained supervision model, the word probability of each candidate keyword is obtained respectively, and the multiple candidate keywords are obtained according to the word probability. Identify target keywords.
  • a fourth aspect of the embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement:
  • the multiple candidate keywords and the target article respectively calculate multiple score values corresponding to each candidate keyword in the multiple candidate keywords
  • the multiple score values corresponding to each candidate keyword are input into the pre-trained supervision model, the word probability of each candidate keyword is obtained respectively, and the multiple candidate keywords are obtained according to the word probability. Identify target keywords.
  • a fifth aspect of the embodiments of the present application further provides a computer program product, which, when the computer program product runs on a terminal device, enables the terminal device to implement:
  • the multiple candidate keywords and the target article respectively calculate multiple score values corresponding to each candidate keyword in the multiple candidate keywords
  • the multiple score values corresponding to each candidate keyword are input into the pre-trained supervision model, the word probability of each candidate keyword is obtained respectively, and the multiple candidate keywords are obtained according to the word probability. Identify target keywords.
  • the embodiments of the present application include the following advantages:
  • a plurality of word segments are obtained by performing word segmentation on the target article, and compared with a preset keyword database, candidate keywords are determined from the plurality of word segments, and the multiple results of each candidate keyword are calculated separately.
  • the target keyword is further determined from multiple candidate keywords according to the multiple score values, so that on the basis of maintaining a high-quality keyword database as the candidate keywords in the output target article, the supervision model can be used at the same time.
  • the word probability of each candidate keyword is further calculated, and the target keyword is determined from multiple candidate keywords according to the word probability, so as to ensure that the extracted target keywords belong to high-quality words with high correlation with the target text.
  • Fig. 1 is the realization flow chart of a kind of keyword extraction method provided by an embodiment of the present application
  • Fig. 2 is the realization flow chart of a kind of keyword extraction method provided by another embodiment of the present application.
  • FIG. 3 is a schematic diagram of an application scenario of a keyword extraction method provided by an embodiment of the present application.
  • Fig. 4 is the realization flow chart of a kind of keyword extraction method provided by another embodiment of the present application.
  • FIG. 5 is a schematic diagram of an implementation manner of S102 of a keyword extraction method provided by an embodiment of the present application.
  • Fig. 6 is the realization flow chart of supervised model training step in a kind of keyword extraction method provided by an embodiment of the present application
  • FIG. 7 is a schematic diagram of feature extraction of sample keywords in a keyword extraction method provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of an implementation manner of S103 of a keyword extraction method provided by an embodiment of the present application.
  • Fig. 9 is the realization flow chart of a kind of keyword extraction method provided by still another embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a keyword extraction device provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • the keyword extraction method provided by the embodiments of the present application can be applied to terminal devices such as mobile phones, tablet computers, notebook computers, ultra-mobile personal computers (UMPCs), netbooks, and the like. Type does not impose any restrictions.
  • FIG. 1 shows an implementation flowchart of a keyword extraction method provided by an embodiment of the present application. The method includes the following steps:
  • the above-mentioned target article may be a Weibo article, a news article, etc., which is not limited.
  • the above method for obtaining the target article may be that the terminal device crawls the target article through the network, or may be that the terminal device obtains the existing target article from a specified storage path.
  • the text language of the target article may be Chinese, English or other text languages, which is not limited. In order to better explain the keyword extraction method, this embodiment takes the text language in Chinese as an example for description.
  • the above-mentioned multiple word segmentations can be obtained by performing word segmentation processing on the target article.
  • the target article often contains words such as the source of the news, which can be reprinted, etc.
  • these words are irrelevant information, which will interfere with the accuracy of extracting keywords from the target article. Therefore, data cleaning can be performed on the above-mentioned target article in advance to remove the above-mentioned words.
  • the word segmentation for the target article may be a pre-established word segmentation database, and the word segmentation database includes all words that can be used in a language (for example, Chinese).
  • a sentence or a string of characters in the target article can be extracted and compared with the words in the word segmentation database. If they are consistent, the string can be a word representing one meaning, that is, a participle. If there is no matching word in the thesaurus, you can reduce the length of the string (for example, excluding the last character in the string), and match the words in the thesaurus again until all the strings are matched, That is to get multiple participles.
  • S102 Determine a plurality of candidate keywords from the plurality of word segmentations according to a preset keyword library.
  • the preset keyword library may preset multiple interest words for the user, and store the multiple interest words as keywords in the keyword library under the storage path specified by the terminal device.
  • the user is interested in the content of the article when reading the content of the rest of the article. If he wants to read articles related to the content field of the article frequently, he can select words from the content of the article as interesting words and store them in the keyword database. .
  • the terminal device can determine the field to which the article belongs according to the content of the article currently read, and crawl the network for many articles in this field.
  • the keywords are stored in the keyword database as interest words.
  • the above preset keyword database may contain specific words such as person's name, place name, time, etc., because these words are often ignored in the use of the current popular keyword extraction algorithm. Therefore, setting specific words separately can ensure the quality of candidate keywords determined from multiple word segmentations.
  • the determination of multiple candidate keywords from multiple word segments may be, if the keyword database has the same word as the segmented word, the segmented word may be determined as a candidate keyword, and thus multiple candidate keywords may be obtained. It can be understood that, if a keyword and a participle included in the keyword database are synonyms, the participle can also be used as a candidate keyword, which is not limited.
  • the above-mentioned multiple score values include, but are not limited to, the position score value of the candidate keyword at the position of the article in the target article, and the frequency score value of appearing in the target article.
  • different score values are assigned based on different positions where different candidate keywords appear in the target article. For example, for the candidate keywords appearing in the title, it can be considered that the title of the target article of the news category is usually taken as the core of the article, which contains the main content of the target article. Therefore, the position score value of the candidate keyword appearing in the title in the target article can be set to be higher than the position score value appearing in the main text.
  • the value with the highest score value can be selected as the position score value of the candidate keyword (ie, the position score value of the title).
  • the frequency score value of the candidate keyword may be calculated as a ratio according to the total number of multiple word segments in the target article and the number of occurrences of each candidate keyword in the target article.
  • S104 Inputting multiple score values corresponding to each candidate keyword into a pre-trained supervision model, obtaining word probability of each candidate keyword respectively, and selecting from the multiple candidate keywords according to the word probability Identify target keywords in keywords.
  • each candidate keyword has multiple score values.
  • the target keyword may be some words in the multiple candidate keywords. For example, the sum of multiple score values of each candidate keyword can be counted, or the average value of multiple score values of each candidate keyword can be calculated as a measure of the degree of association between each candidate keyword and the first text. (word probability).
  • the target keyword may be determined from a plurality of candidate keywords. Exemplarily, a preset number of candidate keywords with the highest score value (word probability) may be determined as target keywords.
  • the above-mentioned supervised model may be a supervised model obtained by model training according to the existing article content and corresponding keywords.
  • the goal of supervised learning is to learn a function (model), in the case of known sample data (existing article content) and output values (keywords) of the function, the maximum possible fit between the input and output Relationship. It is possible to obtain the word probability that each candidate keyword belongs to the target keyword in the target article through the score value and the supervision model.
  • the target keyword may be further determined from the plurality of candidate keywords through a supervision model, so as to ensure that the extracted keywords belong to high-quality words.
  • a plurality of word segmentations are obtained by performing word segmentation on the target article, and compared with a preset keyword database, candidate keywords are determined from the plurality of word segmentations, and multiple word segmentations of each candidate keyword are calculated separately.
  • the score value further determining the target keyword from the plurality of candidate keywords according to the plurality of score values. So that on the basis of maintaining a high-quality keyword database as candidate keywords in the output target article, the word probability of each candidate keyword can be further calculated according to the supervision model at the same time, and the word probability can be determined from multiple candidate keywords according to the word probability.
  • Target keywords to ensure that the extracted target keywords belong to high-quality words with high correlation with the target text.
  • S102A Determine the article domain of the target article, and obtain domain texts belonging to the article domain.
  • the target article can be obtained by crawling on the network by the terminal device. It can be understood that when the user browses the target article using the terminal device, the target article usually has a domain tag (article domain) in advance when it is published. Therefore, it can be considered that the terminal device can simultaneously determine the article field of the target article when it acquires the target article.
  • the terminal device is a smart phone
  • each word under the related channel can be regarded as the article domain of the target article, and multiple texts in the article domain can be regarded as domain text.
  • S102B Calculate the domain association degree between each domain word segment according to multiple domain word segments in the domain text.
  • the above-mentioned domain word segmentation can also be obtained by the method in the above-mentioned S101.
  • the above calculation of the domain relevance between each domain word segment may be calculating the mutual information between each domain word segment.
  • p(x, y) is the probability that domain participle x and domain participle y appear simultaneously in multiple domain texts
  • p(x) is the probability that domain participle x appears alone in multiple domain texts
  • p(y) is the probability that the domain participle y appears alone in multiple domain texts
  • PMI(x, y) is the mutual information between the domain participle x and the domain participle y.
  • the left and right information of each domain keyword can be calculated according to the mutual information, so as to obtain the left and right mutual information, and the left and right mutual information can be used as the above-mentioned domain correlation degree.
  • the mutual information (domain correlation degree) of "Ping” and “An” the mutual information of "Ping” and “Fu”
  • the mutual information of "An” and “Fu” can be calculated respectively through the above mutual information calculation formula.
  • S102C Determine a target relevance degree greater than a preset relevance degree from the multiple domain relevance degrees, and determine a target domain word segmentation corresponding to the target relevance degree.
  • the above-mentioned preset correlation degree may be a value set by the user according to the actual situation, or a fixed value may be preset for the terminal device, which is not limited.
  • the target domain association degree can be determined from multiple domain association degrees according to the size of the domain association degree, and the target domain word segmentation corresponding to the target domain association degree can be determined. For example, when the domain relevance degree is greater than the preset relevance degree, the domain relevance degree is determined as the target relevance degree, and the domain word segment corresponding to the target relevance degree is determined as the target domain word segment.
  • the preset keyword library may preset multiple interest words for the user, and store the multiple interest words as keywords in the keyword library under the storage path specified by the terminal device. Therefore, for the target domain word segmentation, the terminal device may also store the target domain word segmentation in the keyword database.
  • the above-mentioned "safety” and “safety symbol” can be used as target domain word segmentation, and stored in the keyword database.
  • S102E Determine the article domain of the target article, and acquire multiple domain keywords belonging to the article domain.
  • the terminal device can also directly obtain the marked words of each domain text as domain keywords, and generate a keyword library.
  • FIG. 3 Multiple words (“5G channel”, “Internet”) in the same column as “related channel” in FIG. 3 can all be considered as article fields.
  • the terminal device can obtain the corresponding field text from the network according to the article field, and simultaneously obtain each field text when it is published.
  • domain keywords words indicated by arrows for the first article in the figure.
  • the above domain keywords can be considered as high-frequency words or core words defined by the publishing organization for each domain text. Therefore, multiple domain keywords of each domain text under the article domain can be stored in the keyword database.
  • S102 determines a plurality of candidate keywords from the plurality of word segmentations according to a preset keyword library, and further includes the following sub-steps S1021-S1023, which are described in detail as follows:
  • target word segment is contained in the keyword database, use the target word segment as a candidate keyword.
  • the word segmentation stored in the keyword database are all high-quality words in the text field. Therefore, after the above-mentioned multiple word segments are obtained, the multiple word segments can be compared with the word segments in the keyword database respectively. If the word segmentation is consistent with the word segmentation stored in the keyword database, the word segmentation can be initially used as a candidate keyword. Among the multiple word segments, the word segment compared with the word segments in the keyword database can be regarded as the target word segment.
  • the target word segment is not included in the keyword database, determine whether the target word segment belongs to an entity word; if the target word segment belongs to an entity word, input the target word segment belonging to the entity word into the In the supervision model, the keyword probability of the target word segmentation belonging to the entity word is obtained; if the keyword probability is greater than the probability threshold, the target word segmentation corresponding to the keyword probability is used as a candidate keyword.
  • the above entity words are words that can describe things that exist independently. After it is determined that the word segmentation is not stored in the keyword database, it can be determined whether the word segmentation not stored in the keyword database belongs to the entity word. If it does not belong to the entity word, it can be considered that the participle has no meaning, so the participle can be deleted. Among them, it can be judged whether the above-mentioned participles belong to entity words through the named entity recognition (Named Entity Recognition, NER) technology.
  • named entity recognition also known as "proper name recognition” refers to the recognition of entities with specific meanings in the text, mainly including names of persons, places, institutions, and proper nouns.
  • the word segment when it is determined that a word segment that is not stored in the keyword database belongs to an entity word, the word segment can be input into the supervision model to obtain the keyword probability of the word segment.
  • the supervision model is a pre-trained classification model, which is used to judge again the keyword probability that the segmented word belongs to the candidate keyword. For details, reference may be made to the description of the supervision model in the above S104, which will not be described in detail.
  • the above-mentioned supervised model can extract the word feature of each segmented word in the target article, and then output the keyword probability belonging to the target article according to the word feature.
  • the word feature of the word segmentation can be the supervised model to comprehensively extract the word feature of the word segmentation according to the information such as the occurrence position of the word segmentation in the target article, the number of word segmentation in the target article, the word length of the word segmentation, and classify according to the word feature. , and output the keyword probability that the segmented word belongs to the target article.
  • the above probability threshold may be a value preset by the user, or may be a probability threshold set by the supervision model after training and analysis according to the existing big data, which is not limited. It can be considered that when the keyword probability is greater than the probability threshold, the segmented word corresponding to the keyword probability is used as a candidate keyword.
  • the above-mentioned supervision model can be obtained by training the following steps S201-S206, which are described in detail as follows:
  • the above training samples can be considered as the domain texts described above, and the corresponding training keywords can be considered as target domain word segmentations corresponding to the domain texts.
  • the method of obtaining training samples may be crawling multiple domain texts under the same article domain from the network. Based on the word segmentation method described in S101 above, the training samples can be segmented to obtain multiple sample word segmentations, which will not be described in detail.
  • S202 Perform word segmentation on the text content in the training sample to obtain a plurality of sample word segmentations, and respectively calculate the sample score value corresponding to each sample word segmentation.
  • the above-mentioned sample score value may be the sample score value determined according to the position of the article in the training sample for the sample word segmentation, or the word span of the sample word segmentation in the training sample is calculated as the score value.
  • a sample score threshold value can be set. When the sample score value is greater than the sample score value threshold, use the sample word segment corresponding to the sample score value as the sample keyword, or sort multiple sample score values, and score the preset number of samples in the front row. The sample word corresponding to the value is used as the sample keyword, which is not limited.
  • the above-mentioned tag categories can be used to assign specific values to sample keywords for calculation when calculating the training loss value of the model. Specifically, if the sample keyword is consistent with any training keyword, the marked category of the sample keyword may be located at 1; otherwise, the marked category of the sample keyword may be located at 0.
  • the above-mentioned extraction of the keyword features of the sample keywords may be considered as performing feature engineering processing on the above-mentioned sample keywords, that is, extracting word features of various aspects of the sample keywords.
  • FIG. 7 shows that each keyword feature of the sample keywords in the training sample should be extracted.
  • the feature fusion of the keyword features can be performed through the neural network model structure in the initial supervision model to obtain the fusion features, so that the fusion features can comprehensively represent the sample key Multiple feature information of words.
  • the model can output the probability that the sample keyword belongs to the keyword according to the fusion feature, and calculate the training loss in combination with the tag category of the sample keyword.
  • the model parameters in the model are iteratively updated according to the training loss, and when the training loss converges, the current model is used as the trained supervised model. Furthermore, the accuracy of the supervised model in determining the target keyword in the target article is improved.
  • the plurality of score values include a first score value, a second score value, a third score value and a fourth score value; S103 is based on the plurality of score values.
  • the candidate keywords and the target article respectively calculate multiple score values corresponding to each candidate keyword in the multiple candidate keywords, and further include the following sub-steps S1031-S1034, which are described in detail as follows:
  • the number of the above multiple word segmentations is the total number of word segmentations contained in the target article
  • the word frequency of the above candidate keywords may be the total number of word segmentations in the target article, and each candidate keyword appears in the target article separately
  • the number of times is calculated by the ratio.
  • the above-mentioned first score value may be the word frequency in each target article, or may be the word frequency inverse document frequency obtained by calculating the word frequency.
  • the above-mentioned inverse file frequency may be the terminal device to count the first quantity of multiple domain texts, and to count the second quantity of domain texts including the candidate keyword among the multiple domain texts.
  • the ratio result of the first quantity and the second quantity is calculated, and the logarithm of the ratio result is taken as the base 10, and the obtained value is the inverse document frequency of the candidate keyword.
  • the word frequency inverse document frequency of each candidate keyword can be obtained, and further, the first score value can be obtained by multiplying the word frequency and the inverse document frequency.
  • the value corresponding to the word frequency inverse file frequency may be any value between 0 and infinity.
  • the frequency of each word frequency inverse file frequency can be normalized so that it is between 0 and 1. range of values.
  • the candidate keyword appears in the title or text of the target article, which can reflect its importance in the target article.
  • the second score value of candidate keywords appearing in the title may be set to 0.6
  • the second score value of candidate keywords appearing in the text may be set to 0.4, which may be set according to actual conditions. It is understandable that if the same candidate keyword appears multiple times in the target article, and simultaneously appears in multiple positions such as the title and the text, the sum of the scores corresponding to the same candidate keyword in multiple positions can be regarded as one. Second score value. Alternatively, the average value of the same candidate keyword is used as the second score value, which is not limited. It should be noted that, in order to distinguish the title and body in the target article, spaces or special symbols can be added between the title and body to distinguish.
  • the above-mentioned third score value can be considered as the word span of each candidate keyword in the target article.
  • the candidate keywords are determined from multiple word segments in the target article. Therefore, each word segment in the target article can be sorted according to the text content of the target article, and then the corresponding candidate keywords can be determined in the target article.
  • the corresponding serial number in the article can determine the position of each candidate keyword in the target article.
  • the candidate keyword appears in the target article multiple times (that is, a candidate keyword has multiple sequence numbers)
  • the minimum sequence number of the candidate keyword can be used as the initial position in the target article
  • the candidate keyword The maximum sequence number as the ending position in the target article.
  • the two serial numbers are subtracted, and the difference obtained is the third score value.
  • the difference value may also be divided by the total number of multiple word segments, so as to perform normalization processing on the difference value, and the normalized value is used as the third score value, which is not limited.
  • the above-mentioned text ranking algorithm can be a graph-based ranking algorithm (textrank) model, which can divide the target article into several constituent units (word segmentation) and establish a graph model, and use a voting mechanism to determine the ranking of several word segmentations in the target article.
  • the important components are sorted by score, that is, the scores of multiple word segments in the target article are sorted by score. Afterwards, according to the score of each word segment, the score corresponding to the candidate keyword may be determined from the plurality of word segments as the fourth score value.
  • the target keyword in the target article generally appears in the title, and the target keyword appears in the target article relatively many times. Therefore, the calculation of the four score values of each candidate keyword set above can be a good measure for judging the criticality of the keyword in the target article.
  • the terminal device can comprehensively evaluate the criticality of each candidate keyword in the target article based on multiple score values, thereby improving the accuracy of determining high-quality target keywords from the candidate keywords.
  • the target keywords include multiple; in S104, multiple score values corresponding to each candidate keyword are input into a pre-trained supervision model, and the After determining the word probability of each candidate keyword and determining the target keyword from the plurality of candidate keywords according to the word probability, the following steps S104A-S104B are further included, which are described in detail as follows:
  • the above-mentioned multiple target articles can be understood as articles clicked by the user within a preset time period.
  • Each of the above target keywords can be understood that when the user clicks on a target article, the terminal device can use the above method to extract one or more target keywords from the target article. In this way, the terminal device can acquire one or more target keywords in each target article within the preset time period.
  • the target keywords corresponding to the target article are: mother and baby, home cute baby, nutritional development.
  • the terminal device may accumulate the number of occurrences of the above target keywords respectively, that is, count the total number of each target keyword in the multiple target articles. Further, the ratio can be calculated based on the total number of each target keyword.
  • the above target keywords are not present in every target article, and the above target keywords are only one example.
  • the target keyword appears only once in multiple target articles, it should also be recorded and involved in the ratio calculation.
  • S104B Perform article recall according to the ratio and each target keyword to obtain an article set, where the ratio of the number of articles respectively including each target keyword in the article set is equal to the ratio.
  • the above-mentioned article set is used to store articles recalled by the terminal device according to the target keyword.
  • the number of articles can be recalled according to the ratio.
  • the total number of articles that should be recalled by the terminal device may be preset, and the number of articles containing the target keyword that should be recalled is calculated according to the total number and the ratio. For example, for the above-mentioned "mother and baby”, “there are cute babies at home”, and “nutritional development”, the ratio is 5:2:3, and the total number of articles that should be recalled by terminal equipment is 10.
  • the terminal device should recall 5 target articles containing the target keyword "mother and infant”, and recall 3 articles containing the target keyword " There are cute babies at home”, and 2 target articles containing the target keyword “nutritional development” were recalled. In this way, the terminal device can automatically recall articles of interest to the user from the network according to the target keyword, thereby improving the recall effect of the terminal device.
  • FIG. 10 is a structural block diagram of a keyword extraction apparatus provided by an embodiment of the present application.
  • each unit included in the terminal device is used to execute each step in the embodiment corresponding to FIG. 1 , FIG. 2 , FIG. 4 to FIG. 6 , FIG. 8 , and FIG. 9 .
  • the keyword extraction apparatus 1000 includes: a first acquisition module 1010, a first determination module 1020, a first calculation module 1030 and a second determination module 1040, wherein:
  • the first obtaining module 1010 is configured to obtain multiple word segments in the target article.
  • the first determining module 1020 is configured to determine a plurality of candidate keywords from the plurality of word segmentations according to a preset keyword library.
  • the first calculation module 1030 is configured to calculate a plurality of score values corresponding to each candidate keyword in the plurality of candidate keywords according to the plurality of candidate keywords and the target article, respectively.
  • the second determination module 1040 is configured to input multiple score values corresponding to each candidate keyword into the pre-trained supervision model, obtain the word probability of each candidate keyword respectively, and determine the word probability according to the word probability A target keyword is determined from the plurality of candidate keywords.
  • the keyword extraction apparatus 1000 further includes:
  • the third determining module is configured to determine the article domain of the target article, and obtain domain texts belonging to the article domain.
  • the second calculation module is configured to calculate the domain correlation between each domain word segment according to the plurality of domain word segments in the domain text.
  • the fourth determining module is used for determining a target correlation degree greater than a preset correlation degree from a plurality of domain correlation degrees, and determining the target domain word segmentation corresponding to the target correlation degree.
  • the first generation module is used for storing the target domain word segmentation in the keyword library.
  • the keyword extraction apparatus 1000 further includes:
  • the fifth determination module is used for determining the article field of the target article, and acquiring a plurality of field keywords belonging to the article field.
  • the second generating module is configured to store the plurality of domain keywords in the keyword library.
  • the first determining module 1020 is further configured to:
  • the target word segmentation is any one of the multiple word segmentations; if the keyword database contains the target segmentation, the target word segmentation is used as a candidate keyword ; If the target word segment is not included in the keyword library, then judge whether the target word segment belongs to an entity word; if the target word segment belongs to an entity word, input the target word segment belonging to the entity word to the supervision In the model, the keyword probability of the target word segmentation belonging to the entity word is obtained; if the keyword probability is greater than the probability threshold, the target word segmentation corresponding to the keyword probability is used as a candidate keyword.
  • the keyword extraction apparatus 1000 further includes the following modules for supervised model training:
  • the second acquisition module is used for acquiring training samples and acquiring marked training keywords from the training samples.
  • the word segmentation module is configured to perform word segmentation on the text content in the training sample to obtain a plurality of sample word segmentations, and respectively calculate the sample score value corresponding to each sample word segmentation.
  • the sixth determination module is configured to determine sample keywords from the multiple sample word segmentations according to the multiple sample score values.
  • a seventh determination module configured to determine the tag category of the sample keyword based on the sample keyword and the training keyword.
  • the extraction module is used for extracting the keyword features of the sample keywords.
  • a training module is used to perform model training based on the keyword features and marked categories of the sample keywords to obtain the supervised model.
  • the plurality of score values include a first score value, a second score value, a third score value and a fourth score value; the first calculation module 1030 is further configured to:
  • the target keywords include multiple; the keyword extraction apparatus 1000 further includes:
  • the statistics module is configured to count the total quantity of each target keyword in the multiple target articles, and calculate the ratio between the total quantity of each target keyword.
  • the recall module is used for recalling articles according to the ratio and each target keyword to obtain an article set, wherein the ratio of the number of articles respectively including each target keyword in the article set is equal to the ratio.
  • each unit/module is used to execute each of the corresponding embodiments in FIGS. 1 , 2 , 4 to 6 , 8 and 9 .
  • 1, 2, 4 to 6, 8 and 9 have been explained in detail in the above-mentioned embodiments.
  • the relevant descriptions in the embodiments corresponding to FIGS. 6 , 8 and 9 , and FIGS. 1 , 2 , 4 to 6 , and 8 and 9 will not be repeated here.
  • FIG. 11 is a structural block diagram of a terminal device provided by another embodiment of the present application.
  • the terminal device 1100 of this embodiment includes: a processor 1110 , a memory 1120 , and a computer program 1130 stored in the memory 1120 and executable on the processor 1110 , such as a program for a keyword extraction method.
  • the processor 1110 executes the computer program 1130, the steps in each of the foregoing embodiments of the keyword extraction methods are implemented, for example, S101 to S104 shown in FIG. 1 .
  • the processor 1110 executes the computer program 1130, the functions of the modules in the embodiment corresponding to FIG. 10 are implemented, for example, the functions of the modules 1010 to 1040 shown in FIG. 10 .
  • the functions of the modules in the embodiment corresponding to FIG. 10 are implemented, for example, the functions of the modules 1010 to 1040 shown in FIG. 10 .
  • the functions of the modules 1010 to 1040 shown in FIG. 10 . Specifically as follows:
  • a terminal device comprising a memory, a processor, and a computer program stored in the memory and running on the processor, the processor implements when the processor executes the computer program:
  • the multiple candidate keywords and the target article respectively calculate multiple score values corresponding to each candidate keyword in the multiple candidate keywords
  • the multiple score values corresponding to each candidate keyword are input into the pre-trained supervision model, the word probability of each candidate keyword is obtained respectively, and the multiple candidate keywords are obtained according to the word probability. Identify target keywords.
  • the processor when the processor executes the computer program, it further implements:
  • the target domain word segmentation is stored in the keyword database.
  • the processor when the processor executes the computer program, it further implements:
  • the plurality of domain keywords are stored in the keyword library.
  • the processor when the processor executes the computer program, it further implements:
  • the target word segment is contained in the keyword database, the target word segment is used as a candidate keyword;
  • the target word segment is not included in the keyword database, determine whether the target word segment belongs to an entity word; if the target word segment belongs to an entity word, input the target word segment belonging to the entity word into the supervision model , obtain the keyword probability of the target word segmentation belonging to the entity word; if the keyword probability is greater than the probability threshold, the target word segmentation corresponding to the keyword probability is used as a candidate keyword.
  • the training of the supervised model is further implemented through the following steps, specifically:
  • Model training is performed based on the keyword features and tag categories of the sample keywords to obtain the supervised model.
  • the plurality of score values include a first score value, a second score value, a third score value and a fourth score value; the processor further implements when executing the computer program :
  • the fourth score value corresponding to each candidate keyword is calculated.
  • the target keyword includes a plurality of; when the processor executes the computer program, the processor further implements:
  • Article recall is performed according to the ratio and each target keyword to obtain an article set, wherein the ratio of the number of articles respectively including each target keyword in the article set is equal to the ratio.
  • a computer-readable storage medium stores a computer program, and the computer program is implemented when executed by a processor:
  • the multiple candidate keywords and the target article respectively calculate multiple score values corresponding to each candidate keyword in the multiple candidate keywords
  • the multiple score values corresponding to each candidate keyword are input into the pre-trained supervision model, the word probability of each candidate keyword is obtained respectively, and the multiple candidate keywords are obtained according to the word probability. Identify target keywords.
  • the computer program when executed by the processor, further implements:
  • the target domain word segmentation is stored in the keyword database.
  • the computer program when executed by the processor, further implements:
  • the plurality of domain keywords are stored in the keyword library.
  • the computer program when executed by the processor, further implements:
  • the target word segment is contained in the keyword database, the target word segment is used as a candidate keyword;
  • the target word segment is not included in the keyword database, determine whether the target word segment belongs to an entity word; if the target word segment belongs to an entity word, input the target word segment belonging to the entity word into the supervision model , obtain the keyword probability of the target word segmentation belonging to the entity word; if the keyword probability is greater than the probability threshold, the target word segmentation corresponding to the keyword probability is used as a candidate keyword.
  • the training of the supervised model is further implemented through the following steps, specifically:
  • Model training is performed based on the keyword features and tag categories of the sample keywords to obtain the supervised model.
  • the plurality of score values include a first score value, a second score value, a third score value and a fourth score value; when the computer program is executed by the processor, it further implements:
  • the fourth score value corresponding to each candidate keyword is calculated.
  • the target keyword includes a plurality of; when the computer program is executed by the processor, it also implements:
  • Article recall is performed according to the ratio and each target keyword to obtain an article set, wherein the ratio of the number of articles respectively including each target keyword in the article set is equal to the ratio.
  • the computer program 1130 may be divided into one or more modules, and the one or more modules are stored in the memory 1120 and executed by the processor 1110 to complete the present application.
  • One or more modules may be a series of computer program instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer program 1130 in the terminal device 1100 .
  • the computer program 1130 can be divided into a first acquisition module, a first determination module, a first calculation module and a second determination module, and the specific functions of each module are as above.
  • the terminal device may include, but is not limited to, the processor 1110 and the memory 1120 .
  • FIG. 10 is only an example of the terminal device 1100, and does not constitute a limitation on the terminal device 1100, and may include more or less components than the one shown, or combine some components, or different components
  • the terminal device may also include an input and output device, a network access device, a bus, and the like.
  • the so-called processor 1110 may be a central processing unit, and may also be other general-purpose processors, digital signal processors, application-specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete hardware components, and the like.
  • a general purpose processor may be a microprocessor or any conventional processor or the like.
  • the memory 1120 may be an internal storage unit of the terminal device 1100 , such as a hard disk or a memory of the terminal device 1100 .
  • the memory 1120 may also be an external storage device of the terminal device 1100 , such as a plug-in hard disk, a smart memory card, a flash memory card, etc., which are equipped on the terminal device 1100 . Further, the memory 1120 may also include both an internal storage unit of the terminal device 1100 and an external storage device.
  • the computer-readable storage medium may be an internal storage unit of the terminal device described in the foregoing embodiments, such as a hard disk or a memory of the terminal device.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium may also be an external storage device of the terminal device, for example, a pluggable hard disk, a smart memory card, a secure digital card, a flash memory card, etc. equipped on the terminal device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种关键词抽取方法、装置、终端设备及存储介质,其中,方法包括:获取目标文章中的多个分词;根据预设的关键词库,从所述多个分词中确定多个候选关键词;根据所述多个候选关键词和所述目标文章,分别计算所述多个候选关键词中每个候选关键词对应的多个得分值;将所述每个候选关键词对应的多个得分值输入预先训练的监督模型中,分别得到所述每个候选关键词的词概率,并根据所述词概率从所述多个候选关键词中确定目标关键词。采用上述方法从目标文章中提取目标关键词,可以保证提取的目标关键词均属于与目标文章关联度高的高质量词汇。

Description

关键词抽取方法、装置、终端设备及存储介质
本申请要求于2020年11月06日在中国专利局提交的、申请号为202011229490.4、发明名称为“关键词抽取方法、装置、终端设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于人工智能技术领域,尤其涉及一种关键词抽取方法、装置、终端设备及存储介质。
背景技术
现有技术中,关键词抽取在文本处理的许多领域中均应用广泛,例如,文本聚类领域、文本摘要领域和信息检索领域。在当下大数据时代,关键词抽取基本上是通过提取文本中的每个词的单一信息进行判断。目前,流行的有采用基于图的排序算法TextRank算法或主题模型(latent dirichlet allocation,LDA)得到文本的关键词。然而,发明人意识到,有些特殊词汇,如人名、地名等信息,常常会被忽略,而该信息可能为文本中的重要信息。因此,目前抽取文本关键词的方法难以准确提取出与文本相关的高质量的关键词。
技术问题
本申请实施例的目的之一在于:提供一种关键词抽取方法、装置、终端设备及存储介质,旨在解决目前抽取文本关键词的方法难以准确提取出与文本相关的高质量关键词的技术问题。
技术解决方案
为解决上述技术问题,本申请实施例采用的技术方案是:
第一方面,本申请实施例提供了一种关键词抽取方法,包括:
获取目标文章中的多个分词;
根据预设的关键词库,从所述多个分词中确定多个候选关键词;
根据所述多个候选关键词和所述目标文章,分别计算所述多个候选关键词中每个候选关键词对应的多个得分值;
将所述每个候选关键词对应的多个得分值输入预先训练的监督模型中,分别得到所述每个候选关键词的词概率,并根据所述词概率从所述多个候选关键词中确定目标关键词。
第二方面,本申请实施例提供了一种关键词抽取装置,包括:
第一获取模块,用于获取目标文章中的多个分词;
第一确定模块,用于根据预设的关键词库,从所述多个分词中确定多个候选关键词;
第一计算模块,用于根据所述多个候选关键词和所述目标文章,分别计算所述多个候选关键词中每个候选关键词对应的多个得分值;
第二确定模块,用于将所述每个候选关键词对应的多个得分值输入预先训练的监督模型中,分别得到所述每个候选关键词的词概率,并根据所述词概率从所述多个候选关键词中确定目标关键词。
本申请实施例的第三方面提供了一种终端设备,包括:存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现:
获取目标文章中的多个分词;
根据预设的关键词库,从所述多个分词中确定多个候选关键词;
根据所述多个候选关键词和所述目标文章,分别计算所述多个候选关键词中每个候选关键词对应的多个得分值;
将所述每个候选关键词对应的多个得分值输入预先训练的监督模型中,分别得到所述每个候选关键词的词概率,并根据所述词概率从所述多个候选关键词中确定目标关键词。
本申请实施例的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现:
获取目标文章中的多个分词;
根据预设的关键词库,从所述多个分词中确定多个候选关键词;
根据所述多个候选关键词和所述目标文章,分别计算所述多个候选关键词中每个候选关键词对应的多个得分值;
将所述每个候选关键词对应的多个得分值输入预先训练的监督模型中,分别得到所述每个候选关键词的词概率,并根据所述词概率从所述多个候选关键词中确定目标关键词。
本申请实施例的第五方面还提供了一种计算机程序产品,当所述计算机程序产品在终端设备上运行时,使得所述终端设备执行时实现:
获取目标文章中的多个分词;
根据预设的关键词库,从所述多个分词中确定多个候选关键词;
根据所述多个候选关键词和所述目标文章,分别计算所述多个候选关键词中每个候选关键词对应的多个得分值;
将所述每个候选关键词对应的多个得分值输入预先训练的监督模型中,分别得到所述每个候选关键词的词概率,并根据所述词概率从所述多个候选关键词中确定目标关键词。
有益效果
与现有技术相比,本申请实施例包括以下优点:
本申请实施例,通过对目标文章进行分词处理得到多个分词,并与预设的关键词库进行比较,从多个分词中确定候选关键词,并分别计算每个候选关键词的多个得分值,根据多个得分值从多个候选关键词中进一步的确定目标关键词,使得在维护一高质量的关键词库作为输出目标文章中候选关键词的基础上,可同时根据监督模型进一步的计算每个候选关键词的词概率,根据词概率从多个候选关键词中确定目标关键词,以保证提取的目标关键词均属于与目标文本关联度高的高质量词汇。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或示范性技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1是本申请一实施例提供的一种关键词抽取方法的实现流程图;
图2是本申请另一实施例提供的一种关键词抽取方法的实现流程图;
图3是本申请一实施例提供的一种关键词抽取方法的应用场景示意图;
图4是本申请又一实施例提供的一种关键词抽取方法的实现流程图;
图5是本申请一实施例提供的一种关键词抽取方法的S102的实现方式示意图;
图6是本申请一实施例提供的一种关键词抽取方法中监督模型训练步骤的实现流程图;
图7是本申请一实施例提供的一种关键词抽取方法中样本关键词的特征提取的示意图;
图8是本申请一实施例提供的一种关键词抽取方法的S103的实现方式示意图;
图9是本申请再一实施例提供的一种关键词抽取方法的实现流程图;
图10是本申请实施例提供的关键词抽取装置的结构示意图;
图11是本申请实施例提供的终端设备的结构示意图。
本发明的实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请实施例提供的关键词抽取方法可以应用于手机、平板电脑、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本等终端设备上,本申请实 施例对终端设备的具体类型不作任何限制。
请参阅图1,图1示出了本申请实施例提供的一种关键词抽取方法的实现流程图,该方法包括如下步骤:
S101、获取目标文章中的多个分词。
在应用中,上述目标文章可以为微博文章、新闻文章等,对此不作限定。上述获取目标文章的方式可以为终端设备通过网络爬取目标文章,也可以为终端设备从指定的存储路径下获取已有的目标文章。其中,目标文章的文本语言可以为中文、英文或其他文本语言,对此不作限定。为了能更好的对关键词抽取方法进行解释说明,本实施例以中文形式的文本语言作为示例进行说明。
在应用中,上述多个分词可通过对目标文章进行分词处理得到。例如,对于新闻类的目标文章,目标文章往往包含新闻的来源、可转载等文字,然而,这些文字均为无关信息,将会干扰从目标文章中抽取关键词的准确率。因此,可预先对上述目标文章进行数据清洗清除上述文字。其中,对目标文章进行分词可以为预先建立分词库,且分词库中包含了一种语言(示例为中文)所能使用的所有词语。对于目标文章,可先按照正向最大匹配算法或者是逆向最大匹配算法,取出目标文章中的一句或一段字符串,与分词库中的词语进行比较。如果一致,则该段字符串可为代表一种含义的词语,即为一个分词。如果分词库中没有与之相匹配的词语,则可减少字符串长度(例如,排除字符串中的末尾字符),再次与分词库中的词语进行匹配,直到所有的字符串匹配完成,即得到多个分词。
S102、根据预设的关键词库,从所述多个分词中确定多个候选关键词。
在应用中,上述预设的关键词库可以为用户预先设置多个兴趣词汇,并将多个兴趣词汇作为关键词库中的关键词,存储在终端设备指定的存储路径下。示例性的,用户在阅读其余文章内容时,对该文章内容感兴趣,若想经常阅读与该文章内容领域相关的文章,则可从文章内容中挑选词汇作为兴趣词汇,存储至关键词库中。或者,终端设备根据用户的确定指令,确定用户对该文章内容感兴趣后,终端设备可根据当前阅读的文章内容确定文章的所属领域,并从网络上爬取该领域下的多篇文章的多个关键词作为兴趣词汇,存储至关键词库中。上述预设的关键词库中可以包含特定的人名、地名、时间等词汇,因这些词汇在使用目前流行的关键词抽取算法中,常常会被忽略。因此,单独设置特定词汇,可保证从多个分词中确定候选关键词的质量。
其中,从多个分词中确定多个候选关键词可以为,若关键词库中存在与分词相同的词语,则该分词可确定为候选关键词,由此可得到多个候选关键词。可以理解的是,若关键词库中包含的关键词与分词为近义词,也可将该分词作为候选关键词,对此不作限定。
S103、根据所述多个候选关键词和所述目标文章,分别计算所述多个候选关键词中每个候选关键词对应的多个得分值。
在应用中,上述多个得分值包括但不限于候选关键词处于目标文章中文章位置的位置得分值、在目标文章中出现的频率得分值。示例性的,基于不同候选关键词在目标文章所出现的不同位置,赋予不同的得分值。例如,对于候选关键词出现在标题处,可认为新闻类的目标文章的标题通常被作为文章的核心,其包含目标文章的主要内容。因此,可设置出现在目标文章中标题的候选关键词的位置得分值,比出现在正文的位置得分值更高。需要说明的是,若同一个候选关键词即出现在标题,又出现在正文,则可选取分数值最高的数值,作为候选关键词的位置得分值(即标题的位置得分值)。其中,候选关键词的频率得分值可以为根据目标文章中多个分词的总数量,以及每个候选关键词在目标文章中出现的数量进行比值计算得到。
S104、将所述每个候选关键词对应的多个得分值输入预先训练的监督模型中,分别得到所述每个候选关键词的词概率,并根据所述词概率从所述多个候选关键词中确定目标关键词。
在应用中,上述候选关键词具有多个,每个候选关键词均具有多个得分值,然而,目 标关键词可以为多个候选关键词中的部分词语。例如,可以统计每个候选关键词的多个得分值之和,或者计算每个候选关键词的多个得分值的平均值,作为衡量每个候选关键词分别与第一文本的关联程度(词概率)。进而,可从多个候选关键词中确定目标关键词。示例性的,可以将得分值(词概率)最高的预设个数的候选关键词确定为目标关键词。
在应用中,上述监督模型可以为根据已有的文章内容和对应的关键词进行模型训练得到的监督模型。其中,监督学习的目标是学习一个函数(模型),在已知该函数的样本数据(已有的文章内容)和输出值(关键词)的情况下,最大可能的拟合输入和输出之间的关系。即可实现通过得分值和监督模型,得到每个候选关键词属于目标文章中目标关键词的词概率。进而,可在通过预设的关键词库确定多个候选关键词的基础上,进一步的通过监督模型从多个候选关键词中确定目标关键词,保证提取的关键词均属于高质量词汇。
在本实施例中,通过对目标文章进行分词处理得到多个分词,并与预设的关键词库进行比较,从多个分词中确定候选关键词,并分别计算每个候选关键词的多个得分值,根据多个得分值从多个候选关键词中进一步的确定目标关键词。使得在维护一高质量的关键词库作为输出目标文章中候选关键词的基础上,可同时根据监督模型进一步的计算每个候选关键词的词概率,根据词概率从多个候选关键词中确定目标关键词,以保证提取的目标关键词均属于与目标文本关联度高的高质量词汇。
请参照图2,在一具体实施例中,在S102根据预设的关键词库,从所述多个分词中确定多个候选关键词之前,还包括如下步骤S102A-S102D,详述如下:
S102A、确定所述目标文章的文章领域,获取属于所述文章领域的领域文本。
在应用中,目标文章可以由终端设备在网络上进行爬取得到,可以理解的是,对于用户在使用终端设备浏览目标文章时,目标文章通常在发布时已经预先具有领域标签(文章领域)。因此,可认为终端设备在获取到目标文章时,可同时确定目标文章的文章领域。示例性的,对于终端设备为智能手机,使用浏览器浏览目标文章时,该目标文章已经具有确切的文章领域。具体可参考图3中的相关频道,相关频道下的各个词汇即可认为是目标文章的文章领域,在该文章领域下的多个文本均可认为是领域文本。
S102B、根据所述领域文本中的多个领域分词,计算每个领域分词之间的领域关联度。
在应用中,上述领域分词也可通过上述S101中的方法得到,具体可参照上述S101中的解释内容,对此不再进行详细说明。其中,上述计算每个领域分词之间领域关联度可以为计算每个领域分词之间的互信息。具体的,可参照如下公式:
Figure PCTCN2021091083-appb-000001
其中,p(x,y)为领域分词x和领域分词y在多个领域文本中同时出现的概率,p(x)为领域分词x在多个领域文本中单独出现的概率,p(y)为领域分词y在多个领域文本中单独出现的概率,PMI(x,y)为领域分词x与领域分词y的互信息。可统计获取到的多个领域文本的文本数量,并在多个领域文本中,计算每个领域文本同时出现领域分词x和领域分词y的领域文本数量,以及计算单独出现领域分词x的领域文本数量和单独出现领域分词y的领域文本数量。进而,可根据上述公式计算每个领域分词之间的互信息。
在其他应用中,还可在计算互信息之后,根据互信息计算每个领域关键词的左右信息,得到左右互信息,并将左右互信息作为上述领域关联度。示例性的,对于领域文本中出现的“平”、“安”、“符”三个分词。可通过上述互信息计算公式,分别计算出“平”和“安”的互信息(领域关联度)、“平”和“符”的互信息,以及“安”和“符”的互信息。之后,根据互信息大小,可确定“平”和“安”组成的“平安”领域分词之间的领域关联度更高。之后,可将“平安”作为一个领域分词,计算与“符”之间的左右互信息。若计算出组成“符平安”的右互信息数值很低,则确定此“符平安”不能组成领域分词。然而,若计算出组成“平安符”的左互信息数值高,则确定此“平安符”可组成新的领域分词。最后可得到“平”、“安”、“符”三个领域分词之间的多个领域关联度(左右互 信息)。可以理解的是,在计算“平”和“安”之间的互信息时,也需要计算其组成“平安”和“安平”之间的左右互信息,并根据左右互信息,确定“平安”可作为领域分词,而“安平”不可作为领域分词。
S102C、从多个领域关联度中确定大于预设关联度的目标关联度,以及确定所述目标关联度对应的目标领域分词。
S102D、将所述目标领域分词存储至所述关键词库中。
在应用中,上述预设关联度可以为用户根据实际情况进行设定的数值,也可以为终端设备预先设定固定数值,对此不作限定。其中,在获取到每个领域分词之间的领域关联度后,可根据领域关联度的大小,从多个领域关联度中确定目标领域关联度,以及确定目标领域关联度对应的目标领域分词。例如,在领域关联度大于预设关联度时,确定该领域关联度为目标关联度,并确定目标关联度对应的领域分词为目标领域分词。上述S102中已说明预设的关键词库可以为用户预先设置多个兴趣词汇,并将多个兴趣词汇作为关键词库中的关键词,存储在终端设备指定的存储路径下。因此,对于目标领域分词,终端设备可将目标领域分词也存储至关键词库中。例如,可将上述“平安”以及“平安符”作为目标领域分词,并存储在关键词库中。
请参照图4,在一具体实施例中,在S102根据预设的关键词库,从所述多个分词中确定多个候选关键词之前,还包括如下步骤S102E-S102F,详述如下:
S102E、确定所述目标文章的文章领域,并获取属于所述文章领域下的多个领域关键词。
S102F、将所述多个领域关键词存储至所述关键词库中。
在应用中,上述已说明如何确定目标文章的文章领域,并已说明可在确定文章领域后,从网络上爬取多个领域文本。基于此,终端设备还可直接获取每个领域文本已经标记的词汇作为领域关键词,生成关键词库。示例性的,可参照图3,图3中与“相关频道”处于同一列的多个词汇(“5G频道”、“互联网”)均可认为是文章领域。另外,从图3中可看出,在用户选中终端设备中“互联网”的文章领域时,终端设备则可根据该文章领域从网络上获取对应的领域文本,同时获取每篇领域文本在发布时已经具有领域关键词(附图中对于第一篇文章箭头所指词汇)。而上述领域关键词均可认为是发布机构对每篇领域文本定义的高频词汇或者核心词汇。因此,可将文章领域下每篇领域文本的多个领域关键词存储至关键词库中。
请参照图5,在一具体实施例中,S102根据预设的关键词库,从所述多个分词中确定多个候选关键词,还包括如下子步骤S1021-S1023,详述如下:
S1021、确定所述关键词库中是否包含目标分词,所述目标分词为所述多个分词中的任意一个。
S1022、若所述关键词库中包含所述目标分词,则将所述目标分词作为候选关键词。
在应用中,因关键词库内存储的分词,均为该文本领域下的高质量词汇。因此,在得到上述多个分词后,可将多个分词分别与关键词库中的分词进行比较。若分词与关键词库中已存储的分词一致,则可初步将该分词的作为候选关键词。其中,在多个分词中,与关键词库中的分词进行比较的分词即可认为是目标分词。
S1023、若所述关键词库中未包含所述目标分词,则判断所述目标分词是否属于实体词;若所述目标分词属于实体词,则将属于所述实体词的目标分词输入至所述监督模型中,得到属于所述实体词的目标分词的关键词概率;若所述关键词概率大于概率阈值,则将所述关键词概率对应的所述目标分词作为候选关键词。
在应用中,上述实体词为能够描述独立存在的事物的词汇。在判定分词未存储在关键词库后,可判断未存储在关键词库内的分词是否属于实体词。若不属于实体词,则可认为该分词不具有意义,因此,可删除该分词。其中,可通过命名实体识别(Named Entity Recognition,NER)技术判断上述分词是否属于实体词。具体的,命名实体识别又称作“专名识别”,是指识别文本中具有特定意义的实体,主要包括人名、地名、机构名、专有名 词等。
在应用中,在确定未存储在关键词库内的分词属于实体词时,可将该分词输入至监督模型中,得到分词的关键词概率。其中,监督模型为预先训练好的分类模型,用于再次判断分词属于候选关键词的关键词概率。具体可参照上述S104中关于监督模型的描述,对此不再进行详细说明。
在应用中,上述监督模型可提取每个分词在目标文章中的词特征,而后根据词特征输出属于目标文章的关键词概率。其中,分词的词特征可以为监督模型根据分词在目标文章中的出现位置、分词在目标文章的出现数量、分词的词长度等信息,综合提取该分词的词特征,并根据该词特征进行分类,输出分词属于目标文章的关键词概率。上述概率阈值可为用户预先设定的数值,也可以为监督模型根据已有的大数据进行训练分析后设定的概率阈值,对此不作限定。可认为在关键词概率大于概率阈值时,将关键词概率对应的分词作为候选关键词。
请参照图6,在一具体实施例中,上述监督模型可通过如下步骤S201-S206训练得到,详述如下:
S201、获取训练样本,并从所述训练样本中获取已标注的训练关键词。
在应用中,上述训练样本可以认为是上述已说明的领域文本,而对应训练关键词可以认为是领域文本对应的目标领域分词。其中,获取训练样本的方式可以为从网络上爬取同一文章领域下的多个领域文本。基于上述S101说明的分词方法,可对训练样本进行分词得到多个样本分词,对此不再详细说明。
S202、对所述训练样本中的文本内容进行分词得到多个样本分词,并分别计算每个样本分词对应的样本得分值。
S203、根据多个样本得分值,从所述多个样本分词中确定样本关键词。
在应用中,上述样本得分值可以为根据样本分词分别位于训练样本中的文章位置,确定的样本得分值,或者为,计算样本分词在训练样本中的词跨度作为得分值,对此不作限定。其中,基于样本得分值的大小,可设定样本分值阈值。在样本得分值大于样本分值阈值时,将样本得分值对应的样本分词作为样本关键词,或者,对多个样本得分值进行排序,将处于前列的预设个数的样本得分值对应的样本分词,作为样本关键词,对此不作限定。
S204、基于所述样本关键词与所述训练关键词,确定所述样本关键词的标记类别。
在应用中,上述标记类别可用于在计算模型的训练损失值时,赋予样本关键词用于计算的具体数值。具体的,若样本关键词与任一训练关键词一致,则可将样本关键词的标记类别定位1,否则,将样本关键词的标记类别定位0。
S205、提取所述样本关键词的关键词特征。
S206、基于所述样本关键词的关键词特征与标记类别进行模型训练,得到所述监督模型。
在应用中,上述提取样本关键词的关键词特征,可认为是对上述样本关键词进行特征工程处理,即提取样本关键词的多个方面的词特征。具体的对样本关键词进行特征工程处理,可参照图7所示,图7示出了应提取训练样本中样本关键词的各个关键词特征。
在应用中,在得到上述样本关键词的多个关键词特征后,可通过初始监督模型中的神经网络模型结构,对关键词特征进行特征融合,得到融合特征,使融合特征可以综合表示样本关键词的多个特征信息。之后,模型可根据融合特征输出样本关键词属于关键词的概率,并结合样本关键词的标记类别,计算训练损失。最后,根据训练损失迭代更新模型中的模型参数,并在训练损失收敛时,将当前模型作为已训练的监督模型。进而,提高监督模型确定目标文章中目标关键词的准确度。
请参照图8,在一具体实施例中,所述多个得分值包括第一得分值、第二得分值、第三得分值和第四得分值;S103根据所述多个候选关键词和所述目标文章,分别计算所述多个候选关键词中每个候选关键词对应的多个得分值,还包括如下子步骤S1031-S1034,详述 如下:
S1031、统计所述多个分词的数量,并根据所述数量分别计算所述每个候选关键词在所述目标文章中的词频,通过所述词频对应计算所述每个候选关键词的第一得分值。
在应用中,上述多个分词的数量即为目标文章中包含的分词总数量,上述候选关键词的词频可以为根据目标文章中的分词总数量,与每个候选关键词分别在目标文章中出现的次数进行比值计算得到。其中,上述第一得分值可以为每个目标文章中的词频,也可以为通过词频进行计算得到的词频逆文件频率。具体的,上述逆文件频率可以为终端设备统计多个领域文本的第一数量,并统计多个领域文本中,包含该候选关键词的领域文本的第二数量。而后,计算第一数量与第二数量的比值结果,再将比值结果取以10为底的对数,得到的数值即为该候选关键词的逆文件频率。以此,可得到每个候选关键词的词频逆文件频率,进而,可将词频与逆文件频率进行乘积即可得到第一得分值。需要说明的是,词频逆文件频率对应的数值可能为0到无穷大之间的任一数值,为了方便后续计算,可对每个词频逆文件频率均进行归一化处理,使其处于0到1的数值区间。
S1032、确定所述多个候选关键词在所述目标文章中的位置,基于所述多个候选关键词在所述目标文章中的位置,计算所述每个候选关键词的第二得分值。
在应用中,上述S103中已说明候选关键词出现在目标文章中的标题或正文,可体现出其在目标文章中的重要程度。具体的,可以将在标题中出现的候选关键词的第二得分值设为0.6,将在正文出现的候选关键词的第二得分值设为0.4,具体可根据实际情况进行设定。可以理解的是,若同一个候选关键词在目标文章出现多次,且同时出现在标题和正文等多处位置,则可将多处位置中,同一候选关键词对应的分值之和作为一个第二得分值。或者,将同一个候选关键词的平均值作为第二得分值,对此不作限定。需要说明的是,为了区分目标文章中的标题和正文,可在标题与正文之间添加空格或特殊符号进行区分。
S1033、分别确定所述每个候选关键词在所述目标文章中的初始位置和结束位置,根据所述初始位置和所述结束位置计算所述每个候选关键词对应的第三得分值。
在应用中,上述第三得分值可以认为是每个候选关键词在目标文章中的词跨度。具体的,候选关键词是从目标文章中的多个分词中确定的,因此,可根据目标文章的文本内容,对目标文章中的每个分词分别进行排序,进而可确定相应候选关键词在目标文章中对应的序列号,即可确定每个候选关键词分别在目标文章中的位置。在候选关键词多次出现在目标文章时(即一个候选关键词有多个序列号),可将该候选关键词的最小序列号作为在目标文章中的初始位置),以及将该候选关键词的最大序列号作为目标文章中的结束位置。之后将两个序列号进行相减,得到的差值即为第三得分值。另外,为了方便后续计算,还可将差值除以多个分词的总数量,以便对差值进行归一化处理,将归一化后的数值作为第三得分值,对此不作限定。
S1034、根据预设的文本排序算法,计算所述每个候选关键词对应的第四得分值。
在应用中,上述文本排序算法可以为基于图的排序算法(textrank)模型,其可通过将目标文章分割成若干组成单元(分词)并建立图模型,利用投票机制对目标文章中的若干分词的重要成分进行分值排序,即对目标文章中的多个分词进行分值排序。之后,可根据每个分词的分值,从多个分词中确定候选关键词对应的分值,作为第四得分值。
可以理解的是,目标文章中的目标关键词一般经常出现在标题中,且目标关键词在目标文章中出现的次数也相对较多。因此,上述设定的计算每个候选关键词的四个得分值,可以成为一种评判关键词在目标文章中关键程度的良好度量。使终端设备可基于多个得分值综合评判每个该候选关键词在目标文章中的关键程度,提高从候选关键词中确定高质量的目标关键词的准确度。
请参照图9,在一具体实施例中,所述目标关键词包括多个;在S104将所述每个候选关键词对应的多个得分值输入预先训练的监督模型中,分别得到所述每个候选关键词的词概率,并根据所述词概率从所述多个候选关键词中确定目标关键词之后,还包括如下步骤 S104A-S104B,详述如下:
S104A、统计每个目标关键词在多篇目标文章中的总数量,计算所述每个目标关键词的总数量之间的比值。
在应用中,上述多篇目标文章可以理解为用户在预设时间段内点击的文章。上述每个目标关键词可以理解为用户在点击一篇目标文章时,终端设备便可使用上述方法从该目标文章中抽取一个或多个目标关键词。以此,终端设备可获取该预设时间段内每篇目标文章中的一个或多个目标关键词。示例性的,用户点击一篇目标文章,且终端设备记录该目标文章的多个目标关键词时,如用户点击一篇目标文章,该目标文章对应的目标关键词为:母婴、家有萌娃、营养发育。若用户在预设时间段内还点击了其余多篇目标文章,且从多篇目标文章记录的目标关键词中,“母婴”、“家有萌娃”、“营养发育”出现多次,则终端设备可分别累加上述目标关键词的出现次数,即统计每个目标关键词在多篇目标文章中的总数量。进而,可根据每个目标关键词的总数量计算比值。
可以理解的是,并不是每篇目标文章均会出现上述目标关键词,上述目标关键词仅为其中的一种示例。另外,多篇目标文章中存在目标关键词只出现一次的情况,也应进行记录并参与比值计算。
S104B、根据所述比值和所述每个目标关键词进行文章召回,得到文章集,所述文章集中分别包含每个目标关键词的文章数量之比与所述比值相等。
在应用中,上述文章集用于存储终端设备根据目标关键词进行召回的文章。其中,在确定每个目标关键词以及之间的比值后,可根据比值对文章的数量进行召回。具体的,可预先设定终端设备应召回文章的总数量,根据总数量和比值计算应召回的包含目标关键词的文章数量。例如,对于上述“母婴”、“家有萌娃”、“营养发育”的比值为5:2:3,且终端设备应召回文章的总数量为10。基于此,为使得文章集中包含的每个目标关键词的文章数量之比与比值相等,可知终端设备应召回5篇包含目标关键词“母婴”的目标文章,召回3篇包含目标关键词“家有萌娃”的目标文章,以及召回2篇包含目标关键词“营养发育”的目标文章。以此可使得终端设备能够根据目标关键词自动从网络上召回用户感兴趣的文章,提升终端设备的召回效果。
请参阅图10,图10是本申请实施例提供的一种关键词抽取装置的结构框图。本实施例中该终端设备包括的各单元用于执行图1、图2、图4至图6、图8和图9对应的实施例中的各步骤。具体请参阅图1、图2、图4至图6、图8和图9以及图1、图2、图4至图6、图8和图9所对应的实施例中的相关描述。为了便于说明,仅示出了与本实施例相关的部分。参见图10,关键词抽取装置1000包括:第一获取模块1010、第一确定模块1020、第一计算模块1030和第二确定模块1040,其中:
第一获取模块1010,用于获取目标文章中的多个分词。
第一确定模块1020,用于根据预设的关键词库,从所述多个分词中确定多个候选关键词。
第一计算模块1030,用于根据所述多个候选关键词和所述目标文章,分别计算所述多个候选关键词中每个候选关键词对应的多个得分值。
第二确定模块1040,用于将所述每个候选关键词对应的多个得分值输入预先训练的监督模型中,分别得到所述每个候选关键词的词概率,并根据所述词概率从所述多个候选关键词中确定目标关键词。
在一实施例中,关键词抽取装置1000还包括:
第三确定模块,用于确定所述目标文章的文章领域,获取属于所述文章领域的领域文本。
第二计算模块,用于根据所述领域文本中的多个领域分词,计算每个领域分词之间的领域关联度。
第四确定模块,用于从多个领域关联度中确定大于预设关联度的目标关联度,以及确 定所述目标关联度对应的目标领域分词。
第一生成模块,用于将所述目标领域分词存储至所述关键词库中。
在一实施例中,关键词抽取装置1000还包括:
第五确定模块,用于确定所述目标文章的文章领域,并获取属于所述文章领域下的多个领域关键词。
第二生成模块,用于将所述多个领域关键词存储至所述关键词库中。
在一实施例中,第一确定模块1020还用于:
确定所述关键词库中是否包含目标分词,所述目标分词为所述多个分词中的任意一个;若所述关键词库中包含所述目标分词,则将所述目标分词作为候选关键词;若所述关键词库中未包含所述目标分词,则判断所述目标分词是否属于实体词;若所述目标分词属于实体词,则将属于所述实体词的目标分词输入至所述监督模型中,得到属于所述实体词的目标分词的关键词概率;若所述关键词概率大于概率阈值,则将所述关键词概率对应的所述目标分词作为候选关键词。
在一实施例中,关键词抽取装置1000还包括如下模块进行监督模型训练:
第二获取模块,用于获取训练样本,并从所述训练样本中获取已标注的训练关键词。
分词模块,用于对所述训练样本中的文本内容进行分词得到多个样本分词,并分别计算每个样本分词对应的样本得分值。
第六确定模块,用于根据多个样本得分值,从所述多个样本分词中确定样本关键词。
第七确定模块,用于基于所述样本关键词与所述训练关键词,确定所述样本关键词的标记类别。
提取模块,用于提取所述样本关键词的关键词特征。
训练模块,用于基于所述样本关键词的关键词特征与标记类别进行模型训练,得到所述监督模型。
在一实施例中,所述多个得分值包括第一得分值、第二得分值、第三得分值和第四得分值;第一计算模块1030还用于:
统计所述多个分词的数量,并根据所述数量分别计算所述每个候选关键词在所述目标文章中的词频,通过所述词频对应计算所述每个候选关键词的第一得分值;确定所述多个候选关键词在所述目标文章中的位置,基于所述多个候选关键词在所述目标文章中的位置,计算所述每个候选关键词的第二得分值;分别确定所述每个候选关键词在所述目标文章中的初始位置和结束位置,根据所述初始位置和所述结束位置计算所述每个候选关键词对应的第三得分值;根据预设的文本排序算法,计算所述每个候选关键词对应的第四得分值。
在一实施例中,所述目标关键词包括多个;关键词抽取装置1000还包括:
统计模块,用于统计每个目标关键词在多篇目标文章中的总数量,计算所述每个目标关键词的总数量之间的比值。
召回模块,用于根据所述比值和所述每个目标关键词进行文章召回,得到文章集,所述文章集中分别包含每个目标关键词的文章数量之比与所述比值相等。
应当理解的是,图10示出的关键词抽取装置的结构框图中,各单元/模块用于执行图1、图2、图4至图6、图8和图9对应的实施例中的各步骤,而对于图1、图2、图4至图6、图8和图9对应的实施例中的各步骤已在上述实施例中进行详细解释,具体请参阅图1、图2、图4至图6、图8和图9以及图1、图2、图4至图6、图8和图9所对应的实施例中的相关描述,此处不再赘述。
图11是本申请另一实施例提供的一种终端设备的结构框图。如图11所示,该实施例的终端设备1100包括:处理器1110、存储器1120以及存储在存储器1120中并可在处理器1110运行的计算机程序1130,例如关键词抽取方法的程序。处理器1110执行计算机程序1130时实现上述各个关键词抽取方法各实施例中的步骤,例如图1所示的S101至S104。或者,处理器1110执行计算机程序1130时实现上述图10对应的实施例中各模块的功能, 例如,图10所示的模块1010至1040的功能。具体如下所述:
一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现:
获取目标文章中的多个分词;
根据预设的关键词库,从所述多个分词中确定多个候选关键词;
根据所述多个候选关键词和所述目标文章,分别计算所述多个候选关键词中每个候选关键词对应的多个得分值;
将所述每个候选关键词对应的多个得分值输入预先训练的监督模型中,分别得到所述每个候选关键词的词概率,并根据所述词概率从所述多个候选关键词中确定目标关键词。
在一个实施例中,所述处理器执行所述计算机程序时还实现:
确定所述目标文章的文章领域,获取属于所述文章领域的领域文本;
根据所述领域文本中的多个领域分词,计算每个领域分词之间的领域关联度;
从多个领域关联度中确定大于预设关联度的目标关联度,以及确定所述目标关联度对应的目标领域分词;
将所述目标领域分词存储至所述关键词库中。
在一个实施例中,所述处理器执行所述计算机程序时还实现:
确定所述目标文章的文章领域,并获取属于所述文章领域下的多个领域关键词;
将所述多个领域关键词存储至所述关键词库中。
在一个实施例中,所述处理器执行所述计算机程序时还实现:
确定所述关键词库中是否包含目标分词,所述目标分词为所述多个分词中的任意一个;
若所述关键词库中包含所述目标分词,则将所述目标分词作为候选关键词;
若所述关键词库中未包含所述目标分词,则判断所述目标分词是否属于实体词;若所述目标分词属于实体词,则将属于所述实体词的目标分词输入至所述监督模型中,得到属于所述实体词的目标分词的关键词概率;若所述关键词概率大于概率阈值,则将所述关键词概率对应的所述目标分词作为候选关键词。
在一个实施例中,所述处理器执行所述计算机程序时还通过以下步骤实现监督模型的训练,具体的:
获取训练样本,并从所述训练样本中获取已标注的训练关键词;
对所述训练样本中的文本内容进行分词得到多个样本分词,并分别计算每个样本分词对应的样本得分值;
根据多个样本得分值,从所述多个样本分词中确定样本关键词;
基于所述样本关键词与所述训练关键词,确定所述样本关键词的标记类别;
提取所述样本关键词的关键词特征;
基于所述样本关键词的关键词特征与标记类别进行模型训练,得到所述监督模型。
在一个实施例中,所述多个得分值包括第一得分值、第二得分值、第三得分值和第四得分值;所述处理器执行所述计算机程序时还实现:
统计所述多个分词的数量,并根据所述数量分别计算所述每个候选关键词在所述目标文章中的词频,通过所述词频对应计算所述每个候选关键词的第一得分值;
确定所述多个候选关键词在所述目标文章中的位置,基于所述多个候选关键词在所述目标文章中的位置,计算所述每个候选关键词的第二得分值;
分别确定所述每个候选关键词在所述目标文章中的初始位置和结束位置,根据所述初始位置和所述结束位置计算所述每个候选关键词对应的第三得分值;
根据预设的文本排序算法,计算所述每个候选关键词对应的第四得分值。
在一个实施例中,所述目标关键词包括多个;所述处理器执行所述计算机程序时还实现:
统计每个目标关键词在多篇目标文章中的总数量,计算所述每个目标关键词的总数量 之间的比值;
根据所述比值和所述每个目标关键词进行文章召回,得到文章集,所述文章集中分别包含每个目标关键词的文章数量之比与所述比值相等。
一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现:
获取目标文章中的多个分词;
根据预设的关键词库,从所述多个分词中确定多个候选关键词;
根据所述多个候选关键词和所述目标文章,分别计算所述多个候选关键词中每个候选关键词对应的多个得分值;
将所述每个候选关键词对应的多个得分值输入预先训练的监督模型中,分别得到所述每个候选关键词的词概率,并根据所述词概率从所述多个候选关键词中确定目标关键词。
在一个实施例中,所述计算机程序被处理器执行时还实现:
确定所述目标文章的文章领域,获取属于所述文章领域的领域文本;
根据所述领域文本中的多个领域分词,计算每个领域分词之间的领域关联度;
从多个领域关联度中确定大于预设关联度的目标关联度,以及确定所述目标关联度对应的目标领域分词;
将所述目标领域分词存储至所述关键词库中。
在一个实施例中,所述计算机程序被处理器执行时还实现:
确定所述目标文章的文章领域,并获取属于所述文章领域下的多个领域关键词;
将所述多个领域关键词存储至所述关键词库中。
在一个实施例中,所述计算机程序被处理器执行时还实现:
确定所述关键词库中是否包含目标分词,所述目标分词为所述多个分词中的任意一个;
若所述关键词库中包含所述目标分词,则将所述目标分词作为候选关键词;
若所述关键词库中未包含所述目标分词,则判断所述目标分词是否属于实体词;若所述目标分词属于实体词,则将属于所述实体词的目标分词输入至所述监督模型中,得到属于所述实体词的目标分词的关键词概率;若所述关键词概率大于概率阈值,则将所述关键词概率对应的所述目标分词作为候选关键词。
在一个实施例中,所述计算机程序被处理器执行时还通过以下步骤实现监督模型的训练,具体的:
获取训练样本,并从所述训练样本中获取已标注的训练关键词;
对所述训练样本中的文本内容进行分词得到多个样本分词,并分别计算每个样本分词对应的样本得分值;
根据多个样本得分值,从所述多个样本分词中确定样本关键词;
基于所述样本关键词与所述训练关键词,确定所述样本关键词的标记类别;
提取所述样本关键词的关键词特征;
基于所述样本关键词的关键词特征与标记类别进行模型训练,得到所述监督模型。
在一个实施例中,所述多个得分值包括第一得分值、第二得分值、第三得分值和第四得分值;所述计算机程序被处理器执行时还实现:
统计所述多个分词的数量,并根据所述数量分别计算所述每个候选关键词在所述目标文章中的词频,通过所述词频对应计算所述每个候选关键词的第一得分值;
确定所述多个候选关键词在所述目标文章中的位置,基于所述多个候选关键词在所述目标文章中的位置,计算所述每个候选关键词的第二得分值;
分别确定所述每个候选关键词在所述目标文章中的初始位置和结束位置,根据所述初始位置和所述结束位置计算所述每个候选关键词对应的第三得分值;
根据预设的文本排序算法,计算所述每个候选关键词对应的第四得分值。
在一个实施例中,所述目标关键词包括多个;所述计算机程序被处理器执行时还实现:
统计每个目标关键词在多篇目标文章中的总数量,计算所述每个目标关键词的总数量之间的比值;
根据所述比值和所述每个目标关键词进行文章召回,得到文章集,所述文章集中分别包含每个目标关键词的文章数量之比与所述比值相等。
示例性的,计算机程序1130可以被分割成一个或多个模块,一个或者多个模块被存储在存储器1120中,并由处理器1110执行,以完成本申请。一个或多个模块可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述计算机程序1130在终端设备1100中的执行过程。例如,计算机程序1130可以被分割成第一获取模块、第一确定模块、第一计算模块和第二确定模块,各模块具体功能如上。
终端设备可包括,但不仅限于,处理器1110、存储器1120。本领域技术人员可以理解,图10仅仅是终端设备1100的示例,并不构成对终端设备1100的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如终端设备还可以包括输入输出设备、网络接入设备、总线等。
所称处理器1110可以是中央处理单元,还可以是其他通用处理器、数字信号处理器、专用集成电路、现成可编程门阵列或者其他可编程逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。
存储器1120可以是终端设备1100的内部存储单元,例如终端设备1100的硬盘或内存。存储器1120也可以是终端设备1100的外部存储设备,例如终端设备1100上配备的插接式硬盘,智能存储卡,闪存卡等。进一步地,存储器1120还可以既包括终端设备1100的内部存储单元也包括外部存储设备。
所述计算机可读存储介质可以是前述实施例所述的终端设备的内部存储单元,例如所述终端设备的硬盘或内存。所述计算机可读存储介质可以是非易失性,也可以是易失性。所述计算机可读存储介质也可以是所述终端设备的外部存储设备,例如所述终端设备上配备的插接式硬盘,智能存储卡安全数字卡,闪存卡等。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种关键词抽取方法,其中,包括:
    获取目标文章中的多个分词;
    根据预设的关键词库,从所述多个分词中确定多个候选关键词;
    根据所述多个候选关键词和所述目标文章,分别计算所述多个候选关键词中每个候选关键词对应的多个得分值;
    将所述每个候选关键词对应的多个得分值输入预先训练的监督模型中,分别得到所述每个候选关键词的词概率,并根据所述词概率从所述多个候选关键词中确定目标关键词。
  2. 如权利要求1所述的关键词抽取方法,其中,在所述根据预设的关键词库,从所述多个分词中确定多个候选关键词之前,还包括:
    确定所述目标文章的文章领域,获取属于所述文章领域的领域文本;
    根据所述领域文本中的多个领域分词,计算每个领域分词之间的领域关联度;
    从多个领域关联度中确定大于预设关联度的目标关联度,以及确定所述目标关联度对应的目标领域分词;
    将所述目标领域分词存储至所述关键词库中。
  3. 如权利要求1所述的关键词抽取方法,其中,在所述根据预设的关键词库,从所述多个分词中确定多个候选关键词之前,还包括:
    确定所述目标文章的文章领域,并获取属于所述文章领域下的多个领域关键词;
    将所述多个领域关键词存储至所述关键词库中。
  4. 如权利要求1-3任一项所述的关键词抽取方法,其中,所述根据预设的关键词库,从所述多个分词中确定多个候选关键词,包括:
    确定所述关键词库中是否包含目标分词,所述目标分词为所述多个分词中的任意一个;
    若所述关键词库中包含所述目标分词,则将所述目标分词作为候选关键词;
    若所述关键词库中未包含所述目标分词,则判断所述目标分词是否属于实体词;若所述目标分词属于实体词,则将属于所述实体词的目标分词输入至所述监督模型中,得到属于所述实体词的目标分词的关键词概率;若所述关键词概率大于概率阈值,则将所述关键词概率对应的所述目标分词作为候选关键词。
  5. 如权利要求4所述的关键词抽取方法,其中,所述监督模型通过如下步骤训练得到:
    获取训练样本,并从所述训练样本中获取已标注的训练关键词;
    对所述训练样本中的文本内容进行分词得到多个样本分词,并分别计算每个样本分词对应的样本得分值;
    根据多个样本得分值,从所述多个样本分词中确定样本关键词;
    基于所述样本关键词与所述训练关键词,确定所述样本关键词的标记类别;
    提取所述样本关键词的关键词特征;
    基于所述样本关键词的关键词特征与标记类别进行模型训练,得到所述监督模型。
  6. 如权利要求1所述的关键词抽取方法,其中,所述多个得分值包括第一得分值、第二得分值、第三得分值和第四得分值;
    所述根据所述多个候选关键词和所述目标文章,分别计算所述多个候选关键词中每个候选关键词对应的多个得分值,包括:
    统计所述多个分词的数量,并根据所述数量分别计算所述每个候选关键词在所述目标文章中的词频,通过所述词频对应计算所述每个候选关键词的第一得分值;
    确定所述多个候选关键词在所述目标文章中的位置,基于所述多个候选关键词在所述目标文章中的位置,计算所述每个候选关键词的第二得分值;
    分别确定所述每个候选关键词在所述目标文章中的初始位置和结束位置,根据所述初 始位置和所述结束位置计算所述每个候选关键词对应的第三得分值;
    根据预设的文本排序算法,计算所述每个候选关键词对应的第四得分值。
  7. 如权利要求1所述的关键词抽取方法,其中,所述目标关键词包括多个;
    在所述将所述每个候选关键词对应的多个得分值输入预先训练的监督模型中,分别得到所述每个候选关键词的词概率,并根据所述词概率从所述多个候选关键词中确定目标关键词之后,还包括:
    统计每个目标关键词在多篇目标文章中的总数量,计算所述每个目标关键词的总数量之间的比值;
    根据所述比值和所述每个目标关键词进行文章召回,得到文章集,所述文章集中分别包含每个目标关键词的文章数量之比与所述比值相等。
  8. 一种关键词抽取装置,其中,所述装置包括:
    第一获取模块,用于获取目标文章中的多个分词;
    第一确定模块,用于根据预设的关键词库,从所述多个分词中确定多个候选关键词;
    第一计算模块,用于根据所述多个候选关键词和所述目标文章,分别计算所述多个候选关键词中每个候选关键词对应的多个得分值;
    第二确定模块,用于将所述每个候选关键词对应的多个得分值输入预先训练的监督模型中,分别得到所述每个候选关键词的词概率,并根据所述词概率从所述多个候选关键词中确定目标关键词。
  9. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现:
    获取目标文章中的多个分词;
    根据预设的关键词库,从所述多个分词中确定多个候选关键词;
    根据所述多个候选关键词和所述目标文章,分别计算所述多个候选关键词中每个候选关键词对应的多个得分值;
    将所述每个候选关键词对应的多个得分值输入预先训练的监督模型中,分别得到所述每个候选关键词的词概率,并根据所述词概率从所述多个候选关键词中确定目标关键词。
  10. 根据权利要求9所述的终端设备,其中,所述处理器执行所述计算机程序时还实现:
    确定所述目标文章的文章领域,获取属于所述文章领域的领域文本;
    根据所述领域文本中的多个领域分词,计算每个领域分词之间的领域关联度;
    从多个领域关联度中确定大于预设关联度的目标关联度,以及确定所述目标关联度对应的目标领域分词;
    将所述目标领域分词存储至所述关键词库中。
  11. 根据权利要求9所述的终端设备,其中,所述处理器执行所述计算机程序时还实现:
    确定所述目标文章的文章领域,并获取属于所述文章领域下的多个领域关键词;
    将所述多个领域关键词存储至所述关键词库中。
  12. 根据权利要求9-11任一所述的终端设备,其中,所述处理器执行所述计算机程序时还实现:
    确定所述关键词库中是否包含目标分词,所述目标分词为所述多个分词中的任意一个;
    若所述关键词库中包含所述目标分词,则将所述目标分词作为候选关键词;
    若所述关键词库中未包含所述目标分词,则判断所述目标分词是否属于实体词;若所述目标分词属于实体词,则将属于所述实体词的目标分词输入至所述监督模型中,得到属于所述实体词的目标分词的关键词概率;若所述关键词概率大于概率阈值,则将所述关键词概率对应的所述目标分词作为候选关键词。
  13. 根据权利要求12所述的终端设备,其中,所述处理器执行所述计算机程序时还通过以下步骤实现所述监督模型的训练:
    获取训练样本,并从所述训练样本中获取已标注的训练关键词;
    对所述训练样本中的文本内容进行分词得到多个样本分词,并分别计算每个样本分词对应的样本得分值;
    根据多个样本得分值,从所述多个样本分词中确定样本关键词;
    基于所述样本关键词与所述训练关键词,确定所述样本关键词的标记类别;
    提取所述样本关键词的关键词特征;
    基于所述样本关键词的关键词特征与标记类别进行模型训练,得到所述监督模型。
  14. 根据权利要求9所述的终端设备,其中,所述多个得分值包括第一得分值、第二得分值、第三得分值和第四得分值;所述处理器执行所述计算机程序时还通过以下步骤实现所述监督模型的训练:
    统计所述多个分词的数量,并根据所述数量分别计算所述每个候选关键词在所述目标文章中的词频,通过所述词频对应计算所述每个候选关键词的第一得分值;
    确定所述多个候选关键词在所述目标文章中的位置,基于所述多个候选关键词在所述目标文章中的位置,计算所述每个候选关键词的第二得分值;
    分别确定所述每个候选关键词在所述目标文章中的初始位置和结束位置,根据所述初始位置和所述结束位置计算所述每个候选关键词对应的第三得分值;
    根据预设的文本排序算法,计算所述每个候选关键词对应的第四得分值。
  15. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序被处理器执行时实现:
    获取目标文章中的多个分词;
    根据预设的关键词库,从所述多个分词中确定多个候选关键词;
    根据所述多个候选关键词和所述目标文章,分别计算所述多个候选关键词中每个候选关键词对应的多个得分值;
    将所述每个候选关键词对应的多个得分值输入预先训练的监督模型中,分别得到所述每个候选关键词的词概率,并根据所述词概率从所述多个候选关键词中确定目标关键词。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述计算机程序被处理器执行时还实现:
    确定所述目标文章的文章领域,获取属于所述文章领域的领域文本;
    根据所述领域文本中的多个领域分词,计算每个领域分词之间的领域关联度;
    从多个领域关联度中确定大于预设关联度的目标关联度,以及确定所述目标关联度对应的目标领域分词;
    将所述目标领域分词存储至所述关键词库中。
  17. 根据权利要求15所述的计算机可读存储介质,其中,所述计算机程序被处理器执行时还实现:
    确定所述目标文章的文章领域,并获取属于所述文章领域下的多个领域关键词;
    将所述多个领域关键词存储至所述关键词库中。
  18. 根据权利要求15-17任一所述的计算机可读存储介质,其中,所述计算机程序被处理器执行时还实现:
    确定所述关键词库中是否包含目标分词,所述目标分词为所述多个分词中的任意一个;
    若所述关键词库中包含所述目标分词,则将所述目标分词作为候选关键词;
    若所述关键词库中未包含所述目标分词,则判断所述目标分词是否属于实体词;若所述目标分词属于实体词,则将属于所述实体词的目标分词输入至所述监督模型中,得到属于所述实体词的目标分词的关键词概率;若所述关键词概率大于概率阈值,则将所述关键词概率对应的所述目标分词作为候选关键词。
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述计算机程序被处理器执行时还通过以下步骤实现所述监督模型的训练:
    获取训练样本,并从所述训练样本中获取已标注的训练关键词;
    对所述训练样本中的文本内容进行分词得到多个样本分词,并分别计算每个样本分词对应的样本得分值;
    根据多个样本得分值,从所述多个样本分词中确定样本关键词;
    基于所述样本关键词与所述训练关键词,确定所述样本关键词的标记类别;
    提取所述样本关键词的关键词特征;
    基于所述样本关键词的关键词特征与标记类别进行模型训练,得到所述监督模型。
  20. 根据权利要求15所述的计算机可读存储介质,其中,所述多个得分值包括第一得分值、第二得分值、第三得分值和第四得分值;所述计算机程序被处理器执行时还实现:
    统计所述多个分词的数量,并根据所述数量分别计算所述每个候选关键词在所述目标文章中的词频,通过所述词频对应计算所述每个候选关键词的第一得分值;
    确定所述多个候选关键词在所述目标文章中的位置,基于所述多个候选关键词在所述目标文章中的位置,计算所述每个候选关键词的第二得分值;
    分别确定所述每个候选关键词在所述目标文章中的初始位置和结束位置,根据所述初始位置和所述结束位置计算所述每个候选关键词对应的第三得分值;
    根据预设的文本排序算法,计算所述每个候选关键词对应的第四得分值。
PCT/CN2021/091083 2020-11-06 2021-04-29 关键词抽取方法、装置、终端设备及存储介质 WO2022095374A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011229490.4 2020-11-06
CN202011229490.4A CN112347778B (zh) 2020-11-06 2020-11-06 关键词抽取方法、装置、终端设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022095374A1 true WO2022095374A1 (zh) 2022-05-12

Family

ID=74428396

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/091083 WO2022095374A1 (zh) 2020-11-06 2021-04-29 关键词抽取方法、装置、终端设备及存储介质

Country Status (2)

Country Link
CN (1) CN112347778B (zh)
WO (1) WO2022095374A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115269989A (zh) * 2022-08-03 2022-11-01 百度在线网络技术(北京)有限公司 对象推荐方法、装置、电子设备和存储介质
CN115292477A (zh) * 2022-07-18 2022-11-04 盐城金堤科技有限公司 推送相似文章判定方法和装置、及存储介质和电子设备
CN116341521A (zh) * 2023-05-22 2023-06-27 环球数科集团有限公司 一种基于文本特征的aigc文章辨识***
CN116521906A (zh) * 2023-04-28 2023-08-01 广州商研网络科技有限公司 元描述生成方法及其装置、设备、介质

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347778B (zh) * 2020-11-06 2023-06-20 平安科技(深圳)有限公司 关键词抽取方法、装置、终端设备及存储介质
CN113822013B (zh) * 2021-03-08 2024-04-05 京东科技控股股份有限公司 用于文本数据的标注方法、装置、计算机设备及存储介质
CN113270092A (zh) * 2021-05-11 2021-08-17 云南电网有限责任公司 一种基于lda算法的调度语音关键词提取方法
CN113536777A (zh) * 2021-07-30 2021-10-22 深圳豹耳科技有限公司 新闻关键词的抽取方法、装置、设备及存储介质
CN113743090B (zh) * 2021-09-08 2024-04-12 度小满科技(北京)有限公司 一种关键词提取方法及装置
CN113792131B (zh) * 2021-09-23 2024-02-09 深圳平安智慧医健科技有限公司 一种关键词的提取方法、装置、电子设备及存储介质
CN114662474B (zh) * 2022-04-13 2024-06-11 马上消费金融股份有限公司 关键词的确定方法、装置、电子设备及存储介质
CN117786249A (zh) * 2023-12-27 2024-03-29 王冰 网络实时热点话题挖掘解析与舆情提炼***

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108073568A (zh) * 2016-11-10 2018-05-25 腾讯科技(深圳)有限公司 关键词提取方法和装置
CN110008401A (zh) * 2019-02-21 2019-07-12 北京达佳互联信息技术有限公司 关键词提取方法、关键词提取装置和计算机可读存储介质
CN110874530A (zh) * 2019-10-30 2020-03-10 深圳价值在线信息科技股份有限公司 关键词提取方法、装置、终端设备及存储介质
CN112347778A (zh) * 2020-11-06 2021-02-09 平安科技(深圳)有限公司 关键词抽取方法、装置、终端设备及存储介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2635965A4 (en) * 2010-11-05 2016-08-10 Rakuten Inc SYSTEMS AND METHODS RELATING TO KEYWORD EXTRACTION
CN108319627B (zh) * 2017-02-06 2024-05-28 腾讯科技(深圳)有限公司 关键词提取方法以及关键词提取装置
CN108334490B (zh) * 2017-04-07 2021-05-07 腾讯科技(深圳)有限公司 关键词提取方法以及关键词提取装置
CN107704503A (zh) * 2017-08-29 2018-02-16 平安科技(深圳)有限公司 用户关键词提取装置、方法及计算机可读存储介质
CN108563636A (zh) * 2018-04-04 2018-09-21 广州杰赛科技股份有限公司 提取文本关键词的方法、装置、设备及存储介质
CN111223489B (zh) * 2019-12-20 2022-12-06 厦门快商通科技股份有限公司 一种基于Attention注意力机制的特定关键词识别方法及***
CN111814770B (zh) * 2020-09-04 2021-01-15 中山大学深圳研究院 一种新闻视频的内容关键词提取方法、终端设备及介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108073568A (zh) * 2016-11-10 2018-05-25 腾讯科技(深圳)有限公司 关键词提取方法和装置
CN110008401A (zh) * 2019-02-21 2019-07-12 北京达佳互联信息技术有限公司 关键词提取方法、关键词提取装置和计算机可读存储介质
CN110874530A (zh) * 2019-10-30 2020-03-10 深圳价值在线信息科技股份有限公司 关键词提取方法、装置、终端设备及存储介质
CN112347778A (zh) * 2020-11-06 2021-02-09 平安科技(深圳)有限公司 关键词抽取方法、装置、终端设备及存储介质

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115292477A (zh) * 2022-07-18 2022-11-04 盐城金堤科技有限公司 推送相似文章判定方法和装置、及存储介质和电子设备
CN115292477B (zh) * 2022-07-18 2024-04-16 盐城天眼察微科技有限公司 推送相似文章判定方法和装置、及存储介质和电子设备
CN115269989A (zh) * 2022-08-03 2022-11-01 百度在线网络技术(北京)有限公司 对象推荐方法、装置、电子设备和存储介质
CN116521906A (zh) * 2023-04-28 2023-08-01 广州商研网络科技有限公司 元描述生成方法及其装置、设备、介质
CN116521906B (zh) * 2023-04-28 2023-10-24 广州商研网络科技有限公司 元描述生成方法及其装置、设备、介质
CN116341521A (zh) * 2023-05-22 2023-06-27 环球数科集团有限公司 一种基于文本特征的aigc文章辨识***
CN116341521B (zh) * 2023-05-22 2023-07-28 环球数科集团有限公司 一种基于文本特征的aigc文章辨识***

Also Published As

Publication number Publication date
CN112347778B (zh) 2023-06-20
CN112347778A (zh) 2021-02-09

Similar Documents

Publication Publication Date Title
WO2022095374A1 (zh) 关键词抽取方法、装置、终端设备及存储介质
TWI653542B (zh) 一種基於網路媒體資料流程發現並跟蹤熱點話題的方法、系統和裝置
US10019515B2 (en) Attribute-based contexts for sentiment-topic pairs
CN107463605B (zh) 低质新闻资源的识别方法及装置、计算机设备及可读介质
WO2017167067A1 (zh) 网页文本分类的方法和装置,网页文本识别的方法和装置
CN110674317B (zh) 一种基于图神经网络的实体链接方法及装置
US20130060769A1 (en) System and method for identifying social media interactions
CN109299280B (zh) 短文本聚类分析方法、装置和终端设备
WO2015149533A1 (zh) 一种基于网页内容分类进行分词处理的方法和装置
CN108376129B (zh) 一种纠错方法及装置
CN111324771B (zh) 视频标签的确定方法、装置、电子设备及存储介质
CN111460153A (zh) 热点话题提取方法、装置、终端设备及存储介质
US9864795B1 (en) Identifying entity attributes
US20180046721A1 (en) Systems and Methods for Automatic Customization of Content Filtering
WO2017113592A1 (zh) 模型生成方法、词语赋权方法、装置、设备及计算机存储介质
CN107506472B (zh) 一种学生浏览网页分类方法
CN111126067B (zh) 实体关系抽取方法及装置
Shawon et al. Website classification using word based multiple n-gram models and random search oriented feature parameters
CN111767713A (zh) 关键词的提取方法、装置、电子设备及存储介质
Zhang et al. A topic clustering approach to finding similar questions from large question and answer archives
Syed et al. Exploring symmetrical and asymmetrical Dirichlet priors for latent Dirichlet allocation
US10073890B1 (en) Systems and methods for patent reference comparison in a combined semantical-probabilistic algorithm
CN112989118B (zh) 视频召回方法及装置
Tan et al. Newsstories: Illustrating articles with visual summaries
WO2021051587A1 (zh) 基于语意识别的搜索结果排序方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21888078

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21888078

Country of ref document: EP

Kind code of ref document: A1