CN112541105A - Keyword generation method, public opinion monitoring method, device, equipment and medium - Google Patents

Keyword generation method, public opinion monitoring method, device, equipment and medium Download PDF

Info

Publication number
CN112541105A
CN112541105A CN201910929525.6A CN201910929525A CN112541105A CN 112541105 A CN112541105 A CN 112541105A CN 201910929525 A CN201910929525 A CN 201910929525A CN 112541105 A CN112541105 A CN 112541105A
Authority
CN
China
Prior art keywords
corpus
words
word
information
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910929525.6A
Other languages
Chinese (zh)
Inventor
黄翔
张明锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute Of Geography Fujian Normal University
Original Assignee
Institute Of Geography Fujian Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute Of Geography Fujian Normal University filed Critical Institute Of Geography Fujian Normal University
Priority to CN201910929525.6A priority Critical patent/CN112541105A/en
Publication of CN112541105A publication Critical patent/CN112541105A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of network monitoring, and particularly relates to a keyword generation method, a public opinion monitoring method, a device, equipment and a medium. The method comprises the following steps: acquiring environment corpus information, and extracting corpus words from the environment corpus information to obtain a corpus word set; screening the corpus words contained in the corpus word set according to a preset noise corpus word library so as to remove the noise corpus words; and performing weight calculation on each linguistic word in the linguistic word set after the noise linguistic words are removed, and synthesizing a preset number of linguistic words from high to low in weight into a keyword set to output. The keyword generation method provided by the embodiment of the invention can automatically generate the keywords related to the environmental public sentiment by performing the corpus word extraction analysis on the environmental corpus information, thereby improving the speed and comprehensiveness of the keyword generation and being beneficial to improving the efficiency and effect of monitoring the environmental public sentiment.

Description

Keyword generation method, public opinion monitoring method, device, equipment and medium
Technical Field
The invention belongs to the technical field of network monitoring, and particularly relates to a keyword generation method, a public opinion monitoring method, a device, equipment and a medium.
Background
With the development of informatization and the rise of social networks, environmental pollution information is spread and discussed in the internet through various channels such as microblogs, network forums, WeChat public accounts, self-media platforms and the like, and the network public opinions express the attention attitude of people on environmental conditions, environmental pollution and safety supervision, so that the channels of governments for knowing the environmental conditions and the ideas are widened to a certain extent, but the difficulty of monitoring the government network public opinions is increased.
Keywords are typically short, summarized pieces of content that can describe subject matter information in longer texts. The high-quality keywords can provide index information for a public sentiment monitoring system and provide highly refined and valuable information for users. The extraction technology of the key words is an important task in natural language processing, and plays an important role in tasks such as information retrieval, question answering systems, text summarization, search engine indexing and the like. The keywords in the environment field can provide retrieval words for public opinion monitoring in the environment field, and the public opinion monitoring precision and efficiency are improved. At present, the traditional keyword generation method takes field expert experience as the leading factor, has strict standard requirements on the selection of concepts, is mostly manually written, and leads to slow keyword updating and high cost. Because the network public sentiment is different from the traditional public sentiment, the network public sentiment has stronger interactivity and instantaneity, richer and diversified terms and more obvious emotion and non-rationalization. Therefore, the traditional keyword generation method has certain limitation when being oriented to the network linguistic data.
Therefore, the traditional mode of manually defining the keywords of the environmental public sentiment is mainly relied at present, the updating is slow, the dependence on people is large, the reality of network transmission cannot be reflected in time, and the monitoring of the environmental public sentiment is inconvenient.
Disclosure of Invention
The embodiment of the invention aims to provide a keyword generation method, and aims to overcome the defects that the updating is slow, the dependency is large and the time is not long in the prior art which mainly depends on a mode of manually defining public sentiment keywords.
The embodiment of the invention is realized in such a way that a keyword generation method comprises the following steps:
acquiring environment corpus information, and extracting corpus words from the environment corpus information to obtain a corpus word set;
screening the corpus words contained in the corpus word set according to a preset noise corpus word library so as to remove the noise corpus words;
and performing weight calculation on each linguistic word in the linguistic word set after the noise linguistic words are removed, and synthesizing a preset number of linguistic words from high to low in weight into a keyword set to output.
Another objective of the embodiments of the present invention is to provide a public opinion monitoring method, which includes:
acquiring environmental public opinion information, performing word segmentation processing on the environmental public opinion information, and determining at least one public opinion corpus word;
and comparing the public opinion corpus words with keywords in a preset keyword set, and if at least one of the public opinion corpus words is the same as the keywords in the preset keyword set, sending alarm information to a client to give an alarm, wherein the preset keyword set is obtained by executing the keyword generation method in the embodiment of the invention.
Another object of an embodiment of the present invention is to provide an apparatus for generating keywords, including:
the corpus word extracting unit is used for acquiring environment corpus information and extracting corpus words from the environment corpus information to obtain a corpus word set;
the corpus word screening unit is used for screening corpus words contained in the corpus word set according to a preset noise corpus word library so as to remove noise corpus words;
and the keyword determining unit is used for performing weight calculation on each corpus word in the corpus word set after the noise corpus words are removed, and synthesizing the corpus word set with the preset number of weights from high to low into a keyword set for outputting.
Another objective of the embodiments of the present invention is to provide a public opinion monitoring device, which includes:
the information acquisition unit is used for acquiring environmental public opinion information, performing word segmentation processing on the environmental public opinion information and determining at least one public opinion corpus word;
and the information monitoring unit is used for comparing the public opinion corpus words with keywords in a preset keyword set, and if at least one of the public opinion corpus words is the same as the keywords in the preset keyword set, sending alarm information to a client to alarm, wherein the preset keyword set is obtained by executing the keyword generation method in the embodiment of the invention.
Another object of an embodiment of the present invention is to provide a computer device, including a memory and a processor, where the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the steps of the keyword generation method or the public opinion monitoring method in the embodiment of the present invention.
Another object of an embodiment of the present invention is to provide a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the processor is enabled to execute the steps of the keyword generation method or the public opinion monitoring method in the embodiment of the present invention.
According to the keyword generation method provided by the embodiment of the invention, the environment corpus keywords in the network can be rapidly extracted by screening and calculating the corpus words of the environment corpus information, so that the speed and the accuracy of keyword generation are improved, and the speed of updating the network public opinion can be kept up to in real time.
Drawings
Fig. 1 is an application environment diagram of a keyword generation method according to an embodiment of the present invention;
fig. 2 is a flowchart of a keyword generation method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating the method for extracting linguistic data words according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating a method for removing noisy corpus words according to an embodiment of the present invention;
FIG. 5 is a flowchart of another keyword generation method according to an embodiment of the present invention;
fig. 6 is a flowchart of a public opinion monitoring method according to an embodiment of the present invention;
fig. 7 is a block diagram of a keyword generation apparatus according to an embodiment of the present invention;
fig. 8 is a block diagram illustrating a structure of a corpus word extracting unit according to an embodiment of the present invention;
fig. 9 is a block diagram of a keyword determination unit according to an embodiment of the present invention;
fig. 10 is a block diagram of another keyword generation apparatus according to an embodiment of the present invention;
fig. 11 is a block diagram of a public opinion monitoring device according to an embodiment of the present invention;
FIG. 12 is a block diagram showing an internal configuration of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It will be understood that, as used herein, the terms "first," "second," and the like may be used herein to describe various elements, but these elements are not limited by these terms unless otherwise specified. These terms are only used to distinguish one element from another. For example, a first xx script may be referred to as a second xx script, and similarly, a second xx script may be referred to as a first xx script, without departing from the scope of the present application.
Fig. 1 is an application environment diagram of a keyword generation method according to an embodiment of the present invention, as shown in fig. 1, in the application environment, a terminal 110 and a computer device 120 are included.
The terminal 110 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal 110 and the computer device 120 may be connected through a network, and the present invention is not limited thereto.
The computer device 120 may be an independent physical server or terminal, may also be a server cluster formed by a plurality of physical servers, and may be a cloud server providing basic cloud computing services such as a cloud server, a cloud database, a cloud storage, and a CDN.
In the above implementation environment of the embodiment of the present invention, the keyword generation method of the present invention is applied to one of the computer devices 120, obtains environment corpus data from the terminal 110 and other computer devices, and determines the keyword by executing the keyword generation method in the embodiment of the present invention.
Example one
As shown in fig. 2, in an embodiment, a keyword generation method is provided, and this embodiment is mainly illustrated by applying the method to the computer device 120 in fig. 1. A keyword generation method specifically comprises the following steps:
step S202, obtaining environment corpus information, and capturing corpus words from the environment corpus information to obtain a corpus word set.
In the embodiment of the present invention, the environmental corpus information refers to the corpus information related to the environmental field, such as environmental forum, environmental government complaint forum, local office postbox of province and city government, 12369 environmental network report platform of each level of administrative unit, and other network platform public environmental corpus information, the manner of acquiring the environmental corpus information in the embodiment of the present invention may be directly and automatically crawled from the above-mentioned exemplified network by using web crawler technology, and the specific implementation manner may be that first storing the source address of the corpus data to be acquired in the database, then using the web crawler to crawl the searched environmental corpus information from the internet into the database, the data stored in the database may include the behavior data of the corpus data in addition to the corpus data, such as: the publishing time, the number of praise, the number of forwarding, the number of comments and the like of the corpus information. In addition, the means for acquiring the environment corpus information can be directly performed in a gathering and collecting mode, and further statement is not provided in the application.
In an embodiment, as shown in fig. 3, step S202 may specifically include the following steps:
step S302, obtaining environment corpus information, wherein the environment corpus information at least comprises at least one corpus text;
step S304, comparing the characters of the corpus text with the words in the preset lexicon dictionary base from left to right and/or from right to left in sequence, determining all the corpus words contained in the corpus text, and generating a corpus word set.
In the embodiment of the invention, the characters of the material text are sequentially compared with the words in the preset word library dictionary library from left to right, specifically, for example, "fertilizer plant discharges waste water frequently", and the words are divided into "fertilizer/plant/frequent/discharge/waste water"; and the characters of the corpus text are sequentially compared with words in the preset lexicon dictionary base from right to left, specifically, for example, "chemical fertilizer plant discharges waste water frequently", the words are divided into "waste water/discharge/frequent/chemical fertilizer plant", so that possible corpus words contained in the corpus text can be listed, and the keyword is favorably perfected.
In the embodiment of the present invention, a word segmentation tool based on a dictionary lookup tree structure may be specifically used to perform word segmentation on the environment corpus information. The tool principle is that a dictionary lookup tree structure-based word graph is scanned to generate a directed acyclic graph formed by all possible word forming conditions of Chinese characters in a sentence. And searching a maximum probability path by adopting dynamic programming, and finding out a maximum segmentation combination based on the word frequency. And performing word segmentation on the Chinese corpus of the whole paragraph and sentence. For example: "the fertilizer plant discharges waste water frequently", will divide it into "fertilizer plant/often/discharge/waste water".
In addition, for unknown words, namely words which are not included in the word segmentation vocabulary but need to be segmented, including various proper nouns (names of people, place names, enterprise names and the like), abbreviations, newly added words and the like, a hidden Markov model with Chinese character word forming capability is adopted for segmentation, and segmentation results are screened by using a Viterbi algorithm.
Step S204, the corpus words contained in the corpus word set are screened according to a preset noise corpus word bank so as to remove the noise corpus words.
In one embodiment, step S204 specifically includes:
comparing the corpus words contained in the corpus word set with the noise corpus words contained in a preset noise corpus word bank one by one, and screening out the corpus words from the corpus word set if the corpus words are consistent; the noise corpus words contained in the preset noise corpus thesaurus at least comprise one or the combination of two of stop words and punctuation marks.
In the embodiment of the invention, noise information irrelevant to the environmental theme needs to be removed, so that the influence of noise on the extraction result is reduced. Specifically, the stop words in the segmented corpus words are removed, where the stop words refer to words or phrases that are automatically filtered before or after processing natural language data (or text) in the information retrieval process to save storage space and improve search efficiency, and the words or phrases are called stop words, such as "yes", "no", and the stop word list included in the preset noise corpus word library is loaded into the filter program to traverse the stop word list. And comparing the corpus words with the stop words in the stop word list, and if the corpus words exist in the stop word list, removing the corpus words. Second, meaningless characters are eliminated, such as: "@", "#", "%". And further eliminating some emoticons and useless English. And finally obtaining the pure linguistic data words after the noise linguistic data words are removed.
Step S206, weight calculation is carried out on each linguistic data word in the linguistic data word set after the noise linguistic data word is removed, and the linguistic data word set with the weight from high to low in preset quantity is synthesized into a keyword set to be output.
In an embodiment, as shown in fig. 4, step S206 may specifically include the following steps:
step S402, calculating the frequency of occurrence of a corpus word according to the sum of the frequency of occurrence of a corpus word in a corpus text and the frequency of occurrence of all corpus words in the same corpus text.
Step S404, calculating a reverse file frequency of a corpus word according to the number of corpus texts and the number of corpus texts including a corpus word in the environment corpus information.
Step S406, calculating the weight of a corpus word according to the occurrence frequency of the corpus word and the reverse file frequency.
Step S408, after calculating the weights of all the corpus words in the corpus word set, synthesizing a corpus word set with a preset number of weights from high to low into a keyword set, and outputting the keyword set.
The occurrence frequency of the linguistic words in the embodiment of the invention refers to a regularization value of the occurrence frequency of the words in a document. Specifically, for the keywords in the corpus text, the formula for calculating the frequency of occurrence of the corpus words is as follows:
the formula represents the occurrence frequency of words in the corpus text, and the denominator represents the sum of the occurrence frequency of all the corpus words in the corpus text.
In the embodiment of the invention, the reverse file frequency of the linguistic words is used for measuring the differentiable degree and the importance degree of the vocabularies in the documents with the preset number. When the reverse file frequency value of the vocabulary is larger, the word is represented to be more distinguishable in the documents, and the importance degree of the vocabulary in the documents is represented to be higher, wherein the calculation formula of the reverse file frequency value of the vocabulary is as follows:
the total number of documents is represented in the formula, the number of documents containing words is represented, and the denominator plus one is used for preventing the zero division.
Finally, the calculation method of the word weight in the corpus text is as follows:
in the embodiment of the invention, the keywords related to the environmental public sentiment field are generally formed by nouns and adjectives according to part of speech division, and particularly, related nouns and adjectives in the field can be captured by a method of combining a TF-IDF model, word frequency and expert experience. Specifically, firstly, performing part-of-speech division on segmented linguistic words, defining words with repeated occurrence times higher than a certain threshold as high-frequency words, and extracting the high-frequency words in a phrase in a machine learning manner; secondly, capturing low-frequency words or words with less occurrence times but high importance in the words by using a TF-IDF model; and finally, screening out keywords in the two types of words to form a keyword set.
In one embodiment, as shown in fig. 5, a keyword generation method, which is different from the method shown in fig. 2, further includes step S502 and step S503.
Step S502, inputting a preset number of linguistic data words with weights from high to low as seed linguistic data words into a preset first neural network model, and calculating associated keywords associated with the seed linguistic data words through the first neural network model, wherein the first neural network model is obtained by training parameters in a second neural network model through a preset original expected set;
step S503, outputting a preset number of associated keywords with the preset number of high to low associated strength with the seed corpus words to a keyword set.
In the embodiment of the present invention, the first neural network model is obtained by training a shallow neural network model with preset environment field corpus, and a keyword having a high relevance to an input keyword is searched by using a training result of the model, specifically, for easy understanding, in the embodiment, a Word2Vec Word embedded model is taken as an example.
The Word2Vec model uses context information of words to convert individual words into a low-dimensional real number vector, and more similar words are more similar in vector space. The Word2Vec model formally expresses the text by using a vector space representation method, the method maps words or texts into an n-dimensional vector space, and the relation between the words is explained by the operation between vectors. The traditional word vector space model not only has the problem of 'dimension disaster', but also splits the semantic relation among words and has obvious defects on semantic expression. The Word2Vec model maps the corpus to a low-dimensional high-density vector space through corpus training, not only solves the problem of 'dimension disaster' of the traditional vector space, but also gives consideration to semantic relation among words, and has strong semantic representation capability.
Word2Vec includes two models, i.e., a continuous bag of words model (CBOW) and a Skip-gram model (Skip-gram), the CBOW model predicts the current Word by using n (where n is 2) words before and after the Word w (t), and the Skip-gram model predicts n words before and after the Word w (t).
In the embodiment of the present invention, a CBOW model is used as an example, where the input layer is 2n word vectors of the context of the word w (t), and the projection layer vector Xw is the accumulated sum of the 2n word vectors. The output layer is a Huffman tree constructed by taking the words appearing in the training corpus as leaf nodes and taking the times of the words appearing in the corpus as weights. In this Huffman tree, a total of N (═ leaf nodes correspond to words in dictionary D, and N-1 non-leaf nodes. The result of Xw is predicted by a random gradient ascent algorithm such that the value of P (w | context (w)) is maximized, context (w) referring to 2n words in the context of a word. When the neural network training is completed, the word vectors w of all the words can be solved. And based on the trained neural network model, finding out keywords semantically similar to the input algorithm according to a similarity algorithm.
Furthermore, in the embodiment of the present invention, the keywords in step 206 are used as seed keywords to be input into the model, and a similarity calculation algorithm in the trained model is used to calculate and generate keywords having strong association with the seed keywords; the keywords provide relevance intensity coefficients, and the words are sorted from high to low according to the coefficient size; and finally, selecting a preset number of words and adding the words into the keyword set.
In addition, in the embodiment of the invention, the seed keywords and the generated associated keywords can be integrated, and the keyword set is classified according to a preset classification system. For example, the classification system can be divided into two levels, wherein the first level is an event name, an action, a response, a place, an object, time, a pollution type, a state, a degree adverb and a result; the second level is divided into action (victim action, victim feeling, pollution action), response (government response or not, response level, response action, subsequent response of pollutant), place (physical place, orientation place), object (victim, polluted object, pollutant discharge, pollutant, manager), time (moment, period, time span), pollution type (noise pollution, soil pollution, plastic pollution, water pollution, air pollution), state (pre-pollution state, post-pollution state). Thereby making the attributes of the keywords clearer.
According to the keyword generation method provided by the embodiment of the invention, the environment corpus information is obtained, the keywords are extracted from the environment corpus information, and the keyword associated words are calculated and extracted, so that the keywords can be automatically and comprehensively generated, the keyword generation speed and accuracy are improved, and the network public opinion updating speed can be kept up with the real-time.
Example two
As shown in fig. 6, in an embodiment, a public opinion monitoring method is provided, and the embodiment is mainly exemplified by applying the method to the computer device 120 in fig. 1. A public opinion monitoring method specifically comprises the following steps:
step S602, acquiring environmental public sentiment information, performing word segmentation processing on the environmental public sentiment information, and determining at least one public sentiment linguistic word;
step S604, comparing the public sentiment corpus words with keywords in a preset keyword set, and if at least one of the public sentiment corpus words is the same as a keyword in the preset keyword set, sending an alarm message to the client for alarming, where the preset keyword set is obtained by executing the keyword generation method in the first embodiment of the present invention.
Specifically, in the embodiment of the present invention, the keywords in the keyword set are used as indexes, the search is performed in the environmental public opinion information in a manner of traversing the keywords, if the keywords except the keywords initially used for the search and the other keywords in the keyword list appear, an alarm is performed, and the information is classified according to a classification system defined by the keyword list, for example, the polluted object, the polluted action, and the polluted time in the information are classified, and displayed in the environmental public opinion result display module.
The public opinion monitoring method provided by the embodiment of the invention can be used for effectively monitoring the environmental public opinion by monitoring the environmental public opinion information of the network environment and comparing the environmental public opinion information with the keywords in real time, thereby improving the accuracy and timeliness of the environmental public opinion monitoring.
EXAMPLE III
As shown in fig. 7, in an embodiment, a keyword generation apparatus is provided, which may be integrated in the computer device 120, and specifically may include:
the corpus word extracting unit 710 is configured to obtain environment corpus information, and extract corpus words from the environment corpus information to obtain a corpus word set.
In the embodiment of the present invention, the environmental corpus information refers to the corpus information related to the environmental field, such as environmental forum, environmental government complaint forum, local office postbox of province and city government, 12369 environmental network report platform of each level of administrative unit, and other network platform public environmental corpus information, the manner of acquiring the environmental corpus information in the embodiment of the present invention may be directly and automatically crawled from the above-mentioned exemplified network by using web crawler technology, and the specific implementation manner may be that first storing the source address of the corpus data to be acquired in the database, then using the web crawler to crawl the searched environmental corpus information from the internet into the database, the data stored in the database may include the behavior data of the corpus data in addition to the corpus data, such as: the publishing time, the number of praise, the number of forwarding, the number of comments and the like of the corpus information. In addition, the means for acquiring the environment corpus information can be directly performed in a gathering and collecting mode, and further statement is not provided in the application.
In an embodiment, as shown in fig. 8, the corpus word extracting unit 710 may specifically include:
an information obtaining subunit 711, configured to obtain environment corpus information, where the environment corpus information at least includes at least one corpus text;
and a corpus word extracting subunit 712, configured to compare the words of the corpus text with the words in the preset lexicon dictionary from left to right and/or from right to left in sequence, determine all the corpus words included in the corpus text, and generate a corpus word set.
In the embodiment of the invention, the characters of the material text are sequentially compared with the words in the preset word library dictionary library from left to right, specifically, for example, "fertilizer plant discharges waste water frequently", and the words are divided into "fertilizer/plant/frequent/discharge/waste water"; and the characters of the corpus text are sequentially compared with words in the preset lexicon dictionary base from right to left, specifically, for example, "chemical fertilizer plant discharges waste water frequently", the words are divided into "waste water/discharge/frequent/chemical fertilizer plant", so that possible corpus words contained in the corpus text can be listed, and the keyword is favorably perfected.
In the embodiment of the present invention, a word segmentation tool based on a dictionary lookup tree structure may be specifically used to perform word segmentation on the environment corpus information. The tool principle is that a dictionary lookup tree structure-based word graph is scanned to generate a directed acyclic graph formed by all possible word forming conditions of Chinese characters in a sentence. And searching a maximum probability path by adopting dynamic programming, and finding out a maximum segmentation combination based on the word frequency. And performing word segmentation on the Chinese corpus of the whole paragraph and sentence. For example: "the fertilizer plant discharges waste water frequently", will divide it into "fertilizer plant/often/discharge/waste water".
In addition, for unknown words, namely words which are not included in the word segmentation vocabulary but need to be segmented, including various proper nouns (names of people, place names, enterprise names and the like), abbreviations, newly added words and the like, a hidden Markov model with Chinese character word forming capability is adopted for segmentation, and segmentation results are screened by using a Viterbi algorithm.
The corpus word screening unit 720 is configured to screen corpus words included in the corpus word set according to a preset noise corpus word bank, so as to remove noise corpus words.
In an embodiment, the corpus word filtering unit 720 is specifically configured to:
comparing the corpus words contained in the corpus word set with the noise corpus words contained in a preset noise corpus word bank one by one, and screening out the corpus words from the corpus word set if the corpus words are consistent; the noise corpus words contained in the preset noise corpus thesaurus at least comprise one or the combination of two of stop words and punctuation marks.
In the embodiment of the invention, noise information irrelevant to the environmental theme needs to be removed, so that the influence of noise on the extraction result is reduced. Specifically, the stop words in the segmented corpus words are removed, where the stop words refer to words or phrases that are automatically filtered before or after processing natural language data (or text) in the information retrieval process to save storage space and improve search efficiency, and the words or phrases are called stop words, such as "yes", "no", and the stop word list included in the preset noise corpus word library is loaded into the filter program to traverse the stop word list. And comparing the corpus words with the stop words in the stop word list, and if the corpus words exist in the stop word list, removing the corpus words. Second, meaningless characters are eliminated, such as: "@", "#", "%". And further eliminating some emoticons and useless English. And finally obtaining the pure linguistic data words after the noise linguistic data words are removed.
The keyword determining unit 730 is configured to perform weight calculation on each corpus word in the corpus word set from which the noise corpus word is removed, and synthesize a corpus word set with a preset number of weights from high to low into a keyword set to output the keyword set.
In an embodiment, as shown in fig. 9, the keyword determination unit 730 may specifically include:
a first frequency calculating subunit 731, configured to calculate an occurrence frequency of a corpus word according to a sum of the occurrence frequency of the corpus word in a corpus text and the occurrence frequency of all corpus words in the same corpus text;
a second frequency calculating subunit 732, configured to calculate a reverse file frequency of a corpus word according to the number of corpus texts and the number of corpus texts including a corpus word in the environment corpus information;
a weight calculation subunit 733, configured to calculate a weight of a corpus word according to an occurrence frequency of the corpus word and a reverse file frequency;
the corpus word set generating subunit 734 is configured to calculate weights of all corpus words in the corpus word set, and then combine the corpus word sets with preset numbers of weights from high to low into a keyword set for output.
The occurrence frequency of the linguistic words in the embodiment of the invention refers to a regularization value of the occurrence frequency of the words in a document. Specifically, for the keywords in the corpus text, the formula for calculating the frequency of occurrence of the corpus words is as follows:
the formula represents the occurrence frequency of words in the corpus text, and the denominator represents the sum of the occurrence frequency of all the corpus words in the corpus text.
In the embodiment of the invention, the reverse file frequency of the linguistic words is used for measuring the differentiable degree and the importance degree of the vocabularies in the documents with the preset number. When the reverse file frequency value of the vocabulary is larger, the word is represented to be more distinguishable in the documents, and the importance degree of the vocabulary in the documents is represented to be higher, wherein the calculation formula of the reverse file frequency value of the vocabulary is as follows:
the total number of documents is represented in the formula, the number of documents containing words is represented, and the denominator plus one is used for preventing the zero division.
Finally, the calculation method of the word weight in the corpus text is as follows:
in the embodiment of the invention, the keywords related to the environmental public opinion field are generally formed by nouns and adjectives according to part of speech division, and the related nouns and adjectives in the field can be extracted by a method of combining a TF-IDF model, word frequency and expert experience. Specifically, firstly, performing part-of-speech division on segmented linguistic words, defining words with repeated occurrence times higher than a certain threshold as high-frequency words, and extracting the high-frequency words in a phrase in a machine learning manner; secondly, extracting low-frequency words or words with less occurrence times but high importance from the words by using a TF-IDF model; and finally, screening out keywords in the two types of words to form a keyword set.
In one embodiment, as shown in fig. 10, a keyword generation apparatus, which is different from the apparatus shown in fig. 7 in that it further includes a related-word generation unit 1010 and a related-word set generation unit 1020, wherein:
the associated keyword generation unit 1010 is configured to input a preset number of corpus words with weights from high to low as seed corpus words into a preset first neural network model, and calculate associated keywords associated with the seed corpus words through the first neural network model, where the first neural network model is obtained by training parameters in a second neural network model through a preset original prediction set;
the set generating unit 1020 is configured to output a preset number of associated keywords with the highest or lowest intensity of association with the seed corpus word to the keyword set.
In the embodiment of the present invention, the first neural network model is obtained by training a shallow neural network model with preset environment field corpus, and a keyword having a high relevance to an input keyword is searched by using a training result of the model, specifically, for easy understanding, in the embodiment, a Word2Vec Word embedded model is taken as an example.
The Word2Vec model uses context information of words to convert individual words into a low-dimensional real number vector, and more similar words are more similar in vector space. The Word2Vec model formally expresses the text by using a vector space representation method, the method maps words or texts into an n-dimensional vector space, and the relation between the words is explained by the operation between vectors. The traditional word vector space model not only has the problem of 'dimension disaster', but also splits the semantic relation among words and has obvious defects on semantic expression. The Word2Vec model maps the corpus to a low-dimensional high-density vector space through corpus training, not only solves the problem of 'dimension disaster' of the traditional vector space, but also gives consideration to semantic relation among words, and has strong semantic representation capability.
Word2Vec includes two models, i.e., a continuous bag of words model (CBOW) and a Skip-gram model (Skip-gram), the CBOW model predicts the current Word by using n (where n is 2) words before and after the Word w (t), and the Skip-gram model predicts n words before and after the Word w (t).
In the embodiment of the present invention, a CBOW model is used as an example, where the input layer is 2n word vectors of the context of the word w (t), and the projection layer vector Xw is the accumulated sum of the 2n word vectors. The output layer is a Huffman tree constructed by taking the words appearing in the training corpus as leaf nodes and taking the times of the words appearing in the corpus as weights. In this Huffman tree, a total of N (═ leaf nodes correspond to words in dictionary D, and N-1 non-leaf nodes. The result of Xw is predicted by a random gradient ascent algorithm such that the value of P (w | context (w)) is maximized, context (w) referring to 2n words in the context of a word. When the neural network training is completed, the word vectors w of all the words can be solved. And based on the trained neural network model, finding out keywords semantically similar to the input algorithm according to a similarity algorithm.
Furthermore, in the embodiment of the present invention, the keywords in step 206 are used as seed keywords to be input into the model, and a similarity calculation algorithm in the trained model is used to calculate and generate keywords having strong association with the seed keywords; the keywords provide relevance intensity coefficients, and the words are sorted from high to low according to the coefficient size; and finally, selecting a preset number of words and adding the words into the keyword set.
In addition, in the embodiment of the invention, the seed keywords and the generated associated keywords can be integrated, and the keyword set is classified according to a preset classification system. For example, the classification system can be divided into two levels, wherein the first level is an event name, an action, a response, a place, an object, time, a pollution type, a state, a degree adverb and a result; the second level is divided into action (victim action, victim feeling, pollution action), response (government response or not, response level, response action, subsequent response of pollutant), place (physical place, orientation place), object (victim, polluted object, pollutant discharge, pollutant, manager), time (moment, period, time span), pollution type (noise pollution, soil pollution, plastic pollution, water pollution, air pollution), state (pre-pollution state, post-pollution state). Thereby making the attributes of the keywords clearer.
According to the keyword generation device captured in the embodiment of the invention, the environment corpus information is obtained, the keywords are extracted from the environment corpus information and the keyword associated words are calculated and extracted, so that the keywords can be automatically and comprehensively generated, the speed and the accuracy of generating the keywords are captured, and the speed of updating the network public opinion can be kept up to in real time.
Example four
As shown in fig. 11, in one embodiment, there is provided a public opinion monitoring device, the device comprising:
an information obtaining unit 1110, configured to obtain environmental public opinion information, perform word segmentation processing on the environmental public opinion information, and determine at least one public opinion corpus word;
the information monitoring unit 1120 is configured to compare the public sentiment corpus words with keywords in a preset keyword set, and send alarm information to the client for alarm if at least one of the public sentiment corpus words is the same as a keyword in the preset keyword set, where the preset keyword set is obtained by executing the keyword generation method in the first embodiment of the present invention.
Specifically, in the embodiment of the present invention, the keywords in the keyword set are used as indexes, the search is performed in the environmental public opinion information in a manner of traversing the keywords, if the keywords except the keywords initially used for the search and the other keywords in the keyword list appear, an alarm is performed, and the information is classified according to a classification system defined by the keyword list, for example, the polluted object, the polluted action, and the polluted time in the information are classified, and displayed in the environmental public opinion result display module.
The public opinion monitoring device provided by the embodiment of the invention can effectively monitor the environmental public opinion by monitoring the environmental public opinion information of the network environment and comparing the environmental public opinion information with the keywords in real time, thereby improving the accuracy and timeliness of the environmental public opinion monitoring.
EXAMPLE five
In one embodiment, a computer device is proposed, the computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
acquiring environment corpus information, and extracting corpus words from the environment corpus information to obtain a corpus word set;
screening the linguistic data words contained in the linguistic data word set according to a preset noise linguistic data word library so as to remove the noise linguistic data words;
and performing weight calculation on each linguistic data word in the linguistic data word set after the noise linguistic data words are removed, and synthesizing a preset number of linguistic data words from high to low in weight into a keyword set to output.
In addition, in other embodiments of the present invention, the computer device may further perform the steps of:
acquiring environmental public opinion information, performing word segmentation processing on the environmental public opinion information, and determining at least one public opinion corpus word;
and comparing the public opinion corpus words with keywords in a preset keyword set, and if at least one of the public opinion corpus words is the same as the keywords in the preset keyword set, sending alarm information to a client to give an alarm, wherein the preset keyword set is obtained by executing the keyword generation method in the first embodiment of the invention.
FIG. 12 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may be specifically an independent physical server or a terminal, may also be a server cluster formed by a plurality of physical servers, and may be a cloud server providing basic cloud computing services such as a cloud server, a cloud database, a cloud storage, and a CDN. But not limited thereto, the smart phone, the tablet computer, the notebook computer, the desktop computer, the smart speaker, the smart watch, and the like may also be used. As shown in fig. 12, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen linked by a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement a data table processing method and/or a search method. The internal memory may also have stored therein a computer program that, when executed by the processor, causes the processor to perform a method of data table processing and/or a method of searching. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 12 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
EXAMPLE six
In one embodiment, a computer readable storage medium is provided, having a computer program stored thereon, which, when executed by a processor, causes the processor to perform the steps of:
acquiring environment corpus information, and extracting corpus words from the environment corpus information to obtain a corpus word set;
screening the linguistic data words contained in the linguistic data word set according to a preset noise linguistic data word library so as to remove the noise linguistic data words;
and performing weight calculation on each linguistic data word in the linguistic data word set after the noise linguistic data words are removed, and synthesizing a preset number of linguistic data words from high to low in weight into a keyword set to output.
Further, in other embodiments of the invention, the computer program, when executed by the processor, causes the processor to perform the further steps of:
acquiring environmental public opinion information, performing word segmentation processing on the environmental public opinion information, and determining at least one public opinion corpus word;
and comparing the public opinion corpus words with keywords in a preset keyword set, and if at least one of the public opinion corpus words is the same as the keywords in the preset keyword set, sending alarm information to a client to give an alarm, wherein the preset keyword set is obtained by executing the keyword generation method in the first embodiment of the invention.
It should be understood that, although the steps in the flowcharts of the embodiments of the present invention are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in various embodiments may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments contemplated herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for generating keywords, the method comprising:
acquiring environment corpus information, and extracting corpus words from the environment corpus information to obtain a corpus word set;
screening the corpus words contained in the corpus word set according to a preset noise corpus word library so as to remove the noise corpus words;
and performing weight calculation on each linguistic word in the linguistic word set after the noise linguistic words are removed, and synthesizing a preset number of linguistic words from high to low in weight into a keyword set to output.
2. The method according to claim 1, wherein the obtaining environmental corpus information and extracting corpus words from the environmental corpus information to obtain a corpus word set specifically includes:
acquiring environment corpus information, wherein the environment corpus information at least comprises at least one corpus text;
and comparing the characters of the corpus text with words in a preset lexicon dictionary library from left to right and/or from right to left in sequence, determining all the corpus words contained in the corpus text, and generating a corpus word set.
3. The method according to claim 2, wherein the calculating the weight of the corpus words in the corpus word set after removing the noise corpus words, and outputting the keyword set formed by the corpus words with the preset number of weights from high to low includes:
calculating the occurrence frequency of a linguistic data word according to the sum of the occurrence frequency of the linguistic data word in the linguistic data text and the occurrence frequency of all the linguistic data words in the same linguistic data text;
calculating the reverse file frequency of the corpus word according to the number of the corpus texts and the number of the corpus texts containing the corpus word in the environment corpus information;
calculating the weight of the corpus word according to the occurrence frequency of the corpus word and the reverse file frequency;
and after calculating the weights of all the linguistic data words in the linguistic data word set, synthesizing the linguistic data word set with preset number of weights from high to low into a keyword set and outputting the keyword set.
4. The method according to claim 1, wherein the method for generating the keywords according to the preset noise corpus lexicon is used to filter the corpus words included in the corpus word set to remove the noise corpus words, and specifically comprises:
comparing the corpus words contained in the corpus word set with noise corpus words contained in a preset noise corpus word bank one by one, and screening the corpus words from the corpus word set if the corpus words are consistent; and the noise corpus words contained in the preset noise corpus word bank at least comprise one or the combination of two of stop words and punctuation marks.
5. The method according to claim 1, wherein after calculating the weights of all the corpus words in the corpus word set, before outputting a preset number of corpus words with weights from high to low as keywords, the method further comprises:
inputting the preset number of linguistic words with the weight from high to low as seed linguistic words into a preset first neural network model, and calculating associated keywords associated with the seed linguistic words through the first neural network model, wherein the first neural network model is obtained by training parameters in a second neural network model through a preset original expected set;
and collecting the associated keywords with the preset number of the associated strength from high to low with the seed corpus words to the keyword collection for outputting.
6. A public opinion monitoring method is characterized in that the method comprises the following steps:
acquiring environmental public opinion information, performing word segmentation processing on the environmental public opinion information, and determining at least one public opinion corpus word;
and comparing the public opinion corpus words with keywords in a preset keyword set, and if at least one of the public opinion corpus words is the same as the keywords in the preset keyword set, sending alarm information to a client to give an alarm, wherein the preset keyword set is obtained by executing any keyword generation method of claims 1 to 5.
7. An apparatus for generating keywords, the apparatus comprising:
the corpus word extracting unit is used for acquiring environment corpus information and extracting corpus words from the environment corpus information to obtain a corpus word set;
the corpus word screening unit is used for screening corpus words contained in the corpus word set according to a preset noise corpus word library so as to remove noise corpus words;
and the keyword determining unit is used for performing weight calculation on each corpus word in the corpus word set after the noise corpus words are removed, and synthesizing the corpus word set with the preset number of weights from high to low into a keyword set for outputting.
8. The utility model provides a public opinion monitoring devices, its characterized in that, the device includes:
the information acquisition unit is used for acquiring environmental public opinion information, performing word segmentation processing on the environmental public opinion information and determining at least one public opinion corpus word;
and the information monitoring unit is used for comparing the public opinion corpus words with keywords in a preset keyword set, and if at least one of the public opinion corpus words is the same as the keywords in the preset keyword set, sending alarm information to a client side for alarming, wherein the preset keyword set is obtained by executing any keyword generation method of claims 1 to 5.
9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, and the computer program, when executed by the processor, causes the processor to perform the steps of the keyword generation method of any one of claims 1 to 5 or the public opinion monitoring method of claim 6.
10. A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when executed by a processor, causes the processor to perform the steps of the keyword generation method of any one of claims 1 to 5 or the public opinion monitoring method of claim 6.
CN201910929525.6A 2019-09-20 2019-09-20 Keyword generation method, public opinion monitoring method, device, equipment and medium Pending CN112541105A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910929525.6A CN112541105A (en) 2019-09-20 2019-09-20 Keyword generation method, public opinion monitoring method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910929525.6A CN112541105A (en) 2019-09-20 2019-09-20 Keyword generation method, public opinion monitoring method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN112541105A true CN112541105A (en) 2021-03-23

Family

ID=75013161

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910929525.6A Pending CN112541105A (en) 2019-09-20 2019-09-20 Keyword generation method, public opinion monitoring method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN112541105A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544255A (en) * 2013-10-15 2014-01-29 常州大学 Text semantic relativity based network public opinion information analysis method
CN107153658A (en) * 2016-03-03 2017-09-12 常州普适信息科技有限公司 A kind of public sentiment hot word based on weighted keyword algorithm finds method
CN108763213A (en) * 2018-05-25 2018-11-06 西南电子技术研究所(中国电子科技集团公司第十研究所) Theme feature text key word extracting method
CN109325165A (en) * 2018-08-29 2019-02-12 中国平安保险(集团)股份有限公司 Internet public opinion analysis method, apparatus and storage medium
CN109472018A (en) * 2018-09-26 2019-03-15 深圳壹账通智能科技有限公司 Enterprise's public sentiment monitoring method, device, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544255A (en) * 2013-10-15 2014-01-29 常州大学 Text semantic relativity based network public opinion information analysis method
CN107153658A (en) * 2016-03-03 2017-09-12 常州普适信息科技有限公司 A kind of public sentiment hot word based on weighted keyword algorithm finds method
CN108763213A (en) * 2018-05-25 2018-11-06 西南电子技术研究所(中国电子科技集团公司第十研究所) Theme feature text key word extracting method
CN109325165A (en) * 2018-08-29 2019-02-12 中国平安保险(集团)股份有限公司 Internet public opinion analysis method, apparatus and storage medium
CN109472018A (en) * 2018-09-26 2019-03-15 深圳壹账通智能科技有限公司 Enterprise's public sentiment monitoring method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111753060B (en) Information retrieval method, apparatus, device and computer readable storage medium
CN106844658B (en) Automatic construction method and system of Chinese text knowledge graph
CN107066446B (en) Logic rule embedded cyclic neural network text emotion analysis method
CN107229610B (en) A kind of analysis method and device of affection data
CN111291195B (en) Data processing method, device, terminal and readable storage medium
CN110750640B (en) Text data classification method and device based on neural network model and storage medium
US7983902B2 (en) Domain dictionary creation by detection of new topic words using divergence value comparison
CN112800170A (en) Question matching method and device and question reply method and device
US9965726B1 (en) Adding to a knowledge base using an ontological analysis of unstructured text
US20080052262A1 (en) Method for personalized named entity recognition
CN111950273A (en) Network public opinion emergency automatic identification method based on emotion information extraction analysis
CN104765769A (en) Short text query expansion and indexing method based on word vector
CN113094578A (en) Deep learning-based content recommendation method, device, equipment and storage medium
CN110019820B (en) Method for detecting time consistency of complaints and symptoms of current medical history in medical records
CN113515589B (en) Data recommendation method, device, equipment and medium
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN112559747A (en) Event classification processing method and device, electronic equipment and storage medium
US8140464B2 (en) Hypothesis analysis methods, hypothesis analysis devices, and articles of manufacture
CN115795030A (en) Text classification method and device, computer equipment and storage medium
CN114547257B (en) Class matching method and device, computer equipment and storage medium
CN115098706A (en) Network information extraction method and device
CN116756347A (en) Semantic information retrieval method based on big data
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN112231513A (en) Learning video recommendation method, device and system
CN109344397B (en) Text feature word extraction method and device, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination