CN107544988B - Method and device for acquiring public opinion data - Google Patents

Method and device for acquiring public opinion data Download PDF

Info

Publication number
CN107544988B
CN107544988B CN201610482038.6A CN201610482038A CN107544988B CN 107544988 B CN107544988 B CN 107544988B CN 201610482038 A CN201610482038 A CN 201610482038A CN 107544988 B CN107544988 B CN 107544988B
Authority
CN
China
Prior art keywords
entity
data
keywords
subunit
extracted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610482038.6A
Other languages
Chinese (zh)
Other versions
CN107544988A (en
Inventor
王私江
赵辉
高显
岳爱珍
谭静
崔燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201610482038.6A priority Critical patent/CN107544988B/en
Publication of CN107544988A publication Critical patent/CN107544988A/en
Application granted granted Critical
Publication of CN107544988B publication Critical patent/CN107544988B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a method and a device for acquiring public opinion data, wherein an entity word bank is mined in advance, and the entity word bank comprises keywords for describing corresponding entities; when public opinion data is acquired, extracting keywords from the acquired webpage data; similarity matching is carried out on the extracted keywords and each entity word bank, and an entity corresponding to the entity word bank with the similarity meeting the preset requirement is determined; and taking the webpage data as the public opinion data of the determined entity. The public opinion data acquisition method can automatically acquire public opinion data, greatly reduces labor cost and improves the coverage rate of the public opinion data compared with a mode of manually collecting the public opinion data.

Description

Method and device for acquiring public opinion data
[ technical field ] A method for producing a semiconductor device
The invention relates to the technical field of computer application, in particular to a method and a device for acquiring public opinion data.
[ background of the invention ]
The network public sentiment takes the network as a carrier, takes the event as a core, is the expression, the transmission and the interaction of the sentiment, the attitude, the opinion and the viewpoint of vast netizens, and is the mapping of the social public sentiment in the internet space. With the continuous development of the internet, many companies, enterprises and other units need to pay attention to the network public sentiment continuously so as to analyze the network public sentiment and pay attention to the state of the company in the network public sentiment, thereby generating network public sentiment early warning and providing data dependence for network crisis public relations or brand marketing of departments. In addition, the vast netizens also need to pay attention to the network public opinion continuously so as to provide basis for selecting a safe service provider or for financial investment selection and the like.
However, most of the conventional methods for acquiring internet public opinion data are based on manual collection, for example, companies or enterprises employ people dedicated to collecting and analyzing public opinion data; the vast netizens pay attention to the related news by themselves, and the like. On one hand, the methods consume human resources, and on the other hand, the coverage rate of the acquired public opinion data is low.
[ summary of the invention ]
In view of the above, the present invention provides a method and an apparatus for acquiring public sentiment data, so as to automatically acquire the public sentiment data, reduce labor cost, and improve coverage of the public sentiment data.
The specific technical scheme is as follows:
the invention provides a method for acquiring public opinion data, which comprises the steps of excavating an entity word bank in advance, wherein the entity word bank comprises keywords for describing corresponding entities; the method comprises the following steps:
extracting keywords from the acquired webpage data;
similarity matching is carried out on the extracted keywords and each entity word bank, and an entity corresponding to the entity word bank with the similarity meeting the preset requirement is determined;
and taking the webpage data as the public opinion data of the determined entity.
According to a preferred embodiment of the present invention, the mining entity thesaurus includes:
obtaining authority data of a mined entity;
extracting keywords from the authoritative data;
and taking the extracted keyword set as a word stock of the mined entity.
According to a preferred embodiment of the present invention, the obtaining authority data of the mined entity includes:
and acquiring at least one of the name of the mined entity, the official website data and clicked webpage data corresponding to the query containing the mined entity.
According to a preferred embodiment of the present invention, the extracting the keywords comprises:
and segmenting the acquired webpage data, and extracting keywords from the segmented words based on at least one of tf-idf, part of speech, sentence components and context characteristics.
According to a preferred embodiment of the present invention, the mining entity thesaurus further comprises:
filtering the extracted keywords, and using the extracted keyword set as a word bank of the mined entity comprises:
and taking the keyword set obtained after the filtering processing is carried out on the extracted keywords as a word bank of the mined entity.
According to a preferred embodiment of the present invention, the filtering process of the extracted keywords includes at least one of:
filtering the extracted keywords based on a manual mode;
similarity matching is carried out on the extracted keywords and the determined names of other entities, and if the matched names exist, the keywords are deleted;
and performing similarity matching on the extracted keywords and the webpage data, and deleting the keywords if the number of the matched webpage data exceeds a preset number threshold.
According to a preferred embodiment of the invention, the method further comprises: and respectively performing at least one of the following processing on the public opinion data of each entity:
removing the weight;
deleting illegal public opinion data;
and inputting the public sentiment data of the current entity into the trained main body recognition model, and deleting the public sentiment data if the main body recognized by the public sentiment data is not the current entity.
According to a preferred embodiment of the present invention, the subject recognition model is trained as follows:
taking the webpage data of the determined main body as a training corpus;
and performing conditional random field learning based on at least one characteristic of keywords, positions of the keywords, parts of speech, sentence components and context extracted from the training corpus to obtain the main body recognition model.
According to a preferred embodiment of the invention, the method further comprises:
and carrying out emotion analysis on the public sentiment data of each entity, and marking emotion analysis results for the public sentiment data.
According to a preferred embodiment of the invention, the entity comprises an organisation;
the web page data includes a news web page.
The invention also provides a device for acquiring public opinion data, which comprises: the public opinion obtaining unit is used for obtaining public opinions;
the word stock mining unit is used for mining an entity word stock in advance, and the entity word stock comprises keywords for describing the corresponding entity;
public opinion acquisition unit includes:
a second extraction subunit, configured to extract a keyword from the acquired web page data;
the matching subunit is used for matching the similarity between the keywords extracted by the second extraction subunit and each entity lexicon, and determining an entity corresponding to the entity lexicon with the similarity meeting the preset requirement;
and the second determining subunit is used for determining the public sentiment data of the entity by taking the webpage data as the matching subunit.
According to a preferred embodiment of the present invention, the word stock mining unit includes:
the first acquisition subunit is used for acquiring the authority data of the mined entity;
the first extraction subunit is used for extracting keywords from the authority data;
and the first determining subunit is used for taking the keyword set extracted by the first extracting subunit as a word stock of the mined entity.
According to a preferred embodiment of the present invention, the first obtaining subunit is specifically configured to obtain at least one of a name of the mined entity, official website data, and clicked web page data corresponding to a query including the mined entity.
According to a preferred embodiment of the present invention, the second extraction subunit is specifically configured to:
and segmenting the acquired webpage data, and extracting keywords from the segmented words based on at least one of tf-idf, part of speech, sentence components and context characteristics.
According to a preferred embodiment of the present invention, the word stock mining unit further includes:
and the first filtering subunit is used for filtering the keywords extracted by the first extracting subunit and providing the keywords to the first determining subunit to obtain the word bank of the mined entity.
According to a preferred embodiment of the invention, the first filtering subunit performs at least one of the following filtering processes:
filtering the extracted keywords based on a manual mode;
similarity matching is carried out on the keywords extracted by the first extraction subunit and the determined names of other entities, and if the matched names exist, the keywords are deleted;
and performing similarity matching on the keywords extracted by the first extraction subunit and the webpage data, and deleting the keywords if the number of the matched webpage data exceeds a preset number threshold.
According to a preferred embodiment of the present invention, the public opinion obtaining unit further comprises:
the second filtering subunit is used for respectively performing at least one of the following processing on the public opinion data of each entity:
removing the weight;
deleting illegal public opinion data;
and inputting the public sentiment data of the current entity into the trained main body recognition model, and deleting the public sentiment data if the main body recognized by the public sentiment data is not the current entity.
According to a preferred embodiment of the invention, the apparatus further comprises:
the model training unit is used for taking the webpage data of the determined main body as training corpora; and performing conditional random field learning based on at least one characteristic of keywords, positions of the keywords, parts of speech, sentence components and context extracted from the training corpus to obtain the main body recognition model.
According to a preferred embodiment of the invention, the apparatus further comprises:
and the emotion analysis unit is used for carrying out emotion analysis on the public opinion data of each entity and marking emotion analysis results for each public opinion data.
According to a preferred embodiment of the present invention, the public opinion obtaining unit further includes:
and the second acquiring subunit is used for acquiring the webpage data.
According to the technical scheme, the similarity matching is carried out on the keywords extracted from the webpage data and the entity word banks mined in advance, so that whether the webpage data are the public sentiment data of certain entity words or not is determined. The mode can automatically realize the acquisition of public opinion data, and compared with a mode of manually collecting the public opinion data, the labor cost is greatly reduced, and the coverage rate of the public opinion data is improved.
[ description of the drawings ]
Fig. 1 is a flowchart of a method of an entity lexicon mining phase according to an embodiment of the present invention;
fig. 2 is a flowchart of a public opinion data obtaining stage according to an embodiment of the present invention;
fig. 3 is a diagram illustrating an apparatus according to an embodiment of the present invention.
[ detailed description ] embodiments
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.
The core idea of the invention is that an entity word bank is excavated in advance, wherein the entity word bank comprises keywords for describing corresponding entities; then extracting keywords from the webpage data, and performing similarity matching with each entity lexicon, thereby determining whether the webpage data is public sentiment data of a certain entity. That is, the implementation of the present invention mainly includes two stages: the public opinion data acquisition stage is realized by using the result of the entity word bank mining stage, but the entity word bank mining stage and the public opinion data acquisition stage can be periodically executed to continuously update the entity word bank and continuously acquire the public opinion data. The two stages are described in detail below by way of example.
In addition, it should be noted that the entity provided by the embodiment of the present invention may be a variety of entities, such as a person name, a movie name, and the like. Preferably, the public opinion monitoring method may be applied to organizations such as companies, enterprises, institutions, communities and the like, and public opinion monitoring can be performed on the organizations by using the method provided by the present invention, since important public opinion data of organizations such as companies, enterprises and the like is generally news-like data, and the important public opinion data is embodied in news-like web pages in the internet, in the following embodiments, the organizations are used as entity types, and the news web pages are used as web page data types of the public opinion data.
Fig. 1 is a flowchart of a method of an entity lexicon mining phase provided in an embodiment of the present invention, and as shown in fig. 1, the phase may specifically include the following steps:
in 101, authoritative data for the mined entity is obtained.
In this step, the mining corpus is actually acquired for the mined entity, and since the method adopted by the present invention is a method of matching the web page data with the entity lexicon, the entity lexicon needs to be ensured to have certain accuracy, that is, the keywords contained in the entity lexicon can accurately describe the corresponding entity, so authoritative data needs to be used as the mining corpus. In an embodiment of the present invention, the authoritative data sources for the mined entities may include the following:
1) the name of the entity being mined. An entity is most accurately described by its name, which includes full names, short names, nicknames, common names, and the like. Taking the entity "beijing east" as an example, it is called "beijing east century trade company", abbreviated as "beijing east", and the nickname includes "east dog", and the common name also includes "JD", etc., and these names can be used as mining corpus.
2) The official web data of the mined entity. Generally, the data of an entity on the official website can accurately describe the entity, so that the official website data can be used as an important mining corpus. For example, there are some blocks or web pages in the official website to introduce related entities, such as "about us" blocks, from which the corpus can be mined.
3) And the clicked webpage data corresponding to the query of the mined entity. In the search log, when a user searches for an entity, the webpage data clicked in the search result can usually describe the entity well or is relatively strong in relation to the entity, so that the mined corpus can be obtained by using the part of webpage data. Furthermore, since there are some users' clicks that are relatively blind, the clicked web page data corresponding to the query of the obtained mined entity may be further filtered, for example, web page data with a click amount smaller than a certain threshold is filtered based on the click amount, or filtered based on the type or authority degree of the site to which the web page data belongs, and so on.
4) Other sources, such as review data for a review-like website, promotional information and authentication data for mined entities on an authoritative website, and so on.
Still taking "jingdong" as an example, the comment data about "jingdong" may be acquired from a comment-like website such as "popular comment", "Baidu public stone", and the like. Generally, such comment websites have web pages for each entity for users to comment on the entities, so that comment data can be obtained from the web pages as mining corpora.
In addition, some entities present popularization information on an authoritative website for popularization, for example, "jingdong" may have some popularization information on a hundred-degree search engine, and in order to achieve a certain effect, the popularization information on the hundred-degree search engine generally can describe the corresponding entities more accurately, and the popularization information on the hundred-degree search engine generally has a description text "jingdong jd.com-professional comprehensive online shopping mall, and tens of thousands of brands of high-quality goods such as home appliances, digital communications, computers, household goods, clothing mothers and babies, books, foods, etc. the convenient and honest service provides joyful online.
At 102, keywords are extracted from the cheer data.
In this step, the obtained authoritative data may be firstly segmented, and then keywords may be extracted from the words obtained by the segmentation based on at least one of tf-idf, part of speech, sentence components, and context characteristics.
Generally, tf-idf of a word in a text can accurately reflect the importance degree of the word in the text, wherein tf is the word frequency, and idf is the inverse document frequency. Therefore, words in which tf-idf exceeds a preset threshold may be extracted as keywords.
For part of speech, a more accurate description of a real word is usually a noun, and there may be adjectives or verbs, but nouns are preferred, so nouns can be extracted from them as keywords.
In addition, the subject or object is often important in a sentence, so that the subject and object can be extracted as keywords based on the sentence components.
For a certain kind of entities, there are some typical contextual features when they are described, or there are some typical contextual features of public opinion data concerned, for example, for an entity such as a company in an organization, when the features of "purchase", "buy", "sell", "financing", etc. appear in the context, keywords can be extracted according to these contextual features. For example, if "apple buy drops" appear in a certain text, then "drops" and "apples" can be extracted as keywords.
Several of the above manners can also be adopted simultaneously, corresponding weights are given to the factors adopted by the several manners, then a final score is obtained according to the weight value of each keyword on each factor, and then the keyword with the score which is higher than or exceeds a certain score is extracted from the score.
At 103, the extracted keywords are filtered.
In order to further improve the accuracy of the entity word bank and reduce the influence of redundant keywords or inaccurate keywords on the acquisition of subsequent public opinion data, the extracted keywords can be further filtered. The filtering means employed may include, but is not limited to, at least one of:
the first filtration: and filtering the extracted keywords in a manual mode. After the keyword sets of the entities are automatically extracted in the manner, the keyword sets can be submitted to an auditor for manual auditing, and inappropriate keywords in the keyword sets can be deleted, so that the workload is relatively small, and the labor cost is low.
And (3) second filtration: and performing similarity matching on the extracted keywords and the determined names of other entities, and deleting the keywords if matched names exist. Extracted keywords may be closer to other entities for which they should be deleted to avoid interference. For example, when extracting a keyword based on the contextual feature of "purchase", two keywords "apple" and "drip" may be extracted for the entity of "drip express," and when the two keywords are respectively subjected to similarity matching with the names of other entities, it is found that the keyword "apple" has a particularly high similarity with the names of other entities, which indicates that the keyword is not the keyword of "drip express," and the keyword may be deleted from the keyword set of "drip express.
And (3) third filtration: and performing similarity matching on the extracted keywords and the webpage data, and deleting the keywords if the number of the matched webpage data exceeds a preset number threshold.
If a certain extracted keyword can be matched with a large amount of webpage data, the keyword is not high in discrimination, and the existence of the keyword can interfere with the acquisition of subsequent public opinion data, so that the keyword can be deleted.
At 104, the processed keyword set is used as a word stock of the mined entity.
By the method, the corresponding keyword set can be mined for each mined entity to serve as the word stock corresponding to the entity. For an entity of an organization, a set of keywords such as business names, product words, industry, known people, territories, etc. may be mined as a thesaurus for the organization.
Fig. 2 is a flowchart of a public opinion data obtaining stage according to an embodiment of the present invention, as shown in fig. 2, the stage may specifically include the following steps:
in 201, web page data is acquired.
When public opinion data is monitored aiming at an entity, newly-appeared webpage data can be periodically or in real time to judge whether the webpage data is the public opinion data of the entity. For example, news web pages are retrieved periodically or in real-time.
At 202, keywords are extracted from the web page data.
In this step, when extracting the keywords from the web page data, the keywords may be extracted for a news title, a news abstract, a part or all of paragraphs in a news body, and the like. In a manner similar to the way step 102 extracts keywords from the cheer data in the embodiment shown in fig. 1. The method comprises the steps of segmenting the webpage data, and extracting keywords from words obtained through segmentation based on at least one of tf-idf, part of speech, sentence components and context characteristics. Refer specifically to the description of the embodiment shown in fig. 1, and will not be described herein again.
In 203, similarity matching is performed between the extracted keywords and each entity lexicon, and an entity corresponding to the entity lexicon with similarity meeting preset requirements is determined.
And respectively carrying out similarity matching on the keywords extracted from the webpage data and each entity lexicon, and if the similarity between the lexicon of some entity and the extracted keywords meets the preset requirement, for example, exceeds a certain threshold, using the webpage data as the public sentiment data of the entity.
At 204, the webpage data is used as the public opinion data of the determined entity.
At 205, public opinion data of each entity is deduplicated and/or deleted.
By adopting the steps, a series of public opinion data of each entity can be obtained, but the public opinion data may have some repetition, so that the public opinion data can be subjected to deduplication processing. In performing deduplication, it may be determined whether two web page data have the same content based on text similarity.
When the deletion process is performed, the following two ways may be adopted, but not limited to:
the first mode is as follows: and deleting the illegal public opinion data. For example, some public opinion data which do not meet the legal regulations are deleted, and for example, public opinion data containing yellow, violence, reaction and other sensitive words can be filtered out.
The second mode is as follows: and inputting the public sentiment data of the current entity into the trained main body recognition model, and deleting the public sentiment data if the main body recognized by the public sentiment data is not the current entity.
This is because there are few entities, for example, there are small and medium-sized enterprises, and it is desirable to obtain as much public sentiment data as possible. For some large-scale enterprises or known enterprises, the public opinion data are very much, and further deletion is needed to obtain more accurate and valuable public opinion data. In embodiments of the present invention, a subject recognition model may be used to perform subject recognition of web page data, such as what is the subject of a news story. The subject recognition model can identify subject words in the text for the input text. If the main body is just the entity corresponding to the public sentiment data after the main body identification model is input for the public sentiment data, the public sentiment data is reserved aiming at the entity; and if the identified main body is not the entity corresponding to the public opinion data, deleting the public opinion data aiming at the entity.
When training the subject recognition model, the web page data of the determined subject may be used as a corpus, where in the corpus, the web page data are labeled with a subject word and a non-subject word. And performing conditional random field learning based on at least one characteristic of keywords, positions, parts of speech, sentence components and context extracted from the training corpus, namely extracting the characteristics of the subject words and the non-subject words respectively to obtain a subject recognition model.
In addition, in order to better show the public sentiment data of each entity, sentiment analysis can be performed on each public sentiment data, and sentiment analysis results can be labeled for each public sentiment data. The emotion analysis is to analyze whether the public sentiment data expresses positive emotion or negative emotion or neutral emotion. Any text emotion analysis mode in the prior art can be adopted in the embodiment of the present invention, and is not limited and detailed herein. After the sentiment is analyzed for the public sentiment data, sentiment analysis results can be marked for the public sentiment data. When the public sentiment data of each entity is displayed, the public sentiment data can be displayed in a classified mode based on the sentiment analysis result, or the displayed public sentiment data is labeled with the sentiment analysis result.
The above is a detailed description of the method provided by the present invention, and the following is a detailed description of the apparatus provided by the present invention with reference to the examples.
Fig. 3 is a structural diagram of an apparatus according to an embodiment of the present invention, and as shown in fig. 3, the apparatus may include: the word stock mining unit 00 and the public sentiment obtaining unit 10 may further include a model training unit 20 and an emotion analyzing unit 30.
The lexicon mining unit 00 is responsible for mining an entity lexicon in advance, and the entity lexicon includes keywords describing corresponding entities.
The lexicon mining unit 00 may specifically include: the first acquiring sub-unit 01, the first extracting sub-unit 02 and the first determining sub-unit 04 may further include a first filtering sub-unit 03.
The first obtaining subunit 01 is responsible for obtaining the authority data of the mined entity. Specifically, the first obtaining subunit 01 may obtain at least one of a name of the mined entity, the official website data, and clicked web page data corresponding to the query containing the mined entity.
The first extraction sub-unit 02 is responsible for extracting keywords from the cheerward data. Specifically, the first extraction sub-unit 02 may first perform word segmentation on the obtained authoritative data, and then extract keywords from words obtained by the word segmentation based on at least one of tf-idf, part of speech, sentence components, and context features.
The first filtering subunit 03 is responsible for filtering the keywords extracted by the first extracting subunit 02. Specifically, the first filtering subunit 03 performs at least one of the following filtering processes:
the first filtration: and filtering the extracted keywords in a manual mode.
And (3) second filtration: similarity matching is performed between the keyword extracted by the first extraction subunit 02 and the determined names of other entities, and if a matched name exists, the keyword is deleted.
And (3) third filtration: similarity matching is carried out on the keywords extracted by the first extraction subunit 02 and the webpage data, and if the number of the matched webpage data exceeds a preset number threshold, the keywords are deleted.
The first determining subunit 04 is responsible for using the keyword set processed by the first filtering subunit 03 as a thesaurus of the mined entity. Since the first filtering subunit 03 is an optional subunit, if the first filtering subunit 03 is not included, the first determining subunit 04 may use the keyword set extracted by the first extracting subunit 02 as the thesaurus of the mined entity.
The public opinion obtaining unit 10 is responsible for obtaining public opinion data, and may specifically include: the second extraction sub-unit 12, the matching sub-unit 13, and the second determination sub-unit 14 may further include a second obtaining sub-unit 11 and a second filtering sub-unit 15.
The second extraction sub-unit 12 is responsible for extracting keywords from the acquired web page data. Similar to the first extraction sub-unit 02, the second extraction sub-unit 12 may first perform word segmentation on the web page data, and then extract keywords from words resulting from the word segmentation based on at least one of tf-idf, part of speech, sentence components, and context features.
Wherein the acquired web page data can be acquired by the second acquiring sub-unit 11. When monitoring public opinion data for an entity, the second obtaining sub-unit 11 may periodically or in real time obtain newly appeared web page data to determine whether the web page data is the public opinion data of an entity. For example, news web pages are retrieved periodically or in real-time.
The matching subunit 13 is responsible for matching the similarity between the keywords extracted by the second extraction subunit 12 and each entity lexicon, and determining the entity corresponding to the entity lexicon whose similarity meets the preset requirement.
The second determining subunit 14 is responsible for determining the public opinion data of the entity using the web page data as the matching subunit 13. The public opinion data of each entity can be stored in a public opinion database.
The second filtering subunit 15 is responsible for performing at least one of the following processes on the public opinion data of each entity:
removing the weight;
deleting illegal public opinion data, for example, filtering out public opinion data containing yellow, violence, reaction and other sensitive words;
and inputting the public sentiment data of the current entity into the trained main body recognition model, and deleting the public sentiment data if the main body recognized by the public sentiment data is not the current entity.
The subject identification model is obtained by the model training unit 20 in charge of pre-training, specifically, the model training unit 20 may use the webpage data of the determined subject as a training corpus; and performing conditional random field learning based on at least one characteristic of keywords, positions, parts of speech, sentence components and context extracted from the training corpus to obtain a main body recognition model.
In order to better represent the public sentiment data of each entity, the sentiment analysis unit 30 may perform sentiment analysis on the public sentiment data of each entity, and label the sentiment analysis result for each public sentiment data.
The apparatus provided by the present invention may be located in an application of the local terminal, or may also be a functional unit such as a plug-in or Software Development Kit (SDK) located in the application of the local terminal, or may also be located at the server side, which is not particularly limited in this embodiment of the present invention.
The method and the device provided by the embodiment of the invention can be widely applied to various fields and scenes, for example, the method and the device can be applied to a public opinion detection system, are responsible for acquiring public opinion data of each organization, are convenient for generating network public opinion early warning, and provide data dependence for network crisis public relations or brand marketing of the organizations.
For another example, the method can be applied to the field of financial investment, such as in stock or fund type APP, public opinion data related to each stock is collected, and therefore reference is provided for financial investment selection of investors such as stockholders.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (20)

1. A method for obtaining public opinion data is characterized in that an entity word bank is mined in advance, wherein the entity word bank comprises keywords describing corresponding entities; the method comprises the following steps:
extracting keywords from the acquired webpage data;
similarity matching is carried out on the extracted keywords and each entity word bank, and an entity corresponding to the entity word bank with the similarity meeting the preset requirement is determined;
using the webpage data as public opinion data of the determined entity;
and inputting the public sentiment data of the determined entity into the trained main body recognition model, and deleting the public sentiment data of the recognized main body which is not the determined entity.
2. The method of claim 1, wherein mining the entity thesaurus comprises:
obtaining authority data of a mined entity;
extracting keywords from the authoritative data;
and taking the extracted keyword set as a word stock of the mined entity.
3. The method of claim 2, wherein obtaining authority data of the mined entity comprises:
and acquiring at least one of the name of the mined entity, the official website data and clicked webpage data corresponding to the query containing the mined entity.
4. The method of claim 1 or 2, wherein the extracting keywords comprises:
and segmenting the acquired webpage data, and extracting keywords from the segmented words based on at least one of tf-idf, part of speech, sentence components and context characteristics.
5. The method of claim 2, wherein mining the entity thesaurus further comprises:
filtering the extracted keywords, and using the extracted keyword set as a word bank of the mined entity comprises:
and taking the keyword set obtained after the filtering processing is carried out on the extracted keywords as a word bank of the mined entity.
6. The method of claim 5, wherein filtering the extracted keywords comprises at least one of:
filtering the extracted keywords based on a manual mode;
similarity matching is carried out on the extracted keywords and the determined names of other entities, and if the matched names exist, the keywords are deleted;
and performing similarity matching on the extracted keywords and the webpage data, and deleting the keywords if the number of the matched webpage data exceeds a preset number threshold.
7. The method of claim 1, further comprising: and respectively performing at least one of the following processing on the public opinion data of each entity:
removing the weight;
and deleting the illegal public opinion data.
8. The method of claim 1, wherein the subject recognition model is trained by:
taking the webpage data of the determined main body as a training corpus;
and performing conditional random field learning based on at least one characteristic of keywords, positions of the keywords, parts of speech, sentence components and context extracted from the training corpus to obtain the main body recognition model.
9. The method of claim 1, further comprising:
and carrying out emotion analysis on the public sentiment data of each entity, and marking emotion analysis results for the public sentiment data.
10. The method of any one of claims 1 to 3, 5 to 9, wherein the entity comprises an organizational structure;
the web page data includes a news web page.
11. An apparatus for obtaining public opinion data, the apparatus comprising: the public opinion obtaining unit is used for obtaining public opinions;
the word stock mining unit is used for mining an entity word stock in advance, and the entity word stock comprises keywords for describing the corresponding entity;
public opinion acquisition unit includes:
a second extraction subunit, configured to extract a keyword from the acquired web page data;
the matching subunit is used for matching the similarity between the keywords extracted by the second extraction subunit and each entity lexicon, and determining an entity corresponding to the entity lexicon with the similarity meeting the preset requirement;
the second determining subunit is used for determining the public sentiment data of the entity by taking the webpage data as the matching subunit;
and the second filtering subunit is used for inputting the public sentiment data of the determined entity into the trained main body recognition model and deleting the public sentiment data of the recognized main body which is not the determined entity.
12. The apparatus of claim 11, wherein the lexicon mining unit comprises:
the first acquisition subunit is used for acquiring the authority data of the mined entity;
the first extraction subunit is used for extracting keywords from the authority data;
and the first determining subunit is used for taking the keyword set extracted by the first extracting subunit as a word stock of the mined entity.
13. The apparatus according to claim 12, wherein the first obtaining subunit is specifically configured to obtain at least one of a name of the mined entity, official website data, and clicked web page data corresponding to a query containing the mined entity.
14. The apparatus according to claim 11 or 12, wherein the second extraction subunit is specifically configured to:
and segmenting the acquired webpage data, and extracting keywords from the segmented words based on at least one of tf-idf, part of speech, sentence components and context characteristics.
15. The apparatus of claim 12, wherein the thesaurus mining unit further comprises:
and the first filtering subunit is used for filtering the keywords extracted by the first extracting subunit and providing the keywords to the first determining subunit to obtain the word bank of the mined entity.
16. The apparatus of claim 15, wherein the first filtering subunit performs at least one of the following filtering processes:
filtering the extracted keywords based on a manual mode;
similarity matching is carried out on the keywords extracted by the first extraction subunit and the determined names of other entities, and if the matched names exist, the keywords are deleted;
and performing similarity matching on the keywords extracted by the first extraction subunit and the webpage data, and deleting the keywords if the number of the matched webpage data exceeds a preset number threshold.
17. The apparatus of claim 11, wherein the second filtering subunit is further configured to perform at least one of the following processing on the public opinion data of each entity:
removing the weight;
and deleting the illegal public opinion data.
18. The apparatus of claim 11, further comprising:
the model training unit is used for taking the webpage data of the determined main body as training corpora; and performing conditional random field learning based on at least one characteristic of keywords, positions of the keywords, parts of speech, sentence components and context extracted from the training corpus to obtain the main body recognition model.
19. The apparatus of claim 11, further comprising:
and the emotion analysis unit is used for carrying out emotion analysis on the public opinion data of each entity and marking emotion analysis results for each public opinion data.
20. The apparatus according to any one of claims 11 to 13 and 15 to 19, wherein the public opinion obtaining unit further comprises:
and the second acquiring subunit is used for acquiring the webpage data.
CN201610482038.6A 2016-06-27 2016-06-27 Method and device for acquiring public opinion data Active CN107544988B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610482038.6A CN107544988B (en) 2016-06-27 2016-06-27 Method and device for acquiring public opinion data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610482038.6A CN107544988B (en) 2016-06-27 2016-06-27 Method and device for acquiring public opinion data

Publications (2)

Publication Number Publication Date
CN107544988A CN107544988A (en) 2018-01-05
CN107544988B true CN107544988B (en) 2021-03-19

Family

ID=60961479

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610482038.6A Active CN107544988B (en) 2016-06-27 2016-06-27 Method and device for acquiring public opinion data

Country Status (1)

Country Link
CN (1) CN107544988B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284369B (en) * 2018-08-01 2020-10-09 数据地平线(广州)科技有限公司 Method, system, device and medium for judging importance of securities news information
CN109087205B (en) * 2018-08-10 2020-09-18 北京字节跳动网络技术有限公司 Public opinion index prediction method and device, computer equipment and readable storage medium
CN109635276B (en) * 2018-11-12 2020-12-11 厦门市美亚柏科信息股份有限公司 Information matching method and terminal
CN110175733B (en) * 2019-04-01 2023-07-11 创新先进技术有限公司 Public opinion information processing method and server
CN110231955B (en) * 2019-05-13 2024-05-07 平安科技(深圳)有限公司 Code processing method, device, computer equipment and storage medium
CN110297994A (en) * 2019-06-03 2019-10-01 北京金蝶管理软件有限公司 Acquisition method, device, computer equipment and the storage medium of web data
CN110866387A (en) * 2019-11-04 2020-03-06 云目未来科技(北京)有限公司 Method and device for processing text information for public opinion analysis and storage medium
CN111160019B (en) * 2019-12-30 2023-08-15 中国联合网络通信集团有限公司 Public opinion monitoring method, device and system
CN111177391B (en) * 2019-12-31 2023-08-08 北京明略软件***有限公司 Method and device for acquiring social public opinion volume and computer readable storage medium
CN114328852B (en) * 2021-08-26 2024-06-14 腾讯科技(深圳)有限公司 Text processing method, related device and equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102708096A (en) * 2012-05-29 2012-10-03 代松 Network intelligence public sentiment monitoring system based on semantics and work method thereof

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567393A (en) * 2010-12-21 2012-07-11 北大方正集团有限公司 Method, device and system for processing public sentiment topics
US20130110660A1 (en) * 2011-10-27 2013-05-02 Billson Yang Method of collecting opinions and surveying data
CN103186600B (en) * 2011-12-28 2016-03-16 北大方正集团有限公司 The specific analysis method and apparatus of internet public feelings
CN104636386A (en) * 2013-11-14 2015-05-20 华为技术有限公司 Information monitoring method and device
KR101518376B1 (en) * 2014-04-30 2015-05-08 영남대학교 산학협력단 Data extraction method for prediction of public opinion
CN104504150B (en) * 2015-01-09 2017-09-29 成都布林特信息技术有限公司 News public sentiment monitoring system
CN104933093B (en) * 2015-05-19 2018-08-07 武汉泰迪智慧科技有限公司 The monitoring of regional public sentiment and decision support system (DSS) based on big data and method
CN105677802A (en) * 2015-12-31 2016-06-15 宁波公众信息产业有限公司 Internet information analysis system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102708096A (en) * 2012-05-29 2012-10-03 代松 Network intelligence public sentiment monitoring system based on semantics and work method thereof

Also Published As

Publication number Publication date
CN107544988A (en) 2018-01-05

Similar Documents

Publication Publication Date Title
CN107544988B (en) Method and device for acquiring public opinion data
CN106649818B (en) Application search intention identification method and device, application search method and server
CN109815308B (en) Method and device for determining intention recognition model and method and device for searching intention recognition
Sharif et al. Sentiment analysis of Bengali texts on online restaurant reviews using multinomial Naïve Bayes
Wang et al. Automatic online news topic ranking using media focus and user attention based on aging theory
Chatzakou et al. Detecting variation of emotions in online activities
KR101540683B1 (en) Method and server for classifying emotion polarity of words
CN112559684A (en) Keyword extraction and information retrieval method
KR20120108095A (en) System for analyzing social data collected by communication network
CN112069312B (en) Text classification method based on entity recognition and electronic device
CN114238573A (en) Information pushing method and device based on text countermeasure sample
Manke et al. A review on: opinion mining and sentiment analysis based on natural language processing
Kiran et al. User specific product recommendation and rating system by performing sentiment analysis on product reviews
Rani et al. Study and comparision of vectorization techniques used in text classification
Skanda et al. Detecting stance in kannada social media code-mixed text using sentence embedding
CN107665442B (en) Method and device for acquiring target user
Arafat et al. Analyzing public emotion and predicting stock market using social media
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
Phan et al. Applying skip-gram word estimation and SVM-based classification for opinion mining Vietnamese food places text reviews
CN109241438B (en) Element-based cross-channel hot event discovery method and device and storage medium
CN115080741A (en) Questionnaire survey analysis method, device, storage medium and equipment
Mishra et al. VisualTextRank: Unsupervised Graph-based Content Extraction for Automating Ad Text to Image Search
Jiang et al. An improved association rule mining approach to identification of implicit product aspects
Hamroun et al. Lexico semantic patterns for customer intentions analysis of microblogging
Chandankhede et al. ISAR: Implicit sentiment analysis of user reviews

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant