CN107544988B

CN107544988B - Method and device for acquiring public opinion data

Info

Publication number: CN107544988B
Application number: CN201610482038.6A
Authority: CN
Inventors: 王私江; 赵辉; 高显; 岳爱珍; 谭静; 崔燕
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2016-06-27
Filing date: 2016-06-27
Publication date: 2021-03-19
Anticipated expiration: 2036-06-27
Also published as: CN107544988A

Abstract

The invention provides a method and a device for acquiring public opinion data, wherein an entity word bank is mined in advance, and the entity word bank comprises keywords for describing corresponding entities; when public opinion data is acquired, extracting keywords from the acquired webpage data; similarity matching is carried out on the extracted keywords and each entity word bank, and an entity corresponding to the entity word bank with the similarity meeting the preset requirement is determined; and taking the webpage data as the public opinion data of the determined entity. The public opinion data acquisition method can automatically acquire public opinion data, greatly reduces labor cost and improves the coverage rate of the public opinion data compared with a mode of manually collecting the public opinion data.

Description

Method and device for acquiring public opinion data

[ technical field ] A method for producing a semiconductor device

The invention relates to the technical field of computer application, in particular to a method and a device for acquiring public opinion data.

[ background of the invention ]

The network public sentiment takes the network as a carrier, takes the event as a core, is the expression, the transmission and the interaction of the sentiment, the attitude, the opinion and the viewpoint of vast netizens, and is the mapping of the social public sentiment in the internet space. With the continuous development of the internet, many companies, enterprises and other units need to pay attention to the network public sentiment continuously so as to analyze the network public sentiment and pay attention to the state of the company in the network public sentiment, thereby generating network public sentiment early warning and providing data dependence for network crisis public relations or brand marketing of departments. In addition, the vast netizens also need to pay attention to the network public opinion continuously so as to provide basis for selecting a safe service provider or for financial investment selection and the like.

However, most of the conventional methods for acquiring internet public opinion data are based on manual collection, for example, companies or enterprises employ people dedicated to collecting and analyzing public opinion data; the vast netizens pay attention to the related news by themselves, and the like. On one hand, the methods consume human resources, and on the other hand, the coverage rate of the acquired public opinion data is low.

[ summary of the invention ]

In view of the above, the present invention provides a method and an apparatus for acquiring public sentiment data, so as to automatically acquire the public sentiment data, reduce labor cost, and improve coverage of the public sentiment data.

The specific technical scheme is as follows:

the invention provides a method for acquiring public opinion data, which comprises the steps of excavating an entity word bank in advance, wherein the entity word bank comprises keywords for describing corresponding entities; the method comprises the following steps:

extracting keywords from the acquired webpage data;

similarity matching is carried out on the extracted keywords and each entity word bank, and an entity corresponding to the entity word bank with the similarity meeting the preset requirement is determined;

and taking the webpage data as the public opinion data of the determined entity.

According to a preferred embodiment of the present invention, the mining entity thesaurus includes:

obtaining authority data of a mined entity;

extracting keywords from the authoritative data;

and taking the extracted keyword set as a word stock of the mined entity.

According to a preferred embodiment of the present invention, the obtaining authority data of the mined entity includes:

and acquiring at least one of the name of the mined entity, the official website data and clicked webpage data corresponding to the query containing the mined entity.

According to a preferred embodiment of the present invention, the extracting the keywords comprises:

and segmenting the acquired webpage data, and extracting keywords from the segmented words based on at least one of tf-idf, part of speech, sentence components and context characteristics.

According to a preferred embodiment of the present invention, the mining entity thesaurus further comprises:

filtering the extracted keywords, and using the extracted keyword set as a word bank of the mined entity comprises:

and taking the keyword set obtained after the filtering processing is carried out on the extracted keywords as a word bank of the mined entity.

According to a preferred embodiment of the present invention, the filtering process of the extracted keywords includes at least one of:

filtering the extracted keywords based on a manual mode;

similarity matching is carried out on the extracted keywords and the determined names of other entities, and if the matched names exist, the keywords are deleted;

and performing similarity matching on the extracted keywords and the webpage data, and deleting the keywords if the number of the matched webpage data exceeds a preset number threshold.

According to a preferred embodiment of the invention, the method further comprises: and respectively performing at least one of the following processing on the public opinion data of each entity:

removing the weight;

deleting illegal public opinion data;

and inputting the public sentiment data of the current entity into the trained main body recognition model, and deleting the public sentiment data if the main body recognized by the public sentiment data is not the current entity.

According to a preferred embodiment of the present invention, the subject recognition model is trained as follows:

taking the webpage data of the determined main body as a training corpus;

and performing conditional random field learning based on at least one characteristic of keywords, positions of the keywords, parts of speech, sentence components and context extracted from the training corpus to obtain the main body recognition model.

According to a preferred embodiment of the invention, the method further comprises:

and carrying out emotion analysis on the public sentiment data of each entity, and marking emotion analysis results for the public sentiment data.

According to a preferred embodiment of the invention, the entity comprises an organisation;

the web page data includes a news web page.

The invention also provides a device for acquiring public opinion data, which comprises: the public opinion obtaining unit is used for obtaining public opinions;

the word stock mining unit is used for mining an entity word stock in advance, and the entity word stock comprises keywords for describing the corresponding entity;

public opinion acquisition unit includes:

a second extraction subunit, configured to extract a keyword from the acquired web page data;

the matching subunit is used for matching the similarity between the keywords extracted by the second extraction subunit and each entity lexicon, and determining an entity corresponding to the entity lexicon with the similarity meeting the preset requirement;

and the second determining subunit is used for determining the public sentiment data of the entity by taking the webpage data as the matching subunit.

According to a preferred embodiment of the present invention, the word stock mining unit includes:

the first acquisition subunit is used for acquiring the authority data of the mined entity;

the first extraction subunit is used for extracting keywords from the authority data;

and the first determining subunit is used for taking the keyword set extracted by the first extracting subunit as a word stock of the mined entity.

According to a preferred embodiment of the present invention, the first obtaining subunit is specifically configured to obtain at least one of a name of the mined entity, official website data, and clicked web page data corresponding to a query including the mined entity.

According to a preferred embodiment of the present invention, the second extraction subunit is specifically configured to:

According to a preferred embodiment of the present invention, the word stock mining unit further includes:

and the first filtering subunit is used for filtering the keywords extracted by the first extracting subunit and providing the keywords to the first determining subunit to obtain the word bank of the mined entity.

According to a preferred embodiment of the invention, the first filtering subunit performs at least one of the following filtering processes:

filtering the extracted keywords based on a manual mode;

similarity matching is carried out on the keywords extracted by the first extraction subunit and the determined names of other entities, and if the matched names exist, the keywords are deleted;

and performing similarity matching on the keywords extracted by the first extraction subunit and the webpage data, and deleting the keywords if the number of the matched webpage data exceeds a preset number threshold.

According to a preferred embodiment of the present invention, the public opinion obtaining unit further comprises:

the second filtering subunit is used for respectively performing at least one of the following processing on the public opinion data of each entity:

removing the weight;

deleting illegal public opinion data;

According to a preferred embodiment of the invention, the apparatus further comprises:

the model training unit is used for taking the webpage data of the determined main body as training corpora; and performing conditional random field learning based on at least one characteristic of keywords, positions of the keywords, parts of speech, sentence components and context extracted from the training corpus to obtain the main body recognition model.

and the emotion analysis unit is used for carrying out emotion analysis on the public opinion data of each entity and marking emotion analysis results for each public opinion data.

According to a preferred embodiment of the present invention, the public opinion obtaining unit further includes:

and the second acquiring subunit is used for acquiring the webpage data.

According to the technical scheme, the similarity matching is carried out on the keywords extracted from the webpage data and the entity word banks mined in advance, so that whether the webpage data are the public sentiment data of certain entity words or not is determined. The mode can automatically realize the acquisition of public opinion data, and compared with a mode of manually collecting the public opinion data, the labor cost is greatly reduced, and the coverage rate of the public opinion data is improved.

[ description of the drawings ]

Fig. 1 is a flowchart of a method of an entity lexicon mining phase according to an embodiment of the present invention;

fig. 2 is a flowchart of a public opinion data obtaining stage according to an embodiment of the present invention;

fig. 3 is a diagram illustrating an apparatus according to an embodiment of the present invention.

[ detailed description ] embodiments

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

The core idea of the invention is that an entity word bank is excavated in advance, wherein the entity word bank comprises keywords for describing corresponding entities; then extracting keywords from the webpage data, and performing similarity matching with each entity lexicon, thereby determining whether the webpage data is public sentiment data of a certain entity. That is, the implementation of the present invention mainly includes two stages: the public opinion data acquisition stage is realized by using the result of the entity word bank mining stage, but the entity word bank mining stage and the public opinion data acquisition stage can be periodically executed to continuously update the entity word bank and continuously acquire the public opinion data. The two stages are described in detail below by way of example.

In addition, it should be noted that the entity provided by the embodiment of the present invention may be a variety of entities, such as a person name, a movie name, and the like. Preferably, the public opinion monitoring method may be applied to organizations such as companies, enterprises, institutions, communities and the like, and public opinion monitoring can be performed on the organizations by using the method provided by the present invention, since important public opinion data of organizations such as companies, enterprises and the like is generally news-like data, and the important public opinion data is embodied in news-like web pages in the internet, in the following embodiments, the organizations are used as entity types, and the news web pages are used as web page data types of the public opinion data.

Fig. 1 is a flowchart of a method of an entity lexicon mining phase provided in an embodiment of the present invention, and as shown in fig. 1, the phase may specifically include the following steps:

in 101, authoritative data for the mined entity is obtained.

In this step, the mining corpus is actually acquired for the mined entity, and since the method adopted by the present invention is a method of matching the web page data with the entity lexicon, the entity lexicon needs to be ensured to have certain accuracy, that is, the keywords contained in the entity lexicon can accurately describe the corresponding entity, so authoritative data needs to be used as the mining corpus. In an embodiment of the present invention, the authoritative data sources for the mined entities may include the following:

1) the name of the entity being mined. An entity is most accurately described by its name, which includes full names, short names, nicknames, common names, and the like. Taking the entity "beijing east" as an example, it is called "beijing east century trade company", abbreviated as "beijing east", and the nickname includes "east dog", and the common name also includes "JD", etc., and these names can be used as mining corpus.

2) The official web data of the mined entity. Generally, the data of an entity on the official website can accurately describe the entity, so that the official website data can be used as an important mining corpus. For example, there are some blocks or web pages in the official website to introduce related entities, such as "about us" blocks, from which the corpus can be mined.

3) And the clicked webpage data corresponding to the query of the mined entity. In the search log, when a user searches for an entity, the webpage data clicked in the search result can usually describe the entity well or is relatively strong in relation to the entity, so that the mined corpus can be obtained by using the part of webpage data. Furthermore, since there are some users' clicks that are relatively blind, the clicked web page data corresponding to the query of the obtained mined entity may be further filtered, for example, web page data with a click amount smaller than a certain threshold is filtered based on the click amount, or filtered based on the type or authority degree of the site to which the web page data belongs, and so on.

4) Other sources, such as review data for a review-like website, promotional information and authentication data for mined entities on an authoritative website, and so on.

Still taking "jingdong" as an example, the comment data about "jingdong" may be acquired from a comment-like website such as "popular comment", "Baidu public stone", and the like. Generally, such comment websites have web pages for each entity for users to comment on the entities, so that comment data can be obtained from the web pages as mining corpora.

In addition, some entities present popularization information on an authoritative website for popularization, for example, "jingdong" may have some popularization information on a hundred-degree search engine, and in order to achieve a certain effect, the popularization information on the hundred-degree search engine generally can describe the corresponding entities more accurately, and the popularization information on the hundred-degree search engine generally has a description text "jingdong jd.com-professional comprehensive online shopping mall, and tens of thousands of brands of high-quality goods such as home appliances, digital communications, computers, household goods, clothing mothers and babies, books, foods, etc. the convenient and honest service provides joyful online.

At 102, keywords are extracted from the cheer data.

In this step, the obtained authoritative data may be firstly segmented, and then keywords may be extracted from the words obtained by the segmentation based on at least one of tf-idf, part of speech, sentence components, and context characteristics.

Generally, tf-idf of a word in a text can accurately reflect the importance degree of the word in the text, wherein tf is the word frequency, and idf is the inverse document frequency. Therefore, words in which tf-idf exceeds a preset threshold may be extracted as keywords.

For part of speech, a more accurate description of a real word is usually a noun, and there may be adjectives or verbs, but nouns are preferred, so nouns can be extracted from them as keywords.

In addition, the subject or object is often important in a sentence, so that the subject and object can be extracted as keywords based on the sentence components.

For a certain kind of entities, there are some typical contextual features when they are described, or there are some typical contextual features of public opinion data concerned, for example, for an entity such as a company in an organization, when the features of "purchase", "buy", "sell", "financing", etc. appear in the context, keywords can be extracted according to these contextual features. For example, if "apple buy drops" appear in a certain text, then "drops" and "apples" can be extracted as keywords.

Several of the above manners can also be adopted simultaneously, corresponding weights are given to the factors adopted by the several manners, then a final score is obtained according to the weight value of each keyword on each factor, and then the keyword with the score which is higher than or exceeds a certain score is extracted from the score.

At 103, the extracted keywords are filtered.

In order to further improve the accuracy of the entity word bank and reduce the influence of redundant keywords or inaccurate keywords on the acquisition of subsequent public opinion data, the extracted keywords can be further filtered. The filtering means employed may include, but is not limited to, at least one of:

the first filtration: and filtering the extracted keywords in a manual mode. After the keyword sets of the entities are automatically extracted in the manner, the keyword sets can be submitted to an auditor for manual auditing, and inappropriate keywords in the keyword sets can be deleted, so that the workload is relatively small, and the labor cost is low.

And (3) second filtration: and performing similarity matching on the extracted keywords and the determined names of other entities, and deleting the keywords if matched names exist. Extracted keywords may be closer to other entities for which they should be deleted to avoid interference. For example, when extracting a keyword based on the contextual feature of "purchase", two keywords "apple" and "drip" may be extracted for the entity of "drip express," and when the two keywords are respectively subjected to similarity matching with the names of other entities, it is found that the keyword "apple" has a particularly high similarity with the names of other entities, which indicates that the keyword is not the keyword of "drip express," and the keyword may be deleted from the keyword set of "drip express.

And (3) third filtration: and performing similarity matching on the extracted keywords and the webpage data, and deleting the keywords if the number of the matched webpage data exceeds a preset number threshold.

If a certain extracted keyword can be matched with a large amount of webpage data, the keyword is not high in discrimination, and the existence of the keyword can interfere with the acquisition of subsequent public opinion data, so that the keyword can be deleted.

At 104, the processed keyword set is used as a word stock of the mined entity.

By the method, the corresponding keyword set can be mined for each mined entity to serve as the word stock corresponding to the entity. For an entity of an organization, a set of keywords such as business names, product words, industry, known people, territories, etc. may be mined as a thesaurus for the organization.

Fig. 2 is a flowchart of a public opinion data obtaining stage according to an embodiment of the present invention, as shown in fig. 2, the stage may specifically include the following steps:

in 201, web page data is acquired.

When public opinion data is monitored aiming at an entity, newly-appeared webpage data can be periodically or in real time to judge whether the webpage data is the public opinion data of the entity. For example, news web pages are retrieved periodically or in real-time.

At 202, keywords are extracted from the web page data.

In this step, when extracting the keywords from the web page data, the keywords may be extracted for a news title, a news abstract, a part or all of paragraphs in a news body, and the like. In a manner similar to the way step 102 extracts keywords from the cheer data in the embodiment shown in fig. 1. The method comprises the steps of segmenting the webpage data, and extracting keywords from words obtained through segmentation based on at least one of tf-idf, part of speech, sentence components and context characteristics. Refer specifically to the description of the embodiment shown in fig. 1, and will not be described herein again.

In 203, similarity matching is performed between the extracted keywords and each entity lexicon, and an entity corresponding to the entity lexicon with similarity meeting preset requirements is determined.

And respectively carrying out similarity matching on the keywords extracted from the webpage data and each entity lexicon, and if the similarity between the lexicon of some entity and the extracted keywords meets the preset requirement, for example, exceeds a certain threshold, using the webpage data as the public sentiment data of the entity.

At 204, the webpage data is used as the public opinion data of the determined entity.

At 205, public opinion data of each entity is deduplicated and/or deleted.

By adopting the steps, a series of public opinion data of each entity can be obtained, but the public opinion data may have some repetition, so that the public opinion data can be subjected to deduplication processing. In performing deduplication, it may be determined whether two web page data have the same content based on text similarity.

When the deletion process is performed, the following two ways may be adopted, but not limited to:

the first mode is as follows: and deleting the illegal public opinion data. For example, some public opinion data which do not meet the legal regulations are deleted, and for example, public opinion data containing yellow, violence, reaction and other sensitive words can be filtered out.

The second mode is as follows: and inputting the public sentiment data of the current entity into the trained main body recognition model, and deleting the public sentiment data if the main body recognized by the public sentiment data is not the current entity.

This is because there are few entities, for example, there are small and medium-sized enterprises, and it is desirable to obtain as much public sentiment data as possible. For some large-scale enterprises or known enterprises, the public opinion data are very much, and further deletion is needed to obtain more accurate and valuable public opinion data. In embodiments of the present invention, a subject recognition model may be used to perform subject recognition of web page data, such as what is the subject of a news story. The subject recognition model can identify subject words in the text for the input text. If the main body is just the entity corresponding to the public sentiment data after the main body identification model is input for the public sentiment data, the public sentiment data is reserved aiming at the entity; and if the identified main body is not the entity corresponding to the public opinion data, deleting the public opinion data aiming at the entity.

When training the subject recognition model, the web page data of the determined subject may be used as a corpus, where in the corpus, the web page data are labeled with a subject word and a non-subject word. And performing conditional random field learning based on at least one characteristic of keywords, positions, parts of speech, sentence components and context extracted from the training corpus, namely extracting the characteristics of the subject words and the non-subject words respectively to obtain a subject recognition model.

In addition, in order to better show the public sentiment data of each entity, sentiment analysis can be performed on each public sentiment data, and sentiment analysis results can be labeled for each public sentiment data. The emotion analysis is to analyze whether the public sentiment data expresses positive emotion or negative emotion or neutral emotion. Any text emotion analysis mode in the prior art can be adopted in the embodiment of the present invention, and is not limited and detailed herein. After the sentiment is analyzed for the public sentiment data, sentiment analysis results can be marked for the public sentiment data. When the public sentiment data of each entity is displayed, the public sentiment data can be displayed in a classified mode based on the sentiment analysis result, or the displayed public sentiment data is labeled with the sentiment analysis result.

The above is a detailed description of the method provided by the present invention, and the following is a detailed description of the apparatus provided by the present invention with reference to the examples.

Fig. 3 is a structural diagram of an apparatus according to an embodiment of the present invention, and as shown in fig. 3, the apparatus may include: the word stock mining unit 00 and the public sentiment obtaining unit 10 may further include a model training unit 20 and an emotion analyzing unit 30.

The lexicon mining unit 00 is responsible for mining an entity lexicon in advance, and the entity lexicon includes keywords describing corresponding entities.

The lexicon mining unit 00 may specifically include: the first acquiring sub-unit 01, the first extracting sub-unit 02 and the first determining sub-unit 04 may further include a first filtering sub-unit 03.

The first obtaining subunit 01 is responsible for obtaining the authority data of the mined entity. Specifically, the first obtaining subunit 01 may obtain at least one of a name of the mined entity, the official website data, and clicked web page data corresponding to the query containing the mined entity.

The first extraction sub-unit 02 is responsible for extracting keywords from the cheerward data. Specifically, the first extraction sub-unit 02 may first perform word segmentation on the obtained authoritative data, and then extract keywords from words obtained by the word segmentation based on at least one of tf-idf, part of speech, sentence components, and context features.

The first filtering subunit 03 is responsible for filtering the keywords extracted by the first extracting subunit 02. Specifically, the first filtering subunit 03 performs at least one of the following filtering processes:

the first filtration: and filtering the extracted keywords in a manual mode.

And (3) second filtration: similarity matching is performed between the keyword extracted by the first extraction subunit 02 and the determined names of other entities, and if a matched name exists, the keyword is deleted.

And (3) third filtration: similarity matching is carried out on the keywords extracted by the first extraction subunit 02 and the webpage data, and if the number of the matched webpage data exceeds a preset number threshold, the keywords are deleted.

The first determining subunit 04 is responsible for using the keyword set processed by the first filtering subunit 03 as a thesaurus of the mined entity. Since the first filtering subunit 03 is an optional subunit, if the first filtering subunit 03 is not included, the first determining subunit 04 may use the keyword set extracted by the first extracting subunit 02 as the thesaurus of the mined entity.

The public opinion obtaining unit 10 is responsible for obtaining public opinion data, and may specifically include: the second extraction sub-unit 12, the matching sub-unit 13, and the second determination sub-unit 14 may further include a second obtaining sub-unit 11 and a second filtering sub-unit 15.

The second extraction sub-unit 12 is responsible for extracting keywords from the acquired web page data. Similar to the first extraction sub-unit 02, the second extraction sub-unit 12 may first perform word segmentation on the web page data, and then extract keywords from words resulting from the word segmentation based on at least one of tf-idf, part of speech, sentence components, and context features.

Wherein the acquired web page data can be acquired by the second acquiring sub-unit 11. When monitoring public opinion data for an entity, the second obtaining sub-unit 11 may periodically or in real time obtain newly appeared web page data to determine whether the web page data is the public opinion data of an entity. For example, news web pages are retrieved periodically or in real-time.

The matching subunit 13 is responsible for matching the similarity between the keywords extracted by the second extraction subunit 12 and each entity lexicon, and determining the entity corresponding to the entity lexicon whose similarity meets the preset requirement.

The second determining subunit 14 is responsible for determining the public opinion data of the entity using the web page data as the matching subunit 13. The public opinion data of each entity can be stored in a public opinion database.

The second filtering subunit 15 is responsible for performing at least one of the following processes on the public opinion data of each entity:

removing the weight;

deleting illegal public opinion data, for example, filtering out public opinion data containing yellow, violence, reaction and other sensitive words;

The subject identification model is obtained by the model training unit 20 in charge of pre-training, specifically, the model training unit 20 may use the webpage data of the determined subject as a training corpus; and performing conditional random field learning based on at least one characteristic of keywords, positions, parts of speech, sentence components and context extracted from the training corpus to obtain a main body recognition model.

In order to better represent the public sentiment data of each entity, the sentiment analysis unit 30 may perform sentiment analysis on the public sentiment data of each entity, and label the sentiment analysis result for each public sentiment data.

The apparatus provided by the present invention may be located in an application of the local terminal, or may also be a functional unit such as a plug-in or Software Development Kit (SDK) located in the application of the local terminal, or may also be located at the server side, which is not particularly limited in this embodiment of the present invention.

The method and the device provided by the embodiment of the invention can be widely applied to various fields and scenes, for example, the method and the device can be applied to a public opinion detection system, are responsible for acquiring public opinion data of each organization, are convenient for generating network public opinion early warning, and provide data dependence for network crisis public relations or brand marketing of the organizations.

For another example, the method can be applied to the field of financial investment, such as in stock or fund type APP, public opinion data related to each stock is collected, and therefore reference is provided for financial investment selection of investors such as stockholders.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for obtaining public opinion data is characterized in that an entity word bank is mined in advance, wherein the entity word bank comprises keywords describing corresponding entities; the method comprises the following steps:

extracting keywords from the acquired webpage data;

using the webpage data as public opinion data of the determined entity;

and inputting the public sentiment data of the determined entity into the trained main body recognition model, and deleting the public sentiment data of the recognized main body which is not the determined entity.

2. The method of claim 1, wherein mining the entity thesaurus comprises:

obtaining authority data of a mined entity;

extracting keywords from the authoritative data;

and taking the extracted keyword set as a word stock of the mined entity.

3. The method of claim 2, wherein obtaining authority data of the mined entity comprises:

4. The method of claim 1 or 2, wherein the extracting keywords comprises:

5. The method of claim 2, wherein mining the entity thesaurus further comprises:

6. The method of claim 5, wherein filtering the extracted keywords comprises at least one of:

filtering the extracted keywords based on a manual mode;

7. The method of claim 1, further comprising: and respectively performing at least one of the following processing on the public opinion data of each entity:

removing the weight;

and deleting the illegal public opinion data.

8. The method of claim 1, wherein the subject recognition model is trained by:

taking the webpage data of the determined main body as a training corpus;

9. The method of claim 1, further comprising:

10. The method of any one of claims 1 to 3, 5 to 9, wherein the entity comprises an organizational structure;

the web page data includes a news web page.

11. An apparatus for obtaining public opinion data, the apparatus comprising: the public opinion obtaining unit is used for obtaining public opinions;

public opinion acquisition unit includes:

the second determining subunit is used for determining the public sentiment data of the entity by taking the webpage data as the matching subunit;

and the second filtering subunit is used for inputting the public sentiment data of the determined entity into the trained main body recognition model and deleting the public sentiment data of the recognized main body which is not the determined entity.

12. The apparatus of claim 11, wherein the lexicon mining unit comprises:

13. The apparatus according to claim 12, wherein the first obtaining subunit is specifically configured to obtain at least one of a name of the mined entity, official website data, and clicked web page data corresponding to a query containing the mined entity.

14. The apparatus according to claim 11 or 12, wherein the second extraction subunit is specifically configured to:

15. The apparatus of claim 12, wherein the thesaurus mining unit further comprises:

16. The apparatus of claim 15, wherein the first filtering subunit performs at least one of the following filtering processes:

filtering the extracted keywords based on a manual mode;

17. The apparatus of claim 11, wherein the second filtering subunit is further configured to perform at least one of the following processing on the public opinion data of each entity:

removing the weight;

and deleting the illegal public opinion data.

18. The apparatus of claim 11, further comprising:

19. The apparatus of claim 11, further comprising:

20. The apparatus according to any one of claims 11 to 13 and 15 to 19, wherein the public opinion obtaining unit further comprises:

and the second acquiring subunit is used for acquiring the webpage data.