CN110399613B - Method and system for identifying internet news related to place names based on part-of-speech tagging - Google Patents

Method and system for identifying internet news related to place names based on part-of-speech tagging Download PDF

Info

Publication number
CN110399613B
CN110399613B CN201910681163.3A CN201910681163A CN110399613B CN 110399613 B CN110399613 B CN 110399613B CN 201910681163 A CN201910681163 A CN 201910681163A CN 110399613 B CN110399613 B CN 110399613B
Authority
CN
China
Prior art keywords
place name
news
place
noun phrase
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910681163.3A
Other languages
Chinese (zh)
Other versions
CN110399613A (en
Inventor
苏坤雄
彭光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Software Co Ltd
Original Assignee
Inspur Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Software Co Ltd filed Critical Inspur Software Co Ltd
Priority to CN201910681163.3A priority Critical patent/CN110399613B/en
Publication of CN110399613A publication Critical patent/CN110399613A/en
Application granted granted Critical
Publication of CN110399613B publication Critical patent/CN110399613B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Remote Sensing (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses internet news based on part of speech tagging, relates to a place name identification method and a system, and belongs to the technical field of natural language processing. The invention relates to a location name identification method of internet news based on part of speech tagging, which utilizes the general reporting region of news media columns to supplement the context information of news, assists a location name disambiguation program to correctly judge the location name, utilizes part of speech tagging to convert news content into a pure noun phrase sequence, identifies the location name of the noun phrase sequence, reduces the location name of the location name identification result twice, eliminates inaccurate location names, and finally weights and summarizes the two location name reduction results to confirm the location name. The internet news related to place name identification method based on part of speech tagging is popular and easy to understand, is simple in implementation process, can effectively solve the problem that news related to place name extraction is accurate and low, and has good popularization and application values.

Description

Method and system for identifying internet news related to place names based on part-of-speech tagging
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method and a system for identifying internet news related to place names based on part of speech tagging.
Background
Place name recognition is a subclass of the field of entity recognition in the field of natural language processing. The traditional place name recognition technology relates to a place name level model established based on an administrative division dictionary, a word segmentation algorithm, place name disambiguation and other technologies. Typically, place name recognition techniques use a forward maximum matching segmentation algorithm to maximize matching place names. Meanwhile, in the place name identification process, the main stream place name disambiguation means introduces context information through hierarchical clustering to solve the address renaming ambiguity problem.
However, the current place name identification method has certain problems in identifying the place name related to the internet news:
(1) The place name recognition technology based on the place name hierarchical model and the forward maximum matching word segmentation algorithm finds that the place name recognition result is inaccurate in the practical process, such as matching the place name recognition result to the region 'Gaoan county' in the phrase 'improving safety' and matching the place name recognition result to the region 'Crane city and agricultural district' in the phrase 'industry and agriculture';
(2) In a local news media report, vocabularies such as 'my city' and 'my county' often appear, but corresponding place name nouns do not appear, and under the condition, the simple place name identification technology cannot judge the place name related to the news;
(3) In local news media reports, there are cases where only a certain region is indicated and other regions are not described, such as administrative divisions at the level of "city middle region" and "flat region", and the name disambiguation program cannot correctly determine the region because the administrative divisions at the level of this region have more names and no other context information.
Disclosure of Invention
The technical task of the invention is to provide a method for identifying the names of the internet news related to the places based on part of speech tagging, which is popular and easy to understand, has a simple implementation process and can effectively solve the problem that the extraction of the names of the news related to the places is accurate and low.
The invention further provides a system for identifying the internet news related to the place name based on the part of speech tagging.
In order to realize the purpose, the invention provides the following technical scheme:
a method for recognizing location names of Internet news based on part-of-speech tagging relates to the method that context information of news is supplemented by the aid of overall reporting areas of news media columns, location names are judged correctly by aid of location name disambiguation programs, news contents are converted into plain noun phrase sequences by means of part-of-speech tagging, location name recognition is conducted on the noun phrase sequences, location name recognition results are reduced twice, inaccurate location names are eliminated, and finally weighting and summarizing are conducted on the two location name reduction results to confirm the location names.
Preferably, the method specifically comprises the following steps:
s1, determining a total reporting region of a media column: place names with absolute proportion occupied by the occurrence times of place names pointed by all reports under the column of the news media;
s2, acquiring a noun phrase sequence: using part-of-speech tagging for the transmitted title input and text input to obtain a noun phrase sequence;
s3, first geographical name reduction: pruning based on each individual noun phrase in the noun phrase sequence;
s4, second place name reduction: on the basis of first place name reduction, carrying out secondary word segmentation on the recognition result of place name recognition and a corresponding noun word group sequence, and reducing;
s5, acquiring and identifying a place name: and adding the first place name reduction result and the second place name reduction result into weight combination, and confirming the related place names after weighted aggregation.
Preferably, step S1 determines a total reporting area of the media column, complements contextual information of news for geographical name disambiguation, and connects the total reporting area of the media column with a space to a news title and a news text, respectively, to obtain a title input and a text input.
If the proportion of the appeared place names is more balanced, the corresponding upper-level place name is selected as the total report area of the column.
Preferably, in step S2, in the noun phrase sequence, the punctuations in the original text and the words of part of speech other than the noun are replaced with spaces.
Preferably, in step S3, place name recognition is performed on the noun phrase sequence of the title input and the text input, and place names having no relationship between upper and lower levels recognized later in each noun phrase are subtracted.
In the place name reduction, firstly, place name recognition is carried out on the noun phrase sequence of the title input and the text input, and place names which are recognized in each noun phrase and have no upper-lower level relation are reduced.
Preferably, in step S4, performing secondary word segmentation on the recognition result of the place name recognition and the corresponding noun phrase sequence, and reducing, if each word after the word segmentation of the place name recognition result exists continuously in the word segmentation result of the corresponding noun phrase sequence, determining that the recognition is accurate, otherwise, removing the recognition result.
The system for recognizing the place name of Internet news based on part of speech tagging comprises the following modules:
the system comprises a media column general report region determining module, a news media searching module and a reporting module, wherein the general report region determining module is used for determining the place names of which the occurrence times of the place names pointed by all reports in the news media in the column occupy absolute proportion;
a noun phrase sequence acquisition module, which is used for using part-of-speech tagging to the transmitted title input and text input to acquire a noun phrase sequence;
the first place name reduction module is used for reducing each individual noun phrase in the noun phrase sequence;
the second place name reduction module is used for carrying out secondary word segmentation on the recognition result of the place name recognition and the corresponding noun word group sequence on the basis of the first place name reduction, and reducing;
and the identification place name acquisition module is used for adding the first place name reduction result and the second place name reduction result into weight combination, and confirming the related place names after weighted aggregation.
Preferably, the module for determining the overall reporting region of the media column is used for disambiguating the place name and supplementing the context information of news, and the overall reporting region of the media column plus a blank space is respectively connected with the news title and the news text to obtain the title input and the text input.
Preferably, in the noun phrase sequence acquiring module, in the noun phrase sequence, the punctuations in the original text and the words of part of speech except the noun words are replaced by spaces.
Preferably, the first place name reduction module performs place name recognition on the noun phrase sequence of the title input and the text input, and reduces place names which are recognized later in each noun phrase and have no upper-lower level relation; and the second place name reduction module carries out secondary word segmentation on the recognition result of the place name recognition and the corresponding noun word group sequence on the basis of the first place name reduction, and reduces the number of the noun word group sequence.
Compared with the prior art, the internet news related to place name identification method based on part of speech tagging has the following outstanding advantages: the internet news related to the place name identification method based on the part of speech tagging is popular and easy to understand, the implementation is simple, the problem that the extraction accuracy of the news related to the place name is low is effectively solved by utilizing the overall report region of the news media column and reducing the place name identification result twice, the accuracy can reach more than 90% when the internet news related to the place name is displayed in actual use, and the method has good popularization and application values.
Drawings
Fig. 1 is a flowchart of a part-of-speech tagging-based internet news related to place name recognition method according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and embodiments, wherein the method and system for identifying a place name in internet news based on part of speech tagging are described in detail below.
Examples
As shown in figure 1, the internet news based on part-of-speech tagging relates to a place name recognition method, and the method comprises the steps of supplementing context information of news by using a general reporting region of a news media column, assisting a place name disambiguation program to correctly judge a place name, converting news contents into a pure noun phrase sequence by using part-of-speech tagging, carrying out place name recognition on the noun phrase sequence, carrying out twice place name reduction on a place name recognition result, eliminating an inaccurate place name, and finally carrying out weighted summarization on twice place name reduction results to confirm the place name.
The method specifically comprises the following steps:
s1, determining a total report region of a media column: the news media shows the place names with absolute proportion of the number of occurrences of the place names pointed by all reports under the column.
If the proportion of the appeared place names is more balanced, the corresponding superior place name is selected as the total reporting area of the column. The general reporting region is used for disambiguating place names and supplementing context information of news, and the general reporting region of the media column is connected with a news title and a news text by adding a space respectively to obtain title input and text input.
S2, acquiring a noun phrase sequence: and using part-of-speech tagging for the incoming title input and text input to acquire a noun phrase sequence.
In the noun phrase sequence, punctuations in the original text and words of parts of speech except the noun are replaced by spaces.
For example, the text' Laiwu city steel urban district people inspection institute develops judicial rescue work, issues judicial rescue gold for criminal victims by 3 ten thousand yuan, protects lawful rights and interests of the criminal victims, and ensures that criminal litigation activities are smoothly carried out. The noun phrase sequence is the legal title and criminal action of law rescue work criminal victim and law rescue criminal victim in the national institute of civil inspection of steel city in Laiwu city.
S3, first place name reduction: the reduction is based on each individual noun phrase in the sequence of noun phrases.
And carrying out place name recognition on the noun phrase sequences of the title input and the text input, and eliminating place names which are recognized later in each noun phrase and have no upper-lower level relation.
In the place name reduction, firstly, place name recognition is carried out on the noun phrase sequence of the title input and the text input, and place names which are recognized in each noun phrase and have no upper-lower level relation are reduced. For example, the noun phrase "south part of Longyan" has place name recognition results of "city of Longyan" and "south county". Since "southern county" and "Longyan City" have no top-bottom attribution relationship, the recognition result of "southern county" is removed in the reduction of the place name.
S4, the second place name reduction: and on the basis of the first place name reduction, carrying out secondary word segmentation on the recognition result of the place name recognition and the corresponding noun phrase sequence, and reducing. And if each word after the division of the place name recognition result continuously exists in the division result of the corresponding noun phrase sequence, the recognition is considered to be accurate, otherwise, the recognition result is removed.
For example, the place name recognition result ' Gongnong ' in the Crane market ' corresponds to word segmentation results of ' Gongnong ' and ' Gongnong '; the noun phrase corresponding to the place name recognition result is the total value of industrial and agricultural production in the city of the Crane, and the word segmentation result is the total value of industrial and agricultural production in the city of the Crane. The word segmentation result of the place name recognition result is not continuous and equal to the word segmentation result of the noun word group sequence, and the place name recognition result of the time can be eliminated.
S5, acquiring and identifying a place name: and adding the first place name reduction result and the second place name reduction result into weight combination, and confirming the related place names after weighted aggregation.
In practice, the situation that a certain reduction is excessive exists in the second reduction process, so that the results of the first and second reduction of the place names are added into a weight set, administrative divisions at province, city and district and county levels in the weight set have fewer duplication conditions, and the corresponding administrative division weight is increased by one every time the results of the two reductions appear; every time the administrative divisions of the village, town and village level appear, the weight is increased by 0.6. And finally outputting all administrative divisions with the maximum weight in the weight set.
The invention relates to a system for identifying a place name based on internet news labeled by parts of speech, which comprises the following modules:
the media column comprises a general report region determining module used for determining the place names of which the occurrence times occupy absolute proportion in the place names pointed by all reports under the column of the news media.
The overall reporting region determining module of the media column supplements the contextual information of news for the place name disambiguation, and connects the overall reporting region of the media column with a space respectively to a news title and a news text to obtain a title input and a text input.
And the noun phrase sequence acquisition module is used for using part-of-speech tagging on the transmitted title input and text input to acquire a noun phrase sequence.
In the noun phrase sequence obtaining module, in the noun phrase sequence, punctuations in the original text and words of part of speech except the noun are replaced by spaces.
And the first place name reduction module is used for reducing each individual noun phrase in the noun phrase sequence.
The first place name reduction module performs place name recognition on the noun phrase sequence of the title input and the text input, and reduces place names which are recognized later in each noun phrase and have no upper-lower level relation.
And the second place name reduction module is used for carrying out secondary word segmentation on the recognition result of the place name recognition and the corresponding noun word group sequence on the basis of the first place name reduction and reducing.
And the second place name reduction module carries out secondary word segmentation on the recognition result of the place name recognition and the corresponding noun word group sequence on the basis of the first place name reduction, and reduces the number of the noun word group sequence.
And the identification place name acquisition module is used for adding the first place name reduction result and the second place name reduction result into weight combination, and confirming the related place names after weighted aggregation.
The above-described embodiments are merely preferred embodiments of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.

Claims (6)

1. A method for identifying internet news related to place names based on part of speech tagging is characterized by comprising the following steps: the method utilizes the general report region of a news media column to supplement the context information of news, assists a place name disambiguation program to correctly judge place names, utilizes part-of-speech tagging to convert news contents into a pure noun phrase sequence, identifies the place names of the noun phrase sequence, reduces the place names twice according to the place name identification result, eliminates inaccurate place names, and finally carries out weighted summarization on the two place name reduction results to confirm the place names, and specifically comprises the following steps:
s1, determining a total report region of a media column: place names with absolute ratio of occurrence times in place names pointed by all reports in the news media in the column;
s2, acquiring a noun phrase sequence: using part-of-speech tagging for the incoming title input and text input to obtain a noun phrase sequence;
s3, first place name reduction: reducing each individual noun phrase in the noun phrase sequence, performing place name recognition on the noun phrase sequence of the title input and the text input, and reducing place names which are recognized later and have no upper-lower level relation in each noun phrase;
s4, second place name reduction: performing secondary word segmentation on the recognition result of the place name recognition and the corresponding noun word group sequence on the basis of primary place name reduction, reducing the recognition result of the place name recognition and the corresponding noun word group sequence, and reducing the recognition result of the place name recognition and the corresponding noun word group sequence;
s5, acquiring and identifying a place name: and adding the first place name reduction result and the second place name reduction result into weight combination, and confirming the related place names after weighted aggregation.
2. The method for recognizing the place name related to the internet news based on the part of speech tagging as claimed in claim 1, wherein: step S1, determining a total reporting region of a media column, disambiguating a place name, supplementing context information of news, and connecting the total reporting region of the media column with a space respectively to a news title and a news text to obtain title input and text input.
3. The method for recognizing internet news related to place names based on part-of-speech tagging as claimed in claim 2, wherein: in step S2, in the noun phrase sequence, the punctuations in the original text and the words of part of speech except the noun are replaced with spaces.
4. The utility model provides an internet news relates to place name identification system based on part of speech mark which characterized in that: the system comprises the following modules:
the system comprises a media column total report region determining module, a news media searching module and a news media searching module, wherein the media column total report region determining module is used for determining the place names of which the occurrence times of the place names pointed by all reports under the column of the news media occupy absolute proportion; the noun phrase sequence acquisition module is used for using part-of-speech tagging on the transmitted title input and text input to acquire a noun phrase sequence;
the first place name reduction module is used for reducing each individual noun phrase in the noun phrase sequence;
the second place name reduction module is used for carrying out secondary word segmentation on the recognition result of the place name recognition and the corresponding noun word group sequence on the basis of the first place name reduction, and reducing;
the system comprises a recognition place name acquisition module, a recognition place name reduction module and a recognition place name recognition module, wherein the recognition place name acquisition module is used for adding first place name reduction results and second place name reduction results into weight combination, and confirming related place names after weighting and summarizing, the first place name reduction module performs place name recognition on a noun phrase sequence of title input and text input, and reduces place names which are recognized in each noun phrase and have no upper-lower level relation; and the second place name reduction module carries out secondary word segmentation on the recognition result of the place name recognition and the corresponding noun word group sequence on the basis of the first place name reduction, and reduces the number of the noun word group sequence.
5. The system of claim 4, wherein the internet news based on part-of-speech tagging relates to a place name recognition system, and wherein: the overall reporting region determining module of the media column supplements the contextual information of news for the place name disambiguation, and connects the overall reporting region of the media column with a space respectively to a news title and a news text to obtain a title input and a text input.
6. The system of claim 5, wherein the internet news based on part-of-speech tagging relates to a place name recognition system, and wherein: in the noun phrase sequence obtaining module, in the noun phrase sequence, punctuations in the original text and words of part of speech except the noun are replaced by spaces.
CN201910681163.3A 2019-07-26 2019-07-26 Method and system for identifying internet news related to place names based on part-of-speech tagging Active CN110399613B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910681163.3A CN110399613B (en) 2019-07-26 2019-07-26 Method and system for identifying internet news related to place names based on part-of-speech tagging

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910681163.3A CN110399613B (en) 2019-07-26 2019-07-26 Method and system for identifying internet news related to place names based on part-of-speech tagging

Publications (2)

Publication Number Publication Date
CN110399613A CN110399613A (en) 2019-11-01
CN110399613B true CN110399613B (en) 2023-03-31

Family

ID=68326170

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910681163.3A Active CN110399613B (en) 2019-07-26 2019-07-26 Method and system for identifying internet news related to place names based on part-of-speech tagging

Country Status (1)

Country Link
CN (1) CN110399613B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737407B (en) * 2020-08-25 2020-11-10 成都数联铭品科技有限公司 Event unique ID construction method based on event disambiguation
CN112069824B (en) * 2020-11-11 2021-02-02 北京智慧星光信息技术有限公司 Region identification method, device and medium based on context probability and citation
CN112966511B (en) * 2021-02-08 2024-03-15 广州探迹科技有限公司 Entity word recognition method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013257634A (en) * 2012-06-11 2013-12-26 Nippon Telegr & Teleph Corp <Ntt> Apparatus and method for extracting a pair of place name and word from document, and program
CN103853738A (en) * 2012-11-29 2014-06-11 中国科学院计算机网络信息中心 Identification method for webpage information related region
CN106709011A (en) * 2016-12-26 2017-05-24 武汉大学 Positional concept hierarchy disambiguation calculation method based on spatial locating cluster
CN109299456A (en) * 2018-08-28 2019-02-01 昆明理工大学 A kind of place name identification method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013257634A (en) * 2012-06-11 2013-12-26 Nippon Telegr & Teleph Corp <Ntt> Apparatus and method for extracting a pair of place name and word from document, and program
CN103853738A (en) * 2012-11-29 2014-06-11 中国科学院计算机网络信息中心 Identification method for webpage information related region
CN106709011A (en) * 2016-12-26 2017-05-24 武汉大学 Positional concept hierarchy disambiguation calculation method based on spatial locating cluster
CN109299456A (en) * 2018-08-28 2019-02-01 昆明理工大学 A kind of place name identification method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于地名识别的地方新闻分类方法;李果等;《软件》;20180415(第04期);全文 *
基于条件随机场的中文地名识别方法;邬伦等;《武汉大学学报(信息科学版)》;20170205(第02期);全文 *

Also Published As

Publication number Publication date
CN110399613A (en) 2019-11-01

Similar Documents

Publication Publication Date Title
CN110399613B (en) Method and system for identifying internet news related to place names based on part-of-speech tagging
CN109684440B (en) Address similarity measurement method based on hierarchical annotation
CN106777275B (en) Entity attribute and property value extracting method based on more granularity semantic chunks
CN102956231B (en) Voice key information recording device and method based on semi-automatic correction
CN105718586A (en) Word division method and device
CN106066866A (en) A kind of automatic abstracting method of english literature key phrase and system
CN101079025B (en) File correlation computing system and method
CN110413787B (en) Text clustering method, device, terminal and storage medium
CN111832293B (en) Entity and relation joint extraction method based on head entity prediction
CN102955772B (en) A kind of similarity calculating method based on semanteme and device
CN105844424A (en) Product quality problem discovery and risk assessment method based on network comments
CN101079031A (en) Web page subject extraction system and method
CN109344263B (en) Address matching method
CN102214166A (en) Machine translation system and machine translation method based on syntactic analysis and hierarchical model
CN104881402A (en) Method and device for analyzing semantic orientation of Chinese network topic comment text
CN110134780B (en) Method, device, equipment and computer readable storage medium for generating document abstract
CN109033064B (en) Primary school Chinese composition corpus label automatic extraction method based on text abstract
CN109408806A (en) A kind of Event Distillation method based on English grammar rule
CN110348003A (en) Method and device for extracting effective text information
CN103678288A (en) Automatic proper noun translation method
CN111079384B (en) Identification method and system for forbidden language of intelligent quality inspection service
CN111814450A (en) Aspect-level emotion analysis method based on residual attention
CN117743526A (en) Table question-answering method based on large language model and natural language processing
CN108319584A (en) A kind of new word discovery method based on the microblogging class short text for improving FP-Growth algorithms
CN111401012A (en) Text error correction method, electronic device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 271000 Langchao science and Technology Park, 527 Dongyue street, Tai'an City, Shandong Province

Applicant after: INSPUR SOFTWARE Co.,Ltd.

Address before: No. 1036, Shandong high tech Zone wave road, Ji'nan, Shandong

Applicant before: INSPUR SOFTWARE Co.,Ltd.

GR01 Patent grant
GR01 Patent grant