CN113761137B

CN113761137B - Method and device for extracting address information

Info

Publication number: CN113761137B
Application number: CN202010491275.5A
Authority: CN
Inventors: 王潇斌; 丁瑞雪; 刘楚; 徐光伟; 马春平; 龙定坤; 谢朋峻
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-06-02
Filing date: 2020-06-02
Publication date: 2024-01-09
Anticipated expiration: 2040-06-02
Also published as: CN113761137A

Abstract

The invention discloses a method and a device for extracting address information, relates to the technical field of computers, and mainly aims to extract correct and effective address information from a dialogue text. The main technical scheme of the invention is as follows: determining a text to be extracted of the address information based on the dialogue text; segmenting the text to be extracted to obtain a first address segmentation; identifying second address word segmentation in the text to be extracted by using a preset dictionary; and integrating the first address segmentation word and the second address segmentation word according to administrative level to obtain address information corresponding to the dialogue text.

Description

Method and device for extracting address information

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for extracting address information.

Background

With the rapid development of internet technology, the network becomes the maximum aggregation place of geographic information, internet geographic information has entered a big data age, at least 80% of human-computer interaction text data in the next 10 years relate to geographic information, the internet becomes a continuously updated large geographic information database, and how to mine and apply the geographic information into geographic information services is a main problem. The place name address data is the most common social public information resource, is closely related to the daily life of the public, and is also the basic resource of government basic administration. The extraction and standardization of the place name and address information, which is converted into the basic result of the geographic information service, provides support for the production and living of the masses, and is an urgent need.

The existing geographic position information mining algorithm mainly utilizes a keyword matching method, and because the problems of incorrect description, inaccuracy, homophones, insufficient standards and the like exist in the place name address information in the text acquired in the Internet environment, particularly in the dialogue scene, the accuracy of the position information mining algorithm based on keyword matching is low, and the requirement of each industry on geographic information is not satisfied.

Disclosure of Invention

In view of the above problems, the present invention provides a method and apparatus for extracting address information, and is mainly aimed at extracting correct and valid address information from dialogue text.

In order to achieve the above purpose, the present invention mainly provides the following technical solutions:

in one aspect, the present invention provides a method for extracting address information, which specifically includes:

determining a text to be extracted of the address information based on the dialogue text;

segmenting the text to be extracted to obtain a first address segmentation; identifying second address word segmentation in the text to be extracted by using a preset dictionary;

and integrating the first address segmentation word and the second address segmentation word according to administrative level to obtain address information corresponding to the dialogue text.

Preferably, determining the text to be extracted of the address information based on the dialogue text includes:

acquiring one or more groups of question-answer information pairs in the dialogue text;

and combining the question information and the answer information in the question and answer information pair to generate the text to be extracted.

Preferably, the combining the question information and the answer information in the question and answer information pair to generate the text to be extracted includes:

determining a query term in the question information;

and replacing the query words in the question information with the reply information to obtain the text to be extracted.

Preferably, the word segmentation is performed on the text to be extracted to obtain a first address word segmentation, which includes:

performing word segmentation on the text to be extracted to obtain address word segmentation;

judging whether the address word segmentation is correct or not by using a preset rule;

if the error exists, modifying the address word segmentation to obtain a first address word segmentation;

if no error exists, the address word is used as a first address word.

Preferably, determining whether the address word segmentation is correct by using a preset rule includes:

and judging whether the segmentation of the address segmentation is correct or not by using an MMSEG segmentation algorithm and a place name dictionary.

Preferably, integrating the first address word with the second address word according to the administrative level includes:

performing duplication elimination processing on the first address word segmentation and the second address word segmentation;

sorting the address segmentation words subjected to the duplication removal treatment according to administrative levels;

combining the ordered address components into address information.

In another aspect, the present invention provides an apparatus for extracting address information, specifically including:

a determining unit for determining a text to be extracted of the address information based on the dialogue text;

the word segmentation unit is used for segmenting the text to be extracted to obtain a first address word segmentation;

the identification unit is used for identifying second address word segmentation with administrative level in the text to be extracted, which is determined by the determination unit, by using a preset dictionary;

and the generating unit is used for integrating the first address word obtained by the word segmentation unit and the second address word obtained by the identification unit according to administrative level to obtain the address information corresponding to the dialogue text.

Preferably, the determining unit includes:

the acquisition module is used for acquiring one or more groups of question-answer information pairs in the dialogue text;

and the generation module is used for combining the question information and the answer information in the question and answer information pair obtained by the acquisition module to generate the text to be extracted.

Preferably, the generating module is specifically configured to determine a query term in the question information; and replacing the query words in the question information with the reply information to obtain the text to be extracted.

Preferably, the word segmentation unit includes:

the word segmentation module is used for segmenting the text to be extracted to obtain address word segmentation;

the judging module is used for judging whether the address segmentation extracted by the extracting module is correct or not by utilizing a preset rule;

the correction module is used for correcting the address word segmentation to obtain a first address word segmentation if the judgment module determines that an error exists;

and the determining module is used for using the address segmentation as a first address segmentation if the judging module determines that no error exists.

Preferably, the judging module is specifically configured to judge whether the segmentation of the address segmentation word is correct by using an mmmeg segmentation algorithm and a place name dictionary.

Preferably, the generating unit includes:

the duplication eliminating module is used for carrying out duplication eliminating treatment on the first address word segmentation and the second address word segmentation;

the sorting module is used for sorting the address segmentation words subjected to the duplication removal processing by the duplication removal module according to administrative levels;

and the combination module is used for combining the address components sequenced by the sequencing module into address information.

In another aspect, the present invention provides a processor, where the processor is configured to run a program, and the method for extracting address information is performed when the program runs.

By means of the technical scheme, the address information extraction method and device mainly designed for the dialogue text are provided, and based on the feature that the dialogue text has the characteristic of language expression and is concise, the embodiment of the invention determines the text to be extracted for extracting the address information through processing the dialogue text, then respectively acquires address segmentation words contained in the text through segmentation words and a preset dictionary, and finally integrates the acquired address segmentation words into correct and effective address information according to administrative level to be output. Compared with the existing mode of acquiring address information by using a model or using a dictionary, when the simplified dialogue text is identified, the address word is difficult to mark by using the existing identification model, the accuracy of identification is difficult to judge for the identified address word, and the validity of the acquired address information cannot be ensured.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

fig. 1 is a flowchart of a method for extracting address information according to an embodiment of the present invention;

FIG. 2 is a flow chart of another method for extracting address information according to an embodiment of the present invention;

FIG. 3 is a block diagram showing an apparatus for extracting address information according to an embodiment of the present invention;

fig. 4 shows a block diagram of another apparatus for extracting address information according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The method for extracting the address information provided by the embodiment of the invention optimizes and improves the address information extraction mode aiming at the question-answer information collected in the dialogue scene. The specific steps of the method are shown in fig. 1, and the method comprises the following steps:

and 101, determining a text to be extracted of the address information based on the dialogue text.

The dialogue text in the embodiment of the invention refers to question and answer information collected based on dialogue scenes, for example, in the dialogue of an alarm telephone, an alarm person needs to inform the alarm receiver of the position of the alarm person through the dialogue, and various customer service systems such as complaint incoming calls, inquiry incoming calls and the like need to acquire address information of a customer from the question and answer information of the dialogue.

Because the dialogue text is mostly obtained based on voice conversion, and language expressions under the dialogue scene are mostly simplified, the dialogue text obtained based on the dialogue scene is mostly sentences without complete semantic expressions, and the traditional address recognition mode can not effectively and correctly recognize the address information in the dialogue text. The dialog text is processed first, so that the processed text has the possibility of being identified. In this embodiment, the specific manner of determining the text to be extracted may be to group the dialogue text according to question-answer pairs, or may be to filter part of the question-answer information, for example, delete the answer information with too brief expression, etc.

The method aims at obtaining the text to be extracted which is more convenient to identify by processing the dialogue text, so that the identification efficiency of the address information is improved.

Step 102, word segmentation is carried out on the text to be extracted, and a first address word segmentation is obtained.

Because the text to be extracted obtained after the dialogue text is processed has more complete semantic expression, the address word segmentation in the text to be extracted can be extracted through word segmentation, wherein the first address word segmentation can be obtained by using the existing address recognition model for the recognition of the word segmentation and the address word, for example, the address recognition model trained on the sample marked by the public news corpus is used.

The address recognition model can be used for marking possible address words in the input text to be extracted, and the address words are defined as first address words.

And 103, recognizing second address word segmentation in the text to be extracted by using a preset dictionary.

The preset dictionary in the embodiment of the invention is a place name dictionary set according to specific application requirements, and place names with administrative levels are recorded in the place name dictionary, and address words with corresponding administrative levels, such as country, province, city, district/county, street and the like, are generally attached to the place names when the place names are expressed in a standard manner. The address word segmentation in the text to be extracted can be quickly matched by using the preset dictionary, so that a second address word segmentation is obtained.

It should be noted that, there is no logical relationship between this step and the previous step 102. And the specific way of performing address word segmentation matching by using the preset dictionary is not particularly limited.

And 104, integrating the first address segmentation word and the second address segmentation word according to the administrative level to obtain address information corresponding to the dialogue text.

The step is to combine the two groups of address word segmentation recognized by the word segmentation and the preset dictionary respectively to obtain address information conforming to the address expression rule. In this embodiment, address words are arranged and combined according to the order of administrative levels, so that effective address information is obtained. The address information obtained after integrating the first address word and the second address word may be one or more. For example, when the address segmentation of the same administrative level is plural, it is determined that plural sets of address information can be obtained from the text to be extracted.

As can be seen from the description of the above embodiments, the method for extracting address information provided by the embodiments of the present invention needs to process the dialogue text into the text to be extracted, so as to overcome the problems of semantic expression shorthand, text conversion errors, etc. in the dialogue information, and respectively apply the address recognition model and the preset place name dictionary to extract the address segmentation words of the text to be extracted, and then integrate the address segmentation words obtained by the two according to the administrative level, so as to obtain identifiable and effective address information. Compared with the existing address information identification mode, the embodiment of the invention can be used for effectively identifying the address of the dialogue text obtained in the dialogue scene, and accurately and rapidly extracting the effective address information from the dialogue text.

Further, for the method for extracting address information described in fig. 1, the embodiment of the present invention will describe a specific processing procedure for the dialog text and an integration manner of the address information in detail, and the specific steps include:

step 201, one or more sets of question-answer information pairs in the dialogue text are acquired.

In this embodiment, a set of question-answer information pairs includes one piece of question information and corresponding answer information. Specifically, the manner of identifying the question information and the answer information based on the dialogue text can be realized by a semantic identification manner. The reply information corresponding to one piece of question information is not limited to one or more pieces of information.

In practical application, the dialogue text is a plurality of groups of questions and answers based on a certain theme, and sufficient address information cannot be acquired only through one group of question and answer information pairs, so that the content of the address information can be acquired more comprehensively through synchronous analysis of the plurality of groups of question and answer information pairs.

And 202, combining question information and answer information in the question and answer information pair to generate a text to be extracted.

The purpose of combining the question information and the answer information is to improve semantic information of the text to be extracted. The method is characterized in that the existing address recognition model needs to have a certain context relation to the input sentence, namely the marked address word needs to be marked according to the context relation in the sentence, if semantic information is too little, the sentence lacks the context relation, and therefore the address recognition model cannot effectively mark the address word. For example, the first question-answer information pair is: problems: "where do you? ", answer: "Hangzhou", if viewed from the reply only, would result in the address recognition model being difficult or inaccurate to recognize due to its lack of relevant context. Therefore, in this embodiment, after the question-answer information pair is combined for this case, a new text to be extracted is obtained.

Specifically, a combination manner of the embodiment is as follows: firstly, determining the query words in the question information, and then, replacing the content of the reply information with the query words in the question information, thereby obtaining the text to be extracted. The manner of determining the query term may be implemented by matching a preset query term, for example, for a problem: "where do you? ", the matching can determine that the query words in the sentence are" where ", and then the text to be extracted obtained after combination is: "you are in Hangzhou. "another example" problem: what cities you have gone to by fourteen days? Answer: martial arts. "after combination, you become" you get rid of Wuhan for fourteen days. "

Step 203, word segmentation is carried out on the text to be extracted, and a first address word segmentation is obtained.

The text to be extracted after the processing of the steps can be marked with address segmentation by applying the existing address recognition model in the embodiment. The method and the device have the advantages that the existing address recognition model, such as an address recognition model trained on the basis of the sample marked by the public news corpus, is used, because training samples of the model are sufficient, and for dialogue data, large-scale marked sample training models are difficult to collect due to the problem of user privacy.

However, the text to be extracted is processed by using the existing address recognition model, address words can be recognized only from the surface information of the text, and errors in the recognized address words cannot be resolved according to common knowledge. Therefore, the specific way of extracting the first address word in the text to be extracted in this step is as follows: firstly, address recognition models are utilized to recognize address words contained in texts to be extracted, namely, through word segmentation and marking out possible address words, then, preset rules are utilized to judge whether the recognized address words are correct, if errors exist, the corresponding address words are modified, and therefore first address words are obtained, otherwise, the address words are used as the first address words.

The preset rule may include judging whether repeated word segmentation exists in the address word segmentation, whether an error address exists, whether the sequence of the address word segmentation has disorder or not, and other common problems. In this regard, the specific implementation manner of detecting whether the address word is correct in the embodiment of the present invention may be: and judging whether the segmentation of the address segmentation is correct or not by using an MMSEG segmentation algorithm and a place name dictionary. MMSEG is a word segmentation algorithm based on a dictionary, mainly comprising a forward maximum matching algorithm and assisted by various disambiguation rules. The method is a common word segmentation algorithm in Chinese word segmentation, the specific word segmentation principle and mode of the method are not specifically described, whether the address word segmentation extracted based on the text to be extracted is correct or not can be judged through an MMSEG word segmentation algorithm, whether the obtained address word is a correct place name or not is determined by combining a place name dictionary, and therefore whether a repeat problem (such as 'Wuhan and Wuhan from I's family in Wuhan and Wuhan's and Wuhan' is extracted from 'I's family) and an address error problem (such as 'Zhejiang, hangzhou area' and the like) can be identified for the identified address word. And then, the administrative level corresponding to the adjacent address word can be identified to distinguish the mutual dependence, so as to find out whether the disorder problem exists (for example, the 'Wuhan Hubei province' is extracted from the 'Pair, wuhan Hubei province').

The method comprises the steps of identifying address words in a text to be extracted by using an existing address identification model, judging the identified address words by a preset rule, determining whether the identified address words have problems, if not, continuing the subsequent steps, and if so, re-identifying the address words if the problems exist, such as address word errors, so that accurate and effective first address word obtaining is ensured.

And 204, recognizing second address word segmentation in the text to be extracted by using a preset dictionary.

The method comprises the steps of identifying address word segmentation with administrative level in a text to be extracted, and ensuring that the identified address word segmentation is accurate through matching of a place name dictionary. This step is the same as step 103 shown in fig. 1 and will not be described here again.

Step 205, integrating the first address word and the second address word according to the administrative level to obtain the address information corresponding to the dialogue text.

Because the first address word and the second address word are recognized by adopting different recognition modes to recognize the same text to be extracted, the possibility of repeated address word segmentation exists, therefore, the address integration executed in the step firstly carries out de-duplication processing on the first address word and the second address word, then sorts the address words subjected to the de-duplication processing according to the administrative level, generally sorts the address words according to the administrative level from high to low, combines the final address information according to the sorting result, generally extracts the word with membership relationship with the address word of the previous level from the address word of the same administrative level, and combines the word. The address information can be one or more pieces, for example, the extracted address words are "filial sense, filial ch, martial arts, hubei", which are sorted into "Hubei, martial arts, filial sense, filial ch, and" after combination, the obtained address information is: "Hubei Xiaoqiaochang", "Hubei Wuhan".

Further, the embodiment of the invention can also judge the validity of the obtained address information, and generally, one piece of valid address information is formed by combining a plurality of address fragments, so that whether the address information is valid or not can be judged by judging the number of the address fragments in the address information, for example, the address information formed by one address fragment is considered as an invalid address. Therefore, the obtained address information has application value, and the availability of the address information is ensured.

As can be seen from the steps and the corresponding illustrations of the embodiments, the method for extracting address information provided by the embodiments of the present invention extracts address information from a dialogue text generated in a dialogue scene by using an existing address recognition model in combination with a place name dictionary, and makes the processed text to be extracted possible to use the existing address recognition model to label and recognize address segmentation words by performing information complement processing on the dialogue text. Finally, the embodiment of the invention integrates the identified address segmentation according to the existing use habit and outputs the address information which can be directly used.

In summary, the method for extracting address information according to the embodiment of the present invention is mainly used for extracting effective address information from dialogue text, and in the current situation of popularization of the internet, the method can also be applied to auxiliary determination of a location, for example, when determining a location of a person, besides positioning based on device location information (Location Based Service, LBS) and device SIM card information used by the person, the method can also be applied to positioning through text information of real-time communication, and assist in comprehensive positioning of LBS information and SIM card information, so as to improve positioning accuracy.

Further, as an implementation of the method shown in fig. 1 and 2, an embodiment of the present invention provides an apparatus for extracting address information, where the apparatus is mainly aimed at extracting correct and valid address information from a dialog text. For convenience of reading, the details of the foregoing method embodiment are not described one by one in the embodiment of the present apparatus, but it should be clear that the apparatus in this embodiment can correspondingly implement all the details of the foregoing method embodiment. The device is shown in fig. 3, and specifically comprises:

a determining unit 31 for determining a text to be extracted of the address information based on the dialogue text;

a word segmentation unit 32, configured to segment the text to be extracted determined by the determining unit 31, so as to obtain a first address word segmentation;

a recognition unit 33, configured to recognize the second address word segmentation in the text to be extracted determined by the determination unit 31 using a preset dictionary;

and a generating unit 34, configured to integrate the first address word obtained by the word segmentation unit 32 with the second address word obtained by the recognition unit 33 according to an administrative level, so as to obtain address information corresponding to the dialog text.

Further, as shown in fig. 4, the determining unit 31 includes:

an obtaining module 311, configured to obtain one or more sets of question-answer information pairs in the dialog text;

a generating module 312, configured to combine the question information and the answer information in the question and answer information pair obtained by the obtaining module 311, to generate the text to be extracted.

Further, as shown in fig. 4, the generating module 312 is specifically configured to determine a query term in the question information; and replacing the query words in the question information with the reply information to obtain the text to be extracted.

Further, as shown in fig. 4, the word segmentation unit 32 includes:

the word segmentation module 321 is configured to segment the text to be extracted to obtain an address word;

a judging module 322, configured to judge whether the address word extracted by the extracting module 321 is correct by using a preset rule;

a correction module 323, configured to, if the determination module 322 determines that there is an error, modify the address word segmentation to obtain a first address word segmentation;

a determining module 324, configured to use the address word as the first address word if the determining module 322 determines that there is no error.

Further, as shown in fig. 4, the determining module 322 is specifically configured to determine whether the segmentation of the address segmentation word is correct by using an mmmeg segmentation algorithm and a place name dictionary.

Further, as shown in fig. 4, the generating unit 34 includes:

the duplication removing module 341 is configured to perform duplication removing processing on the first address word segment and the second address word segment;

a sorting module 342, configured to sort the address segmentation words subjected to the duplication removal by the duplication removal module 341 according to an administrative level;

a combining module 343, configured to combine the address components ordered by the ordering module 342 into address information.

In addition, the embodiment of the invention also provides a processor, which is used for running a program, wherein the method for extracting address information provided by any one of the embodiments is executed when the program runs.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

It will be appreciated that the relevant features of the methods and apparatus described above may be referenced to one another. In addition, the "first", "second", and the like in the above embodiments are for distinguishing the embodiments, and do not represent the merits and merits of the embodiments.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, the present invention is not directed to any particular programming language. It should be appreciated that the teachings of the present invention as described herein may be implemented in a variety of programming languages and that the foregoing description of specific languages is provided for disclosure of preferred embodiments of the present invention.

Furthermore, the memory may include volatile memory, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), in a computer readable medium, the memory including at least one memory chip.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, etc., such as Read Only Memory (ROM) or flash memory (flashRAM). Memory is an example of a computer-readable medium.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims

1. A method of extracting address information, the method comprising:

segmenting the text to be extracted to obtain a first address segmentation; identifying second address word segmentation in the text to be extracted by using a preset dictionary; the text to be extracted is segmented to obtain a first address segmentation, which comprises the following steps: performing word segmentation on the text to be extracted to obtain address word segmentation; judging whether the address word segmentation is correct or not by using a preset rule; if the error exists, modifying the address word segmentation to obtain a first address word segmentation; if no error exists, the address word is used as a first address word;

judging whether the address segmentation is correct or not by using a preset rule, wherein the method comprises the following steps: judging whether the segmentation of the address segmentation is correct or not by using an MMSEG segmentation algorithm and a place name dictionary;

2. The method of claim 1, wherein determining text to be extracted of address information based on the dialog text comprises:

3. The method of claim 2, wherein combining question information and answer information in the question-answer information pair to generate the text to be extracted includes:

determining a query term in the question information;

4. The method of claim 1, wherein integrating the first address word with the second address word at an administrative level comprises:

combining the ordered address components into address information.

5. An apparatus for extracting address information, the apparatus comprising:

the word segmentation unit is used for segmenting the text to be extracted to obtain a first address word segmentation; the word segmentation unit comprises: the word segmentation module is used for segmenting the text to be extracted to obtain address word segmentation; the judging module is used for judging whether the address word segmentation extracted by the word segmentation module is correct or not by utilizing a preset rule; the correction module is used for correcting the address word segmentation to obtain a first address word segmentation if the judgment module determines that an error exists; the determining module is used for using the address segmentation as a first address segmentation if the judging module determines that no error exists;

the judging module is specifically used for judging whether the segmentation of the address segmentation is correct or not by using an MMSEG segmentation algorithm and a place name dictionary;

the identification unit is used for identifying second address word segmentation in the text to be extracted, which is determined by the determination unit, by using a preset dictionary;

6. The apparatus according to claim 5, wherein the determining unit includes:

7. The apparatus of claim 6, wherein the generating module is specifically configured to determine a query in the question information; and replacing the query words in the question information with the reply information to obtain the text to be extracted.

8. The apparatus of claim 5, wherein the generating unit comprises:

9. A processor for running a program, wherein the program is operative to perform the method of extracting address information as claimed in any one of claims 1-4.