CN112579713A - Address recognition method and device, computing equipment and computer storage medium - Google Patents

Address recognition method and device, computing equipment and computer storage medium Download PDF

Info

Publication number
CN112579713A
CN112579713A CN201910935761.9A CN201910935761A CN112579713A CN 112579713 A CN112579713 A CN 112579713A CN 201910935761 A CN201910935761 A CN 201910935761A CN 112579713 A CN112579713 A CN 112579713A
Authority
CN
China
Prior art keywords
address information
probability
state
address
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910935761.9A
Other languages
Chinese (zh)
Other versions
CN112579713B (en
Inventor
姜荣鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Liaoning Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Liaoning Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Liaoning Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201910935761.9A priority Critical patent/CN112579713B/en
Publication of CN112579713A publication Critical patent/CN112579713A/en
Application granted granted Critical
Publication of CN112579713B publication Critical patent/CN112579713B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/907Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/909Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Remote Sensing (AREA)
  • Library & Information Science (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention relates to the technical field of data processing, and discloses an address identification method, an address identification device, computing equipment and a computer storage medium, wherein the method comprises the following steps: performing word segmentation on the acquired address information to obtain word segmentation results; determining a first transition state of the address information in the probability finite state machine according to the word segmentation result; calculating a first probability of the address information in the probability finite state machine according to the first transition state; if the first probability is smaller than the threshold value, determining whether the address information contains wrongly written words; if the address information contains wrongly-written words, correcting the wrongly-written words to obtain a second migration state of the address information after error correction in the probability finite state machine; calculating a second probability of the address information in the probability finite state machine according to the second transition state; and if the second probability is larger than or equal to the threshold value, determining the address information as the effective address. Through the method, the embodiment of the invention realizes the identification of the address information input by the user through the mode of the probability finite state machine.

Description

Address recognition method and device, computing equipment and computer storage medium
Technical Field
The embodiment of the invention relates to the technical field of data processing, in particular to an address identification method, an address identification device, computing equipment and a computer storage medium.
Background
Address search is widely applied toThe map application software of the mobile phone, the map website, the navigation software and other fields. The existing address identification method uses a deterministic finite state machine in multiple ways, the deterministic finite state machine includes a plurality of state nodes and directed arcs connecting the state nodes, each directed arc is labeled with a migration condition between two state nodes, and a schematic structural diagram of the deterministic finite state machine is shown in fig. 1, where S is0To S3Represents a state node, I0To I6Denotes an external input, O0To O6And the output result is used for representing the transition between the states. For example, the current state node of the finite state machine is S1When an external input I is received0Generating an output O0The output is used for indicating state transition to S0
When the deterministic finite state machine is used for address recognition, an input address is considered to be a valid address when the input address is completely matched with the deterministic finite state machine, namely, if an address can reach an end state node from a start state node of the state machine through a plurality of intermediate state nodes, the address is considered to be valid, and if the input address contains wrongly written words, the address cannot reach the end state node from the start state node, the input address is considered to be invalid.
Since the addresses in various places are not named uniformly, the address information input by the user often contains wrongly written characters, and in this case, when the address is identified by the deterministic finite state machine, the address is considered invalid and cannot be identified.
Disclosure of Invention
In view of the above problems, embodiments of the present invention provide an address identification method, apparatus, computing device and computer storage medium, which overcome or at least partially solve the above problems.
According to an aspect of an embodiment of the present invention, there is provided an address identification method, including:
acquiring address information input by a user;
performing word segmentation on the address information to obtain a word segmentation result;
determining a first transition state of the address information in a probability finite state machine according to the word segmentation result;
calculating a first probability of the address information in the probability finite state machine according to the first transition state;
if the first probability is smaller than a preset threshold value, determining whether the address information contains wrongly-written words or not according to the first migration state;
if the address information contains wrongly written words, correcting the wrongly written words to obtain a second transition state of the address information after error correction in the probability finite-state machine;
calculating a second probability of the address information in the probability finite state machine according to the second migration state;
and if the second probability is greater than or equal to the preset threshold value, determining that the address information input by the user is an effective address.
In an optional manner, if the address information includes a wrongly written word, performing error correction on the wrongly written word to obtain a second transition state of the address information after error correction in the probability finite state machine, including:
determining text information corresponding to the non-address nodes contained in the first migration state as error correction words;
converting the error-correcting words into pinyin;
searching a wrong word object matched with the pinyin in a preset wrong word library according to the pinyin;
taking the word with the highest searching frequency in the error correction word objects as an error correction comparison object;
calculating the proportion between the search times of the error correction words and the search times of the error correction contrast objects;
if the specific gravity is smaller than the preset specific gravity, replacing the error correction words with the error correction comparison objects;
and determining the transition state of the replaced address information in the probability finite state machine as the second transition state.
In an optional manner, if the address information does not include a wrongly written word, the method further includes:
determining a migration sequence of the state nodes;
when the migration sequence is inconsistent with a preset sequence, adjusting the migration sequence to be consistent with the preset sequence;
calculating a third probability of the address information in the probability finite state machine according to the adjusted migration state;
and if the third probability is greater than or equal to the preset threshold value, determining that the address information input by the user is an effective address.
In an optional manner, before obtaining the address information input by the user, the method further includes:
obtaining historical input address information of a user to obtain a training sample;
extracting state nodes of a finite state machine from the training samples;
determining a migration path of the training sample between the state nodes;
calculating the transition probability between the adjacent state nodes through a hidden Markov model according to the transition path;
and taking the finite state machine containing the transition probability among the state nodes as a probability finite state machine.
In an optional manner, performing chinese word segmentation on the address information to obtain a word segmentation result, including:
performing atom segmentation on the address information to obtain a plurality of single characters;
combining adjacent single words according to different combination modes to obtain a first word segmentation;
and matching the first word segmentation with a preset word association table to obtain a word segmentation result, and combining all the word segmentation in the word segmentation result to obtain the address information.
According to another aspect of the embodiments of the present invention, there is provided an address recognition apparatus, including:
the acquisition module is used for acquiring address information input by a user;
the word segmentation module is used for segmenting words of the address information to obtain word segmentation results;
the first determining module is used for determining a first transition state of the address information in the probability finite state machine according to the word segmentation result;
a first calculating module, configured to calculate a first probability of the address information in the probability finite state machine according to the first migration state;
a second determining module, configured to determine whether the address information includes a wrongly-written or mispronounced word according to the first migration status when the first probability is smaller than a preset threshold;
the error correction module is used for correcting errors of the wrongly-written words when the address information contains the wrongly-written words to obtain a second transition state of the address information after error correction in the probability finite-state machine;
a second calculating module, configured to calculate a second probability of the address information in the probability finite state machine according to the second migration state;
and the third determining module is used for determining that the address information input by the user is an effective address when the second probability is greater than or equal to the preset threshold.
According to still another aspect of an embodiment of the present invention, there is provided a computing device including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the address identification method.
According to another aspect of the embodiments of the present invention, there is provided a computer storage medium having at least one executable instruction stored therein, the executable instruction causing a processor to perform an operation corresponding to an address identification method.
The method comprises the steps of obtaining word segmentation results by segmenting address information input by a user, determining a transition state of the address information in a probability finite state machine according to the word segmentation results, wherein the probability finite state machine comprises state nodes and transition probabilities among the state nodes, the transition state comprises the state nodes corresponding to the address information and the transition probabilities among the state nodes, calculating the probability of the address information input by the user in the probability finite state machine according to the transition probabilities corresponding to the transition states, correcting errors of the wrongly-written characters contained in the address information if the probability is smaller than a preset threshold value and determining whether the address information input by the user is an effective address according to the corrected address information if the address information input by the user possibly contains the wrongly-written characters. According to the embodiment of the invention, the address information is identified according to the probability finite-state machine, and when the address information input by the user contains wrongly written words, the address information can be effectively identified, so that the user experience is improved.
The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and the embodiments of the present invention can be implemented according to the content of the description in order to make the technical means of the embodiments of the present invention more clearly understood, and the detailed description of the present invention is provided below in order to make the foregoing and other objects, features, and advantages of the embodiments of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 shows a schematic diagram of a deterministic finite state machine;
fig. 2 is a flowchart illustrating an address identification method according to a first embodiment of the present invention;
fig. 3 is a schematic structural diagram illustrating a probabilistic finite state machine in an address identification method according to a first embodiment of the present invention;
fig. 4 is a flowchart illustrating an address identification method according to a second embodiment of the present invention;
fig. 5 shows a flow chart of an address identification method according to a third embodiment of the invention;
fig. 6 shows a functional block diagram of an address recognition apparatus according to a fourth embodiment of the present invention;
fig. 7 shows a schematic structural diagram of a computing device in a fifth embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
Application scenarios of embodiments of the present invention include, but are not limited to, mapping application software. When the embodiment of the invention is applied to map application software, a user inputs address information needing to be inquired in a search box of the map application software. The address information that the user needs to inquire may be strange address information for the user, so the address information input by the user may include wrongly written words, and the user does not perceive that the input address information includes wrongly written words. Therefore, the present application proposes an address recognition method that can recognize an address even when a wrongly written word is included in information input by a user. The general concept of the present application will be further explained by taking an application scenario of map application software as an example through various specific embodiments.
Fig. 2 shows a flow chart of an address identification method according to a first embodiment of the present invention, as shown in fig. 2, the method includes the following steps:
step 110: and acquiring address information input by a user.
In this step, the address information input by the user is a Point of Interest (POI) of the user, and the POI is usually a specific address information, for example, a doorplate address. The method comprises the steps that a user inputs inquired address information in an address search box of map application software and clicks an inquiry button, the map application software firstly obtains provinces and cities inquired by the user, and if the address information input by the user comprises the provinces and the cities, the provinces and the cities are determined according to the address information input by the user; if the address information input by the user does not contain province and city information, parameters related to the address information input by the user are obtained according to a positioning module in the map application software, so that the province and the city where the user inquired the information are located are determined. The parameters related to the address information input by the user comprise the longitude and latitude of the address information input by the user and the query frequency of the address information, wherein the longitude and latitude can be acquired by a positioning search engine. The province and the city corresponding to the address information input by the user are combined with the address information input by the user to obtain a piece of standard address information, wherein the standard address information is a piece of preset address information, the address information sequentially comprises the address elements in the table 1, and the levels of the address elements are sequentially reduced from top to bottom according to the following table 1.
TABLE 1
Figure BDA0002221540900000061
Figure BDA0002221540900000071
Step 120: and performing word segmentation on the address information to obtain a word segmentation result.
In this step, when the address information is segmented, if the address information input by the user includes province and city information, the address information input by the user is directly segmented, and if the address information input by the user does not include province and city information, the address information input by the user is converted into standard address information, and then the standard address information is segmented. In one embodiment, the step specifically includes the steps of: carrying out atom segmentation on the address information to obtain a plurality of single characters; combining adjacent single words according to different combination modes to obtain a first word segmentation; and matching the first word segmentation with a preset word association table to obtain a word segmentation result, and combining all the word segmentation in the word segmentation result to obtain the address information.
And carrying out atom segmentation on the address information to obtain a plurality of single characters. The atomic segmentation is to segment address information according to characters to realize word segmentation with minimum granularity, for example, "shang ning", province, Shen yang city, muddy south ", and the results obtained after the atomic segmentation are several chinese characters, such as" Liao "," ning "," province "," Shen "," city "," muddy "," south "and" district ". The atom segmentation includes chinese character segmentation, english character segmentation, and digit segmentation. When the English characters are segmented, pauses exist between adjacent English words, and segmentation is carried out according to the pauses; when the numbers are divided, a single number is divided as one character, for example, the "new longjie street 6 a Square", when the division is performed, the numbers are divided into several characters, such as "new", "long", "street", "6", "number", "a", and "Square".
When adjacent single characters are merged, merging is carried out according to the state of each single character in a core word bank, the state of the single character and the word is stored in the core word bank, and the state is used for indicating whether the current single character or the word can be merged with other adjacent characters continuously or not. For example, in the core word stock, 1 indicates that a word can be merged with adjacent characters to form a phrase, 2 indicates that the phrase can still be merged with other characters to form a phrase, and 3 indicates that the phrase cannot be merged with other characters to form a phrase. For example, after the atomic segmentation is performed on the "shangyang muddy south area in shenyang city of liao ning province", the obtained result is several chinese characters of "liao", "ning", "province", "sheng", "yang", "city", "muddy", "south" and "district", when merging is performed, the status flag of "liao" is 1, and may continue to be merged with "ning", so as to obtain "liao ning", the status flag of "liao ning" in the core word stock is 2, and may continue to be merged with the adjacent characters, so as to obtain "liao ning", the status flag of the word group in the core word stock is 3, and may not continue to be merged, so as to finally obtain the word "liao ning province", and then the word is determined as the first word segmentation.
In some embodiments, the single words obtained after the atomic segmentation include english words, and then adjacent english words are combined and matched with a preset english word bank to obtain the first segmentation. For example, the single words obtained after the segmentation of "Four Seasons Hotel" are "Four", "Seasons" and "Hotel", and the first segmentation is "Four Seasons Hotel". In other embodiments, the words obtained after atom segmentation include a plurality of numbers, such as "101 building", and the words obtained after atom segmentation are "1", "0", "1", "building", and the first segmentation is "101" and "building". In other embodiments, the single words obtained after the atomic segmentation include traditional words, and the traditional words are matched with a preset simplified and traditional word library and converted into simplified words before combination, for example, "張" is converted into "张".
In some embodiments, the combined phrase may be an unregistered word in the core thesaurus, and the word is selectively added to the core thesaurus according to the search frequency of the unregistered word. For example, if a building is newly built, and the search frequency of the building is higher than a preset value, the word is frequently searched, and the word belongs to a common address element, the word is added into the core word stock. The unknown words may also include proper nouns, such as names, abbreviations, etc., and in some embodiments, if the proper nouns are not included in the core lexicon, the proper nouns are separated by presetting a related unknown lexicon for subsequent address recognition.
In some embodiments, two first participles simultaneously contain one word in the address information input by the user, and the optimal participle combination is determined by matching the first participles with a preset word association table. For example, the "commodity and service transaction center" may obtain two different cases of the first word segmentation, where the first word segmentation includes the following word segmentation "commodity, and, service, transaction center", and the second word segmentation includes the following word segmentation "commodity, and, service", and "match the word segmentation result with a preset word association table, where the preset word association table includes a start word, an end word, and a word frequency, for example," serve "is a start word in the word association table," serve "is an end word in the word association table, and the word frequency of" serve "is relatively high, then determine the first word segmentation as the optimal word segmentation combination, and use the first word combination as the word segmentation result. And combining the words in the word segmentation result to obtain the address information input by the user.
Step 130: and determining a first transition state of the address information in the probability finite state machine according to the word segmentation result.
In this step, the probability finite state machine includes transition probabilities between the state nodes and adjacent state nodes. Fig. 3 shows a schematic structural diagram of a probabilistic finite state machine, as shown in fig. 3, the probabilistic finite state machine includes a plurality of state nodes, where the plurality of state nodes include a gate address class node representing a start state, a gate address class node representing an end state, a middle gate address class node, and a non-address node, and the state nodes are sequentially decreased according to the level of an address element. And matching a transition state in a probability state machine according to the word segmentation result, wherein the transition state comprises a transition sequence and a transition probability among state nodes. For example, the segmentation result includes "sheng ning province" and "sheng yang city", in the probability finite state machine, the state nodes connected to "sheng ning province" include fourteen state nodes such as "sheng yang city" and "da lian city", and the probabilities of interconnection between "sheng ning province" and these fourteen state nodes are 1/14, then "sheng ning province" and "sheng yang city" and the probability 1/14 therebetween are used as a part of the transition states in the first transition state, and the transition probabilities between all the state nodes and the state nodes are the first transition state of the address information in the probability finite state machine. For example, when the user inputs the word "Shenyang city muddy south area peaches town airport road No. 1 peaches immortal international machineWhen the field is ' the probability state machine is matched to the state with the migration sequence ' grade city-county-village town-road-building unit ', the first migration state is expressed as
Figure BDA0002221540900000091
Wherein s is0State node indicating start, q0,q1...q4Representing intermediate state nodes, q5Indicating the ending state node and x indicating the transition probability between adjacent nodes. It should be understood that the first migration state is arranged in the order of the result of the word segmentation in the address information, and in general, if the address information is input according to the standard address information, the state nodes in the first migration state are migrated in the order of the descending order of the address element levels. When the migration state to which the address information is matched includes the entire path from the start node to the end node, the address information may be one valid address information. The probability finite state machine is obtained by training according to the address information historically input by the user, and the specific training process is explained in the following embodiments, which are not described herein.
Step 140: and calculating the first probability of the address information in the probability finite state machine according to the transition probability corresponding to the first transition state.
In this step, all the transition probabilities in the first transition state are multiplied to obtain the first probability of the address information in the probability finite state machine. For any address information input by the user, the word segmentation result obtained after word segmentation comprises k word segmentations. The first transition state in the probability finite state machine corresponding to the k participles is
Figure BDA0002221540900000101
The first probability is
Figure BDA0002221540900000102
For example, when the user inputs the word "Shenyang city muddy south peach fairy town airport road number 1", taking the probability finite state machine in FIG. 3 as an example, the first transition state is "start state-grade city-district county-county town-The first probability of the road-street number is: 1 × 0.4 × 0.2 × 0.6 × 1 is 0.048.
Step 150: and determining whether the address information contains wrongly-written words or not according to the first migration state under the condition that the first probability is smaller than a preset threshold value.
In this step, the preset threshold is an artificially set value, and when the first probability reaches the preset threshold, it is determined that the address information is an effective address. The state nodes in the probabilistic finite state machine include address nodes representing a valid address element, e.g., "Lianning", "Shenyang City", etc., and non-address nodes representing state nodes other than the valid address element, e.g., "Shenyang City". When the first probability is smaller than the preset threshold, it may be that the address information includes a wrong word, and the address node in the correct state node is not matched according to the wrong word, so that the first probability is smaller than the preset threshold. And if the first migration state contains the non-address nodes and the migration probabilities of the state nodes before the non-address nodes are all larger than the preset migration probability, determining that the address information contains wrongly-written words. For example, the preset threshold is 0.03, the preset transition probability is 0.1, and when the user input word is "shenyang city muddy south area peaches xianxiang peaches international airport road number 1", the state nodes in the first transition probability are: the method comprises the steps of starting a state, a grade city, a county, a township, a non-address node, a road and a street number, wherein the probability of transferring any address node to the non-address node is equal, when the non-address node is included, the state transfer node is considered to comprise the non-address node and a next address node, the state node can reach a termination node after being transferred, the first probability calculation result is 1 multiplied by 0.4 multiplied by 0.2 multiplied by 0.02 multiplied by 0.6 multiplied by 1 to 0.00096, the probability value is far lower than a preset threshold value, the non-address node exists in a path, the node migration probabilities before and after the non-address node are respectively 1 and 0.4, and are both higher than the preset migration probability, and therefore, the address information is determined to contain wrong characters.
And when the address information does not contain the wrongly written words, determining that the address information input by the user is invalid.
Step 160: and under the condition that the address information contains the wrongly-written words, correcting the wrongly-written words to obtain a second transition state of the address information after error correction in the probability finite state machine.
In this step, considering that most of the input methods used by the user are pinyin input methods, the input wrongly written words are most likely homophones, and therefore, the homophones are corrected. Determining text information corresponding to the non-address nodes contained in the first migration state as error correction words; converting the error-correcting words into pinyin; searching a wrong word object matched with the pinyin in a preset wrong word library according to the pinyin; taking the word with the highest searching frequency in the error correction word object as an error correction comparison object; calculating the proportion between the search times of the error correction words and the search times of the error correction contrast objects; if the specific gravity is smaller than the preset specific gravity, replacing the error correction words with error correction comparison objects; and determining the transition state of the replaced address information in the probability finite state machine as a second transition state. Wherein, the wrongly-written or mispronounced character word bank prestores pinyin and all words corresponding to the pinyin. The search times of the error correction comparison object are the recording times of the word in the search log, and the search times of the error correction word are the recording times of the word in the search log. In some embodiments, the number of times of searching for the error-correcting word is determined according to an empirical formula Num (1+ Len/10+ IsStr), where Num represents the number of times of recording the error-correcting word in the search log, Len represents the word length of the error-correcting word, IsStr represents a value when the error-correcting word is chinese or pinyin, and the value is 0 when the error-correcting word is chinese and 1 when the error-correcting word is pinyin. For example, when the user input word is "sheng yang city muddy south region peach prefecture international airport road No. 1", the address information after error correction is "sheng yang city muddy south region peach prefecture international airport road No. 1".
Step 170: a second probability of the address information in the probabilistic finite state machine is computed based on the second transition state.
When the second probability is calculated according to the second migration state, the second migration state does not include non-address nodes, for example, when the user input word is "sheng yang city, muddy south region, peach, xian zhen international airport road number 1", the state nodes in the second migration state after error correction are: the second probability calculation result is 1 × 0.4 × 0.2 × 0.6 × 1 — 0.048.
Step 180: and under the condition that the second probability is greater than or equal to the preset threshold value, determining the address information input by the user as an effective address.
In this step, if the second probability is smaller than the preset threshold, it is determined that the address information input by the user is a non-effective address.
The method comprises the steps of obtaining word segmentation results by segmenting address information input by a user, determining a transition state of the address information in a probability finite state machine according to the word segmentation results, wherein the probability finite state machine comprises state nodes and transition probabilities among the state nodes, the transition state comprises the state nodes corresponding to the address information and the transition probabilities among the state nodes, calculating the probability of the address information input by the user in the probability finite state machine according to the transition probabilities corresponding to the transition states, correcting errors of the wrongly-written characters contained in the address information if the probability is smaller than a preset threshold value and determining whether the address information input by the user is an effective address according to the corrected address information if the address information input by the user possibly contains the wrongly-written characters. According to the embodiment of the invention, the address information is identified according to the probability finite-state machine, and when the address information input by the user contains wrongly written words, the address information can be effectively identified, so that the user experience is improved.
Fig. 4 shows a flowchart of an address identification method according to a second embodiment of the present invention, which, compared with the first embodiment, further includes the following steps after step 170, as shown in fig. 4, the method includes the following steps:
step 210: and determining the transition sequence of the state nodes under the condition that the second probability is smaller than a preset threshold value.
In the embodiment of the present invention, the address information does not include a wrongly written word, but the first migration status may reach the termination node, but the first probability is smaller than the predetermined threshold, which may be caused by an incorrect sequence of status nodes, for example, the address information "shenyang city peaches international airport muddy south district peaches town", and the migration sequence is city, street, county, and township.
Step 220: and when the migration sequence is inconsistent with the preset sequence, adjusting the migration sequence to be consistent with the preset sequence.
In this step, the preset order is an order in which the levels of the address elements are arranged from high to low. Taking the address information input in step 210 as "sheng yang city peaches fairy international airport road in muddy south region of peaches fairy town", the adjusted sequence is "sheng yang city peaches fairy international airport in muddy south region of peaches fairy town".
Step 230: and calculating a third probability of the address information in the probability finite state machine according to the adjusted transition state.
In this step, the calculation process of the third probability is the same as the calculation process of step 140, please refer to the detailed description of step 140, which is not repeated herein.
Step 240: and if the third probability is greater than or equal to the preset threshold value, determining the address information input by the user as an effective address.
In this step, if the third probability is smaller than the preset threshold, the address information input by the user is an invalid address.
According to the embodiment of the invention, the migration sequence is adjusted under the condition that the sequence of the address information input by the user is inconsistent with the preset sequence, so that the address information input by the user can be effectively judged when the sequence of the address information input by the user is inconsistent with the preset sequence.
Fig. 5 is a flowchart of an address identification method according to a third embodiment of the present invention, where before the steps in the first embodiment and the steps in the second embodiment are executed, the embodiment of the present invention further includes the following steps:
step 310: and obtaining historical input address information of the user to obtain a training sample.
In this step, the address information historically input by the user is the address information input by all users when using the map application software, and each piece of address information is used as a training sample.
Step 320: state nodes of the finite state machine are extracted from the training samples.
In this step, each training sample is subjected to word segmentation, and the word segmentation result is used as a state node. The word segmentation process is the same as the word segmentation process in step 120 in the first embodiment, please refer to the description of step 120 in the first embodiment, which is not described herein again.
Step 330: and determining the migration path of the training sample between the state nodes.
The migration path comprises the migration sequence between the state nodes corresponding to each address information.
Step 340: and calculating the transition probability between adjacent state nodes by using a hidden Markov model according to the transition path.
And determining the migration probability according to the migration paths of all the training samples. For example, the training sample includes 6000 million address information, and in the 6000 million address information, 2000 million times of the state node of the migration of "liaoning province" occurs, and 1000 million times of the state node of the migration to "sheng yang city" occurs in the 2000 million times, so that the migration probability of "liaoning province — sheng yang city" is 0.5.
Step 350: a finite state machine containing the transition probabilities between state nodes is taken as a probabilistic finite state machine.
According to the migration path, determining a starting state node and a terminating state node, wherein the starting state node comprises all state nodes such as provinces, cities, counties and the like, namely each state node can be used as a starting state node. The end state node includes a road, a building unit, etc., and the start state node and the end state node may be the same state node.
The embodiment of the invention constructs the probability finite state machine according to the address information historically input by the user, and is convenient for identifying the address information input by the user according to the constructed finite state machine.
Fig. 6 shows a functional block diagram of an address recognition apparatus according to a fourth embodiment of the present invention. As shown in fig. 6, the apparatus includes: an obtaining module 410, configured to obtain address information input by a user; a word segmentation module 420, configured to perform word segmentation on the address information to obtain a word segmentation result; a first determining module 430, configured to determine, according to the word segmentation result, a first transition state of the address information in a probabilistic finite state machine; a first calculating module 440, configured to calculate a first probability of the address information in the probability finite state machine according to the first transition state; a second determining module 450, configured to determine whether the address information includes a wrongly-written or mispronounced word according to the first migration status when the first probability is smaller than a preset threshold; the error correction module 460 is configured to, when the address information includes a wrongly-written word, correct the wrongly-written word to obtain a second transition state of the address information after error correction in the probabilistic finite state machine; a second calculating module 470, configured to calculate a second probability of the address information in the probability finite state machine according to the second migration state; a third determining module 480, configured to determine that the address information input by the user is an effective address when the second probability is greater than or equal to the preset threshold.
In an optional manner, the error correction module 460 is further configured to:
determining text information corresponding to the non-address nodes contained in the first migration state as error correction words;
converting the error-correcting words into pinyin;
searching a wrong word object matched with the pinyin in a preset wrong word library according to the pinyin;
taking the word with the highest searching frequency in the error correction word objects as an error correction comparison object;
calculating the proportion between the search times of the error correction words and the search times of the error correction contrast objects;
if the specific gravity is smaller than the preset specific gravity, replacing the error correction words with the error correction comparison objects;
and determining the transition state of the replaced address information in the probability finite state machine as the second transition state.
In an optional manner, the apparatus further includes a fourth determining module 490, configured to determine a migration order of the state nodes when the address information includes a mistyped word; an adjusting module 400, configured to adjust the migration sequence to be consistent with a preset sequence when the migration sequence is inconsistent with the preset sequence; a third calculating module 401, configured to calculate a third probability of the address information in the probability finite state machine according to the adjusted migration state; a fifth determining module 402, configured to determine that the address information input by the user is an effective address when the third probability is greater than or equal to the preset threshold.
In an optional manner, the apparatus further comprises:
a first obtaining module 403, configured to obtain address information historically input by a user to obtain a training sample;
an extracting module 404, configured to extract a state node of the finite state machine from the training sample;
a sixth determining module 405, configured to determine a migration path of the training sample between the state nodes;
a fourth calculating module 406, for calculating a transition probability between the adjacent state nodes by using a hidden markov model according to the transition path;
a seventh determining module 407, configured to use a finite state machine containing the transition probability between the state nodes as a probability finite state machine.
In an alternative approach, the word segmentation module 420 is further configured to:
performing atom segmentation on the address information to obtain a plurality of single characters;
combining adjacent single words according to different combination modes to obtain a first word segmentation;
and matching the first word segmentation with a preset word association table to obtain a word segmentation result, and combining all the word segmentation in the word segmentation result to obtain the address information.
According to the embodiment of the invention, the word segmentation module 420 is used for segmenting the address information input by the user to obtain a word segmentation result, the first determination module 430 is used for determining the migration state of the address information in the probability finite state machine according to the word segmentation result, the probability finite state machine comprises state nodes and the migration probability among the state nodes, the migration state comprises the migration probability between the state nodes corresponding to the address information and each state node, the probability of the address information input by the user in the probability finite state machine is calculated according to the migration probability corresponding to the migration state, when the probability is smaller than a preset threshold value, the address information input by the user possibly comprises a wrong character, the error correction module 460 is used for correcting the wrong character contained in the address information, and whether the address information input by the user is an effective address is determined according to the address information after error correction. According to the embodiment of the invention, the address information is identified according to the probability finite-state machine, and when the address information input by the user contains wrongly written words, the address information can be effectively identified, so that the user experience is improved.
An embodiment of the present invention provides a non-volatile computer storage medium, where the computer storage medium stores at least one executable instruction, and the computer executable instruction may execute the address identification method in any of the above method embodiments.
Fig. 7 is a schematic structural diagram of a computing device in a fifth embodiment of the present invention, and the specific embodiment of the present invention does not limit the specific implementation of the computing device.
As shown in fig. 7, the computing device may include: a processor (processor)502, a Communications Interface 504, a memory 506, and a communication bus 508.
Wherein: the processor 502, communication interface 504, and memory 506 communicate with each other via the communication bus 408. A communication interface 504 for communicating with network elements of other devices, such as clients or other servers. The processor 502, configured to execute the program 510, may specifically perform the relevant steps in the above embodiments for the address identification method.
In particular, program 510 may include program code that includes computer operating instructions.
The processor 502 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement an embodiment of the present invention. The computing device includes one or more processors, which may be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.
And a memory 506 for storing a program 510. The memory 506 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
The program 510 may be specifically configured to enable the processor 502 to execute steps 110 to 180 in fig. 2, steps 210 to 240 in fig. 4, and steps 310 to 350 in fig. 5, and to implement the functions of the modules 410 to 407 in fig. 6.
The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination. It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specified otherwise.

Claims (10)

1. An address identification method, the method comprising:
acquiring address information input by a user;
performing word segmentation on the address information to obtain a word segmentation result;
determining a first transition state of the address information in a probability finite state machine according to the word segmentation result, wherein the probability finite state machine comprises state nodes and transition probabilities among the state nodes;
calculating a first probability of the address information in the probability finite state machine according to the transition probability corresponding to the first transition state;
if the first probability is smaller than a preset threshold value, determining whether the address information contains wrongly-written words or not according to the first migration state;
if the address information contains wrongly written words, correcting the wrongly written words to obtain a second transition state of the address information after error correction in the probability finite-state machine;
calculating a second probability of the address information in the probability finite state machine according to the second migration state;
and if the second probability is greater than or equal to the preset threshold value, determining that the address information input by the user is an effective address.
2. The method of claim 1, wherein the state nodes comprise address nodes and non-address nodes;
if the first probability is smaller than a preset threshold, determining whether the address information contains a wrongly written word according to the first migration state, including:
if the first probability is smaller than the preset threshold, determining the transition probability between the state nodes and the state nodes included in the first transition state;
if the state node included in the first migration state includes a non-address node, determining a first migration probability of a state node before the non-address node and a second migration probability of a state node after the non-address node;
and if the first transition probability and the second transition probability are both greater than a preset transition probability, determining that the address information contains wrongly-written words.
3. The method of claim 2, wherein if the address information includes a wrong word, performing error correction on the wrong word to obtain a second transition state of the address information after error correction in the probabilistic finite state machine, comprises:
determining text information corresponding to the non-address nodes contained in the first migration state as error correction words;
converting the error-correcting words into pinyin;
searching a wrong word object matched with the pinyin in a preset wrong word library according to the pinyin;
taking the word with the highest searching frequency in the error correction word objects as an error correction comparison object;
calculating the proportion between the search times of the error correction words and the search times of the error correction contrast objects;
if the specific gravity is smaller than the preset specific gravity, replacing the error correction words with the error correction comparison objects;
and determining the transition state of the replaced address information in the probability finite state machine as the second transition state.
4. The method of claim 2, wherein if the address information does not contain a wrongly written word, the method further comprises:
determining a migration sequence of the state nodes;
when the migration sequence is inconsistent with a preset sequence, adjusting the migration sequence to be consistent with the preset sequence;
calculating a third probability of the address information in the probability finite state machine according to the adjusted migration state;
and if the third probability is greater than or equal to the preset threshold value, determining that the address information input by the user is an effective address.
5. The method of claim 1, wherein prior to obtaining the address information entered by the user, the method further comprises:
obtaining historical input address information of a user to obtain a training sample;
extracting state nodes of a finite state machine from the training samples;
determining a migration path of the training sample between the state nodes;
calculating the transition probability between the adjacent state nodes through a hidden Markov model according to the transition path;
and taking the finite state machine containing the transition probability among the state nodes as a probability finite state machine.
6. The method of claim 1, wherein performing chinese word segmentation on the address information to obtain a word segmentation result comprises:
performing atom segmentation on the address information to obtain a plurality of single characters;
combining adjacent single words according to different combination modes to obtain a first word segmentation;
and matching the first word segmentation with a preset word association table to obtain a word segmentation result.
7. The method of claim 1, wherein after obtaining the first participle, the method further comprises:
and if the first segmentation words contain unknown words, taking the unknown words as segmentation results under the condition that the search frequency of the unknown words is greater than a preset value.
8. An address identification apparatus, the apparatus comprising:
the acquisition module is used for acquiring address information input by a user;
the word segmentation module is used for segmenting words of the address information to obtain word segmentation results;
the first determining module is used for determining a first transition state of the address information in the probability finite state machine according to the word segmentation result;
a first calculating module, configured to calculate a first probability of the address information in the probability finite state machine according to the first migration state;
a second determining module, configured to determine whether the address information includes a wrongly-written or mispronounced word according to the first migration status when the first probability is smaller than a preset threshold;
the error correction module is used for correcting errors of the wrongly-written words when the address information contains the wrongly-written words to obtain a second transition state of the address information after error correction in the probability finite-state machine;
a second calculating module, configured to calculate a second probability of the address information in the probability finite state machine according to the second migration state;
and the third determining module is used for determining that the address information input by the user is an effective address when the second probability is greater than or equal to the preset threshold.
9. A computing device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the address identification method according to any one of claims 1-7.
10. A computer storage medium having stored thereon at least one executable instruction for causing a processor to perform operations corresponding to an address identification method according to any one of claims 1 to 7.
CN201910935761.9A 2019-09-29 2019-09-29 Address recognition method, address recognition device, computing equipment and computer storage medium Active CN112579713B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910935761.9A CN112579713B (en) 2019-09-29 2019-09-29 Address recognition method, address recognition device, computing equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910935761.9A CN112579713B (en) 2019-09-29 2019-09-29 Address recognition method, address recognition device, computing equipment and computer storage medium

Publications (2)

Publication Number Publication Date
CN112579713A true CN112579713A (en) 2021-03-30
CN112579713B CN112579713B (en) 2023-11-21

Family

ID=75111221

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910935761.9A Active CN112579713B (en) 2019-09-29 2019-09-29 Address recognition method, address recognition device, computing equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN112579713B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114661688A (en) * 2022-03-25 2022-06-24 马上消费金融股份有限公司 Address error correction method and device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050182868A1 (en) * 2004-02-17 2005-08-18 Samsung Electronics Co., Ltd. Apparatus and method for controlling memory
CN101923618A (en) * 2010-08-19 2010-12-22 中国航天科技集团公司第七一○研究所 Hidden Markov model based method for detecting assembler instruction level vulnerability
CN103020038A (en) * 2012-12-25 2013-04-03 人民搜索网络股份公司 Internet public opinion regional relevance computing method
CN106202028A (en) * 2015-04-30 2016-12-07 阿里巴巴集团控股有限公司 A kind of address information recognition methods and device
CN107526967A (en) * 2017-07-05 2017-12-29 阿里巴巴集团控股有限公司 A kind of risk Address Recognition method, apparatus and electronic equipment
US20180231391A1 (en) * 2017-02-15 2018-08-16 Telenav, Inc. Navigation system with location based parser mechanism and method of operation thereof
CN108563631A (en) * 2018-03-23 2018-09-21 江苏速度信息科技股份有限公司 A kind of automatic identifying method of natural language address descriptor
CN108694985A (en) * 2017-04-06 2018-10-23 中芯国际集成电路制造(北京)有限公司 Test method and test circuit for detecting storage failure

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050182868A1 (en) * 2004-02-17 2005-08-18 Samsung Electronics Co., Ltd. Apparatus and method for controlling memory
CN101923618A (en) * 2010-08-19 2010-12-22 中国航天科技集团公司第七一○研究所 Hidden Markov model based method for detecting assembler instruction level vulnerability
CN103020038A (en) * 2012-12-25 2013-04-03 人民搜索网络股份公司 Internet public opinion regional relevance computing method
CN106202028A (en) * 2015-04-30 2016-12-07 阿里巴巴集团控股有限公司 A kind of address information recognition methods and device
US20180231391A1 (en) * 2017-02-15 2018-08-16 Telenav, Inc. Navigation system with location based parser mechanism and method of operation thereof
CN108694985A (en) * 2017-04-06 2018-10-23 中芯国际集成电路制造(北京)有限公司 Test method and test circuit for detecting storage failure
CN107526967A (en) * 2017-07-05 2017-12-29 阿里巴巴集团控股有限公司 A kind of risk Address Recognition method, apparatus and electronic equipment
CN108563631A (en) * 2018-03-23 2018-09-21 江苏速度信息科技股份有限公司 A kind of automatic identifying method of natural language address descriptor

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIAJUN LI 等: "Interact with robot: An efficient approach based on finite state machine and mouse gesture recognition", 《2016 9TH INTERNATIONAL CONFERENCE ON HUMAN SYSTEM INTERACTIONS》, pages 203 - 208 *
刘勇国: "基于数据挖掘的网络入侵检测研究", 《中国优秀博硕士学位论文全文数据库 (博士)信息科技辑》, pages 139 - 4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114661688A (en) * 2022-03-25 2022-06-24 马上消费金融股份有限公司 Address error correction method and device
CN114661688B (en) * 2022-03-25 2023-09-19 马上消费金融股份有限公司 Address error correction method and device

Also Published As

Publication number Publication date
CN112579713B (en) 2023-11-21

Similar Documents

Publication Publication Date Title
US10783171B2 (en) Address search method and device
US9558179B1 (en) Training a probabilistic spelling checker from structured data
JP5479066B2 (en) Method, apparatus and system for position assisted translation
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
CN110232129B (en) Scene error correction method, device, equipment and storage medium
CN111488468B (en) Geographic information knowledge point extraction method and device, storage medium and computer equipment
CN110147421B (en) Target entity linking method, device, equipment and storage medium
CN109948122B (en) Error correction method and device for input text and electronic equipment
WO2020215683A1 (en) Semantic recognition method and apparatus based on convolutional neural network, and non-volatile readable storage medium and computer device
CN110705302A (en) Named entity recognition method, electronic device and computer storage medium
CN111727442A (en) Training sequence generation neural network using quality scores
CN113326702B (en) Semantic recognition method, semantic recognition device, electronic equipment and storage medium
JP7254925B2 (en) Transliteration of data records for improved data matching
CN111291099B (en) Address fuzzy matching method and system and computer equipment
CN115470307A (en) Address matching method and device
JP6476886B2 (en) Keyword extraction system, keyword extraction method, and computer program
CN103559177A (en) Geographical name identification method and geographical name identification device
CN112579713B (en) Address recognition method, address recognition device, computing equipment and computer storage medium
CN112784611A (en) Data processing method, device and computer storage medium
CN110222340B (en) Training method of book figure name recognition model, electronic device and storage medium
CN114386407B (en) Word segmentation method and device for text
CN109241208B (en) Address positioning method, address monitoring method, information processing method and device
CN113221558B (en) Express address error correction method and device, storage medium and electronic equipment
CN113268452B (en) Entity extraction method, device, equipment and storage medium
WO2022271369A1 (en) Training of an object linking model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant