CN108363729B

CN108363729B - Character string comparison method and device, terminal equipment and storage medium

Info

Publication number: CN108363729B
Application number: CN201810030724.9A
Authority: CN
Inventors: 刘行行
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2018-01-12
Filing date: 2018-01-12
Publication date: 2021-01-26
Anticipated expiration: 2038-01-12
Also published as: CN108363729A

Abstract

The invention discloses a character string comparison method, a device, a terminal device and a storage medium, wherein the method comprises the following steps: establishing a basic word bank according to policy information in a policy database; acquiring a first character string and a second character string to be matched; performing word segmentation processing on the first character string and the second character string respectively to obtain a first number of first words and a second number of second words; matching the word frequency of each first word and the word frequency of each second word from the basic word bank; calculating the similarity between the first character string and the second character string by using a cosine similarity algorithm according to the first number, the second number, the word frequency of each first word and the word frequency of each second word; and if the similarity is greater than a preset similarity threshold value, confirming that the meanings of the first character string and the second character string are the same. The technical scheme of the invention can effectively improve the accuracy of the character string comparison result, meanwhile, the word frequency does not need to be repeatedly calculated, and the execution efficiency can be effectively improved.

Description

Character string comparison method and device, terminal equipment and storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a method for character string similarity.

Background

At present, a common character string similarity comparison method is mainly realized by adopting a traditional TF-IDF and cosine similarity calculation method, two character strings or texts to be matched are directly calculated, the calculation is only carried out in the range of the two character strings or texts to be matched when the word frequency is calculated by using the TD-IDF, the actual application scenes of the character strings or texts to be matched are not considered, the same character string often represents different meanings in different application scenes, the calculation is only carried out in the range of the two character strings or texts to be matched, the accurate comparison result cannot be obtained, and the accuracy of the calculation result is not high.

Meanwhile, the traditional TF-IDF has a complex calculation process, and has low execution efficiency under the condition that the data size of the character string to be matched is large.

Disclosure of Invention

The embodiment of the invention provides a character string comparison method, a character string comparison device, terminal equipment and a storage medium, and aims to solve the problem that in the prior art, the efficiency and the accuracy of character string comparison are not high.

In a first aspect, an embodiment of the present invention provides a character string comparison method, including:

establishing a basic word bank according to policy information in a policy database, wherein the basic word bank comprises Chinese words except English characters and Arabic numbers in the policy information and the word frequency of each Chinese word; the word frequency is obtained by calculation according to the occurrence frequency of each Chinese word in the policy database;

acquiring a first character string and a second character string to be matched;

performing word segmentation processing on the first character string and the second character string respectively to obtain a first number of first words contained in the first character string and a second number of second words contained in the second character string;

matching the word frequency of each first word and the word frequency of each second word from the basic word stock;

calculating the similarity between the first character string and the second character string by using a cosine similarity algorithm according to the first number, the second number, the word frequency of each first word in the first character string and the word frequency of each second word in the second character string;

if the similarity is larger than a preset similarity threshold value, confirming that the meanings of the first character string and the second character string are the same;

the establishing of the basic word bank according to the policy information in the policy database comprises the following steps:

performing word segmentation processing on the Chinese character string to be processed in each policy information to obtain the Chinese word;

counting the occurrence frequency of each Chinese word and the total word number of the Chinese words;

calculating the word frequency of each Chinese word according to the following formula:

wherein, T_mFor the m-th word frequency, P_mIs the occurrence number of the mth Chinese word, G is the total word number of the Chinese word, m belongs to [1, G ∈]；P_jThe number of occurrences of the jth Chinese word;

the total times of all the Chinese words are taken;

and storing each Chinese word and the word frequency association of the Chinese word in a basic word bank.

In a second aspect, an embodiment of the present invention provides a character string comparison apparatus, including:

the word bank establishing module is used for establishing a basic word bank according to policy information in a policy database, wherein the basic word bank comprises Chinese words except English characters and Arabic numbers in the policy information and the word frequency of each Chinese word; the word frequency is obtained by calculation according to the occurrence frequency of each Chinese word in the policy database;

the acquisition module is used for acquiring a first character string and a second character string to be matched;

the word segmentation module is used for performing word segmentation processing on the first character string and the second character string respectively to obtain a first number of first words contained in the first character string and a second number of second words contained in the second character string;

the word frequency matching module is used for matching the word frequency of each first word and the word frequency of each second word from the basic word bank;

a similarity calculation module, configured to calculate a similarity between the first character string and the second character string by using a cosine similarity calculation method according to the first number, the second number, the word frequency of each first word in the first character string, and the word frequency of each second word in the second character string;

the comparison module is used for confirming that the meanings of the first character string and the second character string are the same if the similarity is larger than a preset similarity threshold;

the word stock establishing module comprises:

the word segmentation processing submodule is used for carrying out word segmentation processing on the Chinese character strings to be processed in each policy information to obtain the Chinese words;

the statistic submodule is used for counting the occurrence frequency of each Chinese word and the total word number of the Chinese words;

the word frequency calculation submodule is used for calculating the word frequency of each Chinese word according to the following formula:

the total times of all the Chinese words are taken;

and the association storage submodule is used for storing each Chinese word and the word frequency association of the Chinese word in the basic word stock.

In a third aspect, an embodiment of the present invention provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the character string comparison method when executing the computer program.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps of the character string comparison method.

Compared with the prior art, the embodiment of the invention has the following advantages: in the character string comparison method, device, terminal device and storage medium provided by the embodiment of the invention, a basic word bank is established according to the policy information except the English characters and the Arabic numbers in the policy database, the basic word bank comprises the Chinese words in the policy information and the word frequency of each Chinese word, and the word frequency is the frequency of each Chinese word appearing in the policy database; the word frequency of the words in the first character string and the second character string is acquired based on the basic word stock when the character strings of the first character string and the second character string to be matched are compared, the word frequency acquired based on the basic word stock is more targeted, so that the accuracy of the word frequency is higher, the accuracy of a judgment result can be effectively improved when whether the meanings of the first character string and the second character string are the same is judged according to cosine similarity, meanwhile, the word frequency can be directly acquired from the basic word stock when the character string comparison is carried out each time on the basis of establishing the basic word stock, the word frequency does not need to be repeatedly calculated, and the execution efficiency can be effectively improved when the data volume of the character string to be matched is larger.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

Fig. 1 is a flowchart of an implementation of a character string comparison method provided in embodiment 1 of the present invention;

fig. 2 is a flowchart illustrating an implementation of step S1 in the character string comparison method according to embodiment 1 of the present invention;

fig. 3 is a flowchart of implementing step S11 in the character string comparison method provided in embodiment 1 of the present invention;

fig. 4 is a flowchart of implementing step S5 in the character string comparison method provided in embodiment 1 of the present invention;

fig. 5 is a flowchart illustrating an implementation of associating policy information in the character string comparison method according to embodiment 1 of the present invention;

FIG. 6 is a schematic diagram of a character string comparison apparatus provided in embodiment 2 of the present invention;

fig. 7 is a schematic diagram of a terminal device provided in embodiment 4 of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

Referring to fig. 1, fig. 1 shows an implementation flow of a character string comparison method according to an embodiment of the present invention. The character string comparison method can be applied to matching analysis of insurance policy information in the insurance industry. The details are as follows:

s1: and establishing a basic word bank according to the policy information in the policy database, wherein the basic word bank comprises Chinese words except English characters and Arabic numbers in the policy information and the word frequency of each Chinese word. The word frequency is calculated according to the occurrence frequency of each Chinese word in the inventory database.

In the embodiment of the invention, the policy database is a database which is established in advance by an insurance company and is used for storing policy information of insurance products purchased by a user.

The policy information in the policy database is analyzed to identify keywords in the policy information, and the identified keywords are used to establish a base lexicon.

Specifically, the content of each attribute in the policy information is subjected to word segmentation, the attributes comprise user personal identity attributes, insurance product attributes and the like, each word obtained after word segmentation is used as a keyword, and the word frequency of the keyword is calculated according to the occurrence frequency of each keyword in the policy database.

It should be noted that the purpose of establishing the basic word stock is to provide a basis for comparing character strings, and in the character string comparison, because the meaning of a chinese word is complex, the grammar of the chinese word is different from that of an english word, the structure of the chinese word is different, and the comparison method is completely different, only the chinese word is involved in the basic word stock, and other characters including english characters or arabic numerals are not involved.

Further, the base thesaurus may be updated periodically or in real time. For example, the step may be executed once at preset time intervals, and since new policy information is continuously stored in the policy database, the basic lexicon is established based on the latest policy database every time the step is executed, so that the basic lexicon is updated; or when the policy database is detected to be updated, executing the step to complete the updating of the basic word stock. Because the insurance company continuously generates new insurance policy, the insurance policy database is updated frequently, if the insurance policy database is updated every time, the step is executed to update the basic word stock, and the system performance is reduced, so that the step can be executed to update the basic word stock when the updating times reach a preset time threshold value, and the influence on the system performance caused by the too frequent execution of the step is avoided.

S2: and acquiring a first character string and a second character string to be matched.

In the embodiment of the present invention, based on the basic thesaurus established in step S1, a comparison between two character strings may be performed to determine whether the meanings of the two character strings are the same. If more than two character strings need to be compared, the character strings can be compared pairwise, each character string is compared with other character strings, and whether the meanings of the character strings are the same or not is judged according to the result of pairwise comparison. Specifically, the first character string and the second character string to be matched may be address character strings in the policy information.

S3: and performing word segmentation processing on the first character string and the second character string respectively to obtain a first number of first words contained in the first character string and a second number of second words contained in the second character string.

Specifically, the method of performing word segmentation on the first character string and the second character string to be matched may be the same as the method of performing word segmentation on the content of each attribute in the policy information when the basic thesaurus is established in step S1.

It should be noted that the first word and the second word are both chinese words, and if other characters including english characters, arabic numerals and the like exist in the first character string or the second character string, when performing word segmentation processing, the non-chinese character string is recognized first, and then word segmentation processing is performed on a plurality of chinese character strings separated from the non-chinese character string.

S4: the word frequency of each first word and the word frequency of each second word are matched from the base thesaurus.

Specifically, the word frequency of each first word and the word frequency of each second word are matched in the base thesaurus according to the first number of first words and the second number of second words obtained in step S3.

It should be noted that, if the word frequency of the first word or the word frequency of the second word cannot be matched in the basic thesaurus, it is indicated that the first word or the second word is not stored in the basic thesaurus, at this time, the basic thesaurus needs to be updated according to the method for establishing the basic thesaurus mentioned in step S1 for the first character string or the second character string, and then the word frequency of the first word or the second word is continuously matched in the updated basic thesaurus.

S5: and calculating the similarity between the first character string and the second character string by using a cosine similarity algorithm according to the first number, the second number, the word frequency of each first word in the first character string and the word frequency of each second word in the second character string.

Specifically, according to the word frequency of each first word and the word frequency of each second word obtained in step S4, the word frequency of each first word is formed into a word vector of the first character string, and the word frequency of each second word is formed into a word vector of the second character string.

And calculating the similarity between the first character string and the second character string by using a cosine similarity algorithm according to the word vector of the first character string and the word vector of the second character string.

S6: and if the similarity between the first character string and the second character string is greater than a preset similarity threshold value, confirming that the meanings of the first character string and the second character string are the same.

Specifically, when the similarity calculated in step S5 is greater than the preset similarity threshold, it is determined that the meanings of the first character string and the second character string are the same, otherwise, if the similarity is less than or equal to the similarity threshold, it is determined that the meanings of the first character string and the second character string are different.

The value range of the similarity calculated according to the cosine similarity algorithm is between [0 and 1], and therefore, the preset similarity threshold may be set to 0.9, but is not limited thereto, and the similarity threshold may be set according to the application requirement, and is not limited herein.

It is understood that in the embodiment of the present invention, when the similarity is greater than the preset similarity threshold, the first character string and the second character string have the same meaning, and in other embodiments, when the similarity is greater than or equal to the preset similarity threshold, the first character string and the second character string have the same meaning.

It should be noted that the character string comparison method provided in the embodiment of the present invention compares the first character string and the second character string with respect to the chinese character string, and when the first character string or the second character string includes other characters, the other characters may be compared in a direct comparison manner, and comprehensively determines whether the meanings of the first character string and the second character string are the same according to the comparison result and the similarity. For example, if the similarity obtained in this step is greater than the preset similarity threshold, but the arabic numbers of the first string and the second string are different, it can be considered that the meanings of the first string and the second string are different.

In the embodiment corresponding to fig. 1, the word frequency of each word in the basic word bank is calculated in advance, and when a character string to be matched is compared, only the corresponding word frequency needs to be obtained from the basic word bank for similarity calculation, so that the similarity calculation of the character string is performed based on the basic word bank in an actual application scene, thereby having more pertinence and improving the matching accuracy, and meanwhile, the word frequency does not need to be recalculated in each comparison.

Next, based on the embodiment shown in fig. 1, a specific implementation method for establishing the basic thesaurus according to the policy information in the policy database mentioned in step S1 is described in detail below by using a specific embodiment.

Referring to fig. 2, fig. 2 shows a specific implementation flow of step S1 provided in the embodiment of the present invention, which is detailed as follows:

s11: and performing word segmentation processing on the Chinese character string to be processed in each policy information to obtain Chinese words.

In the embodiment of the present invention, since various character strings to be processed in the policy information may include other characters including english characters and arabic numerals, such as a home address character string, in addition to the chinese characters, the non-chinese character string in the policy information is first identified, and then word segmentation processing is performed on a plurality of chinese character strings separated from the non-chinese character string.

S12: and counting the occurrence times of each Chinese word and the total word number of the Chinese words.

Specifically, the number of occurrences of each chinese word in the inventory database obtained in step S11 and the total number of words of the chinese words obtained after the word segmentation processing in step S11 are counted.

The total word number of the chinese word means the total number of the chinese words obtained by the word segmentation processing.

S13: calculating the word frequency of each Chinese word according to formula (1):

wherein, T_mWord frequency, P, for mth Chinese word_mIs the occurrence number of the mth Chinese word, G is the total word number of the Chinese word, m belongs to [1, G ∈]；P_jThe number of occurrences of the jth Chinese word is shown.

Is the total number of times of all the Chinese words.

Specifically, after the number of occurrences of each chinese word and the total number of words in the chinese character string are counted in step S12, the word frequency of each chinese word is calculated according to formula (1).

S14: and storing each Chinese word and the word frequency association of the Chinese word in a basic word bank.

Specifically, the word frequency of each chinese word obtained in step S13 and the chinese word association are stored in the basic thesaurus, so that the word frequency of the target word can be directly obtained from the basic thesaurus when performing the character string comparison.

In order to better understand the technical solution of the embodiment of the present invention, the process of establishing the basic lexicon is described below by a specific example, which is detailed as follows:

assume that the policy information in the policy database contains the following address strings: "the china times square in the luohu region of Guangdong province Shenzhen city", "the peace mansion in the Futian region of Guangdong province Shenzhen city", "the Tencent mansion in the Futian region of Guangdong province Shenzhen city", "the Zhongxing mansion in the Futian region of Guangdong province Shenzhen city", "the Zhengjia square in the Tianheyuan region of Guangdong province Guangzhou city and" the Tianheuan city in the Tianheyuan region of Guangdong province Guangzhou city ".

First, after the word segmentation processing is performed on the address character string according to step S11, the obtained word segmentation result is as follows:

1) the word segmentation result of the civil times plaza in the lahu region of Guangdong Shenzhen is as follows: "Guangdong province", "Shenzhen City", "Luhu district" and "Zhongmin times square";

2) the word segmentation result of the "Guangdong Shenzhen city Futian district safety mansion" is as follows: "Guangdong province", "Shenzhen city", "Futian district" and "Ming's mansion";

3) the word segmentation result of the "Guangdong Shenzhen Shentian Tengchang" is: "Guangdong province", "Shenzhen city", "Futian district" and "Tengchun mansion";

4) the word segmentation result of the Zhongxing mansion in the Futian region of Shenzhen, Guangdong province is as follows: "Guangdong province", "Shenzhen city", "Futian district" and "Zhongxing mansion";

5) the word segmentation result of the "Zhengjia square in the Tianheyuan of Guangzhou city of Guangdong province" is as follows: "Guangdong province", "Guangzhou city", "Tianhe district" and "Zhengjia square";

6) the word segmentation result of the Tianhecheng area of Guangzhou city, Guangdong province is as follows: "Guangdong province", "Guangzhou city", "Tianhe district" and "Tianhe city".

Step S12 is executed to count the word segmentation results to obtain 12 chinese words, which includes: "Guangdong province", "Guangzhou city", "Shenzhen city", "Luohu region", "Futian region", "Tianhe region", "Zhongmin times square", "Pingyan mansion", "Tengxing mansion", "Zhengjia square" and "Tianhe city". I.e., the total number of words G of the chinese word is 12. The number of occurrences of "guangdong province" was 6, the number of occurrences of "shenzhen city" was 4, the number of occurrences of "futian region" was 3, the number of occurrences of "guangzhou city" and "tianhe region" was 2, and the number of occurrences of "laohu region", "zhongji times square", "peace building", "tengxing building", "zhongxing building", "zhengjia square" and "tianhe city" was 1. Namely, it is

Then, the word frequency of each of the above chinese words is calculated in accordance with formula (1) in step S13. For example, the number of occurrences of "guangdong province" is 6, and thus the word frequency of "guangdong province" is 24/6-4; similarly, the word frequencies of other Chinese words can be calculated according to the formula (1) and are respectively as follows: the term frequency of Shenzhen city is 6, the term frequency of Futian region is 8, the term frequencies of Guangzhou city and Tianhe region are 12, and the term frequencies of Rohu region, Tianhe region, Zhongmin time square, Ping Anxiao, Teng Xingxing building, Zhongxing building, Zheng Jia square and Tianhe city are 24.

Finally, step S14 is executed to store the 24 chinese words and the word frequency association of each chinese word in the basic thesaurus.

In the embodiment corresponding to fig. 2, the word segmentation processing is performed on the chinese character strings in the policy information to obtain chinese words, and the word frequency of each chinese word is calculated according to the formula (1), so that each chinese word and the word frequency association thereof are stored in the basic word stock, and the creation of the basic word stock is completed. Because the word frequency of each word in the basic word stock is calculated in advance, when character strings are compared, the corresponding word frequency can be directly obtained from the basic word stock for similarity calculation, on one hand, the word frequency obtained based on the basic word stock is more targeted, so that the accuracy of the word frequency is higher, on the other hand, the word frequency does not need to be repeatedly calculated when the character strings are compared, and when the number of the character strings to be matched is huge, the execution efficiency can be effectively improved.

Based on the embodiment corresponding to fig. 2, a detailed description is given below of a specific implementation flow of obtaining chinese words by performing word segmentation processing on the kanji character strings to be processed in each policy information mentioned in step S11 through a specific embodiment.

Referring to fig. 3, fig. 3 shows a specific implementation flow of step S11 provided in the embodiment of the present invention, which is detailed as follows:

s111: carrying out single character segmentation on the Chinese character string to be processed to obtain n single characters a_iWherein i ∈ [1, n ]]And n is the number of the Kanji characters contained in the Kanji character string.

In the embodiment of the invention, the Chinese character string to be processed is segmented according to a single character to obtain n Chinese characters a_iN Chinese characters a_iThe Chinese character string is stored in an array form according to the sequence of the Chinese character string, each Chinese character is an element of the array, that is, the arrangement sequence of each single character after the single character segmentation is sequentially arranged from left to right according to the Chinese character sequence in the Chinese character string.

For example, if the kanji character string to be processed is "shenzhen laohu region", then the single character segmentation is performed to obtain six single characters: a is₁Deep, a₂Zhen, a₃City, a₄Arrowa, a₅Lake, a₆A region.

S112: if i is less than n, a_iAdjacent single word a_i+1Combining to obtain a temporary word a_ia_i+1。

Specifically, from the first word a₁At the start of the process,a is to₁And a₂Combining to obtain a temporary word a₁a₂。

When i is n, a is present_iThe character is the last character in the character string of the Chinese character to be processed, and no adjacent single character exists, so that new temporary words cannot be obtained by continuously combining the characters.

S113: if the temporary word a_ia_i+1If the temporary word a exists in a preset common word bank, the temporary word a is used_ia_i+1Adjacent single word a_i+2Combining to obtain a temporary word a_ia_i+1a_i+2And use the provisional word a_ia_i+1a_i+2Continuing to search the common word stock until the temporary word a_ia_i+1a_i+2...a_kDoes not exist in the common word lexicon, wherein k belongs to [ i +1, n ∈ [ ]]。

In the embodiment of the invention, the preset common word bank contains common Chinese words, and the common word bank can be updated regularly.

If the temporary word obtained in step S112 exists in the common word bank, continuing to use the temporary word a_ia_i+1Adjacent single word a_i+2Combining to obtain a temporary word a_ia_i+1a_i+2If the temporary word a_ia_i+1a_i+2If the word still exists in the common word stock, continuing to use the temporary word a_ia_i+1a_i+2And a_i+3Combining to obtain a temporary word a_ia_i+1a_i+2a_i+3And continue to use the provisional word a_ia_i+1a_i+2a_i+3Looking up in the common word stock until the temporary word a_ia_i+1a_i+2...a_kAnd when the common word library does not exist, ending the search.

It should be noted that the range of k is greater than or equal to i +1 and less than or equal to n, that is, when k is equal to n, the temporary word a searched in the common word bank is used at this time_ia_i+1a_i+2...a_nHas reached the Chinese character string to be processedThus, the temporary word a is completed_ia_i+1a_i+2...a_nAfter the search, no matter whether the temporary word exists in the common word bank or not, the search is not continued.

Continuing with the example of the to-be-processed kanji character string "Shenzhen lawy region" mentioned in step S111, six single characters "a" obtained in step S111 are pointed out₁Deep, a₂Zhen, a₃City, a₄Arrowa, a₅Lake, a₆Combining the 'Shenzhen' and the 'Shenzhen' into a provisional word 'Shenzhen', searching the provisional word 'Shenzhen' in a common word library, continuing to combine the 'Shenzhen' with the adjacent 'market' into the provisional word 'Shenzhen market' due to the existence of the 'Shenzhen' in the common word library, continuing to search the provisional word 'Shenzhen market' in the common word library, continuing to combine the 'Shenzhen market' with the adjacent 'Luo' into the provisional word 'Shenzhen market Luo' due to the existence of the 'Shenzhen market' in the common word library, stopping the query due to the absence of the 'Shenzhen market Luo' in the common word library, and taking the current provisional word as the 'Shenzhen market Luo'.

S114: if the temporary word a_ia_i+1a_i+2...a_kIf the common word library does not exist, a is added_ia_i+1a_i+2...a_k-1Recognized as valid words and extracted from a single word a_kTo begin with, if k is equal to n, a will be_kRecognizing as a valid word, if k is less than n, a_kAs a is_iThe process returns to step S112 to continue the execution.

Specifically, if the provisional word a is obtained according to step S113_ia_i+1a_i+2...a_kIf the common word library does not exist, a is added_ia_i+1a_i+2...a_k-1Recognized as valid words and extracted from a single word a_kInitially, if k is less than n, a is added_kAs a is_iReturning to step S112 to be executed again.

When k is i +1, if a_ia_i+1If the common word library does not exist, the single character a_iIdentificationIs a valid word and does not need to judge the single word a_iWhether it exists in the common word stock.

When k is n, if a_ia_i+1a_i+2...a_nIf the common word library does not exist, a is carried out_kThe last character in the character string of the Chinese character to be processed is existed, no adjacent single character exists, and the character string can not be combined continuously to obtain a new temporary word, so that the step S112 is not returned, and the step A is directly used_kA valid word is recognized and the flow jumps to step S116 to continue execution.

Continuing to explain by taking the to-be-processed kanji character string "shenzhen lao region" mentioned in step S111 as an example, when the provisional word "shenzhen lao" does not exist in the common word lexicon, recognizing "shenzhen" as an effective word, then starting from the single word "lao", continuing to combine with the adjacent "lake" into a new provisional word "lao" according to step S112, and continuing to search in the common word lexicon.

S115: if the temporary word a_ia_i+1a_i+2...a_kIf a is present in the common word stock and k is n, a will be added_ia_i+1a_i+2...a_kRecognized as a valid word.

Specifically, if the provisional word a is obtained according to step S113_ia_i+1a_i+2...a_kIf the word exists in the common word bank and k is equal to n, the temporary word a is explained_ia_i+1a_i+2...a_kThe last single character of the kanji character string to be processed has been reached, and the provisional word a_ia_i+1a_i+2...a_kIf the word exists in the common word bank, a is directly added_ia_i+1a_i+2...a_kAnd identifying the Chinese character string to be processed as an effective word.

Continuing to explain by taking the to-be-processed kanji character string "shenzhen lao region" mentioned in step S111 as an example, when the provisional word "shenzhen lao" does not exist in the common word bank, identifying "shenzhen" as an effective word, then starting from a single word "lao", continuing to combine with the adjacent "lake" into a new provisional word "lao lake" according to step S112, continuing to search for the provisional word "lao lake" in the common word bank, if the provisional word "lao lake" exists in the common word bank, combining "lao lake" and the adjacent "region" into the provisional word "lao lake region", continuing to search for the provisional word "lao region" in the common word bank, if the provisional word "lao lake region" exists in the common word bank, and at this time k ═ n, continuing to search circularly, and directly identifying "lao lake region" as an effective word.

S116: and determining the identified effective words as Chinese words in the Chinese character string.

Specifically, the valid words identified in step S114 and step S115 are used as the chinese words obtained by performing the word segmentation processing on the kanji character string.

Continuing with the explanation taking the to-be-processed kanji character string "shenzhen laohu region" mentioned in step S111 as an example, the valid words identified according to step S114 and step S115 are: "Shenzhen city" and "lahu region", therefore, the participle result of the kanji string "Shenzhen city lahu region" is: shenzhen city and Luhu region.

In the embodiment corresponding to fig. 3, the obtained kanji character string to be processed is first subjected to single character segmentation, then, starting from the first single character, combining the first single character with the adjacent single character into a temporary word, searching the temporary word in a common word bank, if the temporary word exists in the common word bank, combining the temporary word with the next adjacent single word into a new temporary word, continuously searching the new temporary word in the common word stock, if the new temporary word can be searched, combining and continuously inquiring in the common word stock until the newly combined temporary word is not inquired in the common word stock, taking the previous temporary word as an effective word, continuously combining the residual single word after the effective word is removed with the next adjacent single word, and continuously inquiring in the common word stock until all the single characters of the Chinese character string are processed. The method and the device have the advantages that the character string word segmentation is carried out in a single character segmentation and combination mode, the initial word length does not need to be set, the common words are prevented from being wrongly separated, the accuracy of the character string word segmentation is improved, meanwhile, compared with the traditional forward maximum matching algorithm, the implementation mode of the technical scheme of the embodiment of the invention is simple and feasible, the execution efficiency is higher, and therefore the universality and the word segmentation efficiency of the character string word segmentation are effectively improved.

Based on the embodiments corresponding to fig. 1 to fig. 3, a detailed description will be given below of a specific implementation flow of calculating the similarity between the first character string and the second character string by using a cosine similarity algorithm according to the first number, the second number, the word frequency of each first word in the first character string, and the word frequency of each second word in the second character string, which are mentioned in step S5, by using a specific embodiment.

Referring to fig. 4, fig. 4 shows a specific implementation flow of step S5 provided in the embodiment of the present invention, which is detailed as follows:

s51: according to the first number, the second number, the word frequency of each first word in the first character string and the word frequency of each second word in the second character string, constructing a first word vector of the first character string and a second word vector of the second character string as follows:

Q1[d+n+m]＝{T(W₁),T(W₂),…T(W_d),T(W_d+1),T(W_d+2),…，T(W_d+n),Z_n+1,Z_n+2,…,Z_n+m}

Q2[d+n+m]＝{T(W₁),T(W₂),…,T(W_d),Z₁,Z₂,…，Z_n,T(W_n+1),T(W_n+2),…,T(W_n+m)}

wherein, Q1[ d + n + m [ ]]Is the first word vector, Q2[ d + n + m]Is a second word vector, W₁,W₂,…W_d,W_d+1,W_d+2,…，W_d+nFor the first word obtained after the word segmentation processing is carried out on the first character string, W₁,W₂,…W_d,W_n+1,W_n+2,…，W_n+mD + n is a first number, d + m is a second number, T (W) is a second word obtained by performing word segmentation processing on the second character string_s) Is W_sS is in [1, n + m ]]，Z_bHas a value of 0, b ∈ [ 1],n+m]。

Specifically, a first word vector and a second word vector are constructed according to the first number of first words and the second number of second words obtained in step S3, and the word frequency of each first word and the word frequency of each second word obtained in step S4.

S52: calculating the similarity between the first character string and the second character string according to formula (2):

wherein epsilon is the similarity between the first character string and the second character string; q1[ c ] is the word frequency of the c-th first word in the first string; q2[ c ] is the word frequency of the c-th second word in the second string.

Specifically, the similarity epsilon is calculated according to formula (2) based on the first word vector and the second word vector obtained in step S51.

In order to better understand the technical solution of the embodiment of the present invention, the following is described by a specific example, which is detailed as follows:

the first character string to be matched is assumed to be the civil times square in the Roche region of Guandong Shenzhen City, and the second character string to be matched is the civil times square in the Roche region of Shenzhen City.

After the word segmentation processing is performed on the first character string and the second character string, the obtained first words are "Guangdong province", "Shenzhen City", "Rohu region" and "Zhongmin epoch Square", and the obtained second words are "Shenzhen City", "Rohu region" and "Zhongmin epoch Square", that is, the first number is 4 and the second number is 3.

Taking the basic thesaurus established in step S14 as an example, the result of matching the word frequency of each first word and the word frequency of each second word in the basic thesaurus is: the word frequency of Guangdong province is 4, the word frequency of Shenzhen City is 6, and the word frequencies of the Luohu region and the China time square are both 24.

Therefore, the word vector Q1[4] of the first character string is {6,24,24,4} and the word vector Q2[4] of the second character string is {6,24,24,0} according to the construction method of step S51.

Calculating the similarity between the first character string and the second character string according to formula (2) of step S52 as:

i.e., the similarity between the first string and the second string is about 0.9933.

Assuming that the preset similarity threshold is 0.9, the first string "the civil times square in the ranhu region of Guandong Shenzhen City" and the second string "the civil times square in the ranhu region of Shenzhen City" have the same meaning as the calculation result.

If the similarity is calculated based on the word frequency provided by the basic word stock instead of the method for calculating the similarity based on the word frequency provided by the basic word stock in the technical scheme of the embodiment of the invention, the calculation process is as follows:

after the word segmentation processing is carried out on the first character string and the second character string, the obtained first words are still 'Guangdong province', 'Shenzhen City', 'Rohu region' and 'Zhongmin epoch Square', and the second words are 'Shenzhen City', 'Rohu region' and 'Zhongmin epoch Square'.

Calculating the word frequency according to the occurrence frequency of each first word and each second word in the first character string and the second character string, and obtaining the result of the word frequency of each first word and the word frequency of each second word as follows: the word frequency of Guangdong province is 1, and the word frequencies of Shenzhen city, Rohu region and Zhongmin times square are 2. Therefore, the word vector Q1 of the first character string is {2,2,2,1}, and the word vector Q2 of the second character string is {2,2,2,0 }.

Calculating the similarity of the first character string and the second character string by a cosine similarity algorithm as follows:

i.e., the similarity between the first string and the second string is about 0.866.

It can be seen that, if the similarity threshold is still 0.9, the similarity calculated according to the method in the prior art is smaller than the similarity threshold, and therefore the meanings of the first string "the civil times square in the ranhu region of Guangzhou Shenzhen City" and the second string "the civil times square in the ranhu region of Shenzhen City" are determined to be different. In fact, the first character string is only the word "Guangdong province" in the Chinese language more than the second character string, and the meanings of the two address character strings "the civil times square in the Rohu region of Guangdong province Shenzhen city" and "the civil times square in the Rohu region of Shenzhen city" in the Saudi information are the same, thereby causing an erroneous judgment result.

Therefore, the technical scheme provided by the embodiment of the invention can more accurately judge whether the meanings of the two character strings are the same or not, and the misjudgment rate is reduced.

In the embodiment corresponding to fig. 4, on the basis of the basic word bank established by using the character strings in the policy information of the policy database, word vectors are constructed according to the word frequencies queried in the basic word bank, and the similarity is calculated by using the formula (2), so that the similarity calculation of the character strings is based on the basic word bank in the actual application scene, thereby having more pertinence, improving the matching accuracy, being capable of more accurately judging whether the meanings of the two character strings are the same, and reducing the misjudgment rate.

In addition to the above embodiments corresponding to fig. 1 to 3, if the similarity between the first character string and the second character string is greater than the preset similarity threshold mentioned in step S6, after confirming that the meanings of the first character string and the second character string are the same, the policy information may be further associated.

In the embodiment of the present invention, the first character string is an address character string in the first policy information, and the second character string is an address character string in the second policy information. For example, the address character string may be home address information or work unit address information or the like.

As shown in fig. 5, the character string comparison method further includes:

s7: and establishing an incidence relation between the policy corresponding to the first policy information where the first character string is located and the policy corresponding to the second policy information where the second character string is located.

Specifically, when it is confirmed in step S6 that the first character string and the second character string have the same meaning, it is confirmed that the policy corresponding to the policy information in which the first character string is located and the policy corresponding to the policy information in which the second character string is located are associated with each other, and an association relationship between the two policies is established. And, the higher the similarity between the first character string and the second character string, the higher the degree of association between the two policies.

By associating the insurance policies, the method can help the related workers of the insurance company to accurately mine the potential relationship between the insurance policies, is beneficial to the related workers to analyze the insurance policies and discover various possible fraud and insurance risks.

It should be noted that, in the embodiment of the present invention, when comparing whether the meanings of the first character string and the second character string are the same, if there is a non-chinese character in the first character string or the second character string, the non-chinese character string is identified first, and then the meanings of the two character strings are compared to be the same according to the remaining chinese character strings. Therefore, it is not determined whether the meanings of the non-kanji character strings are the same. This is because when associating policy based on address character strings, the degree of similarity of address character strings can be effectively reflected by comparing kanji character strings, and when the meanings of kanji character strings of two address character strings are the same, even if the meanings of non-kanji character strings are different, for example, house numbers are different, it can be considered that there is an association between two policies.

In other embodiments, when policy association needs to be performed according to the result of accurate matching of two address character strings, after determining that the meanings of the two address character strings are the same according to the comparison of the chinese character strings, further comparing whether the contents of the non-chinese character strings in the two address character strings are the same by using a direct character comparison method, if so, determining that the meanings of the two address character strings are the same, and if not, determining that the meanings of the two address character strings are different.

In the embodiment corresponding to fig. 5, if it is determined that the two address character strings have the same meaning, it is described that the policy corresponding to the first policy information is associated with the policy of the second policy information, and the two policies are associated, so that the relevant staff of the insurance company can be helped to accurately mine the potential relationship between the policies, and the relevant staff can analyze the policies to find various possible fraud protection risks.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

Example 2

Fig. 6 shows a block diagram of a character string comparison device provided in an embodiment of the present invention, which corresponds to the character string comparison method in the above embodiment, and only shows the relevant parts in the embodiment of the present invention for convenience of description.

Referring to fig. 6, the character string comparison apparatus includes: the word stock establishing module 61, the obtaining module 62, the word segmentation module 63, the word frequency matching module 64, the similarity calculating module 65 and the comparing module 66, wherein the detailed description of each functional module is as follows:

a word bank establishing module 61, configured to establish a basic word bank according to the policy information in the policy database, where the basic word bank includes the chinese words except the english characters and the arabic numerals in the policy information and the word frequency of each chinese word; the word frequency is obtained by calculation according to the occurrence frequency of each Chinese word in the inventory database;

an obtaining module 62, configured to obtain a first character string and a second character string to be matched;

a word segmentation module 63, configured to perform word segmentation on the first character string and the second character string respectively to obtain a first number of first words included in the first character string and a second number of second words included in the second character string;

a word frequency matching module 64 for matching the word frequency of each first word and the word frequency of each second word from the basic word stock;

a similarity calculation module 65, configured to calculate a similarity between the first character string and the second character string by using a cosine similarity calculation method according to the first number, the second number, the word frequency of each first word in the first character string, and the word frequency of each second word in the second character string;

and the comparison module 66 is configured to determine that the first character string and the second character string have the same meaning if the similarity is greater than a preset similarity threshold.

Further, the thesaurus establishing module 61 includes:

a word segmentation processing submodule 611, configured to perform word segmentation processing on the to-be-processed chinese character string in each policy information to obtain a chinese word;

a statistics submodule 612, configured to count the occurrence frequency of each chinese word and the total word number of the chinese word;

a word frequency calculating submodule 613, configured to calculate a word frequency of each chinese word according to the following formula:

wherein, T_mWord frequency, P, for mth Chinese word_mIs the occurrence number of the mth Chinese word, G is the total word number of the Chinese word, m belongs to [1, G ∈]；P_jThe number of occurrences of the jth Chinese word;

the total times of all the Chinese words are taken;

and an association storage sub-module 614, configured to store each chinese word and the word frequency association of the chinese word in the basic word bank.

Further, the participle processing sub-module 611 includes:

a segmentation unit 6111, configured to perform single character segmentation on the kanji character string to be processed to obtain n single characters a_iWherein i ∈ [1, n ]]N is the number of Chinese characters contained in the Chinese character string;

a combination unit 6112 for if i is less than n, then a_iAdjacent single word a_i+1Combining to obtain a temporary word a_ia_i+1；

A loop search unit 6113 for searching if the temporary word a_ia_i+1If the temporary word a exists in a preset common word bank, the temporary word a is used_ia_i+1Adjacent single word a_i+2Combining to obtain a temporary word a_ia_i+1a_i+2And use the provisional word a_ia_i+1a_i+2Continuing to search the common word stock until the temporary word a_ia_i+1a_i+2...a_kDoes not exist in the common word lexicon, wherein k belongs to [ i +1, n ∈ [ ]]；

A first identification unit 6114 for if the temporary word a_ia_i+1a_i+2...a_kIf the common word library does not exist, a is added_ia_i+1a_i+2...a_k-1Recognized as valid words and extracted from a single word a_kTo begin with, if k is equal to n, a will be_kRecognizing as a valid word, if k is less than n, a_kAs a is_iReturning to the combining unit 6112 to continue execution;

a second recognition unit 6115 for if the temporary word a_ia_i+1a_i+2…a_kIf a is present in the common word stock and k is n, a will be added_ia_i+1a_i+2…a_kRecognizing as a valid word;

a result determination unit 6116, configured to determine the identified valid word as a chinese word in the kanji character string.

Further, the similarity calculation module 65 includes:

a word vector constructing sub-module 651, configured to construct a first word vector of the first character string and a second word vector of the second character string according to the first number, the second number, the word frequency of each first word in the first character string, and the word frequency of each second word in the second character string as follows:

wherein, Q1[ d + n + m [ ]]Is the first word vector, Q2[ d + n + m]Is a second word vector, W₁,W₂,…W_d,W_d+1,W_d+2,…，W_d+nFor the first word obtained after the word segmentation processing is carried out on the first character string, W₁,W₂,…W_d,W_n+1,W_n+2,…，W_n+mD + n is a first number, d + m is a second number, T (W) is a second word obtained by performing word segmentation processing on the second character string_s) Is W_sS is in [1, n + m ]]，Z_bHas a value of 0, b ∈ [1, n + m ]]；

The formula calculation sub-module 652 is configured to calculate the similarity between the first character string and the second character string according to the following formula:

Further, the first character string is an address character string in the first policy information, and the second character string is an address character string in the second policy information, and the character string comparison device further includes:

and the association module 67 is configured to establish an association relationship between the policy corresponding to the first policy information and the policy corresponding to the second policy information.

The process of implementing each function by each module in the character string comparison device provided in this embodiment may specifically refer to the description of embodiment 1, and is not described herein again.

Example 3

This embodiment provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method for comparing character strings in embodiment 1 is implemented, and details are not repeated here to avoid repetition. Alternatively, the computer program, when executed by the processor, implements the functions of each module/unit in the character string comparison apparatus in embodiment 2, and is not described herein again to avoid redundancy.

Example 4

Fig. 7 is a schematic diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 7, the terminal device 70 of this embodiment includes: a processor 71, a memory 72, and a computer program 73, such as a string comparison program, stored in the memory 72 and operable on the processor 71. The processor 71, when executing the computer program 73, implements the steps in the respective character string comparison method embodiments described above, such as the steps S1 to S6 shown in fig. 1. Alternatively, the processor 71, when executing the computer program 73, implements the functions of the respective modules/units in the respective embodiments of the character string comparison apparatus described above, such as the functions of the modules 61 to 66 shown in fig. 6.

Illustratively, the computer program 73 may be divided into one or more modules/units, which are stored in the memory 72 and executed by the processor 71 to carry out the invention. One or more of the modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 73 in the terminal device 70. For example, the computer program 73 may be divided into a word stock establishing module, an obtaining module, a word segmentation module, a word frequency matching module, a similarity calculation module, and a comparison module, and each of the functional modules is described in detail as follows:

the similarity calculation module is used for calculating the similarity between the first character string and the second character string by using a cosine similarity calculation method according to the first number, the second number, the word frequency of each first word in the first character string and the word frequency of each second word in the second character string;

and the comparison module is used for confirming that the meanings of the first character string and the second character string are the same if the similarity is greater than a preset similarity threshold.

Further, the word stock establishing module comprises:

the word segmentation processing submodule is used for carrying out word segmentation processing on the Chinese character strings to be processed in each policy information to obtain Chinese words;

for all the Chinese notesTotal number of words;

and the association storage submodule is used for storing each Chinese word and the word frequency association of the Chinese word in a basic word bank.

Further, the word segmentation processing submodule comprises:

a segmentation unit for performing single character segmentation on the Chinese character string to be processed to obtain n single characters a_iWherein i ∈ [1, n ]]N is the number of Chinese characters contained in the Chinese character string;

a combination unit for combining a if i is less than n_iAdjacent single word a_i+1Combining to obtain a temporary word a_ia_i+1；

A cyclic search unit for searching if the temporary word a_ia_i+1If the temporary word a exists in a preset common word bank, the temporary word a is used_ia_i+1Adjacent single word a_i+2Combining to obtain a temporary word a_ia_i+1a_i+2And use the provisional word a_ia_i+1a_i+2Continuing to search the common word stock until the temporary word a_ia_i+1a_i+2...a_kDoes not exist in the common word lexicon, wherein k belongs to [ i +1, n ∈ [ ]]；

A first recognition unit for recognizing if the temporary word a_ia_i+1a_i+2...a_kIf the common word library does not exist, a is added_ia_i+ ₁a_i+2...a_k-1Recognized as valid words and extracted from a single word a_kTo begin with, if k is equal to n, a will be_kRecognizing as a valid word, if k is less than n, a_kAs a is_iReturning to the combination unit for continuing execution;

a second recognition unit for recognizing if the provisional word a_ia_i+1a_i+2…a_kIf a is present in the common word stock and k is n, a will be added_ia_i+1a_i+2…a_kRecognizing as a valid word;

and a result determining unit for determining the recognized effective words as the Chinese words in the Chinese character string.

Further, the similarity calculation module includes:

the word vector construction submodule is used for constructing a first word vector of the first character string and a second word vector of the second character string according to the first number, the second number, the word frequency of each first word in the first character string and the word frequency of each second word in the second character string in the following modes:

The formula calculation submodule is used for calculating the similarity between the first character string and the second character string according to the following formula:

Further, the first character string is an address character string in the first policy information, and the second character string is an address character string in the second policy information, the computer program 73 may be further divided into:

and the association module is used for establishing the association relationship between the policy corresponding to the first policy information and the policy corresponding to the second policy information.

The terminal device 70 may be a computing device such as a desktop computer, a notebook, a palm computer, and a cloud server. Terminal equipment 70 may include, but is not limited to, a processor 71, a memory 72. Those skilled in the art will appreciate that fig. 7 is merely an example of terminal device 70 and does not constitute a limitation of terminal device 70 and may include more or fewer components than shown, or some components may be combined, or different components, e.g., terminal device 70 may also include input-output devices, network access devices, buses, etc.

The Processor 71 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 72 may be an internal storage unit of the terminal device 70, such as a hard disk or a memory of the terminal device 70. The memory 72 may also be an external storage device of the terminal device 70, such as a plug-in hard disk provided on the terminal device 70, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 72 may also include both an internal storage unit of the terminal device 70 and an external storage device. The memory 72 is used to store computer programs and other programs and data required by the terminal device 70. The memory 72 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media which may not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A character string comparison method, comprising:

acquiring a first character string and a second character string to be matched;

wherein, T_mFor the m-th word frequency, P_mIs the occurrence number of the mth Chinese word, G is the total word number of the Chinese word, m belongs to [1, G ∈](ii) a Pj is the occurrence frequency of the jth Chinese word;

the total times of all the Chinese words are taken;

2. The character string comparison method according to claim 1, wherein the performing word segmentation processing on the kanji character string to be processed in each policy information to obtain the chinese word comprises:

carrying out single character segmentation on the Chinese character string to obtain n single characters a_iWherein i ∈ [1, n ]]N is the number of Chinese characters contained in the Chinese character string;

if i is less than n, a_iAdjacent single word a_i+1Combining to obtain a temporary word a_ia_i+1；

If the temporary word a_ia_i+1If the temporary word a exists in a preset common word bank, the temporary word a is used_ia_i+1Adjacent single word a_i+2Combining to obtain a temporary word a_ia_i+1a_i+2And using said temporary word a_ia_i+1a_i+2Continuing to search the common word stock until the temporary word a_ia_i+1a_i+2...a_kIs not existed in the common word lexicon, wherein k is epsilon [ i +1, n]；

If the temporary word a_ia_i+1a_i+2...a_kIf the common word is not existed in the word stock, a is added_ia_i+1a_i+2...a_k-1Recognized as valid words and extracted from a single word a_kTo begin with, if k is equal to n, a will be_kRecognizing as a valid word, if k is less than n, a_kAs a is_iIf i is less than n, continuing to execute a_iAdjacent single word a_i+1Combining to obtain a temporary word a_ia_i+1A step (2);

if the temporary word a_ia_i+1a_i+2…a_kIf a is present in the common word bank and k is n, a is added_ia_i+1a_i+2…a_kRecognizing as a valid word;

and determining the identified effective words as the Chinese words.

3. The character string comparison method according to any one of claims 1 to 2, wherein said calculating the similarity between said first character string and said second character string using a cosine similarity algorithm based on said first number, said second number, the word frequency of each of said first words in said first character string, and the word frequency of each of said second words in said second character string comprises:

constructing a first word vector of the first character string and a second word vector of the second character string according to the first number, the second number, the word frequency of each first word in the first character string and the word frequency of each second word in the second character string as follows:

Q1[d+n+m]＝{T(W₁),T(W₂),…,T(W_d),T(W_d+1),T(W_d+2),…,T(W_d+n),Z_n+1,Z_n+2,…,Z_n+m}

Q2[d+n+m]＝{T(W₁),T(W₂),…,T(W_d),Z₁,Z₂,…,Z_n,T(W_n+1),T(W_n+2),…,T(W_n+m)}

wherein, Q1[ d + n + m [ ]]Is the first word vector, Q2[ d + n + m]Is said second word vector, W₁,W₂,…,W_d,W_d+1,W_d+2,…,W_d+nFor the first word obtained after the word segmentation processing is carried out on the first character string, W₁,W₂,…,W_d,W_n+1,W_n+2,…,W_n+mD + n is the first number, d + m is the second number, T (W) is the second word obtained after the word segmentation processing is carried out on the second character string_s) Is W_sS is in [1, n + m ]]，Z_bHas a value of 0, b ∈ [1, n + m ]]；

The similarity is calculated according to the following formula:

wherein epsilon is the similarity; q1[ c ] is the word frequency of the c-th first word in the first string; q2[ c ] is the word frequency of the c-th second word in the second string.

4. The character string comparison method according to any one of claims 1 to 2, wherein the first character string is an address character string in first policy information, the second character string is an address character string in second policy information, and after confirming that the first character string and the second character string have the same meaning if the similarity is greater than a preset similarity threshold, the character string comparison method further comprises:

and establishing an incidence relation between the policy corresponding to the first policy information and the policy corresponding to the second policy information.

5. A character string comparison device, comprising:

the word stock establishing module comprises:

the total times of all the Chinese words are taken;

6. The character string comparing apparatus as claimed in claim 5, wherein said participle processing submodule comprises:

a segmentation unit for performing single character segmentation on the Chinese character string to obtain n single characters a_iWherein i ∈ [1, n ]]N is the number of Chinese characters contained in the Chinese character string;

A cyclic search unit for searching if the temporary word a_ia_i+1If the temporary word a exists in a preset common word bank, the temporary word a is used_ia_i+1Adjacent single word a_i+2Combining to obtain a temporary word a_ia_i+1a_i+2And using said temporary word a_ia_i+1a_i+2Continuing to search the common word stock until the temporary word a_ia_i+1a_i+2...a_kIs not existed in the common word lexicon, wherein k is epsilon [ i +1, n]；

A first recognition unit for recognizing if the temporary word a_ia_i+1a_i+2...a_kIf the common word is not existed in the word stock, a is added_ia_i+1a_i+2...a_k-1Recognized as valid words and extracted from a single word a_kTo begin with, if k is equal to n, a will be_kRecognizing as a valid word, if k is less than n, a_kAs a is_iReturning to the combination unit for continuing execution;

a second recognition unit for recognizing if the temporary word a_ia_i+1a_i+2…a_kIf a is present in the common word bank and k is n, a is added_ia_i+1a_i+2…a_kRecognizing as a valid word;

and the result determining unit is used for determining the identified effective words as the Chinese words.

7. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the character string comparison method according to any one of claims 1 to 4 when executing the computer program.

8. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the string comparison method according to any one of claims 1 to 4.