CN108363729B - Character string comparison method and device, terminal equipment and storage medium - Google Patents

Character string comparison method and device, terminal equipment and storage medium Download PDF

Info

Publication number
CN108363729B
CN108363729B CN201810030724.9A CN201810030724A CN108363729B CN 108363729 B CN108363729 B CN 108363729B CN 201810030724 A CN201810030724 A CN 201810030724A CN 108363729 B CN108363729 B CN 108363729B
Authority
CN
China
Prior art keywords
word
character string
chinese
frequency
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810030724.9A
Other languages
Chinese (zh)
Other versions
CN108363729A (en
Inventor
刘行行
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN201810030724.9A priority Critical patent/CN108363729B/en
Publication of CN108363729A publication Critical patent/CN108363729A/en
Application granted granted Critical
Publication of CN108363729B publication Critical patent/CN108363729B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a character string comparison method, a device, a terminal device and a storage medium, wherein the method comprises the following steps: establishing a basic word bank according to policy information in a policy database; acquiring a first character string and a second character string to be matched; performing word segmentation processing on the first character string and the second character string respectively to obtain a first number of first words and a second number of second words; matching the word frequency of each first word and the word frequency of each second word from the basic word bank; calculating the similarity between the first character string and the second character string by using a cosine similarity algorithm according to the first number, the second number, the word frequency of each first word and the word frequency of each second word; and if the similarity is greater than a preset similarity threshold value, confirming that the meanings of the first character string and the second character string are the same. The technical scheme of the invention can effectively improve the accuracy of the character string comparison result, meanwhile, the word frequency does not need to be repeatedly calculated, and the execution efficiency can be effectively improved.

Description

Character string comparison method and device, terminal equipment and storage medium
Technical Field
The invention relates to the technical field of computers, in particular to a method for character string similarity.
Background
At present, a common character string similarity comparison method is mainly realized by adopting a traditional TF-IDF and cosine similarity calculation method, two character strings or texts to be matched are directly calculated, the calculation is only carried out in the range of the two character strings or texts to be matched when the word frequency is calculated by using the TD-IDF, the actual application scenes of the character strings or texts to be matched are not considered, the same character string often represents different meanings in different application scenes, the calculation is only carried out in the range of the two character strings or texts to be matched, the accurate comparison result cannot be obtained, and the accuracy of the calculation result is not high.
Meanwhile, the traditional TF-IDF has a complex calculation process, and has low execution efficiency under the condition that the data size of the character string to be matched is large.
Disclosure of Invention
The embodiment of the invention provides a character string comparison method, a character string comparison device, terminal equipment and a storage medium, and aims to solve the problem that in the prior art, the efficiency and the accuracy of character string comparison are not high.
In a first aspect, an embodiment of the present invention provides a character string comparison method, including:
establishing a basic word bank according to policy information in a policy database, wherein the basic word bank comprises Chinese words except English characters and Arabic numbers in the policy information and the word frequency of each Chinese word; the word frequency is obtained by calculation according to the occurrence frequency of each Chinese word in the policy database;
acquiring a first character string and a second character string to be matched;
performing word segmentation processing on the first character string and the second character string respectively to obtain a first number of first words contained in the first character string and a second number of second words contained in the second character string;
matching the word frequency of each first word and the word frequency of each second word from the basic word stock;
calculating the similarity between the first character string and the second character string by using a cosine similarity algorithm according to the first number, the second number, the word frequency of each first word in the first character string and the word frequency of each second word in the second character string;
if the similarity is larger than a preset similarity threshold value, confirming that the meanings of the first character string and the second character string are the same;
the establishing of the basic word bank according to the policy information in the policy database comprises the following steps:
performing word segmentation processing on the Chinese character string to be processed in each policy information to obtain the Chinese word;
counting the occurrence frequency of each Chinese word and the total word number of the Chinese words;
calculating the word frequency of each Chinese word according to the following formula:
Figure GDA0002808097240000021
wherein, TmFor the m-th word frequency, PmIs the occurrence number of the mth Chinese word, G is the total word number of the Chinese word, m belongs to [1, G ∈];PjThe number of occurrences of the jth Chinese word;
Figure GDA0002808097240000031
the total times of all the Chinese words are taken;
and storing each Chinese word and the word frequency association of the Chinese word in a basic word bank.
In a second aspect, an embodiment of the present invention provides a character string comparison apparatus, including:
the word bank establishing module is used for establishing a basic word bank according to policy information in a policy database, wherein the basic word bank comprises Chinese words except English characters and Arabic numbers in the policy information and the word frequency of each Chinese word; the word frequency is obtained by calculation according to the occurrence frequency of each Chinese word in the policy database;
the acquisition module is used for acquiring a first character string and a second character string to be matched;
the word segmentation module is used for performing word segmentation processing on the first character string and the second character string respectively to obtain a first number of first words contained in the first character string and a second number of second words contained in the second character string;
the word frequency matching module is used for matching the word frequency of each first word and the word frequency of each second word from the basic word bank;
a similarity calculation module, configured to calculate a similarity between the first character string and the second character string by using a cosine similarity calculation method according to the first number, the second number, the word frequency of each first word in the first character string, and the word frequency of each second word in the second character string;
the comparison module is used for confirming that the meanings of the first character string and the second character string are the same if the similarity is larger than a preset similarity threshold;
the word stock establishing module comprises:
the word segmentation processing submodule is used for carrying out word segmentation processing on the Chinese character strings to be processed in each policy information to obtain the Chinese words;
the statistic submodule is used for counting the occurrence frequency of each Chinese word and the total word number of the Chinese words;
the word frequency calculation submodule is used for calculating the word frequency of each Chinese word according to the following formula:
Figure GDA0002808097240000041
wherein, TmFor the m-th word frequency, PmIs the occurrence number of the mth Chinese word, G is the total word number of the Chinese word, m belongs to [1, G ∈];PjThe number of occurrences of the jth Chinese word;
Figure GDA0002808097240000042
the total times of all the Chinese words are taken;
and the association storage submodule is used for storing each Chinese word and the word frequency association of the Chinese word in the basic word stock.
In a third aspect, an embodiment of the present invention provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the character string comparison method when executing the computer program.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps of the character string comparison method.
Compared with the prior art, the embodiment of the invention has the following advantages: in the character string comparison method, device, terminal device and storage medium provided by the embodiment of the invention, a basic word bank is established according to the policy information except the English characters and the Arabic numbers in the policy database, the basic word bank comprises the Chinese words in the policy information and the word frequency of each Chinese word, and the word frequency is the frequency of each Chinese word appearing in the policy database; the word frequency of the words in the first character string and the second character string is acquired based on the basic word stock when the character strings of the first character string and the second character string to be matched are compared, the word frequency acquired based on the basic word stock is more targeted, so that the accuracy of the word frequency is higher, the accuracy of a judgment result can be effectively improved when whether the meanings of the first character string and the second character string are the same is judged according to cosine similarity, meanwhile, the word frequency can be directly acquired from the basic word stock when the character string comparison is carried out each time on the basis of establishing the basic word stock, the word frequency does not need to be repeatedly calculated, and the execution efficiency can be effectively improved when the data volume of the character string to be matched is larger.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
Fig. 1 is a flowchart of an implementation of a character string comparison method provided in embodiment 1 of the present invention;
fig. 2 is a flowchart illustrating an implementation of step S1 in the character string comparison method according to embodiment 1 of the present invention;
fig. 3 is a flowchart of implementing step S11 in the character string comparison method provided in embodiment 1 of the present invention;
fig. 4 is a flowchart of implementing step S5 in the character string comparison method provided in embodiment 1 of the present invention;
fig. 5 is a flowchart illustrating an implementation of associating policy information in the character string comparison method according to embodiment 1 of the present invention;
FIG. 6 is a schematic diagram of a character string comparison apparatus provided in embodiment 2 of the present invention;
fig. 7 is a schematic diagram of a terminal device provided in embodiment 4 of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
Referring to fig. 1, fig. 1 shows an implementation flow of a character string comparison method according to an embodiment of the present invention. The character string comparison method can be applied to matching analysis of insurance policy information in the insurance industry. The details are as follows:
s1: and establishing a basic word bank according to the policy information in the policy database, wherein the basic word bank comprises Chinese words except English characters and Arabic numbers in the policy information and the word frequency of each Chinese word. The word frequency is calculated according to the occurrence frequency of each Chinese word in the inventory database.
In the embodiment of the invention, the policy database is a database which is established in advance by an insurance company and is used for storing policy information of insurance products purchased by a user.
The policy information in the policy database is analyzed to identify keywords in the policy information, and the identified keywords are used to establish a base lexicon.
Specifically, the content of each attribute in the policy information is subjected to word segmentation, the attributes comprise user personal identity attributes, insurance product attributes and the like, each word obtained after word segmentation is used as a keyword, and the word frequency of the keyword is calculated according to the occurrence frequency of each keyword in the policy database.
It should be noted that the purpose of establishing the basic word stock is to provide a basis for comparing character strings, and in the character string comparison, because the meaning of a chinese word is complex, the grammar of the chinese word is different from that of an english word, the structure of the chinese word is different, and the comparison method is completely different, only the chinese word is involved in the basic word stock, and other characters including english characters or arabic numerals are not involved.
Further, the base thesaurus may be updated periodically or in real time. For example, the step may be executed once at preset time intervals, and since new policy information is continuously stored in the policy database, the basic lexicon is established based on the latest policy database every time the step is executed, so that the basic lexicon is updated; or when the policy database is detected to be updated, executing the step to complete the updating of the basic word stock. Because the insurance company continuously generates new insurance policy, the insurance policy database is updated frequently, if the insurance policy database is updated every time, the step is executed to update the basic word stock, and the system performance is reduced, so that the step can be executed to update the basic word stock when the updating times reach a preset time threshold value, and the influence on the system performance caused by the too frequent execution of the step is avoided.
S2: and acquiring a first character string and a second character string to be matched.
In the embodiment of the present invention, based on the basic thesaurus established in step S1, a comparison between two character strings may be performed to determine whether the meanings of the two character strings are the same. If more than two character strings need to be compared, the character strings can be compared pairwise, each character string is compared with other character strings, and whether the meanings of the character strings are the same or not is judged according to the result of pairwise comparison. Specifically, the first character string and the second character string to be matched may be address character strings in the policy information.
S3: and performing word segmentation processing on the first character string and the second character string respectively to obtain a first number of first words contained in the first character string and a second number of second words contained in the second character string.
Specifically, the method of performing word segmentation on the first character string and the second character string to be matched may be the same as the method of performing word segmentation on the content of each attribute in the policy information when the basic thesaurus is established in step S1.
It should be noted that the first word and the second word are both chinese words, and if other characters including english characters, arabic numerals and the like exist in the first character string or the second character string, when performing word segmentation processing, the non-chinese character string is recognized first, and then word segmentation processing is performed on a plurality of chinese character strings separated from the non-chinese character string.
S4: the word frequency of each first word and the word frequency of each second word are matched from the base thesaurus.
Specifically, the word frequency of each first word and the word frequency of each second word are matched in the base thesaurus according to the first number of first words and the second number of second words obtained in step S3.
It should be noted that, if the word frequency of the first word or the word frequency of the second word cannot be matched in the basic thesaurus, it is indicated that the first word or the second word is not stored in the basic thesaurus, at this time, the basic thesaurus needs to be updated according to the method for establishing the basic thesaurus mentioned in step S1 for the first character string or the second character string, and then the word frequency of the first word or the second word is continuously matched in the updated basic thesaurus.
S5: and calculating the similarity between the first character string and the second character string by using a cosine similarity algorithm according to the first number, the second number, the word frequency of each first word in the first character string and the word frequency of each second word in the second character string.
Specifically, according to the word frequency of each first word and the word frequency of each second word obtained in step S4, the word frequency of each first word is formed into a word vector of the first character string, and the word frequency of each second word is formed into a word vector of the second character string.
And calculating the similarity between the first character string and the second character string by using a cosine similarity algorithm according to the word vector of the first character string and the word vector of the second character string.
S6: and if the similarity between the first character string and the second character string is greater than a preset similarity threshold value, confirming that the meanings of the first character string and the second character string are the same.
Specifically, when the similarity calculated in step S5 is greater than the preset similarity threshold, it is determined that the meanings of the first character string and the second character string are the same, otherwise, if the similarity is less than or equal to the similarity threshold, it is determined that the meanings of the first character string and the second character string are different.
The value range of the similarity calculated according to the cosine similarity algorithm is between [0 and 1], and therefore, the preset similarity threshold may be set to 0.9, but is not limited thereto, and the similarity threshold may be set according to the application requirement, and is not limited herein.
It is understood that in the embodiment of the present invention, when the similarity is greater than the preset similarity threshold, the first character string and the second character string have the same meaning, and in other embodiments, when the similarity is greater than or equal to the preset similarity threshold, the first character string and the second character string have the same meaning.
It should be noted that the character string comparison method provided in the embodiment of the present invention compares the first character string and the second character string with respect to the chinese character string, and when the first character string or the second character string includes other characters, the other characters may be compared in a direct comparison manner, and comprehensively determines whether the meanings of the first character string and the second character string are the same according to the comparison result and the similarity. For example, if the similarity obtained in this step is greater than the preset similarity threshold, but the arabic numbers of the first string and the second string are different, it can be considered that the meanings of the first string and the second string are different.
In the embodiment corresponding to fig. 1, the word frequency of each word in the basic word bank is calculated in advance, and when a character string to be matched is compared, only the corresponding word frequency needs to be obtained from the basic word bank for similarity calculation, so that the similarity calculation of the character string is performed based on the basic word bank in an actual application scene, thereby having more pertinence and improving the matching accuracy, and meanwhile, the word frequency does not need to be recalculated in each comparison.
Next, based on the embodiment shown in fig. 1, a specific implementation method for establishing the basic thesaurus according to the policy information in the policy database mentioned in step S1 is described in detail below by using a specific embodiment.
Referring to fig. 2, fig. 2 shows a specific implementation flow of step S1 provided in the embodiment of the present invention, which is detailed as follows:
s11: and performing word segmentation processing on the Chinese character string to be processed in each policy information to obtain Chinese words.
In the embodiment of the present invention, since various character strings to be processed in the policy information may include other characters including english characters and arabic numerals, such as a home address character string, in addition to the chinese characters, the non-chinese character string in the policy information is first identified, and then word segmentation processing is performed on a plurality of chinese character strings separated from the non-chinese character string.
S12: and counting the occurrence times of each Chinese word and the total word number of the Chinese words.
Specifically, the number of occurrences of each chinese word in the inventory database obtained in step S11 and the total number of words of the chinese words obtained after the word segmentation processing in step S11 are counted.
The total word number of the chinese word means the total number of the chinese words obtained by the word segmentation processing.
S13: calculating the word frequency of each Chinese word according to formula (1):
Figure GDA0002808097240000101
wherein, TmWord frequency, P, for mth Chinese wordmIs the occurrence number of the mth Chinese word, G is the total word number of the Chinese word, m belongs to [1, G ∈];PjThe number of occurrences of the jth Chinese word is shown.
Figure GDA0002808097240000102
Is the total number of times of all the Chinese words.
Specifically, after the number of occurrences of each chinese word and the total number of words in the chinese character string are counted in step S12, the word frequency of each chinese word is calculated according to formula (1).
S14: and storing each Chinese word and the word frequency association of the Chinese word in a basic word bank.
Specifically, the word frequency of each chinese word obtained in step S13 and the chinese word association are stored in the basic thesaurus, so that the word frequency of the target word can be directly obtained from the basic thesaurus when performing the character string comparison.
In order to better understand the technical solution of the embodiment of the present invention, the process of establishing the basic lexicon is described below by a specific example, which is detailed as follows:
assume that the policy information in the policy database contains the following address strings: "the china times square in the luohu region of Guangdong province Shenzhen city", "the peace mansion in the Futian region of Guangdong province Shenzhen city", "the Tencent mansion in the Futian region of Guangdong province Shenzhen city", "the Zhongxing mansion in the Futian region of Guangdong province Shenzhen city", "the Zhengjia square in the Tianheyuan region of Guangdong province Guangzhou city and" the Tianheuan city in the Tianheyuan region of Guangdong province Guangzhou city ".
First, after the word segmentation processing is performed on the address character string according to step S11, the obtained word segmentation result is as follows:
1) the word segmentation result of the civil times plaza in the lahu region of Guangdong Shenzhen is as follows: "Guangdong province", "Shenzhen City", "Luhu district" and "Zhongmin times square";
2) the word segmentation result of the "Guangdong Shenzhen city Futian district safety mansion" is as follows: "Guangdong province", "Shenzhen city", "Futian district" and "Ming's mansion";
3) the word segmentation result of the "Guangdong Shenzhen Shentian Tengchang" is: "Guangdong province", "Shenzhen city", "Futian district" and "Tengchun mansion";
4) the word segmentation result of the Zhongxing mansion in the Futian region of Shenzhen, Guangdong province is as follows: "Guangdong province", "Shenzhen city", "Futian district" and "Zhongxing mansion";
5) the word segmentation result of the "Zhengjia square in the Tianheyuan of Guangzhou city of Guangdong province" is as follows: "Guangdong province", "Guangzhou city", "Tianhe district" and "Zhengjia square";
6) the word segmentation result of the Tianhecheng area of Guangzhou city, Guangdong province is as follows: "Guangdong province", "Guangzhou city", "Tianhe district" and "Tianhe city".
Step S12 is executed to count the word segmentation results to obtain 12 chinese words, which includes: "Guangdong province", "Guangzhou city", "Shenzhen city", "Luohu region", "Futian region", "Tianhe region", "Zhongmin times square", "Pingyan mansion", "Tengxing mansion", "Zhengjia square" and "Tianhe city". I.e., the total number of words G of the chinese word is 12. The number of occurrences of "guangdong province" was 6, the number of occurrences of "shenzhen city" was 4, the number of occurrences of "futian region" was 3, the number of occurrences of "guangzhou city" and "tianhe region" was 2, and the number of occurrences of "laohu region", "zhongji times square", "peace building", "tengxing building", "zhongxing building", "zhengjia square" and "tianhe city" was 1. Namely, it is
Figure GDA0002808097240000121
Then, the word frequency of each of the above chinese words is calculated in accordance with formula (1) in step S13. For example, the number of occurrences of "guangdong province" is 6, and thus the word frequency of "guangdong province" is 24/6-4; similarly, the word frequencies of other Chinese words can be calculated according to the formula (1) and are respectively as follows: the term frequency of Shenzhen city is 6, the term frequency of Futian region is 8, the term frequencies of Guangzhou city and Tianhe region are 12, and the term frequencies of Rohu region, Tianhe region, Zhongmin time square, Ping Anxiao, Teng Xingxing building, Zhongxing building, Zheng Jia square and Tianhe city are 24.
Finally, step S14 is executed to store the 24 chinese words and the word frequency association of each chinese word in the basic thesaurus.
In the embodiment corresponding to fig. 2, the word segmentation processing is performed on the chinese character strings in the policy information to obtain chinese words, and the word frequency of each chinese word is calculated according to the formula (1), so that each chinese word and the word frequency association thereof are stored in the basic word stock, and the creation of the basic word stock is completed. Because the word frequency of each word in the basic word stock is calculated in advance, when character strings are compared, the corresponding word frequency can be directly obtained from the basic word stock for similarity calculation, on one hand, the word frequency obtained based on the basic word stock is more targeted, so that the accuracy of the word frequency is higher, on the other hand, the word frequency does not need to be repeatedly calculated when the character strings are compared, and when the number of the character strings to be matched is huge, the execution efficiency can be effectively improved.
Based on the embodiment corresponding to fig. 2, a detailed description is given below of a specific implementation flow of obtaining chinese words by performing word segmentation processing on the kanji character strings to be processed in each policy information mentioned in step S11 through a specific embodiment.
Referring to fig. 3, fig. 3 shows a specific implementation flow of step S11 provided in the embodiment of the present invention, which is detailed as follows:
s111: carrying out single character segmentation on the Chinese character string to be processed to obtain n single characters aiWherein i ∈ [1, n ]]And n is the number of the Kanji characters contained in the Kanji character string.
In the embodiment of the invention, the Chinese character string to be processed is segmented according to a single character to obtain n Chinese characters aiN Chinese characters aiThe Chinese character string is stored in an array form according to the sequence of the Chinese character string, each Chinese character is an element of the array, that is, the arrangement sequence of each single character after the single character segmentation is sequentially arranged from left to right according to the Chinese character sequence in the Chinese character string.
For example, if the kanji character string to be processed is "shenzhen laohu region", then the single character segmentation is performed to obtain six single characters: a is1Deep, a2Zhen, a3City, a4Arrowa, a5Lake, a6A region.
S112: if i is less than n, aiAdjacent single word ai+1Combining to obtain a temporary word aiai+1
Specifically, from the first word a1At the start of the process,a is to1And a2Combining to obtain a temporary word a1a2
When i is n, a is presentiThe character is the last character in the character string of the Chinese character to be processed, and no adjacent single character exists, so that new temporary words cannot be obtained by continuously combining the characters.
S113: if the temporary word aiai+1If the temporary word a exists in a preset common word bank, the temporary word a is usediai+1Adjacent single word ai+2Combining to obtain a temporary word aiai+1ai+2And use the provisional word aiai+1ai+2Continuing to search the common word stock until the temporary word aiai+1ai+2...akDoes not exist in the common word lexicon, wherein k belongs to [ i +1, n ∈ [ ]]。
In the embodiment of the invention, the preset common word bank contains common Chinese words, and the common word bank can be updated regularly.
If the temporary word obtained in step S112 exists in the common word bank, continuing to use the temporary word aiai+1Adjacent single word ai+2Combining to obtain a temporary word aiai+1ai+2If the temporary word aiai+1ai+2If the word still exists in the common word stock, continuing to use the temporary word aiai+1ai+2And ai+3Combining to obtain a temporary word aiai+1ai+2ai+3And continue to use the provisional word aiai+1ai+2ai+3Looking up in the common word stock until the temporary word aiai+1ai+2...akAnd when the common word library does not exist, ending the search.
It should be noted that the range of k is greater than or equal to i +1 and less than or equal to n, that is, when k is equal to n, the temporary word a searched in the common word bank is used at this timeiai+1ai+2...anHas reached the Chinese character string to be processedThus, the temporary word a is completediai+1ai+2...anAfter the search, no matter whether the temporary word exists in the common word bank or not, the search is not continued.
Continuing with the example of the to-be-processed kanji character string "Shenzhen lawy region" mentioned in step S111, six single characters "a" obtained in step S111 are pointed out1Deep, a2Zhen, a3City, a4Arrowa, a5Lake, a6Combining the 'Shenzhen' and the 'Shenzhen' into a provisional word 'Shenzhen', searching the provisional word 'Shenzhen' in a common word library, continuing to combine the 'Shenzhen' with the adjacent 'market' into the provisional word 'Shenzhen market' due to the existence of the 'Shenzhen' in the common word library, continuing to search the provisional word 'Shenzhen market' in the common word library, continuing to combine the 'Shenzhen market' with the adjacent 'Luo' into the provisional word 'Shenzhen market Luo' due to the existence of the 'Shenzhen market' in the common word library, stopping the query due to the absence of the 'Shenzhen market Luo' in the common word library, and taking the current provisional word as the 'Shenzhen market Luo'.
S114: if the temporary word aiai+1ai+2...akIf the common word library does not exist, a is addediai+1ai+2...ak-1Recognized as valid words and extracted from a single word akTo begin with, if k is equal to n, a will bekRecognizing as a valid word, if k is less than n, akAs a isiThe process returns to step S112 to continue the execution.
Specifically, if the provisional word a is obtained according to step S113iai+1ai+2...akIf the common word library does not exist, a is addediai+1ai+2...ak-1Recognized as valid words and extracted from a single word akInitially, if k is less than n, a is addedkAs a isiReturning to step S112 to be executed again.
When k is i +1, if aiai+1If the common word library does not exist, the single character aiIdentificationIs a valid word and does not need to judge the single word aiWhether it exists in the common word stock.
When k is n, if aiai+1ai+2...anIf the common word library does not exist, a is carried outkThe last character in the character string of the Chinese character to be processed is existed, no adjacent single character exists, and the character string can not be combined continuously to obtain a new temporary word, so that the step S112 is not returned, and the step A is directly usedkA valid word is recognized and the flow jumps to step S116 to continue execution.
Continuing to explain by taking the to-be-processed kanji character string "shenzhen lao region" mentioned in step S111 as an example, when the provisional word "shenzhen lao" does not exist in the common word lexicon, recognizing "shenzhen" as an effective word, then starting from the single word "lao", continuing to combine with the adjacent "lake" into a new provisional word "lao" according to step S112, and continuing to search in the common word lexicon.
S115: if the temporary word aiai+1ai+2...akIf a is present in the common word stock and k is n, a will be addediai+1ai+2...akRecognized as a valid word.
Specifically, if the provisional word a is obtained according to step S113iai+1ai+2...akIf the word exists in the common word bank and k is equal to n, the temporary word a is explainediai+1ai+2...akThe last single character of the kanji character string to be processed has been reached, and the provisional word aiai+1ai+2...akIf the word exists in the common word bank, a is directly addediai+1ai+2...akAnd identifying the Chinese character string to be processed as an effective word.
Continuing to explain by taking the to-be-processed kanji character string "shenzhen lao region" mentioned in step S111 as an example, when the provisional word "shenzhen lao" does not exist in the common word bank, identifying "shenzhen" as an effective word, then starting from a single word "lao", continuing to combine with the adjacent "lake" into a new provisional word "lao lake" according to step S112, continuing to search for the provisional word "lao lake" in the common word bank, if the provisional word "lao lake" exists in the common word bank, combining "lao lake" and the adjacent "region" into the provisional word "lao lake region", continuing to search for the provisional word "lao region" in the common word bank, if the provisional word "lao lake region" exists in the common word bank, and at this time k ═ n, continuing to search circularly, and directly identifying "lao lake region" as an effective word.
S116: and determining the identified effective words as Chinese words in the Chinese character string.
Specifically, the valid words identified in step S114 and step S115 are used as the chinese words obtained by performing the word segmentation processing on the kanji character string.
Continuing with the explanation taking the to-be-processed kanji character string "shenzhen laohu region" mentioned in step S111 as an example, the valid words identified according to step S114 and step S115 are: "Shenzhen city" and "lahu region", therefore, the participle result of the kanji string "Shenzhen city lahu region" is: shenzhen city and Luhu region.
In the embodiment corresponding to fig. 3, the obtained kanji character string to be processed is first subjected to single character segmentation, then, starting from the first single character, combining the first single character with the adjacent single character into a temporary word, searching the temporary word in a common word bank, if the temporary word exists in the common word bank, combining the temporary word with the next adjacent single word into a new temporary word, continuously searching the new temporary word in the common word stock, if the new temporary word can be searched, combining and continuously inquiring in the common word stock until the newly combined temporary word is not inquired in the common word stock, taking the previous temporary word as an effective word, continuously combining the residual single word after the effective word is removed with the next adjacent single word, and continuously inquiring in the common word stock until all the single characters of the Chinese character string are processed. The method and the device have the advantages that the character string word segmentation is carried out in a single character segmentation and combination mode, the initial word length does not need to be set, the common words are prevented from being wrongly separated, the accuracy of the character string word segmentation is improved, meanwhile, compared with the traditional forward maximum matching algorithm, the implementation mode of the technical scheme of the embodiment of the invention is simple and feasible, the execution efficiency is higher, and therefore the universality and the word segmentation efficiency of the character string word segmentation are effectively improved.
Based on the embodiments corresponding to fig. 1 to fig. 3, a detailed description will be given below of a specific implementation flow of calculating the similarity between the first character string and the second character string by using a cosine similarity algorithm according to the first number, the second number, the word frequency of each first word in the first character string, and the word frequency of each second word in the second character string, which are mentioned in step S5, by using a specific embodiment.
Referring to fig. 4, fig. 4 shows a specific implementation flow of step S5 provided in the embodiment of the present invention, which is detailed as follows:
s51: according to the first number, the second number, the word frequency of each first word in the first character string and the word frequency of each second word in the second character string, constructing a first word vector of the first character string and a second word vector of the second character string as follows:
Q1[d+n+m]={T(W1),T(W2),…T(Wd),T(Wd+1),T(Wd+2),…,T(Wd+n),Zn+1,Zn+2,…,Zn+m}
Q2[d+n+m]={T(W1),T(W2),…,T(Wd),Z1,Z2,…,Zn,T(Wn+1),T(Wn+2),…,T(Wn+m)}
wherein, Q1[ d + n + m [ ]]Is the first word vector, Q2[ d + n + m]Is a second word vector, W1,W2,…Wd,Wd+1,Wd+2,…,Wd+nFor the first word obtained after the word segmentation processing is carried out on the first character string, W1,W2,…Wd,Wn+1,Wn+2,…,Wn+mD + n is a first number, d + m is a second number, T (W) is a second word obtained by performing word segmentation processing on the second character strings) Is WsS is in [1, n + m ]],ZbHas a value of 0, b ∈ [ 1],n+m]。
Specifically, a first word vector and a second word vector are constructed according to the first number of first words and the second number of second words obtained in step S3, and the word frequency of each first word and the word frequency of each second word obtained in step S4.
S52: calculating the similarity between the first character string and the second character string according to formula (2):
Figure GDA0002808097240000171
wherein epsilon is the similarity between the first character string and the second character string; q1[ c ] is the word frequency of the c-th first word in the first string; q2[ c ] is the word frequency of the c-th second word in the second string.
Specifically, the similarity epsilon is calculated according to formula (2) based on the first word vector and the second word vector obtained in step S51.
In order to better understand the technical solution of the embodiment of the present invention, the following is described by a specific example, which is detailed as follows:
the first character string to be matched is assumed to be the civil times square in the Roche region of Guandong Shenzhen City, and the second character string to be matched is the civil times square in the Roche region of Shenzhen City.
After the word segmentation processing is performed on the first character string and the second character string, the obtained first words are "Guangdong province", "Shenzhen City", "Rohu region" and "Zhongmin epoch Square", and the obtained second words are "Shenzhen City", "Rohu region" and "Zhongmin epoch Square", that is, the first number is 4 and the second number is 3.
Taking the basic thesaurus established in step S14 as an example, the result of matching the word frequency of each first word and the word frequency of each second word in the basic thesaurus is: the word frequency of Guangdong province is 4, the word frequency of Shenzhen City is 6, and the word frequencies of the Luohu region and the China time square are both 24.
Therefore, the word vector Q1[4] of the first character string is {6,24,24,4} and the word vector Q2[4] of the second character string is {6,24,24,0} according to the construction method of step S51.
Calculating the similarity between the first character string and the second character string according to formula (2) of step S52 as:
Figure GDA0002808097240000181
i.e., the similarity between the first string and the second string is about 0.9933.
Assuming that the preset similarity threshold is 0.9, the first string "the civil times square in the ranhu region of Guandong Shenzhen City" and the second string "the civil times square in the ranhu region of Shenzhen City" have the same meaning as the calculation result.
If the similarity is calculated based on the word frequency provided by the basic word stock instead of the method for calculating the similarity based on the word frequency provided by the basic word stock in the technical scheme of the embodiment of the invention, the calculation process is as follows:
after the word segmentation processing is carried out on the first character string and the second character string, the obtained first words are still 'Guangdong province', 'Shenzhen City', 'Rohu region' and 'Zhongmin epoch Square', and the second words are 'Shenzhen City', 'Rohu region' and 'Zhongmin epoch Square'.
Calculating the word frequency according to the occurrence frequency of each first word and each second word in the first character string and the second character string, and obtaining the result of the word frequency of each first word and the word frequency of each second word as follows: the word frequency of Guangdong province is 1, and the word frequencies of Shenzhen city, Rohu region and Zhongmin times square are 2. Therefore, the word vector Q1 of the first character string is {2,2,2,1}, and the word vector Q2 of the second character string is {2,2,2,0 }.
Calculating the similarity of the first character string and the second character string by a cosine similarity algorithm as follows:
Figure GDA0002808097240000191
i.e., the similarity between the first string and the second string is about 0.866.
It can be seen that, if the similarity threshold is still 0.9, the similarity calculated according to the method in the prior art is smaller than the similarity threshold, and therefore the meanings of the first string "the civil times square in the ranhu region of Guangzhou Shenzhen City" and the second string "the civil times square in the ranhu region of Shenzhen City" are determined to be different. In fact, the first character string is only the word "Guangdong province" in the Chinese language more than the second character string, and the meanings of the two address character strings "the civil times square in the Rohu region of Guangdong province Shenzhen city" and "the civil times square in the Rohu region of Shenzhen city" in the Saudi information are the same, thereby causing an erroneous judgment result.
Therefore, the technical scheme provided by the embodiment of the invention can more accurately judge whether the meanings of the two character strings are the same or not, and the misjudgment rate is reduced.
In the embodiment corresponding to fig. 4, on the basis of the basic word bank established by using the character strings in the policy information of the policy database, word vectors are constructed according to the word frequencies queried in the basic word bank, and the similarity is calculated by using the formula (2), so that the similarity calculation of the character strings is based on the basic word bank in the actual application scene, thereby having more pertinence, improving the matching accuracy, being capable of more accurately judging whether the meanings of the two character strings are the same, and reducing the misjudgment rate.
In addition to the above embodiments corresponding to fig. 1 to 3, if the similarity between the first character string and the second character string is greater than the preset similarity threshold mentioned in step S6, after confirming that the meanings of the first character string and the second character string are the same, the policy information may be further associated.
In the embodiment of the present invention, the first character string is an address character string in the first policy information, and the second character string is an address character string in the second policy information. For example, the address character string may be home address information or work unit address information or the like.
As shown in fig. 5, the character string comparison method further includes:
s7: and establishing an incidence relation between the policy corresponding to the first policy information where the first character string is located and the policy corresponding to the second policy information where the second character string is located.
Specifically, when it is confirmed in step S6 that the first character string and the second character string have the same meaning, it is confirmed that the policy corresponding to the policy information in which the first character string is located and the policy corresponding to the policy information in which the second character string is located are associated with each other, and an association relationship between the two policies is established. And, the higher the similarity between the first character string and the second character string, the higher the degree of association between the two policies.
By associating the insurance policies, the method can help the related workers of the insurance company to accurately mine the potential relationship between the insurance policies, is beneficial to the related workers to analyze the insurance policies and discover various possible fraud and insurance risks.
It should be noted that, in the embodiment of the present invention, when comparing whether the meanings of the first character string and the second character string are the same, if there is a non-chinese character in the first character string or the second character string, the non-chinese character string is identified first, and then the meanings of the two character strings are compared to be the same according to the remaining chinese character strings. Therefore, it is not determined whether the meanings of the non-kanji character strings are the same. This is because when associating policy based on address character strings, the degree of similarity of address character strings can be effectively reflected by comparing kanji character strings, and when the meanings of kanji character strings of two address character strings are the same, even if the meanings of non-kanji character strings are different, for example, house numbers are different, it can be considered that there is an association between two policies.
In other embodiments, when policy association needs to be performed according to the result of accurate matching of two address character strings, after determining that the meanings of the two address character strings are the same according to the comparison of the chinese character strings, further comparing whether the contents of the non-chinese character strings in the two address character strings are the same by using a direct character comparison method, if so, determining that the meanings of the two address character strings are the same, and if not, determining that the meanings of the two address character strings are different.
In the embodiment corresponding to fig. 5, if it is determined that the two address character strings have the same meaning, it is described that the policy corresponding to the first policy information is associated with the policy of the second policy information, and the two policies are associated, so that the relevant staff of the insurance company can be helped to accurately mine the potential relationship between the policies, and the relevant staff can analyze the policies to find various possible fraud protection risks.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
Example 2
Fig. 6 shows a block diagram of a character string comparison device provided in an embodiment of the present invention, which corresponds to the character string comparison method in the above embodiment, and only shows the relevant parts in the embodiment of the present invention for convenience of description.
Referring to fig. 6, the character string comparison apparatus includes: the word stock establishing module 61, the obtaining module 62, the word segmentation module 63, the word frequency matching module 64, the similarity calculating module 65 and the comparing module 66, wherein the detailed description of each functional module is as follows:
a word bank establishing module 61, configured to establish a basic word bank according to the policy information in the policy database, where the basic word bank includes the chinese words except the english characters and the arabic numerals in the policy information and the word frequency of each chinese word; the word frequency is obtained by calculation according to the occurrence frequency of each Chinese word in the inventory database;
an obtaining module 62, configured to obtain a first character string and a second character string to be matched;
a word segmentation module 63, configured to perform word segmentation on the first character string and the second character string respectively to obtain a first number of first words included in the first character string and a second number of second words included in the second character string;
a word frequency matching module 64 for matching the word frequency of each first word and the word frequency of each second word from the basic word stock;
a similarity calculation module 65, configured to calculate a similarity between the first character string and the second character string by using a cosine similarity calculation method according to the first number, the second number, the word frequency of each first word in the first character string, and the word frequency of each second word in the second character string;
and the comparison module 66 is configured to determine that the first character string and the second character string have the same meaning if the similarity is greater than a preset similarity threshold.
Further, the thesaurus establishing module 61 includes:
a word segmentation processing submodule 611, configured to perform word segmentation processing on the to-be-processed chinese character string in each policy information to obtain a chinese word;
a statistics submodule 612, configured to count the occurrence frequency of each chinese word and the total word number of the chinese word;
a word frequency calculating submodule 613, configured to calculate a word frequency of each chinese word according to the following formula:
Figure GDA0002808097240000221
wherein, TmWord frequency, P, for mth Chinese wordmIs the occurrence number of the mth Chinese word, G is the total word number of the Chinese word, m belongs to [1, G ∈];PjThe number of occurrences of the jth Chinese word;
Figure GDA0002808097240000222
the total times of all the Chinese words are taken;
and an association storage sub-module 614, configured to store each chinese word and the word frequency association of the chinese word in the basic word bank.
Further, the participle processing sub-module 611 includes:
a segmentation unit 6111, configured to perform single character segmentation on the kanji character string to be processed to obtain n single characters aiWherein i ∈ [1, n ]]N is the number of Chinese characters contained in the Chinese character string;
a combination unit 6112 for if i is less than n, then aiAdjacent single word ai+1Combining to obtain a temporary word aiai+1
A loop search unit 6113 for searching if the temporary word aiai+1If the temporary word a exists in a preset common word bank, the temporary word a is usediai+1Adjacent single word ai+2Combining to obtain a temporary word aiai+1ai+2And use the provisional word aiai+1ai+2Continuing to search the common word stock until the temporary word aiai+1ai+2...akDoes not exist in the common word lexicon, wherein k belongs to [ i +1, n ∈ [ ]];
A first identification unit 6114 for if the temporary word aiai+1ai+2...akIf the common word library does not exist, a is addediai+1ai+2...ak-1Recognized as valid words and extracted from a single word akTo begin with, if k is equal to n, a will bekRecognizing as a valid word, if k is less than n, akAs a isiReturning to the combining unit 6112 to continue execution;
a second recognition unit 6115 for if the temporary word aiai+1ai+2…akIf a is present in the common word stock and k is n, a will be addediai+1ai+2…akRecognizing as a valid word;
a result determination unit 6116, configured to determine the identified valid word as a chinese word in the kanji character string.
Further, the similarity calculation module 65 includes:
a word vector constructing sub-module 651, configured to construct a first word vector of the first character string and a second word vector of the second character string according to the first number, the second number, the word frequency of each first word in the first character string, and the word frequency of each second word in the second character string as follows:
Q1[d+n+m]={T(W1),T(W2),…T(Wd),T(Wd+1),T(Wd+2),…,T(Wd+n),Zn+1,Zn+2,…,Zn+m}
Q2[d+n+m]={T(W1),T(W2),…,T(Wd),Z1,Z2,…,Zn,T(Wn+1),T(Wn+2),…,T(Wn+m)}
wherein, Q1[ d + n + m [ ]]Is the first word vector, Q2[ d + n + m]Is a second word vector, W1,W2,…Wd,Wd+1,Wd+2,…,Wd+nFor the first word obtained after the word segmentation processing is carried out on the first character string, W1,W2,…Wd,Wn+1,Wn+2,…,Wn+mD + n is a first number, d + m is a second number, T (W) is a second word obtained by performing word segmentation processing on the second character strings) Is WsS is in [1, n + m ]],ZbHas a value of 0, b ∈ [1, n + m ]];
The formula calculation sub-module 652 is configured to calculate the similarity between the first character string and the second character string according to the following formula:
Figure GDA0002808097240000241
wherein epsilon is the similarity between the first character string and the second character string; q1[ c ] is the word frequency of the c-th first word in the first string; q2[ c ] is the word frequency of the c-th second word in the second string.
Further, the first character string is an address character string in the first policy information, and the second character string is an address character string in the second policy information, and the character string comparison device further includes:
and the association module 67 is configured to establish an association relationship between the policy corresponding to the first policy information and the policy corresponding to the second policy information.
The process of implementing each function by each module in the character string comparison device provided in this embodiment may specifically refer to the description of embodiment 1, and is not described herein again.
Example 3
This embodiment provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method for comparing character strings in embodiment 1 is implemented, and details are not repeated here to avoid repetition. Alternatively, the computer program, when executed by the processor, implements the functions of each module/unit in the character string comparison apparatus in embodiment 2, and is not described herein again to avoid redundancy.
Example 4
Fig. 7 is a schematic diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 7, the terminal device 70 of this embodiment includes: a processor 71, a memory 72, and a computer program 73, such as a string comparison program, stored in the memory 72 and operable on the processor 71. The processor 71, when executing the computer program 73, implements the steps in the respective character string comparison method embodiments described above, such as the steps S1 to S6 shown in fig. 1. Alternatively, the processor 71, when executing the computer program 73, implements the functions of the respective modules/units in the respective embodiments of the character string comparison apparatus described above, such as the functions of the modules 61 to 66 shown in fig. 6.
Illustratively, the computer program 73 may be divided into one or more modules/units, which are stored in the memory 72 and executed by the processor 71 to carry out the invention. One or more of the modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 73 in the terminal device 70. For example, the computer program 73 may be divided into a word stock establishing module, an obtaining module, a word segmentation module, a word frequency matching module, a similarity calculation module, and a comparison module, and each of the functional modules is described in detail as follows:
the word bank establishing module is used for establishing a basic word bank according to policy information in a policy database, wherein the basic word bank comprises Chinese words except English characters and Arabic numbers in the policy information and the word frequency of each Chinese word; the word frequency is obtained by calculation according to the occurrence frequency of each Chinese word in the policy database;
the acquisition module is used for acquiring a first character string and a second character string to be matched;
the word segmentation module is used for performing word segmentation processing on the first character string and the second character string respectively to obtain a first number of first words contained in the first character string and a second number of second words contained in the second character string;
the word frequency matching module is used for matching the word frequency of each first word and the word frequency of each second word from the basic word bank;
the similarity calculation module is used for calculating the similarity between the first character string and the second character string by using a cosine similarity calculation method according to the first number, the second number, the word frequency of each first word in the first character string and the word frequency of each second word in the second character string;
and the comparison module is used for confirming that the meanings of the first character string and the second character string are the same if the similarity is greater than a preset similarity threshold.
Further, the word stock establishing module comprises:
the word segmentation processing submodule is used for carrying out word segmentation processing on the Chinese character strings to be processed in each policy information to obtain Chinese words;
the statistic submodule is used for counting the occurrence frequency of each Chinese word and the total word number of the Chinese words;
the word frequency calculation submodule is used for calculating the word frequency of each Chinese word according to the following formula:
Figure GDA0002808097240000261
wherein, TmWord frequency, P, for mth Chinese wordmIs the occurrence number of the mth Chinese word, G is the total word number of the Chinese word, m belongs to [1, G ∈];PjThe number of occurrences of the jth Chinese word;
Figure GDA0002808097240000262
for all the Chinese notesTotal number of words;
and the association storage submodule is used for storing each Chinese word and the word frequency association of the Chinese word in a basic word bank.
Further, the word segmentation processing submodule comprises:
a segmentation unit for performing single character segmentation on the Chinese character string to be processed to obtain n single characters aiWherein i ∈ [1, n ]]N is the number of Chinese characters contained in the Chinese character string;
a combination unit for combining a if i is less than niAdjacent single word ai+1Combining to obtain a temporary word aiai+1
A cyclic search unit for searching if the temporary word aiai+1If the temporary word a exists in a preset common word bank, the temporary word a is usediai+1Adjacent single word ai+2Combining to obtain a temporary word aiai+1ai+2And use the provisional word aiai+1ai+2Continuing to search the common word stock until the temporary word aiai+1ai+2...akDoes not exist in the common word lexicon, wherein k belongs to [ i +1, n ∈ [ ]];
A first recognition unit for recognizing if the temporary word aiai+1ai+2...akIf the common word library does not exist, a is addediai+ 1ai+2...ak-1Recognized as valid words and extracted from a single word akTo begin with, if k is equal to n, a will bekRecognizing as a valid word, if k is less than n, akAs a isiReturning to the combination unit for continuing execution;
a second recognition unit for recognizing if the provisional word aiai+1ai+2…akIf a is present in the common word stock and k is n, a will be addediai+1ai+2…akRecognizing as a valid word;
and a result determining unit for determining the recognized effective words as the Chinese words in the Chinese character string.
Further, the similarity calculation module includes:
the word vector construction submodule is used for constructing a first word vector of the first character string and a second word vector of the second character string according to the first number, the second number, the word frequency of each first word in the first character string and the word frequency of each second word in the second character string in the following modes:
Q1[d+n+m]={T(W1),T(W2),…T(Wd),T(Wd+1),T(Wd+2),…,T(Wd+n),Zn+1,Zn+2,…,Zn+m}
Q2[d+n+m]={T(W1),T(W2),…,T(Wd),Z1,Z2,…,Zn,T(Wn+1),T(Wn+2),…,T(Wn+m)}
wherein, Q1[ d + n + m [ ]]Is the first word vector, Q2[ d + n + m]Is a second word vector, W1,W2,…Wd,Wd+1,Wd+2,…,Wd+nFor the first word obtained after the word segmentation processing is carried out on the first character string, W1,W2,…Wd,Wn+1,Wn+2,…,Wn+mD + n is a first number, d + m is a second number, T (W) is a second word obtained by performing word segmentation processing on the second character strings) Is WsS is in [1, n + m ]],ZbHas a value of 0, b ∈ [1, n + m ]];
The formula calculation submodule is used for calculating the similarity between the first character string and the second character string according to the following formula:
Figure GDA0002808097240000281
wherein epsilon is the similarity between the first character string and the second character string; q1[ c ] is the word frequency of the c-th first word in the first string; q2[ c ] is the word frequency of the c-th second word in the second string.
Further, the first character string is an address character string in the first policy information, and the second character string is an address character string in the second policy information, the computer program 73 may be further divided into:
and the association module is used for establishing the association relationship between the policy corresponding to the first policy information and the policy corresponding to the second policy information.
The terminal device 70 may be a computing device such as a desktop computer, a notebook, a palm computer, and a cloud server. Terminal equipment 70 may include, but is not limited to, a processor 71, a memory 72. Those skilled in the art will appreciate that fig. 7 is merely an example of terminal device 70 and does not constitute a limitation of terminal device 70 and may include more or fewer components than shown, or some components may be combined, or different components, e.g., terminal device 70 may also include input-output devices, network access devices, buses, etc.
The Processor 71 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 72 may be an internal storage unit of the terminal device 70, such as a hard disk or a memory of the terminal device 70. The memory 72 may also be an external storage device of the terminal device 70, such as a plug-in hard disk provided on the terminal device 70, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 72 may also include both an internal storage unit of the terminal device 70 and an external storage device. The memory 72 is used to store computer programs and other programs and data required by the terminal device 70. The memory 72 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media which may not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (8)

1. A character string comparison method, comprising:
establishing a basic word bank according to policy information in a policy database, wherein the basic word bank comprises Chinese words except English characters and Arabic numbers in the policy information and the word frequency of each Chinese word; the word frequency is obtained by calculation according to the occurrence frequency of each Chinese word in the policy database;
acquiring a first character string and a second character string to be matched;
performing word segmentation processing on the first character string and the second character string respectively to obtain a first number of first words contained in the first character string and a second number of second words contained in the second character string;
matching the word frequency of each first word and the word frequency of each second word from the basic word stock;
calculating the similarity between the first character string and the second character string by using a cosine similarity algorithm according to the first number, the second number, the word frequency of each first word in the first character string and the word frequency of each second word in the second character string;
if the similarity is larger than a preset similarity threshold value, confirming that the meanings of the first character string and the second character string are the same;
the establishing of the basic word bank according to the policy information in the policy database comprises the following steps:
performing word segmentation processing on the Chinese character string to be processed in each policy information to obtain the Chinese word;
counting the occurrence frequency of each Chinese word and the total word number of the Chinese words;
calculating the word frequency of each Chinese word according to the following formula:
Figure FDA0002824179770000021
wherein, TmFor the m-th word frequency, PmIs the occurrence number of the mth Chinese word, G is the total word number of the Chinese word, m belongs to [1, G ∈](ii) a Pj is the occurrence frequency of the jth Chinese word;
Figure FDA0002824179770000022
the total times of all the Chinese words are taken;
and storing each Chinese word and the word frequency association of the Chinese word in a basic word bank.
2. The character string comparison method according to claim 1, wherein the performing word segmentation processing on the kanji character string to be processed in each policy information to obtain the chinese word comprises:
carrying out single character segmentation on the Chinese character string to obtain n single characters aiWherein i ∈ [1, n ]]N is the number of Chinese characters contained in the Chinese character string;
if i is less than n, aiAdjacent single word ai+1Combining to obtain a temporary word aiai+1
If the temporary word aiai+1If the temporary word a exists in a preset common word bank, the temporary word a is usediai+1Adjacent single word ai+2Combining to obtain a temporary word aiai+1ai+2And using said temporary word aiai+1ai+2Continuing to search the common word stock until the temporary word aiai+1ai+2...akIs not existed in the common word lexicon, wherein k is epsilon [ i +1, n];
If the temporary word aiai+1ai+2...akIf the common word is not existed in the word stock, a is addediai+1ai+2...ak-1Recognized as valid words and extracted from a single word akTo begin with, if k is equal to n, a will bekRecognizing as a valid word, if k is less than n, akAs a isiIf i is less than n, continuing to execute aiAdjacent single word ai+1Combining to obtain a temporary word aiai+1A step (2);
if the temporary word aiai+1ai+2…akIf a is present in the common word bank and k is n, a is addediai+1ai+2…akRecognizing as a valid word;
and determining the identified effective words as the Chinese words.
3. The character string comparison method according to any one of claims 1 to 2, wherein said calculating the similarity between said first character string and said second character string using a cosine similarity algorithm based on said first number, said second number, the word frequency of each of said first words in said first character string, and the word frequency of each of said second words in said second character string comprises:
constructing a first word vector of the first character string and a second word vector of the second character string according to the first number, the second number, the word frequency of each first word in the first character string and the word frequency of each second word in the second character string as follows:
Q1[d+n+m]={T(W1),T(W2),…,T(Wd),T(Wd+1),T(Wd+2),…,T(Wd+n),Zn+1,Zn+2,…,Zn+m}
Q2[d+n+m]={T(W1),T(W2),…,T(Wd),Z1,Z2,…,Zn,T(Wn+1),T(Wn+2),…,T(Wn+m)}
wherein, Q1[ d + n + m [ ]]Is the first word vector, Q2[ d + n + m]Is said second word vector, W1,W2,…,Wd,Wd+1,Wd+2,…,Wd+nFor the first word obtained after the word segmentation processing is carried out on the first character string, W1,W2,…,Wd,Wn+1,Wn+2,…,Wn+mD + n is the first number, d + m is the second number, T (W) is the second word obtained after the word segmentation processing is carried out on the second character strings) Is WsS is in [1, n + m ]],ZbHas a value of 0, b ∈ [1, n + m ]];
The similarity is calculated according to the following formula:
Figure FDA0002824179770000031
wherein epsilon is the similarity; q1[ c ] is the word frequency of the c-th first word in the first string; q2[ c ] is the word frequency of the c-th second word in the second string.
4. The character string comparison method according to any one of claims 1 to 2, wherein the first character string is an address character string in first policy information, the second character string is an address character string in second policy information, and after confirming that the first character string and the second character string have the same meaning if the similarity is greater than a preset similarity threshold, the character string comparison method further comprises:
and establishing an incidence relation between the policy corresponding to the first policy information and the policy corresponding to the second policy information.
5. A character string comparison device, comprising:
the word bank establishing module is used for establishing a basic word bank according to policy information in a policy database, wherein the basic word bank comprises Chinese words except English characters and Arabic numbers in the policy information and the word frequency of each Chinese word; the word frequency is obtained by calculation according to the occurrence frequency of each Chinese word in the policy database;
the acquisition module is used for acquiring a first character string and a second character string to be matched;
the word segmentation module is used for performing word segmentation processing on the first character string and the second character string respectively to obtain a first number of first words contained in the first character string and a second number of second words contained in the second character string;
the word frequency matching module is used for matching the word frequency of each first word and the word frequency of each second word from the basic word bank;
a similarity calculation module, configured to calculate a similarity between the first character string and the second character string by using a cosine similarity calculation method according to the first number, the second number, the word frequency of each first word in the first character string, and the word frequency of each second word in the second character string;
the comparison module is used for confirming that the meanings of the first character string and the second character string are the same if the similarity is larger than a preset similarity threshold;
the word stock establishing module comprises:
the word segmentation processing submodule is used for carrying out word segmentation processing on the Chinese character strings to be processed in each policy information to obtain the Chinese words;
the statistic submodule is used for counting the occurrence frequency of each Chinese word and the total word number of the Chinese words;
the word frequency calculation submodule is used for calculating the word frequency of each Chinese word according to the following formula:
Figure FDA0002824179770000051
wherein, TmFor the m-th word frequency, PmIs the occurrence number of the mth Chinese word, G is the total word number of the Chinese word, m belongs to [1, G ∈](ii) a Pj is the occurrence frequency of the jth Chinese word;
Figure FDA0002824179770000052
the total times of all the Chinese words are taken;
and the association storage submodule is used for storing each Chinese word and the word frequency association of the Chinese word in the basic word stock.
6. The character string comparing apparatus as claimed in claim 5, wherein said participle processing submodule comprises:
a segmentation unit for performing single character segmentation on the Chinese character string to obtain n single characters aiWherein i ∈ [1, n ]]N is the number of Chinese characters contained in the Chinese character string;
a combination unit for combining a if i is less than niAdjacent single word ai+1Combining to obtain a temporary word aiai+1
A cyclic search unit for searching if the temporary word aiai+1If the temporary word a exists in a preset common word bank, the temporary word a is usediai+1Adjacent single word ai+2Combining to obtain a temporary word aiai+1ai+2And using said temporary word aiai+1ai+2Continuing to search the common word stock until the temporary word aiai+1ai+2...akIs not existed in the common word lexicon, wherein k is epsilon [ i +1, n];
A first recognition unit for recognizing if the temporary word aiai+1ai+2...akIf the common word is not existed in the word stock, a is addediai+1ai+2...ak-1Recognized as valid words and extracted from a single word akTo begin with, if k is equal to n, a will bekRecognizing as a valid word, if k is less than n, akAs a isiReturning to the combination unit for continuing execution;
a second recognition unit for recognizing if the temporary word aiai+1ai+2…akIf a is present in the common word bank and k is n, a is addediai+1ai+2…akRecognizing as a valid word;
and the result determining unit is used for determining the identified effective words as the Chinese words.
7. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the character string comparison method according to any one of claims 1 to 4 when executing the computer program.
8. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the string comparison method according to any one of claims 1 to 4.
CN201810030724.9A 2018-01-12 2018-01-12 Character string comparison method and device, terminal equipment and storage medium Active CN108363729B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810030724.9A CN108363729B (en) 2018-01-12 2018-01-12 Character string comparison method and device, terminal equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810030724.9A CN108363729B (en) 2018-01-12 2018-01-12 Character string comparison method and device, terminal equipment and storage medium

Publications (2)

Publication Number Publication Date
CN108363729A CN108363729A (en) 2018-08-03
CN108363729B true CN108363729B (en) 2021-01-26

Family

ID=63006157

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810030724.9A Active CN108363729B (en) 2018-01-12 2018-01-12 Character string comparison method and device, terminal equipment and storage medium

Country Status (1)

Country Link
CN (1) CN108363729B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165326A (en) * 2018-08-16 2019-01-08 蜜小蜂智慧(北京)科技有限公司 A kind of character string matching method and device
CN109472022A (en) * 2018-10-15 2019-03-15 平安科技(深圳)有限公司 New word identification method and terminal device based on machine learning
CN109829150B (en) * 2018-11-27 2023-11-14 创新先进技术有限公司 Insurance claim text processing method and apparatus
CN109885396A (en) * 2019-01-14 2019-06-14 珠海金山网络游戏科技有限公司 Character string construction method and device in a kind of game application
CN109918679B (en) * 2019-03-22 2023-04-11 成都晟堃科技有限责任公司 Method for analyzing paper policy data
CN110532561B (en) * 2019-08-30 2022-12-09 北京明略软件***有限公司 Data detection method and device, storage medium and electronic device
CN111831869B (en) * 2020-06-30 2023-11-03 深圳价值在线信息科技股份有限公司 Character string duplicate checking method, device, terminal equipment and storage medium
CN112287657B (en) * 2020-11-19 2024-01-30 每日互动股份有限公司 Information matching system based on text similarity

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8954470B2 (en) * 2005-07-15 2015-02-10 Indxit Systems, Inc. Document indexing
CN107016027A (en) * 2016-12-08 2017-08-04 阿里巴巴集团控股有限公司 The method and apparatus for realizing business information fast search

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411583B (en) * 2010-09-20 2013-09-18 阿里巴巴集团控股有限公司 Method and device for matching texts
CN102063424A (en) * 2010-12-24 2011-05-18 上海电机学院 Method for Chinese word segmentation
US20120290330A1 (en) * 2011-05-09 2012-11-15 Hartford Fire Insurance Company System and method for web-based industrial classification
CN103678528B (en) * 2013-12-03 2017-01-18 北京建筑大学 Electronic homework plagiarism preventing system and method based on paragraph plagiarism detection

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8954470B2 (en) * 2005-07-15 2015-02-10 Indxit Systems, Inc. Document indexing
CN107016027A (en) * 2016-12-08 2017-08-04 阿里巴巴集团控股有限公司 The method and apparatus for realizing business information fast search

Also Published As

Publication number Publication date
CN108363729A (en) 2018-08-03

Similar Documents

Publication Publication Date Title
CN108363729B (en) Character string comparison method and device, terminal equipment and storage medium
WO2019136993A1 (en) Text similarity calculation method and device, computer apparatus, and storage medium
CN110597855B (en) Data query method, terminal device and computer readable storage medium
US20210150142A1 (en) Method and apparatus for determining feature words and server
US20210097238A1 (en) User keyword extraction device and method, and computer-readable storage medium
WO2017045443A1 (en) Image retrieval method and system
CN108376129B (en) Error correction method and device
CN109189888B (en) Electronic device, infringement analysis method, and storage medium
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN110888981B (en) Title-based document clustering method and device, terminal equipment and medium
CN110929125A (en) Search recall method, apparatus, device and storage medium thereof
WO2019148712A1 (en) Phishing website detection method, device, computer equipment and storage medium
CN111159413A (en) Log clustering method, device, equipment and storage medium
CN112651236B (en) Method and device for extracting text information, computer equipment and storage medium
US20230205755A1 (en) Methods and systems for improved search for data loss prevention
CN112015900A (en) Medical attribute knowledge graph construction method, device, equipment and medium
CN112988753B (en) Data searching method and device
US9996603B2 (en) Detecting homologies in encrypted and unencrypted documents using fuzzy hashing
CN113032528A (en) Case analysis method, case analysis device, case analysis equipment and storage medium
US20150036930A1 (en) Discriminating synonymous expressions using images
CN112613310A (en) Name matching method and device, electronic equipment and storage medium
US11281714B2 (en) Image retrieval
CN108804550B (en) Query term expansion method and device and electronic equipment
Kalyanathaya et al. A fuzzy approach to approximate string matching for text retrieval in NLP
US20190004872A1 (en) Application program interface mashup generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant