CN112329390B

CN112329390B - Chinese word similarity detection algorithm based on sound, shape and meaning

Info

Publication number: CN112329390B
Application number: CN202011058506.XA
Authority: CN
Inventors: 黄梦醒; 王华敏; 冯思玲; 冯文龙; 张雨; 吴迪
Original assignee: Hainan University
Current assignee: Hainan University
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2023-08-04
Anticipated expiration: 2040-09-30
Also published as: CN112329390A

Abstract

The invention provides a Chinese word similarity detection algorithm based on sound and shape meanings, which detects the overall similarity of Chinese character strings by comprehensively considering three characteristics of the sound and shape meanings of Chinese characters, firstly, the pinyin of each Chinese character of Chinese character strings s1 and s2 is converted into corresponding sound codes, each Chinese character of the Chinese character strings s1 and s2 is converted into shape codes, then the similarity of the sound codes and the similarity of the shape codes between the Chinese character strings s1 and s2 are respectively calculated, the similarity of the meaning of the Chinese character strings is independently calculated, finally, the sound and shape meanings are combined, and the overall similarity of the final Chinese character strings s1 and s2 is calculated by setting contribution parameters aiming at application scenes. The algorithm can meet the complex application scene, can be applied to detection of the repeatability of the structured data item, especially when manual input errors exist, and can also be applied to detection of sensitive words hidden by wrongly written characters and the like. Compared with the similar Chinese character similarity detection algorithm, the Chinese character similarity detection method greatly enhances the detection effect of Chinese character string similarity.

Description

Chinese word similarity detection algorithm based on sound, shape and meaning

Technical Field

The invention relates to the technical field of Chinese word similarity, in particular to a Chinese word similarity detection algorithm based on sound and shape meaning.

Background

The character string similarity algorithm is to calculate the similarity between two different character strings by a certain method. A percentage is typically used to measure the similarity between strings. String similarity algorithms are used in many computing scenarios and have a wide range of applications such as data cleansing, user input error correction, recommendation systems, hacking detection systems, automatic scoring systems, and web search and DNA sequence matching. At present, algorithms commonly adopted for detecting the similarity of Chinese character strings are as follows: firstly, based on similarity detection of Chinese character pronunciation and shape, basic information of Chinese characters such as pinyin, font structure, stroke number, stroke sequence and the like of the Chinese characters are obtained, mathematical expressions are generated according to certain coding rules on the data, and then the similarity of the Chinese characters is obtained through processing the mathematical expressions by using a specific algorithm; secondly, based on similarity detection of Chinese character semantics, comparing Chinese character strings with words and descriptions recorded in a large knowledge base, and then calculating Chinese character semantic similarity according to the distance between knowledge base meaning sources; however, both methods have defects that the former cannot recognize the situation that the Chinese character strings are different in length or the Chinese character sequences are changed and the meaning is the same, and the latter cannot detect the similarity between words hidden with wrongly written characters because the detection words are completely correct.

Disclosure of Invention

The invention aims to provide a Chinese word similarity detection algorithm based on sound, shape and meaning, which combines the Chinese word similarity with the meaning of Chinese characters on the basis of improving the sound code and the shape code of the Chinese characters, and fully considers three characteristics of the sound, the shape and the meaning of the Chinese characters to calculate the similarity of Chinese character strings.

In order to solve the technical problems, the invention adopts the following technical scheme: a Chinese word similarity detection algorithm based on sound, shape and meaning combines three major characteristics of sound, shape and meaning of Chinese characters to carry out similarity detection on Chinese character strings, and the method comprises the following steps:

step S1: each Chinese phonetic transcription of the input Chinese character strings s1 and s2 is converted into binary phonetic codes;

s2, converting each Chinese character in the input Chinese character strings S1 and S2 into a shape code according to the font;

step S3, respectively calculating the voice code similarity, the shape code similarity and the meaning similarity of the Chinese character strings S1 and S2;

step S4, considering the influence of the phonetic similarity, the shape code similarity and the meaning similarity on the overall similarity, and finally obtaining the overall similarity of the Chinese character strings S1 and S2;

preferably, the specific steps of the step S1 include:

step S11: converting each initial consonant of each Chinese phonetic transcription in the Chinese character strings s1 and s2 into binary numbers according to a Gray code comparison table;

step S12: converting each vowel of each Chinese phonetic alphabet in the Chinese character strings s1 and s2 into binary numbers according to a Gray code comparison table;

step S13: if each Chinese phonetic alphabet in the Chinese character strings s1 and s2 has an intermediate vowel, converting the intermediate vowel into a binary number according to a Gray code comparison table;

step S14: the tone of each Chinese phonetic alphabet in the Chinese character strings s1 and s2 is represented by binary numbers;

preferably, the specific steps of the step S2 include:

s21, dividing the structure of the Chinese character into horizontal, vertical, left falling, right falling and folding according to Chinese character coding rules, and setting corresponding codes for the horizontal, vertical, left falling, right falling and folding respectively;

step S22, recording corresponding codes according to the sequence of the horizontal, vertical, left falling, right falling, left falling and folding of each Chinese character in the Chinese character strings S1 and S2, and respectively obtaining the stroke order codes of the Chinese character strings S1 and S2;

step S23, obtaining stroke codes according to the stroke number of each Chinese character in the Chinese character strings S1 and S2, and obtaining structure codes according to the font structure of each Chinese character in the Chinese character strings S1 and S2;

preferably, in the step S3, the step of calculating the similarity of the phonetic codes of the chinese character strings S1, S2

Comprising the following steps:

step S311, comparing the total length of the characters in the Chinese character strings S1 and S2, setting the character string with shorter total length as min_s and the character string with longer total length as max_s;

step S312, the similarity of the phonetic codes between the single Chinese characters a and b in the min_s and the max_s is calculated by the following formula:

wherein h (a, b) is the phonetic code hamming distance of the Chinese character a, b, len (a) is the phonetic code length of a;

based on the similarity of the phonetic codes among the single Chinese characters, comparing each Chinese character in the min_s with each Chinese character in the max_s one by one, and based on the comparison result of the similarity of the phonetic codes among the single Chinese characters, re-exchanging and sequencing the Chinese characters in the max_s;

step S313, calculating the position difference of the Chinese characters before and after exchange, then calculating the absolute value of the position difference, and obtaining position influence factors based on the absolute value of the position difference, wherein the position influence factors are as follows:

where sum_position is the sum of absolute values of the respective position differences, and len (max_s) is the string length of max_s;

step S314, calculating the edit distance lds between the min_s and the max_s after the position exchange by a weighted edit distance algorithm _yin (max_s,min_s)；

Step S315, calculating the similarity of the phonetic codes of the Chinese character strings S1 and S2:

where α is a position contribution parameter.

Preferably, in the step S3, the step of calculating the similarity of the phonetic codes of the chinese character strings S1, S2 further includes: if the total lengths of the characters in the chinese character strings S1 and S2 are equal, one of the chinese character strings S1 and S2 is set as min_s, the other is set as max_s, and step S312 is executed, and if the min_s and the max_s are completely matched, the phonetic similarity of the chinese character strings S1 and S2 is calculated by the following formula:

wherein sum_sim _yin The sum of the similarity of the phonetic codes of the corresponding Chinese characters;

preferably, in the step S3, the step of calculating the similarity of the shape codes of the chinese character strings S1, S2 includes:

s321, comparing the total length of the characters in the Chinese character strings S1 and S2, setting the character string with shorter total length as min_s and the character string with longer total length as max_s;

step S322, calculating the similarity of the shape codes between the single Chinese characters a and b in the min_s and the max_s by the following modes:

comparing the stroke order code length of a single Chinese character a and b in the min_s and the max_s, setting the smaller stroke order code length as d, setting the larger stroke order code length as s, setting the longest common substring length of the stroke order code of the single Chinese character a and b as Lcs_len, and calculating the common substring duty ratio of the Chinese characters a and b as follows:

the stroke difference c= |lena-lenb| of the Chinese character a and b, wherein lena is the stroke number of the Chinese character a, lenb is the stroke number of the Chinese character b, and the stroke number contribution ratio of the Chinese character a and b is calculated as follows:

let the position of the longest common substring of Chinese character a in the stroke order code of Chinese character a be a_p, let the position of the longest common substring of Chinese character b in the stroke order code of Chinese character b be b_p, calculate its difference p= |a_p-b_p|, and finally calculate the position contribution ratio of the longest common substring as follows:

the hamming distance ham of the structure code is calculated, and then the structural factors are calculated as:

the similarity of the shape codes of the single Chinese characters a and b in the min_s and the max_s is calculated by the following steps:

wherein alpha is a contribution parameter of a common substring duty ratio, beta is a contribution parameter of a stroke number contribution ratio, i is a contribution parameter of a longest common substring position contribution ratio, j is a contribution parameter of a structural factor,

based on the similarity of the shape codes among the single Chinese characters, comparing the similarity of the shape codes of each Chinese character in the min_s with the similarity of the shape codes of each Chinese character in the max_s one by one, and based on the comparison result of the similarity of the shape codes among the single Chinese characters, re-exchanging and sequencing the Chinese characters in the max_s;

step S323, calculating the position difference of the Chinese characters before and after exchange, then calculating the absolute value of the position difference, and obtaining position influence factors based on the absolute value of the position difference, wherein the position influence factors are as follows:

step S324, calculating the edit distance lds between min_s and max_s after the position exchange by a weighted edit distance algorithm _xing (max_s,min_s)；

Step S325, calculating the similarity of the shape codes of the Chinese character strings S1 and S2:

where α is a position contribution parameter.

Preferably, in the step S3, the step of calculating the shape code similarity of the chinese character strings S1, S2 further includes: if the total lengths of the characters in the chinese character strings S1 and S2 are equal, one of the chinese character strings S1 and S2 is set as min_s, the other is set as max_s, and step S322 is executed, and if the min_s and the max_s are completely matched, the shape code similarity of the chinese character strings S1 and S2 is calculated by the following formula:

wherein sum_sim _xing The sum of the similarity of the shape codes among the corresponding Chinese characters is used as the sum;

preferably, in the step S3, the step of calculating the meaning similarity of the chinese character strings S1, S2 includes:

step S331, carrying out meaning similarity detection on the Chinese character strings S1 and S2 based on a hownet algorithm, and if the Chinese character strings S1 and S2 are meaningful, calculating the meaning similarity of the Chinese character strings S1 and S2 according to the following formula: sim (Sim) _yi ＝sim _yi

Wherein sim is _yi The result is the output result of the hownet algorithm;

otherwise, go to step S332;

step S332, replacing nonsensical Chinese character strings S1 and S2 with single similar Chinese characters, detecting meaning, circulating until finding the word with the closest similarity, setting a replacement penalty parameter f, and calculating the meaning similarity of the Chinese character strings S1 and S2 according to the following formula:

Sim _yi ＝sim _yi -f

preferably, the specific steps of the step S4 include:

step S41: setting a similarity threshold t, if the meaning similarity of the Chinese character strings S1 and S2 is greater than t, taking the meaning similarity as the overall similarity between the Chinese character strings S1 and S2, otherwise, entering step S42;

step S42: the overall similarity of the chinese strings s1, s2 is calculated by:

Sim _zong ＝Sim _yin ×a+Sim _xing ×b+Sim _yi ×c-f

where a is a contribution value of the similarity of the phonetic code, b is a contribution value of the similarity of the shape code, c is a contribution value of the meaning similarity, and a+b+c=1.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a Chinese word similarity detection algorithm based on sound, shape and meaning, which combines the sound and shape with the meaning on the basis of improving the sound and shape codes, and fully considers three characteristics of sound, shape and meaning of a Chinese character string to comprehensively calculate the similarity of the Chinese character string; the algorithm has wider application field, can meet more complex application scenes, can be applied to detection of the repeatability of the structured data item, particularly to detection of sensitive words hidden by wrongly written characters and the like under the condition of manual input errors. The algorithm is simple to realize, has low requirements on the realization environment, and greatly enhances the detection effect of the Chinese character string similarity compared with the Chinese character similarity detection algorithm of the same type.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only preferred embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a Chinese word similarity detection algorithm based on sound and shape meaning according to the present invention;

Detailed Description

For a better understanding of the technical content of the present invention, specific examples are provided below, and the present invention is further described with reference to the accompanying drawings:

referring to fig. 1, the invention provides a Chinese word similarity detection algorithm based on sound, shape and meaning, which combines three major characteristics of sound, shape and meaning of Chinese characters to detect the similarity of Chinese character strings, and comprises the following steps:

the method comprises the steps of comprehensively considering three characteristics of sound, shape and meaning of Chinese characters to detect the overall similarity of Chinese character strings, specifically, firstly converting pinyin of each Chinese character of Chinese character strings s1 and s2 into corresponding sound codes, converting each Chinese character of Chinese character strings s1 and s2 into shape codes, then respectively calculating the sound code similarity and the shape code similarity between the Chinese character strings s1 and s2, secondly independently calculating the similarity of the meaning of the Chinese character strings, and finally setting contribution parameters for application scenes to calculate the overall similarity of the final Chinese character strings s1 and s2 by combining the sound code similarity, the shape code similarity and the meaning similarity.

Specifically, the specific steps of the step S1 include:

step S11: each initial consonant of each Chinese phonetic alphabet in the Chinese character strings s1 and s2 is converted into a five-bit binary number according to a Gray code comparison table, and the codes of the initial consonants are specifically as shown in the following table 1:

TABLE 1 Pinyin initial consonant encoding

Step S12: each vowel of the pinyin of each Chinese character in the Chinese character strings s1 and s2 is converted into a five-bit binary number according to a Gray code comparison table, and the encoding of the vowels is specifically as follows in table 2:

TABLE 2 Pinyin vowel coding

Step S13: if each Chinese phonetic alphabet in the Chinese character strings s1 and s2 has an intermediate vowel, converting the intermediate vowel into a five-bit binary number according to a Gray code comparison table, wherein the encoding of the intermediate vowel is specifically shown in the table 2;

step S14: the tone of each Chinese phonetic alphabet in the Chinese character strings s1 and s2 is represented by a binary number of two bits, the tone has four tones, the one tone code of the tone is 00, the two tone code of the tone is 01, the three tone code of the tone is 10, and the four tone code of the tone is 11;

in summary, the phonetic code of each Chinese character can be represented by a 17-bit binary number, and the phonetic code length of a Chinese character is 17.

Specifically, the specific steps of the step S2 include:

step S21, dividing the structure of the Chinese character into horizontal, vertical, left falling, right falling and folding according to Chinese character coding rules, and setting corresponding codes for the horizontal, vertical, left falling, right falling and folding respectively, wherein the specific coding rules are as follows in table 3:

table 3 Chinese character Stroke coding rules

Table 3 Chinese character stroke coding rules

Step S22, according to the sequence of the horizontal, vertical, left falling, right falling, left falling and right falling of each Chinese character in the Chinese character strings S1 and S2, corresponding codes are recorded, and the stroke order codes of the Chinese character strings S1 and S2 can be obtained respectively; for example, the Chinese character 'you' is composed of skimming, vertical, horizontal, skimming, vertical hooks and dots, and according to the coding rule, the stroke order code generated by comparison is 321354;

step S23, obtaining stroke codes of Chinese characters according to the stroke number of each Chinese character in the Chinese character strings S1 and S2, and obtaining the existing structure codes of the Chinese characters according to the font structure of each Chinese character in the Chinese character strings S1 and S2;

the shape code of the Chinese character is formed based on the stroke order code, the stroke code and the structure code of the Chinese character, the stroke order code reflects the composition of the Chinese character, and the same stroke order code indicates the composition of the same stroke order, so that the similarity of the Chinese character can be reflected to a certain extent; the stroke code reflects the stroke number of the Chinese character; the structure code describes the overall structure of Chinese character pattern, including up-down structure, left-right structure, semi-surrounding structure, etc.; the shape of the Chinese character is described by adopting the stroke order code, the stroke code and the structure code, so that the shape of the Chinese character can be roughly described in terms of composition factors and composition modes, and the visual shape of the Chinese character can be described by the similarity calculated by the three codes.

Specifically, in the step S3, the step of calculating the similarity of the phonetic codes of the chinese character strings S1, S2 includes:

wherein h (a, b) is the hamming distance of the phonetic codes of the Chinese characters a, b, and len (a) is the phonetic code length of a, namely 17;

based on the similarity of the phonetic codes among the single Chinese characters, comparing each Chinese character in the min_s with each Chinese character in the max_s one by one, and based on the comparison result of the similarity of the phonetic codes among the single Chinese characters, re-exchanging and sequencing the Chinese characters in the max_s; the method comprises the following steps: and matching each Chinese character in the min_s with one Chinese character with the highest similarity in the max_s, and then referring to the sequence of the Chinese characters in the min_s, and re-exchanging the sequence of the Chinese characters in the max_s, so that the sequence of the Chinese characters in the max_s is the same as the sequence of the Chinese characters matched and corresponding in the min_s. For example, "min_s: teacher- -max_s: you teach the teacher ", after comparing and re-exchanging the ordering, then becomes" min_s: teacher- -max_s: teacher your ";

step S313, because the character positions of max_s are exchanged during matching, the position factors are needed to be considered, the position differences of the Chinese characters before and after the exchange are calculated, then the absolute value of the position differences is calculated, and the position influencing factors are obtained based on the absolute value of the position differences, wherein the position influencing factors are as follows:

step S314, calculating the edit distance lds between the min_s and the max_s after the position exchange by a weighted edit distance algorithm _yin (max_s, min_s), specifically:

where α is a position contribution parameter, which may be set to 0.1 according to experience.

Specifically, in the step S3, the step of calculating the similarity of the phonetic codes of the chinese character strings S1, S2 further includes: if the total lengths of the characters in the chinese character strings S1 and S2 are equal, one of the chinese character strings S1 and S2 is set to min_s, the other is set to max_s, and step S312 is executed, if min_s and max_s are completely matched, i.e. the similarity of each chinese character matching set is 1, for example, "min_s: mutual—max_s: mutually ", the phonetic similarity of the Chinese character strings s1, s2 is calculated by the following formula without considering the position factor:

specifically, in the step S3, the computation of the shape code similarity comprehensively considers the influence of four factors of the longest common substring duty ratio, the longest common substring position difference, the strokes and the structure, and designs a Chinese character shape code similarity detection algorithm based on the improved shape code. The specific calculation of the shape code similarity of the Chinese character strings s1 and s2 comprises the following steps:

considering the longest public substring duty ratio, comparing the stroke order code length of a single Chinese character a and b in min_s and max_s, setting a smaller stroke order code length as d, setting a larger stroke order code length as s, setting the longest public substring length of the stroke order code of the single Chinese character a and b as Lcs_len, and calculating the longest public substring duty ratio of the single Chinese character a and b as follows:

considering strokes, the stroke difference c= |lena-lenb| of the Chinese characters a and b, wherein lena is the stroke number of the Chinese character a, lenb is the stroke number of the Chinese character b, and the stroke number contribution ratio of the Chinese characters a and b is calculated as follows:

taking the position difference of the longest public substring into consideration, setting the position of the longest public substring of the Chinese character a in the stroke order code of the Chinese character a as a_p, setting the position of the longest public substring of the Chinese character b in the stroke order code of the Chinese character b as b_p, calculating the difference p= |a_p-b_p|, and finally calculating the position contribution ratio of the longest public substring as follows:

considering the structure, the hamming distance ham of the structure code is calculated, and then the structural factors are calculated as:

to sum up: the similarity of the shape codes of the single Chinese characters a and b in the min_s and the max_s is calculated by the following steps:

based on the similarity of the shape codes among the single Chinese characters, comparing the similarity of the shape codes of each Chinese character in the min_s with each Chinese character in the max_s one by one, and based on the comparison result of the similarity of the shape codes among the single Chinese characters, re-exchanging and ordering the Chinese characters in the max_s, wherein the method specifically comprises the following steps: and matching each Chinese character in the min_s with one Chinese character with the highest similarity in the max_s, and then referring to the sequence of the Chinese characters in the min_s, and re-exchanging the sequence of the Chinese characters in the max_s, so that the sequence of the Chinese characters in the max_s is the same as the sequence of the Chinese characters matched and corresponding in the min_s.

Step S323, because the character positions of max_s are exchanged during matching, the position factors are needed to be considered, the position difference of the Chinese character before and after the exchange is calculated, then the absolute value of the position difference is calculated, and the position influencing factors are obtained based on the absolute value of the position difference, wherein the position influencing factors are as follows:

step S324, calculating the edit distance lds between min_s and max_s after the position exchange by a weighted edit distance algorithm _xing (max_s, min_s), specifically:

Specifically, in the step S3, the step of calculating the similarity of the shape codes of the chinese character strings S1, S2 further includes: if the total lengths of the characters in the chinese character strings S1 and S2 are equal, one of the chinese character strings S1 and S2 is set as min_s, the other is set as max_s, and step S322 is executed, if min_s and max_s are completely matched, i.e. the similarity of each chinese character matching set is 1, for example, "min_s: mutual—max_s: mutually ", the shape code similarity of the Chinese character strings s1, s2 is calculated by the following formula without considering the position factor:

specifically, in the step S3, the step of calculating the meaning similarity of the chinese character strings S1 and S2 includes:

step S331, meaning similarity detection is performed on the Chinese character strings S1, S2 based on the existing algorithm of the hownet, the algorithm of the hownet screens word senses from various dictionaries and language knowledge bases, and annotates words with the word senses to construct word sense-based word sensesAccording to the meaning similarity calculated according to the distance of the meaning source of the knowledge base, if the Chinese character strings s1 and s2 are meaningful, the meaning similarity of the Chinese character strings s1 and s2 is calculated and output by the following formula: sim (Sim) _yi ＝sim _yi

Wherein sim is _yi The result is the output result of the hownet algorithm;

otherwise, go to step S332;

step S332, when the word has misplaced words, the word is nonsensical, the similarity of the word needs to be judged, the word needs to be converted into the word which is most similar to the word and has meaning, similarity comparison is carried out, single similar Chinese characters are replaced by nonsensical Chinese character strings S1 and S2, meaning detection is carried out, the cycle is carried out until the word with the closest similarity is found, then a replacement penalty parameter f is set, and the meaning similarity of the Chinese character strings S1 and S2 is calculated by the following formula:

Sim _yi ＝sim _yi -f

wherein, according to experimental experience, the replacement penalty parameter f is set to 0.1

Specifically, the specific steps of the step S4 include:

step S41: in the algorithm design, considering that when the same meaning is expressed by using completely different words, the phonetic similarity and the shape code similarity have no reference value, so that a similarity threshold t is set, if the meaning similarity of the Chinese character strings S1 and S2 is greater than t, the meaning of the Chinese character strings S1 and S2 can be considered to be highly similar, the similarity in terms of pinyin and font between the Chinese character strings S1 and S2 can be not considered, and the meaning similarity is taken as the overall similarity between the Chinese character strings S1 and S2, otherwise, the step S42 is entered;

Sim _zong ＝Sim _yin ×a+Sim _xing ×b+Sim _yi ×c-f

where a is a contribution value of the similarity of the phonetic code, b is a contribution value of the similarity of the shape code, c is a contribution value of the meaning similarity, and a+b+c=1. The sizes of the contribution value parameters a, b and c can be set according to different application scenes, so that comparison can be conveniently performed according to the emphasis characteristic, for example, the contribution value a is set to be larger when near-word detection is performed, the contribution value b is set to be larger when near-word detection is performed, and the contribution value b is set to be larger when near-word detection is performed.

It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. A Chinese word similarity detection algorithm based on sound, shape and meaning is characterized by combining three characteristics of sound, shape and meaning of Chinese characters to carry out similarity detection on Chinese character strings, and comprises the following steps:

the specific steps of the step S2 include:

in the step S3, the step of calculating the similarity of the phonetic codes of the chinese character strings S1, S2 includes:

where α is a position contribution parameter.

2. The method of claim 1, wherein the specific steps of step S1 include:

step S14: the tone of each Chinese phonetic alphabet in the Chinese character strings s1 and s2 is represented by binary number.

3. The method according to claim 1, wherein in the step S3, the step of calculating the phonetic similarity of the chinese character strings S1 and S2 further comprises: if the total lengths of the characters in the chinese character strings S1 and S2 are equal, one of the chinese character strings S1 and S2 is set as min_s, the other is set as max_s, and step S312 is executed, and if the min_s and the max_s are completely matched, the phonetic similarity of the chinese character strings S1 and S2 is calculated by the following formula:

wherein sum_sim _yin The sum of the similarity of the phonetic codes of the corresponding Chinese characters.

4. A Chinese word similarity detection algorithm based on phonetic and shape senses as claimed in claim 3 wherein in said step S3, the step of calculating the shape code similarity of the Chinese character strings S1, S2 comprises:

where α is a position contribution parameter.

5. The method according to claim 4, wherein in the step S3, the step of calculating the shape code similarity of the chinese character strings S1 and S2 further comprises: if the total lengths of the characters in the chinese character strings S1 and S2 are equal, one of the chinese character strings S1 and S2 is set as min_s, the other is set as max_s, and step S322 is executed, and if the min_s and the max_s are completely matched, the shape code similarity of the chinese character strings S1 and S2 is calculated by the following formula:

wherein sum_sim _xing Is the sum of the similarity of the shape codes among the corresponding Chinese characters.

6. The method according to claim 5, wherein in the step S3, the step of calculating the meaning similarity of the chinese character strings S1 and S2 includes:

Wherein sim is _yi The result is the output result of the hownet algorithm;

otherwise, go to step S332;

Sim _yi ＝sim _yi -f。

7. the method of claim 6, wherein the specific step of step S4 includes:

Sim _zong ＝Sim _yin ×a+Sim _xing ×b+Sim _yi ×c-f