CN112329390B - Chinese word similarity detection algorithm based on sound, shape and meaning - Google Patents
Chinese word similarity detection algorithm based on sound, shape and meaning Download PDFInfo
- Publication number
- CN112329390B CN112329390B CN202011058506.XA CN202011058506A CN112329390B CN 112329390 B CN112329390 B CN 112329390B CN 202011058506 A CN202011058506 A CN 202011058506A CN 112329390 B CN112329390 B CN 112329390B
- Authority
- CN
- China
- Prior art keywords
- chinese character
- similarity
- chinese
- character strings
- max
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Document Processing Apparatus (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a Chinese word similarity detection algorithm based on sound and shape meanings, which detects the overall similarity of Chinese character strings by comprehensively considering three characteristics of the sound and shape meanings of Chinese characters, firstly, the pinyin of each Chinese character of Chinese character strings s1 and s2 is converted into corresponding sound codes, each Chinese character of the Chinese character strings s1 and s2 is converted into shape codes, then the similarity of the sound codes and the similarity of the shape codes between the Chinese character strings s1 and s2 are respectively calculated, the similarity of the meaning of the Chinese character strings is independently calculated, finally, the sound and shape meanings are combined, and the overall similarity of the final Chinese character strings s1 and s2 is calculated by setting contribution parameters aiming at application scenes. The algorithm can meet the complex application scene, can be applied to detection of the repeatability of the structured data item, especially when manual input errors exist, and can also be applied to detection of sensitive words hidden by wrongly written characters and the like. Compared with the similar Chinese character similarity detection algorithm, the Chinese character similarity detection method greatly enhances the detection effect of Chinese character string similarity.
Description
Technical Field
The invention relates to the technical field of Chinese word similarity, in particular to a Chinese word similarity detection algorithm based on sound and shape meaning.
Background
The character string similarity algorithm is to calculate the similarity between two different character strings by a certain method. A percentage is typically used to measure the similarity between strings. String similarity algorithms are used in many computing scenarios and have a wide range of applications such as data cleansing, user input error correction, recommendation systems, hacking detection systems, automatic scoring systems, and web search and DNA sequence matching. At present, algorithms commonly adopted for detecting the similarity of Chinese character strings are as follows: firstly, based on similarity detection of Chinese character pronunciation and shape, basic information of Chinese characters such as pinyin, font structure, stroke number, stroke sequence and the like of the Chinese characters are obtained, mathematical expressions are generated according to certain coding rules on the data, and then the similarity of the Chinese characters is obtained through processing the mathematical expressions by using a specific algorithm; secondly, based on similarity detection of Chinese character semantics, comparing Chinese character strings with words and descriptions recorded in a large knowledge base, and then calculating Chinese character semantic similarity according to the distance between knowledge base meaning sources; however, both methods have defects that the former cannot recognize the situation that the Chinese character strings are different in length or the Chinese character sequences are changed and the meaning is the same, and the latter cannot detect the similarity between words hidden with wrongly written characters because the detection words are completely correct.
Disclosure of Invention
The invention aims to provide a Chinese word similarity detection algorithm based on sound, shape and meaning, which combines the Chinese word similarity with the meaning of Chinese characters on the basis of improving the sound code and the shape code of the Chinese characters, and fully considers three characteristics of the sound, the shape and the meaning of the Chinese characters to calculate the similarity of Chinese character strings.
In order to solve the technical problems, the invention adopts the following technical scheme: a Chinese word similarity detection algorithm based on sound, shape and meaning combines three major characteristics of sound, shape and meaning of Chinese characters to carry out similarity detection on Chinese character strings, and the method comprises the following steps:
step S1: each Chinese phonetic transcription of the input Chinese character strings s1 and s2 is converted into binary phonetic codes;
s2, converting each Chinese character in the input Chinese character strings S1 and S2 into a shape code according to the font;
step S3, respectively calculating the voice code similarity, the shape code similarity and the meaning similarity of the Chinese character strings S1 and S2;
step S4, considering the influence of the phonetic similarity, the shape code similarity and the meaning similarity on the overall similarity, and finally obtaining the overall similarity of the Chinese character strings S1 and S2;
preferably, the specific steps of the step S1 include:
step S11: converting each initial consonant of each Chinese phonetic transcription in the Chinese character strings s1 and s2 into binary numbers according to a Gray code comparison table;
step S12: converting each vowel of each Chinese phonetic alphabet in the Chinese character strings s1 and s2 into binary numbers according to a Gray code comparison table;
step S13: if each Chinese phonetic alphabet in the Chinese character strings s1 and s2 has an intermediate vowel, converting the intermediate vowel into a binary number according to a Gray code comparison table;
step S14: the tone of each Chinese phonetic alphabet in the Chinese character strings s1 and s2 is represented by binary numbers;
preferably, the specific steps of the step S2 include:
s21, dividing the structure of the Chinese character into horizontal, vertical, left falling, right falling and folding according to Chinese character coding rules, and setting corresponding codes for the horizontal, vertical, left falling, right falling and folding respectively;
step S22, recording corresponding codes according to the sequence of the horizontal, vertical, left falling, right falling, left falling and folding of each Chinese character in the Chinese character strings S1 and S2, and respectively obtaining the stroke order codes of the Chinese character strings S1 and S2;
step S23, obtaining stroke codes according to the stroke number of each Chinese character in the Chinese character strings S1 and S2, and obtaining structure codes according to the font structure of each Chinese character in the Chinese character strings S1 and S2;
preferably, in the step S3, the step of calculating the similarity of the phonetic codes of the chinese character strings S1, S2
Comprising the following steps:
step S311, comparing the total length of the characters in the Chinese character strings S1 and S2, setting the character string with shorter total length as min_s and the character string with longer total length as max_s;
step S312, the similarity of the phonetic codes between the single Chinese characters a and b in the min_s and the max_s is calculated by the following formula:
wherein h (a, b) is the phonetic code hamming distance of the Chinese character a, b, len (a) is the phonetic code length of a;
based on the similarity of the phonetic codes among the single Chinese characters, comparing each Chinese character in the min_s with each Chinese character in the max_s one by one, and based on the comparison result of the similarity of the phonetic codes among the single Chinese characters, re-exchanging and sequencing the Chinese characters in the max_s;
step S313, calculating the position difference of the Chinese characters before and after exchange, then calculating the absolute value of the position difference, and obtaining position influence factors based on the absolute value of the position difference, wherein the position influence factors are as follows:
where sum_position is the sum of absolute values of the respective position differences, and len (max_s) is the string length of max_s;
step S314, calculating the edit distance lds between the min_s and the max_s after the position exchange by a weighted edit distance algorithm yin (max_s,min_s);
Step S315, calculating the similarity of the phonetic codes of the Chinese character strings S1 and S2:
where α is a position contribution parameter.
Preferably, in the step S3, the step of calculating the similarity of the phonetic codes of the chinese character strings S1, S2 further includes: if the total lengths of the characters in the chinese character strings S1 and S2 are equal, one of the chinese character strings S1 and S2 is set as min_s, the other is set as max_s, and step S312 is executed, and if the min_s and the max_s are completely matched, the phonetic similarity of the chinese character strings S1 and S2 is calculated by the following formula:
wherein sum_sim yin The sum of the similarity of the phonetic codes of the corresponding Chinese characters;
preferably, in the step S3, the step of calculating the similarity of the shape codes of the chinese character strings S1, S2 includes:
s321, comparing the total length of the characters in the Chinese character strings S1 and S2, setting the character string with shorter total length as min_s and the character string with longer total length as max_s;
step S322, calculating the similarity of the shape codes between the single Chinese characters a and b in the min_s and the max_s by the following modes:
comparing the stroke order code length of a single Chinese character a and b in the min_s and the max_s, setting the smaller stroke order code length as d, setting the larger stroke order code length as s, setting the longest common substring length of the stroke order code of the single Chinese character a and b as Lcs_len, and calculating the common substring duty ratio of the Chinese characters a and b as follows:
the stroke difference c= |lena-lenb| of the Chinese character a and b, wherein lena is the stroke number of the Chinese character a, lenb is the stroke number of the Chinese character b, and the stroke number contribution ratio of the Chinese character a and b is calculated as follows:
let the position of the longest common substring of Chinese character a in the stroke order code of Chinese character a be a_p, let the position of the longest common substring of Chinese character b in the stroke order code of Chinese character b be b_p, calculate its difference p= |a_p-b_p|, and finally calculate the position contribution ratio of the longest common substring as follows:
the hamming distance ham of the structure code is calculated, and then the structural factors are calculated as:
the similarity of the shape codes of the single Chinese characters a and b in the min_s and the max_s is calculated by the following steps:
wherein alpha is a contribution parameter of a common substring duty ratio, beta is a contribution parameter of a stroke number contribution ratio, i is a contribution parameter of a longest common substring position contribution ratio, j is a contribution parameter of a structural factor,
based on the similarity of the shape codes among the single Chinese characters, comparing the similarity of the shape codes of each Chinese character in the min_s with the similarity of the shape codes of each Chinese character in the max_s one by one, and based on the comparison result of the similarity of the shape codes among the single Chinese characters, re-exchanging and sequencing the Chinese characters in the max_s;
step S323, calculating the position difference of the Chinese characters before and after exchange, then calculating the absolute value of the position difference, and obtaining position influence factors based on the absolute value of the position difference, wherein the position influence factors are as follows:
where sum_position is the sum of absolute values of the respective position differences, and len (max_s) is the string length of max_s;
step S324, calculating the edit distance lds between min_s and max_s after the position exchange by a weighted edit distance algorithm xing (max_s,min_s);
Step S325, calculating the similarity of the shape codes of the Chinese character strings S1 and S2:
where α is a position contribution parameter.
Preferably, in the step S3, the step of calculating the shape code similarity of the chinese character strings S1, S2 further includes: if the total lengths of the characters in the chinese character strings S1 and S2 are equal, one of the chinese character strings S1 and S2 is set as min_s, the other is set as max_s, and step S322 is executed, and if the min_s and the max_s are completely matched, the shape code similarity of the chinese character strings S1 and S2 is calculated by the following formula:
wherein sum_sim xing The sum of the similarity of the shape codes among the corresponding Chinese characters is used as the sum;
preferably, in the step S3, the step of calculating the meaning similarity of the chinese character strings S1, S2 includes:
step S331, carrying out meaning similarity detection on the Chinese character strings S1 and S2 based on a hownet algorithm, and if the Chinese character strings S1 and S2 are meaningful, calculating the meaning similarity of the Chinese character strings S1 and S2 according to the following formula: sim (Sim) yi =sim yi
Wherein sim is yi The result is the output result of the hownet algorithm;
otherwise, go to step S332;
step S332, replacing nonsensical Chinese character strings S1 and S2 with single similar Chinese characters, detecting meaning, circulating until finding the word with the closest similarity, setting a replacement penalty parameter f, and calculating the meaning similarity of the Chinese character strings S1 and S2 according to the following formula:
Sim yi =sim yi -f
preferably, the specific steps of the step S4 include:
step S41: setting a similarity threshold t, if the meaning similarity of the Chinese character strings S1 and S2 is greater than t, taking the meaning similarity as the overall similarity between the Chinese character strings S1 and S2, otherwise, entering step S42;
step S42: the overall similarity of the chinese strings s1, s2 is calculated by:
Sim zong =Sim yin ×a+Sim xing ×b+Sim yi ×c-f
where a is a contribution value of the similarity of the phonetic code, b is a contribution value of the similarity of the shape code, c is a contribution value of the meaning similarity, and a+b+c=1.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a Chinese word similarity detection algorithm based on sound, shape and meaning, which combines the sound and shape with the meaning on the basis of improving the sound and shape codes, and fully considers three characteristics of sound, shape and meaning of a Chinese character string to comprehensively calculate the similarity of the Chinese character string; the algorithm has wider application field, can meet more complex application scenes, can be applied to detection of the repeatability of the structured data item, particularly to detection of sensitive words hidden by wrongly written characters and the like under the condition of manual input errors. The algorithm is simple to realize, has low requirements on the realization environment, and greatly enhances the detection effect of the Chinese character string similarity compared with the Chinese character similarity detection algorithm of the same type.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only preferred embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a Chinese word similarity detection algorithm based on sound and shape meaning according to the present invention;
Detailed Description
For a better understanding of the technical content of the present invention, specific examples are provided below, and the present invention is further described with reference to the accompanying drawings:
referring to fig. 1, the invention provides a Chinese word similarity detection algorithm based on sound, shape and meaning, which combines three major characteristics of sound, shape and meaning of Chinese characters to detect the similarity of Chinese character strings, and comprises the following steps:
step S1: each Chinese phonetic transcription of the input Chinese character strings s1 and s2 is converted into binary phonetic codes;
s2, converting each Chinese character in the input Chinese character strings S1 and S2 into a shape code according to the font;
step S3, respectively calculating the voice code similarity, the shape code similarity and the meaning similarity of the Chinese character strings S1 and S2;
step S4, considering the influence of the phonetic similarity, the shape code similarity and the meaning similarity on the overall similarity, and finally obtaining the overall similarity of the Chinese character strings S1 and S2;
the method comprises the steps of comprehensively considering three characteristics of sound, shape and meaning of Chinese characters to detect the overall similarity of Chinese character strings, specifically, firstly converting pinyin of each Chinese character of Chinese character strings s1 and s2 into corresponding sound codes, converting each Chinese character of Chinese character strings s1 and s2 into shape codes, then respectively calculating the sound code similarity and the shape code similarity between the Chinese character strings s1 and s2, secondly independently calculating the similarity of the meaning of the Chinese character strings, and finally setting contribution parameters for application scenes to calculate the overall similarity of the final Chinese character strings s1 and s2 by combining the sound code similarity, the shape code similarity and the meaning similarity.
Specifically, the specific steps of the step S1 include:
step S11: each initial consonant of each Chinese phonetic alphabet in the Chinese character strings s1 and s2 is converted into a five-bit binary number according to a Gray code comparison table, and the codes of the initial consonants are specifically as shown in the following table 1:
TABLE 1 Pinyin initial consonant encoding
Step S12: each vowel of the pinyin of each Chinese character in the Chinese character strings s1 and s2 is converted into a five-bit binary number according to a Gray code comparison table, and the encoding of the vowels is specifically as follows in table 2:
TABLE 2 Pinyin vowel coding
Step S13: if each Chinese phonetic alphabet in the Chinese character strings s1 and s2 has an intermediate vowel, converting the intermediate vowel into a five-bit binary number according to a Gray code comparison table, wherein the encoding of the intermediate vowel is specifically shown in the table 2;
step S14: the tone of each Chinese phonetic alphabet in the Chinese character strings s1 and s2 is represented by a binary number of two bits, the tone has four tones, the one tone code of the tone is 00, the two tone code of the tone is 01, the three tone code of the tone is 10, and the four tone code of the tone is 11;
in summary, the phonetic code of each Chinese character can be represented by a 17-bit binary number, and the phonetic code length of a Chinese character is 17.
Specifically, the specific steps of the step S2 include:
step S21, dividing the structure of the Chinese character into horizontal, vertical, left falling, right falling and folding according to Chinese character coding rules, and setting corresponding codes for the horizontal, vertical, left falling, right falling and folding respectively, wherein the specific coding rules are as follows in table 3:
table 3 Chinese character Stroke coding rules
Table 3 Chinese character stroke coding rules
Step S22, according to the sequence of the horizontal, vertical, left falling, right falling, left falling and right falling of each Chinese character in the Chinese character strings S1 and S2, corresponding codes are recorded, and the stroke order codes of the Chinese character strings S1 and S2 can be obtained respectively; for example, the Chinese character 'you' is composed of skimming, vertical, horizontal, skimming, vertical hooks and dots, and according to the coding rule, the stroke order code generated by comparison is 321354;
step S23, obtaining stroke codes of Chinese characters according to the stroke number of each Chinese character in the Chinese character strings S1 and S2, and obtaining the existing structure codes of the Chinese characters according to the font structure of each Chinese character in the Chinese character strings S1 and S2;
the shape code of the Chinese character is formed based on the stroke order code, the stroke code and the structure code of the Chinese character, the stroke order code reflects the composition of the Chinese character, and the same stroke order code indicates the composition of the same stroke order, so that the similarity of the Chinese character can be reflected to a certain extent; the stroke code reflects the stroke number of the Chinese character; the structure code describes the overall structure of Chinese character pattern, including up-down structure, left-right structure, semi-surrounding structure, etc.; the shape of the Chinese character is described by adopting the stroke order code, the stroke code and the structure code, so that the shape of the Chinese character can be roughly described in terms of composition factors and composition modes, and the visual shape of the Chinese character can be described by the similarity calculated by the three codes.
Specifically, in the step S3, the step of calculating the similarity of the phonetic codes of the chinese character strings S1, S2 includes:
step S311, comparing the total length of the characters in the Chinese character strings S1 and S2, setting the character string with shorter total length as min_s and the character string with longer total length as max_s;
step S312, the similarity of the phonetic codes between the single Chinese characters a and b in the min_s and the max_s is calculated by the following formula:
wherein h (a, b) is the hamming distance of the phonetic codes of the Chinese characters a, b, and len (a) is the phonetic code length of a, namely 17;
based on the similarity of the phonetic codes among the single Chinese characters, comparing each Chinese character in the min_s with each Chinese character in the max_s one by one, and based on the comparison result of the similarity of the phonetic codes among the single Chinese characters, re-exchanging and sequencing the Chinese characters in the max_s; the method comprises the following steps: and matching each Chinese character in the min_s with one Chinese character with the highest similarity in the max_s, and then referring to the sequence of the Chinese characters in the min_s, and re-exchanging the sequence of the Chinese characters in the max_s, so that the sequence of the Chinese characters in the max_s is the same as the sequence of the Chinese characters matched and corresponding in the min_s. For example, "min_s: teacher- -max_s: you teach the teacher ", after comparing and re-exchanging the ordering, then becomes" min_s: teacher- -max_s: teacher your ";
step S313, because the character positions of max_s are exchanged during matching, the position factors are needed to be considered, the position differences of the Chinese characters before and after the exchange are calculated, then the absolute value of the position differences is calculated, and the position influencing factors are obtained based on the absolute value of the position differences, wherein the position influencing factors are as follows:
where sum_position is the sum of absolute values of the respective position differences, and len (max_s) is the string length of max_s;
step S314, calculating the edit distance lds between the min_s and the max_s after the position exchange by a weighted edit distance algorithm yin (max_s, min_s), specifically:
step S315, calculating the similarity of the phonetic codes of the Chinese character strings S1 and S2:
where α is a position contribution parameter, which may be set to 0.1 according to experience.
Specifically, in the step S3, the step of calculating the similarity of the phonetic codes of the chinese character strings S1, S2 further includes: if the total lengths of the characters in the chinese character strings S1 and S2 are equal, one of the chinese character strings S1 and S2 is set to min_s, the other is set to max_s, and step S312 is executed, if min_s and max_s are completely matched, i.e. the similarity of each chinese character matching set is 1, for example, "min_s: mutual—max_s: mutually ", the phonetic similarity of the Chinese character strings s1, s2 is calculated by the following formula without considering the position factor:
wherein sum_sim yin The sum of the similarity of the phonetic codes of the corresponding Chinese characters;
specifically, in the step S3, the computation of the shape code similarity comprehensively considers the influence of four factors of the longest common substring duty ratio, the longest common substring position difference, the strokes and the structure, and designs a Chinese character shape code similarity detection algorithm based on the improved shape code. The specific calculation of the shape code similarity of the Chinese character strings s1 and s2 comprises the following steps:
s321, comparing the total length of the characters in the Chinese character strings S1 and S2, setting the character string with shorter total length as min_s and the character string with longer total length as max_s;
step S322, calculating the similarity of the shape codes between the single Chinese characters a and b in the min_s and the max_s by the following modes:
considering the longest public substring duty ratio, comparing the stroke order code length of a single Chinese character a and b in min_s and max_s, setting a smaller stroke order code length as d, setting a larger stroke order code length as s, setting the longest public substring length of the stroke order code of the single Chinese character a and b as Lcs_len, and calculating the longest public substring duty ratio of the single Chinese character a and b as follows:
considering strokes, the stroke difference c= |lena-lenb| of the Chinese characters a and b, wherein lena is the stroke number of the Chinese character a, lenb is the stroke number of the Chinese character b, and the stroke number contribution ratio of the Chinese characters a and b is calculated as follows:
taking the position difference of the longest public substring into consideration, setting the position of the longest public substring of the Chinese character a in the stroke order code of the Chinese character a as a_p, setting the position of the longest public substring of the Chinese character b in the stroke order code of the Chinese character b as b_p, calculating the difference p= |a_p-b_p|, and finally calculating the position contribution ratio of the longest public substring as follows:
considering the structure, the hamming distance ham of the structure code is calculated, and then the structural factors are calculated as:
to sum up: the similarity of the shape codes of the single Chinese characters a and b in the min_s and the max_s is calculated by the following steps:
wherein alpha is a contribution parameter of a common substring duty ratio, beta is a contribution parameter of a stroke number contribution ratio, i is a contribution parameter of a longest common substring position contribution ratio, j is a contribution parameter of a structural factor,
based on the similarity of the shape codes among the single Chinese characters, comparing the similarity of the shape codes of each Chinese character in the min_s with each Chinese character in the max_s one by one, and based on the comparison result of the similarity of the shape codes among the single Chinese characters, re-exchanging and ordering the Chinese characters in the max_s, wherein the method specifically comprises the following steps: and matching each Chinese character in the min_s with one Chinese character with the highest similarity in the max_s, and then referring to the sequence of the Chinese characters in the min_s, and re-exchanging the sequence of the Chinese characters in the max_s, so that the sequence of the Chinese characters in the max_s is the same as the sequence of the Chinese characters matched and corresponding in the min_s.
Step S323, because the character positions of max_s are exchanged during matching, the position factors are needed to be considered, the position difference of the Chinese character before and after the exchange is calculated, then the absolute value of the position difference is calculated, and the position influencing factors are obtained based on the absolute value of the position difference, wherein the position influencing factors are as follows:
where sum_position is the sum of absolute values of the respective position differences, and len (max_s) is the string length of max_s;
step S324, calculating the edit distance lds between min_s and max_s after the position exchange by a weighted edit distance algorithm xing (max_s, min_s), specifically:
step S325, calculating the similarity of the shape codes of the Chinese character strings S1 and S2:
where α is a position contribution parameter, which may be set to 0.1 according to experience.
Specifically, in the step S3, the step of calculating the similarity of the shape codes of the chinese character strings S1, S2 further includes: if the total lengths of the characters in the chinese character strings S1 and S2 are equal, one of the chinese character strings S1 and S2 is set as min_s, the other is set as max_s, and step S322 is executed, if min_s and max_s are completely matched, i.e. the similarity of each chinese character matching set is 1, for example, "min_s: mutual—max_s: mutually ", the shape code similarity of the Chinese character strings s1, s2 is calculated by the following formula without considering the position factor:
wherein sum_sim xing The sum of the similarity of the shape codes among the corresponding Chinese characters is used as the sum;
specifically, in the step S3, the step of calculating the meaning similarity of the chinese character strings S1 and S2 includes:
step S331, meaning similarity detection is performed on the Chinese character strings S1, S2 based on the existing algorithm of the hownet, the algorithm of the hownet screens word senses from various dictionaries and language knowledge bases, and annotates words with the word senses to construct word sense-based word sensesAccording to the meaning similarity calculated according to the distance of the meaning source of the knowledge base, if the Chinese character strings s1 and s2 are meaningful, the meaning similarity of the Chinese character strings s1 and s2 is calculated and output by the following formula: sim (Sim) yi =sim yi
Wherein sim is yi The result is the output result of the hownet algorithm;
otherwise, go to step S332;
step S332, when the word has misplaced words, the word is nonsensical, the similarity of the word needs to be judged, the word needs to be converted into the word which is most similar to the word and has meaning, similarity comparison is carried out, single similar Chinese characters are replaced by nonsensical Chinese character strings S1 and S2, meaning detection is carried out, the cycle is carried out until the word with the closest similarity is found, then a replacement penalty parameter f is set, and the meaning similarity of the Chinese character strings S1 and S2 is calculated by the following formula:
Sim yi =sim yi -f
wherein, according to experimental experience, the replacement penalty parameter f is set to 0.1
Specifically, the specific steps of the step S4 include:
step S41: in the algorithm design, considering that when the same meaning is expressed by using completely different words, the phonetic similarity and the shape code similarity have no reference value, so that a similarity threshold t is set, if the meaning similarity of the Chinese character strings S1 and S2 is greater than t, the meaning of the Chinese character strings S1 and S2 can be considered to be highly similar, the similarity in terms of pinyin and font between the Chinese character strings S1 and S2 can be not considered, and the meaning similarity is taken as the overall similarity between the Chinese character strings S1 and S2, otherwise, the step S42 is entered;
step S42: the overall similarity of the chinese strings s1, s2 is calculated by:
Sim zong =Sim yin ×a+Sim xing ×b+Sim yi ×c-f
where a is a contribution value of the similarity of the phonetic code, b is a contribution value of the similarity of the shape code, c is a contribution value of the meaning similarity, and a+b+c=1. The sizes of the contribution value parameters a, b and c can be set according to different application scenes, so that comparison can be conveniently performed according to the emphasis characteristic, for example, the contribution value a is set to be larger when near-word detection is performed, the contribution value b is set to be larger when near-word detection is performed, and the contribution value b is set to be larger when near-word detection is performed.
It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.
Claims (7)
1. A Chinese word similarity detection algorithm based on sound, shape and meaning is characterized by combining three characteristics of sound, shape and meaning of Chinese characters to carry out similarity detection on Chinese character strings, and comprises the following steps:
step S1: each Chinese phonetic transcription of the input Chinese character strings s1 and s2 is converted into binary phonetic codes;
s2, converting each Chinese character in the input Chinese character strings S1 and S2 into a shape code according to the font;
step S3, respectively calculating the voice code similarity, the shape code similarity and the meaning similarity of the Chinese character strings S1 and S2;
step S4, considering the influence of the phonetic similarity, the shape code similarity and the meaning similarity on the overall similarity, and finally obtaining the overall similarity of the Chinese character strings S1 and S2;
the specific steps of the step S2 include:
s21, dividing the structure of the Chinese character into horizontal, vertical, left falling, right falling and folding according to Chinese character coding rules, and setting corresponding codes for the horizontal, vertical, left falling, right falling and folding respectively;
step S22, recording corresponding codes according to the sequence of the horizontal, vertical, left falling, right falling, left falling and folding of each Chinese character in the Chinese character strings S1 and S2, and respectively obtaining the stroke order codes of the Chinese character strings S1 and S2;
step S23, obtaining stroke codes according to the stroke number of each Chinese character in the Chinese character strings S1 and S2, and obtaining structure codes according to the font structure of each Chinese character in the Chinese character strings S1 and S2;
in the step S3, the step of calculating the similarity of the phonetic codes of the chinese character strings S1, S2 includes:
step S311, comparing the total length of the characters in the Chinese character strings S1 and S2, setting the character string with shorter total length as min_s and the character string with longer total length as max_s;
step S312, the similarity of the phonetic codes between the single Chinese characters a and b in the min_s and the max_s is calculated by the following formula:
wherein h (a, b) is the phonetic code hamming distance of the Chinese character a, b, len (a) is the phonetic code length of a;
based on the similarity of the phonetic codes among the single Chinese characters, comparing each Chinese character in the min_s with each Chinese character in the max_s one by one, and based on the comparison result of the similarity of the phonetic codes among the single Chinese characters, re-exchanging and sequencing the Chinese characters in the max_s;
step S313, calculating the position difference of the Chinese characters before and after exchange, then calculating the absolute value of the position difference, and obtaining position influence factors based on the absolute value of the position difference, wherein the position influence factors are as follows:
where sum_position is the sum of absolute values of the respective position differences, and len (max_s) is the string length of max_s;
step S314, calculating the edit distance lds between the min_s and the max_s after the position exchange by a weighted edit distance algorithm yin (max_s,min_s);
Step S315, calculating the similarity of the phonetic codes of the Chinese character strings S1 and S2:
where α is a position contribution parameter.
2. The method of claim 1, wherein the specific steps of step S1 include:
step S11: converting each initial consonant of each Chinese phonetic transcription in the Chinese character strings s1 and s2 into binary numbers according to a Gray code comparison table;
step S12: converting each vowel of each Chinese phonetic alphabet in the Chinese character strings s1 and s2 into binary numbers according to a Gray code comparison table;
step S13: if each Chinese phonetic alphabet in the Chinese character strings s1 and s2 has an intermediate vowel, converting the intermediate vowel into a binary number according to a Gray code comparison table;
step S14: the tone of each Chinese phonetic alphabet in the Chinese character strings s1 and s2 is represented by binary number.
3. The method according to claim 1, wherein in the step S3, the step of calculating the phonetic similarity of the chinese character strings S1 and S2 further comprises: if the total lengths of the characters in the chinese character strings S1 and S2 are equal, one of the chinese character strings S1 and S2 is set as min_s, the other is set as max_s, and step S312 is executed, and if the min_s and the max_s are completely matched, the phonetic similarity of the chinese character strings S1 and S2 is calculated by the following formula:
wherein sum_sim yin The sum of the similarity of the phonetic codes of the corresponding Chinese characters.
4. A Chinese word similarity detection algorithm based on phonetic and shape senses as claimed in claim 3 wherein in said step S3, the step of calculating the shape code similarity of the Chinese character strings S1, S2 comprises:
s321, comparing the total length of the characters in the Chinese character strings S1 and S2, setting the character string with shorter total length as min_s and the character string with longer total length as max_s;
step S322, calculating the similarity of the shape codes between the single Chinese characters a and b in the min_s and the max_s by the following modes:
comparing the stroke order code length of a single Chinese character a and b in the min_s and the max_s, setting the smaller stroke order code length as d, setting the larger stroke order code length as s, setting the longest common substring length of the stroke order code of the single Chinese character a and b as Lcs_len, and calculating the common substring duty ratio of the Chinese characters a and b as follows:
the stroke difference c= |lena-lenb| of the Chinese character a and b, wherein lena is the stroke number of the Chinese character a, lenb is the stroke number of the Chinese character b, and the stroke number contribution ratio of the Chinese character a and b is calculated as follows:
let the position of the longest common substring of Chinese character a in the stroke order code of Chinese character a be a_p, let the position of the longest common substring of Chinese character b in the stroke order code of Chinese character b be b_p, calculate its difference p= |a_p-b_p|, and finally calculate the position contribution ratio of the longest common substring as follows:
the hamming distance ham of the structure code is calculated, and then the structural factors are calculated as:
the similarity of the shape codes of the single Chinese characters a and b in the min_s and the max_s is calculated by the following steps:
wherein alpha is a contribution parameter of a common substring duty ratio, beta is a contribution parameter of a stroke number contribution ratio, i is a contribution parameter of a longest common substring position contribution ratio, j is a contribution parameter of a structural factor,
based on the similarity of the shape codes among the single Chinese characters, comparing the similarity of the shape codes of each Chinese character in the min_s with the similarity of the shape codes of each Chinese character in the max_s one by one, and based on the comparison result of the similarity of the shape codes among the single Chinese characters, re-exchanging and sequencing the Chinese characters in the max_s;
step S323, calculating the position difference of the Chinese characters before and after exchange, then calculating the absolute value of the position difference, and obtaining position influence factors based on the absolute value of the position difference, wherein the position influence factors are as follows:
where sum_position is the sum of absolute values of the respective position differences, and len (max_s) is the string length of max_s;
step S324, calculating the edit distance lds between min_s and max_s after the position exchange by a weighted edit distance algorithm xing (max_s,min_s);
Step S325, calculating the similarity of the shape codes of the Chinese character strings S1 and S2:
where α is a position contribution parameter.
5. The method according to claim 4, wherein in the step S3, the step of calculating the shape code similarity of the chinese character strings S1 and S2 further comprises: if the total lengths of the characters in the chinese character strings S1 and S2 are equal, one of the chinese character strings S1 and S2 is set as min_s, the other is set as max_s, and step S322 is executed, and if the min_s and the max_s are completely matched, the shape code similarity of the chinese character strings S1 and S2 is calculated by the following formula:
wherein sum_sim xing Is the sum of the similarity of the shape codes among the corresponding Chinese characters.
6. The method according to claim 5, wherein in the step S3, the step of calculating the meaning similarity of the chinese character strings S1 and S2 includes:
step S331, carrying out meaning similarity detection on the Chinese character strings S1 and S2 based on a hownet algorithm, and if the Chinese character strings S1 and S2 are meaningful, calculating the meaning similarity of the Chinese character strings S1 and S2 according to the following formula: sim (Sim) yi =sim yi
Wherein sim is yi The result is the output result of the hownet algorithm;
otherwise, go to step S332;
step S332, replacing nonsensical Chinese character strings S1 and S2 with single similar Chinese characters, detecting meaning, circulating until finding the word with the closest similarity, setting a replacement penalty parameter f, and calculating the meaning similarity of the Chinese character strings S1 and S2 according to the following formula:
Sim yi =sim yi -f。
7. the method of claim 6, wherein the specific step of step S4 includes:
step S41: setting a similarity threshold t, if the meaning similarity of the Chinese character strings S1 and S2 is greater than t, taking the meaning similarity as the overall similarity between the Chinese character strings S1 and S2, otherwise, entering step S42;
step S42: the overall similarity of the chinese strings s1, s2 is calculated by:
Sim zong =Sim yin ×a+Sim xing ×b+Sim yi ×c-f
where a is a contribution value of the similarity of the phonetic code, b is a contribution value of the similarity of the shape code, c is a contribution value of the meaning similarity, and a+b+c=1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011058506.XA CN112329390B (en) | 2020-09-30 | 2020-09-30 | Chinese word similarity detection algorithm based on sound, shape and meaning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011058506.XA CN112329390B (en) | 2020-09-30 | 2020-09-30 | Chinese word similarity detection algorithm based on sound, shape and meaning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112329390A CN112329390A (en) | 2021-02-05 |
CN112329390B true CN112329390B (en) | 2023-08-04 |
Family
ID=74314366
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011058506.XA Active CN112329390B (en) | 2020-09-30 | 2020-09-30 | Chinese word similarity detection algorithm based on sound, shape and meaning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112329390B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112766236B (en) * | 2021-03-10 | 2023-04-07 | 拉扎斯网络科技(上海)有限公司 | Text generation method and device, computer equipment and computer readable storage medium |
CN114091436B (en) * | 2022-01-21 | 2022-05-17 | 万商云集(成都)科技股份有限公司 | Sensitive word detection method based on decision tree and variant recognition |
CN114386385A (en) * | 2022-03-22 | 2022-04-22 | 北京创新乐知网络技术有限公司 | Method, device, system and storage medium for discovering sensitive word derived vocabulary |
CN116757189B (en) * | 2023-08-11 | 2023-10-31 | 四川互慧软件有限公司 | Patient name disambiguation method based on Chinese character features |
CN117010368B (en) * | 2023-10-07 | 2024-07-09 | 山东齐鲁壹点传媒有限公司 | Chinese error correction data enhancement method based on font similarity |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110287286A (en) * | 2019-06-13 | 2019-09-27 | 北京百度网讯科技有限公司 | The determination method, apparatus and storage medium of short text similarity |
-
2020
- 2020-09-30 CN CN202011058506.XA patent/CN112329390B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110287286A (en) * | 2019-06-13 | 2019-09-27 | 北京百度网讯科技有限公司 | The determination method, apparatus and storage medium of short text similarity |
Non-Patent Citations (1)
Title |
---|
基于编辑距离和相似度改进的汉字字符串匹配;邵清;叶琨;;电子科技(09);第13-17页 * |
Also Published As
Publication number | Publication date |
---|---|
CN112329390A (en) | 2021-02-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112329390B (en) | Chinese word similarity detection algorithm based on sound, shape and meaning | |
WO2020186778A1 (en) | Error word correction method and device, computer device, and storage medium | |
JP5462001B2 (en) | Contextual input method | |
US7174288B2 (en) | Multi-modal entry of ideogrammatic languages | |
RU2377664C2 (en) | Text input method | |
CN110163181B (en) | Sign language identification method and device | |
KR100656736B1 (en) | System and method for disambiguating phonetic input | |
CN101067780B (en) | Character inputting system and method for intelligent equipment | |
CN101133411A (en) | Fault-tolerant romanized input method for non-roman characters | |
CN112966496B (en) | Chinese error correction method and system based on pinyin characteristic representation | |
CN111310441A (en) | Text correction method, device, terminal and medium based on BERT (binary offset transcription) voice recognition | |
CN105404621A (en) | Method and system for blind people to read Chinese character | |
JP5809381B1 (en) | Natural language processing system, natural language processing method, and natural language processing program | |
CN111079418B (en) | Named entity recognition method, device, electronic equipment and storage medium | |
CN113901170A (en) | Event extraction method and system combining Bert model and template matching and electronic equipment | |
CN113822044A (en) | Grammar error correction data generating method, device, computer equipment and storage medium | |
CN117290515A (en) | Training method of text annotation model, method and device for generating text graph | |
CN115455948A (en) | Spelling error correction model training method, spelling error correction method and storage medium | |
JP2006235916A (en) | Text analysis device, text analysis method and speech synthesizer | |
Guan et al. | Text error correction after text recognition based on MacBERT4CSC | |
CN111090720B (en) | Hot word adding method and device | |
CN109241496B (en) | Phonetic system | |
CN117010368B (en) | Chinese error correction data enhancement method based on font similarity | |
CN116757189B (en) | Patient name disambiguation method based on Chinese character features | |
CN115270769A (en) | Text error correction method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |