CN112329390B - Chinese word similarity detection algorithm based on sound, shape and meaning - Google Patents

Chinese word similarity detection algorithm based on sound, shape and meaning Download PDF

Info

Publication number
CN112329390B
CN112329390B CN202011058506.XA CN202011058506A CN112329390B CN 112329390 B CN112329390 B CN 112329390B CN 202011058506 A CN202011058506 A CN 202011058506A CN 112329390 B CN112329390 B CN 112329390B
Authority
CN
China
Prior art keywords
chinese character
similarity
chinese
character strings
max
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011058506.XA
Other languages
Chinese (zh)
Other versions
CN112329390A (en
Inventor
黄梦醒
王华敏
冯思玲
冯文龙
张雨
吴迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hainan University
Original Assignee
Hainan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hainan University filed Critical Hainan University
Priority to CN202011058506.XA priority Critical patent/CN112329390B/en
Publication of CN112329390A publication Critical patent/CN112329390A/en
Application granted granted Critical
Publication of CN112329390B publication Critical patent/CN112329390B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Document Processing Apparatus (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a Chinese word similarity detection algorithm based on sound and shape meanings, which detects the overall similarity of Chinese character strings by comprehensively considering three characteristics of the sound and shape meanings of Chinese characters, firstly, the pinyin of each Chinese character of Chinese character strings s1 and s2 is converted into corresponding sound codes, each Chinese character of the Chinese character strings s1 and s2 is converted into shape codes, then the similarity of the sound codes and the similarity of the shape codes between the Chinese character strings s1 and s2 are respectively calculated, the similarity of the meaning of the Chinese character strings is independently calculated, finally, the sound and shape meanings are combined, and the overall similarity of the final Chinese character strings s1 and s2 is calculated by setting contribution parameters aiming at application scenes. The algorithm can meet the complex application scene, can be applied to detection of the repeatability of the structured data item, especially when manual input errors exist, and can also be applied to detection of sensitive words hidden by wrongly written characters and the like. Compared with the similar Chinese character similarity detection algorithm, the Chinese character similarity detection method greatly enhances the detection effect of Chinese character string similarity.

Description

Chinese word similarity detection algorithm based on sound, shape and meaning
Technical Field
The invention relates to the technical field of Chinese word similarity, in particular to a Chinese word similarity detection algorithm based on sound and shape meaning.
Background
The character string similarity algorithm is to calculate the similarity between two different character strings by a certain method. A percentage is typically used to measure the similarity between strings. String similarity algorithms are used in many computing scenarios and have a wide range of applications such as data cleansing, user input error correction, recommendation systems, hacking detection systems, automatic scoring systems, and web search and DNA sequence matching. At present, algorithms commonly adopted for detecting the similarity of Chinese character strings are as follows: firstly, based on similarity detection of Chinese character pronunciation and shape, basic information of Chinese characters such as pinyin, font structure, stroke number, stroke sequence and the like of the Chinese characters are obtained, mathematical expressions are generated according to certain coding rules on the data, and then the similarity of the Chinese characters is obtained through processing the mathematical expressions by using a specific algorithm; secondly, based on similarity detection of Chinese character semantics, comparing Chinese character strings with words and descriptions recorded in a large knowledge base, and then calculating Chinese character semantic similarity according to the distance between knowledge base meaning sources; however, both methods have defects that the former cannot recognize the situation that the Chinese character strings are different in length or the Chinese character sequences are changed and the meaning is the same, and the latter cannot detect the similarity between words hidden with wrongly written characters because the detection words are completely correct.
Disclosure of Invention
The invention aims to provide a Chinese word similarity detection algorithm based on sound, shape and meaning, which combines the Chinese word similarity with the meaning of Chinese characters on the basis of improving the sound code and the shape code of the Chinese characters, and fully considers three characteristics of the sound, the shape and the meaning of the Chinese characters to calculate the similarity of Chinese character strings.
In order to solve the technical problems, the invention adopts the following technical scheme: a Chinese word similarity detection algorithm based on sound, shape and meaning combines three major characteristics of sound, shape and meaning of Chinese characters to carry out similarity detection on Chinese character strings, and the method comprises the following steps:
step S1: each Chinese phonetic transcription of the input Chinese character strings s1 and s2 is converted into binary phonetic codes;
s2, converting each Chinese character in the input Chinese character strings S1 and S2 into a shape code according to the font;
step S3, respectively calculating the voice code similarity, the shape code similarity and the meaning similarity of the Chinese character strings S1 and S2;
step S4, considering the influence of the phonetic similarity, the shape code similarity and the meaning similarity on the overall similarity, and finally obtaining the overall similarity of the Chinese character strings S1 and S2;
preferably, the specific steps of the step S1 include:
step S11: converting each initial consonant of each Chinese phonetic transcription in the Chinese character strings s1 and s2 into binary numbers according to a Gray code comparison table;
step S12: converting each vowel of each Chinese phonetic alphabet in the Chinese character strings s1 and s2 into binary numbers according to a Gray code comparison table;
step S13: if each Chinese phonetic alphabet in the Chinese character strings s1 and s2 has an intermediate vowel, converting the intermediate vowel into a binary number according to a Gray code comparison table;
step S14: the tone of each Chinese phonetic alphabet in the Chinese character strings s1 and s2 is represented by binary numbers;
preferably, the specific steps of the step S2 include:
s21, dividing the structure of the Chinese character into horizontal, vertical, left falling, right falling and folding according to Chinese character coding rules, and setting corresponding codes for the horizontal, vertical, left falling, right falling and folding respectively;
step S22, recording corresponding codes according to the sequence of the horizontal, vertical, left falling, right falling, left falling and folding of each Chinese character in the Chinese character strings S1 and S2, and respectively obtaining the stroke order codes of the Chinese character strings S1 and S2;
step S23, obtaining stroke codes according to the stroke number of each Chinese character in the Chinese character strings S1 and S2, and obtaining structure codes according to the font structure of each Chinese character in the Chinese character strings S1 and S2;
preferably, in the step S3, the step of calculating the similarity of the phonetic codes of the chinese character strings S1, S2
Comprising the following steps:
step S311, comparing the total length of the characters in the Chinese character strings S1 and S2, setting the character string with shorter total length as min_s and the character string with longer total length as max_s;
step S312, the similarity of the phonetic codes between the single Chinese characters a and b in the min_s and the max_s is calculated by the following formula:
wherein h (a, b) is the phonetic code hamming distance of the Chinese character a, b, len (a) is the phonetic code length of a;
based on the similarity of the phonetic codes among the single Chinese characters, comparing each Chinese character in the min_s with each Chinese character in the max_s one by one, and based on the comparison result of the similarity of the phonetic codes among the single Chinese characters, re-exchanging and sequencing the Chinese characters in the max_s;
step S313, calculating the position difference of the Chinese characters before and after exchange, then calculating the absolute value of the position difference, and obtaining position influence factors based on the absolute value of the position difference, wherein the position influence factors are as follows:
where sum_position is the sum of absolute values of the respective position differences, and len (max_s) is the string length of max_s;
step S314, calculating the edit distance lds between the min_s and the max_s after the position exchange by a weighted edit distance algorithm yin (max_s,min_s);
Step S315, calculating the similarity of the phonetic codes of the Chinese character strings S1 and S2:
where α is a position contribution parameter.
Preferably, in the step S3, the step of calculating the similarity of the phonetic codes of the chinese character strings S1, S2 further includes: if the total lengths of the characters in the chinese character strings S1 and S2 are equal, one of the chinese character strings S1 and S2 is set as min_s, the other is set as max_s, and step S312 is executed, and if the min_s and the max_s are completely matched, the phonetic similarity of the chinese character strings S1 and S2 is calculated by the following formula:
wherein sum_sim yin The sum of the similarity of the phonetic codes of the corresponding Chinese characters;
preferably, in the step S3, the step of calculating the similarity of the shape codes of the chinese character strings S1, S2 includes:
s321, comparing the total length of the characters in the Chinese character strings S1 and S2, setting the character string with shorter total length as min_s and the character string with longer total length as max_s;
step S322, calculating the similarity of the shape codes between the single Chinese characters a and b in the min_s and the max_s by the following modes:
comparing the stroke order code length of a single Chinese character a and b in the min_s and the max_s, setting the smaller stroke order code length as d, setting the larger stroke order code length as s, setting the longest common substring length of the stroke order code of the single Chinese character a and b as Lcs_len, and calculating the common substring duty ratio of the Chinese characters a and b as follows:
the stroke difference c= |lena-lenb| of the Chinese character a and b, wherein lena is the stroke number of the Chinese character a, lenb is the stroke number of the Chinese character b, and the stroke number contribution ratio of the Chinese character a and b is calculated as follows:
let the position of the longest common substring of Chinese character a in the stroke order code of Chinese character a be a_p, let the position of the longest common substring of Chinese character b in the stroke order code of Chinese character b be b_p, calculate its difference p= |a_p-b_p|, and finally calculate the position contribution ratio of the longest common substring as follows:
the hamming distance ham of the structure code is calculated, and then the structural factors are calculated as:
the similarity of the shape codes of the single Chinese characters a and b in the min_s and the max_s is calculated by the following steps:
wherein alpha is a contribution parameter of a common substring duty ratio, beta is a contribution parameter of a stroke number contribution ratio, i is a contribution parameter of a longest common substring position contribution ratio, j is a contribution parameter of a structural factor,
based on the similarity of the shape codes among the single Chinese characters, comparing the similarity of the shape codes of each Chinese character in the min_s with the similarity of the shape codes of each Chinese character in the max_s one by one, and based on the comparison result of the similarity of the shape codes among the single Chinese characters, re-exchanging and sequencing the Chinese characters in the max_s;
step S323, calculating the position difference of the Chinese characters before and after exchange, then calculating the absolute value of the position difference, and obtaining position influence factors based on the absolute value of the position difference, wherein the position influence factors are as follows:
where sum_position is the sum of absolute values of the respective position differences, and len (max_s) is the string length of max_s;
step S324, calculating the edit distance lds between min_s and max_s after the position exchange by a weighted edit distance algorithm xing (max_s,min_s);
Step S325, calculating the similarity of the shape codes of the Chinese character strings S1 and S2:
where α is a position contribution parameter.
Preferably, in the step S3, the step of calculating the shape code similarity of the chinese character strings S1, S2 further includes: if the total lengths of the characters in the chinese character strings S1 and S2 are equal, one of the chinese character strings S1 and S2 is set as min_s, the other is set as max_s, and step S322 is executed, and if the min_s and the max_s are completely matched, the shape code similarity of the chinese character strings S1 and S2 is calculated by the following formula:
wherein sum_sim xing The sum of the similarity of the shape codes among the corresponding Chinese characters is used as the sum;
preferably, in the step S3, the step of calculating the meaning similarity of the chinese character strings S1, S2 includes:
step S331, carrying out meaning similarity detection on the Chinese character strings S1 and S2 based on a hownet algorithm, and if the Chinese character strings S1 and S2 are meaningful, calculating the meaning similarity of the Chinese character strings S1 and S2 according to the following formula: sim (Sim) yi =sim yi
Wherein sim is yi The result is the output result of the hownet algorithm;
otherwise, go to step S332;
step S332, replacing nonsensical Chinese character strings S1 and S2 with single similar Chinese characters, detecting meaning, circulating until finding the word with the closest similarity, setting a replacement penalty parameter f, and calculating the meaning similarity of the Chinese character strings S1 and S2 according to the following formula:
Sim yi =sim yi -f
preferably, the specific steps of the step S4 include:
step S41: setting a similarity threshold t, if the meaning similarity of the Chinese character strings S1 and S2 is greater than t, taking the meaning similarity as the overall similarity between the Chinese character strings S1 and S2, otherwise, entering step S42;
step S42: the overall similarity of the chinese strings s1, s2 is calculated by:
Sim zong =Sim yin ×a+Sim xing ×b+Sim yi ×c-f
where a is a contribution value of the similarity of the phonetic code, b is a contribution value of the similarity of the shape code, c is a contribution value of the meaning similarity, and a+b+c=1.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a Chinese word similarity detection algorithm based on sound, shape and meaning, which combines the sound and shape with the meaning on the basis of improving the sound and shape codes, and fully considers three characteristics of sound, shape and meaning of a Chinese character string to comprehensively calculate the similarity of the Chinese character string; the algorithm has wider application field, can meet more complex application scenes, can be applied to detection of the repeatability of the structured data item, particularly to detection of sensitive words hidden by wrongly written characters and the like under the condition of manual input errors. The algorithm is simple to realize, has low requirements on the realization environment, and greatly enhances the detection effect of the Chinese character string similarity compared with the Chinese character similarity detection algorithm of the same type.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only preferred embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a Chinese word similarity detection algorithm based on sound and shape meaning according to the present invention;
Detailed Description
For a better understanding of the technical content of the present invention, specific examples are provided below, and the present invention is further described with reference to the accompanying drawings:
referring to fig. 1, the invention provides a Chinese word similarity detection algorithm based on sound, shape and meaning, which combines three major characteristics of sound, shape and meaning of Chinese characters to detect the similarity of Chinese character strings, and comprises the following steps:
step S1: each Chinese phonetic transcription of the input Chinese character strings s1 and s2 is converted into binary phonetic codes;
s2, converting each Chinese character in the input Chinese character strings S1 and S2 into a shape code according to the font;
step S3, respectively calculating the voice code similarity, the shape code similarity and the meaning similarity of the Chinese character strings S1 and S2;
step S4, considering the influence of the phonetic similarity, the shape code similarity and the meaning similarity on the overall similarity, and finally obtaining the overall similarity of the Chinese character strings S1 and S2;
the method comprises the steps of comprehensively considering three characteristics of sound, shape and meaning of Chinese characters to detect the overall similarity of Chinese character strings, specifically, firstly converting pinyin of each Chinese character of Chinese character strings s1 and s2 into corresponding sound codes, converting each Chinese character of Chinese character strings s1 and s2 into shape codes, then respectively calculating the sound code similarity and the shape code similarity between the Chinese character strings s1 and s2, secondly independently calculating the similarity of the meaning of the Chinese character strings, and finally setting contribution parameters for application scenes to calculate the overall similarity of the final Chinese character strings s1 and s2 by combining the sound code similarity, the shape code similarity and the meaning similarity.
Specifically, the specific steps of the step S1 include:
step S11: each initial consonant of each Chinese phonetic alphabet in the Chinese character strings s1 and s2 is converted into a five-bit binary number according to a Gray code comparison table, and the codes of the initial consonants are specifically as shown in the following table 1:
TABLE 1 Pinyin initial consonant encoding
Step S12: each vowel of the pinyin of each Chinese character in the Chinese character strings s1 and s2 is converted into a five-bit binary number according to a Gray code comparison table, and the encoding of the vowels is specifically as follows in table 2:
TABLE 2 Pinyin vowel coding
Step S13: if each Chinese phonetic alphabet in the Chinese character strings s1 and s2 has an intermediate vowel, converting the intermediate vowel into a five-bit binary number according to a Gray code comparison table, wherein the encoding of the intermediate vowel is specifically shown in the table 2;
step S14: the tone of each Chinese phonetic alphabet in the Chinese character strings s1 and s2 is represented by a binary number of two bits, the tone has four tones, the one tone code of the tone is 00, the two tone code of the tone is 01, the three tone code of the tone is 10, and the four tone code of the tone is 11;
in summary, the phonetic code of each Chinese character can be represented by a 17-bit binary number, and the phonetic code length of a Chinese character is 17.
Specifically, the specific steps of the step S2 include:
step S21, dividing the structure of the Chinese character into horizontal, vertical, left falling, right falling and folding according to Chinese character coding rules, and setting corresponding codes for the horizontal, vertical, left falling, right falling and folding respectively, wherein the specific coding rules are as follows in table 3:
table 3 Chinese character Stroke coding rules
Table 3 Chinese character stroke coding rules
Step S22, according to the sequence of the horizontal, vertical, left falling, right falling, left falling and right falling of each Chinese character in the Chinese character strings S1 and S2, corresponding codes are recorded, and the stroke order codes of the Chinese character strings S1 and S2 can be obtained respectively; for example, the Chinese character 'you' is composed of skimming, vertical, horizontal, skimming, vertical hooks and dots, and according to the coding rule, the stroke order code generated by comparison is 321354;
step S23, obtaining stroke codes of Chinese characters according to the stroke number of each Chinese character in the Chinese character strings S1 and S2, and obtaining the existing structure codes of the Chinese characters according to the font structure of each Chinese character in the Chinese character strings S1 and S2;
the shape code of the Chinese character is formed based on the stroke order code, the stroke code and the structure code of the Chinese character, the stroke order code reflects the composition of the Chinese character, and the same stroke order code indicates the composition of the same stroke order, so that the similarity of the Chinese character can be reflected to a certain extent; the stroke code reflects the stroke number of the Chinese character; the structure code describes the overall structure of Chinese character pattern, including up-down structure, left-right structure, semi-surrounding structure, etc.; the shape of the Chinese character is described by adopting the stroke order code, the stroke code and the structure code, so that the shape of the Chinese character can be roughly described in terms of composition factors and composition modes, and the visual shape of the Chinese character can be described by the similarity calculated by the three codes.
Specifically, in the step S3, the step of calculating the similarity of the phonetic codes of the chinese character strings S1, S2 includes:
step S311, comparing the total length of the characters in the Chinese character strings S1 and S2, setting the character string with shorter total length as min_s and the character string with longer total length as max_s;
step S312, the similarity of the phonetic codes between the single Chinese characters a and b in the min_s and the max_s is calculated by the following formula:
wherein h (a, b) is the hamming distance of the phonetic codes of the Chinese characters a, b, and len (a) is the phonetic code length of a, namely 17;
based on the similarity of the phonetic codes among the single Chinese characters, comparing each Chinese character in the min_s with each Chinese character in the max_s one by one, and based on the comparison result of the similarity of the phonetic codes among the single Chinese characters, re-exchanging and sequencing the Chinese characters in the max_s; the method comprises the following steps: and matching each Chinese character in the min_s with one Chinese character with the highest similarity in the max_s, and then referring to the sequence of the Chinese characters in the min_s, and re-exchanging the sequence of the Chinese characters in the max_s, so that the sequence of the Chinese characters in the max_s is the same as the sequence of the Chinese characters matched and corresponding in the min_s. For example, "min_s: teacher- -max_s: you teach the teacher ", after comparing and re-exchanging the ordering, then becomes" min_s: teacher- -max_s: teacher your ";
step S313, because the character positions of max_s are exchanged during matching, the position factors are needed to be considered, the position differences of the Chinese characters before and after the exchange are calculated, then the absolute value of the position differences is calculated, and the position influencing factors are obtained based on the absolute value of the position differences, wherein the position influencing factors are as follows:
where sum_position is the sum of absolute values of the respective position differences, and len (max_s) is the string length of max_s;
step S314, calculating the edit distance lds between the min_s and the max_s after the position exchange by a weighted edit distance algorithm yin (max_s, min_s), specifically:
step S315, calculating the similarity of the phonetic codes of the Chinese character strings S1 and S2:
where α is a position contribution parameter, which may be set to 0.1 according to experience.
Specifically, in the step S3, the step of calculating the similarity of the phonetic codes of the chinese character strings S1, S2 further includes: if the total lengths of the characters in the chinese character strings S1 and S2 are equal, one of the chinese character strings S1 and S2 is set to min_s, the other is set to max_s, and step S312 is executed, if min_s and max_s are completely matched, i.e. the similarity of each chinese character matching set is 1, for example, "min_s: mutual—max_s: mutually ", the phonetic similarity of the Chinese character strings s1, s2 is calculated by the following formula without considering the position factor:
wherein sum_sim yin The sum of the similarity of the phonetic codes of the corresponding Chinese characters;
specifically, in the step S3, the computation of the shape code similarity comprehensively considers the influence of four factors of the longest common substring duty ratio, the longest common substring position difference, the strokes and the structure, and designs a Chinese character shape code similarity detection algorithm based on the improved shape code. The specific calculation of the shape code similarity of the Chinese character strings s1 and s2 comprises the following steps:
s321, comparing the total length of the characters in the Chinese character strings S1 and S2, setting the character string with shorter total length as min_s and the character string with longer total length as max_s;
step S322, calculating the similarity of the shape codes between the single Chinese characters a and b in the min_s and the max_s by the following modes:
considering the longest public substring duty ratio, comparing the stroke order code length of a single Chinese character a and b in min_s and max_s, setting a smaller stroke order code length as d, setting a larger stroke order code length as s, setting the longest public substring length of the stroke order code of the single Chinese character a and b as Lcs_len, and calculating the longest public substring duty ratio of the single Chinese character a and b as follows:
considering strokes, the stroke difference c= |lena-lenb| of the Chinese characters a and b, wherein lena is the stroke number of the Chinese character a, lenb is the stroke number of the Chinese character b, and the stroke number contribution ratio of the Chinese characters a and b is calculated as follows:
taking the position difference of the longest public substring into consideration, setting the position of the longest public substring of the Chinese character a in the stroke order code of the Chinese character a as a_p, setting the position of the longest public substring of the Chinese character b in the stroke order code of the Chinese character b as b_p, calculating the difference p= |a_p-b_p|, and finally calculating the position contribution ratio of the longest public substring as follows:
considering the structure, the hamming distance ham of the structure code is calculated, and then the structural factors are calculated as:
to sum up: the similarity of the shape codes of the single Chinese characters a and b in the min_s and the max_s is calculated by the following steps:
wherein alpha is a contribution parameter of a common substring duty ratio, beta is a contribution parameter of a stroke number contribution ratio, i is a contribution parameter of a longest common substring position contribution ratio, j is a contribution parameter of a structural factor,
based on the similarity of the shape codes among the single Chinese characters, comparing the similarity of the shape codes of each Chinese character in the min_s with each Chinese character in the max_s one by one, and based on the comparison result of the similarity of the shape codes among the single Chinese characters, re-exchanging and ordering the Chinese characters in the max_s, wherein the method specifically comprises the following steps: and matching each Chinese character in the min_s with one Chinese character with the highest similarity in the max_s, and then referring to the sequence of the Chinese characters in the min_s, and re-exchanging the sequence of the Chinese characters in the max_s, so that the sequence of the Chinese characters in the max_s is the same as the sequence of the Chinese characters matched and corresponding in the min_s.
Step S323, because the character positions of max_s are exchanged during matching, the position factors are needed to be considered, the position difference of the Chinese character before and after the exchange is calculated, then the absolute value of the position difference is calculated, and the position influencing factors are obtained based on the absolute value of the position difference, wherein the position influencing factors are as follows:
where sum_position is the sum of absolute values of the respective position differences, and len (max_s) is the string length of max_s;
step S324, calculating the edit distance lds between min_s and max_s after the position exchange by a weighted edit distance algorithm xing (max_s, min_s), specifically:
step S325, calculating the similarity of the shape codes of the Chinese character strings S1 and S2:
where α is a position contribution parameter, which may be set to 0.1 according to experience.
Specifically, in the step S3, the step of calculating the similarity of the shape codes of the chinese character strings S1, S2 further includes: if the total lengths of the characters in the chinese character strings S1 and S2 are equal, one of the chinese character strings S1 and S2 is set as min_s, the other is set as max_s, and step S322 is executed, if min_s and max_s are completely matched, i.e. the similarity of each chinese character matching set is 1, for example, "min_s: mutual—max_s: mutually ", the shape code similarity of the Chinese character strings s1, s2 is calculated by the following formula without considering the position factor:
wherein sum_sim xing The sum of the similarity of the shape codes among the corresponding Chinese characters is used as the sum;
specifically, in the step S3, the step of calculating the meaning similarity of the chinese character strings S1 and S2 includes:
step S331, meaning similarity detection is performed on the Chinese character strings S1, S2 based on the existing algorithm of the hownet, the algorithm of the hownet screens word senses from various dictionaries and language knowledge bases, and annotates words with the word senses to construct word sense-based word sensesAccording to the meaning similarity calculated according to the distance of the meaning source of the knowledge base, if the Chinese character strings s1 and s2 are meaningful, the meaning similarity of the Chinese character strings s1 and s2 is calculated and output by the following formula: sim (Sim) yi =sim yi
Wherein sim is yi The result is the output result of the hownet algorithm;
otherwise, go to step S332;
step S332, when the word has misplaced words, the word is nonsensical, the similarity of the word needs to be judged, the word needs to be converted into the word which is most similar to the word and has meaning, similarity comparison is carried out, single similar Chinese characters are replaced by nonsensical Chinese character strings S1 and S2, meaning detection is carried out, the cycle is carried out until the word with the closest similarity is found, then a replacement penalty parameter f is set, and the meaning similarity of the Chinese character strings S1 and S2 is calculated by the following formula:
Sim yi =sim yi -f
wherein, according to experimental experience, the replacement penalty parameter f is set to 0.1
Specifically, the specific steps of the step S4 include:
step S41: in the algorithm design, considering that when the same meaning is expressed by using completely different words, the phonetic similarity and the shape code similarity have no reference value, so that a similarity threshold t is set, if the meaning similarity of the Chinese character strings S1 and S2 is greater than t, the meaning of the Chinese character strings S1 and S2 can be considered to be highly similar, the similarity in terms of pinyin and font between the Chinese character strings S1 and S2 can be not considered, and the meaning similarity is taken as the overall similarity between the Chinese character strings S1 and S2, otherwise, the step S42 is entered;
step S42: the overall similarity of the chinese strings s1, s2 is calculated by:
Sim zong =Sim yin ×a+Sim xing ×b+Sim yi ×c-f
where a is a contribution value of the similarity of the phonetic code, b is a contribution value of the similarity of the shape code, c is a contribution value of the meaning similarity, and a+b+c=1. The sizes of the contribution value parameters a, b and c can be set according to different application scenes, so that comparison can be conveniently performed according to the emphasis characteristic, for example, the contribution value a is set to be larger when near-word detection is performed, the contribution value b is set to be larger when near-word detection is performed, and the contribution value b is set to be larger when near-word detection is performed.
It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims (7)

1. A Chinese word similarity detection algorithm based on sound, shape and meaning is characterized by combining three characteristics of sound, shape and meaning of Chinese characters to carry out similarity detection on Chinese character strings, and comprises the following steps:
step S1: each Chinese phonetic transcription of the input Chinese character strings s1 and s2 is converted into binary phonetic codes;
s2, converting each Chinese character in the input Chinese character strings S1 and S2 into a shape code according to the font;
step S3, respectively calculating the voice code similarity, the shape code similarity and the meaning similarity of the Chinese character strings S1 and S2;
step S4, considering the influence of the phonetic similarity, the shape code similarity and the meaning similarity on the overall similarity, and finally obtaining the overall similarity of the Chinese character strings S1 and S2;
the specific steps of the step S2 include:
s21, dividing the structure of the Chinese character into horizontal, vertical, left falling, right falling and folding according to Chinese character coding rules, and setting corresponding codes for the horizontal, vertical, left falling, right falling and folding respectively;
step S22, recording corresponding codes according to the sequence of the horizontal, vertical, left falling, right falling, left falling and folding of each Chinese character in the Chinese character strings S1 and S2, and respectively obtaining the stroke order codes of the Chinese character strings S1 and S2;
step S23, obtaining stroke codes according to the stroke number of each Chinese character in the Chinese character strings S1 and S2, and obtaining structure codes according to the font structure of each Chinese character in the Chinese character strings S1 and S2;
in the step S3, the step of calculating the similarity of the phonetic codes of the chinese character strings S1, S2 includes:
step S311, comparing the total length of the characters in the Chinese character strings S1 and S2, setting the character string with shorter total length as min_s and the character string with longer total length as max_s;
step S312, the similarity of the phonetic codes between the single Chinese characters a and b in the min_s and the max_s is calculated by the following formula:
wherein h (a, b) is the phonetic code hamming distance of the Chinese character a, b, len (a) is the phonetic code length of a;
based on the similarity of the phonetic codes among the single Chinese characters, comparing each Chinese character in the min_s with each Chinese character in the max_s one by one, and based on the comparison result of the similarity of the phonetic codes among the single Chinese characters, re-exchanging and sequencing the Chinese characters in the max_s;
step S313, calculating the position difference of the Chinese characters before and after exchange, then calculating the absolute value of the position difference, and obtaining position influence factors based on the absolute value of the position difference, wherein the position influence factors are as follows:
where sum_position is the sum of absolute values of the respective position differences, and len (max_s) is the string length of max_s;
step S314, calculating the edit distance lds between the min_s and the max_s after the position exchange by a weighted edit distance algorithm yin (max_s,min_s);
Step S315, calculating the similarity of the phonetic codes of the Chinese character strings S1 and S2:
where α is a position contribution parameter.
2. The method of claim 1, wherein the specific steps of step S1 include:
step S11: converting each initial consonant of each Chinese phonetic transcription in the Chinese character strings s1 and s2 into binary numbers according to a Gray code comparison table;
step S12: converting each vowel of each Chinese phonetic alphabet in the Chinese character strings s1 and s2 into binary numbers according to a Gray code comparison table;
step S13: if each Chinese phonetic alphabet in the Chinese character strings s1 and s2 has an intermediate vowel, converting the intermediate vowel into a binary number according to a Gray code comparison table;
step S14: the tone of each Chinese phonetic alphabet in the Chinese character strings s1 and s2 is represented by binary number.
3. The method according to claim 1, wherein in the step S3, the step of calculating the phonetic similarity of the chinese character strings S1 and S2 further comprises: if the total lengths of the characters in the chinese character strings S1 and S2 are equal, one of the chinese character strings S1 and S2 is set as min_s, the other is set as max_s, and step S312 is executed, and if the min_s and the max_s are completely matched, the phonetic similarity of the chinese character strings S1 and S2 is calculated by the following formula:
wherein sum_sim yin The sum of the similarity of the phonetic codes of the corresponding Chinese characters.
4. A Chinese word similarity detection algorithm based on phonetic and shape senses as claimed in claim 3 wherein in said step S3, the step of calculating the shape code similarity of the Chinese character strings S1, S2 comprises:
s321, comparing the total length of the characters in the Chinese character strings S1 and S2, setting the character string with shorter total length as min_s and the character string with longer total length as max_s;
step S322, calculating the similarity of the shape codes between the single Chinese characters a and b in the min_s and the max_s by the following modes:
comparing the stroke order code length of a single Chinese character a and b in the min_s and the max_s, setting the smaller stroke order code length as d, setting the larger stroke order code length as s, setting the longest common substring length of the stroke order code of the single Chinese character a and b as Lcs_len, and calculating the common substring duty ratio of the Chinese characters a and b as follows:
the stroke difference c= |lena-lenb| of the Chinese character a and b, wherein lena is the stroke number of the Chinese character a, lenb is the stroke number of the Chinese character b, and the stroke number contribution ratio of the Chinese character a and b is calculated as follows:
let the position of the longest common substring of Chinese character a in the stroke order code of Chinese character a be a_p, let the position of the longest common substring of Chinese character b in the stroke order code of Chinese character b be b_p, calculate its difference p= |a_p-b_p|, and finally calculate the position contribution ratio of the longest common substring as follows:
the hamming distance ham of the structure code is calculated, and then the structural factors are calculated as:
the similarity of the shape codes of the single Chinese characters a and b in the min_s and the max_s is calculated by the following steps:
wherein alpha is a contribution parameter of a common substring duty ratio, beta is a contribution parameter of a stroke number contribution ratio, i is a contribution parameter of a longest common substring position contribution ratio, j is a contribution parameter of a structural factor,
based on the similarity of the shape codes among the single Chinese characters, comparing the similarity of the shape codes of each Chinese character in the min_s with the similarity of the shape codes of each Chinese character in the max_s one by one, and based on the comparison result of the similarity of the shape codes among the single Chinese characters, re-exchanging and sequencing the Chinese characters in the max_s;
step S323, calculating the position difference of the Chinese characters before and after exchange, then calculating the absolute value of the position difference, and obtaining position influence factors based on the absolute value of the position difference, wherein the position influence factors are as follows:
where sum_position is the sum of absolute values of the respective position differences, and len (max_s) is the string length of max_s;
step S324, calculating the edit distance lds between min_s and max_s after the position exchange by a weighted edit distance algorithm xing (max_s,min_s);
Step S325, calculating the similarity of the shape codes of the Chinese character strings S1 and S2:
where α is a position contribution parameter.
5. The method according to claim 4, wherein in the step S3, the step of calculating the shape code similarity of the chinese character strings S1 and S2 further comprises: if the total lengths of the characters in the chinese character strings S1 and S2 are equal, one of the chinese character strings S1 and S2 is set as min_s, the other is set as max_s, and step S322 is executed, and if the min_s and the max_s are completely matched, the shape code similarity of the chinese character strings S1 and S2 is calculated by the following formula:
wherein sum_sim xing Is the sum of the similarity of the shape codes among the corresponding Chinese characters.
6. The method according to claim 5, wherein in the step S3, the step of calculating the meaning similarity of the chinese character strings S1 and S2 includes:
step S331, carrying out meaning similarity detection on the Chinese character strings S1 and S2 based on a hownet algorithm, and if the Chinese character strings S1 and S2 are meaningful, calculating the meaning similarity of the Chinese character strings S1 and S2 according to the following formula: sim (Sim) yi =sim yi
Wherein sim is yi The result is the output result of the hownet algorithm;
otherwise, go to step S332;
step S332, replacing nonsensical Chinese character strings S1 and S2 with single similar Chinese characters, detecting meaning, circulating until finding the word with the closest similarity, setting a replacement penalty parameter f, and calculating the meaning similarity of the Chinese character strings S1 and S2 according to the following formula:
Sim yi =sim yi -f。
7. the method of claim 6, wherein the specific step of step S4 includes:
step S41: setting a similarity threshold t, if the meaning similarity of the Chinese character strings S1 and S2 is greater than t, taking the meaning similarity as the overall similarity between the Chinese character strings S1 and S2, otherwise, entering step S42;
step S42: the overall similarity of the chinese strings s1, s2 is calculated by:
Sim zong =Sim yin ×a+Sim xing ×b+Sim yi ×c-f
where a is a contribution value of the similarity of the phonetic code, b is a contribution value of the similarity of the shape code, c is a contribution value of the meaning similarity, and a+b+c=1.
CN202011058506.XA 2020-09-30 2020-09-30 Chinese word similarity detection algorithm based on sound, shape and meaning Active CN112329390B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011058506.XA CN112329390B (en) 2020-09-30 2020-09-30 Chinese word similarity detection algorithm based on sound, shape and meaning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011058506.XA CN112329390B (en) 2020-09-30 2020-09-30 Chinese word similarity detection algorithm based on sound, shape and meaning

Publications (2)

Publication Number Publication Date
CN112329390A CN112329390A (en) 2021-02-05
CN112329390B true CN112329390B (en) 2023-08-04

Family

ID=74314366

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011058506.XA Active CN112329390B (en) 2020-09-30 2020-09-30 Chinese word similarity detection algorithm based on sound, shape and meaning

Country Status (1)

Country Link
CN (1) CN112329390B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766236B (en) * 2021-03-10 2023-04-07 拉扎斯网络科技(上海)有限公司 Text generation method and device, computer equipment and computer readable storage medium
CN114091436B (en) * 2022-01-21 2022-05-17 万商云集(成都)科技股份有限公司 Sensitive word detection method based on decision tree and variant recognition
CN114386385A (en) * 2022-03-22 2022-04-22 北京创新乐知网络技术有限公司 Method, device, system and storage medium for discovering sensitive word derived vocabulary
CN116757189B (en) * 2023-08-11 2023-10-31 四川互慧软件有限公司 Patient name disambiguation method based on Chinese character features
CN117010368B (en) * 2023-10-07 2024-07-09 山东齐鲁壹点传媒有限公司 Chinese error correction data enhancement method based on font similarity

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287286A (en) * 2019-06-13 2019-09-27 北京百度网讯科技有限公司 The determination method, apparatus and storage medium of short text similarity

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287286A (en) * 2019-06-13 2019-09-27 北京百度网讯科技有限公司 The determination method, apparatus and storage medium of short text similarity

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于编辑距离和相似度改进的汉字字符串匹配;邵清;叶琨;;电子科技(09);第13-17页 *

Also Published As

Publication number Publication date
CN112329390A (en) 2021-02-05

Similar Documents

Publication Publication Date Title
CN112329390B (en) Chinese word similarity detection algorithm based on sound, shape and meaning
WO2020186778A1 (en) Error word correction method and device, computer device, and storage medium
JP5462001B2 (en) Contextual input method
US7174288B2 (en) Multi-modal entry of ideogrammatic languages
RU2377664C2 (en) Text input method
CN110163181B (en) Sign language identification method and device
KR100656736B1 (en) System and method for disambiguating phonetic input
CN101067780B (en) Character inputting system and method for intelligent equipment
CN101133411A (en) Fault-tolerant romanized input method for non-roman characters
CN112966496B (en) Chinese error correction method and system based on pinyin characteristic representation
CN111310441A (en) Text correction method, device, terminal and medium based on BERT (binary offset transcription) voice recognition
CN105404621A (en) Method and system for blind people to read Chinese character
JP5809381B1 (en) Natural language processing system, natural language processing method, and natural language processing program
CN111079418B (en) Named entity recognition method, device, electronic equipment and storage medium
CN113901170A (en) Event extraction method and system combining Bert model and template matching and electronic equipment
CN113822044A (en) Grammar error correction data generating method, device, computer equipment and storage medium
CN117290515A (en) Training method of text annotation model, method and device for generating text graph
CN115455948A (en) Spelling error correction model training method, spelling error correction method and storage medium
JP2006235916A (en) Text analysis device, text analysis method and speech synthesizer
Guan et al. Text error correction after text recognition based on MacBERT4CSC
CN111090720B (en) Hot word adding method and device
CN109241496B (en) Phonetic system
CN117010368B (en) Chinese error correction data enhancement method based on font similarity
CN116757189B (en) Patient name disambiguation method based on Chinese character features
CN115270769A (en) Text error correction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant