CN108009253A - A kind of improved character string Similar contrasts method - Google Patents

A kind of improved character string Similar contrasts method Download PDF

Info

Publication number
CN108009253A
CN108009253A CN201711263775.8A CN201711263775A CN108009253A CN 108009253 A CN108009253 A CN 108009253A CN 201711263775 A CN201711263775 A CN 201711263775A CN 108009253 A CN108009253 A CN 108009253A
Authority
CN
China
Prior art keywords
word
character string
code
chinese
idf
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711263775.8A
Other languages
Chinese (zh)
Inventor
杜庆治
陈鸣
邵玉斌
龙华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201711263775.8A priority Critical patent/CN108009253A/en
Publication of CN108009253A publication Critical patent/CN108009253A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The present invention relates to a kind of improved character string Similar contrasts method, belong to duplicate checking field.Then the present invention calculates the local sensitivity cryptographic Hash of contrast character string, Hamming distance between the two is finally calculated, so as to obtain the similarity of two articles by calculating by contrast character string local sensitivity cryptographic Hash.In order to improve the accuracy of calculating, by after participle, each Chinese character is encoded character string using phonetic-stroke code, and similar contrast is carried out so as to the similar Chinese character of the Chinese character to unisonance, font well.

Description

A kind of improved character string Similar contrasts method
Technical field
The present invention relates to a kind of improved character string Similar contrasts method, belong to duplicate checking field.
Background technology
Field is found in data mining and knowledge information, and a major challenge brought of gushing of mass data is exactly the big of information Amount repeats, and at home, there is 30% repeated pages according to statistics, and duplicate message be too much cause the difficult main problem of retrieval it One.Algorithm specially solves the problems, such as hundreds of millions grades of removing duplicate webpages, is applied certainly in terms of text duplicate removal also widely, no Cross it is more complicated compared to removing duplicate webpages because Chinese sentence structure is special and polysemy.The main core concept of this algorithm is just It is dimensionality reduction, by the feature vector of the maps feature vectors of higher-dimension to low-dimensional, judges two by calculating two vectorial Hamming distances Document is similar or repeats degree.In information theory, the Hamming distance between two isometric character strings is correspondence between two character strings The different number of position digital.For example 1000110 and 1000001 Hamming distance is 3.Based on the above, can be generalized to Character string Similar contrasts, and be that duplicate checking is carried out based on character string phonetically similar word and shape similar word, it can not only greatly reduce contrast Time, and the accuracy of contrast can be improved.
The content of the invention
The present invention provides a kind of improved character string Similar contrasts method, for realizing that the similitude of character string is sentenced It is disconnected.
The technical scheme is that:A kind of improved character string Similar contrasts method, the method step are as follows:
S1, participle:String is accorded with to word to be searched and carries out Chinese word segmentation;
S2、Hash:By each encoding of chinese characters in each word into the phonetic-stroke code of ten, by the sound shape of corresponding Chinese character in each word Code adds up, and the phonetic-stroke code after adding up is converted into cryptographic Hash of the binary system as the word;Wherein phonetic-stroke code includes a semitone The shape code of code and half, tone code is made of the simple or compound vowel of a Chinese syllable of each Chinese character, initial consonant, complement code, tone, shape code by each Chinese character structure Position, quadrangle coding and stroke number composition;
S3, weight:Its weight is calculated using TF_IDF algorithms to the result of each word of participle, calculates the step of the weight of each word Suddenly:
1. calculate the word frequency of each word after word to be searched symbol string participle:
TF=n1/n2
Wherein, TF represents the word frequency of certain word, n1Represent the number that equivalent occurs in this character string, n2Represent this character string Total word number;
2. calculate reverse document frequency:
IDF=ln(ducom/ducom1)
Wherein, IDF represents that inverse document represents certain word in how many documents to frequency, ducom corpus total number of files, ducom1 Occurred;
3. the weight of word:
TF_IDF=TF*IDF
Wherein, TF_IDF represents word weight, word frequency same word corresponding with reverse document-frequency;
S4, weighting:Binary cryptographic Hash is changed into each word divided in step S2, each is judged, if The position is 1, then with the positive weights of this;If 0, then with the negative weighted value of this, until each binary Hash is sentenced It is disconnected when completing untill, each last morphology is into string number sequence;
S5, add up:Word to be searched is accorded with to each word after string participle, each correspondence of the weighted results of calculating adds up, most Accumulation result of the end form into this character string;
S6, dimensionality reduction:Each for the Serial No. that the character string accumulated result is formed carries out dimensionality reduction, if it is big to be judged position In 0, then the position is into 1;Otherwise, then the position into 0;Untill each has judged, the office of this character string is finally then formed Portion's sensitive hash value;
S7, duplicate checking:The character string of contrast is calculated into its local sensitivity cryptographic Hash according to above step, then calculates to accord with word to be searched and goes here and there Hamming distance between the two judges both similitudes.
In the step S2, when carrying out cumulative, accumulation result not-carry.
In the step S7, Hamming distance is less than 3, then judges that both are similar.
The beneficial effects of the invention are as follows:1st, the feature vector of high dimensional feature vector dimensionality reduction to low-dimensional is saved word by the present invention The time of symbol string contrast;The coding mode of phonetic-stroke code used when the 2nd, calculating cryptographic Hash, the characteristics of not only having remained phonetic transcriptions of Chinese characters, but also The characteristics of remaining font, so as to improve the accuracy to phonetically similar word, shape similar word contrast.
Brief description of the drawings
Fig. 1 is Chinese-character sound-shape code coding rule in the present invention.
Embodiment
Embodiment 1:As shown in Figure 1, a kind of improved character string Similar contrasts method, the method step are as follows:
S1, participle:String is accorded with to word to be searched and carries out Chinese word segmentation;
S2、Hash:By each encoding of chinese characters in each word into the phonetic-stroke code of ten, by the sound shape of corresponding Chinese character in each word Code adds up, and the phonetic-stroke code after adding up is converted into cryptographic Hash of the binary system as the word;Wherein phonetic-stroke code includes a semitone The shape code of code and half, tone code is made of the simple or compound vowel of a Chinese syllable of each Chinese character, initial consonant, complement code, tone, shape code by each Chinese character structure Position, quadrangle coding and stroke number composition;
S3, weight:Its weight is calculated using TF_IDF algorithms to the result of each word of participle, calculates the step of the weight of each word Suddenly:
1. calculate the word frequency of each word after word to be searched symbol string participle:
TF=n1/n2
Wherein, TF represents the word frequency of certain word, n1Represent the number that equivalent occurs in this character string, n2Represent this character string Total word number;
2. calculate reverse document frequency:
IDF=ln(ducom/ducom1)
Wherein, IDF represents that inverse document represents certain word in how many documents to frequency, ducom corpus total number of files, ducom1 Occurred;
3. the weight of word:
TF_IDF=TF*IDF
Wherein, TF_IDF represents word weight, word frequency same word corresponding with reverse document-frequency;
S4, weighting:Binary cryptographic Hash is changed into each word divided in step S2, each is judged, if The position is 1, then with the positive weights of this;If 0, then with the negative weighted value of this, until each binary Hash is sentenced It is disconnected when completing untill, each last morphology is into string number sequence;
S5, add up:Word to be searched is accorded with to each word after string participle, each correspondence of the weighted results of calculating adds up, most Accumulation result of the end form into this character string;
S6, dimensionality reduction:Each for the Serial No. that the character string accumulated result is formed carries out dimensionality reduction, if it is big to be judged position In 0, then the position is into 1;Otherwise, then the position into 0;Untill each has judged, the office of this character string is finally then formed Portion's sensitive hash value;
S7, duplicate checking:The character string of contrast is calculated into its local sensitivity cryptographic Hash according to above step, then calculates to accord with word to be searched and goes here and there Hamming distance between the two judges both similitudes.
It is possible to further set in the step S2, when carrying out cumulative, accumulation result not-carry.
It is possible to further set in the step S7, Hamming distance is less than 3, then judges that both are similar.
Embodiment 2:As shown in Figure 1, a kind of improved character string Similar contrasts method, the method step are as follows:
S1, participle:Chinese word segmentation is carried out to being looked into short character strings;
Such as:Chinese yunnan Kunming Zi Lang roads, participle instrument wrap for participle, and word segmentation result is:China/Yunnan/Kunming/Zi Lang roads.
S2、Hash:By each encoding of chinese characters into the phonetic-stroke code of ten, including the shape code of half tone code and half, tone code It is made of the simple or compound vowel of a Chinese syllable of each Chinese character, initial consonant, complement code, tone, i.e.,:First is simple or compound vowel of a Chinese syllable, by simple alternative rule, by the Chinese The simple or compound vowel of a Chinese syllable part mapping of word is to a character bit.One shares 24 kinds of simple or compound vowel of a Chinese syllable in the phonetic of Chinese character, and which part calculates for the later stage Purpose, is substituted, as shown in table 1 using identical character.
Table 1:
Second is initial consonant, likewise, initial consonant is converted into character also with a substitution table, as shown in table 2.
Table 2:
3rd is complement code, commonly used in when between initial consonant and simple or compound vowel of a Chinese syllable also have a consonant when, using rhythm matrix phase Same alternative rule, is defaulted as 0 if not.4th is tone, substitutes the four tones of standard Chinese pronunciation in Chinese character with 1,2,3,4 respectively; And shape code is structure bit by the 5th, according to the different structure of Chinese character, the structure of the Chinese character is represented with a character, such as table 3 It is shown.
Table 3:
6th to the 9th describes the form of the Chinese character for quadrangle coding, and the first upper left corner, the rear upper right corner, lower-left are pressed per word Angle, the order in the last lower right corner take the number at four angles, and coding rule is:Horizontal stroke one, which hangs down two or three points, to be pressed down, slotting five squares six of fork four, and seven Angle 889 is small, there is horizontal change fraction under point;For example the quadrangle coding in " village " is 0021, the quadrangle coding of " gas " is 8001, the Ten are stroke number, i.e., from one to nine, represent the stroke of the Chinese character respectively as one to nine, represent 10 followed by A, B is represented 11, and and so on.Z represents 35, and any all uses Z more than 35.Each in above-mentioned chapter phonetic-stroke code all belongs to In 1-9, A-Z sections.
Finally, the phonetic-stroke code of corresponding Chinese character in each word is added up, the phonetic-stroke code after adding up is converted into binary system Cryptographic Hash as the word.
For example the phone configuration code of " thinkling sound " is " F70211313B ", is then converted to binary system.Likewise, " wolf ", can encode For F70214323A, phonetically similar word contrast is carried out, " refined ", " China fir China fir ", carries out font Similar contrasts;Such as the meter of Zi Lang roads cryptographic Hash Calculate, purple, thinkling sound, the triliteral phonetic-stroke code in road are all sectioned out according to above method, then add up obtain for:41GE5EC78Y,(I.e. ten The the 1st, 2,3,10 meeting is more than F in the phonetic-stroke code of position, therefore the position exceeded respectively becomes two hexadecimals, without departing from six Invariant position, common 8+6=14 hexadecimal, every hexadecimal are 4 binary systems)It is then converted into as 56(For subsequent calculations more Add accurate)Binary system is:111010111 ... .., and so on the cryptographic Hash in China, Yunnan, Kunming is calculated, pay attention to tired Position is corresponded to during adding and is added not-carry, such as:Z+Z=Y;
(3)Weight:Its weight is calculated using TF_IDF algorithms to the result of each word of participle, calculates the step of the weight of each word Suddenly:
1. calculate the word frequency of each word after word to be searched symbol string participle:
TF=n1/n2
Wherein, TF represents the word frequency of certain word, n1Represent the number that equivalent occurs in this character string, n2Represent this character string Total word number;
2. calculate reverse document frequency:
IDF=ln(ducom/ducom1)
Wherein, IDF represents that inverse document represents certain word in how many documents to frequency, ducom corpus total number of files, ducom1 Occurred;
3. the weight of word:
TF_IDF=TF*IDF
Wherein, TF_IDF represents word weight, word frequency same word corresponding with reverse document-frequency;
Such as:Zi Lang roads occur 1 time in the character string, a total of 4 words, word frequency 0.25, and corpus has 10000 texts Chapter, Zi Lang roads occurred in 1000 articles, then the reverse frequency of the word is:2.3, the weight on " Zi Lang roads " is then:0.25* 2.3 it is:0.57, and so on the weight calculation in China, Yunnan, Kunming is come out;
(4)Weighting:To step(2)In each word divided change into binary cryptographic Hash, each is judged, if The position is 1, then weight is TF_IDF, is then-TF_IDF with the weight of this, until each binary system Hash if 0 Untill when value judges to complete, each last morphology is into string number sequence;
As Zi Lang roads cryptographic Hash changes into 56 binary systems as 111010111 ..., the weighted results of the word are then 56: 0.57,0.57,0.57, -0.57,0.57, -0.57,0.57,0.57,0.57 ... ...;Again will and so on by China, Yunnan, The weight calculation in Kunming comes out;
(5)It is cumulative:Word to be searched is accorded with to each word after string participle, each correspondence of the weighted results of calculating adds up, most Accumulation result of the end form into this character string;
Such as:China/Yunnan/Kunming/Zi Lang roads, the result that weighted results add up out are 56:13,108, -22, -5, -32, 55 ... ...;
(6)Dimensionality reduction:Each for the Serial No. that the character string accumulated result is formed carries out dimensionality reduction, if it is big to be judged position In 0, then the position 1, otherwise, is then set to 0;Untill each has judged, finally, then the local sensitivity of this character string is formed Cryptographic Hash;
Such as:China/Yunnan/Kunming/Zi Lang roads, the result that dimensionality reduction comes out is 56 binary systems:110001…….;
(7)Duplicate checking:The character string of contrast is calculated into its local sensitivity cryptographic Hash according to above step, then calculates to accord with word to be searched and goes here and there Hamming distance between the two, if Hamming distance is less than 3, can be determined that both are similar;
Such as:China/Yunnan/Kunming/crape myrtle road, the result that dimensionality reduction comes out is 56 binary systems:111101………;With China/ Yunnan/Kunming/Zi Lang roads, the result that dimensionality reduction comes out is 56 binary systems:110001…….;Hamming distance is 2, then can sentence The two character strings of breaking are similar(It is digital identical behind ellipsis);
The embodiment of the present invention is explained in detail above in conjunction with attached drawing, but the present invention is not limited to above-mentioned implementation Mode, within the knowledge of a person skilled in the art, can also be on the premise of present inventive concept not be departed from Various changes can be made.

Claims (3)

  1. A kind of 1. improved character string Similar contrasts method, it is characterised in that:The method step is as follows:
    S1, participle:String is accorded with to word to be searched and carries out Chinese word segmentation;
    S2、Hash:By each encoding of chinese characters in each word into the phonetic-stroke code of ten, by the sound shape of corresponding Chinese character in each word Code adds up, and the phonetic-stroke code after adding up is converted into cryptographic Hash of the binary system as the word;Wherein phonetic-stroke code includes a semitone The shape code of code and half, tone code is made of the simple or compound vowel of a Chinese syllable of each Chinese character, initial consonant, complement code, tone, shape code by each Chinese character structure Position, quadrangle coding and stroke number composition;
    S3, weight:Its weight is calculated using TF_IDF algorithms to the result of each word of participle, calculates the step of the weight of each word Suddenly:
    1. calculate the word frequency of each word after word to be searched symbol string participle:
    TF=n1/n2
    Wherein, TF represents the word frequency of certain word, n1Represent the number that equivalent occurs in this character string, n2Represent this character string Total word number;
    2. calculate reverse document frequency:
    IDF=ln(ducom/ducom1)
    Wherein, IDF represents that inverse document represents certain word in how many documents to frequency, ducom corpus total number of files, ducom1 Occurred;
    3. the weight of word:
    TF_IDF=TF*IDF
    Wherein, TF_IDF represents word weight, word frequency same word corresponding with reverse document-frequency;
    S4, weighting:Binary cryptographic Hash is changed into each word divided in step S2, each is judged, if The position is 1, then with the positive weights of this;If 0, then with the negative weighted value of this, until each binary Hash is sentenced It is disconnected when completing untill, each last morphology is into string number sequence;
    S5, add up:Word to be searched is accorded with to each word after string participle, each correspondence of the weighted results of calculating adds up, most Accumulation result of the end form into this character string;
    S6, dimensionality reduction:Each for the Serial No. that the character string accumulated result is formed carries out dimensionality reduction, if it is big to be judged position In 0, then the position is into 1;Otherwise, then the position into 0;Untill each has judged, the office of this character string is finally then formed Portion's sensitive hash value;
    S7, duplicate checking:The character string of contrast is calculated into its local sensitivity cryptographic Hash according to above step, then calculates to accord with word to be searched and goes here and there Hamming distance between the two judges both similitudes.
  2. 2. improved character string Similar contrasts method according to claim 1, it is characterised in that:In the step S2, into When row is cumulative, accumulation result not-carry.
  3. 3. improved character string Similar contrasts method according to claim 1, it is characterised in that:In the step S7, the Chinese Prescribed distance is less than 3, then judges that both are similar.
CN201711263775.8A 2017-12-05 2017-12-05 A kind of improved character string Similar contrasts method Pending CN108009253A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711263775.8A CN108009253A (en) 2017-12-05 2017-12-05 A kind of improved character string Similar contrasts method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711263775.8A CN108009253A (en) 2017-12-05 2017-12-05 A kind of improved character string Similar contrasts method

Publications (1)

Publication Number Publication Date
CN108009253A true CN108009253A (en) 2018-05-08

Family

ID=62056462

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711263775.8A Pending CN108009253A (en) 2017-12-05 2017-12-05 A kind of improved character string Similar contrasts method

Country Status (1)

Country Link
CN (1) CN108009253A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108629046A (en) * 2018-05-14 2018-10-09 平安科技(深圳)有限公司 A kind of fields match method and terminal device
CN109271610A (en) * 2018-07-27 2019-01-25 昆明理工大学 A kind of vector expression of Chinese character
CN110032738A (en) * 2019-04-16 2019-07-19 中森云链(成都)科技有限责任公司 Microblogging text normalization method based on context graph random walk and phonetic-stroke code
CN111209447A (en) * 2019-02-27 2020-05-29 山东大学 Chinese character string similarity calculation method and device based on sound-shape codes
CN111507732A (en) * 2019-01-30 2020-08-07 北京嘀嘀无限科技发展有限公司 System and method for identifying similar trajectories
CN111563139A (en) * 2020-07-15 2020-08-21 平安国际智慧城市科技股份有限公司 Checking method and device for identifying invoice drug name through OCR (optical character recognition) and computer equipment
CN111753147A (en) * 2020-06-27 2020-10-09 百度在线网络技术(北京)有限公司 Similarity processing method, device, server and storage medium
CN112487409A (en) * 2020-11-30 2021-03-12 杭州橙鹰数据技术有限公司 Method and device for detecting weak password
CN113626554A (en) * 2021-08-17 2021-11-09 北京计算机技术及应用研究所 Method for calculating hash value of Chinese document
CN114520059A (en) * 2022-02-21 2022-05-20 黑龙江中医药大学 Traditional Chinese medicine diagnostics data platform based on big data
US20220215170A1 (en) * 2021-01-06 2022-07-07 Tencent America LLC Framework for chinese text error identification and correction

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102626A (en) * 2014-07-07 2014-10-15 厦门推特信息科技有限公司 Method for computing semantic similarities among short texts
CN105955976A (en) * 2016-04-15 2016-09-21 中国工商银行股份有限公司 Automatic answering system and method
CN106873964A (en) * 2016-12-23 2017-06-20 浙江工业大学 A kind of improved SimHash detection method of code similarities

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102626A (en) * 2014-07-07 2014-10-15 厦门推特信息科技有限公司 Method for computing semantic similarities among short texts
CN104102626B (en) * 2014-07-07 2017-08-15 厦门推特信息科技有限公司 A kind of method for short text Semantic Similarity Measurement
CN105955976A (en) * 2016-04-15 2016-09-21 中国工商银行股份有限公司 Automatic answering system and method
CN106873964A (en) * 2016-12-23 2017-06-20 浙江工业大学 A kind of improved SimHash detection method of code similarities

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ONEDAYDAYUP: "中文相似度匹配算法", 《360个人图书馆HTTP://WWW.360DOC.COM/CONTENT/16/0113/12/16740871_527575144.SHTML》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108629046B (en) * 2018-05-14 2023-08-18 平安科技(深圳)有限公司 Field matching method and terminal equipment
CN108629046A (en) * 2018-05-14 2018-10-09 平安科技(深圳)有限公司 A kind of fields match method and terminal device
CN109271610A (en) * 2018-07-27 2019-01-25 昆明理工大学 A kind of vector expression of Chinese character
CN111507732B (en) * 2019-01-30 2023-07-07 北京嘀嘀无限科技发展有限公司 System and method for identifying similar trajectories
CN111507732A (en) * 2019-01-30 2020-08-07 北京嘀嘀无限科技发展有限公司 System and method for identifying similar trajectories
CN111209447A (en) * 2019-02-27 2020-05-29 山东大学 Chinese character string similarity calculation method and device based on sound-shape codes
CN110032738A (en) * 2019-04-16 2019-07-19 中森云链(成都)科技有限责任公司 Microblogging text normalization method based on context graph random walk and phonetic-stroke code
CN111753147A (en) * 2020-06-27 2020-10-09 百度在线网络技术(北京)有限公司 Similarity processing method, device, server and storage medium
CN111563139A (en) * 2020-07-15 2020-08-21 平安国际智慧城市科技股份有限公司 Checking method and device for identifying invoice drug name through OCR (optical character recognition) and computer equipment
CN112487409A (en) * 2020-11-30 2021-03-12 杭州橙鹰数据技术有限公司 Method and device for detecting weak password
US20220215170A1 (en) * 2021-01-06 2022-07-07 Tencent America LLC Framework for chinese text error identification and correction
US11481547B2 (en) * 2021-01-06 2022-10-25 Tencent America LLC Framework for chinese text error identification and correction
CN113626554A (en) * 2021-08-17 2021-11-09 北京计算机技术及应用研究所 Method for calculating hash value of Chinese document
CN113626554B (en) * 2021-08-17 2023-08-25 北京计算机技术及应用研究所 Method for calculating hash value of Chinese document
CN114520059A (en) * 2022-02-21 2022-05-20 黑龙江中医药大学 Traditional Chinese medicine diagnostics data platform based on big data

Similar Documents

Publication Publication Date Title
CN108009253A (en) A kind of improved character string Similar contrasts method
CN106815197B (en) Text similarity determination method and device
CN106202153B (en) A kind of the spelling error correction method and system of ES search engine
CN107562824B (en) Text similarity detection method
Sadakane Succinct representations of lcp information and improvements in the compressed suffix arrays
CN110795556A (en) Abstract generation method based on fine-grained plug-in decoding
CN110277085A (en) Determine the method and device of polyphone pronunciation
CN113051371B (en) Chinese machine reading understanding method and device, electronic equipment and storage medium
Yadav et al. A novel approach of bulk data hiding using text steganography
CN104572685B (en) Data reordering method
CN115269834A (en) High-precision text classification method and device based on BERT
Thet et al. Word segmentation for the Myanmar language
CN115314236A (en) System and method for detecting phishing domains in a Domain Name System (DNS) record set
CN101930474A (en) Chinese character simple stroke search method
WO2010043117A1 (en) Digital encoding method and application thereof
Hakak et al. An efficient text representation for searching and retrieving classical diacritical arabic text
WO2021239114A1 (en) Method for synonym editing and determining creator of text
Hossain et al. Transliteration based bengali text compression using huffman principle
Whitelaw et al. Named entity recognition using a character-based probabilistic approach
Hlaing Manually constructed context-free grammar for Myanmar syllable structure
CN114595665A (en) Method for constructing binary extremely-short code word character and word coding set
Czapla et al. Universal language model fine-tuning with subword tokenization for polish
Zhang et al. Hanzi-TB-paws for helping Chinese people to memorize their passwords with longer bytes
Gongshen et al. A text information hiding algorithm based on alternatives
CN111859901A (en) English repeated text detection method, system, terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180508

RJ01 Rejection of invention patent application after publication