CN108009253A - A kind of improved character string Similar contrasts method - Google Patents
A kind of improved character string Similar contrasts method Download PDFInfo
- Publication number
- CN108009253A CN108009253A CN201711263775.8A CN201711263775A CN108009253A CN 108009253 A CN108009253 A CN 108009253A CN 201711263775 A CN201711263775 A CN 201711263775A CN 108009253 A CN108009253 A CN 108009253A
- Authority
- CN
- China
- Prior art keywords
- word
- character string
- code
- chinese
- idf
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 16
- 230000035945 sensitivity Effects 0.000 claims abstract description 7
- 229910002056 binary alloy Inorganic materials 0.000 claims description 12
- 150000001875 compounds Chemical class 0.000 claims description 8
- 238000009825 accumulation Methods 0.000 claims description 7
- 230000000295 complement effect Effects 0.000 claims description 5
- 230000011218 segmentation Effects 0.000 claims description 5
- 230000001186 cumulative effect Effects 0.000 claims description 4
- 239000013598 vector Substances 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 241000282461 Canis lupus Species 0.000 description 1
- 244000050510 Cunninghamia lanceolata Species 0.000 description 1
- 240000000161 Lagerstroemia indica Species 0.000 description 1
- 235000000283 Lagerstroemia parviflora Nutrition 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Document Processing Apparatus (AREA)
Abstract
The present invention relates to a kind of improved character string Similar contrasts method, belong to duplicate checking field.Then the present invention calculates the local sensitivity cryptographic Hash of contrast character string, Hamming distance between the two is finally calculated, so as to obtain the similarity of two articles by calculating by contrast character string local sensitivity cryptographic Hash.In order to improve the accuracy of calculating, by after participle, each Chinese character is encoded character string using phonetic-stroke code, and similar contrast is carried out so as to the similar Chinese character of the Chinese character to unisonance, font well.
Description
Technical field
The present invention relates to a kind of improved character string Similar contrasts method, belong to duplicate checking field.
Background technology
Field is found in data mining and knowledge information, and a major challenge brought of gushing of mass data is exactly the big of information
Amount repeats, and at home, there is 30% repeated pages according to statistics, and duplicate message be too much cause the difficult main problem of retrieval it
One.Algorithm specially solves the problems, such as hundreds of millions grades of removing duplicate webpages, is applied certainly in terms of text duplicate removal also widely, no
Cross it is more complicated compared to removing duplicate webpages because Chinese sentence structure is special and polysemy.The main core concept of this algorithm is just
It is dimensionality reduction, by the feature vector of the maps feature vectors of higher-dimension to low-dimensional, judges two by calculating two vectorial Hamming distances
Document is similar or repeats degree.In information theory, the Hamming distance between two isometric character strings is correspondence between two character strings
The different number of position digital.For example 1000110 and 1000001 Hamming distance is 3.Based on the above, can be generalized to
Character string Similar contrasts, and be that duplicate checking is carried out based on character string phonetically similar word and shape similar word, it can not only greatly reduce contrast
Time, and the accuracy of contrast can be improved.
The content of the invention
The present invention provides a kind of improved character string Similar contrasts method, for realizing that the similitude of character string is sentenced
It is disconnected.
The technical scheme is that:A kind of improved character string Similar contrasts method, the method step are as follows:
S1, participle:String is accorded with to word to be searched and carries out Chinese word segmentation;
S2、Hash:By each encoding of chinese characters in each word into the phonetic-stroke code of ten, by the sound shape of corresponding Chinese character in each word
Code adds up, and the phonetic-stroke code after adding up is converted into cryptographic Hash of the binary system as the word;Wherein phonetic-stroke code includes a semitone
The shape code of code and half, tone code is made of the simple or compound vowel of a Chinese syllable of each Chinese character, initial consonant, complement code, tone, shape code by each Chinese character structure
Position, quadrangle coding and stroke number composition;
S3, weight:Its weight is calculated using TF_IDF algorithms to the result of each word of participle, calculates the step of the weight of each word
Suddenly:
1. calculate the word frequency of each word after word to be searched symbol string participle:
TF=n1/n2
Wherein, TF represents the word frequency of certain word, n1Represent the number that equivalent occurs in this character string, n2Represent this character string
Total word number;
2. calculate reverse document frequency:
IDF=ln(ducom/ducom1)
Wherein, IDF represents that inverse document represents certain word in how many documents to frequency, ducom corpus total number of files, ducom1
Occurred;
3. the weight of word:
TF_IDF=TF*IDF
Wherein, TF_IDF represents word weight, word frequency same word corresponding with reverse document-frequency;
S4, weighting:Binary cryptographic Hash is changed into each word divided in step S2, each is judged, if
The position is 1, then with the positive weights of this;If 0, then with the negative weighted value of this, until each binary Hash is sentenced
It is disconnected when completing untill, each last morphology is into string number sequence;
S5, add up:Word to be searched is accorded with to each word after string participle, each correspondence of the weighted results of calculating adds up, most
Accumulation result of the end form into this character string;
S6, dimensionality reduction:Each for the Serial No. that the character string accumulated result is formed carries out dimensionality reduction, if it is big to be judged position
In 0, then the position is into 1;Otherwise, then the position into 0;Untill each has judged, the office of this character string is finally then formed
Portion's sensitive hash value;
S7, duplicate checking:The character string of contrast is calculated into its local sensitivity cryptographic Hash according to above step, then calculates to accord with word to be searched and goes here and there
Hamming distance between the two judges both similitudes.
In the step S2, when carrying out cumulative, accumulation result not-carry.
In the step S7, Hamming distance is less than 3, then judges that both are similar.
The beneficial effects of the invention are as follows:1st, the feature vector of high dimensional feature vector dimensionality reduction to low-dimensional is saved word by the present invention
The time of symbol string contrast;The coding mode of phonetic-stroke code used when the 2nd, calculating cryptographic Hash, the characteristics of not only having remained phonetic transcriptions of Chinese characters, but also
The characteristics of remaining font, so as to improve the accuracy to phonetically similar word, shape similar word contrast.
Brief description of the drawings
Fig. 1 is Chinese-character sound-shape code coding rule in the present invention.
Embodiment
Embodiment 1:As shown in Figure 1, a kind of improved character string Similar contrasts method, the method step are as follows:
S1, participle:String is accorded with to word to be searched and carries out Chinese word segmentation;
S2、Hash:By each encoding of chinese characters in each word into the phonetic-stroke code of ten, by the sound shape of corresponding Chinese character in each word
Code adds up, and the phonetic-stroke code after adding up is converted into cryptographic Hash of the binary system as the word;Wherein phonetic-stroke code includes a semitone
The shape code of code and half, tone code is made of the simple or compound vowel of a Chinese syllable of each Chinese character, initial consonant, complement code, tone, shape code by each Chinese character structure
Position, quadrangle coding and stroke number composition;
S3, weight:Its weight is calculated using TF_IDF algorithms to the result of each word of participle, calculates the step of the weight of each word
Suddenly:
1. calculate the word frequency of each word after word to be searched symbol string participle:
TF=n1/n2
Wherein, TF represents the word frequency of certain word, n1Represent the number that equivalent occurs in this character string, n2Represent this character string
Total word number;
2. calculate reverse document frequency:
IDF=ln(ducom/ducom1)
Wherein, IDF represents that inverse document represents certain word in how many documents to frequency, ducom corpus total number of files, ducom1
Occurred;
3. the weight of word:
TF_IDF=TF*IDF
Wherein, TF_IDF represents word weight, word frequency same word corresponding with reverse document-frequency;
S4, weighting:Binary cryptographic Hash is changed into each word divided in step S2, each is judged, if
The position is 1, then with the positive weights of this;If 0, then with the negative weighted value of this, until each binary Hash is sentenced
It is disconnected when completing untill, each last morphology is into string number sequence;
S5, add up:Word to be searched is accorded with to each word after string participle, each correspondence of the weighted results of calculating adds up, most
Accumulation result of the end form into this character string;
S6, dimensionality reduction:Each for the Serial No. that the character string accumulated result is formed carries out dimensionality reduction, if it is big to be judged position
In 0, then the position is into 1;Otherwise, then the position into 0;Untill each has judged, the office of this character string is finally then formed
Portion's sensitive hash value;
S7, duplicate checking:The character string of contrast is calculated into its local sensitivity cryptographic Hash according to above step, then calculates to accord with word to be searched and goes here and there
Hamming distance between the two judges both similitudes.
It is possible to further set in the step S2, when carrying out cumulative, accumulation result not-carry.
It is possible to further set in the step S7, Hamming distance is less than 3, then judges that both are similar.
Embodiment 2:As shown in Figure 1, a kind of improved character string Similar contrasts method, the method step are as follows:
S1, participle:Chinese word segmentation is carried out to being looked into short character strings;
Such as:Chinese yunnan Kunming Zi Lang roads, participle instrument wrap for participle, and word segmentation result is:China/Yunnan/Kunming/Zi Lang roads.
S2、Hash:By each encoding of chinese characters into the phonetic-stroke code of ten, including the shape code of half tone code and half, tone code
It is made of the simple or compound vowel of a Chinese syllable of each Chinese character, initial consonant, complement code, tone, i.e.,:First is simple or compound vowel of a Chinese syllable, by simple alternative rule, by the Chinese
The simple or compound vowel of a Chinese syllable part mapping of word is to a character bit.One shares 24 kinds of simple or compound vowel of a Chinese syllable in the phonetic of Chinese character, and which part calculates for the later stage
Purpose, is substituted, as shown in table 1 using identical character.
Table 1:
Second is initial consonant, likewise, initial consonant is converted into character also with a substitution table, as shown in table 2.
Table 2:
3rd is complement code, commonly used in when between initial consonant and simple or compound vowel of a Chinese syllable also have a consonant when, using rhythm matrix phase
Same alternative rule, is defaulted as 0 if not.4th is tone, substitutes the four tones of standard Chinese pronunciation in Chinese character with 1,2,3,4 respectively;
And shape code is structure bit by the 5th, according to the different structure of Chinese character, the structure of the Chinese character is represented with a character, such as table 3
It is shown.
Table 3:
6th to the 9th describes the form of the Chinese character for quadrangle coding, and the first upper left corner, the rear upper right corner, lower-left are pressed per word
Angle, the order in the last lower right corner take the number at four angles, and coding rule is:Horizontal stroke one, which hangs down two or three points, to be pressed down, slotting five squares six of fork four, and seven
Angle 889 is small, there is horizontal change fraction under point;For example the quadrangle coding in " village " is 0021, the quadrangle coding of " gas " is 8001, the
Ten are stroke number, i.e., from one to nine, represent the stroke of the Chinese character respectively as one to nine, represent 10 followed by A, B is represented
11, and and so on.Z represents 35, and any all uses Z more than 35.Each in above-mentioned chapter phonetic-stroke code all belongs to
In 1-9, A-Z sections.
Finally, the phonetic-stroke code of corresponding Chinese character in each word is added up, the phonetic-stroke code after adding up is converted into binary system
Cryptographic Hash as the word.
For example the phone configuration code of " thinkling sound " is " F70211313B ", is then converted to binary system.Likewise, " wolf ", can encode
For F70214323A, phonetically similar word contrast is carried out, " refined ", " China fir China fir ", carries out font Similar contrasts;Such as the meter of Zi Lang roads cryptographic Hash
Calculate, purple, thinkling sound, the triliteral phonetic-stroke code in road are all sectioned out according to above method, then add up obtain for:41GE5EC78Y,(I.e. ten
The the 1st, 2,3,10 meeting is more than F in the phonetic-stroke code of position, therefore the position exceeded respectively becomes two hexadecimals, without departing from six
Invariant position, common 8+6=14 hexadecimal, every hexadecimal are 4 binary systems)It is then converted into as 56(For subsequent calculations more
Add accurate)Binary system is:111010111 ... .., and so on the cryptographic Hash in China, Yunnan, Kunming is calculated, pay attention to tired
Position is corresponded to during adding and is added not-carry, such as:Z+Z=Y;
(3)Weight:Its weight is calculated using TF_IDF algorithms to the result of each word of participle, calculates the step of the weight of each word
Suddenly:
1. calculate the word frequency of each word after word to be searched symbol string participle:
TF=n1/n2
Wherein, TF represents the word frequency of certain word, n1Represent the number that equivalent occurs in this character string, n2Represent this character string
Total word number;
2. calculate reverse document frequency:
IDF=ln(ducom/ducom1)
Wherein, IDF represents that inverse document represents certain word in how many documents to frequency, ducom corpus total number of files, ducom1
Occurred;
3. the weight of word:
TF_IDF=TF*IDF
Wherein, TF_IDF represents word weight, word frequency same word corresponding with reverse document-frequency;
Such as:Zi Lang roads occur 1 time in the character string, a total of 4 words, word frequency 0.25, and corpus has 10000 texts
Chapter, Zi Lang roads occurred in 1000 articles, then the reverse frequency of the word is:2.3, the weight on " Zi Lang roads " is then:0.25*
2.3 it is:0.57, and so on the weight calculation in China, Yunnan, Kunming is come out;
(4)Weighting:To step(2)In each word divided change into binary cryptographic Hash, each is judged, if
The position is 1, then weight is TF_IDF, is then-TF_IDF with the weight of this, until each binary system Hash if 0
Untill when value judges to complete, each last morphology is into string number sequence;
As Zi Lang roads cryptographic Hash changes into 56 binary systems as 111010111 ..., the weighted results of the word are then 56:
0.57,0.57,0.57, -0.57,0.57, -0.57,0.57,0.57,0.57 ... ...;Again will and so on by China, Yunnan,
The weight calculation in Kunming comes out;
(5)It is cumulative:Word to be searched is accorded with to each word after string participle, each correspondence of the weighted results of calculating adds up, most
Accumulation result of the end form into this character string;
Such as:China/Yunnan/Kunming/Zi Lang roads, the result that weighted results add up out are 56:13,108, -22, -5, -32,
55 ... ...;
(6)Dimensionality reduction:Each for the Serial No. that the character string accumulated result is formed carries out dimensionality reduction, if it is big to be judged position
In 0, then the position 1, otherwise, is then set to 0;Untill each has judged, finally, then the local sensitivity of this character string is formed
Cryptographic Hash;
Such as:China/Yunnan/Kunming/Zi Lang roads, the result that dimensionality reduction comes out is 56 binary systems:110001…….;
(7)Duplicate checking:The character string of contrast is calculated into its local sensitivity cryptographic Hash according to above step, then calculates to accord with word to be searched and goes here and there
Hamming distance between the two, if Hamming distance is less than 3, can be determined that both are similar;
Such as:China/Yunnan/Kunming/crape myrtle road, the result that dimensionality reduction comes out is 56 binary systems:111101………;With China/
Yunnan/Kunming/Zi Lang roads, the result that dimensionality reduction comes out is 56 binary systems:110001…….;Hamming distance is 2, then can sentence
The two character strings of breaking are similar(It is digital identical behind ellipsis);
The embodiment of the present invention is explained in detail above in conjunction with attached drawing, but the present invention is not limited to above-mentioned implementation
Mode, within the knowledge of a person skilled in the art, can also be on the premise of present inventive concept not be departed from
Various changes can be made.
Claims (3)
- A kind of 1. improved character string Similar contrasts method, it is characterised in that:The method step is as follows:S1, participle:String is accorded with to word to be searched and carries out Chinese word segmentation;S2、Hash:By each encoding of chinese characters in each word into the phonetic-stroke code of ten, by the sound shape of corresponding Chinese character in each word Code adds up, and the phonetic-stroke code after adding up is converted into cryptographic Hash of the binary system as the word;Wherein phonetic-stroke code includes a semitone The shape code of code and half, tone code is made of the simple or compound vowel of a Chinese syllable of each Chinese character, initial consonant, complement code, tone, shape code by each Chinese character structure Position, quadrangle coding and stroke number composition;S3, weight:Its weight is calculated using TF_IDF algorithms to the result of each word of participle, calculates the step of the weight of each word Suddenly:1. calculate the word frequency of each word after word to be searched symbol string participle:TF=n1/n2Wherein, TF represents the word frequency of certain word, n1Represent the number that equivalent occurs in this character string, n2Represent this character string Total word number;2. calculate reverse document frequency:IDF=ln(ducom/ducom1)Wherein, IDF represents that inverse document represents certain word in how many documents to frequency, ducom corpus total number of files, ducom1 Occurred;3. the weight of word:TF_IDF=TF*IDFWherein, TF_IDF represents word weight, word frequency same word corresponding with reverse document-frequency;S4, weighting:Binary cryptographic Hash is changed into each word divided in step S2, each is judged, if The position is 1, then with the positive weights of this;If 0, then with the negative weighted value of this, until each binary Hash is sentenced It is disconnected when completing untill, each last morphology is into string number sequence;S5, add up:Word to be searched is accorded with to each word after string participle, each correspondence of the weighted results of calculating adds up, most Accumulation result of the end form into this character string;S6, dimensionality reduction:Each for the Serial No. that the character string accumulated result is formed carries out dimensionality reduction, if it is big to be judged position In 0, then the position is into 1;Otherwise, then the position into 0;Untill each has judged, the office of this character string is finally then formed Portion's sensitive hash value;S7, duplicate checking:The character string of contrast is calculated into its local sensitivity cryptographic Hash according to above step, then calculates to accord with word to be searched and goes here and there Hamming distance between the two judges both similitudes.
- 2. improved character string Similar contrasts method according to claim 1, it is characterised in that:In the step S2, into When row is cumulative, accumulation result not-carry.
- 3. improved character string Similar contrasts method according to claim 1, it is characterised in that:In the step S7, the Chinese Prescribed distance is less than 3, then judges that both are similar.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711263775.8A CN108009253A (en) | 2017-12-05 | 2017-12-05 | A kind of improved character string Similar contrasts method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711263775.8A CN108009253A (en) | 2017-12-05 | 2017-12-05 | A kind of improved character string Similar contrasts method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108009253A true CN108009253A (en) | 2018-05-08 |
Family
ID=62056462
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711263775.8A Pending CN108009253A (en) | 2017-12-05 | 2017-12-05 | A kind of improved character string Similar contrasts method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108009253A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108629046A (en) * | 2018-05-14 | 2018-10-09 | 平安科技(深圳)有限公司 | A kind of fields match method and terminal device |
CN109271610A (en) * | 2018-07-27 | 2019-01-25 | 昆明理工大学 | A kind of vector expression of Chinese character |
CN110032738A (en) * | 2019-04-16 | 2019-07-19 | 中森云链(成都)科技有限责任公司 | Microblogging text normalization method based on context graph random walk and phonetic-stroke code |
CN111209447A (en) * | 2019-02-27 | 2020-05-29 | 山东大学 | Chinese character string similarity calculation method and device based on sound-shape codes |
CN111507732A (en) * | 2019-01-30 | 2020-08-07 | 北京嘀嘀无限科技发展有限公司 | System and method for identifying similar trajectories |
CN111563139A (en) * | 2020-07-15 | 2020-08-21 | 平安国际智慧城市科技股份有限公司 | Checking method and device for identifying invoice drug name through OCR (optical character recognition) and computer equipment |
CN111753147A (en) * | 2020-06-27 | 2020-10-09 | 百度在线网络技术(北京)有限公司 | Similarity processing method, device, server and storage medium |
CN112487409A (en) * | 2020-11-30 | 2021-03-12 | 杭州橙鹰数据技术有限公司 | Method and device for detecting weak password |
CN113626554A (en) * | 2021-08-17 | 2021-11-09 | 北京计算机技术及应用研究所 | Method for calculating hash value of Chinese document |
CN114520059A (en) * | 2022-02-21 | 2022-05-20 | 黑龙江中医药大学 | Traditional Chinese medicine diagnostics data platform based on big data |
US20220215170A1 (en) * | 2021-01-06 | 2022-07-07 | Tencent America LLC | Framework for chinese text error identification and correction |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104102626A (en) * | 2014-07-07 | 2014-10-15 | 厦门推特信息科技有限公司 | Method for computing semantic similarities among short texts |
CN105955976A (en) * | 2016-04-15 | 2016-09-21 | 中国工商银行股份有限公司 | Automatic answering system and method |
CN106873964A (en) * | 2016-12-23 | 2017-06-20 | 浙江工业大学 | A kind of improved SimHash detection method of code similarities |
-
2017
- 2017-12-05 CN CN201711263775.8A patent/CN108009253A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104102626A (en) * | 2014-07-07 | 2014-10-15 | 厦门推特信息科技有限公司 | Method for computing semantic similarities among short texts |
CN104102626B (en) * | 2014-07-07 | 2017-08-15 | 厦门推特信息科技有限公司 | A kind of method for short text Semantic Similarity Measurement |
CN105955976A (en) * | 2016-04-15 | 2016-09-21 | 中国工商银行股份有限公司 | Automatic answering system and method |
CN106873964A (en) * | 2016-12-23 | 2017-06-20 | 浙江工业大学 | A kind of improved SimHash detection method of code similarities |
Non-Patent Citations (1)
Title |
---|
ONEDAYDAYUP: "中文相似度匹配算法", 《360个人图书馆HTTP://WWW.360DOC.COM/CONTENT/16/0113/12/16740871_527575144.SHTML》 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108629046B (en) * | 2018-05-14 | 2023-08-18 | 平安科技(深圳)有限公司 | Field matching method and terminal equipment |
CN108629046A (en) * | 2018-05-14 | 2018-10-09 | 平安科技(深圳)有限公司 | A kind of fields match method and terminal device |
CN109271610A (en) * | 2018-07-27 | 2019-01-25 | 昆明理工大学 | A kind of vector expression of Chinese character |
CN111507732B (en) * | 2019-01-30 | 2023-07-07 | 北京嘀嘀无限科技发展有限公司 | System and method for identifying similar trajectories |
CN111507732A (en) * | 2019-01-30 | 2020-08-07 | 北京嘀嘀无限科技发展有限公司 | System and method for identifying similar trajectories |
CN111209447A (en) * | 2019-02-27 | 2020-05-29 | 山东大学 | Chinese character string similarity calculation method and device based on sound-shape codes |
CN110032738A (en) * | 2019-04-16 | 2019-07-19 | 中森云链(成都)科技有限责任公司 | Microblogging text normalization method based on context graph random walk and phonetic-stroke code |
CN111753147A (en) * | 2020-06-27 | 2020-10-09 | 百度在线网络技术(北京)有限公司 | Similarity processing method, device, server and storage medium |
CN111563139A (en) * | 2020-07-15 | 2020-08-21 | 平安国际智慧城市科技股份有限公司 | Checking method and device for identifying invoice drug name through OCR (optical character recognition) and computer equipment |
CN112487409A (en) * | 2020-11-30 | 2021-03-12 | 杭州橙鹰数据技术有限公司 | Method and device for detecting weak password |
US20220215170A1 (en) * | 2021-01-06 | 2022-07-07 | Tencent America LLC | Framework for chinese text error identification and correction |
US11481547B2 (en) * | 2021-01-06 | 2022-10-25 | Tencent America LLC | Framework for chinese text error identification and correction |
CN113626554A (en) * | 2021-08-17 | 2021-11-09 | 北京计算机技术及应用研究所 | Method for calculating hash value of Chinese document |
CN113626554B (en) * | 2021-08-17 | 2023-08-25 | 北京计算机技术及应用研究所 | Method for calculating hash value of Chinese document |
CN114520059A (en) * | 2022-02-21 | 2022-05-20 | 黑龙江中医药大学 | Traditional Chinese medicine diagnostics data platform based on big data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108009253A (en) | A kind of improved character string Similar contrasts method | |
CN106815197B (en) | Text similarity determination method and device | |
CN106202153B (en) | A kind of the spelling error correction method and system of ES search engine | |
CN107562824B (en) | Text similarity detection method | |
Sadakane | Succinct representations of lcp information and improvements in the compressed suffix arrays | |
CN110795556A (en) | Abstract generation method based on fine-grained plug-in decoding | |
CN110277085A (en) | Determine the method and device of polyphone pronunciation | |
CN113051371B (en) | Chinese machine reading understanding method and device, electronic equipment and storage medium | |
Yadav et al. | A novel approach of bulk data hiding using text steganography | |
CN104572685B (en) | Data reordering method | |
CN115269834A (en) | High-precision text classification method and device based on BERT | |
Thet et al. | Word segmentation for the Myanmar language | |
CN115314236A (en) | System and method for detecting phishing domains in a Domain Name System (DNS) record set | |
CN101930474A (en) | Chinese character simple stroke search method | |
WO2010043117A1 (en) | Digital encoding method and application thereof | |
Hakak et al. | An efficient text representation for searching and retrieving classical diacritical arabic text | |
WO2021239114A1 (en) | Method for synonym editing and determining creator of text | |
Hossain et al. | Transliteration based bengali text compression using huffman principle | |
Whitelaw et al. | Named entity recognition using a character-based probabilistic approach | |
Hlaing | Manually constructed context-free grammar for Myanmar syllable structure | |
CN114595665A (en) | Method for constructing binary extremely-short code word character and word coding set | |
Czapla et al. | Universal language model fine-tuning with subword tokenization for polish | |
Zhang et al. | Hanzi-TB-paws for helping Chinese people to memorize their passwords with longer bytes | |
Gongshen et al. | A text information hiding algorithm based on alternatives | |
CN111859901A (en) | English repeated text detection method, system, terminal and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180508 |
|
RJ01 | Rejection of invention patent application after publication |