CN110751165B - Automatic word-composing method for disordered characters - Google Patents

Automatic word-composing method for disordered characters Download PDF

Info

Publication number
CN110751165B
CN110751165B CN201910729423.XA CN201910729423A CN110751165B CN 110751165 B CN110751165 B CN 110751165B CN 201910729423 A CN201910729423 A CN 201910729423A CN 110751165 B CN110751165 B CN 110751165B
Authority
CN
China
Prior art keywords
character
word
vector
conditional probability
elements
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910729423.XA
Other languages
Chinese (zh)
Other versions
CN110751165A (en
Inventor
蔡浩
陈小明
孙浩军
张承钿
姚浩生
胡超
刘正阳
梁道远
曾鑫
白璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Thirty Nine Eight Big Data Technology Co ltd
Shantou University
Original Assignee
Guangdong Thirty Nine Eight Big Data Technology Co ltd
Shantou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Thirty Nine Eight Big Data Technology Co ltd, Shantou University filed Critical Guangdong Thirty Nine Eight Big Data Technology Co ltd
Priority to CN201910729423.XA priority Critical patent/CN110751165B/en
Publication of CN110751165A publication Critical patent/CN110751165A/en
Application granted granted Critical
Publication of CN110751165B publication Critical patent/CN110751165B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an automatic word forming method for disordered characters, which comprises the steps of firstly constructing a character table by combining a training text with a preset training model, traversing all sequences for the input disordered characters, calculating a natural language sequence metric value corresponding to each sequence based on the constructed character table, and finally taking the sequence with the highest natural language sequence metric value as a sequence result of the disordered characters. By adopting the technical scheme of the invention, the training cost can be reduced, and the word forming method can effectively solve the problem of low judgment accuracy of Chinese parallel phrases and improve the word forming accuracy.

Description

Automatic word-forming method for disordered characters
Technical Field
The invention relates to the technical field of computers, in particular to an automatic word forming method for disordered characters.
Background
The computer method is a common technical requirement for correcting text information, and the types of errors of the Chinese text information are many, such as harmonic word errors, grammar errors, word order reversal and the like. There are many reasons for such errors, such as carelessness in manual keyboard input or failure to reorganize the order of characters according to the position information due to various factors when obtaining text through image scanning, so as to obtain a series of disordered characters.
For disordered characters, the prior art solves the word formation problem based on a neural network model, but the solution is too heavy, and if accurate word formation is needed, the required model training cost is greatly increased, and the application effect is not ideal.
Disclosure of Invention
The embodiment of the invention provides an automatic word forming method for disordered characters, which can reduce training cost and improve the word forming accuracy.
The embodiment of the invention provides an automatic word forming method for disordered characters, which comprises the following steps:
according to a training text obtained in advance, a character table is constructed after training of a training model; the character table comprises a first character word frequency table, a second character word frequency table and a character lookup table; the element in the first character word frequency table records the occurrence frequency of adjacent character combinations in all training texts; the elements in the second character word frequency table record the occurrence frequency of combinations with 1 character at intervals in all training texts; the character lookup table records a plurality of common characters and the total number of times of the occurrence of each common character in all training texts;
acquiring character strings to be composed corresponding to characters to be composed out of order, and inquiring the total occurrence times corresponding to all the character strings to be composed according to a character lookup table in the character strings so as to construct a first vector;
respectively inquiring the first character word frequency table and the second character word frequency table according to the current character arrangement sequence of the character string to be word-composed, and constructing a first conditional probability counting vector and a second conditional probability counting vector according to the inquiry result;
respectively calculating a first conditional probability vector and a second conditional probability vector corresponding to the character string of the word to be composed according to the first vector, the first conditional probability count vector and the second conditional probability count vector;
respectively taking logarithms of elements in the first conditional probability vector and the second conditional probability vector, converting the product of the probabilities into the sum of the logarithms, and sequentially obtaining a first natural language sequence metric value and a second natural language sequence metric value;
and according to the first natural word order metric value and the second natural word order metric value, acquiring a natural word order metric value corresponding to the current character arrangement order of the character string to be word-grouped, traversing all the character arrangement orders of the character string to be word-grouped, sequentially acquiring a plurality of natural word order metric values according to the same calculation method, selecting the character arrangement order with the maximum natural word order metric value, and automatically grouping the character string to be word-grouped.
Furthermore, the ith row and jth column element in the first character word frequency table represents the occurrence frequency of adjacent combinations with the hash value of j after the character with the hash value of i in all training texts; wherein i and j are positive integers;
the ith row and jth column elements in the second character word frequency table represent the occurrence frequency of a combination of a second character with a hash value of j and a character with a hash value of i in all training texts;
and the column numbers corresponding to the common characters recorded in the character lookup table are respectively the hash values of each common character.
Further, the total number of occurrences corresponding to all the character strings to be word-organized is queried according to the character lookup table in the character strings, so as to construct a first vector, specifically:
and mapping the character string to be composed to a column corresponding to the corresponding character lookup table to obtain the total occurrence frequency of each character, and recording the total occurrence frequency as a first vector s _ total.
Further, the step of respectively querying the first character word frequency table and the second character word frequency table according to the current character arrangement sequence of the character string to be word-composed, and constructing a first conditional probability count vector and a second conditional probability count vector according to the query result specifically includes:
according to the current character arrangement sequence of the character string of the word to be formed, searching corresponding elements R1 in a first character word frequency table for each group of adjacent elements (a, b) ab Constructing all elements obtained by query into a first conditional probability count vector w _ n1;
according to the current character arrangement sequence of the character string of the word to be formed, for each group of elements (c, d) separated by 1 character, searching the corresponding element R2 in the second character word frequency table cd And constructs all queried elements into a second conditional probability count vector w _ n2.
Further, the first conditional probability vector and the second conditional probability vector corresponding to the character string to be word-composed are respectively calculated according to the first vector, the first conditional probability count vector and the second conditional probability count vector, and specifically:
dividing first n-1 element vectors of a first vector s _ total by a first conditional probability count vector w _ n1 to obtain the first conditional probability vector w1; wherein the first vector s _ total comprises n elements; the elements in the first conditional probability vector w1 are conditional probability vectors after a former character appears and a latter character also appears in the character string to be composed;
dividing first n-1 element vectors of the first vector s _ total by a second conditional probability count vector w _ n2 to obtain a second conditional probability vector w2; and the elements in the second conditional probability vector w2 are conditional probability vectors in which characters separated by 1 character also appear after the former character appears in the character string to be composed of words.
The embodiment of the invention has the following beneficial effects:
the embodiment of the invention provides an automatic word forming method for out-of-order characters, which comprises the steps of firstly constructing a character table by combining a training text with a preset training model, traversing all sequences for the input out-of-order characters, calculating a natural word order metric value corresponding to each sequence based on the constructed character table, and finally taking the sequence with the highest natural word order metric value as a sequence result of the out-of-order characters. Compared with the prior art that the neural network is used for word grouping and sequencing, the technical scheme of the invention uses a simple text training mode, reduces the training cost, and can effectively solve the problem of low judgment accuracy of Chinese parallel phrases and improve the word grouping accuracy.
Drawings
FIG. 1 is a flow chart illustrating an embodiment of an automatic word-composing method for out-of-order characters according to the present invention;
FIG. 2 is a schematic diagram of a training process for model training according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, it is a schematic flow chart of an embodiment of the method for automatically composing words of disordered words provided by the present invention, and the more detailed steps of the schematic flow chart include steps 101 to 106, which are specifically as follows:
step 101: according to a training text obtained in advance, a character table is constructed after training of a training model; the character table comprises a first character word frequency table, a second character word frequency table and a character lookup table.
In this embodiment, the element in the first character word frequency table records the occurrence frequency of adjacent character combinations in all training texts; elements in the second character word frequency table record the occurrence frequency of combinations separated by 1 character in all training texts; the character lookup table records a number of common characters and the total number of occurrences of each common character in all training texts.
In this embodiment, the ith row and jth column element in the first character word frequency table represents the frequency of occurrence of the combination of adjacent characters with hash values i and j in all training texts; wherein i and j are positive integers; the ith row and jth column elements in the second character word frequency table represent the occurrence frequency of the combination of a second character with a hash value of j after the character with the hash value of i in all training texts; the column number corresponding to the common character recorded in the character lookup table is the hash value of each common character.
To better illustrate the present embodiment, the model training process is illustrated by the following example. In this example, the structure of the model is two square matrices, text _ matrix _1 (first character word-frequency table), text _ matrix _2 (second character word-frequency table), and a two-line lookup table, text _ list. the text _ matrix _1 and the text _ matrix _2 have the same structure, the row number and the column number of the text _ matrix _1 correspond to the hashed values of the common characters respectively, and the hash function is a lookup table text _ list. the first line of the text _ list stores common characters, the second line stores the total times of occurrence of a certain character counted in the training process and used for calculating the conditional probability, and the value of the hash function is the column number of the certain character in the lookup table text _ list. The hash function is set so that the hash values of all characters are continuous natural numbers starting from 0. the element in text matrix 1 records how often adjacent character combinations appear, e.g., C1 ij (the element in the ith row and j column in text _ matrix _ 1) represents the frequency of occurrence of the combination of the hash value j of the adjacent character after the character with the hash value i in the training text. the element in text _ matrix _2 records the frequency of occurrence of combinations separated by 1 character, such as C2 ij (the element in the ith row and j column of text _ matrix _ 2) represents the frequency of occurrence of the combination of the hash value j of the second character after the character with the hash value i in the training text. Experiments show that the two-layer structure can effectively solve the problem of judging parallel phrases in the Chinese text under the condition of not losing the generalization capability of the model.
The training process is that the text of the training set is traversed, the occurrence frequency of a certain character is counted and recorded at the corresponding position of the text _ list; counting the occurrence frequency of the combination of two adjacent characters, and recording the occurrence frequency in the corresponding element of the text _ matrix _ 1; the frequency of occurrence of a combination of two characters having one character interval is counted and recorded in the corresponding element of text _ matrix _2, and the detailed training process is as shown in fig. 2.
Step 102: and acquiring character strings to be composed corresponding to the characters to be composed out of order, and inquiring the total occurrence times corresponding to all the character strings to be composed according to the character lookup table in the character strings so as to construct a first vector.
In this embodiment, step 102 specifically includes: and mapping the character strings to be composed to the columns corresponding to the corresponding character lookup tables to obtain the total occurrence frequency of each character, and recording the total occurrence frequency as a first vector s _ total.
Step 103: respectively inquiring a first character word frequency table and a second character word frequency table according to the current character arrangement sequence of the character string to be word-organized, and constructing a first conditional probability count vector and a second conditional probability count vector according to the inquiry result.
In this embodiment, step 103 specifically includes: according to the current character arrangement sequence of the character string to be composed, for each group of adjacent elements (a, b), searching the corresponding element R1 in the first character word frequency table ab Constructing all elements obtained by query into a first conditional probability count vector w _ n1;
according to the current character arrangement sequence of the character string of the word to be formed, for each group of elements (c, d) separated by 1 character, searching the corresponding element R2 in the second character word frequency table cd And all queried elements are constructed into a second conditional probability count vector w _ n2.
Step 104: and respectively calculating a first conditional probability vector and a second conditional probability vector corresponding to the character string of the word to be composed according to the first vector, the first conditional probability count vector and the second conditional probability count vector.
In this embodiment, step 104 specifically includes: dividing the first n-1 element vectors of the first vector s _ total by the first conditional probability count vector w _ n1 to obtain a first conditional probability vector w1; wherein the first vector s _ total comprises n elements; the elements in the first conditional probability vector w1 are conditional probability vectors after a former character and a latter character appear in the character string to be word-composed;
dividing first n-1 element vectors of the first vector s _ total by a second conditional probability count vector w _ n2 to obtain a second conditional probability vector w2; and the elements in the second conditional probability vector w2 are conditional probability vectors in which characters separated by 1 character also appear after the former character appears in the character string to be word-composed.
In the embodiment, the length of the character string S to be composed is n, w _ n1 is 1 less than the dimension of S _ total, and the dimension of w1 is the same as w _ n 1.
Step 105: and taking logarithm numbers of each element in the first conditional probability vector and the second conditional probability vector respectively, converting the product of the probabilities into the sum of the logarithm probabilities, and sequentially obtaining a first natural language sequence metric value and a second natural language sequence metric value.
In this embodiment, the probability of a certain combination appearing in the whole is very small, so when the side length of a character string is short, the product of probabilities may cause floating point underflow, to avoid this, first, logarithms w _ l and w _2 are taken for each element of w1 and w2, and the product of probabilities is converted into the sum of the logarithms, that is, the sum of all elements of w _1 and w _2 is performed, so that the natural word order metric p of a certain character string can be obtained.
In the present embodiment, in the process of generating a character string, the latter word may be considered to have a relationship with all the words ahead of it, and the probability that the next character of the character string a is the character B is calculated, which may be abstracted as calculating the conditional probability that B appears in the case of a, that is, P (B | a). Applying the markov assumption, the above process can be simplified in the actual calculation process, i.e. the probability that the kth character is considered to be B is only related to the first limited characters. The problem handled by the invention is not a character string generation process, but the natural language order of the disordered text is found, so the problem is simplified by the invention, and the natural language order measurement value p is introduced to measure the degree of a candidate character string conforming to the natural language order. A larger p indicates that the character string conforms to the natural language order. Thus, the problem may be translated into computing p for all possible candidate orders and selecting one or several candidates for output where the p value is the largest. The whole process can be seen as finding the path with the highest probability in the discrete-time markov model composed of characters.
For a string S of length n, the natural word order metric p is defined as follows:
Figure RE-GDA0002323608080000071
step 106: and according to the first natural language sequence metric value and the second natural language sequence metric value, obtaining a natural language sequence metric value corresponding to the current character arrangement sequence of the character string to be word-grouped, traversing all the character arrangement sequences of the character string to be word-grouped, sequentially obtaining a plurality of natural language sequence metric values according to the same calculation method, selecting the character arrangement sequence with the maximum natural language sequence metric value, and automatically word-grouping the character string to be word-grouped.
For a string S of length n, the natural word order metric p is defined as follows:
Figure RE-GDA0002323608080000072
to better illustrate the technical solution of the present invention, the process is illustrated by an example: to calculate a natural word order metric p for the string s = { you, good }.
1. The first summation item is calculated, and according to the code of each Chinese character, the hash (namely the column number) is searched from the first row of the text _ list: h is c =Hash(c),c∈s;
Suppose that the hash values calculated according to the above formula are: 196, 135, 1202.
2. Second line from text _ list
Figure RE-GDA0002323608080000073
Obtaining corresponding total counts which are respectively marked as n 2,196 ,n 2,135 ,n 2,1202
3. From the word frequency table of the first character
Figure RE-GDA0002323608080000081
In the method, the occurrence frequency of adjacent two words in the text is obtained according to the hash value and is respectively marked as r 135,196 ,r 1202,135 I.e. the frequency of "you" and "good".
4. Obtaining a natural word order metric value p of 'your good':
Figure BDA0002156645180000082
wherein h is i Representing the hash value of the ith Chinese character.
And then, a second summation item is obtained according to the same calculation process, and the two items are added to obtain a final natural word order metric value. The invention achieves sufficient vectorization, obtains candidate sequences by means of permutation and combination and the like, can calculate the measure value p of the natural language sequence of each sequence in parallel, and selects the combination with the maximum p value as a candidate to output.
In summary, according to the automatic word organizing method for out-of-order characters provided in the embodiments of the present invention, a character table is first constructed by using a training text in combination with a preset training model, all ranks are traversed for input out-of-order characters, a natural word order metric value corresponding to each rank is calculated based on the constructed character table, and finally, the rank with the highest natural word order metric value is used as a ranking result of the out-of-order characters. Compared with the prior art that the neural network is used for word grouping and sequencing, the technical scheme of the invention uses a simple text training mode, reduces the training cost, and can effectively solve the problem of low judgment accuracy of Chinese parallel phrases and improve the word grouping accuracy.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made therein without departing from the spirit and scope of the invention.

Claims (5)

1. An automatic word-composing method for disordered words is characterized by comprising the following steps:
according to a training text obtained in advance, a character table is constructed after training of a training model; the character table comprises a first character word frequency table, a second character word frequency table and a character lookup table; the element in the first character word frequency table records the occurrence frequency of adjacent character combinations in all training texts; the element in the second character word frequency table records the occurrence frequency of the combination of every 1 character in all training texts; the character lookup table records a plurality of common characters and the total occurrence times of each common character in all training texts;
acquiring character strings to be composed corresponding to characters to be composed out of order, and inquiring the total occurrence times corresponding to all the character strings to be composed according to a character lookup table in the character strings so as to construct a first vector;
respectively inquiring the first character word frequency table and the second character word frequency table according to the current character arrangement sequence of the character string to be word-organized, and constructing a first conditional probability count vector and a second conditional probability count vector according to the inquiry result;
respectively calculating a first conditional probability vector and a second conditional probability vector corresponding to the character string of the word to be composed according to the first vector, the first conditional probability count vector and the second conditional probability count vector;
respectively taking logarithms of elements in the first conditional probability vector and the second conditional probability vector, converting the product of the probabilities into the sum of the logarithms and probabilities, and sequentially obtaining a first natural language sequence metric value and a second natural language sequence metric value;
and according to the first natural word order metric value and the second natural word order metric value, acquiring a natural word order metric value corresponding to the current character arrangement order of the character string to be word-grouped, traversing all the character arrangement orders of the character string to be word-grouped, sequentially acquiring a plurality of natural word order metric values according to the same calculation method, selecting the character arrangement order with the maximum natural word order metric value, and automatically grouping the character string to be word-grouped.
2. The method of claim 1, wherein the document is presented in a document form,
the ith row and jth column elements in the first character word frequency table represent the occurrence frequency of the combination of adjacent characters with hash values i and j in all training texts; wherein i and j are positive integers;
the ith row and jth column elements in the second character word frequency table represent the occurrence frequency of a combination of a second character with a hash value of j and a character with a hash value of i in all training texts;
and the column numbers corresponding to the common characters recorded in the character lookup table are respectively hash values of each common character.
3. The method according to claim 1, wherein the total number of occurrences corresponding to all character strings to be composed is looked up according to a character lookup table in the character strings to construct a first vector, specifically:
and mapping the character string to be composed to a column corresponding to a corresponding character lookup table to obtain the total occurrence frequency of each character, and marking as a first vector s _ total.
4. The method according to claim 3, wherein the first and second character word-frequency tables are respectively queried according to a current character arrangement order of the character string to be composed, and a first conditional probability count vector and a second conditional probability count vector are constructed according to a query result, specifically:
according to the current character arrangement sequence of the character string of the word to be formed, searching corresponding elements R1 in a first character word frequency table for each group of adjacent elements (a, b) ab Constructing all elements obtained by query into a first conditional probability count vector w _ n1;
according to the current character arrangement sequence of the character string of the word to be formed, searching corresponding elements R2 in a second character word frequency table for each group of elements (c, d) separated by 1 character cd And constructs all queried elements into a second conditional probability count vector w _ n2.
5. The method according to claim 4, wherein the first conditional probability vector and the second conditional probability vector corresponding to the character string to be word-organized are calculated according to the first vector, the first conditional probability count vector and the second conditional probability count vector, specifically:
dividing the first n-1 element vectors of the first vector s _ total by a first conditional probability count vector w _ n1 to obtain the first conditional probability vector w1; wherein the first vector s _ total comprises n elements; the elements in the first conditional probability vector w1 are conditional probability vectors after a former character and a latter character appear in the character string to be word-organized;
dividing first n-1 element vectors of the first vector s _ total by a second conditional probability count vector w _ n2 to obtain a second conditional probability vector w2; and the elements in the second conditional probability vector w2 are conditional probability vectors in which characters separated by 1 character also appear after the former character appears in the character string to be word-composed.
CN201910729423.XA 2019-08-06 2019-08-06 Automatic word-composing method for disordered characters Active CN110751165B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910729423.XA CN110751165B (en) 2019-08-06 2019-08-06 Automatic word-composing method for disordered characters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910729423.XA CN110751165B (en) 2019-08-06 2019-08-06 Automatic word-composing method for disordered characters

Publications (2)

Publication Number Publication Date
CN110751165A CN110751165A (en) 2020-02-04
CN110751165B true CN110751165B (en) 2023-01-24

Family

ID=69275847

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910729423.XA Active CN110751165B (en) 2019-08-06 2019-08-06 Automatic word-composing method for disordered characters

Country Status (1)

Country Link
CN (1) CN110751165B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950291B (en) * 2020-06-22 2024-02-23 北京百度网讯科技有限公司 Semantic representation model generation method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101950306A (en) * 2010-09-29 2011-01-19 北京新媒传信科技有限公司 Method for filtering character strings in process of discovering new words
CN108647207A (en) * 2018-05-08 2018-10-12 上海携程国际旅行社有限公司 Natural language modification method, system, equipment and storage medium
CN108829660A (en) * 2018-05-09 2018-11-16 电子科技大学 A kind of short text signature generating method based on random number division and recursion

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101950306A (en) * 2010-09-29 2011-01-19 北京新媒传信科技有限公司 Method for filtering character strings in process of discovering new words
CN108647207A (en) * 2018-05-08 2018-10-12 上海携程国际旅行社有限公司 Natural language modification method, system, equipment and storage medium
CN108829660A (en) * 2018-05-09 2018-11-16 电子科技大学 A kind of short text signature generating method based on random number division and recursion

Also Published As

Publication number Publication date
CN110751165A (en) 2020-02-04

Similar Documents

Publication Publication Date Title
CN109885692B (en) Knowledge data storage method, apparatus, computer device and storage medium
US7590626B2 (en) Distributional similarity-based models for query correction
Tang et al. Understanding the limiting factors of topic modeling via posterior contraction analysis
US8171029B2 (en) Automatic generation of ontologies using word affinities
US20180349735A1 (en) Method and Device for Comparing Similarities of High Dimensional Features of Images
CN110516247B (en) Named entity recognition method based on neural network and computer storage medium
CN107704102B (en) Text input method and device
CA2788670C (en) Semantic object characterization and search
US7945576B2 (en) Location recognition using informative feature vocabulary trees
JP6187877B2 (en) Synonym extraction system, method and recording medium
US11972329B2 (en) Method and system for similarity-based multi-label learning
CN110807102B (en) Knowledge fusion method, apparatus, computer device and storage medium
KR100903961B1 (en) Indexing And Searching Method For High-Demensional Data Using Signature File And The System Thereof
US8972415B2 (en) Similarity search initialization
JP2012524314A (en) Method and apparatus for data retrieval and indexing
US20140082021A1 (en) Hierarchical ordering of strings
CN110188131B (en) Frequent pattern mining method and device
CN108846016A (en) A kind of searching algorithm towards Chinese word segmentation
CN108038104A (en) A kind of method and device of Entity recognition
CN106570173B (en) Spark-based high-dimensional sparse text data clustering method
US10474958B2 (en) Apparatus, system and method for an adaptive or static machine-learning classifier using prediction by partial matching (PPM) language modeling
CN110751165B (en) Automatic word-composing method for disordered characters
CN110738042B (en) Error correction dictionary creation method, device, terminal and computer storage medium
CN117076636A (en) Information query method, system and equipment for intelligent customer service
Cotter et al. Interpretable set functions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant