CN110751165B

CN110751165B - Automatic word-composing method for disordered characters

Info

Publication number: CN110751165B
Application number: CN201910729423.XA
Authority: CN
Inventors: 蔡浩; 陈小明; 孙浩军; 张承钿; 姚浩生; 胡超; 刘正阳; 梁道远; 曾鑫; 白璐
Original assignee: Guangdong Thirty Nine Eight Big Data Technology Co ltd; Shantou University
Current assignee: Guangdong Thirty Nine Eight Big Data Technology Co ltd; Shantou University
Priority date: 2019-08-06
Filing date: 2019-08-06
Publication date: 2023-01-24
Anticipated expiration: 2039-08-06
Also published as: CN110751165A

Abstract

The invention discloses an automatic word forming method for disordered characters, which comprises the steps of firstly constructing a character table by combining a training text with a preset training model, traversing all sequences for the input disordered characters, calculating a natural language sequence metric value corresponding to each sequence based on the constructed character table, and finally taking the sequence with the highest natural language sequence metric value as a sequence result of the disordered characters. By adopting the technical scheme of the invention, the training cost can be reduced, and the word forming method can effectively solve the problem of low judgment accuracy of Chinese parallel phrases and improve the word forming accuracy.

Description

Automatic word-forming method for disordered characters

Technical Field

The invention relates to the technical field of computers, in particular to an automatic word forming method for disordered characters.

Background

The computer method is a common technical requirement for correcting text information, and the types of errors of the Chinese text information are many, such as harmonic word errors, grammar errors, word order reversal and the like. There are many reasons for such errors, such as carelessness in manual keyboard input or failure to reorganize the order of characters according to the position information due to various factors when obtaining text through image scanning, so as to obtain a series of disordered characters.

For disordered characters, the prior art solves the word formation problem based on a neural network model, but the solution is too heavy, and if accurate word formation is needed, the required model training cost is greatly increased, and the application effect is not ideal.

Disclosure of Invention

The embodiment of the invention provides an automatic word forming method for disordered characters, which can reduce training cost and improve the word forming accuracy.

The embodiment of the invention provides an automatic word forming method for disordered characters, which comprises the following steps:

according to a training text obtained in advance, a character table is constructed after training of a training model; the character table comprises a first character word frequency table, a second character word frequency table and a character lookup table; the element in the first character word frequency table records the occurrence frequency of adjacent character combinations in all training texts; the elements in the second character word frequency table record the occurrence frequency of combinations with 1 character at intervals in all training texts; the character lookup table records a plurality of common characters and the total number of times of the occurrence of each common character in all training texts;

acquiring character strings to be composed corresponding to characters to be composed out of order, and inquiring the total occurrence times corresponding to all the character strings to be composed according to a character lookup table in the character strings so as to construct a first vector;

respectively inquiring the first character word frequency table and the second character word frequency table according to the current character arrangement sequence of the character string to be word-composed, and constructing a first conditional probability counting vector and a second conditional probability counting vector according to the inquiry result;

respectively calculating a first conditional probability vector and a second conditional probability vector corresponding to the character string of the word to be composed according to the first vector, the first conditional probability count vector and the second conditional probability count vector;

respectively taking logarithms of elements in the first conditional probability vector and the second conditional probability vector, converting the product of the probabilities into the sum of the logarithms, and sequentially obtaining a first natural language sequence metric value and a second natural language sequence metric value;

and according to the first natural word order metric value and the second natural word order metric value, acquiring a natural word order metric value corresponding to the current character arrangement order of the character string to be word-grouped, traversing all the character arrangement orders of the character string to be word-grouped, sequentially acquiring a plurality of natural word order metric values according to the same calculation method, selecting the character arrangement order with the maximum natural word order metric value, and automatically grouping the character string to be word-grouped.

Furthermore, the ith row and jth column element in the first character word frequency table represents the occurrence frequency of adjacent combinations with the hash value of j after the character with the hash value of i in all training texts; wherein i and j are positive integers;

the ith row and jth column elements in the second character word frequency table represent the occurrence frequency of a combination of a second character with a hash value of j and a character with a hash value of i in all training texts;

and the column numbers corresponding to the common characters recorded in the character lookup table are respectively the hash values of each common character.

Further, the total number of occurrences corresponding to all the character strings to be word-organized is queried according to the character lookup table in the character strings, so as to construct a first vector, specifically:

and mapping the character string to be composed to a column corresponding to the corresponding character lookup table to obtain the total occurrence frequency of each character, and recording the total occurrence frequency as a first vector s _ total.

Further, the step of respectively querying the first character word frequency table and the second character word frequency table according to the current character arrangement sequence of the character string to be word-composed, and constructing a first conditional probability count vector and a second conditional probability count vector according to the query result specifically includes:

according to the current character arrangement sequence of the character string of the word to be formed, searching corresponding elements R1 in a first character word frequency table for each group of adjacent elements (a, b) _ab Constructing all elements obtained by query into a first conditional probability count vector w _ n1;

according to the current character arrangement sequence of the character string of the word to be formed, for each group of elements (c, d) separated by 1 character, searching the corresponding element R2 in the second character word frequency table _cd And constructs all queried elements into a second conditional probability count vector w _ n2.

Further, the first conditional probability vector and the second conditional probability vector corresponding to the character string to be word-composed are respectively calculated according to the first vector, the first conditional probability count vector and the second conditional probability count vector, and specifically:

dividing first n-1 element vectors of a first vector s _ total by a first conditional probability count vector w _ n1 to obtain the first conditional probability vector w1; wherein the first vector s _ total comprises n elements; the elements in the first conditional probability vector w1 are conditional probability vectors after a former character appears and a latter character also appears in the character string to be composed;

dividing first n-1 element vectors of the first vector s _ total by a second conditional probability count vector w _ n2 to obtain a second conditional probability vector w2; and the elements in the second conditional probability vector w2 are conditional probability vectors in which characters separated by 1 character also appear after the former character appears in the character string to be composed of words.

The embodiment of the invention has the following beneficial effects:

the embodiment of the invention provides an automatic word forming method for out-of-order characters, which comprises the steps of firstly constructing a character table by combining a training text with a preset training model, traversing all sequences for the input out-of-order characters, calculating a natural word order metric value corresponding to each sequence based on the constructed character table, and finally taking the sequence with the highest natural word order metric value as a sequence result of the out-of-order characters. Compared with the prior art that the neural network is used for word grouping and sequencing, the technical scheme of the invention uses a simple text training mode, reduces the training cost, and can effectively solve the problem of low judgment accuracy of Chinese parallel phrases and improve the word grouping accuracy.

Drawings

FIG. 1 is a flow chart illustrating an embodiment of an automatic word-composing method for out-of-order characters according to the present invention;

FIG. 2 is a schematic diagram of a training process for model training according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, it is a schematic flow chart of an embodiment of the method for automatically composing words of disordered words provided by the present invention, and the more detailed steps of the schematic flow chart include steps 101 to 106, which are specifically as follows:

step 101: according to a training text obtained in advance, a character table is constructed after training of a training model; the character table comprises a first character word frequency table, a second character word frequency table and a character lookup table.

In this embodiment, the element in the first character word frequency table records the occurrence frequency of adjacent character combinations in all training texts; elements in the second character word frequency table record the occurrence frequency of combinations separated by 1 character in all training texts; the character lookup table records a number of common characters and the total number of occurrences of each common character in all training texts.

In this embodiment, the ith row and jth column element in the first character word frequency table represents the frequency of occurrence of the combination of adjacent characters with hash values i and j in all training texts; wherein i and j are positive integers; the ith row and jth column elements in the second character word frequency table represent the occurrence frequency of the combination of a second character with a hash value of j after the character with the hash value of i in all training texts; the column number corresponding to the common character recorded in the character lookup table is the hash value of each common character.

To better illustrate the present embodiment, the model training process is illustrated by the following example. In this example, the structure of the model is two square matrices, text _ matrix _1 (first character word-frequency table), text _ matrix _2 (second character word-frequency table), and a two-line lookup table, text _ list. the text _ matrix _1 and the text _ matrix _2 have the same structure, the row number and the column number of the text _ matrix _1 correspond to the hashed values of the common characters respectively, and the hash function is a lookup table text _ list. the first line of the text _ list stores common characters, the second line stores the total times of occurrence of a certain character counted in the training process and used for calculating the conditional probability, and the value of the hash function is the column number of the certain character in the lookup table text _ list. The hash function is set so that the hash values of all characters are continuous natural numbers starting from 0. the element in text matrix 1 records how often adjacent character combinations appear, e.g., C1 _ij (the element in the ith row and j column in text _ matrix _ 1) represents the frequency of occurrence of the combination of the hash value j of the adjacent character after the character with the hash value i in the training text. the element in text _ matrix _2 records the frequency of occurrence of combinations separated by 1 character, such as C2 _ij (the element in the ith row and j column of text _ matrix _ 2) represents the frequency of occurrence of the combination of the hash value j of the second character after the character with the hash value i in the training text. Experiments show that the two-layer structure can effectively solve the problem of judging parallel phrases in the Chinese text under the condition of not losing the generalization capability of the model.

The training process is that the text of the training set is traversed, the occurrence frequency of a certain character is counted and recorded at the corresponding position of the text _ list; counting the occurrence frequency of the combination of two adjacent characters, and recording the occurrence frequency in the corresponding element of the text _ matrix _ 1; the frequency of occurrence of a combination of two characters having one character interval is counted and recorded in the corresponding element of text _ matrix _2, and the detailed training process is as shown in fig. 2.

Step 102: and acquiring character strings to be composed corresponding to the characters to be composed out of order, and inquiring the total occurrence times corresponding to all the character strings to be composed according to the character lookup table in the character strings so as to construct a first vector.

In this embodiment, step 102 specifically includes: and mapping the character strings to be composed to the columns corresponding to the corresponding character lookup tables to obtain the total occurrence frequency of each character, and recording the total occurrence frequency as a first vector s _ total.

Step 103: respectively inquiring a first character word frequency table and a second character word frequency table according to the current character arrangement sequence of the character string to be word-organized, and constructing a first conditional probability count vector and a second conditional probability count vector according to the inquiry result.

In this embodiment, step 103 specifically includes: according to the current character arrangement sequence of the character string to be composed, for each group of adjacent elements (a, b), searching the corresponding element R1 in the first character word frequency table _ab Constructing all elements obtained by query into a first conditional probability count vector w _ n1;

according to the current character arrangement sequence of the character string of the word to be formed, for each group of elements (c, d) separated by 1 character, searching the corresponding element R2 in the second character word frequency table _cd And all queried elements are constructed into a second conditional probability count vector w _ n2.

Step 104: and respectively calculating a first conditional probability vector and a second conditional probability vector corresponding to the character string of the word to be composed according to the first vector, the first conditional probability count vector and the second conditional probability count vector.

In this embodiment, step 104 specifically includes: dividing the first n-1 element vectors of the first vector s _ total by the first conditional probability count vector w _ n1 to obtain a first conditional probability vector w1; wherein the first vector s _ total comprises n elements; the elements in the first conditional probability vector w1 are conditional probability vectors after a former character and a latter character appear in the character string to be word-composed;

dividing first n-1 element vectors of the first vector s _ total by a second conditional probability count vector w _ n2 to obtain a second conditional probability vector w2; and the elements in the second conditional probability vector w2 are conditional probability vectors in which characters separated by 1 character also appear after the former character appears in the character string to be word-composed.

In the embodiment, the length of the character string S to be composed is n, w _ n1 is 1 less than the dimension of S _ total, and the dimension of w1 is the same as w _ n 1.

Step 105: and taking logarithm numbers of each element in the first conditional probability vector and the second conditional probability vector respectively, converting the product of the probabilities into the sum of the logarithm probabilities, and sequentially obtaining a first natural language sequence metric value and a second natural language sequence metric value.

In this embodiment, the probability of a certain combination appearing in the whole is very small, so when the side length of a character string is short, the product of probabilities may cause floating point underflow, to avoid this, first, logarithms w _ l and w _2 are taken for each element of w1 and w2, and the product of probabilities is converted into the sum of the logarithms, that is, the sum of all elements of w _1 and w _2 is performed, so that the natural word order metric p of a certain character string can be obtained.

In the present embodiment, in the process of generating a character string, the latter word may be considered to have a relationship with all the words ahead of it, and the probability that the next character of the character string a is the character B is calculated, which may be abstracted as calculating the conditional probability that B appears in the case of a, that is, P (B | a). Applying the markov assumption, the above process can be simplified in the actual calculation process, i.e. the probability that the kth character is considered to be B is only related to the first limited characters. The problem handled by the invention is not a character string generation process, but the natural language order of the disordered text is found, so the problem is simplified by the invention, and the natural language order measurement value p is introduced to measure the degree of a candidate character string conforming to the natural language order. A larger p indicates that the character string conforms to the natural language order. Thus, the problem may be translated into computing p for all possible candidate orders and selecting one or several candidates for output where the p value is the largest. The whole process can be seen as finding the path with the highest probability in the discrete-time markov model composed of characters.

For a string S of length n, the natural word order metric p is defined as follows:

step 106: and according to the first natural language sequence metric value and the second natural language sequence metric value, obtaining a natural language sequence metric value corresponding to the current character arrangement sequence of the character string to be word-grouped, traversing all the character arrangement sequences of the character string to be word-grouped, sequentially obtaining a plurality of natural language sequence metric values according to the same calculation method, selecting the character arrangement sequence with the maximum natural language sequence metric value, and automatically word-grouping the character string to be word-grouped.

to better illustrate the technical solution of the present invention, the process is illustrated by an example: to calculate a natural word order metric p for the string s = { you, good }.

1. The first summation item is calculated, and according to the code of each Chinese character, the hash (namely the column number) is searched from the first row of the text _ list: h is _c ＝Hash(c),c∈s；

Suppose that the hash values calculated according to the above formula are: 196, 135, 1202.

2. Second line from text _ list

Obtaining corresponding total counts which are respectively marked as n _2,196 ，n _2,135 ，n _2,1202 。

3. From the word frequency table of the first character

In the method, the occurrence frequency of adjacent two words in the text is obtained according to the hash value and is respectively marked as r _135,196 ，r _1202,135 I.e. the frequency of "you" and "good".

4. Obtaining a natural word order metric value p of 'your good':

wherein h is _i Representing the hash value of the ith Chinese character.

And then, a second summation item is obtained according to the same calculation process, and the two items are added to obtain a final natural word order metric value. The invention achieves sufficient vectorization, obtains candidate sequences by means of permutation and combination and the like, can calculate the measure value p of the natural language sequence of each sequence in parallel, and selects the combination with the maximum p value as a candidate to output.

In summary, according to the automatic word organizing method for out-of-order characters provided in the embodiments of the present invention, a character table is first constructed by using a training text in combination with a preset training model, all ranks are traversed for input out-of-order characters, a natural word order metric value corresponding to each rank is calculated based on the constructed character table, and finally, the rank with the highest natural word order metric value is used as a ranking result of the out-of-order characters. Compared with the prior art that the neural network is used for word grouping and sequencing, the technical scheme of the invention uses a simple text training mode, reduces the training cost, and can effectively solve the problem of low judgment accuracy of Chinese parallel phrases and improve the word grouping accuracy.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made therein without departing from the spirit and scope of the invention.

Claims

1. An automatic word-composing method for disordered words is characterized by comprising the following steps:

according to a training text obtained in advance, a character table is constructed after training of a training model; the character table comprises a first character word frequency table, a second character word frequency table and a character lookup table; the element in the first character word frequency table records the occurrence frequency of adjacent character combinations in all training texts; the element in the second character word frequency table records the occurrence frequency of the combination of every 1 character in all training texts; the character lookup table records a plurality of common characters and the total occurrence times of each common character in all training texts;

respectively inquiring the first character word frequency table and the second character word frequency table according to the current character arrangement sequence of the character string to be word-organized, and constructing a first conditional probability count vector and a second conditional probability count vector according to the inquiry result;

respectively taking logarithms of elements in the first conditional probability vector and the second conditional probability vector, converting the product of the probabilities into the sum of the logarithms and probabilities, and sequentially obtaining a first natural language sequence metric value and a second natural language sequence metric value;

2. The method of claim 1, wherein the document is presented in a document form,

the ith row and jth column elements in the first character word frequency table represent the occurrence frequency of the combination of adjacent characters with hash values i and j in all training texts; wherein i and j are positive integers;

and the column numbers corresponding to the common characters recorded in the character lookup table are respectively hash values of each common character.

3. The method according to claim 1, wherein the total number of occurrences corresponding to all character strings to be composed is looked up according to a character lookup table in the character strings to construct a first vector, specifically:

and mapping the character string to be composed to a column corresponding to a corresponding character lookup table to obtain the total occurrence frequency of each character, and marking as a first vector s _ total.

4. The method according to claim 3, wherein the first and second character word-frequency tables are respectively queried according to a current character arrangement order of the character string to be composed, and a first conditional probability count vector and a second conditional probability count vector are constructed according to a query result, specifically:

according to the current character arrangement sequence of the character string of the word to be formed, searching corresponding elements R2 in a second character word frequency table for each group of elements (c, d) separated by 1 character _cd And constructs all queried elements into a second conditional probability count vector w _ n2.

5. The method according to claim 4, wherein the first conditional probability vector and the second conditional probability vector corresponding to the character string to be word-organized are calculated according to the first vector, the first conditional probability count vector and the second conditional probability count vector, specifically:

dividing the first n-1 element vectors of the first vector s _ total by a first conditional probability count vector w _ n1 to obtain the first conditional probability vector w1; wherein the first vector s _ total comprises n elements; the elements in the first conditional probability vector w1 are conditional probability vectors after a former character and a latter character appear in the character string to be word-organized;