CN106815193A - Model training method and device and wrong word recognition methods and device - Google Patents
Model training method and device and wrong word recognition methods and device Download PDFInfo
- Publication number
- CN106815193A CN106815193A CN201510850128.1A CN201510850128A CN106815193A CN 106815193 A CN106815193 A CN 106815193A CN 201510850128 A CN201510850128 A CN 201510850128A CN 106815193 A CN106815193 A CN 106815193A
- Authority
- CN
- China
- Prior art keywords
- word
- text
- term vector
- sentence
- wrong
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
This application discloses a kind of model training method and device and wrong word recognition methods and device.Wherein, the model training method includes:Text message is extracted from pre-set text data source, wherein, the text included in pre-set text data source is not comprising the text for having wrong word;The corresponding term vector of each word in text message is determined, wherein, term vector is for unique Multidimensional numerical for representing word;In units of the sentence in text message, the corresponding term vector of each word in every sentence is input to Memory Neural Networks, training obtains neural network model, wherein, neural network model is used to recognize the wrong word in text.Present application addresses the low technical problem of the discrimination of wrong word in prior art Chinese version.
Description
Technical field
The application is related to text-processing field, knows in particular to a kind of model training method and device and wrong word
Other method and device.
Background technology
Text is the important carrier of information-recording.Mostly it is human-edited due to text, and human-edited can go out unavoidably
Now slip up, so that occurring wrong word in text.For the identification of wrong word in text, at present generally using artificial
Set up correct lexicon, and carry out text matches and recognize the mode of wrong word, however it is this be difficult to find comprehensively,
Correct lexicon, causes loss higher, and then cause the discrimination of wrong word in text low.
For above-mentioned problem, effective solution is not yet proposed at present.
The content of the invention
The embodiment of the present application provides a kind of model training method and device and wrong word recognition methods and device, with least
Solve the low technical problem of the discrimination of wrong word in prior art Chinese version.
According to the one side of the embodiment of the present application, there is provided a kind of model training method, including:From pre-set text number
According to extracting text message in source, wherein, the text included in the pre-set text data source is not comprising there is wrong word
Text;The corresponding term vector of each word in the text message is determined, wherein, the term vector is for only
One Multidimensional numerical for representing word;In units of the sentence in text message, by each word correspondence in every sentence
Term vector be input to Memory Neural Networks, training obtains neural network model, wherein, the neural network model is used
Wrong word in text is recognized.
Further, before the corresponding term vector of each word in determining the text message, the model training
Method also includes:Target text storehouse is obtained, the text that the target text place is included is not comprising the text for having wrong word
This;The target text storehouse is trained using term vector model, to generate the word pair in the target text storehouse
The term vector answered, obtains the first training set.
Further, it is determined that go out the corresponding term vector of each word in the text message including:To the text message
Word segmentation processing is carried out, the second training set is obtained;Each word in second training set is searched from first training set
The corresponding term vector of language.
Further, before the corresponding term vector of each word in every sentence is input into Memory Neural Networks,
The model training method also includes:By the corresponding term vector of each word in every sentence labeled as default mark,
Wherein, the default mark represents that the corresponding word of term vector is non-wrong word, to cause utilizing the neutral net
When Model Identification goes out non-wrong word, the word of non-wrong word is labeled as the default mark.
According to the another aspect of the embodiment of the present application, a kind of wrong word recognition methods is additionally provided, including:To text to be measured
Originally word segmentation processing is carried out, the corresponding term vector of each word is determined;In units of the sentence in the text to be measured,
The corresponding term vector of each word in every sentence is input in neural network model, using the neutral net mould
Type identifies the wrong word in the text to be measured.
According to the another aspect of the embodiment of the present application, a kind of model training apparatus are additionally provided, including:Extraction unit,
For extracting text message from pre-set text data source, wherein, the text included in the pre-set text data source
It is not comprising the text for having wrong word;Determining unit, for determining the corresponding word of each word in the text message
Vector, wherein, the term vector is for unique Multidimensional numerical for representing word;Training unit, for text envelope
Sentence in breath is unit, and the corresponding term vector of each word in every sentence is input into Memory Neural Networks, is instructed
Neural network model is got, wherein, the neural network model is used to recognize the wrong word in text.
Further, the model training apparatus also include:Acquiring unit, in the text message is determined
Before the corresponding term vector of each word, target text storehouse is obtained, the text that the target text place is included is not wrap
Text containing wrong word;Generation unit, for being trained to the target text storehouse using term vector model, with
The corresponding term vector of word in the target text storehouse is generated, the first training set is obtained.
Further, the determining unit includes:Word-dividing mode, for carrying out word segmentation processing to the text message,
Obtain the second training set;Enquiry module, for searching each word in second training set from first training set
The corresponding term vector of language.
Further, the model training apparatus also include:Indexing unit, for by each word in every sentence
Before the corresponding term vector of language is input to Memory Neural Networks, by the corresponding term vector mark of each word in every sentence
Default mark is designated as, wherein, the default mark represents that the corresponding word of term vector is non-wrong word, to cause in profit
When identifying non-wrong word with the neural network model, the word of non-wrong word is labeled as the default mark.
According to the another aspect of the embodiment of the present application, a kind of wrong word identifying device is additionally provided, including:Vector determines
Unit, for carrying out word segmentation processing to text to be measured, determines the corresponding term vector of each word;Recognition unit, uses
In in units of the sentence in the text to be measured, the corresponding term vector of each word in every sentence is input to god
In through network model, the wrong word in the text to be measured is identified using the neural network model.
According to the embodiment of the present application, text message is extracted by from pre-set text data source, wherein, pre-set text number
It is, not comprising the text for having wrong word, to determine the corresponding word of each word in text message according to the text included in source
Vector, wherein, term vector is the Multidimensional numerical for unique expression word, in units of the sentence in text message,
The corresponding term vector of each word in every sentence is input to Memory Neural Networks, training obtains neural network model,
Facilitate the use neural network model to recognize the wrong word in text, improve the discrimination to wrong word in text,
Solve the low technical problem of the discrimination of wrong word in prior art Chinese version.
Brief description of the drawings
Accompanying drawing described herein is used for providing further understanding of the present application, constitutes the part of the application, this Shen
Schematic description and description please does not constitute the improper restriction to the application for explaining the application.In accompanying drawing
In:
Fig. 1 is the flow chart of the model training method according to the embodiment of the present application;
Fig. 2 is the flow chart of the wrong word recognition methods according to the embodiment of the present application;
Fig. 3 is the schematic diagram of the model training apparatus according to the embodiment of the present application;
Fig. 4 is the schematic diagram of the wrong word identifying device according to the embodiment of the present application.
Specific embodiment
In order that those skilled in the art more fully understand application scheme, below in conjunction with the embodiment of the present application
Accompanying drawing, is clearly and completely described to the technical scheme in the embodiment of the present application, it is clear that described embodiment
The only embodiment of the application part, rather than whole embodiments.Based on the embodiment in the application, ability
The every other embodiment that domain those of ordinary skill is obtained under the premise of creative work is not made, should all belong to
The scope of the application protection.
It should be noted that term " first ", " in the description and claims of this application and above-mentioned accompanying drawing
Two " it is etc. for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that this
The data that sample is used can be exchanged in the appropriate case, so as to embodiments herein described herein can with except
Here the order beyond those for illustrating or describing is implemented.Additionally, term " comprising " and " having " and they
Any deformation, it is intended that covering is non-exclusive to be included, for example, containing process, the side of series of steps or unit
Method, system, product or equipment are not necessarily limited to those steps clearly listed or unit, but may include unclear
List or for these processes, method, product or other intrinsic steps of equipment or unit.
According to the embodiment of the present application, there is provided a kind of embodiment of the method for model training method, it is necessary to explanation,
The step of flow of accompanying drawing is illustrated can perform in the such as one group computer system of computer executable instructions, and
And, although logical order is shown in flow charts, but in some cases, can be with different from order herein
Perform shown or described step.
Fig. 1 is the flow chart of the model training method according to the embodiment of the present application, as shown in figure 1, the method is included such as
Lower step:
Step S102, text message is extracted from pre-set text data source, wherein, included in pre-set text data source
Text be not comprising the text for having wrong word.
Pre-set text data source can be People's Daily, Chinese Government net etc. resource website, can be by after correction not
Include the text data source of wrong word.Include the substantial amounts of text without wrong word in the pre-set text data source,
Therefrom extract these text messages.
Step S104, determines the corresponding term vector of each word in text message, wherein, term vector is for unique
Represent the Multidimensional numerical of word.
The text message gone out to said extracted, determines the wherein corresponding term vector of each word, the word of each word to
Amount represents that the corresponding term vector of different words is different with one group of Multidimensional numerical.Wherein, the term vector of word
Can pre-define, after text message is extracted, text inquired from pre-defined term vector
The vector of each word in this information.Each word can also be generated according to term vector create-rule set in advance
Term vector.
Step S106, in units of the sentence in text message, by the corresponding term vector of each word in every sentence
It is input to Memory Neural Networks, training obtains neural network model, wherein, neural network model is used to recognizing in text
Wrong word.
In the present embodiment, after the term vector for determining each word included in text message, with text message
In sentence be unit, the sentence in text message is sequentially inputted to be trained in Memory Neural Networks, be input to
Sentence in Memory Neural Networks is replaced with the wherein corresponding term vector of each word, i.e. by each word in sentence
Corresponding term vector is input to Memory Neural Networks, and the Memory Neural Networks can be preferably based on Recognition with Recurrent Neural Network
Memory Neural Networks (i.e. LSTM+Bidirectional RNN) in short-term long.By Memory Neural Networks to extraction
Text message is trained, and obtains neural network model.It is in units of sentence that the corresponding term vector of word therein is defeated
Enter to Memory Neural Networks, machine can remember the word in sentence and combinations thereof form, and with neural network model
Parameter (parameter determination in neural network model, major part is matrix) remember these words and combinations thereof.Relative to
In the prior art using manually setting up correct lexicon, and carry out text matches and recognize the mode of wrong word, this
Embodiment is trained by Memory Neural Networks to the text without wrong word, obtains neural network model, then profit
The wrong word in text is recognized with the neural network model, without manually setting up lexicon, you can according to word combination
And sentence recognizes wrong word therein, can be based on context semantic, in effectively and quickly identifying text
Wrong word.
According to the embodiment of the present application, text message is extracted by from pre-set text data source, wherein, pre-set text number
It is, not comprising the text for having wrong word, to determine the corresponding word of each word in text message according to the text included in source
Vector, wherein, term vector is the Multidimensional numerical for unique expression word, in units of the sentence in text message,
The corresponding term vector of each word in every sentence is input to Memory Neural Networks, training obtains neural network model,
Facilitate the use neural network model to recognize the wrong word in text, improve the discrimination to wrong word in text,
Solve the low technical problem of the discrimination of wrong word in prior art Chinese version.
For example, for the word " indignant position " occurred in text, word is included in the lexicon set up in the prior art
Language " angrily leaving the theatre ", " indignant ", " leaving the theatre " and " position ", therefore when the word of above-mentioned appearance is recognized, match
Word " indignant " and " leaving the theatre ", may think that in the word do not have wrong word.And in the embodiment of the present application, due to
It is in units of sentence, i.e., using " angrily leaving the theatre " as overall, to be input to god when training obtains neural network model
Through being trained in network, by " angrily the leaving the theatre " of its parameters memorizing in neural network model, therefore, when by " anger
Right position " is input in neural network model, then will recognise that wherein " vertical " word is wrong word.
Preferably, before the corresponding term vector of each word in determining text message, model training method also includes:
Target text storehouse is obtained, the text that target text place is included is not comprising the text for having wrong word;Using term vector mould
Type is trained to target text storehouse, to generate the corresponding term vector of word in target text storehouse, obtains the first training
Collection.
The target text storehouse of the present embodiment, can be the dictionary for including various words, such as xinhua dictionary, into words and phrases
The text library not comprising wrong word such as allusion quotation, article, obtains target text storehouse with as term vector training set.Term vector
Model can be existing maturity model, and the model can generate a dimension phase according to input text to each word
Same Multidimensional numerical, i.e. term vector, the dimension of the term vector such as will for that can be defined according to term vector training set
It is 1,0,0 that " one " may mark ... ...], it is 0,1,0 that " happiness " may be marked ... ...].
In the embodiment of the present application, the term vector of each word in the term vector training set that can be obtained according to training in advance,
In order to therefrom inquire about to the term vector for each word in the text message for carrying out neural network model training.
It should be noted that the embodiment of the present application can also be generates the corresponding term vector of each punctuation mark.
Further, it is determined that go out the corresponding term vector of each word in text message including:Participle is carried out to text message
Treatment, obtains the second training set;The corresponding term vector of each word in the second training set is searched from the first training set.
For the text message for carrying out neural network model training, word segmentation processing first is carried out to it, obtain word collection
It is the second training set to close, the corresponding word of each word in the first training training set of Integrated query second obtained from above-mentioned
Vector, so that it is determined that going out the term vector of each word in every sentence of above-mentioned text message.
Specifically, it is possible to use existing participle instrument, word segmentation processing is carried out to above-mentioned text message, wherein, after participle
Text be composed of words, such as will " I is a Chinese " participle be for " I is only one China " or " I am
Only one China ".
Preferably, before the corresponding term vector of each word in every sentence is input into Memory Neural Networks, mould
Type training method also includes:By the corresponding term vector of each word in every sentence labeled as default mark, wherein,
Default mark represents that the corresponding word of term vector is non-wrong word, to cause identifying non-mistake using neural network model
During malapropism, by the word of non-wrong word labeled as default mark.
In the embodiment of the present application, it is mark to be input to each word in every sentence of Memory Neural Networks and mark,
Such as " 1 ", so, text is trained obtain neural network model when, parameter can be remembered in neural network model
It is default mark to recall these term identifications.When using neural network model to recognize text to be measured, in its output result
Can be by the word for not having wrong word in text to be measured labeled as the default mark, and the word for wrong word occur is not marked then,
Or labeled as other marks, in order to quickly filter out the wrong word in text to be measured.
A kind of optional mode of the model training method of the embodiment of the present application includes:
Step one, obtain reliable text library (such as text of xinhua dictionary, dictionary of idioms, article not comprising wrong word
This storehouse) it is target text storehouse, it is the first training set as the training set 1 of term vector.
Step 2, using term vector model training training set 1, obtain each word (including punctuation mark) in training set 1
Term vector.Wherein, term vector model can utilize existing maturity model, the model to be given according to input text
Each word generates a unique Multidimensional numerical of dimension identical, i.e. term vector, and the dimension of the term vector can be advance
Definition, such as it is 1,0,0 that may mark " one " ... ...], it is 0,1,0 that " happiness " may be marked ... ...].
Step 3, from reliably by extracting text message in the molecular text data source of a large amount of sentences, as text training
Collection.Wherein, the molecular text data source-representation of reliable a large amount of sentences:There is no the text data source of wrong word, such as from
The channels such as People's Daily, Chinese Government's net are obtained.
Step 4, using existing participle instrument, word segmentation processing is carried out to above-mentioned text training set, obtain training set 2 i.e.
Second training set.Wherein, the text after participle is word composition, may be such as " I by " I is a Chinese " participle
It is only one China " or " I is only one China ".
Step 5, in units of the sentence of training set 2, the term vector of each word in the sentence is found out from training set 1,
And each word is labeled as non-wrong word (as represented non-wrong word with " 1 "), the term vector input that will be obtained is based on following
The length of ring neutral net Memory Neural Networks (i.e. LSTM+Bidirectional RNN) in short-term, training obtains nerve
Network model (Model Parameter determines that major part is matrix).Wherein, neutral net, machine are input into units of sentence
Device can remember the word in sentence and combinations thereof form, and with these combinations of the parameters memorizing in model.
By using neural network model, can be analyzed according to word combination, sentences in article and paragraph, Jin Erti
Recognition accuracy high, reduction loss.
A kind of wrong word recognition methods is additionally provided according to the embodiment of the present application, the wrong word recognition methods can be used for leading to
The model training method for crossing the above embodiments of the present application trains the neural network model for obtaining to recognize wrong word.Such as Fig. 2
Shown, the wrong word recognition methods includes:
Step S202, word segmentation processing is carried out to text to be measured, determines the corresponding term vector of each word.
Each word after word segmentation processing can from first training its corresponding word of Integrated query in the embodiment of the present application to
Amount.
Step S204, in units of the sentence in text to be measured, by the corresponding term vector of each word in every sentence
It is input in neural network model, the wrong word in text to be measured is identified using neural network model.
The nerve that neural network model in the present embodiment is obtained for the model training method training of the above embodiments of the present application
Network model.
Because the neural network model is the text without wrong word to be trained by Memory Neural Networks obtain,
Parameter (parameter determination, most of for matrix in neural network model) in neural network model can be with memory training text
There is no word of wrong word and combinations thereof in this, without manually setting up lexicon, you can according to word combination and sentence
To recognize wrong word therein, can be based on context semantic, effectively and quickly identify the wrong word in text.
The neural network model that the term vector input of text to be measured is trained, by the calculating of neural network model, will
Each word is marked in output result, such as non-wrong word is designated as:1, wrong word is designated as:- 1, and then can screen
Go out wrong word.
The embodiment of the present application additionally provides a kind of model training apparatus, and the device can be used for performing the embodiment of the present application
Model training method, as shown in figure 3, the model training apparatus include:Extraction unit 301, the and of determining unit 303
Training unit 305.
Extraction unit 301 is used to extract text message from pre-set text data source, wherein, in pre-set text data source
Comprising text be not comprising the text for having wrong word.
Pre-set text data source can be People's Daily, Chinese Government net etc. resource website, can be by after correction not
Include the text data source of wrong word.Include the substantial amounts of text without wrong word in the pre-set text data source,
Therefrom extract these text messages.
Determining unit 303 be used for determine the corresponding term vector of each word in text message, wherein, term vector is use
In unique Multidimensional numerical for representing word.
The text message gone out to said extracted, determines the wherein corresponding term vector of each word, the word of each word to
Amount represents that the corresponding term vector of different words is different with one group of Multidimensional numerical.Wherein, the term vector of word
Can pre-define, after text message is extracted, text inquired from pre-defined term vector
The vector of each word in this information.Each word can also be generated according to term vector create-rule set in advance
Term vector.
Training unit 305 is used in units of the sentence in text message, and each word in every sentence is corresponding
Term vector is input to Memory Neural Networks, and training obtains neural network model, wherein, neural network model is used to recognize
Wrong word in text.
In the present embodiment, after the term vector for determining each word included in text message, with text message
In sentence be unit, the sentence in text message is sequentially inputted to be trained in Memory Neural Networks, be input to
Sentence in Memory Neural Networks is replaced with the wherein corresponding term vector of each word, i.e. by each word in sentence
Corresponding term vector is input to Memory Neural Networks, and the Memory Neural Networks can be preferably based on Recognition with Recurrent Neural Network
Memory Neural Networks (i.e. LSTM+Bidirectional RNN) in short-term long.By Memory Neural Networks to extraction
Text message is trained, and obtains neural network model.It is in units of sentence that the corresponding term vector of word therein is defeated
Enter to Memory Neural Networks, machine can remember the word in sentence and combinations thereof form, and with neural network model
Parameter (parameter determination in neural network model, major part is matrix) remember these words and combinations thereof.Relative to
In the prior art using manually setting up correct lexicon, and carry out text matches and recognize the mode of wrong word, this
Embodiment is trained by Memory Neural Networks to the text without wrong word, obtains neural network model, then profit
The wrong word in text is recognized with the neural network model, without manually setting up lexicon, you can according to word combination
And sentence recognizes wrong word therein, can be based on context semantic, in effectively and quickly identifying text
Wrong word.
According to the embodiment of the present application, text message is extracted by from pre-set text data source, wherein, pre-set text number
It is, not comprising the text for having wrong word, to determine the corresponding word of each word in text message according to the text included in source
Vector, wherein, term vector is the Multidimensional numerical for unique expression word, in units of the sentence in text message,
The corresponding term vector of each word in every sentence is input to Memory Neural Networks, training obtains neural network model,
Facilitate the use neural network model to recognize the wrong word in text, improve the discrimination to wrong word in text,
Solve the low technical problem of the discrimination of wrong word in prior art Chinese version.
For example, for the word " indignant position " occurred in text, word is included in the lexicon set up in the prior art
Language " angrily leaving the theatre ", " indignant ", " leaving the theatre " and " position ", therefore when the word of above-mentioned appearance is recognized, match
Word " indignant " and " leaving the theatre ", may think that in the word do not have wrong word.And in the embodiment of the present application, due to
It is in units of sentence, i.e., using " angrily leaving the theatre " as overall, to be input to god when training obtains neural network model
Through being trained in network, by " angrily the leaving the theatre " of its parameters memorizing in neural network model, therefore, when by " anger
Right position " is input in neural network model, then will recognise that wherein " vertical " word is wrong word.
Preferably, model training apparatus also include:Acquiring unit, for each word pair in text message is determined
Before the term vector answered, target text storehouse is obtained, the text that target text place is included is not comprising the text for having wrong word
This;Generation unit, for being trained to target text storehouse using term vector model, with generating target text storehouse
The corresponding term vector of word, obtains the first training set.
The target text storehouse of the present embodiment, can be the dictionary for including various words, such as xinhua dictionary, into words and phrases
The text library not comprising wrong word such as allusion quotation, article, obtains target text storehouse with as term vector training set.Term vector
Model can be existing maturity model, and the model can generate a dimension phase according to input text to each word
Same Multidimensional numerical, i.e. term vector, the dimension of the term vector such as will for that can be defined according to term vector training set
It is 1,0,0 that " one " may mark ... ...], it is 0,1,0 that " happiness " may be marked ... ...].
In the embodiment of the present application, the term vector of each word in the term vector training set that can be obtained according to training in advance,
In order to therefrom inquire about to the term vector for each word in the text message for carrying out neural network model training.
It should be noted that the embodiment of the present application can also be generates the corresponding term vector of each punctuation mark.
Preferably, determining unit includes:Word-dividing mode, for carrying out word segmentation processing to text message, obtains the second instruction
Practice collection;Enquiry module, for searching the corresponding term vector of each word in the second training set from the first training set.
For the text message for carrying out neural network model training, word segmentation processing first is carried out to it, obtain word collection
It is the second training set to close, the corresponding word of each word in the first training training set of Integrated query second obtained from above-mentioned
Vector, so that it is determined that going out the term vector of each word in every sentence of above-mentioned text message.
Specifically, it is possible to use existing participle instrument, word segmentation processing is carried out to above-mentioned text message, wherein, after participle
Text be composed of words, such as will " I is a Chinese " participle be for " I is only one China " or " I am
Only one China ".
Preferably, model training apparatus also include:Indexing unit, for each word in every sentence is corresponding
Term vector be input to Memory Neural Networks before, by the corresponding term vector of each word in every sentence labeled as pre-
Bidding is known, wherein, preset mark and represent that the corresponding word of term vector is non-wrong word, to cause utilizing neutral net
When Model Identification goes out non-wrong word, by the word of non-wrong word labeled as default mark.
In the embodiment of the present application, it is mark to be input to each word in every sentence of Memory Neural Networks and mark,
Such as " 1 ", so, text is trained obtain neural network model when, parameter can be remembered in neural network model
It is default mark to recall these term identifications.When using neural network model to recognize text to be measured, in its output result
Can be by the word for not having wrong word in text to be measured labeled as the default mark, and the word for wrong word occur is not marked then,
Or labeled as other marks, in order to quickly filter out the wrong word in text to be measured.
The model training apparatus include processor and memory, said extracted unit 301, determining unit 303 and instruction
Practice unit 305 grade as program unit storage in memory, by computing device store in memory it is above-mentioned
Program unit.
Kernel is included in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can set one
Or more, trained by adjusting kernel parameter and obtain neural network model.
Memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/
Or the form, such as read-only storage (ROM) or flash memory (flash RAM) such as Nonvolatile memory, memory includes at least one
Individual storage chip.
Present invention also provides a kind of embodiment of computer program product, when being performed on data processing equipment, fit
In the program code for performing initialization there are as below methods step:Text message is extracted from pre-set text data source, wherein,
Text included in pre-set text data source is not comprising the text for having wrong word;Determine each word in text message
The corresponding term vector of language, wherein, term vector is for unique Multidimensional numerical for representing word;With the language in text message
Sentence is unit, and the corresponding term vector of each word in every sentence is input into Memory Neural Networks, and training obtains god
Through network model, wherein, neural network model is used to recognize the wrong word in text.
A kind of wrong word identifying device is additionally provided according to the embodiment of the present application, the wrong word identifying device can be used for holding
The wrong word recognition methods that row the embodiment of the present application is provided.As shown in figure 4, the wrong word identifying device includes:Vector
Determining unit 401 and recognition unit 403.
Vector determination unit 401 is used to carry out word segmentation processing to text to be measured, determines the corresponding term vector of each word;
Each word after word segmentation processing can from first training its corresponding word of Integrated query in the embodiment of the present application to
Amount.
Recognition unit 403 is used in units of the sentence in text to be measured, and each word in every sentence is corresponding
Term vector is input in neural network model, and the wrong word in text to be measured is identified using neural network model.
The nerve that neural network model in the present embodiment is obtained for the model training method training of the above embodiments of the present application
Network model.
Because the neural network model is the text without wrong word to be trained by Memory Neural Networks obtain,
Parameter (parameter determination, most of for matrix in neural network model) in neural network model can be with memory training text
There is no word of wrong word and combinations thereof in this, without manually setting up lexicon, you can according to word combination and sentence
To recognize wrong word therein, can be based on context semantic, effectively and quickly identify the wrong word in text.
The neural network model that the term vector input of text to be measured is trained, by the calculating of neural network model, will
Each word is marked in output result, such as non-wrong word is designated as:1, wrong word is designated as:- 1, and then can screen
Go out wrong word.
The wrong word identifying device includes processor and memory, above-mentioned vector determination unit 401 and recognition unit 403
Deng being stored in memory as program unit, by computing device storage said procedure unit in memory.
Kernel is included in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can set one
Or more, the wrong word in text is recognized by adjusting kernel parameter.
Memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/
Or the form, such as read-only storage (ROM) or flash memory (flash RAM) such as Nonvolatile memory, memory includes at least one
Individual storage chip.
Present invention also provides a kind of embodiment of computer program product, when being performed on data processing equipment, fit
In the program code for performing initialization there are as below methods step:Word segmentation processing is carried out to text to be measured, each word is determined
The corresponding term vector of language;In units of the sentence in text to be measured, by the corresponding word of each word in every sentence to
Amount is input in neural network model, and the wrong word in text to be measured is identified using neural network model.
Above-mentioned the embodiment of the present application sequence number is for illustration only, and the quality of embodiment is not represented.
In above-described embodiment of the application, the description to each embodiment all emphasizes particularly on different fields, and does not have in certain embodiment
The part of detailed description, may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed technology contents, can be by other
Mode realize.Wherein, device embodiment described above is only schematical, such as division of described unit,
Can be a kind of division of logic function, there can be other dividing mode when actually realizing, for example multiple units or component
Can combine or be desirably integrated into another system, or some features can be ignored, or do not perform.It is another, institute
Display or the coupling each other for discussing or direct-coupling or communication connection can be by some interfaces, unit or mould
The INDIRECT COUPLING of block or communication connection, can be electrical or other forms.
The unit that is illustrated as separating component can be or may not be it is physically separate, it is aobvious as unit
The part for showing can be or may not be physical location, you can with positioned at a place, or can also be distributed to
On multiple units.Some or all of unit therein can be according to the actual needs selected to realize this embodiment scheme
Purpose.
In addition, during each functional unit in the application each embodiment can be integrated in a processing unit, it is also possible to
It is that unit is individually physically present, it is also possible to which two or more units are integrated in a unit.It is above-mentioned integrated
Unit can both be realized in the form of hardware, it would however also be possible to employ the form of SFU software functional unit is realized.
If the integrated unit is to realize in the form of SFU software functional unit and as independent production marketing or when using,
Can store in a computer read/write memory medium.Based on such understanding, the technical scheme essence of the application
On all or part of the part that is contributed to prior art in other words or the technical scheme can be with software product
Form is embodied, and the computer software product is stored in a storage medium, including some instructions are used to so that one
Platform computer equipment (can be personal computer, server or network equipment etc.) performs each embodiment institute of the application
State all or part of step of method.And foregoing storage medium includes:USB flash disk, read-only storage (ROM, Read-Only
Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD
Etc. it is various can be with the medium of store program codes.
The above is only the preferred embodiment of the application, it is noted that for the ordinary skill people of the art
For member, on the premise of the application principle is not departed from, some improvements and modifications can also be made, these improve and moisten
Decorations also should be regarded as the protection domain of the application.
Claims (10)
1. a kind of model training method, it is characterised in that including:
Text message is extracted from pre-set text data source, wherein, included in the pre-set text data source
Text is not comprising the text for having wrong word;
The corresponding term vector of each word in the text message is determined, wherein, the term vector is for only
One Multidimensional numerical for representing word;
In units of the sentence in text message, the corresponding term vector of each word in every sentence is input to
Memory Neural Networks, training obtains neural network model, wherein, the neural network model is used to recognize text
In wrong word.
2. model training method according to claim 1, it is characterised in that every in the text message is determined
Before the corresponding term vector of individual word, the model training method also includes:
Target text storehouse is obtained, the text that the target text place is included is not comprising the text for having wrong word;
The target text storehouse is trained using term vector model, to generate the word in the target text storehouse
The corresponding term vector of language, obtains the first training set.
3. model training method according to claim 2, it is characterised in that determine each in the text message
The corresponding term vector of word includes:
Word segmentation processing is carried out to the text message, the second training set is obtained;
The corresponding term vector of each word in second training set is searched from first training set.
4. model training method according to claim 1, it is characterised in that by each word in every sentence
Before corresponding term vector is input to Memory Neural Networks, the model training method also includes:
By the corresponding term vector of each word in every sentence labeled as default mark, wherein, the pre- bidding
Know and represent that the corresponding word of term vector is non-wrong word, with so that non-being identified using the neural network model
During wrong word, the word of non-wrong word is labeled as the default mark.
5. a kind of wrong word recognition methods, it is characterised in that including:
Word segmentation processing is carried out to text to be measured, the corresponding term vector of each word is determined;
It is in units of the sentence in the text to be measured, the corresponding term vector of each word in every sentence is defeated
Enter in training the neural network model for obtaining to the model training method any one of Claims 1-4,
The wrong word in the text to be measured is identified using the neural network model.
6. a kind of model training apparatus, it is characterised in that including:
Extraction unit, for extracting text message from pre-set text data source, wherein, the pre-set text number
It is not comprising the text for having wrong word according to the text included in source;
Determining unit, for determining the corresponding term vector of each word in the text message, wherein, it is described
Term vector is for unique Multidimensional numerical for representing word;
Training unit, in units of the sentence in text message, by each word correspondence in every sentence
Term vector be input to Memory Neural Networks, training obtains neural network model, wherein, the neutral net mould
Type is used to recognize the wrong word in text.
7. model training apparatus according to claim 6, it is characterised in that the model training apparatus also include:
Acquiring unit, for before the corresponding term vector of each word in determining the text message, obtaining
Target text storehouse, the text that the target text place is included is not comprising the text for having wrong word;
Generation unit, for being trained to the target text storehouse using term vector model, to generate the mesh
The corresponding term vector of word in mark text library, obtains the first training set.
8. model training apparatus according to claim 7, it is characterised in that the determining unit includes:
Word-dividing mode, for carrying out word segmentation processing to the text message, obtains the second training set;
Enquiry module, for searching second training set from first training set in each word it is corresponding
Term vector.
9. model training apparatus according to claim 6, it is characterised in that the model training apparatus also include:
Indexing unit, for the corresponding term vector of each word in every sentence to be input into memory nerve net
Before network, by the corresponding term vector of each word in every sentence labeled as default mark, wherein, it is described pre-
It is non-wrong word that bidding knows the corresponding word of expression term vector, to cause to be recognized using the neural network model
When going out non-wrong word, the word of non-wrong word is labeled as the default mark.
10. a kind of wrong word identifying device, it is characterised in that including:
Vector determination unit, for carrying out word segmentation processing to text to be measured, determine the corresponding word of each word to
Amount;
Recognition unit, in units of the sentence in the text to be measured, by each word in every sentence
Corresponding term vector is input to the god that the model training method training any one of Claims 1-4 is obtained
In through network model, the wrong word in the text to be measured is identified using the neural network model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510850128.1A CN106815193A (en) | 2015-11-27 | 2015-11-27 | Model training method and device and wrong word recognition methods and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510850128.1A CN106815193A (en) | 2015-11-27 | 2015-11-27 | Model training method and device and wrong word recognition methods and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106815193A true CN106815193A (en) | 2017-06-09 |
Family
ID=59155338
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510850128.1A Pending CN106815193A (en) | 2015-11-27 | 2015-11-27 | Model training method and device and wrong word recognition methods and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106815193A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107451106A (en) * | 2017-07-26 | 2017-12-08 | 阿里巴巴集团控股有限公司 | Text method and device for correcting, electronic equipment |
CN109213843A (en) * | 2018-07-23 | 2019-01-15 | 北京密境和风科技有限公司 | A kind of detection method and device of rubbish text information |
CN109543022A (en) * | 2018-12-17 | 2019-03-29 | 北京百度网讯科技有限公司 | Text error correction method and device |
CN110310083A (en) * | 2019-06-04 | 2019-10-08 | 南方电网科学研究院有限责任公司 | Submitting system of science and technology project data report |
CN110765996A (en) * | 2019-10-21 | 2020-02-07 | 北京百度网讯科技有限公司 | Text information processing method and device |
WO2020132985A1 (en) * | 2018-12-26 | 2020-07-02 | 深圳市优必选科技有限公司 | Self-training method and apparatus for model, computer device, and storage medium |
CN112599129A (en) * | 2021-03-01 | 2021-04-02 | 北京世纪好未来教育科技有限公司 | Speech recognition method, apparatus, device and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101847140A (en) * | 2009-03-23 | 2010-09-29 | 中国科学院计算技术研究所 | Wrongly-written or mispronounced character processing method and system |
CN104573046A (en) * | 2015-01-20 | 2015-04-29 | 成都品果科技有限公司 | Comment analyzing method and system based on term vector |
CN104615589A (en) * | 2015-02-15 | 2015-05-13 | 百度在线网络技术(北京)有限公司 | Named-entity recognition model training method and named-entity recognition method and device |
CN104899304A (en) * | 2015-06-12 | 2015-09-09 | 北京京东尚科信息技术有限公司 | Named entity identification method and device |
-
2015
- 2015-11-27 CN CN201510850128.1A patent/CN106815193A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101847140A (en) * | 2009-03-23 | 2010-09-29 | 中国科学院计算技术研究所 | Wrongly-written or mispronounced character processing method and system |
CN104573046A (en) * | 2015-01-20 | 2015-04-29 | 成都品果科技有限公司 | Comment analyzing method and system based on term vector |
CN104615589A (en) * | 2015-02-15 | 2015-05-13 | 百度在线网络技术(北京)有限公司 | Named-entity recognition model training method and named-entity recognition method and device |
CN104899304A (en) * | 2015-06-12 | 2015-09-09 | 北京京东尚科信息技术有限公司 | Named entity identification method and device |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107451106A (en) * | 2017-07-26 | 2017-12-08 | 阿里巴巴集团控股有限公司 | Text method and device for correcting, electronic equipment |
CN109213843A (en) * | 2018-07-23 | 2019-01-15 | 北京密境和风科技有限公司 | A kind of detection method and device of rubbish text information |
CN109543022A (en) * | 2018-12-17 | 2019-03-29 | 北京百度网讯科技有限公司 | Text error correction method and device |
US11080492B2 (en) | 2018-12-17 | 2021-08-03 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and device for correcting error in text |
WO2020132985A1 (en) * | 2018-12-26 | 2020-07-02 | 深圳市优必选科技有限公司 | Self-training method and apparatus for model, computer device, and storage medium |
CN110310083A (en) * | 2019-06-04 | 2019-10-08 | 南方电网科学研究院有限责任公司 | Submitting system of science and technology project data report |
CN110765996A (en) * | 2019-10-21 | 2020-02-07 | 北京百度网讯科技有限公司 | Text information processing method and device |
CN110765996B (en) * | 2019-10-21 | 2022-07-29 | 北京百度网讯科技有限公司 | Text information processing method and device |
CN112599129A (en) * | 2021-03-01 | 2021-04-02 | 北京世纪好未来教育科技有限公司 | Speech recognition method, apparatus, device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106815193A (en) | Model training method and device and wrong word recognition methods and device | |
CN106815192B (en) | Model training method and device and sentence emotion recognition method and device | |
CN106815194A (en) | Model training method and device and keyword recognition method and device | |
CN104503998B (en) | For the kind identification method and device of user query sentence | |
CN103970765B (en) | Correct mistakes model training method, device and text of one is corrected mistakes method, device | |
CN113707300B (en) | Search intention recognition method, device, equipment and medium based on artificial intelligence | |
CN109033282B (en) | Webpage text extraction method and device based on extraction template | |
CN105243055A (en) | Multi-language based word segmentation method and apparatus | |
CN112035675A (en) | Medical text labeling method, device, equipment and storage medium | |
CN107491536B (en) | Test question checking method, test question checking device and electronic equipment | |
CN110008473B (en) | Medical text named entity identification and labeling method based on iteration method | |
CN111506696A (en) | Information extraction method and device based on small number of training samples | |
CN112632278A (en) | Labeling method, device, equipment and storage medium based on multi-label classification | |
CN110569335A (en) | triple verification method and device based on artificial intelligence and storage medium | |
CN110610180A (en) | Method, device and equipment for generating recognition set of wrongly-recognized words and storage medium | |
CN110321549B (en) | New concept mining method based on sequential learning, relation mining and time sequence analysis | |
CN104007836A (en) | Handwriting input processing method and terminal device | |
CN112445915A (en) | Document map extraction method and device based on machine learning and storage medium | |
CN111723870A (en) | Data set acquisition method, device, equipment and medium based on artificial intelligence | |
CN107436931B (en) | Webpage text extraction method and device | |
CN113282754A (en) | Public opinion detection method, device, equipment and storage medium for news events | |
CN113850081B (en) | Text processing method, device, equipment and medium based on artificial intelligence | |
CN107506349A (en) | A kind of user's negative emotions Forecasting Methodology and system based on network log | |
CN113761137B (en) | Method and device for extracting address information | |
CN104408036B (en) | It is associated with recognition methods and the device of topic |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
CB02 | Change of applicant information | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170609 |
|
RJ01 | Rejection of invention patent application after publication |