CN106815193A - Model training method and device and wrong word recognition methods and device - Google Patents

Model training method and device and wrong word recognition methods and device Download PDF

Info

Publication number
CN106815193A
CN106815193A CN201510850128.1A CN201510850128A CN106815193A CN 106815193 A CN106815193 A CN 106815193A CN 201510850128 A CN201510850128 A CN 201510850128A CN 106815193 A CN106815193 A CN 106815193A
Authority
CN
China
Prior art keywords
word
text
term vector
sentence
wrong
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510850128.1A
Other languages
Chinese (zh)
Inventor
刘粉香
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201510850128.1A priority Critical patent/CN106815193A/en
Publication of CN106815193A publication Critical patent/CN106815193A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

This application discloses a kind of model training method and device and wrong word recognition methods and device.Wherein, the model training method includes:Text message is extracted from pre-set text data source, wherein, the text included in pre-set text data source is not comprising the text for having wrong word;The corresponding term vector of each word in text message is determined, wherein, term vector is for unique Multidimensional numerical for representing word;In units of the sentence in text message, the corresponding term vector of each word in every sentence is input to Memory Neural Networks, training obtains neural network model, wherein, neural network model is used to recognize the wrong word in text.Present application addresses the low technical problem of the discrimination of wrong word in prior art Chinese version.

Description

Model training method and device and wrong word recognition methods and device
Technical field
The application is related to text-processing field, knows in particular to a kind of model training method and device and wrong word Other method and device.
Background technology
Text is the important carrier of information-recording.Mostly it is human-edited due to text, and human-edited can go out unavoidably Now slip up, so that occurring wrong word in text.For the identification of wrong word in text, at present generally using artificial Set up correct lexicon, and carry out text matches and recognize the mode of wrong word, however it is this be difficult to find comprehensively, Correct lexicon, causes loss higher, and then cause the discrimination of wrong word in text low.
For above-mentioned problem, effective solution is not yet proposed at present.
The content of the invention
The embodiment of the present application provides a kind of model training method and device and wrong word recognition methods and device, with least Solve the low technical problem of the discrimination of wrong word in prior art Chinese version.
According to the one side of the embodiment of the present application, there is provided a kind of model training method, including:From pre-set text number According to extracting text message in source, wherein, the text included in the pre-set text data source is not comprising there is wrong word Text;The corresponding term vector of each word in the text message is determined, wherein, the term vector is for only One Multidimensional numerical for representing word;In units of the sentence in text message, by each word correspondence in every sentence Term vector be input to Memory Neural Networks, training obtains neural network model, wherein, the neural network model is used Wrong word in text is recognized.
Further, before the corresponding term vector of each word in determining the text message, the model training Method also includes:Target text storehouse is obtained, the text that the target text place is included is not comprising the text for having wrong word This;The target text storehouse is trained using term vector model, to generate the word pair in the target text storehouse The term vector answered, obtains the first training set.
Further, it is determined that go out the corresponding term vector of each word in the text message including:To the text message Word segmentation processing is carried out, the second training set is obtained;Each word in second training set is searched from first training set The corresponding term vector of language.
Further, before the corresponding term vector of each word in every sentence is input into Memory Neural Networks, The model training method also includes:By the corresponding term vector of each word in every sentence labeled as default mark, Wherein, the default mark represents that the corresponding word of term vector is non-wrong word, to cause utilizing the neutral net When Model Identification goes out non-wrong word, the word of non-wrong word is labeled as the default mark.
According to the another aspect of the embodiment of the present application, a kind of wrong word recognition methods is additionally provided, including:To text to be measured Originally word segmentation processing is carried out, the corresponding term vector of each word is determined;In units of the sentence in the text to be measured, The corresponding term vector of each word in every sentence is input in neural network model, using the neutral net mould Type identifies the wrong word in the text to be measured.
According to the another aspect of the embodiment of the present application, a kind of model training apparatus are additionally provided, including:Extraction unit, For extracting text message from pre-set text data source, wherein, the text included in the pre-set text data source It is not comprising the text for having wrong word;Determining unit, for determining the corresponding word of each word in the text message Vector, wherein, the term vector is for unique Multidimensional numerical for representing word;Training unit, for text envelope Sentence in breath is unit, and the corresponding term vector of each word in every sentence is input into Memory Neural Networks, is instructed Neural network model is got, wherein, the neural network model is used to recognize the wrong word in text.
Further, the model training apparatus also include:Acquiring unit, in the text message is determined Before the corresponding term vector of each word, target text storehouse is obtained, the text that the target text place is included is not wrap Text containing wrong word;Generation unit, for being trained to the target text storehouse using term vector model, with The corresponding term vector of word in the target text storehouse is generated, the first training set is obtained.
Further, the determining unit includes:Word-dividing mode, for carrying out word segmentation processing to the text message, Obtain the second training set;Enquiry module, for searching each word in second training set from first training set The corresponding term vector of language.
Further, the model training apparatus also include:Indexing unit, for by each word in every sentence Before the corresponding term vector of language is input to Memory Neural Networks, by the corresponding term vector mark of each word in every sentence Default mark is designated as, wherein, the default mark represents that the corresponding word of term vector is non-wrong word, to cause in profit When identifying non-wrong word with the neural network model, the word of non-wrong word is labeled as the default mark.
According to the another aspect of the embodiment of the present application, a kind of wrong word identifying device is additionally provided, including:Vector determines Unit, for carrying out word segmentation processing to text to be measured, determines the corresponding term vector of each word;Recognition unit, uses In in units of the sentence in the text to be measured, the corresponding term vector of each word in every sentence is input to god In through network model, the wrong word in the text to be measured is identified using the neural network model.
According to the embodiment of the present application, text message is extracted by from pre-set text data source, wherein, pre-set text number It is, not comprising the text for having wrong word, to determine the corresponding word of each word in text message according to the text included in source Vector, wherein, term vector is the Multidimensional numerical for unique expression word, in units of the sentence in text message, The corresponding term vector of each word in every sentence is input to Memory Neural Networks, training obtains neural network model, Facilitate the use neural network model to recognize the wrong word in text, improve the discrimination to wrong word in text, Solve the low technical problem of the discrimination of wrong word in prior art Chinese version.
Brief description of the drawings
Accompanying drawing described herein is used for providing further understanding of the present application, constitutes the part of the application, this Shen Schematic description and description please does not constitute the improper restriction to the application for explaining the application.In accompanying drawing In:
Fig. 1 is the flow chart of the model training method according to the embodiment of the present application;
Fig. 2 is the flow chart of the wrong word recognition methods according to the embodiment of the present application;
Fig. 3 is the schematic diagram of the model training apparatus according to the embodiment of the present application;
Fig. 4 is the schematic diagram of the wrong word identifying device according to the embodiment of the present application.
Specific embodiment
In order that those skilled in the art more fully understand application scheme, below in conjunction with the embodiment of the present application Accompanying drawing, is clearly and completely described to the technical scheme in the embodiment of the present application, it is clear that described embodiment The only embodiment of the application part, rather than whole embodiments.Based on the embodiment in the application, ability The every other embodiment that domain those of ordinary skill is obtained under the premise of creative work is not made, should all belong to The scope of the application protection.
It should be noted that term " first ", " in the description and claims of this application and above-mentioned accompanying drawing Two " it is etc. for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that this The data that sample is used can be exchanged in the appropriate case, so as to embodiments herein described herein can with except Here the order beyond those for illustrating or describing is implemented.Additionally, term " comprising " and " having " and they Any deformation, it is intended that covering is non-exclusive to be included, for example, containing process, the side of series of steps or unit Method, system, product or equipment are not necessarily limited to those steps clearly listed or unit, but may include unclear List or for these processes, method, product or other intrinsic steps of equipment or unit.
According to the embodiment of the present application, there is provided a kind of embodiment of the method for model training method, it is necessary to explanation, The step of flow of accompanying drawing is illustrated can perform in the such as one group computer system of computer executable instructions, and And, although logical order is shown in flow charts, but in some cases, can be with different from order herein Perform shown or described step.
Fig. 1 is the flow chart of the model training method according to the embodiment of the present application, as shown in figure 1, the method is included such as Lower step:
Step S102, text message is extracted from pre-set text data source, wherein, included in pre-set text data source Text be not comprising the text for having wrong word.
Pre-set text data source can be People's Daily, Chinese Government net etc. resource website, can be by after correction not Include the text data source of wrong word.Include the substantial amounts of text without wrong word in the pre-set text data source, Therefrom extract these text messages.
Step S104, determines the corresponding term vector of each word in text message, wherein, term vector is for unique Represent the Multidimensional numerical of word.
The text message gone out to said extracted, determines the wherein corresponding term vector of each word, the word of each word to Amount represents that the corresponding term vector of different words is different with one group of Multidimensional numerical.Wherein, the term vector of word Can pre-define, after text message is extracted, text inquired from pre-defined term vector The vector of each word in this information.Each word can also be generated according to term vector create-rule set in advance Term vector.
Step S106, in units of the sentence in text message, by the corresponding term vector of each word in every sentence It is input to Memory Neural Networks, training obtains neural network model, wherein, neural network model is used to recognizing in text Wrong word.
In the present embodiment, after the term vector for determining each word included in text message, with text message In sentence be unit, the sentence in text message is sequentially inputted to be trained in Memory Neural Networks, be input to Sentence in Memory Neural Networks is replaced with the wherein corresponding term vector of each word, i.e. by each word in sentence Corresponding term vector is input to Memory Neural Networks, and the Memory Neural Networks can be preferably based on Recognition with Recurrent Neural Network Memory Neural Networks (i.e. LSTM+Bidirectional RNN) in short-term long.By Memory Neural Networks to extraction Text message is trained, and obtains neural network model.It is in units of sentence that the corresponding term vector of word therein is defeated Enter to Memory Neural Networks, machine can remember the word in sentence and combinations thereof form, and with neural network model Parameter (parameter determination in neural network model, major part is matrix) remember these words and combinations thereof.Relative to In the prior art using manually setting up correct lexicon, and carry out text matches and recognize the mode of wrong word, this Embodiment is trained by Memory Neural Networks to the text without wrong word, obtains neural network model, then profit The wrong word in text is recognized with the neural network model, without manually setting up lexicon, you can according to word combination And sentence recognizes wrong word therein, can be based on context semantic, in effectively and quickly identifying text Wrong word.
According to the embodiment of the present application, text message is extracted by from pre-set text data source, wherein, pre-set text number It is, not comprising the text for having wrong word, to determine the corresponding word of each word in text message according to the text included in source Vector, wherein, term vector is the Multidimensional numerical for unique expression word, in units of the sentence in text message, The corresponding term vector of each word in every sentence is input to Memory Neural Networks, training obtains neural network model, Facilitate the use neural network model to recognize the wrong word in text, improve the discrimination to wrong word in text, Solve the low technical problem of the discrimination of wrong word in prior art Chinese version.
For example, for the word " indignant position " occurred in text, word is included in the lexicon set up in the prior art Language " angrily leaving the theatre ", " indignant ", " leaving the theatre " and " position ", therefore when the word of above-mentioned appearance is recognized, match Word " indignant " and " leaving the theatre ", may think that in the word do not have wrong word.And in the embodiment of the present application, due to It is in units of sentence, i.e., using " angrily leaving the theatre " as overall, to be input to god when training obtains neural network model Through being trained in network, by " angrily the leaving the theatre " of its parameters memorizing in neural network model, therefore, when by " anger Right position " is input in neural network model, then will recognise that wherein " vertical " word is wrong word.
Preferably, before the corresponding term vector of each word in determining text message, model training method also includes: Target text storehouse is obtained, the text that target text place is included is not comprising the text for having wrong word;Using term vector mould Type is trained to target text storehouse, to generate the corresponding term vector of word in target text storehouse, obtains the first training Collection.
The target text storehouse of the present embodiment, can be the dictionary for including various words, such as xinhua dictionary, into words and phrases The text library not comprising wrong word such as allusion quotation, article, obtains target text storehouse with as term vector training set.Term vector Model can be existing maturity model, and the model can generate a dimension phase according to input text to each word Same Multidimensional numerical, i.e. term vector, the dimension of the term vector such as will for that can be defined according to term vector training set It is 1,0,0 that " one " may mark ... ...], it is 0,1,0 that " happiness " may be marked ... ...].
In the embodiment of the present application, the term vector of each word in the term vector training set that can be obtained according to training in advance, In order to therefrom inquire about to the term vector for each word in the text message for carrying out neural network model training.
It should be noted that the embodiment of the present application can also be generates the corresponding term vector of each punctuation mark.
Further, it is determined that go out the corresponding term vector of each word in text message including:Participle is carried out to text message Treatment, obtains the second training set;The corresponding term vector of each word in the second training set is searched from the first training set.
For the text message for carrying out neural network model training, word segmentation processing first is carried out to it, obtain word collection It is the second training set to close, the corresponding word of each word in the first training training set of Integrated query second obtained from above-mentioned Vector, so that it is determined that going out the term vector of each word in every sentence of above-mentioned text message.
Specifically, it is possible to use existing participle instrument, word segmentation processing is carried out to above-mentioned text message, wherein, after participle Text be composed of words, such as will " I is a Chinese " participle be for " I is only one China " or " I am Only one China ".
Preferably, before the corresponding term vector of each word in every sentence is input into Memory Neural Networks, mould Type training method also includes:By the corresponding term vector of each word in every sentence labeled as default mark, wherein, Default mark represents that the corresponding word of term vector is non-wrong word, to cause identifying non-mistake using neural network model During malapropism, by the word of non-wrong word labeled as default mark.
In the embodiment of the present application, it is mark to be input to each word in every sentence of Memory Neural Networks and mark, Such as " 1 ", so, text is trained obtain neural network model when, parameter can be remembered in neural network model It is default mark to recall these term identifications.When using neural network model to recognize text to be measured, in its output result Can be by the word for not having wrong word in text to be measured labeled as the default mark, and the word for wrong word occur is not marked then, Or labeled as other marks, in order to quickly filter out the wrong word in text to be measured.
A kind of optional mode of the model training method of the embodiment of the present application includes:
Step one, obtain reliable text library (such as text of xinhua dictionary, dictionary of idioms, article not comprising wrong word This storehouse) it is target text storehouse, it is the first training set as the training set 1 of term vector.
Step 2, using term vector model training training set 1, obtain each word (including punctuation mark) in training set 1 Term vector.Wherein, term vector model can utilize existing maturity model, the model to be given according to input text Each word generates a unique Multidimensional numerical of dimension identical, i.e. term vector, and the dimension of the term vector can be advance Definition, such as it is 1,0,0 that may mark " one " ... ...], it is 0,1,0 that " happiness " may be marked ... ...].
Step 3, from reliably by extracting text message in the molecular text data source of a large amount of sentences, as text training Collection.Wherein, the molecular text data source-representation of reliable a large amount of sentences:There is no the text data source of wrong word, such as from The channels such as People's Daily, Chinese Government's net are obtained.
Step 4, using existing participle instrument, word segmentation processing is carried out to above-mentioned text training set, obtain training set 2 i.e. Second training set.Wherein, the text after participle is word composition, may be such as " I by " I is a Chinese " participle It is only one China " or " I is only one China ".
Step 5, in units of the sentence of training set 2, the term vector of each word in the sentence is found out from training set 1, And each word is labeled as non-wrong word (as represented non-wrong word with " 1 "), the term vector input that will be obtained is based on following The length of ring neutral net Memory Neural Networks (i.e. LSTM+Bidirectional RNN) in short-term, training obtains nerve Network model (Model Parameter determines that major part is matrix).Wherein, neutral net, machine are input into units of sentence Device can remember the word in sentence and combinations thereof form, and with these combinations of the parameters memorizing in model.
By using neural network model, can be analyzed according to word combination, sentences in article and paragraph, Jin Erti Recognition accuracy high, reduction loss.
A kind of wrong word recognition methods is additionally provided according to the embodiment of the present application, the wrong word recognition methods can be used for leading to The model training method for crossing the above embodiments of the present application trains the neural network model for obtaining to recognize wrong word.Such as Fig. 2 Shown, the wrong word recognition methods includes:
Step S202, word segmentation processing is carried out to text to be measured, determines the corresponding term vector of each word.
Each word after word segmentation processing can from first training its corresponding word of Integrated query in the embodiment of the present application to Amount.
Step S204, in units of the sentence in text to be measured, by the corresponding term vector of each word in every sentence It is input in neural network model, the wrong word in text to be measured is identified using neural network model.
The nerve that neural network model in the present embodiment is obtained for the model training method training of the above embodiments of the present application Network model.
Because the neural network model is the text without wrong word to be trained by Memory Neural Networks obtain, Parameter (parameter determination, most of for matrix in neural network model) in neural network model can be with memory training text There is no word of wrong word and combinations thereof in this, without manually setting up lexicon, you can according to word combination and sentence To recognize wrong word therein, can be based on context semantic, effectively and quickly identify the wrong word in text.
The neural network model that the term vector input of text to be measured is trained, by the calculating of neural network model, will Each word is marked in output result, such as non-wrong word is designated as:1, wrong word is designated as:- 1, and then can screen Go out wrong word.
The embodiment of the present application additionally provides a kind of model training apparatus, and the device can be used for performing the embodiment of the present application Model training method, as shown in figure 3, the model training apparatus include:Extraction unit 301, the and of determining unit 303 Training unit 305.
Extraction unit 301 is used to extract text message from pre-set text data source, wherein, in pre-set text data source Comprising text be not comprising the text for having wrong word.
Pre-set text data source can be People's Daily, Chinese Government net etc. resource website, can be by after correction not Include the text data source of wrong word.Include the substantial amounts of text without wrong word in the pre-set text data source, Therefrom extract these text messages.
Determining unit 303 be used for determine the corresponding term vector of each word in text message, wherein, term vector is use In unique Multidimensional numerical for representing word.
The text message gone out to said extracted, determines the wherein corresponding term vector of each word, the word of each word to Amount represents that the corresponding term vector of different words is different with one group of Multidimensional numerical.Wherein, the term vector of word Can pre-define, after text message is extracted, text inquired from pre-defined term vector The vector of each word in this information.Each word can also be generated according to term vector create-rule set in advance Term vector.
Training unit 305 is used in units of the sentence in text message, and each word in every sentence is corresponding Term vector is input to Memory Neural Networks, and training obtains neural network model, wherein, neural network model is used to recognize Wrong word in text.
In the present embodiment, after the term vector for determining each word included in text message, with text message In sentence be unit, the sentence in text message is sequentially inputted to be trained in Memory Neural Networks, be input to Sentence in Memory Neural Networks is replaced with the wherein corresponding term vector of each word, i.e. by each word in sentence Corresponding term vector is input to Memory Neural Networks, and the Memory Neural Networks can be preferably based on Recognition with Recurrent Neural Network Memory Neural Networks (i.e. LSTM+Bidirectional RNN) in short-term long.By Memory Neural Networks to extraction Text message is trained, and obtains neural network model.It is in units of sentence that the corresponding term vector of word therein is defeated Enter to Memory Neural Networks, machine can remember the word in sentence and combinations thereof form, and with neural network model Parameter (parameter determination in neural network model, major part is matrix) remember these words and combinations thereof.Relative to In the prior art using manually setting up correct lexicon, and carry out text matches and recognize the mode of wrong word, this Embodiment is trained by Memory Neural Networks to the text without wrong word, obtains neural network model, then profit The wrong word in text is recognized with the neural network model, without manually setting up lexicon, you can according to word combination And sentence recognizes wrong word therein, can be based on context semantic, in effectively and quickly identifying text Wrong word.
According to the embodiment of the present application, text message is extracted by from pre-set text data source, wherein, pre-set text number It is, not comprising the text for having wrong word, to determine the corresponding word of each word in text message according to the text included in source Vector, wherein, term vector is the Multidimensional numerical for unique expression word, in units of the sentence in text message, The corresponding term vector of each word in every sentence is input to Memory Neural Networks, training obtains neural network model, Facilitate the use neural network model to recognize the wrong word in text, improve the discrimination to wrong word in text, Solve the low technical problem of the discrimination of wrong word in prior art Chinese version.
For example, for the word " indignant position " occurred in text, word is included in the lexicon set up in the prior art Language " angrily leaving the theatre ", " indignant ", " leaving the theatre " and " position ", therefore when the word of above-mentioned appearance is recognized, match Word " indignant " and " leaving the theatre ", may think that in the word do not have wrong word.And in the embodiment of the present application, due to It is in units of sentence, i.e., using " angrily leaving the theatre " as overall, to be input to god when training obtains neural network model Through being trained in network, by " angrily the leaving the theatre " of its parameters memorizing in neural network model, therefore, when by " anger Right position " is input in neural network model, then will recognise that wherein " vertical " word is wrong word.
Preferably, model training apparatus also include:Acquiring unit, for each word pair in text message is determined Before the term vector answered, target text storehouse is obtained, the text that target text place is included is not comprising the text for having wrong word This;Generation unit, for being trained to target text storehouse using term vector model, with generating target text storehouse The corresponding term vector of word, obtains the first training set.
The target text storehouse of the present embodiment, can be the dictionary for including various words, such as xinhua dictionary, into words and phrases The text library not comprising wrong word such as allusion quotation, article, obtains target text storehouse with as term vector training set.Term vector Model can be existing maturity model, and the model can generate a dimension phase according to input text to each word Same Multidimensional numerical, i.e. term vector, the dimension of the term vector such as will for that can be defined according to term vector training set It is 1,0,0 that " one " may mark ... ...], it is 0,1,0 that " happiness " may be marked ... ...].
In the embodiment of the present application, the term vector of each word in the term vector training set that can be obtained according to training in advance, In order to therefrom inquire about to the term vector for each word in the text message for carrying out neural network model training.
It should be noted that the embodiment of the present application can also be generates the corresponding term vector of each punctuation mark.
Preferably, determining unit includes:Word-dividing mode, for carrying out word segmentation processing to text message, obtains the second instruction Practice collection;Enquiry module, for searching the corresponding term vector of each word in the second training set from the first training set.
For the text message for carrying out neural network model training, word segmentation processing first is carried out to it, obtain word collection It is the second training set to close, the corresponding word of each word in the first training training set of Integrated query second obtained from above-mentioned Vector, so that it is determined that going out the term vector of each word in every sentence of above-mentioned text message.
Specifically, it is possible to use existing participle instrument, word segmentation processing is carried out to above-mentioned text message, wherein, after participle Text be composed of words, such as will " I is a Chinese " participle be for " I is only one China " or " I am Only one China ".
Preferably, model training apparatus also include:Indexing unit, for each word in every sentence is corresponding Term vector be input to Memory Neural Networks before, by the corresponding term vector of each word in every sentence labeled as pre- Bidding is known, wherein, preset mark and represent that the corresponding word of term vector is non-wrong word, to cause utilizing neutral net When Model Identification goes out non-wrong word, by the word of non-wrong word labeled as default mark.
In the embodiment of the present application, it is mark to be input to each word in every sentence of Memory Neural Networks and mark, Such as " 1 ", so, text is trained obtain neural network model when, parameter can be remembered in neural network model It is default mark to recall these term identifications.When using neural network model to recognize text to be measured, in its output result Can be by the word for not having wrong word in text to be measured labeled as the default mark, and the word for wrong word occur is not marked then, Or labeled as other marks, in order to quickly filter out the wrong word in text to be measured.
The model training apparatus include processor and memory, said extracted unit 301, determining unit 303 and instruction Practice unit 305 grade as program unit storage in memory, by computing device store in memory it is above-mentioned Program unit.
Kernel is included in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can set one Or more, trained by adjusting kernel parameter and obtain neural network model.
Memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/ Or the form, such as read-only storage (ROM) or flash memory (flash RAM) such as Nonvolatile memory, memory includes at least one Individual storage chip.
Present invention also provides a kind of embodiment of computer program product, when being performed on data processing equipment, fit In the program code for performing initialization there are as below methods step:Text message is extracted from pre-set text data source, wherein, Text included in pre-set text data source is not comprising the text for having wrong word;Determine each word in text message The corresponding term vector of language, wherein, term vector is for unique Multidimensional numerical for representing word;With the language in text message Sentence is unit, and the corresponding term vector of each word in every sentence is input into Memory Neural Networks, and training obtains god Through network model, wherein, neural network model is used to recognize the wrong word in text.
A kind of wrong word identifying device is additionally provided according to the embodiment of the present application, the wrong word identifying device can be used for holding The wrong word recognition methods that row the embodiment of the present application is provided.As shown in figure 4, the wrong word identifying device includes:Vector Determining unit 401 and recognition unit 403.
Vector determination unit 401 is used to carry out word segmentation processing to text to be measured, determines the corresponding term vector of each word;
Each word after word segmentation processing can from first training its corresponding word of Integrated query in the embodiment of the present application to Amount.
Recognition unit 403 is used in units of the sentence in text to be measured, and each word in every sentence is corresponding Term vector is input in neural network model, and the wrong word in text to be measured is identified using neural network model.
The nerve that neural network model in the present embodiment is obtained for the model training method training of the above embodiments of the present application Network model.
Because the neural network model is the text without wrong word to be trained by Memory Neural Networks obtain, Parameter (parameter determination, most of for matrix in neural network model) in neural network model can be with memory training text There is no word of wrong word and combinations thereof in this, without manually setting up lexicon, you can according to word combination and sentence To recognize wrong word therein, can be based on context semantic, effectively and quickly identify the wrong word in text.
The neural network model that the term vector input of text to be measured is trained, by the calculating of neural network model, will Each word is marked in output result, such as non-wrong word is designated as:1, wrong word is designated as:- 1, and then can screen Go out wrong word.
The wrong word identifying device includes processor and memory, above-mentioned vector determination unit 401 and recognition unit 403 Deng being stored in memory as program unit, by computing device storage said procedure unit in memory.
Kernel is included in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can set one Or more, the wrong word in text is recognized by adjusting kernel parameter.
Memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/ Or the form, such as read-only storage (ROM) or flash memory (flash RAM) such as Nonvolatile memory, memory includes at least one Individual storage chip.
Present invention also provides a kind of embodiment of computer program product, when being performed on data processing equipment, fit In the program code for performing initialization there are as below methods step:Word segmentation processing is carried out to text to be measured, each word is determined The corresponding term vector of language;In units of the sentence in text to be measured, by the corresponding word of each word in every sentence to Amount is input in neural network model, and the wrong word in text to be measured is identified using neural network model.
Above-mentioned the embodiment of the present application sequence number is for illustration only, and the quality of embodiment is not represented.
In above-described embodiment of the application, the description to each embodiment all emphasizes particularly on different fields, and does not have in certain embodiment The part of detailed description, may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed technology contents, can be by other Mode realize.Wherein, device embodiment described above is only schematical, such as division of described unit, Can be a kind of division of logic function, there can be other dividing mode when actually realizing, for example multiple units or component Can combine or be desirably integrated into another system, or some features can be ignored, or do not perform.It is another, institute Display or the coupling each other for discussing or direct-coupling or communication connection can be by some interfaces, unit or mould The INDIRECT COUPLING of block or communication connection, can be electrical or other forms.
The unit that is illustrated as separating component can be or may not be it is physically separate, it is aobvious as unit The part for showing can be or may not be physical location, you can with positioned at a place, or can also be distributed to On multiple units.Some or all of unit therein can be according to the actual needs selected to realize this embodiment scheme Purpose.
In addition, during each functional unit in the application each embodiment can be integrated in a processing unit, it is also possible to It is that unit is individually physically present, it is also possible to which two or more units are integrated in a unit.It is above-mentioned integrated Unit can both be realized in the form of hardware, it would however also be possible to employ the form of SFU software functional unit is realized.
If the integrated unit is to realize in the form of SFU software functional unit and as independent production marketing or when using, Can store in a computer read/write memory medium.Based on such understanding, the technical scheme essence of the application On all or part of the part that is contributed to prior art in other words or the technical scheme can be with software product Form is embodied, and the computer software product is stored in a storage medium, including some instructions are used to so that one Platform computer equipment (can be personal computer, server or network equipment etc.) performs each embodiment institute of the application State all or part of step of method.And foregoing storage medium includes:USB flash disk, read-only storage (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD Etc. it is various can be with the medium of store program codes.
The above is only the preferred embodiment of the application, it is noted that for the ordinary skill people of the art For member, on the premise of the application principle is not departed from, some improvements and modifications can also be made, these improve and moisten Decorations also should be regarded as the protection domain of the application.

Claims (10)

1. a kind of model training method, it is characterised in that including:
Text message is extracted from pre-set text data source, wherein, included in the pre-set text data source Text is not comprising the text for having wrong word;
The corresponding term vector of each word in the text message is determined, wherein, the term vector is for only One Multidimensional numerical for representing word;
In units of the sentence in text message, the corresponding term vector of each word in every sentence is input to Memory Neural Networks, training obtains neural network model, wherein, the neural network model is used to recognize text In wrong word.
2. model training method according to claim 1, it is characterised in that every in the text message is determined Before the corresponding term vector of individual word, the model training method also includes:
Target text storehouse is obtained, the text that the target text place is included is not comprising the text for having wrong word;
The target text storehouse is trained using term vector model, to generate the word in the target text storehouse The corresponding term vector of language, obtains the first training set.
3. model training method according to claim 2, it is characterised in that determine each in the text message The corresponding term vector of word includes:
Word segmentation processing is carried out to the text message, the second training set is obtained;
The corresponding term vector of each word in second training set is searched from first training set.
4. model training method according to claim 1, it is characterised in that by each word in every sentence Before corresponding term vector is input to Memory Neural Networks, the model training method also includes:
By the corresponding term vector of each word in every sentence labeled as default mark, wherein, the pre- bidding Know and represent that the corresponding word of term vector is non-wrong word, with so that non-being identified using the neural network model During wrong word, the word of non-wrong word is labeled as the default mark.
5. a kind of wrong word recognition methods, it is characterised in that including:
Word segmentation processing is carried out to text to be measured, the corresponding term vector of each word is determined;
It is in units of the sentence in the text to be measured, the corresponding term vector of each word in every sentence is defeated Enter in training the neural network model for obtaining to the model training method any one of Claims 1-4, The wrong word in the text to be measured is identified using the neural network model.
6. a kind of model training apparatus, it is characterised in that including:
Extraction unit, for extracting text message from pre-set text data source, wherein, the pre-set text number It is not comprising the text for having wrong word according to the text included in source;
Determining unit, for determining the corresponding term vector of each word in the text message, wherein, it is described Term vector is for unique Multidimensional numerical for representing word;
Training unit, in units of the sentence in text message, by each word correspondence in every sentence Term vector be input to Memory Neural Networks, training obtains neural network model, wherein, the neutral net mould Type is used to recognize the wrong word in text.
7. model training apparatus according to claim 6, it is characterised in that the model training apparatus also include:
Acquiring unit, for before the corresponding term vector of each word in determining the text message, obtaining Target text storehouse, the text that the target text place is included is not comprising the text for having wrong word;
Generation unit, for being trained to the target text storehouse using term vector model, to generate the mesh The corresponding term vector of word in mark text library, obtains the first training set.
8. model training apparatus according to claim 7, it is characterised in that the determining unit includes:
Word-dividing mode, for carrying out word segmentation processing to the text message, obtains the second training set;
Enquiry module, for searching second training set from first training set in each word it is corresponding Term vector.
9. model training apparatus according to claim 6, it is characterised in that the model training apparatus also include:
Indexing unit, for the corresponding term vector of each word in every sentence to be input into memory nerve net Before network, by the corresponding term vector of each word in every sentence labeled as default mark, wherein, it is described pre- It is non-wrong word that bidding knows the corresponding word of expression term vector, to cause to be recognized using the neural network model When going out non-wrong word, the word of non-wrong word is labeled as the default mark.
10. a kind of wrong word identifying device, it is characterised in that including:
Vector determination unit, for carrying out word segmentation processing to text to be measured, determine the corresponding word of each word to Amount;
Recognition unit, in units of the sentence in the text to be measured, by each word in every sentence Corresponding term vector is input to the god that the model training method training any one of Claims 1-4 is obtained In through network model, the wrong word in the text to be measured is identified using the neural network model.
CN201510850128.1A 2015-11-27 2015-11-27 Model training method and device and wrong word recognition methods and device Pending CN106815193A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510850128.1A CN106815193A (en) 2015-11-27 2015-11-27 Model training method and device and wrong word recognition methods and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510850128.1A CN106815193A (en) 2015-11-27 2015-11-27 Model training method and device and wrong word recognition methods and device

Publications (1)

Publication Number Publication Date
CN106815193A true CN106815193A (en) 2017-06-09

Family

ID=59155338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510850128.1A Pending CN106815193A (en) 2015-11-27 2015-11-27 Model training method and device and wrong word recognition methods and device

Country Status (1)

Country Link
CN (1) CN106815193A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451106A (en) * 2017-07-26 2017-12-08 阿里巴巴集团控股有限公司 Text method and device for correcting, electronic equipment
CN109213843A (en) * 2018-07-23 2019-01-15 北京密境和风科技有限公司 A kind of detection method and device of rubbish text information
CN109543022A (en) * 2018-12-17 2019-03-29 北京百度网讯科技有限公司 Text error correction method and device
CN110310083A (en) * 2019-06-04 2019-10-08 南方电网科学研究院有限责任公司 Submitting system of science and technology project data report
CN110765996A (en) * 2019-10-21 2020-02-07 北京百度网讯科技有限公司 Text information processing method and device
WO2020132985A1 (en) * 2018-12-26 2020-07-02 深圳市优必选科技有限公司 Self-training method and apparatus for model, computer device, and storage medium
CN112599129A (en) * 2021-03-01 2021-04-02 北京世纪好未来教育科技有限公司 Speech recognition method, apparatus, device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101847140A (en) * 2009-03-23 2010-09-29 中国科学院计算技术研究所 Wrongly-written or mispronounced character processing method and system
CN104573046A (en) * 2015-01-20 2015-04-29 成都品果科技有限公司 Comment analyzing method and system based on term vector
CN104615589A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Named-entity recognition model training method and named-entity recognition method and device
CN104899304A (en) * 2015-06-12 2015-09-09 北京京东尚科信息技术有限公司 Named entity identification method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101847140A (en) * 2009-03-23 2010-09-29 中国科学院计算技术研究所 Wrongly-written or mispronounced character processing method and system
CN104573046A (en) * 2015-01-20 2015-04-29 成都品果科技有限公司 Comment analyzing method and system based on term vector
CN104615589A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Named-entity recognition model training method and named-entity recognition method and device
CN104899304A (en) * 2015-06-12 2015-09-09 北京京东尚科信息技术有限公司 Named entity identification method and device

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451106A (en) * 2017-07-26 2017-12-08 阿里巴巴集团控股有限公司 Text method and device for correcting, electronic equipment
CN109213843A (en) * 2018-07-23 2019-01-15 北京密境和风科技有限公司 A kind of detection method and device of rubbish text information
CN109543022A (en) * 2018-12-17 2019-03-29 北京百度网讯科技有限公司 Text error correction method and device
US11080492B2 (en) 2018-12-17 2021-08-03 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and device for correcting error in text
WO2020132985A1 (en) * 2018-12-26 2020-07-02 深圳市优必选科技有限公司 Self-training method and apparatus for model, computer device, and storage medium
CN110310083A (en) * 2019-06-04 2019-10-08 南方电网科学研究院有限责任公司 Submitting system of science and technology project data report
CN110765996A (en) * 2019-10-21 2020-02-07 北京百度网讯科技有限公司 Text information processing method and device
CN110765996B (en) * 2019-10-21 2022-07-29 北京百度网讯科技有限公司 Text information processing method and device
CN112599129A (en) * 2021-03-01 2021-04-02 北京世纪好未来教育科技有限公司 Speech recognition method, apparatus, device and storage medium

Similar Documents

Publication Publication Date Title
CN106815193A (en) Model training method and device and wrong word recognition methods and device
CN106815192B (en) Model training method and device and sentence emotion recognition method and device
CN106815194A (en) Model training method and device and keyword recognition method and device
CN104503998B (en) For the kind identification method and device of user query sentence
CN103970765B (en) Correct mistakes model training method, device and text of one is corrected mistakes method, device
CN113707300B (en) Search intention recognition method, device, equipment and medium based on artificial intelligence
CN109033282B (en) Webpage text extraction method and device based on extraction template
CN105243055A (en) Multi-language based word segmentation method and apparatus
CN112035675A (en) Medical text labeling method, device, equipment and storage medium
CN107491536B (en) Test question checking method, test question checking device and electronic equipment
CN110008473B (en) Medical text named entity identification and labeling method based on iteration method
CN111506696A (en) Information extraction method and device based on small number of training samples
CN112632278A (en) Labeling method, device, equipment and storage medium based on multi-label classification
CN110569335A (en) triple verification method and device based on artificial intelligence and storage medium
CN110610180A (en) Method, device and equipment for generating recognition set of wrongly-recognized words and storage medium
CN110321549B (en) New concept mining method based on sequential learning, relation mining and time sequence analysis
CN104007836A (en) Handwriting input processing method and terminal device
CN112445915A (en) Document map extraction method and device based on machine learning and storage medium
CN111723870A (en) Data set acquisition method, device, equipment and medium based on artificial intelligence
CN107436931B (en) Webpage text extraction method and device
CN113282754A (en) Public opinion detection method, device, equipment and storage medium for news events
CN113850081B (en) Text processing method, device, equipment and medium based on artificial intelligence
CN107506349A (en) A kind of user's negative emotions Forecasting Methodology and system based on network log
CN113761137B (en) Method and device for extracting address information
CN104408036B (en) It is associated with recognition methods and the device of topic

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20170609

RJ01 Rejection of invention patent application after publication