CN110032732A

CN110032732A - A kind of text punctuate prediction technique, device, computer equipment and storage medium

Info

Publication number: CN110032732A
Application number: CN201910182506.1A
Authority: CN
Inventors: 王健宗; 程宁; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-03-12
Filing date: 2019-03-12
Publication date: 2019-07-19
Also published as: WO2020181808A1

Abstract

The invention discloses a kind of text punctuate prediction technique, device, computer equipment and storage mediums, are applied to depth learning technology field, for solving the problems, such as that art text is without punctuate if speech recognition obtains.The method include that obtaining the target text without punctuate；Word segmentation processing is carried out to target text, obtains each target words in target text；Vectorization processing is carried out to each target words respectively, obtains each object vector；According to order of each target words in target text, each object vector is sequentially input to network model, the result sequence exported, as a result each numerical value in sequence characterizes the corresponding punctuate of each target words respectively；The corresponding each punctuate of each numerical value is determined respectively according to preset numerical value punctuate corresponding relationship；For each punctuate in each punctuate, each punctuate is inserted into the back location of target words corresponding with each punctuate in target text, obtains art text after punctuate is predicted.

Description

A kind of text punctuate prediction technique, device, computer equipment and storage medium

Technical field

The present invention relates to depth learning technology fields more particularly to a kind of text punctuate prediction technique, device, computer to set Standby and storage medium.

Background technique

With the rapid development of society and high-tech technology, the nature language such as Intelligent housing, automatic question answering, voice assistant Speech processing is got growing concern for.But since spoken dialog does not have punctuation mark, it cannot distinguish between statement boundary and specification Language construction, therefore punctuate prediction is extremely important natural language processing task.In smart phone customer service scene, for The speech at family, what is obtained by speech recognition is original words art text of no punctuate without punctuate, has no idea directly to use, so It further using user before art, is needing first to carry out punctuate prediction to original words art text, so as to the text to no punctuate This addition punctuate.

Therefore, the method for art text progress punctuate prediction can accurately be talked with as those skilled in the art by finding one kind The problem of urgent need to resolve.

Summary of the invention

The embodiment of the present invention provides a kind of text punctuate prediction technique, device, computer equipment and storage medium, to solve The problem of art text is without punctuate if speech recognition obtains.

A kind of text punctuate prediction technique, comprising:

Obtain the target text without punctuate；

Word segmentation processing is carried out to the target text, obtains each target words in the target text；

Vectorization processing is carried out to each target words respectively, obtains the corresponding each mesh of each target words Mark vector；

According to order of each target words in the target text, each object vector is sequentially input To network model, the result sequence that the network model is sequentially output is obtained, each numerical value in the result sequence distinguishes table The corresponding punctuate of each target words is levied, the network model is by preparatory trained LSTM network and condition random field Composition；

The corresponding each punctuate of each numerical value, the numerical value mark are determined respectively according to preset numerical value punctuate corresponding relationship Point correspondence has recorded the one-to-one relationship of numerical value and punctuate；

For each punctuate in each punctuate, by each punctuate be inserted into the target text with it is described Each punctuate corresponds to the back location of target words, obtains art text after punctuate is predicted, the back location refers to described It is located at behind the target words and abuts the position of the target words in target text.

A kind of text punctuate prediction meanss, comprising:

Text obtains module, for obtaining the target text without punctuate；

Word segmentation processing module obtains each in the target text for carrying out word segmentation processing to the target text Target words；

Words vectorization module obtains described each for carrying out vectorization processing respectively to each target words The corresponding each object vector of target words；

Vector input module will be described each for the order according to each target words in the target text A object vector is sequentially input to network model, obtains the result sequence that the network model is sequentially output, the result sequence In each numerical value characterize the corresponding punctuate of each target words respectively, the network model is by trained in advance LSTM network and condition random field composition；

Punctuate determining module, for determining that each numerical value is corresponding each respectively according to preset numerical value punctuate corresponding relationship A punctuate, the numerical value punctuate corresponding relationship have recorded the one-to-one relationship of numerical value and punctuate；

Punctuate is inserted into module, for for each punctuate in each punctuate, each punctuate to be inserted into institute The back location for stating target words corresponding with each punctuate in target text, art text if obtaining after punctuate prediction, institute It states back location and refers to the position for being located at behind the target words and abutting the target words in the target text.

A kind of computer equipment, including memory, processor and storage are in the memory and can be in the processing The computer program run on device, the processor realize above-mentioned text punctuate prediction technique when executing the computer program Step.

A kind of computer readable storage medium, the computer-readable recording medium storage have computer program, the meter The step of calculation machine program realizes above-mentioned text punctuate prediction technique when being executed by processor.

Above-mentioned text punctuate prediction technique, device, computer equipment and storage medium, firstly, obtaining the target without punctuate Text；Then, word segmentation processing is carried out to the target text, obtains each target words in the target text；Then, right Each target words carries out vectorization processing respectively, obtains the corresponding each object vector of each target words；Again It sequentially inputs each object vector to net according to order of each target words in the target text Network model, obtains the result sequence that the network model is sequentially output, and each numerical value in the result sequence characterizes respectively The corresponding punctuate of each target words, the network model is by preparatory trained LSTM network and condition random field group At；Take second place, the corresponding each punctuate of each numerical value, the numerical value mark are determined according to preset numerical value punctuate corresponding relationship respectively Point correspondence has recorded the one-to-one relationship of numerical value and punctuate；Finally, for each punctuate in each punctuate, it will Each punctuate is inserted into the back location of target words corresponding with each punctuate in the target text, obtains punctuate Art text after prediction, the back location refer in the target text and are located at behind the target words and against institute State the position of target words.As it can be seen that the present invention can be quasi- by preparatory trained LSTM network and preset condition random field Punctuate prediction really is carried out to target text, completes to add the punctuate of no punctuate text, improves the effect of text punctuate prediction Rate, the direct use in order to subsequent natural language processing to text.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.

Fig. 1 is an application environment schematic diagram of text punctuate prediction technique in one embodiment of the invention；

Fig. 2 is a flow chart of text punctuate prediction technique in one embodiment of the invention；

Fig. 3 is that process of the text punctuate prediction technique step 103 under an application scenarios is shown in one embodiment of the invention It is intended to；

Fig. 4 is the stream of text punctuate prediction technique training network model under an application scenarios in one embodiment of the invention Journey schematic diagram；

Fig. 5 is that process of the text punctuate prediction technique step 106 under an application scenarios is shown in one embodiment of the invention It is intended to；

Fig. 6 is structural schematic diagram of the text punctuate prediction meanss under an application scenarios in one embodiment of the invention；

Fig. 7 is structural schematic diagram of the text punctuate prediction meanss under another application scenarios in one embodiment of the invention；

Fig. 8 is the structural schematic diagram of punctuate insertion module in one embodiment of the invention；

Fig. 9 is a schematic diagram of computer equipment in one embodiment of the invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.

Text punctuate prediction technique provided by the present application, can be applicable in the application environment such as Fig. 1, wherein client is logical Network is crossed to be communicated with server.Wherein, which can be, but not limited to various personal computers, laptop, intelligence It can mobile phone, tablet computer and portable wearable device.Server can use independent server either multiple server groups At server cluster realize.

In one embodiment, it as shown in Fig. 2, providing a kind of text punctuate prediction technique, applies in Fig. 1 in this way It is illustrated, includes the following steps: for server

101, the target text without punctuate is obtained；

In the present embodiment, server can be obtained according to the needs of actual use or the needs of application scenarios without punctuate Target text.For example, server can be connect with client communication, user's consulting which is supplied in certain place is asked Topic, user pass through the microphone input phonetic problem of client, which is uploaded to server by client, and server will The phonetic problem sound obtains text after turning word, and the general text is the target text without punctuate.Alternatively, server can also execute To high-volume, art text carries out the task of punctuate identification, certain database is collected largely art text in advance, then passed through By multiple words art File Transfers to server, server needs to carry out punctuate prediction respectively to these words art texts network, thus These words art texts are respectively each target text to punctuate prediction, without punctuate.It is understood that server can be with These target texts for waiting for punctuate prediction are got in several ways, this is no longer excessively repeated.

It should be noted that text described in the present embodiment generally refers to words art text, i.e., by people, what is said or talked about passes through sound Turn the content of text that word obtains.

102, word segmentation processing is carried out to the target text, obtains each target words in the target text；

It is understood that when carrying out punctuate prediction, need to accurately hold the position that punctuate is likely to occur, and punctuate Position is again closely related with words each in target text, this just needs server to carry out word segmentation processing to the target text, Obtain each target words in the target text.For example, target text is " you get well me and will reply you tomorrow ", through excessive After word, totally 5 words, this 5 words are each mesh by available " hello ", " I ", " tomorrow ", " reply ", " you " Marking-up word.

It particularly, can be using the third party softwares realization point such as stammerer participle when carrying out word segmentation processing to target text Word processing, obtains each target words.

In order to reduce the interference information in target text, it is accurate to guarantee that subsequent participle and investment network model are identified Property, further, before step 102, this method further include: the specified text in the target text is deleted, it is described specified Text includes at least stop words.It is understood that stop words mentioned here, which can be, refers to the extra high single Chinese of frequency of use Word, such as " ", the Chinese character without practical language meaning such as " ".Before executing step 102, server can will be in target text Specified text suppression, illustrate, it is assumed that the specified text includes stop words, includes text " my in the target text today Come to work ", server can first delete " " therein etc. without the stop words of practical significance, thus the text after being deleted This " I comes to work today ".

103, vectorization processing is carried out to each target words respectively, it is corresponding each to obtain each target words A object vector；

After obtaining each target words, for the ease of the identification and study of subsequent network model, server is needed to institute It states each target words and carries out vectorization processing respectively, i.e., indicate the mode that words is converted into vector, to obtain described each The corresponding each object vector of a target words.Specifically, server can by each target words with one-dimensional matrix (it is one-dimensional to Amount) form record.

For ease of understanding, under a concrete application scene, as shown in figure 3, further, the step 103 is specific May include:

201, it for each target words in each target words, retrieves in preset dictionary and whether records State each target words, if so, then follow the steps 202, if it is not, then follow the steps 203, the dictionary have recorded words with it is one-dimensional Corresponding relationship between vector；

202, one-dimensional vector corresponding with each target words is obtained；

203, by loading the term vector of the first third-party platform, primary vector is converted by each target words；

204, by loading the term vector of the second third-party platform, secondary vector is converted by each target words；

205, splice the primary vector and secondary vector, obtain an one-dimensional vector as each target words pair The one-dimensional vector answered；

206, the obtained one-dimensional vector will be spliced and corresponding target words is recorded to the dictionary.

For above-mentioned steps 201, server, can be one by one to these targets when converting vector for each target words Words is converted, and can also be converted simultaneously to multiple target words by the way of multithreading, per thread is the same as the moment Between vector conversion is carried out to target words.Specifically, it is carried out in vector conversion process for each target words, firstly, Server can retrieve the target words whether has been recorded in preset dictionary.What needs to be explained here is that for the ease of realizing Conversion to words to vector, server can be previously provided with dictionary, which has recorded one between words and one-dimensional vector One corresponding relationship.For example, can be set " hello " it is corresponding with " No. 1 vector ", " I " and " No. 2 vectors " corresponding, " tomorrow " and " 3 Number vector " is corresponding, and " replys " is corresponding with " No. 4 vectors ", and " you " and " No. 5 vectors " are corresponding ..., owns by as exhaustive as possible Words improves the dictionary, so that server can be using pre- when needing to convert each target words in the target text If dictionary convert each one-dimensional vector for target words each in the target text.

Therefore, if server detects that record has the target words in dictionary, illustrate that also record has the target in the dictionary The corresponding one-dimensional vector of words, conversely, not recorded one-dimensional vector corresponding with the target words then.

For above-mentioned steps 202, it is to be understood that if detection finds that record has each target in preset dictionary Words then illustrates that record has the corresponding one-dimensional vector of each target words in the dictionary, and therefore, server can be from dictionary In get one-dimensional vector corresponding with each target words.

For above-mentioned steps 203, it is to be understood that if detection finds there there is described each no record in preset dictionary Target words then illustrates not recording the corresponding one-dimensional vector of each target words in the dictionary.This is because server Exhaustive all words are often difficult to when pre-set dictionary, even if the exhaustive all words of great amount of cost is spent to be recorded in dictionary, due to Current social information content increases severely daily, almost can generate new words, such as cyberspeak, therefore preset dictionary daily The case where in the presence of certain words are not included.Faced with this situation, it can realize on one side when in use in the present embodiment to target The vector of words converts, and supplements newly-increased words on one side into dictionary to improve dictionary.Specifically, server first passes through load first Each target words is converted primary vector by the term vector of third-party platform.It is found that often more due to third-party platform It is new timely, therefore the term vector loaded on it can generally cover all words being currently likely to occur, therefore may be implemented Primary vector is converted by the target words.

For above-mentioned steps 204, in order to increase the accuracy of vector conversion, error rate is reduced, the present embodiment, which also passes through, to be added Each target words is converted secondary vector by the term vector for carrying the second third-party platform.It is found that the second third-party platform It is two different platforms from the first third-party platform, it is also not identical in the respectively upper term vector loaded.

For above-mentioned steps 205, server, can the primary vector and after obtaining primary vector and secondary vector Two vectors obtain an one-dimensional vector as the corresponding one-dimensional vector of each target words.It specifically, can will be same The corresponding primary vector of words and secondary vector are one in front and one in back stitched together, i.e., secondary vector is immediately gone up in the tail portion of primary vector Head, to obtain a new one-dimensional vector.It is found that two words due to primary vector and secondary vector from different platform Vector, therefore there is difference in the two, the transformation rule of two platforms is integrated together, can reduce on the whole by the present embodiment The error of vector conversion, notice also ensure that each one-dimensional vector all has enough length, improve the subsequent accuracy used.

For above-mentioned steps 206, it is to be understood that the one-dimensional vector spliced with respect to the preset dictionary for New one-dimensional vector, consequently, to facilitate improve the dictionary, convenient for it is subsequent using the dictionary when can improve the retrieval success of words Rate, server can will splice the obtained one-dimensional vector and corresponding target words is recorded to the dictionary.

104, the order according to each target words in the target text, successively by each object vector It is input to network model, obtains the result sequence that the network model is sequentially output, each numerical value in the result sequence point Do not characterize the corresponding punctuate of each target words, the network model by preparatory trained LSTM network and condition with Airport composition；

After obtaining the corresponding each object vector of each target words, server can be according to each target word Each object vector is sequentially input to preparatory trained network model, is obtained by order of the word in the target text The result sequence being sequentially output to the network model, wherein each numerical value in the result sequence characterizes described respectively The corresponding punctuate of each target words.For example, it is assumed that the corresponding object vector of the target text totally 5, respectively No. 1-5 to Amount, then when executing step 104, be first input to the network model for No. 1 vector, No. 2 vectors be then input to the network mould Type is followed by No. 5 No. 3 vectors, No. 4 vector sums vectors；Simultaneously, it is known that the network model is input to soon in No. 1 vector, it should Network model can export numerical value corresponding with No. 1 vector, can then export numerical value corresponding with No. 2 vectors, and output Numerical value corresponding with No. 3 vectors, the corresponding numerical value of No. 4 vectors, the corresponding numerical value of No. 5 vectors.Therefore, the network model is successively 5 numerical value of output constitute the result sequence.

It should be noted that server pre-sets the corresponding relationship between each numerical value and punctuate, it specifically can root It needs to set according to actual conditions.For example, under an application scenarios, it can be by the correspondence setting of numerical value and punctuate such as following table Shown in one:

Table one

Punctuate	Space	Fullstop	Comma	Question mark
					Numerical value	0	1	2	3

It is found that the type of above-mentioned punctuate can increase or reduce according to the needs of actual conditions, also, which numerical value and which One punctuate correspondence can be set as needed, need to only guarantee when network model training and using when be all made of same set of corresponding close System.

In the present embodiment, which consists of two parts, and first half is LSTM network, and latter half is then condition Random field.It is understood that LSTM network is good at solving long sequence dependence and is asked under the punctuate prediction application scenarios without punctuate text Topic is suitable for being spaced and postpone relatively long critical event in processing and predicted time sequence, can better understand no punctuate Dependence in text between each words simultaneously provides prediction, but LSTM network lacks the energy to output classification information modeling Power, therefore, this method add the mode of full articulamentum after abandoning LSTM network, and are followed by upper condition random field in LSTM network (CRF, conditional random field algorithm), can make up this defect of LSTM network well, make It obtains the two and combines and bring out the best in each other, improve the punctuate forecasting accuracy to no punctuate text.

For ease of understanding, the training process of network model will be described in detail below.As shown in figure 4, further, The network model can be trained in advance by following steps:

301, art text if the multiple band punctuates of collection；

302, the punctuate in each words art text being collected into is separated with text, obtain each sample text and with it is described The corresponding each punctuate set of each sample text；

303, it is directed to each punctuate set, each punctuate is determined according to preset numerical value punctuate corresponding relationship respectively Corresponding first numerical value of each punctuate in set, and formed with each first numerical value corresponding with each punctuate set Standard sequence, the numerical value punctuate corresponding relationship have recorded the one-to-one relationship of numerical value and punctuate；

304, word segmentation processing is carried out to the sample text respectively, obtains each sample word in each sample text Word；

305, vectorization processing is carried out respectively to each sample words in each sample text, obtain with it is described each The corresponding each sample vector of a sample words；

306, for each sample text in each sample text, according to each sample words in each sample text In order, each sample vector is sequentially input into the network model LSTM network, obtains the LSTM network successively Each intermediate vector of output；

307, each intermediate vector is input in the network model conditional random field respectively, obtains the item The sample sequence of part random field output, it is corresponding that each numerical value in the sample sequence characterizes each sample words respectively Punctuate；

308, using the sample sequence of output as adjustment target, the parameter and the condition of the LSTM network are adjusted The weight coefficient of random field, with minimize obtained sample sequence standard sequence corresponding with each sample text it Between error；

If 309, the error between sample sequence standard sequence corresponding with each sample text meets default Training termination condition, it is determined that the network model has trained.

For above-mentioned steps 301, in the present embodiment, staff can be collected under different application scene it is a large amount of if art Text, for example, art text, collection user when can collect art text when user asks questions, collection customer complaint Art text when chat, etc..When collecting words art text, server can pass through specialized knowledge base, network data base etc. Art text if channel collection is a large amount of, original.It should be noted that these words art texts are needed with punctuate, the original of collection If the words art text that begins manually can add upper punctuate without punctuate for it.

For above-mentioned steps 302, in training, input is art text without punctuate, therefore server can be with Punctuate in each words art text being collected into is separated with text, obtain each sample text and with each sample text Corresponding each punctuate set.For example, some be collected into if art text be " what product you have? ", by this words art text Available sample text " what product you have " and punctuate set after this separation "? " (there are four spaces before question mark).

For above-mentioned steps 303, it is to be understood that literary from words art in step 302 for the ease of the processing of subsequent step After isolating punctuate set in this, the sequence that these punctuate set can also be converted to be made of numerical value by server is marked Quasi- sequence.Specifically, each punctuate in each punctuate set is converted to according to numerical value punctuate corresponding relationship described above Then these first value arrangements are obtained standard sequence by the first numerical value.For example, for example above-mentioned punctuate set "? ", reference The available standard sequence of corresponding relationship shown in above-mentioned table one is " 00003 ".

For above-mentioned steps 304, similarly with above-mentioned steps 102, before carrying out network model training, also need to this A little sample texts carry out word segmentation processing.Therefore, server can carry out word segmentation processing to the sample text respectively, obtain each Each sample words in the sample text.For example, sample text is " what product you have ", after participle, Available " you ", " having ", " what ", " product " totally 4 sample words.

It particularly, can be using the third party softwares realization point such as stammerer participle when carrying out word segmentation processing to sample text Word processing, obtains each sample words.

In order to reduce the interference information in sample text, it is accurate to guarantee that subsequent participle and investment network model are trained Property, further, before step 304, this method further include: the specified text in the sample text is deleted, it is described specified Text includes at least stop words.It is understood that stop words mentioned here, which can be, refers to the extra high single Chinese of frequency of use Word, such as " ", the Chinese character without practical language meaning such as " ".Before executing step 304, server can will be in sample text Specified text suppression, illustrate, it is assumed that the specified text includes stop words, includes text " my in the sample text today Come to work ", server can first delete " " therein etc. without the stop words of practical significance, thus the text after being deleted This " I comes to work today ".

For above-mentioned steps 305, similarly with above-mentioned steps 103, after obtaining each sample words, for the ease of subsequent net The identification and study of network model, server needs carry out vectorization processing to each sample words respectively, i.e., turn words The mode for turning to vector indicates, to obtain the corresponding each object vector of each sample words.Specifically, server can Each sample words to be recorded in the form of one-dimensional matrix (one-dimensional vector).

For above-mentioned steps 306, it is to be understood that in training network model, in each sample text Each sample text is respectively trained.Server can be according to order of each sample words in each sample text, will Each sample vector sequentially inputs the LSTM network into the network model and is trained, and it is successively defeated to obtain the LSTM network Each intermediate vector out.For example, it is assumed that the sample vector of some sample text totally 4, respectively 1-4 vector, then holding When row step 306, No. 1 vector is first input to the LSTM network, No. 2 vectors are then input to the LSTM network, are followed by 3 Number vector, No. 4 vectors；Simultaneously, it is known that be input to the LSTM network soon in No. 1 vector, the LSTM network can export with this 1 Number corresponding intermediate vector of vector can then export intermediate vector corresponding with No. 2 vectors, and output and No. 3 vectors Corresponding intermediate vector, the corresponding intermediate vector of No. 4 vectors.It is understood that can be to the content of text based on LSTM network The characteristics of carrying out short-term memory, the intermediate vector of LSTM network output include more text envelopes compared to the sample vector of input Breath is the basis that the present invention carries out punctuate prediction to no punctuate text.

About LSTM network, it can overcome traditional RNN (Recurrent Neural Network) that can not handle far The shortcomings that distance relies on.There are three doors by LSTM, respectively forget door, input gate and out gate.Wherein, forget to represent first from The information that a upper cell state abandons, value are from 0 to 1, and the smaller information for showing to be abandoned of value is more.It is back to back defeated Introduction, which represents, allows how many new information to be added to cell state.Last out gate can be according to current cell state and new Information obtains corresponding output, and updates cell state.Network structure about LSTM is specifically referred to available data, herein It repeats no more.

For above-mentioned steps 307, after each intermediate vector for obtaining the output of LSTM network, server can respectively will be each A intermediate vector is input in the network model conditional random field, obtains the sample sequence of the condition random field output Column, wherein each numerical value in the sample sequence characterizes the corresponding punctuate of each sample words respectively.

It should be noted that CRF, that is, condition random field (Conditional Random Fields), is one group given The conditional probability distribution model of another set output stochastic variable under the conditions of input stochastic variable, it is a kind of probability of discriminate Undirected graph model, since being discriminate, that is, conditional probability distribution is modeled.Therefore, in the present embodiment, CRF can be real Possibility highest one is selected from various possible output sequences now depending on each intermediate vector that LSTM network provides Sequence is as the sample sequence.It is found that a CRF is usually made of multiple characteristic functions, each characteristic function is correspondingly provided with not Same weight coefficient completes the training to CRF by determining these weight systems in training CRF.

For above-mentioned steps 308, it is to be understood that the process of the present embodiment training network model, as training should The process of LSTM network and condition random field needs to adjust the parameter of the LSTM network and the weight system of the condition random field Number.It illustrates, it is assumed that some sample text " what product you have " is directed to, by 4 sample words pair in the sample text After the sample vector answered sequentially inputs LSTM network+condition random field, the sample sequence of final condition random field output is [00104], and the corresponding standard sequence of the sample text is [00003], server, which can detecte, both learns that there are errors, is This, server can make net by adjusting the parameter of the LSTM network and the weight coefficient of the condition random field as far as possible The result of network model output is close to [00003].

Execute step 308 adjust the LSTM network parameter and the condition random field weight coefficient when, can also To be adjusted by existing back-propagation algorithm, not reinflated description to this.

For above-mentioned steps 309, server may determine that sample sequence mark corresponding with each sample text Whether the error between quasi- sequence meets preset trained termination condition, if satisfied, then illustrating each ginseng in the network model Several and weight coefficient has been adjusted to position, can determine that the network model has trained completion；Conversely, if not satisfied, then explanation should Network model also needs to continue to train.Wherein, which can preset according to actual use situation, specifically Ground can set the training termination condition are as follows: if sample sequence standard sequence corresponding with each sample text Between error be respectively less than specification error value, then it is assumed that it meets the preset trained termination condition.Alternatively, can also be set Are as follows: art text executes above-mentioned steps 306-308 if being concentrated using verifying, if the sample sequence and standard sequence of network model output Error between column is in a certain range, then it is assumed that it meets the preset trained termination condition.Wherein, if which concentrates The collection of art text is similar with above-mentioned steps 301, specifically, can execute the collection of above-mentioned steps 301 and obtain largely talking about art text Afterwards, the certain proportion if collection being obtained in art text is divided into training set, and art text is divided into verifying collection if residue.Than Such as, sample of the random division 80% as the training set of subsequent trained network model in art text can will be collected if obtaining, Others 20% are divided into whether subsequent authentication network model trains completion, namely whether meet default training termination condition Verify the sample of collection.

105, the corresponding each punctuate of each numerical value, the number are determined according to preset numerical value punctuate corresponding relationship respectively Value punctuate corresponding relationship has recorded the one-to-one relationship of numerical value and punctuate；

After the result sequence for obtaining network model output, server can be according to preset numerical value punctuate corresponding relationship The corresponding each punctuate of each numerical value is determined respectively.For example, it is assumed that by " you get well me and will reply you tomorrow " corresponding each target After vector inputs the network model, obtaining result sequence is [20001], then is somebody's turn to do according to the corresponding relationship of above-mentioned table one is available As a result corresponding 5 punctuates of sequence be respectively ", ", space, space, space, ".".

106, for each punctuate in each punctuate, by each punctuate be inserted into the target text with Each punctuate corresponds to the back location of target words, obtains art text after punctuate is predicted, the back location refers to It is located at behind the target words and abuts the position of the target words in the target text.

It is understood that these punctuates are inserted into the corresponding of target text by server after determining each punctuate Art text after punctuate is predicted can be obtained in position, completes to add the punctuate of target text.The example above is accepted, To ",." after this 5 punctuates, add it in target text " you get well me and will reply you tomorrow ", obtain words art text " hello, I Tomorrow replys you."

For ease of understanding, as shown in figure 5, further, above-mentioned steps 106 can specifically include:

401, first punctuate in the result sequence is determined as current punctuate；

402, the first aim words in the target text is determined as current words；

403, the current punctuate is inserted into current words and the position before next words, institute in the target text State next words that next words refers to current words described in the target text；

If 404, the current punctuate is not the last one punctuate of the result sequence, by the result sequence when Next punctuate of preceding punctuate is determined as new current punctuate, and next words of words current in the target text is true It is set to new current words, returns again to and execute step 403；

If 405, the current punctuate is the last one punctuate of the result sequence, it is determined that the target text is mark Art text after point prediction.

For above-mentioned steps 401, the example above is accepted, which is [20001], and first punctuate is ", ", will ", " is determined as current punctuate.

For above-mentioned steps 402, which is " you get well me and will reply you tomorrow ", and first aim words is " you It is good ", so that " hello " is determined as current words.

For above-mentioned steps 403, ", " is inserted into " hello " below, " hello, I am bright so that the target text is updated to It replys you ".At this point, next words is that " hello " is subsequent " I ".

For above-mentioned steps 404, server judgement learns that ", " is not the last one punctuate of result sequence, therefore can be with " " (space) is determined as new current punctuate, " I " is determined as to new current words, and return to step 403.It is found that When executing step 403, " I " will be inserted into " " below, so that the target text is updated to " hello, I will reply you tomorrow ".So Afterwards, server continues to judge to learn " " the last one node nor result sequence, therefore can be true by " " (the 2nd space) It is set to new current punctuate, " tomorrow " new current words will be determined as, and so on.Until current punctuate be "." when, service Device judgement learn "." it is the last one punctuate of the result sequence, therefore execute step 405.

For above-mentioned steps 405, when current punctuate is "." when, illustrate that whole punctuates in the result sequence have added Into the target text, target text is updated to that " hello, I will reply you tomorrow at this time.", it is seen then that the target text has been completed Punctuate prediction and addition, so that server can determine that the target text is art text after punctuate prediction.

In the embodiment of the present invention, firstly, obtaining the target text without punctuate；Then, the target text is segmented Processing, obtains each target words in the target text；Then, each target words is carried out at vectorization respectively Reason, obtains the corresponding each object vector of each target words；In addition, according to each target words in the target Order in text sequentially inputs each object vector to network model, obtains what the network model was sequentially output As a result sequence, each numerical value in the result sequence characterize the corresponding punctuate of each target words, the net respectively Network model is made of preparatory trained LSTM network and condition random field；Take second place, according to preset numerical value punctuate corresponding relationship Determine that the corresponding each punctuate of each numerical value, the numerical value punctuate corresponding relationship have recorded a pair for numerical value and punctuate respectively It should be related to；Finally, for each punctuate in each punctuate, by each punctuate be inserted into the target text with Each punctuate corresponds to the back location of target words, obtains art text after punctuate is predicted, the back location refers to It is located at behind the target words and abuts the position of the target words in the target text.As it can be seen that the present invention can lead to It crosses preparatory trained LSTM network and preset condition random field and punctuate prediction accurately is carried out to target text, complete to nothing The punctuate of punctuate text adds, and the efficiency of text punctuate prediction is improved, in order to which subsequent natural language processing is to the straight of text Connect use.

It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit It is fixed.

In one embodiment, a kind of text punctuate prediction meanss, text punctuate prediction meanss and above-described embodiment are provided Middle text punctuate prediction technique corresponds.As shown in fig. 6, text punctuate prediction meanss include that text obtains module 501, divides Word processing module 502, words vectorization module 503, vector input module 504, punctuate determining module 505 and punctuate are inserted into module 506.Detailed description are as follows for each functional module:

Text obtains module 501, for obtaining the target text without punctuate；

Word segmentation processing module 502 obtains each in the target text for carrying out word segmentation processing to the target text A target words；

Words vectorization module 503 obtains described each for carrying out vectorization processing respectively to each target words The corresponding each object vector of a target words；

Vector input module 504 will be described for the order according to each target words in the target text Each object vector is sequentially input to network model, obtains the result sequence that the network model is sequentially output, the result sequence Each numerical value in column characterizes the corresponding punctuate of each target words respectively, and the network model is by trained in advance LSTM network and condition random field composition；

Punctuate determining module 505, for determining that each numerical value is corresponding respectively according to preset numerical value punctuate corresponding relationship Each punctuate, the numerical value punctuate corresponding relationship has recorded the one-to-one relationship of numerical value and punctuate；

Punctuate is inserted into module 506, for for each punctuate in each punctuate, each punctuate to be inserted into The back location of target words corresponding with each punctuate in the target text, art text if obtaining after punctuate prediction, The back location refers to the position for being located at behind the target words and abutting the target words in the target text.

As shown in fig. 7, further, the network model can be by being trained in advance with lower module:

Talk about art text collection module 507, the art text for collecting multiple band punctuates；

Punctuate text separation module 508 is obtained for separating the punctuate in each words art text being collected into text Each sample text and each punctuate set corresponding with each sample text；

First numerical value determining module 509, for being directed to each punctuate set, according to preset numerical value punctuate corresponding relationship point Corresponding first numerical value of each punctuate in each punctuate set is not determined, and is formed with each first numerical value and institute The corresponding standard sequence of each punctuate set is stated, the numerical value punctuate corresponding relationship has recorded numerical value and the one-to-one correspondence of punctuate closes System；

Sample word segmentation processing module 510 obtains each sample for carrying out word segmentation processing respectively to the sample text Each sample words in this document；

Sample vector module 511, for carrying out vector respectively to each sample words in each sample text Change processing obtains each sample vector corresponding with each sample words；

Sample vector input module 512, for being directed to each sample text in each sample text, according to each sample Each sample vector is sequentially input into the network model LSTM network by order of this words in each sample text, Obtain each intermediate vector that the LSTM network is sequentially output；

It is random to be input to the network model conditional for respectively by random field module 513 for each intermediate vector In, the sample sequence of the condition random field output is obtained, each numerical value in the sample sequence characterizes described respectively The corresponding punctuate of each sample words；

Parameter coefficient adjustment module 514, for adjusting the LSTM using the sample sequence of output as adjustment target The weight coefficient of the parameter of network and the condition random field, to minimize the obtained sample sequence and each sample Error between the corresponding standard sequence of text；

Determining module 515 is completed in training, if for sample sequence standard sequence corresponding with each sample text Error between column meets preset trained termination condition, it is determined that the network model has trained.

As shown in figure 8, further, the punctuate insertion module 506 may include:

Current punctuate determination unit 5061, for first punctuate in the result sequence to be determined as current punctuate；

Current words determination unit 5062, for the first aim words in the target text to be determined as current word Word；

It is inserted into unit 5063, for the current punctuate to be inserted into the target text current words and next words Position before, next words refer to next words of current words described in the target text；

New punctuate determination unit 5064, if for the current punctuate not being the last one punctuate of the result sequence, Next punctuate of punctuate current in the result sequence is then determined as new current punctuate, and by the target text when Next words of preceding words is determined as new current words, returns again to execute and described the current punctuate is inserted into the mesh The step of marking the position in text before current words and next words；

Determination unit 5065 is completed in prediction, if for the current punctuate being the last one punctuate of the result sequence, Then determine that the target text is art text after punctuate is predicted.

Further, the words vectorization module may include:

Words retrieval unit, for retrieving preset dictionary for each target words in each target words In whether recorded each target words, the dictionary has recorded the corresponding relationship between words and one-dimensional vector；

Vector acquiring unit obtains and described every if having each target words for record in preset dictionary The corresponding one-dimensional vector of a target words；

First words conversion unit, if passing through for having each target words without record in preset dictionary Each target words is converted primary vector by the term vector for loading the first third-party platform；

Second words conversion unit, for the term vector by the second third-party platform of load, by each target word Word is converted into secondary vector；

Vector concatenation unit obtains described in an one-dimensional vector conduct for splicing the primary vector and secondary vector The corresponding one-dimensional vector of each target words；

Words recording unit, the one-dimensional vector and corresponding target words for obtaining splicing are recorded to institute State dictionary.

Further, the text punctuate prediction meanss can also include:

Specified text suppression module, for deleting the specified text in the target text, the specified text is at least wrapped Include stop words.

Specific about text punctuate prediction meanss limits the limit that may refer to above for text punctuate prediction technique Fixed, details are not described herein.Modules in above-mentioned text punctuate prediction meanss can fully or partially through software, hardware and its Combination is to realize.Above-mentioned each module can be embedded in the form of hardware or independently of in the processor in computer equipment, can also be with It is stored in the memory in computer equipment in a software form, in order to which processor calls the above modules of execution corresponding Operation.

In one embodiment, a kind of computer equipment is provided, which can be server, internal junction Composition can be as shown in Figure 9.The computer equipment include by system bus connect processor, memory, network interface and Database.Wherein, the processor of the computer equipment is for providing calculating and control ability.The memory packet of the computer equipment Include non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program and data Library.The built-in storage provides environment for the operation of operating system and computer program in non-volatile memory medium.The calculating The database of machine equipment is for storing the data being related in text punctuate prediction technique.The network interface of the computer equipment is used It is communicated in passing through network connection with external terminal.To realize that a kind of text punctuate is pre- when the computer program is executed by processor Survey method.

In one embodiment, a kind of computer equipment is provided, including memory, processor and storage are on a memory And the computer program that can be run on a processor, processor realize text punctuate in above-described embodiment when executing computer program The step of prediction technique, such as step 101 shown in Fig. 2 is to step 106.Alternatively, processor is realized when executing computer program The function of each module/unit of text punctuate prediction meanss in above-described embodiment, such as module 501 shown in Fig. 6 is to module 506 Function.To avoid repeating, which is not described herein again.

In one embodiment, a kind of computer readable storage medium is provided, computer program is stored thereon with, is calculated Machine program realizes the step of text punctuate prediction technique in above-described embodiment, such as step shown in Fig. 2 when being executed by processor 101 to step 106.Alternatively, realizing text punctuate prediction meanss in above-described embodiment when computer program is executed by processor The function of each module/unit, such as module 501 shown in Fig. 6 is to the function of module 506.To avoid repeating, which is not described herein again.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, To any reference of memory, storage, database or other media used in each embodiment provided herein, Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..

It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each function Can unit, module division progress for example, in practical application, can according to need and by above-mentioned function distribution by different Functional unit, module are completed, i.e., the internal structure of described device is divided into different functional unit or module, more than completing The all or part of function of description.

Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations；Although referring to aforementioned reality Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features；And these are modified Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all It is included within protection scope of the present invention.

Claims

1. a kind of text punctuate prediction technique characterized by comprising

Obtain the target text without punctuate；

Vectorization processing is carried out respectively to each target words, obtain the corresponding each target of each target words to Amount；

According to order of each target words in the target text, each object vector is sequentially input to net Network model, obtains the result sequence that the network model is sequentially output, and each numerical value in the result sequence characterizes respectively The corresponding punctuate of each target words, the network model is by preparatory trained LSTM network and condition random field group At；

The corresponding each punctuate of each numerical value, the numerical value punctuate pair are determined respectively according to preset numerical value punctuate corresponding relationship Answer the one-to-one relationship of relation record numerical value and punctuate；

For each punctuate in each punctuate, by each punctuate be inserted into the target text with it is described each Punctuate corresponds to the back location of target words, obtains art text after punctuate is predicted, the back location refers to the target It is located at behind the target words and abuts the position of the target words in text.

2. text punctuate prediction technique according to claim 1, which is characterized in that the network model passes through following steps It trains in advance:

Collect art text if multiple band punctuates；

Punctuate in each words art text being collected into is separated with text, obtain each sample text and with each sample The corresponding each punctuate set of text；

For each punctuate set, determined respectively according to preset numerical value punctuate corresponding relationship each in each punctuate set Corresponding first numerical value of a punctuate, and standard sequence corresponding with each punctuate set is formed with each first numerical value Column, the numerical value punctuate corresponding relationship have recorded the one-to-one relationship of numerical value and punctuate；

Word segmentation processing is carried out to the sample text respectively, obtains each sample words in each sample text；

Vectorization processing is carried out to each sample words in each sample text respectively, is obtained and each sample word The corresponding each sample vector of word；

For each sample text in each sample text, according to time of each sample words in each sample text Each sample vector is sequentially input into the network model LSTM network by sequence, obtains what the LSTM network was sequentially output Each intermediate vector；

Each intermediate vector is input in the network model conditional random field respectively, obtains the condition random field The sample sequence of output, each numerical value in the sample sequence characterize the corresponding punctuate of each sample words respectively；

Using the sample sequence of output as adjustment target, adjust the LSTM network parameter and the condition random field Weight coefficient, to minimize the mistake between obtained sample sequence standard sequence corresponding with each sample text Difference；

If the error between sample sequence standard sequence corresponding with each sample text meets preset training eventually Only condition, it is determined that the network model has trained.

3. text punctuate prediction technique according to claim 1, which is characterized in that described in each punctuate Each punctuate is inserted into the target text position behind target words corresponding with each punctuate by each punctuate It sets, obtaining art text after punctuate is predicted includes:

First punctuate in the result sequence is determined as current punctuate；

First aim words in the target text is determined as current words；

The current punctuate is inserted into current words and the position before next words, next word in the target text Word refers to next words of current words described in the target text；

If the current punctuate is not the last one punctuate of the result sequence, by punctuate current in the result sequence Next punctuate is determined as new current punctuate, and next words of words current in the target text is determined as new Current words, return again to execute it is described by the current punctuate be inserted into the target text current words and next words it The step of preceding position；

If the current punctuate is the last one punctuate of the result sequence, it is determined that the target text is after punctuate is predicted If art text.

4. text punctuate prediction technique according to claim 1, which is characterized in that described to each target words point Not carry out vectorization processing, obtaining the corresponding each object vector of each target words includes:

For each target words in each target words, retrieves and whether recorded each mesh in preset dictionary Marking-up word, the dictionary have recorded the corresponding relationship between words and one-dimensional vector；

If record has each target words in preset dictionary, obtain it is corresponding with each target words it is one-dimensional to Amount；

If in preset dictionary without record have each target words, by load the first third-party platform word to Amount, converts primary vector for each target words；

By loading the term vector of the second third-party platform, secondary vector is converted by each target words；

Splice the primary vector and secondary vector, it is corresponding one-dimensional as each target words to obtain an one-dimensional vector Vector；

The obtained one-dimensional vector will be spliced and corresponding target words is recorded to the dictionary.

5. text punctuate prediction technique according to any one of claim 1 to 4, which is characterized in that the target Text carries out word segmentation processing, before obtaining each target words in the target text, further includes:

The specified text in the target text is deleted, the specified text includes at least stop words.

6. a kind of text punctuate prediction meanss characterized by comprising

Text obtains module, for obtaining the target text without punctuate；

Word segmentation processing module obtains each target in the target text for carrying out word segmentation processing to the target text Words；

Words vectorization module obtains each target for carrying out vectorization processing respectively to each target words The corresponding each object vector of words；

Vector input module, for the order according to each target words in the target text, by each mesh Mark vector is sequentially input to network model, obtains the result sequence that the network model is sequentially output, in the result sequence Each numerical value characterizes the corresponding punctuate of each target words respectively, and the network model is by preparatory trained LSTM net Network and condition random field composition；

Punctuate determining module, for determining the corresponding each mark of each numerical value respectively according to preset numerical value punctuate corresponding relationship Point, the numerical value punctuate corresponding relationship have recorded the one-to-one relationship of numerical value and punctuate；

Punctuate is inserted into module, for for each punctuate in each punctuate, each punctuate to be inserted into the mesh The back location for marking target words corresponding with each punctuate in text, obtains art text after punctuate is predicted, after described Face position refers to the position for being located at behind the target words and abutting the target words in the target text.

7. text punctuate prediction meanss according to claim 6, which is characterized in that the network model passes through with lower module It trains in advance:

Talk about art text collection module, the art text for collecting multiple band punctuates；

Punctuate text separation module obtains each sample for separating the punctuate in each words art text being collected into text This text and each punctuate set corresponding with each sample text；

First numerical value determining module is determined for being directed to each punctuate set according to preset numerical value punctuate corresponding relationship respectively Corresponding first numerical value of each punctuate in each punctuate set out, and with each first numerical value form with it is described each The corresponding standard sequence of punctuate set, the numerical value punctuate corresponding relationship have recorded the one-to-one relationship of numerical value and punctuate；

Sample word segmentation processing module obtains each sample text for carrying out word segmentation processing respectively to the sample text In each sample words；

Sample vector module, for carrying out vectorization processing respectively to each sample words in each sample text, Obtain each sample vector corresponding with each sample words；

Sample vector input module, for being directed to each sample text in each sample text, according to each sample words Each sample vector is sequentially input into the network model LSTM network, obtains institute by the order in each sample text State each intermediate vector that LSTM network is sequentially output；

Random field module is obtained for each intermediate vector to be input in the network model conditional random field respectively The sample sequence exported to the condition random field, each numerical value in the sample sequence characterize each sample respectively The corresponding punctuate of words；

Parameter coefficient adjustment module, for adjusting the LSTM network using the sample sequence of output as adjustment target The weight coefficient of parameter and the condition random field, to minimize the obtained sample sequence and each sample text pair The error between standard sequence answered；

Determining module is completed in training, if between sample sequence standard sequence corresponding with each sample text Error meets preset trained termination condition, it is determined that the network model has trained.

8. text punctuate prediction meanss according to claim 6 or 7, which is characterized in that the punctuate is inserted into module and includes:

Current punctuate determination unit, for first punctuate in the result sequence to be determined as current punctuate；

Current words determination unit, for the first aim words in the target text to be determined as current words；

It is inserted into unit, for the current punctuate to be inserted into current words and the position before next words in the target text It sets, next words refers to next words of current words described in the target text；

New punctuate determination unit will be described if for the current punctuate not being the last one punctuate of the result sequence As a result next punctuate of current punctuate is determined as new current punctuate in sequence, and by words current in the target text Next words is determined as new current words, returns again to and executes step 403；

Determination unit is completed in prediction, if for the current punctuate being the last one punctuate of the result sequence, it is determined that institute Stating target text is art text after punctuate is predicted.

9. a kind of computer equipment, including memory, processor and storage are in the memory and can be in the processor The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to Text punctuate prediction technique described in any one of 5.

10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists In realization text punctuate prediction side as described in any one of claims 1 to 5 when the computer program is executed by processor Method.