CN110147545A - The structuring output method and system of text, storage medium and computer equipment - Google Patents
The structuring output method and system of text, storage medium and computer equipment Download PDFInfo
- Publication number
- CN110147545A CN110147545A CN201811089125.0A CN201811089125A CN110147545A CN 110147545 A CN110147545 A CN 110147545A CN 201811089125 A CN201811089125 A CN 201811089125A CN 110147545 A CN110147545 A CN 110147545A
- Authority
- CN
- China
- Prior art keywords
- text
- word
- label
- content
- default label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The present invention discloses a kind of structuring output method of text.The structuring output method of text includes: the content of text identified in picture;According to participle model by text segmentation be word;Word is converted into term vector according to term vector model;The association probability matrix of term vector and default label is obtained according to term vector and deep semantic model;And text structureization is exported as structured content according to the association probability matrix of probabilistic model and default label.Text segmentation is word one by one by participle model by the structuring output method of the text of embodiment of the present invention, then term vector is converted for word by term vector model and is input to deep semantic model and handled to obtain the association probability matrix of default label, then according to the association probability Output matrix structured content of probabilistic model and default label, it is exported and unrelated with format for text, complicated to format or text without format can also accurately export structure content.The invention also discloses a kind of structuring output system of text, non-volatile computer readable storage medium storing program for executing and computer equipments.
Description
Technical field
The present invention relates to text recognition technique field, in particular to the knot of the structuring output method of a kind of text, text
Structure output system, non-volatile computer readable storage medium storing program for executing and computer equipment.
Background technique
Currently, the structuring output method of text is to carry out registration according to text or template mostly to obtain identified
Text or picture export structure content, but more for format or text without format is just difficult to accurately be registrated,
To influence the accuracy of the structured content exported.
Summary of the invention
The embodiment provides the structuring output systems, non-of a kind of structuring output method of text, text
Volatile computer readable storage medium storing program for executing and computer equipment.
The structuring output method of the text of embodiment of the present invention includes:
Identify the content of text in picture;
The content of text is divided into multiple words according to participle model;
The word is converted into term vector according to term vector model;
The association probability matrix of the term vector Yu default label is obtained according to the term vector and deep semantic model;And
According to the association probability matrix of preset probabilistic model and the default label by the content of text export structure
Change content.
Text segmentation is word one by one by participle model by the structuring output method of the text of embodiment of the present invention
Language, then converts term vector for word by term vector model and is input to deep semantic model and handle to be preset
The association probability matrix of label, then according in the association probability Output matrix structuring of preset probabilistic model and default label
Hold, exported and unrelated with format for text, complicated to format or without format text can also accurately export knot
Structure content.
The structuring output system of the text of embodiment of the present invention includes identification module, word segmentation module, conversion module, obtains
Modulus block and output module.The identification module content of text in picture for identification;The word segmentation module is used for basis point
The text segmentation is word by word model;The conversion module be used for according to term vector model by the word be converted to word to
Amount;The module that obtains is used to obtain being associated with for the term vector and default label according to the term vector and deep semantic model
Probability matrix;The output module is used for will be described according to the association probability matrix of preset probabilistic model and the default label
Text output is structured content.
The one or more of embodiment of the present invention, which includes that the non-volatile computer of computer executable instructions is readable, deposits
Storage media, when the computer executable instructions are executed by one or more processors, so that processor execution is above-mentioned
The structuring output method of text.
The computer equipment of embodiment of the present invention, including memory and processor store calculating in the memory
Machine readable instruction, when described instruction is executed by the processor, so that the processor executes the structuring output of above-mentioned text
Method.
The structuring output method of the text of embodiment of the present invention, the structuring output system of text, non-volatile meter
It is word one by one that calculation machine readable storage medium storing program for executing and computer equipment, which pass through participle model for text segmentation, then passes through term vector
Model converts term vector for word and is input to deep semantic model and handles to obtain the association probability square of default label
Battle array carries out then according to the association probability Output matrix structured content of preset probabilistic model and default label for text
Output and it is unrelated with format, text complicated to format or without format can also accurately export structure content.
Additional aspect and advantage of the invention will be set forth in part in the description, and will partially become from the following description
Obviously, or practice through the invention is recognized.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.
Fig. 1 is the flow diagram of the structuring output method of the text of certain embodiments of the present invention;
Fig. 2 is the module diagram of the structuring output system of the text of certain embodiments of the present invention;
Fig. 3 is the flow diagram of the structuring output method of the text of certain embodiments of the present invention;
Fig. 4 is the module diagram of the structuring output system of the text of certain embodiments of the present invention;
Fig. 5 is the flow diagram of the structuring output method of the text of certain embodiments of the present invention;
Fig. 6 is the module diagram of the structuring output system of the text of certain embodiments of the present invention;
Fig. 7 is the flow diagram of the structuring output method of the text of certain embodiments of the present invention;
Fig. 8 is the module diagram of the structuring output system of the text of certain embodiments of the present invention;
Fig. 9 is the flow diagram of the structuring output method of the text of certain embodiments of the present invention;
Figure 10 is the schematic illustration of the structuring output method of the text of certain embodiments of the present invention;
Figure 11 is the schematic illustration of the structuring output method of the text of certain embodiments of the present invention;
Figure 12 is the schematic illustration of the structuring output method of the text of certain embodiments of the present invention;
Figure 13 is the schematic diagram of the computer readable storage medium of certain embodiments of the present invention;With
Figure 14 is the schematic diagram of the computer equipment of certain embodiments of the present invention.
Specific embodiment
Embodiments of the present invention are described below in detail, the example of the embodiment is shown in the accompanying drawings, wherein from beginning
Same or similar element or element with the same or similar functions are indicated to same or similar label eventually.Below by ginseng
The embodiment for examining attached drawing description is exemplary, and for explaining only the invention, and is not considered as limiting the invention.
In the description of the present invention, it is to be understood that, term " first ", " second " are used for description purposes only, and cannot
It is interpreted as indication or suggestion relative importance or implicitly indicates the quantity of indicated technical characteristic.Define as a result, " the
One ", the feature of " second " can explicitly or implicitly include one or more feature.In description of the invention
In, the meaning of " plurality " is two or more, unless otherwise clearly specifically defined.
In the description of the present invention, it should be noted that unless otherwise clearly defined and limited, term " installation ", " phase
Even ", " connection " shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or be integrally connected;It can
To be mechanical connection, it is also possible to be electrically connected or can be in communication with each other;It can be directly connected, it can also be by between intermediary
It connects connected, can be the connection inside two elements or the interaction relationship of two elements.For the ordinary skill of this field
For personnel, the specific meanings of the above terms in the present invention can be understood according to specific conditions.
Following disclosure provides many different embodiments or example is used to realize different structure of the invention.In order to
Simplify disclosure of the invention, hereinafter the component of specific examples and setting are described.Certainly, they are merely examples, and
And it is not intended to limit the present invention.In addition, the present invention can in different examples repeat reference numerals and/or reference letter,
This repetition is for purposes of simplicity and clarity, itself not indicate between discussed various embodiments and/or setting
Relationship.In addition, the present invention provides various specific techniques and material example, but those of ordinary skill in the art can be with
Recognize the application of other techniques and/or the use of other materials.
Embodiments of the present invention are described below in detail, the example of the embodiment is shown in the accompanying drawings, wherein from beginning
Same or similar element or element with the same or similar functions are indicated to same or similar label eventually.Below by ginseng
The embodiment for examining attached drawing description is exemplary, and for explaining only the invention, and is not considered as limiting the invention.
Referring to Fig. 1, in some embodiments, the structuring output method of the text of embodiment of the present invention includes:
011: the content of text in identification picture;
012: content of text is divided by multiple words according to participle model;
014: word is converted to by term vector according to term vector model;
016: the association probability matrix of term vector and default label is obtained according to term vector and deep semantic model;And
018: will be in content of text export structure according to the association probability matrix of preset probabilistic model and default label
Hold.
Referring to Fig. 2, the structuring output system 100 of the text of embodiment of the present invention includes identification module 11, participle
Module 12, obtains module 16 and output module 18 at conversion module 14.The content of text in picture for identification of identification module 11;
Word segmentation module 12 is used to that content of text to be divided into multiple words according to participle model;Conversion module 14 is used for according to term vector mould
Word is converted to term vector by type;Module 16 is obtained to be used to obtain term vector and pre- bidding according to term vector and deep semantic model
The association probability matrix of label;Output module 18 is used for will be literary according to the association probability matrix of preset probabilistic model and default label
This content export structure content.
In other words, step 011 can be realized by identification module 11.Step 012 can be realized by word segmentation module 12.Step
014 can be realized by conversion module 14.Step 016 can be realized by acquisition module 16.Step 018 can be by output module 18
It realizes.
It specifically, according to participle model by text segmentation is first that one by one word, text can be optical character identification
The resulting text (i.e. OCR text) of (Optical Character Recognition, OCR), is also possible to plain text, i.e.,
As long as text, application range is wider.The length of text is unlimited, one at least available after participle model is divided
Word.Participle model can be N meta-model (n-gram), and N meta-model is a kind of more mature model for participle, can be with
It is more accurate to the participle of text according to first n-1 supposition n-th.After participle, will be divided come out word input word to
Model is measured, term vector model is used to convert term vector for word.For example, each word is indicated with a string of binary characters,
So that computer equipment is capable of handling.Term vector model can be Skip-gram model or continuous bag of words (Continuous
Bag-of-Words), the conversion of term vector can also be carried out using Skip-gram model and continuous bag of words simultaneously.Word to
Amount and word are one-to-one.
Then term vector is input in deep semantic model, deep semantic model can be two-way shot and long term memory mould
Type.Deep semantic model can calculate the association probability of each term vector (i.e. word) and default label, wherein association probability is
Some word belongs to the probability of some default label, and presetting label includes starting label, intermediate label and end tags these three types
Label, for example, it is desired to which the text of structuring output is " Li Ming You Tu research and development centre engineer ", it has been divided into five by participle model
A word " Li Ming ", " excellent figure ", " research and development ", "center", " engineer ".Term vector model converts term vector simultaneously for five words
It is input in deep semantic model, the association probability for the beginning label that " Li Ming " this word is name can be calculated, be people
The association probability of the intermediate label of name is the probability of the end tags of name, is the association probability of the beginning label of Business Name
Etc., and so on, the association probability of available " Li Ming " this word and all default labels.Similarly, available " excellent
Figure ", " research and development ", "center", the association probability of " engineer " these words and all default labels.To according to all words with
The association probability of all default labels is to form the association probability matrix of default label.
Finally, finally determining pre- bidding belonging to each word according to the association probability matrix and probabilistic model of default label
Label.After determining default label belonging to each word by text output be structured content, wherein structured content include open
Any one or more in beginning label, intermediate label and end tags, each word belongs to a default label.On for example,
It states and finally determines that " Li Ming " belongs to the beginning label of name in example, " excellent figure " belongs to the beginning label of position, and " research and development " belong to
The intermediate label of position, "center" belong to the intermediate label of position, and " engineer " belongs to the end tags of position.Final output is
" name: Li Ming, position: You Tu research and development centre engineer " two structured contents.Wherein, structured content: " name: Lee
It is bright ", this structured content only includes the beginning label (corresponding word " Li Ming ") of name.Structured content: " position: excellent
Figure research and development centre engineer " then includes the intermediate label (" research and development " of the beginning label (corresponding word " excellent figure ") of position, position
And "center") and position end tags (corresponding " engineer "), and corresponding two words of intermediate label of position, that is,
It says, a structured content may include three or more words, and each structured content contains up to three labels, so
One default label can correspond to multiple words, and each word only belongs to a default label, in other words, preset label and word
The relationship of language be it is one-to-many, ensure that each word has corresponding default label.Probabilistic model can be condition random field
Model (conditional random fields, CRF), Hidden Markov Model (Hidden Markov Model, HMM) or
Any one in method model based on deep learning.
Text segmentation is word one by one by participle model by the structuring output method of the text of embodiment of the present invention
Language, then converts term vector for word by term vector model and is input to deep semantic model and handle to be preset
The association probability matrix of label, then according in the association probability Output matrix structuring of preset probabilistic model and default label
Hold, exported and unrelated with format for text, complicated to format or without format text can also accurately export knot
Structure content.In addition, being exported for text, without complicated registration Algorithm, the detection algorithm for complex text is promoted
Performance, user experience are good.
Referring to Fig. 3, in some embodiments, structuring output method further include:
013: according to multiple words determine content of text belonging to industry;And
015: default label is determined according to industry.
Referring to Fig. 4, in some embodiments, structuring output system 100 further includes the first determining module 13 and
Two determining modules 15.First determining module 13 be used for according to multiple words determine content of text belonging to industry.Second determines mould
Block 15 is used to determine default label according to industry.
In other words, step 013 can be by the first determining module 13.Step 015 can be determined by the second determining module 15.
Specifically, industry includes any one in delivery industry, bank's industry, retail trade or education sector, and industry can
To include delivery industry;Alternatively, industry may include bank's industry;Alternatively, industry may include retail trade;Alternatively, industry
It may include delivery industry and bank's industry;Alternatively, industry may include delivery industry, bank's industry and retail trade;Alternatively,
Industry may include delivery industry, bank's industry, retail trade and education sector.Industry can also include that more other are different
Industry, herein with no restrictions.
The corresponding default label of different industries is also different, and structuring output system 100 may include a kind of industry
Default label, also may include the default label of a variety of different industries, can be selected according to different application scenarios.Due to
Label used in different industries generally has larger difference, so can easily be determined according to word corresponding with label
Industry belonging to content of text, such as delivery industry generally have the default label of the industry characteristics such as postcode label, freight charges label;
Generally there is bank's industry deposit amount label, deposit time tag and Bank Name label etc. to have the pre- bidding of industry characteristic
Label;The default label that generally there is retail trade commodity amount label, product name label etc. to have industry characteristic;Education sector
The general default label with student number label, grade's label etc. with industry characteristic;When determining the corresponding default label of word,
Due to common labels such as nearly all industries all someone's name label, address tag, so without to name label, address tag
It is matched etc. the label for being widely used in every profession and trade and (is not needed all to match all default labels), it is only necessary to by word
Language is matched with the label with industry characteristic.Such as it is corresponding comprising postcode label, freight charges label etc. in content of text
Word i.e. can determine industry belonging to text content be delivery industry.It such as include deposit amount, bank in content of text
Equal labels corresponding word is to can determine that industry belonging to text content is bank's industry.It such as include commodity in content of text
The corresponding words such as amount tag, product name label can determine that industry belonging to text content is retail trade.Such as
Comprising can determine industry belonging to text content for religion when the corresponding words such as student number label, grade's label in content of text
Educate industry.Certainly, several industries of the example above are not limited to.In this way, industry belonging to content of text can be determined according to word,
It can determine default label corresponding with the sector after determining industry, word carried out with the default label of corresponding industry
Matching is to obtain association probability, without carrying out the default label of all industries in word and structuring output system 100
Matching advantageously reduces calculation amount to obtain association probability, improves delivery efficiency.
Referring to Fig. 5, in some embodiments, step 016 includes:
0162: by term vector by positive sequence and inverted sequence input respectively deep semantic model and respectively output positive sequence output result and
Inverted sequence exports result;And
0164: result being exported according to positive sequence and inverted sequence output result determines association probability and the life of term vector and default label
At the association probability matrix of default label.
Referring to Fig. 6, in some embodiments, word segmentation module includes processing unit 162 and the first determination unit 164.
Processing unit 162 is used to term vector inputting deep semantic model and respectively output positive sequence output result respectively by positive sequence and inverted sequence
Result is exported with inverted sequence;First determination unit 164 be used for according to positive sequence export result and inverted sequence output result determine term vector with
The association probability of default label and the association probability matrix for generating default label.
In other words, step 0162 can be realized by processing unit 162.Step 0164 can be by the first determination unit 164
It realizes.
Specifically, term vector at word one by one and after being converted into term vector, is pressed into positive sequence (example by text segmentation one by one
It if positive sequence is the sequence of user's normal reading, such as from left to right) is input in deep semantic model, obtains positive sequence output result.
Term vector is input to deep semantic model by inverted sequence (the namely opposite sequence of positive sequence) one by one simultaneously, obtains inverted sequence output
As a result.Result is exported according to positive sequence and inverted sequence output result obtains the association probability of term vector (i.e. word) and default label, because
It is to export result according to positive sequence output result and inverted sequence to integrate to obtain for the association probability, that is, considers word in entire text
Context (i.e. the word of word front-rear position) relationship, obtained association probability is more accurate.Then, according to all words
The association probability of language and all default labels generates the association probability matrix of default label.
Referring to Fig. 7, in some embodiments, step 018 includes:
0182: belonging to each word determined in content of text according to the association probability matrix and probabilistic model of default label
Default label;
0184: determining the word for belonging to identical default label in structuring according to position of the word in content of text
Position in appearance;And
0186: the position export structure of default label and word according to belonging to word, word in structured content
Content.
Referring to Fig. 8, in some embodiments, output module includes the second determination unit 182, third determination unit
184 and output unit 186.Second determination unit 182 is used to be determined according to the association probability matrix and probabilistic model of presetting label
Default label belonging to each word in content of text;Third determination unit 184 is used for according to word in content of text
Position determines position of the word for belonging to identical default label in structured content;Output unit 186 be used for according to word,
The position export structure content of default label and word belonging to word in structured content.
In other words, step 0182 can be realized by the second determination unit 182.Step 0184 can be by third determination unit
184 realize.Step 0186 can be realized by output unit 186.
Specifically, default label belonging to word is first determined according to the association probability matrix of default label and probabilistic model.
After default label belonging to word has been determined, due to there may be the identical word of default label, such as text in text
Middle presence " You Tu research and development centre engineer " segments as " excellent figure ", " research and development ", "center", " engineer ", final determination " excellent figure "
For the beginning label of position, " research and development " are the intermediate label of position, and "center" is also the intermediate label of position, and " engineer " is duty
The end tags of position.There are the intermediate label of two positions " research and development " and "center", because a text is usually according to normal
Word order can be according to word in the text so the relative position of word in the text has certain reference value
Position determines default position of the identical word of label in structured content.For example, in above-mentioned text, " research and development " " in
Before the heart ", so " research and development " are also before "center" in structured content.Then it is preset according to belonging to word, word
The position final output structured content of label and word in structured content: " position: You Tu research and development centre engineer ".
Referring to Fig. 9, in some embodiments, step 182 includes:
01822: the word according to word front-rear position each in the association probability matrix and content of text of default label is true
Determine default label belonging to word.
Referring to Fig. 8, in some embodiments, the second determination unit 182 is also used to according to the pass for presetting label
The word of each word front-rear position determines default label belonging to word in connection probability matrix and content of text.
In other words, step 01822 can be realized by the second determination unit 182.
Specifically, in the association probability matrix of obtained default label, each word can have a maximum probability
Default label, but the default label of this maximum probability is not necessarily exactly correctly to preset label, because word is not individually to deposit
, the context and semantic word for meeting context are only correctly, so probabilistic model is according to word in the text upper
Hereafter and the word before and after each word come further correct word it is final belonging to default label.For example, individually only considering
" Li Ming " this word, the probability that he belongs to the beginning label of name may be up to 0.8, belong to the beginning label of Business Name
Probability may only have 0.1.However, " Li Ming " this word is Business Name in " Li Ming clothes Co., Ltd " this text
Prefix, that is to say, that " Li Ming " should be the beginning label for belonging to Business Name, probabilistic model be exactly according to text up and down
Text come further correct word it is final belonging to default label so that the default label of each word finally determined is more
Accurately.
Referring to Fig. 10, in one example, the text for needing to carry out structuring output is Li Ming You Tu research and development centre work
Cheng Shi " is divided into five words " Li Ming ", " excellent figure ", " research and development ", "center", " engineer " by participle model first.Then lead to
It crosses term vector model and converts term vector for word, the corresponding term vector of above-mentioned word is respectively x1, x2, x3, x4, x5.Later will
The term vector of word is input in deep semantic model, and positive sequence inputs x1, x2, x3, x4, x5。x1F is obtained after positive sequence input1, by f1
And x2Positive sequence inputs to obtain f2, by f2And x3Positive sequence inputs to obtain f3, by f3And x4Positive sequence inputs to obtain f4, by f4And x5Positive sequence is defeated
Enter to obtain f5, because of f1It will affect f2, and f2It will affect f3, f3It will affect f4, f4It will affect f5, that is to say, that n-th of positive sequence is defeated
Result exports result by preceding n-1 positive sequence out and term vector currently entered determines, finally obtains positive sequence output result f1Extremely
f5.Inverted sequence inputs x5, x4, x3, x2, x1, x5B is obtained after inverted sequence input5, by b5And x4Inverted sequence inputs to obtain b4, by b4And x3Inverted sequence
Input obtains b3, by b3And x2Inverted sequence inputs to obtain b2, by b2And x1Inverted sequence inputs to obtain b1, because of b5It will affect b4, and b4It can shadow
Ring b3, b3It will affect b2, b2It will affect b1, that is to say, that n-th inverted sequence output result by preceding n-1 inverted sequence output result and
Term vector currently entered determines, finally obtains inverted sequence output result b1To b5.Then according to f1To f5And b1To b5It obtains every
The association probability of a term vector (i.e. word) and all default labels is to generate the association probability matrix of default labelWherein, cn,mFor the association probability of m-th of default label of n-th of word.As shown in Figure 10, c1That is table
Show the first row of above-mentioned matrix, c2Indicate the secondary series of above-mentioned matrix, c3Indicate the third column of above-mentioned matrix, c4Indicate
4th column of above-mentioned matrix, c5Indicate the 5th column of above-mentioned matrix, and so on.Then according to probabilistic model, consider each
Word in the text front-rear position word to correct each word it is final belonging to default label, in other words, in matrix
In find an optimal path.For example, c1,1The probability for belonging to the beginning label of name label for " Li Ming ", is 0.8.c1,2For
" Li Ming " belongs to the probability of the intermediate label of name label, is 0.1.c2,1Belong to beginning label (such as Figure 10 of position for " excellent figure "
In B-OFF) probability, be 0.5.c2,2Belong to the probability of the intermediate label (I-OFF in such as Figure 10) of position for " excellent figure ",
It is 0.2.The rest may be inferred, and each word has a corresponding probability value (i.e. association probability) with each label.Then, according to
There is no word before the word of word front-rear position, such as " Li Ming ", " excellent figure " is followed by, according to the association probability of default label
Matrix and subsequent word are to judge " Li Ming " final affiliated default label as the beginning label (B- in such as Figure 10 of name
PER).It is " Li Ming " before " excellent figure ", is followed by " research and development ", "center", " engineer ", to judge that " excellent figure " is opening for position
Beginning label (B-OFF in such as Figure 10).Similarly, " research and development " are the intermediate label (I-OFF in such as Figure 10) of position, and "center" is
The intermediate label (I-OFF in such as Figure 10) of position, " engineer " are the end tags (E-OFF in such as Figure 10) of position.In advance
Bidding signs identical word and determines its position in structured content, final output according to the relative position of word in the text
Are as follows: " name: Li Ming ", " position: You Tu research and development centre engineer ".Above-mentioned example says the present invention only for clearer
It is bright, it is not as a limitation of the invention.
In some embodiments, default label further includes blank tag, each word be blank tag, start label,
Any one in intermediate label and end tags.
Specifically, presetting label includes four class labels, i.e. blank tag, beginning label, intermediate label and end tags.?
When exporting to text, some words are not have in all senses in the text, and for these words, it is belonged to
After blank tag, output not will do it when being exported, that is to say, that the result of output only includes significant structuring
Content (starts one or more of label, intermediate label and end tags).For example, as shown in figure 11, for a name
All texts of on piece carry out OCR and identify after obtaining OCR text, due to business card there may be some LOGO or users not
The structured content etc. needed, it is corresponding in OCR text also to have these words, however this be not user want (such as
Shown in Figure 11, for the Corporation web site (www.xxx.com) of personal business card, user may not needed), so determining word
Default label when the correlation word of Corporation web site is attributed to blank tag, final output is as shown in figure 12, " name:
Li Ming ", " position: You Tu research and development centre engineer ", " company: XXX company ".In this way, not to belong to the word of blank tag into
Row output, only retains the structured content useful to user, better user experience.
Figure 13 is please referred to, the embodiment of the invention also provides a kind of computer readable storage mediums 500.One or more packet
Non-volatile computer readable storage medium storing program for executing 500 containing computer executable instructions, when computer executable instructions by one or
When multiple processors 600 execute, so that processor 600 executes the structuring output side of the text of any one of the above embodiment
Method.
For example, processor 600 executes the text of following steps when computer executable instructions are executed by processor 600
Structuring output method:
011: the content of text in identification picture;
012: content of text is divided by multiple words according to participle model;
014: word is converted to by term vector according to term vector model;
016: the association probability matrix of term vector and default label is obtained according to term vector and deep semantic model;And
018: will be in content of text export structure according to the association probability matrix of preset probabilistic model and default label
Hold.
In another example processor 600 executes the text of following steps when computer executable instructions are executed by processor 600
This structuring output method:
013: according to multiple words determine content of text belonging to industry;And
015: default label is determined according to industry.
Figure 14 is please referred to, the embodiment of the present invention also provides a kind of computer equipment 700.Computer equipment 700 includes storage
Device 720 and processor 740 store computer-readable instruction in memory 720, and computer-readable instruction is held by processor 740
When row, so that processor 740 executes the structuring output method of the text of any one of the above embodiment.
Computer equipment 700 can be computer, smart phone, tablet computer, laptop, smartwatch, intelligent hand
Ring, intelligent helmet, intelligent glasses etc..
For example, processor 740 executes the knot of the text of following steps when computer-readable instruction is executed by processor 740
Structure output method:
011: the content of text in identification picture;
012: content of text is divided by multiple words according to participle model;
014: word is converted to by term vector according to term vector model;
016: the association probability matrix of term vector and default label is obtained according to term vector and deep semantic model;And
018: will be in content of text export structure according to the association probability matrix of preset probabilistic model and default label
Hold.
In another example processor 740 executes the text of following steps when computer-readable instruction is executed by processor 740
Structuring output method:
013: according to multiple words determine content of text belonging to industry;And
015: default label is determined according to industry.
In the description of this specification, reference term " embodiment ", " some embodiments ", " schematically implementation
What the description of mode ", " example ", " specific example " or " some examples " etc. meant to describe in conjunction with the embodiment or example
Particular features, structures, materials, or characteristics are contained at least one embodiment or example of the invention.In this specification
In, schematic expression of the above terms are not necessarily referring to identical embodiment or example.Moreover, the specific spy of description
Sign, structure, material or feature can be combined in any suitable manner in any one or more embodiments or example.
Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes
Module, segment or the portion of the code of one or more executable instructions for the step of executing specific logical function or process
Point, and the range of the preferred embodiment of the present invention includes other execution, wherein can not press shown or discussed suitable
Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, Lai Zhihang function, this should be of the invention
Embodiment person of ordinary skill in the field understood.
Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use
In the order list for the executable instruction for executing logic function, can specifically execute in any computer-readable medium, for
Instruction execution system, device or equipment (such as computer based system, including the system of processor or other can be held from instruction
The instruction fetch of row system, device or equipment and the system executed instruction) it uses, or combine these instruction execution systems, device or set
It is standby and use.For the purpose of this specification, " computer-readable medium ", which can be, any may include, stores, communicates, propagates or pass
Defeated program is for instruction execution system, device or equipment or the dress used in conjunction with these instruction execution systems, device or equipment
It sets.The more specifically example (non-exhaustive list) of computer-readable medium include the following: there is the electricity of one or more wirings
Interconnecting piece (electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory
(ROM), erasable edit read-only storage (EPROM or flash memory), fiber device and portable optic disk is read-only deposits
Reservoir (CDROM).In addition, computer-readable medium can even is that the paper that can print described program on it or other are suitable
Medium, because can then be edited, be interpreted or when necessary with it for example by carrying out optical scanner to paper or other media
His suitable method is handled electronically to obtain described program, is then stored in computer storage.
It should be appreciated that each section of the invention can be executed with hardware, software, firmware or their combination.Above-mentioned
In embodiment, software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage
Or firmware executes.It, and in another embodiment, can be under well known in the art for example, if executed with hardware
Any one of column technology or their combination execute: having a logic gates for executing logic function to data-signal
Discrete logic, with suitable combinational logic gate circuit specific integrated circuit, programmable gate array (PGA), scene
Programmable gate array (FPGA) etc..
Those skilled in the art are understood that execute all or part of the steps that above-mentioned implementation method carries
It is that relevant hardware can be instructed to complete by program, the program can store in a kind of computer readable storage medium
In, which when being executed, includes the steps that one or a combination set of embodiment of the method.
It, can also be in addition, each functional unit in each embodiment of the present invention can integrate in a processing module
It is that each unit physically exists alone, can also be integrated in two or more units in a module.Above-mentioned integrated mould
Block both can take the form of hardware execution, can also be executed in the form of software function module.The integrated module is such as
Fruit is executed and when sold or used as an independent product in the form of software function module, also can store in a computer
In read/write memory medium.
Storage medium mentioned above can be read-only memory, disk or CD etc..Although having been shown and retouching above
The embodiment of the present invention is stated, it is to be understood that above-described embodiment is exemplary, and should not be understood as to limit of the invention
System, those skilled in the art can be changed above-described embodiment, modify, replace and become within the scope of the invention
Type.
Claims (14)
1. a kind of structuring output method of text, which is characterized in that the structuring output method of the text includes:
Identify the content of text in picture;
The content of text is divided into multiple words according to participle model;
The word is converted into term vector according to term vector model;
The association probability matrix of the term vector Yu default label is obtained according to the term vector and deep semantic model;And
It will be in the content of text export structure according to the association probability matrix of preset probabilistic model and the default label
Hold.
2. the structuring output method of text according to claim 1, which is characterized in that the structuring output method is also
Include:
According to the multiple word determine the content of text belonging to industry, wherein the industry includes delivery industry, bank
Any one in industry, retail trade or education sector;And
The default label is determined according to the industry.
3. the structuring output method of text according to claim 1, which is characterized in that it is described according to the term vector and
Deep semantic model obtains the step of association probability matrix of the term vector with default label and includes:
By the term vector by positive sequence and inverted sequence input respectively the deep semantic model and respectively output positive sequence output result and
Inverted sequence exports result, wherein the deep semantic model includes two-way shot and long term memory models;And
Result is exported according to the positive sequence and inverted sequence output result determines being associated with for the term vector and the default label
Probability and the association probability matrix for generating the default label.
4. the structuring output method of text according to claim 1, which is characterized in that described according to preset probability mould
The step of content of text export structure content includes: by the association probability matrix of type and the default label
Determine that each of described content of text is described according to the association probability matrix of the default label and the probabilistic model
The default label belonging to word;
Determine that the word for belonging to the identical default label exists according to position of the word in the content of text
Position in the structured content;And
The position of the default label and the word in the structured content according to belonging to the word, the word
Export the structured content.
5. the structuring output method of text according to claim 4, which is characterized in that described according to the default label
Association probability matrix and the probabilistic model determine the pre- bidding belonging to each of described content of text word
The step of label includes:
According to the word front-rear position each in the association probability matrix of the default label and the content of text
Word determines the default label belonging to the word.
6. the structuring output method of text according to claim 1, which is characterized in that the default label includes blank
Label, beginning label, intermediate label and end tags, each structured content include the beginning label, the centre
It is any one or more in label and the end tags, each word be the blank tag, the beginning label,
Any one in the intermediate label and the end tags.
7. a kind of structuring output system of text, which is characterized in that the structuring output system of the text includes:
Identification module, the identification module content of text in picture for identification;
The content of text is divided into multiple words according to participle model by word segmentation module;
The word is converted to term vector according to term vector model by conversion module;
Module is obtained, the association probability square of the term vector Yu default label is obtained according to the term vector and deep semantic model
Battle array;And
Output module exports the content of text according to the association probability matrix of preset probabilistic model and the default label
Structured content.
8. the structuring output system of text according to claim 7, which is characterized in that the structuring output system is also
Include:
First determining module, first determining module be used for according to the multiple word determine the content of text belonging to row
Industry, wherein the industry includes any one in delivery industry, bank's industry, retail trade or education sector;And
Second determining module, second determining module are used to determine the default label according to the industry.
9. the structuring output system of text according to claim 7, which is characterized in that the acquisition module includes:
Processing unit, the processing unit are used to the term vector inputting the deep semantic model respectively by positive sequence and inverted sequence
And output positive sequence output result and inverted sequence export result respectively, wherein the deep semantic model includes two-way shot and long term memory
Model;And
First determination unit, first determination unit is used to export result according to the positive sequence and inverted sequence output result is true
The association probability of the fixed term vector and the default label and the association probability matrix for generating the default label.
10. the structuring output system of text according to claim 7, which is characterized in that the output module includes:
Second determination unit, second determination unit are used for association probability matrix and the probability according to the default label
Model determines the default label belonging to each of described content of text word;
Third determination unit, the third determination unit are used for the position according to the word in the content of text and determine category
In position of the word in the structured content of the identical default label;And
Output unit, the output unit are used for the default label and institute's predicate according to belonging to the word, the word
Position of the language in the structured content exports the structured content.
11. the structuring output system of text according to claim 10, which is characterized in that second determination unit is also
For according to the word front-rear position each in the association probability matrix of the default label and the content of text
Word determines the default label belonging to the word.
12. the structuring output system of text according to claim 10, which is characterized in that the default label includes sky
White label starts label, intermediate label and end tags, during each structured content includes the beginning label, is described
Between it is any one or more in label and the end tags, each word is the blank tag, described starts to mark
Any one in label, the intermediate label and the end tags.
13. one or more includes the non-volatile computer readable storage medium storing program for executing of computer executable instructions, when the calculating
When machine executable instruction is executed by one or more processors, so that the processor perform claim requires any one of 1 to 6 institute
The structuring output method for the text stated.
14. a kind of computer equipment, including memory and processor, computer-readable instruction is stored in the memory, institute
When stating instruction by processor execution, so that the knot of text described in any one of processor perform claim requirement 1 to 6
Structure output method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811089125.0A CN110147545B (en) | 2018-09-18 | 2018-09-18 | Method and system for structured output of text, storage medium and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811089125.0A CN110147545B (en) | 2018-09-18 | 2018-09-18 | Method and system for structured output of text, storage medium and computer equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110147545A true CN110147545A (en) | 2019-08-20 |
CN110147545B CN110147545B (en) | 2023-08-29 |
Family
ID=67588427
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811089125.0A Active CN110147545B (en) | 2018-09-18 | 2018-09-18 | Method and system for structured output of text, storage medium and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110147545B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111832300A (en) * | 2020-07-24 | 2020-10-27 | 中国联合网络通信集团有限公司 | Contract auditing method and device based on deep learning |
CN111914535A (en) * | 2020-07-31 | 2020-11-10 | 平安科技(深圳)有限公司 | Word recognition method and device, computer equipment and storage medium |
CN112541373A (en) * | 2019-09-20 | 2021-03-23 | 北京国双科技有限公司 | Judicial text recognition method, text recognition model obtaining method and related equipment |
CN112712879A (en) * | 2021-01-18 | 2021-04-27 | 腾讯科技(深圳)有限公司 | Information extraction method, device, equipment and storage medium for medical image report |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106599933A (en) * | 2016-12-26 | 2017-04-26 | 哈尔滨工业大学 | Text emotion classification method based on the joint deep learning model |
JP2017102599A (en) * | 2015-11-30 | 2017-06-08 | 日本電信電話株式会社 | Estimation device, parameter learning device, method, and program |
CN108170674A (en) * | 2017-12-27 | 2018-06-15 | 东软集团股份有限公司 | Part-of-speech tagging method and apparatus, program product and storage medium |
CN108280062A (en) * | 2018-01-19 | 2018-07-13 | 北京邮电大学 | Entity based on deep learning and entity-relationship recognition method and device |
CN108399227A (en) * | 2018-02-12 | 2018-08-14 | 平安科技(深圳)有限公司 | Method, apparatus, computer equipment and the storage medium of automatic labeling |
-
2018
- 2018-09-18 CN CN201811089125.0A patent/CN110147545B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2017102599A (en) * | 2015-11-30 | 2017-06-08 | 日本電信電話株式会社 | Estimation device, parameter learning device, method, and program |
CN106599933A (en) * | 2016-12-26 | 2017-04-26 | 哈尔滨工业大学 | Text emotion classification method based on the joint deep learning model |
CN108170674A (en) * | 2017-12-27 | 2018-06-15 | 东软集团股份有限公司 | Part-of-speech tagging method and apparatus, program product and storage medium |
CN108280062A (en) * | 2018-01-19 | 2018-07-13 | 北京邮电大学 | Entity based on deep learning and entity-relationship recognition method and device |
CN108399227A (en) * | 2018-02-12 | 2018-08-14 | 平安科技(深圳)有限公司 | Method, apparatus, computer equipment and the storage medium of automatic labeling |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112541373A (en) * | 2019-09-20 | 2021-03-23 | 北京国双科技有限公司 | Judicial text recognition method, text recognition model obtaining method and related equipment |
WO2021051957A1 (en) * | 2019-09-20 | 2021-03-25 | 北京国双科技有限公司 | Judicial text recognition method, text recognition model obtaining method, and related device |
CN112541373B (en) * | 2019-09-20 | 2023-10-31 | 北京国双科技有限公司 | Judicial text recognition method, text recognition model obtaining method and related equipment |
CN111832300A (en) * | 2020-07-24 | 2020-10-27 | 中国联合网络通信集团有限公司 | Contract auditing method and device based on deep learning |
CN111914535A (en) * | 2020-07-31 | 2020-11-10 | 平安科技(深圳)有限公司 | Word recognition method and device, computer equipment and storage medium |
CN111914535B (en) * | 2020-07-31 | 2023-03-24 | 平安科技(深圳)有限公司 | Word recognition method and device, computer equipment and storage medium |
CN112712879A (en) * | 2021-01-18 | 2021-04-27 | 腾讯科技(深圳)有限公司 | Information extraction method, device, equipment and storage medium for medical image report |
CN112712879B (en) * | 2021-01-18 | 2023-05-30 | 腾讯科技(深圳)有限公司 | Information extraction method, device, equipment and storage medium for medical image report |
Also Published As
Publication number | Publication date |
---|---|
CN110147545B (en) | 2023-08-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Goldberg | Neural network methods for natural language processing | |
CN110489760B (en) | Text automatic correction method and device based on deep neural network | |
CN110147545A (en) | The structuring output method and system of text, storage medium and computer equipment | |
US20240135183A1 (en) | Hierarchical classification using neural networks | |
CN109165384A (en) | A kind of name entity recognition method and device | |
CN101131690B (en) | Method and system for mutual conversion between simplified Chinese characters and traditional Chinese characters | |
WO2019085697A1 (en) | Man-machine interaction method and system | |
CN107729309A (en) | A kind of method and device of the Chinese semantic analysis based on deep learning | |
CN108763510A (en) | Intension recognizing method, device, equipment and storage medium | |
CN110188202A (en) | Training method, device and the terminal of semantic relation identification model | |
CN109598517B (en) | Commodity clearance processing, object processing and category prediction method and device thereof | |
CN110046350A (en) | Grammatical bloopers recognition methods, device, computer equipment and storage medium | |
CN105808523A (en) | Method and apparatus for identifying document | |
CN112036184A (en) | Entity identification method, device, computer device and storage medium based on BilSTM network model and CRF model | |
CN115392237B (en) | Emotion analysis model training method, device, equipment and storage medium | |
CN112100384A (en) | Data viewpoint extraction method, device, equipment and storage medium | |
Romero et al. | Modern vs diplomatic transcripts for historical handwritten text recognition | |
CN114036950A (en) | Medical text named entity recognition method and system | |
CN110874534A (en) | Data processing method and data processing device | |
Addis et al. | Printed ethiopic script recognition by using lstm networks | |
CN115374786A (en) | Entity and relationship combined extraction method and device, storage medium and terminal | |
CN115309864A (en) | Intelligent sentiment classification method and device for comment text, electronic equipment and medium | |
CN110738050A (en) | Text recombination method, device and medium based on word segmentation and named entity recognition | |
CN109753647A (en) | The partitioning method and device of paragraph | |
CN110263793A (en) | Article tag recognition methods and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |