CN110147545A - The structuring output method and system of text, storage medium and computer equipment - Google Patents

The structuring output method and system of text, storage medium and computer equipment Download PDF

Info

Publication number
CN110147545A
CN110147545A CN201811089125.0A CN201811089125A CN110147545A CN 110147545 A CN110147545 A CN 110147545A CN 201811089125 A CN201811089125 A CN 201811089125A CN 110147545 A CN110147545 A CN 110147545A
Authority
CN
China
Prior art keywords
text
word
label
content
default label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811089125.0A
Other languages
Chinese (zh)
Other versions
CN110147545B (en
Inventor
蒋兴华
曹浩宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201811089125.0A priority Critical patent/CN110147545B/en
Publication of CN110147545A publication Critical patent/CN110147545A/en
Application granted granted Critical
Publication of CN110147545B publication Critical patent/CN110147545B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention discloses a kind of structuring output method of text.The structuring output method of text includes: the content of text identified in picture;According to participle model by text segmentation be word;Word is converted into term vector according to term vector model;The association probability matrix of term vector and default label is obtained according to term vector and deep semantic model;And text structureization is exported as structured content according to the association probability matrix of probabilistic model and default label.Text segmentation is word one by one by participle model by the structuring output method of the text of embodiment of the present invention, then term vector is converted for word by term vector model and is input to deep semantic model and handled to obtain the association probability matrix of default label, then according to the association probability Output matrix structured content of probabilistic model and default label, it is exported and unrelated with format for text, complicated to format or text without format can also accurately export structure content.The invention also discloses a kind of structuring output system of text, non-volatile computer readable storage medium storing program for executing and computer equipments.

Description

The structuring output method and system of text, storage medium and computer equipment
Technical field
The present invention relates to text recognition technique field, in particular to the knot of the structuring output method of a kind of text, text Structure output system, non-volatile computer readable storage medium storing program for executing and computer equipment.
Background technique
Currently, the structuring output method of text is to carry out registration according to text or template mostly to obtain identified Text or picture export structure content, but more for format or text without format is just difficult to accurately be registrated, To influence the accuracy of the structured content exported.
Summary of the invention
The embodiment provides the structuring output systems, non-of a kind of structuring output method of text, text Volatile computer readable storage medium storing program for executing and computer equipment.
The structuring output method of the text of embodiment of the present invention includes:
Identify the content of text in picture;
The content of text is divided into multiple words according to participle model;
The word is converted into term vector according to term vector model;
The association probability matrix of the term vector Yu default label is obtained according to the term vector and deep semantic model;And
According to the association probability matrix of preset probabilistic model and the default label by the content of text export structure Change content.
Text segmentation is word one by one by participle model by the structuring output method of the text of embodiment of the present invention Language, then converts term vector for word by term vector model and is input to deep semantic model and handle to be preset The association probability matrix of label, then according in the association probability Output matrix structuring of preset probabilistic model and default label Hold, exported and unrelated with format for text, complicated to format or without format text can also accurately export knot Structure content.
The structuring output system of the text of embodiment of the present invention includes identification module, word segmentation module, conversion module, obtains Modulus block and output module.The identification module content of text in picture for identification;The word segmentation module is used for basis point The text segmentation is word by word model;The conversion module be used for according to term vector model by the word be converted to word to Amount;The module that obtains is used to obtain being associated with for the term vector and default label according to the term vector and deep semantic model Probability matrix;The output module is used for will be described according to the association probability matrix of preset probabilistic model and the default label Text output is structured content.
The one or more of embodiment of the present invention, which includes that the non-volatile computer of computer executable instructions is readable, deposits Storage media, when the computer executable instructions are executed by one or more processors, so that processor execution is above-mentioned The structuring output method of text.
The computer equipment of embodiment of the present invention, including memory and processor store calculating in the memory Machine readable instruction, when described instruction is executed by the processor, so that the processor executes the structuring output of above-mentioned text Method.
The structuring output method of the text of embodiment of the present invention, the structuring output system of text, non-volatile meter It is word one by one that calculation machine readable storage medium storing program for executing and computer equipment, which pass through participle model for text segmentation, then passes through term vector Model converts term vector for word and is input to deep semantic model and handles to obtain the association probability square of default label Battle array carries out then according to the association probability Output matrix structured content of preset probabilistic model and default label for text Output and it is unrelated with format, text complicated to format or without format can also accurately export structure content.
Additional aspect and advantage of the invention will be set forth in part in the description, and will partially become from the following description Obviously, or practice through the invention is recognized.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 is the flow diagram of the structuring output method of the text of certain embodiments of the present invention;
Fig. 2 is the module diagram of the structuring output system of the text of certain embodiments of the present invention;
Fig. 3 is the flow diagram of the structuring output method of the text of certain embodiments of the present invention;
Fig. 4 is the module diagram of the structuring output system of the text of certain embodiments of the present invention;
Fig. 5 is the flow diagram of the structuring output method of the text of certain embodiments of the present invention;
Fig. 6 is the module diagram of the structuring output system of the text of certain embodiments of the present invention;
Fig. 7 is the flow diagram of the structuring output method of the text of certain embodiments of the present invention;
Fig. 8 is the module diagram of the structuring output system of the text of certain embodiments of the present invention;
Fig. 9 is the flow diagram of the structuring output method of the text of certain embodiments of the present invention;
Figure 10 is the schematic illustration of the structuring output method of the text of certain embodiments of the present invention;
Figure 11 is the schematic illustration of the structuring output method of the text of certain embodiments of the present invention;
Figure 12 is the schematic illustration of the structuring output method of the text of certain embodiments of the present invention;
Figure 13 is the schematic diagram of the computer readable storage medium of certain embodiments of the present invention;With
Figure 14 is the schematic diagram of the computer equipment of certain embodiments of the present invention.
Specific embodiment
Embodiments of the present invention are described below in detail, the example of the embodiment is shown in the accompanying drawings, wherein from beginning Same or similar element or element with the same or similar functions are indicated to same or similar label eventually.Below by ginseng The embodiment for examining attached drawing description is exemplary, and for explaining only the invention, and is not considered as limiting the invention.
In the description of the present invention, it is to be understood that, term " first ", " second " are used for description purposes only, and cannot It is interpreted as indication or suggestion relative importance or implicitly indicates the quantity of indicated technical characteristic.Define as a result, " the One ", the feature of " second " can explicitly or implicitly include one or more feature.In description of the invention In, the meaning of " plurality " is two or more, unless otherwise clearly specifically defined.
In the description of the present invention, it should be noted that unless otherwise clearly defined and limited, term " installation ", " phase Even ", " connection " shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or be integrally connected;It can To be mechanical connection, it is also possible to be electrically connected or can be in communication with each other;It can be directly connected, it can also be by between intermediary It connects connected, can be the connection inside two elements or the interaction relationship of two elements.For the ordinary skill of this field For personnel, the specific meanings of the above terms in the present invention can be understood according to specific conditions.
Following disclosure provides many different embodiments or example is used to realize different structure of the invention.In order to Simplify disclosure of the invention, hereinafter the component of specific examples and setting are described.Certainly, they are merely examples, and And it is not intended to limit the present invention.In addition, the present invention can in different examples repeat reference numerals and/or reference letter, This repetition is for purposes of simplicity and clarity, itself not indicate between discussed various embodiments and/or setting Relationship.In addition, the present invention provides various specific techniques and material example, but those of ordinary skill in the art can be with Recognize the application of other techniques and/or the use of other materials.
Embodiments of the present invention are described below in detail, the example of the embodiment is shown in the accompanying drawings, wherein from beginning Same or similar element or element with the same or similar functions are indicated to same or similar label eventually.Below by ginseng The embodiment for examining attached drawing description is exemplary, and for explaining only the invention, and is not considered as limiting the invention.
Referring to Fig. 1, in some embodiments, the structuring output method of the text of embodiment of the present invention includes:
011: the content of text in identification picture;
012: content of text is divided by multiple words according to participle model;
014: word is converted to by term vector according to term vector model;
016: the association probability matrix of term vector and default label is obtained according to term vector and deep semantic model;And
018: will be in content of text export structure according to the association probability matrix of preset probabilistic model and default label Hold.
Referring to Fig. 2, the structuring output system 100 of the text of embodiment of the present invention includes identification module 11, participle Module 12, obtains module 16 and output module 18 at conversion module 14.The content of text in picture for identification of identification module 11; Word segmentation module 12 is used to that content of text to be divided into multiple words according to participle model;Conversion module 14 is used for according to term vector mould Word is converted to term vector by type;Module 16 is obtained to be used to obtain term vector and pre- bidding according to term vector and deep semantic model The association probability matrix of label;Output module 18 is used for will be literary according to the association probability matrix of preset probabilistic model and default label This content export structure content.
In other words, step 011 can be realized by identification module 11.Step 012 can be realized by word segmentation module 12.Step 014 can be realized by conversion module 14.Step 016 can be realized by acquisition module 16.Step 018 can be by output module 18 It realizes.
It specifically, according to participle model by text segmentation is first that one by one word, text can be optical character identification The resulting text (i.e. OCR text) of (Optical Character Recognition, OCR), is also possible to plain text, i.e., As long as text, application range is wider.The length of text is unlimited, one at least available after participle model is divided Word.Participle model can be N meta-model (n-gram), and N meta-model is a kind of more mature model for participle, can be with It is more accurate to the participle of text according to first n-1 supposition n-th.After participle, will be divided come out word input word to Model is measured, term vector model is used to convert term vector for word.For example, each word is indicated with a string of binary characters, So that computer equipment is capable of handling.Term vector model can be Skip-gram model or continuous bag of words (Continuous Bag-of-Words), the conversion of term vector can also be carried out using Skip-gram model and continuous bag of words simultaneously.Word to Amount and word are one-to-one.
Then term vector is input in deep semantic model, deep semantic model can be two-way shot and long term memory mould Type.Deep semantic model can calculate the association probability of each term vector (i.e. word) and default label, wherein association probability is Some word belongs to the probability of some default label, and presetting label includes starting label, intermediate label and end tags these three types Label, for example, it is desired to which the text of structuring output is " Li Ming You Tu research and development centre engineer ", it has been divided into five by participle model A word " Li Ming ", " excellent figure ", " research and development ", "center", " engineer ".Term vector model converts term vector simultaneously for five words It is input in deep semantic model, the association probability for the beginning label that " Li Ming " this word is name can be calculated, be people The association probability of the intermediate label of name is the probability of the end tags of name, is the association probability of the beginning label of Business Name Etc., and so on, the association probability of available " Li Ming " this word and all default labels.Similarly, available " excellent Figure ", " research and development ", "center", the association probability of " engineer " these words and all default labels.To according to all words with The association probability of all default labels is to form the association probability matrix of default label.
Finally, finally determining pre- bidding belonging to each word according to the association probability matrix and probabilistic model of default label Label.After determining default label belonging to each word by text output be structured content, wherein structured content include open Any one or more in beginning label, intermediate label and end tags, each word belongs to a default label.On for example, It states and finally determines that " Li Ming " belongs to the beginning label of name in example, " excellent figure " belongs to the beginning label of position, and " research and development " belong to The intermediate label of position, "center" belong to the intermediate label of position, and " engineer " belongs to the end tags of position.Final output is " name: Li Ming, position: You Tu research and development centre engineer " two structured contents.Wherein, structured content: " name: Lee It is bright ", this structured content only includes the beginning label (corresponding word " Li Ming ") of name.Structured content: " position: excellent Figure research and development centre engineer " then includes the intermediate label (" research and development " of the beginning label (corresponding word " excellent figure ") of position, position And "center") and position end tags (corresponding " engineer "), and corresponding two words of intermediate label of position, that is, It says, a structured content may include three or more words, and each structured content contains up to three labels, so One default label can correspond to multiple words, and each word only belongs to a default label, in other words, preset label and word The relationship of language be it is one-to-many, ensure that each word has corresponding default label.Probabilistic model can be condition random field Model (conditional random fields, CRF), Hidden Markov Model (Hidden Markov Model, HMM) or Any one in method model based on deep learning.
Text segmentation is word one by one by participle model by the structuring output method of the text of embodiment of the present invention Language, then converts term vector for word by term vector model and is input to deep semantic model and handle to be preset The association probability matrix of label, then according in the association probability Output matrix structuring of preset probabilistic model and default label Hold, exported and unrelated with format for text, complicated to format or without format text can also accurately export knot Structure content.In addition, being exported for text, without complicated registration Algorithm, the detection algorithm for complex text is promoted Performance, user experience are good.
Referring to Fig. 3, in some embodiments, structuring output method further include:
013: according to multiple words determine content of text belonging to industry;And
015: default label is determined according to industry.
Referring to Fig. 4, in some embodiments, structuring output system 100 further includes the first determining module 13 and Two determining modules 15.First determining module 13 be used for according to multiple words determine content of text belonging to industry.Second determines mould Block 15 is used to determine default label according to industry.
In other words, step 013 can be by the first determining module 13.Step 015 can be determined by the second determining module 15.
Specifically, industry includes any one in delivery industry, bank's industry, retail trade or education sector, and industry can To include delivery industry;Alternatively, industry may include bank's industry;Alternatively, industry may include retail trade;Alternatively, industry It may include delivery industry and bank's industry;Alternatively, industry may include delivery industry, bank's industry and retail trade;Alternatively, Industry may include delivery industry, bank's industry, retail trade and education sector.Industry can also include that more other are different Industry, herein with no restrictions.
The corresponding default label of different industries is also different, and structuring output system 100 may include a kind of industry Default label, also may include the default label of a variety of different industries, can be selected according to different application scenarios.Due to Label used in different industries generally has larger difference, so can easily be determined according to word corresponding with label Industry belonging to content of text, such as delivery industry generally have the default label of the industry characteristics such as postcode label, freight charges label; Generally there is bank's industry deposit amount label, deposit time tag and Bank Name label etc. to have the pre- bidding of industry characteristic Label;The default label that generally there is retail trade commodity amount label, product name label etc. to have industry characteristic;Education sector The general default label with student number label, grade's label etc. with industry characteristic;When determining the corresponding default label of word, Due to common labels such as nearly all industries all someone's name label, address tag, so without to name label, address tag It is matched etc. the label for being widely used in every profession and trade and (is not needed all to match all default labels), it is only necessary to by word Language is matched with the label with industry characteristic.Such as it is corresponding comprising postcode label, freight charges label etc. in content of text Word i.e. can determine industry belonging to text content be delivery industry.It such as include deposit amount, bank in content of text Equal labels corresponding word is to can determine that industry belonging to text content is bank's industry.It such as include commodity in content of text The corresponding words such as amount tag, product name label can determine that industry belonging to text content is retail trade.Such as Comprising can determine industry belonging to text content for religion when the corresponding words such as student number label, grade's label in content of text Educate industry.Certainly, several industries of the example above are not limited to.In this way, industry belonging to content of text can be determined according to word, It can determine default label corresponding with the sector after determining industry, word carried out with the default label of corresponding industry Matching is to obtain association probability, without carrying out the default label of all industries in word and structuring output system 100 Matching advantageously reduces calculation amount to obtain association probability, improves delivery efficiency.
Referring to Fig. 5, in some embodiments, step 016 includes:
0162: by term vector by positive sequence and inverted sequence input respectively deep semantic model and respectively output positive sequence output result and Inverted sequence exports result;And
0164: result being exported according to positive sequence and inverted sequence output result determines association probability and the life of term vector and default label At the association probability matrix of default label.
Referring to Fig. 6, in some embodiments, word segmentation module includes processing unit 162 and the first determination unit 164. Processing unit 162 is used to term vector inputting deep semantic model and respectively output positive sequence output result respectively by positive sequence and inverted sequence Result is exported with inverted sequence;First determination unit 164 be used for according to positive sequence export result and inverted sequence output result determine term vector with The association probability of default label and the association probability matrix for generating default label.
In other words, step 0162 can be realized by processing unit 162.Step 0164 can be by the first determination unit 164 It realizes.
Specifically, term vector at word one by one and after being converted into term vector, is pressed into positive sequence (example by text segmentation one by one It if positive sequence is the sequence of user's normal reading, such as from left to right) is input in deep semantic model, obtains positive sequence output result. Term vector is input to deep semantic model by inverted sequence (the namely opposite sequence of positive sequence) one by one simultaneously, obtains inverted sequence output As a result.Result is exported according to positive sequence and inverted sequence output result obtains the association probability of term vector (i.e. word) and default label, because It is to export result according to positive sequence output result and inverted sequence to integrate to obtain for the association probability, that is, considers word in entire text Context (i.e. the word of word front-rear position) relationship, obtained association probability is more accurate.Then, according to all words The association probability of language and all default labels generates the association probability matrix of default label.
Referring to Fig. 7, in some embodiments, step 018 includes:
0182: belonging to each word determined in content of text according to the association probability matrix and probabilistic model of default label Default label;
0184: determining the word for belonging to identical default label in structuring according to position of the word in content of text Position in appearance;And
0186: the position export structure of default label and word according to belonging to word, word in structured content Content.
Referring to Fig. 8, in some embodiments, output module includes the second determination unit 182, third determination unit 184 and output unit 186.Second determination unit 182 is used to be determined according to the association probability matrix and probabilistic model of presetting label Default label belonging to each word in content of text;Third determination unit 184 is used for according to word in content of text Position determines position of the word for belonging to identical default label in structured content;Output unit 186 be used for according to word, The position export structure content of default label and word belonging to word in structured content.
In other words, step 0182 can be realized by the second determination unit 182.Step 0184 can be by third determination unit 184 realize.Step 0186 can be realized by output unit 186.
Specifically, default label belonging to word is first determined according to the association probability matrix of default label and probabilistic model. After default label belonging to word has been determined, due to there may be the identical word of default label, such as text in text Middle presence " You Tu research and development centre engineer " segments as " excellent figure ", " research and development ", "center", " engineer ", final determination " excellent figure " For the beginning label of position, " research and development " are the intermediate label of position, and "center" is also the intermediate label of position, and " engineer " is duty The end tags of position.There are the intermediate label of two positions " research and development " and "center", because a text is usually according to normal Word order can be according to word in the text so the relative position of word in the text has certain reference value Position determines default position of the identical word of label in structured content.For example, in above-mentioned text, " research and development " " in Before the heart ", so " research and development " are also before "center" in structured content.Then it is preset according to belonging to word, word The position final output structured content of label and word in structured content: " position: You Tu research and development centre engineer ".
Referring to Fig. 9, in some embodiments, step 182 includes:
01822: the word according to word front-rear position each in the association probability matrix and content of text of default label is true Determine default label belonging to word.
Referring to Fig. 8, in some embodiments, the second determination unit 182 is also used to according to the pass for presetting label The word of each word front-rear position determines default label belonging to word in connection probability matrix and content of text.
In other words, step 01822 can be realized by the second determination unit 182.
Specifically, in the association probability matrix of obtained default label, each word can have a maximum probability Default label, but the default label of this maximum probability is not necessarily exactly correctly to preset label, because word is not individually to deposit , the context and semantic word for meeting context are only correctly, so probabilistic model is according to word in the text upper Hereafter and the word before and after each word come further correct word it is final belonging to default label.For example, individually only considering " Li Ming " this word, the probability that he belongs to the beginning label of name may be up to 0.8, belong to the beginning label of Business Name Probability may only have 0.1.However, " Li Ming " this word is Business Name in " Li Ming clothes Co., Ltd " this text Prefix, that is to say, that " Li Ming " should be the beginning label for belonging to Business Name, probabilistic model be exactly according to text up and down Text come further correct word it is final belonging to default label so that the default label of each word finally determined is more Accurately.
Referring to Fig. 10, in one example, the text for needing to carry out structuring output is Li Ming You Tu research and development centre work Cheng Shi " is divided into five words " Li Ming ", " excellent figure ", " research and development ", "center", " engineer " by participle model first.Then lead to It crosses term vector model and converts term vector for word, the corresponding term vector of above-mentioned word is respectively x1, x2, x3, x4, x5.Later will The term vector of word is input in deep semantic model, and positive sequence inputs x1, x2, x3, x4, x5。x1F is obtained after positive sequence input1, by f1 And x2Positive sequence inputs to obtain f2, by f2And x3Positive sequence inputs to obtain f3, by f3And x4Positive sequence inputs to obtain f4, by f4And x5Positive sequence is defeated Enter to obtain f5, because of f1It will affect f2, and f2It will affect f3, f3It will affect f4, f4It will affect f5, that is to say, that n-th of positive sequence is defeated Result exports result by preceding n-1 positive sequence out and term vector currently entered determines, finally obtains positive sequence output result f1Extremely f5.Inverted sequence inputs x5, x4, x3, x2, x1, x5B is obtained after inverted sequence input5, by b5And x4Inverted sequence inputs to obtain b4, by b4And x3Inverted sequence Input obtains b3, by b3And x2Inverted sequence inputs to obtain b2, by b2And x1Inverted sequence inputs to obtain b1, because of b5It will affect b4, and b4It can shadow Ring b3, b3It will affect b2, b2It will affect b1, that is to say, that n-th inverted sequence output result by preceding n-1 inverted sequence output result and Term vector currently entered determines, finally obtains inverted sequence output result b1To b5.Then according to f1To f5And b1To b5It obtains every The association probability of a term vector (i.e. word) and all default labels is to generate the association probability matrix of default labelWherein, cn,mFor the association probability of m-th of default label of n-th of word.As shown in Figure 10, c1That is table Show the first row of above-mentioned matrix, c2Indicate the secondary series of above-mentioned matrix, c3Indicate the third column of above-mentioned matrix, c4Indicate 4th column of above-mentioned matrix, c5Indicate the 5th column of above-mentioned matrix, and so on.Then according to probabilistic model, consider each Word in the text front-rear position word to correct each word it is final belonging to default label, in other words, in matrix In find an optimal path.For example, c1,1The probability for belonging to the beginning label of name label for " Li Ming ", is 0.8.c1,2For " Li Ming " belongs to the probability of the intermediate label of name label, is 0.1.c2,1Belong to beginning label (such as Figure 10 of position for " excellent figure " In B-OFF) probability, be 0.5.c2,2Belong to the probability of the intermediate label (I-OFF in such as Figure 10) of position for " excellent figure ", It is 0.2.The rest may be inferred, and each word has a corresponding probability value (i.e. association probability) with each label.Then, according to There is no word before the word of word front-rear position, such as " Li Ming ", " excellent figure " is followed by, according to the association probability of default label Matrix and subsequent word are to judge " Li Ming " final affiliated default label as the beginning label (B- in such as Figure 10 of name PER).It is " Li Ming " before " excellent figure ", is followed by " research and development ", "center", " engineer ", to judge that " excellent figure " is opening for position Beginning label (B-OFF in such as Figure 10).Similarly, " research and development " are the intermediate label (I-OFF in such as Figure 10) of position, and "center" is The intermediate label (I-OFF in such as Figure 10) of position, " engineer " are the end tags (E-OFF in such as Figure 10) of position.In advance Bidding signs identical word and determines its position in structured content, final output according to the relative position of word in the text Are as follows: " name: Li Ming ", " position: You Tu research and development centre engineer ".Above-mentioned example says the present invention only for clearer It is bright, it is not as a limitation of the invention.
In some embodiments, default label further includes blank tag, each word be blank tag, start label, Any one in intermediate label and end tags.
Specifically, presetting label includes four class labels, i.e. blank tag, beginning label, intermediate label and end tags.? When exporting to text, some words are not have in all senses in the text, and for these words, it is belonged to After blank tag, output not will do it when being exported, that is to say, that the result of output only includes significant structuring Content (starts one or more of label, intermediate label and end tags).For example, as shown in figure 11, for a name All texts of on piece carry out OCR and identify after obtaining OCR text, due to business card there may be some LOGO or users not The structured content etc. needed, it is corresponding in OCR text also to have these words, however this be not user want (such as Shown in Figure 11, for the Corporation web site (www.xxx.com) of personal business card, user may not needed), so determining word Default label when the correlation word of Corporation web site is attributed to blank tag, final output is as shown in figure 12, " name: Li Ming ", " position: You Tu research and development centre engineer ", " company: XXX company ".In this way, not to belong to the word of blank tag into Row output, only retains the structured content useful to user, better user experience.
Figure 13 is please referred to, the embodiment of the invention also provides a kind of computer readable storage mediums 500.One or more packet Non-volatile computer readable storage medium storing program for executing 500 containing computer executable instructions, when computer executable instructions by one or When multiple processors 600 execute, so that processor 600 executes the structuring output side of the text of any one of the above embodiment Method.
For example, processor 600 executes the text of following steps when computer executable instructions are executed by processor 600 Structuring output method:
011: the content of text in identification picture;
012: content of text is divided by multiple words according to participle model;
014: word is converted to by term vector according to term vector model;
016: the association probability matrix of term vector and default label is obtained according to term vector and deep semantic model;And
018: will be in content of text export structure according to the association probability matrix of preset probabilistic model and default label Hold.
In another example processor 600 executes the text of following steps when computer executable instructions are executed by processor 600 This structuring output method:
013: according to multiple words determine content of text belonging to industry;And
015: default label is determined according to industry.
Figure 14 is please referred to, the embodiment of the present invention also provides a kind of computer equipment 700.Computer equipment 700 includes storage Device 720 and processor 740 store computer-readable instruction in memory 720, and computer-readable instruction is held by processor 740 When row, so that processor 740 executes the structuring output method of the text of any one of the above embodiment.
Computer equipment 700 can be computer, smart phone, tablet computer, laptop, smartwatch, intelligent hand Ring, intelligent helmet, intelligent glasses etc..
For example, processor 740 executes the knot of the text of following steps when computer-readable instruction is executed by processor 740 Structure output method:
011: the content of text in identification picture;
012: content of text is divided by multiple words according to participle model;
014: word is converted to by term vector according to term vector model;
016: the association probability matrix of term vector and default label is obtained according to term vector and deep semantic model;And
018: will be in content of text export structure according to the association probability matrix of preset probabilistic model and default label Hold.
In another example processor 740 executes the text of following steps when computer-readable instruction is executed by processor 740 Structuring output method:
013: according to multiple words determine content of text belonging to industry;And
015: default label is determined according to industry.
In the description of this specification, reference term " embodiment ", " some embodiments ", " schematically implementation What the description of mode ", " example ", " specific example " or " some examples " etc. meant to describe in conjunction with the embodiment or example Particular features, structures, materials, or characteristics are contained at least one embodiment or example of the invention.In this specification In, schematic expression of the above terms are not necessarily referring to identical embodiment or example.Moreover, the specific spy of description Sign, structure, material or feature can be combined in any suitable manner in any one or more embodiments or example.
Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes Module, segment or the portion of the code of one or more executable instructions for the step of executing specific logical function or process Point, and the range of the preferred embodiment of the present invention includes other execution, wherein can not press shown or discussed suitable Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, Lai Zhihang function, this should be of the invention Embodiment person of ordinary skill in the field understood.
Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use In the order list for the executable instruction for executing logic function, can specifically execute in any computer-readable medium, for Instruction execution system, device or equipment (such as computer based system, including the system of processor or other can be held from instruction The instruction fetch of row system, device or equipment and the system executed instruction) it uses, or combine these instruction execution systems, device or set It is standby and use.For the purpose of this specification, " computer-readable medium ", which can be, any may include, stores, communicates, propagates or pass Defeated program is for instruction execution system, device or equipment or the dress used in conjunction with these instruction execution systems, device or equipment It sets.The more specifically example (non-exhaustive list) of computer-readable medium include the following: there is the electricity of one or more wirings Interconnecting piece (electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory (ROM), erasable edit read-only storage (EPROM or flash memory), fiber device and portable optic disk is read-only deposits Reservoir (CDROM).In addition, computer-readable medium can even is that the paper that can print described program on it or other are suitable Medium, because can then be edited, be interpreted or when necessary with it for example by carrying out optical scanner to paper or other media His suitable method is handled electronically to obtain described program, is then stored in computer storage.
It should be appreciated that each section of the invention can be executed with hardware, software, firmware or their combination.Above-mentioned In embodiment, software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage Or firmware executes.It, and in another embodiment, can be under well known in the art for example, if executed with hardware Any one of column technology or their combination execute: having a logic gates for executing logic function to data-signal Discrete logic, with suitable combinational logic gate circuit specific integrated circuit, programmable gate array (PGA), scene Programmable gate array (FPGA) etc..
Those skilled in the art are understood that execute all or part of the steps that above-mentioned implementation method carries It is that relevant hardware can be instructed to complete by program, the program can store in a kind of computer readable storage medium In, which when being executed, includes the steps that one or a combination set of embodiment of the method.
It, can also be in addition, each functional unit in each embodiment of the present invention can integrate in a processing module It is that each unit physically exists alone, can also be integrated in two or more units in a module.Above-mentioned integrated mould Block both can take the form of hardware execution, can also be executed in the form of software function module.The integrated module is such as Fruit is executed and when sold or used as an independent product in the form of software function module, also can store in a computer In read/write memory medium.
Storage medium mentioned above can be read-only memory, disk or CD etc..Although having been shown and retouching above The embodiment of the present invention is stated, it is to be understood that above-described embodiment is exemplary, and should not be understood as to limit of the invention System, those skilled in the art can be changed above-described embodiment, modify, replace and become within the scope of the invention Type.

Claims (14)

1. a kind of structuring output method of text, which is characterized in that the structuring output method of the text includes:
Identify the content of text in picture;
The content of text is divided into multiple words according to participle model;
The word is converted into term vector according to term vector model;
The association probability matrix of the term vector Yu default label is obtained according to the term vector and deep semantic model;And
It will be in the content of text export structure according to the association probability matrix of preset probabilistic model and the default label Hold.
2. the structuring output method of text according to claim 1, which is characterized in that the structuring output method is also Include:
According to the multiple word determine the content of text belonging to industry, wherein the industry includes delivery industry, bank Any one in industry, retail trade or education sector;And
The default label is determined according to the industry.
3. the structuring output method of text according to claim 1, which is characterized in that it is described according to the term vector and Deep semantic model obtains the step of association probability matrix of the term vector with default label and includes:
By the term vector by positive sequence and inverted sequence input respectively the deep semantic model and respectively output positive sequence output result and Inverted sequence exports result, wherein the deep semantic model includes two-way shot and long term memory models;And
Result is exported according to the positive sequence and inverted sequence output result determines being associated with for the term vector and the default label Probability and the association probability matrix for generating the default label.
4. the structuring output method of text according to claim 1, which is characterized in that described according to preset probability mould The step of content of text export structure content includes: by the association probability matrix of type and the default label
Determine that each of described content of text is described according to the association probability matrix of the default label and the probabilistic model The default label belonging to word;
Determine that the word for belonging to the identical default label exists according to position of the word in the content of text Position in the structured content;And
The position of the default label and the word in the structured content according to belonging to the word, the word Export the structured content.
5. the structuring output method of text according to claim 4, which is characterized in that described according to the default label Association probability matrix and the probabilistic model determine the pre- bidding belonging to each of described content of text word The step of label includes:
According to the word front-rear position each in the association probability matrix of the default label and the content of text Word determines the default label belonging to the word.
6. the structuring output method of text according to claim 1, which is characterized in that the default label includes blank Label, beginning label, intermediate label and end tags, each structured content include the beginning label, the centre It is any one or more in label and the end tags, each word be the blank tag, the beginning label, Any one in the intermediate label and the end tags.
7. a kind of structuring output system of text, which is characterized in that the structuring output system of the text includes:
Identification module, the identification module content of text in picture for identification;
The content of text is divided into multiple words according to participle model by word segmentation module;
The word is converted to term vector according to term vector model by conversion module;
Module is obtained, the association probability square of the term vector Yu default label is obtained according to the term vector and deep semantic model Battle array;And
Output module exports the content of text according to the association probability matrix of preset probabilistic model and the default label Structured content.
8. the structuring output system of text according to claim 7, which is characterized in that the structuring output system is also Include:
First determining module, first determining module be used for according to the multiple word determine the content of text belonging to row Industry, wherein the industry includes any one in delivery industry, bank's industry, retail trade or education sector;And
Second determining module, second determining module are used to determine the default label according to the industry.
9. the structuring output system of text according to claim 7, which is characterized in that the acquisition module includes:
Processing unit, the processing unit are used to the term vector inputting the deep semantic model respectively by positive sequence and inverted sequence And output positive sequence output result and inverted sequence export result respectively, wherein the deep semantic model includes two-way shot and long term memory Model;And
First determination unit, first determination unit is used to export result according to the positive sequence and inverted sequence output result is true The association probability of the fixed term vector and the default label and the association probability matrix for generating the default label.
10. the structuring output system of text according to claim 7, which is characterized in that the output module includes:
Second determination unit, second determination unit are used for association probability matrix and the probability according to the default label Model determines the default label belonging to each of described content of text word;
Third determination unit, the third determination unit are used for the position according to the word in the content of text and determine category In position of the word in the structured content of the identical default label;And
Output unit, the output unit are used for the default label and institute's predicate according to belonging to the word, the word Position of the language in the structured content exports the structured content.
11. the structuring output system of text according to claim 10, which is characterized in that second determination unit is also For according to the word front-rear position each in the association probability matrix of the default label and the content of text Word determines the default label belonging to the word.
12. the structuring output system of text according to claim 10, which is characterized in that the default label includes sky White label starts label, intermediate label and end tags, during each structured content includes the beginning label, is described Between it is any one or more in label and the end tags, each word is the blank tag, described starts to mark Any one in label, the intermediate label and the end tags.
13. one or more includes the non-volatile computer readable storage medium storing program for executing of computer executable instructions, when the calculating When machine executable instruction is executed by one or more processors, so that the processor perform claim requires any one of 1 to 6 institute The structuring output method for the text stated.
14. a kind of computer equipment, including memory and processor, computer-readable instruction is stored in the memory, institute When stating instruction by processor execution, so that the knot of text described in any one of processor perform claim requirement 1 to 6 Structure output method.
CN201811089125.0A 2018-09-18 2018-09-18 Method and system for structured output of text, storage medium and computer equipment Active CN110147545B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811089125.0A CN110147545B (en) 2018-09-18 2018-09-18 Method and system for structured output of text, storage medium and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811089125.0A CN110147545B (en) 2018-09-18 2018-09-18 Method and system for structured output of text, storage medium and computer equipment

Publications (2)

Publication Number Publication Date
CN110147545A true CN110147545A (en) 2019-08-20
CN110147545B CN110147545B (en) 2023-08-29

Family

ID=67588427

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811089125.0A Active CN110147545B (en) 2018-09-18 2018-09-18 Method and system for structured output of text, storage medium and computer equipment

Country Status (1)

Country Link
CN (1) CN110147545B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832300A (en) * 2020-07-24 2020-10-27 中国联合网络通信集团有限公司 Contract auditing method and device based on deep learning
CN111914535A (en) * 2020-07-31 2020-11-10 平安科技(深圳)有限公司 Word recognition method and device, computer equipment and storage medium
CN112541373A (en) * 2019-09-20 2021-03-23 北京国双科技有限公司 Judicial text recognition method, text recognition model obtaining method and related equipment
CN112712879A (en) * 2021-01-18 2021-04-27 腾讯科技(深圳)有限公司 Information extraction method, device, equipment and storage medium for medical image report

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599933A (en) * 2016-12-26 2017-04-26 哈尔滨工业大学 Text emotion classification method based on the joint deep learning model
JP2017102599A (en) * 2015-11-30 2017-06-08 日本電信電話株式会社 Estimation device, parameter learning device, method, and program
CN108170674A (en) * 2017-12-27 2018-06-15 东软集团股份有限公司 Part-of-speech tagging method and apparatus, program product and storage medium
CN108280062A (en) * 2018-01-19 2018-07-13 北京邮电大学 Entity based on deep learning and entity-relationship recognition method and device
CN108399227A (en) * 2018-02-12 2018-08-14 平安科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium of automatic labeling

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017102599A (en) * 2015-11-30 2017-06-08 日本電信電話株式会社 Estimation device, parameter learning device, method, and program
CN106599933A (en) * 2016-12-26 2017-04-26 哈尔滨工业大学 Text emotion classification method based on the joint deep learning model
CN108170674A (en) * 2017-12-27 2018-06-15 东软集团股份有限公司 Part-of-speech tagging method and apparatus, program product and storage medium
CN108280062A (en) * 2018-01-19 2018-07-13 北京邮电大学 Entity based on deep learning and entity-relationship recognition method and device
CN108399227A (en) * 2018-02-12 2018-08-14 平安科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium of automatic labeling

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541373A (en) * 2019-09-20 2021-03-23 北京国双科技有限公司 Judicial text recognition method, text recognition model obtaining method and related equipment
WO2021051957A1 (en) * 2019-09-20 2021-03-25 北京国双科技有限公司 Judicial text recognition method, text recognition model obtaining method, and related device
CN112541373B (en) * 2019-09-20 2023-10-31 北京国双科技有限公司 Judicial text recognition method, text recognition model obtaining method and related equipment
CN111832300A (en) * 2020-07-24 2020-10-27 中国联合网络通信集团有限公司 Contract auditing method and device based on deep learning
CN111914535A (en) * 2020-07-31 2020-11-10 平安科技(深圳)有限公司 Word recognition method and device, computer equipment and storage medium
CN111914535B (en) * 2020-07-31 2023-03-24 平安科技(深圳)有限公司 Word recognition method and device, computer equipment and storage medium
CN112712879A (en) * 2021-01-18 2021-04-27 腾讯科技(深圳)有限公司 Information extraction method, device, equipment and storage medium for medical image report
CN112712879B (en) * 2021-01-18 2023-05-30 腾讯科技(深圳)有限公司 Information extraction method, device, equipment and storage medium for medical image report

Also Published As

Publication number Publication date
CN110147545B (en) 2023-08-29

Similar Documents

Publication Publication Date Title
Goldberg Neural network methods for natural language processing
CN110489760B (en) Text automatic correction method and device based on deep neural network
CN110147545A (en) The structuring output method and system of text, storage medium and computer equipment
US20240135183A1 (en) Hierarchical classification using neural networks
CN109165384A (en) A kind of name entity recognition method and device
CN101131690B (en) Method and system for mutual conversion between simplified Chinese characters and traditional Chinese characters
WO2019085697A1 (en) Man-machine interaction method and system
CN107729309A (en) A kind of method and device of the Chinese semantic analysis based on deep learning
CN108763510A (en) Intension recognizing method, device, equipment and storage medium
CN110188202A (en) Training method, device and the terminal of semantic relation identification model
CN109598517B (en) Commodity clearance processing, object processing and category prediction method and device thereof
CN110046350A (en) Grammatical bloopers recognition methods, device, computer equipment and storage medium
CN105808523A (en) Method and apparatus for identifying document
CN112036184A (en) Entity identification method, device, computer device and storage medium based on BilSTM network model and CRF model
CN115392237B (en) Emotion analysis model training method, device, equipment and storage medium
CN112100384A (en) Data viewpoint extraction method, device, equipment and storage medium
Romero et al. Modern vs diplomatic transcripts for historical handwritten text recognition
CN114036950A (en) Medical text named entity recognition method and system
CN110874534A (en) Data processing method and data processing device
Addis et al. Printed ethiopic script recognition by using lstm networks
CN115374786A (en) Entity and relationship combined extraction method and device, storage medium and terminal
CN115309864A (en) Intelligent sentiment classification method and device for comment text, electronic equipment and medium
CN110738050A (en) Text recombination method, device and medium based on word segmentation and named entity recognition
CN109753647A (en) The partitioning method and device of paragraph
CN110263793A (en) Article tag recognition methods and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant