CN110046345A - A kind of data extraction method and device - Google Patents

A kind of data extraction method and device Download PDF

Info

Publication number
CN110046345A
CN110046345A CN201910185914.2A CN201910185914A CN110046345A CN 110046345 A CN110046345 A CN 110046345A CN 201910185914 A CN201910185914 A CN 201910185914A CN 110046345 A CN110046345 A CN 110046345A
Authority
CN
China
Prior art keywords
information
numerical value
numerical
identified
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910185914.2A
Other languages
Chinese (zh)
Inventor
斯义谱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tong Shield Holdings Ltd
Tongdun Holdings Co Ltd
Original Assignee
Tong Shield Holdings Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tong Shield Holdings Ltd filed Critical Tong Shield Holdings Ltd
Priority to CN201910185914.2A priority Critical patent/CN110046345A/en
Publication of CN110046345A publication Critical patent/CN110046345A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the present application provides a kind of data extraction method and device, this method comprises: determining text information to be identified;Using predetermined analysis model, text information to be identified is analyzed and processed, includes at least one numerical value and the corresponding information of numerical value in numerical information to obtain the numerical information and syntactic feature information in text information to be identified;In the case where including multiple numerical value in numerical information, according to syntactic feature information and numerical information, determine that the target in text information to be identified executes numerical value.Therefore, preset analysis model can be utilized, the extraction of numerical information and syntactic feature information is carried out to text information to be identified, and then the analysis for passing through semantic relation, it determines the incidence relation between multiple numerical value in numerical information, to determine corresponding calculative strategy according to incidence relation, and then generates target and execute numerical value, to improve the accuracy rate and efficiency extracted to the numerical information in text to be identified, simplify numerical information extraction operation.

Description

A kind of data extraction method and device
Technical field
This application involves technical field of data processing, more particularly to a kind of data extraction method and device.
Background technique
With the fast development of internet finance and universal, the data generated of case disclosed in law court are answered extensively For in internet finance air control link.Since there are about 50% or more law court's case data all without announcing specific execution target The size of the amount of money, above-mentioned amount information are often hidden in law court judgement document, the obligation that effective legal document determines or judgement knot In the text datas such as fruit;And these text datas are non-structured, it usually needs utilize machine learning, deep learning, nature The methods of Language Processing extracts relevant amount information from non-structured text data, and determine each amount information it Between relationship, and then obtain the execution target size of the case.
In the prior art, mainly using the mode of the regular expression of Manual definition, from non-structured textual data All relevant amount information are extracted in, are then maximized or are directly summed, execution of the obtained result as case Target size.
But since the method coverage rate that regular expression extracts relevant amount information is not high, it is difficult to extract all Relevant amount information;Meanwhile by the way that the relevant amount information extracted is directly maximized or is summed after, obtained knot Mode of the fruit as the execution target size of case, shortage analyzes the relationship between relevant amount, therefore causes to obtain Execution target amount information accuracy it is not high.
Summary of the invention
In view of the above problems, the embodiment of the present application provides a kind of data extraction method, is able to solve and exists in the prior art The coverage rate extracted of amount information is low and problem that accuracy is not high.
Correspondingly, the embodiment of the present application also provides a kind of data extraction device, to guarantee the above method realization and Using.
To solve the above-mentioned problems, the embodiment of the present application discloses a kind of data extraction method, which comprises
Determine text information to be identified;
Using predetermined analysis model, the text information to be identified is analyzed and processed, with obtain it is described to It identifies the numerical information and syntactic feature information in text information, includes at least one numerical value and the number in the numerical information It is worth corresponding information;
In the case where including multiple numerical value in the numerical information, believed according to the syntactic feature information and the numerical value Breath determines that the target in the text information to be identified executes numerical value.
Correspondingly, the embodiment of the present application also discloses a kind of data extraction device, described device includes:
Information determination module, for determining text information to be identified;
Message processing module analyzes the text information to be identified for utilizing predetermined analysis model Processing includes extremely in the numerical information to obtain the numerical information and syntactic feature information in the text information to be identified Few numerical value and the corresponding information of the numerical value;
Numerical value determining module, it is special according to the syntax in the case where for including multiple numerical value in the numerical information Reference breath and the numerical information determine that the target in the text information to be identified executes numerical value.
The embodiment of the present application also provides a kind of device, including processor and memory, wherein
The processor executes the computer program code that the memory is stored, to realize data described herein Extracting method.
The embodiment of the present application also provides a kind of computer readable storage medium, deposited on the computer readable storage medium The step of storage computer program, the computer program realizes data extraction method described herein when being executed by processor.
The embodiment of the present application includes the following advantages:
Determine text information to be identified;Using predetermined analysis model, the text information to be identified is divided Analysis is handled, and to obtain the numerical information and syntactic feature information in the text information to be identified, includes in the numerical information At least one numerical value and the corresponding information of the numerical value;In the case where including multiple numerical value in the numerical information, according to institute Syntactic feature information and the numerical information are stated, determines that the target in the text information to be identified executes numerical value.Therefore, can Using preset analysis model, the extraction of numerical information and syntactic feature information is carried out to text information to be identified, and then is passed through The incidence relation between multiple numerical value in numerical information is determined in the analysis of semantic relation, according to determining pair of incidence relation The calculative strategy answered, and then generate target and execute numerical value, to improve to the accurate of the numerical information extraction in text to be identified Rate and efficiency simplify numerical information extraction operation.
Detailed description of the invention
Fig. 1 is a kind of step flow chart of data extraction method embodiment of the application;
Fig. 2 is a kind of schematic diagram of dependency structure tree of the application;
Fig. 3 is a kind of step flow chart of data extraction method alternative embodiment of the application;
Fig. 4 is a kind of step flow chart of data extraction method alternative embodiment of the application;
Fig. 5 is a kind of step flow chart of data extraction method embodiment of the application;
Fig. 6 is a kind of step flow chart of data extraction method embodiment of the application;
Fig. 7 is a kind of structural block diagram of data extraction device embodiment of the application.
Specific embodiment
In order to make the above objects, features, and advantages of the present application more apparent, with reference to the accompanying drawing and it is specific real Applying mode, the present application will be further described in detail.
Referring to Fig.1, a kind of step flow chart of data extraction method embodiment of the application is shown, can specifically include Following steps:
Step 101, text information to be identified is determined.
It should be noted that the judgement document of law court is for recording people's court's hearing process and as a result, being people's method Institute is determining and distributes the voucher of party's substantive right obligation.Example document is such as: " Li Qiaoling and Chen Xiangbin personalized lending dispute one Examine paper of civil judgment " and " Lu Xun tells Chen Bo, Chen Xuhong deal contract dispute first sentence paper of civil judgment " etc., institute in the judgement document The target data of generation, i.e. the execution target amount of money, these amount of money can be used for bank, consumption, and platform, P2P net borrow platform, small by stages Volume finance company, large-scale consumer finance company, insurance, electric business, financing mechanism, financing lease mechanism, guarantee, Real Estate Finance etc. Case-involving situation, the data of refund wish and loan repayment capacity in industry or the credit authorization scene of company, as assessment borrower Foundation.For example, one has the criminal record of 1000 yuan of fine, another has the case of 1,000,000 yuan of fine to two loan application people loans Bottom, it is therefore apparent that a possibility that the latter breaks one's promise after borrowing money is bigger.Therefore, to the extraction of the execution target amount of money in judgement document And analysis, it is the important information source of risk control under internet financial environment, i.e. text to be identified involved in the application This information.It, can be with according to from the incidence relation between numerical information extracted in text information to be identified and numerical information Determine the execution target amount of money corresponding in the judgement document, i.e. target executes data, and then confirms corresponding in judgement document Executed person financial credit, in order to more accurately carry out risk control.
In a particular application, judgement document's data of law court, e.g. electronic document are obtained, by the electronic document Text information identified, and obtain text information to be identified.Illustratively, it can be identified at text information to getting Reason, with for below step provide content completely, the text data of uniform format, such as remove wherein useless content and (be not directed to Execute the content part of the amount information of target), duplicate content etc., reduce the workload of text information processing;And it will be to Identify that text information processing at reference format, e.g. uses the international standard code (Unicode code) of two byte codes, with It is convenient for the processing of text information.
Step 102, using predetermined analysis model, text information to be identified is analyzed and processed, with obtain to Identify the numerical information and syntactic feature information in text information.
It wherein, include at least one numerical value and the corresponding information of numerical value in numerical information;It include that data are known in analysis model Other model and syntactic analysis model.
Illustratively, data identification model can be used for extracting relevant numerical value from the text information to be identified after standardization Information, e.g. from the standard document of law court obtain execute target amount information, such as loaning bill capital, interest, alimony, by Reason takes, agency fee, court cost, costs of preservation, charge for announcement, a variety of expenses such as execute expense, attorney fee, payment for medical care, traveling expense, charge for loss of working time Corresponding amount information, the data identification model are to utilize sample data trained in advance, can be carried out to natural language Analysis processing, to extract the identification model of numerical information.
In addition, syntactic analysis model, preferably can be the maximum spanning tree using network corpus, based on maximum entropy model It is generated after algorithm training, it can be in text information to be identified carrying out the syntactic analysis of each sentence, and then obtain sentence Dependency structure tree, as shown in Fig. 2, the dependence between the different terms in sentence is described in the dependency structure tree, and The corresponding amount of money.
Therefore, after the extraction that numerical information and syntactic feature information are carried out to text information to be identified, following step is utilized Suddenly, it determines the incidence relation between different numerical value, and then obtains target and execute the amount of money.
Step 103, in the case where include multiple numerical value in numerical information, according to syntactic feature information and numerical information, Determine that the target in text information to be identified executes numerical value.
Illustratively, there are incidence relations between correlation values, this step is according to syntactic feature information between multiple numerical value Incidence relation analyzed, as shown in Figure 2, can determine between " capital " and " interest ", " acceptance fee " and " costs of preservation " Relationship be coordination.
It should be noted that the incidence relation between different numerical value (amount of money) may include: inclusion relation, summarizes relationship, subtracts Half relationship, intersection but non-inclusion relation, homogeneity relationship etc..In turn, it is according to the incidence relation between numerical value, benefit that target, which executes numerical value, Determined by corresponding calculative strategy.
In conclusion data extraction method provided by the embodiments of the present application, determines text information to be identified;Using in advance really Fixed analysis model is analyzed and processed text information to be identified, with obtain the numerical information in text information to be identified and Syntactic feature information includes at least one numerical value and the corresponding information of numerical value in numerical information;It include multiple in numerical information In the case where numerical value, according to syntactic feature information and numerical information, determine that the target in text information to be identified executes numerical value.Cause This, can utilize preset analysis model, the extraction of numerical information and syntactic feature information is carried out to text information to be identified, into And by the analysis of semantic relation, the incidence relation between multiple numerical value in numerical information is determined, according to incidence relation It determines corresponding calculative strategy, and then generates target and execute numerical value, the numerical information in text to be identified is extracted to improve Accuracy rate and efficiency, simplify numerical information extraction operation.
Referring to Fig. 3, a kind of step flow chart of data extraction method alternative embodiment of the application, step 101 are shown Determination text information to be identified, includes the following steps:
Step 1011, the character string information in text to be identified is obtained.
Step 1012, character string information is standardized, obtains text information to be identified.
Wherein, standardization includes one of character format processing, the processing of Chinese figure format analysis processing, arithemetic unit Or more persons.
In a particular application, character format is handled for example, is standardized to character encoding format, the character that will acquire String information unification is encoded into Unicode format;And the full half-angle conversion of character, character string information is uniformly converted into half-angle lattice Formula.Chinese figure format analysis processing includes: the conversion of Chinese figure capital and small letter, and the Chinese figure in character string information is uniformly converted into Small letter;And the conversion between Chinese figure and Arabic numerals, by the Chinese figure in character string information be uniformly converted into Ah Arabic numbers.And arithemetic unit processing includes: the numerical information progress polishing to unit missing, such as will " acceptance fee 1234, 234 yuan of agency fee " is converted to " 1234 yuan of acceptance fee, 234 yuan of agency fee ";And unit is standardized, such as will The unit of numerical information in character string information is unified for " member ", such as " 4.12 ten thousand yuan " are converted to " 41200 yuan ", such as will " 100,000 yuan " are converted into " 100000 yuan ";And the foreign currencies such as dollar are converted as the numerical value of unit behaviour people coin member.In addition, also The standardization of thousand quartile numbers, such as " 1,234,000 yuan " is converted into " 1234000 yuan ".
Referring to Fig. 4, a kind of step flow chart of data extraction method alternative embodiment of the application, step 102 are shown It is described using predetermined analysis model, text information to be identified is analyzed and processed, to obtain text envelope to be identified Numerical information and syntactic feature information in breath, include the following steps:
Step 1021, using data identification model, the numerical information in text information to be identified is obtained.
Illustratively, the corresponding information of each numerical value includes the title of the numerical value, be execute target title, as capital, interest, Agency fee, costs of preservation etc..I.e. according to data identification model, in text information to be identified numerical value and its corresponding name information It extracts, in order to which below step 103 combines syntactic feature information, determines that target executes numerical value.
Step 1022, using syntactic analysis model, the syntactic feature information in text information to be identified is obtained.
Illustratively, as shown in Fig. 2, its corresponding sentence is " to enforce the defending party to the application and fulfil loaning bill capital 2132348.11 member and related interest;13823 yuan of acceptance fee, 5000 yuan of costs of preservation." wherein, oriented line indicates sentence elements Between relationship, arrow indicates relationship direction, wherein relationship indicates the relationship between attribute and head in fixed, that is to say, that attribute It is for modifying and/or limiting head.Due to that can not determine " capital " from dependency structure tree shown in Fig. 2 and " accept Take ", the relationship between " interest " and " acceptance fee ", " capital " and " costs of preservation ", " interest " and " costs of preservation ", thus it is considered that Relationship is independent from each other between these amount of money.Each sentence in text information to be identified is carried out using syntactic analysis model Syntactic analysis, with the corresponding incidence relation determined between wherein numerical value.
It should be noted that other modes that corresponding syntactic feature information can be extracted from character string information, Suitable for the technical solution that the application proposes, specifically with no restrictions.
Referring to Fig. 5, a kind of step flow chart of data extraction method alternative embodiment of the application, step 103 are shown In the case that described includes multiple numerical value in numerical information, according to syntactic feature information and numerical information, determine to be identified Target in text information executes numerical value, includes the following steps:
Step 1031, according to syntactic feature information and title, the incidence relation between multiple numerical value is determined.
In a particular application, it according to the obtained interdependent syntax tree of syntactic feature information, and then determines not of the same name in sentence Incidence relation between correlation between word, that is, the corresponding numerical value of each title, e.g. inclusion relation, coordination Deng.Illustratively, " capital " and " acceptance fee ", " interest " and " acceptance fee ", " capital " and " costs of preservation " as shown in Figure 2, " benefit Incidence relation between breath " and " costs of preservation " is independent association relationship, thus below step by the determination of calculative strategy to When screening numerical information, it can use multiple calculative strategies and carry out a variety of calculating, to generate multiple numerical value to be screened.
Step 1032, according to incidence relation, the calculative strategy handled multiple numerical value is determined.
Illustratively, the incidence relation determined according to above-mentioned steps, inclusion relation in this way, then Optimal calculation strategy is to take maximum Value;If according to above-mentioned steps determine incidence relation be it is arranged side by side, Optimal calculation strategy be merging (adduction) handle.It needs to illustrate , for the incidence relation between multiple numerical value, it is understood that there may be multiple relationships, that is to say, that the calculative strategy that can determine Correspondence is multiple, and then when carrying out the calculating of numerical information to be screened of below step, numerical value to be screened generated is also more It is a.
Alternatively, can also multiple numerical value be carried out with the calculating of all calculative strategies, and then generate corresponding multiple to be screened Numerical value recycles its syntactic feature information to be screened out from it the calculating plan for the syntactic feature information being best suitable in step 1034 Numerical value to be screened determined by slightly executes numerical value as target.
Step 1033, using calculative strategy, numerical information to be screened is determined.
It wherein, include the multiple numbers to be screened being calculated by multiple numerical value in numerical information in numerical information to be screened Value.
Step 1034, according to syntactic feature information, the numerical value to be screened for meeting screening conditions is executed into numerical value as target.
Illustratively, in the case where the numerical information to be screened of generation is multiple, the to be screened of wherein maximum probability is chosen Numerical value executes numerical value as target, that is to say, that the number to be screened calculated in a manner of being best suitable for syntactic feature information Value executes numerical value as target.
Optionally, as shown in fig. 6, only including an amount of money (number in the numerical information identified by step 102 Value) in the case where, without carrying out the calculation processing between multiple numerical value, it can directly execute target for the amount of money as target The amount of money, it may be assumed that
Step 104, in the case where in numerical information including a numerical value, numerical value is determined as target and executes numerical value.
It should be noted that for simple description, therefore, it is stated as a series of action groups for embodiment of the method It closes, but those skilled in the art should understand that, the embodiment of the present application is not limited by the described action sequence, because according to According to the embodiment of the present application, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art also should Know, the embodiments described in the specification are all preferred embodiments, and related movement not necessarily the application is implemented Necessary to example.
Referring to Fig. 7, show a kind of structural block diagram of data extraction device embodiment of the application, can specifically include as Lower module:
Information determination module 710, for determining text information to be identified.
Message processing module 720 carries out at analysis text information to be identified for utilizing predetermined analysis model Reason includes at least one numerical value in numerical information to obtain the numerical information and syntactic feature information in text information to be identified Information corresponding with numerical value.
Numerical value determining module 730, in the case where for including multiple numerical value in numerical information, according to syntactic feature information And numerical information, determine that the target in text information to be identified executes numerical value.
Optionally, information determination module, comprising:
Acquisition of information submodule, for obtaining the character string information in text to be identified;
Information processing submodule obtains text information to be identified for being standardized to character string information;
Wherein, standardization includes one of character format processing, the processing of Chinese figure format analysis processing, arithemetic unit Or more persons.
Optionally, include that data identification model and syntactic analysis model, message processing module are used in analysis model:
Using data identification model, the numerical information in text information to be identified is obtained;The corresponding information of numerical value includes number Title corresponding to each numerical value in value information;
Using syntactic analysis model, the syntactic feature information in text information to be identified is obtained.
Optionally, numerical value determining module, comprising:
Relationship determines submodule, for determining the incidence relation between multiple numerical value according to syntactic feature information and title;
Strategy determines submodule, for determining the calculative strategy handled multiple numerical value according to incidence relation;
Information determines submodule, for utilizing calculative strategy, determines numerical information to be screened, wraps in numerical information to be screened Include the multiple numerical value to be screened being calculated by multiple numerical value in numerical information;
Numerical value screens submodule, for will meet the numerical value to be screened of screening conditions as mesh according to syntactic feature information Mark executes numerical value.
Optionally, numerical value determining module is also used to:
In the case where including a numerical value in numerical information, numerical value is determined as target and executes numerical value.
The embodiment of the present application also provides a kind of non-volatile readable storage medium, be stored in the storage medium one or Multiple modules (programs) when the one or more module is used in terminal device, can make the terminal device execute The instruction (instructions) of various method steps in the embodiment of the present application.
For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple Place illustrates referring to the part of embodiment of the method.
All the embodiments in this specification are described in a progressive manner, the highlights of each of the examples are with The difference of other embodiments, the same or similar parts between the embodiments can be referred to each other.
It should be understood by those skilled in the art that, the embodiments of the present application may be provided as method, apparatus or calculating Machine program product.Therefore, the embodiment of the present application can be used complete hardware embodiment, complete software embodiment or combine software and The form of the embodiment of hardware aspect.Moreover, the embodiment of the present application can be used one or more wherein include computer can With in the computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) of program code The form of the computer program product of implementation.
The embodiment of the present application is referring to according to the method for the embodiment of the present application, terminal device (system) and computer program The flowchart and/or the block diagram of product describes.It should be understood that flowchart and/or the block diagram can be realized by computer program instructions In each flow and/or block and flowchart and/or the block diagram in process and/or box combination.It can provide these Computer program instructions are set to general purpose computer, special purpose computer, Embedded Processor or other programmable data processing terminals Standby processor is to generate a machine, so that being held by the processor of computer or other programmable data processing terminal devices Capable instruction generates for realizing in one or more flows of the flowchart and/or one or more blocks of the block diagram The device of specified function.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing terminal devices In computer-readable memory operate in a specific manner, so that instruction stored in the computer readable memory generates packet The manufacture of command device is included, which realizes in one side of one or more flows of the flowchart and/or block diagram The function of being specified in frame or multiple boxes.
These computer program instructions can also be loaded into computer or other programmable data processing terminal devices, so that Series of operation steps are executed on computer or other programmable terminal equipments to generate computer implemented processing, thus The instruction executed on computer or other programmable terminal equipments is provided for realizing in one or more flows of the flowchart And/or in one or more blocks of the block diagram specify function the step of.
Although preferred embodiments of the embodiments of the present application have been described, once a person skilled in the art knows bases This creative concept, then additional changes and modifications can be made to these embodiments.So the following claims are intended to be interpreted as Including preferred embodiment and all change and modification within the scope of the embodiments of the present application.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that process, method, article or terminal device including a series of elements not only wrap Those elements are included, but also including other elements that are not explicitly listed, or further includes for this process, method, article Or the element that terminal device is intrinsic.In the absence of more restrictions, being wanted by what sentence "including a ..." limited Element, it is not excluded that there is also other identical elements in process, method, article or the terminal device for including the element.
Specific examples are used herein to illustrate the principle and implementation manner of the present application, and above embodiments are said It is bright to be merely used to help understand the present processes and its core concept;At the same time, for those skilled in the art, foundation The thought of the application, there will be changes in the specific implementation manner and application range, in conclusion the content of the present specification is not It is interpreted as the limitation to the application.

Claims (10)

1. a kind of data extraction method, which is characterized in that the described method includes:
Determine text information to be identified;
Using predetermined analysis model, the text information to be identified is analyzed and processed, it is described to be identified to obtain Numerical information and syntactic feature information in text information include at least one numerical value and the numerical value pair in the numerical information The information answered;
In the case where including multiple numerical value in the numerical information, according to the syntactic feature information and the numerical information, Determine that the target in the text information to be identified executes numerical value.
2. the method according to claim 1, wherein determination text information to be identified, comprising:
Obtain the character string information in text to be identified;
The character string information is standardized, the text information to be identified is obtained;
Wherein, the standardization includes one of character format processing, the processing of Chinese figure format analysis processing, arithemetic unit Or more persons.
3. the method according to claim 1, wherein including data identification model and syntax in the analysis model Analysis model, it is described to utilize predetermined analysis model, the text information to be identified is analyzed and processed, to obtain State the numerical information and syntactic feature information in text information to be identified, comprising:
Using the data identification model, the numerical information in the text information to be identified is obtained;The numerical value is corresponding Information include title corresponding to each numerical value in the numerical information;
Using the syntactic analysis model, the syntactic feature information in the text information to be identified is obtained.
4. the method according to claim 1, wherein described includes the feelings of multiple numerical value in the numerical information Under condition, according to the syntactic feature information and the numerical information, determine that the target in the text information to be identified executes number Value, comprising:
According to the syntactic feature information and the title, the incidence relation between the multiple numerical value is determined;
According to the incidence relation, the calculative strategy handled the multiple numerical value is determined;
Using the calculative strategy, numerical information to be screened is determined, include being believed by the numerical value in the numerical information to be screened Multiple numerical value to be screened that multiple numerical value in breath are calculated;
According to the syntactic feature information, the numerical value to be screened for meeting screening conditions is executed into numerical value as the target.
5. the method according to claim 1, wherein the method also includes:
In the case where including a numerical value in the numerical information, the numerical value is determined as the target and executes numerical value.
6. a kind of data extraction device, which is characterized in that described device includes:
Information determination module, for determining text information to be identified;
Message processing module, for being analyzed and processed to the text information to be identified using predetermined analysis model, It include at least one in the numerical information to obtain the numerical information and syntactic feature information in the text information to be identified Numerical value and the corresponding information of the numerical value;
Numerical value determining module is believed in the case where for including multiple numerical value in the numerical information according to the syntactic feature Breath and the numerical information determine that the target in the text information to be identified executes numerical value.
7. device according to claim 6, which is characterized in that the information determination module, comprising:
Acquisition of information submodule, for obtaining the character string information in text to be identified;
Information processing submodule obtains the text information to be identified for being standardized to the character string information;
Wherein, the standardization includes one of character format processing, the processing of Chinese figure format analysis processing, arithemetic unit Or more persons.
8. device according to claim 6, which is characterized in that include data identification model and syntax in the analysis model Analysis model, the message processing module, is used for:
Using the data identification model, the numerical information in the text information to be identified is obtained;The numerical value is corresponding Information include title corresponding to each numerical value in the numerical information;
Using the syntactic analysis model, the syntactic feature information in the text information to be identified is obtained.
9. device according to claim 6, which is characterized in that the numerical value determining module, comprising:
Relationship determines submodule, for determining between the multiple numerical value according to the syntactic feature information and the title Incidence relation;
Strategy determines submodule, for determining the calculative strategy handled the multiple numerical value according to the incidence relation;
Information determines submodule, for utilizing the calculative strategy, determines numerical information to be screened, the numerical information to be screened In include the multiple numerical value to be screened being calculated by multiple numerical value in the numerical information;
Numerical value screens submodule, for will meet the numerical value to be screened of screening conditions as institute according to the syntactic feature information It states target and executes numerical value.
10. device according to claim 6, which is characterized in that the numerical value determining module is also used to:
In the case where including a numerical value in the numerical information, the numerical value is determined as the target and executes numerical value.
CN201910185914.2A 2019-03-12 2019-03-12 A kind of data extraction method and device Pending CN110046345A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910185914.2A CN110046345A (en) 2019-03-12 2019-03-12 A kind of data extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910185914.2A CN110046345A (en) 2019-03-12 2019-03-12 A kind of data extraction method and device

Publications (1)

Publication Number Publication Date
CN110046345A true CN110046345A (en) 2019-07-23

Family

ID=67274785

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910185914.2A Pending CN110046345A (en) 2019-03-12 2019-03-12 A kind of data extraction method and device

Country Status (1)

Country Link
CN (1) CN110046345A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110930165A (en) * 2019-11-08 2020-03-27 国家计算机网络与信息安全管理中心 Anomaly detection method and device for Internet financial website
CN111046657A (en) * 2019-12-04 2020-04-21 东软集团股份有限公司 Method, device and equipment for realizing text information standardization
CN111581929A (en) * 2020-04-22 2020-08-25 腾讯科技(深圳)有限公司 Text generation method based on table and related device
CN111581472A (en) * 2020-03-23 2020-08-25 北京航空航天大学 Internet financial product publicity yield and commitment extraction method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017021596A (en) * 2015-07-10 2017-01-26 日本電信電話株式会社 Word rearrangement learning device, word rearrangement device, method, and program
WO2017092555A1 (en) * 2015-12-01 2017-06-08 北京国双科技有限公司 Method and device for parsing amount of money in judgement document
WO2018025317A1 (en) * 2016-08-02 2018-02-08 株式会社日立製作所 Natural language processing device and natural language processing method
CN108197099A (en) * 2017-12-01 2018-06-22 厦门快商通信息技术有限公司 A kind of text message extracting method and computer readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017021596A (en) * 2015-07-10 2017-01-26 日本電信電話株式会社 Word rearrangement learning device, word rearrangement device, method, and program
WO2017092555A1 (en) * 2015-12-01 2017-06-08 北京国双科技有限公司 Method and device for parsing amount of money in judgement document
CN106815203A (en) * 2015-12-01 2017-06-09 北京国双科技有限公司 A kind of amount of money analysis method and device in judgement document
WO2018025317A1 (en) * 2016-08-02 2018-02-08 株式会社日立製作所 Natural language processing device and natural language processing method
CN108197099A (en) * 2017-12-01 2018-06-22 厦门快商通信息技术有限公司 A kind of text message extracting method and computer readable storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110930165A (en) * 2019-11-08 2020-03-27 国家计算机网络与信息安全管理中心 Anomaly detection method and device for Internet financial website
CN111046657A (en) * 2019-12-04 2020-04-21 东软集团股份有限公司 Method, device and equipment for realizing text information standardization
CN111046657B (en) * 2019-12-04 2023-10-13 东软集团股份有限公司 Method, device and equipment for realizing text information standardization
CN111581472A (en) * 2020-03-23 2020-08-25 北京航空航天大学 Internet financial product publicity yield and commitment extraction method and system
CN111581929A (en) * 2020-04-22 2020-08-25 腾讯科技(深圳)有限公司 Text generation method based on table and related device

Similar Documents

Publication Publication Date Title
CN110046345A (en) A kind of data extraction method and device
WO2019196546A1 (en) Method and apparatus for determining risk probability of service request event
CN109410036A (en) A kind of fraud detection model training method and device and fraud detection method and device
CN104616198A (en) P2P (peer-to-peer) network lending risk prediction system based on text analysis
KR102069551B1 (en) Accounting and management system using artificial intelligence
CN112541501A (en) Scene character recognition method based on visual language modeling network
CN108009911A (en) A kind of method of identification P2P network loan borrower's default risks
CN108961032A (en) Borrow or lend money processing method, device and server
CN106203808A (en) Enterprise Credit Risk Evaluation method and apparatus
CN109800420A (en) A kind of feasibility study review report automatic generation method and storage medium
CN112396437A (en) Trade contract verification method and device based on knowledge graph
CN113159796A (en) Trade contract verification method and device
CN116563006A (en) Service risk early warning method, device, storage medium and device
CN115238688A (en) Electronic information data association relation analysis method, device, equipment and storage medium
CN113362852A (en) User attribute identification method and device
CN117435471A (en) Method, device, equipment, storage medium and program product for recommending test cases
CN116738198A (en) Information identification method, device, equipment, medium and product
CN110458684A (en) A kind of anti-fraud detection method of finance based on two-way shot and long term Memory Neural Networks
CN109472277A (en) The method, apparatus and storage medium that debt-credit side classifies
CN114842385A (en) Science and science education video auditing method, device, equipment and medium
Murugesh et al. Construction of ontology for software requirements elicitation
CN111488463A (en) Test corpus generation method and device and electronic equipment
CN114861680B (en) Dialogue processing method and device
Theuri et al. The impact of Artficial Intelligence and how it is shaping banking
Chun et al. Cr-copec: Causal rationale of corporate performance changes to learn from financial reports

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190723