CN111339253A - Method and device for extracting article information - Google Patents

Method and device for extracting article information Download PDF

Info

Publication number
CN111339253A
CN111339253A CN202010116410.8A CN202010116410A CN111339253A CN 111339253 A CN111339253 A CN 111339253A CN 202010116410 A CN202010116410 A CN 202010116410A CN 111339253 A CN111339253 A CN 111339253A
Authority
CN
China
Prior art keywords
article information
word dictionary
information
word
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010116410.8A
Other languages
Chinese (zh)
Inventor
王国悦
饶帆
雷鸣
李力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
CCB Finetech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp, CCB Finetech Co Ltd filed Critical China Construction Bank Corp
Priority to CN202010116410.8A priority Critical patent/CN111339253A/en
Publication of CN111339253A publication Critical patent/CN111339253A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for extracting article information, and relates to the technical field of computers. One embodiment of the method comprises: extracting article information from each historical text through the core word dictionary and the stop word dictionary; training a model based on a long-time and short-time memory network by taking each historical text and corresponding article information thereof as a training set to obtain an information extraction model; and extracting the article information in the text to be extracted by adopting the information extraction model. The implementation mode can solve the technical problem that the difficulty in extracting the article information is high.

Description

Method and device for extracting article information
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for extracting article information.
Background
Documents in the state settlement document business are basically paper documents, business operators scan all the paper documents and application books of a business into image files through registration and scanning and input the image files into an intelligent document examination system, and meanwhile, optical character recognition is carried out on the image files, document content recognition results and document type recognition results are stored and provided for subsequent document examination.
The article information mainly exists in a credit message and a related document of the credit, the credit message is an electronic message related to the credit transmitted through a SWIFT (national banking and telecommunication institute) network in the international settlement document service, and the related document of the credit is an image file which is recorded into an intelligent examination system by a service person in the international settlement document service through a registration scanning function. The examination and verification of the article related information are automatically completed through an intelligent examination system, but the article related information depending on the examination and verification needs to be extracted from a credit message and a credit related document.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
the international trade relates to various articles, and different trade parties describe the article information differently, so even if the same article is used, the description in different specific trades has no unified rule, the styles are various, the difficulty of extracting the article information is high, and the business of extraction failure cannot be brought into an intelligent examination system for automatic examination, so that the examination efficiency is low.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for extracting article information, so as to solve the technical problem that difficulty in extracting article information is large.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a method of extracting item information, including:
extracting article information from each historical text through the core word dictionary and the stop word dictionary;
training a model based on a long-time and short-time memory network by taking each historical text and corresponding article information thereof as a training set to obtain an information extraction model;
and extracting the article information in the text to be extracted by adopting the information extraction model.
Optionally, extracting the item information from each historical text through the core word dictionary and the cutoff word dictionary, including:
respectively creating a core word dictionary and a cut-off word dictionary;
for each historical text, the position of the article information is located from the historical text through a core word dictionary, and the position of the article information is searched back and forth through a cut-off word dictionary, so that the article information is extracted from the historical text.
Optionally, the cutoff dictionary comprises a forward cutoff dictionary and a backward cutoff dictionary;
the position of the article information is searched back and forth through a stop word dictionary, so that the article information is extracted, and the method comprises the following steps:
forward searching the position of the article information through a forward cut-off word dictionary to locate a target forward cut-off word;
searching the position of the article information backwards through a backward cut-off word dictionary, and positioning a target backward cut-off word;
and extracting article information from the historical text according to the target forward stop word and the target backward stop word.
Optionally, the cutoff word dictionary further comprises a general cutoff word dictionary;
the position of the article information is searched back and forth through a stop word dictionary, so that the article information is extracted, and the method further comprises the following steps:
if the position of the article information is searched forwards through the forward cut-off word dictionary and the positioned result is null, the position of the article information is searched forwards through the general cut-off word dictionary to position a target forward cut-off word;
and if the position of the article information is searched backwards through the backward stop word dictionary and the positioned result is null, the position of the article information is searched backwards through the general stop word dictionary to position a target backward stop word.
Optionally, the extracting the item information from each historical text through the core word dictionary and the cutoff word dictionary further comprises:
and if the position of the article information is searched back and forth through the stop word dictionary and the positioned result is empty, the position of the article information is searched back and forth through the quantifier word dictionary, so that the article information is extracted from the historical text.
Optionally, training a model based on a long-time memory network by using each historical text and article information corresponding to the historical text as a training set to obtain an information extraction model, including:
for each historical text, checking whether the historical text is matched with the corresponding article information in a manual verification mode; if not, updating the article information in a manual marking mode, and putting the historical text and the updated article information into a training set; if yes, directly putting the historical texts and the corresponding article information into a training set;
and training a model based on a long-time and short-time memory network by adopting a training set to obtain an information extraction model.
Optionally, the method further comprises:
and if the historical text and the corresponding article information are not matched, identifying a core word and a stop word from the historical text according to the article information extracted in a manual marking mode, and respectively updating the core word and the stop word to a core word dictionary and a stop word dictionary.
In addition, according to another aspect of the embodiments of the present invention, there is provided an apparatus for extracting article information, including:
the first extraction module is used for extracting the article information from each historical text through the core word dictionary and the stop word dictionary;
the training module is used for training a model based on a long-time memory network by taking each historical text and corresponding article information thereof as a training set to obtain an information extraction model;
and the second extraction module is used for extracting the article information in the text to be extracted by adopting the information extraction model.
Optionally, the first extraction module is further configured to:
respectively creating a core word dictionary and a cut-off word dictionary;
for each historical text, the position of the article information is located from the historical text through a core word dictionary, and the position of the article information is searched back and forth through a cut-off word dictionary, so that the article information is extracted from the historical text.
Optionally, the cutoff dictionary comprises a forward cutoff dictionary and a backward cutoff dictionary;
the first decimation module is further configured to:
forward searching the position of the article information through a forward cut-off word dictionary to locate a target forward cut-off word;
searching the position of the article information backwards through a backward cut-off word dictionary, and positioning a target backward cut-off word;
and extracting article information from the historical text according to the target forward stop word and the target backward stop word.
Optionally, the cutoff word dictionary further comprises a general cutoff word dictionary;
the first decimation module is further configured to:
if the position of the article information is searched forwards through the forward cut-off word dictionary and the positioned result is null, the position of the article information is searched forwards through the general cut-off word dictionary to position a target forward cut-off word;
and if the position of the article information is searched backwards through the backward stop word dictionary and the positioned result is null, the position of the article information is searched backwards through the general stop word dictionary to position a target backward stop word.
Optionally, the first extraction module is further configured to:
and if the position of the article information is searched back and forth through the stop word dictionary and the positioned result is empty, the position of the article information is searched back and forth through the quantifier word dictionary, so that the article information is extracted from the historical text.
Optionally, the training module is further configured to:
for each historical text, checking whether the historical text is matched with the corresponding article information in a manual verification mode; if not, updating the article information in a manual marking mode, and putting the historical text and the updated article information into a training set; if yes, directly putting the historical texts and the corresponding article information into a training set;
and training a model based on a long-time and short-time memory network by adopting a training set to obtain an information extraction model.
Optionally, the training module is further configured to:
and if the historical text and the corresponding article information are not matched, identifying a core word and a stop word from the historical text according to the article information extracted in a manual marking mode, and respectively updating the core word and the stop word to a core word dictionary and a stop word dictionary.
According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any of the embodiments described above.
According to another aspect of the embodiments of the present invention, there is also provided a computer readable medium, on which a computer program is stored, which when executed by a processor implements the method of any of the above embodiments.
One embodiment of the above invention has the following advantages or benefits: the technical means that the article information is extracted from each historical text through the core word dictionary and the stop word dictionary, the information extraction model is obtained through training, and the article information in the text to be extracted is extracted through the information extraction model is adopted, so that the technical problem that the difficulty in extracting the article information is high in the prior art is solved. The embodiment of the invention combines the core word dictionary and the stop word dictionary with the model based on the long-term and short-term memory network, improves the success rate and the accuracy rate of the article information extraction, and can bring the services as much as possible into an intelligent order examination system, thereby greatly reducing the service processing cost, reducing the working strength of service personnel and improving the productivity.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
fig. 1 is a schematic diagram of a main flow of a method of extracting item information according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating word frequency statistics according to an embodiment of the present invention;
3a-3d are schematic diagrams of the regularity of core words and stop words in accordance with an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a Bi-LSTM-CRF model according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a main flow of a method of extracting item information according to one referential embodiment of the present invention;
FIG. 6 is a schematic diagram of the main modules of an apparatus for extracting item information according to an embodiment of the present invention;
FIG. 7 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 8 is a schematic structural diagram of a computer system suitable for implementing a terminal device or a server according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram of a main flow of a method of extracting item information according to an embodiment of the present invention. As an embodiment of the present invention, as shown in fig. 1, the method for extracting item information may include:
step 101, extracting article information from each historical text through a core word dictionary and a stop word dictionary.
In this step, item information (which may be a product name) is extracted from the history text through the core word dictionary and the cutoff word dictionary. Optionally, prior to step 101, the text is identified from the historical LC messages using optical character recognition, and then the item information is extracted from the text. Taking the name of the article as an example, the article name is firstly positioned to a certain line in the text through the core article words, then the article name is matched in the text according to the summarized cut-off word dictionary, and the extracted phrase is the name of the article.
Optionally, step 101 may comprise: respectively creating a core word dictionary and a cut-off word dictionary; for each historical text, the position of the article information is located from the historical text through a core word dictionary, and the position of the article information is searched back and forth through a cut-off word dictionary, so that the article information is extracted from the historical text. Each core word and each cutoff word in the historical text can be counted, and a core word dictionary and a cutoff word dictionary are created. The core words are used for positioning the positions of the article information, and the stop words are used for searching front and back, so that the article information (such as article names) can be accurately extracted from the historical text through the core words and the stop words.
Optionally, the cutoff word dictionary comprises a forward cutoff word dictionary and a backward cutoff word dictionary. The forward cutoff words and the backward cutoff words in the historical text can be respectively counted, so that a forward cutoff word dictionary and a backward cutoff word dictionary are obtained, the extraction accuracy is improved, and the extraction difficulty is reduced. Optionally, the step of searching the position of the article information back and forth through a stop word dictionary to extract the article information includes: forward searching the position of the article information through a forward cut-off word dictionary to locate a target forward cut-off word; searching the position of the article information backwards through a backward cut-off word dictionary, and positioning a target backward cut-off word; and extracting article information from the historical text according to the target forward stop word and the target backward stop word. After the forward cutoff word dictionary and the backward cutoff word dictionary are created, the target forward cutoff word and the target backward cutoff word are respectively positioned through the forward cutoff word dictionary and the backward cutoff word dictionary, and therefore the article information is accurately extracted from the historical text.
Optionally, the cutoff word dictionary further includes a general cutoff word dictionary. Optionally, the method further includes searching the position of the article information back and forth through a stop word dictionary to extract the article information, and further includes: if the position of the article information is searched forwards through the forward cut-off word dictionary and the positioned result is null, the position of the article information is searched forwards through the general cut-off word dictionary to position a target forward cut-off word; and if the position of the article information is searched backwards through the backward stop word dictionary and the positioned result is null, the position of the article information is searched backwards through the general stop word dictionary to position a target backward stop word. If the target forward cutoff word or the target backward cutoff word cannot be located through the forward cutoff word dictionary or the backward cutoff word dictionary, the target forward cutoff word or the target backward cutoff word is located through the general cutoff word dictionary, and the situation that the locating result is empty is avoided.
Alternatively, the core word dictionary and the cutoff word dictionary may be created by the following steps:
first, word frequency statistics may be performed on each word in the history text, so as to obtain a word frequency statistical result shown in fig. 2. According to the word frequency statistical result, the following results can be seen: the word frequency is larger than about 2500, and the total number of the words is 463, the curve in the graph is very steep, except common useless words, the words are mainly pct, kgs, mm, qty, item, weight, sets, quality, rate, total, per and the like, most of the words are cut-off words and can be used for intercepting the name of an article; most of the words with the word frequency less than 2500 and more than 10 are article names, adjectives and verbs, and the core words are generally in the word frequencies; the words with the word frequency less than 10 are mainly misspelled words, alternative names, Latin and the like, are more in number, have less useful information and can not be considered.
Then, through sentence analysis of the article description, the following rules are found: the article names all have a core noun; item names generally appear only in a certain row; quantifier and stop word appear on the upper part and the lower part of the name of the article. As shown in fig. 3a, the core word is sweater; as shown in fig. 3b, the item name mink plate is followed by a quantifier; as shown in fig. 3c, shoes is preceded by a cutoff word of; as shown in FIG. 3d, the silica gel cat litter tranquille is preceded by a cutoff word of and followed by a cutoff word of administration.
Then, the core words and the stop words are sorted to respectively obtain a core word dictionary and a stop word dictionary.
And (3) core word arrangement: the core word is mainly used for positioning the position of an article and is obtained according to data statistics in related fields of the 60-ten-thousand history text. Firstly, cleaning data, removing special characters and punctuation marks, segmenting words by spaces, counting word frequency, and removing prepositions, articles, stop words, non-nouns and the like of high-frequency words; then removing words with low frequency to obtain about 2 ten thousand words; then, manual screening is performed to remove verbs and adjectives, and core words (mainly nouns) about 10000 are obtained.
And (3) finishing by term: manually counting the upper and lower information of the names of the objects in the historical text, and sorting forward cutoff words, backward cutoff words and cutoff rules. Common cutoff words are: at, of, kg, number, etc. In the embodiment of the present invention, the sorted cutoff words are divided into 3 types, including a general cutoff word, a forward cutoff word, and a backward cutoff word, specifically as follows:
Figure BDA0002391623800000091
optionally, the extracting the item information from each historical text through the core word dictionary and the cutoff word dictionary further comprises: and if the position of the article information is searched back and forth through the stop word dictionary and the positioned result is empty, the position of the article information is searched back and forth through the quantifier word dictionary, so that the article information is extracted from the historical text. Generally, the article information can be accurately positioned through the cutoff words, and if the front and back of the text have no cutoff words, the article information can be positioned through front and back searching of the quantifier words. It should be noted that, whether the front direction or the back direction is adopted, the cutoff word is firstly adopted for searching, and if the positioning result is empty, the quantifier is adopted for searching.
And 102, taking each historical text and corresponding article information thereof as a training set, training a model based on a long-term memory network, and obtaining an information extraction model.
With the accumulation of data, the article information extracted in step 101 is more and more, so that each history text and the article information corresponding to the history text in step 101 can be used as a training set to train a model based on a long-term memory network, thereby obtaining an information extraction model. Alternatively, the model based on the long-and-short term memory network may be a Bi-LSTM-CRF (bidirectional long-and-short term memory network-conditional random field) deep learning model, as shown in FIG. 4.
The basic idea of a Bi-directional long-short term memory network (Bi-LSTM) is to propose two LSTM for each training sequence, forward and backward, respectively, and both connected to an output layer. This structure provides complete past and future context information for each point in the output layer input sequence. Conditional Random fields CRF (conditional Random fields) are a conditional probability distribution model of another set of output Random variables given a set of input Random variables.
The labeling of the article adopts a BIO labeling mode, namely Begin, Intermediate, Other,
for example:
Figure BDA0002391623800000101
optionally, step 102 may comprise: for each historical text, checking whether the historical text is matched with the corresponding article information in a manual verification mode; if not, updating the article information in a manual marking mode, and putting the historical text and the updated article information into a training set; if yes, directly putting the historical texts and the corresponding article information into a training set; and training a model based on a long-time and short-time memory network by adopting a training set to obtain an information extraction model. After the article information is extracted in step 101, in order to ensure the accuracy of the extracted article information, the extracted article information is further checked in a manual verification manner, and if the extracted article information is incorrect, the article information is updated in a manual marking manner.
Optionally, step 102 may further include: and if the historical text and the corresponding article information are not matched, identifying a core word and a stop word from the historical text according to the article information extracted in a manual marking mode, and respectively updating the core word and the stop word to a core word dictionary and a stop word dictionary. If the history text and the corresponding article information are found to be not matched by adopting a manual verification mode, the current core word dictionary and the cutoff word are not perfect enough and need to be updated, so that the article information can be accurately extracted through the step 101 when the similar history text is identified next time, and the workload of manual marking is reduced. It should be noted that the core word dictionary and the stop word dictionary can be maintained on line and continuously updated, thereby improving the extraction accuracy.
And 103, extracting the article information in the text to be extracted by adopting the information extraction model.
After the information extraction model is obtained through training, the information extraction model is adopted to identify the text to be extracted, and the article information is extracted from the text to be extracted. Therefore, the embodiment of the invention can realize that the system extracts the article information from the credit text and the receipt on line, and the success rate and the accuracy rate of the extraction can be continuously optimized and improved.
According to the various embodiments, the technical means that the item information is extracted from each historical text through the core word dictionary and the stop word dictionary, the information extraction model is obtained through training, and the item information in the text to be extracted is extracted through the information extraction model is adopted, so that the technical problem that the difficulty in extracting the item information in the prior art is high is solved. The embodiment of the invention combines the core word dictionary and the stop word dictionary with the model based on the long-term and short-term memory network, improves the success rate and the accuracy rate of the article information extraction, and can bring the services as much as possible into an intelligent order examination system, thereby greatly reducing the service processing cost, reducing the working strength of service personnel and improving the productivity.
Fig. 5 is a schematic diagram of a main flow of a method of extracting item information according to a referential embodiment of the present invention. As still another embodiment of the present invention, as shown in fig. 5, the method for extracting item information may include:
step 501, respectively creating a core word dictionary and a cutoff word dictionary.
Wherein the cutoff word dictionary comprises a forward cutoff word dictionary, a backward cutoff word dictionary and a general cutoff word dictionary.
Step 502, for each history text, locating the position of the article information from the history text through the core word dictionary.
Step 503, for each history text, searching the position of the article information back and forth through a cut-off word dictionary, and extracting the article information from the history text.
Step 504, for each historical text, checking whether the historical text is matched with the corresponding article information in a manual checking mode; if not, go to step 505; if yes, go to step 507.
And 505, updating the article information in a manual marking mode.
Step 506, according to the article information extracted in the manual marking mode, recognizing the core words and the cutoff words from the historical texts, and respectively updating the core words and the cutoff words to a core word dictionary and a cutoff word dictionary.
And 507, putting the historical texts and the corresponding article information into a training set.
And step 508, training the model based on the long-time and short-time memory network by adopting a training set to obtain an information extraction model.
And 509, extracting the article information in the text to be extracted by using the information extraction model.
In addition, in a reference embodiment of the present invention, the detailed implementation of the method for extracting the article information is already described in detail in the above-mentioned method for extracting the article information, and therefore, the repeated content will not be described again.
Fig. 6 is a schematic diagram of main blocks of an apparatus for extracting item information according to an embodiment of the present invention, and as shown in fig. 6, the apparatus 600 for extracting item information includes a first extraction module 601, a training module 602, and a second extraction module 603; the first extraction module 601 is configured to extract item information from each historical text through a core word dictionary and a stop word dictionary; the training module 602 is configured to train a model based on a long-term memory network with each historical text and article information corresponding to the historical text as a training set, so as to obtain an information extraction model; the second extraction module 603 is configured to extract the item information in the text to be extracted by using the information extraction model.
Optionally, the first extraction module 601 is further configured to:
respectively creating a core word dictionary and a cut-off word dictionary;
for each historical text, the position of the article information is located from the historical text through a core word dictionary, and the position of the article information is searched back and forth through a cut-off word dictionary, so that the article information is extracted from the historical text.
Optionally, the cutoff dictionary comprises a forward cutoff dictionary and a backward cutoff dictionary;
the first extraction module 601 is further configured to:
forward searching the position of the article information through a forward cut-off word dictionary to locate a target forward cut-off word;
searching the position of the article information backwards through a backward cut-off word dictionary, and positioning a target backward cut-off word;
and extracting article information from the historical text according to the target forward stop word and the target backward stop word.
Optionally, the cutoff word dictionary further comprises a general cutoff word dictionary;
the first extraction module 601 is further configured to:
if the position of the article information is searched forwards through the forward cut-off word dictionary and the positioned result is null, the position of the article information is searched forwards through the general cut-off word dictionary to position a target forward cut-off word;
and if the position of the article information is searched backwards through the backward stop word dictionary and the positioned result is null, the position of the article information is searched backwards through the general stop word dictionary to position a target backward stop word.
Optionally, the first extraction module 601 is further configured to:
and if the position of the article information is searched back and forth through the stop word dictionary and the positioned result is empty, the position of the article information is searched back and forth through the quantifier word dictionary, so that the article information is extracted from the historical text.
Optionally, the training module 602 is further configured to:
for each historical text, checking whether the historical text is matched with the corresponding article information in a manual verification mode; if not, updating the article information in a manual marking mode, and putting the historical text and the updated article information into a training set; if yes, directly putting the historical texts and the corresponding article information into a training set;
and training a model based on a long-time and short-time memory network by adopting a training set to obtain an information extraction model.
Optionally, the training module 602 is further configured to:
and if the historical text and the corresponding article information are not matched, identifying a core word and a stop word from the historical text according to the article information extracted in a manual marking mode, and respectively updating the core word and the stop word to a core word dictionary and a stop word dictionary.
According to the various embodiments, the technical means that the item information is extracted from each historical text through the core word dictionary and the stop word dictionary, the information extraction model is obtained through training, and the item information in the text to be extracted is extracted through the information extraction model is adopted, so that the technical problem that the difficulty in extracting the item information in the prior art is high is solved. The embodiment of the invention combines the core word dictionary and the stop word dictionary with the model based on the long-term and short-term memory network, improves the success rate and the accuracy rate of the article information extraction, and can bring the services as much as possible into an intelligent order examination system, thereby greatly reducing the service processing cost, reducing the working strength of service personnel and improving the productivity.
The detailed description of the embodiments of the apparatus for extracting article information according to the present invention is already described in detail in the above-mentioned method for extracting article information, and therefore, the repeated description is omitted here.
Fig. 7 shows an exemplary system architecture 700 of a method of extracting item information or an apparatus for extracting item information to which an embodiment of the present invention may be applied.
As shown in fig. 7, the system architecture 700 may include terminal devices 701, 702, 703, a network 704, and a server 705. The network 704 serves to provide a medium for communication links between the terminal devices 701, 702, 703 and the server 705. Network 704 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use the terminal devices 701, 702, 703 to interact with a server 705 over a network 704, to receive or send messages or the like. The terminal devices 701, 702, 703 may have installed thereon various communication client applications, such as a shopping-like application, a web browser application, a search-like application, an instant messaging tool, a mailbox client, social platform software, etc. (by way of example only).
The terminal devices 701, 702, 703 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 705 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the terminal devices 701, 702, 703. The background management server may analyze and otherwise process the received data such as the item information query request, and feed back a processing result (for example, target push information, item information — just an example) to the terminal device.
It should be noted that the method for extracting the item information provided by the embodiment of the present invention is generally executed by the server 705, and accordingly, the apparatus for extracting the item information is generally disposed in the server 705. The method for extracting the article information provided by the embodiment of the present invention may also be executed by the terminal devices 701, 702, and 703, and accordingly, the apparatus for extracting the article information may be disposed in the terminal devices 701, 702, and 703.
It should be understood that the number of terminal devices, networks, and servers in fig. 7 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 8, shown is a block diagram of a computer system 800 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data necessary for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program executes the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 801.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer programs according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a first extraction module, a training module, and a second extraction module, where the names of the modules do not in some cases constitute a limitation on the modules themselves.
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: extracting article information from each historical text through the core word dictionary and the stop word dictionary; training a model based on a long-time and short-time memory network by taking each historical text and corresponding article information thereof as a training set to obtain an information extraction model; and extracting the article information in the text to be extracted by adopting the information extraction model.
According to the technical scheme of the embodiment of the invention, the technical means that the item information is extracted from each historical text through the core word dictionary and the stop word dictionary, the information extraction model is obtained through training, and the item information in the text to be extracted is extracted through the information extraction model is adopted, so that the technical problem of high difficulty in extracting the item information in the prior art is solved. The embodiment of the invention combines the core word dictionary and the stop word dictionary with the model based on the long-term and short-term memory network, improves the success rate and the accuracy rate of the article information extraction, and can bring the services as much as possible into an intelligent order examination system, thereby greatly reducing the service processing cost, reducing the working strength of service personnel and improving the productivity.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of extracting item information, comprising:
extracting article information from each historical text through the core word dictionary and the stop word dictionary;
training a model based on a long-time and short-time memory network by taking each historical text and corresponding article information thereof as a training set to obtain an information extraction model;
and extracting the article information in the text to be extracted by adopting the information extraction model.
2. The method of claim 1, wherein extracting item information from each of the historical texts via a core word dictionary and a cutoff word dictionary comprises:
respectively creating a core word dictionary and a cut-off word dictionary;
for each historical text, the position of the article information is located from the historical text through a core word dictionary, and the position of the article information is searched back and forth through a cut-off word dictionary, so that the article information is extracted from the historical text.
3. The method of claim 2, wherein the cutoff dictionary comprises a forward cutoff dictionary and a backward cutoff dictionary;
the position of the article information is searched back and forth through a stop word dictionary, so that the article information is extracted, and the method comprises the following steps:
forward searching the position of the article information through a forward cut-off word dictionary to locate a target forward cut-off word;
searching the position of the article information backwards through a backward cut-off word dictionary, and positioning a target backward cut-off word;
and extracting article information from the historical text according to the target forward stop word and the target backward stop word.
4. The method of claim 3, wherein the cutoff word dictionary further comprises a general cutoff word dictionary;
the position of the article information is searched back and forth through a stop word dictionary, so that the article information is extracted, and the method further comprises the following steps:
if the position of the article information is searched forwards through the forward cut-off word dictionary and the positioned result is null, the position of the article information is searched forwards through the general cut-off word dictionary to position a target forward cut-off word;
and if the position of the article information is searched backwards through the backward stop word dictionary and the positioned result is null, the position of the article information is searched backwards through the general stop word dictionary to position a target backward stop word.
5. The method of claim 1, wherein the item information is extracted from each of the historical texts through a core word dictionary and a cutoff word dictionary, further comprising:
and if the position of the article information is searched back and forth through the stop word dictionary and the positioned result is empty, the position of the article information is searched back and forth through the quantifier word dictionary, so that the article information is extracted from the historical text.
6. The method according to claim 1, wherein training a model based on a long-time memory network by using each historical text and corresponding article information thereof as a training set to obtain an information extraction model comprises:
for each historical text, checking whether the historical text is matched with the corresponding article information in a manual verification mode; if not, updating the article information in a manual marking mode, and putting the historical text and the updated article information into a training set; if yes, directly putting the historical texts and the corresponding article information into a training set;
and training a model based on a long-time and short-time memory network by adopting a training set to obtain an information extraction model.
7. The method of claim 6, further comprising:
and if the historical text and the corresponding article information are not matched, identifying a core word and a stop word from the historical text according to the article information extracted in a manual marking mode, and respectively updating the core word and the stop word to a core word dictionary and a stop word dictionary.
8. An apparatus for extracting article information, comprising:
the first extraction module is used for extracting the article information from each historical text through the core word dictionary and the stop word dictionary;
the training module is used for training a model based on a long-time memory network by taking each historical text and corresponding article information thereof as a training set to obtain an information extraction model;
and the second extraction module is used for extracting the article information in the text to be extracted by adopting the information extraction model.
9. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN202010116410.8A 2020-02-25 2020-02-25 Method and device for extracting article information Pending CN111339253A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010116410.8A CN111339253A (en) 2020-02-25 2020-02-25 Method and device for extracting article information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010116410.8A CN111339253A (en) 2020-02-25 2020-02-25 Method and device for extracting article information

Publications (1)

Publication Number Publication Date
CN111339253A true CN111339253A (en) 2020-06-26

Family

ID=71185608

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010116410.8A Pending CN111339253A (en) 2020-02-25 2020-02-25 Method and device for extracting article information

Country Status (1)

Country Link
CN (1) CN111339253A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09288673A (en) * 1996-04-23 1997-11-04 Nippon Telegr & Teleph Corp <Ntt> Japanese morpheme analysis method and device therefor, and dictionary unregistered word collection method and device therefor
EP2428905A1 (en) * 2010-09-14 2012-03-14 Ricoh Company, Ltd. Information processing apparatus, information processing method, and computer program product for using composite data of image and text information
CN103207855A (en) * 2013-04-12 2013-07-17 广东工业大学 Fine-grained sentiment analysis system and method specific to product comment information
CN104331395A (en) * 2014-10-28 2015-02-04 北京京东尚科信息技术有限公司 Method and device for identifying Chinese product name from text
CN106815293A (en) * 2016-12-08 2017-06-09 中国电子科技集团公司第三十二研究所 System and method for constructing knowledge graph for information analysis
CN109858018A (en) * 2018-12-25 2019-06-07 中国科学院信息工程研究所 A kind of entity recognition method and system towards threat information
CN110019676A (en) * 2017-12-01 2019-07-16 北京搜狗科技发展有限公司 A kind of method, apparatus and equipment identifying core word in query information
CN110222176A (en) * 2019-05-24 2019-09-10 苏宁易购集团股份有限公司 A kind of cleaning method of text data, system and readable storage medium storing program for executing
CN110781668A (en) * 2019-10-24 2020-02-11 腾讯科技(深圳)有限公司 Text information type identification method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09288673A (en) * 1996-04-23 1997-11-04 Nippon Telegr & Teleph Corp <Ntt> Japanese morpheme analysis method and device therefor, and dictionary unregistered word collection method and device therefor
EP2428905A1 (en) * 2010-09-14 2012-03-14 Ricoh Company, Ltd. Information processing apparatus, information processing method, and computer program product for using composite data of image and text information
CN103207855A (en) * 2013-04-12 2013-07-17 广东工业大学 Fine-grained sentiment analysis system and method specific to product comment information
CN104331395A (en) * 2014-10-28 2015-02-04 北京京东尚科信息技术有限公司 Method and device for identifying Chinese product name from text
CN106815293A (en) * 2016-12-08 2017-06-09 中国电子科技集团公司第三十二研究所 System and method for constructing knowledge graph for information analysis
CN110019676A (en) * 2017-12-01 2019-07-16 北京搜狗科技发展有限公司 A kind of method, apparatus and equipment identifying core word in query information
CN109858018A (en) * 2018-12-25 2019-06-07 中国科学院信息工程研究所 A kind of entity recognition method and system towards threat information
CN110222176A (en) * 2019-05-24 2019-09-10 苏宁易购集团股份有限公司 A kind of cleaning method of text data, system and readable storage medium storing program for executing
CN110781668A (en) * 2019-10-24 2020-02-11 腾讯科技(深圳)有限公司 Text information type identification method and device

Similar Documents

Publication Publication Date Title
US11023505B2 (en) Method and apparatus for pushing information
US10558984B2 (en) Method, apparatus and server for identifying risky user
US20190163742A1 (en) Method and apparatus for generating information
US11055373B2 (en) Method and apparatus for generating information
CN106960030B (en) Information pushing method and device based on artificial intelligence
CN108572990B (en) Information pushing method and device
CN107679119B (en) Method and device for generating brand derivative words
CN107193974B (en) Regional information determination method and device based on artificial intelligence
CN113657113A (en) Text processing method and device and electronic equipment
EP3961426A2 (en) Method and apparatus for recommending document, electronic device and medium
CN110874532A (en) Method and device for extracting keywords of feedback information
CN111368551A (en) Method and device for determining event subject
CN111191614A (en) Document classification method and device
CN113836316B (en) Processing method, training method, device, equipment and medium for ternary group data
CN110910178A (en) Method and device for generating advertisement
CN113590756A (en) Information sequence generation method and device, terminal equipment and computer readable medium
CN112131292A (en) Method and device for structural processing of changed data
CN108616413B (en) Information calibration method and device
CN110852057A (en) Method and device for calculating text similarity
CN110245357B (en) Main entity identification method and device
CN111368693A (en) Identification method and device for identity card information
CN111339253A (en) Method and device for extracting article information
CN113325959A (en) Input corpus recommendation method and device
CN114170451A (en) Text recognition method and device
CN113761183A (en) Intention recognition method and intention recognition device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220930

Address after: 25 Financial Street, Xicheng District, Beijing 100033

Applicant after: CHINA CONSTRUCTION BANK Corp.

Address before: 25 Financial Street, Xicheng District, Beijing 100033

Applicant before: CHINA CONSTRUCTION BANK Corp.

Applicant before: Jianxin Financial Science and Technology Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200626