CN108197119A

CN108197119A - The archives of paper quality digitizing solution of knowledge based collection of illustrative plates

Info

Publication number: CN108197119A
Application number: CN201810111488.3A
Authority: CN
Inventors: 李进荣; 孙懿鑫; 张步明
Original assignee: Chengdu Zhuo Guan Information Technology Co Ltd
Current assignee: Chengdu Zhuo Guan Information Technology Co Ltd
Priority date: 2018-02-05
Filing date: 2018-02-05
Publication date: 2018-06-22

Abstract

The invention discloses a kind of archives of paper quality digitizing solutions of knowledge based collection of illustrative plates.It includes obtaining archives of paper quality pictorial information, analysis obtains standardized text data, extract the entity information of critical entities, entity information is carried out data fusion by structure normal dictionary table, form structural data, knowledge mapping is built using structural data as knowledge entry, archives of paper quality content-data is obtained according to knowledge mapping and generates electronic document.The present invention improves the digitized working efficiency of archives of paper quality, while reduces accidentally behaviour and lead.

Description

The archives of paper quality digitizing solution of knowledge based collection of illustrative plates

Technical field

The invention belongs to electronic information technical field more particularly to a kind of archives of paper quality digitlization sides of knowledge based collection of illustrative plates Method.

Background technology

Archives of paper quality digitlization operation is that archives large database concept builds most basic work, and operating process includes archives Taxonomic revision, image scanning, words input and arrangement storage and etc..The digitized presentation of archives of paper quality at present is by reality Object archives of paper quality, the archives for becoming electronic document (forms such as JPG, PDF or TFF) are stored, and the purpose is to be information-based clothes Business, it is therefore necessary to can be read and be used by related software system.

Thus when establishing electronic record database, for each archives of paper quality, it is necessary to generate two electronic documents：One A is the picture of the archives of paper quality, and another two are and the one-to-one information of the picture.Current solution is to be fabricated to electricity Sub-pictures add EXCEL entries.Such as 1 archives of paper quality in kind, after scanned, the entitled " 031-053-01-019- of picture is generated The electronic pictures of 01.jpg, but only cannot fully understand that its all the elements is believed substantially from " 031-053-01-019-01.GIF " Breath, therefore, it is necessary to by the information covered on this archives of paper quality, (such as file number, the time, archives kind, page name, is filled and presented at class-mark Which kind of unit, department belong to, have the contents such as several pages) it is input in the corresponding entry of EXCEL file.It can be seen that complete The digitlization of a piece of paper matter archives needs to do two things：When scanning archives of paper quality, second is that inputting archive content to EXCEL file Correspondence item day in the Room, workload is very huge.

Although common scanner (high photographing instrument) can do some processing to the picture of scanning on the market at present, generally lack Crawl to content information is simultaneously generated to the correspondence item day of EXCEL file in the Room.Certainly with technological progress, also occur carrying The high-grade scanner of optical character identification (Optical Character Recognition, abbreviation OCR), but mistake so far Behaviour, which leads, cannot meet the requirement for being less than 0.5% as defined in National archives digitlization：Even if using the high-grade scanner of import, although Accidentally behaviour, which leads, can reduce several orders of magnitude, but cannot meet the requirements, and the high-grade scanner of such import is expensive, Easily hundreds of thousands even one up to a million, cost is excessively high.So upper general company's archival digitalization work of society so far Program, or being all operation before and after two people of same people's secondary operation or assembly line, working procedure is complicated, causes efficiency low Under, and personnel cost is excessively high.

Invention content

The present invention goal of the invention be：In order to solve in the prior art, archives of paper quality digitization procedure is complicated, leads to efficiency The problems such as low, the present invention propose a kind of archives of paper quality digitizing solution of knowledge based collection of illustrative plates.

The technical scheme is that：A kind of archives of paper quality digitizing solution of knowledge based collection of illustrative plates, including

A, the archives of paper quality pictorial information for needing to be digitized is obtained；

B, morphology, grammer and/or semantic analysis are carried out to archives of paper quality pictorial information in step A, obtains standardized text Data；

C, the entity information of critical entities is extracted from the standardized text data of step B；

D, normal dictionary table is built, entity information in step C is carried out by data fusion according to normal dictionary table, forms structure Change data；

E, knowledge mapping is built as knowledge entry according to structural data in step D；

F, the content-data in archives of paper quality pictorial information is obtained according to knowledge mapping in step E and generates electronic document.

Further, the step B carries out archives of paper quality pictorial information in step A morphology, grammer and/or semanteme point Analysis, obtaining standardized text data is specially：

Document knot is carried out to the paragraph of archives of paper quality pictorial information in step A using paragraph sorter model trained in advance Structure is classified, and paragraph structure division is carried out to the archives of paper quality pictorial information according to classification results；

If the archives of paper quality pictorial information is Chinese resource, each paragraph structure marked off is segmented, part of speech Mark and phrase chunking, and remove the punctuation mark in paragraph structure；

If the archives of paper quality pictorial information is foreign language resource, each paragraph structure for marking off is carried out stem processing, Lemmatization and phrase chunking, and remove the punctuation mark in paragraph structure.

Further, the step C extracted from the standardized text data of step B critical entities entity information it is specific For：

Classified using noun classification device model trained in advance to the word in the standardized text data, according to Classification results identify and extract the relationship between noun of all categories and each noun.

Further, normal dictionary table is built in the step D is specially：

The architecture of knowledge mapping is established according to conventional data standard；

The entity attribute of critical entities in step C is converted into triple data；

The relationship type and naming rule of the entity attribute and the critical entities are united according to triple data One specification obtains the normal dictionary table with standard criterion.

Further, entity information in step C is carried out by data fusion according to normal dictionary table in the step D, is formed Structural data is specially：

The critical entities are carried out compareing mapping, while retain the key with the content in the normal dictionary table built Entity attributes relationship forms structural data.

The beneficial effects of the invention are as follows：The present invention is standardized by obtaining archives of paper quality pictorial information and being handled Text data, then the entity information of critical entities is extracted, entity information is subjected to data fusion, shape by building normal dictionary table Into structural data, knowledge mapping is built by the use of structural data as knowledge entry, archives of paper quality is obtained according to knowledge mapping Content improves the digitized working efficiency of archives of paper quality, while reduces accidentally behaviour and lead.

Description of the drawings

Fig. 1 is the flow diagram of the archives of paper quality digitizing solution of the knowledge based collection of illustrative plates of the present invention.

Specific embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, it is right The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, not For limiting the present invention.

As shown in Figure 1, the flow diagram of the archives of paper quality digitizing solution for knowledge based collection of illustrative plates of the invention.It is a kind of The archives of paper quality digitizing solution of knowledge based collection of illustrative plates, including

A, the archives of paper quality pictorial information for needing to be digitized is obtained.

In the present embodiment, it would be desirable to which the archives of paper quality being digitized is scanned by scanner, to obtain the papery Picture after archives scan.

B, morphology, grammer and/or semantic analysis are carried out to archives of paper quality pictorial information in step A, obtains standardized text Data.

In the present embodiment, morphology, grammer and/or semantic analysis refer to word-based to the urtext data of designated field Method, grammer and/or semantic analysis carry out the operations such as structuring processing and word segmentation processing.

C, the entity information of critical entities is extracted from the standardized text data of step B.

In the present embodiment, entity refers to name entity word and event name etc.；Attribute refers to name the noun of entity modification, such as Age, gender, character relation etc..Wherein, the relationship of entity attribute is shared mainly by the probability for calculating co-occurrence, extraction entity, The attribute word of maximum probability.Relationship between entity, on the one hand according to the co-occurrence probabilities in sentence, on the other hand according to identification The entity attribute relationship extraction entity relationship gone out.

In an alternate embodiment of the present invention where, the step B in above-described embodiment further comprises：

In order to quickly and accurately realize that the paragraph structure of urtext data divides, in the embodiment of the present invention, by will be former Beginning text data carries out structuring, distinguishes the paragraphs such as title, text, author, time, classification, realizes urtext data Paragraph structure divides.Specifically.Specifically, can according to file structure distribution characteristics, such as：The position of text, length, in word Hold etc. feature determines the file structure of the urtext data.Or a little training corpus is manually marked, according to above-mentioned spy Sign structure paragraph sorter model classifies to paragraph, using prediction result of classifying as paragraph properties.

In order to quickly and accurately realize that the paragraph structure of urtext data divides, the embodiment of the present invention is former by judging If urtext data are Chinese resource, Chinese word segmentation, part of speech mark are carried out to Chinese resource for the language of beginning text data Note, phrase chunking etc..Specifically available Open-Source Tools carry out morphology, grammer and/or semantic analysis to Chinese.If the textual data During according to for foreign language resource, morphology, grammer and/or semantic analysis are carried out to Chinese resource according to corresponding language tool, for example, to English Language resource carries out stem processing, lemmatization, phrase chunking etc., refers to removal tense, word suffix and is reduced into former word.It is specific Morphology, grammer and/or semantic analysis can be carried out to English resources with Open-Source Tools.

In an alternate embodiment of the present invention where, the step C in above-described embodiment further comprises：

Classified using noun classification device model trained in advance to the word in the standardized text data, according to Classification results identify and extract the relationship between noun of all categories and each noun.Specifically, the relationship between noun can root It is determined according to the co-occurrence probabilities in sentence.

In order to quickly and accurately realize the Knowledge Extraction of standardized text data, the embodiment of the present invention, by existing number According to observation, beginning word to noun terminates the structure feature that the features such as word, word length determines noun of all categories, and according to The structure feature of noun of all categories extracted from standardized text data respective classes noun and each noun between pass System, and then obtain entity information.

In an alternate embodiment of the present invention where, the step D in above-described embodiment further comprises：

The relationship type and naming rule of the entity attribute and the critical entities are united according to triple data One specification obtains the normal dictionary table with standard criterion；

The critical entities are carried out compareing mapping, while retain the key with the content in the normal dictionary table built Entity attributes relationship forms structural data, specially：

Judge whether entity information complies with standard specification；If so, entity information is carried out by data according to normal dictionary table Fusion, i.e., map entity name and the content in normal dictionary table, obtain identical entity name and identical physical name The attribute information of title forms structural data；If it is not, then classified according to professional knowledge carries out relationship map, shape to entity information Into structural data；Here entity information includes entity name and entity attribute information, using entity name as index, with standard Content in dictionary table is mapped, and the attribute information of identical entity name and identical entity name is obtained, according to standard word In allusion quotation table entity name and entity between relationship unified standard, by the attribute information of entity name and the attribute of identical entity name Information is fused together.

Those of ordinary skill in the art will understand that the embodiments described herein, which is to help reader, understands this hair Bright principle, it should be understood that protection scope of the present invention is not limited to such specific embodiments and embodiments.This field Those of ordinary skill can make according to these technical inspirations disclosed by the invention various does not depart from the other each of essence of the invention The specific deformation of kind and combination, these deform and combine still within the scope of the present invention.

Claims

1. a kind of archives of paper quality digitizing solution of knowledge based collection of illustrative plates, which is characterized in that including

D, normal dictionary table is built, entity information in step C is carried out by data fusion according to normal dictionary table, forms structuring number According to；

2. the archives of paper quality digitizing solution of knowledge based collection of illustrative plates as described in claim 1, which is characterized in that the step B Morphology, grammer and/or semantic analysis are carried out to archives of paper quality pictorial information in step A, obtaining standardized text data is specially：

File structure point is carried out to the paragraph of archives of paper quality pictorial information in step A using paragraph sorter model trained in advance Class carries out paragraph structure division according to classification results to the archives of paper quality pictorial information；

If the archives of paper quality pictorial information is Chinese resource, each paragraph structure marked off is segmented, part-of-speech tagging And phrase chunking, and remove the punctuation mark in paragraph structure；

If the archives of paper quality pictorial information is foreign language resource, stem processing, morphology are carried out to each paragraph structure marked off Reduction and phrase chunking, and remove the punctuation mark in paragraph structure.

3. the archives of paper quality digitizing solution of knowledge based collection of illustrative plates as claimed in claim 2, which is characterized in that the step C The entity information of extraction critical entities is specially from the standardized text data of step B：

Classified using noun classification device model trained in advance to the word in the standardized text data, according to classification As a result it identifies and extracts the relationship between noun of all categories and each noun.

4. the archives of paper quality digitizing solution of knowledge based collection of illustrative plates as claimed in claim 3, which is characterized in that the step D It is middle structure normal dictionary table be specially：

The relationship type and naming rule of the entity attribute and the critical entities are subjected to unified rule according to triple data Model obtains the normal dictionary table with standard criterion.

5. the archives of paper quality digitizing solution of knowledge based collection of illustrative plates as claimed in claim 4, which is characterized in that the step D Middle that entity information in step C is carried out data fusion according to normal dictionary table, forming structural data is specially：

The critical entities are carried out compareing mapping, while retain the critical entities with the content in the normal dictionary table built Relation on attributes, formed structural data.