CN107491730A

CN107491730A - A kind of laboratory test report recognition methods based on image procossing

Info

Publication number: CN107491730A
Application number: CN201710575858.4A
Authority: CN
Inventors: 尹建伟; 岑超; 赵景晨; 邓水光; 李莹; 吴健; 吴朝晖
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2017-07-14
Filing date: 2017-07-14
Publication date: 2017-12-19

Abstract

The invention discloses a kind of laboratory test report recognition methods based on image procossing, it passes through the investigation and analysis to chemically examining single structure, devise a set of algorithm that can accurately split each region of laboratory test report and effectively be cleared up, specification has simultaneously segmented finally obtain clearly image from how the laboratory test report photo of mobile phone shooting carries out processing step by step, and is identified using the OCR engine of increasing income of maturation；Thorough consideration has all been done in each stage present invention of laboratory test report image processing flow, performance has been optimized, improves the efficiency of image procossing；After being identified, the laboratory test report project dictionary that the present invention is established using lab work information database is realized to identifying that engine-model automatically selects and to recognition result intelligent correction, the accuracy of raising laboratory test report recognition result.

Description

A kind of laboratory test report recognition methods based on image procossing

Technical field

The invention belongs to medical OCR technique field, and in particular to a kind of laboratory test report recognition methods based on image procossing.

Background technology

OCR (Optical Character Recognition, optical character identification) refers to the image text to text information Part carries out analysis identifying processing, obtains the process of word and space of a whole page characteristic information；It is to utilize optical technology and computer technology Word in image is read out, and be converted into a kind of computer it will be appreciated that form；OCR technique is to realize word at a high speed One key technology of typing.

Research work of the China in terms of OCR technique is relative to start late, and just starts in the 1970s to numeral, English Word is female and the identification technology of symbol is studied, and late 1970s proceed by the research of Chinese Character Recognition.From 20th century The seventies, OCR just have been widely used for applying in news, printing, publication, library, office The industry-by-industries such as automation, substantially increase treatment effeciency and the degree of accuracy of form document, save manpower and materials and financial resources.

Block letter OCR identification technology has reached higher level at present, reaches 98% to the discrimination of printed Chinese character More than, even if to the poor word of printing quality, its discrimination also reaches more than 95%.With the popularization of smart mobile phone, and mobile phone The ecological environment of application is continued to develop, and OCR technique is also applied in mobile phone application, such as document identification, bank card identification, ticket According to identification, business card recognition, passport identification, identity card identification etc..

The electronization storage of document not only avoid the trouble that paper document is carried and lost, and more data analysis provides more Convenient service；Nowadays, either hospital, clinic, pharmaceutical factory and health institution, or carry out the scholars of health field research, Data are all increasingly dependent on to carry out decision-making and judge.Big data brings brand-new change to health medical treatment field, everything All it be unable to do without data storage；Although there is the structural data of oneself in hospital, data can not be easily between different departments Mutually temporarily transfer on ground；Meanwhile a patient generally can not also ensure all inspections in same hospital；In a word, current In the case of, the typing for chemically examining forms data is still the premise that can not be avoided.

OCR technique is then a kind of efficiently feasible technical scheme, by mobile phone application, more to drop significantly using threshold It is low, a large amount of human costs are also saved by user oneself typing；But user will not remove the laboratory test report of typing oneself for no reason at all Information, user is promoted to go so to do so corresponding service must be provided.Usual people are after hospital carries out routine examination, it is desirable to Understand the physical condition of the indices reflection checked, but do not have authoritative doctor and medical team to carry out the deciphering of laboratory test report； Therefore, laboratory test report identification is used as instrument, and laboratory test report is understood as service, and both complement each other, and are the tight demands of existing market.

The content of the invention

In view of it is above-mentioned, can be efficiently and accurately the invention provides a kind of laboratory test report recognition methods based on image procossing Automatic identification laboratory test report content information.

A kind of laboratory test report recognition methods based on image procossing, comprises the following steps：

(1) limb recognition is carried out to the laboratory test report photo of mobile phone shooting, obtains the quadrangular configuration of laboratory test report；

(2) cutting correction is carried out to laboratory test report photo by perspective transform based on quadrangular configuration, obtains chemically examining single image；

(3) Slant Rectify is carried out to chemical examination single image based on probability Hough transformation；

(4) extraction correction after chemically examine single image in cut-off rule, and according to cut-off rule will chemically examine single image be divided into it is upper in Lower three pieces of regions, patient information, lab work information, doctor and verification msu message are corresponded to respectively；

(5) further split to chemically examining the lab work information area in single image after correction according to line information, and it is right Chemical examination single image after segmentation carries out binary conversion treatment；

(6) LSTM (Long Short-Term Memory, the shot and long term memory net increased income in OCR engine Tesseract Network) model carries out Classification and Identification and intelligent correction to the binary image after segmentation.

The detailed process for carrying out limb recognition in the step (1) to chemical examination single image is as follows：

1.1 pairs of laboratory test report photos carry out resampling and obtain its thumbnail；

1.2 pairs of thumbnails pre-process, successively the rapid edge-detection including expansion process, based on structuring forest, Corrosion treatment and binary conversion treatment, so as to obtain marginal information image；

1.3 carry out straight-line detection to edge frame using Hough transformation, while introducing is based on local maximum and certainly Adapt to threshold value and carry out straight line similar in straight line screening and merging；

1.4 calculate intersection point between straight line using vector methods, are found by traveling through intersection point four-tuple and all are surrounded by straight line Quadrangle, take the quadrangular configuration of four edges weight and maximum quadrangle as laboratory test report；The weight is that side place is straight The quantity put on line.

The detailed process for carrying out Slant Rectify in the step (3) to chemical examination single image based on probability Hough transformation is as follows：

Laboratory test report image scaling to width is 1200 sizes by 3.1；

Chemical examination single image after scaling is converted to gray-scale map and carries out illumination amendment and binary conversion treatment by 3.2, that is, is passed through Image subtraction and increase the method for mean shift before and after mean filter and realize the amendment of illumination patterns, reuse contrast-limited Adaptive histogram equalization method strengthens the contrast of image, finally carries out binary conversion treatment to image；

3.3 pairs of binary images carry out etching operation, and using the edge of binary image after the corrosion of Canny operator extractions Pixel, obtain corresponding marginal information image；

3.4 carry out straight-line detection based on probability Hough transformation to edge frame, according to the flat of all linear angle of inclination Average carries out Slant Rectify to chemical examination single image.

Chemically examined after being corrected in the step (4) using the extraction of LSD (Line Segment Detector) Line Segment Detection Algorithm Cut-off rule in single image.

The specific implementation process of the step (5) is as follows：

Chemical examination single image after 5.1 pairs of corrections zooms in and out, and obtains corresponding downscaled images and does binary conversion treatment；

Number of the lab work information area per a line black picture element, statistical result can be in 5.2 statistics binary images Existing peak and low ebb alternating, each low ebb are blank parts in the ranks；Lab work information area is drawn according to statistical information It is divided into multirow, and deletes the row that wherein black picture element number is less than threshold value；

5.3 row informations obtained according to step 5.2, count the width of hollow white part per a line and be ranked up, count As a result the blank spaces of intercharacter are smaller in showing per a line and quantity accounts for the overwhelming majority, and between the blank between each row of lab work There was only several every larger and quantity, thereby determine that to obtain character pitch, setting arranges more than the threshold value of character pitch as minimum Spacing；

The number of each row black picture element of lab work information area, statistical result can be in 5.4 statistics binary images Existing peak and low ebb alternating, each low ebb are the blank parts between arranging；It will be chemically examined according to statistical information and minimum column pitch Project information region is divided into multiple row；

5.5 line informations obtained according to above-mentioned steps, to chemically examining the lab work information area in single image after correction Further segmentation, and select optimal global threshold to carry out using OTSU (maximum variance between clusters) the chemical examination single image after segmentation Binary conversion treatment.

The specific implementation process of the step (6) is as follows：

6.1 patient information region is identified using the document mode in Tesseract, is identified result character String, splits to recognition result character string according to blank spaces, bag is therefrom matched using the method for matching regular expressions The character block of the title containing item of information, character block followed by are the value of the item of information；And then obtained patient's letter will be parsed Breath is cleared up and structuring, obtains final patient information object；

6.2 carry out batch identification using the block mode in Tesseract to the lab work information area that segmentation is completed, Obtain the recognition result of test item；

6.3 establish the change for including test item title, test item alias, test item measurement Value Types (text, numeral etc.) Item information database is tested, and intelligent correction is carried out to test item recognition result using the lab work information database；

6.4, for any test item, obtain it by lab work information database and measure Value Types, according to measured value class The test item measured value is identified using Tesseract corresponding engine configurations for type, obtains the recognition result of measured value.

The specific of intelligent correction is carried out to test item recognition result using lab work information database in the step 6.3 Process is as follows：

6.3.1 the test item title and alias first in lab work information database, obtain including lab work The dictionary of title；

6.3.2 for the recognition result of any test item, if the recognition result is present in dictionary, error correction is terminated；If The recognition result is not present in dictionary, then calculates the normalized edit distance of the recognition result and all entries in dictionary, choosing Take entry composition error correction candidate list of the editing distance less than 0.8 and arranged by editing distance ascending order；

6.3.3 the entry that editing distance is minimum in error correction candidate list is taken as error correction candidate word, if the candidate word is deposited Error correction replacement directly then is carried out to recognition result using it in dictionary；From error correction candidate if the candidate word is not present in dictionary The candidate word is removed in item list, repeats step 6.3.3.

The advantageous effects of the present invention are as follows：

(1) invention introduces the inspection of the rapid edge-detection of integrated structure forest, improved Hough transformation and quadrangle A kind of laboratory test report method for detecting area for the high reliability surveyed.

(2) invention introduces the processing of the laboratory test report image skew correction based on Line segment detection, also with chemical examination free hand drawing Cut-off rule as in carries out region division to laboratory test report, uses different processing methods for different zones, maximumlly improves The accuracy rate of laboratory test report identification.

(3) invention introduces lab work information database is utilized, laboratory test report project dictionary is established, and realize accordingly Identification engine-model is automatically selected and to recognition result intelligent correction, improves recognition accuracy.

Brief description of the drawings

Fig. 1 is that the system of the inventive method realizes schematic diagram.

Fig. 2 is the schematic flow sheet of the inventive method.

Embodiment

In order to more specifically describe the present invention, below in conjunction with the accompanying drawings and embodiment is to technical scheme It is described in detail.

As depicted in figs. 1 and 2, the laboratory test report recognition methods of the invention based on image procossing comprises the following steps：

(1) laboratory test report limb recognition.

After the laboratory test report picture of mobile phone shooting is got, first have to identify laboratory test report main body from environmental background simultaneously With background separation, this process is dependent on the accurate detection to laboratory test report border.Present embodiment passes sequentially through resampling, pre- place The step of reason detects with rim detection, straight-line detection, intersection point calculation and quadrangle, the detection for completing laboratory test report surrounding border are appointed Business, is comprised the following steps that：

1.1 image resampling；The photo resolution of different mobile phone camera shootings differs greatly, if directly handling meeting Cause algorithm performance unstable, and high-resolution pictures are larger for operand demand；By resampling, picture width is united One is 640 pixels, it is ensured that all input dimension of pictures are close, and amount of calculation needed for reduction, processing all bases afterwards Carried out in the thumbnail of resampling.

1.2 pretreatments and rim detection；This step is by some image processing methods and based on the quick of structuring forest Key profile in edge detection method extraction image, key step are as follows：

1.2.1 expansion；Expansive working is carried out to image first, reduces interference of the details in picture for rim detection, Retain principal character and laboratory test report boundary profile；Expansion is a kind of important morphological images processing method, whole for two dimension Number space Z²In expansion to A of set A and B, BIt is defined as：

Wherein,For set B reflection, B is a structural elements or core, and A is inflated Set；This formula is with images of the B on its origin, and based on z translates to image, in this implementation B is 3 × 3 rectangle core in mode.

1.2.2 the rapid edge-detection based on structuring forest；With traditional edge detection algorithm, such as Canny, Sobel Operator etc. is compared, and the edge detection method [Dollar2013] based on structuring forest make use of the method that structuring learns, profit With the immanent structure at edge, the important edges in picture can be protruded, reduce influence of the picture detail for outline identification；For The profile of different levels can phenomenologically be presented as different gray levels.Advantageously reduce difficulty and the calculating of straight-line detection Amount, lift accuracy rate.

1.2.3 corrosion；For the edge image detected, etching operation is carried out, refines profile, details is rejected and slightly takes turns Exterior feature, retain main body contour of object；Corrosion is with expansion on the contrary, for two-dimensional integer space Z²In expansion to A of set A and B, BIt is defined as：

Wherein, corrosion of the B to A is the set that a B with z translations is included in all point z in A.

1.2.4 binaryzation；For the edge image after corrosion, carried out using maximum between-cluster variance (OTSU) algorithm optimal complete Office's threshold process, obtains the binary image of only black and white colour, the input as straight-line detection；The algorithm is using the think of clustered Think, it is assumed that the image includes two class pixels according to bimodulus histogram (foreground pixel and background pixel), and calculating can separate two classes Optimal threshold so that their variance within clusters are minimum or inter-class variance is maximum, comprise the following steps that：

1.2.4.1 the normalization histogram of calculating input image；

1.2.4.2 all possible threshold value t=1...255 is traveled through, calculates inter-class variance Wherein ω_iFor class probability, μ_iAll it is by histogram calculation for class average；

1.2.4.3 threshold value t during inter-class variance maximum_optFor optimal threshold；

1.2.4.4 with t_optFor global threshold, t will be less than_optGray-value pixel point be set to gray scale minimum 0, more than t_opt Gray-value pixel point be set to gray scale maximum 255, realize image binaryzation.

1.3 straight-line detections based on Hough transformation；Hough transformation connects the side of given shape using the global characteristics of image Edge forms the edge of continuously smooth, is added up by the way that the point in original image two-dimensional coordinate is mapped into polar coordinate space, realizes Identification to analytic expression curve；Small, robustness is influenceed due to make use of image overall characteristic, therefore by noise and border interruption It is good, it is a kind of conventional line detection method.

Present embodiment is improved Hough transformation, is introduced and is screened based on the straight line of local maximum and threshold value, On the premise of straight-line detection precision is ensured, similar straight line is incorporated so that the result of straight-line detection can be described preferably Picture structure, its key step are as follows：

1.3.1, all white points in two-dimensional coordinate are mapped to the straight line of polar coordinate space：

R=xcos θ+ysin θ

Wherein, θ ∈ [θ, π], r ∈ [0, (w+h) * 2+1], w and h are respectively the wide and high of picture.

1.3.2 each quantity v for putting the straight line passed through in accumulation calculating polar coordinate space_{θ, r}, i.e. weight；

1.3.3 all points are filtered, only retains and meet point (θ, r) claimed below：

①v_{θ, i}＞ v_max* 0.15, v_maxFor somewhat middle cumulative number maximum；

②v_{θ, i}=max { v_{X, y}| (x, y) ∈ S_{θ, i, k}, S_{I, j, k}It is centered on (θ, r), the length of side is k square area, That is v_i,jIt is local maximum；

③v_{θ -1, i}＞ v_max*0.15∧v_{θ+1, i}＞ v_max* 0.15, v_maxFor somewhat middle cumulative number maximum, this step Isolated point can be removed.

1.3.4 press weight v_θ,rAll available points of polar coordinate space are ranked up, each of which point is expressed as two dimension Straight line in space is as follows, and ten straight lines of weighting weight highest are as candidate's straight line.

1.4 intersection point calculations and quadrangle detection；After the main straight in detecting picture, it is contemplated that laboratory test report is in picture In be quadrangle, therefore travel through all quadrangles for surrounding of straight line, find the quadrilateral area for being most likely to be laboratory test report region, Comprise the following steps that：

1.4.1 straight-line intersection in image range is sought；All straight line detected intersection points, and protecting two-by-two are obtained according to vector method The whole intersection points stayed in image range：

Wherein, (x₁, y₁), (x₂, y₂) it is point on straight line 1, (x₃, y₃), (x₄, y₄) it is point on straight line 2.

1.4.2 all candidate's quadrangles are found；All possible intersection point four-tuple is traveled through, finds all straight lines surround four Side shape.Quadrangle area is calculated, weeds out the quadrangle that all areas are less than picture area 10%.

1.4.3 weight sequencing；By four sides of quadrangle weight and all quadrangles are ranked up, final weight and Maximum quadrangle then regards as laboratory test report borderline region.

(2) the image cropping correction based on perspective transform.

The straight line information on 4 sides of laboratory test report obtained by the limb recognition result of step (1), by calculating between any two Intersection point obtains the coordinate on four summits of laboratory test report, and transformation matrix is established by the information of vertex point coordinate information and dimension of picture, right Image carries out perspective transform.

Perspective transform be using the centre of perspectivity, picture point, the condition of target point three point on a straight line, picture projection is new to one The process of view plane, its universal transformation formula are as follows：

Wherein, u, v are coordinates of original image coordinates, and image coordinate is after conversion：

(3) Slant Rectify based on probability Hough transformation.

After perspective transform, the edge of laboratory test report has been substantially at horizontal and vertical, but still suffers from laboratory test report Content and the chemical examination inequal situation of single edges；Further Slant Rectify is now also needed to, while is also largely kept away Exempt from the slight error that perspective transform cuts correction, present embodiment uses the Slant Rectify method based on probability Hough transformation, Detailed process is：

3.1 pairs of laboratory test report original images zoom in and out, if a height of H of original image, a width of W, keep original image wide high proportion to enter Row scaling, the image a width of 1200 after being reduced are a height ofA series of figures can be carried out in downscaled images afterwards As processing, to obtain the positional information of image segmentation, the smaller processing of picture in the case where ensureing that picture important information does not lose Efficiency is higher.

3.2 influence because mobile phone photograph easily receives lighting angle so that and the different zones Luminance Distribution of picture is uneven, The effect of global binaryzation is generated and significantly affected；In order to solve this problem, present embodiment employ medium filtering and Imaging importing, and the adaptive histogram equalization (CLAHE) of contrast-limited is combined, to brightness point before binaryzation is carried out The problem of cloth inequality, is corrected, and ensure that the homogenization of Luminance Distribution, comprises the following steps that：

3.2.1 arithmetic equal value filters；Arithmetic equal value filtering process is carried out to picture first, obtains rough Luminance Distribution sample This；The mean filter that counts is one kind of spatial filter, and the average value of the gray level that it is faced in domain using a pixel replaces The value of the pixel, i.e.,：

Wherein, (x, y) is current pixel coordinate, and S is contiguous range, and it is 10 × 10 that contiguous range is taken in present embodiment.

3.2.2 imaging importing；The average brightness L of picture after filtering is calculated, after subtracting filtering using original picture brightness value Picture luminance value, and L is added, obtain the image after brightness homogenization i.e.：

F (x, y)=f₁(x, y)-f₂(x, y)+L

3.2.3 the adaptive histogram equalization of contrast-limited；After completing the procedure, using contrast-limited Adaptive histogram equalization strengthens picture contrast.

Common histogram equalization algorithm be usually used in strengthen picture contrast, but if image include substantially than image its The dark or bright part in its region, the contrast in these parts cannot effectively strengthen；Adaptive histogram equalization Algorithm performs the histogram equalization responded to change above mentioned problem by localized region.In CLAHE, for each zonule Contrast amplitude limit must be all used, can overcome and avoid noise from excessively being amplified.

3.2.4 carry out OTSU binaryzations.

3.3 pairs of binary images obtained in the previous step carry out etching operation, use the rectangle core that size is 7 × 7.The operation Character area adjoining in picture can be made to connect together so that the information that next step rim detection is extracted more meets demand.

Image after 3.4 pairs of corrosion proposes the marginal information image of binary image, Canny algorithm meetings using Canny operators Input picture and Gaussian smoothing template are done into convolution, an image slightly obscured is obtained, single pixel noise is produced to the greatest extent Small influence is measured, reuses 4 mask detection levels, the vertical, edge of diagonal；Input picture and each mask are done Convolution obtains 4 sub-pictures, and preserves the maximum on each pixel and direction, determines that edge is believed using hysteresis threshold afterwards Breath：The threshold value opening flag larger from one goes out to compare the edge firmly believed, the whole edge of use direction tracking of information, now uses Less threshold value, a bianry image can be finally obtained, each point indicates whether it is edge.

3.5 pairs of edge frames carry out probability Hough transformation straight-line detection：First, randomly select in edge image Point, is mapped in polar coordinate system, the poll for the corresponding points that add up；When the point accumulation poll in polar coordinate system reaches threshold value, find out pair Two end points of the straight line answered, if line segment length is more than given threshold and is added in result set, so repeat until finding There is qualified line segment.

3.6 all straight incline angle average values detected of statistics consider that a certain bar is straight as angle of inclination rather than only Line, influence caused by indivedual special straight lines is avoided, correction result is more stable, obtains image rotation after average tilt angle Opposite angle is rectifiable.

(4) extract cut-off rule information in laboratory test report and cut.

The cut-off rule in chemical examination single image is extracted using LSD Line Segment Detection Algorithms, laboratory test report is divided into by these cut-off rules Different zones；Usual laboratory test report is divided into three major parts by horizontal line, and the top is hospital name and patient information, middle one Point it is the information of lab work, bottom is the information such as censorship doctor, proofer, auditor；Therefore can be incited somebody to action according to cut-off rule Laboratory test report is divided into some, for the further processing respectively of different parts, downscaled images is split and further carried Line information is taken, while former scaled image is split also according to the position of corresponding proportion, is cut after line information to be obtained Cut and clear up.

(5) each section procession is segmented and cleared up.

Due to reasons such as personal photo angle and light, many factors are there may be in laboratory test report photo can influence picture Quality, such as most common light and shade cause the blank sheet of paper part colours that have in same pictures may than another part word more Secretly；If use global image processing method, it is easy to word segment is had influence on, but if being used only one in a small range Relatively good binary-state threshold can dispose the overwhelming majority interference information lighter than text color, therefore first extract ranks letter Breath, then binary conversion treatment is carried out to the content in each grid, more preferable effect can be obtained, specific method is as follows：

Peak can be presented in number of the chemical examination item parts statistics of 5.1 pairs of downscaled images per a line black picture element, statistical result Replace with low ebb, each low ebb is blank parts in the ranks；Multiple rows are divided into by item parts are chemically examined according to statistical information, and Screen out the row for being less than threshold value comprising black picture element number.

5.2, according to row information obtained in the previous step, count the blank parts width per a line and are ranked up；Usual a line The blank spaces of middle intercharacter are close and quantity accounts for the overwhelming majority, and lab work be respectively spaced between row it is larger but only 4~5 It is individual, therefore be easy to that universal character pitch can be obtained, some threshold value more than character pitch is then taken again as minimum Column pitch.

The chemical examination item parts of 5.3 pairs of downscaled images count the number of each row black picture element, and peak can be presented in statistical result Replace with low ebb, each low ebb is the blank parts between row, and the part that space width is more than to threshold value according to statistical information is drawn It is divided into multiple row.

5.4 obtain the column locations information of laboratory test report original image according to the line informations of downscaled images, in proportion reduction, will The chemical examination item parts of laboratory test report original image are split, and to maximum between-cluster variance (OTSU) algorithms selection of the image after segmentation Optimal global threshold carries out binary conversion treatment, obtains clearly character image.

(6) laboratory test report Classification and Identification and intelligent correction based on Tesseract engines Yu lab work information database.

Tesseract 4 has used shot and long term memory network (LSTM) to carry out OCR identifications, and this is a kind of time recurrent neural Network, it is widely used in the fields such as handwriting recognition, speech recognition, machine translation, compared with the OCR recognition methods such as tradition, LSTM can greatly lift the accuracy rate and speed of OCR identifications.

In order to preferably carry out error correction to recognition result using priori, and difference is taken for different types of value Identifying schemes, lifted recognition accuracy, present embodiment establishes lab work information database, wherein related to identification Main project has：Test item title, test item alias, test item measurement Value Types.

After the binary image fragment after being split, measured successively for patient information, test item title, test item Value, detected, comprised the following steps that using different engine configurations：

6.1 patient informations identify and parsing；Patient information region is identified using Tesseract document mode, Result character string is identified, recognition result is split according to blank；For each character block, regular expression is used The method matched somebody with somebody, therefrom match comprising item of information title (such as name, the age, sex, diagnosis, card number, case number, outpatient service number, live Institute number, sample type etc.) character block, character block followed by is the value of the project；And then obtained patient's letter will be parsed Breath is cleared up and structuring, obtains final patient information object.

6.2 test item titles identify；For the laboratory test report title image block being partitioned into, Tesseract block mould is used Formula carries out batch identification, obtains laboratory test report item recognition the results list.

The 6.3 test item title intelligent corrections based on lab work information database；According to lab work information database In project name and alias, lab work title dictionary can be obtained；For the entry name recognition result in step 6.2, tool Body performs as follows：

6.3.1 if recognition result is present in dictionary, error correction is terminated, and remove and be somebody's turn to do in the dictionary that this error correction uses Project.

6.3.2 if recognition result is not present in dictionary, then returning for recognition result and all entries in dictionary is calculated One changes editing distance (Normalized Levenshtein Distance), and selected distance is less than 0.8 project, forms error correction Candidate list.

6.3.3 by all entries according to the editing distance ascending sort with recognition result, and error correction candidate is chosen with this The minimum entry of distance is as error correction candidate in word list；If the candidate word is present with dictionary, terminating to entangle the project Mistake, using this candidate word as error correction result, and the candidate word is removed from dictionary.

If 6.3.4 the candidate word is not present in dictionary, the candidate word, repeat step 6.3.3 are removed from dictionary.

Between editing distance refers to two character strings, as the edit operation number needed for one changes into another, edit operation Replace, insert and delete including character；Normalized edit distance is the length of editing distance divided by most long character string, specifically：

Two character strings s1, s2 are defined, their length is respectively len1, len2, and dp [i] [j] represents character string s1 [0..i] and s2 [0..j] smallest edit distance, wherein for character string s, s [0..i] is represented with 0 as starting subscript, length It is as follows for i character string s substring, detailed process：

A. dp [i] [j] is initialized, if i=0, dp [i] [j]=j；If j=0, dp [i] [j]=i；

B. state transition equation, for i ＞ 0 and j ＞ 0：

C. for i=1 → len1, j=1 → len2, dp [i] [j] is calculated；

D. character string s1 and s2 smallest edit distances are dp [len1] [len2]；

E. normalized edit distance is

The 6.4 measurement Value Types based on lab work information database judge to identify with measured value；Entangled completing test item After mistake, for each test item, type (text, the numeral of the project survey value can be obtained from lab work database Deng), for different types of measured value, it is identified using the configuration of corresponding Tesseract engines, is finally identified respectively As a result.

The above-mentioned description to embodiment is understood that for ease of those skilled in the art and using the present invention. Person skilled in the art obviously can easily make various modifications to above-described embodiment, and described herein general Principle is applied in other embodiment without by performing creative labour.Therefore, the invention is not restricted to above-described embodiment, ability For field technique personnel according to the announcement of the present invention, the improvement made for the present invention and modification all should be in protection scope of the present invention Within.

Claims

1. a kind of laboratory test report recognition methods based on image procossing, comprises the following steps：

(4) cut-off rule in single image is chemically examined after extraction correction, and upper, middle and lower three is divided into by single image is chemically examined according to cut-off rule Block region, patient information, lab work information, doctor and verification msu message are corresponded to respectively；

(5) further split to chemically examining the lab work information area in single image after correction according to line information, and to segmentation Chemical examination single image afterwards carries out binary conversion treatment；

(6) the LSTM models increased income in OCR engine Tesseract the binary image after segmentation is carried out Classification and Identification and Intelligent correction.

2. laboratory test report recognition methods according to claim 1, it is characterised in that：To chemically examining single image in the step (1) The detailed process for carrying out limb recognition is as follows：

1.2 pairs of thumbnails pre-process, successively the rapid edge-detection including expansion process, based on structuring forest, corrosion Processing and binary conversion treatment, so as to obtain marginal information image；

1.3 use Hough transformation to carry out straight-line detection to edge frame, while introducing is based on local maximum and adaptively Threshold value carries out straight line similar in straight line screening and merging；

1.4 calculate intersection point between straight line using vector methods, by travel through intersection point four-tuple find it is all surrounded by straight line four Side shape, take the quadrangular configuration of four edges weight and maximum quadrangle as laboratory test report；The weight is on the straight line of side place The quantity of point.

3. laboratory test report recognition methods according to claim 1, it is characterised in that：Probability Hough is based in the step (3) Convert as follows to the detailed process of chemical examination single image progress Slant Rectify：

Laboratory test report image scaling to width is 1200 sizes by 3.1；

Chemical examination single image after scaling is converted to gray-scale map and carries out illumination amendment and binary conversion treatment by 3.2, that is, passes through average Image subtraction and increase the method for mean shift before and after filtering and realize the amendment of illumination patterns, reuse the adaptive of contrast-limited The contrast of histogram equalization method enhancing image is answered, binary conversion treatment finally is carried out to image；

3.3 pairs of binary images carry out etching operation, and using the edge picture of binary image after the corrosion of Canny operator extractions Element, obtain corresponding marginal information image；

3.4 carry out straight-line detection based on probability Hough transformation to edge frame, according to the average value of all linear angle of inclination Slant Rectify is carried out to chemical examination single image.

4. laboratory test report recognition methods according to claim 1, it is characterised in that：Using the inspection of LSD line segments in the step (4) The cut-off rule in single image is chemically examined after method of determining and calculating extraction correction.

5. laboratory test report recognition methods according to claim 1, it is characterised in that：The specific implementation process of the step (5) It is as follows：

Height can be presented in number of the lab work information area per a line black picture element, statistical result in 5.2 statistics binary images Peak and low ebb alternating, each low ebb are blank parts in the ranks；Lab work information area is divided into according to statistical information Multirow, and delete the row that wherein black picture element number is less than threshold value；

5.3 row informations obtained according to step 5.2, count the width of hollow white part per a line and be ranked up, statistical result Display is smaller per the blank spaces of intercharacter in a line and quantity accounts for the overwhelming majority, and lab work respectively the blank spaces between row compared with Big and quantity only has several, thereby determines that to obtain character pitch, setting is more than the threshold value of character pitch as minimum column pitch；

Height can be presented in the number of each row black picture element of lab work information area, statistical result in 5.4 statistics binary images Peak and low ebb alternating, each low ebb are the blank parts between arranging；According to statistical information and minimum column pitch by lab work Information area is divided into multiple row；

5.5 line informations obtained according to above-mentioned steps, enter one to chemically examining the lab work information area in single image after correction Step segmentation, and select optimal global threshold to carry out binary conversion treatment using OTSU the chemical examination single image after segmentation.

6. laboratory test report recognition methods according to claim 1, it is characterised in that：The specific implementation process of the step (6) It is as follows：

6.1 patient information region is identified using the document mode in Tesseract, is identified result character string, is pressed Recognition result character string is split according to blank spaces, therefrom matched comprising information using the method for matching regular expressions The character block of title, character block followed by are the value of the item of information；And then obtained patient information progress will be parsed Cleaning and structuring, obtain final patient information object；

6.2 carry out batch identification using the block mode in Tesseract to the lab work information area that segmentation is completed, and obtain The recognition result of test item；

6.3 foundation include test item title, test item alias, the lab work information database of test item measurement Value Types, And intelligent correction is carried out to test item recognition result using the lab work information database；

6.4, for any test item, obtain it by lab work information database and measure Value Types, adopted according to measurement Value Types The test item measured value is identified with Tesseract corresponding engine configurations, obtains the recognition result of measured value.

7. laboratory test report recognition methods according to claim 6, it is characterised in that：Lab work is utilized in the step 6.3 The detailed process that information database carries out intelligent correction to test item recognition result is as follows：

6.3.1 the test item title and alias first in lab work information database, obtain including lab work title Dictionary；

6.3.2 for the recognition result of any test item, if the recognition result is present in dictionary, error correction is terminated；If the knowledge Other result is not present in dictionary, then calculates the normalized edit distance of the recognition result and all entries in dictionary, is chosen and is compiled Collect entry composition error correction candidate list of the distance less than 0.8 and by the arrangement of editing distance ascending order；

6.3.3 the entry that editing distance is minimum in error correction candidate list is taken as error correction candidate word, if the candidate word has word Error correction replacement directly then is carried out to recognition result using it in allusion quotation；Arranged if the candidate word is not present in dictionary from error correction candidate item The candidate word is removed in table, repeats step 6.3.3.