CN112686319A

CN112686319A - Merging method of electric power signal model training files

Info

Publication number: CN112686319A
Application number: CN202011638466.6A
Authority: CN
Inventors: 张海永; 高承贵
Original assignee: Nanjing Taisi De Intelligent Electric Co ltd
Current assignee: Nanjing Taisi De Intelligent Electric Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-04-20

Abstract

The invention discloses a method for combining training files of a power signal model, which comprises the following steps: selecting a box text file and a picture file with a suffix of tif; setting file parameters; automatically naming the picture file and the box file with the tif suffix to form a training file meeting the specification; writing the file name with the set format into a text file of font _ properties; calling a tesseract command to generate a text file with a suffix tr for each file with the suffix tif; combining the text file names with the suffix of the box and the text file names with the suffix of the tr into character strings separated by spaces respectively; and finally, calling a combine _ tessdata command of tessaract to generate a train eddata file. The character recognition method combines the manually marked character pictures in practical application and the character pictures generated by the method through a multi-model combination method, and adjusts the recognized wrong characters through the manually marked data in practical application, thereby reducing the training workload and improving the accuracy of character recognition.

Description

Merging method of electric power signal model training files

Technical Field

The invention belongs to the technical field of Chinese training model training methods, and particularly relates to a method for combining training files of a power signal model.

Background

The Chinese training model of the Tesseract character recognition engine is low in recognition rate, and the method for improving the character recognition rate by retraining the commonly used characters of the user is a common method for the user. Because training needs to adjust the position of characters in pictures and the size of character frames in large quantity, great workload is brought to character training, and if the training files are not combined, the processing time is long, and the recognition efficiency is low.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the method for combining the training files of the power signal model is provided to solve the problems in the prior art.

The technical scheme adopted by the invention is as follows: a method for merging training files of a power signal model comprises the following steps:

1) selecting a box text file and a picture file with a suffix of tif, which are artificially marked and generated;

2) setting training parameters of each file, including training language and page mode parameters, and defaulting to Chinese training language;

3) automatically naming the picture file and box file with the suffix tif according to the selected file name and the model name to be generated to form a training file meeting the specification, wherein the picture file with the suffix tif is named according to the specification as follows:

the text file specification for box is named as follows:

4) according to the picture file name with the suffix of. tif, the name between the first ". multidot.and the second". multidot.is taken as the font name, each font is a line, and is written into the text file with the file name of font _ properties in the following format.

font 0 0 0 0 0

5) Calling tesseract command to generate a text file with a suffix of tr for each file with a suffix of tif, wherein the command is as follows:

tesseract power.font.exp0.GIF power.font.exp0–psm 6nobatch box.train

6) combining the text file names of the box suffixes into character strings separated by spaces, using the character strings as parameters of a unicastet _ extra command, and calling and executing the unicastet _ extra command; combining all text file names with the suffix of tr into character strings separated by spaces, using the character strings as parameters of shapecasting, mftraining and cntraing commands, and calling and executing shapecasting, mftraining and cntraing commands;

7) and finally, calling a combine _ tessdata command of the tessaract to generate a train eddata file, namely the finally merged character model.

In the step 1), the method for generating the box text file and the picture file with the suffix of tif comprises the following steps:

1) reading a txt or excel file;

2) setting parameters of font, size and model name of the training characters;

3) reading the selected txt or excel file according to lines, acquiring the total line number of the characters, and marking the total line number as num _ lines, and the word number of the line with the maximum word number in the lines as max _ length;

4) calculating the width and height of the generated character picture according to the set character space (gap), line space (linepacking), page margin (padding), picture maximum width, single character width (width) and single character height (height);

5) if the calculated picture width is larger than the set maximum picture width, the longest picture width is taken as the width of the picture to be generated, and the picture height is recalculated according to the character pitch (gap), the line pitch (linepacking) and the page margin (padding);

6) marking the picture size as imgsize according to the picture size calculated in the step 5), calling a QImage class of Qt to generate a full-white picture, and storing the full-white picture as a picture file with a suffix of tif;

7) drawing single characters on a full-white picture in sequence;

8) scanning pixel values of the pictures in the rectangular frame in four directions of top to bottom, bottom to top, left to right and right to left respectively according to the position of each character and the length and width data of the rectangle recorded in the step 7);

9) converting the coordinate [ (x, y), (end _ x, end) ] position of the minimum enclosing rectangle of the characters in the step 8) into a character coordinate system [ (t _ x, t _ y), (t _ end _ x, t _ end) ] trained by tesseract, wherein the conversion formula is as follows:

t_x＝x

t_y＝height_image–y-(endy-y)

t_endx＝x+width_image

t_endy＝height_image–y；

10) and 4) carrying out coordinate conversion calculation on each character according to the step 9), writing the data of each character into a text file with a suffix of the following format, wherein each character data occupies one line:

the text t _ x t _ y t _ endx t _ endy.

The invention has the beneficial effects that: compared with the prior art, the character recognition method has the advantages that the character pictures which are manually marked in practical application and the character pictures which are generated by the character recognition method are combined through a multi-model combining method, recognized wrong characters are adjusted through manually marked data in practical application, training workload is reduced, and meanwhile accuracy of character recognition is improved.

Drawings

FIG. 1 is a flow chart of a model training method;

FIG. 2 is a flow chart of a merging method;

fig. 3 is a schematic diagram of coordinate system conversion.

Detailed Description

The invention is further described below with reference to specific figures and embodiments.

Example 1: as shown in fig. 1, an efficient model training method includes the following steps:

1) forming the name, state name and the like of each station in the power dispatching master station system into a txt format or an execl format file, and reading the txt or excel file; because the number of Chinese characters is huge, the characters to be recognized in practical application are sorted in the step, so that the number of the characters to be recognized is reduced, the size of a generated model is reduced, and the recognition speed of the characters can be improved;

2) setting parameters of the font, the size and the model name of the training characters, and improving the accuracy of character recognition;

3) reading the selected txt or excel file according to lines, and acquiring the total line number (marked as num _ lines) of the characters and the word number (marked as max _ length) of the line with the maximum word number in the lines; the method mainly comprises the following steps of calculating the generation size of a training picture to obtain initial data;

4) calculating the width and height of the generated character picture according to the set character spacing (gap), line spacing (linepacking), page margin (padding), picture maximum width, single character width (width) and single character height (height), wherein the calculation formula is as follows:

picture width (width _ image) ═ padding × 2+ max _ length (gap + width)

Picture height (height _ image) ═ padding 2+ num _ lines (linesizing + height)

5) If the calculated picture width is larger than the set maximum picture width, the longest picture width is taken as the width of the picture to be generated, and the picture height is recalculated according to the character pitch (gap), the line pitch (linepacking) and the page margin (padding). The calculation flow is as follows:

a) calculating the number of characters max _ words _ num that a picture line can contain

words_num＝(width_image–padding*2)/(gap+width)

max _ word _ length is taken as the smallest integer greater than or equal to word _ num

b) Calculating the number of lines of characters and the height of picture

lines word total number of words/max word length

Number of lines of text (num _ lines) takes the smallest integer greater than or equal to lines _ word (round up)

Picture height ═ padding × 2+ num _ lines (linescaping + height)

6) Calling a QImage class of Qt (cross-platform C + + graphical user interface application program development framework) to generate a full-white picture according to the picture size (marked as imgsize) calculated in the step 5), wherein calling codes are as follows:

QImage img(imgsize,QImage::Format_RGB888)；

img.fill(QColor(255,255,255))；

7) and drawing single characters on the all-white picture in sequence. The drawing steps are as follows:

a) setting the first character position (the abscissa is startx and the ordinate is starty), and setting the initial position values as follows:

startx＝padding；starty＝padding；

b) the first letter is drawn centrally within a rectangle with coordinates (startx, start) as the starting point, a single letter width (width) as the width of the rectangle, and a single letter height (height) as the height of the rectangle.

c) And calculating startx and starty of the next character, wherein the values are as follows:

startx＝startx+width+gap

starty＝padding

if (startx + page margin) > picture width, then:

starty + number of lines of drawn text ═ starty [ (height + linesapacing) ]

startx＝padding

d) And recording and storing, and repeating the steps a-c by the startx, start, width and height of each character until the picture is completely drawn.

8) Scanning pixel values of the pictures in the rectangular frame in four directions of from top to bottom, from bottom to top, from left to right and from right to left according to the character position and the rectangular length and width data recorded in the step 7), wherein the circumscribed rectangles of the characters are calculated by scanning in the four directions, and the calculation speed is higher than that of a method for calculating the circumscribed rectangles by scanning the characters in the two directions of from top to bottom and from left to right, and is improved;

a) the pixels of the character rectangle are scanned from top to bottom in row units until the pixels which are not white exist in the whole row of pixels, and the row number of the row is recorded as y.

b) The pixels of the character rectangle are scanned from left to right in units of columns until the pixels which are not white exist in the whole column of pixels, and the column number is recorded as y.

c) The pixels of the character rectangle are scanned from bottom to top in row units until the pixels which are not white exist in the whole row of pixels, and the row number is recorded as end _ y.

d) Scanning pixels of the character rectangle from right to left in a column unit until the pixels which are not white exist in the whole column of the pixels, and marking the column as end _ x.

e) Taking the data calculated according to a-d) as the minimum circumscribed rectangle of the character, wherein the vertex coordinates of the upper left corner of the rectangle are (x, y), and the vertex coordinates of the lower right corner are (end _ x, endy);

9) since the QImage coordinate system is inconsistent with the text coordinate system for text training (see the figure below), and text requires the vertex coordinates of the lower left corner and the upper right corner of the minimum enclosing rectangle of text as training data, the coordinate [ (x, y), (end _ x, end) ] position of the minimum enclosing rectangle of text in step 8) is converted into the text coordinate system [ (t _ x, t _ y), (t _ end _ x, t _ end) ] for text training, as shown in fig. 3, the conversion formula is as follows:

t_x＝x

t_y＝height_image–y-(endy-y)

t_endx＝x+width_image

t_endy＝height_image–y

10) calculating each character according to 9), writing the data of each character into a text file with a suffix of the following format, wherein each character data occupies one line:

text t _ x t _ y t _ endx t _ endy

The box file generated in the step is a necessary character position file in tesseract training, is automatically processed and generated through a training tool, and is more convenient and faster than the traditional method of manually adjusting the position of a character circumscribed rectangle;

11) according to the set parameters, such as training language, paging mode and the like, a text file with a suffix of tr is generated by using a tesseract command, and the command is as follows:

tesseract power.font.exp0.GIF power.font.exp0–psm 6nobatch box.train

tr file generated in the step is a necessary character feature file in tesseract training, and the tesseract command is automatically called by a tool to generate, so that the method is more convenient and faster than the traditional method which needs manual calling for command generation;

12) reading files with suffixes of tif, box and tr in the manually marked folder, simultaneously detecting whether a text file with the file name of font _ properties exists in the folder, if not, according to the picture file names with suffixes of tif, taking the name between the first ". multidot.and the second". multidot.. The format is as follows:

font 0 0 0 0 0

for example, the picture file name of the suffix of. tif is power. user. exp0.GIF, the content written in the font _ properties file is:

userfont 0 0 0 0 0

the generated font _ properties file is a necessary font file in tesseract training, and is quicker and more convenient than the traditional method in which the file names of different tifs need to be manually and sequentially searched, the font _ properties file is manually created, and the font is manually input;

13) sequentially executing uniclass _ extra, shapelogic, mftraining and cntracing training commands of tesseract to generate a file with a suffix of traineddata, wherein the file is a character model file; the commands are automatically and sequentially called and executed through the tool, so that the method is quicker and more convenient than the method of manually and sequentially inputting the commands and executing the commands in the traditional method;

14) after the training is finished, the training tool automatically calls a tesseract command to identify the picture generated in the step 7), compares the identification result with the input characters, and prompts the identification of wrong characters, the character recall rate and the accuracy rate; the recall rate and the accuracy rate of the characters are automatically detected through the training tool, and the method is quicker and more convenient than the method which needs to manually search and calculate the wrong characters in the traditional method;

15) if the user wants to improve the character recognition rate, the user can continue to convert the actual application picture containing the wrong characters into the picture file with the suffix of tif by using the picture tool, and manually mark the picture file by using the marking tool to generate a box text file, add the box text file into the manually marked file folder, and train again.

Example 2: as shown in fig. 2, a method for merging training files of a power signal model includes the following steps:

1) selecting a box text file and a picture file with a suffix of. tif which are artificially marked and generated, wherein the box text file and the picture file with the suffix of. tif generated in the steps 1) to 10) in the embodiment 1 can be adjusted by other methods or tools; picture files in tif format under different paths can be selected, and compared with the traditional method, the method that the merged file needs to be copied to the same path manually is more convenient;

2) setting training parameters of each file, including training language and page mode parameters, and defaulting to Chinese training language; different parameters can be set for each selected file, and compared with the traditional method, the method for independently inputting parameters and executing commands for each file is more visual and convenient;

the text file specification for box is named as follows:

the tif format file and the box format file are renamed automatically through a merging tool, so that compared with the traditional method that manual renaming is needed and the files are copied to the same folder, the operation is quicker and more convenient;

4) according to the picture file name with the suffix of tif, taking the name between the first and the second as the font name, wherein each font is a line and writing the font name into a text file with the file name of font _ properties according to the following format;

font 0 0 0 0 0

font _ properties is a necessary font file required by the tesseract training model, the font name of each tif is automatically acquired by a merging tool and is written into the font _ properties file, and compared with the traditional method, the operation that the font needs to be manually judged, the font _ properties file needs to be manually created, and the font is written into the file is more convenient and faster, and errors are not easy to occur;

tesseract power.font.exp0.GIF power.font.exp0–psm 6nobatch box.train

the tr-format file is a necessary character feature file required by a tesseract training model, and the method is executed by automatically calling commands according to the number of files through a merging tool, so that the method is quicker and more convenient than the traditional method which needs to manually call the files in sequence according to the file number and file names;

the execution command statements are automatically combined through the merging tool, and the training command is automatically called, so that the operation method is faster and more convenient compared with the traditional method which needs to manually write the inspection command and execute the command in sequence;

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present invention, and therefore, the scope of the present invention should be determined by the scope of the claims.

Claims

1. A merging method of power signal model training files is characterized in that: the method comprises the following steps:

the text file specification for box is named as follows:

font 0 0 0 0 0

tesseract power.font.exp0.GIF power.font.exp0 –psm 6 nobatch box.train

2. The method of claim 1, wherein the method further comprises: in the step 1), the method for generating the box text file and the picture file with the suffix of tif comprises the following steps:

1) reading a txt or excel file;

2) setting parameters of font, size and model name of the training characters;

7) drawing single characters on a full-white picture in sequence;

t_x＝x

t_y＝height_image–y-(endy-y)

t_endx＝x+width_image

t_endy＝height_image–y；

the text t _ x t _ y t _ endx t _ endy.

3. The method of claim 1, wherein the method further comprises: and the suffix is that tif is converted from a picture by using a picture tool, and the picture is manually marked by using a marking tool to generate a box text file, and the box text file is added into a manually marked folder and is re-trained.