CN114022886B

CN114022886B - Handwriting recognition training set generation method, system and medium for tablet

Info

Publication number: CN114022886B
Application number: CN202111219556.6A
Authority: CN
Inventors: 孙成通; 赵亚欧; 胡焱; 牛鹏
Original assignee: Inspur Financial Information Technology Co Ltd
Current assignee: Inspur Financial Information Technology Co Ltd
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2024-06-14
Anticipated expiration: 2041-10-20
Also published as: CN114022886A

Abstract

The invention discloses a method, a system and a medium for generating a handwriting recognition training set for a tablet, wherein the method comprises the following steps: setting an acquisition threshold and a first contrast; configuring a picture processing program; configuring the picture dataset based on the acquisition threshold, the first contrast, and the picture processing program; setting a character list matched with the picture data set; acquiring the sample picture set based on the picture data set and the character list; setting a zoom percentage and a rotation angle; performing sample transformation extraction operation based on the sample picture set, the scaling percentage and the rotation angle to obtain the identification training set; the invention can generate the training set which comprises a plurality of rarely used words, symbols and formulas and a plurality of font samples, symbols and formulas with semantic relation, thereby improving the range and the efficiency of handwriting recognition on the tablet and further improving the applicability of the tablet.

Description

Handwriting recognition training set generation method, system and medium for tablet

Technical Field

The invention relates to the technical field of handwriting recognition, in particular to a method, a system and a medium for generating a handwriting recognition training set for a tablet.

Background

Along with the increasing application of the flat plate in work, the recognition requirement on the handwriting in the flat plate is higher, the deep learning algorithm is often adopted in the flat plate to match with the handwriting sample training set to realize the recognition function on the handwriting on the flat plate, and the method has extremely high requirements on the generation and configuration of the handwriting sample training set.

The method for generating the handwriting sample training set in the prior art comprises the following steps: on one hand, the handwriting sample collection method is configured in a mode of randomly extracting samples from a handwriting sample collection, so that the diversity of handwriting samples in the handwriting sample training set is ensured, but the relevance among all handwriting samples cannot be ensured; on the other hand, configuration is carried out according to the corpus of a plurality of sentences, the mode ensures the relevance among handwriting samples in the handwriting sample training set, but the handwriting samples lose diversity, and the situation that the rare words cannot be identified easily occurs; therefore, a method for generating a handwriting sample training set is needed, which can ensure the diversity of handwriting samples and the relativity between handwriting samples.

Disclosure of Invention

The invention mainly aims to develop a handwriting sample training set generation method capable of guaranteeing the diversity of handwriting samples and guaranteeing the relativity among the handwriting samples.

In order to achieve the above purpose, the invention adopts a technical scheme that: the method for generating the handwriting recognition training set for the tablet comprises the following steps:

Configuring a picture data set;

Setting an acquisition threshold and a first contrast; configuring a picture processing program; configuring the picture dataset based on the acquisition threshold, the first contrast, and the picture processing program;

Acquiring a sample picture set:

Setting a character list matched with the picture data set; acquiring the sample picture set based on the picture data set and the character list;

Generating an identification training set:

Setting a zoom percentage and a rotation angle; and executing sample transformation extraction operation based on the sample picture set, the scaling percentage and the rotation angle to obtain the identification training set.

As an improvement, the step of configuring the picture dataset based on the acquisition threshold, the first contrast and the picture processing program further comprises:

acquiring a plurality of first character pictures, a plurality of second character pictures, a plurality of third character pictures and a plurality of fourth character pictures which are respectively matched with the acquisition threshold;

Invoking the picture processing program to perform reverse phase processing on the first character pictures, the second character pictures, the third character pictures and the fourth character pictures;

Invoking the picture processing program to adjust the contrast of the first character pictures, the second character pictures, the third character pictures and the fourth character pictures after the reverse phase processing to the first contrast to obtain Chinese character handwriting sample pictures, english handwriting sample pictures, digital handwriting sample pictures and symbol handwriting sample pictures;

Integrating a plurality of Chinese character handwriting sample pictures, a plurality of English handwriting sample pictures, a plurality of digital handwriting sample pictures and a plurality of symbol handwriting sample pictures to obtain the picture data set.

As an improvement, the character list is configured with: a plurality of Chinese character sample characters respectively matched with the Chinese character handwriting sample pictures, a plurality of English sample characters respectively matched with the English handwriting sample pictures, a plurality of digital sample characters respectively matched with the digital handwriting sample pictures and a plurality of symbol sample characters respectively matched with the symbol handwriting sample pictures.

As an improvement, the step of acquiring the sample picture set based on the picture data set and the character list further includes:

performing Chinese character picture screening operation based on the picture data set and the character list to obtain a plurality of first synthesized Chinese character pictures and a plurality of second synthesized Chinese character pictures;

performing English picture screening operation based on the picture data set and the character list to obtain a plurality of third synthesized English pictures;

executing formula picture screening operation based on the picture data set and the character list to obtain a plurality of fourth mathematical formula pictures;

integrating a plurality of first synthesized Chinese character pictures, a plurality of second synthesized Chinese character pictures, a plurality of third synthesized English pictures and a plurality of fourth mathematical formula pictures to obtain the sample picture set.

As an improved scheme, the Chinese character picture screening operation includes:

Setting a first extraction number, a Chinese character synthesis space and a synthesis number; configuring a Chinese corpus vocabulary list;

Selecting a plurality of first Chinese character samples matched with the first extraction number from a plurality of Chinese character sample characters; screening a plurality of first Chinese character pictures which are respectively matched with a plurality of first Chinese character samples from a plurality of Chinese character handwriting sample pictures; according to the synthesis quantity and based on the Chinese character synthesis space, synthesizing a plurality of first Chinese character pictures side by side to obtain a plurality of first synthesized Chinese character pictures;

Selecting a second Chinese character sample from a plurality of Chinese character sample characters; screening a plurality of Chinese vocabulary characters associated with the second Chinese sample in the Chinese corpus vocabulary; screening a plurality of third Chinese character samples matched with a plurality of Chinese vocabulary characters respectively from a plurality of Chinese character sample characters; screening a second Chinese character picture matched with the second Chinese character sample from a plurality of Chinese character handwriting sample pictures; screening a plurality of third Chinese character pictures which are respectively matched with a plurality of third Chinese character samples from a plurality of Chinese character handwriting sample pictures; and synthesizing a plurality of third Chinese character pictures with the second Chinese character pictures side by side based on the Chinese character synthesis space to obtain a plurality of second synthesized Chinese character pictures.

As an improved solution, the english image screening operation includes:

setting a segmentation unit and an English synthesis interval; configuring an English corpus vocabulary list;

Selecting a first English sample from a plurality of English sample characters;

Confirming that a first character position is a plurality of first English vocabulary characters of the first English sample in the English corpus vocabulary list; dividing the first English vocabulary characters according to the dividing units to obtain second English vocabulary characters; removing vocabulary characters matched with the first English sample from the plurality of second English vocabulary characters to obtain a plurality of third English vocabulary characters; screening a plurality of second English samples matched with a plurality of third English vocabulary characters respectively from a plurality of English sample characters;

Screening first English pictures matched with the first English samples from a plurality of English handwriting sample pictures; screening a plurality of second English pictures which are respectively matched with a plurality of second English samples from a plurality of English handwriting sample pictures; and synthesizing a plurality of second English pictures with the first English pictures side by side based on the English synthesis interval to obtain a plurality of third synthesized English pictures.

As an improvement, the formula picture screening operation includes:

setting a formula synthesis interval, a digit number, an operator digit number and a digit and operator synthesis sequence;

selecting a plurality of first digital samples from a plurality of digital sample characters according to the number bits; selecting a plurality of operation character samples from a plurality of symbol sample characters according to the operator bits;

Screening a plurality of first digital pictures which are respectively matched with a plurality of first digital samples from a plurality of digital handwriting sample pictures; screening a plurality of operation character pictures which are respectively matched with a plurality of operation character samples from a plurality of symbol handwriting sample pictures;

and synthesizing a plurality of first digital pictures and a plurality of operation character pictures according to the synthesis sequence of the numbers and operators and the formula synthesis interval to obtain a plurality of fourth mathematical formula pictures.

As an improvement, the sample transform extraction operation includes:

Respectively scaling the first synthesized Chinese character pictures, the second synthesized Chinese character pictures, the third synthesized English pictures and the fourth mathematical formula pictures according to the scaling percentage to obtain first training set pictures, second training set pictures, third training set pictures and fourth training set pictures;

Respectively carrying out rotation transformation on the first training set pictures, the second training set pictures, the third training set pictures and the fourth training set pictures according to the rotation angles to obtain first to-be-extracted synthesized Chinese character pictures, second to-be-extracted synthesized Chinese character pictures, third to-be-extracted English pictures and fourth to-be-extracted mathematical formula pictures;

Setting a first Chinese character proportion, a second Chinese character proportion, an English proportion and a formula proportion; calculating the quantity and the value of a plurality of first to-be-extracted synthesized Chinese character pictures, a plurality of second to-be-extracted synthesized Chinese character pictures, a plurality of third to-be-extracted English pictures and a plurality of fourth to-be-extracted mathematical formula pictures; calculating products of the quantity and the value with the first Chinese character proportion, the second Chinese character proportion, the English proportion and the formula proportion respectively to obtain a first Chinese character selection quantity, a second Chinese character selection quantity, an English selection quantity and a formula selection quantity;

Selecting a plurality of first Chinese character pictures to be processed from a plurality of first Chinese character pictures to be extracted according to the first Chinese character selection quantity; selecting a plurality of second Chinese character pictures to be processed from a plurality of second Chinese character pictures to be extracted according to the second Chinese character selection quantity; selecting a plurality of third English pictures to be processed from the plurality of third English pictures to be extracted according to the English selection quantity; selecting a plurality of fourth formula pictures to be processed from the plurality of fourth formula pictures to be extracted according to the formula selection quantity; integrating a plurality of first Chinese character pictures to be processed, a plurality of second Chinese character pictures to be processed, a plurality of third English pictures to be processed and a plurality of fourth formula pictures to be processed to obtain the recognition training set.

The invention also provides a handwriting recognition training set generation system for the tablet, which comprises the following steps:

the system comprises a picture data set configuration module, a sample picture set acquisition module and an identification training set generation module;

The picture data set configuration module is used for setting an acquisition threshold value and a first contrast; the picture data set configuration module is also used for configuring a picture processing program; the picture data set configuration module configures the picture data set based on the acquisition threshold, the first contrast, and the picture processing program;

the sample picture set acquisition module is used for setting a character list matched with the picture data set; the sample picture set acquisition module acquires the sample picture set based on the picture data set and the character list;

The recognition training set generation module is used for setting a scaling percentage and a rotation angle; and the recognition training set generation module executes sample transformation extraction operation based on the sample picture set, the scaling percentage and the rotation angle to obtain the recognition training set.

The present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method for generating a handwriting recognition training set for a tablet.

The beneficial effects of the invention are as follows:

1. The method for generating the handwriting recognition training set for the tablet can realize generation of the training set which comprises a plurality of rarely-used words, symbols and formulas and a plurality of font samples, symbols and formulas with semantic relation, and finally ensures that the training set can not only be used for handwriting sample diversity, but also ensure the relevance among handwriting samples when being applied, thereby further improving the range and efficiency of handwriting recognition on the tablet, further improving the applicability of the tablet, greatly improving the experience of users, making up the defects of the prior art, and having extremely high market value and product competitiveness.

2. According to the handwriting recognition training set generation system for the tablet, provided by the invention, the training set comprising a plurality of rare words, symbols and formulas and a plurality of font samples, symbols and formulas with semantic relation can be generated through the mutual matching of the picture data set configuration module, the sample picture set acquisition module and the recognition training set generation module, so that the training set can be finally used, the diversity of handwriting samples and the relevance among handwriting samples can be ensured, the handwriting recognition range and efficiency on the tablet are further improved, the applicability of the tablet is further improved, the user experience is greatly improved, the defects of the prior art are overcome, and the tablet has extremely high market value and product competitiveness.

3. The computer readable storage medium can realize the coordination of the guiding picture data set configuration module, the sample picture set acquisition module and the recognition training set generation module, further realize the generation of the training set which not only comprises a plurality of rare words, symbols and formulas, but also comprises a plurality of font samples, symbols and formulas with semantic relation, finally ensure the diversity of handwriting samples and the relevance among handwriting samples when the training set is applied, further improve the handwriting recognition range and efficiency on a tablet, further improve the applicability of the tablet, greatly improve the experience of users, make up the defects of the prior art, have extremely high market value and product competitiveness, and effectively improve the operability of the tablet handwriting recognition training set generation method.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for generating a handwriting recognition training set for a tablet according to embodiment 1 of the present invention;

FIG. 2 is a schematic diagram of a specific flow chart of a method for generating a handwriting recognition training set for a tablet according to embodiment 1 of the present invention;

FIG. 3 is a schematic diagram of implementation effects of a portion of the sample picture set according to embodiment 1 of the present invention;

fig. 4 is a diagram of a handwriting recognition training set generation system for a tablet according to embodiment 2 of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the advantages and features of the present invention can be more easily understood by those skilled in the art, thereby making clear and defining the scope of the present invention.

In the description of the present invention, it should be noted that the described embodiments of the present invention are some, but not all embodiments of the present invention; all other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the description of the present invention, it should be noted that the terms "first," "second," "third," "fourth," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In describing the present invention, it should be noted that: PS (Photoshop) is image processing software.

Example 1

The embodiment provides a handwriting recognition training set generation method for a tablet, as shown in fig. 1 to 3, comprising the following steps:

s100, configuring a picture data set, which specifically comprises the following steps:

S110, setting an acquisition threshold and a first contrast; configuring a picture processing program; configuring the picture dataset based on the acquisition threshold, the first contrast, and the picture processing program; in this embodiment, step S110 is a data collection step as the most reference, and the picture dataset is a collection containing a plurality of handwritten pictures that have undergone a qualified picture processing operation;

Specifically, a plurality of first character pictures, a plurality of second character pictures, a plurality of third character pictures and a plurality of fourth character pictures which are respectively matched with the acquisition threshold are acquired; in this embodiment, the collection threshold is 300, and the collection threshold represents a limitation of the number, so that 300 first character pictures, 300 second character pictures, 300 third character pictures and 300 fourth character pictures are required to be ensured; correspondingly, the first character picture, the second character picture, the third character picture and the fourth character picture are respectively a handwritten Chinese character picture, a handwritten English picture, a handwritten digital picture and a handwritten symbol picture which are combined according to the script and the slight running script which are handwritten by a plurality of different fonts in the dictionary; correspondingly, in this embodiment, the picture processing program is a PS program, and the picture processing program is called to perform reverse processing on the plurality of first character pictures, the plurality of second character pictures, the plurality of third character pictures and the plurality of fourth character pictures; the reverse processing is to process the picture into a negative film with white text and black background; invoking the picture processing program to adjust the contrast of the first character pictures, the second character pictures, the third character pictures and the fourth character pictures after the reverse phase processing to the first contrast to obtain Chinese character handwriting sample pictures, english handwriting sample pictures, digital handwriting sample pictures and symbol handwriting sample pictures; in the embodiment, the first contrast is set to be a medium lower value within the range of 0-100, so that the graying treatment effect of the picture is achieved; integrating a plurality of Chinese character handwriting sample pictures, a plurality of English handwriting sample pictures, a plurality of digital handwriting sample pictures and a plurality of symbol handwriting sample pictures to obtain the picture data set; in this embodiment, the picture data set is stored in a database, and is encoded with text gbk as an index and stored in hdf5 format; each code corresponds to a Chinese character; it is contemplated that a handwritten pattern of one chinese character may comprise a pattern of handwritten objects written by several different persons; correspondingly, a data reference library of the method is built through the step, and a picture data set in the data reference library contains various patterns, is not only limited to Chinese characters, but also comprises various symbols, and provides a data basis for building a subsequent training set with high diversity.

S200, acquiring a sample picture set, which specifically comprises the following steps:

S210, setting a character list matched with the picture data set; acquiring the sample picture set based on the picture data set and the character list; in this embodiment, step S210 is a key sample construction step, which includes a method for synthesizing sample pictures of different types, so that diversity of training sets is greatly improved;

Specifically, the character list is configured with: a plurality of Chinese character sample characters respectively matched with the Chinese character handwriting sample pictures, a plurality of English sample characters respectively matched with the English handwriting sample pictures, a plurality of digital sample characters respectively matched with the digital handwriting sample pictures and a plurality of symbol sample characters respectively matched with the symbol handwriting sample pictures; correspondingly, the character list is a collection list of normal machine display characters or standard Song-body characters, wherein the collection list comprises a plurality of Chinese characters, english, numbers and symbols, and the list is used for selecting character references needing to be synthesized because handwriting is inconvenient to recognize and distinguish;

Specifically, performing a Chinese character picture screening operation based on the picture data set and the character list to obtain a plurality of first synthesized Chinese character pictures and a plurality of second synthesized Chinese character pictures; in this embodiment, the Chinese character picture screening operation includes two operation aspects, the first aspect is random synthesis to improve Chinese character diversity, and the second aspect is corpus generation to improve relevance between Chinese characters;

specifically, the Chinese character picture screening operation includes:

Firstly, carrying out sample operation of random sampling patterns, and further guaranteeing diversity: setting a first extraction number, a Chinese character synthesis space and a synthesis number; configuring a Chinese corpus vocabulary list; in this embodiment, the first extraction number is the number of single words extracted at a time, and in this embodiment, the first extraction number is set to any value between 10 and 15; selecting a plurality of first Chinese character samples matched with the first extraction number from a plurality of Chinese character sample characters; the first Chinese character sample is the standard Chinese character reference in the character list; correspondingly, in the embodiment, the mode selected here is uniform random sampling, so that the occurrence probability of the rarely used word is guaranteed to be the same as the occurrence probability of the common character; correspondingly, in the present embodiment, it is conceivable that multiple extractions may be performed, and the present step may be performed multiple times, so as to achieve a better effect; screening a plurality of first Chinese character pictures which are respectively matched with a plurality of first Chinese character samples from a plurality of Chinese character handwriting sample pictures; in this embodiment, the number of synthesis is 2, and according to the number of synthesis and based on the Chinese character synthesis space, synthesizing a plurality of first Chinese character pictures side by side to obtain a plurality of first synthesized Chinese character pictures; synthesizing into line text pictures according to the synthesis quantity, namely, a group of two first Chinese character pictures, namely, forming vocabulary pictures, wherein the synthesis quantity is not limited; side-by-side synthesis, i.e., for example: the synthesis of the language and the text is the Chinese; correspondingly, in the present embodiment, the Chinese character synthesizing pitch is set in units of pixels, and the Chinese character synthesizing pitch is set to be between 5 and 30 pixels in the present embodiment;

Furthermore, corpus synthesis operation is carried out, so that the relevance among Chinese characters is ensured: in this embodiment, the Chinese corpus vocabulary includes a plurality of vocabularies and a plurality of long/short sentences; the Chinese corpus vocabulary list can select different types of vocabulary data according to application scenes; for example: chinese encyclopedia data, news encyclopedia data, computer professional data, and the like; selecting a second Chinese character sample from a plurality of Chinese character sample characters; the second Chinese character sample is used as a reference for corpus generation; screening a plurality of Chinese vocabulary characters associated with the second Chinese sample in the Chinese corpus vocabulary; for example: if the second Chinese character sample is "language", the Chinese vocabulary characters can be "text", "out of number", "language", "tone", and "border", and also include sentences; correspondingly, a plurality of Chinese vocabulary characters are used as screening templates of handwriting, so that a plurality of third Chinese character samples matched with a plurality of Chinese vocabulary characters respectively are screened from a plurality of Chinese character sample characters; screening second Chinese character pictures which are respectively matched with the second Chinese character samples from a plurality of Chinese character handwriting sample pictures; screening a plurality of third Chinese character pictures which are respectively matched with a plurality of third Chinese character samples from a plurality of Chinese character handwriting sample pictures; synthesizing a plurality of third Chinese character pictures with the second Chinese character pictures side by side based on the Chinese character synthesizing interval to obtain a plurality of second synthesized Chinese character pictures; in this embodiment, the second chinese character sample may also be selected as a plurality of consecutive sheets, thereby further improving the relevance between chinese characters.

Specifically, performing English picture screening operation based on the picture data set and the character list to obtain a plurality of third synthesized English pictures; in this embodiment, the english image screening operation is mainly used to solve the problem that the conference panel recognizes "0r" because the user wants to input the "or" result; therefore, for this situation, it is necessary to strengthen the corpus relation between English;

Specifically, the english image screening operation includes: setting a segmentation unit and an English synthesis interval; configuring an English corpus vocabulary list; in the embodiment, the segmentation unit takes a single letter as a unit, so that the accuracy of English recognition is improved; the English synthesis space is similar to the Chinese character synthesis space, and is also set by taking pixels as units; the English corpus vocabulary is similar to the Chinese corpus vocabulary, and the English corpus vocabulary is English encyclopedia vocabulary, sentences, english behavior vocabulary, sentences and the like in various fields; selecting a first English sample from a plurality of English sample characters; correspondingly, the first English sample is a pure English letter, and does not contain the contents of non-letter, number, symbol and other word patterns; in this embodiment, the first english sample may be a long/short sentence corresponding to the english corpus vocabulary, where the long/short sentence is kept within 20 letters, so as to improve the diversity of letter combinations; confirming that a first character position is a plurality of first English vocabulary characters of the first English sample in the English corpus vocabulary list; the first English vocabulary characters are pure English letters, and do not contain the contents of non-letters, numbers, symbols and other word patterns; correspondingly, the first character bit is the first bit in the vocabulary; for example: if the first English sample is the letter "o", the plurality of first English vocabulary characters may include "or", "okay", "open", and the like; correspondingly, dividing the first English vocabulary characters according to the dividing units to obtain second English vocabulary characters; according to the above examples, the second english vocabulary characters are "o", "r", "k", "a", "y", "p", "e", "n"; the second English vocabulary characters are also screening templates of handwriting sample pictures; removing vocabulary characters matched with the first English sample from the plurality of second English vocabulary characters, obtaining a plurality of third English vocabulary characters according to the example, namely 'o', and obtaining a plurality of third English vocabulary characters according to the example, namely 'r', 'k', 'a', 'y', 'p', 'e', 'n'; screening a plurality of second English samples matched with a plurality of third English vocabulary characters respectively from a plurality of English sample characters; screening first English pictures matched with the first English samples from a plurality of English handwriting sample pictures; screening a plurality of second English pictures which are respectively matched with a plurality of second English samples from a plurality of English handwriting sample pictures; synthesizing a plurality of second English pictures with the first English pictures side by side based on the English synthesis space to obtain a plurality of third synthesized English pictures; in the present embodiment, it is conceivable that the steps of selecting samples and synthesizing pictures are performed a plurality of times, thereby improving diversity.

Specifically, formula picture screening operation is executed based on the picture data set and the character list, and a plurality of fourth mathematical formula pictures are obtained; in this embodiment, the formula picture screening operation is used to prevent recognition errors caused by typeface features when the conference board recognizes a mathematical formula, for example: avoiding that the user wants to input "1+2=3" is identified as "i twelve", i.e., avoiding that "1" is identified as "i", "+" is identified as "ten", "=" is identified as "two", and "3" is identified as "two";

specifically, the formula picture screening operation includes:

Setting a formula synthesis interval, a digit number, an operator digit number and a digit and operator synthesis sequence; in this embodiment, the formula synthesis pitch is also set in pixel units, and the number of digits is the number of digits per se, the number of digits, and what number of digits is empty between digits; the operator digit is the position matched with the digit space between digits; the operator bit number is set according to the number bit number, the operator bit number and the number bit number are matched, and the operator bit number can be smaller than, larger than or equal to the number bit number; correspondingly, the digit and operator synthesis sequence is the emission sequence between operators, and in particular, in this embodiment, the digit and operator synthesis sequence is a number character and then an operator is arranged; the sequence can be matched and set in various conditions according to the complexity of the formula, so that the diversity is extremely strong;

Selecting a plurality of first digital samples from a plurality of digital sample characters according to the number bits; in this embodiment, the number of digits per se is at most six digits, and the number space is the position between every two single digital characters; for example: when the number itself has a number of 3, the number can only be selected from the range of 100 to 999; selecting a plurality of operation character samples from a plurality of symbol sample characters according to the operator bits; corresponding, for example: the number of digits is 3, the first digital sample is any number between 100 and 999, and if the first digital sample is 101, the space between characters is: gaps between 1 and 0, gaps between 0 and 1, and gaps after 1 in number; the number of gaps between characters is also 3; correspondingly, the plurality of operation character samples should be mathematical operators in the symbol sample characters, and 3 arbitrary symbols, for example: ++, =and <; correspondingly, the subsequent synthesized mathematical formula may be 1+0=1, <; the first digital sample and the operation character sample are also selection templates of the handwritten sample pictures, so that a plurality of first digital pictures which are respectively matched with a plurality of first digital samples are screened from a plurality of digital handwritten sample pictures; screening a plurality of operation character pictures which are respectively matched with a plurality of operation character samples from a plurality of symbol handwriting sample pictures; synthesizing a plurality of first digital pictures and a plurality of operation character pictures according to the synthesis sequence of the numbers and operators and the formula synthesis interval to obtain a plurality of fourth mathematical formula pictures; in this embodiment, the length of the formula is limited, and at most 15 characters are not more than one formula picture, so that the validity of the formula is ensured, and the internal space resources of the training set are saved.

Specifically, integrating a plurality of first synthesized Chinese character pictures, a plurality of second synthesized Chinese character pictures, a plurality of third synthesized English pictures and a plurality of fourth mathematical formula pictures to obtain the sample picture set; the sample picture set has extremely high diversity and extremely wide application range, and is main basic data generated for the subsequent training set.

S300, generating an identification training set, which specifically comprises the following steps:

S310, setting a zoom percentage and a rotation angle; performing sample transformation extraction operation based on the sample picture set, the scaling percentage and the rotation angle to obtain the identification training set; in this embodiment, the step considers the different handwriting habits of multiple users, so as to further improve the adaptability of the conference plate and the diversity of the training set;

specifically, in this embodiment, the scaling percentage is 0.7 to 0.1; the rotation angle is 5 degrees; the sample transform decimation operation includes: respectively scaling the first synthesized Chinese character pictures, the second synthesized Chinese character pictures, the third synthesized English pictures and the fourth mathematical formula pictures according to the scaling percentage to obtain first training set pictures, second training set pictures, third training set pictures and fourth training set pictures; respectively carrying out rotation transformation on the first training set pictures, the second training set pictures, the third training set pictures and the fourth training set pictures according to the rotation angles to obtain a plurality of first to-be-extracted synthesized Chinese character pictures, a plurality of second to-be-extracted synthesized Chinese character pictures, a plurality of third to-be-extracted English pictures and a plurality of fourth to-be-extracted mathematical formula pictures; correspondingly, in the present embodiment, the respective transforms represent transforms according to different probabilities for each kind of picture, for example: performing clockwise 5-degree rotation on 50 percent of the first synthesized Chinese character pictures, and performing scaling operation of 0.7-0.1 times of 1 percent of the rotated pictures; correspondingly, the picture transformation algorithms of each kind can be the same or different, and the variety of the algorithms can further improve the diversity of the training set;

Correspondingly, in order to ensure the effectiveness of the training set and lower space occupation rate and improve the use efficiency of the training set, the first Chinese character proportion, the second Chinese character proportion, the English proportion and the formula proportion are set; the first Chinese character proportion, the second Chinese character proportion, the English proportion and the formula proportion are respectively as follows: 5/12, 1/12; the proportion is specifically set according to the application scene of the training set, for example, when the training set is applied to a foreign language conference, the English proportion is required to be higher, and when the training set is applied to mathematical discussion, the formula proportion is higher; correspondingly, calculating the quantity and the value of a plurality of first to-be-extracted synthesized Chinese character pictures, a plurality of second to-be-extracted synthesized Chinese character pictures, a plurality of third to-be-extracted English pictures and a plurality of fourth to-be-extracted mathematical formula pictures; calculating products of the quantity and the value with the first Chinese character proportion, the second Chinese character proportion, the English proportion and the formula proportion respectively to obtain a first Chinese character selection quantity, a second Chinese character selection quantity, an English selection quantity and a formula selection quantity; selecting a plurality of first Chinese character pictures to be processed from a plurality of first Chinese character pictures to be extracted according to the first Chinese character selection quantity; selecting a plurality of second Chinese character pictures to be processed from a plurality of second Chinese character pictures to be extracted according to the second Chinese character selection quantity; selecting a plurality of third English pictures to be processed from the plurality of third English pictures to be extracted according to the English selection quantity; selecting a plurality of fourth formula pictures to be processed from the plurality of fourth formula pictures to be extracted according to the formula selection quantity; integrating a plurality of first Chinese character pictures to be processed, a plurality of second Chinese character pictures to be processed, a plurality of third English pictures to be processed and a plurality of fourth formula pictures to be processed to obtain the recognition training set.

Example 2

The present embodiment provides a handwriting recognition training set generation system for a tablet, as shown in fig. 4, including: the system comprises a picture data set configuration module, a sample picture set acquisition module and an identification training set generation module;

in the handwriting recognition training set generation system for the flat panel, a picture data set configuration module is used for setting an acquisition threshold and a first contrast; the picture data set configuration module is also used for configuring a picture processing program; the picture data set configuration module configures the picture data set based on the acquisition threshold, the first contrast, and the picture processing program;

Specifically, the picture data set configuration module acquires a plurality of first character pictures, a plurality of second character pictures, a plurality of third character pictures and a plurality of fourth character pictures which are respectively matched with the acquisition threshold; the picture data set configuration module invokes the picture processing program to perform reverse phase processing on the first character pictures, the second character pictures, the third character pictures and the fourth character pictures; the picture data set configuration module invokes the picture processing program to adjust the contrast of the first character pictures, the second character pictures, the third character pictures and the fourth character pictures after the reverse phase processing to the first contrast, so as to obtain Chinese character handwriting sample pictures, english handwriting sample pictures, digital handwriting sample pictures and symbol handwriting sample pictures; the picture data set configuration module integrates a plurality of Chinese character handwriting sample pictures, a plurality of English handwriting sample pictures, a plurality of digital handwriting sample pictures and a plurality of symbol handwriting sample pictures to obtain the picture data set.

In the handwriting recognition training set generation system for the tablet, a sample picture set acquisition module is used for setting a character list matched with the picture data set; the sample picture set acquisition module acquires the sample picture set based on the picture data set and the character list;

Specifically, the character list is configured with: a plurality of Chinese character sample characters respectively matched with the Chinese character handwriting sample pictures, a plurality of English sample characters respectively matched with the English handwriting sample pictures, a plurality of digital sample characters respectively matched with the digital handwriting sample pictures and a plurality of symbol sample characters respectively matched with the symbol handwriting sample pictures;

Specifically, the sample picture set acquisition module executes Chinese character picture screening operation based on the picture data set and the character list to obtain a plurality of first synthesized Chinese character pictures and a plurality of second synthesized Chinese character pictures; the Chinese character picture screening operation comprises the following steps: the sample picture set acquisition module sets a first extraction number, a Chinese character synthesis interval and a synthesis number; the sample picture set acquisition module configures a Chinese corpus vocabulary; the sample picture set acquisition module selects a plurality of first Chinese character samples matched with the first extraction number from a plurality of Chinese character sample characters; the sample picture set acquisition module screens a plurality of first Chinese character pictures which are respectively matched with a plurality of first Chinese character samples from a plurality of Chinese character handwriting sample pictures; the sample picture set acquisition module synthesizes a plurality of first Chinese character pictures side by side according to the synthesis quantity and based on the Chinese character synthesis space to obtain a plurality of first synthesized Chinese character pictures; the sample picture set acquisition module selects a second Chinese character sample from a plurality of Chinese character sample characters; the sample picture set acquisition module screens a plurality of Chinese vocabulary characters associated with the second Chinese character sample in the Chinese corpus vocabulary table; the sample picture set acquisition module screens a plurality of third Chinese character samples which are respectively matched with a plurality of Chinese vocabulary characters from a plurality of Chinese character sample characters; the sample picture set acquisition module screens a second Chinese character picture matched with the second Chinese character sample from a plurality of Chinese character handwriting sample pictures; the sample picture set acquisition module screens a plurality of third Chinese character pictures which are respectively matched with a plurality of third Chinese character samples from a plurality of Chinese character handwriting sample pictures; the sample picture set acquisition module respectively synthesizes a plurality of third Chinese character pictures with the second Chinese character pictures side by side based on the Chinese character synthesis space to obtain a plurality of second synthesized Chinese character pictures;

Specifically, the sample picture set acquisition module executes English picture screening operation based on the picture data set and the character list to obtain a plurality of third synthesized English pictures; the English picture screening operation comprises the following steps: the sample picture set acquisition module sets a segmentation unit and an English synthesis interval; configuring an English corpus vocabulary list; the sample picture set acquisition module selects a first English sample from a plurality of English sample characters; the sample picture set acquisition module confirms that first character positions are a plurality of first English vocabulary characters of the first English sample in the English corpus vocabulary list; the sample picture set acquisition module is used for dividing the first English vocabulary characters according to the dividing units to obtain second English vocabulary characters; the sample picture set acquisition module moves out vocabulary characters matched with the first English sample in the plurality of second English vocabulary characters to obtain a plurality of third English vocabulary characters; the sample picture set acquisition module screens a plurality of second English samples which are respectively matched with a plurality of third English vocabulary characters from a plurality of English sample characters; the sample picture set acquisition module screens first English pictures matched with the first English samples from a plurality of English handwriting sample pictures; the sample picture set acquisition module screens a plurality of second English pictures which are respectively matched with a plurality of second English samples from a plurality of English handwriting sample pictures; the sample picture set acquisition module synthesizes a plurality of second English pictures with the first English pictures respectively side by side based on the English synthesis interval to obtain a plurality of third synthesized English pictures;

Specifically, a sample picture set acquisition module executes formula picture screening operation based on the picture data set and the character list to obtain a plurality of fourth mathematical formula pictures; the formula picture screening operation comprises the following steps: the sample picture set acquisition module sets a formula synthesis interval, a digit number, an operator digit number and a digit and operator synthesis sequence; the sample picture set acquisition module selects a plurality of first digital samples from a plurality of digital sample characters according to the digital digits; selecting a plurality of operation character samples from a plurality of symbol sample characters according to the operator bits; the sample picture set acquisition module screens a plurality of first digital pictures which are respectively matched with a plurality of first digital samples from a plurality of digital handwriting sample pictures; screening a plurality of operation character pictures which are respectively matched with a plurality of operation character samples from a plurality of symbol handwriting sample pictures; the sample picture set acquisition module synthesizes a plurality of first digital pictures and a plurality of operation character pictures according to the digital and operator synthesis sequence and the formula synthesis interval to obtain a plurality of fourth mathematical formula pictures;

Specifically, the sample picture set acquisition module integrates a plurality of first synthesized Chinese character pictures, a plurality of second synthesized Chinese character pictures, a plurality of third synthesized English pictures and a plurality of fourth mathematical formula pictures to obtain the sample picture set.

In the handwriting recognition training set generation system for the flat plate, a recognition training set generation module is used for setting a scaling percentage and a rotation angle; the recognition training set generation module executes sample transformation extraction operation based on the sample picture set, the scaling percentage and the rotation angle to obtain the recognition training set;

Specifically, the sample transform decimation operation includes:

The recognition training set generation module performs scaling treatment on the first synthesized Chinese character pictures, the second synthesized Chinese character pictures, the third synthesized English pictures and the fourth mathematical formula pictures according to the scaling percentage to obtain first training set pictures, second training set pictures, third training set pictures and fourth training set pictures; the recognition training set generation module respectively carries out rotation transformation on the first training set pictures, the second training set pictures, the third training set pictures and the fourth training set pictures according to the rotation angles to obtain a plurality of first to-be-extracted Chinese character pictures, a plurality of second to-be-extracted Chinese character pictures, a plurality of third to-be-extracted English pictures and a plurality of fourth to-be-extracted mathematical formula pictures;

The recognition training set generation module sets a first Chinese character proportion, a second Chinese character proportion, an English proportion and a formula proportion; the recognition training set generation module calculates the quantity and the value of a plurality of first to-be-extracted synthesized Chinese character pictures, a plurality of second to-be-extracted synthesized Chinese character pictures, a plurality of third to-be-extracted English pictures and a plurality of fourth to-be-extracted mathematical formula pictures; the recognition training set generation module calculates products of the quantity and the value with the first Chinese character proportion, the second Chinese character proportion, the English proportion and the formula proportion respectively to obtain a first Chinese character selection quantity, a second Chinese character selection quantity, an English selection quantity and a formula selection quantity;

the recognition training set generation module selects a plurality of first Chinese character pictures to be processed from a plurality of first Chinese character pictures to be extracted according to the first Chinese character selection quantity; the recognition training set generation module selects a plurality of second Chinese character pictures to be processed from a plurality of second Chinese character pictures to be processed according to the second Chinese character selection quantity; the recognition training set generation module selects a plurality of third English pictures to be processed from the plurality of third English pictures to be extracted according to the English selection quantity; the recognition training set generation module selects a plurality of fourth formula pictures to be processed from the plurality of fourth formula pictures to be extracted according to the formula selection quantity; the recognition training set generation module integrates a plurality of first Chinese character pictures to be processed, a plurality of second Chinese character pictures to be processed, a plurality of third English pictures to be processed and a plurality of fourth formula pictures to be processed, and the recognition training set is obtained.

Example 3

The present embodiment provides a computer-readable storage medium including:

The storage medium is used for storing computer software instructions for implementing the handwriting recognition training set generation method for a tablet according to the embodiment 1, and the computer software instructions include a program for executing the program set for the handwriting recognition training set generation method for a tablet; specifically, the executable program may be built in the handwriting recognition training set generation system for a tablet according to embodiment 2, so that the handwriting recognition training set generation system for a tablet may implement the handwriting recognition training set generation method for a tablet according to embodiment 1 by executing the built-in executable program.

Further, the computer readable storage medium provided in the present embodiment may be any combination of one or more readable storage media, where the readable storage media includes an electric, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof.

Compared with the prior art, the method, the system and the medium for generating the handwriting recognition training set for the tablet can generate the training set which comprises a plurality of rare words, symbols and formulas and a plurality of font samples, symbols and formulas with semantic relation through the method, so that the training set can not only be used for the diversity of handwriting samples, but also ensure the relevance among the handwriting samples when being applied, and the system provides effective technical support for the method, so that the range and the efficiency of handwriting recognition on the tablet are finally improved, the applicability of the tablet is further improved, the experience of users is greatly improved, the defects of the prior art are overcome, and the tablet has extremely high market value and product competitiveness.

The foregoing embodiment of the present invention has been disclosed with reference to the number of embodiments for the purpose of description only, and does not represent the advantages or disadvantages of the embodiments.

It will be appreciated by those of ordinary skill in the art that all or part of the steps of implementing the above embodiments may be implemented by hardware, or a program implemented by a program to instruct related hardware may be stored in a computer readable storage medium, where the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or equivalent processes or direct or indirect application in other related technical fields are included in the scope of the present invention.

Claims

1. The handwriting recognition training set generation method for the tablet is characterized by comprising the following steps of:

Configuring a picture data set;

Acquiring a sample picture set:

Generating an identification training set:

Setting a zoom percentage and a rotation angle; performing sample transformation extraction operation based on the sample picture set, the scaling percentage and the rotation angle to obtain the identification training set;

The step of configuring the picture dataset based on the acquisition threshold, the first contrast, and the picture processing program further comprises:

Acquiring a plurality of first character pictures, a plurality of second character pictures, a plurality of third character pictures and a plurality of fourth character pictures which are respectively matched with the acquisition threshold; the first character picture, the second character picture, the third character picture and the fourth character picture are respectively a handwritten Chinese character picture, a handwritten English picture, a handwritten digital picture and a handwritten symbol picture which are combined by a regular script and a slight running script which are handwritten according to a plurality of different fonts in a dictionary;

2. The method for generating a handwriting recognition training set for a tablet according to claim 1, wherein the character list is configured with: a plurality of Chinese character sample characters respectively matched with the Chinese character handwriting sample pictures, a plurality of English sample characters respectively matched with the English handwriting sample pictures, a plurality of digital sample characters respectively matched with the digital handwriting sample pictures and a plurality of symbol sample characters respectively matched with the symbol handwriting sample pictures.

3. The method for generating a handwriting recognition training set for a tablet according to claim 2, wherein the step of acquiring the sample picture set based on the picture data set and the character list further comprises:

4. The method for generating a handwriting recognition training set for a tablet according to claim 3, wherein the chinese character picture screening operation comprises:

5. The method for generating a handwriting recognition training set for a tablet according to claim 4, wherein the english image screening operation comprises:

Selecting a first English sample from a plurality of English sample characters;

6. The method for generating a handwriting recognition training set for a tablet according to claim 5, wherein the formula picture filtering operation comprises:

7. The method of claim 6, wherein the sample transformation extraction operation comprises:

8. A handwriting recognition training set generation system for a tablet based on the handwriting recognition training set generation method for a tablet according to any one of claims 1 to 7, characterized by comprising: the system comprises a picture data set configuration module, a sample picture set acquisition module and an identification training set generation module;

the recognition training set generation module is used for setting a scaling percentage and a rotation angle; the recognition training set generation module executes sample transformation extraction operation based on the sample picture set, the scaling percentage and the rotation angle to obtain the recognition training set;

The picture data set configuration module configures the picture data set based on the acquisition threshold, the first contrast, and the picture processing program, further comprising:

The picture data set configuration module acquires a plurality of first character pictures, a plurality of second character pictures, a plurality of third character pictures and a plurality of fourth character pictures which are respectively matched with the acquisition threshold; the first character picture, the second character picture, the third character picture and the fourth character picture are respectively a handwritten Chinese character picture, a handwritten English picture, a handwritten digital picture and a handwritten symbol picture which are combined by a regular script and a slight running script which are handwritten according to a plurality of different fonts in a dictionary;

the picture data set configuration module invokes the picture processing program to perform reverse phase processing on the first character pictures, the second character pictures, the third character pictures and the fourth character pictures;

The picture data set configuration module invokes the picture processing program to adjust the contrast of the first character pictures, the second character pictures, the third character pictures and the fourth character pictures after the reverse phase processing to the first contrast to obtain Chinese character handwriting sample pictures, english handwriting sample pictures, digital handwriting sample pictures and symbol handwriting sample pictures;

The picture data set configuration module integrates a plurality of Chinese character handwriting sample pictures, a plurality of English handwriting sample pictures, a plurality of digital handwriting sample pictures and a plurality of symbol handwriting sample pictures to obtain the picture data set.

9. A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and the computer program when executed by a processor implements the steps of the handwriting recognition training set generation method for a tablet according to any one of claims 1 to 7.