CN114022886A

CN114022886A - Method, system and medium for generating handwriting recognition training set for tablet computer

Info

Publication number: CN114022886A
Application number: CN202111219556.6A
Authority: CN
Inventors: 孙成通; 赵亚欧; 胡焱; 牛鹏
Original assignee: Inspur Financial Information Technology Co Ltd
Current assignee: Inspur Financial Information Technology Co Ltd
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2022-02-08
Anticipated expiration: 2041-10-20
Also published as: CN114022886B

Abstract

The invention discloses a method, a system and a medium for generating a handwriting recognition training set for a tablet, wherein the method comprises the following steps: setting an acquisition threshold and a first contrast; configuring a picture processing program; configuring the picture data set based on the acquisition threshold, the first contrast, and the picture processing program; setting a character list matched with the picture data set; obtaining the sample picture set based on the picture data set and the character list; setting a zoom percentage and a rotation angle; executing a sample transformation extraction operation based on the sample picture set, the scaling percentage and the rotation angle to obtain the identification training set; the invention can generate a training set which not only comprises a plurality of uncommon words, symbols and formulas, but also comprises a plurality of font samples, symbols and formulas with semantic relation, thereby improving the range and efficiency of handwriting recognition on the flat plate and further improving the applicability of the flat plate.

Description

Method, system and medium for generating handwriting recognition training set for tablet computer

Technical Field

The invention relates to the technical field of handwriting font recognition, in particular to a method, a system and a medium for generating a training set for handwriting recognition of a tablet.

Background

With the increasing application of the flat panel in work, the recognition requirement of the flat panel on handwriting is higher and higher, the flat panel adopts a deep learning algorithm to cooperate with a handwriting sample training set to realize the recognition function of the handwriting on the flat panel, and the method has extremely high requirements on the generation and configuration of the handwriting sample training set.

A method for generating a handwriting sample training set in the prior art comprises the following steps: on one hand, the method is configured in a mode of randomly extracting samples from a set of handwriting samples, and the mode ensures the diversity of handwriting samples in a handwriting sample training set but cannot ensure the relevance among all handwriting samples; on the other hand, the configuration is carried out according to the linguistic data of a plurality of sentences, the mode ensures the relevance among handwriting samples in a handwriting sample training set, but the diversity of the handwriting samples is lost, and the situation that rarely-used words cannot be identified is easy to occur; therefore, a method for generating a training set of handwriting samples is needed, which can ensure diversity of handwriting samples and association between handwriting samples.

Disclosure of Invention

The invention mainly aims to develop a method for generating a handwriting sample training set, which can ensure the diversity of handwriting samples and the relevance among the handwriting samples.

In order to achieve the purpose, the invention adopts a technical scheme that: the method for generating the handwriting recognition training set for the tablet computer comprises the following steps of:

configuring a picture data set;

setting an acquisition threshold and a first contrast; configuring a picture processing program; configuring the picture data set based on the acquisition threshold, the first contrast, and the picture processing program;

acquiring a sample picture set:

setting a character list matched with the picture data set; obtaining the sample picture set based on the picture data set and the character list;

generating a recognition training set:

setting a zoom percentage and a rotation angle; and executing a sample transformation extraction operation based on the sample picture set, the scaling percentage and the rotation angle to obtain the identification training set.

As an improvement, the step of configuring the picture data set based on the acquisition threshold, the first contrast and the picture processing program further comprises:

acquiring a plurality of first character pictures, a plurality of second character pictures, a plurality of third character pictures and a plurality of fourth character pictures which are respectively matched with the acquisition threshold;

calling the picture processing program to perform phase reversal processing on the first character pictures, the second character pictures, the third character pictures and the fourth character pictures;

calling the picture processing program to adjust the contrast of the first character pictures, the second character pictures, the third character pictures and the fourth character pictures after the phase reversal processing to the first contrast, so as to obtain a plurality of Chinese character handwriting sample pictures, a plurality of English handwriting sample pictures, a plurality of digital handwriting sample pictures and a plurality of symbol handwriting sample pictures;

integrating a plurality of Chinese character handwriting sample pictures, a plurality of English handwriting sample pictures, a plurality of digital handwriting sample pictures and a plurality of symbol handwriting sample pictures to obtain the picture data set.

As an improved scheme, the character list is configured with: the Chinese character handwriting sample pictures are respectively matched with a plurality of Chinese character sample characters, a plurality of English sample characters, a plurality of digital sample characters and a plurality of symbol sample characters, the English sample characters are respectively matched with the English handwriting sample pictures, the digital sample characters are respectively matched with the digital handwriting sample pictures, and the symbol handwriting sample pictures are respectively matched with the symbol handwriting sample characters.

As an improvement, the step of obtaining the sample picture set based on the picture data set and the character list further comprises:

executing Chinese character picture screening operation based on the picture data set and the character list to obtain a plurality of first synthesized Chinese character pictures and a plurality of second synthesized Chinese character pictures;

performing English picture screening operation based on the picture data set and the character list to obtain a plurality of third synthetic English pictures;

executing formula picture screening operation based on the picture data set and the character list to obtain a plurality of fourth mathematical formula pictures;

and integrating a plurality of first synthesized Chinese character pictures, a plurality of second synthesized Chinese character pictures, a plurality of third synthesized English pictures and a plurality of fourth mathematical formula pictures to obtain the sample picture set.

As an improved scheme, the Chinese character picture screening operation includes:

setting a first extraction quantity, a Chinese character synthesis interval and a synthesis quantity; configuring a Chinese corpus vocabulary table;

selecting a plurality of first Chinese character samples matched with the first extraction quantity from a plurality of Chinese character sample characters; screening a plurality of first Chinese character pictures respectively matched with a plurality of first Chinese character samples from a plurality of Chinese character handwriting sample pictures; synthesizing a plurality of first Chinese character pictures side by side according to the synthesis quantity and based on the Chinese character synthesis intervals to obtain a plurality of first synthesized Chinese character pictures;

selecting a second Chinese character sample from the Chinese character sample characters; screening a plurality of Chinese vocabulary characters associated with the second Chinese character sample in the Chinese corpus vocabulary table; screening a plurality of third Chinese character samples which are respectively matched with a plurality of Chinese vocabulary characters from a plurality of Chinese character sample characters; screening a second Chinese character picture matched with the second Chinese character sample from the plurality of Chinese character handwriting sample pictures; screening a plurality of third Chinese character pictures respectively matched with a plurality of third Chinese character samples from a plurality of Chinese character handwriting sample pictures; and respectively synthesizing the plurality of third Chinese character pictures and the second Chinese character pictures in parallel based on the Chinese character synthesis intervals to obtain a plurality of second synthesized Chinese character pictures.

As an improved scheme, the english picture screening operation includes:

setting a segmentation unit and an English synthesis interval; configuring an English corpus vocabulary;

selecting a first English sample from the English sample characters;

confirming that first character positions are a plurality of first English vocabulary characters of the first English sample in the English corpus vocabulary; segmenting the first English vocabulary characters according to the segmentation units to obtain second English vocabulary characters; removing vocabulary characters matched with the first English sample from the second English vocabulary characters to obtain third English vocabulary characters; screening a plurality of second English samples which are respectively matched with a plurality of third English vocabulary characters from a plurality of English sample characters;

screening a first English picture matched with the first English sample from the plurality of English handwriting sample pictures; screening a plurality of second English pictures respectively matched with a plurality of second English samples from a plurality of English handwriting sample pictures; and respectively synthesizing the plurality of second English pictures and the first English picture in parallel based on the English synthesis intervals to obtain a plurality of third synthesized English pictures.

As an improved solution, the formula picture screening operation includes:

setting formula synthesis space, digit number, operator digit number and digit and operator synthesis sequence;

selecting a plurality of first digital samples from a plurality of digital sample characters according to the number digits; selecting a plurality of operation character samples from a plurality of symbol sample characters according to the operator digit;

screening a plurality of first digital pictures respectively matched with the first digital samples from the plurality of digital handwriting sample pictures; screening a plurality of operation character pictures respectively matched with a plurality of operation character samples from a plurality of symbol handwriting sample pictures;

and synthesizing the plurality of first digital pictures and the plurality of operation character pictures according to the digit and operator synthesis sequence and the formula synthesis interval to obtain a plurality of fourth mathematical formula pictures.

As an improved scheme, the sample transform decimation operation comprises:

respectively carrying out scaling processing on the plurality of first synthesized Chinese character pictures, the plurality of second synthesized Chinese character pictures, the plurality of third synthesized English pictures and the plurality of fourth mathematical formula pictures according to the scaling percentage to obtain a plurality of first training set pictures, a plurality of second training set pictures, a plurality of third training set pictures and a plurality of fourth training set pictures;

respectively carrying out rotation transformation on the plurality of first training set pictures, the plurality of second training set pictures, the plurality of third training set pictures and the plurality of fourth training set pictures according to the rotation angle to obtain a plurality of first to-be-extracted synthetic Chinese character pictures, a plurality of second to-be-extracted synthetic Chinese character pictures, a plurality of third to-be-extracted English pictures and a plurality of fourth to-be-extracted mathematical formula pictures;

setting a first Chinese character proportion, a second Chinese character proportion, an English proportion and a formula proportion; calculating the quantity and the value of a plurality of first to-be-extracted synthetic Chinese character pictures, a plurality of second to-be-extracted synthetic Chinese character pictures, a plurality of third to-be-extracted English pictures and a plurality of fourth to-be-extracted mathematical formula pictures; calculating the products of the quantity and the value with the first Chinese character proportion, the second Chinese character proportion, the English proportion and the formula proportion respectively to obtain a first Chinese character selection quantity, a second Chinese character selection quantity, an English selection quantity and a formula selection quantity;

selecting a plurality of first Chinese character pictures to be sorted from the plurality of first synthetic Chinese character pictures to be extracted according to the first Chinese character selection quantity; selecting a plurality of second Chinese character pictures to be sorted from the plurality of second Chinese character pictures to be extracted and synthesized according to the second Chinese character selection amount; selecting a plurality of third English pictures to be sorted from the plurality of third English pictures to be extracted according to the English selection amount; selecting a plurality of fourth mathematical formula pictures to be sorted from the plurality of fourth mathematical formula pictures to be extracted according to the formula selection quantity; and integrating a plurality of the first Chinese character pictures to be sorted, a plurality of the second Chinese character pictures to be sorted, a plurality of the third English pictures to be sorted and a plurality of the fourth formula pictures to be sorted to obtain the identification training set.

The invention also provides a system for generating the handwriting recognition training set for the tablet, which comprises the following components:

the system comprises a picture data set configuration module, a sample picture set acquisition module and an identification training set generation module;

the picture data set configuration module is used for setting an acquisition threshold and a first contrast; the picture data set configuration module is also used for configuring a picture processing program; the picture data set configuration module configures the picture data set based on the acquisition threshold, the first contrast, and the picture processing program;

the sample picture set acquisition module is used for setting a character list matched with the picture data set; the sample picture set acquisition module acquires the sample picture set based on the picture data set and the character list;

the identification training set generation module is used for setting a scaling percentage and a rotation angle; and the identification training set generation module executes sample transformation extraction operation based on the sample picture set, the scaling percentage and the rotation angle to obtain the identification training set.

The invention also provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps of the tablet handwriting recognition training set generation method.

The invention has the beneficial effects that:

1. the training set generation method for recognizing the handwriting on the tablet can generate the training set which comprises a plurality of rarely-used characters, symbols and formulas and a plurality of font samples, symbols and formulas with semantic relation, and finally can ensure the diversity of the handwriting samples and the relevance among the handwriting samples when the training set is applied, thereby improving the range and the efficiency of the handwriting recognition on the tablet, further improving the applicability of the tablet, greatly improving the experience of users, making up the defects of the prior art, and having extremely high market value and product competitiveness.

2. The tablet handwriting recognition training set generation system can generate a training set which comprises a plurality of rarely-used characters, symbols and formulas and a plurality of font samples, symbols and formulas with semantic relation through the mutual matching of the picture data set configuration module, the sample picture set acquisition module and the recognition training set generation module, and finally can ensure the diversity of handwriting samples and the relevance among the handwriting samples when the training set is applied, thereby improving the range and the efficiency of handwriting recognition on the tablet, further improving the applicability of the tablet, greatly improving the experience of users, making up the defects of the prior art and having extremely high market value and product competitiveness.

3. The computer-readable storage medium can realize the cooperation of a guide picture data set configuration module, a sample picture set acquisition module and an identification training set generation module, further realize the generation of a training set which comprises a plurality of rarely-used characters, symbols and formulas and a plurality of font samples, symbols and formulas with semantic relation, and finally ensure the diversity of handwriting samples and the relevance among the handwriting samples when the training set is applied, thereby improving the range and the efficiency of handwriting identification on a flat plate, further improving the applicability of the flat plate, greatly improving the experience of users, making up the defects of the prior art, having extremely high market value and product competitiveness, and effectively improving the operability of the method for generating the handwriting identification training set for the flat plate.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flowchart of a method for generating a training set for handwriting recognition on a tablet according to embodiment 1 of the present invention;

fig. 2 is a schematic specific flowchart of a method for generating a training set for handwriting recognition on a tablet according to embodiment 1 of the present invention;

fig. 3 is a schematic diagram illustrating an implementation effect of a part of the sample picture set according to embodiment 1 of the present invention;

fig. 4 is an architecture diagram of a tablet handwriting recognition training set generation system according to embodiment 2 of the present invention.

Detailed Description

The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the invention easier to understand by those skilled in the art, and thus will clearly and clearly define the scope of the invention.

In the description of the present invention, it should be noted that the described embodiments of the present invention are a part of the embodiments of the present invention, and not all embodiments; all other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "first", "second", "third", and "fourth" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it is to be noted that: ps (photoshop) is image processing software.

Example 1

The embodiment provides a method for generating a handwriting recognition training set for a tablet, as shown in fig. 1 to 3, comprising the following steps:

s100, configuring a picture data set, specifically comprising:

s110, setting an acquisition threshold and a first contrast; configuring a picture processing program; configuring the picture data set based on the acquisition threshold, the first contrast, and the picture processing program; in the present embodiment, step S110 is a most basic data collection step, and the picture data set is a set including several handwritten pictures that have undergone qualified picture processing operations;

specifically, a plurality of first character pictures, a plurality of second character pictures, a plurality of third character pictures and a plurality of fourth character pictures which are respectively matched with the acquisition threshold value are obtained; in this embodiment, the collection threshold is 300, which represents the limitation of the number, so that 300 first, second, third and fourth character pictures are required to be ensured; correspondingly, the first character picture, the second character picture, the third character picture and the fourth character picture are a handwritten Chinese character picture, a handwritten English picture, a handwritten digital picture and a handwritten symbol picture which are formed by combining a regular script and a slightly running regular script which are handwritten according to a plurality of different fonts in a dictionary respectively; correspondingly, in this embodiment, the picture processing program is a PS program, and the picture processing program is called to perform phase reversal processing on the plurality of first character pictures, the plurality of second character pictures, the plurality of third character pictures, and the plurality of fourth character pictures; the reverse processing is a negative film which processes the picture into a white character black background; calling the picture processing program to adjust the contrast of the first character pictures, the second character pictures, the third character pictures and the fourth character pictures after the phase reversal processing to the first contrast, so as to obtain a plurality of Chinese character handwriting sample pictures, a plurality of English handwriting sample pictures, a plurality of digital handwriting sample pictures and a plurality of symbol handwriting sample pictures; in the embodiment, the first contrast is a medium and low value set in a range of 0-100, so that the effect of graying the picture is achieved; integrating a plurality of Chinese character handwriting sample pictures, a plurality of English handwriting sample pictures, a plurality of digital handwriting sample pictures and a plurality of symbol handwriting sample pictures to obtain the picture data set; in this embodiment, the picture data set is stored in the database, and encoded with gbk as an index, and stored in hdf5 format; each code corresponds to a Chinese character; it is conceivable that a handwritten picture of a Chinese character may include pictures of handwriting written by several different people; correspondingly, the data reference library of the method is built through the steps, the picture data set in the data reference library comprises various styles, is not limited to Chinese characters but also comprises various symbols, and a data basis is provided for the building of a subsequent training set with high diversity.

S200, obtaining a sample picture set, and specifically comprising the following steps:

s210, setting a character list matched with the picture data set; obtaining the sample picture set based on the picture data set and the character list; in this embodiment, step S210 is a key sample construction step, which includes synthesis methods for different types of sample pictures, and greatly improves the diversity of the training set;

specifically, the character list is configured with: a plurality of Chinese character sample characters respectively matched with the Chinese character handwriting sample picture, a plurality of English sample characters respectively matched with the English handwriting sample picture, a plurality of digital sample characters respectively matched with the digital handwriting sample picture and a plurality of symbol sample characters respectively matched with the symbol handwriting sample picture; correspondingly, the character list is a set list of normal machine-displayed characters or standard Song body characters, wherein the set list comprises a plurality of Chinese characters, English, numbers and symbols, and the list is used for selecting a character reference to be synthesized because the handwriting is not convenient to identify and distinguish;

specifically, Chinese character picture screening operation is executed based on the picture data set and the character list to obtain a plurality of first synthesized Chinese character pictures and a plurality of second synthesized Chinese character pictures; in this embodiment, the Chinese character image screening operation is two operation aspects, the first aspect is to randomly synthesize and improve the diversity of Chinese characters, and the second aspect is to generate linguistic data and improve the relevance between Chinese characters;

specifically, the Chinese character picture screening operation includes:

firstly, a random sampling type sample operation is carried out, and then the diversity is ensured: setting a first extraction quantity, a Chinese character synthesis interval and a synthesis quantity; configuring a Chinese corpus vocabulary table; in the embodiment, the first extraction number is the number of single words extracted at one time, and in the embodiment, the first extraction number is set to be any value between 10 and 15; selecting a plurality of first Chinese character samples matched with the first extraction quantity from a plurality of Chinese character sample characters; the first Chinese character sample is a standard Chinese character standard in the character list; correspondingly, in the embodiment, the selected mode is uniform random sampling, so that the occurrence probability of the uncommon word is ensured to be the same as that of the common character; correspondingly, in the embodiment, it is conceivable that multiple times of extraction may be performed, and this step may be performed multiple times to achieve a better effect; screening a plurality of first Chinese character pictures respectively matched with a plurality of first Chinese character samples from a plurality of Chinese character handwriting sample pictures; in this embodiment, the number of the first synthesized Chinese character pictures is 2, and the first synthesized Chinese character pictures are synthesized side by side according to the number of the first synthesized Chinese character pictures and based on the Chinese character synthesis interval to obtain a plurality of first synthesized Chinese character pictures; synthesizing the two first Chinese character pictures into a group according to the synthesis quantity to form a line text picture, namely forming a vocabulary picture, wherein the synthesis quantity is not limited herein correspondingly; side-by-side synthesis, i.e. for example: the synthesis of the words and the texts is the language; correspondingly, in the present embodiment, the Chinese character synthesis distance is set in units of pixels, and the Chinese character synthesis distance is set to be between 5 and 30 pixels in the present embodiment;

moreover, the operation of corpus synthesis is carried out, and the relevance between the Chinese characters is ensured: in this embodiment, the Chinese corpus vocabulary includes a plurality of vocabularies and a plurality of long/short sentences; the Chinese corpus vocabulary table can select different types of vocabulary data according to application scenes; for example: chinese encyclopedia data, news language data, computer professional data and the like; selecting a second Chinese character sample from the Chinese character sample characters; the second Chinese character sample is used as a reference for generating the corpus; screening a plurality of Chinese vocabulary characters associated with the second Chinese character sample in the Chinese corpus vocabulary table; for example: if the second Chinese character sample is 'language', the Chinese vocabulary characters can be 'text', 'digital exterior', 'language', 'tone', and 'situation', and also include sentences; correspondingly, a plurality of Chinese vocabulary characters are used as a screening template of the handwriting, so a plurality of third Chinese character samples respectively matched with the Chinese vocabulary characters are screened from a plurality of Chinese character sample characters; screening second Chinese character pictures respectively matched with the second Chinese character samples from the plurality of Chinese character handwriting sample pictures; screening a plurality of third Chinese character pictures respectively matched with a plurality of third Chinese character samples from a plurality of Chinese character handwriting sample pictures; respectively synthesizing a plurality of third Chinese character pictures and the second Chinese character pictures in parallel based on the Chinese character synthesis intervals to obtain a plurality of second synthesized Chinese character pictures; in this embodiment, the second kanji sample can also be selected as a plurality of consecutive singles, thereby further improving the correlation between the kanji.

Specifically, an English picture screening operation is executed based on the picture data set and the character list to obtain a plurality of third synthetic English pictures; in this embodiment, the english picture screening operation is mainly to solve the problem that the conference tablet is identified as "0 r" because the user wants to input the "or" result; therefore, in order to address such situations, the corpus association between English words needs to be strengthened;

specifically, the english picture screening operation includes: setting a segmentation unit and an English synthesis interval; configuring an English corpus vocabulary; in the embodiment, the segmentation unit takes a single letter as a unit, so that the English recognition accuracy is improved; the English synthesis space is similar to the Chinese character synthesis space and is also set by taking pixels as units; the English corpus vocabulary is similar to the Chinese corpus vocabulary, namely English encyclopedia vocabularies, sentences and English behavior vocabularies, sentences and the like in various fields; selecting a first English sample from the English sample characters; correspondingly, the first English sample is a pure English letter and does not contain the content of non-letters, numbers, symbols and other typefaces; in this embodiment, the first english sample can also be a long/short sentence corresponding to the english corpus vocabulary, and the long/short sentence is kept within 20 letters, so as to improve the diversity of letter combinations; confirming that first character positions are a plurality of first English vocabulary characters of the first English sample in the English corpus vocabulary; the first English vocabulary characters are pure English letters and do not contain the content of non-letters, numbers, symbols and other typefaces; correspondingly, the first character position is the first position in the vocabulary; for example: if the first english sample is the letter "o", then several first english vocabulary characters may include "or", "okay", "open", etc.; correspondingly, segmenting the first English vocabulary characters according to the segmentation unit to obtain second English vocabulary characters; according to the above example, the second english vocabulary characters are "o", "r", "k", "a", "y", "p", "e", "n"; the second English vocabulary characters are also used as a screening template of the handwritten sample picture; removing the vocabulary characters matched with the first English sample from the second English vocabulary characters, and obtaining a plurality of third English vocabulary characters according to the example, namely 'o', namely 'r', 'k', 'a', 'y', 'p', 'e', 'n'; screening a plurality of second English samples which are respectively matched with a plurality of third English vocabulary characters from a plurality of English sample characters; screening a first English picture matched with the first English sample from the plurality of English handwriting sample pictures; screening a plurality of second English pictures respectively matched with a plurality of second English samples from a plurality of English handwriting sample pictures; respectively synthesizing the plurality of second English pictures and the first English pictures in parallel based on the English synthesis intervals to obtain a plurality of third synthesized English pictures; in the present embodiment, it is conceivable that the steps of selecting a sample and synthesizing a picture are performed multiple times, thereby improving the diversity.

Specifically, formula picture screening operation is executed based on the picture data set and the character list to obtain a plurality of fourth mathematical formula pictures; in this embodiment, the formula picture screening operation is used to prevent the conference tablet from causing recognition errors due to typeface features when recognizing mathematical formulas, for example: the user is prevented from inputting "1 +2 ═ 3", which is recognized as "l twelve two", that is, "1" is prevented from being recognized as "l", "+" is recognized as "ten", "═ is recognized as" two ", and" 3 "is recognized as" up ";

specifically, the formula picture screening operation includes:

setting formula synthesis space, digit number, operator digit number and digit and operator synthesis sequence; in the embodiment, the formula synthesis space is also set in pixel units, and the digit number is the digit number of the digit itself, the digit number and the digit vacancy between the digits; the operator digit is the position matched with the digit vacancy between the digits; the operator digit is set according to the digit and is matched with the digit, and the operator digit can be smaller than, larger than or equal to the digit; correspondingly, the synthesis sequence of the numbers and the operators is the arrangement sequence between the operators and the numbers, and the specific situation is determined, in the embodiment, the synthesis sequence of the numbers and the operators is one number character and then one operator is arranged; the sequence can be matched and set in various conditions according to the complexity of the formula, and the diversity is strong;

selecting a plurality of first digital samples from a plurality of digital sample characters according to the number digits; in the present embodiment, the number itself is at most six digits, and the digit space is the position between every two single digit characters; for example: when the number is 3, the number can only select integers from the range of 100 to 999; selecting a plurality of operation character samples from a plurality of symbol sample characters according to the operator digit; correspondingly, for example: if the number of digits is 3, the first digital sample is any number between 100 and 999, and if the first digital sample is 101, the empty positions between characters are: a slot between 1 and 0, a slot between 0 and 1, a number 1 later; the number of empty spaces between characters is also 3; correspondingly, the operation character samples are mathematical operators in the symbol sample character and are 3 arbitrary symbols, for example: +, ═ and <; correspondingly, the mathematical formula of the subsequent synthesis may be 1+0 ═ 1, <; the first digital samples and the operation character samples are also selected templates of handwriting sample pictures, so that a plurality of first digital pictures respectively matched with the first digital samples are screened from the digital handwriting sample pictures; screening a plurality of operation character pictures respectively matched with a plurality of operation character samples from a plurality of symbol handwriting sample pictures; synthesizing a plurality of first digital pictures and a plurality of operation character pictures according to the digit and operator synthesis sequence and the formula synthesis interval to obtain a plurality of fourth mathematical formula pictures; in this embodiment, the length of the formula is limited, and at most, 15 characters are not exceeded in one formula picture, so that the validity of the formula is ensured, and the internal space resources of the training set are saved.

Specifically, integrating a plurality of first synthesized Chinese character pictures, a plurality of second synthesized Chinese character pictures, a plurality of third synthesized English pictures and a plurality of fourth mathematical formula pictures to obtain the sample picture set; the sample picture set has extremely high diversity and extremely wide application range, and is main basic data generated by a subsequent training set.

S300, generating a recognition training set, which specifically comprises the following steps:

s310, setting a zooming percentage and a rotating angle; executing a sample transformation extraction operation based on the sample picture set, the scaling percentage and the rotation angle to obtain the identification training set; in this embodiment, in the step, different handwriting habits of various users are considered, so as to further improve the adaptability of the conference tablet and the diversity of the training set;

specifically, in the embodiment, the zoom percentage is 0.7 to 0.1; the rotating angle is 5 degrees; the sample transform decimation operation comprises: respectively carrying out scaling processing on the plurality of first synthesized Chinese character pictures, the plurality of second synthesized Chinese character pictures, the plurality of third synthesized English pictures and the plurality of fourth mathematical formula pictures according to the scaling percentage to obtain a plurality of first training set pictures, a plurality of second training set pictures, a plurality of third training set pictures and a plurality of fourth training set pictures; respectively carrying out rotation transformation on the plurality of first training set pictures, the plurality of second training set pictures, the plurality of third training set pictures and the plurality of fourth training set pictures according to the rotation angle to obtain a plurality of first to-be-extracted synthetic Chinese character pictures, a plurality of second to-be-extracted synthetic Chinese character pictures, a plurality of third to-be-extracted English pictures and a plurality of fourth to-be-extracted mathematical formula pictures; correspondingly, in this embodiment, the transformation represents that the transformation is performed according to different probabilities for each kind of picture, for example: clockwise 5-degree rotation is carried out on 50 percent of the first synthesized Chinese character pictures, and 0.7-0.1 time of zooming operation is carried out on 1 percent of the rotated pictures; correspondingly, the image transformation algorithms of each kind can be the same or different, and the diversity of the training set can be further improved by the change of various algorithms;

correspondingly, in order to ensure that the effectiveness and the space occupancy rate of the training set are lower and improve the use efficiency of the training set, a first Chinese character proportion, a second Chinese character proportion, an English proportion and a formula proportion are set; the first Chinese character proportion, the second Chinese character proportion, the English proportion and the formula proportion are respectively as follows: 5/12, 5/12, 1/12, 1/12; the proportion is specifically set according to the scene of the training set application, for example, if the training set is applied to a foreign language conference, the English proportion is required to be higher, and if the training set is applied to mathematics study, the formula proportion is higher; correspondingly, the number and the value of the plurality of first to-be-extracted synthetic Chinese character pictures, the plurality of second to-be-extracted synthetic Chinese character pictures, the plurality of third to-be-extracted English pictures and the plurality of fourth to-be-extracted mathematical formula pictures are calculated; calculating the products of the quantity and the value with the first Chinese character proportion, the second Chinese character proportion, the English proportion and the formula proportion respectively to obtain a first Chinese character selection quantity, a second Chinese character selection quantity, an English selection quantity and a formula selection quantity; selecting a plurality of first Chinese character pictures to be sorted from the plurality of first synthetic Chinese character pictures to be extracted according to the first Chinese character selection quantity; selecting a plurality of second Chinese character pictures to be sorted from the plurality of second Chinese character pictures to be extracted and synthesized according to the second Chinese character selection amount; selecting a plurality of third English pictures to be sorted from the plurality of third English pictures to be extracted according to the English selection amount; selecting a plurality of fourth mathematical formula pictures to be sorted from the plurality of fourth mathematical formula pictures to be extracted according to the formula selection quantity; and integrating a plurality of the first Chinese character pictures to be sorted, a plurality of the second Chinese character pictures to be sorted, a plurality of the third English pictures to be sorted and a plurality of the fourth formula pictures to be sorted to obtain the identification training set.

Example 2

The present embodiment provides a system for generating a training set for handwriting recognition for a tablet, as shown in fig. 4, including: the system comprises a picture data set configuration module, a sample picture set acquisition module and an identification training set generation module;

in the tablet handwriting recognition training set generation system, a picture data set configuration module is used for setting an acquisition threshold and a first contrast; the picture data set configuration module is also used for configuring a picture processing program; the picture data set configuration module configures the picture data set based on the acquisition threshold, the first contrast, and the picture processing program;

specifically, the image data set configuration module acquires a plurality of first character images, a plurality of second character images, a plurality of third character images and a plurality of fourth character images which are respectively matched with the acquisition threshold value; the picture data set configuration module calls the picture processing program to perform phase reversal processing on the first character pictures, the second character pictures, the third character pictures and the fourth character pictures; the picture data set configuration module calls the picture processing program to adjust the contrast of the first character pictures, the second character pictures, the third character pictures and the fourth character pictures after the phase reversal processing to the first contrast, so that a plurality of Chinese character handwriting sample pictures, a plurality of English handwriting sample pictures, a plurality of digital handwriting sample pictures and a plurality of symbol handwriting sample pictures are obtained; and the picture data set configuration module integrates a plurality of Chinese character handwriting sample pictures, a plurality of English handwriting sample pictures, a plurality of digital handwriting sample pictures and a plurality of symbol handwriting sample pictures to obtain the picture data set.

In the tablet handwriting recognition training set generation system, a sample picture set acquisition module is used for setting a character list matched with the picture data set; the sample picture set acquisition module acquires the sample picture set based on the picture data set and the character list;

specifically, the character list is configured with: a plurality of Chinese character sample characters respectively matched with the Chinese character handwriting sample picture, a plurality of English sample characters respectively matched with the English handwriting sample picture, a plurality of digital sample characters respectively matched with the digital handwriting sample picture and a plurality of symbol sample characters respectively matched with the symbol handwriting sample picture;

specifically, the sample picture set acquisition module executes Chinese character picture screening operation based on the picture data set and the character list to obtain a plurality of first synthesized Chinese character pictures and a plurality of second synthesized Chinese character pictures; the Chinese character picture screening operation comprises the following steps: the sample picture set acquisition module sets a first extraction number, a Chinese character synthesis interval and a synthesis number; the sample picture set acquisition module configures a Chinese corpus vocabulary; the sample picture set acquisition module selects a plurality of first Chinese character samples matched with the first extraction quantity from a plurality of Chinese character sample characters; a sample picture set acquisition module screens a plurality of first Chinese character pictures respectively matched with a plurality of first Chinese character samples from a plurality of Chinese character handwriting sample pictures; the sample picture set acquisition module synthesizes a plurality of first Chinese character pictures side by side according to the synthesis quantity and based on the Chinese character synthesis interval to obtain a plurality of first synthesized Chinese character pictures; the sample picture set acquisition module selects a second Chinese character sample from the Chinese character sample characters; a sample picture set acquisition module screens a plurality of Chinese vocabulary characters associated with the second Chinese character sample in the Chinese corpus vocabulary table; a sample picture set acquisition module screens a plurality of third Chinese character samples which are respectively matched with a plurality of Chinese vocabulary characters from a plurality of Chinese character sample characters; a sample picture set acquisition module screens second Chinese character pictures matched with the second Chinese character samples from a plurality of Chinese character handwriting sample pictures; a sample picture set acquisition module screens a plurality of third Chinese character pictures respectively matched with a plurality of third Chinese character samples from a plurality of Chinese character handwriting sample pictures; the sample picture set acquisition module is used for respectively synthesizing a plurality of third Chinese character pictures and the second Chinese character pictures in parallel on the basis of the Chinese character synthesis intervals to obtain a plurality of second synthesized Chinese character pictures;

specifically, the sample picture set acquisition module executes an english picture screening operation based on the picture data set and the character list to obtain a plurality of third synthetic english pictures; the English picture screening operation comprises the following steps: the sample picture set acquisition module sets a segmentation unit and an English synthesis interval; configuring an English corpus vocabulary; a sample picture set acquisition module selects a first English sample from the English sample characters; a sample picture set acquisition module confirms that first character positions are a plurality of first English vocabulary characters of the first English sample in the English corpus vocabulary; the sample picture set acquisition module divides the first English vocabulary characters according to the division unit to obtain second English vocabulary characters; the sample picture set acquisition module moves out vocabulary characters matched with the first English sample from the second English vocabulary characters to obtain third English vocabulary characters; a sample picture set acquisition module screens a plurality of second English samples which are respectively matched with a plurality of third English vocabulary characters in a plurality of English sample characters; a sample picture set acquisition module screens a first English picture matched with the first English sample from a plurality of English handwriting sample pictures; a sample picture set acquisition module screens a plurality of second English pictures respectively matched with a plurality of second English samples from a plurality of English handwritten sample pictures; the sample picture set acquisition module respectively synthesizes the second English pictures and the first English pictures in parallel based on the English synthesis intervals to obtain third synthesized English pictures;

specifically, the sample picture set acquisition module executes a formula picture screening operation based on the picture data set and the character list to obtain a plurality of fourth mathematical formula pictures; the formula picture screening operation comprises the following steps: the sample picture set acquisition module sets a formula synthesis interval, a digital digit, an operator digit and a digit and operator synthesis sequence; the sample picture set acquisition module selects a plurality of first digital samples from a plurality of digital sample characters according to the number digits; selecting a plurality of operation character samples from a plurality of symbol sample characters according to the operator digit; a sample picture set acquisition module screens a plurality of first digital pictures respectively matched with a plurality of first digital samples from a plurality of digital handwriting sample pictures; screening a plurality of operation character pictures respectively matched with a plurality of operation character samples from a plurality of symbol handwriting sample pictures; the sample picture set acquisition module synthesizes a plurality of first digital pictures and a plurality of operation character pictures according to the digit and operator synthesis sequence and the formula synthesis interval to obtain a plurality of fourth mathematical formula pictures;

specifically, the sample picture set obtaining module integrates a plurality of first synthesized chinese character pictures, a plurality of second synthesized chinese character pictures, a plurality of third synthesized english pictures, and a plurality of fourth mathematical formula pictures to obtain the sample picture set.

In the tablet handwriting recognition training set generation system, a recognition training set generation module is used for setting the zooming percentage and the rotation angle; the identification training set generation module executes sample transformation extraction operation based on the sample picture set, the scaling percentage and the rotation angle to obtain an identification training set;

specifically, the sample transform decimation operation includes:

the identification training set generation module respectively carries out scaling processing on the plurality of first synthesized Chinese character pictures, the plurality of second synthesized Chinese character pictures, the plurality of third synthesized English pictures and the plurality of fourth mathematical formula pictures according to the scaling percentage to obtain a plurality of first training set pictures, a plurality of second training set pictures, a plurality of third training set pictures and a plurality of fourth training set pictures; the identification training set generation module respectively performs rotation transformation on the plurality of first training set pictures, the plurality of second training set pictures, the plurality of third training set pictures and the plurality of fourth training set pictures according to the rotation angle to obtain a plurality of first to-be-extracted synthetic Chinese character pictures, a plurality of second to-be-extracted synthetic Chinese character pictures, a plurality of third to-be-extracted English pictures and a plurality of fourth to-be-extracted mathematical formula pictures;

the recognition training set generation module sets a first Chinese character proportion, a second Chinese character proportion, an English proportion and a formula proportion; the identification training set generation module calculates the quantity and the value of a plurality of first to-be-extracted synthetic Chinese character pictures, a plurality of second to-be-extracted synthetic Chinese character pictures, a plurality of third to-be-extracted English pictures and a plurality of fourth to-be-extracted mathematical formula pictures; the recognition training set generation module calculates the products of the quantity and the value with the first Chinese character proportion, the second Chinese character proportion, the English proportion and the formula proportion respectively to obtain a first Chinese character selection quantity, a second Chinese character selection quantity, an English character selection quantity and a formula selection quantity;

the recognition training set generation module selects a plurality of first Chinese character pictures to be sorted from the plurality of first Chinese character pictures to be extracted and synthesized according to the first Chinese character selection quantity; the recognition training set generation module selects a plurality of second Chinese character pictures to be sorted from a plurality of second synthetic Chinese character pictures to be extracted according to the second Chinese character selection quantity; the identification training set generation module selects a plurality of third English pictures to be sorted from the plurality of third English pictures to be extracted according to the English selection amount; the recognition training set generation module selects a plurality of fourth mathematical formula pictures to be sorted from the plurality of fourth mathematical formula pictures to be extracted according to the formula selection quantity; and the identification training set generation module integrates a plurality of first Chinese character pictures to be sorted, a plurality of second Chinese character pictures to be sorted, a plurality of third English pictures to be sorted and a plurality of fourth formula pictures to be sorted to obtain the identification training set.

Example 3

The present embodiments provide a computer-readable storage medium comprising:

the storage medium is used for storing computer software instructions for implementing the handwriting recognition training set generation method for the tablet described in embodiment 1, and includes a program for executing the handwriting recognition training set generation method for the tablet; specifically, the executable program may be built in the training set generation system for handwriting recognition for tablet described in embodiment 2, so that the training set generation system for handwriting recognition for tablet described in embodiment 1 may implement the training set generation method for handwriting recognition for tablet described in embodiment 1 by executing the built-in executable program.

Furthermore, the computer-readable storage medium of the present embodiments may take any combination of one or more readable storage media, where a readable storage medium includes an electronic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof.

Different from the prior art, the method, the system and the medium for generating the training set for recognizing the handwriting on the tablet can generate the training set which comprises a plurality of rarely-used characters, symbols and formulas and a plurality of font samples, symbols and formulas with semantic relation, and can ensure the diversity of the handwriting samples and the relevance among the handwriting samples when the training set is applied.

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps of implementing the above embodiments may be implemented by hardware, and a program that can be implemented by the hardware and can be instructed by the program to be executed by the relevant hardware may be stored in a computer readable storage medium, where the storage medium may be a read-only memory, a magnetic or optical disk, and the like.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for generating a handwriting recognition training set for a tablet computer is characterized by comprising the following steps:

configuring a picture data set;

acquiring a sample picture set:

generating a recognition training set:

2. The method of claim 1, wherein the step of configuring the picture data set based on the capture threshold, the first contrast, and the picture processing routine further comprises:

3. The method of claim 2, wherein the character list comprises: the Chinese character handwriting sample pictures are respectively matched with a plurality of Chinese character sample characters, a plurality of English sample characters, a plurality of digital sample characters and a plurality of symbol sample characters, the English sample characters are respectively matched with the English handwriting sample pictures, the digital sample characters are respectively matched with the digital handwriting sample pictures, and the symbol handwriting sample pictures are respectively matched with the symbol handwriting sample characters.

4. The method of claim 3, wherein the step of obtaining the sample picture set based on the picture data set and the character list further comprises:

5. The method as claimed in claim 4, wherein the Chinese character image filtering operation comprises:

6. The method of claim 5, wherein the selecting operation of the English picture comprises:

selecting a first English sample from the English sample characters;

7. The method of claim 6, wherein the formula image filtering operation comprises:

8. The method of claim 7, wherein the sample transformation extraction operation comprises:

9. The system for generating training set for handwriting recognition for tablet according to any one of claims 1 to 8, comprising: the system comprises a picture data set configuration module, a sample picture set acquisition module and an identification training set generation module;

10. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for generating a training set for handwriting recognition for tablets according to any of claims 1 to 8.