CN110163211B

CN110163211B - Image recognition method, device and storage medium

Info

Publication number: CN110163211B
Application number: CN201811037416.5A
Authority: CN
Inventors: 刘东泽; 杨晨; 李�浩
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-09-06
Filing date: 2018-09-06
Publication date: 2023-02-28
Anticipated expiration: 2038-09-06
Also published as: CN110163211A

Abstract

The embodiment of the invention discloses an image identification method, an image identification device and a storage medium; the embodiment of the invention collects the marked sample test paper image, wherein the sample test paper image comprises a marked sample answer area and a sample positioning point of the sample test paper; obtaining the position relation between the sample answering area and the sample positioning point; training the positioning point identification network model according to the sample test image to obtain a trained positioning point identification network model; collecting a test paper image to be identified, and identifying the position of a positioning point of the test paper by adopting a trained positioning point identification network model; extracting an image of the answer area from the test paper image according to the position and the position relation of the positioning point; and performing character recognition on the image of the answer area to obtain a recognition result. The scheme can improve the accuracy and the credibility of character recognition.

Description

Image recognition method, device and storage medium

Technical Field

The present invention relates to the field of communications technologies, and in particular, to an image recognition method, an image recognition apparatus, and a storage medium.

Background

With the increasing performance of computer bottom hardware and the increasing accumulation of field data volume, the development of Artificial Intelligence (AI) technology in various fields of various industries is promoted. At present, a great deal of examination question data is accumulated in the education industry, and a huge demand for quick examination paper marking is met, the traditional examination paper marking scheme is generally based on the mode of student answering and teacher manual judgment, and in a great deal of examination paper marking, the traditional examination paper marking has low efficiency and needs manual screening of each question.

At present, an automatic scoring mode is mainly based on an Optical Character Recognition (OCR) technology, and an OCR technology is used for photographing and recognizing an examination score card, namely, character Recognition is performed on the score card in a specific format, and then a Recognition result is compared with a standard answer, so that automatic scoring is realized.

In the course of research and practice on the prior art, the inventor of the present invention finds that, in the existing scheme, since a simple OCR technology is used for character recognition, it cannot effectively recognize the original test paper, basically, it needs the assistance of answer sheets with specific formats, and the existing character recognition method cannot be applied to character recognition of various image shooting scenes (such as background, light, angle, texture, etc.), and cannot be applied to recognition of various question types, and can only recognize some specific question types such as oral calculation questions, and the limitation of the character recognition method in the existing automatic paper marking scheme is large, so the accuracy and reliability of character recognition are low.

Disclosure of Invention

The embodiment of the invention provides an image recognition method, an image recognition device and a storage medium, which can improve the accuracy and the reliability of character recognition.

The embodiment of the invention provides an image identification method, which comprises the following steps:

acquiring a labeled sample test paper image, wherein the sample test paper image comprises a labeled sample answer area and a sample positioning point of the sample test paper;

obtaining the position relation between the sample answering area and the sample positioning point;

training a positioning point identification network model according to the sample test image to obtain a trained positioning point identification network model;

collecting a test paper image to be identified, and identifying the position of the positioning point of the test paper by adopting the trained positioning point identification network model;

extracting an answer area image from the test paper image according to the position of the positioning point and the position relation;

and performing character recognition on the answer area image to obtain a recognition result.

An embodiment of the present invention further provides an image recognition apparatus, including:

the system comprises a sample acquisition unit, a storage unit and a processing unit, wherein the sample acquisition unit is used for acquiring a labeled sample test paper image, and the sample test paper image comprises a labeled sample answer area and a sample positioning point of the sample test paper;

the relation obtaining unit is used for obtaining the position relation between the sample answering area and the sample positioning point;

the training unit is used for training the positioning point identification network model according to the sample test paper image to obtain the trained positioning point identification network model;

the positioning point identification unit is used for acquiring the test paper image to be identified and identifying the positioning point position of the test paper by adopting the trained positioning point identification network model;

the area extraction unit is used for extracting an answer area image from the test paper image according to the position of the positioning point and the position relation;

and the character recognition unit is used for carrying out character recognition on the answer area image to obtain a recognition result. In addition, the embodiment of the present invention further provides a storage medium, where the storage medium stores a plurality of instructions, and the instructions are suitable for being loaded by a processor to perform the steps in any one of the image recognition methods provided by the embodiments of the present invention.

The embodiment of the invention can acquire the marked sample test paper image, wherein the sample test paper image comprises a marked sample answer area and a sample positioning point of the sample test paper; obtaining the position relation between the sample answering area and the sample positioning point; training the positioning point identification network model according to the sample test paper image to obtain a trained positioning point identification network model; acquiring a test paper image to be identified, and identifying the position of a positioning point of a test paper by adopting a trained positioning point identification network model; extracting an image of the answer area from the test paper image according to the position and the position relation of the positioning point; performing character recognition on the image of the answer area to obtain a recognition result; because the scheme can identify the position of the positioning point of the test paper from the test paper image through the positioning point identification network model based on deep learning, the position of the positioning point of the test paper image obtained under various image shooting scenes (such as background, light, angle, texture and the like) can be effectively and accurately identified, and the scheme is suitable for various shooting scenes and has no any limitation requirement on the shooting scenes of the test paper image; in addition, the scheme can also directly carry out effective character recognition on the original test paper without any assistance, so that no limitation requirements are provided for the type, question type and the like of the test paper; therefore, the character recognition limitation of the scheme is small (for example, no limitation is provided on shooting scenes, test paper types, question types and the like), and the accuracy and the reliability of character recognition are improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1a is a schematic scene diagram of an image recognition method according to an embodiment of the present invention;

FIG. 1b is a schematic flow chart of an image recognition method according to an embodiment of the present invention;

FIG. 1c is a schematic diagram of labeling a test paper image according to an embodiment of the present invention;

FIG. 1d is a schematic structural diagram of a localization point identification network model provided in an embodiment of the present invention

FIG. 1e is a schematic representation of an affine transformation provided by an embodiment of the present invention;

fig. 1f is a schematic diagram of an image of a question answering area provided in an embodiment of the present invention;

FIG. 1g is a schematic structural diagram of a character recognition network model provided in an embodiment of the present invention;

FIG. 2a is a schematic flow chart of an image recognition method according to an embodiment of the present invention;

FIG. 2b is a schematic diagram of an image taken at different background, texture, and angle according to an embodiment of the present invention;

FIG. 2c is a schematic diagram of an image taken under different light according to an embodiment of the present invention;

FIG. 2d is a schematic view of a projection cut provided by an embodiment of the present invention;

FIG. 2e is a block diagram of an image recognition framework provided by an embodiment of the present invention;

fig. 3a is a schematic diagram of a first structure of an image recognition apparatus according to an embodiment of the present invention;

FIG. 3b is a schematic diagram of a second structure of an image recognition apparatus according to an embodiment of the present invention;

FIG. 3c is a schematic diagram of a third structure of an image recognition apparatus according to an embodiment of the present invention;

fig. 3d is a schematic diagram of a fourth structure of the image recognition apparatus according to the embodiment of the present invention;

fig. 4 is a schematic structural diagram of a network device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides an image identification method, an image identification device and a storage medium.

The image recognition apparatus may be specifically integrated in a network device, such as a terminal or a server, for example, referring to fig. 1a, the network device may acquire an image of an annotated sample test paper, where the image of the sample test paper includes an annotated sample answer area and a sample positioning point of the sample test paper, and for example, may receive an image of an annotated sample test paper sent by an image acquisition device, such as a mobile phone or a camera device; then, the network equipment can obtain the position relation between the sample answering area and the sample positioning point; training the positioning point identification network model according to the sample test paper image to obtain a trained positioning point identification network model; acquiring a test paper image to be identified, and identifying the position of a positioning point of a test paper by adopting a trained positioning point identification network model; extracting an image of the answer area from the test paper image according to the position and the position relation of the positioning point; and performing character recognition on the image of the answer area to obtain a recognition result.

The following are detailed descriptions. The numbers in the following examples are not intended to limit the order of preference of the examples.

In the embodiment of the present invention, the description is made of an image recognition apparatus, and the image recognition apparatus may be specifically integrated in a network device such as a terminal or a server.

In an embodiment, an image recognition method is provided, which may be executed by a processor of a network device, as shown in fig. 1b, and a specific flow of the image recognition method may be as follows:

101. and acquiring an annotated sample test paper image, wherein the sample test paper image comprises an annotated sample answer area and a sample positioning point of the sample test paper.

The answer area is an area for the examinee to write an answer in the test paper, for example, an underline area for filling a blank question, an option filling area for selecting a question, and the like. In one embodiment, the answer area may further include answer information and the like.

The anchor points are points for locating the test paper area, and may be set according to actual requirements, for example, vertices of the test paper area may be set, for example, referring to fig. 1c, and the anchor points may be four vertices a, b, c, and d of the test paper.

For example, referring to fig. 1c, a test paper image may be collected, and then, a user such as a teacher may label an answer area (for example, a rectangular box area in fig. 1 c) and anchor points of the test paper, for example, four vertices a, b, c, and d labeled by a circle in the test paper, on the test paper image, so as to obtain a labeled sample test paper image.

In some embodiments, the labeling of the test paper image may be implemented by the network device, for example, the network device may collect a sample test paper image (for example, collect an image of a sample test paper written with an answer), and then label an answer area and a positioning point in the sample test paper image according to a labeling operation of a user, so as to obtain a labeled sample test paper image.

In some embodiments, the labeling of the test paper image may also be implemented by other devices, for example, the terminal acquires a sample test paper image (for example, acquires an image of a sample test paper written with an answer), then, according to the labeling operation of the user, marks an answer area and a positioning point in the sample test paper image, and the terminal sends the labeled test paper image to the network device.

102. And obtaining the position relation between the sample answering area and the sample positioning point.

The position relationship may be a position relationship of the sample answer area relative to the sample positioning point in the sample test paper image.

Specifically, position information (a position value such as a two-dimensional coordinate value) of the sample answer area and position information (a position value such as a two-dimensional coordinate value) of the sample positioning point may be obtained, and then, a positional relationship between the sample answer area and the sample positioning point may be obtained according to the position information of the sample answer area and the sample positioning.

103. And training the positioning point identification network model according to the sample test paper image to obtain the trained positioning point identification network model.

The timing relationship between

steps

102 and 103 may be various and is not limited by the sequence number.

The anchor point identification network model is a deep learning neural network model used for identifying test paper such as a test paper area in an image, namely the anchor point identification network model is a model based on a neural network.

The Neural Network may be a Convolutional Neural Network (CNN).

Taking the structure as a Convolutional Neural Network (CNN), as shown in fig. 1d, the structure may include at least five Convolutional Layers (Convolutional) and one Fully Connected layer (FC), as follows:

a convolutional layer: the method is mainly used for feature extraction (i.e. mapping raw data to hidden layer feature space) of an input image (such as a training sample or an image to be identified), wherein the size of the convolution kernel can be determined according to practical application, for example, the sizes of the convolution kernels from a first layer convolution layer to a fifth layer convolution layer can be (64, 64), (32, 32), (16, 16), (8, 8), (4, 4); optionally, in order to reduce the complexity of the calculation and improve the calculation efficiency, the sizes of the convolution kernels of the five convolution layers may be set to be the same.

Optionally, in order to avoid the problem of changing the distribution of the interlayer data during the training process, so as to prevent the gradient from disappearing or exploding and accelerate the training speed, a Normalization process may be added to simplify the output or the input. The BN may normalize the result after each convolution. For example, a BN treatment may be added to all convolutional layers. Referring to fig. 1d, BN may be added to all of the first to fifth convolution layers.

Optionally, in order to improve the expression capability of the model, a non-Linear factor may be added by adding an activation function, in an embodiment of the present invention, the activation functions are all "relu (Linear rectification function)", and padding (which refers to a space between an attribute definition element border and element content) is all "same", and a "same" padding manner may be simply understood as padding an edge with 0, where the number of left (upper) padding 0 is the same as or less than the number of right (lower) padding 0.

Optionally, in order to further reduce the amount of computation, a downsampling (downsampling) operation may be performed on all the convolutional layers or any 1 to 2 layers, where the downsampling operation is basically the same as the operation of convolution, except that the downsampled convolution kernel is only a maximum value (max) or an average value (average) of corresponding positions, and for convenience of description, in the embodiment of the present invention, the downsampling operation may be performed in the fifth convolutional layer of the first convolutional layer, and the downsampling operation is specifically described as an example of maxpolong.

It should be noted that, for convenience of description, in the embodiment of the present invention, the layer where the activation function is located, the normalization processing layer (such as the BN layer), and the down-sampling layer (also referred to as the pooling layer) are all included in the convolution layer, and it should be understood that the structure may also be considered to include the convolution layer, the normalization processing layer, the layer where the activation function is located, the down-sampling layer (i.e., the pooling layer), and the full connection layer, and of course, the structure may further include an input layer for inputting data and an output layer for outputting data, which are not described herein again.

Full connection layer: the learned "distributed feature representation" can be mapped to a sample label space, which mainly functions as a "classifier" in the whole convolutional neural network, and each node of the fully-connected layer is connected to all nodes output by the upper layer (such as a down-sampling layer in the convolutional layer), wherein one node of the fully-connected layer is called one neuron in the fully-connected layer, and the number of neurons in the fully-connected layer can be determined according to the requirements of the practical application. Similar to the convolutional layer, optionally, in the fully-connected layer, a non-linear factor may be added by adding an activation function, for example, an activation function sigmoid (sigmoid function) may be added.

Based on the introduced locating point identification network model structure, the step "training the locating point identification network model according to the sample test paper image to obtain the trained locating point identification network model" may specifically include:

(1) And acquiring the real position value of the sample positioning point.

For example, a position value (e.g., a two-dimensional coordinate value) of the labeled sample positioning point in the sample test paper image may be obtained, and the position value is the real position value. For example, referring to fig. 1c, after the labeled test paper image is obtained, the position values of the labeled vertexes a, b, c, and d in the image, such as two-dimensional coordinate values, can be obtained.

(2) And acquiring a predicted position value of the sample positioning point based on the sample image and the positioning point identification network model.

For example, the sample image may be input to the anchor point identification network model, the convolution layer in the anchor point identification network model sequentially performs convolution processing on the sample image, and then performs full join operation on the processing result output by the upper convolution layer in the full join layer to obtain the predicted position value of the sample anchor point.

For example, taking an anchor point identification network model including 5 convolutional layers and 1 fully-connected layer as an example, referring to fig. 1d, an input sample image is subjected to convolution processing, normalization processing (BN), activation function (relu) processing, and downsampling operations at a first convolutional layer (Conv 1), and then a processing result is output to a second convolutional layer (Conv 2); performing convolution processing, activating function (relu) processing and downsampling operation on the result output by the upper layer at the second convolutional layer to output the processing result to a third convolutional layer (Conv 3); performing convolution processing, normalization processing (BN), activation function (relu) processing and down-sampling operation on the result output by the upper layer at the third convolution layer, and outputting the result to the fourth convolution layer (Conv 4); performing convolution processing, activating function (relu) processing and downsampling operation on the result output by the upper layer at the fourth convolutional layer to output the processing result to the fifth convolutional layer (Conv 5); performing convolution processing, normalization processing (BN), activation function (relu) processing and downsampling operation on the processing result output by the upper layer at the fifth convolution layer, and outputting the result to a full connection layer (FC); finally, performing full connection operation on the convolution processing result output by the upper layer at a full connection layer to output the prediction position of the positioning point; for example, vertices a, b, c, d may be output.

(3) And adopting a preset loss function to converge the predicted position value and the real position value of the sample positioning point to obtain the trained positioning point identification network model.

The loss function may be flexibly set according to actual application requirements, for example, the loss function may be set based on a euclidean distance between the predicted position value and the actual position value. Specifically, the loss function is characterized in that the euclidean distance between the predicted position value and the true position value is less than a preset threshold.

The trained model can be obtained by reducing the error between the predicted position value and the true position value of the positioning point and continuously training to adjust the weight to a proper value. For example, when the euclidean distance between the predicted position value and the actual position value of the fixed point is greater than the preset threshold, the weight is continuously adjusted to a proper value until the euclidean distance is less than the preset threshold, and the trained model can be obtained.

104. And acquiring an image of the test paper to be identified, and identifying the position of the positioning point of the test paper by adopting the trained positioning point identification network model.

For example, the network device may photograph a test paper with an answer to obtain a test paper image; for another example, the test paper image sent by the image capturing device may be used, for example, the image capturing device may capture a test paper with an answer, and then send the captured test paper image to the network device.

The position of the positioning point can be identified based on the trained positioning point identification network model as follows:

the test paper image can be input into the trained locating point identification network model, the convolution layer in the locating point identification network model sequentially carries out convolution processing on the test paper image, and then the full-connection operation is carried out on the processing result output by the upper convolution layer at the full-connection layer, so that the predicted position value of the locating point is obtained.

Taking an example that the positioning point identification network model comprises a full connection layer and at least five convolution layers; the step of identifying the location point position of the test paper by using the trained location point identification network model may include:

sequentially carrying out convolution processing on the test paper images on at least five convolution layers to obtain convolution processing results;

and performing full-connection operation on the convolution processing result in the full-connection layer to obtain the position of the positioning point.

For example, taking an anchor point identification network model including 5 convolutional layers and 1 fully-connected layer as an example, referring to fig. 1d, a convolution process, a normalization process (BN), an activation function (relu) process, and a downsampling operation are performed on an input test paper image at a first convolutional layer, and then, a processing result is output to a second convolutional layer; performing convolution processing, activating function (relu) processing and downsampling operation on the result output by the upper layer at the second convolutional layer to output the processing result to the third convolutional layer; performing convolution processing, normalization processing (BN), activation function (relu) processing and downsampling operation on the result output by the upper layer at the third convolutional layer, and outputting the result to the fourth convolutional layer; performing convolution processing, activating function (relu) processing and downsampling operation on the result output by the upper layer on the fourth convolutional layer to output the processing result to the fifth convolutional layer; carrying out convolution processing, normalization processing (BN), activation function (relu) processing and downsampling operation on the processing result output from the upper layer of the fifth convolution layer, and outputting the result to the full connection layer; and finally, performing full-connection operation on the convolution processing result output by the upper layer at the full-connection layer to output the predicted position value of the positioning point.

105. And extracting an image of the answer area from the test paper image according to the position of the positioning point and the position relation.

In the embodiment of the invention, after the positioning point of the test paper area is identified based on the positioning point identification network model, the answer area image can be extracted (for example, cut) from the test paper image based on the positioning point position, the position relation between the positioning point and the answer area.

Specifically, in an embodiment, the position of the answer area may be determined based on the position of the positioning point and the position relationship, and then the answer area image may be extracted based on the position of the answer area; that is, the step of "extracting an image of an answer area from an image of a test paper according to the position of the positioning point and the positional relationship" may include:

determining the position of the answering area according to the position and the position relation of the positioning point;

and extracting an answer area image from the test paper image according to the answer area position.

For example, taking the answer area as a rectangular area as an example, after four vertex positions of the test paper area are obtained, the positions of the answer area in the test paper image (e.g., the positions of the four vertices of the answer area) may be determined based on the vertex positions and the positional relationship between the vertices and the answer area (e.g., the four vertices of the answer area), and then the answer area image, for example, a rectangular answer area image, may be cut out from the test paper image based on the positions.

In an embodiment, in order to improve the accuracy of character recognition, affine transformation can be further performed on the test paper image, so that the answer area can be extracted more accurately and character recognition can be performed. For example, the step "extracting an image of an answer area from an image of a test paper according to the position and the positional relationship of the positioning point" may include:

performing affine transformation on the test paper image according to the positioning point position to obtain an affine-transformed image and a positioning point position after the affine transformation;

and extracting an image of the answer area from the image after the affine transformation according to the position and the position relation of the positioning point after the affine transformation.

Affine transformation, also called affine mapping, refers to a process in which one vector space is linearly transformed and then translated into another vector space in geometry. The affine transformation includes: image rotation, translation, zooming, etc.

The affine transformation of an image essentially linearly transforms the two-dimensional position of the pixels in the image, an arbitrary affine transformation being represented by multiplication by a matrix (linear transformation) followed by addition of a vector (translation).

Because the positions of all points (pixel points) of the image are transformed, the positions of the positioning points after affine transformation are also transformed; therefore, after affine transformation of the image, an affine transformed position, i.e. a new position, of the anchor point may be obtained.

In one embodiment, affine transformation may be performed on the test paper area in the test paper image, for example, affine transformation may be performed on the test paper area in the test paper image to form a rectangular test paper image; i.e. by affine and projecting the test paper area to a rectangular area. For example, referring to fig. 1e, the test paper image is compared before and after the affine transformation, and in fig. 1e, the left side is the test paper image before the affine transformation, and the right side is the test paper image after the affine transformation.

The affine transformation matrix is obtained based on the position of the anchor point, for example, the affine transformation matrix may be obtained based on the current position and the new position of the anchor point, and then, the affine transformation is performed based on the matrix.

For example, the step of performing affine transformation on the test paper image according to the positioning point position may include:

acquiring a new positioning point position;

acquiring an affine transformation matrix according to the positioning point position and the new positioning point position;

and carrying out affine transformation processing on the pixel position of the image according to the affine transformation matrix.

The new positioning point position can be a new position of the positioning point, namely a position of the positioning point after executing affine transformation; the position may be predetermined.

For example, taking anchor points as four vertices of the test paper area as an example, the current positions of vertices a, b, c, and d may be obtained through the anchor point identification network model, then new positions after affine transformation of a, b, c, and d may be obtained, an affine transformation matrix may be calculated based on these two groups of positions, then, affine transformation is performed on all points in the image by using the affine transformation matrix, and referring to fig. 1f and fig. 1c, an image of the selected question answering area may be cut out from the image after affine transformation.

In an embodiment, after affine transformation, the position of the answer area in the image after affine transformation may be obtained based on the position of the positioning point and the positional relationship after affine transformation; and extracting an image of the answer area from the affine transformed image based on the position of the answer area.

For example, taking the answer area as a rectangular area as an example, after four vertex positions of the test paper area are obtained, affine transformation is performed on the test paper image, positions of the answer area in the test paper image after affine transformation (for example, positions of four vertices of the answer area) may be determined based on the vertex positions after affine transformation and a positional relationship between the vertices and the answer area (for example, four vertices of the answer area), and then an answer area image, for example, a rectangular answer area image, is cut out from the test paper image after affine transformation based on the positions.

106. And performing character recognition on the image of the answer area to obtain a recognition result.

In order to improve the accuracy and reliability of character recognition, a character image may be cut out from the answer area image, and then character recognition may be performed on the character image, where the character image may be an image containing one or more characters (such as characters, symbols, numbers, and the like).

In an embodiment, in order to improve the cutting efficiency and the cutting accuracy of the character image, the character image may be cut out from the answer area image based on a projection manner, for example, openCV is used for performing horizontal and vertical projection, so as to cut out the character image from the answer area image.

In an embodiment, in order to improve the accuracy and reliability of character recognition, character recognition may be performed based on a character recognition network model, which is a neural network-based character recognition model.

For example, the step of performing character recognition on the answer area image to obtain a recognition result may include:

cutting out character images from the answer area images in a projection mode;

and performing character recognition on the character image by adopting the trained character recognition network model to obtain a recognition result.

For example, the character image can be cut out from the answer area image by adopting a horizontal projection and a vertical projection mode. In order to improve the cutting accuracy of the character image, a plurality of rows of subarea images can be cut by adopting horizontal projection; then, character images are cut out from the subregion images using vertical projection.

That is, the step of "cutting out a character image from the answer area image by projection" may include:

carrying out horizontal projection on the area image to obtain a horizontal projection result;

cutting the area image according to the horizontal projection result to obtain a plurality of rows of sub-area images;

carrying out vertical projection on the sub-region image to obtain a vertical projection result;

and cutting the sub-area image according to the vertical projection result to obtain a character image.

Wherein, the horizontal projection may be: projection of the two-dimensional image on the y-axis; the vertical projection may be: projection of the two-dimensional image on the x-axis.

For example, the image of the answer area shown in fig. 1f may be horizontally projected, and several lines of sub-area images, such as, for example, contain "(c), may be cut out based on the horizontal projection result. "subregion images, subregion images containing other characters, etc. Then, the sub-region image is vertically projected, for example, the pair contains "(c). The projection of the sub-region image can obtain the images including "(", "c", ") and". "of the character image.

In an embodiment, in order to improve the efficiency and accuracy of character image cutting, the image may be filtered after obtaining a plurality of rows of sub-region images, and then the filtered image is cut by using vertical projection. For example, before vertically projecting the sub-region image, the method of the embodiment of the present invention may further include: filtering the plurality of rows of subarea images according to preset image filtering conditions to obtain filtered subarea images;

at this time, the step of "vertically projecting the sub-region image" may include: vertically projecting the filtered sub-region image; the step of "cutting the sub-region image according to the vertical projection result" may include: and cutting the filtered subregion image according to the vertical projection result.

The preset image filtering condition can be set according to actual requirements and is used for filtering out images irrelevant to answer characters.

For example, the image of the answer area shown in fig. 1f may be horizontally projected, and several lines of sub-area images, such as, for example, contain "(c), may be cut out based on the horizontal projection result. "a subregion image, a subregion image containing other characters, etc. Then, filtering the sub-region image, for example, filtering out an image containing abnormal characters and the like; for example, the inclusion "(c) may be obtained finally by image filtering. "of the sub-region.

Next, the filtered subregion image is vertically projected, e.g., to contain "(c). "the sub-region image is projected to obtain the image containing" ("," c ",") and "". "of the character image.

In an embodiment, after the character image is obtained by vertical projection cutting, the character image may be filtered, for example, to filter out images of abnormal characters, to filter out images of incomplete characters, and so on; the filtering rules can be specifically set according to actual requirements.

In the embodiment of the invention, after the character image is cut out from the answer area image in a projection mode, the trained character recognition network model can be adopted for character recognition. The character recognition Network model is a deep learning Neural Network model for recognizing characters, which is a model based on a Neural Network such as a Convolutional Neural Network (CNN). For example, a model similar to the structure of a lenet network can be adopted.

Taking the structure as a Convolutional Neural Network (CNN) as an example, the structure may include at least seven Convolutional Layers (Convolution) and two Fully Connected Layers (FC), as shown in fig. 1 g. Specifically, reference may be made to the above descriptions for the convolutional layer and the fully-connected layer, which are not described herein again. The embodiment of the invention adopts two layers of full connection, so that the accuracy of character recognition can be improved.

Specifically, the character image may be input to the character recognition network model, then the character image is sequentially subjected to convolution processing on at least seven convolution layers, and finally, character classification processing is performed on the convolution processing result on the last two full-connected layers. For example, the step of performing character recognition on the character image by using the trained character recognition network model may include:

carrying out convolution processing on the character images in sequence on the plurality of convolution layers to obtain convolution processing results;

and sequentially carrying out character classification processing on the convolution processing results in the two full-connection layers.

Optionally, in some embodiments, the character image may be further processed by at least one of normalization processing (BN), activation function (relu) processing, and downsampling operation at the convolutional layer, so as to improve the identification accuracy.

For example, taking as an example that the character recognition network model includes 7 convolutional layers and 1 fully-connected layer, referring to fig. 1g, an input character image is subjected to convolution processing at a first convolutional layer (Conv 1), and then, a processing result is output to a second convolutional layer (Conv 2); performing convolution processing on the result output from the upper layer at the second convolution layer (Conv 2) and outputting the processed result to the third convolution layer (Conv 3); performing convolution processing on the result output by the upper layer at the third convolution layer and outputting the result to a fourth convolution layer (Conv 4); performing convolution processing on the result output from the upper layer at the fourth convolution layer to output the processing result to the fifth convolution layer (Conv 15); convolution processing is carried out on the processing result output by the upper layer at the fifth convolution layer, and the result is output to a sixth convolution layer (Conv 6); performing convolution processing on the processing result output from the upper layer at the sixth convolution layer and outputting the result to the seventh convolution layer (Conv 7); and performing convolution processing on the processing result output by the upper layer at the seventh convolutional layer to output the result to the seventh convolutional layer (Conv 7), and outputting the processing result to the last two fully-connected layers for character classification to obtain a recognition result.

For example, the answer character that can be finally recognized through the above character recognition network model is "c".

In the embodiment of the invention, in order to balance the performance and the accuracy at the same time, a network model similar to a lenet structure is adopted for character recognition. The method can classify and process 10000+ characters such as common Chinese characters, numbers and the like, and the accuracy is ensured by adopting two layers of full connection.

The image identification method provided by the embodiment of the invention can be applied to an automatic marking scene, for example, answer characters in a test paper are identified by the image identification method provided by the embodiment of the invention, then the identified answer characters are compared with standard answer characters, and corresponding scores are given based on a comparison result, so that the automatic marking is realized.

As can be seen from the above, the embodiment of the present invention collects the labeled sample test paper image, where the sample test paper image includes the labeled sample answer area and the sample positioning point of the sample test paper; obtaining the position relation between the sample answering area and the sample positioning point; training the positioning point identification network model according to the sample test paper image to obtain a trained positioning point identification network model; collecting a test paper image to be identified, and identifying the position of a positioning point of the test paper by adopting a trained positioning point identification network model; extracting an image of the answer area from the test paper image according to the position of the positioning point and the position relation; performing character recognition on the image of the answer area to obtain a recognition result; the scheme can identify the position of the positioning point of the test paper based on the positioning point identification network model of deep learning, then extract the answer area based on the position of the positioning point, and perform character identification on the image of the answer area. Because the scheme can identify the position of the positioning point of the test paper from the test paper image through the positioning point identification network model based on deep learning, the position of the positioning point of the test paper image obtained under various image shooting scenes (such as background, light, angle, texture and the like) can be effectively and accurately identified, and the scheme is suitable for various shooting scenes and has no any limitation requirement on the shooting scenes of the test paper image; in addition, the scheme can also directly carry out effective character recognition on the original test paper without any assistance, and has no limit requirements on the type, question type and the like of the test paper; therefore, the character recognition limitation of the scheme is small (for example, no limitation is provided on shooting scenes, test paper types, question types and the like), and the accuracy and the reliability of character recognition are improved.

The embodiment of the invention also adopts a projection mode to realize character image cutting, and can improve the accuracy, reliability, robustness and efficiency of character image cutting.

In addition, the embodiment of the invention can also perform character recognition on the character image based on the deep learning network, and can further improve the accuracy of the character recognition.

The method described in the above embodiments is further illustrated in detail by way of example.

In this embodiment, the image recognition apparatus will be described by taking as an example that it is specifically integrated in a network device.

The image recognition process of the network device, as shown in fig. 2a, is as follows:

201. and the network equipment acquires the marked sample test paper image.

The marked sample test paper image comprises a marked sample answer area and a sample positioning point of the sample test paper.

The sample paper image is an image of a sample paper, for example, an image of a mathematical examination paper, an image of a chinese examination paper, and the like.

The anchor points are points for locating the test paper area, and may be set according to actual requirements, for example, the anchor points may be vertices of the test paper area, for example, referring to fig. 1c, and the anchor points may be four vertices a, b, c, and d of the test paper.

The network device can acquire the annotated sample test paper images through a plurality of ways. For example, the network device may photograph a sample test paper to obtain a sample test paper image; and then, marking a sample answer area and positioning points of the sample test paper in the sample test paper image according to the marking operation of the user.

For another example, the image processing device may receive an image of the labeled sample test paper sent by the image acquisition device, and the image acquisition device, such as a terminal, may photograph the sample test paper to obtain an image of the sample test paper; then, marking a sample answer area and positioning points of the sample test paper in the sample test paper image according to the marking operation of the user; and then sending the labeled test paper image to the network equipment.

In practical application, if the embodiment of the present invention is to be adopted to implement automatic examination paper reading, a teacher may select one examination paper written with answers from a plurality of examination papers as a sample examination paper, then shoot an image of the sample examination paper, and label the image of the sample examination paper on a labeling platform or device, such as labeling an answer area and an examination paper area positioning point.

202. The network equipment acquires the position relation between the sample answering area and the sample positioning point; and training the positioning point identification network model according to the sample test paper image to obtain the trained positioning point identification network model.

The position relationship may be a position relationship of the sample answer area relative to a sample positioning point (e.g., a vertex of the test paper) in the sample test paper image.

Specifically, the network device may obtain position information (a position value such as a two-dimensional coordinate value) of the sample answer area and position information (a position value such as a two-dimensional coordinate value) of the sample positioning point, and then obtain a position relationship between the sample answer area and the sample positioning point according to the position information of the sample answer area and the sample positioning.

The position relationship may include a position mapping relationship between the positioning point and the answer area, and the like, and may be, for example, a function.

The anchor point identification network model is a deep learning neural network model used for identifying the test paper in the image, such as a test paper area, namely the anchor point identification network model is a model based on a neural network.

The Neural Network may be a Convolutional Neural Network (CNN).

Taking the structure as a Convolutional Neural Network (CNN) as an example, the structure may include at least five Convolutional Layers (Convolutional) and one Fully Connected layer (FC) as shown in fig. 1 d. The specific network structure may refer to the description of the above embodiments.

Based on the above introduced positioning point identification network model structure, the training process of the positioning point identification network model can be as follows:

(1) And acquiring the real position value of the sample positioning point.

(2) And obtaining the predicted position value of the sample positioning point based on the sample image and the positioning point identification network model.

For example, the sample image may be input to the anchor point identification network model, convolution layers in the anchor point identification network model sequentially perform convolution processing on the sample image, and then full-connection operation is performed on a processing result output by an upper convolution layer in a full-connection layer, so as to obtain a predicted position value of the sample anchor point.

For example, taking an anchor point identification network model including 5 convolutional layers and 1 fully-connected layer as an example, referring to fig. 1d, an input sample image is subjected to convolution processing, normalization processing (BN), activation function (relu) processing, and downsampling operations at a first convolutional layer (Conv 1), and then a processing result is output to a second convolutional layer (Conv 2); performing convolution processing, activating function (relu) processing and downsampling operation on the result output by the upper layer at the second convolutional layer to output the processing result to a third convolutional layer (Conv 3); performing convolution processing, normalization processing (BN), activation function (relu) processing and down-sampling operation on the result output by the upper layer at the third convolution layer, and outputting the result to the fourth convolution layer (Conv 4); performing convolution processing, activating function (relu) processing and downsampling operation on the result output by the upper layer at the fourth convolutional layer to output the processing result to the fifth convolutional layer (Conv 5); performing convolution processing, normalization processing (BN), activation function (relu) processing and downsampling operation on the processing result output by the upper layer at the fifth convolution layer, and outputting the result to a full connection layer (FC); finally, performing full-connection operation on the convolution processing result output by the upper layer at the full-connection layer to output the prediction position of the positioning point; for example, vertices a, b, c, d may be output.

The loss function can be flexibly set according to actual application requirements, for example, the loss function can be set based on the euclidean distance between the predicted position value and the actual position value. Specifically, the loss function is characterized in that the euclidean distance between the predicted position value and the true position value is less than a preset threshold.

The trained model can be obtained by reducing the error between the predicted position value and the actual position value of the positioning point and continuously training to adjust the weight to a proper value. For example, when the euclidean distance between the predicted position value and the actual position value of the fixed point is greater than the preset threshold, the weight is continuously adjusted to a proper value until the euclidean distance is less than the preset threshold, and the trained model can be obtained.

203. The network equipment collects the test paper image to be identified.

For example, the test paper image and the sample test paper image are two test papers with the same test question in the same examination, for example, two test papers with the same test question in a language examination.

Because the locating points are identified by the deep learning network model, the test paper images can be the test paper images shot in various scenes; such as test paper images taken at various angles, light, background, and texture. That is, the scheme of the embodiment of the invention can support the test paper images shot in various scenes, for example, images shot under any angle, light, background and texture. Referring to fig. 2b, the test paper images are taken under various backgrounds and various angles; reference is made to fig. 2c for a test paper image taken under different light.

In an embodiment, the network device may directly acquire the test paper image to be identified, or may acquire the test paper image by another image acquisition device and send the test paper image to the network device.

204. And the network equipment identifies the position of the positioning point of the test paper by adopting the trained positioning point identification network model.

For example, the network device may sequentially perform convolution processing on the test paper image at least five convolution layers to obtain a convolution processing result; and performing full connection operation on the convolution processing result in a full connection layer to obtain the position of the positioning point.

For example, taking an anchor point identification network model including 5 convolutional layers and 1 fully-connected layer as an example, referring to fig. 1d, a convolution process, a normalization process (BN), an activation function (relu) process, and a downsampling operation are performed on an input test paper image at a first convolutional layer, and then, a processing result is output to a second convolutional layer; performing convolution processing, activating function (relu) processing and downsampling operation on the result output by the upper layer at the second convolutional layer to output the processing result to the third convolutional layer; performing convolution processing, normalization processing (BN), activation function (relu) processing and downsampling operation on the result output by the upper layer at the third convolutional layer, and outputting the result to the fourth convolutional layer; performing convolution processing, activating function (relu) processing and downsampling operation on the result output by the upper layer at the fourth convolutional layer to output the processing result to the fifth convolutional layer; carrying out convolution processing, normalization processing (BN), activation function (relu) processing and downsampling operation on the processing result output from the upper layer of the fifth convolution layer, and outputting the result to the full connection layer; and finally, performing full-connection operation on the convolution processing result output by the upper layer at the full-connection layer to output the predicted position value of the positioning point.

205. And the network equipment performs affine transformation on the test paper image according to the positioning point position to obtain an affine transformed image and the position of the positioning point after the affine transformation.

Affine transformation, also called affine mapping, refers to a process in which, in geometry, one vector space is linearly transformed once and then translated into another vector space. The affine transformation includes: image rotation, translation, zooming, etc.

The affine transformation of an image essentially linearly transforms the two-dimensional position of the pixels in the image, and an arbitrary affine transformation can be represented as a multiplication by a matrix (linear transformation) followed by a multiplication by a vector (translation).

For example, the network device may perform affine transformation on a test paper area in a test paper image to form a rectangular test paper image; i.e. by affine and projecting the test paper area to a rectangular area. For example, referring to fig. 1e, the test paper image is compared before and after the affine transformation, and in fig. 1e, the left side is the test paper image before the affine transformation, and the right side is the test paper image after the affine transformation.

For example, the network device may obtain a new location point position; acquiring an affine transformation matrix according to the positioning point position and the new positioning point position; and carrying out affine transformation processing on the pixel position of the image according to the affine transformation matrix. The new positioning point position can be a new position of the positioning point, namely a position of the positioning point after executing affine transformation; the position may be predetermined.

206. And the network equipment extracts the image of the answer area from the image after the affine transformation according to the position and the position relation of the positioning point after the affine transformation.

For example, the network device may obtain the position of the answer area in the image after the affine transformation based on the position of the location point after the affine transformation and the position relationship; and extracting an image of the answer area from the affine transformed image based on the position of the answer area.

For example, taking the answer area as a rectangular area as an example, after four vertex positions of the test paper area are obtained, affine transformation is performed on the test paper image, positions of the answer area in the test paper image after affine transformation (for example, positions of four vertices of the answer area) may be determined based on the vertex positions after affine transformation and a positional relationship between the vertices and the answer area (for example, four vertices of the answer area), and then an answer area image, for example, a rectangular answer area image, may be cut out from the test paper image after affine transformation based on the positions.

207. And the network equipment cuts out character images from the answer area images in a projection mode.

In order to improve the cutting accuracy of the character image, the network equipment can firstly adopt horizontal projection to cut out a plurality of lines of sub-area images; then, character images are cut out from the subregion images using vertical projection.

Specifically, referring to fig. 2d, the network device performs horizontal projection on the area image to obtain a horizontal projection result; cutting the area image according to the horizontal projection result to obtain a plurality of rows of subarea images; then, filtering the sub-area images of a plurality of lines, and vertically projecting the filtered sub-area images to obtain a vertical projection result; and cutting the filtered sub-region image according to the vertical projection result to obtain a character image.

For example, the image of the answer area shown in fig. 1f may be horizontally projected, and several lines of sub-area images, such as, for example, contain "(c), may be cut out based on the horizontal projection result. "subregion images, subregion images containing other characters, etc. Then, filtering the sub-region image, for example, filtering out an image containing abnormal characters, and the like; for example, the inclusion "(c) may be obtained finally by image filtering. "of the sub-region. Next, the filtered subregion image is vertically projected, e.g., to contain "(c). The projection of the sub-region image can obtain the images including "(", "c", ") and". "of the character image.

208. And the network equipment performs character recognition on the character image by adopting the trained character recognition network model to obtain a recognition result.

The character recognition Network model is a deep learning Neural Network model for recognizing characters, which is a model based on a Neural Network such as a Convolutional Neural Network (CNN). For example, a model similar to the structure of a lenet network can be adopted.

As shown in fig. 1g, the structure may include at least seven convolutional Layers (volume) and two Fully Connected Layers (FC). Specifically, reference may be made to the above description for the convolutional layer and the full connection layer, and details are not repeated herein. The embodiment of the invention adopts two layers of full connection, so that the accuracy of character recognition can be improved.

Specifically, the network device may input the character image to the character recognition network model, then perform convolution processing on the character image sequentially on at least seven convolution layers, and finally perform character classification processing on convolution processing results on the last two full-connected layers.

In the embodiment of the invention, in order to balance the performance and the accuracy, a network model similar to a lenet structure is adopted for character recognition. The method can classify and process 10000+ characters such as common Chinese characters, numbers and the like, and the accuracy is ensured by adopting two layers of full connection.

According to the above description, referring to fig. 2e, an embodiment of the present invention further provides a framework for character recognition, which may include: the device comprises a vertex positioning module, a tilt correction module, a character cutting module, a projection module, a character recognition module, an external service packaging module and the like.

The vertex positioning module is configured to identify a position of an anchor point of the test paper based on the anchor point identification network model, for example, identify the vertex position through the network model shown in fig. 1 d.

And the oblique cutting correction module is used for carrying out affine transformation on the test paper images, and correcting the test paper images at various angles, so that subsequent identification is facilitated.

And a character cutting module, configured to cut out an answer area image in the test paper image (specifically, the cutting manner may refer to the description of the foregoing embodiment), and cut out a character image from the answer area by using a projection manner.

And the projection module is used for projecting the image of the cut answer area, such as horizontal projection, vertical projection and the like, so as to cut out a character image.

And the character recognition module is used for carrying out character recognition on the character image by adopting the trained character recognition network model to obtain a recognition result. For example, a network model similar to the structure of a lenet is used for character recognition. Can classify and process 10000+ characters such as common Chinese characters, numbers and the like.

The external service packaging module is an interface of the external service and is used for the external to call the image identification method provided by the embodiment of the invention.

Fig. 2e may further include a data processing procedure such as a data platform (providing data such as image data), labeling data (labeling answer areas, positioning points, and the like), preprocessing data (such as image size and color adjustment), and data specification construction.

In addition, an algorithm platform, a CNN core framework, a large number of tuning parameters for optimization, a more optimal framework search and the like can be included. Through the series of operations, the required character recognition network model and the positioning point recognition network model can be constructed.

Therefore, the method provided by the embodiment of the invention can identify the position of the positioning point of the test paper based on the positioning point identification network model of deep learning, then extract the answer area based on the position of the positioning point, and perform character identification on the image of the answer area. Because the scheme identifies the positioning point position of the test paper from the test paper image through the positioning point identification network model based on deep learning, the position of the positioning point of the test paper image obtained under various image shooting scenes (such as background, light, angle, texture and the like) can be effectively and accurately identified, and the scheme is suitable for various shooting scenes and has no any limitation requirement on the shooting scenes of the test paper image; in addition, the scheme can also directly carry out effective character recognition on the original test paper without any assistance, and has no limit requirements on the type, question type and the like of the test paper; therefore, the character recognition limit of the scheme is relatively small (for example, no limit is provided for shooting scenes, test paper types, question types and the like), and the accuracy and the reliability of character recognition are improved

In addition, the embodiment of the invention can also carry out affine transformation on the image, can correct the images with various oblique angles, is convenient for character recognition, and improves the accuracy and the efficiency of the character recognition.

In addition, the embodiment of the invention can also realize character image cutting by adopting a projection mode, and can improve the accuracy, reliability, robustness and efficiency of character image cutting.

In order to better implement the above method, an embodiment of the present invention further provides an image recognition apparatus, where the image recognition apparatus may be specifically integrated in a network device, such as a terminal or a server, and the terminal may include a device, such as a mobile phone, a tablet computer, a notebook computer, or a PC.

For example, as shown in fig. 3a, the image recognition apparatus may include a sample acquisition unit 301, a relationship acquisition unit 302, a training unit 303, an anchor point recognition unit 304, a region extraction unit 305, and a character recognition unit 306, as follows:

the sample acquisition unit 301 is configured to acquire an annotated sample test paper image, where the sample test paper image includes an annotated sample answer area and a sample positioning point of the sample test paper;

a relation obtaining unit 302, configured to obtain a position relation between the sample answer area and the sample positioning point;

the training unit 303 is configured to train the positioning point identification network model according to the sample test paper image to obtain a trained positioning point identification network model;

the positioning point identification unit 304 is used for acquiring the test paper image to be identified and identifying the positioning point position of the test paper by adopting the trained positioning point identification network model;

an area extracting unit 305, configured to extract an answer area image from the test paper image according to the position of the positioning point and the position relationship;

and the character recognition unit 306 is configured to perform character recognition on the answer area image to obtain a recognition result.

In an embodiment, referring to fig. 3b, the training unit 303 may include:

a position obtaining subunit 3031, configured to obtain a real position value of the sample positioning point;

a predicted value obtaining subunit 3032, configured to obtain a predicted position value of the sample positioning point based on the sample image and the positioning point identification network model;

a convergence subunit 3033, configured to converge the predicted position value and the actual position value of the sample anchor point by using a preset loss function, so as to obtain a trained anchor point identification network model.

In an embodiment, the region extracting unit 305 may specifically be configured to:

determining the position of the answer area according to the position of the positioning point and the position relation;

In an embodiment, referring to fig. 3c, the region extracting unit 305 may include:

the affine transformation subunit 3051 is configured to perform affine transformation on the test paper image according to the location point position to obtain an affine-transformed image and an affine-transformed location point position;

and the region extraction subunit 3052 is configured to extract, according to the position of the location point after the affine transformation and the positional relationship, an image of the answer region from the image after the affine transformation.

The affine transformation subunit 3051 may be specifically configured to:

acquiring a new positioning point position;

In one embodiment, referring to fig. 3d, the character recognition unit 306 includes:

a cutting subunit 3061, configured to cut out a character image from the answer area image in a projection manner;

and the character recognition subunit 3062 is configured to perform character recognition on the character image by using the trained character recognition network model to obtain a recognition result.

In an embodiment, the cutting subunit 3061 may be specifically configured to:

cutting the area image according to the horizontal projection result to obtain a plurality of rows of subarea images;

performing vertical projection on the sub-region image to obtain a vertical projection result;

and cutting the sub-region image according to the vertical projection result to obtain a character image.

In one embodiment, the cutting subunit 3061 is used to:

cutting the area image according to the horizontal projection result to obtain a plurality of rows of subarea images

Filtering the plurality of rows of subarea images according to preset image filtering conditions to obtain filtered subarea images;

vertically projecting the filtered subregion image;

cutting the filtered subregion image according to the vertical projection result to obtain a character image

In one embodiment, the anchor point identifying network model comprises one fully connected layer and at least five convolutional layers; the positioning point identification unit 304 is used for acquiring a test paper image to be identified, and sequentially performing convolution processing on the test paper image on at least five convolution layers to obtain a convolution processing result; and performing full-connection operation on the convolution processing result in the full-connection layer to obtain the position of the positioning point.

In one embodiment, the character recognition network model includes: a plurality of convolutional layers and two full-link layers; the character recognition subunit 3062, may be specifically configured to:

performing convolution processing on the character images in sequence on the plurality of convolution layers to obtain convolution processing results; and carrying out character classification processing on the convolution processing results in the two full connection layers in sequence.

In specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily, and implemented as the same or several entities, and specific implementations of the above units may refer to the foregoing method embodiment, which is not described herein again.

As can be seen from the above, the image recognition apparatus of this embodiment acquires the labeled sample test paper image through the sample acquisition unit 301, where the sample test paper image includes the labeled sample answer area and the sample positioning point of the sample test paper; the relation obtaining unit 302 obtains the position relation between the sample answer area and the sample positioning point; training the positioning point identification network model by a training unit 303 according to the sample test paper image to obtain a trained positioning point identification network model; the positioning point identification unit 304 collects the test paper image to be identified, and identifies the positioning point position of the test paper by adopting the trained positioning point identification network model; extracting, by the region extracting unit 305, an answer region image from the test paper image according to the position of the positioning point and the position relationship; the character recognition unit 306 performs character recognition on the answer area image to obtain a recognition result. Because the scheme can identify the position of the positioning point of the test paper from the test paper image through the positioning point identification network model based on deep learning, the position of the positioning point of the test paper image obtained under various image shooting scenes (such as background, light, angle, texture and the like) can be effectively and accurately identified, and the scheme is suitable for various shooting scenes and has no any limitation requirement on the shooting scenes of the test paper image; in addition, the scheme can also directly carry out effective character recognition on the original test paper without any assistance, so that no limitation requirements are imposed on the type, question type and the like of the test paper; therefore, the character recognition limitation of the scheme is small (for example, no limitation is provided on shooting scenes, test paper types, question types and the like), and the accuracy and the reliability of character recognition are improved.

The embodiment of the invention also provides network equipment, which can be equipment such as a server or a terminal. Fig. 4 is a schematic diagram illustrating a network device according to an embodiment of the present invention, specifically:

the network device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the network device architecture shown in fig. 4 does not constitute a limitation of network devices and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the network device, connects various parts of the entire network device using various interfaces and lines, and performs various functions of the network device and processes data by operating or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the network device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The network device further includes a power supply 403 for supplying power to each component, and preferably, the power supply 403 is logically connected to the processor 401 through a power management system, so that functions of managing charging, discharging, and power consumption are implemented through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The network device may also include an input unit 404, where the input unit 404 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the network device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the network device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:

acquiring a marked sample test paper image, wherein the sample test paper image comprises a marked sample answer area and a sample positioning point of the sample test paper; obtaining the position relation between the sample answer area and the sample positioning point; training a positioning point identification network model according to the sample test image to obtain a trained positioning point identification network model; collecting a test paper image to be identified, and identifying the position of the positioning point of the test paper by adopting the trained positioning point identification network model; extracting an answer area image from the test paper image according to the position of the positioning point and the position relation; and performing character recognition on the answer area image to obtain a recognition result.

For example, a true position value of the sample positioning point may be specifically obtained; obtaining a predicted position value of the sample positioning point based on the sample image and the positioning point identification network model; and adopting a preset loss function to converge the predicted position value and the real position value of the sample positioning point to obtain a trained positioning point identification network model.

For another example, affine transformation is performed on the test paper image according to the positioning point positions to obtain an affine-transformed image and positioning point positions after affine transformation; and extracting an image of the answer area from the affine-transformed image according to the position of the positioning point after the affine transformation and the position relation.

For another example, a character image is cut out from the answer area image in a projection mode; and performing character recognition on the character image by adopting the trained character recognition network model to obtain a recognition result.

The structures of the positioning point recognition network model and the character recognition network model may specifically refer to the foregoing embodiments, and are not described herein again.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

As can be seen from the above, the network device of this embodiment may acquire a sample test paper image after being labeled, where the sample test paper image includes a labeled sample answer area and a sample positioning point of the sample test paper; obtaining the position relation between the sample answering area and the sample positioning point; training the positioning point identification network model according to the sample test paper image to obtain a trained positioning point identification network model; collecting a test paper image to be identified, and identifying the position of the positioning point of the test paper by adopting the trained positioning point identification network model; extracting an answer area image from the test paper image according to the position of the positioning point and the position relation; performing character recognition on the answer area image to obtain a recognition result; because the scheme can identify the position of the positioning point of the test paper from the test paper image through the positioning point identification network model based on deep learning, the position of the positioning point of the test paper image obtained under various image shooting scenes (such as background, light, angle, texture and the like) can be effectively and accurately identified, and the scheme is suitable for various shooting scenes and has no limit requirement on the shooting scenes of the test paper image; in addition, the scheme can also directly carry out effective character recognition on the original test paper without any assistance, so that no limitation requirements are imposed on the type, question type and the like of the test paper; therefore, the character recognition limitation of the scheme is small (for example, the shooting scene, the test paper type, the question type and the like are not limited), and the accuracy and the reliability of character recognition are improved.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present invention provide a storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any one of the image recognition methods provided by the embodiments of the present invention. For example, the instructions may perform the steps of:

acquiring a marked sample test paper image, wherein the sample test paper image comprises a marked sample answer area and a sample positioning point of the sample test paper; obtaining the position relation between the sample answering area and the sample positioning point; training the positioning point identification network model according to the sample test paper image to obtain a trained positioning point identification network model; collecting a test paper image to be identified, and identifying the position of the positioning point of the test paper by adopting the trained positioning point identification network model; extracting an answer area image from the test paper image according to the position of the positioning point and the position relation; and performing character recognition on the answer area image to obtain a recognition result.

For example, a true position value of the sample positioning point may be specifically obtained; acquiring a predicted position value of the sample positioning point based on the sample image and the positioning point identification network model; and adopting a preset loss function to converge the predicted position value and the real position value of the sample positioning point to obtain a trained positioning point identification network model.

For another example, affine transformation is performed on the test paper image according to the positioning point position to obtain an affine-transformed image and an affine-transformed positioning point position; and extracting an image of the answer area from the affine-transformed image according to the position of the positioning point after the affine transformation and the position relation.

The structures of the anchor point identification network model and the character identification network model may specifically refer to the foregoing embodiments, and are not described herein again.

Wherein the storage medium may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any image recognition method provided by the embodiment of the present invention, the beneficial effects that can be achieved by any image recognition method provided by the embodiment of the present invention can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The foregoing detailed description has provided a method, an apparatus, and a storage medium for image recognition according to embodiments of the present invention, and the present disclosure has been made in detail by applying specific examples to explain the principles and embodiments of the present invention, and the description of the foregoing embodiments is only used to help understanding the method and the core concept of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed, and in summary, the content of the present specification should not be construed as limiting the present invention.

Claims

1. An image recognition method, comprising:

obtaining the position relation between the sample answer area and the sample positioning point;

acquiring a real position value of the sample positioning point;

acquiring a predicted position value of the sample positioning point based on the sample image and the positioning point identification network model;

adopting a preset loss function to converge the predicted position value and the real position value of the sample positioning point to obtain a trained positioning point identification network model;

acquiring a test paper image to be identified, and identifying the position of the positioning point of the test paper by adopting the trained positioning point identification network model;

2. The image recognition method of claim 1, wherein extracting an image of an answer area from the test paper image according to the position of the positioning point and the positional relationship comprises:

3. The image recognition method according to claim 1, wherein extracting an image of an answer area from the test paper image based on the position of the positioning point and the positional relationship comprises:

carrying out affine transformation on the test paper image according to the positioning point position to obtain an affine-transformed image and a positioning point position after the affine transformation;

and extracting an image of the answer area from the affine-transformed image according to the position of the positioning point after the affine transformation and the position relation.

4. The image recognition method of claim 1, wherein performing character recognition on the answer area image to obtain a recognition result comprises:

cutting out character images from the answer area images in a projection mode;

5. The image recognition method of claim 4, wherein cutting out character images from the answer area image by projection comprises:

6. The image recognition method of claim 5, wherein prior to vertically projecting the subregion image, the method further comprises:

vertically projecting the sub-region image, comprising: vertically projecting the filtered subregion image;

cutting the sub-region image according to the vertical projection result, comprising: and cutting the filtered sub-region image according to the vertical projection result.

7. The image recognition method of claim 1, wherein the anchor point recognition network model comprises one fully connected layer and at least five convolutional layers;

adopting the trained locating point identification network model to identify the position of the locating point of the test paper, comprising the following steps:

8. The image recognition method of claim 4, wherein the character recognition network model comprises: a plurality of convolution layers and two full link layers;

and performing character recognition on the character image by adopting the trained character recognition network model, wherein the character recognition method comprises the following steps:

performing convolution processing on the character images in sequence on the plurality of convolution layers to obtain convolution processing results;

and carrying out character classification processing on the convolution processing results in sequence on the two full-connection layers.

9. The image recognition method of claim 1, wherein performing affine transformation on the test paper image according to the anchor point positions comprises:

acquiring a new positioning point position;

10. An image recognition apparatus, characterized by comprising:

the system comprises a sample acquisition unit, a data processing unit and a data processing unit, wherein the sample acquisition unit is used for acquiring a sample test paper image after marking, and the sample test paper image comprises a marked sample answer area and a sample positioning point of the sample test paper;

the character recognition unit is used for carrying out character recognition on the answer area image to obtain a recognition result;

the training unit comprises:

the position acquisition subunit is used for acquiring a real position value of the sample positioning point;

a predicted value obtaining subunit, configured to obtain a predicted position value of the sample positioning point based on the sample image and a positioning point identification network model;

and the convergence subunit is used for adopting a preset loss function to converge the predicted position value and the real position value of the sample positioning point to obtain a trained positioning point identification network model.

11. The image recognition apparatus according to claim 10, wherein the region extraction unit includes:

the affine transformation subunit is used for carrying out affine transformation on the test paper image according to the positioning point position to obtain an affine-transformed image and an affine-transformed positioning point position;

and the region extraction subunit is used for extracting an answer region image from the affine-transformed image according to the position of the positioning point after the affine transformation and the position relation.

12. The image recognition apparatus according to claim 10, wherein the character recognition unit includes:

the cutting subunit is used for cutting out character images from the answer area images in a projection mode;

and the character recognition subunit is used for performing character recognition on the character image by adopting the trained character recognition network model to obtain a recognition result.

13. A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the image recognition method according to any one of claims 1 to 9.