CN116391189A

CN116391189A - Object sequence identification method, network training method, device, equipment and medium

Info

Publication number: CN116391189A
Application number: CN202180002790.5A
Authority: CN
Inventors: 陈景焕; 马佳彬
Original assignee: Sensetime International Pte Ltd
Current assignee: Sensetime International Pte Ltd
Priority date: 2021-09-22
Filing date: 2021-09-27
Publication date: 2023-07-04
Also published as: WO2023047164A1; AU2021240190A1

Abstract

Provided are a method for identifying an object sequence, a network training method, a device, equipment and a medium, wherein the method comprises the following steps: acquiring a first image of an object sequence; inputting the first image into an identification network of an object sequence to perform feature extraction to obtain a feature sequence; wherein, the supervision information of the object sequence recognition network in the training process at least comprises: first supervision information of a class of a sample object sequence in each sample image of the sample image group and second supervision information of a similarity between at least two frames of sample images; the at least two frames of sample images in each sample image group comprise an original sample image and an image obtained by performing image transformation on the original sample image by at least one frame; based on the feature sequence, a class of each object in the sequence of objects is determined.

Description

Object sequence identification method, network training method, device, equipment and medium

Cross Reference to Related Applications

The present application claims priority to the filing of singapore intellectual property office, singapore patent application number 10202110489U at month 22 of 2021, the entire contents of which are incorporated herein by reference.

Technical Field

The embodiment of the application relates to the technical field of image processing, and relates to, but is not limited to, an object sequence identification method, a network training method, a device, equipment and a medium.

Background

Sequence identification in images is an important research problem in computer vision. The sequence recognition algorithm is widely applied to scenes such as scene character recognition, license plate recognition and the like. In the related art, a neural network is used to identify images of sequential objects, where the neural network can be trained by using the class of the objects in the sequential objects as supervision information.

In some scenes, the length of the object sequence is longer, the accuracy requirement for identifying the objects is higher, and the sequence identification effect meeting the requirement is difficult to achieve by the sequence identification method in the related technology.

Disclosure of Invention

The embodiment of the application provides a technical scheme for identifying an object sequence.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a method for identifying an object sequence, which comprises the following steps:

acquiring a first image of an object sequence;

inputting the first image into an identification network of an object sequence to perform feature extraction to obtain a feature sequence; wherein, the supervision information of the object sequence recognition network in the training process at least comprises: first supervision information of a class of a sample object sequence in each sample image of the sample image group and second supervision information of a similarity between at least two frames of sample images; the at least two frames of sample images in each sample image group comprise an original sample image and an image obtained by performing image transformation on the original sample image by at least one frame;

Based on the feature sequence, a class of each object in the sequence of objects is determined.

In some embodiments, the feature extraction performed on the first image input to the recognition network of the object sequence to obtain a feature sequence includes: performing feature extraction on the first image by adopting a convolution sub-network in the identification network of the object sequence to obtain a feature map; and splitting the feature map to obtain the feature sequence. Therefore, the feature map is split according to the dimension information, so that the feature sequence with more height direction features reserved can be obtained, and the object sequence category in the feature sequence can be accurately identified.

In some embodiments, the feature extraction of the first image by using a convolution sub-network in the recognition network of the object sequence to obtain a feature map includes: downsampling the first image in a length dimension of a first direction of the first image by adopting the convolution sub-network to obtain a first dimension characteristic, wherein the first direction is different from the arrangement direction of the objects in the object sequence; extracting features in the length dimension of the first image in the second direction based on the length of the first image in the second direction to obtain second dimension features; and obtaining the characteristic map based on the first dimension characteristic and the second dimension characteristic. In this way, the feature information of the first image in the first direction dimension can be retained as much as possible.

In some embodiments, the splitting the feature map to obtain the feature sequence includes: pooling the feature map along the first direction to obtain a pooled feature map; and splitting the pooled feature map along the second direction to obtain the feature sequence. In this way, the feature map is pooled in the second direction and split in the first direction, so that the feature sequence can include more detailed information of the first image along the first direction.

In some embodiments, the determining a category for each object in the sequence of objects based on the feature sequence includes: predicting the class corresponding to each feature in the feature sequence by adopting a classifier of the recognition network of the object sequence; and determining the category of each object in the object sequence according to the prediction result of the category corresponding to each feature in the feature sequence. Thus, a sequence length for converting a fixed-length feature sequence into an indefinite-length feature sequence is realized.

The embodiment of the application provides a method for training an object sequence recognition network, which comprises the following steps: acquiring a sample image group; at least two frames of sample images in the sample image group comprise an original sample image and an image obtained by performing image transformation on the original sample image by at least one frame, and the original sample image comprises a sample object sequence;

Inputting sample images in the sample image group into an identification network of an object sequence to be trained for feature extraction, and obtaining a sample feature sequence of each sample image in the sample image group; predicting the class of the sample object sequence in each sample image based on the sample feature sequence of each sample image; determining a first loss for supervising the category of the sample object sequence in each sample image based on the sample sequence characteristics of each sample image in the sample image group, so as to obtain a first loss set; determining a second loss of supervision of the similarity between at least two frames of sample images based on sample feature sequences of the at least two frames of sample images in the sample image group; and adjusting network parameters of the recognition network of the object sequence to be trained by adopting the first loss set and the second loss so that the loss of the classification result output by the recognition network of the adjusted object sequence meets the convergence condition. In this way, in the training process, the first loss set for supervising the whole sequence and the second loss for supervising the similarity between the images in a group of sample images are introduced, so that the feature extraction consistency of the similar images can be improved, and the class prediction effect of the network is improved as a whole.

In some embodiments, the acquiring the set of sample images includes: acquiring a first sample image for marking the category of a sample object in a picture; determining at least one second sample image based on the picture content of the first sample image; performing data enhancement on the at least one second sample image to obtain at least one third sample image; and obtaining the sample image group based on the first sample image and the at least one third sample image. In this way, the paired images with similar picture contents are created through the first sample image of each frame, so that the consistency of the feature extraction of the similar images is convenient to be improved subsequently.

In some embodiments, the inputting the sample images in the sample image group into the recognition network of the object sequence to be trained to perform feature extraction, to obtain a sample feature sequence of each sample image in the sample image group, includes: performing feature extraction on each sample image by adopting a convolution sub-network in the recognition network of the object sequence to be trained to obtain a sample feature map of each sample image; and splitting the sample feature map of each sample image to obtain the sample feature sequence of each sample image. Thus, a sample feature sequence with more first direction features reserved can be obtained, and the accuracy of training the network can be improved.

In some embodiments, the feature extraction is performed on the each sample image by using a convolution sub-network in the recognition network of the object sequence to be trained to obtain a sample feature map of the each sample image, including: downsampling each sample image in the length dimension of the first direction of each sample image by adopting the convolution sub-network to obtain a first-dimension sample characteristic, wherein the first direction is different from the arrangement direction of sample objects in the sample object sequence; based on the length of each sample image in the second direction, extracting the characteristics in the length dimension of each sample image in the second direction to obtain second-dimension sample characteristics; and obtaining the sample characteristic diagram of each sample image based on the first dimension sample characteristic and the second dimension sample characteristic. In this way, the feature information of each sample image in the first direction dimension can be retained as much as possible.

In some embodiments, the splitting the sample feature map of each sample image to obtain the sample feature sequence of each sample image includes: pooling the sample feature map along the first direction to obtain a pooled sample feature map; splitting the pooled sample feature map along the second direction to obtain the sample feature sequence. In this way, after the sample feature map is pooled in the second direction dimension, the sample feature map is split in the first security dimension, so that the sample feature sequence can retain more detailed information of the sample image in the first direction dimension.

In some embodiments, adjusting network parameters of the recognition network of the object sequence to be trained using the first loss set and the second loss so that the loss of the classification result output by the recognition network of the adjusted object sequence meets a convergence condition, including: carrying out weighted fusion on the first loss set and the second loss to obtain total loss; and adjusting network parameters of the object recognition network to be trained based on the total loss, so that the loss of the classification result output by the object recognition network after adjustment meets a convergence condition. In this way, the first loss set and the second loss are combined to train the network, so that the trained network has stronger robustness,

in some embodiments, said weighting said first and second sets of losses to obtain a total loss comprises: determining a category supervision weight corresponding to the sample image group based on the number of sample images in the sample image group; fusing the first loss in the first loss set of the sample image group based on the category supervision weight and a preset first weight to obtain a third loss; adjusting the second loss by adopting a preset second weight to obtain a fourth loss; the total loss is determined based on the third loss and the fourth loss. Therefore, the total loss obtained by fusion of the third loss and the fourth loss is adopted to train the recognition network of the object sequence to be trained, and the consistency of feature extraction of similar images can be improved, so that the prediction effect of the whole network can be improved.

In some embodiments, the fusing the first loss in the first loss set of the sample image group based on the category supervision weight and a preset first weight to obtain a third loss includes: assigning the category supervision weight to each first loss in the first loss set respectively, so as to obtain an updated loss set comprising at least two updated losses; fusing the updated losses in the updated loss set to obtain fusion losses; and adjusting the fusion loss by adopting the preset first weight to obtain the third loss. In this way, in the training process, the CTC loss of the prediction result of each sample image in the set of sample images is fused, so that the performance of the recognition network obtained by training can be improved.

An embodiment of the present application provides an apparatus for identifying an object sequence, including:

the first acquisition module is used for acquiring a first image of the object sequence;

the first extraction module is used for inputting the first image into an identification network of an object sequence to perform feature extraction to obtain a feature sequence; wherein, the supervision information of the object sequence recognition network in the training process at least comprises: first supervision information of a class of a sample object sequence in each sample image of the sample image group and second supervision information of a similarity between at least two frames of sample images; the at least two frames of sample images in each sample image group comprise an original sample image and an image obtained by performing image transformation on the original sample image by at least one frame;

And the first determining module is used for determining the category of each object in the object sequence based on the characteristic sequence.

In some embodiments, the first extraction module comprises:

the first extraction sub-module is used for extracting the characteristics of the first image by adopting a convolution sub-network in the identification network of the object sequence to obtain a characteristic diagram;

and the first molecular splitting module is used for splitting the characteristic map to obtain the characteristic sequence.

In some embodiments, the first extraction sub-module comprises:

the first downsampling unit is used for downsampling the first image in the length dimension of the first direction of the first image by adopting the convolution sub-network to obtain a first dimension characteristic, wherein the first direction is different from the arrangement direction of the objects in the object sequence;

the first extraction unit is used for extracting the characteristics in the length dimension of the first image in the second direction based on the length of the first image in the second direction to obtain second dimension characteristics;

and the first determining unit is used for obtaining the characteristic map based on the first dimension characteristic and the second dimension characteristic.

In some embodiments, the first split sub-module comprises:

The first pooling unit is used for pooling the feature images along the first direction to obtain pooled feature images;

the first splitting unit is used for splitting the pooled feature map along the second direction to obtain the feature sequence.

In some embodiments, the first determining module includes:

the first prediction submodule is used for predicting the category corresponding to each feature in the feature sequence by adopting a classifier of the recognition network of the object sequence;

and the first determining submodule is used for determining the category of each object in the object sequence according to the prediction result of the category corresponding to each feature in the feature sequence.

The embodiment of the application provides an object sequence recognition network training device, which comprises:

the second acquisition module is used for acquiring a sample image group; at least two frames of sample images in the sample image group comprise an original sample image and an image obtained by performing image transformation on the original sample image by at least one frame, and the original sample image comprises a sample object sequence;

the second extraction module is used for inputting the sample images in the sample image group into an identification network of an object sequence to be trained to perform feature extraction, so as to obtain a sample feature sequence of each sample image in the sample image group;

A first prediction module, configured to predict a class of a sample object sequence in each sample image based on the sample feature sequence of each sample image;

the second determining module is used for determining a first loss for supervising the class of the sample object sequence in each sample image based on the sample sequence characteristics of each sample image in the sample image group to obtain a first loss set;

a third determining module, configured to determine a second loss of supervision of similarity between at least two frames of sample images based on a sample feature sequence of the at least two frames of sample images in the sample image group;

and the first adjusting module is used for adjusting network parameters of the recognition network of the object sequence to be trained by adopting the first loss set and the second loss so that the loss of the classification result output by the recognition network of the adjusted object sequence meets the convergence condition.

In some embodiments, the second acquisition module includes:

the first acquisition sub-module is used for acquiring a first sample image for marking the category of the sample object in the picture;

a second determining sub-module for determining at least one second sample image based on the picture content of the first sample image;

The first enhancer module is used for carrying out data enhancement on the at least one second sample image to obtain at least one third sample image;

a third determination sub-module for obtaining the set of sample images based on the first sample image and the at least one third sample image.

In some embodiments, the second extraction module comprises:

the second extraction sub-module is used for carrying out feature extraction on each sample image by adopting a convolution sub-network in the recognition network of the object sequence to be trained to obtain a sample feature map of each sample image;

and the second splitting module is used for splitting the sample feature images of each sample image to obtain the sample feature sequences of each sample image.

In some embodiments, the second extraction sub-module comprises:

the second downsampling unit is used for downsampling each sample image in the length dimension of the first direction of the sample image by adopting the convolution sub-network to obtain a first-dimension sample characteristic, wherein the first direction is different from the arrangement direction of sample objects in the sample object sequence;

The second extraction unit is used for extracting the characteristics in the length dimension of the second direction of each sample image based on the length of the second direction of each sample image to obtain second-dimension sample characteristics;

and the second determining unit is used for obtaining the sample characteristic diagram of each sample image based on the first dimension sample characteristic and the second dimension sample characteristic.

In some embodiments, the second split sub-module comprises:

the second pooling unit is used for pooling the sample feature images along the first direction to obtain pooled sample feature images;

and the second splitting unit is used for splitting the pooled sample feature images along the second direction to obtain the sample feature sequences.

In some embodiments, the first adjustment module includes:

the first fusion submodule is used for carrying out weighted fusion on the first loss set and the second loss to obtain total loss;

and the first adjustment sub-module is used for adjusting the network parameters of the object recognition network to be trained based on the total loss so that the loss of the classification result output by the object recognition network after adjustment meets the convergence condition.

In some embodiments, the first fusion sub-module comprises:

a third determining unit, configured to determine a category supervision weight corresponding to the sample image group based on the number of sample images in the sample image group;

the first fusion unit is used for fusing the first loss in the first loss set of the sample image group based on the category supervision weight and a preset first weight to obtain a third loss;

the first adjusting unit is used for adjusting the second loss by adopting a preset second weight to obtain a fourth loss;

a fourth determining unit configured to determine the total loss based on the third loss and the fourth loss;

in some embodiments, the first fusion unit comprises:

a first assigning subunit, configured to assign the category supervision weights to each first loss in the first loss set, respectively, to obtain an updated loss set including at least two updated losses;

the first fusion subunit is used for fusing the updated losses in the updated loss set to obtain fusion losses;

and the first adjusting subunit is used for adjusting the fusion loss by adopting the preset first weight to obtain the third loss.

Correspondingly, the embodiment of the application provides a computer storage medium, wherein the computer storage medium is stored with computer executable instructions, and the identification method of the object sequence can be realized after the computer executable instructions are executed; or, the computer executable instructions, when executed, enable the recognition network training method for the object sequence described above.

The embodiment of the application provides a computer device, which comprises a memory and a processor, wherein the memory stores computer executable instructions, and the processor can realize the identification method of the object sequence when running the computer executable instructions on the memory; or, the processor may be capable of implementing the above-described method of training the recognition network of the sequence of objects when executing the computer-executable instructions on the memory.

The embodiment of the application provides an object sequence identification method, a network training method, a device, equipment and a storage medium, wherein firstly, supervision information is adopted to supervise similarity among a group of sample images and an object sequence identification network supervised by categories of sample objects of the group of sample images, and feature extraction is carried out on a first image to obtain a feature sequence; in this way, the consistency of feature extraction of the multi-frame similar first images can be improved; then, carrying out category prediction on the object sequences in the feature sequences, so that the classification result of the object sequences in the obtained feature sequences is more accurate; finally, the classification result of the object sequence in the feature sequence is further processed to determine the class of the object sequence. Therefore, the consistency of the identification network of the object sequence to the feature extraction and the identification result of the similar picture is improved, the robustness is better, and the identification precision of the object sequence can be improved.

Drawings

Fig. 1 is a schematic implementation flow chart of an identification method of an object sequence according to an embodiment of the present application;

FIG. 2A is a flowchart illustrating another implementation of the method for identifying an object sequence according to an embodiment of the present disclosure;

fig. 2B is a schematic implementation flow chart of an object sequence recognition network training method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an identification network of an object sequence according to an embodiment of the present application;

fig. 4 is an application scenario schematic diagram of an identification network of an object sequence provided in an embodiment of the present application;

FIG. 5A is a schematic structural diagram of an identification device for object sequences according to an embodiment of the present application;

FIG. 5B is a schematic structural diagram of an object sequence recognition network training device according to an embodiment of the present application;

fig. 6 is a schematic diagram of a composition structure of a computer device according to an embodiment of the present application.

Detailed Description

For the purposes, technical solutions and advantages of the embodiments of the present application to be more apparent, the following detailed description of the specific technical solutions of the present invention will be further described with reference to the accompanying drawings in the embodiments of the present application. The following examples are illustrative of the present application, but are not intended to limit the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a specific ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a specific order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

Before further describing embodiments of the present application in detail, the terms and expressions that are referred to in the embodiments of the present application are described, and are suitable for the following explanation.

1) In pair loss, many methods of metric learning use pairs of samples to calculate the loss in deep learning. For example, during training of the model, two samples are arbitrarily selected, features are extracted using the model, and the distance between the features of the two samples is calculated. If the two samples belong to the same class, it is desirable that the distance between the two samples should be as small as possible, even 0; if the two samples belong to different categories, it is desirable that the distance between the two samples should be as large as possible, even infinite. Based on this principle, many different types of feature pair losses are derived, the distances between pairs of samples are calculated using these losses, and the model is updated using various optimization methods based on the generated losses.

2) The time series class classification (Connectionist Temporal Classification, CTC) is used for calculating a loss value, and has the main advantage of automatically aligning data which are not aligned, and is mainly used for training the serialized data which are not aligned in advance. Such as speech recognition, optical character recognition (Optical Character Recognition, OCR) recognition, and the like. In the embodiment of the application, CTC loss can be used to supervise the overall prediction situation of the sequence in the earlier stage of training of the network.

The following describes exemplary applications of the device for identifying an object sequence provided in the embodiments of the present application, where the device provided in the embodiments of the present application may be implemented as a notebook computer, a tablet computer, a desktop computer, a camera, a mobile device (e.g., a personal digital assistant, a dedicated messaging device, a portable game device) and other various types of user terminals with an image capturing function, and may also be implemented as a server. In the following, an exemplary application when the device is implemented as a terminal or a server will be described.

The method may be applied to a computer device, and the functions performed by the method may be performed by a processor in the computer device invoking program code, which may of course be stored in a computer storage medium, where it is seen that the computer device comprises at least a processor and a storage medium.

An embodiment of the present application provides a method for identifying an object sequence, as shown in fig. 1, and is described with reference to steps shown in fig. 1:

step S101, a first image of a sequence of objects is acquired.

In some embodiments, the object sequence may be a sequence formed by arranging arbitrary objects in a sequential manner, and the specific object type is not particularly limited. For example, the first image is an image acquired by a game place, and then the object sequence can be a game coin in a game in the game place, and the like; alternatively, the first image is an image acquired in a stacked scene of a plurality of planks of different materials or colors, and then the sequence of objects may be a stack of planks stacked together.

The first image is at least one frame of image, wherein the at least one frame of image is an image with size information and pixel values meeting certain conditions, and the at least one frame of image is an image with size adjusted and normalized pixel values.

In some possible implementations, by preprocessing the acquired second image as the first image that can be input into the recognition network of the object sequence, the above-mentioned step S101 can be implemented by the following steps S111 and S112 (not shown in the drawing):

step S111, a second image of the acquisition screen including the object sequence is acquired.

Here, the second image may be an image including appearance information of the object sequence, and the second image may be an image acquired by any acquisition device, or may be any frame of an image or video acquired from the internet or other devices. For example, the second image is a frame image of which the picture content acquired from the network comprises the object sequence; alternatively, the second image is a video segment of a picture content comprising a sequence of objects, etc.

Step S112, preprocessing the image parameters of the second image based on preset image parameters, to obtain the first image.

In some possible implementations, the preset image parameters include: image width, height, and image pixel values, etc. Firstly, adjusting the size information of the second image according to a preset size to obtain an adjusted image; wherein the preset size is a preset width, a preset ratio of the height to the width. For example, the width of the multi-frame second image is uniformly adjusted to the preset width according to the preset width. And then, normalizing the pixel value of the adjusted image to obtain a first image. For example, for a second image having a height less than a preset height, pixel filling is performed in an image area not reaching the preset height; for example, gray pixel values are filled. In this way, after the size information is adjusted, the ratio of the height to the width in the size of the obtained first image is uniform, and deformation of the image in the processing process can be reduced.

Step S102, inputting the first image into an identification network of an object sequence for feature extraction to obtain a feature sequence.

In some embodiments, the supervision information of the object sequence recognition network during the training process at least comprises: first supervision information of a class of a sample object sequence in each sample image of the sample image group and second supervision information of a similarity between at least two frames of sample images; the at least two frames of sample images in each sample image group comprise an original sample image and an image obtained by performing image transformation on the original sample image by at least one frame. Wherein the image transformation may comprise: one or more of size scaling, rotation, brightness adjustment, etc. The image conversion does not significantly change the screen content of the image, and the screen content of the converted image is approximately the same as the screen content of the image before conversion. In this way, the robustness of the object sequence recognition network to variations in dimensions of the image, rotation angle, brightness, etc. can be enhanced.

Or, the recognition network of the object sequence is trained based on the second loss of supervision of the similarity between at least two frames of sample images related or similar to the picture content and the first loss set of supervision of the class of the sample objects in each frame of sample images. In the training process, the recognition network of the object sequence adopts a plurality of groups of sample images with related or similar surface contents for training.

Inputting the first image into an identification network of an object sequence, and adopting a convolutional neural network part in the identification network of the object sequence to extract the characteristics of the first image so as to obtain a characteristic diagram; and splitting the feature map according to a certain mode, so that the feature map extracted by the convolutional neural network is split into a plurality of feature sequences, and the object sequences in the first image can be classified conveniently. In some possible implementations, the feature map may be split according to any dimension of the feature map to obtain a feature sequence; thus, a plurality of features are included in the feature sequence of the first image of one frame. Each feature in the sequence of features may correspond to an object in the sequence of objects, or a plurality of features in the sequence of features may correspond to an object in the sequence of objects.

Step S103, determining the category of each object in the object sequence based on the feature sequence.

In some embodiments, a classifier in an identification network of the sequence of objects is employed to predict a class of features in a feature sequence of each sample image in a set of sample images, resulting in a predicted probability of the feature sequence of each sample image. The category of each feature can be obtained by carrying out category prediction on the features in the feature sequence, and the category of each feature is the category of the object corresponding to the feature; thus, the feature sequences belonging to the same class are feature sequences corresponding to the same object, and the class of the same class of features is the class of the object corresponding to the class of features, so that the class of each object in the object sequence can be obtained. For example, the total class number of the object sequence is n, and then a classifier with n class labels is adopted to predict the class of the feature in the feature sequence, so as to obtain the prediction probability that each feature in the feature sequence belongs to the n class labels. The prediction probability of the feature sequence is used as the classification result of the feature sequence.

In some embodiments, the classification result of the feature sequences is processed according to the post-processing rule of the CTC function (for example, a final sequence recognition result is generated according to the probability of the sequence output by the CTC), so as to obtain the class of the object corresponding to each feature sequence, so that the class of each object in the sequence of objects in the first image can be predicted. The length of the sequences belonging to the same class of objects can also be counted based on the class of the object corresponding to each feature sequence. The classification result of the feature sequence can represent the probability that the object corresponding to the feature sequence belongs to each classification label of the classifier; in a group of probabilities corresponding to a feature sequence, taking a class corresponding to a classification label with a probability value larger than a certain threshold value as a class of an object corresponding to the feature; in this way, the category of each object can be obtained.

In the embodiment of the application, firstly, a recognition network of an object sequence, in which supervision information comprises supervision on similarity among a group of sample images and class supervision on sample objects of the group of sample images, is adopted to perform feature extraction on a first image to obtain a feature sequence; then, carrying out category prediction on each object in the object sequence, so that the classification result of the obtained object sequence is more accurate; finally, the classification result of the object sequence is further processed to determine the classes of the plurality of objects. Therefore, the network which simultaneously considers the image similarity and the object category is adopted to classify each object, so that the consistency of the feature extraction and the recognition result of the similar image in the first image can be improved, the robustness is improved, and the recognition accuracy of the object is improved.

In some embodiments, feature extraction of the first image is implemented using a convolutional Network after fine-tuning a structure of a Residual Network (res net), so as to obtain a feature sequence; that is, the step S102 may be implemented through the steps shown in fig. 2, fig. 2 is a schematic flow chart of another implementation of the method for identifying an object sequence according to the embodiment of the present application, and the following description is made with reference to fig. 2:

step S201, performing feature extraction on the first image by using a convolution sub-network in the recognition network of the object sequence, so as to obtain a feature map.

In some embodiments, the recognition network of the sequence of objects is based on a first loss of supervision of the entirety of the sample image and a second loss of supervision of each class of objects in the sample image; and extracting the characteristics of the first image by adopting a convolution network part in an identification network of the object sequence to obtain a characteristic diagram. The convolutional network part in the recognition network of the object sequence can be obtained after fine tuning based on the ResNet network structure.

In some possible implementations, in the recognition network of the object sequence, the feature extraction is performed on the first image by using a convolution network with the step size adjusted, so as to obtain a feature map with the height remaining unchanged and the width changing, that is, the step S201 may be implemented by the following steps S211 to 213 (not shown in the drawing):

Step S211, downsampling the first image in a length dimension of a first direction of the first image by using the convolution sub-network to obtain a first dimension feature.

In some possible implementations, the adjusted ResNet network structure is used as a convolutional network for feature extraction of the first image. The first direction is different from an arrangement direction of objects in the sequence of objects. For example, if the object sequence is a plurality of objects arranged or stacked in the height direction, i.e., the arrangement direction of the objects in the object sequence is the height direction, the first direction may be the width direction of the object sequence. If the object sequence is a plurality of objects arranged in a horizontal direction, i.e. the direction of arrangement of the objects in the object sequence is a horizontal direction, the first direction may be the height direction of the object sequence. For example, the step length in the first direction in the last step length (Stride) of the convolution layers of the third layer (layer 3) and the fourth layer (layer 4) in the ResNet network structure is kept constant at 2; in this way, it is achieved that downsampling is performed in the length dimension of the first direction of the first image, the length of the resulting feature map in the first direction being half the length of the first image in the first direction. Taking an object sequence as an example of a plurality of objects stacked in the height direction, keeping the width step length in the last step length (Stride) of the convolution layers of the third layer (layer 3) and the fourth layer (layer 4) in the ResNet network structure as 2 unchanged; in this way, it is achieved that downsampling in the width dimension of the first image is performed, the width of the resulting feature map being half the width of the first image.

Step S212, extracting features in a length dimension of the first image in the second direction based on the length of the first image in the second direction, to obtain second dimension features.

In some possible implementations, the second direction is the same as the arrangement direction of the objects in the sequence of objects, changing the step size of the second direction in the last step size of the convolution layers of the third layer and the fourth layer in the residual network structure from 2 to 1; in this way, it is achieved that no downsampling is performed in the length dimension of the second direction of the first image, i.e. the length of the second direction of the first image is maintained, and feature extraction is performed in the length dimension of the second direction of the first image, resulting in a second dimension feature that is the same as the length of the second direction of the first image.

In a specific example, taking the arrangement direction of the object sequence as the height direction as an example, changing the height step in the last step of the convolution layers of the third layer and the fourth layer in the residual network structure from 2 to 1; in this way, it is achieved that no downsampling is performed in the height dimension of the first image, i.e. the height of the first image is maintained, and feature extraction is performed in the height dimension of the first image, resulting in the same features as the height of the first image.

Step S213, obtaining the feature map based on the first dimension feature and the second dimension feature.

In some possible implementations, the first dimension features are combined with the second dimension features to form a feature map of the first image.

In the above steps S211 to 213, the first image is not downsampled in the length dimension in the second direction of the first image such that the dimension of the dimension feature in the second direction is the same as the dimension in the second direction of the first image, and the first image is downsampled in the first direction dimension different from the arrangement direction of the objects such that the length of the dimension feature in the first direction becomes half of the length of the first image in the first direction; in this way, the feature information of the first image in the arrangement direction dimension of the object sequence can be retained as much as possible. Under the condition that the arrangement direction of the object sequence is the height direction, changing the convolution layer with the last Stride of (2, 2) of the third layer and the fourth layer in the ResNet into the convolution layer with the Stride of (1, 2), so that the first image is not downsampled in the height dimension, the dimension of the height dimension characteristic is the same as the height of the first image, and the first image is downsampled in the width dimension, so that the width of the width dimension characteristic is half of the width of the first image; in this way, the feature information of the first image in the height dimension can be retained as much as possible.

And step S202, splitting the feature map to obtain the feature sequence.

In some embodiments, the feature map is split based on dimension information of the feature map, so as to obtain the feature sequence. The dimension information of the feature map includes a dimension in a first direction and a dimension in a second direction (e.g., a width dimension and a height dimension), and the feature map is processed differently based on the two dimensions to obtain a feature sequence. For example, the feature map is first split into a sequence of features by pooling the feature map in a first dimension of the feature map and then splitting the feature map in a second dimension of the feature map. In this way, the recognition network based on the object sequences obtained by training the two loss functions is adopted to extract the characteristics of the image, and the characteristic image is split according to the dimension information, so that the characteristic sequences with more characteristics in the second direction reserved can be obtained, and the object sequence categories in the characteristic sequences can be recognized more accurately.

In some possible implementations, the feature map is pooled in the dimension of the first direction, and split along the dimension of the second direction to obtain the feature sequence, that is, the step S202 may be implemented by steps S221 and 222 (not shown in the drawing):

Step S221, pooling the feature map along the first direction to obtain a pooled feature map.

In some embodiments, the feature map is averaged and pooled along a first direction dimension of the feature map, and the feature map is kept unchanged in a second direction dimension and channel dimension, resulting in a pooled feature map. For example, taking the arrangement direction of the objects in the object sequence as an example of arrangement along the height, the feature map is pooled along the width dimension in the dimension information, and a pooled feature map is obtained. The dimension of the first feature map is 2048×40×16 (where the channel dimension is 2048, the high dimension is 40, and the width dimension is 16), and the feature map becomes a pooled feature map of 2048×40×1 after the average pooling of the width dimension.

And step S222, splitting the pooled feature map along the second direction to obtain the feature sequence.

In some embodiments, the pooled feature map is split along a second directional dimension of the feature map to obtain the feature sequence. The number of splits of the pooled feature map may be determined based on the length of the second direction of the feature map; for example, if the length of the second direction of the feature map is 60, then the pooled feature map is split into 60 vectors. In a specific example, taking the arrangement direction of the objects in the object sequence as an example of arrangement along the height, the pooled feature map is split based on the height dimension to obtain the feature sequence. If the pooled feature map is 2048×40×1, splitting the pooled feature map along the height dimension to obtain 40 2048-dimensional vectors, each corresponding to a feature corresponding to a 1/40 image region in the height direction in the original first image. In this way, after the feature images are pooled in a first direction different from the object arrangement direction, the feature images are split in a second direction identical to the object arrangement direction, so that the feature sequence can include more detailed information of the first image along the second direction.

In some embodiments, the classification result of the feature sequence is further processed to predict the class of each object, that is, the above step S103 may be implemented by the following steps S131 and S132 (not shown in the drawing):

step S131, a classifier of the recognition network of the object sequence is adopted to predict the category corresponding to each feature in the feature sequence.

In some embodiments, the feature sequence is input into a classifier to predict the class to which each feature in the feature sequence corresponds. For example, the total class number of the object sequence is n, and then a classifier with n class labels is used to predict the class of the feature in the feature sequence, so as to obtain the prediction probability that the feature in the feature sequence corresponds to each class label in the n class labels.

Step S132, determining a category of each object in the object sequence according to a prediction result of the category corresponding to each feature in the feature sequence.

In some embodiments, after splitting the feature map, the feature sequence includes a plurality of feature vectors of the image to be identified in the second direction dimension, that is, the feature vectors are partial features of the image to be identified, and may include all features of one or more object sequences or include partial features of one object sequence. Thus, the classification result of the object corresponding to each feature in the feature sequence is combined, so that the classification of each object in the object sequence in the first image can be accurately identified. Firstly, predicting the class of an object corresponding to each feature vector in the classification result of the feature sequence; then, by counting the categories of the objects corresponding to each feature vector in the feature sequence, the category of each object included in the first image can be determined. In this way, the classification result of the feature sequence is processed by the post-processing rule of the CTC loss function, so that the class of the target sequence included in the image can be predicted more accurately.

In other embodiments, after determining the object class to which each feature vector belongs, the length of the feature vectors belonging to the same class of objects may be determined based thereon. For example, taking the object sequence as the token sequence, the token sequence length corresponding to the characteristics belonging to the same token is determined in the characteristic sequence. Wherein the category of the medal is related to the denomination of the medal, the pattern of the medal, and the game to which the medal is applied. The sequence length of the features of each class of objects is uncertain, thus realizing the conversion of a fixed-length feature sequence into a sequence length of an indefinite-length feature.

In some embodiments, the recognition network of the object sequence is used for recognizing the class of the object sequence, the recognition network of the object sequence is obtained by training the recognition network of the object sequence to be trained, the training process of the recognition network of the object sequence to be trained can be implemented through the steps shown in fig. 2B, fig. 2B is a schematic implementation flow chart of the recognition network training method of the object sequence provided in the embodiment of the present application, and the following description is made in connection with fig. 2B:

step S21, a sample image group is acquired.

In some embodiments, at least two frames of sample images in a sample image group include an original sample image and an image obtained by image transformation of the original sample image, where the original sample image includes a sample object sequence; the original sample image also comprises category labeling information of the sample object. The similarity of the picture contents of the multi-frame sample images in the sample image group is larger than a preset threshold value; or, the classes of the sample objects marked in the multi-frame sample images in the sample image group are the same.

Step S22, inputting the sample images in the sample image group into an identification network of an object sequence to be trained for feature extraction, and obtaining a sample feature sequence of each sample image in the sample image group.

In some embodiments, first, sample images in a sample image group are preprocessed so that the sizes of the sample images in the sample image group are uniform; and then, carrying out feature extraction on the processed sample image group to obtain a sample feature sequence of each sample image.

In some possible implementations, by first creating a pair of images of the acquired original sample image (i.e., the first sample image) and then forming a sample image pair, performing data enhancement on the pair of images, combining the original sample image and the enhanced image as a sample image group, that is, the above step S21 may be implemented by the following steps S231 to 234 (not shown in the drawing):

in step S231, a first sample image is obtained that labels the category of the sample object in the screen.

Here, the image acquisition device may be used to acquire an image of a scene with a sample object, obtain the first sample image, and annotate a class of the sample object in the first sample image. The first sample image may be one or more frames of images.

Step S232, determining at least one second sample image based on the picture content of the first sample image.

In some possible implementations, first, according to the picture content of the first sample image, multiple images with high similarity to the picture content are generated, that is, multiple second sample images are obtained.

Step S233, performing data enhancement on the at least one second sample image to obtain at least one third sample image.

In some embodiments, the generated second sample image is subjected to data enhancement, such as, for example, performing horizontal inversion on the second sample image, increasing random pixel disturbance, adjusting image sharpness or brightness, and the like, so as to obtain a third sample image.

In a specific example, copying the picture content of the first sample image to obtain a second sample image; and then carrying out data enhancement on the second sample image, for example, adjusting the definition or brightness of the second sample image to obtain a third sample image.

Step S234, obtaining the sample image group based on the first sample image and the at least one third sample image.

In some embodiments, the pixel values are normalized by adjusting the size information of the first sample image and the plurality of third sample images, so as to realize unification of the sample data; at least two frames of images after being unified are taken as one sample image group. In this way, the paired images with similar picture contents are created through the first sample image of each frame, so that the consistency of the feature extraction of the similar images is convenient to be improved subsequently.

In some possible implementations, in a set of sample images, the pre-processed image and the enhanced image are combined as a final set of sample images by pre-processing the sample images and enhancing the data of the pre-processed image, that is, the step S234 may be implemented by:

firstly, preprocessing image parameters of the at least one third sample image and the first sample image based on preset image parameters to obtain at least two frames of third sample images.

In some possible implementations, the process of preprocessing the image parameters of the third sample image and the first sample image is similar to the process of preprocessing the second image; firstly, adjusting the size information of the third sample image and the first sample image according to a preset size to obtain an adjusted image; and then, normalizing the pixel values of the adjusted image to obtain at least two frames of third sample images. Thus, the width of the original image of the multi-frame samples is uniformly adjusted to be the preset width according to the preset width. For the original image of the sample with the height less than the preset height, filling pixels in the image area which does not reach the preset height; for example, gray pixel values are filled. Therefore, after the size information is adjusted, the ratio of the height to the width in the size of the multi-frame sample image in the obtained sample image group is uniform, and errors generated in the network training process can be reduced.

And then, carrying out data enhancement on the at least two frames of third sample images to obtain the sample image group.

Here, the data enhancement includes: random overturning, random cutting, random fine adjustment of height-width ratio, random rotation and other operations; thus, operations such as random overturning, random cutting, random fine adjustment of height-width ratio, random rotation and the like are performed on the multi-frame adjusted image, and a group of richer sample images can be obtained. Therefore, the multi-frame images with uniform sizes are combined to be used as a sample image group, so that the sample images can be enriched, and the overall robustness of the network to be trained can be improved.

Step S23, predicting the class of the sample object sequence in each sample image based on the sample characteristic sequence of each sample image.

In some embodiments, firstly, in the sample image group, predicting the category of each sample feature in the sample feature sequence of each sample image to obtain a classification result of each sample feature; then, based on the classification result of each sample feature, a sample object classification result of each sample image is determined. In this way, the sample feature sequence of each sample image in the sample image group is input into a classifier of an identification network of the object sequence to be trained for category prediction, and a sample classification result of the sample feature sequence of each sample image is obtained. Sample object classification results for each sample image, comprising: classification results for each sample feature in the sample sequence of sample images.

In some possible implementations, by analyzing all the classes of the sample object, setting the classification label of the classifier, so as to predict the sample classification result of each sample feature sequence, that is, the above step S23 may be implemented by the following procedures:

first, a total class of the sample objects included in the sample image set is determined.

Here, all categories of sample objects are analyzed in the scene where the sample image is located. For example, in a game scenario, the sample object is a token, and all possible token categories are determined as the total token category.

A classification label of a classifier of the recognition network of the sequence of objects to be trained is then determined based on the total class of the sample objects.

Here, the classification label of the classifier is set according to the total class of the sample object, so that the classifier can predict the probability that the sample object in the sample image belongs to any class.

And finally, carrying out category prediction on the sample objects in the sample feature sequences of each sample image in the sample image group by adopting the classifier with the classification label, and obtaining a sample object classification result of each sample image.

Here, a classifier with multi-class classification labels is adopted to predict the class of the object in the sample feature sequence of each sample image, so that a sample classification result of the sample feature sequence of the sample image can be obtained; based on the sample classification results, the most likely class of the object included in the sample feature sequence may be determined. By analyzing the total category of the object and setting the classification label of the classifier, the category of the object in the sample feature sequence can be predicted more accurately.

Step S24, determining a first loss of supervision on the class of the sample object sequence in each sample image based on the sample sequence feature of each sample image in the sample image group, so as to obtain a first loss set.

In some embodiments, CTC loss is employed as the first loss. In a sample image group, for each sample image, a first loss of the sample image is obtained by taking a classification result of a sample feature sequence of the sample image output by a classifier and class labeling information of a sample object in the sample image as input of CTC loss so as to predict a class of the sample object corresponding to each feature in the sample feature sequence of the sample image; thus, a first set of losses can be obtained based on a set of sample images.

Step S25, determining a second loss of supervision of the similarity between at least two frames of sample images based on the sample feature sequences of the at least two frames of sample images in the sample image group.

In some embodiments, the pair-wise penalty is employed as the second penalty, e.g., the implementation of the pair-wise penalty may be selected from the penalties that measure the distribution variability. Such as regression loss (L2 loss), cosine loss (cos loss), relative entropy loss (Kullback-Leibler divergence loss), and the like. In a group of sample images, a sample feature sequence of each sample image and a similarity true value between the sample images are input into a second loss, and the similarity between different sample images is predicted.

And S26, adjusting network parameters of the recognition network of the object sequence to be trained by adopting the first loss set and the second loss so that the loss of the classification result output by the recognition network of the adjusted object sequence meets the convergence condition.

Here, the first set of losses may be determined by comparing the classification characterizing each sample object to the truth information for each sample object in the classification result; the second penalty may be determined by comparing the similarity between the different sample images to a true value of the similarity between the different sample images. And adjusting the weight value of the recognition network of the object sequence to be trained by fusing the first loss and the second loss, so that the class loss of the sample object output by the recognition network of the trained object sequence is converged.

Through the steps S21 to S26, in the recognition network of the object sequence to be trained, based on the sample image group formed by the paired images, the first loss set for supervising the whole sequence and the second loss for supervising the similarity between the images in the group of sample images are introduced, so that the feature extraction consistency of the similar images can be improved, and the category prediction effect of the network is improved as a whole.

In some embodiments, in the recognition network of the object sequence to be trained, the convolutional sub-network is adopted to realize the feature extraction of the sample image, so as to obtain the sample feature sequence, that is, the step S22 can be realized through the following steps S241 and 242 (not shown in the drawing):

and S241, performing feature extraction on each sample image by adopting a convolution sub-network in the recognition network of the object sequence to be trained, and obtaining a sample feature map of each sample image.

In some embodiments, in the sample image group, a convolution network after fine tuning the structure of the residual network is used as a convolution sub-network of the identification network to be trained, and feature extraction is performed on each sample image, so as to obtain a sample feature map.

In some possible implementations, in the recognition network of the object sequence to be trained, feature extraction is performed on the sample image by using a convolution sub-network with the step length adjusted, so as to obtain a sample feature map with information in the first direction dimension kept unchanged and information in the second direction dimension changed, that is, the step S241 may be implemented by:

and a first step of adopting the convolution sub-network to downsample each sample image in the length dimension of the first direction of each sample image to obtain a first dimension sample characteristic.

Here, the first direction is different from an arrangement direction of the sample objects in the sequence of sample objects; the implementation of the first step is similar to that of step S211 described above; and under the condition that the arrangement direction of the sample object sequence is stacking along the height direction, downsampling the sample image in the width dimension of the sample image to obtain a first-dimension sample feature. The width step in the last step of the convolutional layers of layer3 and layer4 of the convolutional sub-network is set to be 2, and the height step is changed from 2 to 1.

And a second step of extracting features in the length dimension of the second direction of each sample image based on the length of the second direction of each sample image to obtain second-dimension sample features.

Here, the implementation procedure of the second step is similar to that of the above-described step S212; and carrying out feature extraction on the height dimension of the sample image based on the height of the sample image under the condition that the arrangement direction of the sample object sequence is stacked along the height direction, so as to obtain a second-dimension sample feature. For example, the height in the last step of the convolution layers of layer3 and layer4 of the convolution sub-network is set from 2 to 1, thus realizing that no downsampling is performed in the height dimension of the sample image, i.e. the second dimension sample feature of the sample image is maintained.

Thirdly, obtaining the sample feature map of each sample image based on the first dimension sample feature and the second dimension sample feature.

Here, the first-dimension sample features are combined with the second-dimension sample features to form a sample feature map of the sample image.

The first to third steps, in which, in the case that the arrangement direction of the sample object sequences is stacking along the height direction, the convolution layer of which the last Stride of the third layer and the fourth layer is (2, 2) in the res net is changed into the convolution layer of which the Stride is (1, 2), and the convolution layer is used as the convolution sub-network of the recognition network of the object sequence to be trained; thus, the feature information of each sample image in the dimension of the arrangement direction can be retained as much as possible.

Step S242, splitting the sample feature map of each sample image to obtain the sample feature sequence of each sample image.

Here, the implementation procedure of step S242 is similar to the implementation procedure of step S202 described above; that is, based on the dimension in the first direction and the dimension in the second direction, different processing is performed on each sample feature map, and a sample feature sequence is obtained. For example, the sample feature map is pooled in the dimension of the first direction, and split into a plurality of feature vectors in the dimension of the second direction to form a sample feature sequence; thus, a sample feature sequence which retains dimensional features of more sample object arrangement directions can be obtained, and the accuracy of training the network can be improved.

In some possible implementations, the sample feature map is pooled in the second direction dimension, and split along the high first direction dimension to obtain the sample feature sequence, that is, the step S242 may be implemented by:

and firstly, pooling the sample feature map along the first direction to obtain a pooled sample feature map.

Here, the implementation procedure of the first step is similar to that of the above-described step S221; that is, the sample feature map is averaged and pooled along the dimension of the sample feature map in the first direction, and the dimension of the sample feature map in the second direction and the channel dimension are kept unchanged, so as to obtain a pooled sample feature map.

And secondly, splitting the pooled sample feature map along the second direction to obtain the sample feature sequence.

Here, the implementation procedure of the first step is similar to that of the above-described step S222; the pooled sample feature map is split along the dimension of the second direction of the sample feature map, and the sample feature sequence is obtained. For example, if the dimension of the sample feature map in the second direction is 40, then the pooled sample feature map is split into 40 vectors, which form a sample feature sequence. In this way, after the sample feature map is pooled in the dimension in the second direction, the sample feature map is split in the dimension in the second direction, so that the sample feature sequence can retain more detailed information of the sample image in the dimension in the second direction.

In some embodiments, the performance of identifying the object sequence by the identifying network of the object sequence to be trained is improved by dynamically weighting and fusing the first loss set and the second loss, that is, the step S26 may be implemented by the following steps S261 and S262:

step S261, performing weighted fusion on the first loss set and the second loss to obtain a total loss.

In some embodiments, the first loss set and the second loss are weighted by different dynamic weights, and the weighted first loss set and the weighted second loss are fused to obtain the total loss.

In some embodiments, by setting adjustment parameters for the first loss and the second loss, the recognition performance of the object sequence of the recognition network of the object sequence to be trained is improved, that is, the step S261 may be implemented by:

the method comprises the steps of determining category supervision weights corresponding to a sample image group based on the number of sample images in the sample image group.

In some embodiments, the class supervision weights for the sample image groups are determined for a group of sample images, i.e. the weights of the first losses for each sample image in the same group may be the same. The number of the category supervision weights is the same as that of the sample images, so that the plurality of category supervision weights can be the same value or different values, but the sum of the plurality of category supervision weights is 1. For example, the number of sample images in the sample image group is n, and the category supervision weight may be 1/n.

And secondly, fusing the first loss in the first loss set of the sample image group based on the category supervision weight and a preset first weight to obtain a third loss.

In some embodiments, the third loss is obtained by assigning both the category supervision weight and the preset first weight to the first loss in the first loss set, and summing the plurality of first losses after assigning the parameters.

In some possible implementations, the first losses are fused by adopting category supervision adjustment, and then the fused losses are adjusted by adopting a preset first weight, so as to obtain a third loss, namely the second step can be realized by the following steps:

and step A, respectively giving the category supervision weight to each first loss in the first loss set to obtain an updated loss set comprising at least two updated losses.

In some possible implementations, if the number of sample images in the sample image group is 4, then the class supervision weight is 0.25, and multiplying the class supervision weight of 0.25 with each first loss results in updated losses.

And step B, fusing the updated losses in the updated loss set to obtain fusion losses.

In some possible implementations, the updated losses in the updated loss set are added to obtain a fusion loss.

And step C, adjusting the fusion loss by adopting the preset first weight to obtain the third loss.

In some possible implementations, the ratio between the preset first weight and the preset second weight is set to be 1: and 10, multiplying the preset first weight by the fusion loss to obtain the third loss. In this way, in the training process, the CTC loss of the prediction result of each sample image in the set of sample images is fused, so that the performance of the recognition network obtained by training can be improved.

And thirdly, adjusting the second loss by adopting a preset second weight to obtain a fourth loss.

In some embodiments, the second loss is given a preset second weight, and the adjustment of the second loss is implemented, for example, the fourth loss is obtained by multiplying the second loss by the preset second weight. The preset first weight is smaller than the preset second weight, and a certain proportional relation is met between the preset first weight and the preset second weight.

Step S254, determining the total loss based on the third loss and the fourth loss.

In some embodiments, the third loss and the fourth loss are added to obtain a total loss of the recognition network of the sequence of objects to be trained.

Step S262, adjusting the network parameters of the object recognition network to be trained based on the total loss, so that the loss of the classification result output by the object recognition network after adjustment meets the convergence condition.

In the embodiment of the application, the third loss and the fourth loss are adopted to be fused to obtain the total loss, and the recognition network of the object sequence to be trained is trained, so that the consistency of feature extraction of similar images can be improved, and the prediction effect of the whole network can be improved.

In the following, an exemplary application of the embodiment of the present application in an actual application scenario will be described, taking the application scenario as an example of a game place, and identifying an object (for example, a game coin) in the game place will be described as an example.

The sequence recognition algorithm in the image is widely applied to scenes such as scene character recognition, license plate recognition and the like. In the related technology, the method mainly comprises the steps of extracting image features by a convolutional neural network, carrying out classification prediction on each slice feature, combining CTC loss function de-duplication and supervising prediction output, and is applicable to character recognition and license plate recognition tasks.

However, for the problem of token sequence identification in gaming establishments, since token sequences are generally long in sequence length, and there is a high requirement for accuracy of denomination and type prediction for each token; the manner of sequence recognition of the medals based on the deep learning method is not effective.

Based on this, the embodiment of the application provides a recognition method of an object sequence, and based on the recognition of the game currency based on CTC loss, the pair loss based on the feature similarity of the paired images is added, so that the feature extraction consistency of the similar images can be improved, and the recognition precision of the game currency can be improved.

Fig. 3 is a schematic structural diagram of an identification network of an object sequence provided in an embodiment of the present application, and the following description is made with reference to fig. 3, where a framework of the identification network of the object sequence includes the following modules:

the paired data module is configured to construct a corresponding paired image 302 for each frame image 301 in the training image, so as to obtain a sample image set.

In some possible implementations, the training image is subjected to data enhancement processing, such as performing operations of horizontal flipping, increasing random pixel disturbance, adjusting image definition, adjusting image brightness, and the like on the training image; under the condition that the sequence labels of the game coins in the images are not changed, corresponding paired images are constructed for each frame of images, and subsequent processing is carried out in a paired mode.

In one specific example, the pair of images of the image is obtained by first copying a frame of image and then data enhancing the copied image.

After constructing a corresponding pair of images for each frame of image, preprocessing a frame of sample image, including: maintaining the operations of adjusting the size of the image, normalizing the pixel value of the image and the like by the aspect ratio. Wherein the operation of maintaining the aspect ratio to adjust the image size is to adjust the width of the multi-frame sample image to a uniform size; in order to reduce the problem that the difference of the aspect ratios of the images is large due to the unequal number of the game coins in the input images, if the aspect ratios of the multi-frame images are not kept to be adjusted to be uniform, the multi-frame images can generate huge deformation. The step of enhancing the processed sample image only by data for enriching the sample image set; for example, the processed sample image Kong Meixin is subjected to operations such as random overturning, random cutting, random fine tuning of the aspect ratio, random rotation and the like, so that the overall robustness of the network to be trained can be improved.

The feature extraction module 303 performs feature extraction on the processed sample image to obtain a feature sequence; the image 301 is processed, and then the feature extraction is performed to obtain a sample feature sequence 31, and the image 302 is processed, and then the feature extraction is performed to obtain a sample feature sequence 32.

In some possible implementations, first, high-level features are extracted on an input sample image using a convolutional neural network portion in an identification network of a sequence of objects to be trained. The convolutional neural network part is obtained by fine tuning based on a network structure of a residual network (ResNet); for example, the convolutional layer with the last step length (Stride) of layer3 and layer4 in the ResNet network structure being (2, 2) is changed to Stride (1, 2); therefore, the feature map is not subjected to downsampling in the height dimension, and downsampling in the width dimension is changed into half of the original downsampling, so that the feature information in the height dimension can be reserved as much as possible. Then, the feature map is split, namely the feature map extracted by the convolutional neural network is split into a plurality of feature sequences, so that the subsequent classifier and loss function calculation is facilitated. When the feature map splitting is realized, carrying out average pooling along the width direction of the feature map, wherein the height direction and the channel dimension are unchanged; for example, the feature map has dimensions of 2048×40×8 (channel dimension is 2048, high dimension is 40, and width dimension is 8), and the feature map is averaged and pooled in the width direction to become a feature map of 2048×40×1, and the feature map is disassembled along the height dimension to become 40 vectors of 2048 dimensions, each vector corresponding to a feature corresponding to a 1/40 area in the height direction in the original map.

In a specific example, if the sample image is shown in fig. 4 to include a plurality of tokens, then the image 401 is divided in a high dimension to obtain a feature sequence, and one feature sequence includes features of one token or less.

And the classifier is used for predicting the category of the game currency for the feature sequences by using the n classifier to obtain the prediction probability of each feature sequence.

Here, n is the total number of game denominations.

The loss module is used for determining the feature similarity of the paired images by utilizing paired loss 304 according to the feature sequence obtained by the convolution network, and monitoring the network by taking the improvement of the similarity as an optimization target. The prediction results of the feature sequences of each image in a pair of images are respectively supervised by using CTC losses 305 and 306 for the prediction probabilities of all feature sequence classifications.

In some possible implementations, the pairwise loss 304, CTC loss 305, and CTC loss 306 are combined into a total loss 307, l=α (0.5L _ctc1 +0.5L _ctc2 )+βL _pair The method comprises the steps of carrying out a first treatment on the surface of the Wherein, the values of alpha and beta can be set as alpha:beta=1:10.

And finally, carrying out back propagation according to the classification result of the feature sequence and the calculation result of the loss function, and updating the network parameter weight. And in the test stage, processing the classification result of the characteristic sequence according to the CTC loss function post-processing rule to obtain a predicted game chip sequence result, wherein the predicted game chip sequence result comprises the game chip sequence length and the classification corresponding to each game chip.

In the embodiment of the application, under the condition of not introducing additional parameters or network structure change, the prediction result of the sequence length can be improved, meanwhile, the recognition precision of the category is improved, and finally, the overall recognition result is improved, and particularly, the scene of the long game coin sequence is greatly improved.

An embodiment of the present application provides an object sequence recognition device, and fig. 5A is a schematic structural diagram of the object sequence recognition device according to the embodiment of the present application, as shown in fig. 5A, where the object sequence recognition device 500 includes:

a first obtaining module 501, configured to obtain a first image of an object sequence;

the first extraction module 502 is configured to input the first image into an identification network of an object sequence to perform feature extraction, so as to obtain a feature sequence; wherein, the supervision information of the object sequence recognition network in the training process at least comprises: first supervision information of a class of a sample object sequence in each sample image of the sample image group and second supervision information of a similarity between at least two frames of sample images; the at least two frames of sample images in each sample image group comprise an original sample image and an image obtained by performing image transformation on the original sample image by at least one frame;

A first determining module 503 is configured to determine a category of each object in the sequence of objects based on the feature sequence.

In some embodiments, the first extraction module 502 includes:

In some embodiments, the first extraction sub-module comprises:

In some embodiments, the first split sub-module comprises:

In some embodiments, the first determining module 503 includes:

An embodiment of the present application provides a device for training a recognition network of an object sequence, and fig. 5B is a schematic structural diagram of the device for training a recognition network of an object sequence according to the embodiment of the present application, as shown in fig. 5B, where the device 510 for training a recognition network of an object sequence includes:

a second acquisition module 511 for acquiring a sample image group; at least two frames of sample images in the sample image group comprise an original sample image and an image obtained by performing image transformation on the original sample image by at least one frame, and the original sample image comprises a sample object sequence;

A second extraction module 512, configured to input sample images in the sample image group into an identification network of an object sequence to be trained to perform feature extraction, so as to obtain a sample feature sequence of each sample image in the sample image group;

a first prediction module 513, configured to predict a class of the sample object sequence in each sample image based on the sample feature sequence of each sample image;

a second determining module 514, configured to determine, based on the sample sequence feature of each sample image in the sample image group, a first loss for supervising the class of the sample object sequence in each sample image, to obtain a first loss set;

a third determining module 515, configured to determine, based on a sample feature sequence of at least two frames of sample images in the sample image group, a second loss of supervision of similarity between the at least two frames of sample images;

the first adjustment module 516 is configured to adjust network parameters of the identification network of the object sequence to be trained using the first loss set and the second loss, so that the loss of the classification result output by the identification network of the adjusted object sequence meets a convergence condition.

In some embodiments, the second obtaining module 511 includes:

In some embodiments, the second extraction module 512 includes:

In some embodiments, the second extraction sub-module comprises:

In some embodiments, the second split sub-module comprises:

In some embodiments, the first adjustment module 516 includes:

In some embodiments, the first fusion sub-module comprises:

in some embodiments, the first fusion unit comprises:

It should be noted that the description of the above device embodiments is similar to the description of the method embodiments described above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the device embodiments of the present application, please refer to the description of the method embodiments of the present application for understanding.

In the embodiment of the present application, if the method for identifying an object sequence is implemented in the form of a software functional module, and sold or used as a separate product, the method may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or portions contributing to the prior art may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a terminal, a server, etc.) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

Embodiments of the present application further provide a computer program product, where the computer program product includes computer executable instructions, where the computer executable instructions, when executed, enable the method for identifying an object sequence and the method for training a network for identifying an object sequence provided in the embodiments of the present application to be implemented.

The embodiment of the application further provides a computer storage medium, on which computer executable instructions are stored, where the computer executable instructions implement the method for identifying an object sequence and the method for training a network for identifying an object sequence provided in the above embodiment when executed by a processor.

An embodiment of the present application provides a computer device, fig. 6 is a schematic diagram of a composition structure of the computer device according to the embodiment of the present application, as shown in fig. 6, and the computer device 600 includes: a processor 601, at least one communication bus, a communication interface 602, at least one external communication interface and a memory 603. Wherein the communication interface 602 is configured to enable connected communication between these components. The communication interface 602 may include a display screen, and the external communication interface may include a standard wired interface and a wireless interface, among others. The processor 601 is configured to execute an image processing program in a memory, so as to implement the method for identifying an object sequence and the method for training an identification network of an object sequence provided in the foregoing embodiments.

The description of the above object sequence recognition device, the computer device and the storage medium embodiments is similar to the description of the above method embodiments, and has similar technical descriptions and beneficial effects to the corresponding method embodiments, and is limited to the description of the above method embodiments, so that the description is not repeated here. For technical details not disclosed in the embodiments of the identification device, the computer device and the storage medium of the object sequence of the present application, please refer to the description of the method embodiment of the present application for understanding.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application. The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units; can be located in one place or distributed to a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units. Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, where the program, when executed, performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read Only Memory (ROM), a magnetic disk or an optical disk, or the like, which can store program codes.

Alternatively, the integrated units described above may be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partly contributing to the prior art, and the computer software product may be stored in a storage medium, and include several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a removable storage device, a ROM, a magnetic disk, or an optical disk. The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of identifying a sequence of objects, comprising:

acquiring a first image of an object sequence;

2. The method according to claim 1, wherein the inputting the first image into the recognition network of the object sequence for feature extraction, to obtain a feature sequence, includes:

performing feature extraction on the first image by adopting a convolution sub-network in the identification network of the object sequence to obtain a feature map;

and splitting the feature map to obtain the feature sequence.

3. The method of claim 2, wherein the feature extraction of the first image with the convolutional sub-network in the recognition network of the sequence of objects to obtain a feature map comprises:

Downsampling the first image in a length dimension of a first direction of the first image by adopting the convolution sub-network to obtain a first dimension characteristic, wherein the first direction is different from the arrangement direction of the objects in the object sequence;

extracting features in the length dimension of the first image in the second direction based on the length of the first image in the second direction to obtain second dimension features;

and obtaining the characteristic map based on the first dimension characteristic and the second dimension characteristic.

4. A method according to claim 3, wherein said splitting the feature map to obtain the feature sequence comprises:

pooling the feature map along the first direction to obtain a pooled feature map;

and splitting the pooled feature map along the second direction to obtain the feature sequence.

5. The method of any of claims 1 to 4, wherein the determining a category for each object in the sequence of objects based on the sequence of features comprises:

predicting the class corresponding to each feature in the feature sequence by adopting a classifier of the recognition network of the object sequence;

And determining the category of each object in the object sequence according to the prediction result of the category corresponding to each feature in the feature sequence.

6. A method of training a recognition network of a sequence of objects, comprising:

acquiring a sample image group; at least two frames of sample images in the sample image group comprise an original sample image and an image obtained by performing image transformation on the original sample image by at least one frame, and the original sample image comprises a sample object sequence;

inputting sample images in the sample image group into an identification network of an object sequence to be trained for feature extraction, and obtaining a sample feature sequence of each sample image in the sample image group;

predicting the class of the sample object sequence in each sample image based on the sample feature sequence of each sample image;

determining a first loss for supervising the category of the sample object sequence in each sample image based on the sample sequence characteristics of each sample image in the sample image group, so as to obtain a first loss set;

determining a second loss of supervision of the similarity between at least two frames of sample images based on sample feature sequences of the at least two frames of sample images in the sample image group;

And adjusting network parameters of the recognition network of the object sequence to be trained by adopting the first loss set and the second loss so that the loss of the classification result output by the recognition network of the adjusted object sequence meets the convergence condition.

7. The method of claim 6, wherein the acquiring a set of sample images comprises:

acquiring a first sample image for marking the category of a sample object in a picture;

determining at least one second sample image based on the picture content of the first sample image;

performing data enhancement on the at least one second sample image to obtain at least one third sample image;

and obtaining the sample image group based on the first sample image and the at least one third sample image.

8. The method according to claim 6 or 7, wherein the inputting the sample images in the sample image group into the recognition network of the object sequence to be trained for feature extraction, to obtain a sample feature sequence of each sample image in the sample image group, comprises:

performing feature extraction on each sample image by adopting a convolution sub-network in the recognition network of the object sequence to be trained to obtain a sample feature map of each sample image;

And splitting the sample feature map of each sample image to obtain the sample feature sequence of each sample image.

9. The method according to claim 8, wherein the feature extraction of each sample image by using the convolution subnetwork in the recognition network of the object sequence to be trained to obtain a sample feature map of each sample image includes:

downsampling each sample image in the length dimension of the first direction of each sample image by adopting the convolution sub-network to obtain a first-dimension sample characteristic, wherein the first direction is different from the arrangement direction of sample objects in the sample object sequence;

based on the length of each sample image in the second direction, extracting the characteristics in the length dimension of each sample image in the second direction to obtain second-dimension sample characteristics;

and obtaining the sample characteristic diagram of each sample image based on the first dimension sample characteristic and the second dimension sample characteristic.

10. The method of claim 9, wherein the splitting the sample feature map of each sample image to obtain the sample feature sequence of each sample image comprises:

Pooling the sample feature map along the first direction to obtain a pooled sample feature map;

splitting the pooled sample feature map along the second direction to obtain the sample feature sequence.

11. The method according to any one of claims 6 to 10, wherein said adjusting network parameters of the identification network of the object sequence to be trained using the first set of losses and the second set of losses so that the losses of the classification result output by the identification network of the adjusted object sequence meet a convergence condition comprises:

carrying out weighted fusion on the first loss set and the second loss to obtain total loss;

and adjusting network parameters of the object recognition network to be trained based on the total loss, so that the loss of the classification result output by the object recognition network after adjustment meets a convergence condition.

12. The method of claim 11, wherein said weighted fusion of the first and second sets of losses to obtain a total loss comprises:

determining a category supervision weight corresponding to the sample image group based on the number of sample images in the sample image group;

Fusing the first loss in the first loss set of the sample image group based on the category supervision weight and a preset first weight to obtain a third loss;

adjusting the second loss by adopting a preset second weight to obtain a fourth loss;

the total loss is determined based on the third loss and the fourth loss.

13. The method of claim 12, wherein the fusing the first loss in the first set of losses for the sample image set based on the category supervision weight and a preset first weight to obtain a third loss comprises:

assigning the category supervision weight to each first loss in the first loss set respectively, so as to obtain an updated loss set comprising at least two updated losses;

fusing the updated losses in the updated loss set to obtain fusion losses;

and adjusting the fusion loss by adopting the preset first weight to obtain the third loss.

14. A computer storage medium having stored thereon computer executable instructions configured to, upon execution:

Acquiring a first image of an object sequence;

15. A computer storage medium having stored thereon computer executable instructions configured to, upon execution:

16. A computer device, wherein the computer device comprises a memory having stored thereon computer executable instructions and a processor that when executed by the processor is configured to:

Acquiring a first image of an object sequence;

17. A computer device, wherein the computer device comprises a memory having stored thereon computer executable instructions and a processor that when executed by the processor is configured to:

18. A computer program comprising computer instructions executable by an electronic device, wherein the computer instructions, when executed by a processor in the electronic device, are configured to:

Acquiring a first image of an object sequence;

19. A computer program comprising computer instructions executable by an electronic device, wherein the computer instructions, when executed by a processor in the electronic device, are configured to: