CN112257716A

CN112257716A - Scene character recognition method based on scale self-adaption and direction attention network

Info

Publication number: CN112257716A
Application number: CN202011424315.0A
Authority: CN
Inventors: 鲍虎军; 李特; 操晓春; 代朋纹; 张华�
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2021-01-22

Abstract

The invention relates to a scene character recognition method based on scale self-adaptation and a direction attention network, which comprises the steps of mapping an input picture to a polar coordinate space to obtain a polar coordinate image, and extracting a characteristic J of the polar coordinate image by using a convolution network; converting the feature expression of the picture in a polar coordinate space into a high-order semantic feature F by using a deep convolutional network; for the high-order semantic features obtained by conversion, coding the features of more relevant areas of each character by utilizing a character receptive field attention mechanism, acquiring robust feature expression and dispersing the feature expression into a feature sequence Q; capturing the context relation between the characteristic sequences Q by using a bidirectional long and short memory network to obtain a characteristic sequence H; and inputting the characteristic sequence H into a decoding network for analysis to generate a character string with a semantic sequence rule. The method can effectively identify scene characters in any semantic direction; the characters with different scales can be encoded to more effectively express the characteristics, and the recognition performance is obviously improved.

Description

Scene character recognition method based on scale self-adaption and direction attention network

Technical Field

The invention belongs to the technical field of computer vision, and relates to a method capable of identifying characters in any semantic direction in a natural scene image. In particular to a scene character recognition method based on scale self-adaption and a direction attention network.

Background

With the development of information technology, images as a popular information carrier play an indispensable role in our lives. Characters in the images are high-level visual elements, contain rich and accurate semantic information, and are very helpful for understanding scene contents. Therefore, the character information in the image is recognized, so that the character information has quite wide application value in many practical applications, and is mainly reflected in four aspects. First, content-based image retrieval. The character information in the image can effectively solve the ambiguity of the image content; and the image content can be understood more deeply by combining with the scene content, so that more accurate images can be retrieved according to the key information. And secondly, a man-machine interaction system. When people are shopping or shopping, many billboards, posters, shop signs, menus, etc. are often encountered, however, these messages often contain textual information in different languages. Therefore, the mobile equipment is used for collecting the images and identifying the character elements in the images, and the mobile equipment can bring convenience to the life of people. And thirdly, purifying the network space. Many lawbreakers use images as carriers, and embed some characters with low colloquial pornography in the images to spread in a network space. And bad character information in the image is identified, so that the transmission of the information is prevented, and the physical and mental health of the underage is protected. Fourthly, an intelligent transportation system. In the outdoor environment, accurate discernment license plate and traffic sign all have positive effect to the intelligent management of traffic.

Natural Scene Text Recognition (STR) has many challenges compared to conventional Optical Character Recognition (OCR). Mainly in the following aspects. Firstly, OCR aims at a scanned document, the image quality of the scanned document is clear, and the background of the scanned document is single; the STR is directed to a natural scene image, and due to factors such as jitter, illumination or shooting angle during shooting, a shot picture is easily blurred, resolution is low, and text occlusion is difficult. Secondly, characters processed by the OCR are generally consistent in size, uniform in color and orderly in arrangement; the characters aimed by the STR are different in font, various in color and rich in layout, so that the difficulty of character recognition is increased.

Scene character recognition based on a deep neural network is mainly divided into two categories, namely regular scene character recognition and irregular scene character recognition. The recognition of the regular scene characters refers to the recognition of characters aiming at the horizontal front face, and the recognition methods can be divided into three types, namely character-based recognition methods, word-based recognition methods and sequence-based recognition methods. The character-based recognition method firstly detects the position, then classifies the single character by utilizing the deep neural network, and finally aggregates the classification results of the single character to form a final result by a heuristic algorithm and a language rule. Word-based recognition directly classifies whole words using deep neural networks. Based on the identification of the sequence, the input images are first encoded into sequence features, which are then parsed into text strings using an attention-based sequence decoder or a connected semantic Temporal Classification (CTC). The identification of the characters in the irregular scene refers to identification of the characters in the irregular scene, such as multiple directions, perspective distortion, curved arrangement and the like. The identification method can be divided into three categories, namely an identification method based on correction, a two-dimensional space and a direction feature code. The correction-based identification method comprises the steps of firstly, correcting irregular characters into horizontal or approximately horizontal characters by using a correction network, and then identifying by using a regular character identifier; the correction network and the recognition network are combined to be trained end to end, the correction network does not need supervision information, and learning of the correction network is completed by means of gradient feedback of the recognition network. The identification method based on the two-dimensional space is characterized in that the characteristics of an input image are extracted by utilizing a full convolution network so as to keep the space information of characters from being lost; and then identified based on the attention mechanism of the two-dimensional space or class segmentation of each location in the two-dimensional space. Firstly, mapping an input image into one-dimensional features in multiple directions based on a direction feature coding identification method; then learning a weight for each direction and each position in each direction, and fusing all direction features together to form a more expressive feature through the learned weights; and finally, analyzing by using a one-dimensional attention decoder to generate a recognition result.

At present, scene character recognition mainly aims at recognition of characters with irregular geometric layout, and the scene character recognition only focuses on the arbitrariness of character semantic directions; however, in practical applications, scene text of any semantic orientation often appears. In addition, because the scale of each character in the scene characters is various, the existing method does not consider the precise feature coding of a single character. Therefore, scene character recognition of any scale in any semantic direction is a research hotspot facing practical application.

Disclosure of Invention

The invention provides a scene character recognition method based on scale self-adaption and a direction attention network, aiming at scene characters in any semantic direction and different scales of a single character. Since both the scale and orientation of the text need to be considered, the original image is mapped into polar coordinate space for this purpose. In order to accurately sense the scale of a single character in the character, according to the receptive field theory, a plurality of moderate receptive fields are utilized for self-adaptive selection.

The technical scheme of the invention is as follows:

a scene character recognition method based on scale self-adaption and a direction attention network comprises the following steps:

(1) mapping the input picture to a polar coordinate space to obtain a polar coordinate image, and extracting a characteristic J of the polar coordinate image by using a convolution network;

(2) converting the feature expression of the picture in a polar coordinate space into a high-order semantic feature F by using a deep convolutional network;

(3) for the high-order semantic features F obtained by conversion in the step (2), coding the features of a more relevant area for each character by using a character receptive field attention mechanism, acquiring robust feature expression and dispersing the robust feature expression into a feature sequence Q;

(4) capturing the context relation between the characteristic sequences Q by using a bidirectional long and short memory network to obtain a characteristic sequence H;

(5) and inputting the characteristic sequence H into a decoding network for analysis to generate a character string with a semantic sequence rule.

Further, before the step (1), the method further comprises the step of converting the input picture: an arbitrary size color input picture is converted into a fixed size grayscale picture, the size of which is expressed as H × W.

Further, the step (1) specifically includes the following sub-steps:

(1.1) learning a polar origin response map by utilizing a shallow small network; then, obtaining a polar coordinate origin according to the polar coordinate origin response diagram and the corresponding spatial position weighting; the shallow layer small network consists of three convolution layers, a rectification unit and a batch normalization layer which follow the convolution layers;

(1.2) mapping the coordinate position in the polar coordinate space to the position in the Cartesian space according to the conversion relation between the Cartesian coordinate and the polar coordinate; the numerical value at the position in each polar coordinate space is obtained by carrying out bilinear interpolation on four positions adjacent to the corresponding Cartesian coordinate position, so that a polar coordinate image is obtained;

(1.3) acquiring a characteristic J of the polar coordinate image by using a convolution network; in the convolution filling, the polar coordinate image is circularly filled in the vertical direction, that is, the uppermost line is filled by the lowermost line, and vice versa.

Further, the step (2) is specifically:

and utilizing a convolution network to carry out downsampling on the characteristic J, wherein the vertical direction is downsampled to be 1, the horizontal direction is downsampled to be L to obtain a high-order semantic characteristic F, the characteristic dimension is expressed as 1 multiplied by L multiplied by D, and D represents the number of characteristic channels.

Further, the step (3) specifically includes the following sub-steps:

(3.1) inputting the high-order semantic feature F into a standard convolution and K-1 expansion convolutions with different expansion rates to obtain the multi-scale feature F₁,F₂,…,F_KThe feature dimension of each feature is 1 × L × D;

(3.2) combining the multiscale features F₁,F₂,…,F_KThe character regions are spliced to learn the associated weight of each character region and the features with different scales;

(3.3) combining the multiscale features F₁,F₂,…,F_KObtaining the characteristic sequence discretely by fusing the learned weight

Wherein q is^jIs D.

Further, in the step (4), the bidirectional long and short memory network includes D neurons.

Further, the step (5) is specifically:

the decoding network is a recurrent neural network based on a gate cycle unit, and for each decoding time t, a multilayer perceptron is utilized to learn the hidden state of the gate cycle unit networks _t-1The degree of association with the characteristic sequence H; then adaptively focusing on proper sequence feature expression according to the association degree obtained by learning; finally, each GRU unit outputs the distribution of the current time

Wherein C is the number of characters; in the network learning, the maximum cumulative probability at all times will be maximized; in the process of network inference, the category with the largest response is selected as output by using a greedy algorithm for analysis at each moment, or beta categories with the largest response are selected in each moment according to a cluster search algorithm for analysis at the next moment, and the network inference is ended until a sequence ending mark is met or the maximum preset time step T is exceeded.

The invention has the beneficial effects that:

1. the invention applies the polar coordinate conversion to the sequence character recognition, and can effectively sense characters in any direction and any scale, thereby obviously improving the recognition effect.

2. The invention provides a character receptive field attention mechanism which can encode more relevant characteristics for characters with different scales, thereby obviously improving the recognition effect; the mechanism is simple and effective, and can be very simply embedded into the existing sequence recognition model (such as scene character recognition, handwriting recognition, voice recognition and the like) to improve the recognition performance.

In summary, the scene character recognition method based on the scale self-adaptation and the direction attention network provided by the invention can effectively recognize scene characters in any direction. The method can effectively learn better feature expression for characters with different scales in characters, thereby integrally improving the recognition performance.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a diagram of a polar transformation process;

FIG. 3 is a diagram of a scene character recognition network structure in any semantic direction;

FIG. 4 is a schematic diagram of the mechanism of the character receptor field attention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a scene character recognition method based on scale self-adaption and a direction attention network, as shown in figure 1, the steps are as follows:

According to the method, the constructed network structure for recognizing the characters in the scene in any semantic direction is shown in FIG. 2 and specifically comprises a polar coordinate feature conversion module, a feature coding module and a character sequence decoding module.

In an embodiment of the present invention, before the step (1), a step of converting an input picture is further included: and converting the scene character image with any size and any semantic direction into an H multiplied by W gray scale image I, wherein H and W represent the height and width of the gray scale image.

In an embodiment of the present invention, in the step (1), the input image is converted into a feature expression in a polar coordinate space in a polar coordinate feature conversion module, a network structure and a flow are shown in fig. 2, and the specific steps are as follows:

(1.1) correspondingly predicting a network learning polar coordinate origin response diagram by using a shallow small network as a polar coordinate origin; then, obtaining a polar coordinate origin according to the polar coordinate origin response diagram and the corresponding spatial position weighting; specifically, the method comprises the following substeps:

(1.1.1) the polar position response is learned using a small net of four convolutional layers, the first three convolutional layers followed by a Linear rectification Unit (ReLU) and Batch Normalization (BN).

(1.1.2) response diagram O according to polar position, and horizontal position coordinate matrix E in response diagram O^xAnd a vertical position coordinate matrix E^y(wherein, E^x, E^yIs normalized to [ -1, 1]) The position of the polar coordinates can be obtained (x ₀,y ₀) The specific calculation process is as follows:

(1)

(2)

where k represents the position index in the polar response plot O.

Representing multiplication of corresponding elements of the matrix.

And the shallow small network performs weak supervised learning according to the feedback information of the character strings.

(1.2) mapping the coordinate position in the polar coordinate space to a position in the cartesian space according to the transformation relationship between the cartesian coordinate and the polar coordinate, as shown in fig. 3; the numerical value at the position in each polar coordinate space is obtained by carrying out bilinear interpolation on four positions adjacent to the corresponding Cartesian coordinate position, so that a polar coordinate image is obtained; specifically, the method comprises the following substeps:

(1.2.1) construct a polar image P of the same size as the input image I. The coordinates of the images I and P are normalized to [ -1, 1], and the coordinate mapping between the images I and P is calculated in the following way:

x _i ^s =x ₀+ρ _i ^t∙cos(θ _i ^t) (3)

y _i ^s =y ₀+ρ _i ^t∙sin(θ _i ^t) (4)

(5)

θ _i ^t =(y _i ^t +1)⋅π (6)

wherein (A) and (B)x _i ^s,y _i ^s) Indicating the first on the input image IiCoordinates of the individual locations; (x _i ^t,y _i ^t) Coordinates representing the ith position on the polar image P.ρ _i ^t，θ _i ^tRepresenting the first on the polar image PiThe pole diameter and angle of each position. Gamma denotes the maximum distance to the origin of polar coordinates, which is set to

。

(1.2.2) mapping coordinates according to the obtained (1:)x _i ^s,y _i ^s) Calculating each position on the polar coordinate image P by bilinear interpolation (x _i ^t,y _i ^t) The value of (c) above.

(1.3) acquiring a characteristic J of the polar coordinate image by using a convolution network; the method specifically comprises the following steps: the polar coordinate image P is cyclically filled in the vertical direction, i.e. the uppermost row in P is filled with the lowermost row, whereas the lowermost row is filled with the uppermost row. Then learning is carried out by using M3 × 3 convolution kernels (followed by a rectifying unit and a batch normalization layer) to obtain a feature expression J of the polar coordinate image in a polar coordinate space.

In an embodiment of the present invention, the feature J is downsampled by using a convolutional network composed of a plurality of convolutional layers and pooling layers, wherein vertical downsampling is 1, horizontal downsampling is L to obtain a high-order semantic feature F, and a feature dimension is represented as 1 × L × D, where D represents the number of feature channels.

In an embodiment of the present invention, a more effective feature expression is learned for each character with different scales and is discretized into a feature sequence Q by using a receptive field attention mechanism of the character, a principle of which is schematically shown in fig. 4 and specifically implemented in a feature encoding module, a network structure and a flow of which are shown in fig. 2, and the steps are as follows:

(3.1) generating a feature expression F based on the high-order semantic feature expression F using a depth feature extractor, i.e., a 1 × 1 convolutional layer₁And using K-1 3 × 3 expansion convolution layers to generate multi-scale feature F₂,F₃,…,F_KRespectively, having an expansion ratio of 2¹, 2²,…,2^K-1The feature dimensions of each feature are 1 × L × D, so that each position in the feature map is associated with a different region of the input image.

(3.2) adding F₁,F₂,…,F_KSpliced together, input into two convolutional layers to learn the associated weight of each character region and different scale features

。

(3.3) combining the multiscale features F₁,F₂,…,F_KThe method is combined with learned weight, so that the model learns better characteristic expression for characters with different scales, and generates richer characteristic sequences by discrete coding for the characters with different scales

，q^jThe characteristic dimension of (a) is D, which is specifically calculated as follows:

(7)

wherein W_i ^jIndicates the associated weightjA position andithe relevance of the seed scale feature is a scalar; f_i ^jIs shown asiThe seed scale is characterized injFeature vector at each position with a feature dimension of

。

In one embodiment of the invention, the adaptively enhanced feature sequence is

The dependency relationship between different positions is established by using a bidirectional long-short memory network containing D neurons, so that better sequence characteristics are obtained

。

In an embodiment of the present invention, for a scene text in any semantic direction, a text sequence decoding module (as shown in fig. 2) is used as a decoding network to generate a text string with correct text semantic order and accurate recognition result. The decoding network here is a recurrent neural network that is utilized, wherein each network element is optional, and the example here employs a Gated Current Unit (GRU). At each analysis moment, the method is beneficial to a sequence attention mechanism to automatically learn the alignment relation between the character string and the sequence feature H, and comprises the following specific steps:

(5.1) learning the relevance between the hidden state and the sequence feature of the GRU by using a multi-layer perceptron, wherein the calculation mode is as follows:

e _tj= W _e tanh(W _s s _t-1+ W _hh_j + b) (8)

(9)

whereins _t-1Represents the hidden state of GRU at the t-1 th time, h_jIndicates that the sequence feature H is injFeature vectors at individual locations.α _tjIndicates the t-th analysis time and the th in the sequence characteristics HjThe degree of association of each location.W _e, W _s, W _h, bAre parameters that can be learned in the perceptron.

(5.2) the correlation characteristics at the tth analysis time are obtained by weighted combination, and the calculation method is as follows:

(10)

(5.3) updating the hidden state of the GRU at the t-th time, wherein the calculation mode is as follows:

s _t= GRU(s _t-1, c _t, y_t-1) (11)

wherein y is_t-1Labels at time t-1 are represented in trainingy _t-1 ^*When tested, the predicted result at the t-1 th time is showny _t-1。

(5.4) acquiring the output probability distribution of each time t, wherein the calculation mode is as follows:

y_t = softmax(V ⋅ s _t) (12)

whereinVRepresenting a learnable weight parameter.

In the learning process, the used bottom loss function is expressed as follows:

(13)

whereiny _t ^*The label representing the t-th time, theta represents all learnable parameters in the network, I refers to the input image, and p (∙) is the maximum response value in the probability distribution at the t time. The whole network learns end to end, only images and corresponding text strings need to be input, and extra supervision information is not needed. The recognition result with the maximum response at each time is selected as output in the inference process, or the beta categories with the maximum response are selected for the next time by bundle searching.

The method of the present invention will be further described with reference to the following specific examples.

The invention provides a scene character recognition method based on scale self-adaption and a direction attention network, which comprises the following test environments and experimental results:

(1) and (3) testing environment:

the system environment is as follows: ubuntu 16.04;

hardware environment: memory: 128GB, GPU: NVIDIA GTX 1080Ti, CPU 1.70 GHz Intel (R) Xeon (R) E5-2609, hard disk: 4 TB;

(2) experimental data:

the model constructed by the method of the invention is trained on a synthetic data set Synth90k (about nine million word pictures) and SynthText (about four million word pictures). The invention evaluated on five data sets, respectively IIIT5K (3000 training pictures, 2000 test pictures); SVT (647 test pictures); ICDAR03 (1007 test pictures); ICDAR13 (1095 test pictures); ICDAR15 (2077 test pictures). The evaluation criterion utilized is case insensitive word accuracy. In the evaluation, in order to obtain different semantic direction characters, the original image is rotated by 0 degree, 90 degrees, 180 degrees and 270 degrees. The number of characters is 36, including 26 english letters +10 numbers.

(3) The optimization method comprises the following steps:

the ADAELTA optimization method was used, where the size H × W of the image was set to 100 × 100, L was set to 23, K was set to 4 in the convolutional network, i.e., 3 dilation convolutional layers were included, the number of eigen-channels D was set to 256, and T was set to 100. The size of the training mini-batch (minipatch) is set to 128.

(4) The experimental results are as follows:

1) ablation experiment:

the evaluation of the experiment was performed on the IIIT5K test set, and for fair comparison, training was performed only on the Synth90k dataset; in model inference, a greedy selection strategy is used for obtaining a recognition result, and a dictionary is not used for correcting a final prediction result. The Baseline-A firstly trains a semantic direction classification network, namely a 0-degree, 90-degree, 180-degree and 270-degree four-classification network; then using the popular horizontal character recognizer CRNN (B, Shi, X, Bai, and C, Yao, "An end-to-end reliable network for image-based sequence and its application to scene text recognition"IEEE Trans. Pattern Anal. Mach. Intell.Vol. 39, No. 11, pp. 2298 and 2304, 2017). Baseline-B refers to performing 0-degree, 90-degree, 180-degree and 270-degree rotation on any input image; then, pictures in the four directions are identified, and finally, one prediction with the highest summation probability is selected from the four results to serve as a final result. AON (Z. Cheng, Y. Xu, F. Bai, Y. Niu, S. Pu, and S. Zhou, "AON: towardarbitraryoriented text recognition," in CVPR, 2018, pp. 5571-5579 ") refers to a popular multidirectional coding network that performs weighted combination of features by learning the weights of different positions in different directions. As shown in the ablation experiments in table 1 below, the effectiveness of the Polar Transformation (PT) mechanism proposed by the present invention for sequence recognition and based on word-level received Field Attention (CRFA) was found.

Watch (A)

Ablation experiment

2) And (3) comparing the performances:

when compared to other methods, the model was trained using the synthetic dataset Synth90k and synthttext. At model inference, the width β of the bundle search is set to 5, and the prediction is rectified using the largest dictionary set provided by the data set, i.e., the character string in the dictionary with the smallest edit distance from the prediction result is selected as the final result. If the data set does not provide a dictionary, then all of the truth values in the test set are placed in a set to form the dictionary. The performance is shown in table 2 below, which shows that the case-insensitive word accuracy is the average performance in four semantic directions (0 degrees, 90 degrees, 180 degrees and 270 degrees), and the results show the robustness and superiority of our method for semantic direction character recognition.

Watch (A)

Performance comparison

In the table:

the Tesseract-COR method is described in "Tesseract-OCR v4.0," https:// github. com/Tesseract-OCR/Tesseract/leases.

GRCNN method is described in J.Wang and X.Hu, "Gated recovery conversion neural network for OCR," inNeurIPS, 2017, pp. 334–343.

The ALE method is described in S.Fang, H.Xie, Z. Zha, N.Sun, J.Tan, and Y.Zhang, "attachment and language understanding for scene text registration with connected sequence modification," inACM-MM, 2018, pp. 248–256.

The ASTER method is described in B, Shi, M, Yang, X, Wang, P, Lyu, C, Yao, and X, Bai, "ASTER: An interactive scene text receiver with flexible receiver"IEEE Trans.Pattern Anal. Mach. Intell., vol. 41, no. 9, pp. 2035–2048, 2019.

The MORN-v2 method is described in C. Luo, L. Jin, and Z. Sun, "A Multi-object recognition attachment network for scene text recognition"Pattern Recognition, vol. 90, pp. 109–118, 2019.

SAR methods are described in H.Li, P.Wang, C.Shen, and G.Zhang, "Show, Attenden and Read A simple and string baseline for irregular text recognition," inAAAI, 2019, pp. 8610–8617.

It is clear from the above experiments that the polar coordinate transformation and the character recency attention mechanism involved in the present invention are both effective. The two methods are used for identifying scene characters in any semantic direction, and good performance and robustness can be achieved.

The above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and a person skilled in the art may make modifications or equivalent substitutions to the technical solutions of the present invention without departing from the scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A scene character recognition method based on scale self-adaption and a direction attention network is characterized by comprising the following steps:

2. The scene text recognition method based on scale adaptation and direction attention network as claimed in claim 1, further comprising, before the step (1), a step of converting the input picture: an arbitrary size color input picture is converted into a fixed size grayscale picture, the size of which is expressed as H × W.

3. The method for recognizing scene texts based on scale adaptation and directional attention network as claimed in claim 1, wherein the step (1) comprises the following steps:

4. The method for recognizing scene texts based on scale adaptation and directional attention network as claimed in claim 1, wherein the step (2) is specifically as follows:

5. The method for scene text recognition based on scale adaptation and directional attention network as claimed in claim 1, wherein the step (3) comprises the following steps:

(3.3) combining the multiscale features F₁,F₂,…,F_KFusing with learned weight and then obtaining the characteristic sequence discretely

Wherein q is^jIs D.

6. The method for scene text recognition based on scale adaptation and directional attention network of claim 1, wherein in the step (4), the bidirectional long-short memory network comprises D neurons.

7. The method for scene text recognition based on scale adaptation and directional attention network as claimed in claim 1, wherein the step (5) is specifically as follows:

the decoding network is a recurrent neural network based on a gate cycle unit, and for each decoding time t, a multilayer perceptron is utilized to learn the hidden state of the gate cycle unit networks _t-1The degree of association with the characteristic sequence H; then adaptively focusing on proper sequence feature expression according to the association degree obtained by learning; finally, each gate cycle unit outputs the distribution of the current time

Wherein C is the number of characters; in the network learning, the maximum cumulative probability at all times will be maximized; in the process of network inference, analysis at each moment is selected by using greedy algorithmAnd selecting the category with the maximum response as output, or selecting beta categories with the maximum response in each moment according to a bundle searching algorithm to analyze the next moment until a sequence ending mark is met or the maximum preset time step T is exceeded, and concluding by the network.