WO2023202695A1 - 数据处理方法及装置、设备、介质 - Google Patents

数据处理方法及装置、设备、介质 Download PDF

Info

Publication number
WO2023202695A1
WO2023202695A1 PCT/CN2023/089744 CN2023089744W WO2023202695A1 WO 2023202695 A1 WO2023202695 A1 WO 2023202695A1 CN 2023089744 W CN2023089744 W CN 2023089744W WO 2023202695 A1 WO2023202695 A1 WO 2023202695A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
level
attention
target
features
Prior art date
Application number
PCT/CN2023/089744
Other languages
English (en)
French (fr)
Inventor
吴臻志
祝夭龙
Original Assignee
北京灵汐科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京灵汐科技有限公司 filed Critical 北京灵汐科技有限公司
Publication of WO2023202695A1 publication Critical patent/WO2023202695A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the embodiments of the present disclosure relate to the field of computer technology, and in particular, to a data processing method and device, electronic equipment, and computer-readable storage media.
  • the self-attention (SA) mechanism is an improvement on the attention mechanism, which reduces the dependence on external information and is better at capturing the internal correlation of data or features.
  • the present disclosure provides a data processing method and device, electronic equipment, and computer-readable storage media.
  • the present disclosure provides a data processing method.
  • the data processing method includes: inputting data to be processed into a target neural network for processing, and obtaining a processing result of the data to be processed.
  • At least one layer of the target neural network The convolution layer is an attention convolution layer based on a first attention mechanism, and/or feature fusion is performed between at least two convolution layers of the target neural network based on a second attention mechanism, wherein the third An attention mechanism includes a self-attention mechanism for a local area of a feature, and the second attention mechanism includes an attention mechanism for a local area of the output feature between output features of different scales.
  • the present disclosure provides a data processing device.
  • the data processing device includes: a data processing module configured to input data to be processed into a target neural network for processing, and to obtain a processing result of the data to be processed, At least one convolution layer of the target neural network is an attention convolution layer based on a first attention mechanism, and/or, at least two convolution layers of the target neural network are based on a second attention mechanism.
  • Feature fusion is performed, wherein the first attention mechanism includes a self-attention mechanism for local areas of features, and the second attention mechanism includes an attention mechanism between output features of different scales.
  • the present disclosure provides an electronic device, which includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores information that can be processed by the at least one processor.
  • One or more computer programs are executed by the processor, and the one or more computer programs are executed by the at least one processor, so that the at least one processor can execute the above-mentioned data processing method.
  • the present disclosure provides a computer-readable storage medium on which a computer program is stored, wherein the computer program implements the above-mentioned data processing method when executed by a processor.
  • Figure 1 is a schematic diagram of the processing process of a self-attention mechanism provided
  • Figure 2 is a schematic diagram of a feature pyramid network provided
  • Figure 3 is a flow chart of a data processing method provided by an embodiment of the present disclosure.
  • Figure 4 is a flow chart of the working process of a data processing method provided by an embodiment of the present disclosure
  • Figure 5 is a schematic diagram of a target neural network provided by an embodiment of the present disclosure.
  • Figure 6 is a schematic diagram of the working process of an attention convolution layer provided by an embodiment of the present disclosure.
  • Figure 7 is a flow chart of the working process of a data processing method provided by an embodiment of the present disclosure.
  • Figure 8 is a schematic diagram of a feature fusion process provided by an embodiment of the present disclosure.
  • Figure 9 is a schematic diagram of an intermediate feature acquisition process provided by an embodiment of the present disclosure.
  • Figure 10 is a schematic diagram of a mapping relationship of fusion features provided by an embodiment of the present disclosure.
  • Figure 11 is a schematic diagram of a fusion feature acquisition process provided by an embodiment of the present disclosure.
  • Figure 12 is a schematic diagram of a mapping relationship of fusion features provided by an embodiment of the present disclosure.
  • Figure 13 is a block diagram of a data processing device provided by an embodiment of the present disclosure.
  • Figure 14 is a block diagram of a data processing device provided by an embodiment of the present disclosure.
  • Figure 15 is a block diagram of a data processing device provided by an embodiment of the present disclosure.
  • Figure 16 is a block diagram of an electronic device provided by an embodiment of the present disclosure.
  • Figure 17 is a block diagram of an electronic device provided by an embodiment of the present disclosure.
  • Figure 18 is a block diagram of a computer-readable storage medium provided by an embodiment of the present disclosure.
  • Neural Network is a model that imitates the structure and function of biological neural networks and has been widely used in image processing, speech recognition, natural language processing and other fields.
  • Convolution is an important concept in neural networks, and feature extraction can be achieved through convolution operations.
  • sliding convolution kernels can be used for filtering to obtain filter responses, thereby extracting features. Since the receptive field of the convolution kernel is usually a local area of the feature map, convolution has the advantage of inductive bias.
  • continuous hierarchical stacking is required to extract features in a wider range, thereby correlating different areas of the entire feature map.
  • the attention mechanism is another important concept in neural networks. Its essence is to obtain new feature representations by linear weighting based on the relationship between things.
  • the self-attention mechanism (Self-Attention, SA) is a variant of the attention mechanism, which reduces the dependence on external information and is better at capturing the internal correlation of data or features.
  • the self-attention mechanism draws on the query-key-value (qkv) concept in the mainstream network of Natural Language Processing (NLP) - Transformer (Transformer), and converts each feature in the feature map The points are regarded as an embedding, and then the qkv self-attention operation is performed.
  • Figure 1 is a schematic diagram of the processing process of a self-attention mechanism provided. Referring to Figure 1, it combines the self-attention mechanism with the visual backbone network, and realizes self-attention processing through the qkv mechanism.
  • each pixel (or superpixel) in each frame of the image and all pixels (or superpixels) in all other frames is obtained; again, normalization (for example, Softmax) operation is performed on the autocorrelation features to obtain a value range of The weight coefficient (Weights) of [0,1] is the autocorrelation coefficient; finally, the autocorrelation coefficient is multiplied by the feature g to perform channel expansion, and the channel expansion result is calculated as a residual with X, so that Get the final output result Z.
  • normalization for example, Softmax
  • the biggest advantage of the self-attention mechanism is that it can associate any two points within the entire graph with only one layer (that is, modeling any range of feature patterns). Because of this, the biggest disadvantage of the self-attention mechanism is the excessive amount of calculation (the amount of calculation is usually proportional to the square of the feature map size).
  • the self-attention mechanism is not mixed with convolution (that is, the Vision-Transformer (ViT) that introduces convolution)
  • ViT Vision-Transformer
  • the industry has proposed self-attention improvement models based on the Cross Network (Criss Cross Network, CCNet) and the Cross-Shaped Window (CSWin).
  • CCNet Cross Network
  • CSWin Cross-Shaped Window
  • the self-attention model based on CCNet calculates the relationship between the target feature pixels and the pixels in the cross area of the row and column, and uses this relationship to weight the features of the target pixels to obtain more effective results. target characteristics.
  • the self-attention model based on CSWin calculates the relationship between the target feature pixels and the pixels in the cross-shaped cross window area of the row and column, and uses this relationship to weight the features of the target pixels to obtain more effective results. target characteristics.
  • the data processing method provided by the embodiment of the present disclosure is not only dedicated to reducing the calculation amount of the self-attention model, but also takes into account that when extracting features based on convolution operations, the feature pattern to be extracted completely corresponds to the convolution kernel and there is no rotation. In the case of angle, the filtering response will be maximum and the corresponding feature extraction results will be better. However, in practical applications, the same feature pattern may appear at different rotation angles in the image, and it is impossible to use one convolution kernel to extract all the features corresponding to the same feature pattern. In other words, convolution operations do not support rotation invariance when extracting features.
  • embodiments of the present disclosure provide a neural network including a self-attention convolution layer, perform feature extraction based on the self-attention convolution layer, and obtain more effective output features.
  • feature fusion is often used to combine the high resolution of low-level features with the high semantic information of high-level features, thereby enhancing the feature expression effect.
  • FPN Feature Parymid Network
  • the feature fusion model uses channel splicing or point-by-point addition after resampling to achieve feature fusion at different levels, which is relatively simple to implement.
  • FIG. 2 is a schematic diagram of a feature pyramid network provided.
  • the FPN network uses a standard feature extraction network to extract features at multiple spatial locations, then adds a lightweight top-down path and connects it laterally with the feature extraction network.
  • For each level of features extracted by the feature extraction network first perform double upsampling to obtain the upsampled features, and then superimpose them with the next level features that have undergone 1 ⁇ 1 convolution processing (conv), so as to Obtain the corresponding fusion features, and perform subsequent data processing operations based on the fusion features.
  • conv convolution processing
  • the attention mechanism can establish an association between any two pixels in the feature map
  • the attention mechanism is applied to the feature fusion process to integrate features at different levels.
  • the spatial dimension correlation is integrated into the features.
  • the data processing method according to the embodiment of the present disclosure can be executed by electronic equipment such as a terminal device or a server.
  • the terminal device can be a vehicle-mounted device, a user equipment (User Equipment, UE), a mobile device, a user terminal, a terminal, a cellular phone, a cordless phone, Personal Digital Assistant (Personal Digital Assistant, PDA), handheld device, computing device, wearable device, etc.
  • the method can be implemented by the processor calling computer readable program instructions stored in the memory.
  • the data processing method of the embodiment of the present disclosure may be executed through a server, where the server may be an independent physical server, a server cluster composed of multiple servers, or a cloud server capable of cloud computing.
  • Figure 3 is a flow chart of a data processing method provided by an embodiment of the present disclosure.
  • the data processing method includes:
  • step S31 the data to be processed is input into the target neural network for processing, and the processing result of the data to be processed is obtained.
  • At least one convolutional layer of the target neural network is an attention convolutional layer based on the first attention mechanism, and/or, at least two convolutional layers of the target neural network perform feature analysis based on the second attention mechanism.
  • the first attention mechanism includes a self-attention mechanism for local areas of features
  • the second attention mechanism includes an attention mechanism for local areas of output features between output features of different scales.
  • the data to be processed may include any one of image data, voice data, text data, and video data.
  • the embodiments of this disclosure do not limit the type and content of data to be processed.
  • the target neural network uses the attention convolution layer to perform self-attention operations on the data input to this layer, and obtains the output features for use in the target neural network.
  • the other network layers perform data processing based on the output features to obtain the processing results; and the target neural network can also fuse the output features (including but not limited to the output features of the attention convolution layer) based on the second attention mechanism to obtain the fusion Features are used by other network layers for further data processing based on the fused features to obtain processing results.
  • the target neural network can be used to perform any of image processing tasks, speech processing tasks, text processing tasks, and video processing tasks.
  • the processing result of the data to be processed can be any one of image processing results, speech processing results, text processing results, and video processing results (where the processing can include operations such as identification, classification, and labeling), which is related to It is related to the type of data to be processed, the content of the data to be processed, and the execution tasks of the target neural network.
  • the embodiments of the present disclosure do not limit the specific task types and processing result types performed by the target neural network.
  • the target neural network uses the attention convolution layer to perform self-attention operations on the image classification features input to this layer, and obtains the image classification output features.
  • the target neural network uses the attention convolution layer to perform self-attention operations on the image classification features input to this layer, and obtains the image classification output features.
  • the target neural network includes at least one attention convolution layer based on the first attention mechanism, and/or feature fusion is performed between at least two convolution layers of the target neural network based on the second attention mechanism.
  • the processing process of the target neural network at least includes feature extraction and/or feature fusion. It should be noted that both feature extraction and feature fusion use the local attention mechanism, that is, attention operations are performed on local areas of the feature map, and the original features are updated based on the attention operation results. Subsequently, the feature extraction working process and feature fusion working process of the target neural network will be explained respectively.
  • the target neural network includes attentional convolutional layers, which can be used for feature extraction.
  • attentional convolutional layers which can be used for feature extraction. The following describes the working process of feature extraction by the target neural network in conjunction with Figure 4.
  • Figure 4 is a flow chart of the working process of a data processing method provided by an embodiment of the present disclosure.
  • the data processing method includes:
  • Step S41 For any attention convolution layer, linearly transform the input data of the attention convolution layer to obtain the first query feature, the first key feature and the first value feature corresponding to the input data.
  • Step S42 Determine the first attention feature corresponding to the plurality of target feature points of the first query feature based on the first query feature, the first key feature, and the first value feature.
  • the first attention feature includes a first attention value corresponding to multiple target feature points (where the number of target feature points may be multiple, correspondingly, when there are multiple target feature points, the first The attention feature includes first attention values corresponding to multiple target feature points, which can be understood to mean that the first attention feature includes the first attention value corresponding to each target feature point; for example, the first attention feature can include and The first attention value corresponding to each target feature point), the first attention value is determined for the local area corresponding to the target feature point, and the local area corresponding to the target feature point is the target feature point in the first query feature.
  • the first attention value is used to characterize the association between multiple feature points in the local area and the target feature point (where, The number of feature points in the local area may be multiple.
  • the first attention value is used to characterize the relationship between the multiple feature points in the local area and the target feature point.
  • the correlation between can be understood as the first attention value can be used to characterize the correlation between each feature point in the local area and the target feature point, and when the number of target feature points is also multiple, the first attention value
  • the force value can represent the association between each feature point in the local area and each target feature point; for example, the first attention value can be used to characterize the association between each feature point in the local area and the target feature point) .
  • Step S43 Determine the output feature corresponding to the attention convolution layer based on the first attention feature and the input data.
  • the processing result is obtained after the output features are processed by the network layer after the attention convolution layer.
  • the input data of the attention convolution layer is the data to be processed through the network layer before the attention convolution layer.
  • the input data can be transformed with the preset transformation
  • the linear transformation of the input data is implemented by matrix multiplication, thereby obtaining the first query feature, the first key feature and the first value feature.
  • the input data can also be convolved based on a preset convolution kernel, and the convolution result is the first query feature, the first key feature, or the first value feature.
  • the embodiments of the present disclosure do not limit the linear transformation method of input data.
  • the first query feature, the first key feature and the first value can be obtained by multiplying the input data by the same transformation matrix or performing a linear transformation on the input data.
  • the first query feature, the first key feature and the first value feature are exactly the same features.
  • the input data is the matrix F (h1*w1) , multiply it by the preset transformation matrix w (h2*w2) , obtain the multiplication result F' (h1*w2) , and add F' ( h1*w2) as the first query feature, the first key feature and the first value feature.
  • h1 and w1 respectively represent the height and width of the input data corresponding matrix F
  • h2 and w2 respectively represent the height and width of the transformation matrix w
  • w1 h2.
  • different first query features, first key features, and first value features can also be obtained based on different transformation matrices or different linear transformation methods.
  • the first attention feature can be determined through self-attention operation in step S42.
  • the process of determining the first attention feature includes: determining multiple first key feature points corresponding to the local area in the first key feature for multiple target feature points, and the first value feature points corresponding to the local area in the first value feature (wherein, the number of target feature points may be multiple, correspondingly, when there are multiple target feature points, for multiple target feature points, Determining a plurality of first key feature points corresponding to the local area in the first key feature, and the first value feature points corresponding to the local area in the first value feature can be understood as determining the first key feature for each target feature point.
  • the operation of multiple first key feature points corresponding to the local area, and the first value feature points corresponding to the local area in the first value feature for example, for each target feature point, the first key feature and the local area can be determined multiple first key feature points corresponding to the area, and first value feature points corresponding to the local area in the first value feature); determine the similarity between the target feature point and the multiple first key feature points (wherein, the The number of one-key feature points may be multiple.
  • determining the similarity between the target feature point and multiple first-key feature points can be understood as Determine the similarity between the target feature point and each first key feature point respectively; for example, you can determine the similarity between the target feature point and each first key feature point), and obtain the first similarity corresponding to the target feature point Features; obtain the first attention value corresponding to the target feature point based on the first similarity feature and the first value feature point; obtain the first attention feature based on the first attention values of multiple target feature points.
  • the similarity between feature points can be calculated based on cosine similarity, Pearson correlation coefficient (Pearson Correlation), etc.
  • the embodiment of the present disclosure does not limit the method of determining similarity.
  • determining the similarity between the target feature point and multiple first key feature points includes: obtaining the i-th target feature In the case of the similarity Sij between the point and the j-th first key feature point, determine the similarity Sji between the j-th first query feature point and the i-th target feature point based on Sij; where i and j are all integers greater than or equal to 1 and less than or equal to M, M is the total number of feature points in the first query feature or the first key feature (M is an integer greater than or equal to 1), and i ⁇ j or i ⁇ j. Determining the similarity Sji based on Sij can effectively reduce the amount of calculations, thereby reducing the data processing pressure.
  • self-attention calculation can be performed in a local area only for the target feature point to obtain the first attention value, and based on The first attention features are obtained from the first attention values of multiple target feature points.
  • self-attention calculation based on local areas can effectively reduce the amount of calculation, and because the local area has a certain inductive bias relative to the global, it also has the ability to rotate Invariance, therefore, better-performing features can be obtained.
  • the target feature point is a feature point belonging to the first query feature, which has a corresponding relationship with the feature points in the input data.
  • the essence of the self-attention operation is performed on the target feature point based on the local area. Determine the corresponding correlation relationship based on the local area for the input data.
  • the range of the target feature points can be flexibly set as needed. It can include all feature points in the first query feature or several feature points (several number of feature points) specified in the first query feature.
  • the feature point may be one or more feature points), which is not limited in this embodiment of the disclosure.
  • the data processing method may also include: selecting multiple feature points from the first query feature as target feature points; and determining, according to the preset size, corresponding to the multiple target feature points. Partial area. In other words, the number of target feature points can be multiple. When there are multiple target feature points, the local area corresponding to each target feature point can be determined according to the preset size, that is, each target feature point can correspond to the corresponding local area. area.
  • the local area corresponding to the target feature point may be in the form of a vector (for example, a text processing scenario) or a rectangular (including square) form (for example, an image or video processing scenario).
  • a vector for example, a text processing scenario
  • a rectangular (including square) form for example, an image or video processing scenario.
  • the first query feature includes any one of vectors and matrices.
  • the preset size includes the preset number of feature points, and the preset number of feature points is less than the total number of feature points of the first query vector.
  • the local area is centered on the target feature point and the feature points A vector whose number is equal to the number of preset feature points; when the first query feature is a matrix, the preset size includes the preset number of rows and the preset number of columns, and the preset number of rows is less than the total number of rows of the first query feature, The preset number of columns is less than the total number of columns of the first query feature, and the local area is a rectangular area with the target feature point as the center, the preset number of rows as the height, and the preset number of columns as the width.
  • the first query feature is a 5*5 matrix
  • all feature points in the first query feature are determined as target feature points, and, with the target feature point as the center, the side length is equal to 3 features.
  • the area of the point is set to the local area corresponding to the target feature point.
  • the target feature points located at the edge of the first query feature their local areas cannot constitute a 3*3 feature area.
  • the local areas of these target feature points can be supplemented into a 3*3 area by zero padding. Feature areas to facilitate calculations.
  • the process of obtaining the first attention feature in step S42 includes: first determining the target feature point, then determining the local area, and then determining the first attention feature.
  • a sliding window method can be used to obtain the first attention feature.
  • step S42 the process of obtaining the first attention feature based on the sliding window method includes: setting the sliding window and step size according to the preset size of the local area; starting from the preset initial sliding position , slide the sliding window with a step size along the first query feature, and determine the target feature points corresponding to the sliding window in multiple sliding operations, multiple first key feature points corresponding to the sliding window in the first key feature, and the The first value feature point corresponding to the sliding window in the one-value feature; determine the similarity between the target feature point and multiple first key feature points, and obtain the first similarity feature corresponding to the target feature point; based on the first similarity Degree features and first value feature points are used to obtain the first attention value corresponding to the target feature point; based on the first attention values of multiple target feature points, the first attention feature is obtained.
  • the sliding window and step size can be set according to the preset size of the local area; starting from the preset initial sliding position, slide the sliding window along the first query feature with the step size, and determine the relationship between the sliding window and the sliding window in each sliding operation.
  • Corresponding target feature points multiple first key feature points corresponding to the sliding window in the first key feature, and first value feature points corresponding to the sliding window in the first value feature; determine the target feature point and each first key Based on the similarity between feature points, the first similarity feature corresponding to the target feature point is obtained; according to the first similarity feature and the first value feature point, the first attention value corresponding to the target feature point is obtained; according to multiple The first attention value of the target feature point is used to obtain the first attention feature.
  • the sliding window is at a corresponding specific position, which circles the local area corresponding to this sliding operation.
  • the sliding window can be obtained Under the operation, a plurality of first key feature points corresponding to the sliding window in the first key feature, and a first value feature point corresponding to the sliding window in the first value feature.
  • the method of obtaining the first attention feature based on the sliding window is similar to the method of extracting features based on the convolution kernel. The difference is that the first attention feature determines the feature value based on the self-attention operation, and the convolution kernel uses convolution. The operation determines the characteristic value.
  • the output feature may be determined according to the first attention feature and the input data in step S43. That , determining the output features can include at least two methods: the first method is to linearly transform the first attention feature to make it the same size as the input data, and superimpose the transformed first attention feature onto the input data , thereby obtaining the output features; the second method is to establish a position mapping relationship between the first attention feature and the feature point of the input data, and use the position mapping relationship to generate the output feature based on the first attention feature and the input data.
  • the size of the output feature can remain the same as the input data.
  • the second method if the target feature points are not all the feature points in the first query feature, the size of the output feature is different from the input data. It may only include feature components corresponding to target feature points.
  • step S43 determining the output feature corresponding to the attention convolution layer according to the first attention feature and the input data includes: linearly transforming the first attention feature to obtain the input The first matching attention feature with the same data size; superimpose the first matching attention feature with the input data to obtain the output feature corresponding to the input data.
  • step S43 determining the output feature corresponding to the attention convolution layer based on the first attention feature and the input data includes: based on the first attention value in the first attention feature According to the position information of the corresponding target feature point in the input data, the first attention value is rearranged to obtain the second matching attention feature; according to the second matching attention feature and the feature point corresponding to the target feature point in the input data , obtain the output features corresponding to the input data.
  • the first attention mechanism in the embodiment of the present disclosure essentially belongs to the category of self-attention, which has "local area” characteristics, “inductive bias” characteristics and “rotation invariant” characteristics.
  • “local area” means that when obtaining the first attention feature, only the self-attention operation is performed on the local area of the feature, rather than the self-attention operation on the global feature, which can effectively reduce the amount of calculation;
  • “Inductive bias” "The characteristic is an additional characteristic generated by performing self-attention operations on local areas. Compared with global self-attention operations that do not have inductive bias capabilities, self-attention operations only on local areas have certain inductive bias capabilities. ;
  • the "rotation-invariant” feature is due to the fact that the self-attention operation itself focuses on the correlation between feature points. This correlation has nothing to do with the distance and relative position between features, making it insensitive to the rotation angle.
  • Figure 5 is a schematic diagram of a target neural network provided by an embodiment of the present disclosure.
  • the target neural network includes a first network layer structure, an attention convolution layer, and a second network layer structure.
  • the first network layer structure is located before the attention convolution layer, which may include one or more network layers (the network layer may be a convolution layer, etc.)
  • the second network layer structure is located after the attention convolution layer, which is also It can include one or more network layers (the network layer can include batch normalization layer (Batchnorm-Layer) and activation layer (Activation Layer), etc.).
  • the first network layer structure first processes the data to be processed, obtains intermediate data, and inputs the intermediate data into the attention convolution layer.
  • This intermediate data is the input data of the attention convolution layer.
  • the attention convolution layer processes the input data according to any implementation method in the embodiments of the present disclosure, obtains output features, and inputs the output features into the second network layer structure.
  • the second network layer structure processes the output features to obtain the processing results, and the target neural network outputs the processing results.
  • Figure 6 is a schematic diagram of the working process of an attention convolution layer provided by an embodiment of the present disclosure.
  • the input data F of the attention convolution layer is a tensor of h f *w f *c, where h f represents the height of F, w f represents the width of F, and c represents the number of channels of F.
  • F is first multiplied by the first transformation matrix wq, the second transformation matrix wk, and the third transformation matrix wv respectively to implement a three-dimensional linear transformation of F, and correspondingly obtain the first query features Q
  • the first key feature K and the first value feature V where the size of wq is w f *h q *1, correspondingly, Q is a tensor of h q *w q *c, and the size of wk is w f *h k *1, correspondingly, K is the tensor of h k *w k *c, the size of wv is w f *h v *1, correspondingly, V is the tensor of h v *w v *c.
  • the process of obtaining the similarity feature S includes: first, transforming Q into a matrix form of ⁇ h q *w q ⁇ *c, and secondly, transforming K into a matrix form of c* ⁇ h k *w k ⁇ form. Finally, a matrix multiplication operation is performed based on the transformed matrix form to obtain the similarity feature S of size ⁇ h q *w q ⁇ * ⁇ h k *w k ⁇ .
  • the significance is that the element at position (i, j) in S is the influence of the j-th element on the i-th element, or the similarity between the j-th element and the i-th element, thereby realizing the global context of any two dependencies between elements. It should be noted that S is sparse, and among its elements, only the similarity between the target feature point and the local area feature point has a non-zero value, and the values of other elements that have not been subjected to self-attention operations are all zero.
  • V is transformed to obtain V′, which is a matrix of ⁇ h v *w v ⁇ *c.
  • V′ is a matrix of ⁇ h v *w v ⁇ *c.
  • Perform the inner product operation on S and V′ to obtain the matrix of ⁇ h q *w q ⁇ *c, and transform the matrix to obtain the tensor of h q *w q *c, which is h p *
  • P is linearly transformed to make it the same size as F, and then added to F to obtain the final output feature F′.
  • the number of channels c can be reduced during the linear transformation process, that is, the number of channels of Q, K, and V can be smaller than the number of channels of F, and Q, K The number of channels can be different from that of V (the number of channels of Q and K is usually the same).
  • Determining the output features based on the sliding window method is similar to the calculation method of the above process, and will not be described again here.
  • Figure 7 is a flow chart of the working process of a data processing method provided by an embodiment of the present disclosure.
  • the N-level convolutional layers of the target neural network perform feature fusion based on the second attention mechanism.
  • the scales of the features output by the convolutional layers at each level are different, and N is an integer greater than or equal to 2.
  • the second attention mechanism includes an attention mechanism targeting local areas of the output features between output features of different scales. Referring to Figure 7, the method includes:
  • Step S71 for the n-th level convolution layer, determine the n-th level second attention based on the n-1 level intermediate features of the n-1 level convolution layer and the n-th level initial features output by the n-th level convolution layer. force characteristics.
  • n represents the series of the convolution layer, and n is an integer and 2 ⁇ n ⁇ N-1.
  • Step S72 Update the n-th level initial features according to the n-th level second attention features to obtain the n-th level intermediate features.
  • Step S73 Determine the n-th level third attention feature based on the n+1-th level fusion feature and the n-th level intermediate feature of the n+1-th level convolution layer.
  • Step S74 Update the n-th level intermediate features according to the n-th level third attention feature to obtain the n-th level fusion feature.
  • the initial features are the features in the initial state after the data to be processed has been processed by the convolutional layer of the target neural network
  • the intermediate features are the intermediate features obtained based on the initial features and combined with the feature information of other levels.
  • State features and fusion features refer to features obtained by further fusing intermediate features with features at other levels.
  • the processing result is the result obtained after the fused features are processed by the network layer after the convolutional layer.
  • the first-level intermediate features can be set equal to the first-level initial features.
  • the first-level fusion features are obtained by updating the first-level intermediate features based on the first-level third attention features.
  • the first-level third attention features can be based on the second-level fusion features and the first-level intermediate features. get.
  • the Nth level fusion feature can be set equal to the Nth level intermediate feature.
  • the above feature fusion process can be summarized as follows: first, update the n-level initial features based on the n-level second attention features to obtain the n-level intermediate features; secondly, update the n-level features based on the n-level third attention features.
  • the intermediate features are updated to obtain the nth level fusion feature.
  • the n-th level second attention feature and the n-th level third attention feature are both features obtained based on the second attention mechanism.
  • the n-th level second attention feature is composed of the n-1 level intermediate feature and the n-th level.
  • Level initial features are obtained, which reflects the correlation between the feature points of the n-1th level intermediate feature and the feature points of the nth level initial feature.
  • the nth level third attention feature is composed of the n+1th level fusion feature and the nth level initial feature.
  • the n-th level intermediate feature is obtained, which reflects the correlation between the feature points of the n+1-level fusion feature and the feature points of the n-th level intermediate feature.
  • step S71 for the n-th level convolution layer, according to the n-1-th level intermediate feature of the n-1-th level convolution layer and the n-th level output of the n-th level convolution layer Initial features, determine the n-th level second attention feature, including: performing linear transformation on the n-th level initial feature to obtain the n-th level second query feature corresponding to the n-th level initial feature; performing n-1 level intermediate features Perform linear transformation to obtain the n-1th level second key feature and the n-1th level second value feature corresponding to the n-1th level intermediate feature; determine the multiple feature points of the nth level second query feature and the n-th level second query feature.
  • the mapping relationship between multiple feature points of the n-1 level second key feature (wherein, the n-th level second query feature can include multiple feature points, and the n-1 level second key feature can also include multiple features. points, correspondingly, when the n-th level second query feature includes multiple feature points, and the n-1th level second key feature also includes multiple feature points, determine multiple n-th level second query features
  • the mapping relationship between feature points and multiple feature points of the n-1th level second key feature can be understood as, for each feature point of the n-th level second query feature, determine the feature point and the n-1th level
  • the mapping relationship between each feature point of the second key feature is to obtain the mapping relationship between each feature point of the n-th level second query feature and each feature point of the n-1th level second key feature, that is, the mapping relationship is obtained
  • the mapping relationship between multiple feature points of the n-level second query feature and multiple feature points of the n-1 level second key feature for example, each feature point of the n-level second query feature and
  • the feature fusion area of the n-1 level second key feature for example, it can be determined that each feature point of the n-level second query feature corresponds to the feature fusion area of the n-1 level second key feature); determine the n-th level second query feature
  • the similarity between multiple feature points in the second query feature and multiple feature points in the feature fusion area in the n-1 level second key feature (for example, each feature in the n-level second query feature can be determined Points and each feature point in the feature fusion area in the n-1 second key feature), obtain the n-level second similarity feature; determine the n-1 second value feature and the feature
  • the n-1th level second value feature points corresponding to the fusion area are determined; the inner product between the nth level second similarity feature and the n-1th level second value feature point is determined to obtain the nth level second attention feature. .
  • the n-level initial features are updated according to the n-level second attention features to obtain the n-level intermediate features, including: converting the n-level second attention features into Superimposed with the n-th level initial features to obtain the n-th level intermediate features.
  • the n-th level third attention feature is determined based on the n+1-th level fusion feature and the n-th level intermediate feature of the n+1-th level convolution layer, including: Linear transformation is performed on the n-th level intermediate feature to obtain the n-th level third query feature corresponding to the n-th level intermediate feature; linear transformation is performed on the n+1-th level fusion feature to obtain the n-th level fusion feature corresponding to the n+1-th level fusion feature.
  • the n+1-level third key feature and the n+1-level third value feature determine the relationship between multiple feature points of the n-level third query feature and multiple feature points of the n+1-level third key feature.
  • mapping relationship (wherein, the third query feature at the nth level may include multiple feature points, the third key feature at the n+1 level may also include multiple feature points, correspondingly, the third query feature at the nth level may include multiple feature points) feature points, and the n+1-th level third key feature also includes multiple feature points, determine the multiple feature points of the n-level third query feature and the multiple features of the n+1-level third key feature
  • the mapping relationship between points can be understood as: for each feature point of the n-th level third query feature, determine the mapping relationship between the feature point and each feature point of the n+1-th level third key feature, and obtain the The mapping relationship between each feature point of the n-level third query feature and each feature point of the n+1-th level third key feature, that is, the multiple feature points of the n-level third query feature and the n+1-th level are obtained.
  • the mapping relationship between multiple feature points of the third-level key feature for example, the mapping relationship between each feature point of the n-th level third query feature and each feature point of the n+1-th level third key feature can be determined ); According to the mapping relationship, it is determined that multiple feature points of the n-th level third query feature correspond to the feature fusion area of the n+1-th level third key feature (wherein, the n-th level third query feature may include multiple feature points , Correspondingly, when the n-th level third query feature includes multiple feature points, it is determined that the multiple feature points of the n-level third query feature correspond to the feature fusion area of the n+1-th level third key feature, It can be understood that for each feature point of the n-th level third query feature, it is determined that the feature point corresponds to the feature fusion area of the n+1-th level third key feature, thereby obtaining multiple feature points of the n-th level third query feature.
  • the feature fusion area corresponding to the n+1th level third key feature for example, according to the mapping relationship, it can be determined that each feature point of the nth level third query feature corresponds to the feature fusion area of the n+1th level third key feature ); determine the similarity between multiple feature points in the n-th level third query feature and multiple feature points in the feature fusion area in the n+1-level third key feature (wherein, the n-th level third query
  • the feature may include multiple feature points, and the part of the n+1-th level third key feature in the feature fusion area may also include multiple feature points.
  • the similarity between the multiple feature points in the level-3 query feature and the multiple feature points in the feature fusion area in the n+1-level third key feature can be understood as the similarity between the multiple feature points in the n-level third query feature.
  • feature points respectively determine the similarity between the feature point and each feature point in the feature fusion area of the n+1th level third key feature, and obtain the similarity between each feature point in the nth level third query feature and the n+1th level
  • the similarity between the feature points in the feature fusion area in the level-3 key feature is obtained, that is, multiple feature points in the n-level third query feature and the n+1-level third key feature in the feature fusion area are obtained.
  • the similarity between multiple feature points in the region for example, the similarity between each feature point in the n-th level third query feature and each feature point in the n+1-level third key feature in the feature fusion area can be determined Similarity), obtain the nth level third similarity feature; determine the n+1th level third value feature point corresponding to the feature fusion area in the n+1th level third value feature; determine the nth level third similarity
  • the inner product between the feature and the n+1-th level third value feature point obtains the n-th level third attention feature.
  • step S74 the n-th level intermediate features are updated according to the n-th level third attention feature to obtain the n-th level fusion feature, including: converting the n-th level third attention feature into Superimposed with n-th level intermediate features to obtain n-th level fusion features.
  • the second attention mechanism in the embodiment of the present disclosure has “other attention” characteristics and "local area” characteristics.
  • the “other attention” characteristic means that the second attention feature and the third attention feature perform attention operations between different features (that is, output features of different scales), rather than targeting different features of the same feature. Perform attention operations on points;
  • the "local area” feature means that when performing attention operations between different features, the attention operations are not performed on all feature points, but can only be performed on several feature points that have a mapping relationship between the two features. (The number of several feature points can be one or more) Perform attention operation.
  • FIG 8 is a schematic diagram of a feature fusion process provided by an embodiment of the present disclosure.
  • level 1 initial features level 2 Initial features
  • the Nth level initial features are the features obtained by sampling the input data using the 1st to Nth level convolutional layers.
  • the first to Nth convolutional layers correspond to the sampling rates X 1 to X N respectively, and decrease in sequence from X 1 to X N.
  • the first level initial features have the highest resolution
  • Level N initial features have the lowest resolution.
  • the first-level intermediate feature is set equal to the first-level initial feature.
  • the second-level second attention feature is determined based on the first-level intermediate features of the first-level convolutional layer and the second-level initial features output by the second-level convolutional layer.
  • Level 2 second attention features update the level 2 initial features to obtain level 2 intermediate features.
  • the third-level intermediate feature to the Nth-level intermediate feature can be obtained.
  • the fusion features of each convolutional layer can be further determined.
  • the Nth level intermediate feature is set to be equal to the Nth level fusion feature.
  • the N-1th level third attention feature is determined based on the Nth level fusion feature and the N-1th level intermediate feature, and the N-1th level third attention feature is The N-1 level intermediate features are updated to obtain the N-1 level fusion features.
  • the N-2 level fusion features to the 1st level fusion features can be obtained by processing in the above manner.
  • FIG. 9 is a schematic diagram of an intermediate feature acquisition process provided by an embodiment of the present disclosure.
  • F1′ represents the n-1th level intermediate feature, which is a tensor of h 1 *w 1 *c 1 ;
  • F2 represents the nth level initial feature, which is a tensor of h 2 *w 2 *c 1 Tensor, n is an integer greater than 1.
  • linear transformation is performed on F2 to generate the n-th level second query feature Q2 corresponding to F2.
  • Q2 is a tensor of h q1 *w q1 *c 1 .
  • linear transformation is performed on F1′ to obtain the n-1th level second key feature K2 and the n-1th level second value feature V2 corresponding to F1′.
  • K2 is a h k1 *w k1 *c 1
  • V2 is a tensor of h v1 *w v1 * c1 .
  • h k1 h v1
  • w k1 w v1
  • h q1 is smaller than h k1 /h v1
  • w q1 is smaller than w k1 /w v1 .
  • n-th level second similarity feature S2 (the value corresponding to the feature point whose similarity is not calculated in S2 can be set to 0), where, S2 The size of is (h q1 *w q1 )*(h k1 *w k1 ); then, expand V2 into a matrix V2′ of (h v1 *w v1 )*c 1 , and calculate the inner product of S2 and V2′ ( That is, calculate ⁇ (h q1 *w q1 )*(h k1 *w k1 ) ⁇ (h v1 *w v1 )*c 1 ⁇ , ⁇ represents the inner product operation), and the
  • F2 is linearly transformed to make it the same size as F2, and then superimposed with F2 to obtain the n-th level intermediate feature F2′.
  • F2′ is the same as F2.
  • the mapping relationship between each feature point of Q2 and each feature point of K2, that is, the corresponding relationship between the position of the same feature in Q2 and the position of the feature in K2, is calculated in the second When calculating the similarity, it can be calculated only for the feature fusion area, which can effectively reduce the amount of calculation compared to calculating the second similarity for the entire feature area.
  • Figure 10 is a schematic diagram of a mapping relationship of fusion features provided by an embodiment of the present disclosure.
  • the shaded area in the feature map corresponding to Q2 and the shaded area in the feature map corresponding to K2 have a mapping relationship (that is, both correspond to the same feature pattern).
  • the calculation range of its second similarity or the influence range of the second attention mechanism is limited to the shadow area of the K2 feature map.
  • Figure 11 is a schematic diagram of a fusion feature acquisition process provided by an embodiment of the present disclosure.
  • F3′ represents the n-th level intermediate feature, which is a tensor of h 3 * w 3 * c 2 ;
  • F4′′ represents the n+1-th level fusion feature, which is a h 4 * w 4 * c 2 of tensors.
  • h k2 h v2
  • w k2 w v2
  • h q2 is greater than h k2 /h v2
  • w q2 is greater than w k2 /w v2 .
  • each feature point of Q3 corresponds to the feature fusion area of K3; determine the relationship between each feature point in Q3 and K3.
  • the similarity between each feature point in the feature fusion area is used to obtain the nth level third similarity feature S3 (the value corresponding to the feature point for which similarity is not calculated in S3 can be set to 0), and the size of S3 is ( h q2 *w q2 )*(h k2 *w k2 );
  • expand V3 into the matrix V3′ of (h v2 *w v2 )*c 2 and calculate the inner product of S3 and V3′ to obtain the calculation result is the matrix of (h q2 *w q2 )*c 2 , convert this (h q2 *w q2 )*c 2
  • the matrix is rearranged into the tensor form of h q2 *w
  • P3 is linearly transformed to make it the same size as F3′, and then superimposed with F3′ to obtain the nth level fusion feature F3′′.
  • the size of F3′′ is the same as F3′.
  • the mapping relationship between each feature point of Q3 and each feature point of K3, that is, the corresponding relationship between the position of the same feature in Q3 and the position of the feature in K3, is calculated in the third
  • the similarity is calculated, it can be calculated only for the feature fusion area. Compared with calculating the third similarity for the entire feature area, the calculation amount can be effectively reduced.
  • Figure 12 is a schematic diagram of a mapping relationship of fusion features provided by an embodiment of the present disclosure.
  • the shaded area in the feature map corresponding to Q3 and the shaded area in the feature map corresponding to K3 have a mapping relationship (that is, both correspond to the same feature pattern).
  • the calculation range of its second similarity or the influence range of the second attention mechanism is limited to the shadow area of the K3 feature map.
  • the target neural network usually relies on the corresponding electronic device and performs data processing by scheduling various resources of the electronic device (such as computing resources, storage resources, and communication resources, etc.) to obtain the processing of the data to be processed. result. Due to the improvements in the structure and processing mechanism of the target neural network in the embodiments of the present disclosure, when the target neural network schedules the resources of the electronic device, there will also be corresponding changes in the resource scheduling (for example, the consumption quantity of the resources may change) ), thereby affecting the processing performance of electronic equipment. For example, when using the same electronic device to process the same data to be processed to obtain corresponding processing results, the electronic device is processed using relevant technologies, and the electronic device is processed through an internal target neural network.
  • resources of the electronic device such as computing resources, storage resources, and communication resources, etc.
  • the processing results obtained may be different, and the types and/or amounts of resources consumed during the processing may also be different.
  • the data processing method of the embodiment of the present disclosure can change the scheduling method and amount of resources of the electronic device, thereby changing the performance of the electronic device and the utilization of resources.
  • the disclosure also provides data processing devices, electronic equipment, and computer-readable storage media, all of which can be used to implement any of the data processing methods provided by the disclosure.
  • data processing devices electronic equipment, and computer-readable storage media, all of which can be used to implement any of the data processing methods provided by the disclosure.
  • Figure 13 is a block diagram of a data processing device provided by an embodiment of the present disclosure.
  • the data processing device includes:
  • the data processing module 13 is configured to input the data to be processed into the target neural network for processing, so as to obtain the processing result of the data to be processed.
  • At least one convolutional layer of the target neural network is an attention convolutional layer based on the first attention mechanism, and/or, at least two convolutional layers of the target neural network perform feature analysis based on the second attention mechanism.
  • the first attention mechanism includes a self-attention mechanism for local areas of features
  • the second attention mechanism includes an attention mechanism between output features of different scales.
  • the data processing apparatus may further include an input module configured to perform an operation of inputting the data to be processed into the target neural network.
  • the data to be processed may include any one of image data, voice data, text data, and video data.
  • the embodiments of this disclosure do not limit the type and content of data to be processed.
  • the data processing module uses the attention convolution layer in the target neural network to perform self-attention operations on the data input to this layer, and obtains Output features for other network layers in the target neural network to perform data processing based on the output features and obtain processing results; and, the data processing module can also process the output features (including but not limited to) based on the second attention mechanism in the target neural network.
  • the output features of the attention convolution layer are fused to obtain fusion features for further data processing by other network layers based on the fusion features to obtain processing results.
  • the target neural network can be used to perform any of image processing tasks, speech processing tasks, text processing tasks, and video processing tasks.
  • the processing result of the data to be processed can be any one of image processing results, speech processing results, text processing results, and video processing results (where the processing can include operations such as identification, classification, and labeling), which is related to It is related to the type of data to be processed, the content of the data to be processed, and the execution tasks of the target neural network.
  • the embodiments of the present disclosure do not limit the specific task types and processing result types performed by the target neural network.
  • the target neural network includes an attention convolution layer, and accordingly, the data processing module can be configured to implement feature extraction based on the first attention mechanism.
  • Figure 14 is a block diagram of a data processing device provided by an embodiment of the present disclosure.
  • the data processing device includes: a transformation sub-module 141, a first attention processing sub-module 142 and an output feature determination sub-module 143.
  • the transformation sub-module 141 is configured to linearly transform the input data of the attention convolution layer for any attention convolution layer, and obtain the input data corresponding to the input data.
  • the first query feature, the first key feature and the first value feature; the first attention processing sub-module 142 is configured to determine the first query feature according to the first query feature, the first key feature and the first value feature.
  • First attention features corresponding to multiple target feature points where the first attention features include first attention values corresponding to multiple target feature points (for example, the first attention features may include first attention values corresponding to each target feature point)
  • the first attention value is determined for the local area corresponding to the target feature point.
  • the local area corresponding to the target feature point is centered on the target feature point in the first query feature, and is based on the predetermined Assuming that the size of the area is determined, the preset size is smaller than the size of the first query feature, and the first attention value is configured to characterize the association between multiple feature points in the local area and the target feature point (for example, the first attention value The value can represent the correlation between each feature point in the local area and the target feature point); the output feature determination sub-module 143 is configured to determine the corresponding attention convolution layer according to the first attention feature and the input data. Output features, where the input data is data processed by the network layer before the attention convolution layer, and the processing result is the output feature processed by the network layer after the attention convolution layer.
  • the first attention processing sub-module includes: a region mapping unit, a similarity determination unit, a first attention value acquisition unit and a first attention feature acquisition unit.
  • the area mapping unit is configured to determine, for the plurality of target feature points, a plurality of first key feature points corresponding to the local area among the first key features, and a first value feature corresponding to the local area among the first value features.
  • the determining unit is configured to determine the similarity between the target feature point and the plurality of first key feature points (for example, the similarity between the target feature point and each first key feature point can be determined), and obtain the similarity between the target feature point and the target feature point. the corresponding first similarity feature; the first attention value acquisition unit, configured to obtain the first attention value corresponding to the target feature point according to the first similarity feature and the first value feature point; the first attention feature The acquisition unit is configured to obtain the first attention feature according to the first attention values of the plurality of target feature points.
  • the data processing module may also include: a selection sub-module and a region determination sub-module.
  • the selection sub-module is configured to select multiple feature points from the first query feature as target feature points;
  • the area determination sub-module is configured to determine the local area corresponding to the multiple target feature points according to the preset size ( For example, the local area corresponding to each target feature point can be determined).
  • the first query feature includes any one of vectors and matrices; when the first query feature is a vector, the preset size includes a preset number of feature points, and the preset number of feature points is less than The total number of feature points of the first query vector.
  • the local area is a vector centered on the target feature point and the number of feature points is equal to the number of preset feature points.
  • the preset size includes preset rows. and the number of preset columns, and the number of preset rows is less than the total number of rows of the first query feature, and the number of preset columns is less than the total number of columns of the first query feature.
  • the local area is centered on the target feature point and centered on the preset row. A rectangular area with a number equal to the height and a preset number of columns equal to the width.
  • a sliding window method is used to obtain the first attention feature.
  • the first attention processing sub-module includes a first similarity determination unit, a first attention value acquisition unit and a first attention unit.
  • the feature acquisition unit it may also include: a sliding setting unit and a sliding unit.
  • the sliding setting unit is configured to set the sliding window and step size according to the preset size of the local area; the sliding unit is configured to start from the preset initial sliding position and slide the sliding window along the first query feature with a step size.
  • first similarity determination unit configured to determine the similarity between the target feature point and a plurality of first key feature points, and obtain a first similarity feature corresponding to the target feature point
  • first attention value acquisition The unit is configured to obtain the first attention value corresponding to the target feature point based on the first similarity feature and the first value feature point
  • the first attention feature acquisition unit is configured to obtain the first attention value corresponding to the target feature point based on the first similarity feature and the first value feature point.
  • An attention value is used to obtain the first attention feature.
  • the first query feature is the same as the first key feature.
  • the first similarity determination unit determines the similarity between the target feature point and multiple first key feature points, it includes: obtaining the similarity between the i-th target feature point and the j-th first key feature point.
  • Sij determine the similarity Sji between the j-th first query feature point and the i-th target feature point based on Sij; where i and j are both integers greater than or equal to 1 and less than or equal to M, and M is the The total number of feature points in a query feature or first key feature, and i ⁇ j or i ⁇ j.
  • the output feature determination sub-module includes: a first transformation unit and a first superposition unit.
  • the first transformation unit is configured to linearly transform the first attention feature to obtain the first matching attention feature with the same size as the input data;
  • the first superposition unit is configured to combine the first matching attention feature with The input data are superimposed to obtain the output features corresponding to the input data.
  • the output feature determination sub-module includes: a rearrangement unit and a feature acquisition unit.
  • the rearrangement unit is configured to rearrange the first attention value according to the position information of the target feature point corresponding to the first attention value in the first attention feature in the input data, and obtain the second matching attention. force characteristics;
  • the characteristic acquisition unit is configured to 2. Match the attention features and the feature points corresponding to the target feature points in the input data to obtain the output features corresponding to the input data.
  • feature fusion is performed between the N-level convolutional layers of the target neural network based on the second attention mechanism.
  • the scales of the features output by the convolutional layers at each level are different, and N is an integer greater than or equal to 2.
  • the data processing module may be configured to implement feature fusion based on the second attention mechanism.
  • FIG. 15 is a block diagram of a data processing device provided by an embodiment of the present disclosure.
  • the data processing device includes: a second attention processing sub-module 151 , a first update sub-module 152 , a third attention processing sub-module 153 and a second update sub-module 154 .
  • the second attention processing sub-module 151 is configured to, for the n-th level convolution layer, based on the n-1-th level intermediate features of the n-1-th level convolution layer and the n-th level output of the n-th level convolution layer.
  • the first level initial feature determines the nth level second attention feature, n is an integer and 2 ⁇ n ⁇ N-1; the first update submodule 152 is configured to determine the nth level based on the nth level second attention feature.
  • the initial features are updated to obtain the n-th level intermediate features; the third attention processing sub-module 153 is configured to determine the n-th level fusion feature and the n-th level intermediate feature based on the n+1-th level convolutional layer.
  • the n-level third attention feature is configured to update the n-level intermediate feature according to the n-level third attention feature to obtain the n-level fusion feature, where the initial feature is to be
  • the features of the processed data are processed by the convolutional layer of the target neural network.
  • the processing result is obtained by processing the fused features through the network layer after the convolutional layer.
  • the first-level intermediate features are set equal to the first-level initial features.
  • the first-level fusion feature is obtained by updating the first-level intermediate feature based on the first-level third attention feature, and the first-level third attention feature is obtained based on the second-level fusion feature and the first-level intermediate feature.
  • the Nth level fusion feature is set equal to the Nth level intermediate feature.
  • the second attention processing sub-module includes: a second transformation unit, a third transformation unit, a first mapping unit, a first fusion area determination unit, a second similarity determination unit, a first feature point Determination unit and second attention feature acquisition unit.
  • the second transformation unit is configured to linearly transform the n-th level initial feature to obtain the n-th level second query feature corresponding to the n-th level initial feature;
  • the third transformation unit is configured to perform a linear transformation on the n-1 Linear transformation is performed on the level-n intermediate features to obtain the n-1-th level second key features and the n-1-th level second value features corresponding to the n-1-th level intermediate features;
  • the first mapping unit is configured to determine the n-th level A mapping relationship between a plurality of feature points of the second query feature and a plurality of feature points of the n-1th level second key feature (for example, the first mapping unit may be configured to determine each of the n-th level second query feature The mapping relationship between the feature points and each feature point of the n-1 level second key feature);
  • the first fusion area determination unit is configured to determine multiple feature points of the n-level second query feature according to the mapping relationship The feature fusion area corresponding to the n-1 second key feature (for example, the first
  • the similarity between each feature point in the feature fusion area in the n-1 level second key feature is used to obtain the n-level second similarity feature);
  • the first feature point determination unit is configured to determine the n-1 level The n-1th level second value feature point corresponding to the feature fusion area in the second value feature;
  • the second attention feature acquisition unit is configured to determine the n-th level second similarity feature and the n-1th level second similarity feature
  • the inner product between value feature points is used to obtain the n-th level second attention feature.
  • the first update sub-module includes a second superposition unit configured to superimpose the n-th level second attention feature and the n-th level initial feature to obtain the n-th level intermediate feature.
  • the third attention processing sub-module includes: a fourth transformation unit, a fifth transformation unit, a second mapping unit, a second fusion area determination unit, a third similarity determination unit, a second feature point Determination unit and third attention feature acquisition unit.
  • the fourth transformation unit is configured to linearly transform the n-th level intermediate feature to obtain the n-th level third query feature corresponding to the n-th level intermediate feature;
  • the fifth transformation unit is configured to perform a linear transformation on the n+1-th level intermediate feature.
  • the level fusion features are linearly transformed to obtain the n+1th level third key feature and the n+1th level third value feature corresponding to the n+1th level fusion feature;
  • the second mapping unit is configured to determine the nth level A mapping relationship between a plurality of feature points of the third query feature and a plurality of feature points of the n+1-th level third key feature (the second mapping unit may be configured to determine each feature of the n-th level third query feature mapping relationship between points and each feature point of the n+1-th level third key feature);
  • the second fusion area determination unit is configured to determine the correspondence between multiple feature points of the n-th level third query feature according to the mapping relationship
  • the third similarity determination unit is configured to determine that
  • the second update sub-module includes a third superposition unit configured to superpose the n-th level third attention feature and the n-th level intermediate feature to obtain the n-th level fusion feature.
  • Figure 16 is a block diagram of an electronic device provided by an embodiment of the present disclosure.
  • an embodiment of the present disclosure provides an electronic device, which includes: at least one processor 1601; at least one memory 1602, and one or more I/O interfaces 1603, connected between the processor 1601 and the memory 1602 among them, the memory 1602 stores one or more computer programs that can be executed by at least one processor 1601, and the one or more computer programs are executed by at least one processor 1601, so that at least one processor 1601 can execute the above-mentioned Data processing methods.
  • Figure 17 is a block diagram of an electronic device provided by an embodiment of the present disclosure.
  • an embodiment of the present disclosure provides an electronic device.
  • the electronic device includes multiple processing cores 1701 and an on-chip network 1702.
  • the multiple processing cores 1701 are connected to the on-chip network 1702, and the on-chip network 1702 is configured to interact Data between multiple processing cores and external data.
  • One or more instructions are stored in one or more processing cores 1701, and the one or more instructions are executed by one or more processing cores 1701, so that the one or more processing cores 1701 can execute the above-mentioned data processing method.
  • the electronic device may be a brain-like chip, because the brain-like chip can adopt a vectorized calculation method and needs to be loaded into the neural network through an external memory such as a double data rate (Double Data Rate, DDR) synchronous dynamic random access memory. Model weight information and other parameters. Therefore, the operation efficiency of batch processing in the embodiments of the present disclosure is relatively high.
  • DDR Double Data Rate
  • Embodiments of the present disclosure also provide a computer-readable storage medium with a computer program stored thereon.
  • Figure 18 is a block diagram of a computer-readable medium provided by an embodiment of the present disclosure. Wherein, the computer program implements the above data processing method when executed by the processor/processing core.
  • Computer-readable storage media may be volatile or non-volatile computer-readable storage media.
  • Embodiments of the present disclosure also provide a computer program product, which includes computer readable code, or a non-volatile computer readable storage medium carrying the computer readable code.
  • a computer program product which includes computer readable code, or a non-volatile computer readable storage medium carrying the computer readable code.
  • computer storage media includes volatile and non-volatile media implemented in any method or technology for storage of information such as computer readable program instructions, data structures, program modules or other data. lossless, removable and non-removable media.
  • Computer storage media include, but are not limited to, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM), static random access memory (SRAM), flash memory or other memory technology, portable Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disk (DVD) or other optical disk storage, magnetic cassette, magnetic tape, disk storage or other magnetic storage device, or may be configured to store the desired information and be accessible by a computer any other media.
  • communication media typically embodies computer readable program instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery medium.
  • Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to various computing/processing devices, or to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage on a computer-readable storage medium in the respective computing/processing device .
  • Computer program instructions for performing operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or instructions in one or more programming languages.
  • ISA instruction set architecture
  • programming languages include object-oriented programming languages - such as Smalltalk, C++, etc., as well as conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server implement.
  • the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as an Internet service provider through the Internet). connect).
  • LAN local area network
  • WAN wide area network
  • an external computer such as an Internet service provider through the Internet. connect
  • an electronic circuit such as a programmable logic circuit, a field programmable gate array (FPGA), or a programmable logic array (PLA)
  • the electronic circuit can Computer readable program instructions are executed to implement various aspects of the disclosure.
  • the computer program products described herein may be implemented in hardware, software, or a combination thereof.
  • the computer program product is embodied as a computer storage medium.
  • the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) and so on.
  • These computer-readable program instructions may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus, thereby producing a machine that, when executed by the processor of the computer or other programmable data processing apparatus, , resulting in an apparatus that implements the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
  • These computer-readable program instructions can also be stored in a computer-readable storage medium. These instructions cause the computer, programmable data processing device and/or other equipment to work in a specific manner. Therefore, the computer-readable medium storing the instructions includes An article of manufacture that includes instructions that implement aspects of the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
  • Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other equipment, causing a series of operating steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executed on a computer, other programmable data processing apparatus, or other equipment to implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions that contains one or more executable functions for implementing the specified logical functions instruction.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two consecutive blocks may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or acts. , or can be implemented using a combination of specialized hardware and computer instructions.
  • Example embodiments have been disclosed herein, and although terms are employed, they are used and should be interpreted in a general illustrative sense only and not for purpose of limitation. In some instances, it will be apparent to those skilled in the art that features, characteristics and/or elements described in connection with a particular embodiment may be used alone, or may be used in conjunction with other embodiments, unless expressly stated otherwise. Features and/or components used in combination. Accordingly, it will be understood by those skilled in the art that various changes in form and details may be made without departing from the scope of the present disclosure as set forth in the appended claims.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

本公开提供了一种数据处理方法及装置、设备、介质,该方法包括:将待处理数据输入目标神经网络中处理,得到待处理数据的处理结果,目标神经网络的至少一层卷积层为基于第一注意力机制的注意力卷积层,和/或,目标神经网络的至少两级卷积层之间基于第二注意力机制进行特征融合,第一注意力机制包括针对特征的局部区域的自注意力机制,第二注意力机制包括不同尺度的输出特征之间针对输出特征的局部区域的注意力机制。

Description

数据处理方法及装置、设备、介质 技术领域
本公开实施例涉及计算机技术领域,尤其涉及一种数据处理方法及装置、电子设备、计算机可读存储介质。
背景技术
自注意力(Self Attention,SA)机制是对注意力机制的改进,其减少了对外部信息的依赖,更擅长捕捉数据或特征的内部相关性。
发明内容
本公开提供一种数据处理方法及装置、电子设备、计算机可读存储介质。
第一方面,本公开提供了一种数据处理方法,该数据处理方法包括:将待处理数据输入目标神经网络中处理,得到所述待处理数据的处理结果,所述目标神经网络的至少一层卷积层为基于第一注意力机制的注意力卷积层,和/或,所述目标神经网络的至少两级卷积层之间基于第二注意力机制进行特征融合,其中,所述第一注意力机制包括针对特征的局部区域的自注意力机制,所述第二注意力机制包括不同尺度的输出特征之间针对所述输出特征的局部区域的注意力机制。
第二方面,本公开提供了一种数据处理装置,该数据处理装置包括:数据处理模块,被配置为将待处理数据输入目标神经网络中处理,用于得到所述待处理数据的处理结果,所述目标神经网络的至少一层卷积层为基于第一注意力机制的注意力卷积层,和/或,所述目标神经网络的至少两级卷积层之间基于第二注意力机制进行特征融合,其中,所述第一注意力机制包括针对特征的局部区域的自注意力机制,所述第二注意力机制包括不同尺度的输出特征之间的注意力机制。
第三方面,本公开提供了一种电子设备,该电子设备包括:至少一个处理器;以及与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的一个或多个计算机程序,一个或多个所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述的数据处理方法。
第四方面,本公开提供了一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序在被处理器执行时实现上述的数据处理方法。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,而非限制本公开。根据下面参考附图对示例性实施例的详细说明,本公开的其它特征及方面将变得清楚。
附图说明
图1为提供的一种自注意力机制的处理过程的示意图;
图2为提供的一种特征金字塔网络的示意图;
图3为本公开实施例提供的一种数据处理方法的流程图;
图4为本公开实施例提供的一种数据处理方法的工作过程的流程图;
图5为本公开实施例提供的一种目标神经网络的示意图;
图6为本公开实施例提供的一种注意力卷积层的工作过程的示意图;
图7为本公开实施例提供的一种数据处理方法的工作过程的流程图;
图8为本公开实施例提供的一种特征融合过程的示意图;
图9为本公开实施例提供的一种中间特征的获取过程的示意图;
图10为本公开实施例提供的一种融合特征的映射关系示意图;
图11为本公开实施例提供的一种融合特征的获取过程的示意图;
图12为本公开实施例提供的一种融合特征的映射关系示意图;
图13为本公开实施例提供的一种数据处理装置的框图;
图14为本公开实施例提供的一种数据处理装置的框图;
图15为本公开实施例提供的一种数据处理装置的框图;
图16为本公开实施例提供的一种电子设备的框图;
图17为本公开实施例提供的一种电子设备的框图;
图18为本公开实施例提供的一种计算机可读存储介质的框图。
具体实施方式
下面结合附图和实施例对本公开作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释本公开,而非对本公开的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本公开相关的部分而非全部结构。
神经网络(Neural Network,NN)是一种模仿生物神经网络的结构和功能的模型,已经广泛应用在图像处理、语音识别、自然语言处理等领域。卷积是神经网络中的一个重要概念,通过卷积操作可实现特征提取。在相关技术中,可以利用滑动卷积核的方式进行滤波,获得滤波响应,从而提取到特征。由于卷积核作用的感受野通常为特征图的局部区域,因此,卷积具有归纳偏置的优点。相应的,基于卷积核提取特征时,需要通过不断层级式堆叠才能实现对更大范围内的特征提取,从而将整个特征图的不同区域关联起来。
注意力(Attention)机制是神经网络中的另外一个重要概念,其本质是根据事物之间的关系进行线性加权得到新的特征表示。自注意力机制(Self-Attention,SA)是注意力机制的变体,其减少了对外部信息的依赖,更擅长捕捉数据或特征的内部相关性。自注意力机制借鉴了自然语言处理(Natural Language Processing,NLP)主流网络—转换器(Transformer)中的查询-键-值(query-key-value,qkv)概念,将特征图中的每个特征点看作一个嵌入,然后进行qkv自注意力运算。
图1为提供的一种自注意力机制的处理过程的示意图。参照图1,其将自注意机制与视觉主干网相结合,通过qkv机制实现自注意力的处理。
其中,X表示输入的特征图(Feature Map)(例如,尺寸为H×W×C的图像,且其数量可以是N个,N≥1),首先对X进行线性映射(即采用1×1×1卷积来压缩通道数),获得θ、φ、g共三个特征。其次,通过转置(Reshape)操作,合并上述的三个特征除通道数之外的维度,然后对θ和φ进行矩阵点乘操作,获得自相关特征,该步骤旨在计算特征自相关性,即获得每帧图像中每个像素(或超像素)对其他所有帧中所有像素(或超像素)的关系;再次,对自相关特征进行归一化(例如,Softmax)操作,得到值域为[0,1]的权重系数(Weights),该权重系数即为自相关系数;最后,将自相关系数与特征g相乘,进行通道扩展,并将通道扩展结果与X进行残差运算,从而获得最终的输出结果Z。
综上可知,与卷积操作相比,自注意机制最大的优点为仅用一层即可关联全图范围内的任意两点(即建模任意范围的特征模式)。正因为此,导致自注意力机制最大的劣势为计算量过大(计算量通常与特征图尺寸的平方成正比)。尤其是在视觉处理领域,如果自注意力机制不与卷积混用(即引入卷积的视觉变换器(Vision-Transformer,ViT)),将导致特征模式因为缺乏归纳偏置而难以收敛,或者需要在JFT(即谷歌内部的图像分类数据集)等超大规模数据集上进行训练。
为降低自注意力机制的计算量,业界提出了基于十字交叉网络(Criss Cross Network,CCNet)以及基于十字形窗(Cross-Shaped Window,CSWin)等自注意力改进模型。其中,基于CCNet的自注意力模型通过计算目标特征像素点与其所在行列的十字交叉区域内的像素点之间的相互关系,并用该相互关系对目标像素点的特征进行加权,以此获得更加有效的目标特征。基于CSWin的自注意力模型通过计算目标特征像素点与其所在行列的十字形交叉窗口区域内的像素点之间的相互关系,并用该相互关系对目标像素点的特征进行加权,以此获得更加有效的目标特征。
本公开实施例提供的数据处理方法,并非仅致力于降低自注意力模型的计算量,而是考虑到基于卷积操作提取特征时,在所要提取的特征模式与卷积核完全对应且没有旋转角度的情况下,滤波响应才会最大,相应的特征提取结果才会较好。但是,在实际应用中,同一特征模式可能在图像中以不同的旋转角度呈现,使用一个卷积核无法对同一特征模式对应的所有特征进行较好地提取。换言之,卷积操作在提取特征时不支持旋转不变性。与此同时,考虑到自注意机制是根据任意两点间的相似度来对特征图进行重新加权、以建立任意两点之间的关联,且这种加权与距离以及相对位置均无关,从而使得自注意力机制不具备归纳偏置特性。但是,当将自注意力机制作用范围缩小至局部区域时,就会使其具备一定的归纳偏置能力,而且,由于自注意力计算基于特征点之间的相似性进行确定,因此,其具备旋转不变性。有鉴于此,本公开实施例提供一种包含有自注意力卷积层的神经网络,基于该自注意力卷积层进行特征提取,并获得更加有效的输出特征。
另外,在提取特征时,为保障能提取到不同尺度或不同层级的特征,通常需要使用不同尺寸的卷积核,且使用尺寸较小的卷积核通常可以提取到较低层级的特征,使用尺寸较大的卷积核通常可以提取到较高层级的特征。其中,低层级特征的语义信息相对比较少,但目标位置准确,且分辨率较高,而高层级特征的语义信息相对比较丰富,但是目标位置比较粗略,分辨率较低且比较抽象。
在相关技术中,往往通过特征融合的方式,将低层级特征的高分辨率和高层级特征的高语义信息相结合,从而增强特征表达效果。特征金字塔网络(Feature Parymid Network,FPN)是一种常用 的特征融合模型,其采用重采样后进行通道拼接或者逐点相加的方法实现不同层级的特征融合,实现较为简便。
图2为提供的一种特征金字塔网络的示意图。参照图2,FPN网络使用一个标准的特征提取网络提取多个空间位置的特征,然后增加一个轻量级的自顶而下的通路,并且将其与特征提取网络横向连接。针对特征提取网络所提取的每一层级特征,先对其进行二倍上采样,获得上采样后的特征,再将其与经过1×1卷积处理(conv)的下一层级特征叠加,从而获得相应的融合特征,并基于融合特征进行执行后续的数据处理操作。
在相关技术中,为进一步提高特征融合效果,还针对FPN提出多种改进模型,例如,双向(Bi-Directional)FPN、路径聚合网络(Path Aggregation Network,PA-Net)等。然而,无论是FPN还是其改进模型,均未充分考虑两个不同层次的特征之间在空间维度的关联关系。
考虑到注意力机制可以建立特征图中任意两个像素点之间的关联关系,因此,在本公开实施例中,将注意力机制应用到特征融合过程中,以将不同层次的特征之间在空间维度的关联关系融合到特征中。
根据本公开实施例的数据处理方法可以由终端设备或服务器等电子设备执行,终端设备可以为车载设备、用户设备(User Equipment,UE)、移动设备、用户终端、终端、蜂窝电话、无绳电话、个人数字助理(Personal Digital Assistant,PDA)、手持设备、计算设备、可穿戴设备等,所述方法可以通过处理器调用存储器中存储的计算机可读程序指令的方式来实现。或者,可通过服务器执行本公开实施例的数据处理方法,其中,服务器可以是独立的物理服务器、由多个服务器组成的服务器集群或者能够进行云计算的云服务器。
图3为本公开实施例提供的一种数据处理方法的流程图。参照图3,该数据处理方法包括:
在步骤S31中,将待处理数据输入目标神经网络中处理,得到待处理数据的处理结果。
其中,目标神经网络的至少一层卷积层为基于第一注意力机制的注意力卷积层,和/或,目标神经网络的至少两级卷积层之间基于第二注意力机制进行特征融合,第一注意力机制包括针对特征的局部区域的自注意力机制,第二注意力机制包括不同尺度的输出特征之间针对输出特征的局部区域的注意力机制。
举例来说,待处理数据可以包括图像数据、语音数据、文本数据、视频数据中的任意一种。本公开实施例对待处理数据的类型和内容不作限制。
在一些可能的实现方式中,将待处理数据输入目标神经网络之后,目标神经网络利用注意力卷积层对输入该层的数据进行自注意力运算,并获得输出特征,以供目标神经网络中的其他网络层基于输出特征进行数据处理,获得处理结果;并且,目标神经网络还可基于第二注意力机制对输出特征(包括但不限于注意力卷积层的输出特征)进行融合,获得融合特征,以供其他网络层基于融合特征进行进一步地数据处理,获得处理结果。
在一些可能的实现方式中,目标神经网络可用于执行图像处理任务、语音处理任务、文本处理任务、视频处理任务中的任意一种。与之相应的,待处理数据的处理结果可以是图像处理结果、语音处理结果、文本处理结果、视频处理结果中的任意一种(其中,处理可以包括识别、分类、标注等操作),其与待处理数据的类型、待处理数据的内容、以及目标神经网络的执行任务等相关。本公开实施例对目标神经网络所执行的具体任务类型及处理结果类型不作限制。
例如,待处理数据包括待分类图像,则将待分类图像输入目标神经网络之后,目标神经网络利用注意力卷积层对输入该层的图像分类特征进行自注意力运算,并获得图像分类输出特征,以供目标神经网络中的其他网络层基于图像分类输出特征进行数据处理,获得图像分类结果。
如前所述,目标神经网络包括至少一层基于第一注意力机制的注意力卷积层,和/或,目标神经网络的至少两级卷积层之间基于第二注意力机制进行特征融合,因此,在步骤S31中,目标神经网络的处理过程至少包括特征提取,和/或,特征融合。需要说明的是,无论是特征提取还是特征融合,其均利用了局部注意力机制,即针对特征图的局部区域进行注意力运算,并根据注意力运算结果更新原有特征。后续分别针对目标神经网络的特征提取工作过程和特征融合工作过程进行展开说明。
在一些可能的实现方式中,目标神经网络包括注意力卷积层,其可用于进行特征提取。下面结合图4说明该目标神经网络进行特征提取的工作过程。
图4为本公开实施例提供的一种数据处理方法的工作过程的流程图。参照图4,该数据处理方法包括:
步骤S41,针对任一注意力卷积层,对注意力卷积层的输入数据进行线性变换,获得与输入数据对应的第一查询特征、第一键特征和第一值特征。
步骤S42,根据第一查询特征、第一键特征和第一值特征,确定与第一查询特征的多个目标特征点对应的第一注意力特征。
其中,第一注意力特征包括与多个目标特征点对应的第一注意力值(其中,目标特征点的数量可以是多个,相应的,在存在多个目标特征点的情况下,第一注意力特征包括与多个目标特征点对应的第一注意力值,可以理解为第一注意特征中包括与各个目标特征点对应的第一注意力值;例如,第一注意力特征可以包括与各个目标特征点对应的第一注意力值),第一注意力值为针对与目标特征点对应的局部区域确定的,与目标特征点对应的局部区域是第一查询特征中以目标特征点为中心,并根据预设尺寸确定的区域,预设尺寸小于第一查询特征的尺寸,第一注意力值用于表征局部区域内的多个特征点与目标特征点之间的关联关系(其中,局部区域内的特征点的数量可以是多个,相应的,在局部区域内存在多个特征点的情况下,第一注意力值用于表征局部区域内的多个特征点与目标特征点之间的关联关系,可以理解为第一注意力值能够用于表征局部区域内的各个特征点与目标特征点之间的关联关系,并且,当目标特征点的数量也是多个时,第一注意力值可以表征局部区域内的各个特征点与各个目标特征点之间的关联关系;例如,第一注意力值可以用于表征局部区域内的各个特征点与目标特征点之间的关联关系)。
步骤S43,根据第一注意力特征和输入数据,确定与注意力卷积层对应的输出特征。
其中,处理结果是输出特征经由注意力卷积层之后的网络层处理后得到的。
在一些可能的实现方式中,注意力卷积层的输入数据是待处理数据经由注意力卷积层之前的网络层处理后的数据,在步骤S41中,可通过将输入数据与预设的变换矩阵相乘的方式实现对输入数据的线性变换,从而获得第一查询特征、第一键特征和第一值特征。另外,还可以基于预设的卷积核对输入数据进行卷积操作,卷积结果即为第一查询特征、第一键特征或第一值特征。本公开实施例对输入数据的线性变换方式不作限制。
需要说明的是,在一些可能的实现方式中,可以通过对输入数据与同一变换矩阵相乘,或者对输入数据进行一次线性变换的方式,获得第一查询特征、第一键特征和第一值特征,换言之,在此情况下,第一查询特征、第一键特征和第一值特征是完全相同的特征。
在一个示例中,输入数据为矩阵F(h1*w1),将其与预设的变换矩阵w(h2*w2)相乘,获得相乘结果F’(h1*w2),并将F’(h1*w2)作为第一查询特征、第一键特征和第一值特征。其中,h1和w1分别表示输入数据对应矩阵F的高度和宽度,h2和w2分别表示变换矩阵w的高度和宽度,且w1=h2。
在一些可能的实现方式中,还可以基于不同的变换矩阵或者不同的线性变换方式,获得不同的第一查询特征、第一键特征和第一值特征。
在获得第一查询特征、第一键特征和第一值特征之后,即可在步骤S42中通过自注意力运算确定第一注意力特征。在一些可能的实现方式中,在步骤S42中,确定第一注意力特征的过程,包括:针对多个目标特征点,确定第一键特征中与局部区域对应的多个第一键特征点,以及第一值特征中与局部区域对应的第一值特征点(其中,目标特征点的数量可以是多个,相应的,在存在多个目标特征点的情况下,针对多个目标特征点,确定第一键特征中与局部区域对应的多个第一键特征点,以及第一值特征中与局部区域对应的第一值特征点,可以理解为针对各个目标特征点执行确定第一键特征中与局部区域对应的多个第一键特征点,以及第一值特征中与局部区域对应的第一值特征点的操作;例如,可以针对各个目标特征点,确定第一键特征中与局部区域对应的多个第一键特征点,以及第一值特征中与局部区域对应的第一值特征点);确定目标特征点与多个第一键特征点之间的相似性(其中,第一键特征点的数量可以是多个,相应的,在第一键特征点的数量是多个的情况下,确定目标特征点与多个第一键特征点之间的相似性,可以理解为分别确定目标特征点与各个第一键特征点之间的相似性;例如,可以确定目标特征点与各个第一键特征点之间的相似性),获得与目标特征点对应的第一相似度特征;根据第一相似度特征与第一值特征点,获得与目标特征点对应的第一注意力值;根据多个目标特征点的第一注意力值,获得第一注意力特征。
在一些可能的实现方式中,特征点之间的相似性可以基于余弦相似度、皮尔逊相关系数(Pearson Correlation)等方式计算获得,本公开实施例对相似性的确定方式不作限制。
在一些可能的实现方式中,在第一查询特征与第一键特征相同的情况下,确定目标特征点与多个第一键特征点之间的相似性,包括:在获得第i个目标特征点与第j个第一键特征点之间的相似性Sij的情况下,根据Sij确定第j个第一查询特征点与第i个目标特征点之间的相似性Sji;其中,i和j均为大于等于1且小于等于M的整数,M为第一查询特征或第一键特征中的特征点的总数量(M为大于或等于1的整数),且i≥j或i≤j。根据Sij确定相似性Sji,可以有效减少运算量,从而降低数据处理压力。
需要说明的是,在本公开实施例中,对于第一查询特征中的每个目标特征点,可以只针对该目标特征点在局部区域进行自注意力计算,获得第一注意力值,并基于多个目标特征点的第一注意力值获得第一注意力特征。相较于基于整个特征区域进行自注意力计算而言,基于局部区域的自注意力计算可以有效降低计算量,并且,由于局部区域相对于全局而言具备一定的归纳偏置,同时还具备旋转不变性,因此,可以获得效果较好的特征。
还需要说明的是,目标特征点是归属于第一查询特征的特征点,其与输入数据中的特征点之间具有对应关系,对目标特征点基于局部区域进行自注意力运算,其本质为针对输入数据基于局部区域确定相应的关联关系。
在一些可能的实现方式中,目标特征点的范围可根据需要灵活设置,其既可以包括第一查询特征中的所有特征点,也可以包括第一查询特征中指定的若干个特征点(若干个特征点可以是一个或多个特征点),本公开实施例对此不作限定。
在一些可能的实现方式中,在步骤S42之前,该数据处理方法还可以包括:从第一查询特征中选取多个特征点作为目标特征点;根据预设尺寸,确定多个目标特征点对应的局部区域。换言之,目标特征点的数量可以是多个,在存在多个目标特征点的情况下,可以根据预设尺寸,确定各个目标特征点对应的局部区域,即每个目标特征点可以对应相应的局部区域。
在一些可能的实现方式中,与目标特征点对应的局部区域可以是向量形式(例如,文本处理场景),也可以是矩形(包括正方形)形式(例如,图像或视频处理场景),本公开实施例对此不作限定。
在一些可能的实现方式中,第一查询特征包括向量和矩阵中的任意一种。在第一查询特征为向量的情况下,预设尺寸包括预设特征点数量,且预设特征点数量小于第一查询向量的特征点总数量,局部区域是以目标特征点为中心、特征点数量等于预设特征点数量的向量;在第一查询特征为矩阵的情况下,预设尺寸包括预设行数量和预设列数量,且预设行数量小于第一查询特征的行总数量,预设列数量小于第一查询特征的列总数量,局部区域是以目标特征点为中心、以预设行数量为高度、预设列数量为宽度的矩形区域。
在一个示例中,假设第一查询特征为5*5的矩阵,将该第一查询特征中的所有特征点确定为目标特征点,并且,将以目标特征点为中心,边长等于3个特征点的区域设置为与该目标特征点对应的局部区域。针对位于第一查询特征边缘处的目标特征点,其局部区域无法构成3*3的特征区域,在处理过程中,可通过补零的方式将这些目标特征点的局部区域补充为3*3的特征区域,以便于进行计算。
在上述内容中,步骤S42中获取第一注意力特征的过程包括:先确定目标特征点,再确定局部区域,进而确定第一注意力特征。在另外一些可能的实现方式中,可以采用滑动窗口的方式获取第一注意力特征。
在一些可能的实现方式中,在步骤S42中,基于滑动窗口方式获取第一注意力特征的过程包括:根据局部区域的预设尺寸,设置滑动窗口和步长;从预设的初始滑动位置开始,沿第一查询特征以步长滑动该滑动窗口,并确定多次滑动操作中与滑动窗口对应的目标特征点,第一键特征中与滑动窗口对应的多个第一键特征点,以及第一值特征中与滑动窗口对应的第一值特征点;确定目标特征点与多个第一键特征点之间的相似性,获得与目标特征点对应的第一相似度特征;根据第一相似度特征与第一值特征点,获得与目标特征点对应的第一注意力值;根据多个目标特征点的第一注意力值,获得第一注意力特征。
例如,可以根据局部区域的预设尺寸,设置滑动窗口和步长;从预设的初始滑动位置开始,沿第一查询特征以步长滑动该滑动窗口,并确定每次滑动操作中与滑动窗口对应的目标特征点,第一键特征中与滑动窗口对应的多个第一键特征点,以及第一值特征中与滑动窗口对应的第一值特征点;确定目标特征点与各个第一键特征点之间的相似性,获得与目标特征点对应的第一相似度特征;根据第一相似度特征与第一值特征点,获得与目标特征点对应的第一注意力值;根据多个目标特征点的第一注意力值,获得第一注意力特征。
换言之,在每次滑动操作中,滑动窗口处于相应的特定位置,其圈出了与本次滑动操作对应的局部区域,在确定该局部区域内的目标特征点的基础上,可以得到本次滑动操作下第一键特征中与滑动窗口对应的多个第一键特征点,以及第一值特征中与滑动窗口对应的第一值特征点。基于滑动窗口获取第一注意力特征的方式,与基于卷积核提取特征的方式较为类似,区别之处在于第一注意力特征是基于自注意力运算确定特征值,卷积核是通过卷积运算确定特征值。
获得第一注意力特征之后,可在步骤S43中根据第一注意力特征和输入数据确定输出特征。其 中,确定输出特征至少可以包括两种方式:第一种方式是将第一注意力特征做线性变换,使其与输入数据的尺寸相同,并将变换后的第一注意力特征叠加到输入数据中,从而获得输出特征;第二种方式是将第一注意力特征与输入数据的特征点建立位置映射关系,利用该位置映射关系,基于第一注意力特征和输入数据生成输出特征。在第一种方式中,输出特征的尺寸与输入数据能够保持相同,在第二种方式中,若目标特征点不是第一查询特征中的所有特征点时,输出特征的尺寸与输入数据不同,其可以只包括目标特征点对应的特征分量。
在一些可能的实现方式中,在步骤S43中,根据第一注意力特征和输入数据,确定与注意力卷积层对应的输出特征,包括:对第一注意力特征进行线性变换,获得与输入数据尺寸相同的第一匹配注意力特征;将第一匹配注意力特征与输入数据进行叠加,获得与输入数据对应的输出特征。
在一些可能的实现方式中,在步骤S43中,根据第一注意力特征和输入数据,确定与注意力卷积层对应的输出特征,包括:根据与第一注意力特征中第一注意力值对应的目标特征点在输入数据中的位置信息,将第一注意力值进行重新排列,获得第二匹配注意力特征;根据第二匹配注意力特征和输入数据中与目标特征点对应的特征点,获得与输入数据对应的输出特征。
综上所述,本公开实施例中的第一注意力机制本质上属于自注意力范畴,其具有“局部区域”特性、“归纳偏置”特性以及“旋转不变”特性。其中,“局部区域”是指在获取第一注意力特征时,仅对特征的局部区域进行自注意力运算,而非针对特征全局进行自注意力运算,可以有效降低计算量;“归纳偏置”特性是由于对局部区域开展自注意力运算而产生的附加特性,较全局进行自注意力运算不具有归纳偏置能力而言,只针对局部区域开展自注意力运算具备一定的归纳偏置能力;“旋转不变”特性是由于自注意力运算本身关注于特征点之间的关联关系,这种关联关系与特征之间的距离及相对位置无关,从而使得其对旋转角度不敏感。
下面结合图5与图6对根据本公开实施例的数据处理方法进行展开说明。
图5为本公开实施例提供的一种目标神经网络的示意图。
参照图5,目标神经网络包括第一网络层结构、注意力卷积层以及第二网络层结构。其中,第一网络层结构位于注意力卷积层之前,其可以包括一个或多个网络层(网络层可以是卷积层等),第二网络层结构位于注意力卷积层之后,其同样可以包括一个或多个网络层(网络层可以包括批标准化层(Batchnorm-Layer)和激活层(Activation Layer)等)。
在一些可能的实现方式中,将待处理数据输入目标神经网络之后,首先由第一网络层结构对待处理数据进行处理,获得中间数据,并将中间数据输入到注意力卷积层。该中间数据即为注意力卷积层的输入数据。注意力卷积层按照本公开实施例中任意一种实现方式对输入数据进行处理,获得输出特征,并将输出特征输入至第二网络层结构。第二网络层结构对输出特征进行处理,获得处理结果,目标神经网络向外输出该处理结果。
图6为本公开实施例提供的一种注意力卷积层的工作过程的示意图。
参照图6,注意力卷积层的输入数据F是一个hf*wf*c的张量,其中,hf表示F的高度,wf表示F的宽度,c表示F的通道数。
在一些可能的实现方式中,首先将F分别与第一变换矩阵wq、第二变换矩阵wk和第三变换矩阵wv相乘,实现对F的三次线性变换,并相应获得第一查询特征Q、第一键特征K和第一值特征V,其中,wq的尺寸为wf*hq*1,相应的,Q是hq*wq*c的张量,wk的尺寸为wf*hk*1,相应的,K是hk*wk*c的张量,wv的尺寸为wf*hv*1,相应的,V是hv*wv*c的张量。在获得Q、K和V之后,计算Q中各个目标特征点与K中处于该目标特征点对应的局部区域的特征点之间的余弦相似度,获得相似度特征S,其中,S为{hq*wq}*{hk*wk}大小的矩阵。
在一个示例中,获取相似度特征S的过程包括:首先,将Q变换为{hq*wq}*c的矩阵形式,其次,将K变换为c*{hk*wk}的矩阵形式,最后,基于变换后的矩阵形式执行矩阵相乘操作,获得尺寸为{hq*wq}*{hk*wk}的相似度特征S。其意义在于,S中第(i,j)位置的元素为第j个元素对第i个元素的影响,或第j个元素与第i个元素之间的相似性,从而实现全局上下文任意两个元素之间的依赖关系。需要说明的是,S具有稀疏性,其元素中只有目标特征点与局部区域特征点之间的相似性为非零取值,其他未进行自注意力运算的元素的取值均为零。
进一步地,对V进行变换,获得V′,V′为{hv*wv}*c的矩阵。将S与V′做内积运算,获得{hq*wq}*c的矩阵,并对该矩阵进行变换,获得hq*wq*c的张量,该张量即为hp*wp*c的第一注意力特征P(其中,hq=hp,wq*wp)。最后,对P进行线性变换,使其与F的尺寸相同,进而将其与F相加,获得最终的输出特征F′。其中,F′的尺寸为hf′*wf′*c,且hf′=hf,wf′=wf
需要说明的是,在一些可能的实现方式中,在获得相似度特征S之后,还可以通过Softmax等 方式对其中的元素进行归一化,使得数据处于相同数量级,方便进行比较分析。
还需要说明的是,在一些可能的实现方式中,为降低运算量,可以在线性变换过程中减少通道数c,即Q、K和V的通道数可以小于F的通道数,且Q、K与V的通道数可以不同(Q与K的通道数通常相同)。
基于滑窗方式确定输出特征,与上述过程的计算方式类似,在此不再重复描述。
上述内容描述了如何基于注意力卷积层获取输出特征,在实际应用中,获得输出特征之后,为增强特征表达效果,还可以将不同层级的特征进行融合,从而获得融合特征。
下面结合图7说明该目标神经网络进行特征融合的工作过程。
图7为本公开实施例提供的一种数据处理方法的工作过程的流程图。其中,目标神经网络的N级卷积层之间基于第二注意力机制进行特征融合,各级卷积层输出的特征的尺度不同,N为大于或等于2的整数。第二注意力机制包括不同尺度的输出特征之间针对输出特征的局部区域的注意力机制。参照图7,该方法包括:
步骤S71,针对第n级卷积层,根据第n-1级卷积层的第n-1级中间特征与第n级卷积层输出的第n级初始特征,确定第n级第二注意力特征。
其中,n表示卷积层的级数,且n为整数且2≤n≤N-1。
步骤S72,根据第n级第二注意力特征,对第n级初始特征进行更新,获得第n级中间特征。
步骤S73,根据第n+1级卷积层的第n+1级融合特征与第n级中间特征,确定第n级第三注意力特征。
步骤S74,根据第n级第三注意力特征,对第n级中间特征进行更新,获得第n级融合特征。
在一些可能的实现方式中,初始特征是待处理数据经由目标神经网络的卷积层处理后的处于初始状态的特征,中间特征是在初始特征基础上,结合其他层级的特征信息获得的处于中间状态的特征,融合特征是指将中间特征与其他层级的特征进行进一步融合之后获得的特征。相应的,处理结果是融合特征经由卷积层之后的网络层处理后得到的结果。
在一些可能的实现方式中,针对第1级卷积层,由于没有特征层次再比之低的卷积层,因此,可以设置第1级中间特征等于第1级初始特征。相应的,第1级融合特征是基于第1级第三注意力特征对第1级中间特征进行更新获得的,第1级第三注意力特征可以根据第2级融合特征和第1级中间特征获得。
在一些可能的实现方式中,对于第N级卷积层,由于没有特征层次再比之高的卷积层,因此,可以设置第N级融合特征等于第N级中间特征。
上述特征融合过程可以概括为:首先,基于第n级第二注意力特征对第n级初始特征进行更新,获得第n级中间特征;其次,基于第n级第三注意力特征对第n级中间特征进行更新,获得第n级融合特征。第n级第二注意力特征与第n级第三注意力特征均是基于第二注意力机制获得的特征,其中,第n级第二注意力特征由第n-1级中间特征与第n级初始特征获得,其反映了第n-1级中间特征的特征点与第n级初始特征的特征点之间的关联关系,第n级第三注意力特征由第n+1级融合特征与第n级中间特征获得,其反映了第n+1级融合特征的特征点与第n级中间特征的特征点之间的关联关系。
在一些可能的实现方式中,在步骤S71中,针对第n级卷积层,根据第n-1级卷积层的第n-1级中间特征与第n级卷积层输出的第n级初始特征,确定第n级第二注意力特征,包括:对第n级初始特征进行线性变换,获得与第n级初始特征对应的第n级第二查询特征;对第n-1级中间特征进行线性变换,获得与第n-1级中间特征对应的第n-1级第二键特征和第n-1级第二值特征;确定第n级第二查询特征的多个特征点与第n-1级第二键特征的多个特征点之间的映射关系(其中,第n级第二查询特征可以包括多个特征点,第n-1级第二键特征也可以包括多个特征点,相应的,在第n级第二查询特征包括多个特征点,且第n-1级第二键特征也包括多个特征点的情况下,确定第n级第二查询特征的多个特征点与第n-1级第二键特征的多个特征点之间的映射关系,可以理解为针对第n级第二查询特征的各个特征点,分别确定该特征点与第n-1级第二键特征的各个特征点之间的映射关系,得到第n级第二查询特征的各个特征点与第n-1级第二键特征的各个特征点之间的映射关系,也即得到第n级第二查询特征的多个特征点与第n-1级第二键特征的多个特征点之间的映射关系;例如,可以确定第n级第二查询特征的各个特征点与第n-1级第二键特征的各个特征点之间的映射关系);根据映射关系,确定第n级第二查询特征的多个特征点对应于第n-1级第二键特征的特征融合区域(其中,第n级第二查询特征可以包括多个特征点,并且在第n级第二查询特征包括多个特征点的情况下,确定第n级第二查询特征的多个特征点对应于第n-1级第二键特征的特征 融合区域,可以理解为针对第n级第二查询特征的各个特征点,分别确定该特征点对应于第n-1级第二键特征的特征融合区域,从而可以得到多个特征点对应于第n-1级第二键特征的特征融合区域;例如,可以确定第n级第二查询特征的各个特征点对应于第n-1级第二键特征的特征融合区域);确定第n级第二查询特征中的多个特征点与第n-1级第二键特征中处于特征融合区域的多个特征点之间的相似性(例如,可以确定第n级第二查询特征中的各个特征点与第n-1级第二键特征中处于特征融合区域的各个特征点之间的相似性),获得第n级第二相似度特征;确定第n-1级第二值特征中与特征融合区域对应的第n-1级第二值特征点;确定第n级第二相似度特征与第n-1级第二值特征点之间的内积,获得第n级第二注意力特征。
在一些可能的实现方式中,在步骤S72中,根据第n级第二注意力特征,对第n级初始特征进行更新,获得第n级中间特征,包括:将第n级第二注意力特征与第n级初始特征叠加,获得第n级中间特征。
在一些可能的实现方式中,在步骤S73中,根据第n+1级卷积层的第n+1级融合特征与第n级中间特征,确定第n级第三注意力特征,包括:对第n级中间特征进行线性变换,获得与第n级中间特征对应的第n级第三查询特征;对第n+1级融合特征进行线性变换,获得与第n+1级融合特征对应的第n+1级第三键特征和第n+1级第三值特征;确定第n级第三查询特征的多个特征点与第n+1级第三键特征的多个特征点之间的映射关系(其中,第n级第三查询特征可以包括多个特征点,第n+1级第三键特征也可以包括多个特征点,相应的,在第n级第三查询特征包括多个特征点,且第n+1级第三键特征也包括多个特征点的情况下,确定第n级第三查询特征的多个特征点与第n+1级第三键特征的多个特征点之间的映射关系,可以理解为针对第n级第三查询特征的各个特征点,分别确定该特征点与第n+1级第三键特征的各个特征点之间的映射关系,得到第n级第三查询特征的各个特征点与第n+1级第三键特征的各个特征点之间的映射关系,也即得到第n级第三查询特征的多个特征点与第n+1级第三键特征的多个特征点之间的映射关系;例如,可以确定第n级第三查询特征的各个特征点与第n+1级第三键特征的各个特征点之间的映射关系);根据映射关系,确定第n级第三查询特征的多个特征点对应于第n+1级第三键特征的特征融合区域(其中,第n级第三查询特征可以包括多个特征点,相应的,在第n级第三查询特征包括多个特征点的情况下,确定第n级第三查询特征的多个特征点对应于第n+1级第三键特征的特征融合区域,可以理解为针对第n级第三查询特征的各个特征点,确定该特征点对应于第n+1级第三键特征的特征融合区域,从而得到第n级第三查询特征的多个特征点对应于第n+1级第三键特征的特征融合区域;例如,可以根据映射关系,确定第n级第三查询特征的各个特征点对应于第n+1级第三键特征的特征融合区域);确定第n级第三查询特征中的多个特征点与第n+1级第三键特征中处于特征融合区域的多个特征点之间的相似性(其中,第n级第三查询特征可以包括多个特征点,第n+1级第三键特征处于特征融合区域的部分也可以包括多个特征点,相应的,在两者均包括多个特征点的情况下,确定第n级第三查询特征中的多个特征点与第n+1级第三键特征中处于特征融合区域的多个特征点之间的相似性,可以理解为针对第n级第三查询特征的各个特征点,分别确定该特征点与第n+1级第三键特征处于特征融合区域的各个特征点之间的相似性,得到第n级第三查询特征中的各个特征点与第n+1级第三键特征中处于特征融合区域的各个特征点之间的相似性,也即得到第n级第三查询特征中的多个特征点与第n+1级第三键特征中处于特征融合区域的多个特征点之间的相似性;例如,可以确定第n级第三查询特征中的各个特征点与第n+1级第三键特征中处于特征融合区域的各个特征点之间的相似性),获得第n级第三相似度特征;确定第n+1级第三值特征中与特征融合区域对应的第n+1级第三值特征点;确定第n级第三相似度特征与第n+1级第三值特征点之间的内积,获得第n级第三注意力特征。
在一些可能的实现方式中,在步骤S74中,根据第n级第三注意力特征,对第n级中间特征进行更新,获得第n级融合特征,包括:将第n级第三注意力特征与第n级中间特征叠加,获得第n级融合特征。
综上所述,本公开实施例中的第二注意力机制具有“他注意力”特性和“局部区域”特性。其中,“他注意力”特性是指,第二注意力特征与第三注意力特征均是在不同特征之间(即不同尺度的输出特征)进行注意力运算,而非针对同一特征的不同特征点进行注意力运算;“局部区域”特性是指在进行不同特征之间的注意力运算时,不是针对所有特征点进行注意力运算,其可以只针对两个特征中具有映射关系的若干特征点(若干特征点的数量可以是一个或多个)进行注意力运算。
下面结合图8-图12对根据本公开实施例的数据处理方法进行展开说明。
图8为本公开实施例提供的一种特征融合过程的示意图。参照图8,第1级初始特征、第2级 初始特征……第N级初始特征分别为使用第1层卷积层至第N级卷积层对输入数据进行采样获得的特征。其中,第1层卷积层至第N级卷积层分别对应采样率X1至XN,且按照从X1至XN依次降低,相应的,第1级初始特征的分辨率最高,第N级初始特征的分辨率最低。
针对第1级卷积层,考虑到没有特征层次更低的卷积层,其无法获得从低层级特征层传递的第二注意力特征,因此,设置第1级中间特征等于第1级初始特征。对于第2级卷积层,根据第1级卷积层的第1级中间特征与第2级卷积层输出的第2级初始特征,确定第2级第二注意力特征,并基于该第2级第二注意力特征,对第2级初始特征进行更新,获得第2级中间特征。类似的,针对第3级卷积层至第N级卷积层,按照上述方式进行处理,可以获得第3级中间特征至第N级中间特征。
在获得多个卷积层的中间特征之后(例如,在获得所有卷积层的中间特征之后),即可进一步确定各个卷积层的融合特征。
对于第N级卷积层,考虑到没有特征层次更高的卷积层,其无法获得从高层级特征层传递的第三注意力特征,因此,设置第N级中间特征等于第N级融合特征。对于第N-1级卷积层,根据第N级融合特征与第N-1级中间特征,确定第N-1级第三注意力特征,并根据第N-1级第三注意力特征对第N-1级中间特征进行更新,获得第N-1级融合特征。以此类推,针对第N-2级中间特征至第1级中间特征,按照上述方式进行处理可以获得第N-2级融合特征至第1级融合特征。
图9为本公开实施例提供的一种中间特征的获取过程的示意图。
参照图9,F1′表示第n-1级中间特征,其是一个h1*w1*c1的张量;F2表示第n级初始特征,其是一个h2*w2*c1的张量,n为大于1的整数。
首先,对F2做线性变换,生成与F2对应的第n级第二查询特征Q2,Q2是一个hq1*wq1*c1的张量。并且,对F1′进行线性变换,获得与F1′对应的第n-1级第二键特征K2和第n-1级第二值特征V2,K2是一个hk1*wk1*c1的张量,V2是一个hv1*wv1*c1的张量。其中,hk1=hv1,wk1=wv1,且hq1小于hk1/hv1,wq1小于wk1/wv1
其次,确定Q2的各个特征点与K2的各个特征点之间的映射关系,并根据该映射关系,确定Q2的各个特征点对应于K2的特征融合区域;并且,确定Q2中的各个特征点与K2中处于特征融合区域的各个特征点之间的相似性,获得第n级第二相似度特征S2(可以将S2中未计算相似性的特征点对应的取值置为0),其中,S2的尺寸为(hq1*wq1)*(hk1*wk1);然后,将V2展开为(hv1*wv1)*c1的矩阵V2′,并计算S2与V2′的内积(即计算{(hq1*wq1)*(hk1*wk1)}·{(hv1*wv1)*c1},·表示内积运算),获得尺寸为(hq1*wq1)*c1的矩阵,并将(hq1*wq1)*c1矩阵重新排列为hq1*wq1*c1形式的张量,该张量即为(hp1*wp1)*c1的第n级第二注意力特征P2。其中,hp1=hq1,wp1=wq1
最后,将P2进行线性变换,使其与F2尺寸相同,再将其与F2叠加,从而获得第n级中间特征F2′。其中,F2′的尺寸与F2相同。
在一些可能的实现方式中,Q2的各个特征点与K2的各个特征点之间的映射关系即同一特征在Q2中的位置与该特征在K2中的位置之间的对应关系,在计算第二相似度时,可以只针对该特征融合区域计算,较针对整个特征区域计算第二相似性而言,可以有效降低计算量。
图10为本公开实施例提供的一种融合特征的映射关系示意图。参照图10,Q2对应特征图中的阴影区域与K2对应特征图中的阴影区域具有映射关系(即两者对应同一特征模式)。针对Q2阴影区域内的任意一个特征点,其第二相似度的计算范围或者第二注意力机制的影响范围仅限于K2特征图的阴影区域。
图11为本公开实施例提供的一种融合特征的获取过程的示意图。
参照图11,F3′表示第n级中间特征,其是一个h3*w3*c2的张量;F4″表示第n+1级融合特征,其是一个h4*w4*c2的张量。
首先,对F3′进行线性变换,获得与F3′对应的第n级第三查询特征Q3,Q3是一个hq2*wq2*c2的张量;并且,对F4″也进行线性变换,获得与F4″对应的第n+1级第三键特征K3和第n+1级第三值特征V3,K3是一个hk2*wk2*c2的张量,V3是一个hv2*wv2*c2的张量。其中,hk2=hv2,wk2=wv2,且hq2大于hk2/hv2,wq2大于wk2/wv2
其次,确定Q3的各个特征点与K3的各个特征点之间的映射关系,并根据该映射关系,确定Q3的各个特征点对应于K3的特征融合区域;确定Q3中的各个特征点与K3中处于特征融合区域的各个特征点之间的相似性,获得第n级第三相似度特征S3(可以将S3中未计算相似性的特征点对应的取值置为0),S3的尺寸为(hq2*wq2)*(hk2*wk2);然后,将V3展开为(hv2*wv2)*c2的矩阵V3′,并计算S3与V3′的内积,获得的计算结果为(hq2*wq2)*c2的矩阵,将该(hq2*wq2)*c2 矩阵重新排列为hq2*wq2*c2的张量形式,从而获得(hp2*wp2)*c2的第n级第三注意力特征P3,其中,hp2=hq2,wp2=wq2
最后,将P3进行线性变换,使其与F3′尺寸相同,再将其与F3′叠加,从而获得第n级融合特征F3″。其中,F3″的尺寸与F3′相同。
在一些可能的实现方式中,Q3的各个特征点与K3的各个特征点之间的映射关系即同一特征在Q3中的位置与该特征在K3中的位置之间的对应关系,在计算第三相似度时,只针对该特征融合区域计算即可,较针对整个特征区域计算第三相似性而言,可以有效降低计算量。
图12为本公开实施例提供的一种融合特征的映射关系示意图。参照图12,Q3对应特征图中的阴影区域与K3对应特征图中的阴影区域具有映射关系(即两者对应同一特征模式)。针对Q3阴影区域内的任意一个特征点,其第二相似度的计算范围或者第二注意力机制的影响范围仅限于K3特征图的阴影区域。
需要说明的是,目标神经网络通常依赖于相应的电子设备,并通过调度电子设备的各类资源(例如,计算资源、存储资源、以及通信资源等)执行数据处理,以得到待处理数据的处理结果。由于本公开实施例中对目标神经网络的结构和处理机制的改进,使得目标神经网络在调度电子设备的资源时,对资源的调度也会存在相应改变(例如,对资源的消耗数量可能产生改变),从而影响电子设备的处理性能。例如,在使用同一电子设备对同样的待处理数据进行处理以得到相应的处理结果时,针对该电子设备利用相关技术进行处理,以及该电子设备通过内部的目标神经网络进行处理,两种情况所得到的处理结果可能是不同的,且两者在处理过程中所消耗的资源类型和/或资源数量也可能是不同的。换言之,本公开实施例的数据处理方法可以改变电子设备对资源的调度方式和调度量,从而改变电子设备的性能以及对资源的利用率。
可以理解,本公开提及的上述各个方法实施例,在不违背原理逻辑的情况下,均可以彼此相互结合形成结合后的实施例,限于篇幅,本公开不再赘述。本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的具体执行顺序应当以其功能和可能的内在逻辑确定。
此外,本公开还提供了数据处理装置、电子设备、计算机可读存储介质,上述均可用来实现本公开提供的任一种数据处理方法,相应技术方案和描述和参见方法部分的相应记载,不再赘述。
图13为本公开实施例提供的一种数据处理装置的框图。
参照图13,本公开实施例提供了一种数据处理装置,该数据处理装置包括:
数据处理模块13,被配置为将待处理数据输入目标神经网络中处理,用于得到待处理数据的处理结果。
其中,目标神经网络的至少一层卷积层为基于第一注意力机制的注意力卷积层,和/或,目标神经网络的至少两级卷积层之间基于第二注意力机制进行特征融合,第一注意力机制包括针对特征的局部区域的自注意力机制,第二注意力机制包括不同尺度的输出特征之间的注意力机制。
在一些可能的实现方式中,数据处理装置还可以包括输入模块,输入模块被配置为执行将待处理数据输入至目标神经网络的操作。
举例来说,待处理数据可以包括图像数据、语音数据、文本数据、视频数据中的任意一种。本公开实施例对待处理数据的类型和内容不作限制。
在一些可能的实现方式中,通过输入模块将待处理数据输入到目标神经网络之后,数据处理模块利用目标神经网络中的注意力卷积层对输入该层的数据进行自注意力运算,并获得输出特征,以供目标神经网络中的其他网络层基于输出特征进行数据处理,获得处理结果;并且,数据处理模块还可基于目标神经网络中的第二注意力机制对输出特征(包括但不限于注意力卷积层的输出特征)进行融合,获得融合特征,以供其他网络层基于融合特征进行进一步地数据处理,获得处理结果。
在一些可能的实现方式中,目标神经网络可用于执行图像处理任务、语音处理任务、文本处理任务、视频处理任务中的任意一种。与之相应的,待处理数据的处理结果可以是图像处理结果、语音处理结果、文本处理结果、视频处理结果中的任意一种(其中,处理可以包括识别、分类、标注等操作),其与待处理数据的类型、待处理数据的内容、以及目标神经网络的执行任务等相关。本公开实施例对目标神经网络所执行的具体任务类型及处理结果类型不作限制。
在一些可能的实现方式中,目标神经网络包括注意力卷积层,相应的,数据处理模块可被配置为基于第一注意力机制实现特征提取。
图14为本公开实施例提供的一种数据处理装置的框图。参照图14,该数据处理装置包括:变换子模块141、第一注意力处理子模块142和输出特征确定子模块143。其中,变换子模块141,被配置为针对任一注意力卷积层,对注意力卷积层的输入数据进行线性变换,获得与输入数据对应的 第一查询特征、第一键特征和第一值特征;第一注意力处理子模块142,被配置为根据第一查询特征、第一键特征和第一值特征,确定与第一查询特征的多个目标特征点对应的第一注意力特征,其中,第一注意力特征包括与多个目标特征点对应的第一注意力值(例如,第一注意力特征可以包括与各个目标特征点对应的第一注意力值),第一注意力值为针对与目标特征点对应的局部区域确定的,与目标特征点对应的局部区域是第一查询特征中以目标特征点为中心,并根据预设尺寸确定的区域,预设尺寸小于第一查询特征的尺寸,第一注意力值被配置为表征局部区域内的多个特征点与目标特征点之间的关联关系(例如,第一注意力值可以表征局部区域内的各个特征点与目标特征点之间的关联关系);输出特征确定子模块143,被配置为根据第一注意力特征和输入数据,确定与注意力卷积层对应的输出特征,其中,输入数据是待处理数据经由注意力卷积层之前的网络层处理后的数据,处理结果是输出特征经由注意力卷积层之后的网络层处理后得到的。
在一些可能的实现方式中,第一注意力处理子模块包括:区域映射单元、相似性确定单元、第一注意力值获取单元和第一注意力特征获取单元。其中,区域映射单元,被配置为针对多个目标特征点,确定第一键特征中与局部区域对应的多个第一键特征点,以及第一值特征中与局部区域对应的第一值特征点(例如,可以针对各个目标特征点,确定第一键特征中与局部区域对应的多个第一键特征点,以及第一值特征中与局部区域对应的第一值特征点);相似性确定单元,被配置为确定目标特征点与多个第一键特征点之间的相似性(例如,可以确定目标特征点与各个第一键特征点之间的相似性),获得与目标特征点对应的第一相似度特征;第一注意力值获取单元,被配置为根据第一相似度特征与第一值特征点,获得与目标特征点对应的第一注意力值;第一注意力特征获取单元,被配置为根据多个目标特征点的第一注意力值,获得第一注意力特征。
在一些可能的实现方式中,数据处理模块还可以包括:选取子模块和区域确定子模块。其中,选取子模块,被配置为从第一查询特征中选取多个特征点作为目标特征点;区域确定子模块,被配置为根据预设尺寸,确定与多个目标特征点对应的局部区域(例如,可以确定与各个目标特征点对应的局部区域)。
在一些可能的实现方式中,第一查询特征包括向量和矩阵中的任意一种;在第一查询特征为向量的情况下,预设尺寸包括预设特征点数量,且预设特征点数量小于第一查询向量的特征点总数量,局部区域是以目标特征点为中心、特征点数量等于预设特征点数量的向量;在第一查询特征为矩阵的情况下,预设尺寸包括预设行数量和预设列数量,且预设行数量小于第一查询特征的行总数量,预设列数量小于第一查询特征的列总数量,局部区域是以目标特征点为中心、以预设行数量为高度、预设列数量为宽度的矩形区域。
在一些可能的实现方式中,采用滑窗方式获取第一注意力特征,相应的,第一注意力处理子模块除包括第一相似性确定单元、第一注意力值获取单元和第一注意力特征获取单元之外,还可以包括:滑动设置单元、滑动单元。其中,滑动设置单元,被配置为根据局部区域的预设尺寸,设置滑动窗口和步长;滑动单元,被配置为从预设的初始滑动位置开始,沿第一查询特征以步长滑动该滑动窗口,并确定每次滑动操作中与滑动窗口对应的目标特征点,第一键特征中与滑动窗口对应的多个第一键特征点,以及第一值特征中与滑动窗口对应的第一值特征点;第一相似性确定单元,被配置为确定目标特征点与多个第一键特征点之间的相似性,获得与目标特征点对应的第一相似度特征;第一注意力值获取单元,被配置为根据第一相似度特征与第一值特征点,获得与目标特征点对应的第一注意力值;第一注意力特征获取单元,被配置为根据多个目标特征点的第一注意力值,获得第一注意力特征。
在一些可能的实现方式中,第一查询特征与第一键特征相同。第一相似性确定单元在确定目标特征点与多个第一键特征点之间的相似性时,包括:在获得第i个目标特征点与第j个第一键特征点之间的相似性Sij的情况下,根据Sij确定第j个第一查询特征点与第i个目标特征点之间的相似性Sji;其中,i和j均为大于等于1且小于等于M的整数,M为第一查询特征或第一键特征中的特征点的总数量,且i≥j或i≤j。
在一些可能的实现方式中,输出特征确定子模块包括:第一变换单元和第一叠加单元。其中,第一变换单元,被配置为对第一注意力特征进行线性变换,获得与输入数据尺寸相同的第一匹配注意力特征;第一叠加单元,被配置为将第一匹配注意力特征与输入数据进行叠加,获得与输入数据对应的输出特征。
在一些可能的实现方式中,输出特征确定子模块包括:重排列单元和特征获取单元。其中,重排列单元,被配置为根据与第一注意力特征中第一注意力值对应的目标特征点在输入数据中的位置信息,将第一注意力值进行重新排列,获得第二匹配注意力特征;特征获取单元,被配置为根据第 二匹配注意力特征和输入数据中与目标特征点对应的特征点,获得与输入数据对应的输出特征。
在一些可能的实现方式中,目标神经网络的N级卷积层之间基于第二注意力机制进行特征融合,各级卷积层输出的特征的尺度不同,N为大于或等于2的整数,相应的,数据处理模块可被配置为基于第二注意力机制实现特征融合。
图15为本公开实施例提供的一种数据处理装置的框图。参照图15,该数据处理装置包括:第二注意力处理子模块151、第一更新子模块152、第三注意力处理子模块153和第二更新子模块154。其中,第二注意力处理子模块151,被配置为针对第n级卷积层,根据第n-1级卷积层的第n-1级中间特征与第n级卷积层输出的第n级初始特征,确定第n级第二注意力特征,n为整数且2≤n≤N-1;第一更新子模块152,被配置为根据第n级第二注意力特征,对第n级初始特征进行更新,获得第n级中间特征;第三注意力处理子模块153,被配置为根据第n+1级卷积层的第n+1级融合特征与第n级中间特征,确定第n级第三注意力特征;第二更新子模块154,被配置为根据第n级第三注意力特征,对第n级中间特征进行更新,获得第n级融合特征,其中,初始特征是待处理数据经由目标神经网络的卷积层处理后的特征,处理结果是融合特征经由卷积层之后的网络层处理后得到的。
在一些可能的实现方式中,针对第1级卷积层,由于没有特征层次再比之低的卷积层,因此,设置第1级中间特征等于第1级初始特征。相应的,第1级融合特征是基于第1级第三注意力特征对第1级中间特征进行更新获得的,第1级第三注意力特征根据第2级融合特征和第1级中间特征获得。
在一些可能的实现方式中,对于第N级卷积层,由于没有特征层次再比之高的卷积层,因此,设置第N级融合特征等于第N级中间特征。
在一些可能的实现方式中,第二注意力处理子模块包括:第二变换单元、第三变换单元、第一映射单元、第一融合区域确定单元、第二相似性确定单元、第一特征点确定单元和第二注意力特征获取单元。其中,第二变换单元,被配置为对第n级初始特征进行线性变换,获得与第n级初始特征对应的第n级第二查询特征;第三变换单元,被配置为对第n-1级中间特征进行线性变换,获得与第n-1级中间特征对应的第n-1级第二键特征和第n-1级第二值特征;第一映射单元,被配置为确定第n级第二查询特征的多个特征点与第n-1级第二键特征的多个特征点之间的映射关系(例如,第一映射单元可以被配置为确定第n级第二查询特征的各个特征点与第n-1级第二键特征的各个特征点之间的映射关系);第一融合区域确定单元,被配置为根据映射关系,确定第n级第二查询特征的多个特征点对应于第n-1级第二键特征的特征融合区域(例如,第一融合区域确定单元,可以被配置为根据映射关系,确定第n级第二查询特征的各个特征点对应于第n-1级第二键特征的特征融合区域);第二相似性确定单元,被配置为确定第n级第二查询特征中的多个特征点与第n-1级第二键特征中处于特征融合区域的多个特征点之间的相似性,获得第n级第二相似度特征(例如,第二相似性确定单元,可以被配置为确定第n级第二查询特征中的各个特征点与第n-1级第二键特征中处于特征融合区域的各个特征点之间的相似性,获得第n级第二相似度特征);第一特征点确定单元,被配置为确定第n-1级第二值特征中与特征融合区域对应的第n-1级第二值特征点;第二注意力特征获取单元,被配置为确定第n级第二相似度特征与第n-1级第二值特征点之间的内积,获得第n级第二注意力特征。
在一些可能的实现方式中,第一更新子模块包括第二叠加单元,被配置为将第n级第二注意力特征与第n级初始特征叠加,获得第n级中间特征。
在一些可能的实现方式中,第三注意力处理子模块包括:第四变换单元、第五变换单元、第二映射单元、第二融合区域确定单元、第三相似性确定单元、第二特征点确定单元和第三注意力特征获取单元。其中,第四变换单元,被配置为对第n级中间特征进行线性变换,获得与第n级中间特征对应的第n级第三查询特征;第五变换单元,被配置为对第n+1级融合特征进行线性变换,获得与第n+1级融合特征对应的第n+1级第三键特征和第n+1级第三值特征;第二映射单元,被配置为确定第n级第三查询特征的多个特征点与第n+1级第三键特征的多个特征点之间的映射关系(第二映射单元,可以被配置为确定第n级第三查询特征的各个特征点与第n+1级第三键特征的各个特征点之间的映射关系);第二融合区域确定单元,被配置为根据映射关系,确定第n级第三查询特征的多个特征点对应于第n+1级第三键特征的特征融合区域(例如,第二融合区域确定单元,可以被配置为根据映射关系,确定第n级第三查询特征的各个特征点对应于第n+1级第三键特征的特征融合区域);第三相似性确定单元,被配置为确定第n级第三查询特征中的多个特征点与第n+1级第三键特征中处于特征融合区域的多个特征点之间的相似性,获得第n级第三相似度特征(例如,第三相似性确定单元,可以被配置为确定第n级第三查询特征中的各个特征点与第n+1级第三键特征 中处于特征融合区域的各个特征点之间的相似性,获得第n级第三相似度特征);第二特征点确定单元,被配置为确定第n+1级第三值特征中与特征融合区域对应的第n+1级第三值特征点;第三注意力特征获取单元,被配置为确定第n级第三相似度特征与第n+1级第三值特征点之间的内积,获得第n级第三注意力特征。
在一些可能的实现方式中,第二更新子模块包括第三叠加单元,第三叠加单元被配置为将第n级第三注意力特征与第n级中间特征叠加,获得第n级融合特征。
图16为本公开实施例提供的一种电子设备的框图。
参照图16,本公开实施例提供了一种电子设备,该电子设备包括:至少一个处理器1601;至少一个存储器1602,以及一个或多个I/O接口1603,连接在处理器1601与存储器1602之间;其中,存储器1602存储有可被至少一个处理器1601执行的一个或多个计算机程序,一个或多个计算机程序被至少一个处理器1601执行,以使至少一个处理器1601能够执行上述的数据处理方法。
需要说明的是,本公开实施例提供的数据处理方法还可应用于基于众核***的电子设备。图17为本公开实施例提供的一种电子设备的框图。
参照图17,本公开实施例提供了一种电子设备,该电子设备包括多个处理核1701以及片上网络1702,其中,多个处理核1701均与片上网络1702连接,片上网络1702被配置为交互多个处理核间的数据和外部数据。
其中,一个或多个处理核1701中存储有一个或多个指令,一个或多个指令被一个或多个处理核1701执行,以使一个或多个处理核1701能够执行上述的数据处理方法。
在一些实施例中,该电子设备可以是类脑芯片,由于类脑芯片可以采用向量化计算方式,且需要通过外部内存例如双倍速率(Double Data Rate,DDR)同步动态随机存储器调入神经网络模型的权重信息等参数。因此,本公开实施例采用批处理的运算效率较高。
本公开实施例还提供了一种计算机可读存储介质,其上存储有计算机程序。
图18为本公开实施例提供的一种计算机可读介质的组成框图。其中,计算机程序在被处理器/处理核执行时实现上述的数据处理方法。计算机可读存储介质可以是易失性或非易失性计算机可读存储介质。
本公开实施例还提供了一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,当计算机可读代码在电子设备的处理器中运行时,电子设备中的处理器执行上述数据处理方法。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、***、装置中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读存储介质上,计算机可读存储介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。
如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读程序指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM)、静态随机存取存储器(SRAM)、闪存或其他存储器技术、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以被配置为存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读程序指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。
这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。
用于执行本公开操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的 源代码或目标代码,编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA),该电子电路可以执行计算机可读程序指令,从而实现本公开的各个方面。
这里所描述的计算机程序产品可以通过硬件、软件或其结合的方式实现。在一个可选实施例中,计算机程序产品体现为计算机存储介质,在另一个可选实施例中,计算机程序产品体现为软件产品,例如软件开发包(Software Development Kit,SDK)等等。
这里参照根据本公开实施例的方法、装置(***)和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
附图中的流程图和框图显示了根据本公开的多个实施例的***、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的***来实现,或者可以用专用硬件与计算机指令的组合来实现。
本文已经公开了示例实施例,并且虽然采用了术语,但它们仅用于并仅应当被解释为一般说明性含义,并且不用于限制的目的。在一些实例中,对本领域技术人员显而易见的是,除非另外明确指出,否则可单独使用与特定实施例相结合描述的特征、特性和/或元素,或可与其他实施例相结合描述的特征、特性和/或元件组合使用。因此,本领域技术人员将理解,在不脱离由所附的权利要求阐明的本公开的范围的情况下,可进行各种形式和细节上的改变。

Claims (18)

  1. 一种数据处理方法,其中,包括:
    将待处理数据输入目标神经网络中处理,得到所述待处理数据的处理结果,所述目标神经网络的至少一层卷积层为基于第一注意力机制的注意力卷积层,和/或,所述目标神经网络的至少两级卷积层之间基于第二注意力机制进行特征融合,
    其中,所述第一注意力机制包括针对特征的局部区域的自注意力机制,所述第二注意力机制包括不同尺度的输出特征之间针对所述输出特征的局部区域的注意力机制。
  2. 根据权利要求1所述的数据处理方法,其中,所述目标神经网络包括注意力卷积层,所述将待处理数据输入目标神经网络中处理,得到所述待处理数据的处理结果,包括:
    针对任一注意力卷积层,对所述注意力卷积层的输入数据进行线性变换,获得与所述输入数据对应的第一查询特征、第一键特征和第一值特征;
    根据所述第一查询特征、所述第一键特征和所述第一值特征,确定与所述第一查询特征的多个目标特征点对应的第一注意力特征,其中,所述第一注意力特征包括与多个所述目标特征点对应的第一注意力值,所述第一注意力值为针对与所述目标特征点对应的局部区域确定的,与所述目标特征点对应的局部区域是所述第一查询特征中以所述目标特征点为中心,并根据预设尺寸确定的区域,所述预设尺寸小于所述第一查询特征的尺寸,所述第一注意力值用于表征所述局部区域内的多个特征点与所述目标特征点之间的关联关系;
    根据所述第一注意力特征和所述输入数据,确定与所述注意力卷积层对应的输出特征,
    其中,所述输入数据是所述待处理数据经由所述注意力卷积层之前的网络层处理后的数据,所述处理结果是所述输出特征经由所述注意力卷积层之后的网络层处理后得到的。
  3. 根据权利要求2所述的数据处理方法,其中,所述根据所述第一查询特征、所述第一键特征和所述第一值特征,确定与所述第一查询特征的多个目标特征点对应的第一注意力特征,包括:
    针对多个所述目标特征点,确定所述第一键特征中与所述局部区域对应的多个第一键特征点,以及所述第一值特征中与所述局部区域对应的第一值特征点;
    确定所述目标特征点与多个所述第一键特征点之间的相似性,获得与所述目标特征点对应的第一相似度特征;
    根据所述第一相似度特征与所述第一值特征点,获得与所述目标特征点对应的第一注意力值;
    根据多个所述目标特征点的第一注意力值,获得所述第一注意力特征。
  4. 根据权利要求3所述的数据处理方法,其中,所述根据所述第一查询特征、所述第一键特征和所述第一值特征,确定与所述第一查询特征的多个目标特征点对应的第一注意力特征之前,所述数据处理方法还包括:
    从所述第一查询特征中选取多个特征点作为所述目标特征点;
    根据所述预设尺寸,确定多个目标特征点对应的局部区域。
  5. 根据权利要求4所述的数据处理方法,其中,所述第一查询特征包括向量和矩阵中的任意一种;
    在所述第一查询特征为向量的情况下,所述预设尺寸包括预设特征点数量,且所述预设特征点数量小于所述第一查询向量的特征点总数量,所述局部区域是以所述目标特征点为中心、特征点数量等于所述预设特征点数量的向量;
    在所述第一查询特征为矩阵的情况下,所述预设尺寸包括预设行数量和预设列数量,且所述预设行数量小于所述第一查询特征的行总数量,所述预设列数量小于所述第一查询特征的列总数量,所述局部区域是以所述目标特征点为中心、以所述预设行数量为高度、所述预设列数量为宽度的矩形区域。
  6. 根据权利要求2所述的数据处理方法,其中,所述根据所述第一查询特征、所述第一键特征和所述第一值特征,确定与所述第一查询特征的多个目标特征点对应的第一注意力特征,包括:
    根据所述局部区域的预设尺寸,设置滑动窗口和步长;
    从预设的初始滑动位置开始,沿所述第一查询特征以所述步长滑动所述滑动窗口,并确定多次滑动操作中与所述滑动窗口对应的目标特征点,所述第一键特征中与所述滑动窗口对应的多个第一键特征点,以及所述第一值特征中与所述滑动窗口对应的第一值特征点;
    确定所述目标特征点与多个所述第一键特征点之间的相似性,获得与所述目标特征点对应的第 一相似度特征;
    根据所述第一相似度特征与所述第一值特征点,获得与所述目标特征点对应的第一注意力值;
    根据多个所述目标特征点的第一注意力值,获得所述第一注意力特征。
  7. 根据权利要求3或6所述的数据处理方法,其中,所述第一查询特征与所述第一键特征相同;
    所述确定所述目标特征点与多个所述第一键特征点之间的相似性,包括:
    在获得第i个目标特征点与第j个第一键特征点之间的相似性Sij的情况下,根据所述Sij确定第j个第一查询特征点与第i个目标特征点之间的相似性Sji;
    其中,i和j均为大于等于1且小于等于M的整数,M为所述第一查询特征或所述第一键特征中的特征点的总数量,且i≥j或i≤j。
  8. 根据权利要求2所述的数据处理方法,其中,所述根据所述第一注意力特征和所述输入数据,确定与所述注意力卷积层对应的输出特征,包括:
    对所述第一注意力特征进行线性变换,获得与所述输入数据尺寸相同的第一匹配注意力特征;
    将所述第一匹配注意力特征与所述输入数据进行叠加,获得与所述输入数据对应的输出特征。
  9. 根据权利要求2所述的数据处理方法,其中,所述根据所述第一注意力特征和所述输入数据,确定与所述注意力卷积层对应的输出特征,包括:
    根据与所述第一注意力特征中第一注意力值对应的目标特征点在所述输入数据中的位置信息,将所述第一注意力值进行重新排列,获得第二匹配注意力特征;
    根据所述第二匹配注意力特征和所述输入数据中与所述目标特征点对应的特征点,获得与所述输入数据对应的输出特征。
  10. 根据权利要求1所述的数据处理方法,其中,所述目标神经网络的N级卷积层之间基于第二注意力机制进行特征融合,各级卷积层输出的特征的尺度不同,N为大于或等于2的整数,
    所述将待处理数据输入目标神经网络中处理,得到所述待处理数据的处理结果,包括:
    针对第n级卷积层,根据第n-1级卷积层的第n-1级中间特征与所述第n级卷积层输出的第n级初始特征,确定第n级第二注意力特征,n为整数且2≤n≤N-1;
    根据所述第n级第二注意力特征,对所述第n级初始特征进行更新,获得第n级中间特征;
    根据第n+1级卷积层的第n+1级融合特征与所述第n级中间特征,确定第n级第三注意力特征;
    根据第n级第三注意力特征,对所述第n级中间特征进行更新,获得第n级融合特征,
    其中,所述初始特征是所述待处理数据经由所述目标神经网络的卷积层处理后的特征,所述处理结果是所述融合特征经由所述卷积层之后的网络层处理后得到的。
  11. 根据权利要求10所述的数据处理方法,其中,所述针对第n级卷积层,根据第n-1级卷积层的第n-1级中间特征与所述第n级卷积层输出的第n级初始特征,确定第n级第二注意力特征,包括:
    对所述第n级初始特征进行线性变换,获得与所述第n级初始特征对应的第n级第二查询特征;
    对所述第n-1级中间特征进行线性变换,获得与所述第n-1级中间特征对应的第n-1级第二键特征和第n-1级第二值特征;
    确定所述第n级第二查询特征的多个特征点与所述第n-1级第二键特征的多个特征点之间的映射关系;
    根据所述映射关系,确定所述第n级第二查询特征的多个特征点对应于所述第n-1级第二键特征的特征融合区域;
    确定所述第n级第二查询特征中的多个特征点与所述第n-1级第二键特征中处于所述特征融合区域的多个特征点之间的相似性,获得第n级第二相似度特征;
    确定所述第n-1级第二值特征中与所述特征融合区域对应的第n-1级第二值特征点;
    确定所述第n级第二相似度特征与所述第n-1级第二值特征点之间的内积,获得所述第n级第二注意力特征。
  12. 根据权利要求10所述的数据处理方法,其中,所述根据第n+1级卷积层的第n+1级融合特征与所述第n级中间特征,确定第n级第三注意力特征,包括:
    对所述第n级中间特征进行线性变换,获得与所述第n级中间特征对应的第n级第三查询特征;
    对所述第n+1级融合特征进行线性变换,获得与所述第n+1级融合特征对应的第n+1级第三键特征和第n+1级第三值特征;
    确定所述第n级第三查询特征的多个特征点与所述第n+1级第三键特征的多个特征点之间的映射关系;
    根据所述映射关系,确定所述第n级第三查询特征的多个特征点对应于所述第n+1级第三键特征的特征融合区域;
    确定所述第n级第三查询特征中的多个特征点与所述第n+1级第三键特征中处于所述特征融合区域的多个特征点之间的相似性,获得第n级第三相似度特征;
    确定所述第n+1级第三值特征中与所述特征融合区域对应的第n+1级第三值特征点;
    确定所述第n级第三相似度特征与所述第n+1级第三值特征点之间的内积,获得所述第n级第三注意力特征。
  13. 根据权利要求10所述的数据处理方法,其中,所述根据所述第n级第二注意力特征,对所述第n级初始特征进行更新,获得第n级中间特征,包括:
    将所述第n级第二注意力特征与所述第n级初始特征叠加,获得所述第n级中间特征,
    其中,所述根据第n级第三注意力特征,对所述第n级中间特征进行更新,获得第n级融合特征,包括:
    将所述第n级第三注意力特征与所述第n级中间特征叠加,获得所述第n级融合特征。
  14. 根据权利要求10所述的数据处理方法,其中,第1级中间特征等于第1级初始特征,第1级融合特征是基于第1级第三注意力特征对所述第1级中间特征进行更新获得的,所述第1级第三注意力特征根据第2级融合特征和所述第1级中间特征获得;
    第N级融合特征等于第N级中间特征。
  15. 一种数据处理装置,其中,包括:
    数据处理模块,被配置为将待处理数据输入目标神经网络中处理,用于得到所述待处理数据的处理结果,所述目标神经网络的至少一层卷积层为基于第一注意力机制的注意力卷积层,和/或,所述目标神经网络的至少两级卷积层之间基于第二注意力机制进行特征融合,
    其中,所述第一注意力机制包括针对特征的局部区域的自注意力机制,所述第二注意力机制包括不同尺度的输出特征之间针对所述输出特征的局部区域的注意力机制。
  16. 一种电子设备,其中,包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的一个或多个计算机程序,一个或多个所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行如权利要求1-14中任一项所述的数据处理方法。
  17. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序在被处理器执行时实现如权利要求1-14中任一项所述的数据处理方法。
  18. 一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,其中,当所述计算机可读代码在电子设备的处理器中运行时,所述电子设备中的处理器执行用于实现如权利要求1至14中任一所述的数据处理方法。
PCT/CN2023/089744 2022-04-22 2023-04-21 数据处理方法及装置、设备、介质 WO2023202695A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210427732.3 2022-04-22
CN202210427732.3A CN114781513A (zh) 2022-04-22 2022-04-22 数据处理方法及装置、设备、介质

Publications (1)

Publication Number Publication Date
WO2023202695A1 true WO2023202695A1 (zh) 2023-10-26

Family

ID=82430345

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/089744 WO2023202695A1 (zh) 2022-04-22 2023-04-21 数据处理方法及装置、设备、介质

Country Status (2)

Country Link
CN (1) CN114781513A (zh)
WO (1) WO2023202695A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114781513A (zh) * 2022-04-22 2022-07-22 北京灵汐科技有限公司 数据处理方法及装置、设备、介质
CN115034375B (zh) * 2022-08-09 2023-06-27 北京灵汐科技有限公司 数据处理方法及装置、神经网络模型、设备、介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111563551A (zh) * 2020-04-30 2020-08-21 支付宝(杭州)信息技术有限公司 一种多模态信息融合方法、装置及电子设备
CN112906718A (zh) * 2021-03-09 2021-06-04 西安电子科技大学 一种基于卷积神经网络的多目标检测方法
CN113393474A (zh) * 2021-06-10 2021-09-14 北京邮电大学 一种基于特征融合的三维点云的分类和分割方法
CN113627466A (zh) * 2021-06-30 2021-11-09 北京三快在线科技有限公司 图像标签识别方法、装置、电子设备及可读存储介质
US11200459B1 (en) * 2017-10-10 2021-12-14 Snap Inc. Adversarial network for transfer learning
CN114781513A (zh) * 2022-04-22 2022-07-22 北京灵汐科技有限公司 数据处理方法及装置、设备、介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11200459B1 (en) * 2017-10-10 2021-12-14 Snap Inc. Adversarial network for transfer learning
CN111563551A (zh) * 2020-04-30 2020-08-21 支付宝(杭州)信息技术有限公司 一种多模态信息融合方法、装置及电子设备
CN112906718A (zh) * 2021-03-09 2021-06-04 西安电子科技大学 一种基于卷积神经网络的多目标检测方法
CN113393474A (zh) * 2021-06-10 2021-09-14 北京邮电大学 一种基于特征融合的三维点云的分类和分割方法
CN113627466A (zh) * 2021-06-30 2021-11-09 北京三快在线科技有限公司 图像标签识别方法、装置、电子设备及可读存储介质
CN114781513A (zh) * 2022-04-22 2022-07-22 北京灵汐科技有限公司 数据处理方法及装置、设备、介质

Also Published As

Publication number Publication date
CN114781513A (zh) 2022-07-22

Similar Documents

Publication Publication Date Title
WO2023202695A1 (zh) 数据处理方法及装置、设备、介质
Wang et al. Kvt: k-nn attention for boosting vision transformers
CN111104962A (zh) 图像的语义分割方法、装置、电子设备及可读存储介质
CN112990219B (zh) 用于图像语义分割的方法和装置
CN108280451B (zh) 语义分割及网络训练方法和装置、设备、介质
US11575502B2 (en) Homomorphic encryption processing device, system including the same and method of performing homomorphic encryption processing
US20200257902A1 (en) Extraction of spatial-temporal feature representation
CN111401436A (zh) 一种融合网络和双通道注意力机制的街景图像分割方法
CN113343982A (zh) 多模态特征融合的实体关系提取方法、装置和设备
Ma et al. Spatial pyramid attention for deep convolutional neural networks
CN112329801A (zh) 一种卷积神经网络非局部信息构建方法
CN113343981A (zh) 一种视觉特征增强的字符识别方法、装置和设备
US8712159B2 (en) Image descriptor quantization
CN113705575B (zh) 一种图像分割方法、装置、设备及存储介质
Fu et al. Featup: A model-agnostic framework for features at any resolution
CN114049491A (zh) 指纹分割模型训练、指纹分割方法、装置、设备及介质
CN114202648A (zh) 文本图像矫正方法、训练方法、装置、电子设备以及介质
WO2024032585A1 (zh) 数据处理方法及装置、神经网络模型、设备、介质
CN117036699A (zh) 一种基于Transformer神经网络的点云分割方法
CN115222947B (zh) 基于全局自注意力变换网络的岩石节理分割方法和装置
CN113688783B (zh) 人脸特征提取方法、低分辨率人脸识别方法及设备
US20200372280A1 (en) Apparatus and method for image processing for machine learning
CN115082295B (zh) 一种基于自注意力机制的图像编辑方法及装置
CN111242299A (zh) 基于ds结构的cnn模型压缩方法、装置及存储介质
CN115620013B (zh) 语义分割方法、装置、计算机设备及计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23791347

Country of ref document: EP

Kind code of ref document: A1