CN113177562A

CN113177562A - Vector determination method and device based on self-attention mechanism fusion context information

Info

Publication number: CN113177562A
Application number: CN202110488969.8A
Authority: CN
Inventors: 李业豪; 姚霆; 潘滢炜; 梅涛
Original assignee: JD Digital Technology Holdings Co Ltd
Current assignee: JD Digital Technology Holdings Co Ltd
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2021-07-27
Anticipated expiration: 2041-04-29
Also published as: CN113177562B

Abstract

The application discloses a vector determination method and device based on self-attention mechanism fusion context information. One embodiment of the method comprises: determining a key vector, a query vector and a value vector of a feature point in a feature map; performing convolution operation on key vectors of feature points in the feature map by using a convolution kernel with a preset size to obtain a first-order context key vector of the context information of the fused feature points; obtaining a query vector and a value vector corresponding to a receptive field of convolution operation of the first-order context key vector according to the first-order context key vector to obtain a second-order context key vector; and fusing the first-order context key vector and the second-order context key vector to determine a target vector. The application provides a method for determining a vector fused with context information based on a self-attention mechanism, which improves the expression capacity of a target vector; furthermore, a target vector with stronger expression capability can be provided for the machine vision task, and the accuracy of processing the machine vision task is improved.

Description

Vector determination method and device based on self-attention mechanism fusion context information

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a vector determination method and device based on self-attention mechanism fusion context information.

Background

Inspired by self-attention in transform in the field of natural language processing, the design of neural network structures is gradually merged into the self-attention mechanism in the field of machine vision recognition. The conventional self-attention mechanism generally calculates an attention weight corresponding to each key vector by pairwise independent query vector-key vector pairs (query-key pairs), and the attention weight is finally applied to a value vector (values) to obtain an output vector.

Disclosure of Invention

The embodiment of the application provides a vector determination method and device based on self-attention mechanism fusion context information.

In a first aspect, an embodiment of the present application provides a vector determination method based on self-attention mechanism fusion context information, including: determining a key vector, a query vector and a value vector of a feature point in a feature map; performing convolution operation on key vectors of feature points in the feature map by using a convolution kernel with a preset size to obtain a first-order context key vector of the context information of the fused feature points; obtaining a query vector and a value vector corresponding to a receptive field of convolution operation of the first-order context key vector according to the first-order context key vector to obtain a second-order context key vector; and fusing the first-order context key vector and the second-order context key vector to determine a target vector.

In some embodiments, obtaining the second-order context key vector according to the first-order context key vector, the query vector and the value vector corresponding to the receptive field of the convolution operation that obtains the first-order context key vector includes: splicing the first-order context key vector and the target query vector to obtain a spliced vector, wherein the target query vector is represented to obtain a query vector of a feature point at a central position in a receptive field of convolution operation of the first-order context key vector; and obtaining a second-order context key vector according to the splicing vector and the value vector of the feature point in the experience field of the convolution operation of obtaining the first-order context key vector.

In some embodiments, the obtaining the second-order context key vector according to the concatenation vector and the value vector of the feature point in the perceptual domain of the convolution operation that obtains the first-order context key vector includes: performing convolution operation on the splicing vector for multiple times to obtain an attention matrix; and obtaining a second-order context key vector based on local matrix multiplication operation between the value vector of the feature point in the receptive field and the attention matrix.

In some embodiments, the size of the local matrix in the local matrix multiplication operation is the same as the preset size.

In some embodiments, the self-attention mechanism is a multi-headed self-attention mechanism; and the above method further comprises: and determining a target vector corresponding to the multi-head self-attention mechanism according to the target vector corresponding to each head in the multi-head self-attention mechanism.

In some embodiments, the above method comprises: and replacing the output vector of the convolution operation with the convolution kernel in the neural network as the preset size with a finally determined target vector, and processing the visual identification task through the neural network.

In a second aspect, an embodiment of the present application provides a vector determination apparatus fusing context information based on a self-attention mechanism, including: a first determination unit configured to determine a key vector, a query vector, and a value vector of feature points in a feature map; the convolution unit is configured to perform convolution operation on key vectors of the feature points in the feature map by a convolution core with a preset size to obtain a first-order context key vector fusing context information of the feature points; the obtaining unit is configured to obtain a second-order context key vector according to the first-order context key vector and a query vector and a value vector corresponding to a receptive field of convolution operation of the first-order context key vector; and the fusion unit is configured to fuse the first-order context key vector and the second-order context key vector and determine a target vector.

In some embodiments, the deriving unit is further configured to: splicing the first-order context key vector and the target query vector to obtain a spliced vector, wherein the target query vector is represented to obtain a query vector of a feature point at a central position in a receptive field of convolution operation of the first-order context key vector; and obtaining a second-order context key vector according to the splicing vector and the value vector of the feature point in the experience field of the convolution operation of obtaining the first-order context key vector.

In some embodiments, the deriving unit is further configured to: performing convolution operation on the splicing vector for multiple times to obtain an attention matrix; and obtaining a second-order context key vector based on local matrix multiplication operation between the value vector of the feature point in the receptive field and the attention matrix.

In some embodiments, the self-attention mechanism is a multi-headed self-attention mechanism; and the above apparatus further comprises: and the second determining unit is configured to determine a target vector corresponding to the multi-head self-attention mechanism according to the target vector corresponding to each head in the multi-head self-attention mechanism.

In some embodiments, the above apparatus further comprises: and the processing unit is configured to replace an output vector of convolution operation with a convolution kernel of a preset size in the neural network with a finally determined target vector and process the visual recognition task through the neural network.

In a third aspect, the present application provides a computer-readable medium, on which a computer program is stored, where the program, when executed by a processor, implements the method as described in any implementation manner of the first aspect.

In a fourth aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon, which when executed by one or more processors, cause the one or more processors to implement a method as described in any implementation of the first aspect.

According to the vector determination method and device based on the self-attention mechanism fusion context information, the key vectors, the query vectors and the value vectors of the feature points in the feature map are determined; performing convolution operation on key vectors of feature points in the feature map by using a convolution kernel with a preset size to obtain a first-order context key vector of the context information of the fused feature points; obtaining a query vector and a value vector corresponding to a receptive field of convolution operation of the first-order context key vector according to the first-order context key vector to obtain a second-order context key vector; the first-order context key vector and the second-order context key vector are fused to determine the target vector, so that a method for determining the vector fused with the context information based on the self-attention mechanism is provided, and the expression capacity of the target vector is improved; furthermore, a target vector with stronger expression capability can be provided for the machine vision task, and the accuracy of processing the machine vision task is improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for vector determination based on self-attention mechanism fusion context information according to the present application;

fig. 3 is a schematic diagram of an application scenario of the vector determination method based on the self-attention mechanism fusion context information according to the present embodiment;

FIG. 4 is a flow diagram of yet another embodiment of a method for vector determination based on self-attention mechanism fusion context information according to the present application;

FIG. 5 is a schematic flow chart of a multi-headed autofocusing mechanism according to the present application to obtain a target vector;

FIG. 6 is a block diagram of one embodiment of a vector determination apparatus fusing context information based on a self-attention mechanism according to the present application;

FIG. 7 is a block diagram of a computer system suitable for use in implementing embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates an exemplary architecture 100 to which the present self-attention mechanism-based fusion context information-based vector determination methods and apparatus may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The communication connections between the

terminal devices

101, 102, 103 form a topological network, and the network 104 serves to provide a medium for communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The

terminal devices

101, 102, 103 may be hardware devices or software that support network connections for data interaction and data processing. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices supporting network connection, information acquisition, interaction, display, processing, and the like, including but not limited to smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented, for example, as multiple software or software modules to provide distributed services, or as a single software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, for example, a background processing server that obtains a target vector in which context information of feature points is fused based on a self-attention mechanism for a feature map of an image to be recognized that is transmitted by a user through the

terminal devices

101, 102, and 103 and is subjected to machine vision recognition. Optionally, the server may use the obtained target vector for various machine vision downstream tasks, such as target object detection, semantic segmentation, and other tasks. As an example, the server 105 may be a cloud server.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be further noted that the vector determination method based on the self-attention mechanism fusion context information provided by the embodiment of the present application may be executed by a server, or may be executed by a terminal device, or may be executed by the server and the terminal device in cooperation with each other. Accordingly, each part (for example, each unit) included in the vector determination apparatus based on the self-attention mechanism fusion context information may be entirely disposed in the server, may be entirely disposed in the terminal device, and may be disposed in the server and the terminal device, respectively.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. When the electronic device on which the vector determination method based on the self-attention mechanism fusion context information operates does not need to perform data transmission with other electronic devices, the system architecture may include only the electronic device (e.g., a server or a terminal device) on which the vector determination method based on the self-attention mechanism fusion context information operates.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for vector determination based on self-attention mechanism fusion context information is shown, comprising the steps of:

step 201, determining a key vector, a query vector and a value vector of a feature point in a feature map.

In this embodiment, an executing subject (e.g., the server in fig. 1) of the vector determination method based on the self-attention mechanism fused context information may determine a key vector (key), a query vector (query), and a value vector (value) of a feature point in a feature map.

The image to be recognized represented by the feature map is an image to be subjected to a machine vision recognition task, and the machine vision recognition task comprises but is not limited to image recognition, object detection, semantic segmentation and the like. Accordingly, any content may be included in the image to be recognized.

As an example, the execution body may obtain a key vector, a query vector, and a value vector corresponding to each feature point by applying a conversion matrix corresponding to the key vector, the query vector, and the value vector, respectively, to feature vectors of feature points in the feature map.

As yet another example, the execution subject may determine a feature vector of a feature point in the feature map as one or more of a key vector, a query vector, and a value vector corresponding to the feature point. Specifically, the executing body may determine the feature vector of the feature point in the feature map as the key vector and the query vector corresponding to the feature point, and obtain the value vector corresponding to the feature point by applying a transformation matrix corresponding to the value vector to the feature vector of the feature point in the feature map.

Step 202, performing convolution operation on the key vectors of the feature points in the feature map by using a convolution kernel with a preset size to obtain a first-order context key vector of the context information of the fused feature points.

In this embodiment, the execution main body may perform convolution operation on the key vectors of the feature points in the feature map by using a convolution kernel with a preset size, so as to obtain a first-order context key vector of the context information of the fusion feature point.

The preset size may be specifically set according to actual conditions (e.g., the calculation amount of the convolution process, the range of the context information of the feature points to be fused). For example, the preset size is 3 × 3. It is understood that the term context information refers to the field of natural language processing, and in the field of machine vision recognition, the term context information may be specifically understood as feature information of feature points located around feature points.

And the key vector corresponding to each feature point can form a key vector matrix. For each feature point in the feature map, the execution subject may perform convolution operation through a convolution kernel of a preset size to obtain a first-order context key vector of context information represented by the key vector of the feature point in the receptive field of the convolution operation corresponding to the feature point. Wherein, in the receptive field corresponding to the characteristic point, the characteristic point is at the central position.

Taking the size of the convolution kernel of the convolution operation as 3 × 3 as an example, the execution body performs convolution operation on the key vectors corresponding to the feature points (the number of the feature points is 3 × 3 — 9) included in the key vector matrix in the sense field corresponding to each convolution operation, so as to obtain the first-order context key vector corresponding to the feature point located at the central position of the sense field.

And 203, obtaining a query vector and a value vector corresponding to the receptive field of the convolution operation of the first-order context key vector according to the first-order context key vector to obtain a second-order context key vector.

In this embodiment, the execution body may obtain the second-order context key vector according to the first-order context key vector, and the query vector and the value vector corresponding to the receptive field of the convolution operation of the first-order context key vector.

The query vector and the value vector corresponding to the receptive field of the convolution operation of the first-order context key vector are obtained, and may be the query vector and the value vector corresponding to the feature point in the receptive field.

As an example, the execution subject may obtain the second-order context key vector through operations such as vector concatenation and multiplication between the first-order context key vector and the query vector and the value vector corresponding to the receptive field of the convolution operation for obtaining the first-order context key vector.

In some optional implementations of this embodiment, the executing main body may execute the step 203 by:

firstly, splicing the first-order context key vector and the target query vector to obtain a spliced vector.

And the target query vector is characterized to obtain the query vector of the feature point at the central position in the receptive field of the convolution operation of the first-order context key vector.

The first-order context key vector corresponds to the feature point at the central position of the receptive field of the convolution operation for obtaining the first-order context key vector, the target query vector also corresponds to the feature point at the central position in the receptive field of the convolution operation for obtaining the first-order context key vector, and the corresponding first-order context key vector and the target query vector are spliced to obtain a spliced vector corresponding to the feature point. It can be understood that, for each feature point in the feature map, the above determination process of the stitching vector is performed, and a stitching vector corresponding to each feature point can be obtained.

And secondly, obtaining a second-order context key vector according to the splicing vector and the value vector of the feature point in the receptive field of the convolution operation for obtaining the first-order context key vector.

As an example, the execution agent may obtain an attention matrix representing attention to context information in the receptive field by concatenating vectors, and further perform a multiplication operation according to the attention matrix and a value vector of a feature point in the receptive field of a convolution operation for obtaining a first-order context key vector, to obtain a second-order context key vector.

In some optional implementations of this embodiment, the executing body may execute the second step by:

first, a plurality of convolution operations are performed on the spliced vectors to obtain an attention matrix.

As an example, the above-described execution subject may derive the attention matrix based on two convolution operations with a convolution kernel of 1 × 1. Wherein the first convolution operation of the two convolution operations has an activation function, for example, a ReLU activation function, and the second convolution operation has no activation function.

Then, a second-order context key vector is obtained based on a local matrix multiplication operation between the value vector of the feature point in the receptive field and the attention matrix.

The corresponding size of the local matrix may be specifically set according to an actual situation, and is not limited herein.

In some optional implementations of the present embodiment, a corresponding size of the local matrix in the local matrix multiplication operation is the same as a preset size.

And 204, fusing the first-order context key vector and the second-order context key vector to determine a target vector.

In this embodiment, the execution body may fuse the first-order context key vector and the second-order context key vector to determine the target vector.

With continued reference to fig. 3, fig. 3 is a schematic diagram 300 of an application scenario of the vector determination method based on the self-attention mechanism fusion context information according to the present embodiment. In the application scenario of fig. 3, the server 301 first obtains a feature map 303 from the terminal device 302. The server 301 then determines the key vectors, query vectors and value vectors for the feature points in the feature map 303. Then, a convolution kernel with a preset size of 3 × 3 is used to perform convolution operation on the key vectors of the feature points in the feature map 303, so as to obtain a first-order context key vector 304 of the context information of the fused feature points. Then, according to the first-order context key vector 304, a query vector 305 and a value vector 306 corresponding to the receptive field of the convolution operation of the first-order context key vector are obtained, and a second-order context key vector 307 is obtained; the first order context key vector 304 and the second order context key vector 307 are fused to determine a target vector 308.

In the method provided by the above embodiment of the present application, the key vector, the query vector, and the value vector of the feature point in the feature map are determined; performing convolution operation on key vectors of feature points in the feature map by using a convolution kernel with a preset size to obtain a first-order context key vector of the context information of the fused feature points; obtaining a query vector and a value vector corresponding to a receptive field of convolution operation of the first-order context key vector according to the first-order context key vector to obtain a second-order context key vector; the first-order context key vector and the second-order context key vector are fused to determine the target vector, so that a method for determining the vector fused with the context information based on the self-attention mechanism is provided, and the expression capacity of the target vector is improved; furthermore, a target vector with stronger expression capability can be provided for the machine vision task, and the accuracy of processing the machine vision task is improved.

In some alternative implementations of the present embodiment, the self-attentive mechanism is a multi-headed self-attentive mechanism. The execution main body can also determine a target vector corresponding to the multi-head self-attention mechanism according to the target vector corresponding to each head in the multi-head self-attention mechanism.

As an example, target vectors corresponding to the heads are spliced and then subjected to linear transformation to obtain target vectors corresponding to the multi-head self-attention mechanism.

In some optional implementation manners of this embodiment, the executing body may further replace an output vector of a convolution operation in which a convolution kernel in the neural network is a preset size with a finally determined target vector, and process the visual recognition task through the neural network.

In the implementation mode, the output vector of the convolution operation with the convolution kernel in the neural network as the preset size is replaced by the finally determined target vector, so that the neural network can process the visual identification task by using the target vector fusing the context, and the accuracy of the visual identification is improved.

Specifically, the execution body may replace a convolution operation in which a convolution kernel in a neural network for processing a machine vision task is a predetermined size with a network structure in which a target vector is obtained from a key vector, a query vector, and a value vector of a feature point.

With continuing reference to FIG. 4, a schematic flow chart 400 illustrating one embodiment of a method for fusing context information based on a self-attention mechanism in accordance with the present application is shown that includes the steps of:

step 401, determining a key vector, a query vector and a value vector of a feature point in a feature map.

Step 402, performing convolution operation on the key vectors of the feature points in the feature map by using a convolution kernel with a preset size to obtain a first-order context key vector of the context information of the fused feature points.

And 403, splicing the first-order context key vector and the target query vector to obtain a spliced vector.

And step 404, performing convolution operation on the splicing vector for multiple times to obtain an attention matrix.

Step 405, a second-order context key vector is obtained based on a local matrix multiplication operation between the value vector of the feature point in the receptive field and the attention matrix.

And 406, fusing the first-order context key vector and the second-order context key vector to determine a target vector.

Step 407, determining a target vector corresponding to the multi-head self-attention mechanism according to the target vector corresponding to each head in the multi-head self-attention mechanism.

As an example, as shown in fig. 5, a flow diagram 500 of a multi-headed autofocusing mechanism to obtain a target vector is shown. The feature vectors of the feature points in the feature map X are determined as the key vectors and query vectors corresponding to the feature points, and the value vectors corresponding to the feature points are obtained by acting the conversion matrix corresponding to the value vectors on the feature vectors of the feature points in the feature map X.

Firstly, a K multiplied by K convolution operation is acted on a feature map X with the size of H multiplied by W multiplied by C to obtain a feature map corresponding to a first-order context key vector with the size of H multiplied by W multiplied by C, and then the feature map corresponding to the first-order context key vector and the feature map corresponding to a query vector are processedSplicing (Concat) to obtain a splicing characteristic diagram with the size of H multiplied by W multiplied by 2C; then, the stitched feature map is subjected to a convolution operation θ of 1 × 01 to obtain a feature map of size H × W × D, and further subjected to a convolution operation δ of 1 × 1 to obtain a feature map of size H × W × C (K × C)_H) The characteristic diagram of (1). Wherein, C_HThe number of heads for a multi-head self-attention mechanism. Then, the obtained size was H × W × (K × K × C)_H) And performing local matrix multiplication on the feature graph corresponding to the value vector, taking the operation result as a feature graph corresponding to a second-order context key vector, and fusing (Fusion) the feature graph corresponding to the first-order context key vector to obtain a feature graph Y corresponding to the target vector.

And step 408, replacing the output vector of the convolution operation with the convolution kernel in the neural network as the preset size with the finally determined target vector, and processing the visual identification task through the neural network.

As can be seen from this embodiment, compared with the embodiment 200 corresponding to fig. 2, the flow 400 of the vector determination method based on the self-attention mechanism fusion context information in this embodiment specifically illustrates a process of determining a target vector of the multi-head self-attention mechanism, and a process of replacing an output vector of a convolution operation in which a convolution kernel in a neural network is a preset size with a finally determined target vector, so that accuracy of a result obtained by processing a visual recognition task by the neural network is improved.

With continuing reference to fig. 6, as an implementation of the method shown in the above-mentioned figures, the present application provides an embodiment of a vector determination apparatus based on a self-attention mechanism fused context information, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied in various electronic devices.

As shown in fig. 6, the vector determination apparatus fusing context information based on the self-attention mechanism includes: a first determining unit 601 configured to determine a key vector, a query vector, and a value vector of a feature point in a feature map; a convolution unit 602 configured to perform convolution operation on the key vectors of the feature points in the feature map by using a convolution kernel of a preset size to obtain a first-order context key vector of the context information of the fused feature points; an obtaining unit 603 configured to obtain a second-order context key vector according to the first-order context key vector, and a query vector and a value vector corresponding to a receptive field of the convolution operation of the first-order context key vector; a fusion unit 604 configured to fuse the first order context key vector and the second order context key vector to determine a target vector.

In some embodiments, the deriving unit 603 is further configured to: splicing the first-order context key vector and the target query vector to obtain a spliced vector, wherein the target query vector is represented to obtain a query vector of a feature point at a central position in a receptive field of convolution operation of the first-order context key vector; and obtaining a second-order context key vector according to the splicing vector and the value vector of the feature point in the experience field of the convolution operation of obtaining the first-order context key vector.

In some embodiments, the deriving unit 603 is further configured to: performing convolution operation on the splicing vector for multiple times to obtain an attention matrix; and obtaining a second-order context key vector based on local matrix multiplication operation between the value vector of the feature point in the receptive field and the attention matrix.

In some embodiments, the self-attention mechanism is a multi-headed self-attention mechanism; and the above apparatus further comprises: and a second determining unit (not shown in the figure) configured to determine a target vector corresponding to the multi-head self-attention mechanism according to the target vector corresponding to each head in the multi-head self-attention mechanism.

In some embodiments, the above apparatus further comprises: and a processing unit (not shown in the figure) configured to replace an output vector of a convolution operation in which a convolution kernel in the neural network is a preset size with a finally determined target vector, and process the visual recognition task through the neural network.

In this embodiment, a first determining unit in a vector determining apparatus based on a self-attention mechanism fused context information determines a key vector, a query vector, and a value vector of a feature point in a feature map; the convolution unit performs convolution operation on key vectors of the feature points in the feature map by using a convolution kernel with a preset size to obtain a first-order context key vector of the context information of the fusion feature points; the obtaining unit obtains a query vector and a value vector corresponding to a receptive field of convolution operation of the first-order context key vector according to the first-order context key vector to obtain a second-order context key vector; the fusion unit fuses the first-order context key vector and the second-order context key vector to determine the target vector, so that a device for determining the vector fused with the context information based on the self-attention mechanism is provided, and the expression capacity of the target vector is improved; furthermore, a target vector with stronger expression capability can be provided for the machine vision task, and the accuracy of processing the machine vision task is improved.

Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use in implementing devices of embodiments of the present application (e.g.,

devices

101, 102, 103, 105 shown in FIG. 1). The apparatus shown in fig. 7 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present application.

As shown in fig. 7, the computer system 700 includes a processor (e.g., CPU, central processing unit) 701, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data necessary for the operation of the system 700 are also stored. The processor 701, the ROM702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program, when executed by the processor 701, performs the above-described functions defined in the method of the present application.

It should be noted that the computer readable medium of the present application can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the client computer, partly on the client computer, as a stand-alone software package, partly on the client computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the client computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a first determining unit, a convolution unit, a deriving unit, and a fusing unit. The names of the units do not limit the units themselves under certain conditions, for example, a convolution unit may also be described as a unit that performs a convolution operation on key vectors of feature points in a feature map by using a convolution kernel with a preset size to obtain a first-order context key vector fusing context information of the feature points.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the computer device to: determining a key vector, a query vector and a value vector of a feature point in a feature map; performing convolution operation on key vectors of feature points in the feature map by using a convolution kernel with a preset size to obtain a first-order context key vector of the context information of the fused feature points; obtaining a query vector and a value vector corresponding to a receptive field of convolution operation of the first-order context key vector according to the first-order context key vector to obtain a second-order context key vector; and fusing the first-order context key vector and the second-order context key vector to determine a target vector.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A vector determination method based on self-attention mechanism fusion context information comprises the following steps:

determining a key vector, a query vector and a value vector of a feature point in a feature map;

performing convolution operation on the key vectors of the feature points in the feature map by using a convolution core with a preset size to obtain a first-order context key vector of the context information of the fused feature points;

obtaining a query vector and a value vector corresponding to a receptive field of convolution operation of the first-order context key vector according to the first-order context key vector to obtain a second-order context key vector;

and fusing the first-order context key vector and the second-order context key vector to determine a target vector.

2. The method of claim 1, wherein the deriving a second-order context key vector according to the first-order context key vector, a query vector and a value vector corresponding to a receptive field of a convolution operation that derives the first-order context key vector comprises:

splicing the first-order context key vector and a target query vector to obtain a spliced vector, wherein the target query vector is characterized to obtain a query vector of a feature point at a central position in a receptive field of convolution operation of the first-order context key vector;

and obtaining the second-order context key vector according to the splicing vector and the value vector of the feature point in the receptive field of the convolution operation of the first-order context key vector.

3. The method of claim 2, wherein the deriving the second order context key vector from the concatenation vector and a vector of values of feature points in a receptive field of a convolution operation that derives the first order context key vector comprises:

performing convolution operation on the splicing vector for multiple times to obtain an attention matrix;

and obtaining the second-order context key vector based on local matrix multiplication operation between the value vector of the feature point in the receptive field and the attention matrix.

4. The method of claim 3, wherein the local matrix in the local matrix multiplication operation corresponds to a size that is the same as the preset size.

5. The method of claim 1, wherein the self-attention mechanism is a multi-headed self-attention mechanism; and

further comprising:

and determining a target vector corresponding to the multi-head self-attention mechanism according to the target vector corresponding to each head in the multi-head self-attention mechanism.

6. The method according to any one of claims 1-5, further comprising:

and replacing the convolution kernel in the neural network as the output vector of the convolution operation with the preset size by the finally determined target vector, and processing the visual identification task through the neural network.

7. A vector determination apparatus fusing context information based on a self-attention mechanism, comprising:

a first determination unit configured to determine a key vector, a query vector, and a value vector of feature points in a feature map;

the convolution unit is configured to perform convolution operation on key vectors of the feature points in the feature map by using a convolution core with a preset size to obtain a first-order context key vector fusing context information of the feature points;

the obtaining unit is configured to obtain a second-order context key vector according to the first-order context key vector and a query vector and a value vector corresponding to a receptive field of convolution operation of the first-order context key vector;

and the fusion unit is configured to fuse the first-order context key vector and the second-order context key vector to determine a target vector.

8. The apparatus of claim 7, wherein the deriving unit is further configured to:

splicing the first-order context key vector and a target query vector to obtain a spliced vector, wherein the target query vector is characterized to obtain a query vector of a feature point at a central position in a receptive field of convolution operation of the first-order context key vector; and obtaining the second-order context key vector according to the splicing vector and the value vector of the feature point in the receptive field of the convolution operation of the first-order context key vector.

9. The apparatus of claim 8, wherein the deriving unit is further configured to:

performing convolution operation on the splicing vector for multiple times to obtain an attention matrix; and obtaining the second-order context key vector based on local matrix multiplication operation between the value vector of the feature point in the receptive field and the attention matrix.

10. The apparatus of claim 9, wherein a size of the local matrix in the local matrix multiplication operation corresponds to the preset size.

11. The apparatus of claim 7, wherein the self-attention mechanism is a multi-headed self-attention mechanism; and

further comprising:

a second determining unit configured to determine a target vector corresponding to the multi-head self-attention mechanism according to the target vector corresponding to each head in the multi-head self-attention mechanism.

12. The apparatus of any of claims 7-11, further comprising:

and the processing unit is configured to replace an output vector of the convolution operation with the convolution kernel of the preset size in the neural network with a finally determined target vector and process a visual identification task through the neural network.

13. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-6.

14. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.