CN113516030A

CN113516030A - Action sequence verification method and device, storage medium and terminal

Info

Publication number: CN113516030A
Application number: CN202110469750.3A
Authority: CN
Inventors: 高盛华; 钱一成; 罗伟鑫
Original assignee: ShanghaiTech University
Current assignee: ShanghaiTech University
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2021-10-19
Anticipated expiration: 2041-04-28
Also published as: CN113516030B

Abstract

The invention provides an action sequence verification method, an action sequence verification device, a storage medium and a terminal, wherein the action sequence verification method comprises the following steps: acquiring an action sequence to be verified; extracting the characteristics of the action sequence to be verified to obtain a corresponding characteristic sequence; performing information fusion on the characteristic sequence to obtain the overall sequence characteristics to judge the action category of the action sequence to be verified; and comparing the characteristic sequence of the action sequence to be verified with the characteristic sequence of the standard action sequence to judge whether the action sequences belong to the same action sequence. The action sequence is verified by constructing a new neural network model, and the overall characteristics and the characteristic sequence of the action sequence are restrained according to time sequence, so that the action verification accuracy is high; the application field is wide, for example, whether people in two sections of videos complete the same action or not is identified, and the standardized flow of factories and workshops is detected; scoring the sports entertainment domain for action, and the like.

Description

Action sequence verification method and device, storage medium and terminal

Technical Field

The invention relates to the field of computer vision, in particular to an action sequence verification method, an action sequence verification device, a storage medium and a terminal.

Background

With the rapid development of information network technology, video has become a main means for people to acquire information, and the video permeates into various fields such as production, security, transportation, entertainment and the like. Among them, how to effectively utilize video information to realize motion recognition has become a research hotspot. In the method, how to accurately verify the action sequence is in a blank stage, such as judging whether the action sequences in two videos are the same action sequence, judging whether the work of a factory workshop meets the standard, and the like.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, it is an object of the present invention to provide an action sequence verification method, device, storage medium and terminal to solve the problems in the prior art.

To achieve the above and other related objects, a first aspect of the present invention provides an action sequence verification method, including: acquiring an action sequence to be verified; extracting the characteristics of the action sequence to be verified to obtain a corresponding characteristic sequence; performing information fusion on the characteristic sequence to obtain the overall sequence characteristics to judge the action category of the action sequence to be verified; and comparing the characteristic sequence of the action sequence to be verified with the characteristic sequence of the standard action sequence to judge whether the action sequences belong to the same action sequence.

In some embodiments of the first aspect of the present invention, the manner of feature extraction includes: and performing feature extraction on the action sequence to be verified based on a BN-inclusion network or a Resnet50 network, and modifying the final feature sequence of the full connection layer output preset dimension.

In some embodiments of the first aspect of the present invention, the obtaining of the overall sequence feature includes: and performing time-series information fusion on the characteristic sequence of the action sequence to be verified based on a Vision Transformer model to obtain the overall sequence characteristics.

In some embodiments of the first aspect of the present invention, the manner of comparing the features comprises: comparing the characteristic sequence of the action sequence to be verified with the characteristic sequence of the standard action sequence one by one according to a time sequence to obtain a characteristic comparison result; and constructing a loss function to supervise and constrain the feature comparison result.

In some embodiments of the first aspect of the present invention, the loss function is a first loss function; the method further comprises the following steps: acquiring the overall sequence characteristics of the action sequence to be verified based on a Vision Transformer model; judging the action type of the action sequence to be verified based on the overall sequence characteristics; and constructing a second loss function to supervise and constrain the judgment result of the action category.

In some embodiments of the first aspect of the present invention, the action sequence verification method comprises: and performing weighted calculation based on the first loss function and the second loss function to obtain a verification result of the action sequence to be verified.

In some embodiments of the first aspect of the present invention, the result of the feature extraction is a plurality of feature maps; the method comprises the following steps: dividing all feature maps into a plurality of fixed-size patches to obtain feature patch sequences; inputting the obtained feature patch sequence into a Vision Transformer; and the Vision Transformer weights and fuses the characteristics of the rest of the patches for each patch through a self-attribute module, a special token is reserved, and the characteristics corresponding to the token position represent the category information of the whole action sequence to be verified.

To achieve the above and other related objects, a second aspect of the present invention provides an action sequence verification apparatus, including: the action sequence acquisition module is used for acquiring an action sequence to be verified; the characteristic extraction module is used for extracting the characteristics of the action sequence to be verified so as to obtain a corresponding characteristic sequence; the characteristic fusion module is used for carrying out information fusion on the characteristic sequence to obtain the overall sequence characteristic to judge the action category of the action sequence to be verified; and the characteristic comparison module is used for comparing the characteristic sequence of the action sequence to be verified with the characteristic sequence of the standard action sequence to judge whether the action sequences belong to the same action sequence.

To achieve the above and other related objects, a third aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the action sequence verification method.

To achieve the above and other related objects, a fourth aspect of the present invention provides an electronic terminal, comprising: a processor and a memory; the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the terminal to execute the action sequence verification method.

As described above, the present invention provides an action sequence verification method, an action sequence verification device, a storage medium, and a terminal, wherein a feature sequence is obtained by extracting features from an action sequence, and a finer-grained feature of the action sequence to be verified is obtained by performing time sequence matching and information fusion on the feature sequence, so as to implement accurate verification of the action sequence; the method can compare and verify whether the action sequences in the two videos are consistent or not, and is more suitable for the actual life scene needing a plurality of continuous atomic actions when the actions are completed; and the action sequence which is not predefined can be verified, and the application range is expanded. In addition, the action sequence is verified by constructing a new neural network model, so that not only can the action information of each atomic action in the action sequence be extracted, but also the overall characteristics of the action sequence can be extracted, and the overall characteristics of the action sequence, the characteristics of each atomic action of the action sequence and the time sequence of the characteristic sequence can be restrained at the same time, thereby greatly improving the success rate of action verification. Moreover, the invention has wide application field, can be used for accurately verifying whether people in two sections of videos complete the same action sequence, can also detect the standardized flow in the production fields of factories, workshops and the like, and is beneficial to improving the product quality; the method can also be applied to the fields of sports entertainment and the like, and can be used for scoring the actions of athletes, scoring the actions in man-machine interaction games and the like.

Drawings

Fig. 1 is a flowchart illustrating an action sequence verification method according to an embodiment of the invention.

FIG. 2 is a diagram illustrating a sequence of three groups of actions obtained from decimating a video according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of an action sequence verification neural network model and a working process thereof according to an embodiment of the invention.

Fig. 4 is a schematic structural diagram of an action sequence verification apparatus according to an embodiment of the present invention.

Fig. 5 is a schematic structural diagram of an electronic terminal according to an embodiment of the invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It is noted that in the following description, reference is made to the accompanying drawings which illustrate several embodiments of the present invention. It is to be understood that other embodiments may be utilized and that changes may be made without departing from the spirit and scope of the present invention. The following detailed description is not to be taken in a limiting sense, and the scope of embodiments of the present invention is defined only by the claims of the issued patent. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

In the present invention, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," and/or "comprising," when used in this specification, specify the presence of stated features, operations, elements, components, items, species, and/or groups, but do not preclude the presence, or addition of one or more other features, operations, elements, components, items, species, and/or groups thereof. The terms "or" and/or "as used herein are to be construed as inclusive or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a; b; c; a and B; a and C; b and C; A. b and C ". An exception to this definition will occur only when a combination of elements, functions or operations are inherently mutually exclusive in some way.

The invention provides an action sequence verification method, an action sequence verification device, a storage medium and a terminal, and aims to solve the problems in the prior art.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention are further described in detail by the following embodiments in conjunction with the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example one

As shown in fig. 1, the present embodiment provides a flow chart of an action sequence verification method, which includes steps S11 to S14, and can be specifically described as follows:

and S11, acquiring an action sequence to be verified. An action sequence is composed of a plurality of atomic actions with sequence relation, and because the functions which can be completed by a single atomic action are too few, more actions are shown in the form of the action sequence in life. The atomic actions contained in different action sequences may differ in number, may differ in corresponding single or multiple atomic actions, and may also differ in number and action of atomic actions.

For example, fig. 2 shows a motion sequence of three groups of different video pictures obtained by decimating three videos, wherein each video picture represents an atomic motion. These three sets of action sequences are named in order: seq1, seq2 and seq 3. One motion sequence is composed of a plurality of atomic motions with sequence relation (i.e. corresponding video pictures in fig. 1), seq1 and seq2 contain the same atomic motions, and the sequence of the atomic motions is the same, so seq1 and seq2 belong to the same motion sequence; since seq2 and seq3 do not differ in the order of atomic actions and in the number of atomic actions, seq2 and seq3 do not belong to the same action sequence.

Generally, the action sequence to be verified comes from a video, and can be obtained by sampling continuous video frames in the video according to a preset rule. It is preferable that the acquired motion sequence is preprocessed, for example, a picture is smoothed, restored, enhanced, median filtered, edge detected, and the like, and further, there are advantageous effects of removing irrelevant information, restoring useful information, enhancing detectability of relevant information, and simplifying data to the maximum extent.

And S12, extracting the characteristics of the action sequence to be verified to obtain a corresponding characteristic sequence. Specifically, a convolutional neural network is utilized to perform feature extraction operation on an action sequence to be verified, so as to obtain a feature sequence ordered according to time. The feature sequence is a representation of motion information included in the entire motion sequence in a feature space.

In a preferred embodiment of this embodiment, the feature extraction method includes: and constructing a feature extraction module (backhaul) based on a BN-inclusion network or a Resnet50 network to extract features of the action sequence to be verified, and modifying a final full connection layer (fc layer) to output a feature sequence with a preset dimension.

Specifically, the BN-inclusion network and the Resnet50 network include a plurality of layers, which extract input pictures hierarchically through a hierarchical structure, and as the layers are deepened, the extracted features are higher in dimension and more global. And the BN-inclusion network and the Resnet50 network have the characteristic of strong robustness.

Further, the feature extraction module finally needs to solve a classification problem of 45 classes, so that the K-dimensional features need to be converted into 45 dimensions through an fc layer, and the output preset dimension of the fc layer is larger than 45; and if the value of K is too high, the output dimension of the previous backbone is exceeded, so the output preset dimension of the fc layer is smaller than 512. Preferably, the output dimension of the feature sequence is 128 or 256, so that the effective balance of the identification efficiency and the identification precision can be realized.

And S13, carrying out information fusion on the characteristic sequence to obtain the overall sequence characteristics to judge the action type of the action sequence to be verified. Specifically, the time sequence information fusion is performed on the feature sequence of the action sequence to be verified based on a Vision Transformer model to obtain the overall sequence feature.

Specifically, the Vision Transformer model has the following working flow: inputting a video frame sequence (namely an action sequence to be verified) into a backbone (such as resnet50) to perform feature extraction and outputting a series of feature maps (feature maps); dividing all feature maps into a plurality of patches with fixed sizes, and inputting the obtained feature patch sequences into a vision transform; the Vision transform weights and fuses the characteristics of the rest of the patch for each patch through a self-attribute module (namely, the input feature patch sequence is subjected to characteristic sequence enhancement); meanwhile, a special token is reserved, and the corresponding characteristics of the token position are used for representing the category information of the whole video frame sequence. The Vision Transformer model has no strong inductive bias, has stronger time sequence modeling capability and smaller calculated amount, and is particularly suitable for the characteristic extraction and fusion of action sequences with time sequence.

Further, the output of the Vision Transformer model and the feature identification module (Backbone) keep consistent with each other in feature dimension, namely, the dimension of the last fc layer of the Vision Transformer model and the dimension of the Backbone output feature are set to be consistent, so that the validity of the Vision Transformer model can be tested.

And S14, comparing the characteristic sequence of the action sequence to be verified with the characteristic sequence of the standard action sequence to judge whether the action sequences belong to the same action sequence. Specifically, the characteristic sequences of the action sequences to be verified and the characteristic sequences of the standard action sequences are compared one by one according to a time sequence to obtain a characteristic comparison result; constructing a first loss function to supervise and constrain the feature comparison result; acquiring the overall sequence characteristics of the action sequence to be verified based on a Vision Transformer model; judging the action type of the action sequence to be verified based on the overall sequence characteristics; and constructing a second loss function to supervise and constrain the judgment result of the action category.

In a preferred embodiment of this embodiment, a weighting calculation is performed based on the first loss function and the second loss function, and the influence degrees of the two losses on the training of the integral model are balanced, so as to obtain the verification result of the action sequence to be verified. In some examples, the ratio of the weights of the first loss function and the second loss function may be set to 10, 2, 1, etc., wherein a more accurate verification result may be obtained when the weight is set to 10.

In some embodiments, the method may be applied to a controller, such as an arm (advanced RISC machines) controller, an fpga (field Programmable Gate array) controller, a soc (system on chip) controller, a dsp (digital Signal processing) controller, or an mcu (microcontroller unit) controller, among others. In some embodiments, the methods are also applicable to computers including components such as memory, memory controllers, one or more processing units (CPUs), peripheral interfaces, RF circuits, audio circuits, speakers, microphones, input/output (I/O) subsystems, display screens, other output or control devices, and external ports; the computer includes, but is not limited to, Personal computers such as desktop computers, notebook computers, tablet computers, smart phones, smart televisions, Personal Digital Assistants (PDAs), and the like. In other embodiments, the method may also be applied to servers, which may be arranged on one or more physical servers, or may be formed of a distributed or centralized cluster of servers, depending on various factors such as function, load, etc.

Example two

As shown in fig. 3, the present embodiment provides a schematic diagram of an action sequence verification neural network model and a workflow thereof. Specifically, the action sequence verification neural network model includes: the system comprises a feature extraction module (Intra-action module) established based on a 2D Backbone, a feature fusion module (Inter-action module) established based on a Vision Transformer and a feature comparison module (Alignment module). It is worth mentioning that the feature enhancement module built based on the Vision Transformer not only can realize parallel training, but also can obtain the global feature information of the action sequence, and is beneficial to ensuring the verification accuracy of the action sequence. After training of the neural network on the feature extraction module and the feature fusion module is completed, the action sequence to be verified and the standard action sequence are respectively input into the neural network and compared through the feature comparison module after the two feature sequences are obtained, and whether the action sequences belong to the same action sequence is judged.

Fig. 3 shows the workflow of the action sequence verification neural network model by taking the verification process of whether two groups of action sequences input frames 1 and input frames 2 are the same action sequence as an example, which can be specifically described as follows:

firstly, inputting the action sequences input frames 1 and input frames 2 into a 2D Backbone, connecting the features corresponding to a plurality of atomic actions in each action sequence according to a specified dimension by using a concat function, and acquiring a corresponding Feature map sequence (Feature map sequence) by using a reshape function.

Then, Feature sequences corresponding to the motion sequences input frames 1 and input frames 2 are input into the Feature enhancement module, Feature maps with different lengths are converted into vectors with fixed lengths by using a linear projection layer (linear projection layer), the Feature maps and the position encoding are combined in a mode of adding together, and an extra-low encoding represents that a corresponding result after passing through a transport encoder is the representation of the whole motion sequence, namely, the whole sequence Feature vector is obtained, so that the motion type of the corresponding motion sequence is judged and obtained, and a Classification threshold (class scores) is set based on a second loss function, so that a corresponding first loss value (class 1) and a corresponding second loss value (class scores 2) are obtained.

And inputting the feature sequences corresponding to the action sequences input frames 1 and input frames 2 into a feature registration module (Alignment module), comparing and matching the feature sequences of the two groups of action sequences based on a Sequence similarity matrix (Sequence similarity matrix) and an identity matrix (identity matrix), and supervising and constraining the feature registration result based on a first loss function to obtain a feature registration loss value (Sequence Alignment loss).

Finally, the first loss value (Classification loss1), the second loss value (Classification loss2) and the feature registration loss value (Sequence Alignment loss) are weighted and calculated based on preset weights, and a final action verification result is obtained, so that whether the action sequences input frames 1 and input frames 2 are the same action Sequence can be verified.

It is worth mentioning that the present invention trains the action sequence validation neural network model by proposing a new data set, which differs from existing multiple data sets in that the action sequence is more concerned than the single atomic action; and contains multiple action sequences with atomic action level differences that existing datasets do not have. An action sequence is composed of a plurality of atomic actions with sequence relation, and because the functions which can be completed by a single atomic action are too few, more actions are shown in the form of the action sequence in life.

In some examples, the training dataset is obtained in a manner that includes: 2000 videos containing 70 different action sequences are shot, some videos causing problems due to devices or actions are removed, and 1938 videos are remained; extracting pictures from the remaining video in each frame, wherein the total number of the pictures is 960,000, and each video lasts 20.58 seconds on average and comprises 495.85 frames; the 70 different action sequences can be divided into 14 groups, the first action sequence in each group is a standard sequence, the remaining four action sequences have slight difference from the standard sequence and are defined as error sequences, and the two action sequences are different from the standard sequence in that the sequence of some atomic actions is disturbed; the other two are to delete some atom actions on the basis of the standard sequence; finally, the obtained data set can be used for training the action sequence verification neural network model provided by the invention, the trained model can be used for verifying the action sequence to be verified, and can also be used for verifying whether a plurality of groups of action sequences are the same action.

EXAMPLE III

As shown in fig. 4, the present embodiment provides an action sequence verification apparatus, including: an action sequence obtaining module 41, configured to obtain an action sequence to be verified; the feature extraction module 42 is configured to perform feature extraction on the action sequence to be verified to obtain a corresponding feature sequence; a feature fusion module 43, configured to perform information fusion on the feature sequence to obtain an overall sequence feature to determine an action category of the action sequence to be verified; and the feature comparison module 44 is configured to perform feature comparison on the feature sequence of the action sequence to be verified and the feature sequence of the standard action sequence to determine whether the action sequences belong to the same action sequence.

It should be noted that the modules provided in this embodiment are similar to the methods and embodiments provided above, and therefore, the description thereof is omitted. It should be noted that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these modules can be realized in the form of software called by processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the feature registration module 44 may be a separate processing element, or may be integrated into a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and a processing element of the apparatus calls and executes the functions of the feature registration module 44. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

Example four

The present embodiment proposes a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the aforementioned action sequence verification method.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

EXAMPLE five

As shown in fig. 5, an embodiment of the present invention provides a schematic structural diagram of an electronic terminal. The electronic terminal provided by the embodiment comprises: a processor 51, a memory 52, a communicator 53; the memory 52 is connected with the processor 51 and the communicator 53 through a system bus and completes mutual communication, the memory 52 is used for storing computer programs, the communicator 53 is used for communicating with other devices, and the processor 51 is used for operating the computer programs, so that the electronic terminal executes the steps of the action sequence verification method.

The above-mentioned system bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus. The communication interface is used for realizing communication between the database access device and other devices (such as a client, a read-write library and a read-only library). The Memory may include a Random Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

In summary, the present invention provides an action sequence verification method, an action sequence verification device, a storage medium, and a terminal, in which a feature sequence is obtained by extracting features from an action sequence, and fine-grained features of the action sequence to be verified are obtained by performing time sequence matching and information fusion on the feature sequence, so as to implement accurate verification of the action sequence. In real life, one action is often completed by an action sequence consisting of a plurality of continuous atomic actions, so that the accuracy of action verification can be effectively improved by verifying the action sequence with continuity, and the actual application range is wider. In addition, the action Sequence verification neural network model is built based on deep learning, the action Sequence is subjected to primary feature extraction through a basic network (backbone), and then the features with finer granularity are further extracted through a vision transformer and Sequence Alignment, so that the feature information is comprehensively extracted, and the action verification accuracy is high. Therefore, the present invention effectively overcomes various disadvantages of the prior art and has a high industrial utility value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A method for action sequence verification, comprising:

acquiring an action sequence to be verified;

extracting the characteristics of the action sequence to be verified to obtain a corresponding characteristic sequence;

performing information fusion on the characteristic sequence to obtain the overall sequence characteristics to judge the action category of the action sequence to be verified;

and comparing the characteristic sequence of the action sequence to be verified with the characteristic sequence of the standard action sequence to judge whether the action sequences belong to the same action sequence.

2. The action sequence verification method according to claim 1, wherein the feature extraction manner includes:

and performing feature extraction on the action sequence to be verified based on a BN-inclusion network or a Resnet50 network, and modifying the final feature sequence of the full connection layer output preset dimension.

3. The action sequence verification method according to claim 1, wherein the manner of obtaining the overall sequence feature comprises:

and performing time-series information fusion on the characteristic sequence of the action sequence to be verified based on a Vision Transformer model to obtain the overall sequence characteristics.

4. The action sequence verification method according to claim 1, wherein the manner of feature comparison comprises:

comparing the characteristic sequence of the action sequence to be verified with the characteristic sequence of the standard action sequence one by one according to the time sequence to obtain a characteristic comparison result;

and constructing a loss function to supervise and constrain the feature comparison result.

5. The action sequence verification method according to claim 4, wherein the loss function is a first loss function; the method further comprises the following steps:

acquiring the overall sequence characteristics of the action sequence to be verified based on a Vision Transformer model;

judging the action type of the action sequence to be verified based on the overall sequence characteristics;

and constructing a second loss function to supervise and constrain the judgment result of the action category.

6. The action sequence verification method according to claim 5, comprising:

and performing weighted calculation based on the first loss function and the second loss function to obtain a verification result of the action sequence to be verified.

7. The action sequence verification method according to claim 3, wherein the result of the feature extraction is a plurality of feature maps; the method comprises the following steps:

dividing all feature maps into a plurality of fixed-size patches to obtain feature patch sequences;

inputting the obtained feature patch sequence into a Vision Transformer;

and the Vision Transformer weights and fuses the characteristics of the rest of the patches for each patch through a self-attribute module, a special token is reserved, and the characteristics corresponding to the token position represent the category information of the whole action sequence to be verified.

8. An action sequence verification apparatus, comprising:

the action sequence acquisition module is used for acquiring an action sequence to be verified;

the characteristic extraction module is used for extracting the characteristics of the action sequence to be verified so as to obtain a corresponding characteristic sequence;

the characteristic fusion module is used for carrying out information fusion on the characteristic sequence to obtain the overall sequence characteristic to judge the action category of the action sequence to be verified;

and the characteristic comparison module is used for comparing the characteristic sequence of the action sequence to be verified with the characteristic sequence of the standard action sequence to judge whether the action sequences belong to the same action sequence.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the action sequence verification method of any one of claims 1 to 7.

10. An electronic terminal, comprising: a processor and a memory;

the memory is configured to store a computer program, and the processor is configured to execute the computer program stored by the memory to cause the terminal to perform the action sequence verification method according to any one of claims 1 to 7.