CN117765432A - Motion boundary prediction-based middle school physical and chemical life experiment motion detection method - Google Patents

Motion boundary prediction-based middle school physical and chemical life experiment motion detection method Download PDF

Info

Publication number
CN117765432A
CN117765432A CN202311558506.XA CN202311558506A CN117765432A CN 117765432 A CN117765432 A CN 117765432A CN 202311558506 A CN202311558506 A CN 202311558506A CN 117765432 A CN117765432 A CN 117765432A
Authority
CN
China
Prior art keywords
action
time
motion
sequence
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311558506.XA
Other languages
Chinese (zh)
Inventor
刘峰
王慧
宋婉茹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202311558506.XA priority Critical patent/CN117765432A/en
Publication of CN117765432A publication Critical patent/CN117765432A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention provides a middle school physical and chemical life experiment action detection method based on action boundary prediction, which comprises the following steps: acquiring an experimental video to be tested; performing feature extraction on the experimental video to be detected by using a feature extraction network to obtain video features; performing boundary matching on the video features by using a boundary matching network to obtain a time sequence segment formed by combining action starting time and action ending time; the boundary matching network comprises two branches, wherein in a first branch, after time sequence convolution processing, each frame is modeled by using a leachable Gaussian distribution, and a background frame and an action frame are distinguished to obtain an initial boundary probability sequence; processing in the second branch to obtain a two-dimensional confidence coefficient map matched with the initial position of the action, and obtaining a time sequence segment through post-processing of the initial boundary probability sequence and the two-dimensional confidence coefficient map matched with the initial position of the action; and finally classifying the time sequence fragments by using a fragment classification network to obtain an experimental action detection result. The invention effectively improves the fragment generation efficiency and the accuracy of motion detection.

Description

Motion boundary prediction-based middle school physical and chemical life experiment motion detection method
Technical Field
The invention relates to a motion boundary prediction-based middle school physical and chemical life experiment motion detection method, and belongs to the technical field of time sequence motion positioning in video understanding.
Background
With the continuous development of artificial intelligence technology, the application of the artificial intelligence technology in the education field is also becoming wider and wider. The traditional experiment examination of the physical and chemical students of middle school requires a teacher to personally correct the experiment report of the students, which is time-consuming and labor-consuming, and has subjective factors. In order to agree with the trend of promoting education intelligent transformation by using an innovative technology, an AI energized physicochemical experiment operation examination solution is provided, a new floor scene in the education subdivision field is constructed, and the intelligent scoring system can automatically score an experiment report of a student through an artificial intelligence technology, so that the scoring accuracy can be improved, and the time and energy of a teacher can be saved.
The time sequence action detection can be regarded as consisting of two subtasks, one subtask is a start-stop time sequence interval of the predicted action, and the other subtask is a category of the predicted action. As the field of motion recognition has been developed in recent years, algorithms for predicting motion types have gradually matured, and thus, the key for positioning time-series motion is to predict the start and stop time-series intervals of motion.
The conventional methods include clustering and feature similarity-based methods, which generally rely on manually designed features or rules and cannot capture complex video scenes and semantic information.
Disclosure of Invention
The invention mainly researches a time sequence action detection algorithm in an intelligent evaluation system of a physicochemical experiment, and aims to remove redundant information in a video and generate a key action segment in the video by a certain technical means. On one hand, key technical support can be provided in a scoring system, and the segments are used before a scoring module, so that the running time is saved, and the scoring of the system is quickened; on the other hand, the generated key actions can be reserved as abstracts, so that the later-stage user auditing and retrospective review are facilitated, and the workload is effectively reduced.
The purpose is as follows: in view of at least one of the above technical problems, the present invention provides a middle school physical and chemical life experiment action detection method based on action boundary prediction.
The invention adopts the technical scheme that:
in a first aspect, the invention provides a middle school physical and chemical life experiment action detection method based on action boundary prediction, which comprises the following steps:
acquiring an experimental video to be tested; the test video to be tested is a video containing physical and chemical experiment actions of middle school;
performing feature extraction on the experimental video to be detected by using a feature extraction network to obtain video features;
performing boundary matching on the video features by using a boundary matching network to obtain a time sequence segment formed by combining action starting time and action ending time; the boundary matching network comprises two branches, wherein in a first branch, after time sequence convolution processing, each frame is modeled by using a leachable Gaussian distribution, and a background frame and an action frame are distinguished to obtain an initial boundary probability sequence; processing in the second branch to obtain a two-dimensional confidence coefficient map matched with the initial position of the action, and obtaining a time sequence segment through post-processing of the initial boundary probability sequence and the two-dimensional confidence coefficient map matched with the initial position of the action;
and classifying the time sequence fragments by using a fragment classification network to obtain an experimental action detection result.
In some embodiments, wherein the feature extraction network employs an I3D network, a variable speed network Slowfast, a sparse time sampling network TSN, or a three-dimensional convolution network C3D.
In some embodiments, wherein the segment classification network employs a long and short term memory network, a graph rolling network, or a recurrent neural network RNN.
Performing feature extraction on the experimental video to be detected by using a feature extraction network to obtain video features, wherein the feature extraction comprises the following steps:
processing the experimental video to be detected at 25 frames per second, determining a sliding window, and dividing the video in a non-overlapping way by the sliding window to obtain a picture;
every time a picture with 16 frames of size of C multiplied by C is input, the picture passes through a feature extraction network I3D network, and the RGB features of the output image are obtained;
obtaining the optical flow of the adjacent frame images by introducing an L1 norm and utilizing an optical flow estimation algorithm, calculating a flow field to estimate the motion of pixels in two continuous image frames, obtaining optical flow effects of actions in horizontal and vertical directions through optical flow estimation, overlapping to form an optical flow graph, and finally outputting an optical flow characteristic as an output result;
and carrying out feature fusion on the RGB features and the optical flow features of the image to obtain video features.
Further, in some embodiments, each time a 16-frame image with a size of c×c is input, the image is passed through a feature extraction network I3D network to obtain an output image RGB feature, which specifically includes: each time 16 frames of pictures are input, three-dimensional rolling of 7 x 7 and maximum pooling of 1 x 3 are performed, outputting a first layer of feature graphs N1; carrying out three-dimensional convolution of a three-dimensional convolution product of 1 multiplied by 1 and a three-dimensional convolution product of 3 multiplied by 3 on the first layer of feature map N1, and carrying out downsampling to obtain a second layer of feature map N2; the second layer of feature map N2 is subjected to two residual modules and then to a 3 x 3 max pooling, obtaining a third layer of characteristic diagram N3; the third layer of characteristic diagram N3 is subjected to downsampling with the step length of 2 through 5 identical residual modules, and finally the RGB characteristics of the output image are obtained; the residual module comprises four branches, the four branches of the residual module are all the output of the above module, the first branch is subjected to three-dimensional convolution of 1 multiplied by 1, the second branch and the third branch are subjected to three-dimensional convolution of 1 multiplied by 1 and 3 multiplied by 3, the fourth branch is subjected to three-dimensional convolution of 1 multiplied by 1 after carrying out the largest pooling downsampling of 3 multiplied by 3, and finally, the output of the four branches is subjected to feature fusion and then downsampling with the step length of 2.
In some embodiments, where the optical flow of adjacent frame images is obtained by a total variation algorithm, the following objective function is constructed:
wherein x represents the position of any pixel point of the image, I 0 (x) Representing the image intensity value of the previous frame image at the x position of the pixel point, u 1 (x) Representing the displacement of the optical flow of the previous frame image at the x position of the pixel point, u 2 (x) Representing the displacement of the optical flow of the image of the next frame at the x position of the pixel point, I 1 (x+u (x)) represents the image intensity value of the subsequent frame image at the pixel point x+u (x) position,to calculate the hamiltonian of the gradient, the former item +.>A regularization term representing the obtained smooth displacement, λ being a weight coefficient used to control regularization effects, a second term ≡ Ω (I 1 (x+u(x))-I 0 (x)) 2 dΩ is the data item of the optical flow constraint;
by adding regularization into the objective function, the result obtained by learning meets sparsification, finally, optical flow effects of actions in horizontal and vertical directions are obtained through iteration, an optical flow graph is formed by overlapping, and the final output result is optical flow characteristics.
In some embodiments, performing boundary matching on the video feature by using a boundary matching network to obtain a time sequence segment formed by combining action start time and action end time, including:
inputting video features into two convolution layers with 3 multiplied by 3 and step length of 1, processing to obtain a time sequence feature sequence, and respectively inputting two branches;
after the time sequence characteristic sequence is processed by a first branch, namely two convolution layers with the depth of 3 multiplied by 3 and the maximum pooling, normalizing with the depth of 2 is carried out, and action importance contrast learning is introduced into the first branch, namely, each frame is combined with a leachable Gaussian distribution for modeling, the weight of each frame in training is adjusted according to the action importance degree by distinguishing a background frame and an action frame, and finally, an initial boundary probability sequence, namely, a start probability sequence and an end probability sequence is output;
the time sequence feature sequence is processed through a second branch, a sliding window with the size of 16 is determined through a boundary matching layer, T is used for representing the maximum fragment length of a sample, N points are uniformly sampled from the input time sequence feature sequence in the time sequence range to form an N multiplied by T mask weight, a C multiplied by T time sequence feature diagram generated from the time sequence feature sequence for the nth sampling point is obtained, wherein N is the number of sampling points in a section of video, and C is the dimension number of the feature; weighting the time sequence feature map of C multiplied by T according to the weight matrix corresponding to the sampling point, and calculating to generate a C multiplied by N feature map;
performing 32×1×1 three-dimensional convolution on the C×N feature map, reducing the feature dimension to 1, performing convolution on the feature dimension to 1×1 and 3×3, and activating and generating confidence coefficient with channel dimension of 2 through a sigmoid function to obtain a two-dimensional confidence coefficient map with matched action starting positions, wherein fragments distributed in the same row in the two-dimensional confidence coefficient map have the same continuous length, and fragments distributed in the same column have the same starting boundary;
multiplying and fusing the initial boundary probability sequence and the two-dimensional confidence coefficient map matched with the initial position of the action to generate the confidence coefficient of each action segment;
and removing redundancy by adopting a non-maximum inhibition method according to the confidence of the action segment to obtain a final candidate action segment, and forming a time sequence segment.
Further, introducing action importance contrast learning in the first branch specifically includes:
modeling action importance by utilizing a leachable Gaussian distribution, and explicitly assigning Gaussian-like weights to simulate the accuracy of action positioning; motion importance p for each frame loc Modeling is as follows:
wherein,to simulate the i-th frame start time positioning accuracy, < >>Simulating the positioning accuracy of the end time of the ith frame;
wherein mu s Sum sigma s As a learnable parameter, mean and variance of motion importance distribution of each category c at start time, μ e Sum sigma e D (i) represents the distance from the center point of the true annotation segment of the current i-th frame in the training process for the mean and variance of the motion importance distribution of each class c at the end time;
obtaining an importance sequence of the frame through modeling learning;
and accumulating and fusing the importance sequence and the time sequence characteristic sequence to obtain an initial boundary probability sequence.
Further, multiplying and fusing the initial boundary probability sequence and the two-dimensional confidence coefficient map matched with the initial position of the action to generate a confidence coefficient p of each action segment f Comprising:
wherein t is s Indicating the start time of the action segment, t e The end time of the action segment is indicated,at the starting time t for the action segment s Probability of->At the end time t for the action segment e Probability of p cc To classify confidence, p cr Is the regression confidence.
Further, removing redundancy by adopting a non-maximum suppression method according to the confidence level of the action segment to obtain a final candidate action segment, including:
after the action segment with the maximum confidence coefficient is taken out each time, calculating the coincidence degree of the starting position of the action segment and other action segments, and removing if the coincidence degree exceeds a threshold value to obtain the final candidate action segment, wherein the action segment comprises an action starting time, an action ending time and the confidence coefficient.
In some embodiments, the segment classification network employs a long-short term memory network, and the classifying the time-series segment with the segment classification network comprises:
inputting the time sequence segment into a long-short-period memory network, wherein the time is t, and the input feature vector is n t The information is updated and reserved by three gating mechanisms and a cell memory unit, wherein the three gating mechanisms comprise a forgetting gate, an input gate and an output gate;
the output characteristic vector h of the last time t-1 t-1 And inputFeature vector n t Splicing to form a feature matrix M, inputting the feature matrix M into a forgetting gate, and obtaining an output state information vector f passing through the forgetting gate t
f t =σ(W f *M+b f )
Wherein W is f 、b f Respectively representing a weight matrix and a bias value vector of a forgetting gate, wherein sigma represents a sigmoid activation function;
inputting the characteristic matrix M into an input gate, and obtaining an output state information matrix f after passing through the input gate i At the same time obtain the cell state vector which is not updated
f i =σ(W i *M+b i )
Wherein W is i 、b i Respectively represent the weight matrix and the offset value vector of the input gate, W c 、b c Respectively representing a weight matrix and a bias value vector of the cell memory unit, wherein tanh represents a tanh activation function;
based on the cell state vector not updatedVector of state information f t And f i Updating and storing to obtain a cell update vector C t
Wherein C is t-1 Updating vector for the cells at the last time t-1;
cell update vector C t After passing through the output gate, the current C is determined t The effective information output vector h in (a) t The expression is as follows:
h t =σ(W o *M+b o )*tanh(C t )
wherein W is o 、b o Respectively representing a weight matrix and a bias value vector of the output gate;
output the effective information to the vector h t And predicting the category of the fragments through the full connection layer to generate an experimental action detection result.
In a second aspect, the invention provides a middle school physical and chemical life experiment action detection device based on action boundary prediction, which comprises a processor and a storage medium;
the storage medium is used for storing instructions;
the processor is configured to operate in accordance with the instructions to perform the method according to the first aspect.
In a third aspect, the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of the first aspect.
In a fourth aspect, the present invention provides an apparatus comprising,
a memory;
a processor;
and
A computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of the first aspect described above.
The beneficial effects of the invention are as follows:
(1) The application scene of the method is a two-stage method based on key action detection in a middle school physicochemical experiment video, adopts the first time sequence action positioning generation segment, and then carries out action segment identification, and has higher action segment identification accuracy.
(2) The method comprises the steps of providing action importance learning on the premise that a full-supervision data set is trained, modeling the action importance degree of each frame, applying the learned weight to a loss function, multiplying the learned action importance degree before the loss function of each frame on the basis, and promoting the model to pay more attention to important frames, so that better training is achieved.
Drawings
FIG. 1 is a timing detection flow chart according to an embodiment of the invention.
Fig. 2 is a schematic diagram of a feature extraction network according to an embodiment of the invention.
Fig. 3 is a schematic structural diagram of a boundary matching network according to an embodiment of the present invention.
Detailed Description
Embodiments of the invention are disclosed in the drawings, and for purposes of explanation, numerous practical details are set forth in the following description. However, it should be understood that these practical details are not to be taken as limiting the invention. That is, in some embodiments of the invention, these practical details are unnecessary.
Example 1
In a first aspect, the present embodiment provides a middle school physical and chemical life experiment motion detection method based on motion boundary prediction, including:
acquiring an experimental video to be tested; the test video to be tested is a video containing physical and chemical experiment actions of middle school;
performing feature extraction on the experimental video to be detected by using a feature extraction network to obtain video features;
performing boundary matching on the video features by using a boundary matching network to obtain a time sequence segment formed by combining action starting time and action ending time; the boundary matching network comprises two branches, wherein in a first branch, after time sequence convolution processing, each frame is modeled by using a leachable Gaussian distribution, and a background frame and an action frame are distinguished to obtain an initial boundary probability sequence; processing in the second branch to obtain a two-dimensional confidence coefficient map matched with the initial position of the action, and obtaining a time sequence segment through post-processing of the initial boundary probability sequence and the two-dimensional confidence coefficient map matched with the initial position of the action;
and classifying the time sequence fragments by using a fragment classification network to obtain an experimental action detection result.
In some embodiments, the feature extraction network may employ an I3D network, a variable speed network Slowfast, a sparse time sampling network TSN, or a three-dimensional convolution network C3D.
In some embodiments, the segment classification network may employ a long and short term memory network, a graph convolution network, or a recurrent neural network RNN.
In this embodiment, the feature extraction network is an I3D network, and the segment classification network is a long-term and short-term memory network for further detailed description.
In some embodiments, as shown in fig. 1, a method for detecting a physical and biochemical experiment action of a middle school based on action boundary prediction includes the following steps:
step 1: by using special hardware equipment, unifying the view angle range and the lens height, simulating the experimental examination of middle school students to acquire videos, ensuring that data are attached to actual scenes, then manufacturing a training data set, and preparing a JSON tag according to the following format: each video contains action information, action category labels, category names, start frames and end frames in each video segment. The label comprises a movable slide rheostat, an opening switch, zeroing of the ammeter and the like.
Step 2: constructing a time sequence action detection network, wherein the time sequence action detection network comprises a feature extraction network, a boundary matching network and a fragment classification network; acquiring an experimental action data set and training a feature extraction network to obtain video features; marking the video features to form a video feature data set with a tag action start time and a tag action end time, and training a boundary matching network by using the video feature data set to obtain a time sequence segment; and marking the time sequence fragments to form a time sequence fragment data set with class labels, and training a fragment classification network by using the time sequence fragment data set to obtain a pre-trained feature extraction network, a boundary matching network and a fragment classification network.
Step 3: as shown in fig. 2, the experimental video to be tested is obtained and sent to the I3D network to extract video features, and the specific steps are as follows:
(3.1) processing the experimental video to be detected at 25 frames per second, determining a sliding window, setting the size to be 16, and cutting the video in a non-overlapping manner by the sliding window;
(3.2) extracting the RGB features of the image, wherein 16 frames of pictures with the size of C multiplied by C are input each time and pass through an I3D network to obtain the RGB features of the output image;
the method specifically comprises the following steps: every time 16 frames of pictures are input, the pictures pass through an I3D network, and the steps are as follows: three-dimensional convolution of 7 x 7 and one maximum pooling of 1 x 3 are performed once, outputting a first layer of feature graphs N1; carrying out three-dimensional convolution of a three-dimensional convolution product of 1 multiplied by 1 and a three-dimensional convolution product of 3 multiplied by 3 on the first layer of feature map N1, and carrying out downsampling to obtain a second layer of feature map N2; the second layer of characteristic diagram N2 is subjected to two residual error networks, wherein the two residual error networks comprise four branches, N1 is taken as input, the first branch is subjected to three-dimensional convolution of 1 multiplied by 1, the second branch and the third branch are subjected to three-dimensional convolution of 1 multiplied by 1 and 3 multiplied by 3, the fourth branch is subjected to maximum pooling downsampling, then the three-dimensional convolution of 1 multiplied by 1 is carried out, finally, the output of the four branches is subjected to characteristic fusion, and then downsampling with the step length of 2 is carried out, so that the third layer of characteristic diagram N3 is obtained; finally, after N3 passes through 5 identical residual error networks, downsampling with the step length of 2 is carried out, and finally, the RGB characteristics of the output image are obtained;
(3.3) for extracting the optical flow characteristics, obtaining the optical flow of the adjacent frame images through an optical flow estimation algorithm by introducing an L1 norm, calculating a flow field to estimate the movement of pixels in two continuous image frames, obtaining the optical flow effects of the actions in the horizontal and vertical directions through optical flow estimation, overlapping to form an optical flow diagram, and finally outputting the result as the optical flow characteristics; the method specifically comprises the following steps:
the optical flow of the adjacent frame images is obtained through a total variation algorithm, and the following objective function is constructed:
wherein x represents the position of any pixel point of the image, I 0 (x) Representing the image intensity value of the previous frame image at the x position of the pixel point, u 1 (x) Representing the displacement of the optical flow of the previous frame image at the x position of the pixel point, u 2 (x) Representing the displacement of the optical flow of the image of the next frame at the x position of the pixel point, I 1 (x+u (x)) represents the image intensity value of the subsequent frame image at the pixel point x+u (x) position,to calculate the hamiltonian of the gradient, the former item +.>A regularization term representing the obtained smooth displacement, λ being a weight coefficient used to control regularization effects, a second term ≡ Ω (I 1 (x+u(x))-I 0 (x)) 2 dΩ is an optical flow constrained data item.
Regularization is added to the objective function, so that the result obtained by learning meets sparsification, optical flow effects of actions in the horizontal and vertical directions are finally obtained, an optical flow graph is formed by overlapping, and the final output result is an optical flow characteristic;
and (3.4) carrying out feature fusion on the RGB features and the optical flow features of the image to obtain video features.
Step 4: the extracted video features are input into a boundary matching network, as shown in fig. 3, the boundary matching network comprises two branches, in a first branch, after time sequence convolution processing, each frame is modeled by using a leachable Gaussian distribution, a background frame and an action frame are distinguished, an initial boundary probability sequence is obtained, a two-dimensional confidence map of action initial position matching is obtained by processing in a second branch, and finally, the output of the two branches is subjected to post-processing to obtain a time sequence segment formed by combining action starting time and action ending time, wherein the method comprises the following specific steps:
(4.1) firstly, inputting the video features obtained in the step 3 into two convolution layers with 3 multiplied by 3 step length of 1, and processing the input feature sequences to be used as time sequence feature sequences shared by two subsequent modules;
and (4.2) inputting the processed characteristic sequence into a time sequence evaluation module, wherein the time sequence evaluation module comprises two convolution layers of 3 multiplied by 3 and maximum pooling, and then carrying out normalization with depth of 2, introducing an action importance comparison learning branch into the time sequence evaluation module, and adjusting the weight of each frame in training according to the action importance degree.
Because the video itself information is continuous and some key frames exist, a learnable gaussian distribution is utilized to model the importance of the motion, and gaussian-like weights are explicitly assigned to simulate the accuracy of motion localization. Motion importance p for each frame loc Modeling is as follows:
wherein the method comprises the steps ofTo simulate the i-th frame positioning accuracy for the start time, < >>To simulate the i-th frame positioning accuracy for the end time. The expressions are as follows:
wherein mu s Sum sigma s As a learnable parameter, represent the mean and variance of the motion importance distribution of each category c at the start time, and initialize to μ s =-0.5,σ s =1,μ e Sum sigma e Mean and variance of motion importance distribution at end time for each category c, and initialized to μ e =0.5,σ e =1. d (i) represents the distance from the center point of the true annotation segment of the current i-th frame during training;
(4.3) inputting the feature sequence processed at first into a segment generation module, firstly determining a sliding window with the size of 16 through a boundary matching layer, using T to represent the maximum segment length of a sample, uniformly sampling N points from the input feature sequence in the time sequence range to form an N multiplied by T mask weight, wherein N is the number of sampling points in a video segment, and obtaining a feature graph for the nth sampling point, which generates C multiplied by T from the time sequence feature sequence, wherein C is the number of the feature dimensions; weighting the time sequence feature map of C multiplied by T according to the weight matrix corresponding to the sampling point, and calculating to generate a C multiplied by N feature map; performing three-dimensional convolution on the feature map with the size of 32 multiplied by 1, reducing the feature dimension to 1, performing convolution on the feature map with the size of 1 multiplied by 1 and 3 multiplied by 3, and activating and generating a confidence coefficient with the channel dimension of 2 through a sigmoid function to obtain a two-dimensional confidence coefficient map with final output of matching of the action starting position, wherein fragments distributed on the same row have the same duration length, and fragments distributed on the same column have the same starting boundary;
(4.4) multiplying and fusing the two-dimensional confidence maps of the initial boundary probability sequences and the action initial positions generated in (4.2) and (4.3) to generate the confidence coefficient p of each segment f . The formula is as follows:
wherein t is s Indicating the start time of the action segment, t e The end time of the action segment is indicated,at the starting time t for the action segment s Probability of->At the end time t for the action segment e Probability of p cc To classify confidence, p cr Is regression confidence;
(4.5) finally removing redundancy by adopting a non-maximum suppression method to obtain a final candidate action segment, wherein the final candidate action segment comprises action starting time, action ending time and confidence degree.
Step 5: constructing a long-period memory network for segment identification, normalizing the time sequence segment characteristics generated in the step 4, inputting the normalized time sequence segment characteristics into the long-period memory network, and performing category prediction to generate a final result, wherein the specific steps are as follows:
(5.1) inputting the time sequence segment (the candidate action segment generated in the step (4)) into a long-term and short-term memory network, wherein the time is t, and the input feature vector is n t The information is updated and reserved by three gating mechanisms and a cell memory unit, wherein the three gating mechanisms comprise a forgetting gate, an input gate and an output gate;
(5.2) the output eigenvector h at the previous time t-1 t-1 And input feature vector n t Splicing to form a feature matrix M, inputting the feature matrix M into a forgetting gate, and obtaining an output state information vector f passing through the forgetting gate t The expression is as follows:
f t =σ(W f *M+b f )
wherein W is f 、b f Respectively representing a weight matrix and a bias value vector of a forgetting gate, wherein sigma represents a sigmoid activation function;
(5.3) inputting the characteristic matrix M into an input gate, and obtaining an output state information matrix f after passing through the input gate i At the same time obtain the cell state vector which is not updatedThe expression is as follows:
f i =σ(W i *M+b i )
wherein W is i 、b i Respectively represent the weight matrix and the offset value vector of the input gate, W c 、b c Respectively representing a weight matrix and a bias value vector of the cell memory unit, wherein tanh represents a tanh activation function;
(5.4) vector the above-mentioned state information t And f i The update storage is performed, and the update storage is performed,obtaining a cell update vector C t The expression is as follows:
wherein C is t-1 The vector is updated for the cells at the previous time d-1,is a cell state vector that is not currently updated;
(5.5) vector C of the cell update t After passing through the output gate, the current C is determined t The effective information output vector h in (a) t The expression is as follows:
h t =σ(W o *M+b o )*tanh(C t )
wherein W is o 、b o Respectively representing a weight matrix and a bias value vector of the output gate;
(5.5) outputting the last significant information to the vector h t And predicting the category of the fragments through the full connection layer, and generating a final action detection result.
Example 2
In a second aspect, based on embodiment 1, the present embodiment provides a motion boundary prediction-based motion detection device for physical and chemical life experiments in middle school, including a processor and a storage medium;
the storage medium is used for storing instructions;
the processor is configured to operate in accordance with the instructions to perform the method according to embodiment 1.
Example 3
In a third aspect, based on embodiment 1, the present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method described in embodiment 1.
Example 4
In a fourth aspect, based on embodiment 1, the present embodiment provides an apparatus comprising,
a memory;
a processor;
and
A computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of embodiment 1.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing description is only illustrative of the invention and is not to be construed as limiting the invention. Various modifications and variations of the present invention will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, or the like, which is within the spirit and principles of the present invention, should be included in the scope of the claims of the present invention.

Claims (10)

1. The motion boundary prediction-based middle school physical and chemical life experiment motion detection method is characterized by comprising the following steps of:
acquiring an experimental video to be tested; the test video to be tested is a video containing physical and chemical experiment actions of middle school;
performing feature extraction on the experimental video to be detected by using a feature extraction network to obtain video features;
performing boundary matching on the video features by using a boundary matching network to obtain a time sequence segment formed by combining action starting time and action ending time; the boundary matching network comprises two branches, wherein in a first branch, after time sequence convolution processing, each frame is modeled by using a leachable Gaussian distribution, and a background frame and an action frame are distinguished to obtain an initial boundary probability sequence; processing in the second branch to obtain a two-dimensional confidence coefficient map matched with the initial position of the action, and obtaining a time sequence segment through post-processing of the initial boundary probability sequence and the two-dimensional confidence coefficient map matched with the initial position of the action;
and classifying the time sequence fragments by using a fragment classification network to obtain an experimental action detection result.
2. The method for detecting the middle school physical and chemical experiment motion based on motion boundary prediction according to claim 1, wherein the feature extraction of the experiment video to be detected by using a feature extraction network to obtain video features comprises the following steps:
processing the experimental video to be detected at 25 frames per second, determining a sliding window, and dividing the video in a non-overlapping way by the sliding window to obtain a picture;
every time a picture with 16 frames of size of C multiplied by C is input, the picture passes through a feature extraction network I3D network, and the RGB features of the output image are obtained;
obtaining the optical flow of the adjacent frame images by introducing an L1 norm and utilizing an optical flow estimation algorithm, calculating a flow field to estimate the motion of pixels in two continuous image frames, obtaining optical flow effects of actions in horizontal and vertical directions by utilizing optical flow estimation, overlapping to form an optical flow graph, and finally outputting an optical flow characteristic as an output result;
and carrying out feature fusion on the RGB features and the optical flow features of the image to obtain video features.
3. The motion boundary prediction-based motion detection method for middle school physical and chemical life experiments according to claim 2, wherein each time 16 frames of pictures with a size of c×c are input, the pictures pass through a feature extraction network I3D network, and output image RGB features are obtained, specifically comprising: each time 16 frames of pictures are input, three-dimensional rolling of 7 x 7 and maximum pooling of 1 x 3 are performed, outputting a first layer of feature graphs N1; carrying out three-dimensional convolution of a three-dimensional convolution product of 1 multiplied by 1 and a three-dimensional convolution product of 3 multiplied by 3 on the first layer of feature map N1, and carrying out downsampling to obtain a second layer of feature map N2; the second layer of feature map N2 is subjected to two residual modules and then to a 3 x 3 max pooling, obtaining a third layer of characteristic diagram N3; the third layer of characteristic diagram N3 is subjected to downsampling with the step length of 2 through 5 identical residual modules, and finally the RGB characteristics of the output image are obtained; the residual module comprises four branches, the four branches of the residual module are all the output of the above module, the first branch is subjected to three-dimensional convolution of 1 multiplied by 1, the second branch and the third branch are subjected to three-dimensional convolution of 1 multiplied by 1 and 3 multiplied by 3, the fourth branch is subjected to three-dimensional convolution of 1 multiplied by 1 after carrying out the largest pooling downsampling of 3 multiplied by 3, and finally, the output of the four branches is subjected to feature fusion and then downsampling with the step length of 2.
4. The motion boundary prediction-based middle school physical and chemical experiment motion detection method according to claim 2, wherein the optical flow of the adjacent frame images is obtained through a total variation algorithm, and the following objective function is constructed:
wherein x represents the position of any pixel point of the image, I 0 (x) Representing the image intensity value of the previous frame image at the x position of the pixel point, u 1 (x) Representing the displacement of the optical flow of the previous frame image at the x position of the pixel point, u 2 (x) Representing the displacement of the optical flow of the image of the next frame at the x position of the pixel point, I 1 (x+u (x)) represents the image intensity value of the subsequent frame image at the pixel point x+u (x) position,to calculate the hamiltonian of the gradient, the former item +.>A regularization term representing the obtained smooth displacement, λ being a weight coefficient used to control regularization effects, a second term ≡ Ω (I 1 (x+u(x))-I 0 (x)) 2 dΩ is the data item of the optical flow constraint;
by adding regularization into the objective function, the result obtained by learning meets sparsification, finally, optical flow effects of actions in horizontal and vertical directions are obtained through iteration, an optical flow graph is formed by overlapping, and the final output result is optical flow characteristics.
5. The motion boundary prediction-based middle school physical and chemical experiment motion detection method according to claim 1, wherein performing boundary matching on the video features by using a boundary matching network to obtain a time sequence segment formed by combining a motion start time and a motion end time, comprises:
inputting video features into two convolution layers with 3 multiplied by 3 and step length of 1, processing to obtain a time sequence feature sequence, and respectively inputting two branches;
after the time sequence characteristic sequence is processed by a first branch, namely two convolution layers with the depth of 3 multiplied by 3 and the maximum pooling, normalizing with the depth of 2 is carried out, and action importance contrast learning is introduced into the first branch, namely each frame is modeled by using leachable Gaussian distribution, the weight of each frame in training is adjusted according to the action importance degree by distinguishing a background frame and an action frame, and finally an initial boundary probability sequence, namely a start probability sequence and an end probability sequence is output;
the time sequence feature sequence is processed through a second branch, a sliding window with the size of 16 is determined through a boundary matching layer, T is used for representing the maximum fragment length of a sample, N points are uniformly sampled from the input time sequence feature sequence in the time sequence range to form an N multiplied by T mask weight, a C multiplied by T time sequence feature diagram generated from the time sequence feature sequence for the nth sampling point is obtained, wherein N is the number of sampling points in a section of video, and C is the dimension number of the feature; weighting the time sequence feature map of C multiplied by T according to the weight matrix corresponding to the sampling point, and calculating to generate a C multiplied by N feature map;
performing 32×1×1 three-dimensional convolution on the C×N feature map, reducing the feature dimension to 1, performing convolution on the feature dimension to 1×1 and 3×3, and activating and generating confidence coefficient with channel dimension of 2 through a sigmoid function to obtain a two-dimensional confidence coefficient map with matched action starting positions, wherein fragments distributed in the same row in the two-dimensional confidence coefficient map have the same continuous length, and fragments distributed in the same column have the same starting boundary;
multiplying and fusing the initial boundary probability sequence and the two-dimensional confidence coefficient map matched with the initial position of the action to generate the confidence coefficient of each action segment;
and removing redundancy by adopting a non-maximum inhibition method according to the confidence of the action segment to obtain a final candidate action segment, and forming a time sequence segment.
6. The method for detecting middle school physical and chemical experiment motion based on motion boundary prediction according to claim 5, wherein the step of introducing motion importance contrast learning in the first branch specifically comprises:
modeling action importance by utilizing a leachable Gaussian distribution, and explicitly assigning Gaussian-like weights to simulate the accuracy of action positioning; motion importance p for each frame loc Modeling is as follows:
wherein,to simulate the i-th frame start time positioning accuracy, < >>Simulating the positioning accuracy of the end time of the ith frame;
wherein mu s Sum sigma s As a learnable parameter, mean and variance of motion importance distribution of each category c at start time, μ e Sum sigma e D (i) represents the distance from the center point of the true annotation segment of the current i-th frame in the training process for the mean and variance of the motion importance distribution of each class c at the end time;
obtaining an importance sequence of the frame through modeling learning;
and accumulating and fusing the importance sequence and the time sequence characteristic sequence to obtain an initial boundary probability sequence.
7. The motion detection method for middle school physical and chemical experiment based on motion boundary prediction according to claim 5, wherein the two-dimensional confidence map of the initial boundary probability sequence and the motion initial position match is multiplied and fused to generate the confidence p of each motion segment f Comprising:
wherein t is s Indicating the start time of the action segment, t e The end time of the action segment is indicated,at the starting time t for the action segment s Probability of->At the end time t for the action segment e Probability of p cc To classify confidence, p cr Is the regression confidence.
8. The motion boundary prediction-based middle school physical and chemical experiment motion detection method according to claim 5, wherein removing redundancy by using a non-maximum suppression method according to the confidence of the motion segment to obtain a final candidate motion segment comprises:
after the action segment with the maximum confidence coefficient is taken out each time, calculating the coincidence degree of the starting position of the action segment and other action segments, and removing if the coincidence degree exceeds a threshold value to obtain the final candidate action segment, wherein the action segment comprises an action starting time, an action ending time and the confidence coefficient.
9. The method for detecting experimental actions of middle school physical and chemical students based on action boundary prediction according to claim 1, wherein the segment classification network adopts a long-short-term memory network, and the time sequence segments are classified by the segment classification network, comprising:
inputting the time sequence segment into a long-short-period memory network, wherein the time is t, and the input feature vector is n t The information is updated and reserved by three gating mechanisms and a cell memory unit, wherein the three gating mechanisms comprise a forgetting gate, an input gate and an output gate;
the output characteristic vector h of the last time t-1 t-1 And input feature vector n t Splicing to form a feature matrix M, inputting the feature matrix M into a forgetting gate, and obtaining an output state information vector f passing through the forgetting gate t
f t =σ(W f *M+b f )
Wherein W is f 、b f Respectively representing a weight matrix and a bias value vector of a forgetting gate, wherein sigma represents a sigmoid activation function;
inputting the characteristic matrix M into an input gate, and obtaining an output state information matrix f after passing through the input gate i At the same time obtain the cell state vector which is not updated
f i =σ(W i *M+b i )
Wherein W is i 、b i Respectively represent the weight matrix and the offset value vector of the input gate, W c 、b c Respectively representing a weight matrix and a bias value vector of the cell memory unit, wherein tanh represents a tanh activation function;
based on the cell state vector not updatedVector of state information f t And f i Updating and storing to obtain a cell update vector C t
Wherein C is t-1 Updating vector for the cells at the last time t-1;
cell update vector C t After passing through the output gate, the current C is determined t The effective information output vector h in (a) t The expression is as follows:
h t =σ(W o *M+b o )*tanh(C t )
wherein W is o 、b o Respectively representing a weight matrix and a bias value vector of the output gate;
output the effective information to the vector h t And predicting the category of the fragments through the full connection layer to generate an experimental action detection result.
10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, implements the method of any of claims 1 to 9.
CN202311558506.XA 2023-11-21 2023-11-21 Motion boundary prediction-based middle school physical and chemical life experiment motion detection method Pending CN117765432A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311558506.XA CN117765432A (en) 2023-11-21 2023-11-21 Motion boundary prediction-based middle school physical and chemical life experiment motion detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311558506.XA CN117765432A (en) 2023-11-21 2023-11-21 Motion boundary prediction-based middle school physical and chemical life experiment motion detection method

Publications (1)

Publication Number Publication Date
CN117765432A true CN117765432A (en) 2024-03-26

Family

ID=90313410

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311558506.XA Pending CN117765432A (en) 2023-11-21 2023-11-21 Motion boundary prediction-based middle school physical and chemical life experiment motion detection method

Country Status (1)

Country Link
CN (1) CN117765432A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117994864A (en) * 2024-04-03 2024-05-07 北京师范大学珠海校区 Method and device for evaluating biological experiment operation of middle school, electronic equipment and storage medium
CN117994864B (en) * 2024-04-03 2024-07-26 北京师范大学珠海校区 Method and device for evaluating biological experiment operation of middle school, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117994864A (en) * 2024-04-03 2024-05-07 北京师范大学珠海校区 Method and device for evaluating biological experiment operation of middle school, electronic equipment and storage medium
CN117994864B (en) * 2024-04-03 2024-07-26 北京师范大学珠海校区 Method and device for evaluating biological experiment operation of middle school, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108182394B (en) Convolutional neural network training method, face recognition method and face recognition device
CN110472531B (en) Video processing method, device, electronic equipment and storage medium
CN108985334B (en) General object detection system and method for improving active learning based on self-supervision process
CN112348849B (en) Twin network video target tracking method and device
CN109241829B (en) Behavior identification method and device based on space-time attention convolutional neural network
US9798923B2 (en) System and method for tracking and recognizing people
CN111476302A (en) fast-RCNN target object detection method based on deep reinforcement learning
CN111860235A (en) Method and system for generating high-low-level feature fused attention remote sensing image description
CN114332578A (en) Image anomaly detection model training method, image anomaly detection method and device
WO2020088491A1 (en) Method, system, and device for classifying motion behavior mode
CN111368634B (en) Human head detection method, system and storage medium based on neural network
CN110175657B (en) Image multi-label marking method, device, equipment and readable storage medium
KR20230171966A (en) Image processing method and device and computer-readable storage medium
WO2023207389A1 (en) Data processing method and apparatus, program product, computer device, and medium
CN111052128A (en) Descriptor learning method for detecting and locating objects in video
CN113920170A (en) Pedestrian trajectory prediction method and system combining scene context and pedestrian social relationship and storage medium
CN114842343A (en) ViT-based aerial image identification method
CN114283352A (en) Video semantic segmentation device, training method and video semantic segmentation method
CN116524593A (en) Dynamic gesture recognition method, system, equipment and medium
CN111401105A (en) Video expression recognition method, device and equipment
CN111126155B (en) Pedestrian re-identification method for generating countermeasure network based on semantic constraint
CN113283334B (en) Classroom concentration analysis method, device and storage medium
CN112560823B (en) Adaptive variance and weight face age estimation method based on distribution learning
CN116824583A (en) Weak supervision video scene graph generation method and device and electronic equipment
CN111223126A (en) Cross-view-angle trajectory model construction method based on transfer learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination