CN112949352B

CN112949352B - Training method and device of video detection model, storage medium and electronic equipment

Info

Publication number: CN112949352B
Application number: CN201911256542.4A
Authority: CN
Inventors: 蒋正锴; 王国利; 张骞
Original assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Current assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2024-05-24
Anticipated expiration: 2039-12-10
Also published as: CN112949352A

Abstract

A training method and device for a video detection model, a storage medium and electronic equipment are disclosed. The method comprises the following steps: determining a preset relation between key frames and non-key frames in a plurality of training videos; acquiring a plurality of training samples from the plurality of training videos based on a preset relation between the key frames and the non-key frames; and training the video detection model according to the training samples. According to the technical scheme, through the video detection model trained by the mode, each frame of image in the video can be identified more accurately, and the video detection precision can be effectively improved.

Description

Training method and device of video detection model, storage medium and electronic equipment

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a training method and device of a video detection model, a storage medium and electronic equipment.

Background

The video detection is an important application technology, and has important application prospects in automatic driving, security protection and the like. How to achieve fast and accurate monitoring is a research target for video detection.

In the video detection scheme in the prior art, the concept of characteristic alignment of optical flows among different frames is mainly utilized to track and mark objects in the video. In particular, existing video detection schemes train an end-to-end video detection model that detects network modules and optical flow network modules together. In use, a detected video is input to the video detection model, which may output information of objects detected from each frame in the video and their corresponding tags.

However, the optical flow network module trained by the video detection model is not an optical flow with the traditional meaning any more, so that not only can the speed of video detection be influenced, but also the optical flow network is often inaccurate, and the precision of video detection is low.

Disclosure of Invention

In order to solve the technical problems, the application provides a training method and device of a video detection model, a storage medium and electronic equipment.

According to one aspect of the present application, there is provided a training method of a video detection model, including:

Determining a preset relation between key frames and non-key frames in a plurality of training videos;

Acquiring a plurality of training samples from the plurality of training videos based on a preset relation between the key frames and the non-key frames;

And training the video detection model according to the training samples.

According to another aspect of the present application, there is provided a training apparatus for a video detection model, comprising:

the determining module is used for determining preset relations between key frames and non-key frames in the training videos;

The acquisition module is used for acquiring a plurality of training samples from the plurality of training videos based on a preset relation between the key frames and the non-key frames;

and the training module is used for training the video detection model according to the training samples.

According to another aspect of the present application, there is provided a computer readable storage medium storing a computer program for performing any one of the methods described above.

According to another aspect of the present application, there is provided an electronic device including: a processor; a memory for storing the processor-executable instructions; the processor is configured to perform any of the methods described above.

The training method of the video detection model provided by the embodiment of the application adopts the method for determining the preset relation between the key frames and the non-key frames in a plurality of training videos; acquiring a plurality of training samples from the plurality of training videos based on a preset relation between the key frames and the non-key frames; according to the training samples, the video detection model is trained, and each frame of image in the video can be more accurately identified through the video detection model trained by the scheme, so that the video detection precision can be effectively improved.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing embodiments of the present application in more detail with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and together with the embodiments of the application, and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.

Fig. 1 is a flowchart of a training method of a video detection model according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a video detection model according to the present embodiment.

Fig. 3 is a flowchart of a training method of a video detection model according to a second embodiment of the present invention.

Fig. 4 is a flowchart of an embodiment of a video detection method of the present embodiment.

Fig. 5 is a block diagram of a training apparatus for a video detection model according to an embodiment of the present invention.

Fig. 6 is a block diagram of a training apparatus for a video detection model according to a second embodiment of the present invention.

Fig. 7 is a block diagram of an embodiment of a video detection apparatus according to the present invention.

Fig. 8 illustrates a block diagram of an electronic device according to an embodiment of the application.

Detailed Description

Hereinafter, exemplary embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.

Summary of the application

The video detection scheme is realized by adopting a novel video detection model, and the video detection model can comprise an end-to-end training motion priori learning module and a detection network module. The video detection scheme of the application can be applied to the fields of automatic driving, security protection and the like and is used for marking objects in each frame of image of video, for example, in the technical scheme of the application, a detected surrounding frame (bounding-box) of the objects in each frame of image can be particularly marked, the surrounding frame can be understood as frame information of an area where the objects are located, and meanwhile, label information such as Lable of the objects in various predicted images can be marked, so that the operations such as tracking, analyzing and the like can be performed on the objects in the video based on the video detection result.

Exemplary System

The training scheme of the video detection model can be deployed in a training device of the video detection model, and the training device of the video detection model can be an electronic device of an entity such as a large-scale computer or can also adopt software integrated application. When the video detection model training device is used, the video detection model can be trained in the video detection model training device so as to obtain the video detection model with higher detection precision. The video detection model may then be deployed in a specific usage scenario at the time of application. For example, in the field of autopilot, the video detection model may be provided in an unmanned vehicle to enable detection of objects in video captured by the unmanned vehicle. In the security field, the video detection model can be arranged in a security system to detect a monitoring video shot by a camera. Similarly, in other fields, the trained video detection model can be deployed to realize detection of the acquired video so as to meet the application of the corresponding scene.

Exemplary method

Fig. 1 is a flowchart of a training method of a video detection model according to an embodiment of the present invention. As shown in fig. 1, the training method of the video detection model of the present embodiment may specifically include the following steps:

s101, determining a preset relation between key frames and non-key frames in a plurality of training videos.

The training device of the video detection model in this embodiment needs to determine a preset relationship between the key frame and the non-key frame before training. For example, the preset key is set according to the relation of the preset frame interval number, for example, the 0 th frame in the video may be set as a key frame, the preset frame interval number is 5, every 5 frames are set as key frames, and 0, 5, 10, 15.

S102, acquiring a plurality of training samples from a plurality of training videos based on a preset relation between the key frames and the non-key frames.

The training samples of this embodiment are selected by selecting some key frames and some non-key frames based on the preset relationship between the key frames and the non-key frames, and forming the training samples together.

It should be noted that, to cover various possible situations in the video, the several training samples collected in this embodiment may include all possible situations of two key frames and one non-key frame in the video.

For example, an array of two keyframe images with a sequential arrangement and a non-keyframe image located after the two keyframe images may be extracted from a plurality of training videos based on a preset relationship; and then obtaining the region information of the target object marked in the non-key frame image and the label information corresponding to the target object. It can be seen that each training sample includes two key frame images and one non-key frame image, and the timing sequence of the two key frame images is before and the timing sequence of the non-key frame image is after. Because the training of the embodiment is supervised training, each training sample further includes the region information of the training object and the corresponding label information marked in the image data of the non-key frame, so as to facilitate the subsequent calculation of the detection loss. Therefore, in the training sample, the region information of the target object marked in the non-key frame image and the corresponding Label, namely Label, also need to be determined, and the region information can be marked manually by staff, so that the training sample can be directly acquired when the training sample is acquired. The region information may be a frame marked in the non-key frame image and used for enclosing the target object, and the frame may be a rectangle, a square, a circle or any other regular or irregular shape, that is, any shape capable of enclosing the target object. If a plurality of target objects are included in the non-key frame image, each target object has corresponding region information, and a plurality of frames are needed to be respectively used for framing the corresponding target object correspondingly, then a plurality of frames are necessarily included in one non-key frame image. In order to track the object in motion when the video detection model is applied, in this embodiment, the region coincidence region information of each target object needs to be labeled, that is, the label is used to uniquely identify the target object in the frame. For example, the label may be represented by at least one symbol of a letter, a number, and a letter.

S103, training a video detection model according to the plurality of training samples.

In this embodiment, when several training samples are collected from multiple training videos, specifically, one, two or multiple training samples may be collected from each training video. In this embodiment, the number of the collected training samples may reach more than a million levels, and the more the number of the training samples, the more accurate the trained video detection model.

Fig. 2 is a schematic structural diagram of a video detection model according to the present embodiment. As shown in fig. 2, the video detection model of the present embodiment includes two parts, namely a motion prior learning module and a detection network module, where the motion prior learning module is configured to combine motion knowledge in consecutive frames, obtain feature information of a current frame to be predicted based on feature information of different frames in a video, and the detection network module is configured to detect information of an object and a corresponding tag based on the feature information of the current frame. The motion priori learning module and the detection network module are modeled by adopting a convolutional neural network so as to realize respective functions. During training, the motion prior learning module and the detection network module in the video detection model perform end-to-end training together.

Fig. 3 is a flowchart of a training method of a video detection model according to a second embodiment of the present invention. As shown in fig. 3, in this embodiment, the training process of training the video detection model according to the plurality of training samples in S103 in the embodiment shown in fig. 1 is described in detail, and as shown in fig. 3, the training method of the video detection model in this embodiment may specifically include the following steps:

S201, selecting one training sample from a plurality of training samples, and inputting image data of two key frames and image data of a non-key frame in the corresponding training samples into a motion prior learning module.

It should be noted that, before training, the parameters in the motion prior learning module and the detection network module need to be randomly initialized. These parameters are then trained in the training manner of this embodiment.

S202, acquiring feature fusion of a motion prior learning module according to image data of two key frames to obtain feature information of non-key frames.

In this embodiment, during training, one training sample may be selected each time to perform training, specifically, the training sample is input to the motion prior learning module first, so that the motion prior learning module learns the fusion of the features from two key frames to a non-key frame. Correspondingly, the motion prior learning module can be obtained to perform feature fusion according to the image data of the two key frames, so as to obtain the feature information of the non-key frames.

In addition, it should be noted that, since both the two key frames and the non-key frame in the training sample of the present embodiment are obtained according to the previous preset relationship, the preset relationship includes the offset between the two key frames and the offset between the non-key frame and each key frame. These offsets may be obtained based on the corresponding frame images, or may be labeled in advance based on a preset relationship, and during training, the motion prior learning module may obtain the offsets, for example, the motion prior learning module may learn the feature information of the second key frame based on the image data of the first key frame and the offset between the two key frames, and correct the learned feature information of the second key frame based on the image data of the second key frame. By learning the function, the motion prior learning module can learn images from key frame to key frame. And the characteristic fusion can be carried out according to the image data of the two key frames to obtain the characteristic information of the non-key frames based on the image data of the first key frame, the image data of the second key frame and the offset of the acquired non-key frames and the two key frames respectively. And because the characteristic information of the subsequent non-key frame is obtained by fusing the image data of the previous two key frames, the accuracy of the characteristic information of the non-key frame can be fully ensured. By learning the function, the motion prior learning module can learn images according to the previous two key frames and track the image of the next non-key frame.

S203, inputting the feature information of the non-key frames obtained through fusion into a detection network module.

S204, acquiring area information and label information of the training object in the image data of the non-key frames predicted by the detection network module.

The feature information of the non-key frames obtained by fusion of the motion prior learning module is input into the detection network module, so that the detection network module can predict the region information and the label information of the training object in the image data of the non-key frames based on the input feature information of the non-key frames and output the region information and the label information.

S205, calculating detection loss according to the predicted area information and label information of the training object and the marked area information and corresponding label information of the training object.

Since the previous sample data includes the region information and the corresponding label information of the training object marked in the image data of the non-key frame, then the detection loss is calculated based on the predicted region information and the label information of the training object and the region information and the corresponding label information of the training object marked in the sample data.

S206, judging whether the detection loss is converged or not; if not, go to step S207; otherwise, step S208 is performed.

S207, adjusting parameters in the motion prior learning module and the detection network module by adopting a gradient descent method; returning to step S201, the next training sample is selected, and training is continued.

S208, judging whether the training of the continuous preset number of rounds is always converged, if so, determining parameters in the motion priori learning module and the detection network module, determining the motion priori learning module and the detection network module, and further determining a video detection model; otherwise, returning to step S201, selecting the next training sample, and continuing to start training.

In this embodiment, when a training sample is used for training the video detection model for the first time, since the detection loss is calculated for the first time, it cannot be determined whether the detection loss is converged at this time, and the next training sample is directly selected and training is continued according to the above steps. For the non-first training, the detection loss is calculated before, and the convergence of the detection loss can be judged by combining the previous detection results. In order to avoid that the training result is affected by the minor fluctuation, in this embodiment, the value of the detection loss may be set to always keep the minimum value in the continuous preset number of rounds, such as 100 times, 80 times or other times of training, and no longer continue to shrink in the direction tending to 0, at this time, the detection loss may be considered to converge. And taking the parameters in the motion priori learning module and the detection network module after the last adjustment, and determining the motion priori learning module and the detection network module and further determining the video detection model for the parameters in the trained motion priori learning module and the trained detection network module.

The training of this embodiment is end-to-end convergence, and when the detection loss is not converged, the motion prior learning module and the detection network module need to be adjusted at the same time each time.

In the training of this embodiment, if the collected training samples are enough, the detection loss convergence may be achieved by using one training round, and if the collected training samples are not enough, two or more training rounds may be performed by using several training samples, so that the detection loss convergence may be achieved.

According to the training method of the video detection model, the motion priori learning module and the detection network module in the video detection model are trained by adopting the scheme, so that the motion priori learning module can learn to perform feature fusion based on two key frames, and the feature prediction of any subsequent non-key frame is realized, so that each frame of image in the video can be detected more accurately. By adopting the technical scheme of the embodiment, the trained video detection model can detect the video more accurately, and the video detection precision can be further effectively improved.

In addition, the video detection model of the embodiment adopts a motion priori learning module, and compared with the existing optical flow network module, the video detection model has fewer parameters, so that the training speed of the video detection model can be further increased, and the accuracy of the trained video detection model is improved.

Moreover, the video detection model in this embodiment is an end-to-end training manner during training, and an end-to-end video detection model is obtained, that is, a motion priori learning module and a detection network module included in the video detection model are trained together. When the video detection module is used, each module does not output a result independently, and the whole video detection module only outputs a final result according to input, namely, one problem can be solved by adopting one step. The end-to-end implementation mode does not introduce accumulated errors, and further can effectively improve the accuracy of video detection.

Fig. 4 is a flowchart of an embodiment of a video detection method of the present embodiment. As shown in fig. 4, the video detection method of the present embodiment may specifically include the following steps:

S301, acquiring a video to be detected.

S302, acquiring area information of an object detected from each frame of image of a video and a corresponding label according to the video and a pre-trained video detection model; the video detection model is formed by performing end-to-end training based on a detection network module and a motion priori learning module.

The training method of the video detection model in this embodiment may specifically be a method for using the video detection model trained in the foregoing embodiment.

The video to be detected in the embodiment may be a video to be detected in the fields of unmanned vehicles, security protection and the like. The video detection model in the embodiment is formed by performing end-to-end training based on a detection network module and a motion priori learning module. The motion prior learning module can learn feature fusion among different frames, and can further accurately identify the image of the current frame based on the image of the key frame before each frame of image. Therefore, the situation that the object image is not clear enough due to the fact that the object movement speed is too high in the video can be avoided, and the object image can still be accurately identified.

When the method is used, the video to be detected is directly input into a pre-trained video detection model, and the motion priori learning module and the detection network module in the video detection model can be used for identifying objects in each frame of image in the video and outputting the region information and the corresponding labels of the objects in the frame of image. The area information of the object can be a surrounding frame (surrounding-box) of the object, the label of the object can be a unique identifier of the object, and the label can be specifically identified by any one of numbers, letters, special symbols, chinese characters and the like, or at least two combinations of the numbers, letters, special symbols, chinese characters and the like.

According to the training method of the video detection model, through the adoption of the video detection model formed by end-to-end training based on the detection network module and the motion priori learning module, each frame of image in a video can be identified more accurately, and the video detection precision can be effectively improved.

Exemplary apparatus

Fig. 5 is a block diagram of a training apparatus embodiment of a video detection model according to the present invention. As shown in fig. 5, the training device for a video detection model of the present embodiment includes:

A determining module 11, configured to determine a preset relationship between key frames and non-key frames in the plurality of training videos;

an acquisition module 12, configured to acquire a plurality of training samples from a plurality of training videos based on a preset relationship between the key frames and the non-key frames;

the training module 13 is configured to train the video detection model according to a plurality of training samples.

The training device for the video detection model in this embodiment adopts the same implementation principle and technical effect as those of the related method embodiment to implement training of the video detection model by adopting the above module, and reference may be made to the description of the related method embodiment for details, which are not repeated here.

Fig. 6 is a block diagram of a training apparatus embodiment of a video detection model provided by the present invention. As shown in fig. 6, the training device for a video detection model according to the present embodiment further describes the technical solution of the present invention in more detail on the basis of the technical solution of the embodiment shown in fig. 5.

As shown in fig. 6, the acquisition module 12 of the present embodiment includes:

An image obtaining unit 121, configured to extract, from the plurality of training videos, an array of two keyframe images with a sequential arrangement and a non-keyframe image located after the two keyframe images, based on the determination module 11 determining a preset relationship;

the object information obtaining unit 122 is configured to obtain region information of a target object marked in a non-key frame image and tag information corresponding to the target object.

Further alternatively, the training module 13 of the present embodiment specifically includes:

The input unit 131 is configured to input, for each training sample, image data of two key frames and image data of a non-key frame in the corresponding training sample to the motion prior learning module in the video detection model;

the obtaining unit 132 is configured to obtain feature information of a non-key frame by performing feature fusion according to image data of two key frames by using the motion prior learning module;

The input unit 131 is further configured to input the feature information of the non-key frames obtained by fusion to a detection network module in the video detection model;

The acquiring unit 132 is further configured to acquire area information and tag information of the training object in the image data of the non-key frame predicted by the detection network module;

the calculating unit 133 is configured to calculate a detection loss according to the predicted area information and label information of the training object, and the labeled area information and corresponding label information of the training object;

The adjustment unit 134 is configured to adjust parameters of the video detection model based on the detection loss.

Further alternatively, the training module 13 of the present embodiment further includes a judging unit 135 and a determining unit 136:

the judging unit 135 is configured to judge whether the detection loss calculated by the calculating unit 133 converges;

the adjustment module 134 adjusts the parameters in the motion prior learning module and the detection network module by adopting a gradient descent method if the judgment unit 135 judges and determines that the detection loss is not converged;

the determining unit 136 is configured to determine parameters in the motion prior learning module and the detection network module when determining that the detection loss converges, determine the motion prior learning module and the detection network module, and further determine the video detection model.

Further alternatively, the training module 13 of the present embodiment further includes:

The initializing unit 137 is configured to randomly initialize parameters in the motion prior learning module and the detection network module.

Correspondingly, the processing of each unit in the training module 13 is performed based on the operation after the initializing unit 137.

Fig. 7 is a block diagram of an embodiment of a video detection apparatus according to the present invention. As shown in fig. 7, the video detection apparatus of this embodiment may specifically include:

the acquisition module 21 is used for acquiring a video to be detected;

The detection module 22 is configured to obtain, according to the video obtained by the obtaining module 21 and the pre-trained video detection model, area information of an object detected from each frame of image of the video and a corresponding tag; the video detection model is formed by performing end-to-end training based on a detection network module and a motion priori learning module.

The implementation principle and the technical effect of the video detection device in this embodiment by using the above modules are the same as those of the implementation of the above related method embodiments, and detailed description of the above related method embodiments may be referred to and will not be repeated here.

Exemplary electronic device

As shown in fig. 8, the electronic device 11 includes one or more processors 111 and a memory 112.

The processor 111 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in the electronic device 11 to perform desired functions.

Memory 112 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that may be executed by the processor 111 to implement the training method, video detection method, and/or other desired functions of the video detection model of the various embodiments of the application described above. Various contents such as an input signal, a signal component, a noise component, and the like may also be stored in the computer-readable storage medium.

In one example, the electronic device 11 may further include: an input device 113 and an output device 114, which are interconnected by a bus system and/or other forms of connection mechanisms (not shown).

For example, the input device 113 may be a camera or microphone, a microphone array, or the like as described above for capturing an image or an input signal of a sound source. When the electronic device is a stand-alone device, the input means 123 may be a communication network connector for receiving the acquired input signals from the neural network processor.

In addition, the input device 113 may also include, for example, a keyboard, a mouse, and the like.

The output device 114 may output various information to the outside, including the determined output voltage, output current information, and the like. The output device 114 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, etc.

Of course, only some of the components of the electronic device 11 relevant to the present application are shown in fig. 8 for simplicity, components such as buses, input/output interfaces, and the like being omitted. In addition, the electronic device 11 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer readable storage Medium

In addition to the methods and apparatus described above, embodiments of the application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps in the training method of the video detection model according to the various embodiments of the application described in the "exemplary methods" section of this specification.

The computer program product may write program code for performing operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium, having stored thereon computer program instructions, which when executed by a processor, cause the processor to perform the steps in the training method of the video detection model according to the various embodiments of the present application described in the "exemplary methods" section above in this specification.

The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The basic principles of the present application have been described above in connection with specific embodiments, but it should be noted that the advantages, benefits, effects, etc. mentioned in the present application are merely examples and not intended to be limiting, and these advantages, benefits, effects, etc. are not to be construed as necessarily possessed by the various embodiments of the application. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not necessarily limited to practice with the above described specific details.

The block diagrams of the devices, apparatuses, devices, systems referred to in the present application are only illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.

It is also noted that in the apparatus, devices and methods of the present application, the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent aspects of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the application to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims

1. A method of training a video detection model, comprising:

Training the video detection model according to the training samples;

wherein training the video detection model according to the plurality of training samples comprises:

for each training sample, inputting the image data of two key frames and the image data of the non-key frames in the corresponding training sample to a motion priori learning module in a video detection model;

Acquiring feature fusion of the motion prior learning module according to the image data of the two key frames to obtain feature information of the non-key frames;

inputting the feature information of the non-key frames obtained through fusion into a detection network module in the video detection model;

Acquiring area information and label information of a training object in the image data of the non-key frames predicted by the detection network module;

calculating detection loss according to the predicted area information and the label information of the training object, and the marked area information and the corresponding label information of the training object;

and adjusting parameters of the video detection model based on the detection loss.

2. The method of claim 1, wherein collecting a number of training samples from the plurality of training videos based on a preset relationship between the key frames and non-key frames comprises:

Based on the preset relation, extracting an array of two keyframe images with sequential arrangement and a non-keyframe image positioned behind the two keyframe images from the plurality of training videos;

and acquiring the region information of the target object marked in the non-key frame image and the label information corresponding to the target object.

3. The method of claim 1, wherein adjusting parameters of the video detection model based on the detection loss comprises:

judging whether the detection loss is converged or not;

If the motion prior learning module is not converged, a gradient descent method is adopted to adjust parameters in the motion prior learning module and the detection network module;

And when the detection loss is determined to be converged, determining parameters in the motion priori learning module and the detection network module, and determining the motion priori learning module and the detection network module so as to further determine the video detection model.

4. A method according to claim 3, wherein for each of the training samples, before inputting the image data of the two key frames and the image data of the non-key frames in the corresponding training samples into the motion prior learning module, the method further comprises:

And randomly initializing parameters in the motion prior learning module and the detection network module.

5. A training apparatus for a video detection model, comprising:

the training module is used for training the video detection model according to the training samples;

wherein, training module includes:

The input unit is used for inputting the image data of two key frames and the image data of the non-key frames in the corresponding training samples to the motion prior learning module in the video detection model for each training sample;

the acquisition unit is used for acquiring the characteristic fusion of the motion prior learning module according to the image data of the two key frames to obtain the characteristic information of the non-key frames;

The input unit is further used for inputting the feature information of the non-key frames obtained through fusion into a detection network module in the video detection model;

the acquisition unit is further used for acquiring area information and label information of the training object in the image data of the non-key frames predicted by the detection network module;

A calculating unit, configured to calculate a detection loss according to the predicted region information and the tag information of the training object, and the labeled region information and the corresponding tag information of the training object;

And the adjusting unit is used for adjusting parameters of the video detection model based on the detection loss.

6. The apparatus of claim 5, wherein the acquisition module comprises:

the image acquisition unit is used for extracting an array of two keyframe images which are arranged in sequence and one non-keyframe image positioned behind the two keyframe images from the plurality of training videos based on the preset relation;

And the object information acquisition unit is used for acquiring the region information of the target object marked in the non-key frame image and the label information corresponding to the target object.

7. A computer readable storage medium storing a computer program for performing the training method of the video detection model according to any one of the preceding claims 1-4.

8. An electronic device, the electronic device comprising:

A processor;

A memory for storing the processor-executable instructions;

the processor is configured to perform the training method of the video detection model according to any one of claims 1 to 4.