CN117095337A - Target detection method and device, electronic equipment and storage medium - Google Patents

Target detection method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117095337A
CN117095337A CN202311247047.3A CN202311247047A CN117095337A CN 117095337 A CN117095337 A CN 117095337A CN 202311247047 A CN202311247047 A CN 202311247047A CN 117095337 A CN117095337 A CN 117095337A
Authority
CN
China
Prior art keywords
target
target detection
video stream
detection model
backbone network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311247047.3A
Other languages
Chinese (zh)
Inventor
杨烨飞
陶砚蕴
丁延超
陈银
郭信来
俄文娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Suzhou Automotive Research Institute of Tsinghua University
Original Assignee
Suzhou University
Suzhou Automotive Research Institute of Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University, Suzhou Automotive Research Institute of Tsinghua University filed Critical Suzhou University
Priority to CN202311247047.3A priority Critical patent/CN117095337A/en
Publication of CN117095337A publication Critical patent/CN117095337A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a target detection method, a target detection device, electronic equipment and a storage medium. The method comprises the following steps: acquiring a target video stream to be detected and a pre-constructed target detection model; inputting a target video stream into a target detection model, wherein a first backbone network in the target detection model is used for extracting large-size target features in the target video stream, a second backbone network is used for extracting small-size target features in the target video stream, a lightweight feature pyramid network is used for interpolating the small-size target features extracted by the second backbone network and then splicing the small-size target features extracted by the first backbone network with the large-size target features extracted by the first backbone network to generate final target features, and a detection result prediction module is used for carrying out target detection based on the final target features; and obtaining a target detection result output by the target detection model. According to the technical scheme, the lightweight characteristic pyramid network is improved in the model, so that the target detection efficiency is improved on the basis of ensuring the accuracy.

Description

Target detection method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of image processing technologies, and in particular, to a target detection method, a target detection device, an electronic device, and a storage medium.
Background
With the increasing demand of people for video monitoring and intelligent video analysis technologies, multi-video stream object detection has become one of the hot spots of research. In practical applications, there are typically multiple cameras or video streams to monitor simultaneously, and it is necessary to detect and identify targets in these video streams. For multi-video stream target detection, algorithms based on multi-scale detection, algorithms based on feature fusion and the like are generally adopted in the prior art. However, these algorithms have certain limitations in multi-video streaming scenarios. Because the data volume involved in the multiple video streams is large, the processing efficiency of the traditional algorithm is low, and the real-time requirement is difficult to meet.
Disclosure of Invention
The invention provides a target detection method, a device, electronic equipment and a storage medium, which improve the efficiency of target detection on the basis of ensuring the accuracy by improving a lightweight characteristic pyramid network.
According to an aspect of the present invention, there is provided a target detection method, the method comprising:
acquiring a target video stream to be detected and a pre-constructed target detection model; the target detection model comprises a first backbone network, a second backbone network, a lightweight characteristic pyramid network and a detection result prediction module which are sequentially connected;
inputting the target video stream into the target detection model, wherein a first backbone network in the target detection model is used for extracting large-size target features in the target video stream, a second backbone network is used for extracting small-size target features in the target video stream, the lightweight feature pyramid network is used for splicing the small-size target features extracted by the second backbone network with the large-size target features extracted by the first backbone network after interpolation to generate final target features, and the detection result prediction module is used for carrying out target detection based on the final target features;
and obtaining a target detection result output by the target detection model.
According to another aspect of the present invention, there is provided an object detection apparatus including:
the detection preparation module is used for acquiring a target video stream to be detected and a pre-constructed target detection model; the target detection model comprises a first backbone network, a second backbone network, a lightweight characteristic pyramid network and a detection result prediction module which are sequentially connected;
the target detection module is used for inputting the target video stream into the target detection model, wherein a first backbone network in the target detection model is used for extracting large-size target features in the target video stream, a second backbone network is used for extracting small-size target features in the target video stream, the lightweight feature pyramid network is used for interpolating the small-size target features extracted by the second backbone network and then splicing the small-size target features extracted by the first backbone network to generate final target features, and the detection result prediction module is used for carrying out target detection based on the final target features;
and the result acquisition module is used for acquiring the target detection result output by the target detection model.
According to another aspect of the present invention, there is provided an electronic apparatus including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the object detection method according to any one of the embodiments of the present invention.
According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to execute the object detection method according to any one of the embodiments of the present invention.
According to the technical scheme, a target video stream to be detected and a pre-constructed target detection model are obtained; the target detection model comprises a first backbone network, a second backbone network, a lightweight characteristic pyramid network and a detection result prediction module which are sequentially connected; inputting a target video stream into a target detection model, wherein a first backbone network in the target detection model is used for extracting large-size target features in the target video stream, a second backbone network is used for extracting small-size target features in the target video stream, a lightweight feature pyramid network is used for interpolating the small-size target features extracted by the second backbone network and then splicing the small-size target features extracted by the first backbone network with the large-size target features extracted by the first backbone network to generate final target features, and a detection result prediction module is used for carrying out target detection based on the final target features; and obtaining a target detection result output by the target detection model. According to the technical scheme, the lightweight characteristic pyramid network in the model is improved, so that the target detection efficiency is improved on the basis of ensuring the accuracy.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a target detection method according to a first embodiment of the present invention;
FIG. 2 is a flow chart of a target detection method according to a second embodiment of the present invention;
FIG. 3 is a block diagram of a target detection model according to a second embodiment of the present invention;
FIG. 4 is a detection flow chart of a target detection model according to a second embodiment of the present invention;
FIG. 5 is a graph showing the comparison of the effects of a Yolo-FastV 2 model and a target detection model according to the second embodiment of the present invention;
FIG. 6 is a diagram of the results of object detection of an object detection model according to a second embodiment of the present invention;
fig. 7 is a view of a 16-channel video result turned on by object detection according to a second embodiment of the present invention;
fig. 8 is a view of a 32-channel video result turned on for object detection according to a second embodiment of the present invention;
fig. 9 is a schematic structural diagram of an object detection device according to a third embodiment of the present invention;
fig. 10 is a schematic structural diagram of an electronic device implementing the target detection method according to the embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
Fig. 1 is a flowchart of a target detection method according to an embodiment of the present invention, where the method may be performed by a target detection device, the target detection device may be implemented in hardware and/or software, and the target detection device may be configured in an electronic apparatus. As shown in fig. 1, the method includes:
s110, acquiring a target video stream to be detected and a pre-constructed target detection model; the target detection model comprises a first backbone network, a second backbone network, a lightweight characteristic pyramid network and a detection result prediction module which are sequentially connected.
The target video stream to be detected may include multiple video streams, such as 16 channels, 32 channels, and the like.
Wherein the backbone network is part of a deep learning computer vision task, typically consisting of multiple convolution layers, for extracting advanced features of the input image. Common backbone networks include convolutional neural networks, residual networks, dense connectivity networks, and the like. The embodiment of the invention is not limited to a specific kind of backbone network, and illustratively, the backbone network may be a ShuffleNetV2 convolutional neural network.
Among other things, a lightweight feature pyramid network is a network architecture for image processing and computer vision tasks that is designed to reduce the number of parameters and computational complexity of the network while maintaining high performance. It should be noted that in the embodiment of the present invention, when the target detection model is pre-built, the feature layer responsible for predicting the small target in the lightweight feature pyramid network is deleted, and the anchor frames of the 6 groups of predicted object positions are changed to 3 groups.
In the embodiment of the invention, before target detection, preparation work of target detection needs to be completed, namely, a target video stream to be detected and a pre-constructed target detection model are acquired. Specifically, a target video stream to be detected needs to be acquired, and a target detection model which is built in advance and comprises a first backbone network, a second backbone network, a lightweight characteristic pyramid network and a detection result prediction module which are sequentially connected.
The object detection method provided by the embodiment of the invention mainly detects that the object is a road vehicle, and generally, the size of the vehicle is larger. Therefore, by deleting the feature layer responsible for predicting the small target in the lightweight feature pyramid network and changing the anchor frame of 6 groups of predicted object positions into 3 groups, and only predicting the vehicle position by using the feature layer responsible for predicting the large target, the processing speed of the picture can be improved while the accuracy is ensured.
S120, inputting the target video stream into a target detection model, wherein a first backbone network in the target detection model is used for extracting large-size target features in the target video stream, a second backbone network is used for extracting small-size target features in the target video stream, a lightweight feature pyramid network is used for interpolating the small-size target features extracted by the second backbone network and then splicing the small-size target features extracted by the first backbone network with the large-size target features extracted by the first backbone network to generate final target features, and a detection result prediction module is used for carrying out target detection based on the final target features.
In general, a scale difference exists between a large-size target feature extracted by the first backbone network and a small-size target feature extracted by the second backbone network, and at this time, the small-size target feature extracted by the second backbone network can be spliced with the large-size target feature extracted by the first backbone network after interpolation through a lightweight feature pyramid network, namely, multi-scale feature fusion is performed, so that a final target feature is generated. Through the multi-scale feature fusion mode, the features extracted under different scales can be effectively information fused, so that the perceptibility of the model to targets with different scales is enhanced.
In the embodiment of the invention, in order to further improve the efficiency of target detection, when the target video stream is input into the target detection model, the target video stream needs to be subjected to time redundancy compression.
Specifically, inputting the target video stream into the target detection model includes: determining at least one set of homogeneous video frames in a target video stream; a target video frame is randomly selected from each group of homogeneous video frames, and is input into a target detection model.
Wherein, a homogeneous video frame refers to that a plurality of continuous video frames in a video sequence have the same or similar content, which can lead to redundant data transmission and storage, and waste bandwidth and storage resources.
In the embodiment of the invention, the data redundancy can be reduced, unnecessary detection of the homogeneous video frames can be avoided, and the target detection efficiency can be further improved by randomly selecting one target video frame from each group of homogeneous video frames and inputting the target video frame into the target detection model. Prior to this, a determination of the homogeneous video frames in the target video stream is required.
Specifically, determining at least one set of homogeneous video frames in a target video stream includes: determining the similarity of video frames contained in a target video stream, and taking the video frames with the similarity larger than a preset similarity threshold as homogeneous video frames; or, using N adjacent video frames in the target video stream as homogeneous video frames; wherein N is an integer greater than or equal to 2; or taking the video frames in the preset time range in the target video stream as homogeneous video frames.
In general, the homogeneous video frames may be detected and screened by comparing content differences between video frames, and a common method is to determine the similarity of video frames included in a target video stream, and use, as the homogeneous video frames, video frames having a similarity greater than a preset similarity threshold. It will be appreciated that homogeneous video frames generally occur continuously, and therefore, adjacent N video frames in the target video stream may be considered homogeneous video frames, where N is an integer greater than or equal to 2. Or taking the video frames in the preset time range in the target video stream as homogeneous video frames.
S130, obtaining a target detection result output by the target detection model.
In the embodiment of the invention, after the detection result prediction module performs target detection based on the final target feature obtained by multi-scale feature fusion, the target detection result output by the target detection model can be obtained.
According to the technical scheme, a target video stream to be detected and a pre-constructed target detection model are obtained; the target detection model comprises a first backbone network, a second backbone network, a lightweight characteristic pyramid network and a detection result prediction module which are sequentially connected; inputting a target video stream into a target detection model, wherein a first backbone network in the target detection model is used for extracting large-size target features in the target video stream, a second backbone network is used for extracting small-size target features in the target video stream, a lightweight feature pyramid network is used for interpolating the small-size target features extracted by the second backbone network and then splicing the small-size target features extracted by the first backbone network with the large-size target features extracted by the first backbone network to generate final target features, and a detection result prediction module is used for carrying out target detection based on the final target features; and obtaining a target detection result output by the target detection model. According to the technical scheme, the lightweight characteristic pyramid network in the model is improved, so that the target detection efficiency is improved on the basis of ensuring the accuracy.
Example two
Fig. 2 is a flowchart of a target detection method according to a second embodiment of the present invention, where the optimization is performed based on the above embodiment, and a scheme not described in detail in the embodiment of the present invention is shown in the above embodiment. As shown in fig. 2, the method includes:
s210, acquiring at least two sample video streams.
The sample video stream comprises video data collected by a monitoring camera, video data collected by a pedestrian visual angle and public data set data.
S220, performing target annotation and target anti-annotation on video frames in at least two sample video streams, and generating a training sample set.
Where target annotation refers to marking a target of interest in a video frame, and target counterannotation refers to marking a target in a video frame that is desired to be ignored or masked. In the embodiment of the invention, when the target anti-labeling is carried out on the video frames in at least two sample video streams, the target to be subjected to the anti-labeling is required to be marked with an anti-labeling label.
In the embodiment of the invention, the target annotation and the target anti-annotation can be carried out on the video frames in at least two sample video streams through the annotation software, so as to generate a training sample set. The embodiment of the invention does not limit the labeling software, and the labeling software can be labelImg by way of example.
Optionally, before the target annotation and the target anti-annotation are performed on the video frames in the at least two sample video streams, the method further comprises: framing at least two sample video streams. Illustratively, at least two sample video streams are framed one frame every five frames.
S230, training a machine learning model constructed in advance based on the training sample set to generate a target detection model.
In the embodiment of the invention, the target feature can be subjected to deep learning by utilizing a pre-constructed machine learning model based on a training sample set, and the target detection model after training is obtained by combining the counterexample recognition of the countermark label to perform longitudinal training.
Illustratively, fig. 3 shows a structure diagram of an object detection model, and as shown in fig. 3, a specific processing procedure of the object detection model on a picture is as follows:
1. the input picture is scaled to 352 x 3 pixels.
2. And entering a first stage, sequentially carrying out convolution, a batch norm algorithm, a Relu activation function and maximum pooling on the picture, and expanding a tensor channel to 24.
3. And entering a second stage to a fourth stage, and respectively generating a large-size characteristic tensor with the size of 22 multiplied by 96 and a small-size characteristic tensor with the size of 11 multiplied by 192 through three first or second backbone networks formed by stacking 4, 8 and 4 backbone network basic building units.
4. 11X 11 features obtained in the fourth stage are spliced to 22X 22 features of the third stage through interpolation through a lightweight feature pyramid network to fuse deep features,
5. and then, the feature tensor subjected to interpolation and splicing is output after being processed by a convolution, a latch norm algorithm and a Relu activation function.
6. The output channels of 11×11×192 are deleted.
7. The output characteristic tensor is operated by a Dwconvblocks module, wherein the Dwconvblocks module is Conv+Bn +Relu+Conv+Bn+Relu+Bn+Conv. Thus, the number of channels can be adjusted so that the result matches the predicted type.
8. And inputting the feature tensor operated by the Dwconvblocks module into a detection result prediction module to realize target positioning and classification.
S240, starting at least two first threads, wherein each first thread in the at least two first threads respectively acquires one target video stream to be detected, and storing video frames in each video stream into a corresponding picture queue.
The first thread is a thread for reading video frames, and the number of the first thread corresponds to the number of video streams in the target video stream, so that real-time detection of multiple paths of video streams is realized by utilizing a multithreading technology.
Specifically, for each first thread, a video object is created, video frames in a target video stream corresponding to the thread are continuously read in a loop, and the read video frames are placed into the tail of a picture queue. When the pictures in the queue are full, such as up to 4, the video frame at the head of the picture queue is removed from the queue.
Fig. 4 illustrates a detection flow chart of a target detection model, as shown in fig. 4, each read video frame thread corresponds to a path of target video stream to be detected, receives input of the target video stream, and stores the video frames in a corresponding picture queue after the read of the video frames in the target video stream is completed.
S250, starting at least two second threads, wherein each of the at least two second threads respectively reads video frames from the corresponding picture queues, and inputs the video frames into the target detection model; the first thread and the second thread are mutually exclusive threads.
The second thread is a thread for detecting video frames. It should be noted that the thread for detecting the video frame and the thread for reading the video frame are mutually exclusive threads, that is, the two threads cannot work simultaneously, so as to solve the problem of read-write conflict and ensure the normal and stable operation of the threads.
Specifically, according to the target detection flow, when the video frame reading thread is required to finish the operation of the picture queues, the video frame detecting thread can start to read video frames from the corresponding picture queues and input the video frames into the target detection model.
Illustratively, as shown in fig. 4, the detecting video frame thread and the reading video frame thread are mutually exclusive threads, and the communication process of the detecting video frame thread and the reading video frame thread is as follows: after the video frame reading thread stores the video frames into the picture queues, detecting the video frames, reading the video frames from the corresponding picture queues, calling a target detection model, and inputting the video frames read from the corresponding picture queues into the target detection model. The method cancels the encapsulation of the model initialization loading, initializes the model in the main function, reduces the initialization times of the model, and calls the model in the sub-thread.
And S260, respectively storing target detection results, which are output by the target detection model and are aimed at corresponding video frames, into a detection result queue by at least two second threads.
Specifically, for each second thread, a thread object is created, and the thread object is used for placing the target detection result, output by the target detection model, for the corresponding video frame into the tail of the detection result queue.
S270, starting at least two third threads, wherein each third thread in the at least two third threads respectively reads a target detection result from a corresponding detection result queue, and displaying the read target detection result; the second thread and the third thread are mutually exclusive threads.
The third thread is a display result thread. It should be noted that the thread for detecting the video frame and the thread for displaying the result are mutually exclusive threads, that is, the two threads cannot work simultaneously, so as to solve the problem of read-write conflict and ensure normal and stable operation of the threads.
Specifically, for each third thread, a thread object is created for reading the result from the head of the detection result queue and displaying the detection result.
Illustratively, as shown in fig. 4, the detecting video frame thread and the displaying result thread are mutually exclusive threads, and the communication process of the detecting video frame thread and the displaying result thread is as follows: after the detection video frame thread stores the target detection result into the detection result queue, the display result thread reads the target detection result from the corresponding detection result queue and displays the read target detection result.
By way of example, fig. 5 shows a graph comparing the effect of a Yolo-FastestV2 model with that of a target detection model, which is 6.42% lower in recognition rate than the Yolo-FastestV2 model, but 15.92% faster in processing speed of a single picture.
Illustratively, fig. 6 shows a target detection result diagram of a target detection model, and as shown in fig. 6, the detection accuracy of the target detection model can meet the detection requirement.
For example, fig. 7 shows a graph of a target detection result of 16 paths of video on target detection in the target detection method, the graphics card used in this experiment is GTX1650 (4G video memory), the processor is Intel CORE i5 9 generation, and table 1 below shows the rate at which the frame rate of each path of video can reach real-time detection.
Table 1 sixteen-way video frame rate table
For example, fig. 8 shows a graph of the result of opening 32 paths of videos for target detection in the target detection method, the graphics card used in this experiment is GTX4080 (video memory 16G), the processor is Intel CORE i7 generation 13, and table 2 below shows the rate at which the frame rate of each path of video can reach real-time detection.
Table 2 thirty-two video frame rate table
According to the technical scheme, at least two sample video streams are obtained; performing target annotation and target anti-annotation on video frames in at least two sample video streams to generate a training sample set; training a machine learning model constructed in advance based on a training sample set to generate a target detection model; starting at least two first threads, wherein each first thread in the at least two first threads respectively acquires one target video stream to be detected, and storing video frames in each video stream into a corresponding picture queue; starting at least two second threads, wherein each second thread in the at least two second threads reads video frames from the corresponding picture queues respectively, and inputs the video frames into a target detection model; the first thread and the second thread are mutually exclusive threads; the at least two second threads respectively store target detection results which are output by the target detection model and are aimed at the corresponding video frames into a detection result queue; starting at least two third threads, wherein each third thread in the at least two third threads respectively reads a target detection result from a corresponding detection result queue and displays the read target detection result; the second thread and the third thread are mutually exclusive threads. According to the technical scheme provided by the embodiment of the invention, the lightweight characteristic pyramid network in the model is improved, so that the target detection efficiency is improved on the basis of ensuring the accuracy. Meanwhile, by setting the mutual exclusion thread, the problem of read-write conflict is solved, and the normal and stable operation of the thread is ensured.
Example III
Fig. 9 is a schematic structural diagram of a target detection device according to a third embodiment of the present invention. As shown in fig. 9, the apparatus includes:
the detection preparation module 310 is configured to obtain a target video stream to be detected and a pre-constructed target detection model; the target detection model comprises a first backbone network, a second backbone network, a lightweight characteristic pyramid network and a detection result prediction module which are sequentially connected;
the target detection module 320 is configured to input the target video stream into the target detection model, where a first backbone network in the target detection model is configured to extract a large-size target feature in the target video stream, a second backbone network is configured to extract a small-size target feature in the target video stream, and the lightweight feature pyramid network is configured to interpolate the small-size target feature extracted by the second backbone network and splice the small-size target feature extracted by the first backbone network to generate a final target feature, and the detection result prediction module is configured to perform target detection based on the final target feature;
and the result obtaining module 330 is configured to obtain a target detection result output by the target detection model.
Optionally, the apparatus further includes:
the sample video stream acquisition module is used for acquiring at least two sample video streams;
the training sample set generation module is used for carrying out target annotation and target anti-annotation on video frames in the at least two sample video streams to generate a training sample set;
the target detection model generation module is used for training a machine learning model built in advance based on the training sample set to generate a target detection model.
Optionally, the detection preparation module 310 includes:
a first thread starting unit for starting at least two first threads;
the target video stream obtaining unit is used for each first thread in the at least two first threads to obtain a path of target video stream to be detected respectively, and storing video frames in each path of video stream into a corresponding picture queue.
Optionally, the target detection module 320 includes:
a second thread starting unit for starting at least two second threads;
the first target detection unit is used for reading video frames from the corresponding picture queues by each of the at least two second threads and inputting the video frames into a target detection model; wherein the first thread and the second thread are mutually exclusive threads.
Optionally, the result obtaining module 330 includes:
and the target detection result storage unit is used for storing target detection results, which are output by the target detection model and are aimed at corresponding video frames, into a detection result queue by the at least two second threads respectively.
Optionally, the result obtaining module 330 further includes:
a third thread starting unit for starting at least two third threads;
the target detection result display unit is used for reading target detection results from the corresponding detection result queues by each of the at least two third threads and displaying the read target detection results; wherein the second thread and the third thread are mutually exclusive threads.
Optionally, the target detection module 320 includes:
a homogeneous video frame determination unit for determining at least one set of homogeneous video frames in the target video stream;
and the second target detection unit is used for randomly selecting one target video frame from each group of homogeneous video frames and inputting the target video frames into the target detection model.
Optionally, the homogeneous video frame determining unit is specifically configured to:
determining the similarity of video frames contained in the target video stream, and taking the video frames with the similarity larger than a preset similarity threshold as homogeneous video frames; or,
taking N adjacent video frames in the target video stream as homogeneous video frames; wherein N is an integer greater than or equal to 2; or,
and taking the video frames in the preset time range in the target video stream as homogeneous video frames.
The object detection device provided by the embodiment of the invention can execute the object detection method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Example IV
Fig. 10 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 10, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as the target detection method.
In some embodiments, the object detection method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the above-described object detection method may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the target detection method in any other suitable way (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method of detecting an object, comprising:
acquiring a target video stream to be detected and a pre-constructed target detection model; the target detection model comprises a first backbone network, a second backbone network, a lightweight characteristic pyramid network and a detection result prediction module which are sequentially connected;
inputting the target video stream into the target detection model, wherein a first backbone network in the target detection model is used for extracting large-size target features in the target video stream, a second backbone network is used for extracting small-size target features in the target video stream, the lightweight feature pyramid network is used for splicing the small-size target features extracted by the second backbone network with the large-size target features extracted by the first backbone network after interpolation to generate final target features, and the detection result prediction module is used for carrying out target detection based on the final target features;
and obtaining a target detection result output by the target detection model.
2. The method of claim 1, further comprising, prior to acquiring the target video stream to be detected and the pre-constructed target detection model:
acquiring at least two sample video streams;
performing target annotation and target anti-annotation on video frames in the at least two sample video streams to generate a training sample set;
training a machine learning model constructed in advance based on the training sample set to generate a target detection model.
3. The method of claim 1, wherein obtaining the target video stream to be detected comprises:
starting at least two first threads;
each first thread in the at least two first threads respectively acquires one path of target video stream to be detected, and video frames in each path of video stream are stored in corresponding picture queues.
4. A method according to claim 3, wherein inputting the target video stream into the target detection model comprises:
starting at least two second threads;
each second thread in the at least two second threads reads video frames from the corresponding picture queues respectively and inputs the video frames into a target detection model; wherein the first thread and the second thread are mutually exclusive threads;
the method for obtaining the target detection result output by the target detection model comprises the following steps:
and the at least two second threads store target detection results which are output by the target detection model and are aimed at the corresponding video frames into a detection result queue respectively.
5. The method of claim 4, further comprising, after the at least two second threads store the target detection results for the corresponding video frames output by the target detection model into a detection result queue, respectively:
starting at least two third threads;
each third thread in the at least two third threads respectively reads a target detection result from a corresponding detection result queue and displays the read target detection result; wherein the second thread and the third thread are mutually exclusive threads.
6. The method of claim 1, wherein inputting the target video stream into the target detection model comprises:
determining at least one set of homogeneous video frames in the target video stream;
randomly selecting a target video frame from each group of the homogeneous video frames, and inputting the target video frames into the target detection model.
7. The method of claim 6, wherein determining at least one set of homogeneous video frames in the target video stream comprises:
determining the similarity of video frames contained in the target video stream, and taking the video frames with the similarity larger than a preset similarity threshold as homogeneous video frames; or,
taking N adjacent video frames in the target video stream as homogeneous video frames; wherein N is an integer greater than or equal to 2; or,
and taking the video frames in the preset time range in the target video stream as homogeneous video frames.
8. An object detection apparatus, comprising:
the detection preparation module is used for acquiring a target video stream to be detected and a pre-constructed target detection model; the target detection model comprises a first backbone network, a second backbone network, a lightweight characteristic pyramid network and a detection result prediction module which are sequentially connected;
the target detection module is used for inputting the target video stream into the target detection model, wherein a first backbone network in the target detection model is used for extracting large-size target features in the target video stream, a second backbone network is used for extracting small-size target features in the target video stream, the lightweight feature pyramid network is used for interpolating the small-size target features extracted by the second backbone network and then splicing the small-size target features extracted by the first backbone network to generate final target features, and the detection result prediction module is used for carrying out target detection based on the final target features;
and the result acquisition module is used for acquiring the target detection result output by the target detection model.
9. An electronic device, the electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the object detection method of any one of claims 1-7.
10. A computer readable storage medium storing computer instructions for causing a processor to perform the object detection method according to any one of claims 1-7.
CN202311247047.3A 2023-09-26 2023-09-26 Target detection method and device, electronic equipment and storage medium Pending CN117095337A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311247047.3A CN117095337A (en) 2023-09-26 2023-09-26 Target detection method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311247047.3A CN117095337A (en) 2023-09-26 2023-09-26 Target detection method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117095337A true CN117095337A (en) 2023-11-21

Family

ID=88773559

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311247047.3A Pending CN117095337A (en) 2023-09-26 2023-09-26 Target detection method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117095337A (en)

Similar Documents

Publication Publication Date Title
JP7535374B2 (en) On-device classification of fingertip motion patterns into real-time gestures
CN110610510B (en) Target tracking method and device, electronic equipment and storage medium
CN113255694B (en) Training image feature extraction model and method and device for extracting image features
Nguyen et al. Yolo based real-time human detection for smart video surveillance at the edge
US11900676B2 (en) Method and apparatus for detecting target in video, computing device, and storage medium
WO2023174098A1 (en) Real-time gesture detection method and apparatus
CN112154476A (en) System and method for rapid object detection
CN113436100B (en) Method, apparatus, device, medium, and article for repairing video
WO2023040146A1 (en) Behavior recognition method and apparatus based on image fusion, and electronic device and medium
CN113496208B (en) Video scene classification method and device, storage medium and terminal
WO2019084712A1 (en) Image processing method and apparatus, and terminal
CN109977875A (en) Gesture identification method and equipment based on deep learning
CN110008922B (en) Image processing method, device, apparatus, and medium for terminal device
CN108777779A (en) A kind of intelligent device, method, medium and the electronic equipment of video capture equipment
CN112487911A (en) Real-time pedestrian detection method and device based on improved yolov3 in intelligent monitoring environment
CN116863376A (en) Edge computing method and system for detecting abnormal event of elevator car passenger
CN117095337A (en) Target detection method and device, electronic equipment and storage medium
Kim et al. Low-complexity online model selection with Lyapunov control for reward maximization in stabilized real-time deep learning platforms
CN116012756A (en) Behavior action detection method, device, equipment and storage medium
CN113627354B (en) A model training and video processing method, which comprises the following steps, apparatus, device, and storage medium
CN115830362A (en) Image processing method, apparatus, device, medium, and product
CN116137671A (en) Cover generation method, device, equipment and medium
CN113610021A (en) Video classification method and device, electronic equipment and computer-readable storage medium
CN115147434A (en) Image processing method, device, terminal equipment and computer readable storage medium
CN113221920B (en) Image recognition method, apparatus, device, storage medium, and computer program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination