WO2021022643A1 - 一种视频目标检测与跟踪方法及装置 - Google Patents

一种视频目标检测与跟踪方法及装置 Download PDF

Info

Publication number
WO2021022643A1
WO2021022643A1 PCT/CN2019/108080 CN2019108080W WO2021022643A1 WO 2021022643 A1 WO2021022643 A1 WO 2021022643A1 CN 2019108080 W CN2019108080 W CN 2019108080W WO 2021022643 A1 WO2021022643 A1 WO 2021022643A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
detection
video frame
image
frame image
Prior art date
Application number
PCT/CN2019/108080
Other languages
English (en)
French (fr)
Inventor
江浩
李亚
费晓天
任少卿
朱望江
董维山
Original Assignee
初速度(苏州)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 初速度(苏州)科技有限公司 filed Critical 初速度(苏州)科技有限公司
Publication of WO2021022643A1 publication Critical patent/WO2021022643A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/251Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/75Determining position or orientation of objects or cameras using feature-based methods involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning

Definitions

  • the present invention relates to the technical field of computer vision, in particular to a method and device for detecting and tracking video targets.
  • tracking and detecting targets in videos collected by capture devices is the main content of computer vision.
  • the self-driving car in order to autopilot, the self-driving car needs to know the driving environment around the self-driving car. Therefore, it needs to pass
  • the collection equipment of the self-vehicle performs target detection and tracking on the surrounding environment of the self-vehicle.
  • the current target detection method only performs target detection on the target in a single frame image in the video, and does not consider the relationship between the front and rear frames, so that the detection accuracy of the target detection is low.
  • the current target tracking method only tracks each target appearing in the first frame of the video. When a new target appears in the video, the new target cannot be tracked. Therefore, there is an urgent need for a video target detection and tracking method that has high detection accuracy and can track newly emerging targets.
  • the present invention provides a video target detection and tracking method and device, so as to improve the detection accuracy of target detection and track newly emerging targets.
  • the specific technical solution is as follows.
  • an embodiment of the present invention provides a video target detection and tracking method, the method including:
  • the detection target of the current video frame image is taken as the first detection target, and for each first detection target The target is determined, the rectangular image area corresponding to the first detected target in the current video frame image is determined based on the position of the first detected target, and the width and height of the rectangular image area are respectively scaled to pre-established local target detection Input the width and height of the image to the model, input the zoomed rectangular image area into the local target detection model to obtain the position and category of the second detected target, and return to execute the detection whether the real-time acquisition by the acquisition device is received and the surrounding environment Of the current video frame image;
  • the detection target When the detection target is not detected and there is a detection target in the previous video frame image of the current video frame image, the detection target existing in the previous video frame image is taken as the third detection target, and for each third detection target The target is determined, the rectangular image area corresponding to the third detected target in the current video frame image is determined, the width and height of the rectangular image area are respectively scaled to the width and height of the input image of the pre-established local target detection model, and The rectangular image area obtained after scaling is input into the local target detection model to obtain the position and category of the fourth detected target, the corresponding relationship between the fourth detected target and the third detected target is established, and the detection is performed back Whether the step of receiving the current video frame image of the surrounding environment collected by the collecting device in real time;
  • the detection target of the current video frame image and the detection target existing in the previous video frame image are regarded as the fifth
  • the detected target determine the corresponding rectangular image area of the fifth detected target in the video frame image where the fifth detected target is located, and scale the width and height of the rectangular image area respectively To the width and height of the input image of the pre-established local target detection model, input the zoomed rectangular image area into the local target detection model to obtain the position and category of the sixth detected target.
  • an embodiment of the present invention provides a video target detection and tracking device, which includes:
  • the detection module is used to detect whether the current video frame image of the surrounding environment collected by the collecting device in real time is received
  • the judging module is used to determine whether the frame number interval between the current video frame image and the video frame image of the last full-image target detection is a preset interval if the current video frame image is received, and if so, trigger the full-image Target detection module;
  • the full-image target detection module is configured to perform full-image target detection on the current video frame image according to a pre-established full-image target detection model
  • the first detection result module is used to set the detection target of the current video frame image as the first detection when the location and category of the detection target are detected and there is no detection target in the previous video frame image of the current video frame image Target, for each first detected target, the rectangular image area corresponding to the first detected target in the current video frame image is determined based on the position of the first detected target, and the width and height of the rectangular image area are respectively Zoom to the width and height of the input image of the pre-established local target detection model, input the zoomed rectangular image area into the local target detection model to obtain the position and category of the second detected target, and trigger the detection module;
  • the second detection result module is used for when the detection target is not detected and the detection target exists in the previous video frame image of the current video frame image, the detection target existing in the previous video frame image is regarded as the third detection target Target, for each third detected target, determine the corresponding rectangular image area of the third detected target in the current video frame image, and scale the width and height of the rectangular image area to a pre-established local target detection model Input the width and height of the image, input the zoomed rectangular image area into the local target detection model to obtain the position and category of the fourth detected target, and establish the relationship between the fourth detected target and the third detected target Correspondence, trigger the detection module;
  • the third detection result module is used to combine the detection target of the current video frame image and the previous video frame image when the location and category of the detection target are detected and the previous video frame image of the current video frame image has the detection target
  • the existing detection target is regarded as the fifth detection target.
  • the corresponding rectangular image area of the fifth detection target in the video frame image where the fifth detection target is located is determined, and the rectangle The width and height of the image area are respectively scaled to the width and height of the input image of the pre-established local target detection model, and the zoomed rectangular image area is input into the local target detection model to obtain the position and category of the sixth detected target, Target matching is performed on a plurality of sixth detected targets to obtain targets that are successfully matched and unsuccessfully matched between the current video frame image and the previous video frame image, and trigger the detection module.
  • this embodiment can combine the detection result of the previous video frame image with the detection result of the current video frame image in the case of the current video frame image for the full image target detection, and pass the full image-partial alternation
  • the detection method local target detection is continued after the full-image target detection, which takes into account the relationship between the front and rear video frame images, and improves the detection accuracy of target detection.
  • the embodiment of the present invention is based on the full-image
  • the target detection model and the local target detection model perform target detection on each video frame image, so that the target existing in each video frame image can be detected. Therefore, the newly appeared target in the video frame can be detected.
  • the corresponding relationship of the same target between the previous video frame image and the current video frame image can be obtained, and the matching target between the previous video frame image and the current video frame image can be obtained, thus, the new appearance can be achieved
  • the target is tracked instead of tracking each target appearing in the first video frame image in the video.
  • the detection result of the previous video frame image and the detection result of the current video frame image are combined, and the whole image-partial alternate detection method is used to perform the whole
  • the local target detection is continued, which takes into account the relationship between the front and rear video frame images, and improves the detection accuracy of the target detection.
  • the embodiment of the present invention is based on the whole image target detection model and the local target detection The model performs target detection on each video frame image, so that the target existing in each video frame image can be detected. Therefore, the newly appeared target in the video frame can be detected. At the same time, the previous target can be obtained after partial detection.
  • full-image target detection is not performed for each video frame, but a full-image target detection is performed at an interval of a preset number of frames, and other video frames are subjected to local target detection.
  • the amount of calculation for detection is much smaller than that for full-image target detection. Therefore, by adopting the method of performing full-image target detection once at intervals of a preset number of frames in the embodiment of the present invention, the amount of calculation can be significantly reduced.
  • a full-image target detection model can be obtained that associates the first sample image with the position and category of the target in the detection frame, and the video frame can be detected by the full-image target detection model.
  • the image performs full-image target detection in order to obtain the position and category of the target in the video frame image.
  • the first width and height of the corresponding rectangular image area of the detected target in the current video frame image are scaled to the width and height of the input image of the pre-established local target detection model, making preparations for subsequent local target detection.
  • a local target detection model can be obtained that associates the second sample image with the position and category of the target in the detection frame.
  • a full-image target can be detected
  • the obtained detected target is then subjected to local target detection in order to correct the position and category of the detected target, and obtain the precise position and category of the target in the video frame image.
  • the current video frame image and the previous video frame image are matched successfully and unsuccessfully matched.
  • the goal of successful matching is to
  • the last video frame image corresponds to the same target in the current video frame image one-to-one, and the position of the same target in the last video frame image and the position in the current video frame can be known, which serves the purpose of tracking the same target It also serves the purpose of target detection for the same target, and the obtained unsuccessful matching target serves the purpose of target detection for different targets.
  • FIG. 1 is a schematic flowchart of a video target detection and tracking method provided by an embodiment of the present invention
  • Fig. 2 is a schematic structural diagram of a video target detection and tracking device provided by an embodiment of the present invention.
  • the embodiment of the present invention discloses a video target detection and tracking method, which can consider the relationship between the front and rear video frames, improve the detection accuracy of target detection, and at the same time, can track newly-appearing targets.
  • the embodiments of the present invention will be described in detail below.
  • FIG. 1 is a schematic flowchart of a method for detecting and tracking a video target provided by an embodiment of the present invention. This method is applied to electronic equipment. The method specifically includes the following steps S110 to S160:
  • step S110 Detect whether the current video frame image of the surrounding environment collected by the collecting device in real time is received, and if so, execute step S120.
  • the collection device After the collection device collects the video in real time, it sends the collected video to the electronic device.
  • the collection device of the own car collects the video in real time, and then sends the collected video to the electronic device of the own car.
  • the device may be a processor of the vehicle.
  • the electronic device detects whether the current video frame image of the surrounding environment collected by the collecting device in real time is received, and performs subsequent steps according to the detection result.
  • step S120 Determine whether the frame number interval between the current video frame image and the video frame image of the last full-image target detection is a preset interval, and if so, perform step S130.
  • the embodiment of the present invention no longer performs full-image target detection for each video frame image, but It adopts the method of full-image target detection every preset frame number interval. Therefore, when the electronic device detects and receives the current video frame image of the vehicle surrounding environment collected by the self-car acquisition device in real time, it needs to determine the current video frame image and the previous Whether the frame number interval between the video frame images for the whole image target detection is a preset interval, and the subsequent steps are executed according to the detection result.
  • S130 Perform full-picture target detection on the current video frame image according to a pre-established full-picture target detection model.
  • the frame interval between the current video frame image and the last full-image target detection video frame image is a preset interval, it means that the current video frame image is a video frame image that requires full-image target detection.
  • the established full-image target detection model performs full-image target detection on the current video frame image.
  • the training process of the full-image target detection model can be:
  • the training is completed, and a full-image target detection model that associates the first sample image with the position and category of the target in the detection frame is obtained.
  • the electronic device first needs to construct a first initial network model, and then train it to obtain a full-image target detection model.
  • a caffe tool can be used to construct a first initial network model including a first feature extraction layer, a region generation network layer, and a first regression layer.
  • the first initial network model may be Faster R-CNN (Faster Region Convolutional Neural Networks), R-FCN (Region-based Fully Convolutional Networks, Region-based Fully Convolutional Networks), YOLO Algorithm or SSD algorithm.
  • the first sample image and the detection contained in the first sample image After acquiring the first sample image in the training set and the first position and the first category corresponding to the target in the detection frame contained in the first sample image, the first sample image and the detection contained in the first sample image are input into the first initial network model for training.
  • the first sample image is input to the first feature extraction layer, and the full image feature vector in the first sample image is determined through the first model parameters of the first feature extraction layer. Then the determined feature vector of the full image is input to the region generation network layer, and feature calculation is performed on the feature vector of the full image through the second model parameter of the region generation network layer to obtain feature information of the candidate region containing the first reference target. Then the feature information is input to the first regression layer, and the feature information is regressed through the third model parameters of the first regression layer to obtain the first reference category to which the first reference target belongs and the first reference target in the first sample image The first reference position in.
  • the first reference category and the first reference position are obtained, they are compared with the first category and the first position respectively, and the first difference value between the first reference category and the first category can be calculated through the predefined objective function.
  • the second difference value between the first reference position and the first position is calculated.
  • the training process it is possible to loop through all the first sample images, and continuously adjust the first model parameter, the second model parameter, and the third model parameter of the first initial network model.
  • the number of iterations reaches the first preset number, it means that the first initial network model at this time can adapt to most of the first sample images and obtain accurate results.
  • the training of the first initial network model is completed and the entire Figure target detection model. It is understandable that the trained full-image target detection model makes the first sample image correlate with the position and category of the target in the detection frame, and the full-image target detection model takes the full image as input to obtain the detected target The location and category of the model.
  • a full-image target detection model that associates the first sample image with the position and category of the target in the detection frame can be obtained, and the full-image target detection model can be Full image target detection is performed on the video frame image in order to obtain the position and category of the target in the video frame image.
  • the detected target of the current video frame image is taken as the first detected target, and for each first detection target, A detected target, based on the position of the first detected target, determine the corresponding rectangular image area of the first detected target in the current video frame image, and scale the width and height of the rectangular image area to the pre-established local target detection
  • the model inputs the width and height of the image, and inputs the zoomed rectangular image area into the local target detection model to obtain the position and category of the second detected target, and then returns to step S110.
  • the embodiment of the present invention needs to merge the detection result of the current video frame with the detection result of the previous video frame.
  • the detection result of the target detection model in the whole image is used to obtain the position and category of the detected target and the current
  • the detection target of the current video frame image is taken as the first detection target.
  • the score of the detected target will also be obtained.
  • the score is greater than the preset threshold to indicate the detection
  • the accuracy of the target is high. Therefore, when the position and category of the detected target are obtained and there is no detected target in the previous video frame image of the current video frame image, the score of the detected target of the current video frame image is greater than
  • the detection target of the preset threshold is used as the first detection target.
  • the embodiment of the present invention proposes a method of full-image-local alternate detection, that is, after performing full-image target detection, the first inspection is performed.
  • the target continues to perform local target detection.
  • the local target detection method is to perform local target detection through a pre-established local target detection model.
  • the size of the input image is a preset size, and the preset size is usually small. Therefore, before performing local target detection, it needs to be performed
  • the size of the image of the partial target detection is scaled to a preset size. That is, for each first detected target, the corresponding rectangular image area of the first detected target in the current video frame image is determined based on the position of the first detected target, and the width and height of the rectangular image area are respectively scaled to the preset
  • the established local target detection model inputs the width and height of the image, and the zoomed rectangular image area is input into the local target detection model to obtain the position and category of the second detected target. Then, return to step S110. Because when performing local target detection, only one zoomed rectangular image area is input at a time, so that the amount of calculation is small, and the probability of false detection is further reduced.
  • inputting the zoomed rectangular image area into the local target detection model to obtain the position and category of the second detected target may include: inputting the zoomed rectangular image area into the local target detection model The position and category of the candidate detection target and the score of the candidate detection target are obtained in, and the candidate detection target of the candidate detection target whose score is greater than the preset threshold is taken as the second detection target.
  • the corresponding rectangular image area of the first detected target in the current video frame image is determined based on the position of the first detected target, and the width and height of the rectangular image area are respectively scaled to
  • the step of inputting the width and height of the pre-established local target detection model may include:
  • the coordinates of the upper left corner point and the lower right corner point of the first detected target in the current video frame image are determined based on the position of the first detected target, in the current video frame image Obtain a rectangular image area with the upper left corner and the lower right corner as the diagonal;
  • the scaled coordinates of the upper left intersection point and the scaled coordinates of the lower right corner point are calculated;
  • the width and height of the rectangular image area are respectively scaled to the width of the input image of the pre-established local target detection model And height.
  • the position of the first detected target is obtained, then the coordinates of the upper left corner point and the lower right corner point of the first detected target in the current video frame image are known, In order to be able to perform local target detection, a rectangular image area with the upper left corner and the lower right corner as the diagonal is obtained in the current video frame image.
  • the coordinates of the upper left corner point include the abscissa of the upper left corner point and the ordinate of the upper left corner point.
  • the coordinates of the lower right corner point include the abscissa of the lower right corner point and the ordinate of the lower right corner point.
  • the preset coordinate transformation coefficient includes the first The preset abscissa transformation coefficient, the first preset ordinate transformation coefficient, the second preset abscissa transformation coefficient, and the second preset ordinate transformation coefficient.
  • the scaled coordinates of the upper left corner point and the scaled coordinates of the lower right corner point can be calculated by the following formula:
  • a x is the first preset abscissa transformation coefficient
  • a y is the first preset ordinate transformation coefficient
  • d x is the second preset abscissa transformation coefficient
  • d y is the second preset ordinate transformation coefficient
  • x lt is the abscissa of the upper left corner point
  • y lt is the ordinate of the upper left corner point
  • x rb is the abscissa of the lower right corner point
  • y rb is the ordinate of the lower right corner point
  • F w is the horizontal coordinate of the upper left corner point after scaling Coordinates
  • F h is the ordinate after scaling of the upper left corner point
  • H is the height of the input image of the local target detection model
  • W is the width of the input image of the local target detection model.
  • the width and height of the rectangular image area need to be scaled separately How much zoom can reach the width and height of the input image of the pre-established local target detection model, and then zoom the width and height according to the zoom amount, that is, zoom based on the coordinates of the upper left corner, the coordinates of the lower right corner, and the upper left intersection
  • the width and height of the rectangular image area are respectively scaled to the width and height of the input image of the local target detection model established in advance.
  • the preset coordinate transformation coefficient, and the width and height of the input image of the pre-established local target detection model in the current video frame image of the first detected target are respectively scaled to the width and height of the input image of the pre-established local target detection model to prepare for subsequent local target detection.
  • the training process of the local target detection model can be:
  • the second sample image and the second position and second category corresponding to the target in the detection frame contained in the second sample image are input into the second initial network model, where the second initial network model includes a second feature extraction layer and a first Second regression layer
  • the electronic device first needs to construct a second initial network model, and then train it to obtain a local target detection model.
  • the caffe tool can be used to construct a second initial network model including a second feature extraction layer and a second regression layer.
  • the second initial network model may be Faster R-CNN (Faster Region Convolutional Neural Networks), R-FCN (Region-based Fully Convolutional Networks, Region-based Fully Convolutional Networks), YOLO Algorithm or SSD algorithm.
  • the second sample image and the target in the detection frame contained in the second sample image are input into the second initial network model for training.
  • the second sample image is input to the second feature extraction layer, and the feature vector in the second sample image is determined through the fourth model parameter of the second feature extraction layer. Then the determined feature vector is input to the second regression layer, and the feature vector is regressed through the fifth model parameter of the second regression layer to obtain the second reference category to which the second reference target belongs and the second reference target in the second The second reference position in the sample image.
  • the second reference category and the second reference position are obtained, they are compared with the second category and the second position respectively, and the third difference value between the second reference category and the second category can be calculated through the predefined objective function.
  • a fourth difference value between the second reference position and the second position is calculated.
  • the training process it is possible to loop through all the second sample images, and continuously adjust the fourth model parameter and the fifth model parameter of the second initial network model.
  • the number of iterations reaches the second preset number, it means that the second initial network model at this time can adapt to most of the second sample images and obtain accurate results.
  • the training of the second initial network model is completed and the local target is obtained.
  • Detection model It is understandable that the local target detection model obtained by training makes the second sample image correlate with the position and category of the target in the detection frame, and the local target detection model takes the local image as input to obtain the position and type of the detected target. The model of the category.
  • a local target detection model can be obtained that associates the second sample image with the position and category of the target in the detection frame.
  • the local target detection model can be used to perform full
  • the detected target obtained by the image target detection is then subjected to local target detection to correct the position and category of the detected target, and obtain the precise position and category of the target in the video frame image.
  • the detection target existing in the previous video frame image is taken as the third detection target.
  • the detected target is not detected by the full-image target detection model, including but not limited to the current video frame image does not have a target, for example: in the field of autonomous driving, the self-driving car is parked in the parking lot, and the self-driving car
  • the whole image target detection model fails to detect it.
  • the embodiment of the present invention proposes a method of full-image-local alternate detection, that is, after performing full-image target detection, the third inspection is performed.
  • the target continues to perform local target detection.
  • the local target detection method is to perform local target detection through a pre-established local target detection model. For the training process of the local target detection model, refer to the description in step S140, which will not be repeated here.
  • the size of the input image is a preset size, and the preset size is usually small. Therefore, before performing local target detection, it needs to be performed
  • the size of the image of the partial target detection is scaled to a preset size. That is, for each third detection target, determine the corresponding rectangular image area of the third detection target in the current video frame image, and scale the width and height of the rectangular image area to the input image of the pre-established local target detection model. Width and height, input the zoomed rectangular image area into the local target detection model to obtain the position and category of the fourth detected target, establish the corresponding relationship between the fourth detected target and the third detected target, and return to step S110 .
  • determining the rectangular image area corresponding to the third detection target in the current video frame image may include: determining the first target position of the third detection target in the previous video frame image, and determining in the current video frame A first reference position that is the same as the first target position is determined based on the first reference position, and a rectangular image area corresponding to the third detected target in the current video frame image is determined.
  • the third detection target Since the position of the third detection target in the two video frames before and after will not change much, it can be assumed that in the current video frame, the third detection target is still at the position of the previous video frame.
  • the first target position and then use the rectangular image area corresponding to the first reference position that is the same as the first target position in the current video frame image as the rectangular image area corresponding to the third detected target in the current video frame image.
  • the width and height of the rectangular image area corresponding to the third detected target in the current video frame image are scaled to the width and height of the input image of the pre-established local target detection model, and the scaled rectangular image area is input to the local
  • the position and category of the fourth detected target are obtained in the target detection model, and thus the position of the third detected target in the current video frame image, that is, the position of the fourth detected target can be obtained.
  • step S110 is returned to.
  • Establish the corresponding relationship between the fourth detected target and the third detected target that is, the previous video frame image is corresponding to the same target in the current video frame image, and the position of the same target in the previous video frame image can be known. And the position in the current video frame plays the purpose of tracking the same target.
  • the detection target of the current video frame image and the detection target existing in the previous video frame image are regarded as the fifth Check out the target.
  • the embodiment of the present invention proposes a method of full-image-local alternate detection, that is, after the whole-image target detection is performed, the fifth inspection is performed.
  • the target continues to perform local target detection.
  • the local target detection method is to perform local target detection through a pre-established local target detection model. For the training process of the local target detection model, refer to the description in step S140, which will not be repeated here.
  • the size of the input image is a preset size, and the preset size is usually small. Therefore, before performing local target detection, it needs to be performed
  • the size of the image of the partial target detection is scaled to a preset size. That is, for each fifth detected target, determine the corresponding rectangular image area of the fifth detected target in the video frame image where the fifth detected target is located, and scale the width and height of the rectangular image area to the pre-established local
  • the target detection model inputs the width and height of the image, and inputs the zoomed rectangular image area into the local target detection model to obtain the position and category of the sixth detected target.
  • the rectangular image area corresponding to the fifth detected target in the video frame image where the fifth detected target is located is determined, and the width and height of the rectangular image area are respectively scaled to the width and width of the input image of the local target detection model established in advance.
  • the height method refer to step S140 to determine the corresponding rectangular image area of the first detected target in the current video frame image, and scale the width and height of the rectangular image area to the width and height of the input image of the pre-established local target detection model. The method is not repeated here.
  • the sixth detection target includes both the detection target of the previous video frame image and the detection target of the current video frame image, in order to detect and track the target, after obtaining the position and category of the sixth detection target, Target matching is performed on a plurality of sixth detected targets, and the target that is successfully matched and the target that is unsuccessful between the current video frame image and the previous video frame image are obtained, and step S110 is returned to.
  • the step of performing target matching on a plurality of sixth detected targets to obtain a target that is successfully matched and a target that is not successfully matched between the current video frame image and the previous video frame image may include:
  • For each sixth detection target in the current video frame image determine the overlap area and intersection area between the sixth detection target and each sixth detection target in the previous video frame image, and calculate the area of the overlap area Quotient with the area of the intersecting area;
  • the sixth detection target of the current video frame image corresponding to the quotient not less than the preset threshold value and the sixth detection target of the previous video frame image are regarded as the successful matching target, and the quotient of the quotient less than the preset threshold value is corresponding
  • the sixth detection target of the current video frame image and the sixth detection target of the previous video frame image are regarded as unsuccessful matching targets.
  • multiple sixth detected targets are matched by calculating IoU, where IoU (Intersection over Union) refers to the area of the intersection of two geometric figures divided by the combination of the two The quotient of the area. The higher the IoU, the more overlapping parts and the more similar the two goals. Therefore, after obtaining the location and category of the sixth detection target, for each sixth detection target in the current video frame image, determine the sixth detection target and each sixth detection target in the previous video frame image Calculate the quotient of the overlap area and the intersection area between the overlap area and the intersection area.
  • IoU Intersection over Union
  • the quotient After obtaining the quotient, compare the quotient with the preset threshold. If it is greater than or equal to the preset threshold, it means that the two sixth detection targets are relatively similar. If it is less than the preset threshold, it means that the two sixth detection targets are not similar. , Taking the sixth detection target of the current video frame image corresponding to the quotient not less than the preset threshold and the sixth detection target of the previous video frame image as the target of successful matching, and the quotient that is less than the preset threshold The corresponding sixth detection target of the current video frame image and the sixth detection target of the previous video frame image are regarded as unsuccessful matching targets.
  • the reason for the existence of unsuccessful targets may be that the full-image target detection model fails to detect new targets in the current video frame image, or it may be targets that exist in both the previous video frame image and the current video frame image.
  • the target is detected in the video frame image, but the target cannot be detected in the current video frame image by the full-image target detection model.
  • the reason is not limited to this.
  • the target of successful matching and unsuccessful matching between the current video frame image and the previous video frame image are obtained, and the obtained target of successful matching is
  • One-to-one correspondence between the last video frame image and the same target in the current video frame image can know the position of the same target in the previous video frame image and the position in the current video frame, which can track the same target
  • the purpose also serves the purpose of target detection on the same target, and the obtained unsuccessful matching target serves the purpose of target detection on different targets.
  • this embodiment can combine the detection result of the previous video frame image with the detection result of the current video frame image in the case of the current video frame image for the full image target detection, and pass the full image-partial alternation
  • the detection method local target detection is continued after the full-image target detection, which takes into account the relationship between the front and rear video frame images, and improves the detection accuracy of target detection.
  • the embodiment of the present invention is based on the full-image
  • the target detection model and the local target detection model perform target detection on each video frame image, so that the target existing in each video frame image can be detected. Therefore, the newly appeared target in the video frame can be detected.
  • the corresponding relationship of the same target between the previous video frame image and the current video frame image can be obtained, and the matching target between the previous video frame image and the current video frame image can be obtained, thus, the new appearance can be achieved
  • the target is tracked instead of tracking each target appearing in the first video frame image in the video.
  • the embodiments of the present invention can be applied to automatic driving.
  • the electronic equipment of the self-vehicle detects and tracks the targets in the surrounding environment of the self-vehicle collected by the collection device of the self-vehicle in real time, so as to realize automatic driving.
  • a video for automatic driving provided by the embodiment of the present invention
  • the target detection and tracking method may also include:
  • the detection does not receive the current video frame image of the surrounding environment of the vehicle collected in real time by the self-car acquisition device, it means that the self-car acquisition device no longer collects images.
  • the algorithm ends and the previously detected target and the tracking result need to be output. That is, it is necessary to output the position and category of the detection target existing in the previous video frame image of the current video frame image and the corresponding relationship of each detection target.
  • the position and category of the detection target existing in the previous video frame image of the current video frame image and each detection Target detection and tracking are realized by means of target correspondence.
  • a video target detection and tracking method for automatic driving may also include:
  • the detection target existing in the previous video frame image is taken as the seventh detection target, and for each seventh detection target, the seventh detection target is determined
  • the rectangular image area corresponding to the target in the current video frame image, the width and height of the rectangular image area are respectively scaled to the width and height of the input image of the pre-established local target detection model, and the zoomed rectangular image area is input to the local target detection
  • the position and category of the eighth detected target are obtained from the model, the corresponding relationship between the eighth detected target and the seventh detected target is established, and step S110 is returned to.
  • the frame number interval between the current video frame image and the last full-image target detection video frame image is not the preset interval, it means that the current video frame image does not require full-image target detection.
  • the current video frame If there is no detection target in the previous video frame of the image, no processing is done. If the detection target exists in the previous video frame of the current video frame, the detection target existing in the previous video frame is regarded as the seventh detection target. Go out.
  • the local target detection method is to perform local target detection through a pre-established local target detection model.
  • the training process of the local target detection model refer to the description in step S140, which will not be repeated here.
  • the size of the input image is a preset size, and the preset size is usually small. Therefore, before performing local target detection, it needs to be performed
  • the size of the image of the partial target detection is scaled to a preset size. That is, for each seventh detection target, determine the corresponding rectangular image area of the seventh detection target in the current video frame image, and scale the width and height of the rectangular image area to the input image of the pre-established local target detection model. Width and height, input the zoomed rectangular image area into the local target detection model to obtain the position and category of the eighth detected target, establish the corresponding relationship between the eighth detected target and the seventh detected target, and return to step S110 .
  • determining the rectangular image area corresponding to the seventh detection target in the current video frame image may include: determining the second target position of the seventh detection target in the previous video frame image, and determining in the current video frame A second reference position that is the same as the second target position is determined based on the second reference position, and a rectangular image area corresponding to the seventh detected target in the current video frame image is determined.
  • the seventh detection target Since the position of the seventh detection target in the two video frames before and after will not change too much, it can be assumed that in the current video frame, the seventh detection target is still at the position of the previous video frame.
  • the second target position and then use the rectangular image area corresponding to the second reference position that is the same as the second target position in the current video frame image as the rectangular image area corresponding to the seventh detected target in the current video frame image.
  • the width and height of the rectangular image area corresponding to the seventh detected target in the current video frame image are scaled to the width and height of the input image of the pre-established local target detection model, and the scaled rectangular image area is input to the local
  • the position and category of the eighth detected target are obtained from the target detection model, and thus the position of the seventh detected target in the current video frame image, that is, the position of the eighth detected target can be obtained.
  • step S110 is returned to.
  • Establish the corresponding relationship between the eighth detected target and the seventh detected target that is, the previous video frame image is corresponding to the same target in the current video frame image, and the position of the same target in the previous video frame image can be known. And the position in the current video frame plays the purpose of tracking the same target.
  • full-image target detection is not performed for each video frame, but a full-image target detection is performed at an interval of a preset number of frames, and other video frames are performed for local target detection. Due to the local target detection The calculation amount is much smaller than the full-image target detection. Therefore, by adopting the method of performing a full-image target detection at intervals of a preset frame number in the embodiment of the present invention, the calculation amount can be significantly reduced.
  • Fig. 2 is a schematic structural diagram of a video target detection and tracking device provided by an embodiment of the present invention.
  • the device may include:
  • the detection module 210 is configured to detect whether the current video frame image of the surrounding environment collected by the collecting device in real time is received;
  • the judging module 220 is used for judging whether the frame number interval between the current video frame image and the video frame image of the last full-image target detection is a preset interval if the current video frame image is received, and if so, trigger the full-image Figure target detection module 230;
  • the full-image target detection module 230 is configured to perform full-image target detection on the current video frame image according to a pre-established full-image target detection model
  • the first detection result module 240 is used for setting the detection target of the current video frame image as the first detection target when the detection target's position and category are detected and the previous video frame image of the current video frame image does not have the detection target For each first detected target, determine the corresponding rectangular image area of the first detected target in the current video frame image based on the position of the first detected target, and calculate the width and height of the rectangular image area Respectively zoom to the width and height of the input image of the pre-established local target detection model, input the zoomed rectangular image area into the local target detection model to obtain the position and category of the second detected target, and trigger the detection module 210 ;
  • the second detection result module 250 is used to set the detection target existing in the previous video frame image as the third detection target when the detection target is not detected and the previous video frame image of the current video frame image has the detection target.
  • Target for each third detected target, determine the corresponding rectangular image area of the third detected target in the current video frame image, and scale the width and height of the rectangular image area to the pre-established local target detection
  • the model inputs the width and height of the image, inputs the zoomed rectangular image area into the local target detection model to obtain the position and category of the fourth detected target, and establishes the fourth detected target and the third detected target To trigger the detection module 210;
  • the third detection result module 260 is used to compare the detection target of the current video frame image and the previous video frame when the detection target is detected and the position and category of the detection target are detected and the previous video frame image of the current video frame image has the detection target
  • the detected target that exists in the image is taken as the fifth detected target. For each fifth detected target, determine the corresponding rectangular image area of the fifth detected target in the video frame image where the fifth detected target is located.
  • the width and height of the rectangular image area are respectively scaled to the width and height of the input image of the pre-established local target detection model, and the zoomed rectangular image area is input into the local target detection model to obtain the position and category of the sixth detected target , Perform target matching on a plurality of sixth detected targets, obtain targets that are successfully matched and unsuccessfully matched between the current video frame image and the previous video frame image, and trigger the detection module 210.
  • this embodiment can combine the detection result of the previous video frame image with the detection result of the current video frame image in the case of the current video frame image for the full image target detection, and pass the full image-partial alternation
  • the detection method local target detection is continued after the full-image target detection, which takes into account the relationship between the front and rear video frame images, and improves the detection accuracy of target detection.
  • the embodiment of the present invention is based on the full-image
  • the target detection model and the local target detection model perform target detection on each video frame image, so that the target existing in each video frame image can be detected. Therefore, the newly appeared target in the video frame can be detected.
  • the corresponding relationship of the same target between the previous video frame image and the current video frame image can be obtained, and the matching target between the previous video frame image and the current video frame image can be obtained, thus, the new appearance can be achieved
  • the target is tracked instead of tracking each target appearing in the first video frame image in the video.
  • the foregoing device may further include:
  • the output module is used to output the previous video frame image of the current video frame image if the current video frame image is not received after the detection of whether the current video frame image of the surrounding environment collected by the acquisition device in real time is received. The location and category of the detected target and the corresponding relationship of each detected target.
  • the foregoing device may further include:
  • the fourth detection result module is used to determine whether the frame number interval between the current video frame image and the video frame image of the last full-image target detection is a preset interval, if it is not a preset interval, When there is a detection target in the previous video frame image of the current video frame image, the detection target existing in the previous video frame image is taken as the seventh detection target, and for each seventh detection target, the seventh detection target is determined.
  • the rectangular image area corresponding to the target in the current video frame image is obtained, the width and height of the rectangular image area are respectively scaled to the width and height of the input image of the pre-established local target detection model, and the rectangular image area obtained after scaling is input.
  • the location and category of the eighth detected target are obtained from the local target detection model, the corresponding relationship between the eighth detected target and the seventh detected target is established, and the detection module is triggered.
  • the above-mentioned apparatus may further include a first training module configured to train to obtain the full-image target detection model, and the first training module may include:
  • the first acquisition submodule is configured to acquire the first sample image in the training set and the first position and the first category corresponding to the target in the detection frame contained in the first sample image;
  • the first input sub-module is used to input the first sample image and the first position and the first category corresponding to the target in the detection frame contained in the first sample image into the first initial network model, where ,
  • the first initial network model includes a first feature extraction layer, a region generation network layer, and a first regression layer;
  • the full-image feature vector determining sub-module is configured to determine the full-image feature vector in the first sample image through the first model parameter of the first feature extraction layer;
  • the feature information determining sub-module is configured to perform feature calculation on the full image feature vector through the second model parameter of the region generation network layer to obtain feature information of the candidate region containing the first reference target;
  • the first generation sub-module is used to perform regression on the feature information through the third model parameter of the first regression layer to obtain the first reference category to which the first reference target belongs and the location of the first reference target The first reference position in the first sample image;
  • a first difference calculation sub-module configured to calculate a first difference value between the first reference category and the first category, and calculate a second difference value between the first reference position and the first position ;
  • the first adjustment sub-module is configured to adjust the first model parameter, the second model parameter, and the third model parameter based on the first difference value and the second difference value to trigger the first acquisition Submodule
  • the first training completion sub-module is used to complete the training when the number of iterations reaches the first preset number of times to obtain a full-image target detection model that associates the first sample image with the position and category of the target in the detection frame.
  • the first detection result module 240 may be specifically used for:
  • the coordinates of the upper left corner point and the lower right corner point of the first detected target in the current video frame image are determined based on the position of the first detected target, in the current video frame image Obtaining a rectangular image area with the upper left corner point and the lower right corner point as diagonal lines;
  • the coordinates of the upper left corner point According to the coordinates of the upper left corner point, the coordinates of the lower right corner point, the preset coordinate transformation coefficient, and the width and height of the input image of the local target detection model established in advance, the scaled coordinates of the upper left corner point and the The scaled coordinates of the lower right corner point;
  • the width and height of the rectangular image area are respectively scaled to The width and height of the input image of the pre-built local target detection model.
  • the above-mentioned apparatus may further include a second training module configured to train to obtain the local target detection model, and the second training module may include:
  • the second acquisition submodule is configured to acquire the second sample image in the training set and the second position and second category corresponding to the target in the detection frame contained in the second sample image;
  • the second input sub-module is used to input the second sample image and the second position and second category corresponding to the target in the detection frame contained in the second sample image into the second initial network model, where
  • the second initial network model includes a second feature extraction layer and a second regression layer;
  • the feature vector determining sub-module is configured to determine the feature vector in the second sample image through the fourth model parameter of the second feature extraction layer;
  • the second generation sub-module is used to perform regression on the feature vector through the fifth model parameter of the second regression layer to obtain the second reference category to which the second reference target belongs and the second reference target in the The second reference position in the second sample image;
  • a second difference calculation sub-module configured to calculate a third difference value between the second reference category and the second category, and calculate a fourth difference value between the second reference position and the second position ;
  • the second adjustment sub-module is configured to adjust the fourth model parameter and the fifth model parameter based on the third difference value and the fourth difference value, return to execute the acquisition of the second sample image in the training set, and The step of the second position and the second category corresponding to the target in the detection frame included in the second sample image;
  • the second training completion sub-module is used to complete the training when the number of iterations reaches the second preset number to obtain a local target detection model that associates the second sample image with the position and category of the target in the detection frame.
  • the third detection result module 260 may be specifically used for:
  • each sixth detection target in the current video frame image determines the overlap area and intersection area between the sixth detection target and each sixth detection target in the previous video frame image, and calculate the overlap area The quotient of the area of and the area of the intersection area;
  • the sixth detection target of the current video frame image and the sixth detection target of the previous video frame image corresponding to the quotient not less than the preset threshold among the quotients are regarded as the target of successful matching, and the quotient is less than the preset threshold.
  • the sixth detection target of the current video frame image corresponding to the quotient of the threshold value and the sixth detection target of the previous video frame image are regarded as the unsuccessful matching target.
  • the foregoing device embodiment corresponds to the method embodiment, and has the same technical effect as the method embodiment.
  • the device embodiment is obtained based on the method embodiment, and the specific description can be found in the method embodiment part, which will not be repeated here.
  • modules in the device in the embodiment may be distributed in the device in the embodiment according to the description of the embodiment, or may be located in one or more devices different from this embodiment with corresponding changes.
  • the modules of the above-mentioned embodiments can be combined into one module or further divided into multiple sub-modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

本发明实施例公开一种视频目标检测与跟踪方法及装置。该方法包括:在当前视频帧图像进行全图目标检测的情况下,将上一视频帧图像的检测结果与当前视频帧图像的检测结果进行合并,并通过全图-局部交替检测的方式,在进行全图目标检测后继续进行局部目标检测,由此考虑到了前后视频帧图像之间的关系,提高了目标检测的检测准确率,又由于是基于全图目标检测模型和局部目标检测模型对各个视频帧图像进行目标检测,可以检测出视频帧中新出现的目标,同时,可以得到上一视频帧图像与当前视频帧图像之间的同一目标的对应关系,以及上一视频帧图像与当前视频帧图像之间匹配成功的目标,由此,可以实现对新出现的目标进行跟踪。

Description

一种视频目标检测与跟踪方法及装置 技术领域
本发明涉及计算机视觉技术领域,具体而言,涉及一种视频目标检测与跟踪方法及装置。
背景技术
目前,对采集设备采集的视频中的目标进行跟踪和检测是计算机视觉的主要内容,例如:在自动驾驶场景中,自车为了进行自动驾驶,需要获知自车周围的行驶环境,因此,需要通过自车的采集设备对自车周围环境进行目标检测与跟踪。
目前的目标检测方法仅对视频中的单帧图像中的目标进行目标检测,未考虑前后帧之间的关系,使得目标检测的检测准确率较低。目前的目标跟踪方法,仅针对视频中的第一帧图像中出现的各个目标进行跟踪,当在视频中出现新的目标时,无法对新的目标进行跟踪。因此,目前亟需一种检测准确率较高且可以对新出现的目标进行跟踪的视频目标检测与跟踪方法。
发明内容
本发明提供了一种视频目标检测与跟踪方法及装置,以提高目标检测的检测准确率以及对新出现的目标进行跟踪。具体的技术方案如下。
第一方面,本发明实施例提供了一种视频目标检测与跟踪方法,该方法包括:
检测是否接收到采集设备实时采集的周围环境的当前视频帧图像;
如果接收到当前视频帧图像,判断所述当前视频帧图像与上一次进行全图目标检测的视频帧图像之间的帧数间隔是否为预设间隔;
如果是预设间隔,根据预先建立的全图目标检测模型对所述当前视频帧图像进行全图目标检测;
当检测得到检出目标的位置和类别且当前视频帧图像的上一视频帧图像不存在检出目标时,将当前视频帧图像的检出目标作为第一检出目标,对于每个第一检出目标,基于该第一检出目标的位置确定该第一检出目标在当前视频帧图像中对应的矩形图像区域,将所述矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入所述局部目标检测模型中得到第二检出目标的位置和类别,返回执行所述检测是否接收到采集设备实时采集的周围环境的当前视频帧图像的步骤;
当未检测出检出目标且当前视频帧图像的上一视频帧图像存在检出目标时,将所述上一视频帧图像存在的检出目标作为第三检出目标,对于每个第三检出目标,确定该第三检出目标在当前视频帧图像中对应的矩形图像区域,将所述矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入所述局部目标检测模型中得到第四检出目标的位置和类别,建立所述第四检出目标与该第三检出目标的对应关系,返回执行所述检测是否接收到采集设备实时采集的周围环境的当前视频帧图像的步骤;
当检测得到检出目标的位置和类别且当前视频帧图像的上一视频帧图像存在检出目标时,将当前视频帧图像的检出目标和上一视频帧图像存在的检出目标作为第五检出目标,对于每个第五检出目标,确定该第五检出目标在该第五检出目标所在视频帧图像中对应的矩形图像区域,将所述矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入所述局部目标检测模型中得到第六检出目标的位置和类别,对多个第六检出目标进行目标匹配,得到当前视频帧图像与上一视频帧图像之间匹配成功的目标和匹配不成功的目标,返回执行所述检测是否接收到采集设备实时采集的周围环境的当前视频帧图像的步骤。
第二方面,本发明实施例提供了一种视频目标检测与跟踪装置,该装置包括:
检测模块,用于检测是否接收到采集设备实时采集的周围环境的当前视频帧图像;
判断模块,用于如果接收到当前视频帧图像,判断所述当前视频帧图像与上一次进行全图目标检 测的视频帧图像之间的帧数间隔是否为预设间隔,如果是,触发全图目标检测模块;
所述全图目标检测模块,用于根据预先建立的全图目标检测模型对所述当前视频帧图像进行全图目标检测;
第一检测结果模块,用于当检测得到检出目标的位置和类别且当前视频帧图像的上一视频帧图像不存在检出目标时,将当前视频帧图像的检出目标作为第一检出目标,对于每个第一检出目标,基于该第一检出目标的位置确定该第一检出目标在当前视频帧图像中对应的矩形图像区域,将所述矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入所述局部目标检测模型中得到第二检出目标的位置和类别,触发所述检测模块;
第二检测结果模块,用于当未检测出检出目标且当前视频帧图像的上一视频帧图像存在检出目标时,将所述上一视频帧图像存在的检出目标作为第三检出目标,对于每个第三检出目标,确定该第三检出目标在当前视频帧图像中对应的矩形图像区域,将所述矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入所述局部目标检测模型中得到第四检出目标的位置和类别,建立所述第四检出目标与该第三检出目标的对应关系,触发所述检测模块;
第三检测结果模块,用于当检测得到检出目标的位置和类别且当前视频帧图像的上一视频帧图像存在检出目标时,将当前视频帧图像的检出目标和上一视频帧图像存在的检出目标作为第五检出目标,对于每个第五检出目标,确定该第五检出目标在该第五检出目标所在视频帧图像中对应的矩形图像区域,将所述矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入所述局部目标检测模型中得到第六检出目标的位置和类别,对多个第六检出目标进行目标匹配,得到当前视频帧图像与上一视频帧图像之间匹配成功的目标和匹配不成功的目标,触发所述检测模块。
由上述内容可知,本实施例可以在当前视频帧图像进行全图目标检测的情况下,将上一视频帧图像的检测结果与当前视频帧图像的检测结果进行合并,并通过全图-局部交替检测的方式,在进行全图目标检测后继续进行局部目标检测,由此考虑到了前后视频帧图像之间的关系,提高了目标检测的检测准确率,又由于本发明实施例中是基于全图目标检测模型和局部目标检测模型对各个视频帧图像进行目标检测,使得存在于每个视频帧图像中的目标均可以被检测出,因此,可以检测出视频帧中新出现的目标,同时,在局部检测后可以得到上一视频帧图像与当前视频帧图像之间的同一目标的对应关系,以及上一视频帧图像与当前视频帧图像之间匹配成功的目标,由此,可以实现对新出现的目标进行跟踪,而不是仅针对视频中的第一视频帧图像中出现的各个目标进行跟踪。当然,实施本发明的任一产品或方法并不一定需要同时达到以上所述的所有优点。
本发明实施例的创新点包括:
1、在当前视频帧图像进行全图目标检测的情况下,将上一视频帧图像的检测结果与当前视频帧图像的检测结果进行合并,并通过全图-局部交替检测的方式,在进行全图目标检测后继续进行局部目标检测,由此考虑到了前后视频帧图像之间的关系,提高了目标检测的检测准确率,又由于本发明实施例中是基于全图目标检测模型和局部目标检测模型对各个视频帧图像进行目标检测,使得存在于每个视频帧图像中的目标均可以被检测出,因此,可以检测出视频帧中新出现的目标,同时,在局部检测后可以得到上一视频帧图像与当前视频帧图像之间的同一目标的对应关系,以及上一视频帧图像与当前视频帧图像之间匹配成功的目标,由此,可以实现对新出现的目标进行跟踪,而不是仅针对视频中的第一视频帧图像中出现的各个目标进行跟踪。
2、本发明实施例中并不是针对每一视频帧都进行全图目标检测,而是采用间隔预设帧数间隔进行一次全图目标检测,其他视频帧进行局部目标检测的方式,由于局部目标检测的计算量远远小于全图目标检测,因此,采用本发明实施例间隔预设帧数间隔进行一次全图目标检测的方式,计算量可以显著的下降。
3、过对第一初始网络模型进行训练,可以得到使得第一样本图像和检测框内的目标的位置和类别相关联的全图目标检测模型,通过该全图目标检测模型可以对视频帧图像进行全图目标检测以便得到视频帧图像中的目标的位置和类别。
4、通过第一检出目标在当前视频帧图像中的左上角点的坐标、右下角点的坐标、预设坐标变换系数以及预先建立的局部目标检测模型输入图像的宽度和高度,将第一检出目标在当前视频帧图像中对应的矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,为后续进行局部目标检测做了准备。
5、通过对第二初始网络模型进行训练,可以得到使得第二样本图像和检测框内的目标的位置和类别相关联的局部目标检测模型,通过该局部目标检测模型可以对进行全图目标检测得到的检出目标再进行局部目标检测以便修正检出目标的位置和类别,得到视频帧图像中的目标的精准的位置和类别。
6、通过计算IoU的方式,考虑到了前后视频帧之间的关系,得到当前视频帧图像与上一视频帧图像之间匹配成功的目标和匹配不成功的目标,得到的匹配成功的目标就是将上一视频帧图像与当前视频帧图像中的同一目标一一对应起来,可以获知同一目标在上一视频帧图像的位置,以及在当前视频帧中的位置,起到了对同一目标进行跟踪的目的也起到了对同一目标进行目标检测的目的,得到的匹配不成功的目标,起到了对不同目标进行目标检测的目的。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单介绍。显而易见地,下面描述中的附图仅仅是本发明的一些实施例。对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例提供的视频目标检测与跟踪方法的一种流程示意图;
图2为本发明实施例提供的视频目标检测与跟踪装置的一种结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整的描述。显然,所描述的实施例仅仅是本发明的一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有付出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
需要说明的是,本发明实施例及附图中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。例如包含的一系列步骤或单元的过程、方法、***、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其他步骤或单元。
本发明实施例公开了一种视频目标检测与跟踪方法,能够考虑前后视频帧之间的关系,提高目标检测的检测准确率,同时,能够对新出现的目标进行跟踪。下面对本发明实施例进行详细说明。
图1为本发明实施例提供的视频目标检测与跟踪方法的一种流程示意图。该方法应用于电子设备。该方法具体包括以下步骤S110~S160:
S110:检测是否接收到采集设备实时采集的周围环境的当前视频帧图像,如果是,执行步骤S120。
在计算视觉计算领域中,为了实现相应的功能,电子设备需要对采集设备实时采集的周围环境的视频进行目标跟踪与检测,例如:在自动驾驶场景中,自车为了进行自动驾驶,需要获知自车周围的行驶环境,例如:道路上的其他车辆的运行情况以及行人的行走路线等,因此,需要通过自车的采集设备对自车周围环境进行视频采集。
集设备实时采集视频后,将采集得到的视频发送至电子设备,例如:在自动驾驶场景中,自车的采集设备实时采集视频后,将采集得到的视频发送至自车的电子设备,该电子设备可以为车辆的处理器。电子设备检测是否接收到采集设备实时采集的周围环境的当前视频帧图像,并根据检测结果执行后续步骤。
S120:判断当前视频帧图像与上一次进行全图目标检测的视频帧图像之间的帧数间隔是否为预设间隔,如果是,执行步骤S130。
由于如果针对每一视频帧图像都进行全图目标检测,将使得计算量巨大,因此,为了减少计算量,本发明实施例中不再针对每一视频帧图像都进行全图目标检测,而是采用每隔预设帧数间隔进行全图目标检测的方式,因此,当电子设备检测接收到自车采集设备实时采集的车辆周围环境的当前视频帧图像时,需要判断当前视频帧图像与上一次进行全图目标检测的视频帧图像之间的帧数间隔是否为预设间隔,并根据检测结果执行后续步骤。
S130:根据预先建立的全图目标检测模型对当前视频帧图像进行全图目标检测。
当前视频帧图像与上一次进行全图目标检测的视频帧图像之间的帧数间隔为预设间隔时,说明当前视频帧图像为需要进行全图目标检测的视频帧图像,此时,根据预先建立的全图目标检测模型对当前视频帧图像进行全图目标检测。
其中,全图目标检测模型的训练过程可以为:
获取训练集中的第一样本图像以及第一样本图像包含的检测框内的目标对应的第一位置和第一类别;
将第一样本图像以及第一样本图像包含的检测框内的目标对应的第一位置和第一类别输入到第一初始网络模型中,其中,第一初始网络模型包括第一特征提取层、区域生成网络层和第一回归层;
通过第一特征提取层的第一模型参数,确定第一样本图像中的全图特征向量;
通过区域生成网络层的第二模型参数对全图特征向量进行特征计算,得到包含第一参考目标的候选区域的特征信息;
通过第一回归层的第三模型参数,对特征信息进行回归,得到第一参考目标所属的第一参考类别和第一参考目标在第一样本图像中的第一参考位置;
计算第一参考类别与第一类别之间的第一差异值,计算第一参考位置与第一位置之间的第二差异值;
基于第一差异值和第二差异值调整第一模型参数、第二模型参数和第三模型参数,返回执行获取训练集中的第一样本图像以及第一样本图像包含的检测框内的目标对应的第一位置和第一类别的步骤;
当迭代次数达到第一预设次数时,完成训练,得到使得第一样本图像与检测框内的目标的位置和类别相关联的全图目标检测模型。
在建立全图目标检测模型时,需要获取训练集中的第一样本图像以及第一样本图像包含的检测框内的目标对应的第一位置和第一类别。
可以理解的是,电子设备首先需要构建一个第一初始网络模型,然后对其进行训练,进而得到全图目标检测模型。在一种实现方式中,可以利用caffe工具构建一个包括第一特征提取层、区域生成网络层和第一回归层的第一初始网络模型。示例性的,第一初始网络模型可以为Faster R-CNN(Faster RegionConvolutional Neural Networks,快速区域卷积神经网络),R-FCN(Region-based Fully Convolutional Networks,基于区域的全卷积网络)、YOLO算法或SSD算法。
在获取了训练集中的第一样本图像以及第一样本图像包含的检测框内的目标对应的第一位置和第一类别后,将第一样本图像以及第一样本图像包含的检测框内的目标对应的第一位置和第一类别输入到第一初始网络模型中进行训练。
具体来说,将第一样本图像输入到第一特征提取层,通过第一特征提取层的第一模型参数,确定第一样本图像中的全图特征向量。然后将所确定的全图特征向量输入到区域生成网络层,通过区域生成网络层的第二模型参数对全图特征向量进行特征计算,得到包含第一参考目标的候选区域的特征信息。然后将特征信息输入到第一回归层,通过第一回归层的第三模型参数,对特征信息进行回归,得到第一参考目标所属的第一参考类别和第一参考目标在第一样本图像中的第一参考位置。
在得到第一参考类别和第一参考位置后,分别与第一类别和第一位置进行对比,可以分别通过预先定义的目标函数计算第一参考类别与第一类别之间的第一差异值,计算第一参考位置与第一位置之间的第二差异值。当迭代次数未达到第一预设次数时,说明此时的第一初始网络模型还未能适应大部 分的第一样本图像,此时,需要基于第一差异值和第二差异值通过反向传播法调整第一模型参数、第二模型参数和第三模型参数,返回执行获取训练集中的第一样本图像以及第一样本图像包含的检测框内的目标对应的第一位置和第一类别的步骤。
在训练过程中,可以循环遍历所有的第一样本图像,并不断调整第一初始网络模型的第一模型参数、第二模型参数和第三模型参数。当迭代次数达到第一预设次数时,说明此时的第一初始网络模型能适应大部分的第一样本图像,获得准确的结果,此时,确定第一初始网络模型训练完成,得到全图目标检测模型。可以理解的是,训练得到的全图目标检测模型使得第一样本图像与检测框内的目标的位置和类别相关联,且,全图目标检测模型是将全图作为输入,获得检出目标的位置和类别的模型。
可见,通过上述训练方式对第一初始网络模型进行训练,可以得到使得第一样本图像和检测框内的目标的位置和类别相关联的全图目标检测模型,通过该全图目标检测模型可以对视频帧图像进行全图目标检测以便得到视频帧图像中的目标的位置和类别。
S140:当检测得到检出目标的位置和类别且当前视频帧图像的上一视频帧图像不存在检出目标时,将当前视频帧图像的检出目标作为第一检出目标,对于每个第一检出目标,基于该第一检出目标的位置确定该第一检出目标在当前视频帧图像中对应的矩形图像区域,将矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入局部目标检测模型中得到第二检出目标的位置和类别,返回执行步骤S110。
本发明实施例为了考虑前后视频帧之间的关系,需要将当前视频帧的检测结果与上一视频帧的检测结果合并,当通过全图目标检测模型检测得到检出目标的位置和类别且当前视频帧图像的上一视频帧图像不存在检出目标时,将当前视频帧图像的检出目标作为第一检出目标。
由于在利用预先建立的全图目标检测模型对当前视频帧图像进行全图目标检测并得到检出目标的位置和类别的同时,还会得到检出目标的得分,得分大于预设阈值说明检出目标的准确率较高,因此,还可以在得到检出目标的位置和类别且当前视频帧图像的上一视频帧图像不存在检出目标时,将当前视频帧图像的检出目标中得分大于预设阈值的检出目标作为第一检出目标。
由于全图目标检测的结果可能出现误差,为了更精准的进行目标检测,本发明实施例中提出了通过全图-局部交替检测的方法,也就是在进行全图目标检测后,对第一检出目标继续进行局部目标检测。其中,进行局部目标检测的方式为通过预先建立的局部目标检测模型进行局部目标检测。
由于预先建立的局部目标检测模型的输入图像一般为整张图像的局部,因此,输入图像的尺寸是预设尺寸,且预设尺寸通常较小,因此,在进行局部目标检测前,需要将进行局部目标检测的图像的尺寸缩放至预设尺寸。即对于每个第一检出目标,基于该第一检出目标的位置确定该第一检出目标在当前视频帧图像中对应的矩形图像区域,将矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入局部目标检测模型中得到第二检出目标的位置和类别。然后,返回执行步骤S110。由于在进行局部目标检测时,每一次仅输入一个缩放后的矩形图像区域,使得计算量较小,进一步使得误检的发生概率减小,
提高了目标检测的准确率。
由于在利用预先建立的局部目标检测模型对缩放后得到的矩形图像区域进行局部目标检测得到检出目标的位置和类别的同时,还会得到检出目标的得分,得分大于预设阈值说明检出目标的准确率较高,因此,将缩放后得到的矩形图像区域输入局部目标检测模型中得到第二检出目标的位置和类别,可以包括:将缩放后得到的矩形图像区域输入局部目标检测模型中得到候选检出目标的位置和类别以及候选检出目标的得分,将候选检出目标中得分大于预设阈值的候选检出目标作为第二检出目标。
其中,对于每个第一检出目标,基于该第一检出目标的位置确定该第一检出目标在当前视频帧图像中对应的矩形图像区域,将矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度的步骤,可以包括:
对于每个第一检出目标,基于该第一检出目标的位置确定该第一检出目标在当前视频帧图像中的左上角点的坐标和右下角点的坐标,在当前视频帧图像中得到以左上角点和右下角点为对角线的矩形图像区域;
根据左上角点的坐标、右下角点的坐标、预设坐标变换系数以及预先建立的局部目标检测模型输入图像的宽度和高度计算得到左上交点缩放后的坐标和右下角点缩放后的坐标;
基于左上角点的坐标、右下角点的坐标、左上交点缩放后的坐标和右下角点缩放后的坐标,将矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度。
对于每个第一检出目标,得到了该第一检出目标的位置,那么该第一检出目标在当前视频帧图像中的左上角点的坐标和右下角点的坐标就是已知的,为了可以进行局部目标检测,在当前视频帧图像中得到以左上角点和右下角点为对角线的矩形图像区域。
然后根据左上角点的坐标、右下角点的坐标、预设坐标变换系数以及预先建立的局部目标检测模型输入图像的宽度和高度计算得到左上角点缩放后的坐标和右下角点缩放后的坐标。其中,左上角点的坐标包括左上角点的横坐标和左上角点的纵坐标,右下角点的坐标包括右下角点的横坐标和右下角点的纵坐标,预设坐标变换系数包括第一预设横坐标变换系数、第一预设纵坐标变换系数、第二预设横坐标变换系数、第二预设纵坐标变换系数。
其中,左上角点缩放后的坐标和右下角点缩放后的坐标可以通过以下公式计算:
Figure PCTCN2019108080-appb-000001
其中,a x为第一预设横坐标变换系数,a y为第一预设纵坐标变换系数,d x为第二预设横坐标变换系数,d y为第二预设纵坐标变换系数,x lt为左上角点的横坐标,y lt为左上角点的纵坐标,x rb为右下角点的横坐标,y rb为右下角点的纵坐标,F w为左上角点缩放后的横坐标,F h为左上角点缩放后的纵坐标,H为局部目标检测模型输入图像的高度,W为局部目标检测模型输入图像的宽度。
在得到左上角点缩放后的坐标和右下角点缩放后的坐标后,通过分别与左上角点的坐标和右下角点的坐标进行对比,就可以获知需要对矩形图像区域的宽度和高度分别缩放多少缩放量才能达到预先建立的局部目标检测模型输入图像的宽度和高度,然后按照缩放量分别对宽度和高度进行缩放即可,即基于左上角点的坐标、右下角点的坐标、左上交点缩放后的坐标和右下角点缩放后的坐标,将矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度。
由此,通过第一检出目标在当前视频帧图像中的左上角点的坐标、右下角点的坐标、预设坐标变换系数以及预先建立的局部目标检测模型输入图像的宽度和高度,将第一检出目标在当前视频帧图像中对应的矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,为后续进行局部目标检测做了准备。
其中,局部目标检测模型的训练过程可以为:
获取训练集中的第二样本图像以及第二样本图像包含的检测框内的目标对应的第二位置和第二类别;
将第二样本图像以及第二样本图像包含的检测框内的目标对应的第二位置和第二类别输入到第二初始网络模型中,其中,第二初始网络模型包括第二特征提取层和第二回归层;
通过第二特征提取层的第四模型参数,确定第二样本图像中的特征向量;
通过第二回归层的第五模型参数,对特征向量进行回归,得到第二参考目标所属的第二参考类别和第二参考目标在第二样本图像中的第二参考位置;
计算第二参考类别与第二类别之间的第三差异值,计算第二参考位置与第二位置之间的第四差异值;
基于第三差异值和第四差异值调整第四模型参数和第五模型参数,返回执行获取训练集中的第二样本图像以及第二样本图像包含的检测框内的目标对应的第二位置和第二类别的步骤;
当迭代次数达到第二预设次数时,完成训练,得到使得第二样本图像与检测框内的目标的位置和类别相关联的局部目标检测模型。
在建立局部目标检测模型时,需要获取训练集中的第二样本图像以及第二样本图像包含的检测框内的目标对应的第二位置和第二类别。
可以理解的是,电子设备首先需要构建一个第二初始网络模型,然后对其进行训练,进而得到局部目标检测模型。在一种实现方式中,可以利用caffe工具构建一个包括第二特征提取层和第二回归层的第二初始网络模型。示例性的,第二初始网络模型可以为Faster R-CNN(Faster RegionConvolutional Neural Networks,快速区域卷积神经网络),R-FCN(Region-based Fully Convolutional Networks,基于区域的全卷积网络)、YOLO算法或SSD算法。
在获取了训练集中的第二样本图像以及第二样本图像包含的检测框内的目标对应的第二位置和第二类别后,将第二样本图像以及第二样本图像包含的检测框内的目标对应的第二位置和第二类别输入到第二初始网络模型中进行训练。
具体来说,将第二样本图像输入到第二特征提取层,通过第二特征提取层的第四模型参数,确定第二样本图像中的特征向量。然后将所确定的特征向量输入到第二回归层,通过第二回归层的第五模型参数,对特征向量进行回归,得到第二参考目标所属的第二参考类别和第二参考目标在第二样本图像中的第二参考位置。
在得到第二参考类别和第二参考位置后,分别与第二类别和第二位置进行对比,可以分别通过预先定义的目标函数计算第二参考类别与第二类别之间的第三差异值,计算第二参考位置与第二位置之间的第四差异值。当迭代次数未达到第二预设次数时,说明此时的第二初始网络模型还未能适应大部分的第二样本图像,此时,需要基于第三差异值和第四差异值通过反向传播法调整第四模型参数和第五模型参数,返回执行获取训练集中的第二样本图像以及第二样本图像包含的检测框内的目标对应的第二位置和第二类别的步骤。
在训练过程中,可以循环遍历所有的第二样本图像,并不断调整第二初始网络模型的第四模型参数和第五模型参数。当迭代次数达到第二预设次数时,说明此时的第二初始网络模型能适应大部分的第二样本图像,获得准确的结果,此时,确定第二初始网络模型训练完成,得到局部目标检测模型。可以理解的是,训练得到的局部目标检测模型使得第二样本图像与检测框内的目标的位置和类别相关联,且,局部目标检测模型是将局部图像作为输入,获得检出目标的位置和类别的模型。
可见,通过上述训练方式对第二初始网络模型进行训练,可以得到使得第二样本图像和检测框内的目标的位置和类别相关联的局部目标检测模型,通过该局部目标检测模型可以对进行全图目标检测得到的检出目标再进行局部目标检测以便修正检出目标的位置和类别,得到视频帧图像中的目标的精准的位置和类别。
S150:当未检测出检出目标且当前视频帧图像的上一视频帧图像存在检出目标时,将上一视频帧图像存在的检出目标作为第三检出目标,对于每个第三检出目标,确定该第三检出目标在当前视频帧图像中对应的矩形图像区域,将矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入局部目标检测模型中得到第四检出目标的位置和类别,建立第四检出目标与该第三检出目标的对应关系,返回执行步骤S110。
当通过全图目标检测模型未检测出检出目标且当前视频帧图像的上一视频帧图像存在检出目标时,将上一视频帧图像存在的检出目标作为第三检出目标。
其中,通过全图目标检测模型未检测出检出目标的情况有多种,包括但不限于当前视频帧图像确实不存在目标,例如:在自动驾驶领域,自车停放在停车场,自车的采集设备对准墙壁的情景,以及,当前视频帧图像存在目标,但全图目标检测模型未能检测出。
由于全图目标检测的结果可能出现误差,为了更精准的进行目标检测,本发明实施例中提出了通过全图-局部交替检测的方法,也就是在进行全图目标检测后,对第三检出目标继续进行局部目标检测。 其中,进行局部目标检测的方式为通过预先建立的局部目标检测模型进行局部目标检测。其中,局部目标检测模型的训练过程可以参见步骤S140中的描述,在此不再赘述。
由于预先建立的局部目标检测模型的输入图像一般为整张图像的局部,因此,输入图像的尺寸是预设尺寸,且预设尺寸通常较小,因此,在进行局部目标检测前,需要将进行局部目标检测的图像的尺寸缩放至预设尺寸。即对于每个第三检出目标,确定该第三检出目标在当前视频帧图像中对应的矩形图像区域,将矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入局部目标检测模型中得到第四检出目标的位置和类别,建立第四检出目标与该第三检出目标的对应关系,返回执行步骤S110。
其中,确定该第三检出目标在当前视频帧图像中对应的矩形图像区域,可以包括:确定该第三检出目标在上一视频帧图像中的第一目标位置,在当前视频帧中确定与该第一目标位置相同的第一参考位置,基于该第一参考位置确定该第三检出目标在当前视频帧图像中对应的矩形图像区域。
由于该第三检出目标在前后两个视频帧图像中的位置不会发生太大的变化,因此,可以假设在当前视频帧图像中,该第三检出目标仍然在上一视频帧图像的第一目标位置,然后将当前视频帧图像中与该第一目标位置相同的第一参考位置对应的矩形图像区域作为该第三检出目标在当前视频帧图像中对应的矩形图像区域。然后将该第三检出目标在当前视频帧图像中对应的矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入局部目标检测模型中得到第四检出目标的位置和类别,由此,即可获得该第三检出目标在当前视频帧图像中的位置即第四检出目标的位置。
在得到第四检出目标的位置和类别后,建立第四检出目标与该第三检出目标的对应关系,并返回执行步骤S110。建立第四检出目标与该第三检出目标的对应关系,也就是将上一视频帧图像与当前视频帧图像中的同一目标对应起来,可以获知同一目标在上一视频帧图像的位置,以及在当前视频帧中的位置,起到了对同一目标进行跟踪的目的。
S160:当检测得到检出目标的位置和类别且当前视频帧图像的上一视频帧图像存在检出目标时,将当前视频帧图像的检出目标和上一视频帧图像存在的检出目标作为第五检出目标,对于每个第五检出目标,确定该第五检出目标在该第五检出目标所在视频帧图像中对应的矩形图像区域,将矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入局部目标检测模型中得到第六检出目标的位置和类别,对多个第六检出目标进行目标匹配,得到当前视频帧图像与上一视频帧图像之间匹配成功的目标和匹配不成功的目标,返回执行步骤S110。
当检测得到检出目标的位置和类别且当前视频帧图像的上一视频帧图像存在检出目标时,将当前视频帧图像的检出目标和上一视频帧图像存在的检出目标作为第五检出目标。
由于全图目标检测的结果可能出现误差,为了更精准的进行目标检测,本发明实施例中提出了通过全图-局部交替检测的方法,也就是在进行全图目标检测后,对第五检出目标继续进行局部目标检测。其中,进行局部目标检测的方式为通过预先建立的局部目标检测模型进行局部目标检测。其中,局部目标检测模型的训练过程可以参见步骤S140中的描述,在此不再赘述。
由于预先建立的局部目标检测模型的输入图像一般为整张图像的局部,因此,输入图像的尺寸是预设尺寸,且预设尺寸通常较小,因此,在进行局部目标检测前,需要将进行局部目标检测的图像的尺寸缩放至预设尺寸。即对于每个第五检出目标,确定该第五检出目标在该第五检出目标所在视频帧图像中对应的矩形图像区域,将矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入局部目标检测模型中得到第六检出目标的位置和类别,
其中,确定该第五检出目标在该第五检出目标所在视频帧图像中对应的矩形图像区域,将矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度的方式可以参考步骤S140中确定第一检出目标在当前视频帧图像中对应的矩形图像区域,将矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度的方式,在此不再赘述。
由于第六检出目标中既包括上一视频帧图像的检出目标又包括当前视频帧图像的检出目标,为了 对目标进行检测和跟踪,在得到第六检出目标的位置和类别后,对多个第六检出目标进行目标匹配,得到当前视频帧图像与上一视频帧图像之间匹配成功的目标和匹配不成功的目标,返回执行步骤S110。
其中,对多个第六检出目标进行目标匹配,得到当前视频帧图像与上一视频帧图像之间匹配成功的目标和匹配不成功的目标的步骤,可以包括:
对于当前视频帧图像的每个第六检出目标,确定该第六检出目标与上一视频帧图像的每个第六检出目标之间的重叠区域以及相交区域,并计算重叠区域的面积与相交区域的面积的商;
将商中不小于预设阈值的商对应的当前视频帧图像的第六检出目标以及上一视频帧图像的第六检出目标作为匹配成功的目标,将商中小于预设阈值的商对应的当前视频帧图像的第六检出目标以及上一视频帧图像的第六检出目标作为匹配不成功的目标。
本发明实施例中通过计算IoU的方式来对多个第六检出目标进行目标匹配,其中,IoU(Intersection over Union,交并比)指两个几何图形相交部分的面积除以二者相并的面积的商。IoU越高,说明重叠的部分越多,说明两个目标越相似。因此,在得到第六检出目标的位置和类别后,对于当前视频帧图像的每个第六检出目标,确定该第六检出目标与上一视频帧图像的每个第六检出目标之间的重叠区域以及相交区域,并计算重叠区域的面积与相交区域的面积的商。
在得到商后,将商与预设阈值进行对比,如果大于等于预设阈值,说明两个第六检出目标较为相似,如果小于预设阈值,说明两个第六检出目标不相似,因此,将商中不小于预设阈值的商对应的当前视频帧图像的第六检出目标以及上一视频帧图像的第六检出目标作为匹配成功的目标,将商中小于预设阈值的商对应的当前视频帧图像的第六检出目标以及上一视频帧图像的第六检出目标作为匹配不成功的目标。
存在匹配不成功的目标的原因可能是全图目标检测模型未能检测到当前视频帧图像新出现的目标,也可能是上一视频帧图像和当前视频帧图像中均存在的目标,在上一视频帧图像中检测到了该目标,但通过全图目标检测模型未能在当前视频帧图像中检测到该目标,当然原因并不只限于此。
由此,通过计算IoU的方式,考虑到了前后视频帧之间的关系,得到当前视频帧图像与上一视频帧图像之间匹配成功的目标和匹配不成功的目标,得到的匹配成功的目标就是将上一视频帧图像与当前视频帧图像中的同一目标一一对应起来,可以获知同一目标在上一视频帧图像的位置,以及在当前视频帧中的位置,起到了对同一目标进行跟踪的目的也起到了对同一目标进行目标检测的目的,得到的匹配不成功的目标,起到了对不同目标进行目标检测的目的。
由上述内容可知,本实施例可以在当前视频帧图像进行全图目标检测的情况下,将上一视频帧图像的检测结果与当前视频帧图像的检测结果进行合并,并通过全图-局部交替检测的方式,在进行全图目标检测后继续进行局部目标检测,由此考虑到了前后视频帧图像之间的关系,提高了目标检测的检测准确率,又由于本发明实施例中是基于全图目标检测模型和局部目标检测模型对各个视频帧图像进行目标检测,使得存在于每个视频帧图像中的目标均可以被检测出,因此,可以检测出视频帧中新出现的目标,同时,在局部检测后可以得到上一视频帧图像与当前视频帧图像之间的同一目标的对应关系,以及上一视频帧图像与当前视频帧图像之间匹配成功的目标,由此,可以实现对新出现的目标进行跟踪,而不是仅针对视频中的第一视频帧图像中出现的各个目标进行跟踪。
本发明实施例可以应用于自动驾驶中,自车的电子设备通过对自车的采集设备实时采集的自车周围环境中的目标进行检测和跟踪,以便于实现自动驾驶。
在图1所示方法的基础上,在步骤S110之后,检测未接收到自车采集设备实时采集的车辆周围环境的当前视频帧图像时,本发明实施例提供的一种用于自动驾驶的视频目标检测与跟踪方法还可以包括:
输出当前视频帧图像的上一视频帧图像存在的检出目标的位置和类别以及各检出目标的对应关系。
当检测未接收到自车采集设备实时采集的车辆周围环境的当前视频帧图像时,说明自车采集设备不再采集图像,此时,算法结束,需要将之前检测到的目标以及跟踪结果输出,即需要将当前视频帧图像的上一视频帧图像存在的检出目标的位置和类别以及各检出目标的对应关系输出。
由此,在检测未接收到自车采集设备实时采集的车辆周围环境的当前视频帧图像时,通过输出当 前视频帧图像的上一视频帧图像存在的检出目标的位置和类别以及各检出目标的对应关系的方式实现目标检测与跟踪。
在图1所示方法的基础上,在步骤S120之后,判断当前视频帧图像与上一次进行全图目标检测的视频帧图像之间的帧数间隔不为预设间隔时,本发明实施例提供的一种用于自动驾驶的视频目标检测与跟踪方法还可以包括:
在当前视频帧图像的上一视频帧图像存在检出目标时,将上一视频帧图像存在的检出目标作为第七检出目标,对于每个第七检出目标,确定该第七检出目标在当前视频帧图像中对应的矩形图像区域,将矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入局部目标检测模型中得到第八检出目标的位置和类别,建立第八检出目标与该第七检出目标的对应关系,返回执行步骤S110。
在当前视频帧图像与上一次进行全图目标检测的视频帧图像之间的帧数间隔不为预设间隔时,说明当前视频帧图像不需要进行全图目标检测,此时,如果当前视频帧图像的上一视频帧图像不存在检出目标,则不做任何处理,如果当前视频帧图像的上一视频帧图像存在检出目标,将上一视频帧图像存在的检出目标作为第七检出目标。
为了更精准的进行目标检测,在得到第七检出目标后,对第七检出目标进行局部目标检测。其中,进行局部目标检测的方式为通过预先建立的局部目标检测模型进行局部目标检测。其中,局部目标检测模型的训练过程可以参见步骤S140中的描述,在此不再赘述。
由于预先建立的局部目标检测模型的输入图像一般为整张图像的局部,因此,输入图像的尺寸是预设尺寸,且预设尺寸通常较小,因此,在进行局部目标检测前,需要将进行局部目标检测的图像的尺寸缩放至预设尺寸。即对于每个第七检出目标,确定该第七检出目标在当前视频帧图像中对应的矩形图像区域,将矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入局部目标检测模型中得到第八检出目标的位置和类别,建立第八检出目标与该第七检出目标的对应关系,返回执行步骤S110。
其中,确定该第七检出目标在当前视频帧图像中对应的矩形图像区域,可以包括:确定该第七检出目标在上一视频帧图像中的第二目标位置,在当前视频帧中确定与该第二目标位置相同的第二参考位置,基于该第二参考位置确定该第七检出目标在当前视频帧图像中对应的矩形图像区域。
由于该第七检出目标在前后两个视频帧图像中的位置不会发生太大的变化,因此,可以假设在当前视频帧图像中,该第七检出目标仍然在上一视频帧图像的第二目标位置,然后将当前视频帧图像中与该第二目标位置相同的第二参考位置对应的矩形图像区域作为该第七检出目标在当前视频帧图像中对应的矩形图像区域。然后将该第七检出目标在当前视频帧图像中对应的矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入局部目标检测模型中得到第八检出目标的位置和类别,由此,即可获得该第七检出目标在当前视频帧图像中的位置即第八检出目标的位置。
在得到第八检出目标的位置和类别后,建立第八检出目标与该第七检出目标的对应关系,并返回执行步骤S110。建立第八检出目标与该第七检出目标的对应关系,也就是将上一视频帧图像与当前视频帧图像中的同一目标对应起来,可以获知同一目标在上一视频帧图像的位置,以及在当前视频帧中的位置,起到了对同一目标进行跟踪的目的。
本发明实施例中并不是针对每一视频帧都进行全图目标检测,而是采用间隔预设帧数间隔进行一次全图目标检测,其他视频帧进行局部目标检测的方式,由于局部目标检测的计算量远远小于全图目标检测,因此,采用本发明实施例间隔预设帧数间隔进行一次全图目标检测的方式,计算量可以显著的下降。
图2为本发明实施例提供的视频目标检测与跟踪装置的一种结构示意图。该装置可以包括:
检测模块210,用于检测是否接收到采集设备实时采集的周围环境的当前视频帧图像;
判断模块220,用于如果接收到当前视频帧图像,判断所述当前视频帧图像与上一次进行全图目标检测的视频帧图像之间的帧数间隔是否为预设间隔,如果是,触发全图目标检测模块230;
所述全图目标检测模块230,用于根据预先建立的全图目标检测模型对所述当前视频帧图像进行全图目标检测;
第一检测结果模块240,用于当检测得到检出目标的位置和类别且当前视频帧图像的上一视频帧图像不存在检出目标时,将当前视频帧图像的检出目标作为第一检出目标,对于每个第一检出目标,基于该第一检出目标的位置确定该第一检出目标在当前视频帧图像中对应的矩形图像区域,将所述矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入所述局部目标检测模型中得到第二检出目标的位置和类别,触发所述检测模块210;
第二检测结果模块250,用于当未检测出检出目标且当前视频帧图像的上一视频帧图像存在检出目标时,将所述上一视频帧图像存在的检出目标作为第三检出目标,对于每个第三检出目标,确定该第三检出目标在当前视频帧图像中对应的矩形图像区域,将所述矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入所述局部目标检测模型中得到第四检出目标的位置和类别,建立所述第四检出目标与该第三检出目标的对应关系,触发所述检测模块210;
第三检测结果模块260,用于当检测得到检出目标的位置和类别且当前视频帧图像的上一视频帧图像存在检出目标时,将当前视频帧图像的检出目标和上一视频帧图像存在的检出目标作为第五检出目标,对于每个第五检出目标,确定该第五检出目标在该第五检出目标所在视频帧图像中对应的矩形图像区域,将所述矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入所述局部目标检测模型中得到第六检出目标的位置和类别,对多个第六检出目标进行目标匹配,得到当前视频帧图像与上一视频帧图像之间匹配成功的目标和匹配不成功的目标,触发所述检测模块210。
由上述内容可知,本实施例可以在当前视频帧图像进行全图目标检测的情况下,将上一视频帧图像的检测结果与当前视频帧图像的检测结果进行合并,并通过全图-局部交替检测的方式,在进行全图目标检测后继续进行局部目标检测,由此考虑到了前后视频帧图像之间的关系,提高了目标检测的检测准确率,又由于本发明实施例中是基于全图目标检测模型和局部目标检测模型对各个视频帧图像进行目标检测,使得存在于每个视频帧图像中的目标均可以被检测出,因此,可以检测出视频帧中新出现的目标,同时,在局部检测后可以得到上一视频帧图像与当前视频帧图像之间的同一目标的对应关系,以及上一视频帧图像与当前视频帧图像之间匹配成功的目标,由此,可以实现对新出现的目标进行跟踪,而不是仅针对视频中的第一视频帧图像中出现的各个目标进行跟踪。
在本发明的另一实施例中,上述装置还可以包括:
输出模块,用于在所述检测是否接收到采集设备实时采集的周围环境的当前视频帧图像之后,如果未接收到当前视频帧图像,输出所述当前视频帧图像的上一视频帧图像存在的检出目标的位置和类别以及各检出目标的对应关系。
在本发明的另一实施例中,上述装置还可以包括:
第四检测结果模块,用于在所述判断所述当前视频帧图像与上一次进行全图目标检测的视频帧图像之间的帧数间隔是否为预设间隔之后,如果不是预设间隔,在当前视频帧图像的上一视频帧图像存在检出目标时,将所述上一视频帧图像存在的检出目标作为第七检出目标,对于每个第七检出目标,确定该第七检出目标在当前视频帧图像中对应的矩形图像区域,将所述矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入所述局部目标检测模型中得到第八检出目标的位置和类别,建立所述第八检出目标与该第七检出目标的对应关系,触发所述检测模块。
在本发明的另一实施例中,上述装置还可以包括第一训练模块,所述第一训练模块用于训练得到所述全图目标检测模型,所述第一训练模块可以包括:
第一获取子模块,用于获取训练集中的第一样本图像以及所述第一样本图像包含的检测框内的目标对应的第一位置和第一类别;
第一输入子模块,用于将所述第一样本图像以及所述第一样本图像包含的检测框内的目标对应的 第一位置和第一类别输入到第一初始网络模型中,其中,所述第一初始网络模型包括第一特征提取层、区域生成网络层和第一回归层;
全图特征向量确定子模块,用于通过所述第一特征提取层的第一模型参数,确定所述第一样本图像中的全图特征向量;
特征信息确定子模块,用于通过所述区域生成网络层的第二模型参数对所述全图特征向量进行特征计算,得到包含第一参考目标的候选区域的特征信息;
第一生成子模块,用于通过所述第一回归层的第三模型参数,对所述特征信息进行回归,得到所述第一参考目标所属的第一参考类别和所述第一参考目标在所述第一样本图像中的第一参考位置;
第一差异计算子模块,用于计算所述第一参考类别与所述第一类别之间的第一差异值,计算所述第一参考位置与所述第一位置之间的第二差异值;
第一调整子模块,用于基于所述第一差异值和所述第二差异值调整所述第一模型参数、所述第二模型参数和所述第三模型参数,触发所述第一获取子模块;
第一训练完成子模块,用于当迭代次数达到第一预设次数时,完成训练,得到使得第一样本图像与检测框内的目标的位置和类别相关联的全图目标检测模型。
在本发明的另一实施例中,所述第一检测结果模块240,可以具体用于:
对于每个第一检出目标,基于该第一检出目标的位置确定该第一检出目标在当前视频帧图像中的左上角点的坐标和右下角点的坐标,在当前视频帧图像中得到以所述左上角点和所述右下角点为对角线的矩形图像区域;
根据所述左上角点的坐标、所述右下角点的坐标、预设坐标变换系数以及预先建立的局部目标检测模型输入图像的宽度和高度计算得到所述左上角点缩放后的坐标和所述右下角点缩放后的坐标;
基于所述左上角点的坐标、所述右下角点的坐标、所述左上角点缩放后的坐标和所述右下角点缩放后的坐标,将所述矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度。
在本发明的另一实施例中,上述装置还可以包括第二训练模块,所述第二训练模块用于训练得到所述局部目标检测模型,所述第二训练模块可以包括:
第二获取子模块,用于获取训练集中的第二样本图像以及所述第二样本图像包含的检测框内的目标对应的第二位置和第二类别;
第二输入子模块,用于将所述第二样本图像以及所述第二样本图像包含的检测框内的目标对应的第二位置和第二类别输入到第二初始网络模型中,其中,所述第二初始网络模型包括第二特征提取层和第二回归层;
特征向量确定子模块,用于通过所述第二特征提取层的第四模型参数,确定所述第二样本图像中的特征向量;
第二生成子模块,用于通过所述第二回归层的第五模型参数,对所述特征向量进行回归,得到第二参考目标所属的第二参考类别和所述第二参考目标在所述第二样本图像中的第二参考位置;
第二差异计算子模块,用于计算所述第二参考类别与所述第二类别之间的第三差异值,计算所述第二参考位置与所述第二位置之间的第四差异值;
第二调整子模块,用于基于所述第三差异值和所述第四差异值调整所述第四模型参数和所述第五模型参数,返回执行所述获取训练集中的第二样本图像以及所述第二样本图像包含的检测框内的目标对应的第二位置和第二类别的步骤;
第二训练完成子模块,用于当迭代次数达到第二预设次数时,完成训练,得到使得第二样本图像与检测框内的目标的位置和类别相关联的局部目标检测模型。
在本发明的另一实施例中,所述第三检测结果模块260,可以具体用于:
对于当前视频帧图像的每个第六检出目标,确定该第六检出目标与上一视频帧图像的每个第六检出目标之间的重叠区域以及相交区域,并计算所述重叠区域的面积与所述相交区域的面积的商;
将所述商中不小于预设阈值的商对应的当前视频帧图像的第六检出目标以及上一视频帧图像的第六检出目标作为匹配成功的目标,将所述商中小于预设阈值的商对应的当前视频帧图像的第六检出目标以及上一视频帧图像的第六检出目标作为匹配不成功的目标。
上述装置实施例与方法实施例相对应,与该方法实施例具有同样的技术效果,具体说明参见方法实施例。装置实施例是基于方法实施例得到的,具体的说明可以参见方法实施例部分,此处不再赘述。
本领域普通技术人员可以理解:附图只是一个实施例的示意图,附图中的模块或流程并不一定是实施本发明所必须的。
本领域普通技术人员可以理解:实施例中的装置中的模块可以按照实施例描述分布于实施例的装置中,也可以进行相应变化位于不同于本实施例的一个或多个装置中。上述实施例的模块可以合并为一个模块,也可以进一步拆分成多个子模块。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明实施例技术方案的精神和范围。

Claims (10)

  1. 一种视频目标检测与跟踪方法,其特征在于,包括:
    检测是否接收到采集设备实时采集的周围环境的当前视频帧图像;
    如果接收到当前视频帧图像,判断所述当前视频帧图像与上一次进行全图目标检测的视频帧图像之间的帧数间隔是否为预设间隔;
    如果是预设间隔,根据预先建立的全图目标检测模型对所述当前视频帧图像进行全图目标检测;
    当检测得到检出目标的位置和类别且当前视频帧图像的上一视频帧图像不存在检出目标时,将当前视频帧图像的检出目标作为第一检出目标,对于每个第一检出目标,基于该第一检出目标的位置确定该第一检出目标在当前视频帧图像中对应的矩形图像区域,将所述矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入所述局部目标检测模型中得到第二检出目标的位置和类别,返回执行所述检测是否接收到采集设备实时采集的周围环境的当前视频帧图像的步骤;
    当未检测出检出目标且当前视频帧图像的上一视频帧图像存在检出目标时,将所述上一视频帧图像存在的检出目标作为第三检出目标,对于每个第三检出目标,确定该第三检出目标在当前视频帧图像中对应的矩形图像区域,将所述矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入所述局部目标检测模型中得到第四检出目标的位置和类别,建立所述第四检出目标与该第三检出目标的对应关系,返回执行所述检测是否接收到采集设备实时采集的周围环境的当前视频帧图像的步骤;
    当检测得到检出目标的位置和类别且当前视频帧图像的上一视频帧图像存在检出目标时,将当前视频帧图像的检出目标和上一视频帧图像存在的检出目标作为第五检出目标,对于每个第五检出目标,确定该第五检出目标在该第五检出目标所在视频帧图像中对应的矩形图像区域,将所述矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入所述局部目标检测模型中得到第六检出目标的位置和类别,对多个第六检出目标进行目标匹配,得到当前视频帧图像与上一视频帧图像之间匹配成功的目标和匹配不成功的目标,返回执行所述检测是否接收到采集设备实时采集的周围环境的当前视频帧图像的步骤。
  2. 如权利要求1所述的方法,其特征在于,在所述检测是否接收到采集设备实时采集的周围环境的当前视频帧图像的步骤之后,所述方法还包括:
    如果未接收到当前视频帧图像,输出所述当前视频帧图像的上一视频帧图像存在的检出目标的位置和类别以及各检出目标的对应关系。
  3. 如权利要求1所述的方法,其特征在于,在所述判断所述当前视频帧图像与上一次进行全图目标检测的视频帧图像之间的帧数间隔是否为预设间隔的步骤之后,所述方法还包括:
    如果不是预设间隔,在当前视频帧图像的上一视频帧图像存在检出目标时,将所述上一视频帧图像存在的检出目标作为第七检出目标,对于每个第七检出目标,确定该第七检出目标在当前视频帧图像中对应的矩形图像区域,将所述矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入所述局部目标检测模型中得到第八检出目标的位置和类别,建立所述第八检出目标与该第七检出目标的对应关系,返回执行所述检测是否接收到采集设备实时采集的周围环境的当前视频帧图像的步骤。
  4. 如权利要求1所述的方法,其特征在于,所述全图目标检测模型的训练过程为:
    获取训练集中的第一样本图像以及所述第一样本图像包含的检测框内的目标对应的第一位置和第一类别;
    将所述第一样本图像以及所述第一样本图像包含的检测框内的目标对应的第一位置和第一类别输入到第一初始网络模型中,其中,所述第一初始网络模型包括第一特征提取层、区域生成网络层和第一回归层;
    通过所述第一特征提取层的第一模型参数,确定所述第一样本图像中的全图特征向量;
    通过所述区域生成网络层的第二模型参数对所述全图特征向量进行特征计算,得到包含第一参考目标的候选区域的特征信息;
    通过所述第一回归层的第三模型参数,对所述特征信息进行回归,得到所述第一参考目标所属的第一参考类别和所述第一参考目标在所述第一样本图像中的第一参考位置;
    计算所述第一参考类别与所述第一类别之间的第一差异值,计算所述第一参考位置与所述第一位置之间的第二差异值;
    基于所述第一差异值和所述第二差异值调整所述第一模型参数、所述第二模型参数和所述第三模型参数,返回执行所述获取训练集中的第一样本图像以及所述第一样本图像包含的检测框内的目标对应的第一位置和第一类别的步骤;
    当迭代次数达到第一预设次数时,完成训练,得到使得第一样本图像与检测框内的目标的位置和类别相关联的全图目标检测模型。
  5. 如权利要求1所述的方法,其特征在于,所述对于每个第一检出目标,基于该第一检出目标的位置确定该第一检出目标在当前视频帧图像中对应的矩形图像区域,将所述矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度的步骤,包括:
    对于每个第一检出目标,基于该第一检出目标的位置确定该第一检出目标在当前视频帧图像中的左上角点的坐标和右下角点的坐标,在当前视频帧图像中得到以所述左上角点和所述右下角点为对角线的矩形图像区域;
    根据所述左上角点的坐标、所述右下角点的坐标、预设坐标变换系数以及预先建立的局部目标检测模型输入图像的宽度和高度计算得到所述左上角点缩放后的坐标和所述右下角点缩放后的坐标;
    基于所述左上角点的坐标、所述右下角点的坐标、所述左上角点缩放后的坐标和所述右下角点缩放后的坐标,将所述矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度。
  6. 如权利要求1所述的方法,其特征在于,所述局部目标检测模型的训练过程为:
    获取训练集中的第二样本图像以及所述第二样本图像包含的检测框内的目标对应的第二位置和第二类别;
    将所述第二样本图像以及所述第二样本图像包含的检测框内的目标对应的第二位置和第二类别输入到第二初始网络模型中,其中,所述第二初始网络模型包括第二特征提取层和第二回归层;
    通过所述第二特征提取层的第四模型参数,确定所述第二样本图像中的特征向量;
    通过所述第二回归层的第五模型参数,对所述特征向量进行回归,得到第二参考目标所属的第二参考类别和所述第二参考目标在所述第二样本图像中的第二参考位置;
    计算所述第二参考类别与所述第二类别之间的第三差异值,计算所述第二参考位置与所述第二位置之间的第四差异值;
    基于所述第三差异值和所述第四差异值调整所述第四模型参数和所述第五模型参数,返回执行所述获取训练集中的第二样本图像以及所述第二样本图像包含的检测框内的目标对应的第二位置和第二类别的步骤;
    当迭代次数达到第二预设次数时,完成训练,得到使得第二样本图像与检测框内的目标的位置和类别相关联的局部目标检测模型。
  7. 如权利要求1所述的方法,其特征在于,所述对多个第六检出目标进行目标匹配,得到当前视频帧图像与上一视频帧图像之间匹配成功的目标和匹配不成功的目标的步骤,包括:
    对于当前视频帧图像的每个第六检出目标,确定该第六检出目标与上一视频帧图像的每个第六检出目标之间的重叠区域以及相交区域,并计算所述重叠区域的面积与所述相交区域的面积的商;
    将所述商中不小于预设阈值的商对应的当前视频帧图像的第六检出目标以及上一视频帧图像的第 六检出目标作为匹配成功的目标,将所述商中小于预设阈值的商对应的当前视频帧图像的第六检出目标以及上一视频帧图像的第六检出目标作为匹配不成功的目标。
  8. 一种视频目标检测与跟踪装置,其特征在于,包括:
    检测模块,用于检测是否接收到采集设备实时采集的周围环境的当前视频帧图像;
    判断模块,用于如果接收到当前视频帧图像,判断所述当前视频帧图像与上一次进行全图目标检测的视频帧图像之间的帧数间隔是否为预设间隔,如果是,触发全图目标检测模块;
    所述全图目标检测模块,用于根据预先建立的全图目标检测模型对所述当前视频帧图像进行全图目标检测;
    第一检测结果模块,用于当检测得到检出目标的位置和类别且当前视频帧图像的上一视频帧图像不存在检出目标时,将当前视频帧图像的检出目标作为第一检出目标,对于每个第一检出目标,基于该第一检出目标的位置确定该第一检出目标在当前视频帧图像中对应的矩形图像区域,将所述矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入所述局部目标检测模型中得到第二检出目标的位置和类别,触发所述检测模块;
    第二检测结果模块,用于当未检测出检出目标且当前视频帧图像的上一视频帧图像存在检出目标时,将所述上一视频帧图像存在的检出目标作为第三检出目标,对于每个第三检出目标,确定该第三检出目标在当前视频帧图像中对应的矩形图像区域,将所述矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入所述局部目标检测模型中得到第四检出目标的位置和类别,建立所述第四检出目标与该第三检出目标的对应关系,触发所述检测模块;
    第三检测结果模块,用于当检测得到检出目标的位置和类别且当前视频帧图像的上一视频帧图像存在检出目标时,将当前视频帧图像的检出目标和上一视频帧图像存在的检出目标作为第五检出目标,对于每个第五检出目标,确定该第五检出目标在该第五检出目标所在视频帧图像中对应的矩形图像区域,将所述矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入所述局部目标检测模型中得到第六检出目标的位置和类别,对多个第六检出目标进行目标匹配,得到当前视频帧图像与上一视频帧图像之间匹配成功的目标和匹配不成功的目标,触发所述检测模块。
  9. 如权利要求8所述的装置,其特征在于,所述装置还包括:
    输出模块,用于在所述检测是否接收到采集设备实时采集的周围环境的当前视频帧图像之后,如果未接收到当前视频帧图像,输出所述当前视频帧图像的上一视频帧图像存在的检出目标的位置和类别以及各检出目标的对应关系。
  10. 如权利要求8所述的装置,其特征在于,所述装置还包括:
    第四检测结果模块,用于在所述判断所述当前视频帧图像与上一次进行全图目标检测的视频帧图像之间的帧数间隔是否为预设间隔之后,如果不是预设间隔,在当前视频帧图像的上一视频帧图像存在检出目标时,将所述上一视频帧图像存在的检出目标作为第七检出目标,对于每个第七检出目标,确定该第七检出目标在当前视频帧图像中对应的矩形图像区域,将所述矩形图像区域的宽度和高度分别缩放至预先建立的局部目标检测模型输入图像的宽度和高度,将缩放后得到的矩形图像区域输入所述局部目标检测模型中得到第八检出目标的位置和类别,建立所述第八检出目标与该第七检出目标的对应关系,触发所述检测模块。
PCT/CN2019/108080 2019-08-08 2019-09-26 一种视频目标检测与跟踪方法及装置 WO2021022643A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910729242.7A CN112347817B (zh) 2019-08-08 2019-08-08 一种视频目标检测与跟踪方法及装置
CN201910729242.7 2019-08-08

Publications (1)

Publication Number Publication Date
WO2021022643A1 true WO2021022643A1 (zh) 2021-02-11

Family

ID=74367598

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/108080 WO2021022643A1 (zh) 2019-08-08 2019-09-26 一种视频目标检测与跟踪方法及装置

Country Status (2)

Country Link
CN (1) CN112347817B (zh)
WO (1) WO2021022643A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966699A (zh) * 2021-03-24 2021-06-15 沸蓝建设咨询有限公司 一种通信工程项目的目标检测***
CN114305317A (zh) * 2021-12-23 2022-04-12 广州视域光学科技股份有限公司 一种智能辨别用户反馈视标的方法和***

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113962141A (zh) * 2021-09-22 2022-01-21 北京智行者科技有限公司 一种目标检测模型自动化迭代方法、设备及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150116504A1 (en) * 2008-12-04 2015-04-30 Sony Corporation Image processing device and method, image processing system, and image processing program
CN106228571A (zh) * 2016-07-15 2016-12-14 北京光年无限科技有限公司 面向机器人的目标物追踪检测方法及装置
CN106599836A (zh) * 2016-12-13 2017-04-26 北京智慧眼科技股份有限公司 多人脸跟踪方法及跟踪***
CN106875425A (zh) * 2017-01-22 2017-06-20 北京飞搜科技有限公司 一种基于深度学习的多目标追踪***及实现方法
CN107563313A (zh) * 2017-08-18 2018-01-09 北京航空航天大学 基于深度学习的多目标行人检测与跟踪方法
CN108388879A (zh) * 2018-03-15 2018-08-10 斑马网络技术有限公司 目标的检测方法、装置和存储介质
CN108694724A (zh) * 2018-05-11 2018-10-23 西安天和防务技术股份有限公司 一种长时间目标跟踪方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106650630B (zh) * 2016-11-11 2019-08-23 纳恩博(北京)科技有限公司 一种目标跟踪方法及电子设备
CN108491816A (zh) * 2018-03-30 2018-09-04 百度在线网络技术(北京)有限公司 在视频中进行目标跟踪的方法和装置
CN109035292B (zh) * 2018-08-31 2021-01-01 北京智芯原动科技有限公司 基于深度学习的运动目标检测方法及装置
CN109584276B (zh) * 2018-12-04 2020-09-25 北京字节跳动网络技术有限公司 关键点检测方法、装置、设备及可读介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150116504A1 (en) * 2008-12-04 2015-04-30 Sony Corporation Image processing device and method, image processing system, and image processing program
CN106228571A (zh) * 2016-07-15 2016-12-14 北京光年无限科技有限公司 面向机器人的目标物追踪检测方法及装置
CN106599836A (zh) * 2016-12-13 2017-04-26 北京智慧眼科技股份有限公司 多人脸跟踪方法及跟踪***
CN106875425A (zh) * 2017-01-22 2017-06-20 北京飞搜科技有限公司 一种基于深度学习的多目标追踪***及实现方法
CN107563313A (zh) * 2017-08-18 2018-01-09 北京航空航天大学 基于深度学习的多目标行人检测与跟踪方法
CN108388879A (zh) * 2018-03-15 2018-08-10 斑马网络技术有限公司 目标的检测方法、装置和存储介质
CN108694724A (zh) * 2018-05-11 2018-10-23 西安天和防务技术股份有限公司 一种长时间目标跟踪方法

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966699A (zh) * 2021-03-24 2021-06-15 沸蓝建设咨询有限公司 一种通信工程项目的目标检测***
CN114305317A (zh) * 2021-12-23 2022-04-12 广州视域光学科技股份有限公司 一种智能辨别用户反馈视标的方法和***
CN114305317B (zh) * 2021-12-23 2023-05-12 广州视域光学科技股份有限公司 一种智能辨别用户反馈视标的方法和***

Also Published As

Publication number Publication date
CN112347817B (zh) 2022-05-17
CN112347817A (zh) 2021-02-09

Similar Documents

Publication Publication Date Title
WO2021196294A1 (zh) 一种跨视频人员定位追踪方法、***及设备
WO2021022643A1 (zh) 一种视频目标检测与跟踪方法及装置
CN104282020B (zh) 一种基于目标运动轨迹的车辆速度检测方法
CN111126399B (zh) 一种图像检测方法、装置、设备及可读存储介质
WO2018177026A1 (zh) 确定道路边沿的装置和方法
CN104700414B (zh) 一种基于车载双目相机的前方道路行人快速测距方法
WO2020253010A1 (zh) 一种泊车定位中的停车场入口定位方法、装置及车载终端
JP2019124683A (ja) オブジェクト速度推定方法と装置及び画像処理機器
Kuk et al. Fast lane detection & tracking based on Hough transform with reduced memory requirement
WO2020237942A1 (zh) 一种行人3d位置的检测方法及装置、车载终端
CN114089330B (zh) 一种基于深度图像修复的室内移动机器人玻璃检测与地图更新方法
JP2015055875A (ja) 移動先レーンに合流する移動アイテムの移動元レーンの判定
CN112906777A (zh) 目标检测方法、装置、电子设备及存储介质
CN116310679A (zh) 多传感器融合目标检测方法、***、介质、设备及终端
JPH07262375A (ja) 移動体検出装置
CN111402293A (zh) 面向智能交通的一种车辆跟踪方法及装置
CN110147748A (zh) 一种基于道路边缘检测的移动机器人障碍物识别方法
CN101320477B (zh) 一种人体跟踪方法及其设备
CN113256683B (zh) 目标跟踪方法及相关设备
CN112347818B (zh) 一种视频目标检测模型的困难样本图像筛选方法及装置
WO2024131200A1 (zh) 基于单目视觉的车辆3d定位方法、装置及汽车
CN116958935A (zh) 基于多视角的目标定位方法、装置、设备及介质
CN114037977B (zh) 道路灭点的检测方法、装置、设备及存储介质
CN105894505A (zh) 一种基于多摄像机几何约束的快速行人定位方法
CN115249407B (zh) 指示灯状态识别方法、装置、电子设备、存储介质及产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19940628

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19940628

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 14.11.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19940628

Country of ref document: EP

Kind code of ref document: A1