CN111325198A

CN111325198A - Video object feature extraction method and device and video object matching method and device

Info

Publication number: CN111325198A
Application number: CN201811527701.5A
Authority: CN
Inventors: 陈广义; 鲁继文; 杨铭; 周杰
Original assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Current assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date: 2018-12-13
Filing date: 2018-12-13
Publication date: 2020-06-23
Anticipated expiration: 2038-12-13
Also published as: CN111325198B

Abstract

A video object feature extraction method and apparatus, a video object matching method and apparatus, an electronic device, and a computer-readable storage medium are disclosed, which solve the problem of low accuracy of existing video object feature extraction. The video object feature extraction method comprises the following steps: acquiring local characteristic information of each component area of an image surface where an object to be identified is located in each video frame of a video stream; acquiring the spatial domain values of all component areas of the image surface of the object to be identified in each video frame; and acquiring comprehensive local characteristic information of each component area of the image surface of the object to be identified according to the acquired local characteristic information and the spatial domain values of each component area of the image surface of the object to be identified in each video frame.

Description

Video object feature extraction method and device and video object matching method and device

Technical Field

The invention relates to the technical field of video analysis, in particular to a video object feature extraction method and device, a video object matching method and device, electronic equipment and a computer-readable storage medium.

Background

The video object feature extraction technology is a process for extracting feature information for representing a video object from a continuous video stream, and is widely applied to the field of video object identification and monitoring. The existing video object feature extraction method is to extract video object features from the same region of interest in multiple video frames. However, in practical application scenarios, the monitoring scenario of the video monitoring device is often dynamic. Once an occlusion or other background interference occurs in a monitoring scene of the video monitoring device, the image content of the region of interest may generate a deviation, and a video object feature extracted based on the region of interest may also generate an error, thereby seriously reducing the accuracy of video object feature extraction.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for extracting video object features, a method and an apparatus for matching video objects, an electronic device, and a computer-readable storage medium, which solve the problem of low accuracy in extracting video object features in the prior art.

According to an aspect of the present invention, a method for extracting features of a video object provided by an embodiment of the present invention includes: acquiring local characteristic information of each component area of an image surface where an object to be identified is located in each video frame of a video stream; acquiring the spatial domain values of all component areas of the image surface of the object to be identified in each video frame; and acquiring comprehensive local characteristic information of each component area of the image surface of the object to be identified according to the acquired local characteristic information and the spatial domain values of each component area of the image surface of the object to be identified in each video frame.

According to another aspect of the present invention, a video object matching method provided by an embodiment of the present invention includes: acquiring comprehensive local characteristic information of each component area of an image surface where a first object is located in a first video stream, comprehensive spatial domain values of each component area and global characteristic information of the first object; acquiring comprehensive local characteristic information of each component area of an image surface where a second object is located in a second video stream, comprehensive spatial domain values of each component area and global characteristic information of the second object; and judging whether the first object is consistent with the second object based on the comprehensive local characteristic information of each component area of the image surface where the first object is located, the comprehensive spatial domain score of each component area of the image surface where the first object is located, the global characteristic information of the first object, the comprehensive local characteristic information of each component area of the image surface where the second object is located, the comprehensive spatial domain score of each component area of the image surface where the second object is located and the global characteristic information of the second object.

According to an aspect of the present invention, an embodiment of the present invention provides a video object feature extraction apparatus, including: the first characteristic acquisition module is configured to acquire local characteristic information of each component area of an image surface where an object to be identified is located in each video frame of a video stream; the spatial domain score acquisition module is configured to acquire spatial domain scores of all component areas of an image surface where an object to be identified is located in each video frame; and the local feature acquisition module is configured to acquire comprehensive local feature information of each component area of the image surface where the object to be identified is located according to the acquired local feature information and the spatial domain score of each component area of the image surface where the object to be identified is located in each video frame.

According to an aspect of the present invention, a video object matching apparatus provided by an embodiment of the present invention is communicatively connected to the video object feature extraction apparatus as described above, and includes: a measurement parameter obtaining module configured to obtain, from the video object feature extraction device, integrated local feature information of each component region of an image plane where a first object is located in a first video stream, an integrated spatial score of each component region of the first object, and global feature information of the first object; acquiring comprehensive local characteristic information of each component area of an image surface where a second object is located in a second video stream, comprehensive spatial domain scores of each component area of the second object and global characteristic information of the second object from the video object characteristic extraction device; and the measurement execution module is configured to judge whether the first object and the second object are consistent based on the comprehensive local feature information of each component area of the image surface where the first object is located, the comprehensive spatial domain score of each component area of the first object, the global feature information of the first object, the comprehensive local feature information of each component area of the image surface where the second object is located, the comprehensive spatial domain score of each component area of the second object, and the global feature information of the second object.

According to an aspect of the present invention, an embodiment of the present invention provides an electronic device, including: a processor; and a memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform the method of any of the preceding claims.

According to an aspect of the present invention, an embodiment of the present invention provides a computer-readable storage medium having stored thereon computer program instructions, which, when executed by a processor, cause the processor to perform the method as described in any of the preceding.

According to the video object feature extraction method and device, the video object matching method and device, the electronic equipment and the computer readable storage medium provided by the embodiment of the invention, the spatial domain values of all the component areas of the image surface where the object to be identified is located in each video frame are obtained, and the spatial domain values can be used for evaluating the image quality of the local feature information of all the component areas of the image surface where the object to be identified is located in each video frame; by referring to the spatial domain values of all the component areas of the image surface where the object to be identified is located in each video frame, the comprehensive local feature information of all the component areas of the image surface where the object to be identified is located can be obtained, and therefore the dynamic change of all the component areas of the image surface where the object to be identified is located in the time dimension is taken into account. Therefore, when a certain composition area of an image surface where an object to be identified is located in a video frame is shielded by an obstacle or interfered by a background image, the spatial domain value acquired by the composition area in the video frame is lower, so that when the comprehensive local feature information of the composition area is acquired, the importance of the local feature information of the composition area in the video frame is reduced, the influence of the local feature information of the composition area in the video frame on the finally acquired comprehensive local feature information is also reduced, the accuracy of the finally acquired comprehensive local feature information is improved, the interference of a monitoring scene shield or other backgrounds can be effectively avoided, and the accuracy of video object feature extraction is remarkably improved.

Drawings

Fig. 1 is a schematic flow chart illustrating a method for extracting features of a video object according to an embodiment of the present invention.

Fig. 2 is a schematic diagram illustrating a principle of a video object feature extraction method according to an embodiment of the present invention.

Fig. 3 is a schematic diagram illustrating a method for extracting features of a video object according to another embodiment of the present invention.

Fig. 4 is a schematic flow chart illustrating a method for extracting features of a video object according to another embodiment of the present invention.

Fig. 5 is a schematic diagram illustrating a method for extracting features of a video object according to another embodiment of the present invention.

Fig. 6 is a schematic flow chart illustrating a process of identifying a composition area in a video object feature extraction method according to an embodiment of the present invention.

Fig. 7a, 7b, and 7c are schematic diagrams illustrating a principle of identifying a composition region in a video object feature extraction method according to an embodiment of the present invention.

Fig. 8 is a schematic flow chart illustrating a process of acquiring local feature information of each component area in a video object feature extraction method according to an embodiment of the present invention.

Fig. 9 is a schematic flow chart illustrating a method for extracting features of a video object according to another embodiment of the present invention.

Fig. 10 is a schematic diagram illustrating a method for extracting features of a video object according to another embodiment of the present invention.

Fig. 11 is a schematic structural diagram of a residual attention network model in a video object feature extraction method according to another embodiment of the present invention.

Fig. 12 is a schematic flowchart illustrating a method for extracting features of a video object according to another embodiment of the present invention.

Fig. 13 is a schematic diagram illustrating a video object feature extraction method according to another embodiment of the present invention.

Fig. 14 is a schematic flow chart of a video object matching method according to an embodiment of the present invention.

Fig. 15 is a schematic structural diagram of a video object feature extraction apparatus according to an embodiment of the present invention.

Fig. 16 is a schematic structural diagram of a video object feature extraction apparatus according to another embodiment of the present invention.

Fig. 17 is a schematic structural diagram of a video object matching apparatus according to an embodiment of the present invention.

Fig. 18 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Summary of the application

In order to solve the problem of low accuracy of feature extraction of the existing video object, the dynamic change of a monitored scene must be taken into account, so that the interference of an obstruction or other background interference area in the monitored scene on a feature extraction result is reduced. Considering that the occlusion and background interference are also dynamically changed, when a region of interest in a certain video frame is biased due to occlusion, the region of interest in the next video frame or a video frame with a preset distance in the time dimension may not be occluded. Therefore, an evaluation mechanism based on a time dimension needs to be established, so that the influence of the biased interested regions on the final feature extraction result is reduced, and feature information is extracted mainly by referring to the unbiased interested regions, thereby improving the accuracy of video feature extraction.

In view of the above technical problems, the basic concept of the present application is to provide a method and an apparatus for extracting video object features, a method and an apparatus for matching video objects, an electronic device, and a computer-readable storage medium, wherein by obtaining spatial scores 52 of each component area of an image plane on which an object to be recognized is located in each video frame, the spatial scores 52 can be used to evaluate the image quality of local feature information 51 of each component area of the image plane on which the object to be recognized is located in each video frame; by referring to the spatial domain scores 52 of the component areas of the image plane where the object to be recognized is located in each video frame, the comprehensive local feature information 54 of the component areas of the image plane where the object to be recognized is located can be obtained, thereby taking the dynamic changes of the component areas of the image plane where the object to be recognized is located in consideration of the time dimension. Therefore, when a certain composition region of an image plane where an object to be identified is located in a video frame is shielded by an obstacle or interfered by a background image, the spatial domain score 52 obtained by the composition region in the video frame is also lower, so that when the comprehensive local feature information 54 of the composition region is obtained, the importance of the local feature information 51 of the composition region in the video frame is reduced, the influence of the local feature information 51 of the composition region in the video frame on the finally obtained comprehensive local feature information 54 is also reduced, the accuracy of the finally obtained comprehensive local feature information 54 is improved, the interference of a monitoring scene shield or other backgrounds can be effectively avoided, and the accuracy of video object feature extraction is remarkably improved.

It should be noted that the feature information obtained by the video object feature extraction method provided by the embodiment of the present invention may be used in various video application scenarios such as video object matching, video object identification, video object monitoring, and video object tracking.

Having described the general principles of the present application, various non-limiting embodiments of the present application will now be described with reference to the accompanying drawings.

Exemplary video object feature extraction method

Fig. 1 and fig. 2 are a schematic flowchart and a schematic diagram respectively illustrating a method for extracting features of a video object according to an embodiment of the present invention. As shown in fig. 1 and 2, the video object feature extraction method includes:

step 101: local feature information 51 of each component area of an image plane where an object to be identified is located in each video frame of a video stream is acquired.

The video stream is a video streaming media composed of a plurality of video frames for extracting feature information of a video object, and the video frames in the video stream can be acquired by splitting the video stream. It should be understood that the video stream may be a complete surveillance video file, may be a portion of a complete surveillance video file, or may be a combination of video frames selected from a complete surveillance video file. For example, when the video object feature extraction method is performed in real time, the video stream for extracting the feature information of the video object may be a video stream composed of a preset number of video frames before the current video frame, and the feature information of the video object is extracted by using the preset number of video frames. The length and content of the video stream are not limited by the present invention.

The image plane where the object to be recognized is located refers to an image of the object to be recognized in one video frame, the image plane where the object to be recognized is located can be composed of a plurality of component areas, and the specific component area forming mode can be determined according to the specific type of the object to be recognized. For example, when the object to be recognized is a human body, the image plane on which the human body is located may be composed of a plurality of component regions such as a head region, a body region, a left arm region, a right arm region, a left leg region, and a right leg region. However, it should be understood that the video object targeted by the video object feature extraction method may also be other objects besides a human body, such as vehicles, animals, and the like, and the composition of the video object and the image plane where the video object is located is not strictly limited by the present invention.

The local feature information 51 is used for representing the image content of a component area of an image plane where an object to be identified is located in a video frame, and the local feature information 51 corresponds to a component area in a video frame. For the entire video stream, each component region of the image plane on which the video object is located exists in each video frame, which means that each component region corresponds to a plurality of video frames, and each component region has a plurality of local feature information 51 corresponding to the plurality of video frames.

Step 102: and acquiring spatial domain values 52 of all component areas of the image surface of the object to be identified in each video frame.

The spatial score 52 is used to evaluate the image quality of a component region of the image plane where the object to be identified is located in a video frame. For example, when the object to be recognized is a human body, and when a head region of the human body in a video frame is occluded, the spatial score 52 of the head region in the video frame should be lower, because the image content quality of the head region in the video frame is poor, and the influence of the local feature information 51 acquired by the head region in the video frame on the final feature extraction result needs to be reduced.

In one embodiment of the present invention, as shown in FIG. 3, the spatial scores 52 may be obtained by a spatio-temporal evaluation neural network model 10. Specifically, each video frame of the video stream may be input into the spatio-temporal evaluation neural network model 10, and then the spatial scores 52 of the constituent regions of the image plane where the object to be identified is located in each video frame may be obtained from the spatio-temporal evaluation neural network model. It should be understood that the spatio-temporal evaluation neural network model 10 may be established through a pre-training process, for example, the spatio-temporal evaluation neural network model 10 may be obtained by manually evaluating a large number of image samples to obtain the spatial score 52 of each image sample, and then training the neural network based on the image samples and the corresponding spatial scores 52.

In an embodiment of the present invention, the trained spatio-temporal evaluation neural network model 10 may include a first convolutional layer 11, a first pooling layer 12, a second convolutional layer 13, a third convolutional layer 14, a second pooling layer 15, and a first fully-connected layer 16, which are connected in sequence along the data processing direction. It should be understood that the specific structure of the spatio-temporal evaluation neural network model 10 may also be adjusted according to the actual scene requirements, and the specific parameters (e.g., the block size, the stride size, and the output size) of each layer structure may also be adjusted according to the actual scene requirements, and the specific internal structure of the spatio-temporal evaluation neural network model 10 is not strictly limited by the present invention.

In one embodiment of the present invention, in order to make the spatial scores 52 of the component regions in each video frame convenient for statistics and usage, the spatial scores 52 of the component regions may be mapped to between 0 and 1 by a continuous approximate symbolic function. As shown in FIG. 3, the continuous approximation sign function processing may be connected to the output of the spatio-temporal evaluation neural network model 10 to directly process the output of the spatio-temporal evaluation neural network model 10 to obtain a spatial score 52 with a value between 0 and 1.

Step 103: and acquiring comprehensive local characteristic information 54 of each component area of the image surface of the object to be identified according to the acquired local characteristic information 51 and spatial score 52 of each component area of the image surface of the object to be identified in each video frame.

As described above, by obtaining the spatial score 52 of each component area on the image plane where the object to be recognized is located in each video frame, the local feature information 51 of each component area on the image plane where the object to be recognized is located in each video frame can be evaluated by the spatial score 52. Thus, by referring to the spatial domain score 52 of each component area of the image surface where the object to be recognized is located in each video frame, the comprehensive local feature information 54 of each component area of the image surface where the object to be recognized is located can be obtained, and therefore the dynamic change of each component area of the image surface where the object to be recognized is located in the time dimension is taken into account.

In an embodiment of the present invention, the spatial score 52 of each video frame corresponding to a component region may be used as a weight, and the local feature information 51 of each video frame corresponding to a component region may be integrated into the integrated local feature information 54 of the component region. Thus, the component regions with lower spatial scores 52 are also lower weighted and have a lower impact on the final captured integrated local feature information 54.

Therefore, by adopting the video object feature extraction method provided by the embodiment of the invention, when a certain composition region of an image plane where an object to be identified is located in a video frame is shielded by an obstacle or interfered by a background image, the spatial score 52 obtained by the composition region in the video frame is also lower, so that when the comprehensive local feature information 54 of the composition region is obtained, the importance of the local feature information 51 of the composition region in the video frame is reduced, the influence of the local feature information 51 of the composition region in the video frame on the finally obtained comprehensive local feature information 54 is also reduced, the accuracy of the finally obtained comprehensive local feature information 54 is improved, the interference of a monitoring scene shield or other backgrounds can be effectively avoided, and the accuracy of video object feature extraction is remarkably improved.

Fig. 4 and fig. 5 are a schematic flowchart and a schematic diagram illustrating a method for extracting features of a video object according to an embodiment of the present invention. As shown in fig. 4 and 5, the local feature information 51 in the video object feature extraction method may be obtained by:

step 1011: and acquiring a global feature map of the object to be identified in each video frame.

The global feature map is used for representing the whole image content of an image plane where an object to be identified is located in a video frame, for example, in order to remove background influence and highlight the outline of the object to be identified in some application scenes, the global feature map describing the apparent information of the video frame can be obtained through a full convolution operator, and detection and segmentation of a local area are not involved in the process of obtaining the global feature map. In an embodiment of the present invention, as shown in fig. 5, the global feature map may be obtained by the first global feature extraction neural network model 20, specifically, each video frame of the video stream may be input into the first global feature extraction neural network model 20, and then the global feature map of each video frame may be obtained by the first global feature extraction neural network model 20. It should be understood that the first global feature extraction neural network model 20 may be established through a pre-training process, for example, the first global feature extraction neural network model 20 may be obtained by performing a pre-image processing process on a large number of image samples to obtain a global feature map of each image sample, and then performing neural network training based on the image samples and the corresponding global feature maps.

In an embodiment of the present invention, the trained first global feature extraction neural network model 20 may include a fourth convolution layer 21, a first initiation layer 22, a second initiation layer 23, and a third initiation layer 24, which are sequentially connected along the data processing direction. It should be understood that the specific structure of the first global feature extraction neural network model 20 may also be adjusted according to the actual scene requirement, and the specific parameters (e.g., block size, stride size, and output size) of each layer structure may also be adjusted according to the actual scene requirement, and the specific internal structure of the first global feature extraction neural network model 20 is not strictly limited by the present invention.

Step 1012: and identifying each composition area of the image plane where the object to be identified is located in each video frame.

As mentioned above, the image plane where the object to be recognized is located may be composed of a plurality of component areas, the specific component area configuration manner may be determined according to the specific type of the object to be recognized, and the specific recognition manner may also be adjusted according to the specific type of the object to be recognized.

In an embodiment of the present invention, each component area of the image plane where the object to be recognized is located may be determined according to the feature recognition point. Specifically, as shown in fig. 6, identifying each component region of the image plane where the object to be identified is located in each video frame may include the following steps:

step 10121: a plurality of feature recognition points in a video frame are identified.

As shown in fig. 7a, if the object to be recognized in the video frame is a human body, the feature recognition points on the image plane where the human body is located may be J1-J14. The recognition process of the feature recognition points can also be realized by a pre-trained neural network model, and details are not repeated here.

Step 10122: and identifying each composition area of the image surface where the object to be identified is located according to the plurality of feature identification points, wherein each composition area is determined according to the positions of at least two feature identification points.

For example, based on the feature recognition points shown in fig. 7a, six different parts of the human body can be divided, and the different parts respectively correspond to one component area of the image plane where the object to be recognized is located. Specifically, as shown in fig. 7b, the feature recognition points J1, J2, J3 and J6 correspond to the head of the human body, J2, J3, J6, J9 and J12 correspond to the body parts of the human body, J3, J4 and J5 correspond to the left arm of the human body, J6, J7 and J8 correspond to the right arm of the human body, J12, J13 and J14 correspond to the right leg of the human body, and J9, J10 and J11 correspond to the left leg of the human body. After the different parts of the human body are divided, the corresponding composition region configuration can be determined based on the body part division as shown in fig. 7b, as shown in fig. 7 c.

Step 1013: according to the global feature map and the component areas of each video frame, local feature information 51 of the component areas of the image plane where the object to be identified is located in each video frame is obtained.

In an embodiment of the present invention, the local feature information 51 of each component area may be obtained based on the region-of-interest pooling process and the neural network, as shown in fig. 8 and 5, the method may specifically include the following steps:

step 10131: and performing region-of-interest pooling on the global feature map of the video frame and each composition area of the image plane where the object to be identified is located in the video frame.

Specifically, each composition area in the acquired video frame is used as an interesting area to be mapped to a corresponding position of the global feature map, the mapped global feature map area is divided into a plurality of sub-areas corresponding to the output dimension, and then maximum pooling processing is performed on each sub-area respectively to obtain the feature map corresponding to each interesting area. The feature map corresponding to each region of interest after the region of interest pooling processing has a smaller data size than the global feature map and can be used for representing the image content of each region of interest, so that the efficiency of extracting the whole video feature information can be remarkably improved.

Step 10132: inputting the result of the region-of-interest pooling into the local feature extraction neural network model, and acquiring local feature information 51 of each component region of the image plane where the object to be identified is located in the video frame output by the local feature extraction neural network model.

As mentioned above, the structure of the region-of-interest pooling process is a feature map corresponding to each region of interest, and each region of interest corresponds to one component region of the video frame, so that when the feature map corresponding to each region of interest is input into the local feature extraction neural network model, the output of the local feature extraction neural network model is the local feature information 51 corresponding to each component region. It should be understood that the local feature extraction neural network model may be established through a pre-training process, for example, the local feature extraction neural network model may be obtained by performing the pre-training process on a large number of image samples to obtain feature information of each image sample, and then performing neural network training based on the image samples and the corresponding global feature map.

In an embodiment of the present invention, the trained local feature extraction neural network model 30 may include a first local initiation layer 31, a second local initiation layer 32, a third local initiation layer 33, and a second fully-connected layer 34, which are sequentially connected along the data processing direction and sequentially connected along the data processing direction. It should be understood that the specific structure of the local feature extraction neural network model 30 can also be adjusted according to the actual scene requirement, and the specific parameters (e.g., the block size, the stride size, and the output size) of each layer structure can also be adjusted according to the actual scene requirement, and the specific internal structure of the local feature extraction neural network model 30 is not strictly limited by the present invention.

Fig. 9 and fig. 10 are a schematic flowchart and a schematic diagram respectively illustrating a method for extracting features of a video object according to an embodiment of the present invention. As shown in fig. 9 and 10, in order to further improve the integrity of video feature extraction, in addition to the global feature information 54 of each component area, the video object feature extraction method may further obtain global feature information 53 of an object to be identified, and the global feature information 53 and the global feature information 54 of each component area on the image plane where the object to be identified is located are used to represent the object to be identified. Specifically, the video object feature extraction method may further include:

step 901: the global feature map for each video frame is input to the second global feature extraction neural network model 40.

Step 902: global feature information 53 of the object to be recognized in the video stream output by the second global feature extraction neural network model 40 is obtained.

As mentioned above, the global feature map of each video frame is used to represent the overall image content of the image plane where the object to be identified is located in the video frame. The global feature map of the video frame is input into the second global feature extraction neural network model 40, so as to obtain the global feature information 53 output by the second global feature extraction neural network model 40. It should be understood that the second global feature extraction neural network model 40 may be established through a pre-training process, for example, the second global feature extraction neural network model 40 may be obtained by performing the pre-training process on a large number of feature map samples to obtain feature information of each feature map sample, and then performing neural network training based on the feature map samples and corresponding feature information.

In an embodiment of the present invention, the trained second global feature extraction neural network model 40 may include a first global initiation layer 41, a residual attention network model 42, a second global initiation layer 43, a third global initiation layer 44, and a fourth global initiation layer 45, which are sequentially connected along a data processing direction. It should be understood that the specific structure of the second global feature extraction neural network model 40 can also be adjusted according to the actual scene requirement, and the specific parameters (e.g., block size, stride size, and output size) of each layer structure can also be adjusted according to the actual scene requirement, and the specific internal structure of the second global feature extraction neural network model 40 is not strictly limited by the present invention.

In a further embodiment, to further improve the accuracy of the global feature information 53 extraction, as shown in fig. 11, the residual attention network model 42 in the second global feature extraction neural network model 40 may include: a first neural network module 421 and a convolutional neural network module 422 that process data in parallel. The first neural network module 421 may include: a fifth convolutional layer 4211, a third pooling layer 4212, a sixth convolutional layer 4213, a deconvolution layer 4214, a seventh convolutional layer 4215, and a continuous approximate sign function processing layer 4216, which are connected in this order in the data processing direction. The output result of the first neural network module 421 is integrated with the output result of the convolutional neural network module 422 as the output result of the residual attention network module 42.

Fig. 12 and fig. 13 are a schematic flowchart and a schematic diagram respectively illustrating a method for extracting features of a video object according to an embodiment of the present invention. As shown in fig. 12 and 13, in order to further improve the integrity of video feature extraction, in addition to the global feature information 53 and the comprehensive local feature information 54 of each component area, the video object feature extraction method may further include:

step 1201: and acquiring comprehensive spatial domain scores 55 of all component areas of the image surface of the object to be identified according to the spatial domain scores 52 of all component areas of the image surface of the object to be identified in each video frame.

The global feature information 53, the comprehensive local feature information 54 of each component area of the image surface where the object to be recognized is located, and the comprehensive spatial score 55 of each component area of the image surface where the object to be recognized is located are used for representing the object to be recognized together.

In an embodiment of the present invention, a weighted average value of all spatial scores 52 corresponding to one component region of the object to be identified may be obtained, and the weighted average value may be used as a comprehensive spatial score 55 of the component region. The weight of the component region can be determined according to the size of the spatial score 52 of the component region, so that when the spatial score 52 of a component region of a video frame is low, the influence of the spatial score 52 of the component region of the video frame on the final comprehensive spatial score 55 of the component region is also low, thereby being beneficial to further improving the accuracy of video object feature extraction.

Exemplary video object matching method

Fig. 14 is a schematic flow chart of a video object matching method according to an embodiment of the present invention. As shown in fig. 14, the video object matching method may include the steps of:

step 1401: the integrated local feature information 54 of each component region of the image plane where the first object is located in the first video stream, the integrated spatial score 55 of each component region, and the global feature information 53 of the first object are obtained.

Specifically, the comprehensive local feature information 54 of each component region of the image plane where the first object is located may be obtained by the video object feature extraction method shown in fig. 1 and 2, the comprehensive spatial score 55 of each component region may be obtained by the video object feature extraction method shown in fig. 12 and 13, and the global feature information 53 of the first object may be obtained by the video object feature extraction method shown in fig. 9 and 10.

Step 1402: the comprehensive local feature information 54 of each component area of the image plane where the second object is located, the comprehensive spatial score 55 of each component area, and the global feature information 53 of the second object in the second video stream are obtained.

Specifically, the comprehensive local feature information 54 of each component region of the image plane where the second object is located may be obtained by the video object feature extraction method shown in fig. 1 and 2, the comprehensive spatial score 55 of each component region may be obtained by the video object feature extraction method shown in fig. 12 and 13, and the global feature information 53 of the first object may be obtained by the video object feature extraction method shown in fig. 9 and 10.

Step 1403: whether the first object is consistent with the second object is judged based on the comprehensive local feature information 54 of each component area of the image surface where the first object is located, the comprehensive spatial score 55 of each component area of the image surface where the first object is located, the global feature information 53 of the first object, the comprehensive local feature information 54 of each component area of the image surface where the second object is located, the comprehensive spatial score 55 of each component area of the image surface where the second object is located, and the global feature information 53 of the second object.

Specifically, the local feature distance between the first object and the second object may be calculated using the integrated local feature information 54 of each component region on the image plane where the first object is located and the integrated local feature information 54 of each component region on the image plane where the second object is located as the measurement variables, with the integrated spatial score 55 of each component region on the image plane where the first object is located and the integrated spatial score 55 of each component region on the image plane where the second object is located as the weights. Then, the global feature distance between the first object and the second object is calculated by using the global feature information 53 of the first object and the global feature information 53 of the second object as metric variables. And finally, judging whether the first object is consistent with the second object according to the local characteristic distance and the global characteristic distance. It should be understood that the specific principle of determining whether to be consistent may be adjusted according to the actual application scenario, for example, a first threshold and a second threshold may be set, and the first object and the second object may be determined to be consistent only when the local feature distance is lower than the first threshold and the global feature distance is lower than the second threshold, and the specific principle of determination is not strictly limited in the present invention.

In an embodiment of the present invention, the video object feature matching method may be implemented by a training model, and at this time, the following min function may be used as a target function for training:

therein

Global feature information 53 representing the video object,

global local feature information 54 representing each component region of the image plane on which the video object is located,

a composite spatial score 55 representing each component region of the image plane on which the video object is located. When the value of the target function is smaller, the effect of the training model is better, namely the accuracy of the video object feature matchingThe higher.

Specifically, the min function is composed of three parts, wherein,

ordering information for ensuring that the three-medium feature information (global feature information 53, integrated local feature information 54, and integrated spatial score 55) of each video stream makes the negative sample distance larger than the positive sample distance by a threshold value.

To ensure that each video stream can be correctly classified so that the intra-sample variance is small.

For ensuring consistency between the global feature information 53 and the integrated local feature information 54 of each component area.

It should be understood that although the above description uses three types of feature information, i.e., the global feature information 53, the comprehensive local feature information 54, and the comprehensive spatial score 55, in other embodiments of the present invention, whether the first object is consistent with the second object may be determined based on only the comprehensive local feature information 54 of each component area of the image plane where the first object is located and the comprehensive local feature information 54 of each component area of the image plane where the second object is located; or judging whether the first object is consistent with the second object based on the comprehensive local characteristic information 54 of each composition area of the image surface where the first object is located, the comprehensive spatial domain score 55 of each composition area of the image surface where the first object is located, the comprehensive local characteristic information 54 of each composition area of the image surface where the second object is located and the comprehensive spatial domain score 55 of each composition area of the image surface where the second object is located; or, whether the first object and the second object are consistent is judged based on the comprehensive local feature information 54 of each composition area of the image plane where the first object is located, the global feature information 53 of the first object, the comprehensive local feature information 54 of each composition area of the image plane where the second object is located, and the global feature information 53 of the second object. The invention does not strictly limit the matching process of the video objects between different video streams by specifically utilizing which characteristic information.

In addition, it should be further understood that, because the feature information acquired by the video object feature extraction method provided by the embodiment of the present invention can accurately represent a video object, the feature information acquired by the video object feature extraction method provided by the embodiment of the present invention can be used in a video object matching process, and can also be used in other video application scenarios such as video object identification, video object monitoring, video object tracking, and the like.

Exemplary video object feature extraction apparatus

Fig. 15 is a schematic structural diagram of a video object feature extraction apparatus according to an embodiment of the present invention. As shown in fig. 15, the video object feature extraction device 150 includes: a first feature acquisition module 1501, configured to acquire local feature information 51 of each component area of an image plane where an object to be identified is located in each video frame of a video stream; a spatial score acquisition module 1502 configured to acquire spatial scores 52 of each component region of an image plane where an object to be identified is located in each video frame; and a local feature obtaining module 1503, configured to obtain, according to the obtained local feature information 51 and spatial score 52 of each component region of the image plane where the object to be identified is located in each video frame, comprehensive local feature information 54 of each component region of the image plane where the object to be identified is located.

In an embodiment of the present invention, the local feature obtaining module 1503 is further configured to: the spatial score 52 of each video frame corresponding to a component region is taken as a weight, and the local feature information 51 of each video frame corresponding to the component region is integrated into the comprehensive local feature information 54 of the component region.

In an embodiment of the present invention, the spatial score obtaining module 1502 is further configured to: inputting each video frame into a space-time evaluation neural network model 10; and obtaining spatial scores 52 of all component areas of the image surface where the object to be identified is located in each video frame output by the spatial-temporal evaluation neural network model 10.

In one embodiment of the present invention, the spatio-temporal evaluation neural network model 10 includes: the data processing device comprises a first convolution layer 11, a first pooling layer 12, a second convolution layer 13, a third convolution layer 14, a second pooling layer 15 and a first full-connection layer 16 which are connected in sequence along the data processing direction.

In an embodiment of the present invention, the video object feature extraction apparatus 150 further includes: a spatial score processing module 1504 configured to map the spatial scores 52 of the constituent regions between 0 and 1 by successive approximation sign functions.

In an embodiment of the present invention, as shown in fig. 16, the first feature obtaining module 1501 may include: a global feature map acquisition unit 15011 configured to acquire a global feature map of an object to be identified in each video frame; a component area identifying unit 15012 configured to identify component areas of an image plane where an object to be identified is located in each video frame; and a first feature acquisition unit 15013 configured to acquire local feature information 51 of each component area of the image plane on which the object to be identified is located in each video frame, based on the global feature map and each component area of each video frame.

In an embodiment of the present invention, the global feature map obtaining unit 15011 is further configured to: inputting each video frame into a first global feature extraction neural network model 20; and obtains a global feature map for each video frame output by the first global feature extraction neural network model 20.

In an embodiment of the present invention, the first global feature extraction neural network model 20 includes: a fourth convolution layer 21, a first initiation layer 22, a second initiation layer 23, and a third initiation layer 24 connected in this order in the data processing direction.

In an embodiment of the invention, the composition area identifying unit 15012 is further configured to: identifying a plurality of feature recognition points in a video frame; and identifying each composition area of the image surface where the object to be identified is located according to the plurality of feature identification points, wherein each composition area is determined according to the positions of at least two feature identification points.

In an embodiment of the present invention, the first feature obtaining unit 15013 is further configured to: performing region-of-interest pooling on the global feature map of the video frame and each component area of an image surface where an object to be identified is located in the video frame; inputting the result of the pooling process of the region of interest into the local feature extraction neural network model 30; and acquiring local feature information 51 of each composition region of the image plane where the object to be identified is located in the video frame output by the local feature extraction neural network model 30.

In an embodiment of the present invention, the local feature extraction neural network model 30 includes: a first local initiation layer 31, a second local initiation layer 32, a third local initiation layer 33, and a second global connection layer 34, which are connected in this order in the data processing direction.

In an embodiment of the present invention, as shown in fig. 16, the video object feature extraction apparatus 150 may further include: a global feature information obtaining module 1505 configured to input the global feature map of each video frame into the second global feature extraction neural network model 40; and global feature information 53 of the object to be recognized in the video stream output by the second global feature extraction neural network model 40 is acquired.

In an embodiment of the present invention, the second global feature extraction neural network model 40 includes: a first global initiation layer 41, a residual attention network model 42, a second global initiation layer 43, a third global initiation layer 44, and a fourth global initiation layer 45 connected in this order along the data processing direction.

In one embodiment of the present invention, the residual attention network model 42 includes: a first neural network module 421 and a convolutional neural network module 422 that process data in parallel; the first neural network module 421 includes: a fifth convolutional layer 4211, a third pooling layer 4212, a sixth convolutional layer 4213, a deconvolution layer 4214, a seventh convolutional layer 4215, and a continuous approximate sign function processing layer 4216, which are connected in this order along the data processing direction; the output result of the first neural network module 421 and the output result of the convolutional neural network module 422 are integrated as the output result of the residual attention network module 421.

In an embodiment of the present invention, as shown in fig. 16, the video object feature extraction apparatus 150 may further include: the comprehensive spatial score obtaining module 1506 is configured to obtain a comprehensive spatial score 55 of each component area of the image plane where the object to be recognized is located according to the spatial score 52 of each component area of the image plane where the object to be recognized is located in each video frame.

In an embodiment of the present invention, the comprehensive spatial score obtaining module 1506 is further configured to: and calculating a weighted average value of all the spatial scores 52 corresponding to one component area of the object to be identified, and taking the weighted average value as a comprehensive spatial score 55 of the component area.

The detailed functions and operations of the respective modules in the video object feature extraction apparatus 150 described above have been described in detail in the video object feature extraction method described above with reference to fig. 1 to 14, and thus, a repetitive description thereof will be omitted herein.

It should be noted that the video object feature extraction apparatus 150 according to the embodiment of the present application may be integrated into the electronic device 180 as a software module and/or a hardware module, in other words, the electronic device 180 may include the video object feature extraction apparatus 150. For example, the video object feature extraction apparatus 150 may be a software module in an operating system of the electronic device 180, or may be an application developed for it; of course, the video object feature extraction apparatus 150 can also be one of many hardware modules of the electronic device 180.

In another embodiment of the present invention, the video object feature extraction apparatus 150 and the electronic device 180 may also be separate devices (e.g., servers), and the video object feature extraction apparatus 150 may be connected to the electronic device 180 through a wired and/or wireless network and transmit the interactive information according to an agreed data format.

Exemplary video object matching device

Fig. 17 is a schematic structural diagram of a video object matching apparatus according to an embodiment of the present invention. The video object matching apparatus is communicatively connected to the video object feature extraction apparatus 150 shown in fig. 15 and 16, and as shown in fig. 17, the video object matching apparatus 170 includes: a metric parameter obtaining module 1701 configured to obtain, from the video object feature extraction device 150, the integrated local feature information 54 of each component region of the image plane where the first object is located in the first video stream, the integrated spatial score 55 of each component region of the first object, and the global feature information 53 of the first object; acquiring comprehensive local characteristic information 54 of each component area of an image surface where the second object is located, comprehensive spatial scores 55 of each component area of the second object and global characteristic information 53 of the second object in the second video stream from the video object characteristic extraction device 150; and a measurement executing module 1702 configured to determine whether the first object and the second object are consistent based on the integrated local feature information 54 of each component area of the image plane where the first object is located, the integrated spatial score 55 of each component area of the first object, the global feature information 53 of the first object, the integrated local feature information 54 of the image plane where each component area of the second object is located, the integrated spatial score 55 of each component area of the second object, and the global feature information 53 of the second object.

It should be noted that the video object matching apparatus 170 according to the embodiment of the present application may be integrated into the electronic device 180 as a software module and/or a hardware module, in other words, the electronic device 180 may include the video object matching apparatus 170. For example, the video object matching apparatus 170 may be a software module in the operating system of the electronic device 180, or may be an application developed for it; of course, the video object matching apparatus 170 can also be one of many hardware modules of the electronic device 180.

In another embodiment of the present invention, the video object matching apparatus 170 and the electronic device 180 may also be separate devices (e.g., servers), and the video object matching apparatus 170 may be connected to the electronic device 180 through a wired and/or wireless network and transmit the interactive information according to an agreed data format.

Exemplary electronic device

Fig. 18 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 18, the electronic device 180 includes: one or more processors 1801 and memory 1802; and computer program instructions stored in the memory 1802 that, when executed by the processor 1801, cause the processor 1801 to perform a video object feature extraction method or a video object matching method as in any of the embodiments described above.

The processor 1801 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device to perform desired functions.

Memory 1802 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by the processor 1801 to implement the steps of the mechanical mechanism control method of the various embodiments of the present application described above and/or other desired functions. Information such as light intensity, compensation light intensity, position of the filter, etc. may also be stored in the computer readable storage medium.

In one example, the electronic device 180 may further include: an input device 1803 and an output device 1804, which are interconnected by a bus system and/or other form of connection mechanism (not shown in fig. 18).

For example, when the electronic device is a monitoring device, the input device 1803 may be a monitoring camera for capturing a video stream. When the electronic device is a stand-alone device, the input device 1803 may be a communication network connector for receiving a captured video signal from an external video capture device.

The output device 1804 may output various information to the outside, which may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.

Of course, for the sake of simplicity, only some of the components related to the present application in the electronic apparatus 180 are shown in fig. 18, and components such as a bus, an input device/output interface, and the like are omitted. In addition, the electronic device 180 may include any other suitable components, depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatuses, embodiments of the present application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps in the video object feature extraction method or the video object matching method of any of the above-described embodiments.

The computer program product may write program code for carrying out operations for embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the mechanical mechanism control method according to the various embodiments of the present application described in the "exemplary mechanical mechanism control method" section above in this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a random access memory ((RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.

The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and the like that are within the spirit and principle of the present invention are included in the present invention.

Claims

1. A video object feature extraction method comprises the following steps:

acquiring local characteristic information of each component area of an image surface where an object to be identified is located in each video frame of a video stream;

acquiring the spatial domain values of all component areas of the image surface of the object to be identified in each video frame; and

and acquiring comprehensive local characteristic information of each component area of the image surface of the object to be identified according to the acquired local characteristic information and the spatial domain values of each component area of the image surface of the object to be identified in each video frame.

2. The method according to claim 1, wherein the acquiring, according to the obtained local feature information and the spatial score of each component region of the image plane where the object to be identified is located in each video frame, the comprehensive local feature information of the image plane where the object to be identified is located comprises:

and integrating the local characteristic information of each video frame corresponding to the composition region into the comprehensive local characteristic information of the composition region by taking the spatial domain value of each video frame corresponding to the composition region as a weight.

3. The method according to claim 1, wherein the obtaining spatial scores of the constituent regions of the image plane where the object to be identified is located in each of the video frames comprises:

inputting each video frame into a space-time evaluation neural network model; and

and acquiring the spatial domain values of all component areas of the image surface where the object to be identified is located in each video frame output by the spatial-temporal evaluation neural network model.

4. The method of claim 3, further comprising:

mapping the spatial scores of the component regions to between 0 and 1 by a continuous approximate symbolic function.

5. The method according to claim 1, wherein the acquiring local feature information of each component region of an image plane where an object to be identified is located in each video frame of the video stream comprises:

acquiring a global feature map of the object to be identified in each video frame;

identifying each composition area of an image surface where the object to be identified is located in each video frame; and

and acquiring local feature information of each component area of the image surface of the object to be identified in each video frame according to the global feature map and each component area of each video frame.

6. The method of claim 5, wherein the obtaining the global feature map of the object to be identified in each video frame of the video stream comprises:

inputting each video frame into a first global feature extraction neural network model; and

and acquiring the global feature map of each video frame output by the first global feature extraction neural network model.

7. The method of claim 5, wherein the identifying the component regions of the image plane where the object to be identified is located in each video frame comprises:

identifying a plurality of feature recognition points in a video frame; and

and identifying each component area of the image surface where the object to be identified is located according to the plurality of feature identification points, wherein each component area is determined according to the positions of at least two feature identification points.

8. The method according to claim 5, wherein the obtaining, according to the global feature map of each video frame and the identified component areas, local feature information of the component areas of the image plane where the object to be identified is located in each video frame comprises:

performing region-of-interest pooling on the global feature map of the video frame and each component region of an image plane where an object to be identified is located in the video frame;

inputting the result of the region of interest pooling into a local feature extraction neural network model; and

and acquiring local feature information of each component area of the image surface where the object to be identified is located in the video frame output by the local feature extraction neural network model.

9. The method of claim 5, further comprising:

inputting the global feature map of each video frame into a second global feature extraction neural network model; and

and acquiring global feature information of the object to be identified in the video stream output by the second global feature extraction neural network model.

10. The method of claim 1, further comprising:

and acquiring the comprehensive space domain value of each component region of the image surface of the object to be identified according to the space domain value of each component region of the image surface of the object to be identified in each video frame.

11. The method according to claim 10, wherein the obtaining a comprehensive spatial score of each component region of the image plane on which the object to be recognized is located according to the spatial scores of each component region of the image plane on which the object to be recognized is located in each of the video frames comprises:

and solving a weighted average value of all the spatial scores corresponding to one component area of the object to be identified, and taking the weighted average value as the comprehensive spatial score of the component area.

12. A video object matching method, comprising:

acquiring comprehensive local characteristic information of each component area of an image surface where a first object is located in a first video stream, comprehensive spatial domain values of each component area and global characteristic information of the first object;

acquiring comprehensive local characteristic information of each component area of an image surface where a second object is located in a second video stream, comprehensive spatial domain values of each component area and global characteristic information of the second object; and

and judging whether the first object is consistent with the second object based on the comprehensive local characteristic information of each component area of the image surface where the first object is located, the comprehensive spatial domain value of each component area of the image surface where the first object is located, the global characteristic information of the first object, the comprehensive local characteristic information of each component area of the image surface where the second object is located, the comprehensive spatial domain value of each component area of the image surface where the second object is located and the global characteristic information of the second object.

13. The method according to claim 12, wherein the determining whether the first object and the second object are consistent based on the integrated local feature information of the components of the image plane where the first object is located, the integrated spatial score of the components of the image plane where the first object is located, the global feature information of the first object, the integrated local feature information of the components of the image plane where the second object is located, the integrated spatial score of the components of the image plane where the second object is located, and the global feature information of the second object comprises:

calculating a local characteristic distance between the first object and the second object by taking the comprehensive space domain value of each component area of the image surface where the first object is located and the comprehensive space domain value of each component area of the image surface where the second object is located as weights and taking the comprehensive local characteristic information of each component area of the image surface where the first object is located and the comprehensive local characteristic information of each component area of the image surface where the second object is located as measurement variables;

calculating a global feature distance between the first object and the second object by taking the global feature information of the first object and the global feature information of the second object as measurement variables; and

and judging whether the first object is consistent with the second object according to the local characteristic distance and the global characteristic distance.

14. A video object feature extraction apparatus comprising:

the first characteristic acquisition module is configured to acquire local characteristic information of each component area of an image surface where an object to be identified is located in each video frame of a video stream;

the spatial domain score acquisition module is configured to acquire spatial domain scores of all component areas of an image surface where an object to be identified is located in each video frame; and

and the local feature acquisition module is configured to acquire comprehensive local feature information of each component area of the image surface where the object to be identified is located according to the acquired local feature information and the spatial domain score of each component area of the image surface where the object to be identified is located in each video frame.

15. A video object matching apparatus communicatively connected to the video object feature extraction apparatus of claim 14, the video object matching apparatus comprising:

a measurement parameter obtaining module configured to obtain, from the video object feature extraction device, integrated local feature information of each component region of an image plane where a first object is located in a first video stream, an integrated spatial score of each component region of the first object, and global feature information of the first object; acquiring comprehensive local characteristic information of each component area of an image surface where a second object is located in a second video stream, comprehensive spatial domain scores of each component area of the second object and global characteristic information of the second object from the video object characteristic extraction device; and

the measurement execution module is configured to judge whether the first object and the second object are consistent based on the comprehensive local feature information of each component area of the image plane where the first object is located, the comprehensive spatial domain score of each component area of the first object, the global feature information of the first object, the comprehensive local feature information of each component area of the image plane where the second object is located, the comprehensive spatial domain score of each component area of the second object, and the global feature information of the second object.

16. An electronic device, comprising:

a processor; and

memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform the method of any of claims 1 to 13.

17. A computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1 to 13.