WO2023040146A1 - 基于图像融合的行为识别方法、装置、电子设备及介质 - Google Patents

基于图像融合的行为识别方法、装置、电子设备及介质 Download PDF

Info

Publication number
WO2023040146A1
WO2023040146A1 PCT/CN2022/071329 CN2022071329W WO2023040146A1 WO 2023040146 A1 WO2023040146 A1 WO 2023040146A1 CN 2022071329 W CN2022071329 W CN 2022071329W WO 2023040146 A1 WO2023040146 A1 WO 2023040146A1
Authority
WO
WIPO (PCT)
Prior art keywords
images
image
sampling
target
optical flow
Prior art date
Application number
PCT/CN2022/071329
Other languages
English (en)
French (fr)
Inventor
郑喜民
苏杭
舒畅
陈又新
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023040146A1 publication Critical patent/WO2023040146A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4038Image mosaicing, e.g. composing plane images from plane sub-images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation

Definitions

  • the present application relates to the technical field of artificial intelligence, in particular to an image fusion-based behavior recognition method, device, electronic equipment and media.
  • Behavior recognition is a very challenging topic in the field of computer vision, because it not only analyzes the spatial information of the target object, but also analyzes the information in the time dimension. How to better extract space-time features is the key to the problem. With the widespread application and good results of deep neural networks in object detection, people are also exploring the use of neural networks for action recognition.
  • the first aspect of the present application provides a method of behavior recognition based on image fusion, the method comprising:
  • the multiple fused images are input into a pre-trained 3D convolutional neural network for behavior recognition, wherein the pre-trained 3D convolutional neural network has a single-branch network structure.
  • the second aspect of the present application provides a behavior recognition device based on image fusion, the device comprising:
  • An acquisition module configured to acquire a video stream containing the target object in response to an instruction to identify the behavior of the target object
  • An extraction module configured to extract a plurality of initial images from the video stream
  • a calculation module configured to perform optical flow calculation on the multiple initial images to obtain multiple optical flow images
  • a fusion module configured to fuse each of the initial images with the corresponding optical flow images based on an attention mechanism to obtain a plurality of fusion images
  • the recognition module is configured to input the multiple fused images into a pre-trained 3D convolutional neural network for behavior recognition, wherein the pre-trained 3D convolutional neural network has a single-branch network structure.
  • a third aspect of the present application provides an electronic device, the electronic device includes a processor and a memory, and the processor is configured to implement the following steps when executing the computer-readable instructions stored in the memory:
  • the multiple fused images are input into a pre-trained 3D convolutional neural network for behavior recognition, wherein the pre-trained 3D convolutional neural network has a single-branch network structure.
  • a fourth aspect of the present application provides a computer-readable storage medium, on which computer-readable instructions are stored, and when the computer-readable instructions are executed by a processor, the following steps are implemented:
  • the multiple fused images are input into a pre-trained 3D convolutional neural network for behavior recognition, wherein the pre-trained 3D convolutional neural network has a single-branch network structure.
  • the image fusion-based behavior recognition method, device, electronic equipment, and media described in this application can be applied in smart government affairs and other fields, thereby promoting the development of smart cities.
  • this application not only realizes the fusion of image information and time information, but also guides the 3D convolutional neural network to understand the information of the initial image through the optical flow information, and through the attention mechanism Let the 3D convolutional neural network learn more information about the region of interest, so that behavior recognition can be performed based on the fused fusion image, which can effectively ensure the accuracy of behavior recognition;
  • the trained 3D convolutional neural network is a single branch Network structure, compared with the double-branch 3D convolutional neural network, the single-branch network structure reduces the complexity of the network model, thereby reducing the complexity of the entire behavior recognition process, making the whole process more centralized, and improving behavior recognition. s efficiency.
  • FIG. 1 is a flow chart of an image fusion-based behavior recognition method provided in Embodiment 1 of the present application.
  • FIG. 2 is a schematic diagram of splicing fused images provided in Embodiment 2 of the present application.
  • FIG. 3 is a structural diagram of an image fusion-based behavior recognition device provided in Embodiment 2 of the present application.
  • FIG. 4 is a schematic structural diagram of an electronic device provided in Embodiment 3 of the present application.
  • the image fusion-based behavior recognition method provided in the embodiment of the present application is executed by an electronic device, and accordingly, the image fusion-based behavior recognition device runs in the electronic device.
  • AI artificial intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • FIG. 1 is a flow chart of an image fusion-based behavior recognition method provided in Embodiment 1 of the present application.
  • the image fusion-based behavior recognition method specifically includes the following steps. According to different requirements, the order of the steps in the flow chart can be changed, and some of them can be omitted.
  • the target object refers to an object entity that requires behavior recognition, for example, a person or a pet. If it is necessary to perform behavior recognition on a certain person or a certain pet, an image acquisition device can be used to collect the video stream of the person or the pet.
  • the image acquisition device may be a high-definition digital image acquisition device.
  • the instruction to identify the behavior of the target object may be triggered by the user, or may be automatically triggered.
  • the electronic device receives the instruction for recognizing the behavior of the target object, in response to the instruction, send an acquisition instruction to the image acquisition device to control the image acquisition device to acquire the video stream containing the target object.
  • the image collection device collects the video stream containing the target object, it streams the collected video to the electronic device.
  • the image collection device can send the video stream in a manner of sending while collecting, so that after collecting the video stream for a preset duration, the collected video stream can be sent to the electronic device.
  • the video stream can be decomposed into two parts of information, space and time.
  • the spatial information is expressed in the form of a single image, which carries static information such as the shape and color of the target object, while the temporal information is dynamically expressed through multiple frames of continuous images. , reflecting the movement information of the target object.
  • a collection frame rate may be preset in the electronic device, and the video stream is collected at the collection frame rate to obtain a plurality of initial images, where the initial images are RGB images.
  • the method further includes:
  • the YOLO target detection algorithm may be used to select the area where the target object is located in the initial image with a detection frame, and the area selected by the detection frame is the target area.
  • the target image obtained by cutting out the target area is regarded as 3D
  • the input of the convolutional neural network model not only helps to improve the efficiency of the 3D convolutional neural network model to identify the behavior of the target object, but also there is no interference of non-target objects in the target image, and it can also improve the recognition efficiency of the 3D convolutional neural network model.
  • the accuracy of the behavior of the object is regarded as 3D
  • the relative size of the target object in different initial images is different, resulting in the size of the cropped target image is different.
  • it is necessary to Sampling is performed on multiple target images to realize normalization of the multiple target images, thereby ensuring that the obtained multiple sampled images have the same size.
  • said sampling a plurality of said target images to obtain a plurality of sampled images includes:
  • a plurality of sample images are determined according to the comparison result.
  • different target images have different sizes
  • multiple target images may have multiple different sizes
  • each target image is sampled twice according to these multiple different sizes, then after two sampling After that, each target image corresponds to two different sampled images.
  • the multiple first image qualities obtained by sampling can be compared with multiple second image qualities to obtain a comparison result, so as to determine which sampling method to use to sample the target image according to the comparison result , to get the sampled image.
  • performing the first sampling on each of the target images according to the size to obtain a first sampling image, and performing the second sampling to each of the target images to obtain a second sampling image include :
  • the size of the target image F1 is T1
  • the size of the target image F2 is T2
  • the size of the target image F3 is T3
  • the size of the target image F4 is is T4
  • the size of the target image F5 is T5, and these five sizes (T1, T2, T3, T4, T5) are sorted from large to small, or from small to large, to obtain a size sequence.
  • T1 determines the first sampling rate of the target image F1 according to the largest size T1 T1/T1
  • the size T1 determines the first sampling rate of the target image F3 to be T1/T3
  • the first sampling rate of the target image F4 is determined to be T1/T4 according to the maximum size T1
  • the first sampling rate of the target image F5 is determined to be T1/T4 according to the maximum size T1 T5.
  • the first sampled images F11 , F21 , F41 , F41 , F51 obtained by upsampling have the same size as the target image F1 .
  • T5 determines the first sampling rate of the target image F1 according to the minimum size T5 T5/T1, determine the first sampling rate of the target image F2 according to the minimum size T5 T5/T2, according to the minimum
  • the size T5 determines the first sampling rate of the target image F3 to be T5/T3
  • the first sampling rate of the target image F4 is determined to be T5/T4 according to the minimum size T5
  • the first sampling rate of the target image F5 is determined to be T5/T4 according to the minimum size T5 T5.
  • the target image F1 is down-sampled according to the first sampling rate T5/T1 to obtain the first sampling image F12
  • the target image F2 is down-sampled according to the first sampling rate T5/T2 to obtain the first sampling image F22
  • the first sampling rate T5 /T3 down-samples the target image F3 to obtain the first sampled image F42
  • down-samples the target image F4 according to the first sampling rate T5/T4 to obtain the first sampled image F42
  • obtains the first sampled image F42 according to the first sampling rate T5/T5 to the target image F5 Perform down-sampling to obtain the first sampled image F52.
  • the first sampled images F12 , F22 , F42 , F42 , F52 obtained by downsampling have the same size as the target image F5 .
  • the sampling rate is determined according to the size of the target image, and the corresponding target image is up-sampled or down-sampled according to the sampling rate, which can realize dynamic sampling of different target images and ensure the size of the sampled image obtained by sampling Keep consistent with the maximum size of the target image, or keep the size of the sampled image obtained by sampling consistent with the minimum size of the target image.
  • the determining a plurality of sampling images according to the comparison result includes:
  • the comparison result is that the average value of the plurality of first image qualities is smaller than the average value of the plurality of second image qualities, it is determined that the plurality of second sampled images are the plurality of sampled images.
  • a plurality of first image qualities is compared to a plurality of second image qualities.
  • the comparison results include: the average value of the multiple first image qualities is greater than the average value of the multiple second image qualities, and the average value of the multiple first image qualities is smaller than the average value of the multiple second image qualities average of.
  • the average value of multiple first image qualities is greater than the average value of multiple second image qualities, indicating that the first image quality of most of the first sampled images obtained by upsampling is higher than that obtained by downsampling.
  • the second image quality of most of the second sampled images of the second sampled image therefore, the electronic device can determine to upsample the target image by upsampling, that is, to determine multiple first sampled images as the final multiple sampled images Image, as input to the 3D Convolutional Neural Network model.
  • the average value of multiple first image qualities is smaller than the average value of multiple second image qualities, indicating that the first image quality of most of the first sampled images obtained by upsampling is lower than that obtained by downsampling.
  • the second image quality of most of the second sampled images of the second sampled image therefore, the electronic device can determine to down-sample the target image in a down-sampling manner, that is, to determine multiple second sampled images as the final multiple sampled images Image, as input to the 3D Convolutional Neural Network model.
  • the comparison result is that the average value of a plurality of the first image qualities is equal to the average value of a plurality of the second image qualities, it can be applicable to the case where the comparison result is a plurality of the The average value of the first image quality is greater than the average value of a plurality of the second image qualities, and it is also applicable to the comparison result that the average value of the plurality of the first image qualities is smaller than the average value of the plurality of the second image qualities average value.
  • the target image when the average value of the plurality of first image qualities is equal to the average value of the plurality of second image qualities, the target image may be sampled in an up-sampling manner, so that the plurality of first sampled images are determined the final multiple sampled images; the target image may also be sampled in a down-sampling manner, so that multiple second sampled images are determined as the final multiple sampled images.
  • an optical flow image is calculated according to every two adjacent initial images in the N consecutive initial images, thereby obtaining N-1 continuous optical flows image.
  • N consecutive target images can be obtained, and after sampling the N consecutive target images, N consecutive sampling images can be obtained images, the performing optical flow calculation on the multiple initial images to obtain multiple optical flow images is performing optical flow calculation on the multiple sampling images to obtain multiple optical flow images.
  • An optical flow image is calculated according to every two adjacent sampling images in the N consecutive sampling images, so as to obtain N-1 consecutive optical flow images.
  • the performing optical flow calculation on the plurality of sampling images to obtain a plurality of optical flow images includes:
  • the optical flow algorithm is used to calculate the optical flow field of each adjacent two sampling images
  • a target sampling image corresponding to the target optical flow field is determined, and a target optical flow image is obtained according to the target optical flow field.
  • the optical flow algorithm can be used to correct multiple initial images and correlate adjacent initial images.
  • the optical flow algorithm calculates the optical flow field.
  • the motion field is estimated according to the spatio-temporal gradient of the image sequence, and the moving target and scene are detected and segmented by analyzing the change of the motion field. .
  • the optical flow vector of each point in the two adjacent feature vectors is calculated by the optical flow algorithm, and there is a difference between the optical flow vector of the moving object and the background optical flow vector, and the optical flow can be divided by threshold segmentation
  • the field is divided into two parts, ie to distinguish the moving object from the background.
  • the selection of the threshold can be determined using the method of maximum variance between classes (Otsu algorithm).
  • the method further includes: filtering the thresholded optical flow field according to a morphological operation; connecting the filtered optical flow to field to get a new optical flow field.
  • the opening operation in morphological filtering can be used first to remove those concave areas whose optical flow values do not match the structural elements, while retaining those matching concave areas. Then, use the closing operation in the morphological filtering to fill the concave area, so that the area corresponding to the target object becomes an area that can be connected as a whole.
  • the electronic device can set parameters to fuse the initial image and the optical flow image, and perform normalization and weighted fusion on each channel to obtain a fusion image.
  • An initial image is fused with a corresponding optical flow image to obtain a fused image.
  • the attention-based mechanism fuses each of the initial images and the corresponding optical flow images into an attention-based mechanism that fuses each of the target sampling images with the corresponding target Optical flow image fusion.
  • the pre-trained 3D convolutional neural network has a single-branch network structure.
  • the size of the input fusion image is B*C*T*H*W, where B is batch_size, C is the number of channels, and the number of channels after fusion is 5 (3 channels corresponding to RGB + light 2 channels corresponding to the stream), T is the time series, H and W are the length and width of the fused image.
  • the input of the plurality of fused images into a pre-trained 3D convolutional neural network for behavior recognition includes:
  • Behavior recognition is performed based on the stitched images.
  • the 3D convolutional neural network includes multiple convolutional layers, and each convolutional layer includes a convolution kernel, where the first convolution kernel is 1x1x1, and the second convolution kernel It is 3x3x3, and the last convolution kernel is 1x1x1.
  • the 3D convolutional neural network fuses the channel information through the first convolutional layer and upgrades a feature dimension, and simultaneously performs time dimension and The feature extraction of the image information in the spatial dimension, the dimensionality reduction of the feature is performed through the last convolutional layer, and then the feature map output by the last convolutional layer is spliced with the fusion image.
  • the feature dimension is reduced through the 3D convolutional neural network; finally, in order to avoid 3D
  • the convolutional neural network reduces the feature dimension too low, and stitches the fusion image with the feature map output by the last convolutional layer. Compared with the feature map output by the last convolutional layer, the stitched image obtained after splicing processing The feature dimension is high, which can effectively prevent the gradient from disappearing.
  • the behavior recognition method based on image fusion can be applied in fields such as smart government affairs, thereby promoting the development of smart cities.
  • this application not only realizes the fusion of image information and time information, but also guides the 3D convolutional neural network to understand the information of the initial image through the optical flow information, and through the attention mechanism Let the 3D convolutional neural network learn more information about the region of interest, so that behavior recognition can be performed based on the fused fusion image, which can effectively ensure the accuracy of behavior recognition;
  • the trained 3D convolutional neural network is a single branch Network structure, compared with the double-branch 3D convolutional neural network, the single-branch network structure reduces the complexity of the network model, thereby reducing the complexity of the entire behavior recognition process, making the whole process more centralized, and improving behavior recognition. s efficiency.
  • FIG. 3 is a structural diagram of an image fusion-based behavior recognition device provided in Embodiment 2 of the present application.
  • the image fusion-based behavior recognition device 30 may include a plurality of functional modules composed of computer-readable instruction segments.
  • the computer-readable instructions of the various program segments in the behavior recognition device 30 based on image fusion can be stored in the memory of the electronic device, and executed by at least one processor to execute (see Figure 1 for details) based on image fusion behavior recognition function.
  • the behavior recognition device 30 based on image fusion can be divided into multiple functional modules according to the functions it performs.
  • the functional modules may include: an acquisition module 301 , an extraction module 302 , a sampling module 303 , a calculation module 304 , a fusion module 305 and an identification module 306 .
  • the module referred to in this application refers to a series of computer-readable instruction segments that can be executed by at least one processor and can complete fixed functions, and are stored in a memory. In this embodiment, the functions of each module will be described in detail in subsequent embodiments.
  • the acquiring module 301 is configured to acquire a video stream containing the target object in response to an instruction for identifying a behavior of the target object.
  • the target object refers to an object entity that requires behavior recognition, for example, a person or a pet. If it is necessary to perform behavior recognition on a certain person or a certain pet, an image acquisition device may be used to collect the video stream of the person or the pet.
  • the image acquisition device may be a high-definition digital image acquisition device.
  • the instruction to identify the behavior of the target object may be triggered by the user, or may be automatically triggered.
  • the electronic device receives the instruction for recognizing the behavior of the target object, in response to the instruction, send an acquisition instruction to the image acquisition device to control the image acquisition device to acquire the video stream containing the target object.
  • the image collection device collects the video stream containing the target object, it streams the collected video to the electronic device.
  • the image collection device can send the video stream in a manner of sending while collecting, so that after collecting the video stream for a preset duration, the collected video stream can be sent to the electronic device.
  • the extraction module 302 is configured to extract a plurality of initial images from the video stream.
  • the video stream can be decomposed into two parts of information, space and time.
  • the spatial information is expressed in the form of a single image, which carries static information such as the shape and color of the target object, while the temporal information is dynamically expressed through multiple frames of continuous images. , reflecting the movement information of the target object.
  • a collection frame rate may be preset in the electronic device, and the video stream is collected at the collection frame rate to obtain a plurality of initial images, where the initial images are RGB images.
  • the sampling module 303 is configured to sample the plurality of initial images to obtain a plurality of sampled images.
  • the sampling module 303 samples the multiple initial images, and obtaining multiple sampled images includes:
  • the YOLO target detection algorithm may be used to select the area where the target object is located in the initial image with a detection frame, and the area selected by the detection frame is the target area.
  • the target image obtained by cutting out the target area is regarded as 3D
  • the input of the convolutional neural network model not only helps to improve the efficiency of the 3D convolutional neural network model to identify the behavior of the target object, but also there is no interference of non-target objects in the target image, and it can also improve the recognition efficiency of the 3D convolutional neural network model.
  • the accuracy of the behavior of the object is regarded as 3D
  • the relative size of the target object in different initial images is different, resulting in the size of the cropped target image is different.
  • it is necessary to Sampling is performed on multiple target images to realize normalization of the multiple target images, thereby ensuring that the obtained multiple sampled images have the same size.
  • the sampling module 303 samples a plurality of target images, and obtaining a plurality of sampled images includes:
  • a plurality of sample images are determined according to the comparison result.
  • different target images have different sizes
  • multiple target images may have multiple different sizes
  • each target image is sampled twice according to these multiple different sizes, then after two sampling After that, each target image corresponds to two different sampled images.
  • the multiple first image qualities obtained by sampling can be compared with multiple second image qualities to obtain a comparison result, so as to determine which sampling method to use to sample the target image according to the comparison result , to get the sampled image.
  • performing the first sampling on each of the target images according to the size to obtain a first sampling image, and performing the second sampling to each of the target images to obtain a second sampling image include :
  • the size of the target image F1 is T1
  • the size of the target image F2 is T2
  • the size of the target image F3 is T3
  • the size of the target image F4 is is T4
  • the size of the target image F5 is T5, and these five sizes (T1, T2, T3, T4, T5) are sorted from large to small, or from small to large, to obtain a size sequence.
  • T1 determines the first sampling rate of the target image F1 according to the largest size T1 T1/T1
  • the size T1 determines the first sampling rate of the target image F3 to be T1/T3
  • the first sampling rate of the target image F4 is determined to be T1/T4 according to the maximum size T1
  • the first sampling rate of the target image F5 is determined to be T1/T4 according to the maximum size T1 T5.
  • the first sampled images F11 , F21 , F41 , F41 , F51 obtained by upsampling have the same size as the target image F1 .
  • T5 determines the first sampling rate of the target image F1 according to the minimum size T5 T5/T1, determine the first sampling rate of the target image F2 according to the minimum size T5 T5/T2, according to the minimum
  • the size T5 determines the first sampling rate of the target image F3 to be T5/T3
  • the first sampling rate of the target image F4 is determined to be T5/T4 according to the minimum size T5
  • the first sampling rate of the target image F5 is determined to be T5/T4 according to the minimum size T5 T5.
  • the target image F1 is down-sampled according to the first sampling rate T5/T1 to obtain the first sampling image F12
  • the target image F2 is down-sampled according to the first sampling rate T5/T2 to obtain the first sampling image F22
  • the first sampling rate T5 /T3 down-samples the target image F3 to obtain the first sampled image F42
  • down-samples the target image F4 according to the first sampling rate T5/T4 to obtain the first sampled image F42
  • obtains the first sampled image F42 according to the first sampling rate T5/T5 to the target image F5 Perform down-sampling to obtain the first sampled image F52.
  • the first sampled images F12, F22, F42, F42, F52 obtained by down-sampling have the same size as the target image F5.
  • the sampling rate is determined according to the size of the target image, and the corresponding target image is up-sampled or down-sampled according to the sampling rate, which can realize dynamic sampling of different target images and ensure the size of the sampled image obtained by sampling Keep consistent with the maximum size of the target image, or keep the size of the sampled image obtained by sampling consistent with the minimum size of the target image.
  • the determining a plurality of sampling images according to the comparison result includes:
  • the comparison result is that the average value of the plurality of first image qualities is smaller than the average value of the plurality of second image qualities, it is determined that the plurality of second sampled images are the plurality of sampled images.
  • a plurality of first image qualities is compared to a plurality of second image qualities.
  • the comparison results include: the average value of the multiple first image qualities is greater than the average value of the multiple second image qualities, and the average value of the multiple first image qualities is smaller than the average value of the multiple second image qualities average of.
  • the average value of multiple first image qualities is greater than the average value of multiple second image qualities, indicating that the first image quality of most of the first sampled images obtained by upsampling is higher than that obtained by downsampling.
  • the second image quality of most of the second sampled images of the second sampled image therefore, the electronic device can determine to upsample the target image by upsampling, that is, to determine multiple first sampled images as the final multiple sampled images Image, as input to the 3D Convolutional Neural Network model.
  • the average value of multiple first image qualities is smaller than the average value of multiple second image qualities, indicating that the first image quality of most of the first sampled images obtained by upsampling is lower than that obtained by downsampling.
  • the second image quality of most of the second sampled images of the second sampled image therefore, the electronic device can determine to down-sample the target image in a down-sampling manner, that is, to determine multiple second sampled images as the final multiple sampled images Image, as input to the 3D Convolutional Neural Network model.
  • the comparison result is that the average value of a plurality of the first image qualities is equal to the average value of a plurality of the second image qualities, it can be applicable to the case where the comparison result is a plurality of the The average value of the first image quality is greater than the average value of a plurality of the second image qualities, and it is also applicable to the comparison result that the average value of the plurality of the first image qualities is smaller than the average value of the plurality of the second image qualities average value.
  • the target image when the average value of the plurality of first image qualities is equal to the average value of the plurality of second image qualities, the target image may be sampled in an up-sampling manner, so that the plurality of first sampled images are determined the final multiple sampled images; the target image may also be sampled in a down-sampling manner, so that multiple second sampled images are determined as the final multiple sampled images.
  • the calculation module 304 is configured to perform optical flow calculation on the multiple initial images to obtain multiple optical flow images.
  • an optical flow image is calculated according to every two adjacent initial images in the N consecutive initial images, thereby obtaining N-1 continuous optical flows image.
  • N consecutive target images can be obtained, and after sampling the N consecutive target images, N consecutive sampling images can be obtained images, the performing optical flow calculation on the multiple initial images to obtain multiple optical flow images is performing optical flow calculation on the multiple sampling images to obtain multiple optical flow images.
  • An optical flow image is calculated according to every two adjacent sampling images in the N consecutive sampling images, so as to obtain N-1 consecutive optical flow images.
  • the calculation module 304 performs optical flow calculation on the multiple sampling images to obtain multiple optical flow images including:
  • the optical flow algorithm is used to calculate the optical flow field of each adjacent two sampling images
  • the optical flow algorithm can be used to correct multiple initial images and correlate adjacent initial images.
  • the optical flow algorithm calculates the optical flow field.
  • the motion field is estimated according to the spatio-temporal gradient of the image sequence, and the moving target and scene are detected and segmented by analyzing the change of the motion field. .
  • the optical flow vector of each point in the two adjacent feature vectors is calculated by the optical flow algorithm, and there is a difference between the optical flow vector of the moving object and the background optical flow vector, and the optical flow can be divided by threshold segmentation
  • the field is divided into two parts, ie to distinguish the moving object from the background.
  • the selection of the threshold can be determined using the method of maximum variance between classes (Otsu algorithm).
  • the electronic device may also filter the optical flow field after threshold segmentation according to morphological operations; connect the filtered optical flow field to get a new optical flow field.
  • the opening operation in morphological filtering can be used first to remove those concave areas whose optical flow values do not match the structural elements, while retaining those matching concave areas. Then, use the closing operation in the morphological filtering to fill the concave area, so that the area corresponding to the target object becomes an area that can be connected as a whole.
  • the fusion module 305 is configured to fuse each of the initial images with the corresponding optical flow images based on an attention mechanism to obtain a plurality of fusion images.
  • the electronic device can set parameters to fuse the initial image and the optical flow image, and perform normalization and weighted fusion on each channel to obtain a fusion image.
  • An initial image is fused with a corresponding optical flow image to obtain a fused image.
  • the attention-based mechanism fuses each of the initial images and the corresponding optical flow images into an attention-based mechanism that fuses each of the target sampling images with the corresponding target Optical flow image fusion.
  • the recognition model 306 is used to input the multiple fused images into the pre-trained 3D convolutional neural network for behavior recognition.
  • the pre-trained 3D convolutional neural network has a single-branch network structure.
  • the size of the input fusion image is B*C*T*H*W, where B is batch_size, C is the number of channels, and the number of channels after fusion is 5 (3 channels corresponding to RGB + light 2 channels corresponding to the stream), T is the time series, H and W are the length and width of the fused image.
  • the identification model 306 inputting the plurality of fused images into a pre-trained 3D convolutional neural network for behavior identification includes:
  • Behavior recognition is performed based on the stitched images.
  • the 3D convolutional neural network includes multiple convolutional layers, and each convolutional layer includes a convolution kernel, where the first convolution kernel is 1x1x1, and the second convolution kernel It is 3x3x3, and the last convolution kernel is 1x1x1.
  • the 3D convolutional neural network fuses the channel information through the first convolutional layer and upgrades a feature dimension, and simultaneously performs time dimension and The feature extraction of the image information in the spatial dimension, the dimensionality reduction of the feature is performed through the last convolutional layer, and then the feature map output by the last convolutional layer is spliced with the fusion image.
  • the feature dimension is reduced through the 3D convolutional neural network; finally, in order to avoid 3D
  • the convolutional neural network reduces the feature dimension too low, and stitches the fusion image with the feature map output by the last convolutional layer. Compared with the feature map output by the last convolutional layer, the stitched image obtained after splicing processing The feature dimension is high, which can effectively prevent the gradient from disappearing.
  • the behavior recognition method based on image fusion can be applied in fields such as smart government affairs, thereby promoting the development of smart cities.
  • this application not only realizes the fusion of image information and time information, but also guides the 3D convolutional neural network to understand the information of the initial image through the optical flow information, and through the attention mechanism Let the 3D convolutional neural network learn more information about the region of interest, so that behavior recognition can be performed based on the fused fusion image, which can effectively ensure the accuracy of behavior recognition;
  • the trained 3D convolutional neural network is a single branch Network structure, compared with the double-branch 3D convolutional neural network, the single-branch network structure reduces the complexity of the network model, thereby reducing the complexity of the entire behavior recognition process, making the whole process more centralized, and improving behavior recognition. s efficiency.
  • This embodiment provides a computer-readable storage medium, on which computer-readable instructions are stored, and when the computer-readable instructions are executed by a processor, the steps in the above-mentioned embodiment of the behavior recognition method based on image fusion are implemented , such as S11-S15 shown in Figure 1:
  • modules 301-306 in FIG. 3 are realized, such as modules 301-306 in FIG. 3:
  • the obtaining module 301 is configured to obtain a video stream containing the target object in response to an instruction for identifying the behavior of the target object;
  • the extraction module 302 is configured to extract a plurality of initial images from the video stream
  • the sampling module 303 is configured to sample the initial image to obtain a plurality of sampled images
  • the calculation module 304 is configured to perform optical flow calculation on the multiple initial images to obtain multiple optical flow images
  • the fusion module 305 is configured to fuse each of the initial images with the corresponding optical flow images based on an attention mechanism to obtain a plurality of fusion images;
  • the recognition module 306 is configured to input the multiple fused images into the pre-trained 3D convolutional neural network for behavior recognition.
  • the electronic device 4 includes a memory 41 , at least one processor 42 , at least one communication bus 43 and a transceiver 44 .
  • the structure of the electronic device shown in Figure 4 does not constitute a limitation of the embodiment of the present application, it can be a bus structure or a star structure, and the electronic device 4 can also include a ratio diagram more or less other hardware or software, or a different arrangement of components.
  • the electronic device 4 is a device that can automatically perform numerical calculation and/or information processing according to preset or stored instructions, and its hardware includes but is not limited to microprocessors, application-specific integrated circuits, Programmable gate arrays, digital processors and embedded devices, etc.
  • the electronic device 4 may also include a client device, which includes but is not limited to any electronic product that can interact with the client through a keyboard, mouse, remote control, touch pad or voice control device, for example, Personal computers, tablets, smartphones, digital cameras, etc.
  • the electronic device 4 is only an example, and other existing or future electronic products that can be adapted to this application should also be included in the scope of protection of this application, and are included here by reference .
  • Described memory 41 comprises volatile and nonvolatile memory, such as random access memory (Random Access Memory, RAM), read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable Read-Only Memory, PROM), erasable programmable read-only memory (Erasable Programmable Read-Only Memory, EPROM), one-time programmable read-only memory (One-time Programmable Read-Only Memory, OTPROM), electronic erasable rewritable only Read-only memory (Electrically-Erasable Programmable Read-Only Memory, EEPROM), read-only CD-ROM (Compact Disc Read-Only Memory, CD-ROM) or other optical disk storage, disk storage, tape storage, or can be used to carry or store data computer readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function, etc.; The data created using the node, etc.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with each other using cryptographic methods. Each data block contains a batch of network transaction information, which is used to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the at least one processor 42 is the control core (Control Unit) of the electronic device 4, and uses various interfaces and lines to connect the various components of the entire electronic device 4, by running or executing the Programs or modules in the memory 41 and call data stored in the memory 41 to execute various functions of the electronic device 4 and process data.
  • the at least one processor 42 executes the computer-readable instructions stored in the memory, it realizes all or part of the steps of the behavior recognition method based on image fusion described in the embodiment of the present application; or realizes the behavior recognition method based on image fusion. Identify all or part of the functionality of the device.
  • the at least one processor 42 may be composed of an integrated circuit, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits with the same function or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessor, digital processing chip, graphics processor and a combination of various control chips, etc.
  • CPU central processing unit
  • microprocessor microprocessor
  • digital processing chip graphics processor
  • graphics processor a combination of various control chips, etc.
  • the at least one communication bus 43 is configured to realize connection and communication between the memory 41 and the at least one processor 42 and so on.
  • the electronic device 4 can also include a power supply (such as a battery) for supplying power to each component.
  • the power supply can be logically connected to the at least one processor 42 through a power management device, thereby realizing Manage functions such as charging, discharging, and power management.
  • the power supply may also include one or more DC or AC power supplies, recharging devices, power failure detection circuits, power converters or inverters, power status indicators and other arbitrary components.
  • the electronic device 4 may also include various sensors, bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
  • the above-mentioned integrated units implemented in the form of software function modules can be stored in a computer-readable storage medium.
  • the above-mentioned software function modules are stored in a storage medium, and include several instructions for enabling an electronic device (which may be a personal computer, or a network device, etc.) or a processor (processor) to execute part of the method described in each embodiment of the present application.
  • the modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, and may be located in one place or distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional module in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware, or in the form of hardware plus software function modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

本申请涉及人工智能技术领域,提供一种基于图像融合的行为识别方法、装置、电子设备及介质,通过将初始图像与计算得到的光流图像进行融合,实现了图像信息与时间信息的融合,且通过光流信息指导3D卷积神经网络对于初始图像的信息的理解,通过注意力机制让3D卷积神经网络学习更多的感兴趣信息,从而基于融合后的融合图像进行行为识别,有效的保证了行为识别的准确度;训练的3D卷积神经网络为单分支网络结构,相比双分支的3D卷积神经网络而言,在保证行为识别准确率的前提下,单分支网络结构降低了网络模型的复杂度,从而降低了整个行为识别过程的复杂度,让整个过程更加集中化,提高了行为识别的效率。

Description

基于图像融合的行为识别方法、装置、电子设备及介质
本申请要求于2021年09月17日提交中国专利局、申请号为202111093387.6,发明名称为“基于图像融合的行为识别方法、装置、电子设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,具体涉及一种基于图像融合的行为识别方法、装置、电子设备及介质。
背景技术
行为识别是计算机视觉领域非常有挑战性的课题,因为其不仅仅要分析目标体的空间信息,还要分析时间维度上的信息。如何更好的提取出空间-时间特征是问题的关键。随着深度神经网络在目标检测方面的广泛应用和取得的良好效果,人们也探索使用神经网络进行动作识别。
发明人在实现本申请的过程中发现,现有技术中通过设置两个网络,一个网络用于处理图像空间,另一个网络用于处理时间信息,最后利用SVM将这两个网络的输出关联起来,实现目标体的静态和动态的融合,从而实现行为的识别。但该方法由于设置了两个网络结构,导致网络结构较为复杂,且需要同时训练两个网络,导致模型训练复杂度加大,训练效率较低,从而导致行为识别的效率较低。
发明内容
鉴于以上内容,有必要提出一种基于图像融合的行为识别方法、装置、电子设备及介质,能够在保证行为识别准确率的前提下简化神经网络模型的结构,降低神经网络模型的复杂度,提高行为识别的效率。
本申请的第一方面提供一种基于图像融合的行为识别方法,所述方法包括:
响应于识别目标对象的行为的指令,获取包含所述目标对象的视频流;
从所述视频流中提取多个初始图像;
对所述多个初始图像进行光流计算得到多个光流图像;
基于注意力机制将每个所述初始图像与对应的所述光流图像进行融合,得到多个融合图像;
将所述多个融合图像输入至预先训练完成的3D卷积神经网络中进行行为识别,其中,所述预先训练完成的3D卷积神经网络为单分支网络结构。
本申请的第二方面提供一种基于图像融合的行为识别装置,所述装置包括:
获取模块,用于响应于识别目标对象的行为的指令,获取包含所述目标对象的视频流;
提取模块,用于从所述视频流中提取多个初始图像;
计算模块,用于对所述多个初始图像进行光流计算得到多个光流图像;
融合模块,用于基于注意力机制将每个所述初始图像与对应的所述光流图像进行融合,得到多个融合图像;
识别模块,用于将所述多个融合图像输入至预先训练完成的3D卷积神经网络中进行行为识别,其中,所述预先训练完成的3D卷积神经网络为单分支网络结构。
本申请的第三方面提供一种电子设备,所述电子设备包括处理器和存储器,所述处理器用于执行所述存储器中存储的计算机可读指令时实现以下步骤:
响应于识别目标对象的行为的指令,获取包含所述目标对象的视频流;
从所述视频流中提取多个初始图像;
对所述多个初始图像进行光流计算得到多个光流图像;
基于注意力机制将每个所述初始图像与对应的所述光流图像进行融合,得到多个融合图像;
将所述多个融合图像输入至预先训练完成的3D卷积神经网络中进行行为识别,其中,所述预先训练完成的3D卷积神经网络为单分支网络结构。
本申请的第四方面提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现以下步骤:
响应于识别目标对象的行为的指令,获取包含所述目标对象的视频流;
从所述视频流中提取多个初始图像;
对所述多个初始图像进行光流计算得到多个光流图像;
基于注意力机制将每个所述初始图像与对应的所述光流图像进行融合,得到多个融合图像;
将所述多个融合图像输入至预先训练完成的3D卷积神经网络中进行行为识别,其中,所述预先训练完成的3D卷积神经网络为单分支网络结构。
综上所述,本申请所述的基于图像融合的行为识别方法、装置、电子设备及介质,可应用在智慧政务等领域,从而推动智慧城市的发展。本申请通过将初始图像与计算得到的光流图像进行融合,不仅实现了图像信息与时间信息的融合,而且通过光流信息指导3D卷积神经网络对于初始图像的信息的理解,通过注意力机制让3D卷积神经网络学习更多的感兴趣的区域的信息,从而基于融合后的融合图像进行行为识别,能够有效的保证行为识别的准确度;此外,训练的3D卷积神经网络为单分支网络结构,相比双分支的3D卷积神经网络而言,单分支网络结构降低了网络模型的复杂度,从而降低了整个行为识别过程的复杂度,让整个过程更加集中化,提高了行为识别的效率。
附图说明
图1是本申请实施例一提供的基于图像融合的行为识别方法的流程图。
图2是本申请实施例二提供的对融合图像进行拼接的示意图。
图3是本申请实施例二提供的基于图像融合的行为识别装置的结构图。
图4是本申请实施例三提供的电子设备的结构示意图。
具体实施方式
为了能够更清楚地理解本申请的上述目的、特征和优点,下面结合附图和具体实施例对本申请进行详细描述。需要说明的是,在不冲突的情况下,本申请的实施例及实施例中的特征可以相互组合。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述在一个可选的实施方式中实施例的目的,不是旨在于限制本申请。
本申请实施例提供的基于图像融合的行为识别方法由电子设备执行,相应地,基于图像融合的行为识别装置运行于电子设备中。
本申请实施例可以基于人工智能技术对目标对象的行为进行识别。其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用***。
人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互***、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
实施例一
图1是本申请实施例一提供的基于图像融合的行为识别方法的流程图。所述基于图像融合的行为识别方法具体包括以下步骤,根据不同的需求,该流程图中步骤的顺序可以改变, 某些可以省略。
S11,响应于识别目标对象的行为的指令,获取包含所述目标对象的视频流。
其中,目标对象是指需要进行行为识别的对象实体,例如,人或者宠物。若需要对某个人或者某个宠物进行行为识别,则可以使用图像采集设备采集这个人或者这个宠物的视频流。所述图像采集设备可以是高清数字图像采集设备。
识别目标对象的行为的指令可以由用户进行触发,也可以自动进行触发。电子设备在接收到识别目标对象的行为的指令时,响应于所述指令,向图像采集设备发送采集指令以控制图像采集设备采集包含目标对象的视频流。图像采集设备在采集到包含目标对象的视频流之后,将所采集的视频流至电子设备。图像采集设备可以以边采集边发送的方式发送视频流,以可以在采集预设时长的视频流之后,将所采集的视频流发送至电子设备。
S12,从所述视频流中提取多个初始图像。
视频流可以被分解为空间和时间两部分信息,空间信息以单幅图像的形式表现出来,图像携带着目标对象的形状,颜色等静态信息,而时间信息则通过多帧连续的图像动态表现出来,反应了目标对象的移动信息。
电子设备中可以预先设置有采集帧率,以所述采集帧率对视频流进行采集,从而得到多个初始图像,所述初始图像为RGB图像。
在一个可选的实施方式中,在从所述视频流中提取多个初始图像之后,所述方法还包括:
检测每个所述初始图像中所述目标对象对应的目标区域;
对每个所述初始图像中的目标区域进行裁剪,得到目标图像;
对多个所述目标图像进行采样,得到多个采样图像。
该可选的实施方式中,可以采用YOLO目标检测算法将所述初始图像中目标对象所在的区域用检测框框选出来,检测框框选的区域即为目标区域。
由于目标区域的像素数量远小于整幅初始图像的像素数量,且目标区域几乎只包含了人或者宠物这一目标对象,而无其他非目标对象,因此将目标区域裁剪出来得到的目标图像作为3D卷积神经网络模型的输入,不仅有助于提高3D卷积神经网络模型识别目标对象的行为的效率,而且目标图像中不存在非目标对象的干扰,还能提高3D卷积神经网络模型识别目标对象的行为的准确度。
此外,由于远近距离的原因,导致不同的初始图像中目标对象的相对大小不同,从而导致裁剪得到的目标图像的大小不同,为了保证输入至3D卷积神经网络模型中的图像的一致性,需要对多个目标图像进行采样,实现对多个目标图像的归一化,从而保证得到的多个采样图像的大小一致。
在一个可选的实施方式中,所述对多个所述目标图像进行采样,得到多个采样图像包括:
获取每个所述目标图像的尺寸;
根据所述尺寸对每个所述目标图像进行第一采样得到第一采样图像,及对每个所述目标图像进行第二采样得到第二采样图像;
计算每个所述第一采样图像的第一图像质量,及计算每个所述第二采样图像的第二图像质量;
比较多个所述第一图像质量及多个所述第二图像质量,得到比较结果;
根据所述比较结果确定多个采样图像。
该可选的实施方式中,不同的目标图像的大小不同,多个目标图像可能存在多个不同的尺寸,根据这多个不同的尺寸对每个目标图像进行两次采样,则经过两次采样后,每个目标图像对应两个不同的采样图像。
采用不同的采样方式进行采样,必然导致每个目标图像对应的两个采样图像的质量不同,将图像质量较佳的采样图像作为3D卷积神经网络模型中的输入,有助于提高3D卷积神经网络模型的识别准确度,从而提高提高目标对象的行为的准确度。为了获得图像质量 较佳的采样图像,可以将采样得到的多个第一图像质量与多个第二图像质量进行比较,得到比较结果,从而根据比较结果确定采用何种采样方式对目标图像进行采样,得到采样图像。
在一个可选的实施方式中,所述根据所述尺寸对每个所述目标图像进行第一采样得到第一采样图像,及对每个所述目标图像进行第二采样得到第二采样图像包括:
获取多个所述尺寸中的最大尺寸及最小尺寸;
根据所述最大尺寸确定每个所述目标图像的第一采样率,根据所述最小尺寸确定每个所述目标图像的第二采样率;
根据所述第一采样率对对应的所述目标图像进行上采样得到第一采样图像,根据所述第二采样率对对应的所述目标图像进行下采样得到第二采样图像。
示例性的,假设有5张目标图像:F1、F2、F3、F4、F5,目标图像F1的尺寸为T1,目标图像F2的尺寸为T2,目标图像F3的尺寸为T3,目标图像F4的尺寸为T4,目标图像F5的尺寸为T5,将这5个尺寸(T1、T2、T3、T4、T5)进行从大到小的排序,或者从小到大的排序,得到尺寸序列。
获取尺寸序列中的最大尺寸,假设为T1,根据最大尺寸T1确定目标图像F1的第一采样率为T1/T1,根据最大尺寸T1确定目标图像F2的第一采样率为T1/T2,根据最大尺寸T1确定目标图像F3的第一采样率为T1/T3,根据最大尺寸T1确定目标图像F4的第一采样率为T1/T4,根据最大尺寸T1确定目标图像F5的第一采样率为T1/T5。根据第一采样率T1/T1对目标图像F1进行上采样得到第一采样图像F11,根据第一采样率T1/T2对目标图像F2进行上采样得到第一采样图像F21,根据第一采样率T1/T3对目标图像F3进行上采样得到第一采样图像F41,根据第一采样率T1/T4对目标图像F4进行上采样得到第一采样图像F41,根据第一采样率T1/T5对目标图像F5进行上采样得到第一采样图像F51。上采样得到的第一采样图像F11、F21、F41、F41、F51与目标图像F1具有相同的大小。
获取尺寸序列中的最小尺寸,假设为T5,根据最小尺寸T5确定目标图像F1的第一采样率为T5/T1,根据最小尺寸T5确定目标图像F2的第一采样率为T5/T2,根据最小尺寸T5确定目标图像F3的第一采样率为T5/T3,根据最小尺寸T5确定目标图像F4的第一采样率为T5/T4,根据最小尺寸T5确定目标图像F5的第一采样率为T5/T5。根据第一采样率T5/T1对目标图像F1进行下采样得到第一采样图像F12,根据第一采样率T5/T2对目标图像F2进行下采样得到第一采样图像F22,根据第一采样率T5/T3对目标图像F3进行下采样得到第一采样图像F42,根据第一采样率T5/T4对目标图像F4进行下采样得到第一采样图像F42,根据第一采样率T5/T5对目标图像F5进行下采样得到第一采样图像F52。下采样得到的第一采样图像F12、F22、F42、F42、F52与目标图像F5具有相同的大小。
该可选的方式中,根据目标图像的尺寸确定采样率,根据采样率对对应的目标图像进行上采样或者下采样,能够实现对不同的目标图像的动态采样,保证采样得到的采样图像的尺寸与目标图像的最大尺寸保持一致,或者采样得到的采样图像的尺寸与目标图像的最小的尺寸保持一致。
在一个可选的实施方式中,所述根据所述比较结果确定多个采样图像包括:
当所述比较结果为多个所述第一图像质量的平均值大于多个所述第二图像质量的平均值,确定多个所述第一采样图像为所述多个采样图像;
当所述比较结果为多个所述第一图像质量的平均值小于多个所述第二图像质量的平均值,确定多个所述第二采样图像为所述多个采样图像。
由于在上采样或者下采样的过程中,会降低或者提高目标图像的质量,导致采样图像的质量相对目标图像的质量有所提高或者降低,为了保证采样图像具有大概率的较佳的图像质量,对多个第一图像质量与多个第二图像质量进行比较。比较得到的结果包括:多个所述第一图像质量的平均值大于多个所述第二图像质量的平均值,多个所述第一图像质量的平均值小于多个所述第二图像质量的平均值。
多个所述第一图像质量的平均值大于多个所述第二图像质量的平均值,表明上采样得到 的第一采样图像大部分的第一采样图像的第一图像质量高于下采样得到的第二采样图像大部分的第二采样图像的第二图像质量,因此,电子设备可以确定采用上采样的方式对目标图像进行上采样,即将多个第一采样图像确定为最终的多个采样图像,作为3D卷积神经网络模型的输入。
多个所述第一图像质量的平均值小于多个所述第二图像质量的平均值,表明上采样得到的第一采样图像大部分的第一采样图像的第一图像质量低于下采样得到的第二采样图像大部分的第二采样图像的第二图像质量,因此,电子设备可以确定采用下采样的方式对目标图像进行下采样,即将多个第二采样图像确定为最终的多个采样图像,作为3D卷积神经网络模型的输入。
需要说明的是,当所述比较结果为多个所述第一图像质量的平均值等于多个所述第二图像质量的平均值的情形,即可适用于所述比较结果为多个所述第一图像质量的平均值大于多个所述第二图像质量的平均值,也可适用于所述比较结果为多个所述第一图像质量的平均值小于多个所述第二图像质量的平均值。即,当多个所述第一图像质量的平均值等于多个所述第二图像质量的平均值时,既可以采用上采样的方式对目标图像进行采样,从而将多个第一采样图像确定为最终的多个采样图像;也可以下采样的方式对目标图像进行采样,从而将多个第二采样图像确定为最终的多个采样图像。
S13,对所述多个初始图像进行光流计算得到多个光流图像。
示例性的,假设从视频流中提取了N个连续的初始图像,根据N个连续的初始图像中每两个相邻的初始图像计算一个光流图像,从而获得N-1个连续的光流图像。
在一个可能的实施方式中,当电子设备执行了对初始图像的检测和裁剪过程后,可以得到N个连续的目标图像,对N个连续的目标图像进行采样后,可以得到N个连续的采样图像,则所述对所述多个初始图像进行光流计算得到多个光流图像为对所述多个采样图像进行光流计算得到多个光流图像。根据N个连续的采样图像中每两个相邻的采样图像计算一个光流图像,从而获得N-1个连续的光流图像。
在一个可选的实施方式中,所述对所述多个采样图像进行光流计算得到多个光流图像包括:
采用光流算法计算每相邻的两个采样图像的光流场;
对所述光流场进行阈值分割;
筛选出所述光流场中大于所述阈值的目标光流场;
确定对应所述目标光流场的目标采样图像,根据所述目标光流场得到目标光流图像。
由于在较短的时间内,同一目标对象在不同时刻之间的运动速度有限,即从图像采集设备采集的视频流中提取出的多个连续的初始图像之间的亮度不会发生改变,且目标对象的位置不会发生剧烈变化,相邻时刻对应的两个相邻的初始图像之间目标对象的位移非常小,仅存在平移变换或者拉伸压缩变换等,因而多个初始图像之间具有较强的关联性,该多个初始图像在像素上的表征能力大致相当,局部区域差异不大。故而可以采用光流算法对多个初始图像进行修正,将相邻的初始图像进行相关联。
该可选的实施方式中,光流算法是计算光流场的,在适当的平滑性约束条件下,根据图像序列的时空梯度估算运动场,通过分析运动场的变化对运动目标和场景进行检测与分割。通常有基于全局光流场和特征点光流场两种方法。优选为特征点光流场,具有计算量小、快速灵活的特点。通过光流算法计算出了相邻两个特征向量中的每个点的光流矢量,而发生运动的物体的光流矢量与背景光流矢量之间存在差异,使用阀值分割可以将光流场分成两个部分,即区分出运动的物体与背景。优选地,所述阀值的选取可以使用最大类间方差法(大津算法)来确定。
在一个可选的实施例中,在所述对所述光流场进行阈值分割之后,所述方法还包括:根据形态学运算对阈值分割后的光流场进行滤波;连通滤波后的光流场得到新的光流场。
本实施例中,光流场经过阈值分割后,会存在一些孤立的点或者凹区域,影响了目标对 象的提取。可先利用形态学滤波中的开运算,去除那些光流值与结构元素不相吻合的凹区域,同时保留那些相吻合的凹区域。然后,利用形态学滤波中的闭运算,填充凹区域,如此目标对象所对应的区域成为一个可以连成一体的区域。
S14,基于注意力机制将每个所述初始图像与对应的所述光流图像进行融合,得到多个融合图像。
电子设备可以设定参数进行初始图像与光流图像的融合,对每个通道进行归一化和加权融合,得到融合图像。一个初始图像与对应的一个光流图像进行融合,得到一个融合图像。将融合得到融合图像按照顺序进行排列,得到融合图像流,从而实现了时间信息与空间信息的融合。
在一个可能的实施方式中,所述基于注意力机制将每个所述初始图像与对应的所述光流图像进行融合为基于注意力机制将每个所述目标采样图像与对应的所述目标光流图像进行融合。
S15,将所述多个融合图像输入至预先训练完成的3D卷积神经网络中进行行为识别。
其中,所述预先训练完成的3D卷积神经网络为单分支网络结构。
对于3D卷积神经网络,输入的融合图像的大小为B*C*T*H*W,其中B为batch_size,C为通道数,融合之后的通道数为5(RGB对应的3个通道+光流对应的2个通道),T为时间序列,H和W为融合图像的长和宽。在搭建3D卷积神经网络时,将通道数设定为5即可,这样既能对目标对象的静态特征进行捕捉,又能对光流的变化过程进行学习,在保证同样效果的前提下,降低3D卷积神经网络的整体框架的复杂度。
在一个可选的实施方式中,所述将所述多个融合图像输入至预先训练完成的3D卷积神经网络中进行行为识别包括:
获取所述3D卷积神经网络中最后一个卷积层输出的特征图;
将每个所述融合图像与对应的所述特征图进行拼接处理,得到拼接图像;
基于所述拼接图像进行行为识别。
请一并参阅图2所示,3D卷积神经网络包括多个卷积层,每个卷积层中包括一个卷积核,其中,第一个卷积核为1x1x1,第二个卷积核为3x3x3,最后一个卷积核为1x1x1。
将融合图像输入至3D卷积神经网络中,3D卷积神经网络通过第一个卷积层进行通道信息的融合并进行一个特征维度的升维,通过第二个卷积层同时进行时间维度以及空间维度的图像信息的特征提取,通过最后一个卷积层进行特征的降维,然后将最后一个卷积层输出的特征图与融合图像进行一个拼接处理。
该可选的实施方式,通过3D卷积神经网络进行特征维度的升维之后,为了避免进行梯度信息时,防止梯度***,再通过3D卷积神经网络进行特征维度的降维;最后为了避免3D卷积神经网络将特征维度降得过低,将融合图像与最后一个卷积层输出的特征图在进行拼接处理,相对最后一个卷积层输出的特征图而言,拼接处理后得到的拼接图像的特征维度较高,能够有效的防止梯度消失。
相比于传统的基于two stream的行为识别方法,本申请实施例提供的基于图像融合的行为识别方法,可应用在智慧政务等领域,从而推动智慧城市的发展。本申请通过将初始图像与计算得到的光流图像进行融合,不仅实现了图像信息与时间信息的融合,而且通过光流信息指导3D卷积神经网络对于初始图像的信息的理解,通过注意力机制让3D卷积神经网络学习更多的感兴趣的区域的信息,从而基于融合后的融合图像进行行为识别,能够有效的保证行为识别的准确度;此外,训练的3D卷积神经网络为单分支网络结构,相比双分支的3D卷积神经网络而言,单分支网络结构降低了网络模型的复杂度,从而降低了整个行为识别过程的复杂度,让整个过程更加集中化,提高了行为识别的效率。
实施例二
图3是本申请实施例二提供的基于图像融合的行为识别装置的结构图。
在一些实施例中,所述基于图像融合的行为识别装置30可以包括多个由计算机可读指令段所组成的功能模块。所述基于图像融合的行为识别装置30中的各个程序段的计算机可读指令可以存储于电子设备的存储器中,并由至少一个处理器所执行,以执行(详见图1描述)基于图像融合的行为识别的功能。
本实施例中,所述基于图像融合的行为识别装置30根据其所执行的功能,可以被划分为多个功能模块。所述功能模块可以包括:获取模块301、提取模块302、采样模块303、计算模块304、融合模块305及识别模块306。本申请所称的模块是指一种能够被至少一个处理器所执行并且能够完成固定功能的一系列计算机可读指令段,其存储在存储器中。在本实施例中,关于各模块的功能将在后续的实施例中详述。
所述获取模块301,用于响应于识别目标对象的行为的指令,获取包含所述目标对象的视频流。
其中,目标对象是指需要进行行为识别的对象实体,例如,人或者宠物。若需要对某个人或者某个宠物进行行为识别,则可以使用图像采集设备采集这个人或者这个宠物的视频流。所述图像采集设备可以是高清数字图像采集设备。
识别目标对象的行为的指令可以由用户进行触发,也可以自动进行触发。电子设备在接收到识别目标对象的行为的指令时,响应于所述指令,向图像采集设备发送采集指令以控制图像采集设备采集包含目标对象的视频流。图像采集设备在采集到包含目标对象的视频流之后,将所采集的视频流至电子设备。图像采集设备可以以边采集边发送的方式发送视频流,以可以在采集预设时长的视频流之后,将所采集的视频流发送至电子设备。
所述提取模块302,用于从所述视频流中提取多个初始图像。
视频流可以被分解为空间和时间两部分信息,空间信息以单幅图像的形式表现出来,图像携带着目标对象的形状,颜色等静态信息,而时间信息则通过多帧连续的图像动态表现出来,反应了目标对象的移动信息。
电子设备中可以预先设置有采集帧率,以所述采集帧率对视频流进行采集,从而得到多个初始图像,所述初始图像为RGB图像。
在一个可选的实施方式中,在从所述视频流中提取多个初始图像之后,所述采样模块303,用于对所述多个初始图像进行采样,得到多个采样图像。
在一个可选的实施方式中,所述采样模块303对所述多个初始图像进行采样,得到多个采样图像包括:
检测每个所述初始图像中所述目标对象对应的目标区域;
对每个所述初始图像中的目标区域进行裁剪,得到目标图像;
对多个所述目标图像进行采样,得到多个采样图像。
该可选的实施方式中,可以采用YOLO目标检测算法将所述初始图像中目标对象所在的区域用检测框框选出来,检测框框选的区域即为目标区域。
由于目标区域的像素数量远小于整幅初始图像的像素数量,且目标区域几乎只包含了人或者宠物这一目标对象,而无其他非目标对象,因此将目标区域裁剪出来得到的目标图像作为3D卷积神经网络模型的输入,不仅有助于提高3D卷积神经网络模型识别目标对象的行为的效率,而且目标图像中不存在非目标对象的干扰,还能提高3D卷积神经网络模型识别目标对象的行为的准确度。
此外,由于远近距离的原因,导致不同的初始图像中目标对象的相对大小不同,从而导致裁剪得到的目标图像的大小不同,为了保证输入至3D卷积神经网络模型中的图像的一致性,需要对多个目标图像进行采样,实现对多个目标图像的归一化,从而保证得到的多个采样图像的大小一致。
在一个可选的实施方式中,所述采样模块303对多个所述目标图像进行采样,得到多个采样图像包括:
获取每个所述目标图像的尺寸;
根据所述尺寸对每个所述目标图像进行第一采样得到第一采样图像,及对每个所述目标图像进行第二采样得到第二采样图像;
计算每个所述第一采样图像的第一图像质量,及计算每个所述第二采样图像的第二图像质量;
比较多个所述第一图像质量及多个所述第二图像质量,得到比较结果;
根据所述比较结果确定多个采样图像。
该可选的实施方式中,不同的目标图像的大小不同,多个目标图像可能存在多个不同的尺寸,根据这多个不同的尺寸对每个目标图像进行两次采样,则经过两次采样后,每个目标图像对应两个不同的采样图像。
采用不同的采样方式进行采样,必然导致每个目标图像对应的两个采样图像的质量不同,将图像质量较佳的采样图像作为3D卷积神经网络模型中的输入,有助于提高3D卷积神经网络模型的识别准确度,从而提高提高目标对象的行为的准确度。为了获得图像质量较佳的采样图像,可以将采样得到的多个第一图像质量与多个第二图像质量进行比较,得到比较结果,从而根据比较结果确定采用何种采样方式对目标图像进行采样,得到采样图像。
在一个可选的实施方式中,所述根据所述尺寸对每个所述目标图像进行第一采样得到第一采样图像,及对每个所述目标图像进行第二采样得到第二采样图像包括:
获取多个所述尺寸中的最大尺寸及最小尺寸;
根据所述最大尺寸确定每个所述目标图像的第一采样率,根据所述最小尺寸确定每个所述目标图像的第二采样率;
根据所述第一采样率对对应的所述目标图像进行上采样得到第一采样图像,根据所述第二采样率对对应的所述目标图像进行下采样得到第二采样图像。
示例性的,假设有5张目标图像:F1、F2、F3、F4、F5,目标图像F1的尺寸为T1,目标图像F2的尺寸为T2,目标图像F3的尺寸为T3,目标图像F4的尺寸为T4,目标图像F5的尺寸为T5,将这5个尺寸(T1、T2、T3、T4、T5)进行从大到小的排序,或者从小到大的排序,得到尺寸序列。
获取尺寸序列中的最大尺寸,假设为T1,根据最大尺寸T1确定目标图像F1的第一采样率为T1/T1,根据最大尺寸T1确定目标图像F2的第一采样率为T1/T2,根据最大尺寸T1确定目标图像F3的第一采样率为T1/T3,根据最大尺寸T1确定目标图像F4的第一采样率为T1/T4,根据最大尺寸T1确定目标图像F5的第一采样率为T1/T5。根据第一采样率T1/T1对目标图像F1进行上采样得到第一采样图像F11,根据第一采样率T1/T2对目标图像F2进行上采样得到第一采样图像F21,根据第一采样率T1/T3对目标图像F3进行上采样得到第一采样图像F41,根据第一采样率T1/T4对目标图像F4进行上采样得到第一采样图像F41,根据第一采样率T1/T5对目标图像F5进行上采样得到第一采样图像F51。上采样得到的第一采样图像F11、F21、F41、F41、F51与目标图像F1具有相同的大小。
获取尺寸序列中的最小尺寸,假设为T5,根据最小尺寸T5确定目标图像F1的第一采样率为T5/T1,根据最小尺寸T5确定目标图像F2的第一采样率为T5/T2,根据最小尺寸T5确定目标图像F3的第一采样率为T5/T3,根据最小尺寸T5确定目标图像F4的第一采样率为T5/T4,根据最小尺寸T5确定目标图像F5的第一采样率为T5/T5。根据第一采样率T5/T1对目标图像F1进行下采样得到第一采样图像F12,根据第一采样率T5/T2对目标图像F2进行下采样得到第一采样图像F22,根据第一采样率T5/T3对目标图像F3进行下采样得到第一采样图像F42,根据第一采样率T5/T4对目标图像F4进行下采样得到第一采样图像F42,根据第一采样率T5/T5对目标图像F5进行下采样得到第一采样图像F52。下采样得到的第一采样图像F12、F22、F42、F42、F52与目标图像F5具有相同 的大小。
该可选的方式中,根据目标图像的尺寸确定采样率,根据采样率对对应的目标图像进行上采样或者下采样,能够实现对不同的目标图像的动态采样,保证采样得到的采样图像的尺寸与目标图像的最大尺寸保持一致,或者采样得到的采样图像的尺寸与目标图像的最小的尺寸保持一致。
在一个可选的实施方式中,所述根据所述比较结果确定多个采样图像包括:
当所述比较结果为多个所述第一图像质量的平均值大于多个所述第二图像质量的平均值,确定多个所述第一采样图像为所述多个采样图像;
当所述比较结果为多个所述第一图像质量的平均值小于多个所述第二图像质量的平均值,确定多个所述第二采样图像为所述多个采样图像。
由于在上采样或者下采样的过程中,会降低或者提高目标图像的质量,导致采样图像的质量相对目标图像的质量有所提高或者降低,为了保证采样图像具有大概率的较佳的图像质量,对多个第一图像质量与多个第二图像质量进行比较。比较得到的结果包括:多个所述第一图像质量的平均值大于多个所述第二图像质量的平均值,多个所述第一图像质量的平均值小于多个所述第二图像质量的平均值。
多个所述第一图像质量的平均值大于多个所述第二图像质量的平均值,表明上采样得到的第一采样图像大部分的第一采样图像的第一图像质量高于下采样得到的第二采样图像大部分的第二采样图像的第二图像质量,因此,电子设备可以确定采用上采样的方式对目标图像进行上采样,即将多个第一采样图像确定为最终的多个采样图像,作为3D卷积神经网络模型的输入。
多个所述第一图像质量的平均值小于多个所述第二图像质量的平均值,表明上采样得到的第一采样图像大部分的第一采样图像的第一图像质量低于下采样得到的第二采样图像大部分的第二采样图像的第二图像质量,因此,电子设备可以确定采用下采样的方式对目标图像进行下采样,即将多个第二采样图像确定为最终的多个采样图像,作为3D卷积神经网络模型的输入。
需要说明的是,当所述比较结果为多个所述第一图像质量的平均值等于多个所述第二图像质量的平均值的情形,即可适用于所述比较结果为多个所述第一图像质量的平均值大于多个所述第二图像质量的平均值,也可适用于所述比较结果为多个所述第一图像质量的平均值小于多个所述第二图像质量的平均值。即,当多个所述第一图像质量的平均值等于多个所述第二图像质量的平均值时,既可以采用上采样的方式对目标图像进行采样,从而将多个第一采样图像确定为最终的多个采样图像;也可以下采样的方式对目标图像进行采样,从而将多个第二采样图像确定为最终的多个采样图像。
所述计算模块304,用于对所述多个初始图像进行光流计算得到多个光流图像。
示例性的,假设从视频流中提取了N个连续的初始图像,根据N个连续的初始图像中每两个相邻的初始图像计算一个光流图像,从而获得N-1个连续的光流图像。
在一个可能的实施方式中,当电子设备执行了对初始图像的检测和裁剪过程后,可以得到N个连续的目标图像,对N个连续的目标图像进行采样后,可以得到N个连续的采样图像,则所述对所述多个初始图像进行光流计算得到多个光流图像为对所述多个采样图像进行光流计算得到多个光流图像。根据N个连续的采样图像中每两个相邻的采样图像计算一个光流图像,从而获得N-1个连续的光流图像。
在一个可选的实施方式中,所述计算模块304对所述多个采样图像进行光流计算得到多个光流图像包括:
采用光流算法计算每相邻的两个采样图像的光流场;
对所述光流场进行阈值分割;
筛选出所述光流场中大于所述阈值的目标光流场;
确定对应所述目标光流场的目标采样图像,根据所述目标光流场得到目标光流图 像。
由于在较短的时间内,同一目标对象在不同时刻之间的运动速度有限,即从图像采集设备采集的视频流中提取出的多个连续的初始图像之间的亮度不会发生改变,且目标对象的位置不会发生剧烈变化,相邻时刻对应的两个相邻的初始图像之间目标对象的位移非常小,仅存在平移变换或者拉伸压缩变换等,因而多个初始图像之间具有较强的关联性,该多个初始图像在像素上的表征能力大致相当,局部区域差异不大。故而可以采用光流算法对多个初始图像进行修正,将相邻的初始图像进行相关联。
该可选的实施方式中,光流算法是计算光流场的,在适当的平滑性约束条件下,根据图像序列的时空梯度估算运动场,通过分析运动场的变化对运动目标和场景进行检测与分割。通常有基于全局光流场和特征点光流场两种方法。优选为特征点光流场,具有计算量小、快速灵活的特点。通过光流算法计算出了相邻两个特征向量中的每个点的光流矢量,而发生运动的物体的光流矢量与背景光流矢量之间存在差异,使用阀值分割可以将光流场分成两个部分,即区分出运动的物体与背景。优选地,所述阀值的选取可以使用最大类间方差法(大津算法)来确定。
在一个可选的实施例中,在所述对所述光流场进行阈值分割之后,所述电子设备还可以根据形态学运算对阈值分割后的光流场进行滤波;连通滤波后的光流场得到新的光流场。
本实施例中,光流场经过阈值分割后,会存在一些孤立的点或者凹区域,影响了目标对象的提取。可先利用形态学滤波中的开运算,去除那些光流值与结构元素不相吻合的凹区域,同时保留那些相吻合的凹区域。然后,利用形态学滤波中的闭运算,填充凹区域,如此目标对象所对应的区域成为一个可以连成一体的区域。
所述融合模块305,用于基于注意力机制将每个所述初始图像与对应的所述光流图像进行融合,得到多个融合图像。
电子设备可以设定参数进行初始图像与光流图像的融合,对每个通道进行归一化和加权融合,得到融合图像。一个初始图像与对应的一个光流图像进行融合,得到一个融合图像。将融合得到融合图像按照顺序进行排列,得到融合图像流,从而实现了时间信息与空间信息的融合。
在一个可能的实施方式中,所述基于注意力机制将每个所述初始图像与对应的所述光流图像进行融合为基于注意力机制将每个所述目标采样图像与对应的所述目标光流图像进行融合。
所述识别模型306,用于将所述多个融合图像输入至预先训练完成的3D卷积神经网络中进行行为识别。
其中,所述预先训练完成的3D卷积神经网络为单分支网络结构。
对于3D卷积神经网络,输入的融合图像的大小为B*C*T*H*W,其中B为batch_size,C为通道数,融合之后的通道数为5(RGB对应的3个通道+光流对应的2个通道),T为时间序列,H和W为融合图像的长和宽。在搭建3D卷积神经网络时,将通道数设定为5即可,这样既能对目标对象的静态特征进行捕捉,又能对光流的变化过程进行学习,在保证同样效果的前提下,降低3D卷积神经网络的整体框架的复杂度。
在一个可选的实施方式中,所述识别模型306将所述多个融合图像输入至预先训练完成的3D卷积神经网络中进行行为识别包括:
获取所述3D卷积神经网络中最后一个卷积层输出的特征图;
将每个所述融合图像与对应的所述特征图进行拼接处理,得到拼接图像;
基于所述拼接图像进行行为识别。
请一并参阅图2所示,3D卷积神经网络包括多个卷积层,每个卷积层中包括一个卷积核,其中,第一个卷积核为1x1x1,第二个卷积核为3x3x3,最后一个卷积核为1x1x1。
将融合图像输入至3D卷积神经网络中,3D卷积神经网络通过第一个卷积层进行通 道信息的融合并进行一个特征维度的升维,通过第二个卷积层同时进行时间维度以及空间维度的图像信息的特征提取,通过最后一个卷积层进行特征的降维,然后将最后一个卷积层输出的特征图与融合图像进行一个拼接处理。
该可选的实施方式,通过3D卷积神经网络进行特征维度的升维之后,为了避免进行梯度信息时,防止梯度***,再通过3D卷积神经网络进行特征维度的降维;最后为了避免3D卷积神经网络将特征维度降得过低,将融合图像与最后一个卷积层输出的特征图在进行拼接处理,相对最后一个卷积层输出的特征图而言,拼接处理后得到的拼接图像的特征维度较高,能够有效的防止梯度消失。
相比于传统的基于two stream的行为识别方法,本申请实施例提供的基于图像融合的行为识别方法,可应用在智慧政务等领域,从而推动智慧城市的发展。本申请通过将初始图像与计算得到的光流图像进行融合,不仅实现了图像信息与时间信息的融合,而且通过光流信息指导3D卷积神经网络对于初始图像的信息的理解,通过注意力机制让3D卷积神经网络学习更多的感兴趣的区域的信息,从而基于融合后的融合图像进行行为识别,能够有效的保证行为识别的准确度;此外,训练的3D卷积神经网络为单分支网络结构,相比双分支的3D卷积神经网络而言,单分支网络结构降低了网络模型的复杂度,从而降低了整个行为识别过程的复杂度,让整个过程更加集中化,提高了行为识别的效率。
实施例三
本实施例提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机可读指令,该计算机可读指令被处理器执行时实现上述基于图像融合的行为识别方法实施例中的步骤,例如图1所示的S11-S15:
S11,响应于识别目标对象的行为的指令,获取包含所述目标对象的视频流;
S12,从所述视频流中提取多个初始图像;
S13,对所述多个初始图像进行光流计算得到多个光流图像;
S14,基于注意力机制将每个所述初始图像与对应的所述光流图像进行融合,得到多个融合图像;
S15,将所述多个融合图像输入至预先训练完成的3D卷积神经网络中进行行为识别。
或者,该计算机可读指令被处理器执行时实现上述装置实施例中各模块/单元的功能,例如图3中的模块301-306:
所述获取模块301,用于响应于识别目标对象的行为的指令,获取包含所述目标对象的视频流;
所述提取模块302,用于从所述视频流中提取多个初始图像;
所述采样模块303,用于对所述初始图像进行采样,得到多个采样图像;
所述计算模块304,用于对所述多个初始图像进行光流计算得到多个光流图像;
所述融合模块305,用于基于注意力机制将每个所述初始图像与对应的所述光流图像进行融合,得到多个融合图像;
所述识别模块306,用于将所述多个融合图像输入至预先训练完成的3D卷积神经网络中进行行为识别。
实施例四
参阅图4所示,为本申请实施例三提供的电子设备的结构示意图。在本申请较佳实施例中,所述电子设备4包括存储器41、至少一个处理器42、至少一条通信总线43及收发器44。
本领域技术人员应该了解,图4示出的电子设备的结构并不构成本申请实施例的限定,既可以是总线型结构,也可以是星形结构,所述电子设备4还可以包括比图示更多或更少的其他硬件或者软件,或者不同的部件布置。
在一些实施例中,所述电子设备4是一种能够按照事先设定或存储的指令,自动进行数 值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路、可编程门阵列、数字处理器及嵌入式设备等。所述电子设备4还可包括客户设备,所述客户设备包括但不限于任何一种可与客户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互的电子产品,例如,个人计算机、平板电脑、智能手机、数码相机等。
需要说明的是,所述电子设备4仅为举例,其他现有的或今后可能出现的电子产品如可适应于本申请,也应包含在本申请的保护范围以内,并以引用方式包含于此。
在一些实施例中,所述存储器41中存储有计算机可读指令,所述计算机可读指令被所述至少一个处理器42执行时实现如所述的基于图像融合的行为识别方法中的全部或者部分步骤。所述存储器41包括易失性和非易失性存储器,例如随机存取存储器(Random Access Memory,RAM)、只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable Read-Only Memory,PROM)、可擦除可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM)、一次可编程只读存储器(One-time Programmable Read-Only Memory,OTPROM)、电子擦除式可复写只读存储器(Electrically-Erasable Programmable Read-Only Memory,EEPROM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)或其他光盘存储器、磁盘存储器、磁带存储器、或者能够用于携带或存储数据的计算机可读的存储介质。所述计算机可读存储介质可以是非易失性,也可以是易失性的。
进一步地,所述计算机可读存储介质可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作***、至少一个功能所需的应用程序等;存储数据区可存储根据区块链节点的使用所创建的数据等。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
在一些实施例中,所述至少一个处理器42是所述电子设备4的控制核心(Control Unit),利用各种接口和线路连接整个电子设备4的各个部件,通过运行或执行存储在所述存储器41内的程序或者模块,以及调用存储在所述存储器41内的数据,以执行电子设备4的各种功能和处理数据。例如,所述至少一个处理器42执行所述存储器中存储的计算机可读指令时实现本申请实施例中所述的基于图像融合的行为识别方法的全部或者部分步骤;或者实现基于图像融合的行为识别装置的全部或者部分功能。所述至少一个处理器42可以由集成电路组成,例如可以由单个封装的集成电路所组成,也可以是由多个相同功能或不同功能封装的集成电路所组成,包括一个或者多个中央处理器(Central Processing unit,CPU)、微处理器、数字处理芯片、图形处理器及各种控制芯片的组合等。
在一些实施例中,所述至少一条通信总线43被设置为实现所述存储器41以及所述至少一个处理器42等之间的连接通信。
尽管未示出,所述电子设备4还可以包括给各个部件供电的电源(比如电池),优选的,电源可以通过电源管理装置与所述至少一个处理器42逻辑相连,从而通过电源管理装置实现管理充电、放电、以及功耗管理等功能。电源还可以包括一个或一个以上的直流或交流电源、再充电装置、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。所述电子设备4还可以包括多种传感器、蓝牙模块、Wi-Fi模块等,在此不再赘述。
上述以软件功能模块的形式实现的集成的单元,可以存储在一个计算机可读取存储介质中。上述软件功能模块存储在一个存储介质中,包括若干指令用以使得一台电子设备(可以是个人计算机,或者网络设备等)或处理器(processor)执行本申请各个实施例所述方法的部分。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅 为一种逻辑功能划分,实际实现时可以有另外的划分方式。
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,既可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附图标记视为限制所涉及的权利要求。此外,显然“包括”一词不排除其他单元或,单数不排除复数。说明书中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。

Claims (22)

  1. 一种基于图像融合的行为识别方法,其中,所述方法包括:
    响应于识别目标对象的行为的指令,获取包含所述目标对象的视频流;
    从所述视频流中提取多个初始图像;
    对所述多个初始图像进行光流计算得到多个光流图像;
    基于注意力机制将每个所述初始图像与对应的所述光流图像进行融合,得到多个融合图像;
    将所述多个融合图像输入至预先训练完成的3D卷积神经网络中进行行为识别,其中,所述预先训练完成的3D卷积神经网络为单分支网络结构。
  2. 如权利要求1所述的基于图像融合的行为识别方法,其中,在从所述视频流中提取多个初始图像之后,所述方法还包括:
    检测每个所述初始图像中所述目标对象对应的目标区域;
    对每个所述初始图像中的目标区域进行裁剪,得到目标图像;
    对多个所述目标图像进行采样,得到多个采样图像;
    所述对所述多个初始图像进行光流计算得到多个光流图像包括:对所述多个采样图像进行光流计算得到多个光流图像。
  3. 如权利要求2所述的基于图像融合的行为识别方法,其中,所述对多个所述目标图像进行采样,得到多个采样图像包括:
    获取每个所述目标图像的尺寸;
    根据所述尺寸对每个所述目标图像进行第一采样得到第一采样图像,及对每个所述目标图像进行第二采样得到第二采样图像;
    计算每个所述第一采样图像的第一图像质量,及计算每个所述第二采样图像的第二图像质量;
    比较多个所述第一图像质量及多个所述第二图像质量,得到比较结果;
    根据所述比较结果确定多个采样图像。
  4. 如权利要求3所述的基于图像融合的行为识别方法,其中,所述根据所述尺寸对每个所述目标图像进行第一采样得到第一采样图像,及对每个所述目标图像进行第二采样得到第二采样图像包括:
    获取多个所述尺寸中的最大尺寸及最小尺寸;
    根据所述最大尺寸确定每个所述目标图像的第一采样率,根据所述最小尺寸确定每个所述目标图像的第二采样率;
    根据所述第一采样率对对应的所述目标图像进行上采样得到第一采样图像,根据所述第二采样率对对应的所述目标图像进行下采样得到第二采样图像。
  5. 如权利要求3所述的基于图像融合的行为识别方法,其中,所述根据所述比较结果确定多个采样图像包括:
    当所述比较结果为多个所述第一图像质量的平均值大于多个所述第二图像质量的平均值,确定多个所述第一采样图像为所述多个采样图像;
    当所述比较结果为多个所述第一图像质量的平均值小于多个所述第二图像质量的平均值,确定多个所述第二采样图像为所述多个采样图像。
  6. 如权利要求2至5中任意一项所述的基于图像融合的行为识别方法,其中,所述对所述多个采样图像进行光流计算得到多个光流图像包括:
    采用光流算法计算每相邻的两个采样图像的光流场;
    对所述光流场进行阈值分割;
    筛选出所述光流场中大于所述阈值的目标光流场;
    确定对应所述目标光流场的目标采样图像,根据所述目标光流场得到目标光流图像;
    所述基于注意力机制将每个所述初始图像与对应的所述光流图像进行融合包括:基于注 意力机制将每个所述目标采样图像与对应的所述目标光流图像进行融合。
  7. 如权利要求2至5中任意一项所述的基于图像融合的行为识别方法,其中,所述将所述多个融合图像输入至预先训练完成的3D卷积神经网络中进行行为识别包括:
    获取所述3D卷积神经网络中最后一个卷积层输出的特征图;
    将每个所述融合图像与对应的所述特征图进行拼接处理,得到拼接图像;
    基于所述拼接图像进行行为识别。
  8. 一种基于图像融合的行为识别装置,其中,所述装置包括:
    获取模块,用于响应于识别目标对象的行为的指令,获取包含所述目标对象的视频流;
    提取模块,用于从所述视频流中提取多个初始图像;
    计算模块,用于对所述多个初始图像进行光流计算得到多个光流图像;
    融合模块,用于基于注意力机制将每个所述初始图像与对应的所述光流图像进行融合,得到多个融合图像;
    识别模块,用于将所述多个融合图像输入至预先训练完成的3D卷积神经网络中进行行为识别,其中,所述预先训练完成的3D卷积神经网络为单分支网络结构。
  9. 一种电子设备,其中,所述电子设备包括处理器和存储器,所述处理器用于执行所述存储器中存储的计算机可读指令以实现以下步骤:
    响应于识别目标对象的行为的指令,获取包含所述目标对象的视频流;
    从所述视频流中提取多个初始图像;
    对所述多个初始图像进行光流计算得到多个光流图像;
    基于注意力机制将每个所述初始图像与对应的所述光流图像进行融合,得到多个融合图像;
    将所述多个融合图像输入至预先训练完成的3D卷积神经网络中进行行为识别,其中,所述预先训练完成的3D卷积神经网络为单分支网络结构。
  10. 如权利要求9所述的电子设备,其中,在从所述视频流中提取多个初始图像之后,所述处理器执行所述计算机可读指令还用以实现以下步骤:
    检测每个所述初始图像中所述目标对象对应的目标区域;
    对每个所述初始图像中的目标区域进行裁剪,得到目标图像;
    对多个所述目标图像进行采样,得到多个采样图像;
    所述对所述多个初始图像进行光流计算得到多个光流图像包括:对所述多个采样图像进行光流计算得到多个光流图像。
  11. 如权利要求10所述的电子设备,其中,所述处理器执行所述计算机可读指令以实现对多个所述目标图像进行采样,得到多个采样图像时,具体包括:
    获取每个所述目标图像的尺寸;
    根据所述尺寸对每个所述目标图像进行第一采样得到第一采样图像,及对每个所述目标图像进行第二采样得到第二采样图像;
    计算每个所述第一采样图像的第一图像质量,及计算每个所述第二采样图像的第二图像质量;
    比较多个所述第一图像质量及多个所述第二图像质量,得到比较结果;
    根据所述比较结果确定多个采样图像。
  12. 如权利要求11所述的电子设备,其中,所述处理器执行所述计算机可读指令以实现根据所述尺寸对每个所述目标图像进行第一采样得到第一采样图像,及对每个所述目标图像进行第二采样得到第二采样图像时,具体包括:
    获取多个所述尺寸中的最大尺寸及最小尺寸;
    根据所述最大尺寸确定每个所述目标图像的第一采样率,根据所述最小尺寸确定每个所述目标图像的第二采样率;
    根据所述第一采样率对对应的所述目标图像进行上采样得到第一采样图像,根据所述第 二采样率对对应的所述目标图像进行下采样得到第二采样图像。
  13. 如权利要求11所述的电子设备,其中,所述处理器执行所述计算机可读指令以实现根据所述比较结果确定多个采样图像时,具体包括:
    当所述比较结果为多个所述第一图像质量的平均值大于多个所述第二图像质量的平均值,确定多个所述第一采样图像为所述多个采样图像;
    当所述比较结果为多个所述第一图像质量的平均值小于多个所述第二图像质量的平均值,确定多个所述第二采样图像为所述多个采样图像。
  14. 如权利要求10至13中任意一项所述的电子设备,其中,所述处理器执行所述计算机可读指令以实现对所述多个采样图像进行光流计算得到多个光流图像时,具体包括:
    采用光流算法计算每相邻的两个采样图像的光流场;
    对所述光流场进行阈值分割;
    筛选出所述光流场中大于所述阈值的目标光流场;
    确定对应所述目标光流场的目标采样图像,根据所述目标光流场得到目标光流图像;
    所述处理器执行所述计算机可读指令以实现基于注意力机制将每个所述初始图像与对应的所述光流图像进行融合时,具体包括:基于注意力机制将每个所述目标采样图像与对应的所述目标光流图像进行融合。
  15. 如权利要求10至13中任意一项所述的电子设备,其中,所述处理器执行所述计算机可读指令以实现将所述多个融合图像输入至预先训练完成的3D卷积神经网络中进行行为识别时,具体包括:
    获取所述3D卷积神经网络中最后一个卷积层输出的特征图;
    将每个所述融合图像与对应的所述特征图进行拼接处理,得到拼接图像;
    基于所述拼接图像进行行为识别。
  16. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,其中,所述计算机可读指令被处理器执行时实现以下步骤:
    响应于识别目标对象的行为的指令,获取包含所述目标对象的视频流;
    从所述视频流中提取多个初始图像;
    对所述多个初始图像进行光流计算得到多个光流图像;
    基于注意力机制将每个所述初始图像与对应的所述光流图像进行融合,得到多个融合图像;
    将所述多个融合图像输入至预先训练完成的3D卷积神经网络中进行行为识别,其中,所述预先训练完成的3D卷积神经网络为单分支网络结构。
  17. 如权利要求16所述的计算机可读存储介质,其中,在从所述视频流中提取多个初始图像之后,所述计算机可读指令被所述处理器执行还用以实现以下步骤:
    检测每个所述初始图像中所述目标对象对应的目标区域;
    对每个所述初始图像中的目标区域进行裁剪,得到目标图像;
    对多个所述目标图像进行采样,得到多个采样图像;
    所述对所述多个初始图像进行光流计算得到多个光流图像包括:对所述多个采样图像进行光流计算得到多个光流图像。
  18. 如权利要求17所述的计算机可读存储介质,其中,所述计算机可读指令被所述处理器执行以实现对多个所述目标图像进行采样,得到多个采样图像时,具体包括:
    获取每个所述目标图像的尺寸;
    根据所述尺寸对每个所述目标图像进行第一采样得到第一采样图像,及对每个所述目标图像进行第二采样得到第二采样图像;
    计算每个所述第一采样图像的第一图像质量,及计算每个所述第二采样图像的第二图像质量;
    比较多个所述第一图像质量及多个所述第二图像质量,得到比较结果;
    根据所述比较结果确定多个采样图像。
  19. 如权利要求18所述的计算机可读存储介质,其中,所述计算机可读指令被所述处理器执行以实现根据所述尺寸对每个所述目标图像进行第一采样得到第一采样图像,及对每个所述目标图像进行第二采样得到第二采样图像时,具体包括:
    获取多个所述尺寸中的最大尺寸及最小尺寸;
    根据所述最大尺寸确定每个所述目标图像的第一采样率,根据所述最小尺寸确定每个所述目标图像的第二采样率;
    根据所述第一采样率对对应的所述目标图像进行上采样得到第一采样图像,根据所述第二采样率对对应的所述目标图像进行下采样得到第二采样图像。
  20. 如权利要求18所述的计算机可读存储介质,其中,所述计算机可读指令被所述处理器执行以实现根据所述比较结果确定多个采样图像时,具体包括:
    当所述比较结果为多个所述第一图像质量的平均值大于多个所述第二图像质量的平均值,确定多个所述第一采样图像为所述多个采样图像;
    当所述比较结果为多个所述第一图像质量的平均值小于多个所述第二图像质量的平均值,确定多个所述第二采样图像为所述多个采样图像。
  21. 如权利要求17至20中任意一项所述的计算机可读存储介质,其中,所述计算机可读指令被所述处理器执行以实现对所述多个采样图像进行光流计算得到多个光流图像时,具体包括:
    采用光流算法计算每相邻的两个采样图像的光流场;
    对所述光流场进行阈值分割;
    筛选出所述光流场中大于所述阈值的目标光流场;
    确定对应所述目标光流场的目标采样图像,根据所述目标光流场得到目标光流图像;
    所述基于注意力机制将每个所述初始图像与对应的所述光流图像进行融合时,具体包括:基于注意力机制将每个所述目标采样图像与对应的所述目标光流图像进行融合。
  22. 如权利要求17至20中任意一项所述的计算机可读存储介质,其中,所述计算机可读指令被所述处理器执行以实现将所述多个融合图像输入至预先训练完成的3D卷积神经网络中进行行为识别时,具体包括:
    获取所述3D卷积神经网络中最后一个卷积层输出的特征图;
    将每个所述融合图像与对应的所述特征图进行拼接处理,得到拼接图像;
    基于所述拼接图像进行行为识别。
PCT/CN2022/071329 2021-09-17 2022-01-11 基于图像融合的行为识别方法、装置、电子设备及介质 WO2023040146A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111093387.6A CN113792680A (zh) 2021-09-17 2021-09-17 基于图像融合的行为识别方法、装置、电子设备及介质
CN202111093387.6 2021-09-17

Publications (1)

Publication Number Publication Date
WO2023040146A1 true WO2023040146A1 (zh) 2023-03-23

Family

ID=78878787

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/071329 WO2023040146A1 (zh) 2021-09-17 2022-01-11 基于图像融合的行为识别方法、装置、电子设备及介质

Country Status (2)

Country Link
CN (1) CN113792680A (zh)
WO (1) WO2023040146A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116309080A (zh) * 2023-05-11 2023-06-23 武汉纺织大学 一种无人机视频拼接方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792680A (zh) * 2021-09-17 2021-12-14 平安科技(深圳)有限公司 基于图像融合的行为识别方法、装置、电子设备及介质
CN114399839A (zh) * 2022-01-18 2022-04-26 平安科技(深圳)有限公司 基于特征融合的行为识别方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543526A (zh) * 2018-10-19 2019-03-29 谢飞 基于深度差异性特征的真假面瘫识别***
CN110084228A (zh) * 2019-06-25 2019-08-02 江苏德劭信息科技有限公司 一种基于双流卷积神经网络的危险行为自动识别方法
CN111462183A (zh) * 2020-03-31 2020-07-28 山东大学 一种基于注意力机制双流网络的行为识别方法及***
CN112990077A (zh) * 2021-04-02 2021-06-18 中国矿业大学 基于联合学习与光流估计的面部动作单元识别方法及装置
CN113792680A (zh) * 2021-09-17 2021-12-14 平安科技(深圳)有限公司 基于图像融合的行为识别方法、装置、电子设备及介质

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111488475A (zh) * 2019-01-29 2020-08-04 北京三星通信技术研究有限公司 图像检索方法、装置、电子设备及计算机可读存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543526A (zh) * 2018-10-19 2019-03-29 谢飞 基于深度差异性特征的真假面瘫识别***
CN110084228A (zh) * 2019-06-25 2019-08-02 江苏德劭信息科技有限公司 一种基于双流卷积神经网络的危险行为自动识别方法
CN111462183A (zh) * 2020-03-31 2020-07-28 山东大学 一种基于注意力机制双流网络的行为识别方法及***
CN112990077A (zh) * 2021-04-02 2021-06-18 中国矿业大学 基于联合学习与光流估计的面部动作单元识别方法及装置
CN113792680A (zh) * 2021-09-17 2021-12-14 平安科技(深圳)有限公司 基于图像融合的行为识别方法、装置、电子设备及介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116309080A (zh) * 2023-05-11 2023-06-23 武汉纺织大学 一种无人机视频拼接方法
CN116309080B (zh) * 2023-05-11 2023-08-11 武汉纺织大学 一种无人机视频拼接方法

Also Published As

Publication number Publication date
CN113792680A (zh) 2021-12-14

Similar Documents

Publication Publication Date Title
WO2023040146A1 (zh) 基于图像融合的行为识别方法、装置、电子设备及介质
US9251425B2 (en) Object retrieval in video data using complementary detectors
US20180114071A1 (en) Method for analysing media content
US20230267735A1 (en) Method for structuring pedestrian information, device, apparatus and storage medium
CN111160350B (zh) 人像分割方法、模型训练方法、装置、介质及电子设备
CN111428664B (zh) 一种基于深度学习技术的计算机视觉的实时多人姿态估计方法
JP2022538928A (ja) 画像処理方法及び装置、電子機器、コンピュータ可読記憶媒体
CN110619284B (zh) 一种视频场景划分方法、装置、设备及介质
KR20200010993A (ko) 보완된 cnn을 통해 이미지 속 얼굴의 속성 및 신원을 인식하는 전자 장치.
WO2024001123A1 (zh) 基于神经网络模型的图像识别方法、装置及终端设备
KR102309111B1 (ko) 딥러닝 기반 비정상 행동을 탐지하여 인식하는 비정상 행동 탐지 시스템 및 탐지 방법
CN113297956B (zh) 一种基于视觉的手势识别方法及***
CN111353544A (zh) 一种基于改进的Mixed Pooling-YOLOV3目标检测方法
CN111382737A (zh) 多路负载均衡异步目标检测方法、存储介质及处理器
WO2023279799A1 (zh) 对象识别方法、装置和电子***
US11423262B2 (en) Automatically filtering out objects based on user preferences
Dahirou et al. Motion Detection and Object Detection: Yolo (You Only Look Once)
CN113011320B (zh) 视频处理方法、装置、电子设备及存储介质
Pavlov et al. Application for video analysis based on machine learning and computer vision algorithms
CN117094362A (zh) 一种任务处理方法及相关装置
WO2022228325A1 (zh) 行为检测方法、电子设备以及计算机可读存储介质
CN110427920B (zh) 一种面向监控环境的实时行人解析方法
CN113516148A (zh) 基于人工智能的图像处理方法、装置、设备及存储介质
Min et al. Vehicle detection method based on deep learning and multi-layer feature fusion
Das et al. Indian sign language recognition system for emergency words by using shape and deep features

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE