WO2024109902A1 - 一种基于视频超分辨率的多目标识别方法和装置 - Google Patents

一种基于视频超分辨率的多目标识别方法和装置 Download PDF

Info

Publication number
WO2024109902A1
WO2024109902A1 PCT/CN2023/133779 CN2023133779W WO2024109902A1 WO 2024109902 A1 WO2024109902 A1 WO 2024109902A1 CN 2023133779 W CN2023133779 W CN 2023133779W WO 2024109902 A1 WO2024109902 A1 WO 2024109902A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
video
image
target object
original video
Prior art date
Application number
PCT/CN2023/133779
Other languages
English (en)
French (fr)
Inventor
陈巍
王珏
焦国华
罗栋
赵琦
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2024109902A1 publication Critical patent/WO2024109902A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/48Extraction of image or video features by mapping characteristic values of the pattern into a parameter space, e.g. Hough transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions

Definitions

  • the present application relates to the field of computer technology, and in particular to a multi-target recognition method, device, electronic device and storage medium based on video super-resolution.
  • target recognition can help people find difficult-to-detect objects in the overall background, or cooperate with other modules to complete automated production or processing processes.
  • various target recognition technologies have been initially applied in monitoring video analysis and other aspects.
  • the existing target recognition technologies have high requirements for the resolution of the monitoring video. If the video resolution of the monitoring video is too low, the target recognition cannot obtain accurate recognition results.
  • the resolution of the monitoring video is not high. By replacing more advanced monitoring equipment to obtain monitoring videos, the resolution of the monitoring video can be improved, but the economic cost is too high.
  • the embodiments of the present application provide a multi-target recognition method, device, electronic device and storage medium based on video super-resolution, which can solve the problem of low accuracy of target recognition in videos existing in the related art.
  • a multi-target recognition method based on video super-resolution includes: obtaining an original video; performing video super-resolution reconstruction on the original video based on features extracted from the original video, and restoring the low-resolution original video to a high-resolution video to be recognized; the video to be recognized includes multiple frames of images to be recognized; performing target detection on at least one target object in each of the images to be recognized, and obtaining a detection result for each of the target objects, wherein the detection result of the target object includes at least a category to which the target object belongs; determining a target recognition model that is compatible with the category to which each of the target objects belongs, respectively calling target recognition models that are compatible with the categories to which different target objects belong, performing target recognition on the images to be recognized containing the target objects, and obtaining a recognition result for each of the target objects.
  • a multi-target recognition method based on video super-resolution includes: a video acquisition module, used to acquire an original video; a video super-resolution module, used to perform video super-resolution reconstruction on the original video based on features extracted from the original video, and restore the low-resolution original video to a high-resolution video to be recognized;
  • the video to be recognized includes multiple frames of images to be recognized;
  • a target detection module used to perform target detection on at least one target object in each of the images to be recognized, and obtain a detection result of each target object, wherein the detection result of the target object at least includes the category to which the target object belongs;
  • a target recognition module used to determine a target recognition model that is compatible with the category to which each target object belongs, and respectively call the target recognition models that are compatible with the categories to which different target objects belong, perform target recognition on the images to be recognized containing the target objects, and obtain a recognition result of each target object.
  • an electronic device includes: at least one processor, at least one memory, and at least one communication bus, wherein a computer program is stored in the memory, and the processor reads the computer program in the memory through the communication bus; when the computer program is executed by the processor, the multi-target recognition method as described above is implemented.
  • a storage medium stores a computer program thereon, and when the computer program is executed by a processor, the multi-target recognition method as described above is implemented.
  • a computer program product includes a computer program, the computer program is stored in a storage medium, a processor of a computer device reads the computer program from the storage medium, and the processor executes the computer program, so that the computer device implements the multi-target recognition method as described above when executing the computer program.
  • target detection is first performed on the target objects in each image to be identified, and then a target recognition model that is compatible with the category to which each target object belongs is selected based on the detection result for target recognition.
  • This not only improves the efficiency of target detection and target recognition, but also increases the accuracy of target recognition by performing target recognition based on the high-resolution video to be identified, thereby solving the problem of low accuracy of target recognition in videos existing in related technologies.
  • FIG1 is a schematic diagram of an implementation environment involved in the present application.
  • FIG2 is a flow chart of a multi-target recognition method based on video super-resolution according to an exemplary embodiment
  • FIG3 is a flow chart of step 370 in one embodiment of the embodiment corresponding to FIG2 ;
  • FIG4 is a flow chart of step 370 in one embodiment of the embodiment corresponding to FIG2 ;
  • FIG5 is a flow chart of step 330 in one embodiment of the embodiment corresponding to FIG2;
  • FIG6 is a flow chart of step 335 in the embodiment corresponding to FIG5 in one embodiment
  • FIG7 is a block diagram of a structure of step 335 in an embodiment of the embodiment corresponding to FIG5 ;
  • FIG8 is a flow chart of step 350 in one embodiment of the embodiment corresponding to FIG2 ;
  • FIG9 is a flow chart of the steps before step 370 in the embodiment corresponding to FIG2 in one embodiment
  • FIG10 is a schematic diagram of a specific implementation of a multi-target recognition method based on video super-resolution in an application scenario
  • FIG11 is a structural block diagram of a multi-target recognition device based on video super-resolution according to an exemplary embodiment
  • FIG12 is a hardware structure diagram of a server according to an exemplary embodiment (the electronic device is a server);
  • Fig. 13 is a structural block diagram of an electronic device according to an exemplary embodiment.
  • the low video resolution of surveillance videos affects the accuracy of target recognition.
  • Target recognition often has certain requirements for the clarity of images or videos. If the resolution of the camera itself is not high, it is often difficult to achieve the expected results for target recognition. At the same time, if the target object is moving at high speed, the image obtained by the camera will be blurred, which will also affect the specific situation of target recognition. In addition, if the actual environment faced by target recognition is not good, such as haze, dense fog, and cloudy days, it will also affect the recognition results of target recognition.
  • a larger project may require the detection of multiple target objects and multiple refined processing of the target objects, and one or two neural network models are unlikely to meet this requirement.
  • the related technology still has the defect of low accuracy in object recognition in videos.
  • the multi-target recognition method can effectively improve the accuracy of target recognition.
  • the multi-target recognition method is suitable for an information recommendation device, and the information recommendation device can be deployed in an electronic device configured with a von Neumann architecture, for example, the electronic device can be a desktop computer, a laptop computer, a server, etc.
  • FIG1 is a schematic diagram of an implementation environment involved in a multi-target recognition method based on video super-resolution.
  • the implementation environment includes a collection end 110 and a service end 130 .
  • the acquisition end 110 may be an electronic device having the function of acquiring at least one or more data of pictures, videos, and multimedia, which is not specifically limited herein.
  • the server 130 may be a desktop computer, a laptop computer, a server or other electronic device, or a computer device cluster composed of multiple servers, or even a cloud computing center composed of multiple servers.
  • the server 130 is used to provide background services, for example, the background services include but are not limited to multi-target recognition services, etc.
  • the server 130 and the acquisition terminal 110 establish a network communication connection in advance through wired or wireless means, and data transmission between the server 130 and the acquisition terminal 110 is realized through the network communication connection.
  • the transmitted data includes but is not limited to: original video and the like.
  • the acquisition end 110 sends the original video to the server end 130, and the server end 130 processes the acquired original video in combination with video super-resolution to complete target detection and target recognition of the target object.
  • an embodiment of the present application provides a multi-target recognition method, which is applicable to an electronic device, and the electronic device may be the server 130 in the implementation environment shown in FIG. 1 .
  • the method may include the following steps:
  • Step 310 obtaining the original video.
  • the original video is obtained by shooting and collecting the environment where the target object is located through a video acquisition device.
  • the video acquisition device can be an electronic device with a video acquisition function, such as a camera, a smart phone equipped with a camera, etc.
  • the video acquisition device can be deployed around the environment where the target object is located. For example, if the target object is a person, the video acquisition device can be deployed on the pillars of the building where the target object will appear; if the target object is a vehicle, the video acquisition device can be deployed on the lamppost on the side of the road.
  • the original video can be derived from a video captured and collected by a video acquisition device in real time, or it can be an original video captured and collected by a video acquisition device in a historical time period pre-stored in the electronic device. Then, for the electronic device, after the video acquisition device captures and collects the original video, the original video can be processed in real time, or it can be pre-stored and processed, for example, the original video can be processed when the CPU of the electronic device is low, or the original video can be processed according to the instructions of the staff. Therefore, the multi-target recognition in this embodiment can be for the original video acquired in real time, or for the original video acquired in a historical time period, which is not specifically limited here.
  • Step 330 based on the features extracted from the original video, perform video super-resolution reconstruction on the original video, and restore the low-resolution original video to a high-resolution video to be identified.
  • the original video may have noise, low resolution and other conditions that affect the video quality.
  • target detection often has certain requirements on the clarity of images or videos. If the resolution of the original video is not high, it is often difficult for target detection to achieve the expected effect. At the same time, if the target object is moving at high speed, the original video obtained by the video acquisition device will be blurred, which will also affect the detection results of the target detection.
  • the video super-resolution model can be used to extract the features of the original video, restore the low-resolution original video to a high-resolution video to be identified, and then perform target detection based on the high-resolution video to be identified.
  • Target detection is performed in units of frame images, and the video to be identified includes multiple frames of images to be identified.
  • Step 350 Perform target detection on at least one target object in each image to be identified to obtain a detection result of each target object.
  • the image to be identified includes multiple frames of images to be identified, and the categories and quantities of target objects contained in each frame of the image to be identified are not necessarily the same. Therefore, it is necessary to perform target object detection on the image to be identified to determine the target objects contained in each image to be identified.
  • target detection can be achieved using traditional target detection methods, such as optical flow method, background subtraction method, frame interpolation method, etc., or target detection algorithms can be used to achieve target detection, such as Cascade R-CNN algorithm, DPM algorithm, HOG algorithm, etc., and target detection can also be achieved through a model based on deep learning, which is not limited here.
  • the detection result of the target object at least includes the category to which the target object belongs. According to the classification idea, different target recognition models are used to identify target objects of different categories.
  • Step 370 determine the target recognition model that is compatible with the category to which each target object belongs, call the target recognition models that are compatible with the categories to which different target objects belong respectively, perform target recognition on the image to be recognized containing the target object, and obtain the recognition result of each target object.
  • the detection result includes at least the category to which the target object belongs, which may be a vehicle, a person, a plant, etc.
  • Target recognition of target objects of different categories can be achieved through a target recognition algorithm or through a machine learning model obtained through deep learning training.
  • target recognition through machine learning models obtained through deep learning training, it can be understood that different target recognition models are trained based on target objects of different categories. Therefore, using a target model that is adapted to the category to which the target object belongs to recognize the target object of that category will have a better recognition effect.
  • the target object is input into the vehicle recognition model. If the detected target object belongs to the category of person, the target object is input into the face recognition model.
  • the target recognition model includes a face recognition model, which is a machine learning model that is trained and has the ability to recognize the target image.
  • step 370 may include the following steps:
  • Step 410 locate the facial key points in the face area of the target image.
  • Step 430 based on the located facial key points, segment a facial region image containing the face from the target image.
  • Step 450 Map the face region image to the Euclidean space.
  • Step 470 based on the similarity between the face in the face region image and the face in the sample image, obtain a recognition result for the face in the person region image.
  • the target image can be a face image or a person image, wherein the person image includes the body area and the face area of the person, and the face recognition model identifies the face area of the target image to obtain the recognition result of the target object. Therefore, if the target object is a person image, the person image is first detected to obtain the face area image of the person image. Furthermore, the face area of a walking person can be captured in combination with the similarity between adjacent frame images of the video to be identified, which is not limited here.
  • the key points of the face it refers to locating the key area positions of the face, including eyebrows, eyes, nose, mouth, facial contours, etc. for the face area. It can be understood that the key points of the face can be used to segment the face area image from the target image.
  • the Euclidean space includes the distance between the face region image of the target object and each sample image in the face recognition model.
  • the distance between the face region image in the Euclidean space is used to indicate the similarity between the face in the face region image and the face in the sample image; the sample image refers to the face region image used to train the face recognition model. It can be understood that the higher the similarity, the higher the possibility that the sample image is the recognition result of the target object. Therefore, the sample image with the highest similarity can be selected as the recognition result.
  • the sample image provided in the face recognition model does not necessarily include the recognition result of the target object.
  • a threshold can be set for the similarity configuration.
  • the similarity between the sample image and the face image is the highest and exceeds the set threshold, the sample image is the recognition result of the target object.
  • the method of obtaining the recognition result using the Euclidean space mapping method can effectively improve the accuracy of face recognition.
  • the recognition result may include the identity information of the target object, such as name, age, occupation, etc., which is not limited here.
  • the target recognition model includes a license plate recognition model, which is a machine learning model that is trained and has the ability to recognize the target image.
  • step 370 may include the following steps:
  • Step 510 extracting features of the target image through each network layer of the license plate recognition model to obtain a feature sequence of the target image.
  • Step 530 mapping the feature sequence through the license plate recognition model to obtain a recognition result for the target image.
  • the feature sequence includes multiple features of the target image, and the target image can be a license plate image or a vehicle image, which is not limited here. Furthermore, each feature of the target image is used to indicate the character information in the license plate image/vehicle image.
  • the license plate recognition model mainly recognizes the license plate of the vehicle. It can be understood that the license plates of different vehicles are unique. Therefore, character recognition can be performed on the license plate area of the vehicle image to obtain the recognition result of the target object (vehicle).
  • the mapping of feature sequences refers to mapping feature sequences by using the context structure of characters in the sample image through the license plate recognition model to output a prediction sequence with probability, which is used to indicate the maximum possibility of each character in the license plate area.
  • the recognition result of the target image that is, the license plate number of the vehicle, can be obtained.
  • the sample image refers to the license plate image used to train the license plate recognition model.
  • the length of the feature sequence input to the license plate recognition model is related to the width of the target image
  • the length of the predicted sequence output by the license plate recognition model is related to the width of the sample image. Since the feature sequence length and the predicted sequence length may differ, the license plate recognition model may not be able to obtain the correct recognition result.
  • a loss function can be introduced in the training of the license plate recognition model to solve the problem of misalignment between the input sequence and the output sequence of the license plate recognition model.
  • the segmentation-free CTC loss function can be used to perform end-to-end training on the vehicle recognition model.
  • target detection is performed based on the restored high-resolution video to be identified, which improves the efficiency of target detection and target recognition.
  • target recognition is performed based on the high-resolution video to be identified, which increases the accuracy of target recognition and solves the problem of low accuracy of target recognition in video that exists in related technologies.
  • step 330 may include the following steps:
  • Step 33 extract features from the original video to obtain shallow features of the original video.
  • the shallow features can be extracted through the shallow network structure, which has high resolution and contains more position and detail information. It can be understood that in order to perform video super-resolution reconstruction on the original video, the position and detail information in the shallow features cannot be missing.
  • the shallow features of the original video can be extracted through convolution, for example, 2D convolution.
  • Step 333 divide the original video into a plurality of original video segments.
  • the features of the same target object in different frames of the original video are different. It can be understood that the restoration effect of restoring the target object using the features of the same target object in multiple frames will be better than the restoration effect of restoring the target object using only the features of the target object in one frame. Therefore, in order to use the features of multiple frames of the original video to perform video super-resolution reconstruction of the original video, it is necessary to fuse the features of multiple frames. Inter-frame feature fusion can be achieved through alignment and non-alignment methods.
  • the position of the same target object will change over time, and the posture will also undergo certain deformation. If the moving target object is not aligned, the same target object may appear in different positions in multiple frames of the input video. This requires deepening the network structure to have a larger receptive field in order to extract the same features with a larger motion range; in the aligned method, the same target object (with changed position) in different frames is aligned to the same position, so that the target objects are in the same position, which facilitates the extraction of as many features of the target object as possible.
  • the same target object can be aligned to the same position by performing frame shifting processing on the original video, so as to align all target objects in each frame image of the original video.
  • frame shifting processing Based on the frame shifting operation, several original video clips are obtained, each of which does not overlap and each of which includes at least two consecutive frames of original images in the original video.
  • dividing the original video into non-overlapping video clips and running them in parallel can reduce the amount of calculation during feature extraction and improve efficiency.
  • Step 335 using each stage in the parallel super-resolution mechanism, feature propagation is performed on the continuous original images in each original video clip to obtain the deep features of each original video clip; each stage corresponds to a scale, and each stage includes a temporal mutual self-attention module and/or a parallel warping module.
  • shallow features contain more location and detail information, but they go through fewer convolutions during feature extraction, have lower semantics, and have more noise.
  • the semantic information in the original video is also crucial, so it is also necessary to obtain the semantic information in the original video.
  • the deep features of each original image contain rich semantic information, usually information about the overall image. In other words, the semantic information in the original video can be obtained by extracting the deep features of each original image in the original video.
  • Step 337 based on the obtained shallow features and deep features, feature reconstruction is performed in parallel for each original image in the original video to obtain a video to be identified.
  • feature reconstruction is to simultaneously reconstruct features from the addition of shallow features and deep features of each original image in the original video, wherein original images of different frames are independently reconstructed according to the corresponding shallow features and deep features. Therefore, when performing feature reconstruction, it can be performed in parallel, and then the low-resolution original video is restored to a high-resolution video to be identified based on the high-definition features of each original image obtained after feature reconstruction.
  • step 335 includes the following steps:
  • Step 3351 in multiple stages corresponding to different scales, using the temporal mutual self-attention module in the stage, feature extraction is performed on the continuous original images in each original video clip to obtain the features of each original video clip at different scales.
  • step 3353 features of each original video clip at different scales are aligned using a parallel warping module connected in series with the temporal mutual self-attention module in the stage.
  • Step 3355 in the last stage, the temporal mutual self-attention module in the stage is used to perform feature fusion on the features of each original video clip at different scales to obtain the deep features of each original video clip.
  • the receptive field gradually increases, and the overlapping area between receptive fields also continues to increase.
  • the information represented by the pixel point is the information of a region, and what is obtained is the features of this region or between adjacent regions. The details are relatively not delicate, but the semantic information is rich.
  • feature extraction is performed on continuous original images in each original video clip. It can be understood that after one stage of feature extraction, that is, the number of downsampling is increased once, the obtained features contain more semantic information.
  • the temporal mutual self-attention module in the first five stages is used to estimate the joint motion of the target object in the continuous original images of the original video clip to complete feature extraction;
  • the parallel distortion module connected in series with the temporal mutual attention module is used to further fuse the features of the current frame original image and the adjacent frame original image by performing parallel feature deformation on the features of the current frame original image and the adjacent frame original image in the original video clip;
  • the temporal mutual attention module in the last stage is used to fuse the features of different scales extracted in the first five stages to obtain the deep features of each original video clip.
  • the resolution of the video to be identified is higher and more natural, and the target detection based on the high-resolution video to be identified is more accurate.
  • the deep features obtained through multi-scale feature extraction contain more feature information, and the video restoration based on the deep features has higher resolution, richer details, and better results.
  • step 350 may include the following steps:
  • Step 351 gridding each image to be identified to obtain a grid image corresponding to each image to be identified.
  • Step 353 based on the convolutional neural network, target detection is performed on each grid unit to obtain multiple prediction boxes and corresponding prediction parameters in each grid unit.
  • Step 355 screening each prediction box based on the prediction parameters of each prediction box, and using the screened prediction boxes and corresponding prediction parameters as the detection results of each target object.
  • the grid image includes a plurality of grid units.
  • the gridding process is to generate a plurality of prediction boxes based on the grid units of the grid image. These prediction boxes can roughly cover the entire image area of the image to be identified.
  • Target detection is performed on these prediction boxes to obtain the detection results of these prediction boxes, that is, the target detection of the image to be identified is completed.
  • Each prediction box has a corresponding prediction parameter, wherein the prediction parameter includes at least one of the position of the corresponding prediction box in the grid image, the confidence for indicating whether there is a target object in the corresponding prediction box, and the probability for indicating that the target object in the corresponding prediction box belongs to different categories.
  • the detection result of the target detection of each prediction box is obtained based on the corresponding prediction parameter.
  • the prediction box it is determined whether there is a target object in the prediction box based on the confidence in the prediction parameter. If the confidence indicates that there is no target object in the prediction box, no further judgment is required. If it indicates that there is a target object, the category to which the target object belongs is determined based on the probability that the target object in the prediction parameter belongs to different categories.
  • a threshold can be set for the confidence level.
  • the confidence level exceeds the corresponding threshold, the corresponding prediction box contains a target object.
  • the category corresponding to the maximum probability can be selected as the detection result of the target object based on the probability that the target object belongs to different categories. This is not limited here.
  • the prediction parameters also include the position of the corresponding prediction box in the grid image. If there is a target object in the prediction box, the position of the prediction box is the position of the target object. Based on this, the image of the target object can be segmented from the image to be identified according to the position.
  • the target object includes at least one of a person and a vehicle. Specifically, as shown in FIG9 , in a possible implementation, the following steps may be included after step 335:
  • Step 610 Based on the detection result of the target object, obtain the position of the target object in the image to be identified.
  • Step 630 locate the area containing the target object in each image to be identified in the video to be identified according to the position of the target object in the image to be identified.
  • Step 650 Segment a target image containing a target object from each image to be identified.
  • the prediction parameters also include the position of each prediction box, and the target object image can be obtained by segmenting the image to be identified from the image to be identified, thereby reducing the calculation process, increasing the efficiency of target detection, and improving the speed of target detection.
  • FIG. 10 is a schematic diagram of a specific implementation of a multi-target recognition method based on video super-resolution in an application scenario.
  • step 801 the original video is obtained.
  • step 803 features in the original video are extracted, and based on the features, the low-resolution original video is restored to a high-resolution video to be identified.
  • step 805 target detection is performed based on the video to be identified to obtain the target object in the video to be identified and the category to which the target object belongs.
  • a target recognition model suitable for the category is selected according to the category of the target object, wherein the category of the target object may be a person, a vehicle, a plant, etc., and a corresponding target recognition model may be configured according to the requirements of the task.
  • the recognition results of each vehicle in the video to be recognized are obtained based on the license plate recognition model through step 809. If the target object is a person, then the recognition results of each person in the video to be recognized are obtained based on the face recognition model through step 811.
  • step 813 target recognition is performed on all target objects in the video to be recognized to obtain a recognition result.
  • target detection is performed based on the restored high-resolution video to be identified, and target recognition is performed on the detection results. This can quickly and accurately obtain the recognition results of all target objects in the video to be identified. In some special application scenarios, it can assist in searching for target objects. For example, in criminal investigation cases, criminal suspects or their vehicles can be quickly identified from low-resolution surveillance videos.
  • the following is an embodiment of the device of the present application, which can be used to execute the multi-target recognition method involved in the present application.
  • the method embodiment of the multi-target recognition method involved in the present application please refer to the method embodiment of the multi-target recognition method involved in the present application.
  • a multi-target recognition device 900 is provided in an embodiment of the present application, including but not limited to: a video acquisition module 910 , a video super-resolution module 930 , a target detection module 950 , and a target recognition module 970 .
  • the video acquisition module 910 is used to acquire the original video
  • the video super-resolution module 930 is used to perform video super-resolution reconstruction on the original video based on the features extracted from the original video, and restore the low-resolution original video to a high-resolution video to be identified; the video to be identified includes multiple frames of images to be identified;
  • the target detection module 950 is used to perform target detection on at least one target object in each of the to-be-recognized images to obtain a detection result of each of the target objects, wherein the detection result of the target object at least includes a category to which the target object belongs;
  • the target recognition module 970 is used to determine the target recognition model that is compatible with the category to which each target object belongs, and respectively call the target recognition models that are compatible with the categories to which different target objects belong, perform target recognition on the image to be recognized containing the target object, and obtain the recognition results of each target object.
  • the multi-target recognition device provided in the above embodiment only uses the division of the above-mentioned functional modules as an example when performing target recognition.
  • the above-mentioned functions can be assigned to different functional modules as needed, that is, the internal structure of the multi-target recognition device will be divided into different functional modules to complete all or part of the functions described above.
  • Fig. 11 is a schematic diagram of the structure of a server according to an exemplary embodiment.
  • the server is applicable to the server 200 in the implementation environment shown in Fig. 1 .
  • server is only an example adapted to the present application and cannot be considered to provide any limitation on the scope of use of the present application.
  • the server cannot be interpreted as needing to rely on or having one or more components in the exemplary server 2000 shown in FIG. 11.
  • the hardware structure of the server 2000 may vary greatly due to different configurations or performances.
  • the server 2000 includes: a power supply 210 , an interface 230 , at least one memory 250 , and at least one central processing unit (CPU) 270 .
  • CPU central processing unit
  • the power supply 210 is used to provide operating voltage for each hardware device on the server 2000 .
  • the interface 230 includes at least one wired or wireless network interface for interacting with external devices. For example, the interaction between the terminal 100 and the server 200 in the implementation environment shown in FIG1 is performed.
  • the interface 230 may further include at least one serial-to-parallel conversion interface 233, at least one input-output interface 235, and at least one USB interface 237, as shown in FIG11, which is not specifically limited here.
  • the memory 250 is a carrier for storing resources, which may be a read-only memory, a random access memory, a disk or an optical disk, etc.
  • the resources stored thereon include an operating system 251, an application 253 and data 255, etc.
  • the storage method may be temporary storage or permanent storage.
  • the operating system 251 is used to manage and control the hardware devices and application programs 253 on the server 200 to enable the central processor 270 to calculate and process the massive data 255 in the memory 250. It can be Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • the application 253 is a computer program that performs at least one specific task based on the operating system 251, and may include at least one module (not shown in FIG. 11 ), each of which may include a computer program for the server 2000.
  • the multi-target recognition device may be regarded as an application 253 deployed on the server 2000.
  • the data 255 may be photos, pictures, etc. stored in a disk, or may be original videos, videos to be identified, etc., stored in the memory 250 .
  • the central processor 270 may include one or more processors and is configured to communicate with the memory 250 through at least one communication bus to read the computer program stored in the memory 250, thereby realizing the operation and processing of the mass data 255 in the memory 250. For example, the multi-target recognition method is completed by the central processor 270 reading a series of computer programs stored in the memory 250.
  • present application can also be implemented through hardware circuits or hardware circuits combined with software. Therefore, the implementation of the present application is not limited to any specific hardware circuits, software, or a combination of the two.
  • An electronic device 4000 is provided in an embodiment of the present application.
  • the electronic device 4000 may include: a desktop computer, a laptop computer, a server, etc.
  • the electronic device 4000 includes at least one processor 4001, at least one communication bus 4002, and at least one memory 4003.
  • the processor 4001 and the memory 4003 are connected, such as through the communication bus 4002.
  • the electronic device 4000 may further include a transceiver 4004, which may be used for data interaction between the electronic device and other electronic devices, such as data transmission and/or data reception.
  • the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 does not constitute a limitation on the embodiments of the present application.
  • Processor 4001 may be a CPU (Central Processing Unit), a general-purpose processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. It may implement or execute various exemplary logic blocks, modules and circuits described in conjunction with the disclosure of this application. Processor 4001 may also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a DSP and a microprocessor, etc.
  • the communication bus 4002 may include a path to transmit information between the above components.
  • the communication bus 4002 may be a PCI (Peripheral Component Interconnect) bus or an EISA (Extended Industry Standard Architecture) bus, etc.
  • the communication bus 4002 may be divided into an address bus, a data bus, a control bus, etc.
  • FIG. 13 only uses one thick line, but does not mean that there is only one bus or one type of bus.
  • the memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, or an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical disk storage, optical disk storage (including compressed optical disk, laser disk, optical disk, digital versatile disk, Blu-ray disk, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store the desired program code in the form of instructions or data structures and can be accessed by a computer, but is not limited thereto.
  • ROM Read Only Memory
  • RAM Random Access Memory
  • EEPROM Electrically Erasable Programmable Read Only Memory
  • CD-ROM Compact Disc Read Only Memory
  • optical disk storage including compressed optical disk, laser disk, optical disk, digital versatile disk, Blu-ray disk, etc.
  • magnetic disk storage medium or other magnetic storage device or any
  • the memory 4003 stores a computer program
  • the processor 4001 reads the computer program stored in the memory 4003 through the communication bus 4002 .
  • a storage medium is provided in an embodiment of the present application, on which a computer program is stored.
  • the computer program is executed by a processor, the multi-target recognition method in the above embodiments is implemented.
  • a computer program product is provided in an embodiment of the present application, the computer program product includes a computer program, and the computer program is stored in a storage medium.
  • a processor of a computer device reads the computer program from the storage medium, and the processor executes the computer program, so that the computer device executes the multi-target recognition method in each of the above embodiments.
  • the target recognition is performed based on high-resolution videos to be recognized, which increases the accuracy of target recognition and solves the problem of low accuracy of target recognition on videos in the related technologies.
  • the target recognition model that is suitable for the category of the target object is selected for target recognition, which further improves the accuracy of target recognition.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例提供了一种基于视频超分辨率的多目标识别方法、装置、电子设备及存储介质,涉及计算机技术领域。其中,该方法包括:获取原始视频;基于所述原始视频提取得到的特征,对所述原始视频进行视频超分辨率重建,将低分辨率的所述原始视频恢复为高分辨率的待识别视频;对各所述待识别图像中的至少一个目标对象进行目标检测,得到各所述目标对象的检测结果;确定与各所述目标对象所属类别相适配的目标识别模型,分别调用与不同所述目标对象所属类别相适配的目标识别模型,对包含所述目标对象的所述待识别图像进行目标识别,得到各所述目标对象的识别结果。本申请实施例解决了相关技术中目标识别的准确率低下的问题。

Description

一种基于视频超分辨率的多目标识别方法和装置 技术领域
本申请涉及计算机技术领域,具体而言,本申请涉及一种基于视频超分辨率的多目标识别方法、装置、电子设备及存储介质。
背景技术
随着人工智能技术的不断发展,各种各样的新算法和新模型被提出。其中的有许多新兴技术可以被运用到人的生产生活当中,比如目标识别可以帮助人们在整体背景中发现较难察觉的物体,或与其它模块相互配合,完成自动化的生产或者处理过程。
其中,各种目标识别技术已经在监控视频分析等方面具有初步运用,然而,现有的目标识别技术对监控视频的分辨率有较高的要求,若监控视频的视频分辨率过低,目标识别无法得到准确的识别结果。但是,由于监控设备的分辨率问题,监控视频的分辨率不高,通过更换更先进的监控设备获取监控视频,可以提高监控视频的分辨率,但是经济成本过高。
由上可知,对视频进行目标识别的准确率不高成为了亟需解决的问题。
技术问题
本申请各实施例提供了一种基于视频超分辨率的多目标识别方法、装置、电子设备及存储介质,可以解决相关技术中存在的对视频进行目标识别的准确率不高的问题。
技术解决方案
根据本申请实施例的一个方面,一种基于视频超分辨率的多目标识别方法,包括:获取原始视频;基于所述原始视频提取得到的特征,对所述原始视频进行视频超分辨率重建,将低分辨率的所述原始视频恢复为高分辨率的待识别视频;所述待识别视频包括多帧待识别图像;对各所述待识别图像中的至少一个目标对象进行目标检测,得到各所述目标对象的检测结果,其中,所述目标对象的检测结果至少包括所述目标对象所属类别;确定与各所述目标对象所属类别相适配的目标识别模型,分别调用与不同所述目标对象所属类别相适配的目标识别模型,对包含所述目标对象的所述待识别图像进行目标识别,得到各所述目标对象的识别结果。
根据本申请实施例的一个方面,一种基于视频超分辨率的多目标识别方法,包括:视频获取模块,用于获取原始视频;视频超分辨率模块,用于基于所述原始视频提取得到的特征,对所述原始视频进行视频超分辨率重建,将低分辨率的所述原始视频恢复为高分辨率的待识别视频;所述待识别视频包括多帧待识别图像;目标检测模块,用于对各所述待识别图像中的至少一个目标对象进行目标检测,得到各所述目标对象的检测结果,其中,所述目标对象的检测结果至少包括所述目标对象所属类别;目标识别模块,用于确定与各所述目标对象所属类别相适配的目标识别模型,分别调用与不同所述目标对象所属类别相适配的目标识别模型,对包含所述目标对象的所述待识别图像进行目标识别,得到各所述目标对象的识别结果。
根据本申请实施例的一个方面,一种电子设备,包括:至少一个处理器、至少一个存储器、以及至少一条通信总线,其中,存储器上存储有计算机程序,处理器通过通信总线读取存储器中的计算机程序;计算机程序被处理器执行时实现如上所述的多目标识别方法。
根据本申请实施例的一个方面,一种存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现如上所述的多目标识别方法。
根据本申请实施例的一个方面,一种计算机程序产品,计算机程序产品包括计算机程序,计算机程序存储在存储介质中,计算机设备的处理器从存储介质读取计算机程序,处理器执行计算机程序,使得计算机设备执行时实现如上所述的多目标识别方法。
有益效果
本申请提供的技术方案带来的有益效果是:
在上述技术方案中,基于恢复得到的高分辨率的待识别视频,首先对待识别包含的各待识别图像中目标对象进行目标检测,再基于检测结果选择与各目标对象所属类别相适配的目标识别模型进行目标识别,不仅提升了目标检测与目标识别的效率,并且,基于高分辨的待识别视频进行目标识别,增加了目标识别的准确性,解决了相关技术中存在的对视频进行目标识别的准确率低下的问题。
附图说明
图1是根据本申请所涉及的实施环境的示意图;
图2是根据一示例性实施例示出的一种基于视频超分辨率的多目标识别方法的流程图;
图3是图2对应实施例中步骤370在一个实施例的流程图;
图4是图2对应实施例中步骤370在一个实施例的流程图;
图5是图2对应实施例中步骤330在一个实施例的流程图;
图6是图5对应实施例中步骤335在一个实施例的流程图;
图7是图5对应实施例中步骤335在一个实施例的结构框图;
图8是图2对应实施例中步骤350在一个实施例的流程图;
图9是图2对应实施例中步骤370之前的步骤在一个实施例的流程图;
图10是一应用场景中一种基于视频超分辨率的多目标识别方法的具体实现示意图;
图11是根据一示例性实施例示出的一种基于视频超分辨率的多目标识别装置的结构框图;
图12是根据一示例性实施例示出的一种服务器的硬件结构图(电子设备为服务器);
图13是根据一示例性实施例示出的一种电子设备的结构框图。
本发明的实施方式
如前所述,监控视频的视频分辨率过低影响了目标识别的准确率。
目标识别往往对图像或是视频的清晰度具有一定的要求,如果摄像头本身的分辨率不高,那么该目标识别往往较难达到预期的效果。同时,如果识别的目标对象在高速运动过程中,摄像头所获得的图像便会产生模糊,这也将影响目标识别的具体情况,并且,如果目标识别面对的实际环境不佳,例如雾霾、浓雾、阴天,也会影响目标识别的识别结果。
并且,一个较大的项目可能有检测多种目标对象、并对目标对象有多种精细化处理的需求,而一两种神经网络模型较难满足这个需求。
由上可知,相关技术中仍存在对视频进行目标识别的准确率不高的缺陷。
为此,本申请提供的多目标识别方法,能够有效地提升目标识别的准确率,相应地,该多目标识别方法适用于信息推荐装置、该信息推荐装置可部署于配置冯诺依曼体系结构的电子设备,例如,该电子设备可以是台式电脑、笔记本电脑、服务器等等。
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
图1为一种基于视频超分辨率的多目标识别方法所涉及的一种实施环境的示意图。
该实施环境包括采集端110和服务端130。
具体地,采集端110,可以是具有采集图片、视频、多媒体中至少一种或多种数据功能的电子设备,在此不构成具体限定。
服务端130,该服务端130可以是台式电脑、笔记本电脑、服务器等等电子设备,还可以是由多台服务器构成的计算机设备集群,甚至是由多台服务器构成的云计算中心。其中,服务端130用于提供后台服务,例如,后台服务包括但不限于多目标识别服务等等。
服务端130与采集端110之间通过有线或者无线等方式预先建立网络通信连接,并通过该网络通信连接实现服务端130与采集端110之间的数据传输。传输的数据包括但不限于:原始视频等等。
通过采集端110与服务端130的交互,采集端110将原始视频发送给服务端130,服务端130结合视频超分辨率对获取到的原始视频进行处理,便能够完成对目标对象的目标检测与目标识别。
请参阅图2,本申请实施例提供了一种多目标识别方法,该方法适用于电子设备,该电子设备可以是图1所示出实施环境中的服务端130。如图2所示,该方法可以包括以下步骤:
步骤310,获取原始视频。
原始视频是通过视频采集设备对目标对象所在环境进行拍摄和采集得到的。其中,视频采集设备可以是具备视频采集功能的电子设备,例如,摄像机、配置摄像头的智能手机等等。视频采集设备可以部署在目标对象所在环境的四周,例如,若目标对象为人,则视频采集设备可以部署在目标对象会出现的建筑内的廊柱上;若目标对象为车辆,则视频采集设备可以部署在公路边的灯柱上。
关于原始视频的获取,原始视频可以来源于视频采集设备实时拍摄并采集的视频,也可以是预先存储于电子设备的一历史时间段由视频采集设备拍摄并采集的原始视频。那么,对于电子设备而言,在视频采集设备拍摄并采集得到原始视频之后,可以实时处理原始视频,还可以预先存储了再处理,例如,在电子设备的CPU低的时候处理原始视频,或者,根据工作人员的指示处理原始视频。由此,本实施例中的多目标识别可以针对实时获取到的原始视频,也可以针对历史时间段获取到的原始视频,在此并未进行具体限定。
步骤330,基于原始视频提取得到的特征,对原始视频进行视频超分辨率重建,将低分辨率的原始视频恢复为高分辨率的待识别视频。
首先说明的是,由于各种客观原因,例如,视频采集设备分辨率低,原始视频可以存在噪点、分辨率低等影响视频质量的情况。然而,目标检测往往对图像或是视频的清晰度具有一定的要求,如果原始视频的分辨率不高,那么目标检测往往较难达到预期的效果。同时,如果检测的目标对象在高速运动过程中,视频采集设备所获得的原始视频便会产生模糊,这也将影响目标检测的检测结果。
基于此,在进行目标检测之前需要改善原始视频的视频质量。可以通过视频超分辨率模型,提取原始视频的特征,将低分辨率的原始视频恢复为高分辨率的待识别视频,再基于高分辨率的待识别视频进行目标检测,其中,目标检测是以帧图像为单位进行的,待识别视频包括多帧待识别图像。
步骤350,对各待识别图像中的至少一个目标对象进行目标检测,得到各目标对象的检测结果。
如前所述,待识别图像中包括多帧待识别图像,每帧待识别图像所包含的目标对象的类别和数量不一定相同,因此,需要对待识别图像进行目标对象检测,确定各待识别图像中所包含的目标对象。其中,可以使用传统目标检测方法实现目标检测,例如光流法、背景减去法、帧插法等,也可以使用目标检测算法实现目标检测,例如Cascade R-CNN算法、DPM算法、HOG算法等,还可以通过基于深度学习的模型实现目标检测,在此不作限定。其中,目标对象的检测结果至少包括目标对象所属类别,根据分类思想,对于不同类别的目标对象,使用不同的目标识别模型对其进行识别。
步骤370,确定与各目标对象所属类别相适配的目标识别模型,分别调用与不同目标对象所属类别相适配的目标识别模型,对包含目标对象的待识别图像进行目标识别,得到各目标对象的识别结果。
如前所述,检测结果至少包括目标对象所属类别,所属类别可以是车辆、人物、植物等,关于对不同类别的目标对象进行目标识别,可以通过目标识别算法实现,也可以通过基于深度学习训练得到的机器学习模型实现。
关于通过基于深度学习训练得到的机器学习模型进行目标识别,可以理解,不同的目标识别模型基于不同类别的目标对象训练而成,因此,利用与目标对象所属类别相适配的目标模型对该类别的目标对象进行识别,识别效果更好。
例如,若检测得到的目标对象所属类别为车辆,则将该目标对象输入车辆识别模型,若检测得到的目标对象所属类别为人物,则将该目标对象输入人脸识别模型,通过调用与不同目标对象所属类别相适配的目标识别模型对目标对象进行目标识别,可以同时完成对多种类别目标对象的目标识别,加快得到目标识别的速度。
在一个可能的实现方式,目标识别模型包括人脸识别模型,人脸识别模型是经训练得到的、且具有对目标图像进行识别的能力的机器学习模型,如图3所示,步骤370可以包括以下步骤:
步骤410,对目标图像中人脸区域中的人脸关键点进行定位。
步骤430,基于定位得到的人脸关键点,从目标图像中分割出包含人脸的人脸区域图像。
步骤450,将人脸区域图像映射至欧式空间。
步骤470,基于人脸区域图像中人脸与样本图像中人脸的相似度,得到针对人物区域图像中人脸的识别结果。
首先说明的是,目标图像可以是人脸图像、也可以是人物图像,其中,人物图像包含人物的身体区域以及人脸区域,人脸识别模型针对目标图像的人脸区域进行识别,以此获取该目标对象的识别结果。因此,若目标对象是人物图像,首先对人物图像的进行检测,得到人物图像的人脸区域图像。进一步地,还可以结合待识别视频相邻帧图像之间的相似性,来捕获行走的人物中的人脸区域,在此不作限定。关于人脸关键点,是指针对人脸区域,定位出人脸面部的关键区域位置,包括眉毛、眼睛、鼻子、嘴巴、脸部轮廓等。可以理解,人脸关键点可以用于将人脸区域图像从目标图像中分割出来。
需要说明的是,欧式空间包括目标对象的人脸区域图像与人脸识别模型中各样本图像之间的距离,人脸区域图像在欧式空间中的距离用于指示人脸区域图像中人脸与样本图像中人脸的相似度;样本图像是指用于训练人脸识别模型的人脸区域图像。可以理解,相似度越高,该样本图像是该目标对象的识别结果的可能性越高,因此,可以选定相似度最高的样本图像为识别结果,当然,人脸识别模型中提供的样本图像不一定包括该目标对象的识别结果,那么,为了避免相似度最高的样本图像不是识别结果而造成误识别的情况出现,可以为相似度配置设定阈值,当样本图像与人脸图像之间的相似度最高且超过设定阈值时,该样本图像是该目标对象的识别结果,利用欧式空间映射的方法得到识别结果的方法,可以有效地提升人脸识别的准确率。
其中,该识别结果可以包括目标对象的身份信息,例如姓名、年龄、职业等,在此不作限定。
在一个可能的实现方式,目标识别模型包括车牌识别模型,车牌识别模型是经训练得到的、且具有对目标图像进行识别的能力的机器学习模型,如图4所示,步骤370可以包括以下步骤:
步骤510,通过车牌识别模型的各网络层,对目标图像进行特征提取,得到目标图像的特征序列。
步骤530,通过车牌识别模型对特征序列进行映射,得到针对目标图像的识别结果。
其中,特征序列包括目标图像的多个特征,目标图像可以是车牌图像,也可以是车辆图像,在此不作限定。进一步地,目标图像的各特征用于指示车牌图像/车辆图像中的字符信息。车牌识别模型主要是针对车辆的车牌进行识别,可以理解,不同车辆的车牌是唯一的,因此,可以针对车辆图像的车牌区域进行字符识别,以此得到对目标对象(车辆)的识别结果。
关于对特征序列的映射,是指通过车牌识别模模型利用样本图像中字符的上下文结构对特征序列进行映射,以输出一个带有概率的预测序列,其用于指示车牌区域各字符的最大可能性,也就是说,通过该预测序列,可以得到对目标图像的识别结果,即车辆的车牌号。其中,样本图像是指用于训练车牌识别模型的车牌图像。
需要说明的是,车牌识别模型输入的特征序列长度与目标图像宽度相关,而车牌识别模型输出的预测序列长度与样本图像宽度相关,由于特征序列长度与预测序列长度可能存在差异,可能会导致车牌识别模型无法得到正确的识别结果。基于此,可以在车牌识别模型的训练中引入损失函数,用于解决车牌识别模型的输入序列与输出序列不对齐的问题,例如,可以选用免分割的CTC损失函数对车辆识别模型进行端到端的训练。
通过上述过程,基于恢复得到的高分辨率的待识别视频进行目标检测,提升了目标检测与目标识别的效率,并且,基于高分辨的待识别视频进行目标识别,增加了目标识别的准确性,解决了相关技术中存在的对视频进行目标识别的准确率过低的问题。
请参阅图5,在一示例性实施例中,步骤330可以包括以下步骤:
步骤331,对原始视频进行特征提取,得到原始视频的浅层特征。
首先说明的是,对于原始视频中各原始图像而言,包含了各原始图像的浅层特征和深层特征。其中,浅层特征可以通过浅层网络结构提取得到,其分辨率高,包含更多位置、细节信息。可以理解,为了对原始视频进行视频超分辨率重建,不能缺少浅层特征中的位置、细节信息。关于浅层特征提取,可以通过卷积提取原始视频的浅层特征,例如,2D卷积。
步骤333,将原始视频划分为多个原始视频片段。
首先说明的是,同一个目标对象在原始视频的不同帧图像中包含的特征是不一样的,可以理解,利用同一个目标对象在多帧图像的特征对该目标对象进行恢复的恢复效果,会比仅利用该目标对象在一帧图像的特征进行恢复的恢复效果好。那么,为了利用原始视频多帧图像的特征对原始视频进行视频超分辨率重建,有必要将多帧图像的特征进行融合。可以通过对齐的方法和不对齐的方法实现帧间特征融合。
其中,在不对齐的方法,对于不同帧而言,同一个目标对象随着时间推进位置会发生变化,姿态也会产生一定的形变。如果不对运动中的目标对象进行对齐的话,可能会出现输入的多帧视频同一目标对象出现在不同位置的情况。这就要加深网络结构,以更大感受野,才能提取到运动范围较大的相同的特征;在对齐的方法,是将不同帧的同一个目标对象(位置发生变化的)对齐到同一位置,使得目标对象都处在相同位置,方便提取到目标对象尽可能多的特征。
由上可知,在进行深层特征提取之前,可以通过对原始视频进行移帧处理,将同一个目标对象对齐到同一个位置,以此将原始视频各帧图像的各目标对象都对齐。基于该移帧操作,得到若干个原始视频片段,各原始视频片段之间不重叠、且各原始视频片段均包括原始视频中连续的至少两帧原始图像。并且,将原始视频划分为不重叠的视频片段且并行运行,能降低特征提取时的计算量,提升效率。
步骤335,利用并行超分机制中的各个阶段,进行针对各原始视频片段中连续原始图像的特征传播,得到各原始视频片段的深层特征;每一个阶段对应一种尺度,各阶段包括时间相互自注意力模块和/或平行扭曲模块。
如前所述,浅层特征包含更多位置、细节信息,但是,特征提取时经过的卷积更少,其语义性更低,噪声更多。在进行视频超分辨率重建时,原始视频中的语义信息也至关重要,那么,也需要获取原始视频中的语义信息。而各原始图像的深层特征包含了丰富的语义信息,通常是图像整体性的信息。也就是说,可以通过提取原始视频中各原始图像的深层特征,以获取原始视频中的语义信息。
步骤337,基于得到的浅层特征和深层特征,为原始视频中各原始图像并行地进行特征重建,得到待识别视频。
其中,特征重建是同时从原始视频中各原始图像的浅层特征和深层特征的相加中进行特征重建,其中,不同帧的原始图像是根据与其对应的浅层特征和深层特征独立地进行重建的,因此,在进行特征重建时,可以并行地进行,再根据特征重建后得到的各原始图像的高清特征,将低分辨率的原始视频恢复为高分辨率的待识别视频。
关于深层特征提取,在一个可能的实现方式,如图6所示,步骤335包括以下步骤:
步骤3351,在对应不同尺度的多个阶段,利用阶段中的时间相互自注意力模块,对各原始视频片段中的连续原始图像进行特征提取,得到各原始视频片段在不同尺度上的特征。
步骤3353,利用与阶段中时间相互自注意力模块串接的平行扭曲模块,各原始视频片段在不同尺度上的特征进行对齐。
步骤3355,在最后一个阶段,利用阶段中的时间相互自注意力模块,对各原始视频片段在不同尺度上的特征进行特征融合,得到各原始视频片段的深层特征。
在进行深层特征提取时,随着下采样次数增加,感受野逐渐增加,感受野之间重叠区域也不断增加,此时的像素点代表的信息是一个区域的信息,获得的是这块区域或相邻区域之间的特征,细节相对不够细腻,但语义信息丰富。
基于此,结合图7的特征提取结构对各原始视频片段中连续原始图像进行特征提取,可以理解,经过一个阶段的特征提取,即增加了一次下采样次数,得到的特征包含更多的语义信息。
具体地,前5个阶段的时间相互自注意力模块用于对原始视频片段的连续原始图像进行目标对象关节运动估计,以完成特征提取;与时间相互注意力模块串接的平行扭曲模块,用于通过对原始视频片段中当前帧原始图像特征与相邻帧原始图像特征进行平行特征变形,以进一步融合当前帧原始图像与相邻帧原始图像的特征;最后一个阶段的时间相互注意力模块用于将前5个阶段提取得到的不同尺度的特征进行特征融合,以得到各原始视频片段的深层特征。
由上可知,由于使用了浅层特征、以及经过多尺度特征提取得到的深层特征进行视频超分辨率重建,待识别视频的分辨率更高、更自然,基于高分辨率的待识别视频进行目标检测,检测结果更准确。并且,经过多尺度特征提取得到的深层特征,包含更多的特征信息,基于该深层特征进行视频恢复,分辨率更高,细节更丰富,效果更好。
请参阅图8,在一示例性实施例中,步骤350可以包括以下步骤:
步骤351,对各待识别图像进行网格化处理,得到各待识别图像对应的网格图像。
步骤353,基于卷积神经网络对各网格单元进行的目标检测,得到各网格单元中的多个预测框及对应的预测参数。
步骤355,基于各预测框的预测参数对各预测框进行筛选,将筛选得到的预测框及对应的预测参数作为各目标对象的检测结果。
其中,网格图像包括多个网格单元,网格化处理是为了基于网格图像的各网格单元,生成多个预测框,这些预测框可以粗略的覆盖待识别图像的整个图像区域,针对这些预测框进行目标检测,得到这些预测框的检测结果,即完成了对待识别图像的目标检测。各个预测框有与之对应的预测参数,其中,预测参数包括对应预测框在网格图像中的位置、用于指示对应预测框中是否存在目标对象的置信度、用于指示对应预测框中存在的目标对象属于不同类别的概率中的至少一种。各预测框的目标检测的检测结果,基于对应的预测参数得到,也就是说,基于预测参数中的置信度确定该预测框中是否存在目标对象,若该置信度指示该预测框中不存在目标对象,则不需要进行进一步判断,若指示存在目标对象,则基于预测参数中的存在的目标对象属于不同类别的概率,确定该目标对象所属类别。
需要说明的是,可以为置信度设定阈值,当置信度超过相应阈值时,对应的预测框存在目标对象;关于概率,可以基于目标对象属于不同类别的概率大小,选择最大概率对应的类别为该目标对象的检测结果,在此不作限定。
如前所述,预测参数中还包括对应预测框在网格图像中的位置,若预测框中存在的目标对象,则该预测框的位置为目标对象的位置,基于此,可以根据该位置从待识别图像中分割出目标对象的图像,该目标对象包括人物、车辆中的至少一种。具体地,如图9所示,在一个可能的实现方式,步骤335之后可以包括以下步骤:
步骤610,基于目标对象的检测结果,获得目标对象在待识别图像中的位置。
步骤630,根据目标对象在待识别图像中的位置,在待识别视频中的各待识别图像中定位包含目标对象的区域。
步骤650,从各待识别图像中分割出包含目标对象的目标图像。
在上述实施例的作用下,可以通过各预测框的预测参数一次性得到各预测框是否存在目标对象,以及目标对象的所属类别,并且,预测参数还包括各预测框的位置,可以通过该位置从待识别图像中分割得到目标对象图像,减少了计算过程,增加目标检测效率,提升了目标检测的速度。
图10是一应用场景中一种基于视频超分辨率的多目标识别方法的具体实现示意图。
通过步骤801,获取原始视频。
通过步骤803,提取原始视频中的特征,基于特征将低分辨率的原始视频恢复为高分辨率的待识别视频。
通过步骤805,基于待识别视频进行目标检测,得到待识别视频中的目标对象以及目标对象所属类别。
通过步骤807,根据目标对象的所属类别选择与类别相适应的目标识别模型。其中,目标对象的所属类别可以是人物、车辆、植物等,可以根据任务的需求配置相应的目标识别模型。
若目标对象为车辆,则通过步骤809,基于车牌识别模型得到待识别视频中各车辆的识别结果。若目标对象为人物,则通过步骤811,基于人脸识别模型得到待识别视频中各人物的识别结果。
通过步骤813,针对待识别视频中所有存在的目标对象进行目标识别,得到识别结果。
在本应用场景中,基于恢复得到的高分辨率的待识别视频进行目标检测,对检测结果进行目标识别,可以快速地、准确地获取待识别视频中所有存在的目标对象的识别结果,在一些特殊的应用场景可以协助搜查目标对象,例如,在刑侦案件中,从低分辨率的监控视频中快速识别犯罪嫌疑人,或者犯罪嫌疑人的车辆。
下述为本申请装置实施例,可以用于执行本申请所涉及的多目标识别方法。对于本申请装置实施例中未披露的细节,请参照本申请所涉及的多目标识别方法的方法实施例。
请参阅图10,本申请实施例中提供了一种多目标识别装置900,包括但不限于:视频获取模块910、视频超分辨率模块930、目标检测模块950、以及目标识别模块970。
其中,视频获取模块910,用于获取原始视频;
视频超分辨率模块930,用于基于所述原始视频提取得到的特征,对所述原始视频进行视频超分辨率重建,将低分辨率的所述原始视频恢复为高分辨率的待识别视频;所述待识别视频包括多帧待识别图像;
目标检测模块950,用于对各所述待识别图像中的至少一个目标对象进行目标检测,得到各所述目标对象的检测结果,其中,所述目标对象的检测结果至少包括所述目标对象所属类别;
目标识别模块970,用于确定与各所述目标对象所属类别相适配的目标识别模型,分别调用与不同所述目标对象所属类别相适配的目标识别模型,对包含所述目标对象的所述待识别图像进行目标识别,得到各所述目标对象的识别结果。
需要说明的是,上述实施例所提供的多目标识别装置在进行目标识别时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即多目标识别装置的内部结构将划分为不同的功能模块,以完成以上描述的全部或者部分功能。
另外,上述实施例所提供的多目标识别装置与多目标识别方法的实施例属于同一构思,其中各个模块执行操作的具体方式已经在方法实施例中进行了详细描述,此处不再赘述。
图11根据一示例性实施例示出的一种服务器的结构示意。该服务器适用于图1所示出实施环境中的服务器200。
需要说明的是,该服务器只是一个适配于本申请的示例,不能认为是提供了对本申请的使用范围的任何限制。该服务器也不能解释为需要依赖于或者必须具有图11示出的示例性的服务器2000中的一个或者多个组件。
服务器2000的硬件结构可因配置或者性能的不同而产生较大的差异,如图11所示,服务器2000包括:电源210、接口230、至少一存储器250、以及至少一中央处理器(CPU, Central Processing Units)270。
具体地,电源210用于为服务器2000上的各硬件设备提供工作电压。
接口230包括至少一有线或无线网络接口,用于与外部设备交互。例如,进行图1所示出实施环境中终端100与服务器200之间的交互。当然,在其余本申请适配的示例中,接口230还可以进一步包括至少一串并转换接口233、至少一输入输出接口235以及至少一USB接口237等,如图11所示,在此并非对此构成具体限定。
存储器250作为资源存储的载体,可以是只读存储器、随机存储器、磁盘或者光盘等,其上所存储的资源包括操作***251、应用程序253及数据255等,存储方式可以是短暂存储或者永久存储。
其中,操作***251用于管理与控制服务器200上的各硬件设备以及应用程序253,以实现中央处理器270对存储器250中海量数据255的运算与处理,其可以是Windows ServerTM、Mac OS XTM、UnixTM、LinuxTM、FreeBSDTM等。
应用程序253是基于操作***251之上完成至少一项特定工作的计算机程序,其可以包括至少一模块(图11未示出),每个模块都可以分别包含有对服务器2000的计算机程序。例如,多目标识别装置可视为部署于服务器2000的应用程序253。
数据255可以是存储于磁盘中的照片、图片等,还可以是原始视频、待识别视频等,存储于存储器250中。
中央处理器270可以包括一个或多个以上的处理器,并设置为通过至少一通信总线与存储器250通信,以读取存储器250中存储的计算机程序,进而实现对存储器250中海量数据255的运算与处理。例如,通过中央处理器270读取存储器250中存储的一系列计算机程序的形式来完成多目标识别方法。
此外,通过硬件电路或者硬件电路结合软件也能同样实现本申请,因此,实现本申请并不限于任何特定硬件电路、软件以及两者的组合。
请参阅图13,本申请实施例中提供了一种电子设备4000,该电子设备400可以包括:台式电脑、笔记本电脑、服务器等。
在图13中,该电子设备4000包括至少一个处理器4001、至少一条通信总线4002以及至少一个存储器4003。其中,处理器4001和存储器4003相连,如通过通信总线4002相连。可选地,电子设备4000还可以包括收发器4004,收发器4004可以用于该电子设备与其他电子设备之间的数据交互,如数据的发送和/或数据的接收等。需要说明的是,实际应用中收发器4004不限于一个,该电子设备4000的结构并不构成对本申请实施例的限定。
处理器4001可以是CPU(Central Processing Unit,中央处理器),通用处理器,DSP(Digital Signal Processor,数据信号处理器),ASIC(Application Specific Integrated Circuit,专用集成电路),FPGA(Field Programmable Gate Array,现场可编程门阵列)或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。处理器4001也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等。
通信总线4002可包括一通路,在上述组件之间传送信息。通信总线4002可以是PCI(Peripheral Component Interconnect,外设部件互连标准)总线或EISA(Extended Industry Standard Architecture,扩展工业标准结构)总线等。通信总线4002可以分为地址总线、数据总线、控制总线等。为便于表示,图13中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
存储器4003可以是ROM(Read Only Memory,只读存储器)或可存储静态信息和指令的其他类型的静态存储设备,RAM(Random Access Memory,随机存取存储器)或者可存储信息和指令的其他类型的动态存储设备,也可以是EEPROM(Electrically Erasable Programmable Read Only Memory,电可擦可编程只读存储器)、CD-ROM(Compact Disc Read Only Memory,只读光盘)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。
存储器4003上存储有计算机程序,处理器4001通过通信总线4002读取存储器4003中存储的计算机程序。
该计算机程序被处理器4001执行时实现上述各实施例中的多目标识别方法。
此外,本申请实施例中提供了一种存储介质,该存储介质上存储有计算机程序,该计算机程序被处理器执行时实现上述各实施例中的多目标识别方法。
本申请实施例中提供了一种计算机程序产品,该计算机程序产品包括计算机程序,该计算机程序存储在存储介质中。计算机设备的处理器从存储介质读取该计算机程序,处理器执行该计算机程序,使得该计算机设备执行上述各实施例中的多目标识别方法。
与相关技术相比,,基于高分辨的待识别视频进行目标识别,增加了目标识别的准确性,解决了相关技术中存在的对视频进行目标识别的准确率低下的问题,并且,基于目标对象所属类别选定与其相适配的目标识别模型进行目标识别,进一步提升了目标识别的准确性。
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。

Claims (10)

  1. 一种基于视频超分辨率的多目标识别方法,其特征在于,包括:
    获取原始视频;
    基于所述原始视频提取得到的特征,对所述原始视频进行视频超分辨率重建,将低分辨率的所述原始视频恢复为高分辨率的待识别视频;所述待识别视频包括多帧待识别图像;
    对各所述待识别图像中的至少一个目标对象进行目标检测,得到各所述目标对象的检测结果,其中,所述目标对象的检测结果至少包括所述目标对象所属类别;
    确定与各所述目标对象所属类别相适配的目标识别模型,分别调用与不同所述目标对象所属类别相适配的目标识别模型,对包含所述目标对象的所述待识别图像进行目标识别,得到各所述目标对象的识别结果。
  2. 如权利要求1所述的方法,其特征在于,所述基于所述原始视频进行特征提取,得到若干个视频特征,根据各所述视频特征将低分辨率的所述原始视频重建为高分辨率的待识别视频,包括:
    对所述原始视频进行特征提取,得到所述原始视频的浅层特征;
    将所述原始视频划分为多个原始视频片段,各所述原始视频片段之间不重叠、且各所述原始视频片段均包括所述原始视频中连续的至少两帧原始图像;
    利用并行超分机制中的各个阶段,进行针对各所述原始视频片段中连续原始图像的特征传播,得到各所述原始视频片段的深层特征;每一个阶段对应一种尺度,各阶段包括时间相互自注意力模块和/或平行扭曲模块;
    基于得到的浅层特征和深层特征,为所述原始视频中各所述原始图像并行地进行特征重建,得到所述待识别视频。
  3. 如权利要求2所述的方法,其特征在于,所述利用并行超分机制中的各个阶段,进行针对各所述原始视频片段中连续原始图像的特征传播,得到各所述原始视频片段的深层特征,包括:
    在对应不同尺度的多个所述阶段,利用所述阶段中的时间相互自注意力模块,对各所述原始视频片段中的连续原始图像进行特征提取,得到各所述原始视频片段在不同尺度上的特征;
    利用与所述阶段中所述时间相互自注意力模块串接的平行扭曲模块,各所述原始视频片段在不同尺度上的所述特征进行对齐;
    在最后一个所述阶段,利用所述阶段中的时间相互自注意力模块,对各所述原始视频片段在不同尺度上的所述特征进行特征融合,得到各所述原始视频片段的深层特征。
  4. 如权利要求1至3任一项所述的方法,其特征在于,所述对所述待识别视频的各帧待识别图像中的目标对象进行目标检测,得到检测结果,包括:
    对各所述待识别图像进行网格化处理,得到各所述待识别图像对应的网格图像;所述网格图像包括多个网格单元;
    基于卷积神经网络对各所述网格单元进行的目标检测,得到各所述网格单元中的多个预测框及对应的预测参数,其中,所述预测参数包括对应预测框在所述网格图像中的位置、用于指示对应预测框中是否存在目标对象的置信度、用于指示对应预测框中存在的目标对象属于不同类别的概率中的至少一种;
    基于各所述预测框的预测参数对各所述预测框进行筛选,将筛选得到的预测框及对应的预测参数作为各所述目标对象的检测结果。
  5. 如权利要求1至3任一项所述的方法,其特征在于,所述目标对象包括人物、车辆中的至少一种;
    所述将包含所述目标对象的待识别图像输入适配到的目标识别模型进行目标识别,得到所述目标对象的识别结果之前,所述方法还包括:
    基于所述目标对象的检测结果,获得所述目标对象在所述待识别图像中的位置;
    根据所述目标对象在所述待识别图像中的位置,在所述待识别视频中的各所述待识别图像中定位包含所述目标对象的区域;
    从各所述待识别图像中分割出包含所述目标对象的目标图像。
  6. 如权利要求5所述的方法,其特征在于,所述目标识别模型包括人脸识别模型,所述人脸识别模型是经训练得到的、且具有对所述目标图像进行识别的能力的机器学习模型;
    所述将包含所述目标对象的待识别图像输入适配到的目标识别模型进行目标识别,得到所述目标对象的识别结果,包括:
    对所述目标图像中人脸区域中的人脸关键点进行定位;
    基于定位得到的所述人脸关键点,从所述目标图像中分割出包含人脸的人脸区域图像;
    将所述人脸区域图像映射至欧式空间,所述人脸区域图像在所述欧式空间中的距离用于指示所述人脸区域图像中人脸与样本图像中人脸的相似度;所述样本图像是指用于训练所述人脸识别模型的人脸区域图像;
    基于所述人脸区域图像中人脸与样本图像中人脸的相似度,得到针对所述人物区域图像中人脸的识别结果。
  7. 如权利要求5所述的方法,其特征在于,所述目标识别模型包括车牌识别模型,所述车牌识别模型是经训练得到的、且具有对目标图像进行识别的能力的机器学习模型;
    所述将包含所述目标对象的待识别图像输入适配到的目标识别模型进行目标识别,得到所述目标对象的识别结果,包括:
    通过所述车牌识别模型的各网络层,对所述目标图像进行特征提取,得到所述目标图像的特征序列;
    通过所述车牌识别模型对所述特征序列进行映射,得到针对所述目标图像的识别结果。
  8. 一种基于视频超分辨率的多目标识别方法,其特征在于,包括:
    视频获取模块,用于获取原始视频;
    视频超分辨率模块,用于基于所述原始视频提取得到的特征,对所述原始视频进行视频超分辨率重建,将低分辨率的所述原始视频恢复为高分辨率的待识别视频;所述待识别视频包括多帧待识别图像;
    目标检测模块,用于对各所述待识别图像中的至少一个目标对象进行目标检测,得到各所述目标对象的检测结果,其中,所述目标对象的检测结果至少包括所述目标对象所属类别;
    目标识别模块,用于确定与各所述目标对象所属类别相适配的目标识别模型,分别调用与不同所述目标对象所属类别相适配的目标识别模型,对包含所述目标对象的所述待识别图像进行目标识别,得到各所述目标对象的识别结果。
  9. 一种电子设备,其特征在于,包括:至少一个处理器、至少一个存储器、以及至少一条通信总线,其中,
    所述存储器上存储有计算机程序,所述处理器通过所述通信总线读取所述存储器中的所述计算机程序;
    所述计算机程序被所述处理器执行时实现权利要求1至7中任一项所述的多目标识别方法。
  10. 一种存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7中任一项所述的多目标识别方法。
PCT/CN2023/133779 2022-11-25 2023-11-23 一种基于视频超分辨率的多目标识别方法和装置 WO2024109902A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211497336.4 2022-11-25
CN202211497336.4A CN118097482A (zh) 2022-11-25 2022-11-25 一种基于视频超分辨率的多目标识别方法和装置

Publications (1)

Publication Number Publication Date
WO2024109902A1 true WO2024109902A1 (zh) 2024-05-30

Family

ID=91148264

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/133779 WO2024109902A1 (zh) 2022-11-25 2023-11-23 一种基于视频超分辨率的多目标识别方法和装置

Country Status (2)

Country Link
CN (1) CN118097482A (zh)
WO (1) WO2024109902A1 (zh)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019162241A1 (en) * 2018-02-21 2019-08-29 Robert Bosch Gmbh Real-time object detection using depth sensors
CN110956126A (zh) * 2019-11-27 2020-04-03 云南电网有限责任公司电力科学研究院 一种联合超分辨率重建的小目标检测方法
CN111695478A (zh) * 2020-06-04 2020-09-22 济南信通达电气科技有限公司 一种目标检测方法及设备
CN111784624A (zh) * 2019-04-02 2020-10-16 北京沃东天骏信息技术有限公司 目标检测方法、装置、设备及计算机可读存储介质
CN112215119A (zh) * 2020-10-08 2021-01-12 华中科技大学 一种基于超分辨率重建的小目标识别方法、装置及介质
CN113283396A (zh) * 2021-06-29 2021-08-20 艾礼富电子(深圳)有限公司 目标对象的类别检测方法、装置、计算机设备和存储介质
CN113888407A (zh) * 2021-09-16 2022-01-04 温州大学大数据与信息技术研究院 一种基于超分辨率技术的目标检测***
CN113901928A (zh) * 2021-10-13 2022-01-07 长沙理工大学 一种基于动态超分辨率的目标检测方法、输电线路部件检测方法及***
CN115082308A (zh) * 2022-05-23 2022-09-20 华南理工大学 基于多尺度局部自注意力的视频超分辨率重建方法及***

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019162241A1 (en) * 2018-02-21 2019-08-29 Robert Bosch Gmbh Real-time object detection using depth sensors
CN111784624A (zh) * 2019-04-02 2020-10-16 北京沃东天骏信息技术有限公司 目标检测方法、装置、设备及计算机可读存储介质
CN110956126A (zh) * 2019-11-27 2020-04-03 云南电网有限责任公司电力科学研究院 一种联合超分辨率重建的小目标检测方法
CN111695478A (zh) * 2020-06-04 2020-09-22 济南信通达电气科技有限公司 一种目标检测方法及设备
CN112215119A (zh) * 2020-10-08 2021-01-12 华中科技大学 一种基于超分辨率重建的小目标识别方法、装置及介质
CN113283396A (zh) * 2021-06-29 2021-08-20 艾礼富电子(深圳)有限公司 目标对象的类别检测方法、装置、计算机设备和存储介质
CN113888407A (zh) * 2021-09-16 2022-01-04 温州大学大数据与信息技术研究院 一种基于超分辨率技术的目标检测***
CN113901928A (zh) * 2021-10-13 2022-01-07 长沙理工大学 一种基于动态超分辨率的目标检测方法、输电线路部件检测方法及***
CN115082308A (zh) * 2022-05-23 2022-09-20 华南理工大学 基于多尺度局部自注意力的视频超分辨率重建方法及***

Also Published As

Publication number Publication date
CN118097482A (zh) 2024-05-28

Similar Documents

Publication Publication Date Title
Wang et al. Generative neural networks for anomaly detection in crowded scenes
CN110378264B (zh) 目标跟踪方法及装置
CN109710780B (zh) 一种归档方法及装置
CN109086873B (zh) 递归神经网络的训练方法、识别方法、装置及处理设备
CN109829396B (zh) 人脸识别运动模糊处理方法、装置、设备及存储介质
CN109711407B (zh) 一种车牌识别的方法及相关装置
CN112861575A (zh) 一种行人结构化方法、装置、设备和存储介质
CN111027507A (zh) 基于视频数据识别的训练数据集生成方法及装置
CN111079507B (zh) 一种行为识别方法及装置、计算机装置及可读存储介质
WO2024001123A1 (zh) 基于神经网络模型的图像识别方法、装置及终端设备
CN113128368B (zh) 一种人物交互关系的检测方法、装置及***
CN110660102B (zh) 基于人工智能的说话人识别方法及装置、***
CN112905824A (zh) 目标车辆追踪方法、装置、计算机设备及存储介质
WO2023082453A1 (zh) 一种图像处理方法及装置
CN111401196A (zh) 受限空间内自适应人脸聚类的方法、计算机装置及计算机可读存储介质
CN113537107A (zh) 一种基于深度学习的人脸识别及追踪方法、装置及设备
CN114333062B (zh) 基于异构双网络和特征一致性的行人重识别模型训练方法
CN113688839B (zh) 视频处理方法及装置、电子设备、计算机可读存储介质
CN113688804B (zh) 基于多角度视频的动作识别方法及相关设备
WO2024109902A1 (zh) 一种基于视频超分辨率的多目标识别方法和装置
CN115439367A (zh) 一种图像增强方法、装置、电子设备及存储介质
CN112686828B (zh) 视频去噪方法、装置、设备及存储介质
CN112257666B (zh) 目标图像内容的聚合方法、装置、设备及可读存储介质
CN117237386A (zh) 对目标对象进行结构化处理的方法、装置和计算机设备
Han et al. Hyperbolic face anti-spoofing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23893991

Country of ref document: EP

Kind code of ref document: A1