CN111837158A

CN111837158A - Image processing method and device, shooting device and movable platform

Info

Publication number: CN111837158A
Application number: CN201980011444.6A
Authority: CN
Inventors: 王涛; 李思晋; 刘政哲; 李然
Original assignee: SZ DJI Technology Co Ltd
Current assignee: Shenzhen Zhuoyu Technology Co ltd
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2020-10-27
Also published as: WO2020258286A1

Abstract

An image processing method, an image processing apparatus, a photographing apparatus and a movable platform, the method comprising: acquiring a binocular image of a target scene; determining the depth information of the target scene according to the binocular image; and obtaining semantic information and position information of each target in the target scene according to the depth information and a semantic segmentation map of a monocular image in the binocular image. When the method is used for identifying the target, the depth information of the target scene and the semantic segmentation map of the monocular image of the target scene are combined, the semantic information and the position information of each target in the target scene can be more accurately acquired, the classification of classes with missing distance and similar textures in the target scene is realized, and support is provided for constructing an accurate and practical semantic map; the target identification method is particularly suitable for target scenes with complex backgrounds.

Description

Image processing method and device, shooting device and movable platform

Technical Field

The present invention relates to the field of image processing, and in particular, to an image processing method and apparatus, a photographing apparatus, and a movable platform.

Background

In the related art, when identifying an object in a scene, a single image captured for the scene is subjected to semantic segmentation to obtain a semantic segmentation map, and the object is identified according to the semantic segmentation map. The object recognition mode is difficult to distinguish the categories with missing distance and similar textures in scenes, particularly scenes with complex backgrounds, such as the distinction between a grass and the ground and the distinction between front and rear vehicles, and is difficult to realize through the object recognition mode.

Disclosure of Invention

The invention provides an image processing method, an image processing device, a shooting device and a movable platform.

Specifically, the invention is realized by the following technical scheme:

according to a first aspect of the present invention, there is provided an image processing method, the method comprising:

acquiring a binocular image of a target scene;

determining the depth information of the target scene according to the binocular image;

and obtaining semantic information and position information of each target in the target scene according to the depth information and a semantic segmentation map of a monocular image in the binocular image.

According to a second aspect of the present invention, there is provided an image processing apparatus comprising:

storage means for storing program instructions;

one or more processors that invoke program instructions stored in the storage device, the one or more processors individually or collectively configured when the program instructions are executed to:

acquiring a binocular image of a target scene;

According to a third aspect of the present invention, there is provided a photographing apparatus including:

the image acquisition module is used for acquiring binocular images of a target scene;

storage means for storing program instructions;

acquiring a binocular image of a target scene acquired by the image acquisition module;

According to a fourth aspect of the present invention there is provided a moveable platform comprising:

storage means for storing program instructions;

According to the technical scheme provided by the embodiment of the invention, when the target is identified, the depth information of the target scene and the semantic segmentation map of the monocular image of the target scene are combined, the semantic information and the position information of each target in the target scene can be more accurately acquired, the categories with missing distance and similar textures in the target scene are distinguished, and support is provided for constructing an accurate and practical semantic map; the target identification method is particularly suitable for target scenes with complex backgrounds.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

FIG. 1 is a flow chart of a method of image processing in an embodiment of the invention;

FIG. 2a is a monocular image of a binocular image of a target scene in an embodiment of the present invention;

FIG. 2b is a schematic representation of depth information of the target scene shown in FIG. 2 a;

fig. 2c is a flowchart of a specific implementation of determining depth information of a target scene according to binocular images in an embodiment of the present invention;

fig. 3 is a flowchart of a specific implementation of obtaining semantic information and position information of each target in a target scene according to depth information and a semantic segmentation map of a monocular image in a binocular image in an embodiment of the present invention;

FIG. 4a is a schematic diagram of an application scenario of an image processing method according to an embodiment of the present invention;

FIG. 4b is a schematic illustration of a depth map and binocular image of the scene of FIG. 4 a;

FIG. 5a is a schematic diagram of another application scenario of the image processing method in an embodiment of the present invention;

FIG. 5b is a schematic illustration of a depth map and binocular image of the scene of FIG. 5 a;

FIG. 6a is a schematic diagram of another application scenario of the image processing method in an embodiment of the present invention;

FIG. 6b is a schematic diagram of a depth map and binocular images of the scene of FIG. 6 a;

fig. 7 is a flowchart of another specific implementation manner of obtaining semantic information and position information of each target in a target scene according to depth information and a semantic segmentation map of a monocular image in a binocular image in an embodiment of the present invention;

FIG. 8 is a flowchart of a specific method of an image processing method according to an embodiment of the invention;

fig. 9 is a block diagram of an image processing apparatus according to an embodiment of the present invention;

fig. 10 is a block diagram of a photographing apparatus according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of a movable platform according to an embodiment of the present invention.

Detailed Description

The traditional target identification mode is difficult to distinguish the categories with missing distance and similar texture in the scene, such as the categories of a grass, the ground, front and rear vehicles, adjacent walls and the like in the scene.

When the target is identified, the depth information of the target scene and the semantic segmentation map of the monocular image of the target scene are combined, the semantic information and the position information of each target in the target scene can be more accurately acquired, and the classification of classes with missing distance and similar textures in the target scene is distinguished.

The method can provide pixel-level semantic recognition for scenes under the visual angle of the movable platform, and construct a semantic graph to provide key strategy semantic category support, such as information of drivable areas, people, vehicles and the like. The single monocular image in the binocular image is poor in semantic segmentation result due to lack of distance information and color information, and for some categories which are difficult to distinguish, such as a bush, the ground, front and rear vehicles and the like, for the categories, the semantic segmentation method relies on the binocular image, when the target is identified, the semantic information and the position information of each target in the scene under the visual angle of the movable platform can be accurately acquired by combining the depth information of the target scene and the semantic segmentation map of the monocular image of the target scene, the categories with missing distance and similar texture in the scene under the visual angle of the movable platform are distinguished, and support is provided for other intelligent functions of the movable platform.

The movable platform has a shooting function, and can be a vehicle, an unmanned aerial vehicle, a handheld tripod head, an unmanned ship and the like. Wherein, the vehicle can be unmanned vehicle, telecar etc. and unmanned vehicles can be for taking photo by plane unmanned aerial vehicle or other unmanned vehicles that have the shooting function.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that, in the following examples and embodiments, features may be combined with each other without conflict.

FIG. 1 is a flow chart of a method of image processing in an embodiment of the invention; referring to fig. 1, an image processing method according to an embodiment of the present invention may include the steps of:

s101: acquiring a binocular image of a target scene;

the binocular image acquisition mode can be selected as required, for example, in some embodiments, a binocular shooting camera is used for shooting to acquire binocular images of a target scene. The binocular camera can be a camera of a binocular camera, and the binocular camera is carried on the movable platform for use or can be directly used; of course, the binocular camera can be integrated on the movable platform.

In some embodiments, monocular cameras are used to capture images at different positions to obtain binocular images of a target scene. Wherein the different positions correspond to the shooting positions of the binocular shooting camera. The monocular camera of the embodiment can be a camera of a monocular camera, and the monocular camera can be carried on a movable platform for use or can be directly used; of course, the monocular camera described above may also be integrated on the movable platform.

For the binocular images of the same target scene, one group or multiple groups of binocular images of the target scene can be acquired, and the group of binocular images comprise two monocular images, namely a left eye image and a right eye image.

S102: determining depth information of a target scene according to the binocular images;

wherein the depth information may include: and relative distance information of each target in the target scene under a preset coordinate system. Optionally, the depth information includes: distance information of each object in the object scene relative to a camera capturing the object scene, such as the distance of each object relative to the lens, or the distance of each object relative to other locations of the camera. The preset coordinate system can be a world coordinate system or a user-defined coordinate system. It will be appreciated that in other embodiments, absolute distance information may also be employed to represent depth information.

The depth information can be presented in a characteristic diagram mode or a data mode. FIG. 2a is a monocular image of a binocular image of a target scene; fig. 2b is depth information of the target scene shown in fig. 2a, and the depth information of the target scene is presented in a feature map manner in the present embodiment.

In the related technology, the depth information is determined by using a triangle similarity principle, and the method has a complex calculation process and long service time. In order to reduce the time for determining the depth information, in the embodiment, the depth information is determined by using a depth learning method, as shown in fig. 2c, which is a specific implementation manner for determining the depth information of the target scene according to the binocular images. Referring to fig. 2c, when determining the depth information of the target scene according to the binocular images, the image information of the binocular images is input into a first convolutional neural network trained in advance, and the depth information of the target scene is determined. Wherein the image information includes color information, such as RGB components, corresponding to each channel of the monocular image; in addition, in this embodiment, the image information of one or more groups of binocular images is input into the first convolutional neural network, and the depth information of the target scene is determined. And, the finally determined depth information may be represented by a feature map having the same length and width as those of the monocular image of the binocular image.

The network structure of the first convolutional neural network may be designed as required, for example, in a possible implementation manner, the first convolutional neural network may include a plurality of first network units connected in sequence, and the first network units are configured to perform feature extraction on respective inputs; optionally, the first convolutional neural network includes three first network units connected in sequence, the input of the first network unit is image information of a binocular image, the input of the middle first network unit is the output of the first network unit, and the input of the last first network unit is the output of the middle first network unit; optionally, the output of the first network element and the output of the middle first network element are used as the input of the last first network element together, so as to deepen the depth of the network of the first convolutional neural network.

The first network unit of this embodiment may include at least one of a convolutional layer, a batch normalization layer, and a non-linear active layer, where the convolutional layer, the batch normalization layer, and the non-linear active layer all select normal operations, which is not specifically described in this embodiment. Of course, the first network element may also include other network layers, not limited to convolutional layers, batch normalization layers, and/or nonlinear activation layers.

In addition, referring to fig. 2c again, in some embodiments, before determining the depth information of the target scene according to the binocular image, the image processing method may further include: preprocessing the binocular images to ensure that the sizes of two monocular images forming the binocular images are consistent, and corresponding image points on the two monocular images are matched; in some embodiments, before determining depth information of the target scene from the binocular image, the image processing method may further include: the distortion of the two monocular images of the binocular image is eliminated by binocular correction. Through the preprocessing, the matching degree of the binocular images is improved, and therefore the precision of the depth information is improved.

S103: and obtaining semantic information and position information of each target in the target scene according to the depth information and the semantic segmentation map of a monocular image in the binocular image.

Meanwhile, the depth information of the target scene and the semantic segmentation map of the monocular image of the target scene are combined for target identification, so that the precision of the semantic information and the position information of each target in the target scene is more accurate, and the classification of classes with missing distance and similar textures in the target scene is realized.

In this embodiment, the semantic information at least includes information indicating a category in which the object is located, for example, information indicating a category in which the object is located, such as a vehicle, a pedestrian, a road, or a sky.

The implementation manner of step S103 may include multiple, optionally, referring to fig. 3, in some embodiments, the implementation process of obtaining the semantic information and the position information of each target in the target scene according to the depth information and the semantic segmentation map of one monocular image in the binocular image may include, but is not limited to, the following steps:

s301: performing semantic segmentation on a monocular image in the binocular images to obtain a semantic segmentation image of the monocular image;

the monocular image can be a left eye image or a right eye image; since the left eye image is usually taken as a reference during shooting when the binocular image is acquired, the semantic segmentation is performed on the left eye image to obtain the semantic segmentation map of the left eye image in the embodiment.

Step S301 may be implemented by using an existing semantic segmentation algorithm, in this embodiment, one monocular image in the binocular image is input into the second convolutional neural network, the second convolutional neural network determines the semantic segmentation map of the monocular image according to a preset target classification rule and image information of the monocular image, and for the second convolutional neural network, please refer to the following description.

S302: and determining semantic information and position information of each target in the target scene according to the depth information and the semantic segmentation map.

The implementation process of determining the semantic information and the position information of each target in the target scene according to the depth information and the semantic segmentation map may include, but is not limited to, the following steps:

(1) according to the depth information and the initial semantic information of each target in the semantic segmentation graph, distinguishing a plurality of targets which have the same or similar target types and are adjacent in position in the semantic segmentation graph;

for a plurality of targets with the same or similar target types and adjacent positions in the semantic segmentation map, the initial semantic information may identify the targets as the same target, resulting in inaccurate target identification. The multiple targets in this step may be a grass, the ground, front and rear vehicles, an adjacent wall, and the like in the target scene, or may be multiple targets with the same or similar types of other targets and adjacent positions.

(2) Semantic information and boundary information of a plurality of objects is obtained.

Of course, the position information for identifying each object in the object scene may be other position information of the corresponding object according to the depth information and the semantic segmentation map, and is not limited to the boundary information.

In the following embodiments, the above depth information is represented based on a depth map.

The image processing method provided by this embodiment can be applied to semantic segmentation of an image with occlusion, such as the top view shown in fig. 4a, when a car a and a car B are in the positions shown in the figure, a binocular camera (including a camera a and a camera B) is viewed from the viewing angle in the direction of the arrow, the car a is partially occluded by the car B, after the binocular camera is photographed along the viewing angle, a monocular image a (acquired by the camera a) and a monocular image B (acquired by the camera B) as shown in the lower diagram of fig. 4B can be obtained, and according to the monocular image a and the monocular image B, a depth map (in which different filling patterns represent different depths) as shown in the upper diagram of fig. 4B can be obtained, by combining the depth map with the initial semantic information of each object in the semantic segmentation map of the monocular image a, or by combining the depth map with the initial semantic information of each object in the semantic segmentation map of the monocular image B, two vehicles with different distances in front can be distinguished, and the type (namely the target category) of the vehicle can be further distinguished.

Of course, this embodiment can also distinguish more complicated occlusion situations, such as the top view shown in fig. 5a, when the binocular camera views from the observation angle in the direction of the arrow, the vehicle C occludes part of the vehicle B and the vehicle a, the vehicle B occludes part of the vehicle a, after the binocular camera shoots along the observation angle, a monocular image a (captured by camera a) and a monocular image b (captured by camera b) as shown in the lower graph of figure 5b may be obtained, from the monocular image a and the monocular image b, a depth map as shown in the upper graph of fig. 5b can be obtained, by combining the depth map with the initial semantic information of each target in the semantic segmentation map of the monocular image a, or by combining the depth map with the initial semantic information of each object in the semantic segmentation map of the monocular image b, different shielding relationships in the front can be distinguished, and the type (namely the target type) of the vehicle can be further distinguished.

The image processing method provided in this embodiment may also be applied to perform semantic segmentation on an image with an object having similar texture, for example, as shown in the top view of fig. 6a, a wall surface with a corner is in front of a binocular camera, a wall surface D is closer to the binocular camera than to the wall surface E, and the wall surface D and the wall surface E have similar texture, after the binocular camera shoots along an observation angle, a monocular image a (acquired by a camera a) and a monocular image b (acquired by a camera b) as shown in the lower diagram of fig. 6b may be obtained, according to the monocular image a and the monocular image b, a depth map as shown in the upper diagram of fig. 6b may be obtained, by combining the depth map with initial semantic information of each object in the semantic segmentation map of the monocular image a, or by combining the depth map with initial semantic information of each object in the semantic segmentation map of the monocular image b, the front-back relationship between the wall surface D and the wall surface E may be distinguished, and the boundary information can be further distinguished as the wall surface with the corner.

Referring to fig. 7, in some embodiments, the implementation process of obtaining the semantic information and the position information of each target in the target scene according to the depth information and the semantic segmentation map of one monocular image in the binocular image may include, but is not limited to, the following steps:

s701: inputting the depth information and the image information of a monocular image in a binocular image into a second convolutional neural network trained in advance to obtain semantic information and position information of each target in a target scene;

the second convolutional neural network is used for determining a semantic segmentation map of the monocular image according to a preset target classification rule and image information of the monocular image; and obtaining semantic information and position information of each target in the target scene based on the depth information and the semantic segmentation map.

In this embodiment, to achieve target classification more accurately, the image training set used for training the second convolutional neural network includes image training sets of a plurality of target categories, and the image training set of each category includes at least one image training set of a sub-category; optionally, the object categories include at least two of the following: vehicle, sky, road, static obstacle and dynamic obstacle; of course, the object category is not limited to the above listed categories, and may be set to other categories. In addition, the sub-categories of vehicles may be specifically classified as cars, trucks, buses, trains, motor homes, etc., the sub-categories of static obstacles may be specifically classified as buildings, walls, guardrails, utility poles, traffic lights, traffic signs, etc., and the sub-categories of dynamic obstacles may include pedestrians, bicycles, motorcycles, etc.

The target classification rule of the present embodiment corresponds to a target class, that is, the second convolutional neural network can identify a target belonging to the target class in the monocular image.

The network structure of the second convolutional neural network may be designed as required, for example, in a possible implementation, the second convolutional neural network includes a plurality of second network units connected in sequence, and the second network units are used for performing target classification on respective inputs; optionally, the second convolutional neural network includes three second network units connected in sequence, the input depth information of the first second network unit and the image information of a monocular image in the binocular image, the input of the middle second network unit is the output of the first network unit, and the input of the last second network unit is the output of the middle second network unit; optionally, the output of the first second network unit and the output of the middle second network unit are used as the input of the last second network unit together, so as to deepen the depth of the network of the second convolutional neural network.

The second network layer of this embodiment includes at least one of a convolutional layer, a batch normalization layer, and a nonlinear active layer, where the convolutional layer, the batch normalization layer, and the nonlinear active layer all select normal operations, which is not specifically described in this embodiment. Of course, the second network element may also include other network layers, not limited to convolutional layers, batch normalization layers, and/or nonlinear activation layers.

The position information in step S701 may be boundary information of each object in the object scene, or may be other position information of each object in the object scene.

In certain embodiments, step S301 and step S302 are both implemented in the second convolutional neural network described above.

Referring to fig. 8, in some embodiments, image information after binocular image preprocessing is input into a first convolutional neural network, and depth information of a target scene is determined; and inputting the depth information of the target scene and the image information of one monocular image in the binocular image into a second convolutional neural network to obtain the semantic information and the position information of each target in the target scene.

Optionally, in some embodiments, the semantic information may include: the recognition result and the corresponding recognition confidence. The recognition result is used for representing the information of the category where the target is located, the recognition confidence coefficient is used for representing the accuracy of the recognition result, the target which is recognized by mistake can be removed through the confidence coefficient, and the accuracy of target recognition is improved.

Further, referring to fig. 8, in some embodiments, after obtaining semantic information and position information of each object in the object scene, the image processing method may further include: and generating a semantic graph of the target scene according to the recognition result, the corresponding recognition confidence and the position information, thereby visually presenting the target recognition result based on the semantic graph. The implementation process of generating the semantic graph of the target scene according to the recognition result, the corresponding recognition confidence and the position information may include, but is not limited to, the following steps:

(1) determining a target corresponding to the recognition result in the semantic segmentation graph according to the recognition result and the position information;

and displaying the outline of the target corresponding to the recognition result in the semantic segmentation graph according to the recognition result and the position information.

(3) And if the recognition confidence corresponding to the recognition result is greater than the preset confidence threshold, marking the target corresponding to the recognition result as the mark of the target category where the preset recognition result is located in the semantic segmentation graph.

The labeled semantic segmentation graph is the semantic graph of the target scene, and the target category of the target recognition result is visually presented in the semantic segmentation graph through labeling.

In this embodiment, the label of each target category is preset, and the label of the target category may be represented by a color, a pattern, or the like, where the labels corresponding to different target categories are different. Optionally, the colors corresponding to different target categories are different, for example, the color corresponding to the sky is blue, the color corresponding to the ground is brown, the color corresponding to the grass is green, and the like; optionally, the colors corresponding to different sub-categories in the same target category are the same color, but the colors corresponding to different sub-categories in the same target category have different depths.

In addition, if the recognition confidence corresponding to the recognition result is less than or equal to the preset confidence threshold, the recognition result is determined to have the possibility of false recognition, and for the object with false recognition, the object type information of the object can be directly ignored, so as to avoid influencing the semantic segmentation result.

Corresponding to the image processing method of the above embodiment, an embodiment of the present invention further provides an image processing apparatus, and referring to fig. 9, the image processing apparatus 100 includes: a first storage device 110 and one or more first processors 120.

The first storage device 110 is used for storing program instructions; the one or more first processors 120, invoking program instructions stored in the first storage 110, the one or more first processors 120, individually or collectively, being configured, when the program instructions are executed, to: acquiring a binocular image of a target scene; determining depth information of a target scene according to the binocular images; and obtaining semantic information and position information of each target in the target scene according to the depth information and the semantic segmentation map of a monocular image in the binocular image.

The first processor 120 may implement the image processing method according to the embodiments shown in fig. 1, fig. 2c, fig. 3, fig. 7 and fig. 8 of the present invention, and the image processing apparatus 100 of the present embodiment will be described with reference to the image processing method according to the above embodiments.

It should be noted that the image processing apparatus 100 of the present embodiment may be a device with image processing capability, such as a computer, or may also be a shooting apparatus with a camera function, such as a camera, a video camera, a smart phone, an intelligent terminal, a shooting stabilizer, an unmanned aerial vehicle, and so on.

Corresponding to the image processing method of the above embodiment, an embodiment of the present invention further provides a camera 200, referring to fig. 10, including: a first image acquisition module 210, a second storage device 220, and one or more second processors 230.

The first image acquisition module 210 is configured to acquire a binocular image of a target scene; a second storage device 220 for storing program instructions; the one or more second processors 230, invoking program instructions stored in the second storage 220, the one or more second processors 230, individually or collectively, being configured, when the program instructions are executed, to: acquiring a binocular image of a target scene acquired by the first image acquisition module 210; determining depth information of a target scene according to the binocular images; and obtaining semantic information and position information of each target in the target scene according to the depth information and the semantic segmentation map of a monocular image in the binocular image.

Optionally, the first image capturing module 210 includes a lens and an imaging sensor, such as a CCD, a CMOS, or other image sensor, which is matched with the lens.

The second processor 230 may implement the image processing method according to the embodiments of the present invention shown in fig. 1, fig. 2c, fig. 3, fig. 7 and fig. 8, and the image capturing apparatus 200 of the present embodiment will be described with reference to the image processing method of the above embodiments.

This shoot device 200 can be for having the camera of the function of making a video recording, the camera, the smart mobile phone, and intelligent terminal shoots the stabilizer (like handheld cloud platform), unmanned vehicles (like unmanned aerial vehicle) and so on.

An embodiment of the present invention provides a movable platform, and referring to fig. 11, the movable platform 300 includes: a second image acquisition module 310, a third storage 320, and one or more third processors 330.

The second image acquisition module 310 is configured to acquire a binocular image of a target scene; third storage 320 for storing program instructions; the one or more third processors 330, invoking program instructions stored in the third storage 320, the one or more third processors 330, individually or collectively, being configured, when the program instructions are executed, to: acquiring a binocular image of a target scene acquired by the second image acquisition module 310; determining depth information of a target scene according to the binocular images; and obtaining semantic information and position information of each target in the target scene according to the depth information and the semantic segmentation map of a monocular image in the binocular image.

The second image capturing module 310 of this embodiment may be a camera, or may be a structure formed by combining a lens and an imaging sensor (such as a CCD, a CMOS, or the like) and having a shooting function.

The third processor 330 may implement the image processing method according to the embodiments shown in fig. 1, fig. 2c, fig. 3, fig. 7 and fig. 8 of the present invention, and the movable platform 300 of the present embodiment will be described with reference to the image processing method according to the above embodiments.

In a feasible implementation, the movable platform 300 is an unmanned aerial vehicle, and as can be understood, the unmanned aerial vehicle is an aerial unmanned aerial vehicle, and other unmanned aerial vehicles without the camera shooting function do not belong to the protection main body of the embodiment. The unmanned aerial vehicle can be a multi-rotor unmanned aerial vehicle or a fixed-wing unmanned aerial vehicle, and the type of the unmanned aerial vehicle is not particularly limited in the embodiment of the invention. Further, the second image capturing module 310 may be mounted on a machine body (not shown) through a cloud deck (not shown), and the second image capturing module 310 is stabilized by the cloud deck, where the cloud deck may be a two-axis cloud deck or a three-axis cloud deck, and the embodiment of the present invention is not limited in this respect.

The storage device may include a volatile memory (volatile memory), such as a random-access memory (RAM); the storage device may also include a non-volatile memory (non-volatile), such as a flash memory (flash memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD); the storage 110 may also comprise a combination of memories of the kind described above.

The processor may be a Central Processing Unit (CPU). The Processor may also be other general purpose processors, Digital Signal Processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Furthermore, an embodiment of the present invention also provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps of the image processing method of the above-described embodiment.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of the cradle head according to any of the foregoing embodiments. The computer readable storage medium may also be an external storage device of the cradle head, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), and the like, provided on the device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the pan/tilt head. The computer-readable storage medium is used for storing the computer program and other programs and data required by the head, and may also be used for temporarily storing data that has been output or is to be output.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is intended to be illustrative of only some embodiments of the invention, and is not intended to limit the scope of the invention.

Claims

1. An image processing method, characterized in that the method comprises:

acquiring a binocular image of a target scene;

2. The method of claim 1, wherein the depth information comprises: and the relative distance information of each target in the target scene under a preset coordinate system.

3. The method of claim 2, wherein the depth information comprises: distance information of each target in the target scene relative to a photographing device that photographs the target scene.

4. The method according to any one of claims 1 to 3, wherein the determining depth information of the target scene from the binocular images comprises:

and inputting the image information of the binocular image into a first convolutional neural network trained in advance, and determining the depth information of the target scene.

5. The method of claim 4, wherein the first convolutional neural network comprises a plurality of sequentially connected first network elements for feature extraction of respective inputs;

the first network element includes at least one of a convolutional layer, a batch normalization layer, and a nonlinear activation layer.

6. The method of claim 1, wherein prior to determining the depth information of the target scene from the binocular images, further comprising:

and preprocessing the binocular images to ensure that the sizes of the two monocular images forming the binocular images are consistent.

7. The method of claim 1, wherein obtaining semantic information and position information of each object in the object scene from the depth information and a semantic segmentation map of a monocular image of the binocular images comprises:

performing semantic segmentation on a monocular image in the binocular images to obtain a semantic segmentation image of the monocular image;

and determining semantic information and position information of each target in the target scene according to the depth information and the semantic segmentation map.

8. The method of claim 7, wherein determining semantic information and position information for each object in the object scene from the depth information and the semantic segmentation map comprises:

according to the depth information and the initial semantic information of each target in the semantic segmentation map, distinguishing a plurality of targets which have the same or similar target types and are adjacent in position in the semantic segmentation map;

semantic information and boundary information of the plurality of targets are obtained.

9. The method according to claim 1 or 7, wherein obtaining semantic information and position information of each object in the object scene according to the depth information and a semantic segmentation map of one of the binocular images comprises:

inputting the depth information and image information of one monocular image in the binocular images into a second convolutional neural network trained in advance to obtain semantic information and position information of each target in the target scene;

the second convolutional neural network is used for determining a semantic segmentation map of the monocular image according to a preset target classification rule and the image information of the monocular image; and obtaining semantic information and position information of each target in the target scene based on the depth information and the semantic segmentation map.

10. The method of claim 9, wherein the image training set used to train the second convolutional neural network comprises a plurality of image training sets of target classes, each image training set of a class comprising at least one image training set of a sub-class;

the target classification rule corresponds to the target class.

11. The method of claim 10, wherein the object classes include at least two of:

vehicle, sky, road, static obstacle and dynamic obstacle.

12. The method of claim 9, wherein the second convolutional neural network comprises a plurality of sequentially connected second network elements for target classification of respective inputs;

the second network layer includes at least one of a convolutional layer, a batch normalization layer, and a nonlinear activation layer.

13. The method of claim 1, wherein the semantic information comprises: the recognition result and the corresponding recognition confidence.

14. The method of claim 13, wherein after obtaining the semantic information and the position information of each object in the object scene, further comprising:

and generating a semantic graph of the target scene according to the recognition result, the corresponding recognition confidence and the position information.

15. The method of claim 14, wherein generating the semantic graph of the target scene according to the recognition result, the corresponding recognition confidence and the position information comprises:

determining a target corresponding to the recognition result in the semantic segmentation graph according to the recognition result and the position information;

and if the recognition confidence corresponding to the recognition result is greater than a preset confidence threshold, marking the target corresponding to the recognition result as the preset mark of the target category where the recognition result is located in the semantic segmentation graph.

16. An image processing apparatus, characterized in that the apparatus comprises:

storage means for storing program instructions;

acquiring a binocular image of a target scene;

17. The image processing apparatus according to claim 16, wherein the depth information includes: and the relative distance information of each target in the target scene under a preset coordinate system.

18. The image processing apparatus according to claim 17, wherein the depth information includes: distance information of each target in the target scene relative to a photographing device that photographs the target scene.

19. The image processing apparatus of any of claims 16 to 18, wherein the one or more processors are further configured, individually or collectively, to:

20. The apparatus according to claim 19, wherein the first convolutional neural network comprises a plurality of sequentially connected first network elements, the first network elements being configured to perform feature extraction on respective inputs;

21. The image processing apparatus of claim 16, wherein the one or more processors, prior to determining depth information of the target scene from the binocular images, are further configured, individually or collectively, to:

22. The image processing apparatus of claim 16, wherein the one or more processors are further configured, individually or collectively, to:

23. The image processing apparatus of claim 22, wherein the one or more processors are further configured, individually or collectively, to:

24. The image processing apparatus of claim 16 or 22, wherein the one or more processors are further configured, individually or collectively, to:

25. The image processing apparatus of claim 24, wherein the image training set used to train the second convolutional neural network comprises a plurality of image training sets of target classes, each image training set of a class comprising at least one image training set of a sub-class;

the target classification rule corresponds to the target class.

26. The image processing apparatus according to claim 25, wherein the object classes include at least two of:

vehicle, sky, road, static obstacle and dynamic obstacle.

27. The image processing apparatus of claim 24, wherein the second convolutional neural network comprises a plurality of sequentially connected second network elements for object classification of respective inputs;

28. The image processing apparatus according to claim 16, wherein the semantic information includes: the recognition result and the corresponding recognition confidence.

29. The image processing apparatus of claim 28, wherein the one or more processors, after obtaining semantic information and location information for each object in the object scene, are further configured, individually or collectively, to:

30. The image processing apparatus of claim 29, wherein the one or more processors are further configured, individually or collectively, to:

31. A photographing apparatus, characterized by comprising:

the image acquisition module is used for acquiring a binocular image of a target scene;

storage means for storing program instructions;

one or more processors that invoke program instructions stored in the storage device, the one or more processors individually or collectively configured to implement the method of any of claims 1-15 when the program instructions are executed.

32. A movable platform, comprising:

storage means for storing program instructions;

33. The movable platform of claim 32, wherein the movable platform is at least one of an unmanned aerial vehicle and a vehicle.