CN113468969B

CN113468969B - Aliased electronic component space expression method based on improved monocular depth estimation

Info

Publication number: CN113468969B
Application number: CN202110618580.0A
Authority: CN
Inventors: 顾寄南; 雷文桐; 张可; 高伟
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2021-06-03
Filing date: 2021-06-03
Publication date: 2024-05-14
Anticipated expiration: 2041-06-03
Also published as: CN113468969A

Abstract

The invention discloses an aliasing electronic component space expression method based on improved monocular depth estimation, which relates to the field of machine vision and comprises an image acquisition module, a target detection network module, a semantic segmentation network module, an HSV (hue, saturation, value) module and an RGB (red, green and blue) module; the image acquisition module is used for acquiring RGB images of different kinds of aliasing electronic components in the feed box; the target detection network module is used for processing the RGB image acquired by the image acquisition module to obtain a depth image A; the semantic segmentation network module is used for segmenting the depth image A processed by the target detection network module to obtain rough depth information; and the HSV and RGB module refines the rough depth information segmented by the semantic segmentation network module to obtain the detailed depth information of each electronic component. The invention can effectively solve the problem of autonomous identification of the aliased complex working scene between the electronic components.

Description

Aliased electronic component space expression method based on improved monocular depth estimation

Technical Field

The invention relates to the field of machine vision, in particular to an aliasing electronic component space expression method based on improved monocular depth estimation.

Background

The autonomous identification of the electronic components is the basis of visual control of the intelligent assembly robot, and the complex scene understanding is the basic support of the autonomous identification of the electronic components. The electronic components can be accurately and autonomously identified, and the accuracy and the efficiency of the intelligent assembly robot assembly are directly related. In practical production application, the mechanical arm is assisted by the machine vision technology to assemble electronic components, so that the problems of low production efficiency, high labor input and large worker load are solved, and the transformation from traditional flow production to intelligent production is fundamentally realized.

The existing electronic component identification method based on machine vision is mainly used for identifying scattered and uniformly distributed electronic components, but still does not solve the problem of autonomous identification among the electronic components and under complex working scenes of electronic components and background aliasing.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention provides an aliasing electronic component space expression method based on improved monocular depth estimation, which can effectively solve the problem of autonomous identification of aliasing between electronic components in complex working scenes.

The present invention achieves the above technical object by the following means.

An aliasing electronic component space expression method based on improved monocular depth estimation comprises an image acquisition module, a target detection network module, a semantic segmentation network module, an HSV (hue, saturation, value) module and an RGB (red, green and blue) module;

the image acquisition module is used for acquiring RGB images of different kinds of aliasing electronic components in the feed box;

the target detection network module is used for processing the RGB image acquired by the image acquisition module to obtain a depth image A;

The semantic segmentation network module is used for segmenting the depth image A processed by the target detection network module to obtain rough depth information;

and the HSV and RGB module refines the rough depth information segmented by the semantic segmentation network module to obtain the detailed depth information of each electronic component.

Further, the target detection network module comprises an input image module, a data enhancement module, a feature extraction network module, a feature fusion module, a downsampling module, a full-connection layer module, a classifier and a prediction output module; specifically, the acquired RGB image is subjected to data enhancement processing, feature extraction, feature fusion, downsampling, full connection layer, classifier and prediction output.

Further, performing 2 times of random scaling on the RGB image through a data enhancement module to obtain 2 images a and b; c, d is obtained by 2 times of random cutting;

The feature extraction network module comprises a lightweight network and a deep convolution network, wherein a lightweight network algorithm is utilized to extract features of images a and c, and a deep convolution network algorithm is utilized to extract features of images b and d;

The feature fusion module performs three-time layering feature fusion: fusing the shallow features and the deep features of the graphs a and b to obtain a feature graph x, fusing the shallow features and the deep features of the graphs c and d to obtain a feature graph y, and fusing the shallow features and the deep features of the graphs x and y to obtain a feature graph z; and the feature map z is predicted and output by a prediction output module after passing through a downsampling module, a full-connection layer module and a classifier.

Further, the prediction output module predicts and outputs the depth image A, the position information of the electronic components and the category and probability distribution of the electronic components.

Further, the depth image a is an RGB color image.

Further, the HSV and RGB module comprises an HSV color model, an HSV cone model, an RGB three-dimensional coordinate model and an RGB value classifier;

firstly, a depth image A is segmented by a semantic segmentation network module to obtain rough depth information, and the rough depth information is input into an HSV color model and is output into H, S, V values of three attributes; secondly, visualizing the values of the H, S, V attributes onto a color cone model by using the HSV cone model, and converting the HSV cone model into an RGB three-dimensional coordinate model to obtain a R, G, B value of the depth map; the three ranges of R, G, B values are refined with an RGB value classifier.

Further, the HSV color model determines colors by H, S, V three attributes, namely hue, saturation and brightness; wherein, the hue H is measured by an angle, the value range is 0-360 degrees, the hue H is calculated from red in the anticlockwise direction, the red is 0 degree, the green is 120 degrees, and the blue is 240 degrees; the saturation S represents the degree of approaching the color to the spectrum color, and the value range is usually 0-100%, and the larger the value is, the more saturated the color is; the brightness V represents the degree of brightness of the color, and for the light source color, the brightness value is related to the brightness of the illuminant; for object colors, this value is related to the transmittance or reflectance of the object.

Further, the device also comprises a manipulator control module, wherein the manipulator control module realizes positioning, grabbing and assembling according to the position information of the electronic components, the category and probability distribution of the electronic components and the RGB depth information.

Compared with the prior art, the technical scheme of the invention has at least the following benefits:

1. The invention combines the light-weight network and the deep convolution network, thereby ensuring the comprehensiveness of image characteristics and detailed information, improving the speed of model prediction and realizing real-time target detection on mobile equipment and embedded equipment.

2. According to the invention, the three-time hierarchical feature fusion is carried out on the extracted image features, so that the low-level detail features and the high-level semantic features are fused, and the detection performance of the network is greatly improved.

3. Compared with a common target detection algorithm, the method increases the output of the depth image, divides the aliased electronic components in the depth direction, realizes the spatial expression of the aliased electronic components, and solves the problem that a computer is difficult to understand the aliased electronic components.

Drawings

FIG. 1 is a schematic flow diagram of an aliased electronic component spatial representation based on improved monocular depth estimation, according to an embodiment of the present invention;

FIG. 2 is a flow chart of the object detection module of FIG. 1 according to the present invention;

FIG. 3 is a schematic diagram of a specific workflow of the object detection module of FIG. 2 according to the present invention;

Fig. 4 is a schematic flow chart of the HSV, RGB module of fig. 1 according to the present invention.

Reference numerals:

1-an image acquisition module; 2-a target detection network module; 3-a semantic segmentation network module; 4-HSV, RGB module; 5-a manipulator control module; 6-an input image module; 7-a data enhancement module; 8-a feature extraction network module; 9-a feature fusion module; 10-a downsampling module; 11-a full connection layer module; 12-a classifier; 13-a prediction output module; 14-HSV color model; 15-HSV cone model; a 16-RGB three-dimensional coordinate model; 17-RGB value classifier.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

Image acquisition is carried out on aliased electronic components of different types in the feed box by using a camera, so that an RGB image of the electronic components is obtained; each picture is randomly scaled for 2 times and randomly cut for 2 times; extracting features of the processed image by using a lightweight algorithm and a depth convolution algorithm in target detection; fusing the extracted shallow layer features and deep layer features; after downsampling, a full connection layer and a classifier, obtaining a depth image, electronic component position information, and class prediction and probability score of the electronic component; dividing the depth image by using a semantic division module with the color as a standard, thereby providing pixel-level image understanding; through setting parameters and super parameters and training a network, the depth map color and the distance between the electronic component and the lens are more accurately limited in a range; combining the segmented depth image with HSV and RGB methods, and refining the range to obtain the depth D of each electronic component; combining the position information (x, y, w, h) of each electronic component with the depth D of each electronic component, and expressing the complete three-dimensional position information of each electronic component by 5 parameters (x, y, w, h, D); the camera coordinate system is converted into the manipulator coordinate system, and aliased electronic components are provided for the manipulator end effector to accurately position information and depth information, so that the manipulator is convenient to position, grasp and assemble with high precision.

The image acquisition module comprises a monocular color high-resolution CCD camera, an electronic component placement platform, a telescopic bracket and a light source; the CCD camera has 38 ten thousand pixels, 480 lines of color resolution and 600 lines of black-and-white resolution. The camera is placed on a telescopic bracket with the height of 15cm of the experiment platform, and the distance between the lens and the table top of the experiment platform is 10cm.

The image acquisition object of the invention is an electronic component, in particular to an aliased electronic component of different types. The electronic components include resistor, capacitor, inductor, and are cylindrical, square, tubular, coil, etc. and the number of electronic components is controlled at 15-25. The electronic components are placed in a feed box on the experimental platform, and the length, the width and the height of the feed box are respectively 10cm, 10cm and 5cm.

According to the invention, the acquired electronic components are subjected to 2 times of random scaling (random resolution) and 2 times of random cropping (random crop), the data are enhanced, the model precision is improved, and the model stability is enhanced.

The feature extraction method used by the invention comprises two deep convolution networks and a lightweight network; the depth convolution network can better extract image features and detail information, including colors, shapes, sizes, edge features and corner features of electronic components; the lightweight network can reduce network parameters, does not lose network performance, solves the problem of model storage, can also improve the speed of model prediction, and realizes real-time target detection on mobile equipment and embedded equipment.

The invention can perform three-time hierarchical feature fusion on the extracted features, the resolution of the low-level features is higher, and the low-level features contain more position and detail information; the resolution of the high-level features is low, and the semantic information is higher. And the low-level detail features and the high-level semantic features are fused through three-time layered feature fusion, so that the detection performance of the network is improved.

The target detection algorithm has three outputs (the depth image, the electronic component position image, the category and the corresponding probability score), compared with the common target detection algorithm, the output of the depth image is increased, the aliased electronic components are divided in the depth direction, and the problem that the aliased electronic components are difficult to understand by a computer can be effectively solved.

The method uses the color as a standard to carry out image semantic segmentation on the obtained depth map, and roughly divides the electronic components into an upper-layer electronic component, a middle-layer electronic component and a bottom-layer electronic component; and combining the segmented depth image with HSV and RGB, refining again to obtain the depth of each electronic component, and controlling the precision to be 0.1mm.

According to the invention, 5 parameters (x, y, w, h and D) are used for expressing the complete three-dimensional position information of each electronic component, so that the spatial expression of the aliased electronic components is realized, the accurate position information and depth information of the aliased electronic components are provided for the manipulator end effector, and the high-precision positioning, grabbing and assembling of the follow-up intelligent assembling robot are facilitated.

Specifically, an aliasing electronic component space expression method based on improved monocular depth estimation utilizes an industrial CCD to collect images of different kinds of aliasing electronic components in a feed box, and obtains an RGB image of the electronic components; image enhancement of the electronic components is carried out by utilizing random scaling and random cutting; extracting the characteristics of the image by utilizing a lightweight network and a deep convolution network in the target detection network; fusion of shallow features and deep features is carried out by utilizing feature fusion in a target detection network; obtaining a depth image A, electronic component position information B, class prediction C of the electronic component and probability score P by using downsampling, a full connection layer and a classifier; the semantic segmentation network module is utilized to segment the depth graph, so that rough depth information is obtained; refining the rough depth information after semantic segmentation by using HSV and RGB modules to obtain detailed depth information D of each electronic component; combining the position information (x, y, w, h) of each electronic component with the detailed depth information depth D of each electronic component, and expressing the complete three-dimensional position information of each electronic component by 5 parameters (x, y, w, h, D), so that the manipulator is convenient to position, grasp and assemble with high precision.

The feature extraction is carried out by the lightweight network and the deep convolution network, and the deep convolution network can better extract image features and detail information, including the color, shape, size, edge features and corner features of electronic components; the lightweight network can reduce network parameters, does not lose network performance, solves the problem of model storage, can also improve the speed of model prediction, and realizes real-time target detection on mobile equipment and embedded equipment.

Feature fusion: three times of layered feature fusion are carried out on the extracted image features, the resolution of the low-level features is higher, and more position and detail information are contained; the resolution of the high-level features is low, and the semantic information is higher. And the low-level detail features and the high-level semantic features are fused through three-time layered feature fusion, so that the detection performance of the network is improved.

Compared with a general target detection algorithm, the output of the depth image A is increased by utilizing the output after the downsampling, the full connection layer and the classifier, and aliased electronic components are divided in the depth direction, so that the problem that a computer is difficult to understand the aliased electronic components can be solved.

The semantic segmentation module is used for carrying out image semantic segmentation on the obtained depth map by taking the color as a standard, and roughly dividing the electronic components into an upper-layer electronic component, a middle-layer electronic component and a bottom-layer electronic component;

The HSV and RGB module inputs rough depth information obtained by the depth image A through a semantic segmentation network into an HSV color model and outputs H, S, V values of three attributes; visualizing the values of the H, S, V three attributes onto a color cone model using an HSV cone model; converting the HSV conical model into an RGB three-dimensional coordinate model to obtain R, G, B values of a depth map; and refining three ranges in the semantic segmentation network by using an RGB classifier, and controlling the distance precision to be 0.1mm, thereby obtaining the detailed depth information depth D (namely the distance from the camera) of each electronic component.

The positioning, grabbing and assembling of the manipulator expresses the complete three-dimensional position information of each electronic component by 5 parameters (x, y, w, h and D), a camera coordinate system is converted into a manipulator coordinate system, and the aliased electronic component accurate position information and depth information are provided for the manipulator end effector, so that the high-precision positioning, grabbing and assembling of the manipulator is convenient to realize.

Referring to fig. 1, an aliasing electronic component space expression method based on improved monocular depth estimation comprises an image acquisition module 1, a target detection network module 2, a semantic segmentation network module 3, an HSV, an RGB module 4 and a manipulator control module 5;

The image acquisition module 1 acquires RGB images of different kinds of aliased electronic components in the feed box; the target detection network 2 carries out data enhancement processing, feature extraction, feature fusion, downsampling, full connection layers and classifiers on the acquired RGB images to obtain depth images, electronic component position information, class prediction and probability scores of the electronic components; the semantic segmentation network module 3 segments the depth image by taking the color as a standard, and roughly limits the color of the depth image and the distance between the electronic component and the lens in a range through parameter setting and network training; the HSV and RGB module 4 refines the range of the depth image to obtain the depth D of each electronic component, combines the position information (x, y, w, h) of each electronic component with the depth D of each electronic component, and expresses the complete three-dimensional position information of each electronic component by 5 parameters (x, y, w, h, D); the manipulator control module 5 converts the camera coordinate system into a manipulator coordinate system, provides aliased electronic components with accurate position information and depth information for the manipulator end effector, and achieves high-precision positioning, grabbing and assembling of the manipulator.

In specific implementation, the monocular camera of the invention is a color high-resolution CCD camera, the image pixels are 38 ten thousand points, the color resolution is 480 lines, and the black-and-white resolution is 600 lines. The camera adopts a fixed photographing mode and is arranged on a telescopic bracket with the height of 15cm of the experimental platform, the distance between the lens and the table top of the experimental platform is 10cm, and the camera is locked after reaching a fixed position, so that the monocular camera can not move and slide in the experimental process.

In specific implementation, the image acquisition object of the invention is an aliased electronic component of different types. The electronic components include resistors, capacitors and inductors, and the electronic components include cylinders, squares, tubes, coils and the like. The electronic components place in the workbin on experimental platform, and the length, the width, the height of workbin are 10cm, 5cm respectively, and the workbin passes through the spout to be fixed on experimental platform, guarantees that can not remove and shake in the experimentation. The number of the electronic components is controlled to be 15-25, the specific number can be adjusted according to the sizes of the electronic components, the electronic components are required to be overlapped and shielded, and the electronic components cannot exceed the horizontal plane of the upper end of the feed box.

In the concrete implementation, the semantic segmentation module adopts PSPNet, and the PSPNet extracts abstract features through a residual error network ResNet; the context information is aggregated through a pyramid pooling module, the pyramid level is 4, and 4 pieces of information with different scales are obtained; the number of channels of the 4-level feature map is reduced to 512 by convolution (conv/BN/ReLU), respectively; and (3) respectively carrying out upsampling through bilinear linear interpolation to restore the spatial dimension of each feature map to the spatial dimension of the input of the pyramid pooling module, namely restoring the output dimension of each level of feature map to (60,60,512).

In specific implementation, the colors of the depth map are divided into 6 ranges through semantic segmentation, and the depths are sequentially from shallow to deep: "blue" (0,0,119) — (0,0,255), "cyan" (0,0,255) - (0,119,119), "green" (0,119,119) - (119,255,0), "yellow" (255,199,0) - (199,255,0), "orange" (255, 0) - (255,119,0), "red" (255,119,0) - (119,0,0). The result after segmentation is divided into three parts, namely: closer to the camera (blue-cyan in display), 5-7cm from the camera, medium (green-yellow in display), 7-9cm from the camera, and 9-10cm from the camera (orange-red in display). Electronic components corresponding to three ranges respectively: upper layer electronic components (no shielding), middle layer electronic components (partial shielding, shielding part less than one half/one layer shielding), and bottom layer electronic components (partial shielding, shielding part more than one half/multi-layer shielding).

In the specific implementation, the target detection and the semantic segmentation are both realized by adopting transfer learning, all parameters are trained after weights are loaded, shallow learning parameters of the learned network are transferred to a new network, and the new network also has the capability of identifying the bottom general features.

In specific implementation, the transformation of the camera coordinate system and the manipulator coordinate system is as follows: assuming OXY as the robot coordinate system and O ' X ' Y ' as the camera coordinate system. theta is the included angle between two coordinate systems, and then the coordinate transformation relationship is:

x＝x′*r′*cos(theta)-y′*r*sin(theta)+x₀(1)

y＝x′*r*sin(theta)-y′*r*cos(theta)+y₀(2)

Where r is the millimeter pixel ratio, (mm/pixel) refers to the number of pixels of one millimeter, theta is the angle between the two coordinate systems, and (x 0, y 0) is the distance from the origin of the image coordinates to the origin of the mechanical coordinates.

Referring to fig. 2 and 3, the object detection network module includes an input image module 6, a data enhancement module 7, a feature extraction network module 8, a feature fusion module 9, a downsampling module 10, a full connection layer 11, a classifier 12, and a prediction output module 13.

The images input by the input image module 6 are RGB images of different kinds of aliased electronic components in the feed box acquired by the image acquisition module 1; the data enhancement module 7 performs 2 times of random scaling (size) on the preprocessed picture to obtain 2 images a and b respectively, and performs 2 times of random cropping (crop) on the preprocessed picture to obtain 2 images c and d respectively; the feature extraction network 8 comprises a lightweight network (MobileNetV) and a deep convolution network (DenseNet), wherein a MobileNetV algorithm is utilized to extract features of images a and c, a DenseNet algorithm in target detection is utilized to extract features of images b and d, and the sizes of the images after feature extraction are unified to 900 x 900; the feature fusion module 9 comprises three-time layering feature fusion, wherein the shallow features and the deep features of the graphs a and b are fused to obtain a feature graph x, the shallow features and the deep features of the graphs c and d are fused to obtain a feature graph y, and the shallow features and the deep features of the graphs x and y are fused to obtain a feature graph z; the feature map z passes through a downsampling module 10, a full-connection layer 11 and a classifier 12 (softmax) to obtain a prediction output 13; the prediction output 13 includes a depth image a, electronic component position information B, an electronic component category C, and a probability score P.

In particular, the depth image a is an RGB color image. In one depth map, the portion closer to the camera is shown as blue-cyan (from near to far), the portion at a medium distance from the camera is shown as cyan-green-yellow (from near to far), and the portion farther from the camera is shown as yellow-orange-red (from near to far).

In specific implementation, the position information B of the electronic components is represented by coordinate values, the coordinate values are marked as (0, 0) by taking the upper right corner and the upper left corner as the original points, the smallest external rectangle is drawn around each electronic component, the center point of the electronic component is (x, y), the width of the smallest external rectangle is w, and the height is h. The positional information of each electronic component is expressed by four parameters (x, y, w, h).

In the implementation, the electronic component class C and the probability score P are respectively represented by Chinese characters and decimal numbers at the right upper corner of the minimum circumscribed rectangle of each electronic component in the image. The categories are c+1 (including a background), and refer to "resistive", "capacitive", "inductive", "background" 4. The probability score refers to the probability that the object framed by the smallest circumscribed rectangle is of the category, the probability score value P is between 0 and 1, and the precision is controlled to be 0.01.

Referring to fig. 4, the HSV and RGB modules include an HSV color model 14, an HSV cone model 15, an RGB three-dimensional coordinate model 16, and an RGB value classifier 17; coarse depth information obtained by the depth image A through the semantic segmentation network module 3 is input into the HSV color model 14, and values of three attributes are output H, S, V; HSV cone model 15 visualizes the values of H, S, V three attributes onto a color cone model; converting the HSV cone model 15 into an RGB three-dimensional coordinate model 16 to obtain R, G, B values of a depth map; the three ranges of R, G, B values are refined by the RGB classifier 17, and the distance accuracy is controlled to be 0.1mm, so that the depth D (i.e., the distance from the camera) of each electronic component is obtained.

In particular, the HSV color model 14 determines the color from H, S, V three attributes, hue, saturation, and brightness, respectively. Wherein, the hue H is measured by an angle, the value range is 0-360 degrees, the red is 0 degrees, the green is 120 degrees, and the blue is 240 degrees calculated from the red in the anticlockwise direction. Their complementary colors are: yellow 60 °, cyan 180 °, magenta 300 °; the saturation S represents the degree to which the color approaches the spectral color. The value range is usually 0-100%, and the larger the value is, the more saturated the color is; the brightness V represents the degree of brightness of the color, and for the light source color, the brightness value is related to the brightness of the illuminant; for object colors, this value is related to the transmittance or reflectance of the object. Typically the values range from 0% (black) to 100% (white).

In practice, the HSV cone model visualizes the values of the H, S, V three attributes onto a color inverted cone. At the apex (i.e., origin) of the cone, v=0, h and S are undefined, representing black. S=0, v=1, h is undefined at the center of the top surface of the cone, representing white. The V-axis in the HSV model corresponds to the main diagonal in the RGB color space.

In the case of a specific implementation of the method, the X, Y, Z axes in the RGB three-dimensional coordinate model correspond to R, G, B channels respectively, value range 0 to 255. When H is more than or equal to 0 and less than 360,0 and less than or equal to S is more than or equal to 1, V is more than or equal to 0 and less than or equal to 1, the conversion formula of HSV and RGB is as follows:

C＝V×S (3)

X＝C×(1-|(H/60°)mod2-1|) (4)

m＝V-C (5)

(R，G，B)＝((R′+m)×255，(G′+m)×255，(B′+m)×255) (7)

In particular, the RGB classifier 17 refines the 6 color span ranges "blue" (0, 119) - (0, 255), "cyan" (0, 255) - (0, 119, 119), "green" (0, 119, 119) - (119, 255, 0), "yellow" (255, 199,0) - (199, 255, 0), "orange" (255, 0) - (255, 119,0), "red" (255, 119,0) - (119,0,0) to each channel value (R, GB), and controls the accuracy of the distance of the camera from the electronic component to 0.1mm, thereby obtaining detailed depth information D of each electronic component (i.e., the distance from the camera).

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives, and variations may be made in the above embodiments by those skilled in the art without departing from the spirit and principles of the invention.

Claims

1. An aliasing electronic component space expression system based on improved monocular depth estimation is characterized by comprising an image acquisition module, a target detection network module, a semantic segmentation network module, an HSV (hue, saturation, value and saturation) module and an RGB (red, green and blue) module;

The HSV and RGB module refines the rough depth information segmented by the semantic segmentation network module to obtain detailed depth information of each electronic component; the HSV and RGB module comprises an HSV color model, an HSV cone model, an RGB three-dimensional coordinate model and an RGB value classifier;

Firstly, a depth image A is segmented by a semantic segmentation network module to obtain rough depth information, and the rough depth information is input into an HSV color model and is output into H, S, V values of three attributes; secondly, visualizing the values of the H, S, V attributes onto a color cone model by using the HSV cone model, and converting the HSV cone model into an RGB three-dimensional coordinate model to obtain a R, G, B value of the depth map; refining the three ranges of R, G, B values by using an RGB value classifier, so as to refine rough depth information obtained by a semantic segmentation module and obtain detailed depth information of the electronic component; the HSV color model determines colors by H, S, V attributes, namely hue, saturation and brightness; wherein, the hue H is measured by an angle, the value range is 0-360 degrees, the hue H is calculated from red in the anticlockwise direction, the red is 0 degree, the green is 120 degrees, and the blue is 240 degrees; the saturation S represents the degree of approaching the color to the spectrum color, and the value range is usually 0-100%, and the larger the value is, the more saturated the color is; the brightness V represents the degree of brightness of the color, and for the light source color, the brightness value is related to the brightness of the illuminant; for object colors, this value is related to the transmittance or reflectance of the object.

2. The aliased electronic component space expression system based on improved monocular depth estimation of claim 1, wherein the object detection network module comprises an input image module, a data enhancement module, a feature extraction network module, a feature fusion module, a downsampling module, a full-connection layer module, a classifier, and a prediction output module; specifically, the acquired RGB image is subjected to data enhancement processing, feature extraction, feature fusion, downsampling, full connection layer, classifier and prediction output.

3. The aliased electronic component space expression system based on improved monocular depth estimation of claim 2,

Carrying out 2 times of random scaling on the RGB images through a data enhancement module to obtain 2 images a and b; c, d is obtained by 2 times of random cutting;

4. The aliased electronic component spatial representation system based on improved monocular depth estimation of claim 3, wherein the predictive output module predicts output comprising depth image a, electronic component location information, class of electronic component, and probability distribution.

5. The aliased electronic component spatial representation system based on improved monocular depth estimation of claim 1, wherein the depth image a is an RGB color image.

6. The aliased electronic component spatial representation system based on improved monocular depth estimation of claim 1, further comprising a robot control module that performs positioning, grabbing, and assembly based on electronic component location information, class and probability distributions of electronic components, and detailed depth information of electronic components.