CN113468955B

CN113468955B - Method, device and storage medium for estimating distance between two points in traffic scene

Info

Publication number: CN113468955B
Application number: CN202110556836.XA
Authority: CN
Inventors: 萧允治; 王礼闻; 许伟林; 伦栢江; 李永智; 肖顺利; 陆允; 曾国强
Original assignee: Hong Kong Productivity Council
Current assignee: Hong Kong Productivity Council
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2024-02-02
Anticipated expiration: 2041-05-21
Also published as: CN113468955A

Abstract

The application relates to a method, equipment and storage medium for estimating a distance between two points in a traffic scene, wherein the method comprises the following steps: acquiring a video image sequence, wherein the video image sequence comprises a plurality of frame images acquired by a camera associated with a traffic scene; for each frame of images, determining one or more preset physical lengths in the images associated with the vehicle using a first deep learning model; for each preset physical length in the image, determining the distance weight of each corresponding pixel position according to the actual length value of the preset physical length; interpolation is carried out on the distance weights of the pixel positions of the region of interest in the traffic scene by using a second deep learning model, so that the distance weights of all the pixel positions of the region of interest are obtained; and determining the actual distance between any two pixel positions on the image acquired by the camera according to the distance weight of each pixel position of the region of interest. Thereby, an accurate estimation of the distance independent of camera external or internal parameters is achieved.

Description

Method, device and storage medium for estimating distance between two points in traffic scene

Technical Field

The present disclosure relates to the field of image recognition technologies, and in particular, to a method, an apparatus, and a storage medium for estimating a distance between two points in a traffic scene.

Background

Estimating distance from the captured video is a challenging task because the image captured by the camera does not provide distance information for each pixel. Recently, many techniques have provided solutions for distance estimation of road video.

In the related art, some methods calibrate intrinsic and extrinsic parameters of a camera based on a camera model. The intrinsic parameters include the focal length and the center position of the camera, while the extrinsic parameters consist of a rotation matrix and a translation matrix. For example, some papers estimate the calibration parameters of a camera by finding three vanishing points. However, the vanishing point is detected based on the driving trajectory of the vehicle, and the vehicle should not change lanes during the calibration. Meanwhile, it assumes that the road is straight and flat, and has considerable limitations in practical applications. Another approach is to estimate the extrinsic parameters of the camera by mapping feature points in the image into real world coordinates. It requires that the intrinsic parameters are known and that the accuracy is highly dependent on the required feature points. The method detects 10 key points of 10 cars and marks the real world coordinates of the key points. By transferring the calibrated position information to solving the PNP problem by using the two-dimensional position and the three-dimensional real world coordinates in the image, the external parameters of the camera can be estimated. Other methods utilize lane marker information of known width to estimate calibration parameters. However, the lane markings of different roads are different, which requires additional labeling work.

Camera calibration is a challenging task for the purpose of finding a mapping function (i.e., calibration matrix) from two-dimensional (i.e., image) to three-dimensional (i.e., real world coordinates) space. These methods assume that the scene is positioned on a plane and estimate the calibration matrix based on an accurate vehicle model, or that there are harsh assumptions (e.g., accurate signature information).

In the related art, patent US 2007/0154068 A1 presents a method for estimating the distance between a vehicle and a preceding vehicle by means of a vehicle-mounted camera, which requires that the camera be mounted parallel to the road surface and that the focal length be known, which can estimate the distance interval between the camera and the vehicle from the detected width of the frontal vehicle. Patents EP1005234 A2 and US 6172601B1 disclose estimating distance by the distance of movement of the host vehicle. However, these methods are designed for in-vehicle cameras and are intended to calculate the distance separation to a front vehicle or obstacle. Moreover, these methods have strict assumptions, for example, the focal length and camera height must be known, which limits use.

Disclosure of Invention

In order to solve the technical problems described above or at least partially solve the technical problems described above, the present application provides a method, an apparatus, and a storage medium for estimating a distance between two points in a traffic scene.

In a first aspect, the present application provides a method for estimating a distance between two points in a traffic scene, comprising: acquiring a video image sequence, wherein the video image sequence comprises a plurality of frame images acquired by a camera associated with a traffic scene; for each frame of image in the sequence of video images, determining one or more preset physical lengths of the images associated with the vehicle using a first deep learning model; for each preset physical length in the image, determining the distance weight of each pixel position corresponding to the preset physical length according to the actual length value of the preset physical length, wherein the distance weight represents the actual length represented by the pixel position, and the distance weight comprises a horizontal weight and a vertical weight; interpolation is carried out on the distance weights of the pixel positions of the region of interest (Region of Interest, abbreviated as ROI) in the traffic scene by using a second deep learning model, so that the distance weights of all the pixel positions of the region of interest are obtained; and determining the actual distance between any two pixel positions on the image acquired by the camera according to the distance weights of all pixel positions of the region of interest.

In some embodiments, interpolating distance weights of pixel locations of a region of interest in a traffic scene using a second deep learning model to obtain distance weights of each pixel location of the region of interest, comprising: converting the input image from a color space to a feature space through a first convolution layer of a second deep learning model to obtain a first feature map of the input image; extracting a set of features of different scales of the first feature map through a set of feature extraction blocks of the second deep learning model; up-sampling, fusing and amplifying the set of features with different scales through a set of up-sampling blocks of a second deep learning model, and outputting a second feature map with the same size as the first feature map; the second feature map is input to a distance estimation head of the second deep learning model, and distance weights of pixel positions in the region of interest are output.

In some embodiments, for each feature extraction block, extracting features includes: outputting a first feature of the first input feature through a second convolution layer of the feature extraction block; mapping the first features back to the first input features through a deconvolution layer of the feature extraction block to obtain second input features; determining a difference between the first input feature and the second input feature, and inputting the difference into a third convolution layer of the feature extraction block; outputting a compensation term through a third convolution layer of the feature extraction block; and determining the output characteristic of the characteristic extraction block according to the compensation term and the first characteristic.

In some embodiments, upsampling, fusing, and amplifying the features for each upsampling block includes: preprocessing the input characteristics of the up-sampling block through a fourth convolution layer of the up-sampling block; upsampling the preprocessed features by a bilinear interpolation layer of an upsampling block; and processing the up-sampled characteristics through a fifth convolution layer of the up-sampling block to obtain output characteristics of the up-sampling block.

In some embodiments, outputting, by the distance estimation head, the distance weights for the pixel locations in the region of interest includes: the distance weights for pixel locations in the region of interest are output by a sixth convolution layer and a seventh convolution layer in series of the distance estimation head, where the excitation function of the seventh convolution layer uses a Sigmoid function to compress the output values to a range of 0 to 1.

In some embodiments, the second deep learning model is trained with at least one of a horizontal direction constraint defining a degree of approximation of a distance weight of horizontally adjacent pixel positions, a vertical direction constraint defining a degree of increase with a vertically rising distance weight of pixel positions, and a video consistency constraint defining a degree of approximation of a distance weight of different frame images as constraint terms.

In some embodiments, the constraint term is a weighted average of a horizontal direction constraint, a vertical direction constraint, and a video consistency constraint.

In some embodiments, determining the real distance between any two pixel positions on the image acquired by the camera according to the distance weights of the pixel positions of the region of interest includes: determining a reference origin on an image acquired by a camera; for each of any two pixel positions, determining a horizontal coordinate and a vertical coordinate of the pixel position relative to a reference origin, wherein the horizontal coordinate is the accumulation of horizontal weights, and the vertical coordinate is the accumulation of vertical weights; and determining the actual distance between the two pixel positions according to the horizontal coordinates and the vertical coordinates of the two pixel positions.

In a second aspect, the present application provides a computer device comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor; the computer program when executed by the processor implements the steps of the method for estimating a distance between two points in a traffic scene of any of the embodiments described above.

In a third aspect, the present application provides a computer readable storage medium having stored thereon a program for estimating a distance between two points in a traffic scene, which when executed by a processor, implements the steps of the method for estimating a distance between two points in a traffic scene of any of the embodiments described above.

Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages: the method provided by the embodiment of the application realizes accurate distance estimation independent of external or internal parameters of the camera.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a flowchart of one implementation of a method for estimating a distance between two points in a traffic scene according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of one implementation of two-point coordinates provided in the examples of the present application;

FIG. 3 is a schematic diagram of an implementation of the preset physical length provided in the embodiment of the present application;

FIG. 4 is a schematic diagram of one implementation of a distance weight graph provided in an embodiment of the present application;

FIG. 5a is a schematic diagram of one example of an un-interpolated distance weight map provided by an embodiment of the present application;

FIG. 5b is a schematic diagram of one example of an interpolated distance weight map provided by an embodiment of the present application;

FIG. 6 is a block diagram of one implementation of a second deep learning model (DEN) provided by an embodiment of the present application;

FIG. 7 is a block diagram illustrating the structure of one implementation of a feature extraction block (FE) provided in an embodiment of the present application;

FIG. 8 is a block diagram of one implementation of an upsampling block (US) provided in an embodiment of the present application;

FIG. 9 is a block diagram of one implementation of a Distance Estimation Head (DEH) provided by an embodiment of the present application;

FIG. 10 is a schematic diagram of one example of estimating distance using distance weights provided by embodiments of the present application; and

fig. 11 is a schematic hardware structure of an implementation manner of a computer device according to an embodiment of the present application.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

In the following description, suffixes such as "module", "component", or "unit" for representing elements are used only for facilitating the description of the present application, and are not of specific significance per se. Thus, "module," "component," or "unit" may be used in combination.

The embodiment of the application relates to a distance estimation method, which can automatically calculate the distance between two positions in a captured video of a traffic scene. In the embodiment of the application, the distance is estimated by recording and analyzing the robust a priori information of the vehicles present in the video, mimicking the perception of humans. The distance between any points within the road area is automatically calculated, thereby promoting the development of existing traffic camera systems, embedding real world distance information into the captured video without additional measurements by the camera. The method can provide useful information for various applications such as vehicle speed estimation, collision warning systems, intelligent traffic control, etc.

The application provides a method for estimating the distance between two points in a traffic scene, as shown in fig. 1, the method comprises steps S102 to S110.

Step S102, a video image sequence is acquired, wherein the video image sequence comprises a plurality of frame images acquired by a camera associated with a traffic scene.

As an example, a camera is fixedly provided at a position near the traffic scene, and an image of the traffic scene is acquired by the camera.

Step S104, for each frame of image in the sequence of video images, determining one or more preset physical lengths of the images associated with the vehicle using a first deep learning model. As one example, the preset physical length includes a wheelbase, a length, etc. of the vehicle.

Step S106, for each preset physical length in the image, determining the distance weight of each pixel position corresponding to the preset physical length according to the actual length value of the preset physical length. Wherein the distance weight represents the real length represented by the pixel location, and the distance weight comprises a horizontal weight and a vertical weight.

Step S108, interpolation is carried out on the distance weights of the pixel positions of the region of interest in the traffic scene by using the second deep learning model, so that the distance weights of all the pixel positions of the region of interest are obtained.

Step S110, determining the actual distance between any two pixel positions on the image acquired by the camera according to the distance weights of the pixel positions of the region of interest.

In some embodiments, in the step S110, a reference origin on the image acquired by the camera is determined; for each of any two pixel positions, determining a horizontal coordinate and a vertical coordinate of the pixel position relative to a reference origin, wherein the horizontal coordinate is the accumulation of horizontal weights, and the vertical coordinate is the accumulation of vertical weights; and determining the actual distance between the two pixel positions according to the horizontal coordinates and the vertical coordinates of the two pixel positions.

In the above step S110, a reference origin is defined, which may be any point or a point designed by two meaningful intersecting lines on the road. It will serve as a reference point when building a real world distance map. In some embodiments, horizontal and vertical lines are found from the video using a Hough transform. Then, the intersection of the horizontal line and the vertical line is found, and one is selected as the reference origin.

In the above step S110, two given points (x _a ,y _a ) And (x) _b ,y _b ) Distance S between _ab . Taking a small area on the road (e.g. a rectangle in fig. 2) as an example, this area is small and can be modeled as a 2D plane. The plane is a plane that can be described by two orthogonal vectors (x-axis and y-axis in fig. 2). The distance between two points (e.g., d in fig. 2) (e.g., a and b in fig. 2) can be described by two orthogonal vectors (in the x and y directions, fig. 2). Thus we can store the real distance for each pixel location with two orthogonal elements (horizontal and vertical distance weights). Mathematically, the assumption can be written as:

here, theAnd->Are two orthogonal distance vectors along the x-axis and y-axis directions. Sign->Representing the distance vector between points a and b.

And determining a horizontal coordinate and a vertical coordinate of the pixel position relative to the reference origin, wherein the horizontal coordinate is the accumulation of horizontal weights, and the vertical coordinate is the accumulation of vertical weights. Knowing the real coordinates of all points, one can obtain any two points (x _a ,y _a ) And (x) _b ,y _b ) Distance between:

wherein (x) _a ,x _b ) And (y) _a ,y _b ) Representing the real world coordinates of a and b.

In the embodiment of the present application, in step S104, for the acquired image, the first deep learning model is used to identify the preset physical length of different objects, which continuously, repeatedly and continuously appear in the scene, such as the vehicle in fig. 3. The standard width of the vehicle in the real world is designed to be 1.6-2.0m, and the wheelbase length is about 2.6m according to the type of the vehicle. In step S106, the distance within the image is roughly estimated using these a priori knowledge.

For example, in the above step S104, the vehicle is detected from the image using Mask R-CNN, and the wheels of the vehicle can be easily found. The preset physical length of the detected object is known. Because there are many lengths in the scene that occur continuously, repeatedly, and continuously in the scene. Vehicles repeatedly appear in the photographed video clips. Their width is fixed in the real world. The width in the image changes consistently as the vehicle moves along the road direction. Also, the size of the wheelbase length is relatively fixed in the real world. The length continuously changes as the vehicle moves. These preset physical lengths can be used as scales to measure real world length in a small area in the image.

After finding the preset physical length associated with the vehicle in the image using the first deep learning model, in step S106, the ratio (referred to herein as the distance weight, denoted δ) between the occurrence of the real world (denoted L) and the length of the pixel (denoted L) is calculated. Mathematically, the distance weight is defined as:

each pixel location is considered an infinitely small region according to the distance model in equation (3). For each pixel position of the image we define a distance weight δ to represent the ratio between the real world and the pixel length, i.e. the real length represented by each pixel position. For each frame of video, a first deep learning model is first used to identify the object of interest and a preset physical length of interest (e.g., width of the vehicle, wheelbase length of the vehicle, etc.) is found from the image. The distance weight for the pixel location is calculated based on a fixed, known preset physical length in the real world, which is known in advance.

Taking fig. 4 as an example, a white vehicle is detected from a captured image using a first deep learning model. The width of the vehicle is counted according to the detected position in pixels, namely 55 pixels. In the real world, the average width of the vehicle is about 1.8m. Distance weights are calculated according to equation (3), the distance weights representing the relationship between the real world and the pixel length of the location of the vehicle on the image. Thus, we record the weight values at the vehicle location and build a graph (called the distance weight map, see right part of fig. 4). Also, weights are calculated for all vehicles (or other length of interest) in the image, following the same procedure, and recorded on a distance weight map.

For a video sequence, each frame may be processed. Fig. 5a gives an example where a distance weight map records the distance weights of a video image sequence. As shown in fig. 5a, the distance weight map of a set of recorded values is obtained through the previous processing. However, these values are sparse, that is, a large portion of the region of interest has not been satisfactorily produced as shown in FIG. 5 a. There is much space in the road area where weight information is needed for interpolation.

After proper training, the Convolutional Neural Network (CNN) can well interpolate and fill up the blank according to the design rule. Therefore, in the above-described step S108, the distance weight of the road region is interpolated using the second deep learning model. Through the interpolation process, as shown in fig. 5b, each position of the weight map has a distance weight to represent the real world distance.

In the step S108, a deep learning model with interpolation capability is possible, and the embodiment of the present application further proposes a new second deep learning model.

One implementation of the second deep learning model of the embodiment of the present application is described below.

Distance estimation is a high-level video understanding task that requires a fairly large accepted field to analyze the context. Localization information is also critical to refine the details of the distance map. The embodiments of the present application propose a second deep learning model, called Distance Estimation Network (DEN), that combines the context and local information of global and local features to facilitate understanding of the input image. In order to reduce the requirement on the computing power in practical application, the device has a lightweight structure and is easy to deploy.

Distance estimation network (Distance) Estimation Network，DEN)

As shown in fig. 6, the DEN consists of an upper half path and a lower half path. On the upper half path, a set of feature extraction blocks (Feature Extraction Block, abbreviated as FE) configured to progressively downsample the input image, it extracts more and more global features for use in subsequent processes, as described in fig. 6, including feature extraction blocks 201, 202, and 203. In the lower path, a set of Up-Sampling blocks (US) is included, configured to receive extracted features from different scales and to continually process, fuse and scale the features back to their original size, as shown in fig. 6, including Up-Sampling blocks 301, 302 and 303. Finally, two distance estimation heads (Distance Estimation Head, abbreviated as DEH) 401 and 402, the distance estimation heads 401 being configured to predict distance weights in the horizontal direction(horizontal weight), the distance estimation head 402 is configured to predict the distance weight +.>(vertical weight). As shown in fig. 6, the upper half path further includes a convolution layer 501 located before the feature extraction block and configured to process the input image 101, and convert the input image 101 from the color space to the feature space, to obtain a feature map 102; also included between the upper and lower paths is a convolution layer 502 configured to process the features output by the feature extraction block 203.

Mathematically, the function of the DEN is described as:

where f (-) represents DEN. Symbol I epsilon R ^W×H×3 Representing an input image with three (i.e., RGB) channels. Output ofAnd->An estimated distance weight map in the horizontal and vertical directions, respectively.

In step S108 described above, the input image 101 is converted from the color space to the feature space by the convolution layer 501, resulting in a feature map 102 (for example, the scale of 640×0360×116) of the input image 101 (for example, the scale of 640×360×3). By means of the feature extraction blocks 201, 202 and 203, a set of features of different dimensions of the feature map 102 is extracted, as shown in fig. 6, including features 103, 104 and 105, with dimensions of 360 x 2180 x 332, 180 x 490 x 564, 90 x 645 x 128, respectively, as an example. The set of features of different scales are up-sampled, fused and amplified by up-sampling blocks 301, 302 and 303 to output a feature map 109 of the same size as feature map 102, wherein up-sampling block 301 processes feature map 106 (which is 90 x 45 x 256 in scale) output by convolution layer 502, up-sampling block 302 processes feature map 107 (which is composed of the output of feature map 104 and up-sampling block 301, which is 180 x 90 (64+64) in scale), and up-sampling block 303 processes feature map 108 (which is composed of the output of feature map 103 and up-sampling block 302, which is 360 x 180 x (32+32) in scale). The feature map 109 (which is composed of the feature map 102 and the output of the up-sampling block 303, whose scale is 640×360× (16+16)) is input to the distance estimation heads 401 and 402, and the distance weights of the pixel positions in the region of interest are outputAnd->

Feature extraction block (Feature) Extraction Block，FE)

The quality of the extracted features has a great relationship with the performance of the DEN, and thus, efficient extraction of useful information from the input image is one of the most important parts of the DEN network. To efficiently extract features to accomplish the distance estimation task, the present embodiment provides a feature extraction block, as shown in fig. 7, where Input is a feature map of 640×360×16, which we denote as X. First, the convolutional layer E1 is entered, and features with more global information (360×180×32) are extractedFor evaluating the extracted features->A deconvolution layer D acts to characterize +.>Mapping back to original size +.>(estimated in the input domain). By calculating->The original X and the estimate +.>Is a difference in (c). The difference represents the extracted feature->Is a mass of (3). Because of a good->Should give a small estimation error R in the input domain _X Based on the measured difference R _X A compensation term R is estimated by a convolution layer E2 _Y . By compensating term R _Y Enhancing extracted featuresMathematically, the process of the feature extraction block is described as:

the feature extraction block extracts multi-scale features from the input image in turn, as shown in fig. 6, which extracts four different scale features from the top half path. Features of different scales contain different semantic level information.

In some embodiments, for each feature extraction block, extracting features includes: outputting the features of the input features X through the convolution layer E1 of the feature extraction blockFeature is +.>Mapping back the input feature X to obtain the feature +.>Determining the characteristics X and->Difference R between _X The difference R is calculated _X A convolution layer E2 of the input feature extraction block; outputting the compensation term R through the convolution layer E2 _Y The method comprises the steps of carrying out a first treatment on the surface of the According to compensation term R _Y And features->Determining the output characteristics of the characteristic extraction block>

Up-sampling block (Up-SamplingBlock, US)

The lower half path uses a set of upper sampling block samples to gradually process, fuse and amplify the features so that they recover to the original size. As shown in fig. 8, the upper sampling block (US) includes: one convolution layer E3 pre-processes the features, one Bilinear interpolation layer (Bilinear) B upsamples the features, and the other convolution layer E4 processes the upsampled features.

In some embodiments, upsampling, fusing, and amplifying the features for each upsampling block includes: the input features 81 of the upsampled block, which are, as an example, of dimensions 360 x 180 x (32+32), are preprocessed by the convolving layer E3 of the upsampled block; upsampling the preprocessed feature 82 (which is, by way of example, 360 x 180 x 32 in scale) by the bilinear interpolation layer B of the upsampling block; the upsampled features 83 (which are, for example, of dimensions 640 x 360 x 32) are processed by the convolutional layer E4 of the upsampled block to obtain the output features 84 (which are, for example, of dimensions 640 x 360 x 16) of the upsampled block.

Distance estimating head (Distance Estimation) Head，DEH)

The feature map obtained through the previous feature extraction and upsampling processes contains rich distance estimation information. DEH is designed to predict a distance weight map from the obtained features. As shown in fig. 9, each DEH has convolution layers E5 and E6, where the excitation function of the second convolution layer E6 uses the Sigmoid function to compress the output value to a range of 0 to 1. As an example, as shown in fig. 9, the scale of the input feature 91 is 640×360× (16+16), the scale of the convolution layer E5 output feature 92 is 640×360×32, and the scale of the convolution layer E6 output feature 93 is 640×360×1. The reason for using the Sigmoid function is: (a) The distance weight represents the ratio between the real world distance and the pixel length, always positive. A value of (b) too large is not useful in the application. If the object is too far away, it is difficult to detect and measure, causing a large error. (c) Marked distance map [ ]And->) Is discrete. The difference of the labeling positions is mainly measured in the training process. Because the labeling positions of different frames are different, the losses of the different frames have larger difference, and the gradient in the training process is unstable. By using the Sigmoid function, the gradient has less variation (close to zero) when the estimate is too small or too large, thus filtering out the noise of the gradient, stabilizing the training process.

In some embodiments, outputting, by the distance estimation head, the distance weights for the pixel locations in the region of interest includes: the distance weights of the pixel locations in the region of interest are output by two convolution layers E5 and E6 in series of the distance estimation head, wherein the excitation function of the second convolution layer E6 uses the Sigmoid function to compress the output values into the range of 0 to 1.

Constraint conditions

The prior knowledge of the traffic video is utilized to assist in training the deep learning model. The present embodiment defines a constraint consisting of three parts: constraint omega in horizontal direction _h Constraint Ω in vertical direction _v And video consistency constraint Ω _vid . Mathematically, the constraint term is defined as that the constraint term will optimize the deep learning model during the training phase as part of the loss function.

Wherein lambda is ₁ ,λ ₂ And lambda (lambda) ₃ Is the coefficient that balances the three terms.

Constraint omega in horizontal direction _h : the embodiments of the present application consider that the traffic camera is always positioned facing the road, the direction of which is mainly longitudinal. The distance weight (i.e. the ratio of the measured distance to the pixel distance) along the road direction (vertical direction) varies significantly, and the adjacent position in the horizontal direction is similar to the weight of the camera. To constrain the weight change in the horizontal direction, the embodiment of the application defines a waterPlane constraint term Ω _h The following are provided:

where W and H represent the width and height of the video frame. Sign symbolRepresenting the estimated distance weight. Items i and j are indices in the horizontal and vertical directions, with the origin of the pixel index being located in the upper left corner of the image.

Constraint omega in vertical direction _v : in a frame of traffic video, the sky (if present) is always at the top and the ground is at the bottom. For the same column of pixels, the ratio of the top position pixel to the physical value is always greater than the bottom pixel, which results in an increase in the distance weight delta with the vertical rise of the position. Based on this feature we define a vertical direction Ω _v Is shown below:

video consistency constraint Ω _vid : the position of the road camera is fixed. The captured video always contains the same scene. Thus, video frames from the same video sequence should share the same distance weight map Δ. For each iteration of the training process, the distance estimation network of the embodiments of the present application optimizes the gradient of a batch of video frames from the same video sequence. Estimated weight map for different framesIt should be the same, the constraint is defined in the embodiments of the present application as follows:

where L e {1,2,..l } represents the sample index in a batch of video frames.

Fig. 10 gives an example of the proposed method. In this example, a short video sequence is acquired from a scene, and two distance weight maps in the horizontal and vertical directions are created using our method, as shown in fig. 10 (a) and (b). The distance from the stop line was calculated with one corner of the stop line as a reference point (see the dot in fig. 10 (c)). Each circle in the figure represents ten meters. The distance from the reference point is indicated by a "contour". Estimated distance radioactivity increases, the "contour" forms a regular ellipse. As shown in fig. 10 (d), sample distances from some points to points are calculated, which indicates that the embodiments of the present application can accurately estimate the distance between any points of the road area.

In the embodiment of the application, each pixel position is regarded as an infinitely small area, wherein the road of the local area can be modeled as a small two-dimensional plane. A rough road may be described by a set of two-dimensional planes. A reference origin of a scene is determined. And determining one to several preset physical lengths in the scene, wherein the preset physical lengths appear continuously and repeatedly in the scene, and the statistical lengths are kept unchanged consistently. And mapping the obtained preset physical length with pixels on the image obtained by the camera. A sketch of the actual distance is formed on the map. A pair of weighted scores is found for each pixel on the image in the region of interest using a deep learning method to form a weight map of the scene. Any deep learning model with interpolation capability can be used to construct a weight map of the scene, which is also an advantage of this approach. However, a special deep learning architecture (DEN network) has also been proposed, whose loss function takes care of the consistency of the x-and y-coordinate values. To train a deep learning network for distance estimation, it uses a novel set of constraints that force the network to interpolate uncovered points using natural reasoning. Using the weight map, the coordinates of all pixels in the RoI can be estimated with reference to the origin; and calculating the distance between the two points according to the coordinates of the points, and the accuracy is high.

The embodiment of the application also provides computer equipment. Fig. 11 is a schematic hardware structure of an implementation manner of a computer device provided in an embodiment of the present application, and as shown in fig. 11, the computer device 10 in the embodiment of the present application includes at least, but is not limited to: a memory 11 and a processor 12 that may be communicatively coupled to each other via a system bus. It should be noted that FIG. 11 shows only computer device 10 having components 11-12, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead.

In the present embodiment, the memory 11 (i.e., readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the memory 11 may be an internal storage unit of the computer device 10, such as a hard disk or a memory of the computer device 10. In other embodiments, the memory 11 may also be an external storage device of the computer device 10, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the computer device 10. Of course, the memory 11 may also include both internal storage units of the computer device 10 and external storage devices. In this embodiment, the memory 11 is typically used to store an operating system and various types of software installed on the computer device 10. Further, the memory 11 may be used to temporarily store various types of data that have been output or are to be output.

Processor 12 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 12 is generally used to control the overall operation of the computer device 10. In the present embodiment, the processor 12 is configured to execute the program code stored in the memory 11 or process data, such as a method for estimating the distance between two points in a traffic scene.

The present embodiment also provides a computer-readable storage medium such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application store, etc., on which a computer program is stored, which when executed by a processor, performs the corresponding functions. The computer readable storage medium of the present embodiment is for storing program code for a method for estimating a distance between two points in a traffic scene, which when executed by a processor implements the method for estimating a distance between two points in a traffic scene.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), including several instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the protection of the claims, which fall within the protection of the present application.

Claims

1. A method for estimating a distance between two points in a traffic scene, comprising:

acquiring a video image sequence, wherein the video image sequence comprises multi-frame images acquired by a camera associated with the traffic scene;

for each frame of image in the sequence of video images, determining one or more preset physical lengths of the images associated with the vehicle using a first deep learning model; for each preset physical length in an image, determining the distance weight of each pixel position corresponding to the preset physical length according to the actual length value of the preset physical length, wherein the distance weight represents the actual length represented by the pixel position and comprises a horizontal weight and a vertical weight;

interpolation is carried out on the distance weights of the pixel positions of the region of interest in the traffic scene by using a second deep learning model, so that the distance weights of all the pixel positions of the region of interest are obtained;

and determining the actual distance between any two pixel positions on the image acquired by the camera according to the distance weights of all pixel positions of the region of interest.

2. The method of claim 1, wherein interpolating distance weights for pixel locations of a region of interest in the traffic scene using a second deep learning model to obtain distance weights for each pixel location of the region of interest comprises:

converting an input image from a color space to a feature space through a first convolution layer of the second deep learning model to obtain a first feature map of the input image;

extracting a set of features of different scales of the first feature map by a set of feature extraction blocks of the second deep learning model;

up-sampling, fusing and amplifying the set of features with different scales through a set of up-sampling blocks of the second deep learning model, and outputting a second feature map with the same size as the first feature map;

and inputting the second feature map into a distance estimation head of the second deep learning model, and outputting the distance weight of the pixel position in the region of interest.

3. The method of claim 2, wherein extracting features comprises, for each feature extraction block:

outputting a first feature of the first input feature through a second convolution layer of the feature extraction block;

mapping the first feature back to the first input feature through the deconvolution layer of the feature extraction block to obtain a second input feature;

determining a difference between the first input feature and the second input feature, inputting the difference into a third convolution layer of the feature extraction block;

outputting a compensation term through a third convolution layer of the feature extraction block;

and determining output characteristics of the characteristic extraction block according to the compensation term and the first characteristics.

4. The method of claim 2, wherein upsampling, fusing, and amplifying the features for each upsampled block comprises:

preprocessing the input characteristics of an upper sampling block through a fourth convolution layer of the upper sampling block;

upsampling the preprocessed features by a bilinear interpolation layer of the upsampling block;

and processing the up-sampled characteristics through a fifth convolution layer of the up-sampling block to obtain output characteristics of the up-sampling block.

5. The method of claim 2, wherein outputting, by the distance estimation head, the distance weights for pixel locations in the region of interest, comprises:

and outputting the distance weight of the pixel position in the region of interest through a sixth convolution layer and a seventh convolution layer of the distance estimation head which are connected in series, wherein an excitation function of the seventh convolution layer compresses an output value into a range of 0 to 1 by using a Sigmoid function.

6. The method of any one of claims 1 to 5, wherein the second deep learning model is trained with at least one of a horizontal direction constraint defining a degree of approximation of distance weights of horizontally adjacent pixel positions, a vertical direction constraint defining a degree of increase with vertical rising distance weights of pixel positions, and a video consistency constraint defining a degree of approximation of distance weights of different frame images as constraint terms.

7. The method of claim 6, wherein the constraint term is a weighted average of the horizontal direction constraint, the vertical direction constraint, and the video consistency constraint.

8. The method according to any one of claims 1 to 5, wherein determining the real distance between any two pixel locations on the image acquired by the camera according to the distance weights of the respective pixel locations of the region of interest comprises:

determining a reference origin on an image acquired by the camera;

for each of any two pixel positions, determining a horizontal coordinate and a vertical coordinate of the pixel position relative to a reference origin, wherein the horizontal coordinate is the accumulation of horizontal weights, and the vertical coordinate is the accumulation of vertical weights;

and determining the actual distance between the two pixel positions according to the horizontal coordinates and the vertical coordinates of the two pixel positions.

9. A computer device, the computer device comprising:

a memory, a processor, and a computer program stored on the memory and executable on the processor;

the computer program, when executed by the processor, implements the steps of the method for estimating a distance between two points in a traffic scene as claimed in any of claims 1 to 8.

10. A computer-readable storage medium, on which a program for estimating a distance between two points in a traffic scene is stored, which program, when being executed by a processor, implements the steps of the method for estimating a distance between two points in a traffic scene as claimed in any of claims 1 to 8.