CN114792414A

CN114792414A - Target variable detection method and system for carrier

Info

Publication number: CN114792414A
Application number: CN202210346628.1A
Authority: CN
Inventors: 黄骏杰; 黄冠
Original assignee: Beijing Jianzhi Technology Co ltd
Current assignee: Beijing Jianzhi Technology Co ltd
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2022-07-26

Abstract

The application discloses a target variable detection method and a system thereof for a carrier. The method comprises the following steps: carrying out image space coding on an input image sequence to obtain image view angle characteristics corresponding to image frames at different moments; performing visual angle conversion on the image visual angle characteristics to obtain bird-eye view angle characteristics; fusing the aerial view visual angle characteristic of the frame at the current moment and the aerial view visual angle characteristic of the frame at the previous moment to obtain a fused aerial view visual angle characteristic; performing aerial view perspective spatial feature coding on the fused aerial view perspective features to obtain enhanced aerial view perspective features; and carrying out target variable detection based on the enhanced bird's-eye view angle characteristics.

Description

Target variable detection method and system for carrier

Technical Field

The present disclosure relates to image sequence processing methods, and particularly to a target variable detection method suitable for a carrier.

Background

Autonomous vehicles can operate completely autonomously without the need for a human driver. In some scenarios, an autonomously driven vehicle may image the environment around the vehicle by using an imaging system with one or more optical cameras. The images may be used for object perception, which includes the speed of objects in the vehicle's surroundings, etc.

One image processing scheme for perceiving objects is described in the article BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View [ J ]. arXIv prediction arXIv:2112.11790,2021 by Huang J, Huang G, Zhu Z et al.

As shown in fig. 1, in the conventional method, the input is an image, and the processing procedure is as follows: the image is encoded through an image space encoder to obtain image visual angle characteristics, then the image visual angle characteristics are converted into bird-eye view angle characteristics through visual angle transformation, then feature enhancement is carried out on the bird-eye view angle through an additional feature encoder, and finally target variable detection is carried out through a target variable detection module.

Some of the terms referred to herein are explained as follows:

is characterized in that: which is essentially one or more variables that can store certain values. The definition and use of the value is learned by neural networks. For example, when an algorithm is used to determine whether a group in a photo is a photo of a cat, a certain feature is a number in the range of [ 0-1 ]. The magnitude of this value is defined to the extent that the algorithm determines whether the photograph is a cat: 0 means not a cat, 1 means a cat, and 0.5 means that the algorithm finds a 50% probability of being a cat.

Characteristic enhancement: the essence is that a new feature is obtained by calculation of the neural network, for example, the value of the feature mentioned in the above problem is enhanced from 0.5 to 0.8. The confidence that the picture is a cat picture at this time is enhanced from 50% to 80%.

The image space encoder, the characteristic encoder of the aerial view angle space encoder and the target variable detection module are all depth neural networks, and the image space encoder, the aerial view angle space encoder and the target variable detection module have the functions by utilizing artificial marking data and gradient back-propagation training.

Bird's-eye view angle: bird-eye-view, also abbreviated as BEV.

BEV perception: the perception of the target variable on the ground level, typically measured in meters, is defined as distinguished from the perception of the image perspective: the target variable perceived by the image perspective is defined in the image plane in pixels. Common BEV perceptions include semantic segmentation, target detection, and the like.

Timing sequence information: the relationship between the contents of successive frames of data (e.g., images) constitutes timing information, but it does not take advantage of timing for BEV perception.

Therefore, in the conventional scheme, since only the current frame data (image) is used as the detection basis, time-series information is lacked, and thus variables such as the speed of the target cannot be detected well.

Disclosure of Invention

According to an aspect of the present disclosure, there is provided a target variable detection method for a carrier, which may include: carrying out image space coding on an input image sequence to obtain image view angle characteristics corresponding to image frames at different moments; performing visual angle conversion on the image visual angle characteristics to obtain aerial view visual angle characteristics; fusing the aerial view visual angle characteristic of the frame at the current moment and the aerial view visual angle characteristic of the frame at the previous moment to obtain a fused aerial view visual angle characteristic; performing aerial view perspective spatial feature coding on the fused aerial view perspective features to obtain enhanced aerial view perspective features; and detecting target variables based on the enhanced aerial view angle characteristics.

In a further embodiment, the method further comprises: and aligning the bird's-eye view perspective feature of the frame at the current moment with the bird's-eye view perspective feature of the frame at the previous moment to obtain an aligned bird's-eye view perspective feature, wherein the aligning is performed based on the displacement amount of the carrier from the previous moment to the current moment, and wherein the fusing comprises fusing the aligned bird's-eye view perspective features of the frame at the current moment and the frame at the previous moment to obtain a fused bird's-eye view perspective feature.

In one embodiment, for the case where there are no features in the aligned spatial coordinates, the spatial coordinates are automatically feature supplemented.

In various embodiments, specific values of automatic feature supplementation include at least one of: a fixed value, a random value, a feature value nearest to its coordinate position.

In various embodiments, the fused operation comprises at least one of: the bird's-eye view angle characteristic of the frame at the current time is spliced with the bird's-eye view angle characteristic of the frame at the previous time, the bird's-eye view angle characteristic of the frame at the current time is added to the bird's-eye view angle characteristic of the frame at the previous time, and the bird's-eye view angle characteristic of the frame at the current time is subtracted from the bird's-eye view angle characteristic of the frame at the previous time.

In various embodiments, the target variable comprises at least one of: speed, location, size, orientation, and category of the target.

According to another aspect of the present disclosure, there is provided a system for target variable detection of a carrier, the system comprising: an image space encoding module configured to encode an input image sequence to obtain image view characteristics at respective different time points; a perspective transformation module configured to transform the image perspective feature into a bird's-eye perspective feature; a fusion module configured to fuse the bird's-eye view angle feature of the frame at the current time with the bird's-eye view angle feature of the frame at the previous time to obtain a fused bird's-eye view angle feature; a bird's-eye view feature BEV spatial feature encoding module configured to perform bird's-eye view spatial feature encoding on the fused bird's-eye view feature to obtain an enhanced bird's-eye view feature; and a target variable detection module configured to perform target variable detection based on the enhanced bird's-eye view angle feature.

In a further embodiment, the system further comprises: an alignment module configured to align the bird's-eye view angle feature of the frame at the current time with the bird's-eye view angle feature of the frame at the previous time to obtain an aligned bird's-eye view angle feature, and wherein the alignment is performed based on an amount of displacement of the carrier from the previous time to the current time, and wherein the fusion module is further configured to fuse the aligned bird's-eye view angle features of the frame at the current time and the frame at the previous time to obtain a fused bird's-eye view angle feature.

In various embodiments, specific values of automatic feature supplementation include at least one of: a fixed value, a random value, a characteristic value nearest to its coordinate position.

In various embodiments, the fused operation comprises at least one of: the method includes the steps of stitching a bird's-eye view perspective feature of a frame at a current time with a bird's-eye view perspective feature of a frame at a previous time, adding the bird's-eye view perspective feature of the frame at the current time with the bird's-eye view perspective feature of the frame at the previous time, and subtracting the bird's-eye view perspective feature of the frame at the current time with the bird's-eye view perspective feature of the frame at the previous time.

In one embodiment, the fusion module is implemented as a deep neural network.

In one embodiment, the alignment module is implemented as a deep neural network.

In various embodiments, the carrier comprises at least one of: cars, people, aircraft, and robots.

In one embodiment, the sequence of images is acquired by an imaging system disposed in a carrier.

In another embodiment, the sequence of images is acquired by an imaging system arranged independently of the carrier.

Drawings

FIG. 1 is a schematic diagram illustrating a method of target variable detection according to the prior art;

FIG. 2 illustrates a method schematic of target variable detection according to an embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of a method of target variable detection with an added alignment step according to an embodiment of the disclosure;

FIG. 4 shows a schematic diagram of the use of a stitching operation in feature map processing; and

FIG. 5 shows a schematic diagram of the use of both stitching and alignment operations in feature map processing; and

FIG. 6 shows a schematic diagram of a simplified method according to an embodiment of the disclosure;

FIG. 7 shows a schematic diagram of an alignment operation according to an embodiment of the present disclosure;

FIG. 8 shows a system framework schematic in accordance with an embodiment of the present disclosure;

Detailed Description

Example methods and systems are described herein. Any implementation or feature described herein as "exemplary," or "illustrative" is not necessarily to be construed as preferred or advantageous over other implementations or features. The example implementations described herein are not intended to be limiting. It will be readily understood that the aspects of the present disclosure, as generally described herein, and illustrated in the figures, can be arranged, substituted, combined, separated, and designed in a wide variety of different configurations, all of which are explicitly contemplated herein.

In addition, the particular arrangements shown in the drawings should not be considered limiting. It should be understood that other implementations may include more or less of each of the elements shown in a given figure. In addition, some of the illustrated elements may be combined or omitted. Moreover, example implementations may include elements not illustrated in the figures.

In the current image processing method, these target variables cannot be detected well because of the lack of timing information for sensing using the current frame data. One of the technical problems to be solved by the present disclosure is how to use data of adjacent frames to construct a feature with timing information, and to effectively detect variables such as speed of a target.

According to the method and the device, on the basis of the existing scheme, the previous frame data and the current frame data are simultaneously used as detection bases, and the prediction inference is carried out on the basis of the detection result, so that the capability of judging and predicting the target speed can be effectively improved. The algorithm of the predictive reasoning is built by utilizing a deep neural network and trained by using artificial labeling data.

Specifically, the principles of the present disclosure are illustrated in fig. 2. Different from the existing scheme, when the target is detected based on the data of the current frame (the frame at the time T), the method utilizes the intermediate characteristic of the previous frame (the frame at the time T-1), and the characteristic value obtained by fusing (for example, splicing) the characteristics of two adjacent frames enables the target variable detection to be more accurate. In a further embodiment, the original features may also be replaced by the features obtained by alignment (Align) and fusion (e.g., splicing (splice)) as the basis for subsequent prediction, as shown in fig. 3.

Where alignment is the preferred embodiment, misalignment can also be blended. One preferred embodiment of the present disclosure is to introduce an alignment operation in the feature processing. The reason for this is that: the receptive field of the profile (meaning the extent of the profile in space, or region of interest) is the vector itself (labeled O in FIGS. 4 and 5) _e ) The field of reception varies with the motion of the carrier in a region around the field, which results in the same object (in fig. 4 and 5, the static object is marked O) _s Dynamic object is marked as O _m ) The positions in the feature map are different. The fusion of the misaligned feature maps may cause a deterioration in perceptual effect due to a target position deviation, and specifically, as shown in fig. 4, after the frame features of the T-1 time point and the T time point are spliced (abbreviated as "C" in the figure) and fused, a stationary target O is obtained _s May be identified as a moving object, and a dynamic object O _m May be identified as having a greater amount of displacement during the T-1 time point and the T time point. According to the translation amount of the carrier, the translation feature diagram can achieve the alignment effect (the simplified mark is 'A' in the diagram), and the fusion of the aligned feature diagrams can play a role in feature enhancement, so that the perception effect is effectively improved. For example, as shown in FIG. 5, in the result of fusing the aligned feature maps, a stationary object O can be seen _s At rest, and the dynamic target O _m The amount of displacement identified during the T-1 time point and the T time point may also be closer to the true value.

The fusion operation may involve various feature value calculation methods, including but not limited to adding feature values of two adjacent image frames, splicing features of two adjacent image frames, subtracting feature values of two adjacent image frames, and the like, and specifically, various operations such as splicing, adding, subtracting, and the like may be performed on the bird's-eye view angle feature of the current frame and the bird's-eye view angle feature of the previous frame.

One method according to the present disclosure includes the steps of:

as shown in fig. 6, in step 1, an image sequence is captured with an imaging system in the carrier, the image sequence comprising a sequence of image frames at a plurality of different moments in time (e.g. a T-1 point in time, a T point in time and a T +1 point in time), and the captured image sequence is input into an image spatial encoding module. In various embodiments, the carrier may be a vehicle, a human, an aircraft, or a robot, among others. In an alternative embodiment, the imaging system may also be an imaging system arranged separately from the carrier, in which case the imaging system may be fixed or movable.

In step 2, the image sequence is encoded by the image space encoding module to obtain image view characteristics corresponding to the image frames at different time instants.

In step 3, the image visual angle characteristic is converted into a bird's-eye view angle characteristic by using the visual angle conversion module.

In optional step 4, the bird's-eye view characteristic of the current frame (T) is aligned with the bird's-eye view characteristic of the previous frame (T-1) to obtain an aligned bird's-eye view characteristic. In an embodiment the alignment is based on the amount of displacement of the carrier during capturing of adjacent image frames, e.g. the amount of displacement of the carrier from a previous time instant, e.g. the T-1 time point, to a current time instant, e.g. the T time point. In a further embodiment, the amount of displacement may be determined based on a speed of movement of the carrier. In one embodiment, when the displacement amount of the carrier is 0, the alignment operation may be omitted.

As described above, the misalignment of the fused feature maps may cause a problem such as ambiguity of the target position, and the perception effect is deteriorated. According to the translation amount of the carrier, the translation characteristic diagrams can achieve the alignment effect, the fusion of the aligned characteristic diagrams can play a role in enhancing the characteristics, and the perception effect is effectively improved.

Without loss of generality, the alignment process is described by taking a feature with a spatial dimension as an example, and as described with reference to fig. 7, assuming that the bird's eye view angle feature of the current frame (T) is a one-dimensional feature, which is denoted as [ a, b, c, d, e ], the one-dimensional spatial coordinates of each feature value in the image frame are respectively [ 1,2,3,4,5 ] meters, and the feature value of the feature corresponding to the previous frame (T-1) is assumed as [ f, g, h, i, j ], since the motion of the carrier of the photographing system (assuming that the carrier moves one meter to the left between the time of photographing the previous frame, i.e., T-1, and the time of photographing the current frame T), the coordinates of the feature at the time of T-1 in the one-dimensional space are [ 2,3,4,5,6 ], the alignment operation is performed as the coordinates of the bird's eye view angle feature of the current frame (T) in the one-dimensional space ([ 1,2,3,4, 5) as an index, finding features of the same location among the features of the previous frame (T-1): for example, the coordinates [ 1,2,3,4,5 ] of the current frame in the one-dimensional space are aligned, and the feature of the corresponding position (coordinates [ 1,2,3,4,5 ]) in the previous frame is found to be [ 0, f, g, h, i ], where the value of 0 is because there is no feature in the place where the spatial coordinate is 1 in the T-1 frame, so that automatic feature supplement is performed, and the specific supplement value may be various, and may be a fixed value, a random value, or a value (f) nearest to the coordinate position.

In step 5, the bird's-eye view angle characteristic of the current frame (T) and the bird's-eye view angle characteristic of the previous frame (T-1) are fused to obtain a fused bird's-eye view angle characteristic.

On the basis of the existing scheme, the previous frame data and the current frame data are simultaneously used as detection bases, and the prediction reasoning is further carried out on the basis of the detection, so that the capability of judging and predicting the target speed can be effectively improved.

In one embodiment, in the case of adopting the above-mentioned alignment step, it is assumed that the feature of the current frame (the frame photographed at the time T) at the coordinates [ 1,2,3,4,5 ] is [ a, b, c, d, e ], and the feature of the previous frame (the frame photographed at the time T-1, which is adjacent to the frame photographed at the time T) at the same position is [ 0, f, g, h, i ] after being aligned, as shown in fig. 7. In a specific embodiment, the fusion uses stitching, and specifically, the features after stitching at coordinates [ 1,2,3,4,5 ] may be [ a, b, c, d, e,0, f, g, h, i ]. Note that, in other embodiments, the fusion may include adding the bird's-eye view characteristics of the two adjacent image frames (in the above example, the fusion result of the bird's-eye view characteristics of the current frame (T) and the previous frame (T-1) is [ a, b + f, c + g, d + h, e + i ]), subtracting the bird's-eye view characteristic of the current frame (T) from the bird's-eye view characteristic of the previous frame (T-1) (in the above example, the fusion result of the bird's-eye view characteristics of the current frame (T) and the previous frame (T-1) is [ a, b-f, c-g, d-h, e-i ]), and so on.

In another embodiment, without the above-mentioned alignment step, it is assumed that the feature of the current frame (the frame taken at the time T) at the coordinates [ 1,2,3,4,5 ] is [ a, b, c, d, e ], and the feature value of the same feature of the previous frame (the frame taken at the time T-1, which is adjacent to the frame taken at the time T) is [ f, g, h, i, j ], so that the feature after stitching at the coordinates [ 1,2,3,4,5 ] can be [ a, b, c, d, e, f, g, h, i, j ]. Note that, in other embodiments, the fusion may include adding the bird's-eye view characteristic of the current frame (T) to the bird's-eye view characteristic of the previous frame (T-1) (in the above example, the fusion result of the bird's-eye view characteristics of the current frame (T) and the previous frame (T-1) is [ a + f, b + g, c + h, d + i, e + j ]), subtracting the bird's-eye view characteristic of the current frame (T) from the bird's-eye view characteristic of the previous frame (T-1) (in the above example, the fusion result of the bird's-eye view characteristics of the current frame (T) and the previous frame (T-1) is [ a-f, b-g, c-h, d-i, e-j ]), and so on.

In step 6, the feature of the fused bird's-eye view angle feature is enhanced by using a BEV spatial feature coding module. In one embodiment, the BEV spatial feature encoding module may be implemented as a deep neural network.

In step 7, a target variable detection module is used to perform target variable detection based on the output in step 6, i.e. the feature enhancement result, wherein the variables may include at least one of speed, position, size, orientation, and category, etc. In one embodiment, the target variable detection module may be implemented as a deep neural network.

In a more specific embodiment, the image view space coding module, the view transformation module, the BEV space coding module, and the target variable detection module may be implemented as a deep neural network, and have corresponding functions by using artificial labeling data and gradient back-propagation. The design of the deep neural network of the modules is not limited to a specific design, and can be any design.

In one particular embodiment, the carrier is an a vehicle, the imaging system is an on-board binocular camera of the a vehicle, and the B vehicle is the object of interest. When the vehicle A is in operation, the vehicle-mounted binocular camera of the vehicle A acquires images, the vehicle B in the acquired images is in a motion state, the frame images acquired at the time T and the frame images acquired at the time T-1 can be fused, and the result finally obtained by the method disclosed by the invention is the data such as the operation direction and speed of the vehicle B, so that the state of the vehicle B can be effectively and accurately judged. The camera in this embodiment may be any camera capable of realizing an image capturing function, such as a monocular camera, a binocular camera, or a panoramic camera. The application of the method according to the present disclosure in products such as aircraft and motion cameras is also envisioned by those skilled in the art.

In another specific embodiment, for simplifying the description, taking a robot that can realize a dish sending function in a restaurant, that is, the carrier is a robot that can realize the dish sending function, the imaging system is a camera on a target robot, the camera on the target robot collects images, the target can be a certain area in front of the target robot, and by fusing the frame image acquired at the time T and the frame image acquired at the time T-1 and performing target variable detection based on the fused result, the specific situation in the area where the camera collects the images can be judged more accurately.

As shown in fig. 8, a system according to the present disclosure may include the following modules: the device comprises an image space coding module, a visual angle transformation module, a fusion module, a BEV space characteristic coding module and a target variable detection module. In a preferred embodiment, the system may include an optional alignment module.

In more particular embodiments, the image space encoding module may be configured to encode the input sequence of images to derive the image view characteristics at respective different points in time. The perspective transformation module may be configured to transform the image perspective characteristics into bird's-eye perspective characteristics. The fusion module may be configured to fuse the bird's-eye view angle feature of the frame at the current time with the bird's-eye view angle feature of the frame at the previous time to obtain a fused bird's-eye view angle feature. The bird's-eye view feature BEV spatial feature encoding module may be configured to perform bird's-eye view spatial feature encoding on the fused bird's-eye view feature to obtain an encoded bird's-eye view feature. The target variable detection module may be configured to perform target variable detection based on the encoded bird's eye view angle characteristic.

The method and the system according to the present disclosure can improve the perception capability by using multi-frame information. The beneficial effects that can be brought by applying the system and method according to the present disclosure to the field of automatic driving (i.e. the carrier is embodied as an automatic driving vehicle) at least include the following aspects: on the one hand, the detection effect of the automatic driving carrier on the target is improved, and for example, when the target is partially or completely shielded in a short time, the detection judgment can be carried out by referring to the information of the adjacent frames. On the other hand, the position difference of the two adjacent frame features is compared to complete the variable estimation of the automatic driving vehicle for the moving object, wherein the variable can comprise speed, position, size, category and the like.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly embodied computer software or firmware, in computer hardware (including the structures disclosed in this specification and their structural equivalents), or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory storage medium for execution by, or to control the operation of, data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. Alternatively or additionally, program instructions may be encoded on an artificially generated propagated signal (e.g., a machine-generated electrical, optical, or electromagnetic signal) that is generated to encode information for transmission to suitable receiver apparatus for execution by data processing apparatus.

The term "data processing apparatus" refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be or further comprise special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). The apparatus can optionally include, in addition to hardware, code that creates an execution environment for the computer program, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

A computer program (which may also be referred to or described as a program, software application, app, module, software module, script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and a computer program can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the application, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and in combination with, special purpose logic circuitry, e.g., an FPGA or an ASIC.

A computer suitable for executing a computer program may be based on a general purpose or special purpose microprocessor or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for executing or executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Additionally, the computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; a magneto-optical disk; and CD-ROM and DVD-ROM disks.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings and described in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, but rather it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Specific embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Claims

1. A method of target variable detection for a support comprising:

carrying out image space coding on an input image sequence to obtain image view angle characteristics corresponding to image frames at different moments;

performing visual angle conversion on the image visual angle characteristics to obtain aerial view visual angle characteristics;

fusing the aerial view angle characteristic of the frame at the current moment and the aerial view angle characteristic of the frame at the previous moment to obtain a fused aerial view angle characteristic;

carrying out aerial view visual angle spatial feature coding on the fused aerial view visual angle features to obtain enhanced aerial view visual angle features; and

and detecting target variables based on the enhanced aerial view angle characteristics.

2. The method of claim 1, further comprising: aligning the bird's-eye view angle characteristic of the frame at the current time with the bird's-eye view angle characteristic of the frame at the previous time to obtain an aligned bird's-eye view angle characteristic, wherein the alignment is performed based on the displacement amount of the carrier from the previous time to the current time, and

the fusion comprises the step of fusing the aligned aerial view angle characteristics of the frame at the current moment and the frame at the previous moment to obtain the fused aerial view angle characteristics.

3. The method of claim 2, wherein the spatial coordinates are automatically feature supplemented for the absence of features in the aligned spatial coordinates.

4. The method of claim 3, wherein the specific values of automatic feature supplementation include at least one of: a fixed value, a random value, a characteristic value nearest to its coordinate position.

5. The method of any of claims 1 to 4, wherein the fused operation comprises at least one of: the bird's-eye view angle characteristic of the frame at the current time is spliced with the bird's-eye view angle characteristic of the frame at the previous time, the bird's-eye view angle characteristic of the frame at the current time is added to the bird's-eye view angle characteristic of the frame at the previous time, and the bird's-eye view angle characteristic of the frame at the current time is subtracted from the bird's-eye view angle characteristic of the frame at the previous time.

6. The method of any of claims 1 to 5, wherein the target variable comprises at least one of: speed, location, size, orientation, and category of the target.

7. A system for target variable detection for a carrier, comprising:

an image space encoding module configured to encode an input image sequence to obtain image view characteristics at respective different time points;

a perspective transformation module configured to transform the image perspective characteristics into bird's-eye perspective characteristics;

a fusion module configured to fuse the bird's-eye view angle feature of the frame at the current time with the bird's-eye view angle feature of the frame at the previous time to obtain a fused bird's-eye view angle feature;

a bird's-eye view feature BEV spatial feature encoding module configured to perform bird's-eye view spatial feature encoding on the fused bird's-eye view feature to obtain an enhanced bird's-eye view feature; and

and a target variable detection module configured to perform target variable detection based on the enhanced bird's-eye view angle feature.

8. The system of claim 7, further comprising: an alignment module configured to align the bird's-eye view angle feature of the frame at the current time with the bird's-eye view angle feature of the frame at the previous time to obtain an aligned bird's-eye view angle feature, and wherein the alignment is performed based on an amount of displacement of the carrier from the previous time to the current time, and

the fusion module is further configured to fuse the aligned bird's-eye view angle characteristics of the current frame and the previous frame to obtain a fused bird's-eye view angle characteristic.

9. The system of claim 8, wherein the spatial coordinates are automatically feature supplemented for the absence of features in the aligned spatial coordinates.

10. The system of claim 9, wherein the specific values of automatic feature supplementation include at least one of: a fixed value, a random value, a feature value nearest to its coordinate position.

11. The system of any of claims 7 to 10, wherein the fused operation comprises at least one of: the method includes the steps of stitching a bird's-eye view perspective feature of a frame at a current time with a bird's-eye view perspective feature of a frame at a previous time, adding the bird's-eye view perspective feature of the frame at the current time with the bird's-eye view perspective feature of the frame at the previous time, and subtracting the bird's-eye view perspective feature of the frame at the current time with the bird's-eye view perspective feature of the frame at the previous time.

12. The system of any of claims 7 to 10, wherein the target variables include at least one of: speed, location, size, orientation, and category of the target.

13. The system of any one of claims 7 to 12, wherein the fusion module is implemented as a deep neural network.

14. The system of any of claims 8 to 13, wherein the alignment module is implemented as a deep neural network.

15. The method of any one of claims 1 to 6 or the system of claims 7 to 14, wherein the carrier comprises at least one of: cars, people, aircraft, and robots.

16. The method or system of claim 15, wherein the sequence of images is acquired by an imaging system disposed in a carrier.

17. The method or system of claim 15, wherein the sequence of images is acquired by an imaging system disposed independently of the carrier.