CN114757301A

CN114757301A - Vehicle-mounted visual perception method and device, readable storage medium and electronic equipment

Info

Publication number: CN114757301A
Application number: CN202210514756.2A
Authority: CN
Inventors: 朱红梅; 孟文明; 王梦圆; 李翔宇; 任伟强; 张骞
Original assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Current assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date: 2022-05-12
Filing date: 2022-05-12
Publication date: 2022-07-15

Abstract

The embodiment of the disclosure discloses a vehicle-mounted visual perception method and device, a readable storage medium and electronic equipment, wherein the method comprises the following steps: acquiring images at a plurality of continuous moments by a plurality of vehicle-mounted cameras arranged at preset positions on a vehicle to obtain a plurality of image frame sets; each image frame set comprises a plurality of image frames collected by the same vehicle-mounted camera, and each vehicle-mounted camera corresponds to one image frame at each moment; performing feature extraction on the image frames in the plurality of image frame sets through a coding branch network in a first network model to obtain a plurality of first feature maps; performing space fusion and time sequence fusion on the plurality of first feature maps through a fusion branch network in the first network model to obtain a second feature map under a bird's-eye view image coordinate system; and identifying the second feature map based on a network model corresponding to a preset perception task, and determining a perception result corresponding to the preset perception task.

Description

Vehicle-mounted visual perception method and device, readable storage medium and electronic equipment

Technical Field

The disclosure relates to the technical field of automatic driving, in particular to a vehicle-mounted visual perception method and device, a readable storage medium and electronic equipment.

Background

In the field of automatic driving, sensing results of a plurality of vehicle-mounted cameras in an external sensing system under respective coordinate systems cannot be directly used for a subsequent prediction and regulation system, and dynamic and static sensing results at different visual angles need to be fused in a certain mode and are uniformly expressed under the own vehicle coordinate system; in the prior art, the fusion of different visual angles is usually carried out separately through post-processing.

Disclosure of Invention

The present disclosure is proposed to solve the above technical problems. The embodiment of the disclosure provides a vehicle-mounted visual perception method and device, a readable storage medium and electronic equipment.

According to an aspect of the disclosed embodiments, there is provided a vehicle-mounted visual perception method, including:

acquiring images at a plurality of continuous moments by a plurality of vehicle-mounted cameras arranged at preset positions on a vehicle to obtain a plurality of image frame sets; each image frame set comprises a plurality of image frames collected by the same vehicle-mounted camera, and each vehicle-mounted camera corresponds to one image frame at each moment;

performing feature extraction on the image frames in the plurality of image frame sets through a coding branch network in a first network model to obtain a plurality of first feature maps;

performing space fusion and time sequence fusion on the plurality of first feature maps through a fusion branch network in the first network model to obtain a second feature map under a bird's-eye view image coordinate system;

and identifying the second feature map based on a network model corresponding to a preset perception task, and determining a perception result corresponding to the preset perception task.

According to another aspect of the embodiments of the present disclosure, there is provided an in-vehicle visual perception device, including:

the system comprises an image acquisition module, a data acquisition module and a data processing module, wherein the image acquisition module is used for acquiring images at a plurality of continuous moments through a plurality of vehicle-mounted cameras arranged at preset positions on a vehicle to obtain a plurality of image frame sets; each image frame set comprises a plurality of image frames collected based on the same vehicle-mounted camera, and each image frame corresponds to one moment;

the encoding module is used for carrying out feature extraction on image frames in a plurality of image frame sets obtained by the image acquisition module through an encoding branch network in a first network model to obtain a plurality of first feature maps;

the fusion module is used for performing space fusion and time sequence fusion on the plurality of first feature maps determined by the encoding module through a fusion branch network in the first network model to obtain a second feature map under a bird's-eye view image coordinate system;

and the perception module is used for identifying the second characteristic diagram determined by the fusion module based on a network model corresponding to a preset perception task to obtain a perception result corresponding to the preset perception task.

According to still another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the in-vehicle visual perception method according to any one of the embodiments.

According to still another aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the vehicle-mounted visual perception method according to any one of the above embodiments.

Based on the vehicle-mounted visual perception method and device, the readable storage medium and the electronic device provided by the embodiments of the present disclosure, the spatial fusion and the time sequence fusion are realized inside the first neural network, so that the end-to-end learning of the neural network is realized, and since the post-processing fusion is not required, the complexity of performing the spatial fusion and the time sequence fusion on the image during the post-processing can be effectively avoided, and the situation that the same target is mistakenly identified as multiple targets in the post-processing can be avoided.

The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments of the present disclosure with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. In the drawings, like reference numbers generally represent like parts or steps.

Fig. 1 is a schematic structural diagram of a first network model provided in an exemplary embodiment of the present disclosure.

Fig. 2 is a schematic flowchart of a vehicle-mounted visual perception method according to an exemplary embodiment of the present disclosure.

FIG. 3 is a schematic flow chart of step 203 in the embodiment shown in FIG. 2 according to the present disclosure.

FIG. 4 is a schematic flow chart of step 2031 in the embodiment of FIG. 3 according to the present disclosure.

FIG. 5a is a schematic flow chart of step 401 in the embodiment shown in FIG. 4 according to the present disclosure.

Fig. 5b is a schematic diagram of a conversion relationship between the bird's-eye view imaging plane and the preset plane of the own vehicle coordinate system according to an exemplary embodiment of the disclosure.

Fig. 6 is a schematic flow chart of step 2032 in the embodiment of fig. 3 of the present disclosure.

FIG. 7 is a schematic flow chart of step 204 in the embodiment shown in FIG. 2 according to the present disclosure.

Fig. 8 is a schematic structural diagram of an in-vehicle visual perception device according to an exemplary embodiment of the present disclosure.

Fig. 9 is a schematic structural diagram of an in-vehicle visual perception device according to another exemplary embodiment of the present disclosure.

Fig. 10 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.

Detailed Description

Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.

It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more and "at least one" may refer to one, two or more.

It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.

In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing an associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be discussed further in subsequent figures.

The disclosed embodiments may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices, such as terminal devices, computer systems, servers, and the like, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above systems, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Summary of the application

In the process of implementing the present disclosure, the inventor finds that, in the vehicle-mounted visual perception method provided in the prior art, a post-processing mode is usually adopted, and after a neural network model outputs a perception result, perception results of different cameras are fused together through some rules; this method has at least the following problems: when perception results of adjacent view overlapping regions are fused in a post-processing mode, due to the fact that overlapping exists among multiple views, detection results can be obtained for the same target in images of different views, and the same target appears in different positions in different images, complex recognition and detection are needed during post-processing fusion, and ambiguity is easily caused when the same target is recognized as multiple targets.

Exemplary network architecture

Fig. 1 is a schematic structural diagram of a first network model provided in an exemplary embodiment of the present disclosure. As shown in fig. 1, the first network model in the present embodiment includes: a coding branch network 101, a convergence branch network 102, and a decoding branch network 103.

In the embodiment, feature extraction is performed on image frames acquired by a plurality of vehicle-mounted cameras through a coding branch network 101 to obtain a plurality of first feature maps; the vehicle-mounted cameras are arranged at different preset positions on the vehicle, each vehicle-mounted camera corresponds to one visual angle, the shooting visual field ranges corresponding to the vehicle can cover the surrounding environment of the vehicle, and in order to avoid visual angle blind areas, the shooting visual field ranges among the vehicle-mounted cameras can be overlapped. In the running process of the vehicle, each vehicle-mounted camera continuously collects a plurality of images at a set position at a plurality of moments based on one visual angle to obtain an image frame set corresponding to each vehicle-mounted camera, wherein each vehicle-mounted camera collects one image frame at each moment, and each image frame set comprises a plurality of image frames which are collected by the same vehicle-mounted camera and arranged based on time sequence. Alternatively, the coding branch network 101 may be any network capable of implementing feature extraction, such as a convolutional neural network, and the present embodiment does not limit a network structure specifically adopted by the network.

Performing space fusion and time sequence fusion on the plurality of first feature maps through the fusion branch network 102 to obtain a second feature map under a bird's-eye view image coordinate system; the spatial fusion is to fuse multi-frame image frames acquired by each vehicle-mounted camera in the plurality of vehicle-mounted cameras at the same moment to obtain a third characteristic diagram under a bird's-eye view image coordinate system; and the time sequence fusion is to fuse the third characteristic diagrams corresponding to each moment in the multiple moments to obtain a time sequence fused second characteristic diagram.

Decoding the second feature map obtained by the merging branch network 102 through the decoding branch network 103 to obtain a decoded second feature map; optionally, in this embodiment, the decoded second feature map may be a feature map of a structure that conforms to a subsequent sensing task, and optionally, the decoding branch network 103 may be any network that can implement feature extraction, for example, a convolutional neural network, and the like, and this embodiment does not limit a specific network structure.

In order to implement the visual perception task, the first network model provided in this embodiment may be connected with any visual perception task after the first network model, and the visual perception task may include, but is not limited to: segmentation tasks, detection tasks, classification tasks, and the like.

In the embodiment, space fusion and time fusion are realized through the interior of the first network model, so that disturbance of internal and external parameters of a camera caused by the posture change of the camera in the driving process of the vehicle does not directly act on a perception result, and the influence of the disturbance is overcome through training and learning of the first network model, so that the output perception result is more stable and is not influenced by the posture change of the camera in the driving process of the vehicle; space fusion and time fusion are not needed in the first network model, so that fusion in post-processing is avoided, the complexity of post-processing is reduced, end-to-end learning of the first network model can be realized, the potential of joint learning with candidate visual perception tasks is provided, and the requirement on the computational complexity of a chip by the perception fusion method of the first network model is lower.

Exemplary method

Fig. 2 is a schematic flowchart of a vehicle-mounted visual perception method according to an exemplary embodiment of the present disclosure. The embodiment can be applied to an electronic device, as shown in fig. 2, and includes the following steps:

step 201, acquiring images at a plurality of continuous moments through a plurality of vehicle-mounted cameras arranged at preset positions on a vehicle to obtain a plurality of image frame sets.

Each image frame set comprises a plurality of frame image frames collected based on the same vehicle-mounted camera, and each vehicle-mounted camera corresponds to one frame image frame at each moment.

Optionally, since each vehicle-mounted camera corresponds to one image frame set, multiple vehicle-mounted cameras obtain multiple image frame sets; every on-vehicle camera can correspond not equidirectional, and can have between the image acquisition scope that a plurality of on-vehicle cameras correspond to overlap or do not have the overlap, under the normal conditions, through setting up a plurality of on-vehicle cameras in a plurality of preset position and realize shooing the all-round angle that the vehicle corresponds to in order to provide more information for autopilot or auxiliary driving.

Step 202, performing feature extraction on the image frames included in the plurality of image frame sets through a coding branch network in the first network model to obtain a plurality of first feature maps.

Optionally, feature extraction is sequentially performed on image frames included in the plurality of image frame sets through a coding branch network in the first network model to obtain a plurality of first feature maps; the coding branch network in this embodiment can be understood with reference to the coding branch network 101 provided in fig. 1, and feature extraction of the image frame is implemented.

And step 203, performing space fusion and time sequence fusion on the plurality of first feature maps through the fusion branch network in the first network model to obtain a second feature map under the aerial view image coordinate system.

Optionally, the encoding branch network in this embodiment may be understood with reference to the merging branch network 102 provided in fig. 1, so as to implement merging of the first feature map.

And 204, identifying the second feature map based on the network model corresponding to the preset sensing task, and determining a sensing result corresponding to the preset sensing task.

The preset perception task in this embodiment may be any visual perception task, for example, a segmentation task, a detection task, a classification task, and the like, and the identification operation in this step is implemented by using a network model corresponding to the visual perception task.

According to the vehicle-mounted visual perception method provided by the embodiment of the disclosure, the spatial fusion and the time sequence fusion are realized in the first neural network, so that the end-to-end learning of the neural network is realized, the complexity of spatial fusion and time sequence fusion of images in post-processing can be effectively avoided due to no need of post-processing fusion, and the situation that the same target is mistakenly identified as multiple targets in the post-processing can be avoided.

As shown in fig. 3, based on the embodiment shown in fig. 2, step 203 may include the following steps:

step 2031, for each of the plurality of times, performing spatial fusion on the plurality of first feature maps corresponding to the time to obtain a plurality of third feature maps in the bird's eye view image coordinate system.

Wherein each third feature map corresponds to a time instant.

In this embodiment, the image frames acquired by each vehicle-mounted camera are image frames in a camera coordinate system corresponding to the vehicle-mounted camera, and in order to project the plurality of first feature maps to the bird's-eye view image coordinate system before fusion, and then fuse the plurality of feature maps in the bird's-eye view image coordinate system, fusion of the plurality of feature maps in the same coordinate system may be implemented by a feature map fusion method in the related art, for example, element-by-element addition, feature channel dimension splicing, element-by-element maximum value taking, neural network fusion, and the like.

Step 2032, performing time sequence fusion on the plurality of third feature maps to obtain a second feature map.

In this embodiment, each third feature map corresponds to a time, the third feature map corresponding to each time is respectively reconstructed into the third feature maps corresponding to a certain time (for example, the latest time corresponding to the last image frame is acquired), and the second feature map can be obtained by fusing the plurality of reconstructed third feature maps, where the fusion in this step can also be implemented by a feature map fusion method in the related art, for example, element-by-element addition, feature channel dimension splicing, element-by-element maximum value taking, neural network fusion, and the like; in the embodiment, the spatial fusion is executed first, and then the time sequence fusion is executed, so that the time sequence fusion is easier to realize, and the speed of feature map fusion is increased.

As shown in fig. 4, based on the embodiment shown in fig. 3, step 2031 may include the following steps:

step 401, performing homographic transformation on each of the plurality of first feature maps to obtain a plurality of transformed feature maps under the bird's-eye view image coordinate system.

The homographic transformation (or referred to as projective transformation) has 8 degrees of freedom, and is used for describing the mapping relationship between points on two planes. And converting each first feature map in the image coordinate system into the aerial view image coordinate system through homography, so that a plurality of converted feature maps can be obtained.

And step 402, performing point-by-point fusion on the plurality of transformed feature maps to obtain a third feature map under the aerial view image coordinate system.

In this embodiment, a plurality of transformation feature maps in the same coordinate system are obtained through homographic transformation, and the plurality of transformation feature maps have the same size, so that the plurality of transformation feature maps can be fused in a point-by-point fusion manner, and the point-by-point fusion method may include, but is not limited to: adding element by element, splicing feature channel dimensions, taking the maximum value element by element, fusing a neural network and the like; in the embodiment, the plurality of first feature maps respectively corresponding to different coordinate systems are mapped to the same coordinate system through single-strain transformation, the feature maps of all visual angles are converted to the aerial view from the camera visual angle, fusion is realized under the same coordinate system, the fused third feature map can have image features corresponding to all angles around the vehicle, and when a visual perception task is executed based on the third feature map, no post-processing is needed, the perception results of different vehicle-mounted cameras can be fused, the fusion efficiency of the perception results is improved, and the fusion difficulty is reduced.

As shown in fig. 5a, based on the embodiment shown in fig. 4, step 401 may include the following steps:

step 4011, a first transformation matrix from the vehicle coordinate system to the camera coordinate system corresponding to the first feature map is determined based on the internal reference matrix and the external reference matrix of the vehicle-mounted camera corresponding to each first feature map.

In this embodiment, the internal reference matrix corresponding to each vehicle-mounted camera is a known 3 × 3 matrix K, the external reference matrix corresponding to the vehicle-mounted camera can be calculated through a preset position of the vehicle-mounted camera on the vehicle, and a first transformation matrix T from a vehicle coordinate system to a camera coordinate system can be determined based on the internal reference matrix and the external reference matrix_vcs2camThe first transformation matrix includes a rotation parameter and a translation parameter, for example, in an alternative example, the first transformation matrix may be expressed as:

wherein r is₁₁、r₁₂、r₁₃、r₂₁、r₂₂、r₂₃、r₃₁、r₃₂、r₃₃Denotes the rotation parameter, t₁、t₂、t₃Representing the translation parameters.

And step 4012, determining a second transformation matrix from the bird's-eye view image coordinate system to the preset plane of the own vehicle coordinate system based on the sensing range on the preset plane of the own vehicle coordinate system and the scaling and translation distance of the bird's-eye view image coordinate system.

Alternatively, as shown in FIG. 5b, with d₁，d₂，d₃，d₄Represents the coordinate system of the bicycle on a preset plane (xy plane, O in the figure)_vcsRepresenting the center point of the xy plane in the own vehicle coordinate system), and the translation distance of the own vehicle coordinate system with respect to the bird's eye view image coordinate system can be determined based on the sensing range (d in fig. 5 b)₁，d₃) Suppose a bird's eye view imaging plane (the center point of the imaging plane may be O in FIG. 5 b)_BEV) Coinciding with (or parallel to) the xy plane of the vehicle coordinate system, and representing the scaling from the xy plane of the vehicle coordinate system to the bird's eye view imaging plane by r, the second transformation matrix can be expressed as:

and 4013, determining a third transformation matrix from the bird's-eye view image coordinate system to the image coordinate system corresponding to the vehicle-mounted camera based on the first transformation matrix, the second transformation matrix and the internal reference matrix of the vehicle-mounted camera.

Alternatively, in this embodiment, a transformation matrix from the bird's-eye view image coordinate system to the camera coordinate system may be derived from the first transformation matrix and the second transformation matrix, and then the third transformation matrix T between the bird's-eye view image coordinate system and the image plane (corresponding to the image coordinate system) corresponding to the vehicle-mounted camera may be determined by combining the internal reference matrix_bev2img。

And 4014, transforming the first feature map based on the third transformation matrix to obtain a transformed feature map.

In the present embodiment, the third transformation matrix is determined by determining the transformation relationship between the image coordinate system and the bird's-eye view image coordinate system, and the image position (u, v) can be inversely indexed by the transformation relationship according to the bird's-eye view image (u, v) position₁，v₁) For example, the feature mapping can be implemented by the following formula (1):

thereby obtaining a transformation characteristic diagram f after the visual angle conversion_n(ii) a Similarly, the transformation operation is executed on the first feature map corresponding to each view angle, and N (N is the number of vehicle-mounted cameras and can be set according to specific scenes) transformed feature maps { f) of view angles are obtained_n1,2, …, N; finally, feature fusion is carried out on the three-dimensional feature map to obtain a third feature map F at t moment (one of a plurality of moments) after space fusion_t。

Optionally, on the basis of the foregoing embodiment, step 3013 may further include:

a1, determining a fourth transformation matrix from the preset plane of the vehicle coordinate system to the camera coordinate system based on the first transformation matrix.

Alternatively, if the imaging plane of the bird's-eye view angle coincides with (or is parallel to) the xy plane of the own vehicle coordinate system, that is, the plane in which the z value of the three-dimensional point coordinate in the own vehicle coordinate is equal to 0 (or equal to any of the same values), the first transformation matrix T can be simply eliminated when coinciding with each other_vcs2camThe row (row 3) corresponding to the z-axis in the image processing system can be parallel to the row 3, and the fourth transformation matrix T between the imaging plane of the bird's-eye view image coordinate system and the camera coordinate system can be obtained by setting the row 3 to the preset value_{vcs_xy2cam}For example, in an alternative example (the imaging plane of the bird's eye view image coordinate system coincides with the xy plane of the own vehicle coordinate system), the fourth transformation matrix may be expressed as:

a2, determining a fifth transformation matrix from the bird's eye view image coordinate system to the camera coordinate system based on the fourth transformation matrix and the second transformation matrix.

Alternatively, after the fourth transformation matrix is determined, the second transformation matrix from the bird's-eye-view image coordinate system to the own vehicle coordinate system is combined, i.e., the fifth transformation matrix from the bird's-eye-view image coordinate system to the camera coordinate system may be determined by performing matrix multiplication by the fourth transformation matrix and the second transformation matrix, e.g., the fifth transformation matrix T may be determined based on the following formula (2)_bev2cam：

T_bev2cam＝T_vcs-xy2cam×T_bev2vcs-xyFormula (2)

a3, determining a third transformation matrix based on the fifth transformation matrix and the internal reference matrix of the vehicle-mounted camera.

In this embodiment, it is determined that the number of rows and columns between the fifth transformation matrix and the internal reference matrix does not correspond, and in order to perform matrix multiplication (the internal reference matrix is a3 × 3 matrix), a truncation operation may be performed on the fifth transformation matrix, and the first three rows of the fifth transformation matrix are truncated to obtain a truncated transformation matrix, for example, the fifth transformation matrix T is used_bev2camIs intercepted to obtain an intercepted transformation matrix T'_bev2camAt this time, a third transformation matrix is determined by performing matrix multiplication by truncating the transformation matrix and the internal reference matrix, for example, as shown in the following equation (3):

T_bev2img＝K×T′_bev2camformula (3)

The third transformation matrix can be obtained through the processing, the first feature map corresponding to the image coordinate system can be transformed into the imaging plane under the bird's-eye view image coordinate system based on the third transformation matrix, the third transformation matrix for transforming the bird's-eye view image coordinate system into the image coordinate system is determined through coordinate system relation transformation, the feature in the image coordinate system can be directly transformed into the bird's-eye view image coordinate system based on the third transformation matrix, rapid space fusion is realized in the fusion branch model of the first network model, specific parameters in the coordinate system transformation are determined through the first network model learning, and the accuracy and the speed of the transformation are improved.

As shown in fig. 6, based on the embodiment shown in fig. 3, step 2032 may include the following steps:

and 601, taking a third characteristic diagram corresponding to the latest time in a plurality of times as a reference characteristic diagram.

Optionally, when the multiple times include times t and t-1 … … t-s, the third feature map corresponding to the time t (the latest time) may be used as the reference feature map, which is, of course, only a specific example in this embodiment, and in practical application, the third feature map corresponding to any time may be used as the reference feature map without affecting the operation and result of the time-series fusion; when the latest moment is selected as the parameter feature map, the result of the time domain fusion corresponds to the latest moment, that is, the output second feature corresponds to the latest moment (for example, the current moment), so that the real-time performance of the sensing result determined based on the second feature is improved, that is, the real-time performance of the vehicle-mounted visual angle sensing method is improved.

And step 602, reconstructing each third feature map in the at least one third feature map respectively to obtain at least one fourth feature map.

In this embodiment, in order to realize that the third feature maps corresponding to the multiple times can be fused, each third feature map is mapped to the feature space corresponding to the reference feature map, so as to perform feature fusion.

And 603, performing point-by-point fusion on the reference characteristic diagram and the at least one fourth characteristic diagram to obtain a second characteristic diagram.

Alternatively, the point-by-point fusion method may include, but is not limited to: adding element by element, splicing in a feature channel dimension, taking the maximum value element by element, fusing a neural network and the like; in the embodiment, the spatial fusion and the time sequence fusion are sequentially realized in the fusion branch network of the first network model by performing the time sequence fusion on the third feature map corresponding to each of the multiple moments obtained after the spatial fusion, so that the internal branch realization in the first network model is realized, the multi-frame image frames acquired at the multiple moments from the multiple visual angles are fused into one second feature map under the bird's-eye view image coordinate system, the first network model can learn how to realize the spatial and time sequence fusion of the feature maps end to end, the spatial and time sequence fusion is not required to be executed after the network model is externally connected, the complexity of the post-processing is reduced, and the ambiguity problem caused by the fusion in the post-processing is reduced.

Optionally, on the basis of the foregoing embodiment, step 602 may further include:

b1, determining at least one homography transformation matrix between the first characteristic diagram corresponding to the reference characteristic diagram and the first characteristic diagram corresponding to the at least one third characteristic diagram based on the interframe motion transformation matrix corresponding to the latest time and the at least one time corresponding to the at least one third characteristic diagram of the vehicle-mounted camera.

Optionally, since the position of the vehicle may change when the vehicle-mounted camera captures images at different times, which may cause a corresponding change in the camera position, based on the inter-frame motion of the vehicle (each time corresponds to one frame of image frame) and the known height d of the vehicle-mounted camera (e.g., the vehicle-mounted front-view camera) from the xy plane of the vehicle coordinate system (which belongs to the camera external parameter and may be determined by calculation through a prior art algorithm or determined through information collected by a sensor), a homography transformation matrix H between two frames of images before and after any viewing angle is calculated. Assuming that the rotation from time t to time t-1 is R and the translation is m, the homographic transformation matrix H can be determined based on the following equation (4):

wherein n is [0, 0, 1 ]]^TIs the normal direction of the xy plane of the self-vehicle coordinate system; k represents an internal reference matrix of the vehicle-mounted camera; d represents the height of the vehicle-mounted camera from the xy plane of the self-vehicle coordinate system.

b2, determining at least one conversion matrix between the reference characteristic diagram and the at least one third characteristic diagram based on the at least one homographic transformation matrix and the third transformation matrix.

In this embodiment, a transformation matrix is determined for each third feature map by combining the third transformation matrix of the spatial transformation and the homography transformation matrix corresponding to each third feature map, and optionally, the transformation matrix T may be determined based on the following formula (5)_temp：

b3, reconstructing each third feature map in the at least one third feature map based on each conversion matrix in the at least one conversion matrix to obtain at least one fourth feature map.

This implementationIn the example, taking the third feature map at time t-1 as an example, the space at time t-1 is merged into (U)_t-1，V_t-1) The position may be characterized by a transformation matrix T_temp(U) mapped to time t spatial fusion feature_t，V_t) To obtain reconstructed time-t characteristic F'_tThat is, the reconstruction of the third feature map at the time t-1 is realized through the transformation matrix to obtain a fourth feature map, and by analogy, the reconstruction of the third feature map at each time is respectively carried out to obtain at least one fourth feature map; by means of feature map reconstruction, feature maps corresponding to different moments can be fused, and by performing space fusion and then performing time sequence fusion, the number of feature maps processed by time sequence fusion is reduced, the difficulty of time sequence fusion is reduced, and time sequence fusion is realized by combining a third transformation matrix in space fusion, so that the repeated utilization rate of parameters is improved, and the fusion efficiency is improved.

As shown in fig. 7, based on the embodiment shown in fig. 2, step 204 may include the following steps:

and 2041, decoding the second characteristic diagram through a decoding branch network in the first network model to obtain a decoded second characteristic diagram.

Alternatively, the decoding branch network in the present embodiment can be understood with reference to the decoding branch network 103 in the embodiment provided in fig. 1.

Step 2042, based on the second network model, the decoded second feature map is identified, and a sensing result corresponding to the preset sensing task is determined.

In this embodiment, the second feature map may be mapped to a feature space required by the preset sensing task through the decoding branch network, so that the network model corresponding to the subsequent preset sensing task may directly process the mapped feature map, thereby reducing other intermediate processing processes, realizing that the network model of the preset sensing task is directly connected to the first network model, realizing joint learning of the network model, and improving the accuracy of the sensing result of the preset sensing task through the joint learning.

Any of the in-vehicle visual perception methods provided by the embodiments of the present disclosure may be performed by any suitable device having data processing capabilities, including but not limited to: terminal equipment, a server and the like. Alternatively, any of the in-vehicle visual perception methods provided by the embodiments of the present disclosure may be executed by a processor, for example, the processor may execute any of the in-vehicle visual perception methods mentioned in the embodiments of the present disclosure by calling a corresponding instruction stored in a memory. And will not be described in detail below.

Exemplary devices

Fig. 8 is a schematic structural diagram of a vehicle-mounted visual perception device according to an exemplary embodiment of the present disclosure. As shown in fig. 8, the apparatus provided in this embodiment includes:

the image acquisition module 81 is configured to acquire images at a plurality of consecutive moments through a plurality of vehicle-mounted cameras arranged at preset positions on a vehicle, so as to obtain a plurality of image frame sets.

Each image frame set comprises a plurality of image frames collected by the same vehicle-mounted camera, and each vehicle-mounted camera corresponds to one image frame at each moment.

And the encoding module 82 is configured to perform feature extraction on image frames included in the plurality of image frame sets obtained by the image acquisition module 81 through an encoding branch network in the first network model to obtain a plurality of first feature maps.

And the fusion module 83 is configured to perform spatial fusion and time sequence fusion on the plurality of first feature maps determined by the encoding module 82 through a fusion branch network in the first network model to obtain a second feature map in the bird's-eye view image coordinate system.

The perception module 84 is configured to identify the second feature map determined by the fusion module 83 based on the network model corresponding to the preset perception task, so as to obtain a perception result corresponding to the preset perception task.

According to the vehicle-mounted visual perception device provided by the embodiment of the disclosure, the spatial fusion and the time sequence fusion are realized in the first neural network, so that the end-to-end learning of the neural network is realized, the complexity of spatial fusion and time sequence fusion of images in post-processing can be effectively avoided due to no need of post-processing fusion, and the situation that the same target is mistakenly identified as a plurality of targets in the post-processing can be avoided.

Fig. 9 is a schematic structural diagram of an in-vehicle visual perception device according to another exemplary embodiment of the present disclosure. As shown in fig. 9, in the apparatus provided in this embodiment, the fusion module 83 includes:

and a spatial fusion unit 831 configured to perform spatial fusion on the plurality of first feature maps corresponding to the respective times to obtain a plurality of third feature maps in the bird's-eye view image coordinate system.

Wherein each third feature map corresponds to a time instant.

And a time sequence fusion unit 832, configured to perform time sequence fusion on the plurality of third feature maps to obtain a second feature map.

Optionally, the spatial fusion unit 831 is specifically configured to perform homographic transformation on each of the plurality of first feature maps to obtain a plurality of transformed feature maps in a bird's-eye view image coordinate system; and performing point-by-point fusion on the plurality of transformation feature maps to obtain a third feature map under the aerial view image coordinate system.

Optionally, the spatial fusion unit 831 is configured to, when performing homographic transformation on each of the plurality of first feature maps to obtain a plurality of transformed feature maps in the bird's eye view image coordinate system, determine a first transformation matrix from the vehicle coordinate system to the camera coordinate system corresponding to the first feature map based on the internal reference matrix and the external reference matrix of the vehicle-mounted camera corresponding to each of the first feature maps; determining a second transformation matrix from the bird's-eye view image coordinate system to the preset plane of the self-vehicle coordinate system based on the sensing range on the preset plane of the self-vehicle coordinate system and the scaling and translation distance of the bird's-eye view image coordinate system; determining a third transformation matrix from the aerial view image coordinate system to an image coordinate system corresponding to the vehicle-mounted camera based on the first transformation matrix, the second transformation matrix and the internal reference matrix of the vehicle-mounted camera; and transforming the first feature map based on the third transformation matrix to obtain a transformed feature map.

Optionally, when the space fusion unit 831 determines a third transformation matrix from the bird's-eye view image coordinate system to the image coordinate system corresponding to the vehicle-mounted camera based on the first transformation matrix, the second transformation matrix and the internal reference matrix of the vehicle-mounted camera, the space fusion unit 831 is configured to determine a fourth transformation matrix from a preset plane of the vehicle coordinate system to the camera coordinate system based on the first transformation matrix; determining a fifth transformation matrix from the aerial view image coordinate system to the camera coordinate system based on the fourth transformation matrix and the second transformation matrix; and determining a third transformation matrix based on the fifth transformation matrix and the internal reference matrix of the vehicle-mounted camera.

Optionally, the time sequence fusing unit 832 is specifically configured to use a third feature map corresponding to a latest time in the multiple times as a reference feature map; reconstructing each third feature map in the at least one third feature map respectively to obtain at least one fourth feature map; and performing point-by-point fusion on the reference characteristic diagram and the at least one fourth characteristic diagram to obtain a second characteristic diagram.

Optionally, the time-series fusion unit 832 is configured to determine, when reconstructing each of the at least one third feature map to obtain at least one fourth feature map, at least one homography transformation matrix between the first feature map corresponding to the reference feature map and the first feature map corresponding to the at least one third feature map based on an interframe motion transformation matrix corresponding to the vehicle-mounted camera at the latest time and the at least one time corresponding to the at least one third feature map; determining at least one transformation matrix between the reference feature map and the at least one third feature map based on the at least one homographic transformation matrix and the third transformation matrix; and reconstructing each third characteristic diagram in the at least one third characteristic diagram respectively based on each conversion matrix in the at least one conversion matrix to obtain at least one fourth characteristic diagram.

In some optional embodiments, the sensing module 84 includes:

the decoding unit 841 is configured to perform decoding processing on the second feature map through a decoding branch network in the first network model, so as to obtain a decoded second feature map.

The feature recognition unit 842 is configured to recognize the decoded second feature map based on the second network model, and determine a sensing result corresponding to the preset sensing task.

Exemplary embodiments of the inventionElectronic device

Next, an electronic apparatus according to an embodiment of the present disclosure is described with reference to fig. 10. The electronic device may be either or both of the first device 100 and the second device 200, or a stand-alone device separate from them that may communicate with the first device and the second device to receive the collected input signals therefrom.

FIG. 10 illustrates a block diagram of an electronic device in accordance with an embodiment of the disclosure.

As shown in fig. 10, the electronic device 10 includes one or more processors 11 and memory 12.

The processor 11 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.

Memory 12 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer readable storage medium and executed by the processor 11 to implement the in-vehicle visual perception methods of the various embodiments of the present disclosure described above and/or other desired functionality. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 10 may further include: an input device 13 and an output device 14, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

For example, when the electronic device is the first device 100 or the second device 200, the input device 13 may be a microphone or a microphone array as described above for capturing an input signal of a sound source. When the electronic device is a stand-alone device, the input means 13 may be a communication network connector for receiving the acquired input signals from the first device 100 and the second device 200.

The input device 13 may also include, for example, a keyboard, a mouse, and the like.

The output device 14 may output various information including the determined distance information, direction information, and the like to the outside. The output devices 14 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.

Of course, for simplicity, only some of the components of the electronic device 10 relevant to the present disclosure are shown in fig. 10, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device 10 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the in-vehicle visual perception method according to various embodiments of the present disclosure described in the "exemplary methods" section above of this specification.

The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the in-vehicle visual perception method according to various embodiments of the present disclosure as described in the "exemplary methods" section above in this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. An on-vehicle visual perception method, comprising:

2. The method according to claim 1, wherein the performing spatial fusion and time-series fusion on the plurality of first feature maps through a fusion branch network in the first network model to obtain a second feature map in a bird's-eye view image coordinate system comprises:

performing spatial fusion on a plurality of first feature maps corresponding to the time to obtain a plurality of third feature maps under a bird's-eye view image coordinate system for each time in the plurality of times; each third feature map corresponds to a moment;

and performing time sequence fusion on the plurality of third feature maps to obtain the second feature map.

3. The method according to claim 2, wherein the performing spatial fusion on the plurality of first feature maps corresponding to the time instants to obtain a plurality of third feature maps in a bird's-eye view image coordinate system comprises:

performing homographic transformation on each first feature map in the plurality of first feature maps respectively to obtain a plurality of transformation feature maps under a bird's-eye view image coordinate system;

and performing point-by-point fusion on the plurality of transformation characteristic graphs to obtain the third characteristic graph under the aerial view image coordinate system.

4. The method according to claim 3, wherein the performing the homographic transformation on each of the plurality of first feature maps to obtain a plurality of transformed feature maps under the bird's-eye view image coordinate system comprises:

determining a first transformation matrix from a vehicle coordinate system to a camera coordinate system corresponding to the first characteristic diagram based on an internal reference matrix and an external reference matrix of the vehicle-mounted camera corresponding to each first characteristic diagram;

determining a second transformation matrix from the aerial view image coordinate system to a preset plane of the self-vehicle coordinate system based on a sensing range on the preset plane of the self-vehicle coordinate system and the scaling and translation distance of the aerial view image coordinate system;

determining a third transformation matrix from the bird's-eye view image coordinate system to an image coordinate system corresponding to the vehicle-mounted camera based on the first transformation matrix, the second transformation matrix and an internal reference matrix of the vehicle-mounted camera;

and transforming the first characteristic diagram based on the third transformation matrix to obtain the transformation characteristic diagram.

5. The method of claim 4, wherein the determining a third transformation matrix of the bird's eye view image coordinate system to an image coordinate system corresponding to the vehicle-mounted camera based on the first transformation matrix, the second transformation matrix, and an internal reference matrix of the vehicle-mounted camera comprises:

determining a fourth transformation matrix from a preset plane of the own vehicle coordinate system to a camera coordinate system based on the first transformation matrix;

determining a fifth transformation matrix from the aerial view image coordinate system to the camera coordinate system based on the fourth transformation matrix and the second transformation matrix;

and determining the third transformation matrix based on the fifth transformation matrix and the internal reference matrix of the vehicle-mounted camera.

6. The method according to claim 4 or 5, wherein the performing time-series fusion on the plurality of third feature maps to obtain the second feature map comprises:

taking a third characteristic diagram corresponding to the latest moment in the plurality of moments as a reference characteristic diagram;

reconstructing each third feature map in the at least one third feature map respectively to obtain at least one fourth feature map;

and performing point-by-point fusion on the reference feature map and the at least one fourth feature map to obtain the second feature map.

7. The method according to claim 6, wherein the reconstructing each of the at least one third feature map to obtain at least one fourth feature map comprises:

determining at least one homography transformation matrix between the first feature diagram corresponding to the reference feature diagram and the first feature diagram corresponding to the at least one third feature diagram based on the interframe motion transformation matrix corresponding to the vehicle-mounted camera at the latest moment and the at least one moment corresponding to the at least one third feature diagram;

determining at least one transformation matrix between the reference feature map and the at least one third feature map based on the at least one homographic transformation matrix and the third transformation matrix;

and reconstructing each third characteristic diagram in the at least one third characteristic diagram respectively based on each conversion matrix in the at least one conversion matrix to obtain the at least one fourth characteristic diagram.

8. The method according to any one of claims 1 to 7, wherein the identifying the second feature map based on a second network model corresponding to a preset sensing task and determining a sensing result corresponding to the preset sensing task include:

decoding the second feature map through a decoding branch network in the first network model to obtain a decoded second feature map;

and identifying the decoded second feature map based on the second network model, and determining a perception result corresponding to the preset perception task.

9. An on-vehicle visual perception device, comprising:

the system comprises an image acquisition module, a data acquisition module and a data processing module, wherein the image acquisition module is used for acquiring images at a plurality of continuous moments through a plurality of vehicle-mounted cameras arranged at preset positions on a vehicle to obtain a plurality of image frame sets; each image frame set comprises a plurality of image frames collected by the same vehicle-mounted camera, and each vehicle-mounted camera corresponds to one image frame at each moment;

the encoding module is used for extracting the features of the image frames in the image frame sets obtained by the image acquisition module through an encoding branch network in a first network model to obtain a plurality of first feature maps;

10. A computer-readable storage medium, storing a computer program for executing the in-vehicle visual perception method of any of the above claims 1-8.

11. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is used for reading the executable instructions from the memory and executing the instructions to realize the vehicle-mounted visual perception method of any one of the claims 1-8.